Linear trends of the surface air temperature (SAT) simulated by selected models from the Coupled Model Intercomparison Project (CMIP3 and CMIP5) historical experiments are evaluated using observations to document (1) the expected range and characteristics of the errors in hindcasting the ‘change’ in SAT at different spatiotemporal scales, (2) if there are ‘threshold’ spatiotemporal scales across which the models show substantially improved performance, and (3) how they differ between CMIP3 and CMIP5. Root Mean Square Error, linear correlation, and Brier score show better agreement with the observations as spatiotemporal scale increases but the skill for the regional (5° × 5° – 20° × 20° grid) and decadal (10 – ∼30-year trends) scales is rather limited. Rapid improvements are seen across 30° × 30° grid to zonal average and around 30 years, although they depend on the performance statistics. Rather abrupt change in the performance from 30° × 30° grid to zonal average implies that averaging out longitudinal features, such as land-ocean contrast, might significantly improve the reliability of the simulated SAT trend. The mean bias and ensemble spread relative to the observed variability, which are crucial to the reliability of the ensemble distribution, are not necessarily improved with increasing scales and may impact probabilistic predictions more at longer temporal scales. No significant differences are found in the performance of CMIP3 and CMIP5 at the large spatiotemporal scales, but at smaller scales the CMIP5 ensemble often shows better correlation and Brier score, indicating improvements in the CMIP5 on the temporal dynamics of SAT at regional and decadal scales.
 Global climate models (coupled general circulation or Earth system models) play a crucial role in improving the understanding of the climate system and providing the projections of future climate under anthropogenic influence. The latest generation of climate models has been recently introduced to the community with the phase five of the Coupled Model Intercomparison Project (CMIP5), which include unprecedented complexity and realism in their representation of the Earth system [Taylor et al., 2012]. The skill of these models and improvements from the previous generation, the phase three of CMIP (CMIP3) [Meehl et al., 2007], is of critical importance for scientific endeavors using the simulation outputs. Particularly the increasing interests in decadal and regional climate projections necessitate quantitative evaluations of the global climate simulations on not only the climatological mean field but also the change of the climate state over time at various spatiotemporal scales [Stott and Tett, 1998; Masson and Knutti, 2011b].
 Our recent study [Sakaguchi et al., 2012] attempted to provide a quantitative summary of the scale dependency of the skills of three climate models from CMIP3 for hindcasting surface air temperature (SAT) trend. The observed and simulated SAT linear trends were compared for all the possible running windows where the observation is sufficiently available, with the length of the running windows varying from 10 to 50 years. By repeating the evaluation based on linear correlation and RMSE from the smallest 5° × 5° grid to the global average, it was concluded that the multimodel ensemble mean (EM) shows robust performance at zonal mean of 30° latitudinal bands or larger spatial scales and 40 years or longer temporal scales.
 Questions remain whether such scale dependency may change (or be improved) with increased number of models, newer generation of models, and probabilistic predictions. This study aims to address these questions by assessing (1) the expected range and characteristics of the model errors across diverse spatiotemporal scales, (2) if there exist some ‘threshold’ spatiotemporal scales for substantially improved performance in SAT trend predictions, and (3) whether the performance has been improved from CMIP3 to CMIP5. The assessment is made for the individual simulations from each model (‘ensemble member’) as well as the EMs and probabilistic predictions, which have been actively discussed using CMIP models [Räisänen and Palmer, 2001; Tebaldi and Knutti, 2007; Gleckler et al., 2008; Reichler and Kim, 2008; Knutti et al., 2010a; Annan and Hargreaves, 2010, 2011].
2. Data and Method
 Most model assessments have been done using the climatological mean, but the correlation between the skills for the mean climate and the trend has been found to be weak [Jun et al., 2008; Knutti et al., 2010a]. Here we consider the trend of SAT, rather than the mean. It is also more reasonable to assess the skill dependence on the temporal scale using the 50-year trend instead of the 50-year mean SAT, particularly for impact studies. Our prior analysis indicated that models showed higher skills in simulating the mean than the trend [Sakaguchi et al., 2012].
 We consider running time windows of lengths of 10, 20, 30, 40, and 50 years, over which linear trends are calculated by least squares regression. The slopes of the linear trends from each time window comprise another time series altogether, which we call ‘running trend time series’ (with examples shown in Figure S1). The running trends are evaluated instead of a single linear trend over a particular period to sample a wider range of external forcings and internally generated variabilities in the historical period [Easterling and Wehner, 2009; Liebmann et al., 2010; Meehl et al., 2011]. The running trend time series is calculated for each grid box (or averaged region) in different spatial scales, which are obtained after area-weighted averaging from the 5° × 5° grid to: 10° × 10°, 15° × 15°, 20° × 20°, and 30° × 30° grids, as well as to zonal mean (5°, 15°, and 30° widths), hemispherical, and global means. Examples of global mean running trends are presented inauxiliary material Figure S1.
 The observational data used are the historical reconstructions of the monthly mean SAT anomaly on a 5° × 5° global grid by the Hadley Centre and University of East Anglia Climate Research Unit (HadCRUT3) [Brohan et al., 2006] and the National Oceanic and Atmospheric Administration (NOAA) National Climatic Data Center (NCDC) [Smith et al., 2008]. There are several uncertainties in these SAT data as documented in the above two references and others [Pielke et al., 2007; Thompson et al., 2008; Deser et al., 2010], and our use of the two data is a simple, albeit limited, approach to reflect the observational uncertainty on the analysis of model errors. Specifically, we calculate the performance statistics between the two data sets as a reference for model performance. We also estimate the influence of observational uncertainty through the ‘perturbed’ ensemble approach: for each grid point/averaged region, a Gaussian random noise with the mean of zero and standard deviation from the difference between the two observational data sets is added to each ensemble member. This is repeated 500 times to obtain the possible range of the performance statistics with the random noise. There are some limitations to this approach and other methods are available [Bowler, 2006, 2008; Candille and Talagrand, 2008], but for our analysis across diverse spatiotemporal scales the simplicity of this method makes it desirable.
 The climate model outputs of the monthly mean SAT were downloaded from the Program for Climate Model Diagnosis and Intercomparison (PCMDI) CMIP3 and CMIP5 archives [Meehl et al., 2007; Taylor et al., 2012]. We focus on the historical simulations with the prescribed greenhouse gas concentrations and other external forcings. The differences between CMIP3 and CMIP5 generations include, but not limited to, prognostic aerosol distribution, dynamic land surface representation (anthropogenic land cover change and dynamic vegetation), stratospheric circulations, atmospheric chemistry, carbon cycles, grid resolutions, and numerical scheme changes.
 The model selection from CMIP3 is based on the inclusion of temporally varying greenhouse gas forcings during the 20th century [Hegerl et al., 2007] and availability of more than one simulation because of our interest in how internal climate variability affects the performance matrices. Most of the selected models employed time-varying natural forcings as well. Only one model from each modeling center is chosen to decrease inter-dependence among models [Masson and Knutti, 2011a]. These criteria leave us seven models (28 members) from CMIP3 (referred as the ‘CMIP3 ensemble’, Table 1), which we consider a reasonable number based on the asymptotic performance improvement by ensemble averaging at five to ten models [Hagedorn et al., 2005; Knutti et al., 2010a]. In contrast, only three CMIP3 models with 13 members were used in Sakaguchi et al. . The sensitivity of some of the performance statistics to the number of models will be discussed in section 4. The direct descendants of those CMIP3 models were given the priority in the selection from CMIP5 (the ‘CMIP5 ensemble’ with 29 members, Table 1). The new generation of MIUB/KMSA ECHO-G was not found in CMIP5 at the time of writing, thus MPI ESM-LR was chosen because of the close relation between ECHO-G and the previous version of the MPI model, ECHAM4 [Masson and Knutti, 2011a]. The details of each model are given in the references in Table S1 in Text S1.
The number of ensemble members is given in parentheses. More detailed list of the modeling centers and references for the models are given in Table S1 in Text S1.
GFDL CM2.0 (3)
GFDL CM3 (4)
GISS E-R (5)
GISS E2-R (5)
MIUB/KMA ECHO-G (5)
CCSR MIROC3.2 medres (3)
MOHC HadGEM1 (2)
MOHC HadGEM2-ES (4)
MRI CGCM2.3.2 (5)
MRI CGCM3 (5)
NCAR CCSM3 (5)
NCAR CCSM4 (5)
 If more than five members were available from a single model, five of them were arbitrarily chosen to avoid overweighting on a particular model. Further, we computed the multimodel EM in two ways: as a simple mean across all the members and as the average of each model's ensemble average to have equal weighting among the models [Knutti et al., 2010b]. The results from the latter are shown below, although no significant differences were found in the performance between the two approaches. The model outputs are regridded to the same grid as the observational data, upon which the missing flags of HadCRUT3 are imported. The annual mean SAT anomaly is calculated for each grid following Hegerl et al. .
 Several performance statistics (root mean square error (RMSE), linear correlation (r), mean bias (MB), etc.) are calculated between the simulated and observed running trend time series. The accuracy of probabilistic prediction is evaluated by the Brier score (BS) [Brier, 1950] based on a simple dichotomous category of ‘positive trend’ and ‘negative or zero trend’. For example, if half of the ensemble members predict a positive trend for a given time window, then it is casted as 50% probability of a positive trend. The so-called ‘reliability’ of the ensemble prediction, which is one of the three elements from the partitioning of BS [Murphy, 1973], refers to if the ensemble probability distribution is statistically indistinguishable from the observation. This aspect of the CMIP3 models has been assessed over climatological field by Annan and Hargreaves  and has recently drawn attention concerning the interpretation of the CMIP ensembles [Knutti et al., 2010b]. In this study rank histogram [Anderson, 1996; Talagrand et al., 1997], MB, and spread of the ensemble are analyzed to diagnose the reliability of the CMIP3 and CMIP5 ensembles and its scale-dependence. No post-processing such as model weighting or bias corrections [e.g.,Tebaldi and Knutti, 2007; Johnson and Bowler, 2009] is applied to the model outputs.
 In assessing the sampling uncertainty of the performance statistics [Jolliffe, 2007], the degrees of freedom should be smaller than the number of moving windows since each window is not independent of each other. We estimated effective sample sizes based on the serial correlation of the running trend time series following Bretherton et al.  for RMSE and correlation. The sample variance of BS [Bradley et al., 2008] and the chi-square test for the rank histogram are also modified according to autocorrelation as inWilks [2004, 2010]. See the auxiliary material for the details (section 2.1 of Text S1).
3.1. Hindcast Skills
 We first summarize the skill (RMSE and r) of the EM as it is widely used in climate-related studies and known for its superior performance than individual members [Gleckler et al., 2008; Reichler and Kim, 2008] and the probabilistic prediction skill of each ensemble by BS. Three performance statistics show clear improvement with increasing scales (Figure 1). The RMSE are divided by the standard deviation of the HadCRUT3 running trend (σobs) for consistency across scales since linear trends vary by more than one order of magnitude, and the 75th percentiles of this normalized RMSE across grid points (i.e., 75% of the available grid points have smaller RMSE/σobs) from EMs are shown in Figure 1a. Division by σobs was also adopted in forecasting skill scores using climatology as a reference [Wilks, 2006]. In both CMIP3 and CMIP5 EMs, the normalized RMSE decreases with increasing scales and become smaller than the observed variability for the spatial scales of 30° zonal mean or larger, and 20 years or longer temporal scales. As an additional reference, pre-industrial control simulations by several CMIP5 models are randomly resampled 1000 times and the RMSE between these samples and HadCRUT3 is compared to the historical simulations by the two EMs [Sakaguchi et al., 2012]. This simple comparison attempts to identify the spatiotemporal scales where the effect of the external forcings during the 20th century (which was not included in the pre-industrial experiment) dominates the influence from the natural variability (which was realized in both the pre-industrial and historical runs) [Hawkins and Sutton, 2009]. Presumably the former is more predictable than the latter, thus we could put more confidence in the model predictions at these scales if the simulated natural variability is realistic. One would need more rigorous methods to understand the contributions from different forcing agents [Hegerl et al., 2007], but that is not our focus here. The spatiotemporal scales with significant differences from the control simulations at the α = 0.10 level are enclosed by dashed black lines in Figure 1a. Both EMs exhibit significant differences in most of the scales at this level. Exceptions are the two smallest spatial scales with longer temporal scales, which could be attributed to the dominance of natural variability and more pronounced mean bias in the historical simulations relative to σobs at longer temporal scales (discussed later). At α = 0.05, the differences at the two smallest spatial scales are generally not significant.
 Temporal dynamics of the linear trends is first measured by r, whose 25th percentiles across grid points (i.e., 75% of the available grid points have higher correlation) from the EMs are shown in Figure 1b. It is expected that a high rwould not be obtained at smaller spatiotemporal scales for the historical simulations since they are initialized with the pre-industrial control simulations in the near equilibrium, which is by no means designed to capture the phase of internal variabilities in the real world around 1850 (the starting year of the historical simulation,Taylor et al. ). Our results show that in the small to intermediate spatiotemporal scales r is below 0.5 for most of the grid cells, with EMs in general obtaining higher r than each member (shown later). Both EMs reach r of ∼0.5 at 30 years and ∼0.7 (corresponds to r2 = 0.5) at 40 years or longer temporal scales at hemispherical or global averages. Significant differences from the preindustrial simulations are restricted to the three largest spatial scales at the α = 0.10 level.
 Individual simulations from the arbitrarily initialized historical experiments are limited in reproducing the temporal evolutions of the internal variabilities. Then how about the prediction by an ensemble? Probabilistic predictions are constructed for each ensemble as described in section 2, and Figure 1c displays the 75th percentiles of BS for each scale. One reference for an ensemble prediction skill is whether the majority of the grid points attain BS lower than 0.25 (lower is better), which corresponds to random predictions giving a 50% chance to each of the two events. The 75th percentiles of BS are found to be smaller than 0.25 once the scale becomes greater than the 30° × 30° grid and 20 years for the CMIP5 ensemble. The CMIP5 ensemble shows slightly better BS than the CMIP3 ensemble around these threshold scales.
 The reliability of the probabilistic prediction is assessed using rank histogram (section 2). Our χ2 tests for rank histograms do not reject the null hypothesis of uniformity at most of the grid points and averaged regions. This seemingly suggests that CMIP ensemble predictions are consistent with the observed probability distributions in almost all scales. However, the rank histogram test at a grid point level may have limited statistical power with the running SAT trends due to the small effective number of samples for longer temporal scales and high observational uncertainty at small scales. Also it may not be unusual to see the lack of power in the significance test for rank histogram associated with the unknown degrees of freedom [Jolliffe and Primo, 2008; Annan and Hargreaves, 2010]. There are some indications of bias and under- and over-dispersions in the simulated SAT trend through visual inspections of the trend time series, the reliability diagram [Wilks, 2006], and the decomposition of the chi-square test into different null hypotheses other than uniformity [Jolliffe and Primo, 2008]. These ensemble characteristics will be further discussed in the following sections.
 The three different performance statistics (although r is part of RMSE [Murphy, 1988]) depend on the spatiotemporal scales somewhat differently, but jointly suggest a rapid increase in the skills beyond the spatial scale of 30° × 30° and temporal scale of 20–30 years. No substantial differences are seen between CMIP3 and CMIP5 EMs although there are hints of better performance in CMIP5 at smaller spatiotemporal scales (weaker colors for r and BS). The zonal mean of 30° latitudinal width is shown in Figure 1, but the performance statistics are quite similar among 5°, 15°, and 30° widths.
3.2. Spatial Pattern
 The previous section summarizes global characteristics of the performance statistics at each scale, but how they vary geographically and for each ensemble member are also of interest. Figure 2 shows the spatial patterns over 5° × 5° grid for RMSE, r, as well as MB and ensemble spread from the CMIP5 ensemble. The first two rows illustrate how RMSE (relative to σobs) changes between the two temporal scales for the EM (Figure 2a) and for the individual members by taking the average of RMSE from each member (Figure 2b). At the 10-year scale the EM's RMSE is mostly comparable to the observed variability, but with some exceptions such as being slightly greater over the equatorial and northwestern Pacific (Figure 2a, left). The average of RMSE of the individual members is larger and more pronounced particularly over the high-latitude oceans (Figure 2b, left). For the 50-year trend, high RMSE (relative toσobs) is distributed mainly over land, the north Atlantic, and the western boundary currents in the Pacific Ocean (Figures 2a (right) and 2b (right)). Larger spatial averaging usually ‘diffuse’ the patterns shown in the 5° × 5° scale into smoother, smaller values (not shown).
 Average ramong individual members for the 10-year trend is distributed such that small but positiver is mostly found over the oceans except over the tropical Pacific (Figure 2c, left). For the 50-year trend, higher and positiver is dominant and low r is now restricted to midlatitudes in the Pacific Ocean, the Atlantic Ocean near Greenland, and some land regions (Figure 2c, right). The geographical distributions of BS appear similar to those of r and thus not shown.
 A part of RMSE is the mean model bias [Murphy, 1988], which is useful information to interpret or calibrate the model outputs. It is reminded that MB is for the SAT trend instead of the SAT itself, thus it represents an average degree of over- or underestimation of SAT changes by each model. The MB of EMs also represents how the simulated probability distribution of the SAT change is located relative to the one from observations. The spatial pattern and magnitude of the MB are found not to be so dependent on the temporal scales, particularly over the continents (Figure 2d). The average of individual members' MB from CMIP5 is shown but the result is almost identical for its EM. For the 10-year trend, negative biases (i.e., on average simulating smaller SAT change than observed) are found over the midlatitudes of Eurasia and high latitudes of North America, while positive bias can be found in the high latitudes and Pacific Ocean (Figure 2d, left). The MB decreases over the ocean with longer temporal scales, some changes from positive to negative, but not so much over land (Figure 2d, right).
 The running trend time series and corresponding rank histograms from two grid points in the Atlantic Ocean are shown in Figure 3to illustrate how the MB and ensemble spread change over geography and temporal scales. Over the Tropical Atlantic, the CMIP5 ensemble seems reasonable in capturing both the mean and variability of the 10-year SAT trend (Figure 3a). The ratio of the χ2 test value to the critical χ2 value for α= 0.10 level is 0.75 (the values greater than one favors uneven histogram, i.e., the ensemble being inconsistent with the observation), which suggest not enough evidence to deny the reliability at this level. In the figure we also show the contribution from the MB and from the spread-error of the ensemble to the totalχ2test (%). We should note that the spread-error component (U-shaped rank histogram) can also be influenced by MB [Hamill, 2001], but in this example it more likely reflects slight underdispersion. For the 50-year trend over the same grid point, the ensemble is obviously biased toward the low SAT trend for the periods centered around 1920 and 1970 (Figure 3b). Over the North Atlantic, the CMIP5 ensemble is not heavily biased but over-dispersive for the 10-year trend (Figure 3c). This overdispersion, however, is not so clearly expressed by the rank histogram because there are several time windows in which the observed trend is in the lower end of the ensemble range. For the 50-year trend, the ensemble becomes more biased toward the high SAT trend (Figure 3d). The contribution of MB to the total error for the 50-year trends is apparent, which might be also included in the ensemble-spread error component (13% and 14% for the Tropical and North Atlantic grid box, respectively).
 The consistency of the ensemble spread with the observed variability can also be assessed by the closeness of the RMSE of the EM to the standard deviation (spread) of the ensemble, assuming the ensemble is unbiased [Palmer et al., 2006; Johnson and Bowler, 2009]. Note that the RMSE represents the ‘average’ difference between the EM and observation while the ensemble standard deviation is the average distance between the EM and ensemble members, thus having them equal is one of the necessary conditions for ensemble members and observation being statistically indistinguishable [Johnson and Bowler, 2009]. Here we compare the ensemble standard deviation averaged over time (σens) with the ‘centered pattern RMSE’ (CRMSE hereafter) of the EM (CRMSE2 = RMSE2 − MB2 [Taylor, 2001]) that generally represents the differences in the temporal pattern between the two time series. We use CRMSE instead of RMSE to focus on the spread of the ensemble. The spatial pattern of the ratio of σens/CRMSE suggests that for the shorter temporal scale the SAT trends in the CMIP5 ensemble tend to be underdispersive (<1; ensemble spread does not extend enough to cover the observed trend variability) over most of the oceans in the Southern Hemisphere, while it appears more overdispersive (>1) over the northern Atlantic and Pacific Oceans, particularly near the continents (Figure 2e, left). For the longer trends, overdispersive predictions become more common and extend over the continents (Figure 2e, right). It is interesting to note that the outstanding reddish color over the northwestern Pacific and North Atlantic for the 10-year trend remains over the former but diminishes over the latter region for the 50-year trend, indicating different reasons for the model errors between the two regions. These results inFigure 2 are similar using the CMIP3 models (not shown).
3.3. Scale Dependency
 The distributions of the above error statistics are summarized in box plots for all the studied scales (Figure 4). As described in section 2, the difference between the two observational statistics are used in the ‘perturbed ensemble’ approach to estimate their effect on the performance statistics, and the typical range (medians across grid points) of the 95% confidence interval is shown in gray shading around the CMIP5. The error statistics between the two observations are also included for RMSE and r. For all the statistics shown in this section, the zonal means with different latitudinal widths (5°, 15°, and 30°) provide nearly identical interquartile ranges.
 It is clear that RMSE of the EM is usually lower than those of individual members and the EMs of CMIP3 and CMIP5 have indistinguishable RMSE (Figures 4a and 4b). For individual members, at the three largest spatial scales and temporal scales of 30-year or longer, the average RMSE is shown to be smaller for the CMIP3 ensemble. The actual magnitude of RMSE ranges from 0.1 to 1°C decade−1at the smallest scale (5° × 5° grid and 10-year trend) and is reduced to 0.03–0.1°C decade−1at the global and 50-year scale (Figure S2).
 It is clearly seen that r strongly depends on the spatiotemporal scale for both EMs and individual members (Figures 4c and 4d). Only the EM and a few members of the CMIP5 ensemble achieve higher rthan 0.5 at the spatial scales ≤ 30° × 30° for the 40- and 50-year trends. Both EMs generally show higherr than individual members at these spatial scales, which is likely because the smoother time series of the EMs can achieve higher r by chance than a single, noisier member for shorter temporal scales. At larger spatial scales several members exhibit higher r than EMs, indicating that greater predictability of the SAT trend at these scales may give a higher chance to a single member to achieve better r than the EM. We should note the higher sampling uncertainty with fewer regions to derive the median compared to the gridded scales, as shown by the broad 95% confidence intervals for the available length of observation (the dashed lines in the global scale panel in Figure 4). Also note that the uncertainty based on the difference between the two observational data is minimal at the large scales. No particular models have consistently higher r than others or EM (not shown).
 The distribution also illustrates that CMIP5 consistently shows better r than CMIP3, particularly for the smaller spatial scales, although the differences between CMIP3 and CMIP5 EMs do not attain field significance [Livezey and Chen, 1983; Wilks, 1997]. Global distribution of BS gives similar conclusions to r (Figure 4e). The CMIP5 ensemble generally obtains better BS than CMIP3, particularly at the smaller spatial scales with longer temporal scales. The CMIP5 ensemble also shows better skills on average than those obtained by a constant climatological probability based on HadCRUT3.
 In the distribution of MB, three features are clear (Figure 5a): (1) relatively similar magnitude across spatial scales, (2) higher sensitivity to temporal scale at the zonal mean or larger spatial scales, and (3) strong model dependence. The implication of (1) is that MB becomes the dominant error component at large scales as the model errors in the variability and temporal phase decrease with increasing scales [see also Sakaguchi et al., 2012]. The model-dependency is fairly consistent across scales, possibly reflecting the characteristic sensitivity of SAT to the changes in climate forcings in each model [e.g.,Soden and Held, 2006]. For example, the median MB of all the members of ESM-LR and CCSM4 are positive at all spatiotemporal scales, while those of CM2, MIROC-ESM, and CGCM3 tend to be negative. Such scale and model dependencies of MB are also observed in the CMIP3 ensemble. The seven models selected in the CMIP3 ensemble happened to have a more balanced distribution of negatively and positively biased models, making the CMIP3 EM having smaller SAT trend bias than the CMIP5 EM. This may be different if another set of models were chosen.
 Last, the distribution of the σens/CRMSE ratio shows that the CMIP5 ensemble tends to be underdispersive at the spatial scales of 20° × 20° or smaller, but it becomes more and more overdispersive as spatiotemporal scale increases (Figure 5b). The σens/CRMSE ratio reaches ∼1.4 for the 50-year trend at global mean. Similar to MB, this ratio depends on temporal scale in a rather complex manner instead of simple improvement with longer temporal scales. The CMIP3 ensemble tends to be more under-dispersive than the CMIP5.
 As the first goal of this study, we attempted to document the expected range and characteristics of the model errors for the SAT trend across spatiotemporal scales. A part of our analysis is to characterize the model error by MB, particularly that of EMs, and the variability of the SAT trend expressed by the spread of the ensemble. The results suggest that the dependence of these errors of ensemble prediction on temporal scales is strong at larger spatial scales, but relatively weak at the smaller (<30° × 30°) spatial scales. Therefore if the model performance from the historical period is used to correct the model bias or variability of the ensemble for the future projection of the SAT trend over a given grid box, then the same correction could be used for different temporal scales for the same grid box. However, the same idea may not be applied for model weighting based on RMSE, since the model ranking may be dependent on the temporal scale. Obviously such corrections or model weightings should not be applied across the spatiotemporal scales.
 In addition, the model-dependent MB indicates that model-weighting based on squared errors may not fully optimize the ensemble prediction since squaring loses the sign of the errors. This is more important for longer temporal scales at which MB becomes the dominant component of RMSE. We also saw a weaker but clear model dependence in the variability error (difference in simulated and observed standard deviations of the running trends, not shown) so that each model contributes differently to the ensemble spread of the running trend. It is also interesting to note that the model spread is increased in CMIP5 from CMIP3. Several studies suggested that CMIP ensembles are more likely to be underdispersed based on its narrower range of the climate sensitivity in the participating GCMs compared to other observational and modeling estimates and by its construction as an ‘ensemble of opportunity’ [Tebaldi and Knutti, 2007; Knutti et al., 2010a], although there are studies with different conclusions [Annan and Hargreaves, 2010, 2011]. Our study shows that the two ensembles tend to be underdispersed across most of the scales (Figure 4), but the CMIP5 ensemble has wider spread than the CMIP3 ensemble and is possibly over-dispersive for the large spatiotemporal scales.
 For our limited use of the rank histogram results, an interesting issue could be raised: even though the inter-dependence of the verification samples (i.e., auto-correlation between the running trend values) is taken care of to some degree, possible inter-dependence among the ensemble members and models [Jun et al., 2008; Masson and Knutti, 2011a] may reduce the degrees of freedom of the test (which is equal to the number of the ensemble members) for the rank histogram. Previous studies attempted to rescale the degrees of freedom based on some estimates on the effective degrees of freedom in the field of interest [Jolliffe and Primo, 2008; Annan and Hargreaves, 2010] but we are not aware of similar rescaling based on the inter-dependence among the models. Further, one could hypothesize that the independence of the models, if it can be quantified, might be scale-dependent for two reasons: at the smaller spatiotemporal scales the simulated natural climate variability dominates the SAT trend and makes each ensemble member to differ more from each other; for larger spatiotemporal scales the signals from the external forcings become more prominent, then two members from the same model or two models from the same institution that share the same forcings may be quite alike.
 While a direct answer may not be available, we can still partially address the above issue through sensitivity tests. Figure 6a shows the median and the 25th–75th percentile range of the χ2 test statistic across the grid from the CMIP5 ensemble after divided by the critical χ2 test value for α= 0.10 level. The degrees of freedom are reduced to 50% and 20% for green and blue lines, respectively. Reducing the degrees of freedom by 50% changes this fraction by ∼0.1 on average, but the 10-year scale is more sensitive (Figure 6a). With the reduction to 20%, the 10-year trend across all the spatial scales by the CMIP5 ensemble would be considered as inconsistent with the observation.Figure 6b shows the same χ2test fractions from different sets of ensembles: another CMIP5 ensemble that is composed of 14 models (51 members), our default CMIP5 ensemble with seven models (29 members), and two single-model ensembles by CSIRO Mk3.6.0 (10 members) and NCAR CCSM4 (6 members). The dashed lines are the two multimodel CMIP5 ensembles but their members are the ensemble mean of each model instead of the raw ensemble members (therefore 14 and 7 members for each). ComparingFigures 6a and 6b indicates that sensitivity of the test statistic to the change of the degrees of freedom by 50% or more is comparable to the considerable changes in the ensemble configuration. It is not possible to infer more from this comparison since Figure 6b also reflects the change in the ensemble performance while Figure 6a does not. For instance, in Figure 6b the ratio approaches unity as the number of the ensemble members is increased, particularly over the smaller spatiotemporal scales. It indicates that either the ensemble becomes less consistent with the observed SAT trend or the statistical test obtains more power as more members are added, although the former is less likely.
 The single-model and multimodel ensembles inFigure 6bshow similar results to each other when the number of the members is close (CMIP5–7EMs and CCSM4, CMIP5–14EMs and Mk3.6.0) over smaller spatial scales. However, the error characteristics of the single-model and the multimodel ensembles can be quite different. One such example is shown inFigure 6c where χ2 test fractions are again plotted, but for the spread error component (with some influence from MB) after decompositions of the χ2 test statistic following Jolliffe and Primo . Compared to the raw multimodel ensembles, both the single-model ensembles and model-mean ensembles have higher fraction of the dispersion errors. Interestingly, their dependence on the temporal scales is different. For the two single-model ensembles, the dispersion error component generally increases with the temporal scales while for the multimodel-mean ensembles it is shown to be opposite, at least for the 10° × 10° to 30° × 30° spatial scales. This result probably reflects the limited sampling of the possible SAT trends by a single model.
 We also explored the sensitivity of BS to these different ensembles. The single-model ensembles do not show clear improvement of BS with increasing temporal scales as multimodel ensembles do (Figure 6d). Somewhat surprising, we see little change in BS from our default 7-model CMIP5 ensemble to the 14-model CMIP5 ensemble. The difference in BS between the raw-member ensembles and model-mean ensembles is large at smaller spatiotemporal scales, up to 0.4 for the seven-model CMIP5 ensembles. However, this difference diminishes at the 50-year trend. These two points suggest that the sampling of the natural variabilities become less important compared to the sampling of the model structure errors with increasing scales, and also there is a redundancy in the information provided by more ensemble members/models.
 In comparing the performance by CMIP3 and CMIP5, the CMIP5 ensemble often shows improved performance but the result is mixed and mostly not statistically significant. RMSE from the two ensembles is practically equivalent and seems to be smaller with the CMIP3 than CMIP5 ensemble at the largest spatiotemporal scales. A potential reason is that the CMIP5 ensemble of this study have two models (CCSM4 and CM3) whose late 20th century warming is reported to be less consistent with the observed change compared to their previous generation models [Donner et al., 2011; Gent et al., 2011], possibly contributing to the larger RMSE of the CMIP5 ensemble. On the other hand r and BS are generally improved in the CMIP5 simulations. Although the differences between the two ensembles are mostly within the uncertainty due to the observational error or limited sampling, it can be argued that the better r and BS at smaller spatial scales by CMIP5 support the improvement of CMIP5 from CMIP3 since the SAT change at the large scale is usually used to optimize key parameters in model development [Knutti, 2008; Gent et al., 2011].
 The goals of this study are to document (1) the expected range and characteristics of the model errors across spatiotemporal scales, (2) the ‘threshold’ spatiotemporal scales where model performance improves significantly if one exits, and (3) whether the performance has been improved from CMIP3 to CMIP5, in terms of the surface air temperature (SAT) trend. For (1), we first described the model skill using RMSE, correlation (r), and Brier Score (BS) across the eight spatial scales from 5° × 5° grid to the global average, and from 10-year to 50-year trends. For example, the typical errors for SAT change by the CMIP5 ensemble over 10, 30, and 50 years and over a 5° × 5° grid are ∼0.80 (0.62), 0.20 (0.16), and 0.12 (0.09) °C decade−1, as median RMSE for the average across the individual members and for EM in parentheses (Figure S2). The sign of the 10-year trend at the 5° × 5° – 30° × 30° scales would be difficult to reliably simulate: probabilistic predictions by the ensemble for the 10-year trend may not be much better than a simple 50–50 guess according to BS. However, for 20-year or longer trends, the CMIP5 ensemble prediction is equally or more skillful than the 50–50 coin toss or the climatological forecast. Since the long-term CMIP simulations are not designed to systematically explore the uncertainty due to natural variability (although some attempts were made, such as sampling the different states of the North Atlantic meridional overturning circulation in CCSM4 byGent et al. ), ensemble reliability at small scales seems difficult to obtain.
 As for the second goal, we saw a significant improvement in the model skill across the 30° × 30° scale and around 30-year-long temporal scale. At the spatiotemporal scales smaller than these, the hindcasts do not show strong skills. The rapid improvement from the gridded 30° × 30° scale to the zonal average implies that averaging out longitudinal features, such as land-ocean contrast or different natural variabilities over the Pacific and Atlantic Oceans, might significantly improve the reliability of the simulated SAT trend. The spatiotemporal scales with more reliable model skills as identified in this study are consistent with previous studies [Randall et al., 2007] and suggest caution in directly using the outputs of long-term simulations for regional and decadal studies.
 On our third goal, we find no significant differences between the performance of CMIP3 and CMIP5 at the large spatiotemporal scales. However, the CMIP5 ensemble often shows better r and BS at smaller scales, which arguably better indicates the improvements made in the CMIP5 on the temporal dynamics of SAT.
 Since the scope of this study is to provide the general diagnosis of scale-dependency from the long-term CMIP experiments, it leaves out several interesting issues that need more focused attentions. Quantification of model interdependence is being attempted in the community [e.g.,Sanderson et al., 2012] and their application to model evaluation will be one of the key issues. The grid structure is not designed for specific geographical regions with good or poor model performance or with high interests for global climate (e.g., grid boxes at different spatial scales are not centered at a particular ENSO index region or arranged for those often used for regional climate studies following Giorgi and Francisco ). Other examples are different types of CMIP5 experiments. Our preliminary analysis on five realizations of coupled-carbon historical experiments showed that their RMSE andrare within the ensemble range of the standard historical simulations, but the ensemble characteristics need to be evaluated. Also, initialized decadal (30-year) simulations showed a wider range of performance, some of which achieved substantially better performance statistics than the members from the long-term experiment. Such additional work will be reported elsewhere.
 This work was supported by the NASA NNX09A021G, NSF AGS-0944101, and DOE DE-SC0006773. We acknowledge the World Climate Research Program's Working Group on Coupled Modeling, which is responsible for CMIP, and the modeling centers (listed inTable 1 and Figure 6) for making the model output available. For CMIP the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We thank Avelino Arellano, Benjamin Sanderson, and the anonymous reviewers for their insightful comments. The HadCRUT3 data were obtained from the Web site of the University of East Anglia Climate Research Unit. The NCDC data were obtained from the NOAA NCDC Web site.