Linear trends of the surface air temperature (SAT) simulated by selected models from the Coupled Model Intercomparison Project (CMIP3 and CMIP5) historical experiments are evaluated using observations to document (1) the expected range and characteristics of the errors in hindcasting the ‘change’ in SAT at different spatiotemporal scales, (2) if there are ‘threshold’ spatiotemporal scales across which the models show substantially improved performance, and (3) how they differ between CMIP3 and CMIP5. Root Mean Square Error, linear correlation, and Brier score show better agreement with the observations as spatiotemporal scale increases but the skill for the regional (5° × 5° – 20° × 20° grid) and decadal (10 – ∼30-year trends) scales is rather limited. Rapid improvements are seen across 30° × 30° grid to zonal average and around 30 years, although they depend on the performance statistics. Rather abrupt change in the performance from 30° × 30° grid to zonal average implies that averaging out longitudinal features, such as land-ocean contrast, might significantly improve the reliability of the simulated SAT trend. The mean bias and ensemble spread relative to the observed variability, which are crucial to the reliability of the ensemble distribution, are not necessarily improved with increasing scales and may impact probabilistic predictions more at longer temporal scales. No significant differences are found in the performance of CMIP3 and CMIP5 at the large spatiotemporal scales, but at smaller scales the CMIP5 ensemble often shows better correlation and Brier score, indicating improvements in the CMIP5 on the temporal dynamics of SAT at regional and decadal scales.