• climate model selection;
  • model skill;
  • multi-model ensemble

[1] The principle of selecting climate models based on their agreement with observations has been tested for surface temperature using 17 of the IPCC AR4 models. Those models simulating global mean, Siberian and European 20th Century surface temperature with a lower error than the total ensemble for one period on average do not do so for a subsequent period. Error in the ensemble mean decreases systematically with ensemble size, N, and for a random selection as approximately 1/Nα, where α lies between 0.6 and 1. This is larger than the exponent of a random sample (α = 0.5) and appears to be an indicator of systematic bias in the model simulations. There is no evidence that any subset of models delivers significant improvement in prediction accuracy compared to the total ensemble.