Coupled Model Intercomparison Project 3 simulations that included time-varying radiative forcings were ranked according to their ability to consistently reproduce twentieth century intradecadal to multidecadal (IMD) surface temperature variability at the 5° by 5° spatial scale. IMD variability was identified using the running Mann-Whitney Z method. Model rankings were given context by comparing the IMD variability in preindustrial control runs to observations and by contrasting the IMD variability among the ensemble members within each model. These experiments confirmed that the inclusion of time-varying external forcings brought simulations into closer agreement with observations. Additionally, they illustrated that the magnitude of unforced variability differed between models. This led to a supplementary metric that assessed model ability to reproduce observations while accounting for each model's own degree of unforced variability. These two metrics revealed that discernable differences in skill exist between models and that none of the models reproduced observations at their theoretical optimum level. Overall, these results demonstrate a methodology for assessing coupled models relative to each other within a multimodel framework.