Attribution of observed historical near‒surface temperature variations to anthropogenic and natural causes using CMIP5 simulations



[1] We have carried out an investigation into the causes of changes in near‒surface temperatures from 1860 to 2010. We analyze the HadCRUT4 observational data set which has the most comprehensive set of adjustments available to date for systematic biases in sea surface temperatures and the CMIP5 ensemble of coupled models which represents the most sophisticated multi‒model climate modeling exercise yet carried out. Simulations that incorporate both anthropogenic and natural factors span changes in observed temperatures between 1860 and 2010, while simulations of natural factors do not warm as much as observed. As a result of sampling a much wider range of structural modeling uncertainty, we find a wider spread of historic temperature changes in CMIP5 than was simulated by the previous multi‒model ensemble, CMIP3. However, calculations of attributable temperature trends based on optimal detection support previous conclusions that human‒induced greenhouse gases dominate observed global warming since the mid‒20th century. With a much wider exploration of model uncertainty than previously carried out, we find that individually the models give a wide range of possible counteracting cooling from the direct and indirect effects of aerosols and other non‒greenhouse gas anthropogenic forcings. Analyzing the multi‒model mean over 1951–2010 (focusing on the most robust result), we estimate a range of possible contributions to the observed warming of approximately 0.6 K from greenhouse gases of between 0.6 and 1.2 K, balanced by a counteracting cooling from other anthropogenic forcings of between 0 and −0.5 K.

1 Introduction

[2] Successive reports of the Intergovernmental Panel on Climate Change (IPCC) have come to increasingly confident assessments of the dominant role of human‒induced greenhouse gas emissions in causing increasing global near‒surface temperatures [e.g., IPCC, 2007a]. Furthermore, analyses of changes across the climate system, including of sub‒surface ocean temperatures, of the water cycle, and of the cryosphere, show that as the observational evidence accumulates, there is an increasingly remote possibility that climate change is dominated by natural rather than anthropogenic factors [Stott et al., 2010]. Thus, the IPCC concluded in its fourth assessment report that “most of the observed increase in global average temperatures since the mid‒20th century is very likely due to the observed increase in anthropogenic greenhouse gas concentrations” [IPCC, 2007a].

[3] However, despite the high level of confidence in the conclusion that greenhouse gases caused a substantial part of the observed warming, there remain many uncertainties that have so far limited the potential to be more precise about such attribution statements. These uncertainties are associated with observational uncertainties caused by remaining biases in data sets and gaps in global coverage [Morice et al., 2012] and modeling uncertainties, which limit the ability to define the expected fingerprints of change due to anthropogenic and natural factors, and which result from errors in model formulation, deficiencies in model resolution, and inadequacies in the way external climate forcings are specified. All attribution results are contingent on such remaining uncertainties and until now they have been explored in a relatively limited way. Many previous attribution studies have been limited to a single observational data set and a single climate model [e.g., Tett et al., 2002] or a rather limited ensemble of different climate models [e.g., Hegerl et al., 2000; Gillett et al., 2002; Huntingford et al., 2006] while another study used simple climate models to emulate global mean temperatures from over a dozen models [Stone et al., 2007].

[4] There have also been studies that have used simple time series methods to determine contributions of forcings to observed global temperatures [Lockwood, 2008; Lean and Rind, 2008; Kaufmann et al., 2011; Foster and Rahmstorf, 2011]. While many such studies are broadly consistent with studies using coupled climate models in finding a dominant role for greenhouse gases in explaining recent warming, the advantage of using climate models rather than simple statistical relationships with forcings is that coupled atmosphere‒ocean general circulation models (AOGCMs) attempt to simulate all the most important physical processes in the climate system that lead to a model's response to a particular forcing. Responses are therefore emergent from the model and not imposed upon it [Hegerl and Zwiers, 2011]. Also climate models produce temperature variations over space as well as time and this study analyzes these space‒time variations.

[5] Increasingly in the climate science community has come a realization of the importance of testing the robustness of results to observational uncertainty. One study [Hegerl et al., 2001] deduced that one type of observational uncertainty, grid box sampling error, had little impact on the attribution of anthropogenic influence on temperature trends. A more recent study [Jones and Stott, 2011] analyzed four global temperature data sets (GISS, NCDC, JMA, and HadCRUT3—see section 3) together with one climate model and concluded that the choice of observational data set had little impact on the attribution of greenhouse gas warming. Therefore, the conclusions that greenhouse gases were the dominant contributor to global warming over the last 50 years of the 20th century are robust to that observational uncertainty.

[6] Since that study, there have been two important developments. First, a new analysis of global temperatures, HadCRUT4, has been released that includes a much more thorough investigation of systematic biases in sea surface temperature measurements than carried out previously [Morice et al., 2012]. While an overall global warming trend is still seen, the detailed nature of the time series has changed, particularly in the middle part of the 20th century [Thompson et al., 2008]. It is therefore worth exploring how this change affects attribution results. Second, results from the CMIP5 experiment have become available, the most complete exploration of climate model uncertainty in simulating the last 150 years ever undertaken [Taylor et al., 2012]. The new multi‒model ensemble of opportunity includes a new generation of climate models with more sophisticated treatments of forcings, including aerosols and land use changes. As part of the experimental design, it also includes ensembles of simulations with both anthropogenic and natural forcings, as well as alternative ensembles which just include natural forcings, and ensembles which include just changes in well‒mixed greenhouse gases. Therefore, the CMIP5 ensemble provides a much more thorough and up to date exploration of modeling uncertainty than available from the CMIP3 ensemble [Meehl et al., 2007a].

[7] In this paper therefore we have the opportunity to undertake the widest exploration yet of the effects of modeling uncertainty when applied to the most up‒to‒date observational estimates of data until 2010. In section 2 we describe the CMIP model simulations, in section 3 we discuss the observational data sets we use, in section 4 we compare results from the CMIP3 and CMIP5 model ensembles, and in section 5 we compare the CMIP models with observations. In section 6 we carry out optimal detection analyses and describe the results using standard techniques that have been widely applied to near‒surface temperature data and other climate data over the last 10 years [Tett et al., 2002; Gillett et al., 2002; Nagashima et al., 2006; Zhang et al., 2007]. In section 7 we provide conclusions.

2 Climate Model Intercomparison Project

[8] The World Climate Research Programme's Coupled Model Intercomparison Project phase 3 (CMIP3) and phase 5 (CMIP5) are arguably the largest collaborative effort for bringing together climate model data for access by climate scientists across the world. The CMIP3 repository [Meehl et al., 2007a] was a major contributor for model data used in studies assessed by the IPCC's fourth assessment report [IPCC, 2007b] while the CMIP5 repository [Taylor et al., 2012] will be used in many studies to be assessed by the IPCC's fifth assessment report due to be published in 2013. Over 20 different institutions and groups have used over 60 different climate models to produce simulations for dozens of experiments to contribute data to both CMIP archives. All the climate models examined in this study are atmosphere‒ocean general circulation models (AOGCMs), where the ocean and atmosphere components are coupled together, covering a wide range of resolutions and sophistication of physical modeling.

[9] The different experiments represent sets of simulations that had different scenarios of forcing factors applied [Taylor et al., 2012]. The basic experiments examined in this study are piControl, historical, historicalNat, and historicalGHG (Table 1). The piControl experiment is a long time scale simulation with no variations in external forcings, such as greenhouse gas concentrations, set to pre‒industrial concentrations/settings. The other experiments are parallel to the piControl but with different forcing factor variations applied, initialized from different times from the piControl. It should be noted that there are differences between the names of the experiments in CMIP5 and CMIP3. For instance the historical experiment was called 20C3M in CMIP3 [Meehl et al., 2007a]. The 20C3M CMIP3 experiments comprised not only simulations driven by anthropogenic and natural forcing factors but also simulations driven by anthropogenic forcing factors only [Stone et al., 2007]. To be consistent with the CMIP5 historical experiments, we use only those 20C3M CMIP3 simulations that were driven by both anthropogenic and natural forcing factors. There are no historicalNat and historicalGHG type simulations in the CMIP3 archive. In Figure 9.5 of Hegerl et al. [2007], historicalNat simulations are presented. The authors of Hegerl et al. [2007] retrieved the historicalNat simulations from the institutions concerned (Dáithí Stone, personal communication). We have tried to collect the same data [Supplementary Materials in Hegerl et al., 2007] from the institutions and will call them CMIP3 simulations for simplicity sake. Additional CMIP5 experiments used are historicalExt and rcp45 (Table 1), both extensions to the historical experiments [Taylor et al., 2012]. The 20C3M CMIP3 experiments are extended if an appropriate A1B simulation is available [Meehl et al., 2007a]. These experiments extend the historical simulations by taking their initial conditions from the end of the equivalent historical experiment, in the case of CMIP5 in 2005. To avoid the use of confusingly different terminology we will generally use the CMIP5 terms (Table 1).

Table 1. Model Experiment Definitions as Used in the CMIP5 Archive [Taylor et al., 2012], With Equivalent CMIP3 Terms [Meehl et al., 2007a]
Experiment NameDescriptionCMIP3 Name
piControlPre‒industrial control simulationpicntrl
historical20th century (1850–2005) forced by anthropogenic and natural factors20c3m
historicalNat20th century (1850–2005) forced by only natural factorsN.A.
historicalGHG20th century (1850–2005) forced by only greenhouse gas factorsN.A.
historicalExtExtension to historical experiment from 2005 onwardA1B
rcp4521st century (2005–2100) forced by RCP4.5 factors"

2.1 Models

[10] A list of the different institutions involved in producing data for CMIP3 and CMIP5 is given in Table 2. Tables 3 and 4 list the models from CMIP3 and CMIP5 used in this study indicating the institutions involved and describing which experiments were made, how many initial condition ensembles were produced, and the length of the pre‒industrial control simulations that were available.

Table 2. Institutions That Supplied Model Data to CMIP3 and CMIP5 Repositories Used in This Studya
  1. a

    The key is used in Tables 3 and 4.

Atmosphere and Ocean Research Institute (The University of Tokyo), Japana
Beijing Climate Center, China Meteorological Administration, Chinab
Bureau of Meteorology, Australiac
Canadian Centre for Climate Modelling and Analysis, Canadad
Centre National Recherches Météorologiques, Météo‒France, Francee
Centre Europeen de Recherches et de Formationf
Avancee en Calcul Scientifique, France 
Commonwealth Scientific and Industrial Research Organisation,g
Marine and Atmospheric Research, Australia 
Institute for Numerical Mathematics, Russiah
Institute of Atmospheric Physics, Chinese Academy of Sciencesi
Institut Pierre Simon Laplace, Francej
Japan Agency for Marine‒Earth Science and Technology, Japank
Max Planck Institute for Meteorology, Germanyl
Met Office Hadley Centre, UKm
Meteorological Institute of the University of Bonn, Germanyn
Meteorological Research Institute, Japano
Meteorological Research Institute of KMA, Koreap
NASA Goddard Institute for Space Studies, USAq
National Center for Atmospheric Research, USAr
National Institute for Environmental Studies, Japans
NOAA Geophysical Fluid Dynamics Laboratory, USAt
Norwegian Climate Centre, Norwayu
Queensland Climate Change Centre of Excellence, Australiav
Technology and National Institute for Environmental Studies, Japanw
Tsinghua University, Chinax
Table 3. CMIP3 Models Used in This Study, Their Institutions (“Inst.”, See Table 2), the Number of Initial Condition Ensemble Members for Each Experiment, and the Number of Years Available From Each piControla
ModelInst.Number of Ensemble MemberspiControl Length
  1. a

    Numbers in square brackets (e.g., “[1]”) represent numbers of ensembles members that cover period ending in 2010. Names of models are same as used in CMIP3. There are some minor differences between what historical and historicalNat CMIP3 simulations we use and those used in Hegerl et al. [2007] (Table S9.1) and in Stone et al [2007]

gfdl_cm2.0t3 [1]500
gfdl_cm2.1t5 [3]500
giss_model_ehq5 [3]400
giss_model_erq9 [5]500
inmcm3_0h1 [1]330
miroc3.2_hiresa,k,s1 [1]
miroc3.2_medresa,k,s10 [10]10 [0]500
miub_echo_gn,p4 [3]3 [0]340
mri_cgcm2.3.2ao5 [0]4 [0]500
ncar_ccsm3r9 [7]5 [0]230
ncar_pcm1r4 [0]4 [0]350
ukmo_hadcm3m3 [0]4 [0]340
ukmo_hadgem1m4 [0]240
Table 4. CMIP5 Models Used in This Study, Their Institutions (“Inst.”, See Table 2), the Number of Initial Condition Ensemble Members for Each Experiment, and the Number of Years Available From the piControl of the Modela
ModelInst.Number of Ensemble MemberspiControl Length
  1. a

    Numbers in square brackets (e.g.,‘[1]’) represent numbers of ensembles members that cover period ending in 2010. Names of models as used in CMIP5. GISS‒E2‒H and GISS‒E2‒R models provided two separate piControl simulations, thus the two numbers in the piControl length column for the models.

ACCESS1‒0g,c1 [1]250
CCSM4r6 [6]4 [0]6 [0]500
CNRM‒CM5e,f10 [10]6 [6]6 [6]850
CSIRO‒Mk3‒6‒0g,v10 [10]5 [5]5 [5]500
CanESM2d5 [5]5 [5]5 [5]995
FGOALS‒g2i,x4 [2]1 [1]900
FGOALS‒s2i3 [3]500
GFDL‒CM3t5 [1]3 [0]3 [0]500
GFDL‒ESM2Gt3 [1]500
GFDL‒ESM2Mt1 [1]1 [0]1 [0]500
GISS‒E2‒Hq5 [5]5 [5]5 [5]240,240
GISS‒E2‒Rq6 [5]5 [5]5 [5]300,500
HadCM3m7 [7]
HadGEM2‒CCm1 [1]250
HadGEM2‒ESm4 [4]4 [4]4 [4]1030
IPSL‒CM5A‒LRj5 [4]3 [0]1 [0]950
IPSL‒CM5A‒MRj1 [1]300
MIROC‒ESMa,k,s3 [1]1 [0]1 [0]530
MIROC‒ESM‒CHEMa,k,s1 [1]1 [0]1 [0]250
MIROC5a,k,s4 [4]670
MPI‒ESM‒LRl3 [3]1000
MRI‒CGCM3o3 [3]1 [0]1 [0]500
NorESM1‒Mu3 [3]1 [1]1 [1]500
NorESM1‒MEu1 [0]
bcc‒csm1‒1b3 [3]1 [1]1 [1]500
inmcm4h1 [1]500

[11] For a given model experiment, up to 10 initial condition ensembles were produced. These took their initial conditions from the model's piControl with the period between samples varying greatly from model to model. We do not consider the impact of different periods between initial conditions, although some studies have tried to minimize the impact of any sampling bias by selecting initial conditions according to the ocean's state [e.g., Jones et al., 2011a].

[12] The CMIP3 and CMIP5 simulations, a multi‒model ensemble (MME), are often called an ensemble of opportunity [Allen and Ingram, 2002]. Ideally models should sample uncertainties (physical modeling, forcing, and internal variability) as widely as possible [Collins et al., 2010]. In practice, an ensemble of opportunity, like CMIP3 and CMIP5, would not methodically sample the full range of possible uncertainties. For instance, the models are not independent [Jun et al., 2008; Abramowitz and Gupta, 2008; Masson and Knutti, 2011], with many sharing common components and algorithms [Knutti et al., 2010], which can be seen in climate responses sharing common patterns [Masson and Knutti, 2011]. As a result, the effective number of independent models in such an ensemble of opportunity is less than the actual number [Pennell and Reichler, 2011]. Ensembles of models in which parameters in physics schemes have been perturbed (so‒called perturbed physics ensembles) have been used to sample model uncertainty in a more methodical manner [Murphy et al., 2004; Stainforth et al., 2005] although they do not sample structural model uncertainty, i.e., uncertainty due to processes not incorporated in a particular model. Similarly the CMIP5 models do not systematically sample forcing uncertainties [Taylor et al., 2012].

[13] Bearing in mind the caveat that there are limitations to the statistical interpretation of ensembles of opportunity [Knutti, 2010; Pennell and Reichler, 2011; von Storch and Zwiers, 2013], we use the CMIP3 and CMIP5 MME spread to examine the confidence of any agreement between the ensembles and the observations [Taylor et al., 2012]. In this study, we generally treat each model as being an equal member of the MME. While such “one model one vote” [Knutti, 2010] methods may underestimate the uncertainty in the model spread, it is arguably the simplest approach to use [Weigel et al., 2010]. In any event, it is difficult to justify any particular weighting scheme based on a model's base climate since a measure of skill of the CMIP5 models to represent mean climate bears little relation to the skill of the models in simulating the observed trend [Jun et al., 2008].

[14] While the CMIP3 archive was about 36 TB in size, the CMIP5 archive is estimated to be 2.5–3 PB in or even larger in size [Overpeck et al., 2011; Taylor et al., 2012]. In this study, because we are interested in near‒surface temperatures only, we examine only a tiny fraction of the total CMIP archive; monthly means of one climate variable on a single level for up to 10 initial condition ensemble members for each of seven experiments from 46 models. In total, more than 66 thousand model years equating to about 65 GB of storage are analyzed in this study.

[15] We have endeavored to retrieve the latest versions of the data available in the CMIP5 archives up to 1 March 2012 in order to allow time to make the analysis and write up the results. Due to CMIP5's limited version control, it has been a non‒trivial task to keep track of data set changes. Thus, we cannot guarantee that all the data used in this study were up to date as of March 2012. Nonetheless the CMIP5 data repository is a substantial undertaking and a great success, and without such a project such studies as this would not be possible. Since March 2012 more models have been added to the CMIP5 archive, we hope to be able to include these models in future analyses.

2.2 External Forcing Factors

[16] Exactly what forcing factors are applied, and how they are modeled, for a given experiment (Table 1) differs somewhat across the models (see model's documentation for more details). Details of which forcings were included in the CMIP3 historical experiments were deduced from Table 10.1 in Meehl et al. [2007b]. Information about which forcings were included in the CMIP5 simulations were extracted from the metadata contained within the data [Taylor et al., 2012], with additional information being obtained from the institutions. The minimum criteria for models being included in the following analyses are that the model's historical experiments must include at least variations in long‒lived well‒mixed greenhouse gases, direct sulfate aerosol, ozone (tropospheric and stratospheric), solar, and explosive volcanic influences. Therefore, as stated earlier, we do not examine those CMIP3 historical experiments (20C3M) that did not also include natural forcings. Only historicalNat simulations that include changes in both total solar irradiance and stratospheric volcanic aerosols are examined. The long‒lived well‒mixed greenhouse gases simulated in the historical and historicalGHG experiments include concentration changes in carbon dioxide, methane, and nitrous oxides (or carbon dioxide equivalent), with some variation across the models in which CFC species are included. All the historical simulations include direct sulfate aerosol effects but how other non‒greenhouse gas anthropogenic forcing factors are applied differs across the models. Tables 5 and 6 give a summary of which model historical simulations include the indirect effects of sulfate aerosols, the effects of carbonaceous aerosols (black carbon and/or organic carbon), and land use influences. Further details of the intricacies of the forcings and how they are implemented in particular models are available from the individual model's documentation.

Table 5. Forcing Factors Included in CMIP3 Historical Experiments Additional to Greenhouse Gases, Sulfate Direct Effects, Ozone, Solar Irradiance, and Stratospheric Volcanic Aerosol Factorsa
  1. a

    SI, sulfate indirect effects (first and/or second effects); CA, carbonaceous aerosols (black carbon and organic carbon); and LU, land use changes (Y‒factor included, N‒factor not included).

Table 6. Same as Table 5 but for CMIP5 Models

[17] There are a few notable oddities in the way some model experiments have been set up which make them different from the rest of the models in the archive. The historicalGHG experiments for the CNRM‒CM5, GFDL‒CM3, MIROC‒ESM, MIROC‒ESM‒CHEM, MRI‒CGCM3, and NorESM1‒M models include variations in ozone concentrations in addition to the well‒mixed greenhouse gas variations. The CMIP3 models miub_echo_g and mri_cgcm2_3_2a and the CMIP5 model IPSL‒CM5A‒LR simulate volcanic influences by perturbing the shortwave radiation at the top of the atmosphere. Whereas the ukmo‒hadcm3 (run1 and run2) and ukmo‒hadgem1 (run1) 20C3M simulations listed in CMIP3 contained anthropogenic only forced simulations, the ukmo‒hadcm3 and ukmo‒hadgem1 20C3M simulations we analyze here include both anthropogenic and natural forcings [Stott et al., 2000, 2006b].

2.3 Simulation Details

[18] Monthly mean near‒surface air temperatures (TAS) were retrieved from the CMIP3 and CMIP5 archives. The historical CMIP3 simulations had start dates varying between 1850 and 1900 and end dates varying between 1999 and 2002. To enable a continuation of the CMIP3 historical simulations to near‒present day, we use any available A1B SRES scenario simulations [Meehl et al., 2007a] that are continuations of the 20C3M experiments. For CMIP5, the “historic” period is defined as starting in the mid‒19th century and ending in 2005, so to extend the historical simulations up to 2010, we append to it the historicalExt experiment, or if not available the rcp45 experiment (Table 1) [Taylor et al., 2012]. There are some minor differences between the different representative concentration pathways (RCPs) anthropogenic emissions and concentrations during the first 10 years of the 21st century [van Vuuren et al., 2011], but these are very small compared to the differences over the whole century. Which RCP experiment is chosen to extend the historical experiment to 2010 is unlikely to be important. There are bigger differences between the forcing factors in the CMIP3 SRES and the CMIP5 RCP experiments over the first few decades of the 21st century, but differences in the climatic responses are also relatively small [Knutti and Sedláček, 2013]. How the solar and volcanic forcing factors are applied during this period will also differ across the models, so there will be additional forcing uncertainties due to these choices. While the official CMIP5 guidance was for the historicalNat and historicalGHG simulations to cover the mid‒19th century to 2005 period [Taylor et al., 2012], a number of institutions supplied historicalNat and historicalGHG data to CMIP5 to cover the period up to 2010 (Table 4). None of these CMIP3 historicalNat simulations have data beyond the year 2000.

3 Observed Near‒Surface Temperatures

[19] Measurements of near‒surface temperatures have been used to create the longest global scale diagnostics of observed climate going back to the mid‒19th century. In this study we analyze HadCRUT4 produced by the Met Office and Climate Research Unit, University of East Anglia [Morice et al., 2012], GISS produced by NASA Goddard Institute for Space Studies [Hansen et al., 2006], NCDC produced by NOAA's National Climatic Data Center [Smith et al., 2008], and JMA produced by the Japan Meteorological Agency [ accessed 8/8/2011; Ishii et al., 2005]. The data sets differ in how raw data have been quality controlled, and in the homogenization adjustments and bias corrections made. They also differ in their coverage. While HadCRUT4 and JMA have areas of missing data where no observations are available (i.e., no infilling outside of grid boxes with data), GISS extrapolate over data‒sparse regions using data within a radius of 1200 km, and NCDC use large area averages from low‒frequency components of the data and spatial covariance patterns for the high frequency components to extrapolate data. The data sets incorporate land air temperatures and near‒surface temperatures over the ocean (e.g., sea surface temperatures) into a global temperature record. The main data set we analyze is HadCRUT4, an update to HadCRUT3 [Brohan et al., 2006] with additional data, bias corrections, and a sophisticated error model. The data set is provided as an ensemble that samples a number of uncertainties and bias corrections that are correlated in time and space, as well as statistical descriptions of the other uncertainties [Kennedy et al., 2011; Morice et al., 2012]. For the purposes of this study, we use the median field of the HadCRUT4 bias realizations (see Morice et al. [2012] for more details) as the best estimate of the data set. We plan to do a thorough investigation using HadCRUT4's error model in a future study. The HadCRUT4, NCDC, and JMA data sets have a gridded spatial resolution of 5°×5° and GISS a resolution of 1°×1°. The periods covered by the data sets are 1850–2010 for HadCRUT4, 1880–2010 for GISS and NCDC, and 1891–2010 for JMA.

4 Comparison of CMIP3 With CMIP5 Models

[20] We first compare the CMIP3 and CMIP5 multi‒model ensemble (MME). Annual mean spatial fields are created by averaging up the monthly gridded data (January–December) and when comparing spatial patterns, model data are projected onto a 5°×5° spatial grid.

4.1 Differences in CMIP3 and CMIP5 Variability

[21] For consistency when comparing the CMIP3 and CMIP5 piControl experiments, we limit the length of simulations being examined to that of the shortest piControl simulation, 230 years (Tables 3 and 4). Often a model simulation with no changes in external forcing (piControl) will have a drift in the climate diagnostics due to various flux imbalances in the model [Gupta et al., 2012]. Some studies attempt to account for possible model climate drifts, for instance Figure 9.5 in Hegerl et al. [2007] did not include transient simulations of the 20th century if the long‒term trend of the piControl was greater in magnitude than 0.2 K/century (Appendix 9.C in Hegerl et al. [2007]). Another technique is to remove the trend, from the transient simulations, deduced from a parallel section of piControl [e.g., Knutson et al., 2006]. However whether one should always remove the piControl trend, and how to do it in practice, is not a trivial issue [Taylor et al., 2012; Gupta et al., 2012]. Only two of the CMIP model simulations analyzed in this paper, giss‒model‒eh and csiro‒mk3‒0 (both from CMIP3), have trends with magnitude greater than 0.2 K/century (Figure S1 in the supporting information). We choose not to remove the trend from the piControl from parallel simulations of the same model in this study due to the impact it would have on long‒term variability, i.e., the possibility that part of the trend in the piControl may be long‒term internal variability that may or may not happen in a parallel experiment when additional forcing has been applied. The overall range of CMIP5 piControl trends has a smaller spread than the CMIP3 piControl trends, and all are lower in magnitude than about 0.1 K/century. While the variance of the annual mean TAS of the first 230 years of piControl (Figure S1) has a lower spread in CMIP5 than in CMIP3, the differences are not statistically significant. The models with the very highest variability, which are all in CMIP3, have lower variability when the drift in the piControl is removed (Figure S1).

[22] The interannual variability across latitudes in the CMIP3 and CMIP5 models is shown in Figure 1. The spread of the models is greatest around the tropics and high latitudes with the range across the models being quite large in places. For CMIP3, the tropical variability is very weak in few models and very strong in others; for CMIP5, there is closer agreement in the tropics. The reduction in the range across the models of tropical variability in CMIP5 is probably due to a reduction in the range of El Nimath formulao‒Southern Oscillation variability across the models relative to CMIP3 [Guilyardi et al., 2012; Kim and Yu, 2012]. Variability across the models over the high latitudes is similar between CMIP3 and CMIP5 apart from one model forming an outlier in CMIP5.

Figure 1.

Standard deviation of zonal annual mean TAS from piControls for CMIP3 (top) and CMIP5 (bottom). The first 230 years of each model's piControl used to calculate annual latitudinal zonal means and then the standard deviation for each latitude was calculated.

4.2 Historic Transient Experiments

[23] Global annual mean TAS for simulations for the historic period from 1850 to 2010 for CMIP3 (Table 3) and CMIP5 models (Table 4) are shown in Figure 2. Included in the figure are the individual historical, historicalNat, and historicalGHG simulations for CMIP3/5 together with the CMIP3 and CMIP5 ensemble averages. While for figure 9.5 in Hegerl et al. [2007] the simple average of all the historical and historicalNat simulations was shown together with each individual simulation, here we take a slightly different approach. As for some models, there are as many as 10 initial condition ensemble members in the CMIP5 archive, and for other models, there is as few as one single simulation for a given experiment, a simple average would give most weight to the model with the most ensemble members. Therefore to avoid this bias toward models with most ensemble members, here we calculate the weighted average that gives equal weight to each model [Santer et al., 2007] regardless of how many ensemble members a model has (see the supporting information for details). A simulation, one of an ensemble for a given model, will have a weight which is the inverse of the number of ensemble members for that model multiplied by the inverse of the number of different models. The resultant weighted average is the equivalent of taking the average of all the models' ensemble averages. To estimate the statistical spread of the MME, we create a cumulative probability distribution, after ranking the simulations, with the probability assigned to each simulation equal to its weight (as described above). The cumulative probability distribution can then be calculated and sampled at whatever percentiles are of interest to obtain MME ranges. A more basic analysis could just look at the average and the percentile range of the ranked simulations, which would be equivalent to simply setting the weights to be the inverse of the total number of simulations. However by given equal weight to each model, rather than to each simulation, the statistical properties of the MME will not be dominated by those models with many ensemble members. The weighted ensemble means for CMIP3 and CMIP5, separately, are shown as thick lines in Figure 2.

Figure 2.

Global annual mean TAS for both CMIP3 and CMIP5 for historical (top), historicalNat (middle), and historicalGHG (bottom) individual ensemble members (thin light lines). The weighted ensemble average for both CMIP3 (blue thick line) and CMIP5 (red thick line) are estimated by finding the average of the model ensemble means (supporting information). TAS annual means shown with respect to 1880–1919.

[24] All the CMIP3 and CMIP5 historical experiments (see Figures S2 and S3 for the responses for each model shown separately) show warming over the historic period. The spread in the increase in TAS across the models is larger than the spread across the initial condition ensemble for any of the individual models. The historicalNat simulations show little overall warming due to the combined influences of solar and volcanic activity. In most models, the historicalGHG experiment warms more than the historical experiment consistent with the other anthropogenic factors—sulfate and carbonaceous aerosols, ozone, land use—and natural factors having an overall net cooling influence. However, the CCSM4, GFDL‒ESM2M, and bcc‒csm1 models have historicalGHG simulations that warm by 2000 by similar amounts to the equivalent historical simulations (Figure S3) which implies that the non greenhouse gas forcing factors have little or no net warming/cooling influence over the whole period. These models were the only CMIP5 models that provided historical and historicalGHG experiments that did not include the indirect effects of sulfate aerosols effects in the historical simulations, i.e., the effects of aerosols to make clouds brighter (first indirect effect) or longer lasting (second indirect effect).

[25] The gradual warming seen in the historical simulations is punctuated with short periods of cooling—of varying degrees—from the major volcanic eruptions [Driscoll et al., 2012]. Far less obvious is any response to solar irradiance changes with little evidence of an “11 year” cycle in the weighted mean of the historical or historicalNat simulations, supporting previous examinations of the response of climate models to solar forcing that suggest they have a weak global mean response to the “11 year” solar cycle [Jones et al., 2012].

[26] With the increased number of models available in CMIP5 compared to CMIP3 there is a wider variety of responses in CMIP5 than CMIP3. The spread in TAS in the first decade of the 21st century for the historical CMIP5 simulations is somewhat wider than the CMIP3 simulations (Figure 2). The spread is almost identical up to 1960 but then widens for CMIP5 relative to CMIP3. A non‒parametric Kolmogorov‒Smirnov test [Press et al., 1992, p. 617] does not rule out that the CMIP3 and CMIP5 distributions of 2001–2010 mean temperatures are drawn from the same population distribution, i.e., the distributions are not significantly different, but the test is not robust to differences in the tails of distributions which is what seems to be the main difference between CMIP3 and CMIP5. One of the striking differences between the CMIP3 and CMIP5 MME is the stronger cooling some of the models have for the historical experiments around the 1950s to 1980s (Figures S2 and S3). These differences indicate that some changes in modeling are potentially increasing the variety of temperature climate responses in the CMIP5 historical simulations. This is despite the similarity of the range of the transient climate response (TCR) seen in CMIP3 and CMIP5 [Andrews et al., 2012]. Differences in the way forcing factors are applied in the CMIP models and the resulting uncertainty in the radiative forcings [Forster and Taylor, 2006; Forster et al., 2013] may also be contributing to the wider spread in TAS responses in CMIP5. For instance a higher proportion of the CMIP5 models include land use changes and a wider range of aerosol influences than in CMIP3 (Tables 5 and 6). Also the 2007 IPCC assessment [Forster et al., 2007] estimated that historic total solar irradiance (TSI) increases were half that estimated by the previous report leading to a recommendation for CMIP5 to use a TSI reconstruction which had a smaller increase over the first half of the 20th century than those used by CMIP3. On the other hand, some of the CMIP5 models have very similar variations in historical TAS to that of their earlier generation model counterparts in CMIP3, for example giss_model_e_r and GISS‒E2‒R or ncar_ccsm3_0 and CCSM4 models (Figures S2 and S3).

[27] The multi‒model weighted average of the CMIP3 historical ensemble is very similar to the CMIP5 historical weighted average (Figure 2), which given the wide range of responses in the individual models in CMIP5 compared to CMIP3, and the differences in the models and forcing factors applied, is perhaps surprising. The weighted means for CMIP3 and CMIP5 historicalNAT simulations are also very similar with the mean cooling following the major volcanic eruptions being very similar even though there are large differences between individual models.

5 Comparison of CMIP Models With Observed Temperatures

[28] In this section we compare the CMIP models with observed near surface temperatures. When comparing models with observations it is important to treat the model data in as similar a way as possible as the observed data [Santer et al., 1995]. All data are projected onto a 5°×5° spatial grid and then monthly anomalies, relative to 1961–1990, are masked by HadCRUT4's spatial coverage. Annual means (January to December) are calculated for a gridpoint if at least 2 months of data are available (see the supporting information for more details of the pre‒processing). Imposing HadCRUT4 spatial coverage on all data means that the other observational data sets, in particular GISS and NCDC, have reduced coverage [Jones and Stott, 2011] but this will allow a consistent comparison between the observations and the models.

5.1 Global Annual Mean Temperatures

[29] Global annual mean near‒surface temperatures for each model and the four observational data sets are shown in Figure 3 for CMIP5 MME (See Figure S7 in the supporting information for the equivalent figure for the CMIP3 models). Global mean anomalies were calculated by removing the average of the global annual means over the 1880–1919 period (see the supporting information). As has been noted elsewhere, [Jones and Stott, 2011; Morice et al., 2012], all observational data sets track each other relatively closely over the 1850–2010 period. Each model's historical simulations warm up more than, and tracks more closely the observations, than the equivalent historicalNat simulations. A comparison with simulations that have complete spatial coverage (supporting information) shows that the vast majority, >95% of the historical simulations, warm up less over the historic period when masked by the observational coverage. This is predominantly caused by the high northern latitude warming from greenhouse gas warming in the models being masked by the lack of coverage in HadCRUT4 in that region. The largest reduction in the linear trend between 1901 and 2010, in an individual simulation due to applying the observational coverage mask, is −0.18 K per 110 years demonstrating the importance of comparing like with like. The historicalGHG simulations consistently warm more than the observations across all the CMIP5 models (Figure 3).

Figure 3.

Global annual mean TAS variations, 1850–2010, for the CMIP5 historical, historicalNat and historicalGHG experiments and the four observational data sets. TAS annual means shown with respect to 1880–1919. Model and observed data have same spatial coverage as HadCRUT4.

[30] Many of the model's historical simulations (Figure 3) capture the general temporal shape in the observed TAS, an increase from the 1900s to the 1940s, then flattening or even cooling to the 1970s, then increase to the present day. The spread in a given model's warming across the ensemble is relatively small over the whole period compared to the spread in warming across the models. This indicates that over the 100 year timescale differences in the forcing factors applied to the models and their responses are more important than internal variability, although on shorter timescales the opposite may be the case [Smith et al., 2007; Hawkins and Sutton, 2009].

[31] Global annual mean TAS for the MME for both CMIP3 and CMIP5 for the historical, historicalNat and historicalGHG simulations are given in Figure 4 together with the CMIP3 and CMIP5 model weighted averages and the four observational data sets. As suggested in previous analyses [e.g., Stott et al., 2000] and as documented in Figure 9.5 in Hegerl et al. [2007], the historical simulations describe the variations of the observed annual mean near surface temperatures fairly well (Figure S8 in the supporting information is an alternative version of Figure 4 showing the overall spread of the CMIP3 and CMIP5 simulations combined). Linear trends for the observed data sets, and the model experiments are given in Table 7 for different periods. While all of the observational data sets show similar time histories (Figure 4), there is a small spread in linear trends with the GISS data set having a slightly smaller trend than the other observations [Jones and Stott, 2011; Morice et al., 2012] over 1901–2010. The observational data sets show a linear warming of between 0.64 and 0.75 K per 100 years over 1901–2010 while the spread of the central 95% of the historical simulations trends is 0.33–1.11 K per 100 years. It should be noted that linear trends in these circumstances are just summary statistics and do not imply linear climate changes are expected or observed.

Figure 4.

Global annual mean TAS for CMIP3 (thin blue lines) and CMIP5 (thin red lines) for (a) historical, (b) historicalNat, and (c) historicalGHG ensemble members, compared to the four observational data sets (black lines)—also shown individually in the insert of Figure 4a. The weighted ensemble average for CMIP3 (blue thick line) and CMIP5 (red thick line) are estimated by given equal weight to each model's ensemble mean (supporting information). All model and observed data have same spatial coverage as HadCRUT4. TAS anomalies with respect to 1880–1919 period.

Table 7. Global Mean Linear Trends for the Observed Data Sets and Both CMIP3 and CMIP5 MME (K per 100 Years)a
  1. a

    The average of the MME trends together with the 2.5–97.5 %range (in brackets) are given for the CMIP experiments (given equal weight to each model). All observations and model simulations have same temporal‒spatial coverage as HadCRUT4. Trends calculated for a period when less than 10 years have missing data, apart from the 2001–2010 when trend is calculated only when all 10 years are available.

historical0.65 (0.33,1.11)0.65 (0.24,1.11)1.23 (0.63,1.93)2.11 (0.91,3.23)1.87 (−0.47,4.92)
historicalNat0.00 (−0.13,0.13)0.43 (0.08,0.78)−0.14 (−0.58,0.15)0.16 (−0.79,1.07)0.07 (−2.49,2.43)
historicalGHG1.09 (0.81,1.59)0.37 (0.05,0.72)1.93 (1.47,2.74)2.07 (1.26,3.13)1.93 (0.41,4.14)

[32] While the observed trend over the first half of the 20th century is higher than the historical MME mean, it is within its ensemble spread, but not the historicalNat ensemble spread (Table 7). The spread of the historical MME trends over the 1951–2010 and 1979–2010 periods encompass the observed trends while the historicalNat MME do not. One must be careful not to draw too many conclusions about the significance of differences just between the model mean and the observations. The process of averaging many model's simulations reduces substantially any internal variability in the ensemble mean of the historical experiments, leaving an average model response that is predominantly a forced signal with almost no internal variability even though the observations still contain internal variability. So, leaving aside issues of observational uncertainty, the difference between multi model mean and observations is mainly due to mean model error and observational internal variability [Weigel et al., 2010]. This means it is important to account for the MME spread, and measures of internal variability, when comparing models with observations [Santer et al., 2008], and not just contrast the MME mean with an observational data set [e.g., Wild, 2012].

[33] A comparison of the variability of the global mean of the models with the observations on different timescales is shown in Figure 5 as a power spectral density (PSD) plot (see also Figures S10 and S11 for the individual models PSDs). The method used is described elsewhere [Mitchell et al., 2001; Allen et al., 2006; Stone et al., 2007; Hegerl et al., 2007]. The spectra contain variance from internal variability and the response to external forcings, as the data has not been de‒trended. The CMIP3 and CMIP5 historical MME encompass the variability of all four observational data sets on all the timescales examined. The historicalNat MME starts to diverge from the observations after periodicities of 20 or so years and for periodicities of about 35 years no historicalNat simulations have variability as large as observed. Together with Figure 4 this is strong evidence that observed temperature variations are detectable over internal and externally forced natural variability on the longer timescales, whereas on timescales shorter than 30 years changes are indistinguishable [Hegerl and Zwiers, 2011].

Figure 5.

Power spectral density for 1901–2010 period for both CMIP3 and CMIP5 simulations and the observations. Analysis on annual mean data as shown in Figure 4. Tukey‒hanning window of 97 years in length applied to all data. The central 90% ranges of the historical and historicalNat multi‒model ensemble are shown separately as shaded areas. The 5–95% ranges are calculated given equal weight to each model (see section 4.2). The HadCRUT4, GISS, NCDC, and JMA global mean near surface temperature observations are as shown in the key.

[34] Figure 6 shows a summary of three statistical indicators for the CMIP simulations compared with HadCRUT4, on a Taylor diagram [Taylor, 2001]. The Taylor diagram enables the simultaneous representation of the standard deviation of each simulation and HadCRUT4's global annual mean TAS, the root mean square error (RMSE) and correlation of the simulations with HadCRUT4. The period 1901–2005 is used, to increase the number of simulations that can be examined, with global annual means having their whole period mean removed. Perhaps unsurprisingly the historicalNat (green points in Figure 6) simulations have the lowest standard deviation and the lowest correlation with HadCRUT4. None of the historicalNat simulations have a RMSE lower than 0.2 K. All the historicalGHG simulations have correlations with HadCRUT4 around 0.8 and RMSEs up to 0.4 K. The historical simulations have some of the simulations with the lowest RMSE with correlations with HadCRUT4 varying from just above 0.4 up to just below 0.9. While the historicalNat simulations are clustered away from the other simulations, there is some overlap between the clusters of historical and historicalGHG simulations.

Figure 6.

Taylor diagram [Taylor, 2001] for global annual mean TAS for the CMIP5 models compared with HadCRUT4 for 1901–2005 period. Historical, historicalNat, and historicalGHG model data have same spatial coverage as HadCRUT4. Shown are lines of equal standard deviation (dashed), correlation (dash‒dot), and centered root mean square error (RMSE dotted). Each colored point represents summary statistic for a single simulation, so some models have more simulations in the diagram than others. As only one variable is being examined the data in the diagram has not been normalized.

5.2 Continental‒Scale Mean Temperatures

[35] Climate changes from internal variability and external forcings would not be expected to be uniform across the globe [Santer et al., 1995]. We examine annual mean temperatures over sea, land and six continental land areas. We group pre‒defined regions used by the IPCC in a report on climate extremes [SREX, 2012] into six continental regions (Figure 7 insert). These SREX areas (Figure 3.1 and Table 3.A‒1 in SREX [2012]) do not always align perfectly with common geographic or political definitions of the continents, but for convenience we group and call the areas North America, South America, Africa, Europe, Asia, Australasia and Antarctica (insert in Figure 7). All data, models and HadCRUT4, are processed in the same way to construct the global and regional land and global ocean temperatures. We use the proportion of land area in each of HadCRUT4's grid boxes to deduce which grid boxes, in HadCRUT4 and the models, to use. Only those grid boxes where there is 25% or more land in HadCRUT4 are used to calculate land temperatures and only those grid boxes with 0% land are used to calculate ocean temperatures (see the supporting information for further details).

Figure 7.

Global, land, ocean, and continental annual mean temperatures for CMIP3 and CMIP5 historical (red) and historicalNat (blue) MME and HadCRUT4 (black). Weighted model means shown as thick dark lines and 5–95% ranges shown as shaded areas. Continental regions as defined in insert. Temperatures shown with respect to 1880–1919 period apart for Antarctica which is shown with respect to 1951–1980 period mean.

[36] The observed (HadCRUT4) data coverage across the regions changes substantially over the period being examined (Figure S6). Europe has the least amount of variation, increasing from 60% in 1860 to near 100% by the 1950s. The other continental regions have much lower observational coverage increasing from less than 10% in the 1860s to around 80%by the 1960s. Antarctica has very little spatial coverage even during recent times and zero before the 1950s, so the Antarctica temperatures are shown with anomalies with respect to the 1951–1980 period.

[37] Figure 7 shows TAS for HadCRUT4 and the model weighted CMIP MME average and 5−95% (equal weight given to each model; supporting information) range for the historical and historicalNat experiments for the globe, land only, ocean only, and the seven continents. Historical simulations generally capture observed variations over land and ocean, both showing more warming over land than sea. HistoricalNat simulations show little overall warming trend and diverge from the historical experiment by the 1950s. While the continental regions have more interannual variability than globally, they all show a warming over the whole period in HadCRUT4. Antarctica, which only covers the period after 1950, shows some warming but also a great deal of interannual variability; nevertheless, the anthropogenic component of warming has been detected in the region in a spatial‒temporal analysis [Gillett et al., 2008].

5.3 Spatial Temperature Trends

[38] As indicated by looking at the continental scale mean TAS variations, observed and modeled temperatures are not warming uniformly across the globe. This can further be investigated by examining spatial linear trends. Figure 8 shows spatial trends for four different periods for HadCRUT4 and the historical, historicalNat, and historicalGHG CMIP simulations. The periods examined, 1901–2010, 1901–1950, 1951–2010, and 1979–2010, capture the most important periods in temperature changes since the start of the 20th century. For each simulation at each grid point, linear trends are calculated where no more than five consecutive years have missing data (see the supporting information). Linear trends are calculated for each simulation and then the weighted model average is calculated for each model experiment. In Figure 8, grid boxes are outlined where the central 90%range (where equal weight is given to each model rather than each individual simulation—section 4.2 and the supporting information) of the CMIP MME does not encompass HadCRUT4's trend [Karoly and Wu, 2005; Knutson et al., 2006]. This indicates where 95%or more of the simulated trends are larger than observed (or where 95% or greater have trends less than observed) and thus where there is some inconsistency between the CMIP MME and the observations. This can be looked at as a simple detection and attribution analysis on the grid point level [Karoly and Wu, 2005], where an observed trend is inconsistent with the historicalNat MME then it can be said to be detected, and where it is also consistent with the historical MME it is attributed. However, this is not a strong attribution statement [Hegerl and Zwiers, 2011], especially considering the limitations of an ensemble of opportunity [Knutti, 2010; Pennell and Reichler, 2011; von Storch and Zwiers, 2013].

Figure 8.

Spatial linear trends for four periods, 1901–2010, 1901–1950, 1951–2010, and 1979–2010. Trend shown as temperature change over period, with each period having different lengths. HadCRUT4 trends (first row), weighted ensemble averages for CMIP3 and CMIP5 historical (second row), historicalNat (third row), and historicalGHG (fourth row) simulations. Black boxes placed around grid points where the central 90% range of the simulations (with equal weight given to each model—supporting information) do not encompass the observed trend. Numbers in bottom left‒hand corner of panels give the percentage number of non‒missing data grid points that are highlighted with a black box.

[39] Over the 1901–2010 period (Figure 8 first column), HadCRUT4 warms almost everywhere, except for a small region south of Greenland, with slightly higher warming at the higher latitudes. The historical MME average has a similar pattern of warming with not quite as much warming at the higher latitudes, with only a small number of grid boxes showing an inconsistency between HadCRUT4 and the central 90% range of the MME. The historicalNat mean shows little change, generally a slight cooling, with only 12 % of grid points showing trends that are consistent with HadCRUT4—the cooling off Greenland. The historicalGHG MME generally warms more than observed almost everywhere, except for some regions in the ocean south of Australia and southern South America.

[40] The 1901–1950 period (Figure 8 second column) sees HadCRUT4 warming significantly more than the historical MME in the north Pacific and parts of the Atlantic and cooling more in the tropical Pacific. Many areas of the historicalNat and historicalGHG MME do not warm as much as HadCRUT4.

[41] The HadCRUT4 warms over the 1951–2010 period (Figure 8 third column) with a strong signal over land and the Northern Hemisphere. The historical simulations capture this land warming trend, although the MME range is inconsistent in places over Eurasia and Africa. Some parts of the Pacific, in HadCRUT4, do not warm as much as 95%of the historical simulations. The historicalNat simulations are largely inconsistent with HadCRUT4 over 1951–2010, showing an overall cooling trend. A large proportion of the globe for historicalGHG is warming more than HadCRUT4.

[42] While much of the globe is warming in HadCRUT4 for the 1979–2010 period (Figure 8 fourth column), there are some interesting features. There are areas of cooling over the Southern Ocean and in a distinct “v” shape in the Pacific and strong warming over the south Pacific and some areas in Africa and Eurasia. These areas are outside the central 90% of both the historical and historicalGHG MME trends. The historicalNat shows inconsistencies over much of the globe again generally showing less warming than HadCRUT4. The HadCRUT4 patterns of warming/cooling over the Pacific may be related to changes in the Pacific Decadal Oscillation [Mantua and Hare, 2002] from a warm phase to a cool phase over the 1979–2010 period [Trenberth et al., 2007]. Several of the historical simulations have patterns similar to the Pacific cooling seen in HadCRUT4, but only one simulation (third ensemble member of MIROC‒ESM) cools with a similar magnitude.

[43] Latitudinal zonal trends were calculated for each observational data set and simulation for the four periods shown in Figure 9. We only calculate the zonal trend at a given latitude for the observational data sets where they have 50%or more of the number of grid points that HadCRUT4 has. This is to reduce the impact of coverage differences, particularly from the NCDC and JMA data sets which have less coverage at high latitudes than HadCRUT4, on comparisons between the observed zonal trends. The central 90% range (where equal weight is given to each model) of the historical and historicalNat are shown as shaded regions and the four observational data sets as colored lines—HadCRUT4 as the black line. The spread in the observational data sets gives an indication of the uncertainties in the observations. Reducing the spatial coverage of model simulations to that of HadCRUT4 increases the spread across the MME in the higher latitudes (compare with Figure S4 where the model data have full coverage). The observations generally warm more than historicalNat across the latitudes for three of the periods, but not over the 1901–1950 period. Alternatively the observations lie within the spread of the historical simulations for most of the latitudes and periods being examined.

Figure 9.

Latitude zonal TAS trends for observations and CMIP3 and CMIP5 simulations for four periods, 1901–2010, 1901–1950, 1951–2010, and 1979–2010. For each simulation, the average of the trends for each latitude is calculated, then the 5–95 % (equal model weighted) range of the available ensemble members for each latitude is estimated. All observed data sets and simulations have same spatial coverage as HadCRUT4. Where the other observed data sets have less than 50% of the coverage of HadCRUT4 for a given latitude, the zonal mean is set to a missing data value.

[44] The observed cooling over the Southern Ocean during the 1979–2010 period seen in both Figures 8 and 9 may be related to changes in the Southern Annual Mode [Trenberth et al., 2007], although there may also be forcing contributions to these patterns of change [Karpechko et al., 2009]. This cooling is not captured by most of the historical simulations, but a few do show cooling trends of a similar magnitude as HadCRUT4.

5.4 Models Including Indirect Aerosol Effects

[45] We have not made any judgements about the quality or skill of the model's response, such as how well models simulate an observed climatology. But one criteria of “quality” that could be used is how complete the estimates of forcing factors are incorporated by each model. For instance do models with a broader use of available forcing factors respond substantially differently to those with a smaller collection of forcing factors? Arguably the most dominant grouping of forcing factors after greenhouse gas concentrations are sulfate aerosols [Haywood and Schulz, 2007]. How the different aerosol processes are modeled vary considerably across the models [Forster et al., 2007], which can produce differences in the climatic responses of models (e.g., temperatures over the Atlantic region) [Booth et al., 2012]. It could be reasonable to weight down simulations that do not include important physical processes, such as indirect aerosol mechanisms [Weigel et al., 2010].

[46] Figure 10 compares the TAS variations over 1850–2010 for the CMIP3 and CMIP5 historical simulations that include only the direct effects of sulfate aerosols and those that include both indirect and direct effects. By the end of the 20th century, the weighted mean of those models only incorporating aerosol direct effects warms up considerably more than the models that also include aerosol indirect effects. This is largely because simulations that include the direct effect only do not cool as much over the 1940–1979 period. Both sets of simulations have a similar rate of increase thereafter. Maps of spatial trends (Figure S5) show that the weighted average of the simulations including direct effects only warms up more at the higher northern latitudes over the 1901–2010 period. Over the 1940–1979, period the weighted average of the simulations that include both direct and indirect effects has a strong cooling in the Northern Hemisphere centered around the midlatitudes, over which region the average of models including just direct effects continues to warm. This indicates that including the indirect effects of sulfate aerosols in models produces a distinct signal in TAS over the latter half of the 20th century while not including indirect effects produces a net warming bias over the whole century [Wild, 2012].

Figure 10.

Global annual mean TAS for CMIP historical simulations with and without indirect aerosol effects compared with the observations (black lines). Showing the weighted average (thick dark lines) and 5–95% range (thin light lines) for the simulations including direct sulfate aerosol effects only (blue) and those including both direct and indirect effects (red). Data processed as in Figure 4.

[47] Figure 10 also shows the global annual mean TAS for the different observational data sets. By the start of the 21st century the observations are cooler than most of the simulations that only include direct aerosol effects but are in the center of the distribution of models that include both direct and indirect aerosol effects. It is intriguing to consider whether selecting a priori models with a more “complete” set of forcing factors may provide a more accurate representation of past climate changes than models with a limited set of forcings. It should be remembered that there are still many uncertainties in how factors that can influence climate are modeled and in their radiative forcings and whether important factors are still not being included which could mean any apparent agreement between models and observations may be fortuitously due to a cancelation of errors.

6 Detection Analysis on CMIP5 Models

[48] Of the CMIP5 models available, only eight have the required historical, historicalNat, and historicalGHG simulations covering the period 1901–2010 (BCC‒CSM1‒1, CNRM‒CM5, CSIRO‒Mk3‒6‒0, CanESM2, GISS‒E2‒H, GISS‒E2‒R, HadGEM2‒ES, and NorESM1‒M; Table 4) to carry out a detection and attribution analysis that seeks to partition the observed temperature changes into contributions from greenhouse gases only, from other anthropogenic forcings and from natural forcings. We only use the HadCRUT4 data set in this detection analysis. We plan to examine the impact of choice of observational data set and uncertainties in HadCRUT4 (see Section 3) in a future detection study.

6.1 The Method of Optimal Detection

[49] We apply a standard implementation of optimal detection, the methodology of which has been extensively covered elsewhere [Allen and Stott, 2003; Hasselmann, 1997; Hegerl et al., 1996; Stott et al., 2003]. The following description is adapted from Jones et al. [2008, 2011b] and Jones and Stott [2011]. The method is a linear regression of simulated climate pattern responses against observed climate changes, optimizing by weighting down modes of variability with low ratio of signal to noise. Patterns are filtered (which can be spatial, temporal, or both), by projecting onto an estimate of the leading orthogonal modes of internal climate variability, which have been whitened. This produces “optimal fingerprints” where modes of variability with high signal‒to‒noise ratios (SNR) have more prominence than those with low SNR. The basis of orthogonal modes—empirical orthogonal functions (EOFs)—is usually deduced from model simulations.

[50] Scaling factors for different forcing factors are deduced by regressing the filtered observed changes (response variable) against the forced climate changes optimal fingerprints (explanatory variables), allowing for noise in both (the total least squares technique):

display math(1)

where y is the observed pattern and xithe ith simulated pattern of I signals. The scaling factors to be estimated in the regression are βi. ν0 and νiare the estimate of the internal variability in the observations and in the simulated temperatures, respectively. If more than one ensemble member is available, then xi is the ensemble average, although then the model noise is scaled to allow for the changes in the noise characteristics [Allen and Stott, 2003]. An independent estimate of noise is used to estimate the uncertainties in βi and for a residual consistency test. The consistency of the linear regression to over/under fitting is examined by comparing the residual of the regression with the expected variance, as estimated from an independent estimate of noise variability, using an F test [Allen and Tett, 1999] at a two‒sided significance level of 10%. Uncertainties in the scaling factors (βi) of the simulated patterns give a measure of whether a particular forcing factor is detected, by testing the null hypothesis that the scaling factors are consistent with a value of zero, i.e., a detection occurs when the scaling factor uncertainty range does not cross zero, and attributed by also testing the null hypothesis that the scaling factor is consistent with a value of one, i.e., is the response inconsistent with that expected from the model. If all known major forcing factors are included in a multiple signal analysis, then confidence in an attribution may be strengthened [Hegerl et al., 2007], by showing consistency with the understanding of the physics and helping to exclude alternative factors as being the sole drivers for the observed changes. The use of such an attribution consistency test [Hasselmann, 1997; Allen and Tett, 1999], while not a “complete attribution assessment” [Hegerl et al., 2007], may be considered a practical approach within the analysis framework we use [Mitchell et al., 2001]. If the factors are not detected or attributed, the pattern and magnitude of the forced responses may have errors, other important forcing factors may have not been included, or a mode of internal variability is not being simulated appropriately. Or alternatively the assumption that the responses to different forcing factors can be linearly added together may be inappropriate. The key point of the detection method is that the uncertainty in the magnitude of the forcing factor and/or climate response can be compensated for by the scaling factor (βi). However this method is unable to account for uncertainties in forcing and response space‒time patterns. A variant of this method, using “error‒in‒variables,” attempts to account for inter‒model differences [Huntingford et al., 2006], however to avoid adding further complexity, we do not use that method here.

[51] The values of the scaling factors will have a dependency on the number of EOFs used (or truncation in EOF space). The fewer the EOFs used, then the lower the overall variability of the observations and simulated patterns that are captured. But the higher the number of EOFs used, the more modes of variability are added that add little information about the signals. The absolute maximum EOFs that can be used are determined by the number of degrees of freedom in the creation of the EOFs. But if ranges of EOFs give unbounded scaling factors or fail the residual F tests, then a lower EOF truncation can be chosen.

[52] We are interested in investigating the different contributions to observed changes from greenhouse gases only, non‒greenhouse gas anthropogenic factors and natural influences, but we only have the historial, historicalGHG, and historicalNat experiments to use. To deduce the scaling factors for greenhouse gases only (G), the other anthropogenic forcings (OA), and natural influences (N), we make a transformation on the historical, historicalNat, and historicalGHG scaling factors— βhistorical,βhistoricalNatand βhistoricalGHG, respectively—as calculated in a multiple signal regression (equation ((1))) [as described in Tett et al., 2002], i.e.,

display math(2)

[53] A basic assumption of the optimal detection analysis is that the estimate of internal variability used is comparable with the real world's internal variability. As the observations are influenced by external forcing, and we do not have a non‒externally forced alternative reality to use to test this assumption, an alternative common method is to compare the power spectral density (PSD) of the observations with the model simulations that include external forcings. We have already seen that overall the CMIP5 and CMIP3 model variability compares favorably across different periodicities with HadCRUT4‒observed variability (Figure 5). Figure S11 (in the supporting information) includes the PSDs for each of the eight models (BCC‒CSM1‒1, CNRM‒CM5, CSIRO‒Mk3‒6‒0, CanESM2, GISS‒E2‒H, GISS‒E2‒R, HadGEM2‒ES and NorESM1‒M) that can be examined in the detection analysis. Variability for the historical experiment in most of the models compares favorably with HadCRUT4 over the range of periodicities, except for HadGEM2‒ES whose very long period variability is lower due to the lower overall trend than observed and for CanESM2 and bcc‒cm1‒1 whose decadal and higher period variability are larger than observed. While not a strict test, Figure S11 suggests that the models have an adequate representation of internal variability—at least on the global mean level. In addition, we use the residual test from the regression to test whether there are any gross failings in the models representation of internal variability.

[54] Two types of analysis are examined here. The first analysis uses samples of the eight model's piControl to create a common EOF basis with which to analyze all the data with [Gillett et al., 2002]. In conjunction with this method, averages are constructed of the three signals from the simulations, the first a simple average and the second a weighted average giving equal weight to each model (following the method described in Huntingford et al. [2006], see also the supporting information). The second analysis looks at the six models that have sufficient simulated data to estimate their own EOF basis (CNRM‒CM5, CSIRO‒Mk3‒6‒0, CanESM2, GISS‒E2‒H, GISS‒E2‒R, and HadGEM2‒ES). These are the models with long‒enough piControl and enough ensemble members in the experiments to allow the generation of a sufficiently large number of EOFs to represent internal variability.

[55] Two periods are examined, 1901–2010 and 1951–2010, using 10 year mean anomalies (relative to periods being examined) filtered onto 5000 km spatial scales [Stott and Tett, 1998]. The decadal means are calculated for the simulations by averaging annual means and then masking by HadCRUT4's 10 year means. The technique for creating the filtered data is the same as described in previous detection analyses [Tett et al., 2002; Stott et al., 2006b; Jones et al., 2011b] apart from using annual means calculated over January to December (further details in the supporting information).

6.2 Analysis of 1901–2010 Period

[56] Results for the 1901–2010 period using a common EOF basis are shown in Figure 11 and using each model's own EOF basis in Figure 12.

Figure 11.

Optimal detection scaling factors (βG, βOA and βN) for 1901–2010 period with 10 year means projected onto spherical harmonics. Scaling factors (5–95 % ranges) for each signal shown over range of available EOF truncations. Analysis using model common EOF basis is based on eight models. The triangle symbols represent EOF truncations where the residual consistency test fails at the 10% two‒sided significance level. Number of degrees of freedom that can be examined is 24.

Figure 12.

Same as Figure 11 but using individual model basis (i.e., analysis of each model using EOF basis produced by that model).

[57] A common EOF basis is constructed, following Gillett et al. [2002], with the eight models that have historical, historicalNat, historicalGHG, and piControl simulations to project the observational and model data onto. The technique of using a common EOF basis created from a range of climate models has been used in a number of studies [Verbeek, 1997; Barnett,1998; Stouffer et al., 2000; Hegerl et al., 2000; Gillett et al., 2005; Zhang et al., 2006; Santer et al., 2007; Christidis et al., 2010]. Equal length (240 years) segments from each of the eight models' piControls are used to draw 110 year lengths, overlapping by 10 years except between models, to create a covariance matrix. This is repeated for a different 240 year long segment from the models' piControls. The first noise estimate is used to create the common EOF basis and the second for uncertainty estimates and residual testing. We do not examine the sensitivity of the results to using other methods of creating a common EOF basis [Stott et al., 2006a; Gillett et al., 2012b].

Table 8. CMIP5 Models Used in the Detection Analysis Using the Models Own EOF Basis (Individual EOF Basis) and a Common EOF Basis for Both Periods Being Examined, 1901–2010 and 1951–2010a
  1. a

    Each row gives the maximum truncation of EOF being used (Trunc.), the percentage captured variance (Cap.Var.) for HadCRUT4 (Obs.), and the historical experiment ensemble average (Mod.) when projected onto the truncated EOF basis.

Individual EOF Basis
Common EOF Basis
Simple Avg.""96.5""99.1
Wgt. Avg.""96.5""99.0

[58] The common EOF basis has 24 degrees of freedom (estimated from the number of independent 110 year length segments, multiplied by 1.5) [Allen and Smith, 1996] and so a maximum EOF truncation possible of 24 (Table 8). Using this maximum truncation can explain 93.6 % of the observed variance [Tett et al., 2002] and more than 95%of the historical variance for each of the models. Figure 11 shows the scaling factors for the eight models across the range of maximum EOFs being used. CNRM‒CM5, CanESM2, and NorESM1‒M give fairly consistent scaling factors across the range of EOFs. HadGEM2‒ES is more varied and fails the residual consistency test for most of the maximum truncation choices. The other models give poorly constrained results. Figure 13a—left‒hand side—shows the results for the eight models at the truncation of 24. The same truncation is used to enable a consistent comparison when using the common EOF basis. Five of the models fail to detect greenhouse gases (G), with CSIRO‒Mk3‒6‒0, GISS‒E2‒H, GISS‒E2‒R, and bcc‒csm1 having very large ranges of scaling factors and HadGEM2‒ES failing the residual consistency test. Of the three models that do detect G only, CNRM‒CM5 has scaling values consistent with unity, CanESM2 has values lower, and NorESM1‒M has values higher than one. Only NorESM1‒M detects other anthropogenic forcings (OA), with a value significantly greater than 1, with CanESM2 and HadGEM2‒ES both not detecting OA but with negative best estimates. CanESM2, HadGEM2‒ES, and NorESM1‒M detect natural forcings (N) with values consistent with 1. We use two alternative ways of creating average temporal spatial patterns across the models. The first is just simply averaging up all the available ensemble members from the eight models for the historical (45 members), historicalNat (32 members), and historicalGHG (32 members) and treating the ensemble members as if being sampled from the same model [Gillett et al., 2002]. The second method averages the ensembles for each model then averages the models means [Santer et al., 2007], thus given equal weight to each model rather than to each ensemble member, with the noise estimates being scaled appropriately. For the simple model average (Simple avg) and weighted model average (Weighted avg), G is robustly detected across the choice of truncation with values consistent with 1 (Figures 11 and 13). N is also detected with values consistent with 1, but OA is not detected in both cases. For both “Simple avg” and “Weighted avg,” the residual consistency test fails over wide ranges of EOF truncations (Figure 11), although not for the maximum truncation.

Figure 13.

Optimal detection results for 1901–2010 period, 10 year means projected onto spherical harmonics for chosen maximum EOF truncation (Table 8). Results using common EOF basis on left and using the model's own EOF basis on the right. (a) Scaling factors (βG, βOA, and βN), (b) linear trends of scaled reconstructed temperatures over whole period, and (c) linear temperature trend over sub‒period 1951–2010 while using scaling factors shown in Figure 13a. Best estimates shown with central 90% uncertainty range. Trends in decadal global mean HadCRUT4 shown as thin horizontal black lines. Triangles at top of panels indicate where analysis fails residual consistency test.

[59] The sensitivity of the results to what model is examined may be a consequence of the common EOF basis being composed of relatively small amount of data with low degrees of freedom, although the captured variance in the model's historical simulations is between 95.2% and 96.7 %, suggesting most modes of variability should be captured. Problems of reliably estimating scaling factors are more apparent when using the EOF basis produced from each model. For each model, covariance matrices are constructed from the model's piControl and intra‒ensemble variability [Tett et al., 2002]. These are used to create the EOF basis for the model and noise estimates for uncertainty and residual testing. The scaling factors for the six models able to construct their own EOF spaces are shown in Figure 12. Each of the six models has different length piControl's and a different number of historical, historicalNat and historicalGHG ensembles to be used to create intra‒ensemble variability estimates [Tett et al., 2002], so it is not possible to make the same choices for which parts of the noise estimates are used in the analysis for each model. The supporting information gives details of the choices made. The number of degrees of freedom, NDOF, is therefore different for each model's analysis. The maximum EOF truncation, set to the model's EOF basis NDOF, will also be different for each model, varying between 14 and 26 (see Table 8 for details together with the captured observed variance for that truncation). Both CNRM‒CM5 and CanESM2 detect G, OA, and N, although CanESM2 finds a negative value for OA and fails the residual consistency test over much of the lower EOF truncations. While CSIRO‒Mk3‒6‒0 detects G, this is not a robust result as it is not detected for all the other choices of truncation (Figure 12). The scaling factors shown in Figure 13a (right‒hand side) are for the maximum truncations allowed for each model (Table 8). Unfortunately the CMIP5 simulations do not yet have available the millennia length control simulations that are needed for a robust estimation of the modes of variability needed for such analyses and therefore the most reliable method of estimating scaling factors and attributable temperature trends is arguably to use the common EOF basis technique. This also has the advantage that the observations projected onto this basis do not change when compared with the different models.

[60] Reconstructed temperature trends [Allen and Stott, 2003], best estimates and 5–95% ranges, are shown in Figures 13b and 13c. Where G is detected, it is the dominant attributed forcing to the observed temperature trend. Where OA is also detected, G is significantly warmer than the HadCRUT4 trend. For the cases where the attributed trend for G is low, OA has either small cooling or even a warming contribution—as when using CanESM2. The model average analyses give similar attributed trends. For instance the “Weighted avg” analysis finds ranges of attributed trends for G of 0.50–1.14 K, OA of −0.41–0.24 K, and N of 0–0.01K over the 1901–2010 period (given as 5–95 %ranges) compared to the observed trend of 0.76 K per 110 years. Examining the trends over the sub‒period of 1951–2010 shows that G had a slightly larger relative contribution to observed trends, e.g., the “Weighted avg” analysis gives attributed trends (given as K per 60 years) for G of 0.49–1.13K, OA of −0.31–0.18 K, and N of −0.002–0.003 K with the HadCRUT4 trend of 0.65 K per 60 years.

6.3 Analysis of 1951–2010 Period

[61] We next examine the 1951–2010 period, the part of the instrumental record that is best observed. Figures 14 and 15 show the results for the common EOF and the individual model EOF analyses, respectively.

Figure 14.

Optimal detection scaling factors (βG, βOA, and βN) for 1951–2010 period with 10 year means projected onto spherical harmonics. Scaling factors (5–95 % ranges) for each signal shown over range of available EOF truncations. Analysis using model common EOF basis is based on eight models. The triangle symbols represent EOF truncations where the residual consistency test fails at the 10% two‒sided significance level. Number of degrees of freedom that can be examined is 48.

Figure 15.

Same as Figure 14 but for individual EOF analysis.

[62] For the common EOF basis set, the shortness of the period together with the amount of data increases the number of degrees of freedom and EOFs that are available for use. Using the eight climate models provides noise estimates that have 48 degrees of freedom which is thus the maximum EOF truncation that can be examined. Figure 14 shows the scaling factors for the choices of EOF truncation up to the maximum allowable value. Many of the models give unconstrained values for the higher choice of truncation so to compare across the models, we choose an EOF truncation of 24 (Table 8) to examine the scaling factors in a consistent manner (Figure 16). The observed variance captured by using the first 24 EOFs is 96.7% and is above 98%for each of the model's historical simulation. Greenhouses gases (G) are detected in every model except GISS‒E2‒H. The CNRM‒CM5, CSIRO‒Mk3‒6‒0, HadGEM2‒ES, and NorESM1‒M models detect G with values consistent with 1, but CanESM2, GISS‒E2‒R, and bcc‒csm1 detect G with values significantly lower than 1. Other anthropogenic forcings (OA) are detected by CSIRO‒Mk3‒6‒0, HadGEM2‒ES, and NorESM1‒M, and natural (N) is detected by CanESM2, GISS‒E2‒H, HadGEM2‒ES, NorESM1‒M, and bcc‒csm1. Only NorESM1‒M detected all three signals with values consistent with 1. Again in those models where OA is detected, G has a value consistent with 1, and in those models where G has a value less than 1, OA is not detected.

Figure 16.

As Figure 13 but showing results for 1951–2010 period. Table 8 gives the EOF truncations used.

[63] Both analyses of model averages (“Simple avg” and “Weighted avg”) for the 1951–2010 period show detection of G and OA with values consistent with 1 and only the “Weight avg” case is N detected. While the conclusion of detection of greenhouse gases with scaling factor consistent with 1 is robust to truncation choice (Figure 14), exact values of scaling factors are sensitive to the number of EOFs included. For instance G scaling factors generally reduce in amplitude the larger the truncation when the weighted model average is taken (Figure 14).

[64] The temperature trends of the reconstructed global mean‒scaled patterns are shown in Figure 16b. For the common EOF analysis (left‒hand side of Figure 16b), the range of the best estimate of the attributed linear trend for G for the different individual models that detect G varies between 0.63 and 1.48 K over the 60 year period, with only CanESM2 showing G warming less than the observed trend of 0.64 K per 60 years. This attributed warming is offset by OA cooling in all the models, up to −0.80K per 60 years in magnitude, apart from CanESM2 and bcc‒csm1 models which show near‒zero trends. The average models analyses (“Simple avg” and “Weighted avg”) give warmings attributed to greenhouse gases greater than observed with some offsetting due to other anthropogenic forcings. For instance “Weighted avg” analysis gives attributed trends (5–95 % trend range over 60 years) for G of 0.66–1.22 K, OA of −0.51–0 K, and N around 0 K.

[65] In contrast, the individual EOF basis analysis (Figure 15 and right‒hand side of Figure 16) has a wider range of scaling factors across the models than for the common EOF basis. The scaling factors for CNRM‒CM5 and CanESM2 are unbounded for the higher EOF truncations, so we use lower EOF truncations (Table 8) to show their scaling factors in Figure 16. The best estimates of the attributed G trends, of the five models that detect G, range from 0.86 to 2.06 K per 60 years, while the OA trends vary between −1.36 and 0.24 K per 60 years. It is possible that whereas the optimal detection regression procedure is able to scale gross errors in a model's transient climate response or net forcing, errors in model patterns for these models (which the regression scaling cannot account for) may be biasing results, a model bias that is less important when models are averaged. We therefore consider the more stable results using the multi‒model mean, in particular “Weighted avg,” the most robust of the analysis.

[66] These analyses have a number of sources of sensitivity due to choices made in the methodology. For instance the choice of what data to be used to create the model noise estimates and EOF bases is arbitrary to a certain extent. Swapping the noise estimates so that the estimates used to create the EOF basis are now used for uncertainty estimates and residual consistency testing (and vice versa) gives generally similar results but with some interesting differences (supporting information). For instance the most striking difference is for the simple and weighted average scaling factors with the common EOF basis where OA is now consistently detected in both 1901–2010 and 1951–2010 periods (Figures S12 and S13).

[67] Three studies have also used optimal detection analyses on observed near‒surface air temperatures for periods ending in 2010 [Stott and Jones, 2012; Gillett et al., 2012a, 2012b]. While the studies produce a variety of results, due to the different analysis choices, a broadly consistent conclusion is the detection of G with attributed warming near or greater than observed, thereby adding confidence to the results reported here. Gillett et al. [2012b] examined a selection of seven CMIP5 models, some of which are the same used in this analysis—CNRM‒CM5, CSIRO‒Mk3‒6‒0, CanESM2, HadGEM2‒ES, NorESM1‒M, and bcc‒csm1—and HadCRUT4 for the period 1861–2010 using a common EOF basis—created with different criteria than we use. In Gillett et al. [2012b] and our study, G and OA are scaled down for both CanESM2 and HadGEM2‒ES while the other models have scaling values for G consistent or greater than 1. The “Simple avg” result for both periods examined is similar to the model average result in Gillett et al. [2012b] with G detected and the scaling factors for both G and OA consistent with 1. The differences in the results, however, indicate the sensitivity of the results to analysis choices, even when similar model and observational data is used in the studies.

[68] The results presented here are somewhat different to previously reported detection studies on centennial time scales. The last IPCC assessment [Hegerl et al., 2007] reported on studies that compared different models with observed changes over the 20th century [Nozawa et al., 2005; Stott et al., 2006a], using similar methodologies as described in this analysis that showed “...that there is a robust identification of a significant greenhouse warming contribution to observed warming that is likely greater than the observed warming over the last 50 years ...” For the period 1951–2010 examined in this analysis, we find a wider range of attributed greenhouse warming across a variety of models than assessed in Hegerl et al. [2007] for the 1950–1999 period.

7 Conclusions and Discussion

[69] Our analysis of the HadCRUT4 observational data set and the CMIP5 multi‒model ensemble supports previous analyses in showing that the observed warming seen since the mid‒20th century cannot be explained by natural forcings and natural internal variability alone. This conclusion still holds even though we find a wider spread of modeled historic global mean temperature changes than in the CMIP3 ensemble. This wider spread is possibly associated with a greater exploration of modeling uncertainty than hitherto including a much wider exploration of aerosol uncertainty than previously carried out in models that now include a much more sophisticated treatment of aerosol physics.

[70] Despite this wider spread of model results, calculations of attributable temperature trends based on optimal detection support previous conclusions that human‒induced greenhouse gases dominate observed global warming since the middle part of the 20th century. It is the first time that eight different climate models have been used in this type of space‒time TAS analysis with the consequence that a wider range of aerosol and other forcing uncertainty is explored. We find a wider range of warming from greenhouse gases and counteracting cooling from the direct and indirect effects of tropospheric aerosols and other non greenhouse gas anthropogenic forcings than previously reported in detection studies of decadal variations in near‒surface temperatures.

[71] Analyzing 1951–2010 (thereby concentrating on the part of the instrumental record that is best observed and when the external forcings are most dominant), and using the multi model mean (Weighted avg), we estimate a range of possible contributions to the observed warming of approximately 0.6 K (to nearest 0.1 K) from greenhouse gases of between 0.6 and 1.2 K, balanced by a counteracting cooling from other anthropogenic forcings of between 0 and −0.5 K. Our comprehensive set of sensitivity studies has shown that there remains some dependence on model and methodological choices, such as associated with choice of EOF truncation and which set of data is used to estimate the EOF basis onto which model signals are projected. When model signals are averaged and when a common EOF basis is used, we get the most stable results.

[72] There remain continuing uncertainties associated with the methodology, in particular the extent to which the methodology can compensate for model errors. While the regression‒based methodology can compensate for gross model errors in transient climate response and net radiative forcing by scaling the overall pattern of response to a particular forcing, it cannot compensate for errors in the model patterns themselves since the whole pattern is either scaled up or down. Further work is required to understand the extent to which model error might be influencing results and to develop the methodology to cope with this in a multi‒model setting.


[73] We acknowledge the Program for Climate Model Diagnosis and Intercomparison and the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We give thanks to the hard work of the institution modelers, data processors, and those who developed and maintain the gateways. The model data used in this study were obtained from http://www‒ (CMIP3) and http://cmip‒ and associated gateways (CMIP5). We wish to thank Dáithí Stone for discussions about analysis of CMIP3 data in the IPCC 2007 report as well as providing access to additional data. We acknowledge our colleagues who retrieved some of the data from the CMIP archives including Jamie Kettleborough and Peter Good. Thanks also goes to Piers Forster, Tim Andrews, Mat Collins, Glen Harris, Ben Booth, David Sexton, John Kennedy, Colin Morice, Nathan Gillett, and Nathan Bindoff for information, discussions, and comments. The work of the authors was supported by the Joint DECC/Defra Met Office Hadley Centre Climate Programme (GA01101). We are also grateful to the reviewers, David Karoly, Gabi Hegerl, and Francis Zwiers, for their helpful comments and insights.