In this paper, we breakdown the temperature response of coupled ocean-atmosphere climate models into components due to radiative forcing, climate feedback, and heat storage and transport to understand how well climate models reproduce the observed 20th century temperature record. Despite large differences between models' feedback strength, they generally reproduce the temperature response well but for different reasons in each model. We show that the differences in forcing and heat storage and transport give rise to a considerable part of the intermodel variability in global, Arctic, and tropical mean temperature responses over the 20th century. Projected future warming trends are much more dependent on a model's feedback strength, suggesting that constraining future climate change by weighting these models on the basis of their 20th century reproductive skill is not possible. We find that tropical 20th century warming is too large and Arctic amplification is unrealistically low in the Geophysical Fluid Dynamics Laboratory CM2.1, Meteorological Research Institute CGCM232a, and MIROC3.2(hires) models because of unrealistic forcing distributions. The Arctic amplification in both National Center for Atmospheric Research models is unrealistically high because of high feedback contributions in the Arctic compared to the tropics. Few models reproduce the strong observed warming trend from 1918 to 1940. The simulated trend is too low, particularly in the tropics, even allowing for internal variability, suggesting there is too little positive forcing or too much negative forcing in the models at this time. Over the whole of the 20th century, the feedback strength is likely to be underestimated by the multimodel mean.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 Time-dependent surface temperature response to radiative forcing depends on the forcing applied, on the radiative and dynamic feedbacks inherent in the climate system, i.e., the climate sensitivity, and on the rate of uptake of heat by the oceans. Predicting our future influence on climate requires us to have confidence in the climate models used to make predictions, and in particular that the model's climate sensitivity and ocean heat storage characteristics are realistic. Confidence is gained by assessing how well climate models reproduce current climatology and climate variability, and how their feedback parameters compare with estimates from observations [Randall et al., 2007]. Unfortunately, estimates of observed feedbacks have so far put little constraint on the models, and have not narrowed the estimated range of climate sensitivity over the years. Climate models are also assessed for how well they reproduce past climates (ancient and modern) and for how much influence humans have had over the 20th century climate.
 Polar regions are expected to warm more strongly than the tropics (polar amplification) in future, partly because of stronger positive feedbacks acting there [Meehl et al., 2007], but there are considerable differences between climate models in the extent of this amplification [Holland and Bitz, 2003]. Observations suggest that the Arctic has warmed at twice the rate of the global mean over the last 100 years [Trenberth et al., 2007] with implications for local populations, degradation of the permafrost carbon sink, and global sea level rise.
 Here we determine the surface temperature response contributions due to long-term radiative feedbacks, atmosphere-adjusted forcing, and heat storage and transport for a number of coupled ocean-atmosphere climate models. We compare the linear trends of global mean, Arctic (60°N–90°N) mean and tropical (30°S–30°N) mean surface temperature responses of these models with observations over several time periods. We investigate why models do or do not reproduce the observed temperature response patterns. We also perform optimal fingerprinting analyses on the components of surface temperature response to test their forcing, feedback and heat storage responses. The observation data, model data and analysis methods are described in section 2, the results are presented in section 3 and final conclusions are presented in section 4.
2. Data and Methods
 Our observations were the HadCRUT3 data set of 20th century surface temperature anomalies [Brohan et al., 2006]. This consists of land and sea surface temperature anomalies from the 1961–1990 mean on a 5° × 5° grid with no infilling of missing data. This data set has very similar features to other temperature data sets such as the Goddard Institute for Space Studies (GISS) surface temperature analysis [Hansen et al., 2010]. Uncorrected biases in the instrumental record due to differences in the way sea surface temperatures were measured during the Second World War are partly responsible for the observed rapid cooling around 1945 [Thompson et al., 2008]. Correcting for this is expected to only affect temperatures between 1940 and 1960 so trends over the whole period, the pre-1940 period and post-1960 period should not be affected.
 We compared the HadCRUT3 data with surface temperature anomalies from simulations of the 20th century climate from coupled ocean-atmosphere climate models taking part in the World Climate Research Programme's (WCRP's) Coupled Model Intercomparison Project phase 3 (CMIP3). Given that previous studies have shown that both natural and anthropogenic forcings are required to reproduce the warming pattern of the 20th century [Hegerl et al., 2007; Stone et al., 2009; Min and Hense, 2006], only those CMIP3 models which have been forced with both anthropogenic and solar and volcanic forcings [see Forster and Taylor, 2006, Table 1] were used. However, the aerosol forcing varies across all models as does whether land use changes have been included. The model responses were split into forcing, feedback and heat storage and transport terms as outlined below.
Table 1. Surface Temperature Response of CMIP3 Models for 1pctto2x Simulationsa
The multimodel ensemble means with 2 standard deviations are also given. Arctic amplification was taken as the Arctic mean minus tropical mean surface temperature response divided by the global mean surface temperature response.
These models have 20th century simulations with both anthropogenic and natural forcing and are used in the subsequent analysis.
2.2. Determining Temperature Response Contributions
 Following standard linear feedback analysis methods [Bony et al., 2006], we assume that the top of atmosphere (TOA) net downward radiative flux ΔR can be approximated as a forcing term, F, and a radiative feedback term. The feedback term is approximated as a linear function of the surface temperature response, ΔTs, with the proportionality constant being the climate feedback parameter, Ytotal. Ytotal can be split into the Planck feedback term, YPlanck, due to the blackbody response, and the remaining feedback term, Y. All terms are generally functions of space and time but we assume the feedback parameters are constant in time.
By rearranging equation (1), the temperature response can be split into three components due to heat storage and transport (the ΔR term), due to the forcing (the F term), and due to the non-Planck feedbacks (the YΔTs term):
 The temperature response due to the forcing is the response that would be obtained if there were no feedbacks or heat storage and transport operating, i.e., the Planck response. Both equations (1) and (2) have been shown to hold in the zonal mean as well as the global mean, but tend to break down at smaller spatial scales [Crook et al., 2011] where there is too much noise in the ΔR and ΔTs terms. Note that in the zonal mean, equation (1) calculates a local feedback parameter, not a local contribution to the global mean feedback parameter that is often used in other studies [e.g., Boer and Yu, 2003]. We need the local feedback parameter to find the local feedback contribution to the local temperature. Effects on local temperature due to distant forcings are seen in the heat transport (ΔR) term.
 In order to use equation (2) to break down the 20th century surface temperature response into these components, the total feedback, the Planck feedback and the 20th century forcing are required. The total feedback parameter was determined from simulations forced with a 1% annual increase in CO2 to the point of doubling (1pctto2x) where we know the forcing reasonably accurately. Equation (1) was used in the zonal mean to obtain the total zonal mean feedback parameter by regressing ΔR − F against ΔTs over the time when the forcing is changing. The regression method allows for latitudinally dependent rapid atmospheric adjustments to the forcing [Gregory et al., 2004; Forster and Taylor, 2006; Gregory and Webb, 2008; Andrews and Forster, 2008] that are traditionally included in the feedback term. This has been shown to produce a feedback parameter that is much more independent of the forcing mechanism than using the instantaneous or stratosphere-adjusted forcing [Shine et al., 2003; Hansen et al., 2005; Crook et al., 2011] and is also more time independent [Williams et al., 2008]. For these reasons, our assumption that 1pctto2x feedback parameters can be applied to the 20th century is reasonable [see Forster and Taylor, 2006]. Using the regression method means our 20th century forcing term will also include rapid atmospheric adjustments. We chose to use zonal means because smaller spatial scales show more nonlinearities but using larger spatial scales would cause problems because of changing spatial coverage in the observations over the 20th century, and different forcing patterns in the 20th century compared to 1pctto2x simulations [see Crook et al., 2011]. Myhre et al.  showed that the forcing due to CO2 takes the form
where C is the current concentration and C0 is the initial concentration of CO2 and in our case α is a function of latitude. A small number of CMIP3 models provide TOA stratosphere-adjusted forcing for 2 × CO2 (National Center for Atmospheric Research (NCAR) CCSM3.0, GISS ER, Meteorological Research Institute (MRI) CGCM2.3.2a, and Institut Pierre-Simon Laplace (IPSL) CM4). Equation (3) with was used to determine α for each of these models and the ensemble mean α was used to determine the forcing for the 1pctto2x scenario. The 1pctto2x forcing in year y is given by
 The partial radiative perturbation method with the Edwards Slingo radiation code [see Crook et al., 2011] was used to find the Planck feedback parameter for a subset of the models (Geophysical Fluid Dynamics Laboratory (GFDL) CM2.1, NCAR CCSM3.0, GISS EH, UK Met Office (UKMO) HadGEM1 and MIROC3.2(medres); these were simply the first of the models we analyzed). Temperature and humidity profiles for the 1pctto2x case were taken as the 30 year mean after the point of doubling of CO2 (year 70). Although calculations were performed for all sky conditions, the model's cloud profile was not included as this would be treated differently in each model's own radiation code. The Planck feedback parameter for these models is very similar and, therefore, the exercise was not repeated for the remaining models. For all models the ensemble mean Planck feedback parameter was used in analysis of 20th century simulations.
Equation (1) was then used to determine the 20th century zonal mean forcing in each model using the total zonal mean 1pctto2x feedback parameter and the 20th century total temperature response and TOA net downward radiative flux change. Finally the partial temperature response time series was determined using equation (2) in the zonal mean. Surface temperature observations have a considerable amount of missing data, especially during the early part of the 20th century. Given that we are comparing modeled 20th century temperature responses with observations, only the locations with valid observations must be included in the determination of zonal means at each point in time. Therefore interpolation and masking was performed on the total temperature response and TOA net downward radiative flux change before taking zonal means. Note that internal variability will form a part of all three components of temperature response. Some of this variability can be eliminated by taking ensemble means of a number of simulations for each model.
2.3. Linear Trend Comparisons
 Linear regression was used to obtain trends for the global mean, Arctic (60°N–90°N) mean and tropical (30°S–30°N) mean surface temperature response over the whole time period available for all models (1900–1999) and over the two particularly strong warming periods (1918–1940 and 1965–1999) seen in the observations. We used linear trends rather than simple differences between two time periods to reduce the effect of strong or weak responses to the volcanic eruptions of 1902, 1963, and 1991. None of the models has more than five simulations for the 20th century, and so most models do not give us a good sense of the likely spread of possible trends because of internal variability. Therefore, control data from all CMIP3 models were used to assess whether each model has an adequate representation of the observed warming in these regions in each of these time periods within expected internal variability. This assumes that the control data contain an adequate measure of internal variability. Models generally reproduce the large-scale patterns of seasonal surface temperature variation, temperature extremes, and the dominant extratropical patterns of variability such as annular modes and the Pacific Decadal Oscillation, but there still remain problems in adequately representing the El Niño–Southern Oscillation and the Madden-Julian Oscillation [Randall et al., 2007]. The control data were divided into sections of the same number of years as each of the three time periods and the same missing data mask as for the observations was applied before determining linear trends.
 The model mean trend for the period was added to these control trends and we checked whether the observed trend fell within two standard deviations (2σ) of the mean. The 20th century results were compared with what might be expected on the basis of the transient climate response (TCR) and Arctic amplification of the 1pctto2x simulations. The TCR was taken as the 20 year mean global mean surface temperature response centered on the point of doubling of CO2, i.e., year 70, [Cubasch et al., 2001] and the Arctic amplification was taken as the Arctic mean minus tropical mean surface temperature response divided by the global mean surface temperature response for the same 20 year mean [see Crook et al., 2011]. We use this definition of Arctic amplification rather than a simple ratio of Arctic warming to global mean warming because some components of the temperature response may warm the Arctic more than the tropics and others may warm the tropics more than the Arctic. The TCR and Arctic amplification of the 1pctto2x simulations for all CMIP3 models are shown in Table 1. From this it is clear that those models that were analyzed under 20th century forcing (see footnote b in Table 1) cover the range of TCR and Arctic amplification of all CMIP3 models. We do not expect the Arctic amplification to be the same in 1pctto2x and 20th century simulations because the forcing in the 20th century is less homogeneous, but we might expect those models with high Arctic amplification in 1pctto2x to have high Arctic amplification in 20th century simulations.
2.4. Optimal Fingerprint Analysis
 Given that the climate models include internal variability and there are only a small number of realizations available of each one, total least squares (TLS) optimal regression was used to allow for noise in the model data [Stott et al., 2003]:
where y are the observations (in our case the optimized observed temperature anomaly), xi are the modeled responses (in our case the optimized components of the total temperature anomaly), νi is the noise in the modeled responses, ν0 is the noise in the observations, and βi are the scaling factors. For the optimal fingerprint analysis, 5 year means from 1900 to 1999 were used as all the models chosen cover this time period. The model data and HadCRUT3 data set were converted to anomalies from the 1900–1999 mean. Control data from as many CMIP3 models as possible were used to provide the estimates of internal variability required by the optimal fingerprint analysis code. This gave us 2 independent sets of 42 segments of nonoverlapped control data. One set was required for the “prewhitening” operator, which is used to produce the optimized fingerprints from the temperature anomaly components, and the second set was required for the model consistency checks [Allen and Tett, 1999], allowing us to perform the analysis with up to the first 42 eigenvectors of internal variability (truncation of 42). The analysis was performed on the ensemble mean of all the available runs for each model for 30° latitude band means using truncations 2 to 42 and global means using truncations 2 to 19 (note that the global mean data has a vector size of 20 and therefore only 19 eigenvectors are needed). We do not make any assumptions on the best number of truncations to use, although it is probably best to use more than 4, but we present results for a range of truncations. The consistency checks can indicate when the number of truncations is unsuitable. It was not possible to detect all three components of temperature response in one regression. We therefore combined two at a time and performed the regression analysis on dTforcing and dTfeedback + dTheat, dTfeedback and dTforcing+ dTheat, and dTheat and dTforcing + dTfeedback so that our regression equations become
 Our aim in this analysis was to see if we could distinguish between models through their different contributions to temperature response, but we also performed the analysis for the multimodel mean results.
3. Results and Discussion
3.1. Global Mean Linear Trend Comparisons
 The linear trends in the global mean temperature response of the models over the 20th century are given in Table 2. Using 84 sections of 100 years of control data from all CMIP3 models, the observed trend over the whole time period was found to be within the model ensemble mean ±2σ for all models except MIROC3.2(medres), which has too little warming. NCAR CCSM3 has a warming trend on the upper limit and GISS ER has a warming trend on the lower limit.
Table 2. Global Mean Surface Temperature Response (Total and Contributions) for 20th Century Simulations Expressed as a Linear Trend Over the Whole Time Perioda
Model (Simulations Included in Ensemble Mean)
The fraction of the total is given in parentheses. For each model these are ensemble means of the number of simulations, which are given in parentheses in the first column. The multimodel ensemble mean linear trends with 2 standard deviations and the observed linear trend with its uncertainty from the linear regression are also given.
NCAR PCM1 (2)
GFDL CM2.1 (3)
GFDL CM2.0 (3)
NCAR CCSM3.0 (5)
GISS ER (5)
GISS EH (5)
MRI CGCM2.3.2a (5)
UKMO HadGEM1 (1)
MIUB ECHO-G (3)
Multimodel mean ±2σ
0.61 ± 0.27
0.34 ± 0.16
0.41 ± 0.19
−0.13 ± 0.09
0.67 ± 0.08
 The global mean trends over the 20th century are not in the same order as for 1pctto2x (see Table 2 and Figure 1a). For example, one might expect the high climate sensitivity MIROC3.2(medres) model to show one of the greatest warming trends over the 20th century but in fact it shows the least, and NCAR CCSM3.0, a model of low to middle climate sensitivity, shows the greatest warming trend. This is due to the fact that these models have not included the same forcings over the 20th century, varying in whether they include ozone, black carbon, organic carbon, mineral dust, sea salt, land use changes and the indirect effects of sulphate aerosols. Kiehl  and Knutti  pointed out how surprising it is that models with quite different climate sensitivities and projected future warming, agree so well in simulating 20th century temperature response. They found this was partly caused by the different forcing applied in each model. In fact it is possible that the models may have had parameters tuned to match the observed 20th century surface temperature with their included forcings, and if they included extra forcings they may not capture the 20th century response so effectively with that particularly tuning [Knutti, 2008]. We found those models with the least 20th century warming (GISS ER, GISS EH, and MIROC3.2(medres)) have the smallest forcing contribution, whereas NCAR CCSM3.0 has the largest warming and largest forcing contribution. Unlike greenhouse gas forcing, aerosol forcing is far more inhomogeneous and is likely to cause different heat storage and transport contributions to the temperature response. MIROC3.2(hires) has a stronger than expected forcing contribution compared to its TCR because of its strong heat storage in the 20th century. Not including MIROC3.2(hires), we found a weak anticorrelation between dTforcing and TCR of −0.30 (Figure 1b). Knutti  found a similar anticorrelation between global mean forcing and climate sensitivity. The global mean linear trends in temperature response due to the feedback, expressed as a fraction of the total (Table 2), are unsurprisingly more in line with the TCR of the model, although the latitudinal pattern of forcing affects this. The temperature response contributions due to the forcing and due to the feedback have similar standard deviations between models (Table 2) showing the importance of both the differences in the forcing and in the climate sensitivity between models. It should be noted that the forcing contribution includes rapid atmospheric adjustments which can be quite different in different models, and the standard deviation between models in their instantaneous forcing (if that were available) would likely be much smaller.
 Contributions to the temperature response for up to three simulations (where available) of each model under the SRES A1B emissions scenario were calculated. In contrast to the 20th century warming, the projected warming for the 21st century is positively correlated with the TCR of the model (Figures 1a and 1c). There are considerable differences between models in the SRES A1B forcing and there is a weak positive correlation between dTforcing and TCR of 0.36 (Figure 1d) that enhances the relationship between temperature response and TCR. The forcing differences were further analyzed by examining the linear trends of the shortwave and longwave forcing components (Figure 2). The 20th century shortwave forcing trend is negative in all models, whereas in the 21st century the shortwave forcing trend is positive in some models and negative in others resulting in large difference between models. In the 20th century the shortwave forcing is dominated by volcanic eruptions, although anthropogenic aerosols increase, giving a negative shortwave forcing. However, in the 21st century there are no volcanic eruptions specified and we expect the shortwave forcing to be dominated by decreasing anthropogenic aerosols. The SRES A1B scenario specifies sulphur emissions, but nonsulphate aerosols and whether the indirect effect of aerosols is included is left to the discretion of the modeling centers. Shortwave forcing is affected by the direct effect of aerosols, but both shortwave and longwave forcings are affected by the rapid tropospheric adjustments (semidirect effect) to the aerosol forcing and indirect effects to clouds which will be different in each model. The different aerosol forcings cause convergence of modeled temperature response in the 20th century, but divergence in the 21st century.
 Attempts to constrain climate sensitivity and future projected warming in multimodel ensembles by weighting models on the basis of their skill in reproducing recent past climate have not been very successful and there is no consensus on how best to obtain model weights [Weigel et al., 2010]. Climate sensitivity has been constrained using model weighting in perturbed physics parameter ensembles and in energy balance models [Hegerl et al., 2007] to some extent, although the upper limit is still poorly constrained. The climate sensitivity of CMIP3 models typically has a narrower range and lies within these limits. Our results show that for CMIP3 models, skill in reproducing 20th century global mean temperature response is unrelated to both TCR and 21st century global mean temperature response because of the differences in aerosol forcing in models. The measure of skill in reproducing 20th century global mean temperature response (and possibly other climate variables) is more a measure of how well the CMIP3 model has been tuned to fit the observations given its included forcing, rather than how well its climate sensitivity matches that of the real world, which helps to explain why constraining climate sensitivity by weighting multimodel ensembles has not been very successful. However, measuring skill in producing the greenhouse gas contribution to the 20th century warming is useful in constraining future warming [Stott et al., 2006].
3.2. Arctic and Tropics Trend Comparisons
 The total response of the models and the contributions over the whole 20th century in terms of the linear trend in the Arctic and tropics are shown in Figure 3. All three temperature response contributions for both Arctic and tropics vary considerably between models and have the same order of magnitude in their standard deviation (Table 3). Relative warming between Arctic and tropics is highly dependent on the latitudinal distribution of the forcing, which varies between models, and is much less dependent on the Arctic amplification in 1pctto2x. The Arctic and topics trends over the two rapid warming periods (1918–1940 and 1965–1999) and their contributions in these periods are shown in Figure 4.
Table 3. Multimodel Ensemble Mean ( ±2 Standard Deviations) Surface Temperature Response (Total and Contributions) in the Arctic and Tropics Expressed as a Linear Trend Over the Whole 20th Century
1.16 ± 0.85
0.60 ± 0.27
0.27 ± 0.41
0.39 ± 0.20
0.90 ± 0.71
0.39 ± 0.26
−0.01 ± 0.38
−0.19 ± 0.19
 Using 84 sections of 100 years of control data from all CMIP3 models to assess the role of internal variability, the observed 1900–1999 trends in both the Arctic and tropics were found to be within ±2σ for eight of the eleven models. However, there are three models for which this is not the case and another two models in which the relative warming in the Arctic compared to the tropics is probably unrealistic. We now discuss these five models in detail. Although both NCAR models have plausible 1900–1999 Arctic and tropics trends, the tropical warming for NCAR PCM1 is on the low side and the Arctic warming for NCAR CCSM3.0 is on the high side. There were no simulations that gave a warming less than or equal to the observed trend in the Arctic at the same time as a warming greater than or equal to the observed trend in the tropics, implying that both these models tends to produce too much warming in the Arctic compared to the tropics. For NCAR PCM1 both the forcing and feedback contributions to the temperature response are considerably higher in the Arctic than the tropics leading to this high Arctic amplification. This model also had a high Arctic amplification in 1pctto2x and shows strong Arctic amplification in both early and late warming periods (Figure 4). In fact the observed 1965–1999 Arctic trend is outside ±2σ, being considerably lower than the modeled trend. For NCAR CCSM3.0 the high Arctic amplification is due to the high feedback in the Arctic compared to the tropics. This model had the highest Arctic amplification in 1pctto2x. For this model the 1pctto2x feedback parameters were obtained individually for the albedo feedback, shortwave cloud feedback, water vapor plus lapse rate feedback and longwave cloudy sky feedback, and the partial temperature contributions due to each of these feedbacks were calculated using the method described by Crook et al.  for both the 1pctto2x run and the 1st run of the 20th century. The percentage contributions to the Arctic amplification from the forcing, heat storage and transport and the individual feedbacks are given in Table 4. In both forcing scenarios, the albedo feedback, water vapor plus lapse rate feedback, and longwave cloudy sky feedback provide a similar positive contribution to the Arctic amplification; the shortwave cloud contribution is only weakly negative. Crook et al.  also found a weak negative shortwave cloud contribution for the equivalent slab ocean model forced with 2 × CO2, whereas for many other models, there was a strong negative shortwave cloud contribution.
Table 4. Percentage Arctic Amplification Contributions Due to Forcing, Heat Storage and Transport, and the Different Feedbacks for the 1pctto2x and 20th Century NCAR CCSM3.0 Runs
20th Century Run 1
Water vapor + lapse rate
Longwave cloudy sky
 The observed 1900–1999 trends in the tropics are outside ±2σ for GFDL CM2.1; the model has too much warming in the tropics. There were no simulations where the warming was greater than or equal to the observed trend in the Arctic at the same time as being less than or equal to the observed trend in the tropics, implying this model produces too little warming in the Arctic compared to the tropics. This model has a strong response to volcanic forcing, particularly to Krakatau in 1983 [Knutson et al., 2006], even without including the aerosol indirect effect. In the tropics, the temperature anomaly recovers only gradually from this cooling effect, with another small cooling presumably due to Santa Maria in 1902, and is still considerably lower than observations in the first decade of the 20th century (Figure 5b). This may account for the large 20th century tropical warming trend. In the Arctic, however, although the influence of Krakatau can be seen, the temperature anomaly recovers very rapidly, resulting in an unusually high anomaly at the beginning of the century, and any cooling due to Santa Maria or Katmai (1912) is not enough to bring the anomaly in line with observations at this time (Figure 5a). This is the only model which has a negative forcing trend in the Arctic over the whole 20th century. The Arctic forcing contribution shows a gradual decrease from 1920 until the 1970s at which point it shows an increase (Figure 5e). Knutson et al.  pointed out that this model has particularly large internal variability. Although the rapid warming from the 1890s to 1920 is likely due to recovery from the Krakatau eruption plus internal variability, the subsequent decrease in Arctic forcing until the 1970s is most likely due to negative aerosol forcing outweighing positive greenhouse gas forcing (note this does not appear to be the case in the tropics).
 The observed 1900–1999 trends in the tropics are outside ±2σ for MRI CGCM232a; the model has too much warming in the tropics. There were no simulations where the warming was greater than or equal to the observed trend in the Arctic at the same time as being less than or equal to the observed trend in the tropics, implying this model produces too little warming in the Arctic compared to the tropics. The model is cooler in the tropics than observations pre-1910 and then warms quite strongly in the latter part of the 20th century due to strong forcing and feedback contributions. The forcing contribution in the Arctic is very small and warming here is largely caused by heat transport from the tropics.
 The observed 1900–1999 trends in the tropics are outside ±2σ for MIROC3.2(hires); the model has too much warming in the tropics. There were no simulations where the warming was greater than or equal to the observed trend in the Arctic at the same time as being less than or equal to the observed trend in the tropics, implying this model produces too little warming in the Arctic compared to the tropics. However, it should be noted that we only had one simulation for this model from which to produce a model mean. The 1918–1940 warming trends in both the Arctic and tropics are too low (Figure 4) because of very little forcing; in fact the Arctic cools slightly in the early period. Increasing warming occurs post-1940 in both regions because of strong feedback in the Arctic and strong forcing in the tropics.
3.3. Early Warming Trend Comparisons
Figure 4 shows that the 1918–1940 warming in both the Arctic and tropics is not well captured by most models. However, the history of warming prior to 1940 followed by cooling, followed by further warming from the 1960s, seen in the observations, and which is much more distinct in the Arctic than the tropics, is found in models to some extent. To ascertain the role of internal variability in the early warming, 366 control sections of 22 years were used with the same missing data mask applied as for the 1918–1940 observations. Probability distribution functions were produced for the Arctic and tropics mean warming trends of all 20th century simulations (Figures 6a and 6b), of all control simulations (Figures 6c and 6d), and of all control simulations plus the multimodel mean warming trend from 20th century simulations (Figures 6e and 6f). This shows that it is exceedingly unlikely that the early warming was due to internal variability alone (both Arctic and tropics observed trends are outside ±2σ for controls) and that the observed trends are still at the upper end of the 20th century modeled trends which include some forced contribution. When comparing the 366 control trends plus the multimodel mean 20th century trend, it is found that the Arctic is just within ±2σ and the tropics is not. Repeating this for each model mean, it is found that only GFDL CM2.1, UKMO HadGEM1 and MIUB ECHO-G have adequate warming in the tropics. Therefore, assuming control runs reproduce multidecadal internal variability realistically, it is unlikely that internal variability can explain the difference between the multimodel mean 20th century trend and the observed trend in the tropics, and likely that many models are missing some positive forcing or have too much negative forcing here during this time. In the Arctic there is more internal variability than in the tropics and so we cannot rule out that internal variability (albeit a large realization of it) can explain the difference between observations and each model except in the case of MIROC3.2(medres) and MIROC3.2(hires). Certainly in the MIROC3.2 models more positive forcing would be needed to simulate the observed trends.
Wang et al.  suggested Arctic early warming was consistent with internal variability in some but not all CMIP3 models, but they looked at mean anomalies over 1939–1949 rather than looking at trends. They also noted that whereas the observed warming was multidecadal, the modeled warming was only decadal. Although our results broadly agree with this, we wish to stress that a large warming response due to internal variability is required to match trends in the Arctic. Delworth and Knutson  used trends from their older GFDL model and also concluded that only an unusually large realization of internal variability on top of greenhouse gas and sulphate aerosol forcing could have produced the early 20th century warming, although they could not quantify contributions from natural forcing. The more recent study of Knutson et al.  on the CMIP3 versions of the GFDL models suggests that the observed early global mean warming can be produced by a combination of anthropogenic forcing and natural forcing, or either anthropogenic forcing or natural forcing only in combination with an unusually strong warming from internal variability, but it should be noted that GFDL CM2.1 was one of only three models with adequate tropical early warming. Shindell and Faluvegi  showed that internal variability and a net positive aerosol forcing on top of the greenhouse gas, ozone and natural forcing was required to match 1890–1930 increases in the Arctic minus SH extratropics gradient. They also inferred a net negative aerosol forcing in the tropics over this time. Our results suggest it is likely that many CMIP3 models have too much negative aerosol forcing in the tropics from 1918 to 1940, and, although eight of the models do include black carbon, it is possible that they do not include enough, or they have too strong an aerosol indirect effect. Another possibility may be that some models do not cool enough in response to the 1883, 1902 and 1912 volcanic eruptions and, therefore, have less to recover from subsequently. Because of a lack of observations, the global distribution of aerosol optical depth has to be estimated for these eruptions and this is highly dependent on the circulation patterns at the time of the eruption as well as the amount of SO2 ejected. Many models use the Sato et al.  volcanic data set where simple assumptions have been made about distributions. The Ammann et al.  data set used by the NCAR models attempts to improve the estimates of aerosol optical depth for the early volcanoes by estimating the spread and decay of the aerosol according to the seasonal stratospheric transport.
 Errors in the observed temperature anomalies were not taken into account in our comparisons. Correcting for biases in the instrumental record due to differences in the way sea surface temperatures were measured during the Second World War is expected to only affect temperatures between 1940 and 1960 so our 1918–1940 trends should not be affected. However, to test this, our calculations could be performed just for land grid boxes or with the corrected CRU temperature data set when it becomes available.
3.4. Optimal Fingerprint Analysis
 Detection of components of temperature response proved difficult because of poor signal-to-noise ratio and degeneracy between components. This was particularly the case for those models with small response to volcanic forcing compared to noise (MRI CGCM2.3.2a, UKMO HadGEM1, MIROC3.2(medres), and MIROC3.2(hires)) resulting in large uncertainties. Detection was poorer (larger uncertainties) for 30° latitude band means than for global means because the signal to noise is poorer. It was not possible to detect the temperature response due to the feedback (dTfeedback) and remaining dT (dTforcing + dTheat) for any models or the multimodel mean in either global mean or 30° latitude band means as uncertainties were too large. This is not surprising in the global mean as dTfeedback is a scaled version of dTtotal and therefore degenerate with it. In general dTfeedback and the remaining dT are also quite similar compared to the noise (particularly in the global mean). Sudden changes in dTforcing from volcanic eruptions are opposed by dTheat, so combining these smoothes the combined response. It was possible to detect dTforcing and the remaining dT (dTfeedback + dTheat) for many models in both global means and 30° latitude band means (Figure 7 shows the global mean results) and to detect dTheat and remaining dT (dTforcing + dTfeedback) for some models in the global mean (not shown). For these cases most models pass the consistency checks for the majority of truncations, but more confidence should be put in the results of those models which have similar scaling factors across a wide range of truncations. Unfortunately, we found it was not possible to distinguish scaling factors between models because of large uncertainties. This shows that it is possible to reproduce the 20th century temperature response pattern in different ways through a balance of forcing and feedback. Nevertheless, our results show that the direct radiative temperature response due to forcings is detectable in the climate record irrespective of the climate feedbacks.
 For the multimodel mean we also performed the optimal fingerprint analysis on tropical means and 40°N–60°N means for dTforcing and dTfeedback + dTheat to see how well different regions performed. We used 40°N–60°N rather than the Arctic because detection was very poor in the Arctic where there is less data and more variability. In both regions and the global mean the scaling factors for the dTforcing contribution were close to one, although in the 40°N–60°N region the uncertainties were larger such that the scaling factors were not inconsistent with zero (the mean scaling factor and their 95% uncertainty ranges over truncations 5–19 for which the consistency test passed were 0.97 ± 0.54 for the global mean, 0.76 ± 0.29 for the tropics and 0.54 ± 2.03 for 40°N–60°N). We then took the scaled dTforcing away from the observed temperature anomalies so that we could perform an optimal fingerprint analysis of the observed temperature anomaly minus the forcing contribution against the modeled feedback and heat storage and transport contributions. This was our only way to investigate the accuracy of the multimodel mean feedback. The results for these analyses for global means, tropical means, and 40°N–60°N means are shown in Figure 8. Scaling factors for the dTheat contribution are close to one in all these regions. The feedback contribution tends to be underestimated by the multimodel mean, particularly in the global mean and 40°N–60°N mean where the scaling factor is close to 1.5 suggesting the real world feedback may be greater than the multimodel mean feedback, although the uncertainties are such that the scaling factor is consistent with one.
 The response of the models over the 20th century in terms of the linear trend in global mean temperature response does not follow the same order as for 1pctto2x because of different 20th century forcing in each model compensating for the climate sensitivity to some extent. Despite being able to detect dTforcing and the remaining dT in most models using optimal fingerprint analysis, we were unable to distinguish between models because of the large uncertainties in the scaling factors. Both these results highlight the difficulty of constraining climate sensitivity when there is so much uncertainty in 20th century forcing. If 20th century forcing could be better constrained and models run under such a forcing scenario this may lead to a better understanding of which models produce the most accurate response and therefore could constrain the feedback and hence climate sensitivity. Modeling groups have been adding missing forcings to their models for the CMIP5 experiments, potentially allowing for easier comparison between models. Results from these experiments may shed interesting light on the current understanding of 20th century forcing and feedback.
 In contrast to the 20th century, projected global mean warming over the 21st century is much more dependent on the TCR. Whereas differences in aerosol forcing cause convergence of temperature response in the 20th century, differences in the 21st century cause divergence, making it very difficult to constrain climate sensitivity and future predictions of climate change by weighting the CMIP3 models according to their 20th century skill. Better understanding of aerosol forcing is therefore of great importance.
 A comparison of modeled and observed warming trends in the Arctic and tropics suggests the tropical warming is too high and Arctic amplification is too low in GFDL CM2.1, MRI CGCM232a, and MIROC3.2(hires) because of too little forcing in the Arctic compared to the tropics. The Arctic amplification in NCAR PCM1 and NCAR CCSM3.0 is unrealistically high because of high feedback contributions in the Arctic compared to the tropics in both these models and also because of a high forcing contribution in NCAR PCM1. It is also evident that few of the models produce the early (1918–1940) warming, particularly in the tropics and that internal variability is unlikely to explain the difference, suggesting many models are missing some positive forcing or have too much negative forcing at this time. Variability is higher in the Arctic and so it is not possible to state the need for more positive forcing here, although a larger positive forcing in the tropics would also cause more warming in the Arctic through increased heat transport from the tropics. The larger positive forcing may be due to more black carbon, a smaller aerosol indirect effect than is currently included in models, or stronger volcanic forcing at the beginning of the 20th century.
 Finally, our multimodel mean optimal fingerprint analysis results suggest that the multimodel mean forcing contribution to the temperature response is quite well detected in global means, 40°N–60°N means and tropical means, but the feedback is lower in these regions than observed temperature anomalies suggest.
 We acknowledge the modeling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI), and the WCRP's Working Group on Coupled Modeling (WGCM) for their roles in making available the WCRP CMIP3 multimodel data set. Support of this data set is provided by the Office of Science, U.S. Department of Energy. We also acknowledge CRU, University of East Anglia, for the HadCRUT3 data set. We thank Nathan Gillett and Daithi Stone for their invaluable advice on the use of the attribution and detection code and interpretation of its output. We thank the authors of the attribution code, Myles Allen and Daithi Stone, for allowing us to use the code. Finally, we thank Retto Knutti, Drew Shindell, and one anonymous reviewer for their helpful review comments. This work was funded by NERC grant NE/E016189/1, “An Observationally-Based Quantification of Climate Feedbacks.”