### Abstract

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[1] Surface temperature, upper-air temperature, and ocean heat content data are used to constrain the distributions of the parameters that define three climate system properties in the MIT Integrated Global Systems Model: effective climate sensitivity, the rate of ocean heat uptake into the deep ocean, and net anthropogenic aerosol forcing. Five different surface temperature data records are used to show that the resulting parameter distribution functions are sensitive to the dataset used to estimate the likelihood of model output given the observed climate records. Estimates of effective climate sensitivity mode and mean differ by as much as 1 K between the datasets, with an overall range of 1.2 to 5.3 K. Ocean effective diffusivity distributions are poorly constrained by any dataset. The overall range of net aerosol forcing values, −0.19 to −0.83 Wm^{−2}, is small compared to other uncertainties in climate forcings. Transient climate response (TCR) estimates derived from these distributions range between 0.87 and 2.41 K and the shapes of individual TCR distributions depend on the surface dataset. Understanding the differences in parameter distributions and climate system properties derived from them is critical for understanding the full range of uncertainty involved in climate model calibration and prediction results.

### 1. Introduction

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[2] Developing climate models that produce reliable projections of future climate change is a critical research goal. To this end, models must be properly calibrated to have values of climate system properties that yield behavior similar to the true climate system. Uncertainties in the physical processes and feedbacks that define the climate system properties and resulting behavior introduce additional challenges into the calibration problem [*Randall et al.*, 2007]. Due to the multiple uncertainties, joint probability distributions are derived for the parameters used in the model rather than estimating individual values.

[3] Multiple studies have derived probability distribution functions for uncertain model parameters [*Andronova and Schlesinger*, 2001; *Forest et al.*, 2002, 2008; *Knutti et al.*, 2003; *Tomassini et al.*, 2007; *Sansó and Forest*, 2009; *Urban and Keller*, 2010]. While the specific approaches differ, the same general methodology is used for deriving the distribution functions. Using the efficiency of simple climate models and Earth Systems Models of Intermediate Complexity, hundreds of model runs are used to calibrate model parameters and resulting properties by comparing model output to observational data. Equilibrium climate sensitivity has been extensively studied. A synthesis of current work provided in the IPCC AR4 [*Hegerl et al.*, 2007] estimates climate sensitivity to be between 2.0 and 4.5 °C with a greater than 66-percent probability. However, several studies have shown that climate sensitivities can lie outside of the IPCC upper bound [*Andronova and Schlesinger*, 2001; *Forest et al.*, 2002, 2008; *Knutti et al.*, 2003; *Tomassini et al.*, 2007; *Tanaka et al.*, 2009; *Urban and Keller*, 2010] and a summary of sensitivity estimates are given by *Knutti and Hegerl* [2008]. Given that climate sensitivity is a measure of future warming, these uncertainties have a profound impact on policy decisions [*McInerney and Keller*, 2008].

[4] Transient climate response (TCR) offers an alternative metric of future climate behavior and is a function of both climate sensitivity and the rate of ocean heat uptake [*Sokolov et al.*, 2003; *Andrews and Allen*, 2008]. The IPCC AR4 estimates TCR to lie between 1 and 3 K [*Hegerl et al.*, 2007]. These bounds encompass TCR distributions derived in other studies and TCRs from individual AOGCMs [*Stott and Forest*, 2007; *Forest et al.*, 2008; *Knutti and Tomassini*, 2008]. Defined as the change in global-mean surface temperature at the time of CO_{2} doubling in response to increasing CO_{2} concentrations at 1% per year, TCR allows for all climate processes to be active and contribute fully to climate change.

[5] This study explores the impact that the surface temperature dataset used to compare model output to observed values has on the parameter constraints. To date, few studies have investigated how the surface temperature dataset used to compare model output with observational data impacts the parameter and TCR distributions. In total, five surface temperature data records representing three well-known climate centers are used in this study. Estimates of TCR are also investigated from the parameter distributions derived from each dataset. The resulting distributions show that model calibration is sensitive to the specific surface temperature dataset. Section 2 describes the datasets used and Section 3 describes the methods by which constraints are placed on parameter values. Section 4 presents the results with a discussion and summary in Section 5.

### 2. Data

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[6] We use surface temperature data from five climate data records. The first two data records are HadCRUT2 [*Jones and Moberg*, 2003] and HadCRUT3 [*Brohan et al.*, 2006]. The third is the NCDC merged land-ocean dataset [*Smith et al.*, 2008]. The remaining two records are GISTEMP 250 and GISTEMP 1200 [*Hansen et al.*, 2010] from NASA, with the distinctions reflecting the 250 km and 1200 km radii of influence used in the interpolation algorithm. All data are reported as monthly surface temperature anomalies with respect to a given base period on a 5° × 5° grid. The data records differ from one another and potential reasons for these differences are now discussed briefly.

[7] One difference between the records is the land surface data used in the analyses. All records obtain a majority of their land surface data from the Global Historical Climatology Network (GHCN) [*Peterson and Vose*, 1997], but each utilizes the available data differently. For example, the Hadley Centre requires stations to have sufficient data between 1961 and 1990, their climate normal period, to be used in the analysis [*Jones and Moberg*, 2003; *Brohan et al.*, 2006]. Alternatively, NASA requires that stations have a period of overlap of at least 20 years with stations inside of a 1200 km radius to be used in the analysis [*Hansen et al.*, 2010]. A second difference between the data records is that each uses a different sea surface temperature (SST) dataset. Because oceans cover 70% of the Earth's surface, these choices lead to differences between the temperature data records [*Smith et al.*, 2008]. In a test of the sensitivity to ocean data choice, *Hansen et al.* [2010] showed that the global mean temperature calculated from GISTEMP data is affected by the choice of SST data. A last difference between the data records is the method used for filling regions with missing data and how the 5° × 5° grid box anomalies are calculated. Specific details of infilling and grid box averaging methods for each data record can be found in corresponding references. At this stage, we have five surface temperature data records and choose to treat them each as equally plausible. We present the results derived from each of them and do not attempt to merge the results into a single posterior distribution.

### 3. Methods

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[8] Following the work of *Forest et al.* [2008], this study estimates the joint probability distribution of climate model parameters for effective climate sensitivity (*S*_{eff}), effective ocean diffusivity of heat anomalies (*K*_{v}), and net anthropogenic aerosol forcing (*F*_{aer}). Using the climate model component of the MIT Integrated Global Systems Model [*Sokolov and Stone*, 1998; *Sokolov et al.*, 2005], the model simulates historical temperature responses given choices of the three model parameters. In this study, the parameter space is sampled by varying *S*_{eff} between 0 and 8 K, *K*_{v} between 0 and 25 cm^{2}s^{−1}, and *F*_{aer} between −1.5 and 0.5 Wm^{−2}. The value of *F*_{aer} sets the amplitude of the net anthropogenic aerosol forcing in the 1980s in a spatially prescribed forcing pattern and is scaled by historical emissions to represent all unmodeled forcings in simulations [*Forest et al.*, 2001]. Each model run is forced by historical records of greenhouse gas concentrations, sulfate aerosol loadings, tropospheric and stratospheric ozone concentrations, solar irradiance changes, and stratospheric aerosols from volcanic eruptions [*Forest et al.*, 2008]. Model performance under a given set of parameter values is evaluated through comparison of model output to historical data using surface temperature, upper-air temperature, and ocean heat content diagnostics as described by *Forest et al.* [2008].

### 4. Results

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[9] Time series of temperature for each zonal band used in the surface temperature diagnostic have been derived using the averaging techniques described in the auxiliary material and are shown in Figure 1, along with the time series used by *Forest et al.* [2008] from *Allen et al.* [2000]. Although also derived from HadCRUT2 data, it is important to note that the time series from *Allen et al.* [2000] is different than the time series derived in this study and is not used to allow for identical treatment of each dataset. In general, the patterns in each zonal band are similar, with the sign of the temperature change consistent across a majority of the decades for each dataset. However, the magnitudes of the changes differ. In particular, agreement is weakest among trends in the Southern Hemisphere, particularly for the 30–90 °S region. This is not surprising given that SST datasets differ between the records and a greater fraction of the Earth's surface is ocean in the Southern Hemisphere. Differences are also related to the factors discussed in Section 2.

[10] Another significant difference is the linear temperature trend observed from beginning to end in the period used in the surface diagnostic, 1946–1995. In all zonal bands, GISTEMP datasets show either similar or weaker warming trends than the other datasets. This is most evident in the 30–90 °S zonal band, where in the first decade of GISTEMP data are the warmest, yet in the last decade GISTEMP data are the coldest relative to other records. Similar, yet weaker, patterns hold in the remaining zonal bands. In general, the NCDC time series yields the next weakest warming trends, followed by the HadCRUT2 and HadCRUT3 datasets. However, the extent of the differences is much less pronounced than with the GISTEMP datasets and the rank order of the trends is not consistent across all zonal bands.

[11] Using the derived time series, new surface temperature diagnostics are calculated for each model run and used to estimate of a goodness-of-fit statistic (*r*^{2}) which evaluates how well model output matches the observed temperature trends. For a fixed *F*_{aer} and *K*_{v}, the model is varied over ten values of *S*_{eff}. For fixed values of *F*_{aer} and *K*_{v}, the *r*^{2} values are smoothed by fitting a sixth-order polynomial to the ten data points obtained when varying the values of *S*_{eff}. After the values have been smoothed, *r*^{2} values are interpolated onto a finer grid scale (see auxiliary material) using least-squares quadratic interpolation of the smoothed data.

[12] Using an F-test, each *r*^{2} value is then converted to the likelihood that a given model produces output which matches the observations [*Forest et al.*, 2002]. Further details of this process are provided in the auxiliary material. The resulting *r*^{2} values at a net aerosol forcing of −0.25 Wm^{−2} indicate that the regions of the parameter space which are rejected by the surface diagnostic are dependent on which surface dataset is used (Figure 2). At this particular *F*_{aer} level, model results are consistent with HadCRUT data over a much larger region of the parameter space. In particular, regions that are not inconsistent with the surface data at the 10-percent level are only present when HadCRUT data are used. Significant differences between the surface diagnostics are also evident at the −0.50 and −0.75 Wm^{−2}*F*_{aer} levels and tend to show larger acceptance regions for the HadCRUT2 and HadCRUT3 datasets and strong rejection of high *S*_{eff} values for the GISTEMP datasets (see auxiliary material). Upper-air and ocean diagnostics were not changed from *Forest et al.* [2008].

[13] Upon interpolation between *F*_{aer} levels and repeated application of Bayes' Theorem for each diagnostic, the joint probability distribution for the parameter space is derived. The expert prior on *S*_{eff} used by *Forest et al.* [2008] has been applied along with uniform priors on *K*_{v} and *F*_{aer}. The resulting marginal distributions for each parameter are presented in Figures 3a–3c for each surface dataset, along with the distributions derived by *Forest et al.* [2008]. From these distributions, it is clear that the dataset used for the surface diagnostic impacts the parameter distributions. For *S*_{eff}, the distributions derived from the GISTEMP datasets yield the lowest values with 5–95% confidence intervals of 1.3 to 3.6 K (GISTEMP 250) and 1.2 to 3.4 K (GISTEMP 1200). This can be traced to the failure of the surface diagnostic in constraining the lower bound of the distribution and can be attributed to the weaker warming trends previously discussed (i.e., model runs with higher *S*_{eff} values yield warming which is too strong to be consistent with the GISTEMP datasets). The fifth percentiles fall outside of the lower bound of 2.0 K given in the IPCC AR4. The remaining datasets yield similar, yet still noticeably different results. Of these, HadCRUT datasets yield wider 5–95% confidence intervals of 2.0 to 5.3 K (HadCRUT2) and 1.9 to 5.1 K (HadCRUT3) than the NCDC distributions 1.8 to 4.7 K bounds. Each upper bound is greater than the upper bound of 4.5 K in the IPCC AR4.

[14] Based on the wide confidence intervals regardless of which surface dataset is used, *K*_{v} is poorly constrained by the observations. With the exclusion of the GISTEMP datasets, the mode in the distribution is found for low values of ocean heat uptake. This results from the high *S*_{eff} and high *K*_{v} regions being rejected for positive values of net aerosol forcing. GISTEMP datasets demonstrate a long right tail for high values of *K*_{v} and show no pronounced mode. A major difference between the distributions derived in this study and those from *Forest et al.* [2008] is that an estimate of natural variability has been included in the ocean heat content diagnostic. This is analogous to the treatment of the surface diagnostic and accounts for observational errors and natural sources of variability by combining the variability of both sources into a single estimate of the total variability. This estimation results in a decrease in the significance of the ocean heat content signal and leads to weaker constraints on *K*_{v}. As a result, broad distributions for *K*_{v} are derived when the natural variability estimate is included.

[15] Weaker *F*_{aer} values are estimated when using HadCRUT data, with 5–95% intervals of −0.19 to −0.70 Wm^{−2} (HadCRUT2) and −0.22 to −0.74 Wm^{−2} (HadCRUT3). The remaining datasets yield approximately 0.1 Wm^{−2} stronger aerosol forcing, with 5–95% confidence intervals of −0.37 to −0.78 Wm^{−2} (NCDC), −0.32 to −0.83 Wm^{−2} (GISTEMP 250), and −0.33 to −0.80 Wm^{−2} (GISTEMP 1200). The slightly weaker *F*_{aer} values from the HadCRUT datasets can be attributed to the larger acceptance regions at the −0.25 Wm^{−2} aerosol level seen in Figure 2. However, the overall range of *F*_{aer} based off of the 5–95% confidence intervals, −0.19 to −0.83 Wm^{−2}, is smaller than errors in other model forcing terms.

[16] TCR distributions are derived from the parameter distributions. From each joint distribution, a 1000 member Latin Hypercube sample [*McKay et al.*, 1979] is estimated, whereby *S*_{eff}-*K*_{v} pairs are drawn. Using a functional fit calibrated by prior runs of the model, the resulting TCR has been calculated for each pair [*Sokolov et al.*, 2003] and cumulative density functions are estimated (Figure 3d).

[17] We note that the lower bound on TCR values for the GISTEMP results are less than the lower bound of 1 K from the IPCC AR4. Ranges of 0.87 to 1.32 K (GISTEMP 250) and 0.91 to 1.35 K (GISTEMP 1200) mark the 5–95% confidence intervals. These lower TCRs can be attributed to the low values of *S*_{eff} and long right tails in the *K*_{v} distributions derived using the GISTEMP datasets. With a lower *S*_{eff}, the equilibrium temperature change will be less for a given forcing. Furthermore, more efficient mixing of heat into the deep ocean acts to reduce surface temperatures as well. The HadCRUT2, HadCRUT3, and NCDC 5–95% intervals are bounded by 1.24 to 2.31 K, 1.13 to 2.41 K, and 1.10 to 1.96 K, respectively. All distributions fall within the range of TCR values given in the IPCC AR4. Given that the *K*_{v} distributions are similar for these datasets, similar values are drawn in the Latin Hypercube sample and it follows that the *S*_{eff} distributions should dominate the TCR distributions (Figure 3d). Given that the *F*_{aer} distributions are nearly identical across all datasets, these results show that TCR follows shifts in the distributions of *S*_{eff} and *K*_{v} rather than those for *F*_{aer}.

[18] As a surrogate for future warming, TCR distributions measure the model response global mean temperature change for idealized forcing scenarios. Similar to climate sensitivity estimates, TCR has a profound impact on policy decisions regarding climate change adaptation and mitigation strategies. If the information from TCR distributions are properly included in policy decisions, effective strategies can be potentially improved.

### 5. Conclusions

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

[19] The results presented here show that climate model parameter constraints are sensitive to the surface dataset used to compare with model output. In general, the ranges of the effective climate sensitivity parameter distributions are comparable, but are shifted relative to each other depending on which surface dataset is used. The biggest shift in effective climate sensitivity distributions is observed when the GISTEMP datasets are used. Using the 95-percent confidence intervals and considering all datasets, climate sensitivity is found to be between 1.2 and 5.3 K. Regardless of the surface data used, effective ocean diffusivity is poorly constrained by the data. Anthropogenic aerosol forcing is found to be between −0.19 and −0.83 Wm^{−2} when considering all datasets.

[20] TCR estimates are also sensitive to the choice of surface data. When all surface datasets are considered, transient warming is found to lie between 0.87 and 2.31 K. However, this range masks the differences that exist between the individual distributions. The TCR distribution derived from GISTEMP data is narrower and yields only minimal warming. In contrast, distributions derived from Hadley Centre datasets are wider and yield stronger warming. Given that both the parameter and TCR distributions differ when using different datasets, additional uncertainty is present in model calibration and climate projection studies. Future studies using these datasets must account for these differences to avoid overconfidence in predictions through mistreatment of the uncertainty.

### Supporting Information

- Top of page
- Abstract
- 1. Introduction
- 2. Data
- 3. Methods
- 4. Results
- 5. Conclusions
- Acknowledgments
- References
- Supporting Information

Auxiliary material for this article contains additional discussion of methodology and results.

Auxiliary material files may require downloading to a local drive depending on platform, browser, configuration, and size. To open auxiliary materials in a browser, click on the label. To download, Right-click and select “Save Target As…” (PC) or CTRL-click and select “Download Link to Disk” (Mac).

Additional file information is provided in the readme.txt.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.