Surface temperature, upper-air temperature, and ocean heat content data are used to constrain the distributions of the parameters that define three climate system properties in the MIT Integrated Global Systems Model: effective climate sensitivity, the rate of ocean heat uptake into the deep ocean, and net anthropogenic aerosol forcing. Five different surface temperature data records are used to show that the resulting parameter distribution functions are sensitive to the dataset used to estimate the likelihood of model output given the observed climate records. Estimates of effective climate sensitivity mode and mean differ by as much as 1 K between the datasets, with an overall range of 1.2 to 5.3 K. Ocean effective diffusivity distributions are poorly constrained by any dataset. The overall range of net aerosol forcing values, −0.19 to −0.83 Wm−2, is small compared to other uncertainties in climate forcings. Transient climate response (TCR) estimates derived from these distributions range between 0.87 and 2.41 K and the shapes of individual TCR distributions depend on the surface dataset. Understanding the differences in parameter distributions and climate system properties derived from them is critical for understanding the full range of uncertainty involved in climate model calibration and prediction results.
 Developing climate models that produce reliable projections of future climate change is a critical research goal. To this end, models must be properly calibrated to have values of climate system properties that yield behavior similar to the true climate system. Uncertainties in the physical processes and feedbacks that define the climate system properties and resulting behavior introduce additional challenges into the calibration problem [Randall et al., 2007]. Due to the multiple uncertainties, joint probability distributions are derived for the parameters used in the model rather than estimating individual values.
 Transient climate response (TCR) offers an alternative metric of future climate behavior and is a function of both climate sensitivity and the rate of ocean heat uptake [Sokolov et al., 2003; Andrews and Allen, 2008]. The IPCC AR4 estimates TCR to lie between 1 and 3 K [Hegerl et al., 2007]. These bounds encompass TCR distributions derived in other studies and TCRs from individual AOGCMs [Stott and Forest, 2007; Forest et al., 2008; Knutti and Tomassini, 2008]. Defined as the change in global-mean surface temperature at the time of CO2 doubling in response to increasing CO2 concentrations at 1% per year, TCR allows for all climate processes to be active and contribute fully to climate change.
 This study explores the impact that the surface temperature dataset used to compare model output to observed values has on the parameter constraints. To date, few studies have investigated how the surface temperature dataset used to compare model output with observational data impacts the parameter and TCR distributions. In total, five surface temperature data records representing three well-known climate centers are used in this study. Estimates of TCR are also investigated from the parameter distributions derived from each dataset. The resulting distributions show that model calibration is sensitive to the specific surface temperature dataset. Section 2 describes the datasets used and Section 3 describes the methods by which constraints are placed on parameter values. Section 4 presents the results with a discussion and summary in Section 5.
 We use surface temperature data from five climate data records. The first two data records are HadCRUT2 [Jones and Moberg, 2003] and HadCRUT3 [Brohan et al., 2006]. The third is the NCDC merged land-ocean dataset [Smith et al., 2008]. The remaining two records are GISTEMP 250 and GISTEMP 1200 [Hansen et al., 2010] from NASA, with the distinctions reflecting the 250 km and 1200 km radii of influence used in the interpolation algorithm. All data are reported as monthly surface temperature anomalies with respect to a given base period on a 5° × 5° grid. The data records differ from one another and potential reasons for these differences are now discussed briefly.
 One difference between the records is the land surface data used in the analyses. All records obtain a majority of their land surface data from the Global Historical Climatology Network (GHCN) [Peterson and Vose, 1997], but each utilizes the available data differently. For example, the Hadley Centre requires stations to have sufficient data between 1961 and 1990, their climate normal period, to be used in the analysis [Jones and Moberg, 2003; Brohan et al., 2006]. Alternatively, NASA requires that stations have a period of overlap of at least 20 years with stations inside of a 1200 km radius to be used in the analysis [Hansen et al., 2010]. A second difference between the data records is that each uses a different sea surface temperature (SST) dataset. Because oceans cover 70% of the Earth's surface, these choices lead to differences between the temperature data records [Smith et al., 2008]. In a test of the sensitivity to ocean data choice, Hansen et al.  showed that the global mean temperature calculated from GISTEMP data is affected by the choice of SST data. A last difference between the data records is the method used for filling regions with missing data and how the 5° × 5° grid box anomalies are calculated. Specific details of infilling and grid box averaging methods for each data record can be found in corresponding references. At this stage, we have five surface temperature data records and choose to treat them each as equally plausible. We present the results derived from each of them and do not attempt to merge the results into a single posterior distribution.
 Following the work of Forest et al. , this study estimates the joint probability distribution of climate model parameters for effective climate sensitivity (Seff), effective ocean diffusivity of heat anomalies (Kv), and net anthropogenic aerosol forcing (Faer). Using the climate model component of the MIT Integrated Global Systems Model [Sokolov and Stone, 1998; Sokolov et al., 2005], the model simulates historical temperature responses given choices of the three model parameters. In this study, the parameter space is sampled by varying Seff between 0 and 8 K, Kv between 0 and 25 cm2s−1, and Faer between −1.5 and 0.5 Wm−2. The value of Faer sets the amplitude of the net anthropogenic aerosol forcing in the 1980s in a spatially prescribed forcing pattern and is scaled by historical emissions to represent all unmodeled forcings in simulations [Forest et al., 2001]. Each model run is forced by historical records of greenhouse gas concentrations, sulfate aerosol loadings, tropospheric and stratospheric ozone concentrations, solar irradiance changes, and stratospheric aerosols from volcanic eruptions [Forest et al., 2008]. Model performance under a given set of parameter values is evaluated through comparison of model output to historical data using surface temperature, upper-air temperature, and ocean heat content diagnostics as described by Forest et al. .
 Time series of temperature for each zonal band used in the surface temperature diagnostic have been derived using the averaging techniques described in the auxiliary material and are shown in Figure 1, along with the time series used by Forest et al.  from Allen et al. . Although also derived from HadCRUT2 data, it is important to note that the time series from Allen et al.  is different than the time series derived in this study and is not used to allow for identical treatment of each dataset. In general, the patterns in each zonal band are similar, with the sign of the temperature change consistent across a majority of the decades for each dataset. However, the magnitudes of the changes differ. In particular, agreement is weakest among trends in the Southern Hemisphere, particularly for the 30–90 °S region. This is not surprising given that SST datasets differ between the records and a greater fraction of the Earth's surface is ocean in the Southern Hemisphere. Differences are also related to the factors discussed in Section 2.
 Another significant difference is the linear temperature trend observed from beginning to end in the period used in the surface diagnostic, 1946–1995. In all zonal bands, GISTEMP datasets show either similar or weaker warming trends than the other datasets. This is most evident in the 30–90 °S zonal band, where in the first decade of GISTEMP data are the warmest, yet in the last decade GISTEMP data are the coldest relative to other records. Similar, yet weaker, patterns hold in the remaining zonal bands. In general, the NCDC time series yields the next weakest warming trends, followed by the HadCRUT2 and HadCRUT3 datasets. However, the extent of the differences is much less pronounced than with the GISTEMP datasets and the rank order of the trends is not consistent across all zonal bands.
 Using the derived time series, new surface temperature diagnostics are calculated for each model run and used to estimate of a goodness-of-fit statistic (r2) which evaluates how well model output matches the observed temperature trends. For a fixed Faer and Kv, the model is varied over ten values of Seff. For fixed values of Faer and Kv, the r2 values are smoothed by fitting a sixth-order polynomial to the ten data points obtained when varying the values of Seff. After the values have been smoothed, r2 values are interpolated onto a finer grid scale (see auxiliary material) using least-squares quadratic interpolation of the smoothed data.
 Using an F-test, each r2 value is then converted to the likelihood that a given model produces output which matches the observations [Forest et al., 2002]. Further details of this process are provided in the auxiliary material. The resulting r2 values at a net aerosol forcing of −0.25 Wm−2 indicate that the regions of the parameter space which are rejected by the surface diagnostic are dependent on which surface dataset is used (Figure 2). At this particular Faer level, model results are consistent with HadCRUT data over a much larger region of the parameter space. In particular, regions that are not inconsistent with the surface data at the 10-percent level are only present when HadCRUT data are used. Significant differences between the surface diagnostics are also evident at the −0.50 and −0.75 Wm−2Faer levels and tend to show larger acceptance regions for the HadCRUT2 and HadCRUT3 datasets and strong rejection of high Seff values for the GISTEMP datasets (see auxiliary material). Upper-air and ocean diagnostics were not changed from Forest et al. .
 Upon interpolation between Faer levels and repeated application of Bayes' Theorem for each diagnostic, the joint probability distribution for the parameter space is derived. The expert prior on Seff used by Forest et al.  has been applied along with uniform priors on Kv and Faer. The resulting marginal distributions for each parameter are presented in Figures 3a–3c for each surface dataset, along with the distributions derived by Forest et al. . From these distributions, it is clear that the dataset used for the surface diagnostic impacts the parameter distributions. For Seff, the distributions derived from the GISTEMP datasets yield the lowest values with 5–95% confidence intervals of 1.3 to 3.6 K (GISTEMP 250) and 1.2 to 3.4 K (GISTEMP 1200). This can be traced to the failure of the surface diagnostic in constraining the lower bound of the distribution and can be attributed to the weaker warming trends previously discussed (i.e., model runs with higher Seff values yield warming which is too strong to be consistent with the GISTEMP datasets). The fifth percentiles fall outside of the lower bound of 2.0 K given in the IPCC AR4. The remaining datasets yield similar, yet still noticeably different results. Of these, HadCRUT datasets yield wider 5–95% confidence intervals of 2.0 to 5.3 K (HadCRUT2) and 1.9 to 5.1 K (HadCRUT3) than the NCDC distributions 1.8 to 4.7 K bounds. Each upper bound is greater than the upper bound of 4.5 K in the IPCC AR4.
 Based on the wide confidence intervals regardless of which surface dataset is used, Kv is poorly constrained by the observations. With the exclusion of the GISTEMP datasets, the mode in the distribution is found for low values of ocean heat uptake. This results from the high Seff and high Kv regions being rejected for positive values of net aerosol forcing. GISTEMP datasets demonstrate a long right tail for high values of Kv and show no pronounced mode. A major difference between the distributions derived in this study and those from Forest et al.  is that an estimate of natural variability has been included in the ocean heat content diagnostic. This is analogous to the treatment of the surface diagnostic and accounts for observational errors and natural sources of variability by combining the variability of both sources into a single estimate of the total variability. This estimation results in a decrease in the significance of the ocean heat content signal and leads to weaker constraints on Kv. As a result, broad distributions for Kv are derived when the natural variability estimate is included.
 Weaker Faer values are estimated when using HadCRUT data, with 5–95% intervals of −0.19 to −0.70 Wm−2 (HadCRUT2) and −0.22 to −0.74 Wm−2 (HadCRUT3). The remaining datasets yield approximately 0.1 Wm−2 stronger aerosol forcing, with 5–95% confidence intervals of −0.37 to −0.78 Wm−2 (NCDC), −0.32 to −0.83 Wm−2 (GISTEMP 250), and −0.33 to −0.80 Wm−2 (GISTEMP 1200). The slightly weaker Faer values from the HadCRUT datasets can be attributed to the larger acceptance regions at the −0.25 Wm−2 aerosol level seen in Figure 2. However, the overall range of Faer based off of the 5–95% confidence intervals, −0.19 to −0.83 Wm−2, is smaller than errors in other model forcing terms.
 TCR distributions are derived from the parameter distributions. From each joint distribution, a 1000 member Latin Hypercube sample [McKay et al., 1979] is estimated, whereby Seff-Kv pairs are drawn. Using a functional fit calibrated by prior runs of the model, the resulting TCR has been calculated for each pair [Sokolov et al., 2003] and cumulative density functions are estimated (Figure 3d).
 We note that the lower bound on TCR values for the GISTEMP results are less than the lower bound of 1 K from the IPCC AR4. Ranges of 0.87 to 1.32 K (GISTEMP 250) and 0.91 to 1.35 K (GISTEMP 1200) mark the 5–95% confidence intervals. These lower TCRs can be attributed to the low values of Seff and long right tails in the Kv distributions derived using the GISTEMP datasets. With a lower Seff, the equilibrium temperature change will be less for a given forcing. Furthermore, more efficient mixing of heat into the deep ocean acts to reduce surface temperatures as well. The HadCRUT2, HadCRUT3, and NCDC 5–95% intervals are bounded by 1.24 to 2.31 K, 1.13 to 2.41 K, and 1.10 to 1.96 K, respectively. All distributions fall within the range of TCR values given in the IPCC AR4. Given that the Kv distributions are similar for these datasets, similar values are drawn in the Latin Hypercube sample and it follows that the Seff distributions should dominate the TCR distributions (Figure 3d). Given that the Faer distributions are nearly identical across all datasets, these results show that TCR follows shifts in the distributions of Seff and Kv rather than those for Faer.
 As a surrogate for future warming, TCR distributions measure the model response global mean temperature change for idealized forcing scenarios. Similar to climate sensitivity estimates, TCR has a profound impact on policy decisions regarding climate change adaptation and mitigation strategies. If the information from TCR distributions are properly included in policy decisions, effective strategies can be potentially improved.
 The results presented here show that climate model parameter constraints are sensitive to the surface dataset used to compare with model output. In general, the ranges of the effective climate sensitivity parameter distributions are comparable, but are shifted relative to each other depending on which surface dataset is used. The biggest shift in effective climate sensitivity distributions is observed when the GISTEMP datasets are used. Using the 95-percent confidence intervals and considering all datasets, climate sensitivity is found to be between 1.2 and 5.3 K. Regardless of the surface data used, effective ocean diffusivity is poorly constrained by the data. Anthropogenic aerosol forcing is found to be between −0.19 and −0.83 Wm−2 when considering all datasets.
 TCR estimates are also sensitive to the choice of surface data. When all surface datasets are considered, transient warming is found to lie between 0.87 and 2.31 K. However, this range masks the differences that exist between the individual distributions. The TCR distribution derived from GISTEMP data is narrower and yields only minimal warming. In contrast, distributions derived from Hadley Centre datasets are wider and yield stronger warming. Given that both the parameter and TCR distributions differ when using different datasets, additional uncertainty is present in model calibration and climate projection studies. Future studies using these datasets must account for these differences to avoid overconfidence in predictions through mistreatment of the uncertainty.
 This project was supported in part by NOAA award NA09OAR4310176, DOE award DE-SC0004956, NSF award SES-0825915, and the MIT Joint Program on the Science and Policy of Global Change. We also thank the National Climatic Data Center, the Hadley Centre for Climate Prediction and Research, and the NASA Goddard Institute for Space Studies for producing publicly available surface data products. The authors thank B. Sanderson and K. Tanaka for helpful and constructive reviews.
 The Editor thanks the two anonymous reviewers for their assistance in evaluating this paper.