##### A2.1. Calculation of p-Values

[57] We calculate two different types of *p*-value. The first type, *p*_{c}(*i*) (where *i* is an index over the number of maximally overlapping observed trends), is for comparisons of observed trends and trends estimated from CMIP-3 model control runs with no changes in natural or anthropogenic forcings. The second type of *p*-value, *p*_{f}(*i*), is based on comparisons of observed trends against the externally-forced trends in the spliced 20CEN/A1B experiments.

[58] As used here and subsequently, ‘overlapping’ signifies trend overlap by all but one month. For *L* = 120 months, the first trend is over January 1979 to December 1988, the second trend is over February 1979 to January 1989, etc. Note that all least-squares linear trends were computed from time series of monthly-mean anomalies of spatially-averaged (82.5°N–70°S) observed and simulated TLT data. Anomalies in the 20CEN/A1B runs were defined relative to climatological monthly means over the 384-month period January 1979 to December 2010. Control run anomalies were defined relative to climatological monthly means over the full length of each model's control integration.

[59] We compute both ‘unweighted’ and ‘weighted’ forms of *p*_{c}(*i*) and *p*_{f}(*i*). The weighted forms, *p*_{c}(*i*)′ and *p*_{f}(*i*)′, are distinguished by the use of prime notation (′), and account for inter-model differences in either the length/number of realizations of the control run or in the number of realizations of the spliced 20CEN/A1B run (respectively).

[60] Consider first the ‘unweighted’ form of *p*_{c}(*i*). For a stipulated trend length *L* (in months), the *p*_{c}(*i*) value is defined as:

where *K*_{c}(*i*) is the number of *L*-month trends in the MMSD of control run trends that are larger than *b*_{o}(*i*) (the current *L*-month observed trend), *N*_{c} is the total number of overlapping *L*-month trends in the MMSD of control run trends, and *N*_{o} is the total number of overlapping *L*-month observed trends in the 384-month analysis period. For *L* = 120 months, *N*_{c} = 120965 and *N*_{o} = 265.

[61] The time series of spatially-averaged TLT anomalies from individual models are not concatenated prior to trend calculation (which could spuriously inflate trends spanning the ‘splice point’ between two different model control runs). Instead, overlapping trends are calculated separately from each realization of each individual model's TLT time series, and each model's TLT trends are then accumulated in a multi-model trend distribution.

[62] In the ‘weighted’ form, *p*_{c}(*i*)′, individual *p*_{c}(*i*, *j*) values are first calculated separately for each model, and the accumulated *p*_{c}(*i*, *j*) values are then averaged:

where *j* is an combined index over the number of models and the number of control run realizations per model, and *N*_{model} (the number of CMIP-3 models with pre-industrial control runs from which synthetic MSU temperatures could be calculated) = 22. The individual *p*_{c}(*i*, *j*) values for each model are calculated as follows:

where *K*_{c}(*i*, *j*) is, for the *i*th observed trend and the *j*th model, the number of *L*-month trends in the pre-industrial control run larger than *b*_{o}(*i*).

[63] Values of *p*_{c}(*i*) and *p*_{c}(*i*)′ are very similar, indicating that inter-model differences in control run length do not distort our estimates of whether observed TLT trends are unusually large relative to trends arising from internally-generated variability. We show only ‘weighted’ *p*_{c}(*i*)′ values in the main text.

[64] In comparisons involving forced trends from the CMIP-3 20CEN/A1B runs, we seek to determine whether the model TLT trends are unusually large relative to observed trends, as some analysts have claimed [*Douglass et al.*, 2008]. Values of *p*_{f}(*i*) are defined in an analogous way to *p*_{c}(*i*) values:

where *K*_{f}(*i*) is the number of *L*-month trends in the 20CEN/A1B MMSD that are smaller than the current observed trend, *N*_{f} is the total number of overlapping *L*-month trends in the 20CEN/A1B MMSD, and *N*_{o} is the total number of overlapping *L*-month observed trends in the 384-month analysis period.

[65] Unlike *p*_{c}(*i*) calculations with the CMIP-3 pre-industrial control runs (where synthetic TLT data for the full length of each control run were used in the calculations, but only 384 months of observational TLT data were analyzed), all *p*_{f}(*i*) and *p*_{f}(*i*)′ values were computed using the same 384-month period (January 1979 to December 2010) in the spliced 20CEN/A1B runs and the observations. The spliced 20CEN/A1B runs provide a total of 51 realizations of forced TLT changes over January 1979 to December 2010. For the case of overlapping 120-month trends, *N*_{f} = 13515 (265 × 15).

[66] As in the comparisons with control run trends, a ‘weighted’ form of *p*_{f}(*i*) can be calculated:

where *N*_{model} (the number of CMIP-3 models with 20CEN and A1B runs from which synthetic MSU temperatures could be calculated) = 20, and *p*_{f}(*i*, *j*) is defined in an analogous way to *p*_{c}(*i*, *j*) in equation (A3). Averaging the *N*_{o} individual values of *p*_{f}(*i*)′ yields ′:

with ′ defined similarly.

[67] Our use of maximally overlapping trends has the advantage of reducing the impact of seasonal and interannual noise on estimates of the signal components of TLT trends, both in the observations and in the spliced 20CEN/A1B runs. However, it has the disadvantage of decreasing the statistical independence of trend samples.

[68] While non-independence of samples is an important issue in formal statistical significance testing, it is not a serious concern here. This is because our *p*_{c}(*i*)′ and *p*_{f}(*i*)′ values are not used as a basis for formal statistical tests. Instead, they simply provide useful information on whether observed TLT trends are unusually large relative to model-based estimates of unforced trends, or unusually small relative to model estimates of externally-forced trends. Note also that we process observed TLT data and model output in identical ways, with the same overlap between successive *L*-month trends – i.e., we are not generating fundamentally different temporal autocorrelation structure in the model and observational trend samples.

[69] The key point is that whether we employ overlapping or non-overlapping model trends has very small impact on estimates of *p*_{c}(*i*)′ or *p*_{f}(*i*)′. This suggests that the sample sizes of non-overlapping trends (in both the CMIP-3 control runs and the 20CEN/A1B runs) may be adequate for obtaining reasonable estimates of *p*_{c}(*i*)′ and *p*_{f}(*i*)′.

[70] However, because of the relatively short length of satellite temperature records, the use of non-overlapping observed TLT trends can have a large impact on both *p*_{c}(*i*)′ and *p*_{f}(*i*)′. For each observational TLT data set, the 1979 to 2010 analysis period contains three non-overlapping 10-year trends, two non-overlapping trends >10 years and ≤16 years, and only one non-overlapping trend >16 years and ≤32 years. As shown in Figure 3a of the main text, the use of non-overlapping time series segments does not adequately sample the impact of interannual variability on trends. This is why we focus primarily on *p*_{c}(*i*)′ and *p*_{f}(*i*)′ values calculated with overlapping *L*-month observed trends.

[71] The implicit assumption in all of our *p*-value calculations is that results from individual models are independent. This assumption is almost certainly unjustified [*Masson and Knutti*, 2011]. While it would be interesting to explore the sensitivity of trend consistency results to the selection of different subsets of “independent” CMIP-3 models, we do not perform such an analysis here. We suspect that the identification of “independent” model subsets may be sensitive to the variables, statistical procedures, and metrics used to assess inter-model dependencies.

##### A2.2. Calculation of Signal-to-Noise Ratios

[72] Two types of signal-to-noise ratio are shown in Figure 6. The first is the ‘observed’ signal-to-noise ratio, *R*_{o}, in Figure 6c:

where is the average of all overlapping *L*-month observed TLT trends, and *s*{*b*_{c}} is the standard deviation of the MMSD of overlapping *L*-month control run TLT trends. The model signal-to-noise ratio in Figure 6c, *R*_{f}, is defined similarly:

where is the MMA of the overlapping *L*-month TLT trends obtained from the 20CEN/A1B runs. Figure 6c shows *R*_{o} and *R*_{f} for 23 different values of *L* (120, 132, 144, …, 372, and 384 months).

[73] Note that, in an analogous way to the calculation of unweighted and weighted *p*-values, unweighted and weighted forms of the MMA can be computed. The unweighted MMA is simply the arithmetic average of all available overlapping, *L*-month trends in the 20CEN/A1B runs. The weighted MMA (which is what we use here, and what we show in Figure 6a of the main text) is calculated by first computing (for the *j*th model) the average of all available overlapping, *L*-month trends in all realizations of the *j*th models's 20CEN/A1B runs, and then averaging these ensemble-mean ‘average’ trends. For the sample sizes of forced TLT trends available here, weighted and unweighted forms of the MMA yield very similar results.