Many issues of interest in climate analysis involve comparisons of trends across different data sets. This note explains regression-based methods that yield asymptotically valid parameter variances and covariances while providing a flexible testing framework. Obtaining linear trend coefficients is easy using ordinary least squares (OLS). Obtaining unbiased estimates of the parameter variances and covariances (collectively referred to as the covariance matrix) is more challenging, because the regression residuals may be autocorrelated within each panel, and both heteroskedastic (unequal variance) and correlated across panels. Regressions that use sequenced groups of time series observations are referred to as panel estimators (Davidson and MacKinnon, 2002). They are convenient when panels are unbalanced, i.e. they do not all have the same numbers of observations, but they impose restrictions on the covariance matrix. A nonparametric method introduced by Vogelsang and Franses (2005) handles autocorrelation of unknown dimension; however, it is only applicable to balanced panels.

We explain both methods and the trade-offs between them. In Section 3, we apply them to a comparison of model temperature projections and observations in the tropical troposphere. We test trend significance as well as model-data equivalence. For discussions of the importance of modeling and climatological measurement issues related to the tropical atmosphere, see Karl et al. (2006), Santer et al. (2005, 2008), and Douglass et al. (2007).

2. Methods

2.1. Introduction: two-equation case

We assume that the data are stationary, though autocorrelated, upon detrending; in other words, ‘trend stationary.’ Suppose that there are two series of interest, y_{1τ} and y_{2τ}, where τ = 1, …, T. Trends are fitted using

(1)

and

(2)

A student's t test of slope equivalence is

(3)

where ∧ denotes an OLS estimate, (i = 1,2) denotes an autocorrelation-robust variance estimator for b̂_{i}, and cov(b̂_{1}, b̂_{2}) is the estimated covariance between the trend terms.

Karl et al. (2006) drew attention to an apparent discrepancy between observed and model-generated temperature trends in the tropical atmosphere. Douglass et al. (2007) tested surface-matched differences (Supporting Information) using

(4)

where b̂_{1} denotes the trend through model ensemble means, b̂_{2} denotes the trend through observations, and s̃_{1} is the estimated standard error of b̂_{1}. The test (4) incorrectly treats the observations as deterministic and assumes the model observations are independent across time. Santer et al. (2008) instead used

(5)

where ∼ denotes a least-squares estimate and r_{i} denotes the first-order autoregressive (AR1) coefficient in series i. The ratio of AR1 terms is commonly referred to as an ‘effective degrees of freedom’ adjustment (Santer et al.2000). Instead of a series providing T-independent observations, it is said to provide only (1 − r_{i})T/(1 + r_{i}) -independent observations. The resulting variance corresponds to an estimate obtained using an AR1 model, but is not equivalent to that derived from higher order autocorrelation models. In addition, it does not yield a correct 2cov(b̂_{1}, b̂_{2}) term (Supporting Information), which was missing in both Equations (4) and (5). While detrended climate model projections may be uncorrelated with observations, the assumption of no covariance among trend coefficients implies that models have no low-frequency correspondence with observations in response to observed forcings, which seems overly pessimistic.

2.2. Panel regressions

Equation (3) can be obtained using a panel regression. Suppose that the dependent variable is the stacked vector (y_{1}, y_{2})′, and we estimate the following equation:

(6)

(1 1)′ denotes two stacked T-length vectors of ones. (0 1)′ denotes a vector of T zeros stacked on T ones. This is called an indicator or a ‘dummy variable,’ since it indicates (value = 1) if the dependent variable is y_{2}. (ττ)′ denotes a 2T-length vector consisting of two T-length time trends and (0 τ)′ is (ττ)′ times (0 1)′. A test of d̂_{2} = 0 in Equation (6) can be shown to be equivalent to testing b̂_{1} = b̂_{2} (Kmenta 1986; Supporting Information). Hence, the t-statistic on d̂_{2} in Equation (6) yields the test score (3).

To generalize the framework further, suppose that we are comparing m model-generated series and o observational series, making the total number of series N = m + o. Each source i yields T_{i}≤T nonmissing observations y_{iτ} over the interval τ = 1, …, T. Define an indicator variable obs_{iτ} = 0 if the record is model generated, and = 1 if it is from an observational series. Denote the ith vector as y′_{i} = [y_{i1}, …, y_{iT}]. Stack these vectors into a single NT × 1 vector y as follows:

(7)

Stack the trend vector τ′ = [1, …, T]N times to get the NT × 1 panel trend vector

(8)

The indicator, or the dummy, variables are likewise stacked to form

(9)

where obs_{i} is (obs_{i1}, …,obs_{iT})′. The regression equation is then written as

(10)

where e is an NT × 1 residual vector with typical element e_{iτ}. Note that all the ‘data’ are on the left-hand side, and the right-hand side consists of dummy variables and trend terms.

When obs_{ij} = 0, dy_{iτ}/dτ = b̂_{1} and when obs_{it} = 1, dy_{iτ}/dτ yields (b̂_{1} + b̂_{2}). Thus, a t-statistic on b̂_{1} will test whether the model trend is zero and a test of the linear restriction b̂_{1} + b̂_{2} = 0 indicates the significance of the observed slope. The t-statistic on b̂_{2} tests whether the trend on observations differs significantly from the trend in models.

Equation (10) can be extended further. Suppose that observations come from two different systems, such as satellites and weather balloons. Define two different indicator variables: d_{1}, which is equal to 1 if an observation is from either system 1 or 2, and d_{2} that is equal to 1 only if the observation is from system 2. The regression equation becomes

(11)

The estimated model trend is b̂_{1}. The trend in observations from system 1 is b̂_{1} + b̂_{2} and from system 2 is b̂_{1} + b̂_{2} + b̂_{4}. The t-statistic on b̂_{4} tests whether the trend in the second observation system differs from that in the first, and so forth.

Hypothesis testing requires a valid estimator of V(b), the covariance matrix of b. The general form is (Davidson and MacKinnon 2002)

(12)

where X denotes the right-hand side variables in Equation (11) and Ω = E(ee′). Obtaining a valid estimate of Ω requires modeling the cross- and within-panel covariances. For a panel i with T observations, define a matrix A_{i} of AR weights using the panel-specific AR1 coefficient ρ_{i}:

(13)

Then a model of Ω can be written as

(14)

where denotes the covariance between series i and j, I_{i} denotes an identity matrix with dimension T, and denotes the variance of series i. There are N(N − 1)/2 covariances in Equation (14) that need to be estimated, in addition to the variances and AR1 parameters. If some panels j are shorter than others (T_{j} < T), then the dimensions of the A_{i} matrices need to be adjusted accordingly. Some commercial statistical packages, such as STATA, can accommodate unbalanced data sets.

2.3. Higher order autocorrelations and multivariate trend models

Vogelsang and Franses (2005, herein VF05) derived two estimators for Ω that impose no parametric restrictions on the lag and correlation structure, as is done in Equation (14). Suppose that the N panels are used one at a time in Equation (1), yielding OLS trend estimates b̂ = b̂_{1}, …, b̂_{N}. Take the N residual series u_{1τ}, …, u_{Nτ} and form the T × N matrix U = [u_{1τ}, …, u_{Nτ}]. VF05 derive two transformations of U that converge in probability to a scalar multiple of Ω. Of their two estimators, we focus on the form, which has higher power and is slightly easier to compute. It is obtained as follows. Denote V = U′ and take the columns v_{j}, for j = 1, …, T, each of length N. Define a vector . Then, VF05 show that

(15)

converges in probability to an unbiased estimate of Ω, regardless of the form of autocorrelation and other departures from the independence assumption. For testing purposes, linear restrictions on the slopes can be written in the matrix form Rb̂ = 0 (Supporting Information). The VF05 test statistic is

(16)

where η = Σ(t − t̄)^{2} and q is the number of restrictions, which in our examples is always equal to 1. Critical values for Equation (16) generated by Monte Carlo simulation are reported in VF05.

The VF05 approach improves on the panel method by providing robust trend variances and covariances regardless of the autocorrelation order and the structure of heteroskedasticity. However, it requires balanced panels, which can be a limitation in some cases.

The VF05 statistic, as with all test statistics, has improved size as the sample size increases. Rejection probabilities also increase as ρ→1. Monte Carlo simulations in VF05 show that for T = 100, when q = 1 and ρ> 0.8, just under 10% of scores exceed the 95th percentile, indicating a tendency to over-reject a true null, although this is an improvement compared to earlier alternatives. Each panel in our full sample has well over 100 observations, but a high ρ value. Hence, VF05 scores that are close to the critical values may overstate significance.

3. Empirical application

3.1. Data

We used the same archive of climate model simulations as used by Santer et al. (2008). The available group now includes 57 runs from 23 models. Each source provides data for both the lower troposphere (LT) and mid-troposphere (MT). Each model uses prescribed forcing inputs up to the end of the twentieth century climate experiment (20C3M; Santer et al., 2005). Projections forward use the A1B emission scenario. Table I lists the models, the number of runs in each ensemble mean, and other details. We used four observational temperature series: two satellite-borne microwave sounding unit (MSU)-derived series and two balloon-borne radiosonde series. We use monthly data starting in 1979, covering the tropics from 20°N to 20°S. The MSU observations come from the University of Alabama-Huntsville (UAH; Spencer and Christy, 1990) and Remote Sensing Systems Inc. (RSS; Mears et al., 2003). The HadAT radiosonde series is an MSU-equivalent published on the Hadley Centre web site (http://hadobs.metoffice.com/hadat/msu_equivalents.html; Thorne et al., 2005). The Radiosonde Innovation Composite Homogenization (RICH) series is published by Haimberger et al. (2008) and is available at ftp://srvx6.img.univie.ac.at/pub/rich_gridded_2009.nc. We used the RICH-gridded data and MSU weights supplied by John Christy (personal communication) to construct MSU-equivalent series (see Supporting Information for details).

Table I. Summary of data series

Panel

Model/obs name

Extra forcings

No. of runs

LT trend (SD)

MT trend (SD)

AR coeffs LT/MT

Each row refers to model ensemble mean (rows 1–23) or observational series (rows 24–27). All models forced with twentieth century greenhouse gases and direct sulfate effects. Rows 10, 11, 19, 22, and 23 also include indirect sulfate effects. ‘Extra forcings’ column indicates which models included other forcing: ozone depletion (O), solar changes (SO), land use (LU), and volcanic eruptions (V). NA: information not supplied to PCMDI. ‘No. of runs’ indicates the number of individual realizations in the ensemble mean. LT and MT trends based on linear regression allowing six AR terms. Standard errors in brackets. AR coeffs: the AR lags that were significant (p < 0.05) for LT/MT layers, respectively.

Our data start in January 1979 and end in December 2009. Thus, we have N = 27 panels, each with 372 monthly observations. Figure 1 displays the (smoothed) MSU series and the mean of the PCM model runs for comparison.

Douglass et al. (2007) and Santer et al. (2008) focused on trends from 1979 to about 1999, with some series extending a few years further. To compare with these results, we first look at data ending in 1999, and then extend the sample to 2009. Since our panels are balanced, we can generate results using both the VF05 and panel regression methods, but since the results are so similar, we report only the VF05 results for the shorter 1979–1999 sample.

Table I summarizes the data. The 1979–2009 trends in °C per decade are shown for the LT and MT levels, with accompanying standard errors, for all ensemble means and observational series. Each series was centered and the trend regression allowed for a six-lag AR process, denoted as AR6. Table I (final column) shows that in 17 of the 23 models and in all 4 observational series, autocorrelation at lags greater than one were observed in at least one atmospheric layer. Hence, an AR1 error specification is likely inadequate. Extended autocorrelation lags were also observed in the individual model runs.

All climate models were forced with twentieth century greenhouse gas and sulfate levels: other assumed forcings are listed in Table I.

3.2. Multivariate trend test results

We weighted each model by the number of runs in its ensemble to adjust for the effect of combining runs into an average, although our conclusions would be unchanged if we weighted each model equally.

Table II presents tests of trend significance for the observational series. On data ending in 1999, the VF05 test shows the four observational series are insignificant at both the LT and MT layers individually and averaged together (column ‘Obs’). By extending the data to 2009, the score of combined significance at the LT layer rises from 12.50 to 76.66, thus attaining significance at 5%. All observed LT series are individually significant, except UAH which is significant at 10%. At the MT layer, extending the sample raises the combined score from 5.06 to 23.77, which is significant at 10%. UAH and Hadley series are insignificant, RICH is marginal, and RSS is individually significant at 5%.

Table II. Trend significance tests using nonparametric covariance estimator on balanced panels and panel regression on unbalanced panels

Tests of trend significance

Obs

MSU

UAH

RSS

BAL

HAD

RICH

Models

VF method: Shown are Vogelsang and Franses (2005) test scores. The 90% critical value is 20.14, 95% critical value is 41.53, and 99% critical value is 83.96. Panel method refers to panel regression results. Shown are the trend in °C per decade, the standard error of the trend, and the p value of a test of H0: trend = 0. See text for discussion of column groupings. Headings: Obs, average of all observational series; MSU, combined satellite record; UAH, University of Alabama-Huntsville; RSS, remote sensing systems; BAL, combined balloon (radiosonde) series; HAD, HadAT balloon series; RICH, Haimberger balloon series; Models, average of 23 ensemble means.

Trend comparison results are listed in Table III. The second column (‘Obs’) shows that at both the LT and MT layers, on data ending in 1999, the difference between models and observations is only marginally significant, echoing the findings of Santer et al. (2008). However, with the addition of another decade of data the results change, such that the differences between models and observations now exceed the 99% critical value. As shown in Table I and Section 3.3, the model trends are about twice as large as observations in the LT layer, and about four times as large in the MT layer.

Table III. Trend difference tests using nonparametric covariance estimator on balanced panels and panel regression on unbalanced panels

Tests of difference from models

Obs

MSU

UAH

RSS

BAL

RSS versus UAH

BAL versus MSU

VF group results: Vogelsang and Franses (2005) F2 test scores, 90% critical value is 20.14, 95% critical value is 41.53, and 99% critical value is 83.96. Panel (p) refers to panel regression results. Shown are the p values of a test of whether indicated trend difference = 0. See text for discussion of column groupings. For description of headings, see footnote of Table II

At both the LT and MT layers, on data ending in either 1999 or 2009, the VF05 tests show that the balloon data are not significantly different from the MSU data, but within the satellite category, the RSS and UAH data are significantly different. Possible reasons for RSS/UAH differences include treatment of intersatellite calibration, orbital decay, and other processing issues (Santer et al., 2005; Karl et al., 2006; Christy and Norris, 2009).

3.3. Panel regressions tests

In cases where one or more series is not of full length, the VF05 test will not work. The panel-corrected standard error estimator in the STATA program (command xtpcse) allows an unbalanced panel in the estimate of Equation (14); however, it imposes an AR1 assumption. For comparison purposes, we report these results on data ending in 2009. We again weighted each observation by the number of runs in the ensemble mean. None of the conclusions depend on this step.

In Table II, the panel estimator at the LT layer shows that the observations as a group (column 2) exhibit a significant trend of 0.110 °C per decade, compared to a model trend (column 9) of 0.272 °C per decade. The balloon and MSU series are each jointly significant (p = 0.026 and 0.042, respectively). In the MT layer, the model trend (0.253 °C per decade) remains significant. The mean observed trend is only 0.057 °C per decade. The panel-estimated standard error implies that it is insignificant (p = 0.272), while the VF05 score implies significance at 10%. Among observational series only RSS is individually significant, echoing the VF05 results. The MSU and balloon series are each jointly insignificant. Figures 2 and 3 show the trend magnitudes.

In Table III, the p values of the test scores on a hypothesis of equality between the indicated trends are shown in the bottom row. On data ending in 2009, the trend differences between models and observations (column 2) are significant in both the LT (p = 0.002) and MT (p = 0.000) layers, as was the case with the VF05 tests. The model-observation difference is significant for all data products at both layers, except for the RSS series in the LT layer (p = 0.059).

In the last columns of Table III, we test the differences among the observational series. As was the case with the VF05 tests, the balloons and MSU series are not significantly different from each other (p = 0.880), but within the MSU category, the RSS and UAH series are significantly different (p = 0.000).

4. Discussion and conclusions

Econometric tools are increasingly being used for climate data sets (Fomby and Vogelsang, 2002; Mills, 2010). We present two econometric methods for trend comparisons between data sets. Both add flexibility for multivariate comparisons and provide improved treatment of complex error structures. The multivariate testing method of Vogelsang and Franses (2005) yields more robust estimator of the covariance matrix, but requires balanced data panels. Panel regression methods can accommodate comparisons of series of unequal lengths, but software limitations typically limit treatment of within-panel autocorrelation to the AR1 case. In our example, the two methods yielded similar conclusions, indicating that the AR1 approximation in the panel model was likely not overly restrictive. In general, however, for the purpose of multivariate trend comparisons in climatology, we particularly recommend that the VF05 method enter the empirical toolkit.

In our example on temperatures in the tropical troposphere, on data ending in 1999, we find the trend differences between models and observations are only marginally significant, partially confirming the view of Santer et al. (2008) against Douglass et al. (2007). The observed temperature trends themselves are statistically insignificant. Over the 1979–2009 interval, in the LT layer, observed trends are jointly significant and three of four data sets have individually significant trends. In the MT layer, two of four data sets have individually significant trends and the trends are jointly insignificant or marginal depending on the test used. Over the interval 1979–2009, model-projected temperature trends are two to four times larger than observed trends in both the LT and MT and the differences are statistically significant at the 99% level.

Our methods assume that the trends are linear. We found no evidence for nonlinearity on the observed data, but some on modeled data in the MT. In addition, the fact that the results are sensitive to the end date suggests that they might also be sensitive to the start date. Since the satellite data are unavailable prior to 1979, we cannot extend these series earlier. Interpretation of trend comparisons should, therefore, make reference to the time period analyzed, which, ideally, should have some intrinsic interest. In this case, the 1979–2009 interval is a 31-year span during which the upward trend in surface data strongly suggests a climate-scale warming process. As noted in the studies cited in Section 1, comparing models to observations in the tropical troposphere is an important aspect of testing explanations of the origins of surface warming.