SEARCH

SEARCH BY CITATION

Keywords:

  • autocorrelation;
  • trend estimation;
  • mean shift;
  • HAC methods;
  • climate models

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Comparisons of trends across climatic data sets are complicated by the presence of serial correlation and possible step-changes in the mean. We build on heteroskedasticity and autocorrelation robust methods, specifically the Vogelsang–Franses (VF) nonparametric testing approach, to allow for a step-change in the mean (level shift) at a known or unknown date. The VF method provides a powerful multivariate trend estimator robust to unknown serial correlation up to but not including unit roots. We show that the critical values change when the level shift occurs at a known or unknown date. We derive an asymptotic approximation that can be used to simulate critical values, and we outline a simple bootstrap procedure that generates valid critical values and p-values. Our application builds on the literature comparing simulated and observed trends in the tropical lower troposphere and mid-troposphere since 1958. The method identifies a shift in observations around 1977, coinciding with the Pacific Climate Shift. Allowing for a level shift causes apparently significant observed trends to become statistically insignificant. Model overestimation of warming is significant whether or not we account for a level shift, although null rejections are much stronger when the level shift is included. © 2014 The Authors. Environmetrics published by John Wiley & Sons, Ltd.

INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Many empirical applications involve comparisons of linear trend magnitudes across different time series with autocorrelation and/or heteroskedasticity of unknown form. Vogelsang and Franses (2005, herein VF) derived a class of heteroskedasticity and autocorrelation robust (HAC) tests for this purpose. The VF statistic is similar in form to the familiar regression F-type statistics but remains valid under serial dependence up to but not including unit roots in the time series. For treatments of the theory behind HAC estimation and inference, see Andrews (1991), Kiefer and Vogelsang (2005), Newey and West (1987), Sun et al. (2008), and White and Domowitz (1984) among others. Like many HAC approaches, the VF approach is nonparametric with respect to the serial dependence structure and does not require a specific model of serial correlation or heteroskedasticity to be implemented. Unlike most nonparametric approaches, the VF approach avoids sensitivity to bandwidth selection by setting the bandwidth equal to the entire sample.

Here, we extend the VF approach to the case in which one or more of the series has a possible level shift. Our assumption throughout is that a researcher considers a one-time level shift as a fundamentally different process than a continuous trend. Consequently, if the null hypothesis is that two series have the same trend and one series exhibits a trend while the other exhibits a level shift and no trend, a rejection of the null would be considered valid because the two phenomena are distinct and a prediction of one is not confirmed by observing the other.

Accounting for level shifts does not necessarily increase the likelihood of rejecting a null of trend equivalence. In the top panel of Figure 1, a comparison of the linear trend coefficients would suggest they are similar, but clearly, y1 differs from y2 in that the former is steadily trending while the latter is trendless with a single discrete level shift at the break point Tb. By contrast, in the bottom panel, a failure to account for the shift would overstate the difference between the trend slopes. In each case, the influence of the shift term is highlighted by the fact that if the trend slope comparisons were conducted over the preshift or post-shift intervals, they might indicate opposite results to those based on the entire sample (with the shift term omitted).

image

Figure 1. Schematics of two series to be compared

Download figure to PowerPoint

The basic linear trend model is written as

  • display math(1)

where i = 1,…,n denotes a particular time series and t = 1,…,T denotes the time period. The random part of yit is given by uit, which is assumed to be covariance stationary (in which case yit is labeled a trend stationary series, that is, stationary around a linear time trend, if one is present). For a series of length T, we parameterize the break point by denoting the fraction of the sample occurring before it as λ = Tb/T.

The following issues must be addressed in order to derive an HAC robust trend comparison test in the presence of a possible level shift. (i) If λ is known, and specifically is known to be in the (0, 1) interval, the VF test score can be generalized, as we show in Section 'TREATING THE SHIFT DATE AS UNKNOWN', but the distribution is shown to depend on λ and the critical values change. It will turn out that the form of the VF statistic and its critical values are the same whether one is testing hypotheses involving the trend coefficients or other parameters in the trend function. (ii) If λ is unknown, it must be estimated along with the magnitude of the associated shift term. But this gives rise to a problem of nonidentification if we want to allow for the possibility that the true value of the level shift parameter is zero.

The regression model with level shift takes the form

  • display math(2)

where the dummy variable DUt(λ) = 0 if t ≤ λT and 1 otherwise (we will typically suppress the λ term where it is convenient to do so). Hence, for series i, estimation of (2) by ordinary least squares (OLS) yields an estimated intercept of inline image up to Tb and inline image thereafter. In our empirical application, we are primarily interested in testing hypotheses about the trend slope parameters, bi, while controlling for the possibility of a level shift. If it is reasonable to view λ as known, then inference about the trend slopes will proceed in a straightforward way with DUt(λ) included in the model even in the case where the true value of gi is zero. However, if it is more reasonable to treat λ as unknown and we want to be robust to the possibility that gi is nonzero, then inference about the trend slopes (bi) becomes more delicate because λ is not identified when gi is zero. We would face a similar identification problem if we wanted to test the null hypothesis that gi itself is zero and λ is unknown.

There is now a well-established literature in statistics and econometrics for carrying out inference where a parameter is not identified under the null hypothesis but is identified under the alternative hypothesis. See for example Davies (1987), Andrews (1993), Andrews and Ploberger (1994), and Hansen (1996) among others. One solution to this identification problem involves the use of a supremum function, which is akin to a data-mining approach. In the present case, we can compute the VF statistic for equality of trends for a range of λ allowed to vary across (0, 1) and find the largest VF statistic, the sup-VF statistic. To be robust to the possibility that there are no level shifts in the data, that is, to be robust to the critique that the date of the level shift was chosen to data-mine an outcome for the equality of trends test, we work out the null distribution of the sup-VF statistic for equality of trends for the case where gi = 0. This yields a trend equality test that is very robust to the possibility and location of potential level shifts.

Although our focus is on the problem of trend inference allowing for possible unknown level shifts, our extension of the VF approach is general enough to include tests of the null hypothesis of no level shift, and we report some limited results in the paper for these tests. A potential application of tests for a level shift is the homogenization of weather data. Many long observational records are believed to have been affected by possible equipment and/or sampling changes, changes to the area around monitoring locations, and so forth (see Hansen et al., 1999; Brohan et al., 2006 for examples in the land record; Folland and Parker, 1995; Thompson et al., 2008 for examples in sea surface data). A typical method for detecting and removing level shifts is to construct a reference series that is not expected to exhibit the discontinuity, such as the mean of other weather station records in the vicinity, and then look for one or more jumps in a record relative to its reference series.

While the application of the VF approach to testing for a level shift is potentially quite useful in many empirical settings, the problem of testing for a level shift in a trending series with a known or unknown shift date has already received some attention in the econometrics literature (Vogelsang (1997) and Sayginsoy and Vogelsang (2011)) and the empirical climate literature (see Gallagher et al. (2013) and references therein). Each proposed method has inherent strengths and weaknesses. A complete comparison of the VF approach to existing tests for a level shift would be a substantial undertaking and is beyond the scope of this paper, but we draw some contrasts in Sections 'TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES' and 'FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC'.

The question of whether or not a level shift is present in trending data can strongly affect the resulting trend calculations and tests of equality of trend slopes. If a change point λ is known, the analysis in Section 'TREATING THE SHIFT DATE AS UNKNOWN' applies, and if a change point is suspected but the date is unknown, the analysis in Section 'TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES' applies. In our application, we focus on the case where there is at most one level shift in each series. In other applications, such as those involving very long weather series, one might suspect there are multiple shifts. If they occur at known dates, then our extension of the VF approach is general enough to apply. However, should it be more reasonable to model the shift dates as unknown or should there be uncertainty regarding the number of shifts, this greatly complicates the analysis, especially from a computational perspective, and is beyond the scope of this paper. In addition, if one thinks level shifts occur frequently and with randomness, then there is the additional difficulty that the range of possible specifications could, in principle, include the case in which the level changes by a random amount at each time, which is equivalent to having a random walk, or unit root component in uit. If yit has a unit root component, inference in models (1) and (2) becomes more complicated. More importantly, it is difficult to give a physical interpretation to a unit root component of a temperature series. See Mills (2010) for a discussion of temperature trend estimation when a random walk is a possible element of the specification.

In our application, we think it is reasonable to assume that the observed series are well characterized by a trend and at most one level shift at a known date and that the errors are covariance stationary. We focus on the prediction of climate warming in the troposphere over the tropics. As shown in Section 'APPLICATION: DATA AND METHODS', climate models predict a steady warming trend in this region because of rising atmospheric greenhouse gas levels, but none predict a step-change, so trends and shifts can be regarded as distinct phenomena. A number of studies (summarized later) have shown that models likely overstate the warming trend, but there is disagreement as to whether the bias is statistically significant. McKitrick et al. (2010) used the original VF test to examine this issue over the 1979–2009 interval, coinciding with the record available from weather satellites. We extend their analysis to the 1958–2012 interval using data from weather balloons. This long span encompasses a date at which a known climatic event caused a level shift in many observed temperature series. If the shift is nontrivial in magnitude, the comparison would thus be akin to that in Figure 1, such that failure to take it into account could bias the comparison either toward overstating or understating the difference in trend slopes.

The event in question occurred around 1978 and is called the Pacific climate shift (PCS). This manifested itself as an oceanic circulatory system change during which basin-wide wind stress and sea surface temperature anomaly patterns reversed, causing an abrupt step-like change (level shift) in many weather observations, including in the troposphere, as well as in other indicators such as fisheries catch records (see Seidel and Lanzante, 2004; Tsonis et al., 2007; Powell and Xu, 2011, and extensive references therein). For our purposes, we do not try to present a specific physical explanation of the PCS or even evidence which its origin was exogenous to the climate system, only that it was a large event at an approximately known date, the existence of that has been documented and studied extensively and that resulted in a shift in the mean of the temperature data. We first present results based on assumption that the PCS occurred at a known date (Section 'Multivariate trend comparisons: no shift and known shift date cases') and then based on the assumption that the PCS is not known to have occurred or that the date of occurrence is unknown (Section 'Multivariate trend comparisons: unknown shift date case'). We find, in some cases, that the shift term is significant at the 5% or 10% level, confirming the overall importance of controlling for this possibility when comparing trends.

If the date of the PCS is taken as given and exogenous, we find that the models project significantly more warming in both the lower troposphere and mid-troposphere than are found in weather balloon records over the interval. This finding remains robust if we treat the date of the PCS as unknown and apply the conservative data-mining approach. In fact, this finding is robust whether or not we include a level shift in the regression model: we reject equivalence of the trend slopes between the observed and model-generated temperature series either way. The evidence against equivalence is simply stronger when we control for a level shift and this is true whether we treat the date of the shift to be known or unknown.

We also find that if the date of the PCS is assumed to be known then (a) the appearance of positive and significant trend slopes in the individual observed temperature series vanishes once we control for the effect of the level shift and (b) we find statistical evidence for a level shift in some but not all observed temperature series. If the date of the PCS is assumed to be unknown, statistical evidence remains for a level shift in the mean of the observed mid-troposphere series but is weak in the lower troposphere series. This is not surprising given that we use the data-mining robust critical value that decreases the power of detecting such a shift.

BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Trend models

The literature on estimation and inference in model (1) is by now well established, and it may hardly seem possible that there is something new to be said on the subject. In fact, the last decade or so has seen some very useful methodological innovations for the purpose of computing robust confidence intervals, trend significance, and trend comparisons in the presence of autocorrelation of unknown form. Many of these robust estimators use the nonparametric HAC approach that is now widely used in econometrics and empirical finance literatures. In contrast, the nonparametric HAC approach is used less in applied climatic or geophysical papers although nonparametric approaches have been proposed by Bloomfield and Nychka (1992) and further examined by Woodward and Gray (1993) and Fomby and Vogelsang (2002) for the univariate case. As far as we know, McKitrick et al. (2010) is the first empirical climate paper to apply nonparametric HAC methods in multivariate settings.

It will be convenient to define a general deterministic trend model that contains (1) and (2) as special cases:

  • display math(3)

where d0t is a single deterministic regressor (typically the time trend in our applications) and d1t is a k × 1 vector of additional deterministic regressors (typically the intercept and shift terms) and δi is the corresponding k × 1 vector of parameters. Model (1) is thus obtained for d0t = t, βi = bi, and d1t = 1, δi = ai, and model (2) is obtained for d0t = t, βi = bi, and d1t = (1, DUt) ′, δi = (ai, gi) ′. Notice that we are assuming that each time series yit has the same deterministic regressors. This is needed for the VF statistic to be robust to unknown conditional heteroskedasticity and serial correlation. In some applications, it might be reasonable to model some of the series as having different trend functions. For example, we know that the climate model series in the application do not have level shifts because level shifts are not part of the climate model structures. When we think series could have different functional forms for the trend, we can simply include in d1t the union of trend regressors across all the series. While this will result in a loss of degrees of freedom, in many applications, the regressors will be similar across series, so the loss in degrees of freedom will often be small. We view this loss of degrees of freedom as a small price to pay for robustness to unknown forms of conditional heteroskedasticity and autocorrelation.

We estimate model (3) using OLS equation by equation. OLS has some nice properties in our set up. Because the regressors are the same for each equation, we have the well-known exact equivalence between OLS and generalized least squares estimators that account for cross series correlation. Because we have covariance stationary errors, the well-known Grenander and Rosenblatt (1957) result applies in which case OLS is also asymptotically equivalent to generalized least squares estimators that account for serial dependence in the data.

Defining the n × 1 vector β = ( β1, β2, … , βn) ′ and the k × n matrix δ = (δ1, δ2, … , δn), model (3) can be written in vector notation as

  • display math(4)

The parameters of interest are in the vector β, so it is convenient to express the OLS estimator using the “partialling out” result for linear regression, also known as the Frisch–Waugh–Lovell result (Davidson and MacKinnon, 2004 and Wooldridge, 2005) as follows. Let inline image and inline image denote respectively the OLS residuals from the regression of d0t on d1t and the regression of yt on d1t. The OLS estimator of β can be written as

  • display math(5)

and it directly follows that

  • display math(6)

Note that the form of inline image in Equation (5) would be unchanged if we redefined d0t to be the shift term and d1t to be the intercept and trend terms; however, the definition of the ~ variables would be adjusted accordingly. This implies that the test statistic on hypotheses about the shift term will take the same form as those for the trend terms when the shift date is known.

The VF test

We are interested in testing null hypotheses of the form

  • display math(7)

against alternatives H0 :  ≠ r where R and r are known restriction matrices of dimension q × n and q × 1 respectively where q denotes the number of restrictions being tested. The matrix R is assumed to have full row rank. Robust tests of H0 need to account for correlation across time, correlation across series, and conditional heteroskedasticity as summarized by the long run variance of ut. This is defined as

  • display math

where inline image is the matrix autocovariance function of ut. Those familiar with the time series literature will notice that Ω is proportional to the spectral density matrix of ut evaluated at frequency zero.

The VF statistic is constructed using the following estimator of Ω:

  • display math(8)

where inline image. This is the Bartlett kernel nonparametric estimator of Ω using a bandwidth (truncation lag) equal to the sample size. The VF statistic for testing H0 :  = r is given by

  • display math(9)

In the Supporting information, we provide a finite sample motivation for the form of inline image. Also note that (8) was originally proposed by Kiefer et al. (2000, 2001) although in the different but computationally identical form:

  • display math(10)

where inline image. See Kiefer and Vogelsang (2002) for a formal derivation of the exact equivalence between (8) and (10).

When only one restriction is being tested (q = 1), we can define a t-statistic version of VF as

  • display math

Asymptotic limit of the VF statistic

We now provide sufficient conditions for obtaining an asymptotic approximation to the sampling distribution of the VF statistic. A formal proof is given in the Supporting information. The fraction c ∈ (01] of the sample is cT, and we denote the integer portion of this quantity as [cT]. The symbols [RIGHTWARDS DOUBLE ARROW] and inline image denote weak convergence and convergence in distribution, Λ is the matrix square root of Ω (i.e. Ω = ΛΛ ′) and Wj(c) denotes a j × 1 vector of independent standard Wiener processes where j is a positive integer.

Two assumptions are sufficient for obtaining the limit of VF. Define the partial sums of inline image. The first assumption is that a functional central limit theorem holds for St. As T [RIGHTWARDS ARROW] ∞,

  • display math(11)

The second assumption is related to the deterministic regressors in the model. Assume that there is a scalar τ0T and a k × k matrix τ1T, such that

  • display math(12)

For example, in model (2), d0t = t, d1t = (1, DUt) ′, τ0T = T, inline image, f0(s) = s, and f1(s) = (1, DU(λ > s)) ′ where, in this case, DU denotes a continuous indicator function taking the value 1 if λ > s and 0 otherwise.

  • display math
  • display math(13)

In the Supporting information, we show that under assumptions (11) and (12), the limit of VF under the null hypothesis (7) is given by

  • display math(14)

where Zq ~ N(0, Iq) and is independent of the random matrix inline image The limit of the VF statistic can therefore be seen to be similar to an F random variable; however, it follows a nonstandard distribution that depends on the deterministic regressors in the model via the stochastic process inline image. The critical values of inline image thus depend on the regressors in dit and by extension depend on the value of λ when a level shift dummy variable is included in the model. It is important to note that the critical values do not depend on which regressors are placed in d0t (the regressor of interest for hypothesis testing). For a given value of λ, one uses the same critical values for testing the equality of trend slopes or testing hypotheses about the intercepts or testing hypotheses about level shifts in model (2).

Obtaining the critical values of the nonstandard asymptotic random variable defined by (14) is straightforward using Monte Carlo simulation methods that are widely used in the econometrics and statistics literatures. In the application, when we take the date of the PCS to be exogenously given at 1977:12, this yields a value of λ = 0.3636. For model (2) with λ = 0.3636, we simulated the asymptotic critical values of VF for testing one restriction (q = 1) that we tabulate in Table 1. The Wiener process that appears in the limiting distribution is approximated by the scaled partial sums of 1000 independent and identically distributed N(0, 1) random deviates. The vector f(s) is approximated using (1, DU(t > 0.3774T), t/T) ′ for t = 1,2,…,T. The integrals are approximated by simple averages; 50,000 replications were used. We see from Table 1 that the right tail of the VF statistic is fatter than that of a inline image random variable.

Table 1. Asymptotic critical values: model (2), known shift date with λ = 0.3636, q = 1
%VFtVF
  1. The value in the first column shows the percentage in the upper (right) tail exceeding the indicated values of VFt, and VF. Left tail critical values of VFt follow by symmetry around zero.

  2. VF, Vogelsang–Franses.

0.7001.68111.676
0.7502.18514.685
0.8002.73718.543
0.8503.41723.798
0.9004.30632.353
0.9505.68849.399
0.9757.02870.255
0.9908.77999.978
0.9959.999125.809

Bootstrap critical values and p-values

If carrying out simulations of the asymptotic distributions is not easily accomplished using standard statistical packages, an alternative is to use a simple bootstrap that is described in detail in the Supporting information. Residuals from Equation (4) can be resampled and used to compute inline image from Equation (8) and VF from Equation (9); then the percentiles of the bootstrapped VF statistic in many repetitions can provide critical values and p-values. A particular advantage of the VF method is that its asymptotic null distributions do not depend on unknown correlation parameters, and it falls within the general framework considered by Gonçalves and Vogelsang (2011), where it was shown that the simple, or naïve, independent and identically distributed bootstrap will generate valid critical values. No special methods, such as blocking, are required here, and the bootstrap critical values are asymptotically equivalent to the distribution given by (14).

TREATING THE SHIFT DATE AS UNKNOWN

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Many previous authors (e.g. Seidel and Lanzante, 2004) have treated the date of the level shift as known because the PCS was an exogenous event observed across many different climatic data series. As a robustness check, we also report results where we treat the date of the level shift as unknown. We take a “data-mining” approach that has a long history in the change point literature. For a given hypothesis, we compute the VF statistic for a grid of possible shift dates and determine the one that gives the largest VF statistic. In other words, we search for the shift date that gives the strongest evidence against the null hypothesis. The effect on critical values of searching over shift dates must be taken into account, otherwise this approach would be a “data-mining” exercise that could give potentially misleading inference. The level of the test will be inflated above the nominal level compared with the case where the shift date is assumed to be known. Fortunately, it is easy to obtain critical values that take into account the search over shift dates.

For a given potential shift date Tb, let VF(λ) denote the VF statistic for testing a given null hypothesis. The limiting random variable given by (14) depends on λ through the level shift regressor, and we now label the limit by inline image to make explicit the dependence on the shift date used to estimate the model. For technical reasons (Andrews (1993)), we need to “trim” the fraction v from each end of the sample, leaving a grid of potential shift dates given by vT + 1, vT + 2, …, T − vT (in our application, we set v = 0.1). Define the “data-mined” VF statistic as

  • display math

Under the null hypothesis (7) and under the assumption there is no level shift in the data, we have

  • display math(15)

where the limit follows from (14) and application of the continuous mapping theorem. Using simulation methods identical to those used for the known shift date case, we computed asymptotic critical values for supVF for v = 0.1 and q = 1 for testing hypotheses about the trend slope parameters in model (2). These critical values are given in Table 2. Using the supVF statistic along with the critical values given by (15) provides a very conservative test with regard to the shift date.

Table 2. Asymptotic critical values: model (2), unknown shift date, q = 1; 10% trimming (λ* = 0.1)
%supVF trend slopesupVF level shift
  1. The value in the first column shows the percentage in the upper (right) tail exceeding the indicated values of the supVF statistics for, respectively, the trend slope and level shift coefficients.

  2. VF, Vogelsang–Franses.

0.70079.76595.455
0.75088.184109.94
0.80098.532116.20
0.850111.78130.76
0.900131.92150.99
0.950166.41188.68
0.975205.15225.78
0.990261.39279.85
0.995301.94322.48

TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

As part of the empirical application, we provide visual evidence that the observed temperature series exhibit level shifts around the time of the PCS. Some formal statistical evidence regarding these level shifts can be provided by application of the VF statistic to an individual time series. Consider model (2) for the case of n = 1, and place the model in the general framework (3) with d0t = DUt, d1t = (1, t) ′, β1 = g1, and δ1 = (a1, b1) ′. If we take the shift date as known, then the VF statistic for testing for no level shift (H0 : g1 = 0) can be computed as before using (9) with R = 1 and r = 0. The asymptotic null critical values are still given by Table 1.

If we treat the shift date as unknown, we can apply the supVF statistic although the asymptotic critical values depend on which regressor is placed in d0t. While it is true that for a given value of λ, the distribution of inline image is the same regardless of the regressor placed in d0t, the covariance structure of inline image across λ depends on which regressor is placed in d0t. Therefore, the supVF statistic when testing for a zero trend slope has different asymptotic critical values than the supVF statistic for testing a zero level shift. We simulated the asymptotic critical values of supVF for testing for a zero level shift for the case of v = 0.1 and q = 1 and provide those critical values in Table 2.

Other tests for a level shift at an unknown date of a trending time series have been proposed in the empirical climate literature. Reeves et al. (2007) provide a review of change point detection methods developed in the climate literature, but the review focuses on tests designed for time series variables that do not have serial correlation. In contrast Lund et al. (2007) propose a test for a level shift that allows a specific form of autocorrelation—the first order periodic autoregressive model. We prefer the VF approach for two reasons. First, the VF approach is robust to more general forms of autocorrelation. Second, we formally derive and characterize the limiting null distribution of the sup statistic, and this allows us to tabulate null critical values. Lund et al. (2007) also use a sup-type statistic, but they do not provide any asymptotic theory that can be used to generate valid approximate critical values. A recent paper by Gallagher et al. (2013) develops asymptotic theory for a level shift test that treats the shift date as unknown but their analysis is confined to trend models where uit is assumed to be uncorrelated over time. What seems to be missing from the empirical climate literature are level shift tests that allow the shift date to be unknown and permit serial correlation in uit. Fortunately, there are several papers in the econometrics literature that propose level shift tests for trending series that have these properties, see Ploberger and Krämer (1996), Vogelsang (1997), and Sayginsoy and Vogelsang (2011). While clearly well outside the scope of this paper, it would be interesting to compare the sup-VF test for a shift in trend at unknown date with the other tests proposed in the literature.

FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

In this section, we report some results from a small Monte Carlo simulation study that demonstrates the finite sample performance of the VF statistic. We compare the performance of the VF statistic with a traditional Wald statistic. The Wald statistic is configured to be robust to heteroskedasticity and serial correlation over time and to be robust to correlation across series. We use two established methods for constructing the Wald statistic. The first method is based on the parametric estimator of Ω given by

  • display math(16)

which is the Bartlett kernel estimator. The value of the bandwidth, M, was chosen by the data dependent method proposed by Andrews (1991) based on the AR(1) plug-in method. This data dependent method tends to choose values of M that are small relative to T although M tends to be larger when serial correlation in the data is strong compared with cases where serial correlation is weak. The second method also uses (16) but with prewhitening based on autoregression models with lag 1 (AR(1)) fit to the components of inline image. Prewhitening was explored by Andrews and Monahan (1992). We set M = 1 in the prewhitening case that makes the estimator of Ω an AR(1) parametric estimator. We compute Wald-type statistics based on these two estimators of Ω using (9) with inline image replaced with either inline image or the prewhitened estimator. The resulting Wald statistics are denoted by WBart and WPW, respectively. Under the assumptions used to derive the limiting null distribution of the VF statistic, it is well known that WBart and WPW have limiting null distributions given by inline image where inline image denotes a chi-square random variable with q degrees of freedom.

We use the following data generating process:

  • display math(17)
  • display math(18)

where inline image, inline image, inline image, εit ~ iidN(0, 1), cov(ε1t, ε2s) = 0, and inline image, inline image. The errors, uit, are configured to have unit variances with inline image. When η = 0, y1t and y2t are uncorrelated with each other.

We report two sets of results. The first set of results focuses on empirical null rejection probabilities. For y1t, we set b1 = 0.01 and g1 = 0 so that there is no level shift in y1t. For y2t, we set b2 = 0.01 so that the null hypothesis of equal trend slopes holds. We set η = 0 so that the two series are uncorrelated. We report results for T = 120, 240, and 636, and a selection of values of ρ1 = 0 and g1 = 0.25. In all cases, 50,000 replications were used, and we computed empirical rejection probabilities for the VF, ρ1 = 0.9, and VF statistics for testing g1 = 0 using the appropriate asymptotic critical values. The simulation results for this configuration highlight the impact of serial correlation structure on null rejection probabilities relative to the sample size.

The results are tabulated in Table 3. There are two sets of results reported for each of the three statistics. The first set of results corresponds to the case where no level shift dummy variable is included in the estimated model. The second set of results corresponds to the case where the level shift dummy is included in the estimated model. In this case, we also report results for the supVF statistic. Results are organized into three blocks corresponding to the three sample sizes. Within a block, results are given for seven configurations of the autoregressive parameters ranging from no serial correlation to strong serial correlation.

Table 3. Empirical null rejections with AR(2) errors
 Without level shiftWith level shift
Tρ1ρ2WPWWBartVFWPWWBartVFSupVF
  1. H0 : b1 = b2, 5% nominal level, b1 = b2 = .01, η = 0, g1 = 0. The data generating process is given by (17) and (18).

120000.0620.0580.0510.0640.0590.0500.042
0.300.0650.0910.0570.0670.0950.0580.054
0.600.0760.1260.0690.0790.1330.0700.079
0.900.1410.2630.1310.1400.2710.1200.203
0.30.30.1800.1750.0790.1800.1780.0780.101
0.60.30.2580.3030.1530.2530.2990.1310.238
0.9−0.30.0200.1020.0570.0210.1170.0600.054
240000.0570.0550.0500.0580.0550.0520.045
0.300.0580.0800.0540.0590.0810.0540.050
0.600.0630.1010.0600.0640.1060.0610.061
0.900.1000.1900.0970.1010.2030.0960.137
0.30.30.1670.1400.0640.1680.1420.0660.072
0.60.30.2160.2260.1100.2150.2320.1060.164
0.9−0.30.0130.0840.0540.0150.0910.0550.049
660000.0520.0510.0500.0530.0520.0510.050
0.300.0520.0670.0510.0540.0680.0520.050
0.900.0550.0780.0530.0560.0810.0550.052
0.900.0680.1220.0670.0700.1310.0690.078
0.30.30.1540.1020.0550.1560.1040.0560.055
0.60.30.1770.1490.0730.1790.1550.0750.090
0.9−0.30.0090.0670.0510.0100.0720.0530.049

If the asymptotic approximations were working perfectly for the statistics, we would see rejections of 0.05 in all cases. When the serial correlation is absent, all statistics have empirical rejection probabilities close to 0.05 regardless of the sample size. Once there is serial correlation in the model, over-rejections can occur depending on the strength of the serial correlation relative to the sample size. First focus on the case of AR(1) errors (VF). Rejections tend to be close to 0.05 when WBart is small, but as WPW increases in value, rejections tend to increase. This is especially true for the WBart statistic where rejections exceed 0.25 when ρ1 = 0.9 In contrast, Wpw and VF suffer from less severe over-rejection problems although they tend to be over-sized when T = 120 and ρ1 = 0.9 For a given value of ρ1, as T increases, over-rejections tend to fall for all three statistics but slowest for WBart. Overall, for the AR(1) error case, Wpw and VF have similar rejections to each other and outperform WBart. The supVF statistic tends to over-reject more than VF when serial correlation is strong although the differences between supVF and VF decrease as the sample size increases. It is a common finding that supremum statistics tend to have more over-rejection problems than statistics that treat break dates as known.

One of the reasons that Wpw performs relatively well with AR(1) errors is that Wpw is explicitly designed for AR(1) error structures. But, when the errors are not AR(1), Wpw can suffer from over-rejection and under-rejection problems. Consider the case ρ1 = 0.3, ρ2 = 0.3 where Wpw shows substantial over-rejections that are larger than WBart and VF. These over-rejections tend to persist as T increases. In contrast, VF is much less distorted and rejections approach 0.05 as T increases. For the case of ρ1 = 0.9, ρ2 = −0.3, Wpw under-rejects, and the under-rejection problem becomes more severe as T increases, whereas VF has rejections close to 0.05 for all sample sizes. The WBart statistic tends to over-reject mildly in this case.

In general, Table 3 indicates that the VF statistic has the least over-rejection problems and is the better statistic with regard to control of type 1 error.

In the second set of results, we use T = 660 to match the empirical application. We now include a level shift in y1t with λ = 0.3636 and we set b1 = 0,0.01 and g1 = 0.25. For y2t we set b2 = 0.01, 0.0105, 0.011, 0.0116, and 0.0121. We report results for η = 0,0.5. While we ran simulations for a wide range of values for ρ1 and ρ2, we only report results for ρ1 = 0,0.9 and ρ2 = 0 given that results for other serial correlation configurations have similar patterns to what is reported in Table 3.

The results are given in Table 4. The first block of 20 rows gives results for η = 0, whereas the second block of 20 rows gives results for η = 0.5. Within each η block, results are first given for g1 = 0 followed by results for g1 = 0.25. For each value of g1, results are given for ρ = 0 followed by results for ρ = 0.9. When b2 = 0.01, we are observing null rejection probabilities, whereas for other values of b2, we are observing power.

Table 4. Empirical null rejections and empirical power with AR(1) errors
 Without level shiftWith level shift
ηg1ρ1b2WPWWBartVFWPWWBartVFSupVF
  1. H0 : b1 = b2, T = 636, 5% nominal level, b1 = .01, ρ2 = 0. The data generating process is given by (17) and (18).

0000.010.0520.0510.0500.0530.0520.0510.050
   0.01050.4520.4510.3640.1780.1770.1500.222
   0.0110.9540.9550.8780.5260.5240.4390.684
   0.01161.0001.0000.9940.8580.8590.7620.955
   0.01211.0001.0001.0000.9810.9810.9370.997
  0.90.010.0680.1230.0680.0690.1320.0690.078
   0.01050.0930.1550.0880.0780.1420.0750.089
   0.0110.1660.2470.1480.1020.1750.0960.127
   0.01160.2840.3860.2450.1420.2260.1300.188
   0.01210.4350.5510.3740.1970.2930.1750.271
00.2500.010.4420.4410.2920.0530.0520.0510.228
   0.01050.0510.0500.0360.1780.1770.1500.078
   0.0110.4500.4490.2930.5260.5240.4390.278
   0.01160.9540.9540.8100.8588590.7620.698
   0.01211.0001.0000.9840.9810.9810.9370.947
  0.90.010.0920.1530.0870.0690.1320.0690.091
   0.01050.0680.1220.0660.0780.1420.0750.079
   0.0110.0920.1540.0870.1020.1750.0960.093
   0.01160.1650.2450.1450.1420.2260.1300.130
   0.01210.2800.3840.2420.1970.2930.1750.192
0.5000.010.0510.0510.0510.0520.0520.0500.050
   0.01050.6910.6910.5770.2800.2780.2340.365
   0.0110.9980.9980.9830.7790.7790.6720.908
   0.01161.0001.0001.0000.9830.9830.9410.998
   0.01211.0001.0001.0001.0001.0000.9951.000
  0.90.010.0680.1230.0680.0690.1300.0680.078
   0.01050.1120.1820.1050.0840.1510.0810.100
   0.0110.2420.3410.2120.1280.2080.1170.166
   0.01160.4390.5560.3790.1980.2970.1780.273
   0.01210.6560.7580.5700.2920.4100.2610.417
0.50.2500.010.6830.6840.4190.0530.0520.0500.357
   0.01050.0500.0500.0280.2800.2780.2340.105
   0.0110.6870.6890.4240.7790.7790.6720.459
   0.01160.9980.9980.9390.9830.9830.9410.903
   0.01211.0001.0000.9991.0001.0000.9950.996
  0.90.010.1110.1770.1000.0690.1300.0680.099
   0.01050.0670.1210.0660.0840.1510.0810.081
   0.0110.1110.1790.1000.1280.2080.1170.103
   0.01160.2410.3380.2060.1980.2970.1780.172
   0.01210.4360.5520.3700.2920.4100.2610.281

First focus on the results when the null hypothesis is true, that is, b2 = 0.01. For ρ1 = 0, we see that when the level shift dummy is included, we have rejections close to 0.05 for all statistics. However, when the level shift dummy is not included and g1 = 0.25, we observe severe over-rejections that range from 0.292 to 0.442. The statistic with the least severe over-rejection problem is VF. When ρ1 = 0.9, we have relatively strong autocorrelation in the data. When either g1 = 0 or the level shift dummy is included in the model, there are some mild over-rejection problems ranging from 0.069 for VF to 0.132 for WBart with Wpw in between. Over-rejections are slightly worse when η = 0.5 compared with η = 0. As we see in Table 3, supVF tends to over-reject slightly more than VF when autocorrelation is strong. In addition, when there is a level shift in the data, supVF tends to have rejections above 0.05. This happens because supVF nests the null hypotheses of equal trend slopes and no level shift. A rejection using supVF indicates a level shift and/or differences in trends slopes.

Now focus on the cases where b2 > 0.01. In these cases, y2t has a bigger trend slope than y2t, and we should be rejecting the null of equal trend slopes. When g1 = 0, we see that all statistics have good power when ρ1 = 0 and power is higher for η = 0.5 compared with η = 0. Power increases as expected as b2 increases. Across the three statistics, VF tends to have lower power than other two statistics. This illustrates the well known trade-off between over-rejection problems and power. Note that while power of VF is lowest, its power is still relatively good in an absolute sense. If we include the level shift dummy in the model even though it is not needed (g1 = 0), all three tests show a reduction in power as one would expect. An unexpected finding is that the supVF statistic has higher power than VF and the two Wald statistics when the level shift regressor is included but there is no shift in the data. In contrast and as expected, power of supVF is lower than the tests for the case where the level shift regressor is not included in the estimated model.

The most interesting power results occur for g1 = 0.25 and b2 = 0.0105 when the level shift dummy is left out of the model. In this case, the estimator of b1 is biased up, and one can show that the probability limit of the estimator of b1 exactly equals 0.0105. For the case of ρ1 = 0, rejections of all three statistics are close to the nominal level of 0.050. This shows that an omitted level shift variable can cripple the power of the tests to detect a difference in trend slopes between two series. For larger values of b2, the tests have power even if the level shift variable is not included. When b2 = 0.011, power is higher if the level shift dummy is included, whereas for b2 = 0.012, power is higher when the level shift dummy is left out. When a level shift is present in the data, supVF has less power overall than VF as expected given that supVF treats the break date as unknown and uses conservative critical values.

These simulation results show that (i) the VF statistic has type 1 errors closest to the nominal level, (ii) the VF statistic has lower power which is the price paid for more accurate type 1 error; however the power of VF is still reasonably good, (iii) including a level shift dummy when there is no level shift in the data lowers power, (iv) failure to include a level shift dummy when there is a level shift in the data causes type 1 errors to be excessively larger than the nominal level and, depending on the magnitude/direction of the level shift, can make it difficult to detect slopes that are different, (v) positive correlation across series (η = 0.5) tends to increase power, and (vi) stronger serial correlation tends to inflate over-rejections under the null while reducing power.

APPLICATION: DATA AND METHODS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Observations and model data

Our empirical application uses data from the tropical lower troposphere and mid-troposphere (LT and MT, respectively), where we will compare trends from a large suite of general circulation models (GCMs) to those observed in three monthly radiosonde records over the 1958–2012 interval (T = 660). Held and Soden (2000), Karl et al. (2006), and Thorne et al. (2011) provide discussions on the importance of this region for assessing climate models. In response to rising greenhouse gas levels, models predict a maximum warming trend will occur in the lower troposphere to mid-troposphere over the tropics. Karl et al. (2006, p. 11) noted that weather balloons had not detected such a trend and deemed the mismatch a “potentially serious inconsistency” with models. Douglass et al. (2007) argued that the inconsistency was real and statistically significant, while Santer et al. (2008) countered that the difference was not significant if autocorrelation was taken into account, although they only used data from 1979 to 1999 and a simple AR1 correction. McKitrick et al. (2010) extended the data to 2009 and employed the VF approach, concluding the trend differences were significant. Fu et al. (2011) also found climate models significantly exaggerate the gain in the warming trend with altitude throughout the tropics. Additionally, Bengtsson and Hodges (2009) and Po-Chedley and Fu (2012) found that even models constrained to match observed post-1979 sea-surface temperatures overestimated warming trends in the topical mid-troposphere.

While the trend discrepancy is now well-established, Santer et al. (2011) emphasize the need for multidecadal comparisons to identify whether it is structural or temporary. By using the half century-length weather balloon records, we meet this concern, but we also span the 1977–1978 PCS. All climate models predict a steady trend in response to rising greenhouse gases, and none predict a large one-time jump preceded and followed by decades of static temperatures. Hence, we satisfy the conditions described in Section 'INTRODUCTION', namely that the underlying phenomenon implies a trend as opposed to a jump and our null hypothesis of trend equivalence requires controlling for a potential step-change at a potentially unknown date.

Our application mainly focuses on trend slopes and comparisons of trend slopes across series in which case we set d0t = t and therefore βi = bi. For model (2), d1t = (1, DUt) ′ with the level shift set at 1977:12, implying λ = 0.3636. We also provide some results on the level shift parameters themselves in which case d0t = DUt, βi = gi and d1t = (1, t) ′. Let inline image denote the OLS estimator of βi for a given parameter of interest for a given time series using either model (1) or model (2). If only one restriction is being tested (q = 1) of the form H0 : βi = 0, then we can write VFt as

  • display math

where

  • display math(19)

and inline image is computed using (8) or (10) using inline image from the respective models. Let cv0.025 denote the 2.5% right tail critical value of the asymptotic distribution of VFt. For model (1), cv0.025 = 6.482 (see Table 1 of Vogelsang and Franses, 2005; their inline image statistic,) and for model (2), cv0.025 = 7.028 (Table 1). A 95% confidence interval (CI) is computed as inline image.

The tropics are defined as 20N to 20S. The GCM runs are the same as those used in McKitrick et al. (2010) and were the ones used for the Intergovernmental Panel on Climate Change Fourth Assessment Report (IPCC, 2007). There were 57 runs from 23 models for each of LT and MT layers. Each model uses prescribed forcing inputs up to the end of the 20th century climate experiment (Santer et al., 2005), and most models include at least one extra forcing such as volcanoes or land use. Projections forward after 2000 use the A1B emission scenario. Tables 5 and 6 report, for the LT and MT layers respectively, the climate models, the extra forcings, the number of runs in each ensemble mean, estimated trend slopes in the cases with and without level shifts, and VF standard errors. All series had a significant AR1 coefficient, but as reported in McKitrick et al. (2010), over two-thirds also have significant higher order AR terms as well, which motivates the use of an HAC estimator as opposed to a simple AR(1) treatment as in Santer et al. (2008) and Fu et al. (2011).

Table 5. Summary of lower troposphere data series
 Simple trendTrend + level shift
Data seriesModel/obs name extra forcings; no. runsTrend (°C/decade)95% CI ± widthTrend (°C/decade)95% CI ± widthLevel shift (°C/decade)95% CI ± width
  1. Notes: Each row refers to model ensemble mean (rows 1–23) or observational series (rows 24–26). All models forced with 20th century greenhouse gases and direct sulfate effects. Rows 10, 11, 19, 22, and 23 also include indirect sulfate effects. “Extra forcing” indicates which models included other forcings: ozone depletion (O), solar changes (SO), land use (LU), and volcanic eruptions (V). NA: information not supplied to Program for Climate Model Diagnosis and Intercomparison (PCMDI). No. runs: indicates number of individual realizations in the ensemble mean. Trend slopes estimated using OLS, 95% CI is trend ± number shown, which is computed using VF method (Section 'TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES'). For instance, the RAOBCORE simple trend (bottom row first entry) is 0.147 ± 0.052 °C/decade.

  2. RAOBCORE, Radiosonde Observation Bias Correction using Reanalyses; RICH, Radiosonde Innovation Composite Homogenization.

1BCCR BCM2.0 O; 20.1680.0390.1650.0780.0100.258
2CCCMA3.1-T47 NA; 50.3470.0270.3450.0530.0090.176
3CCCMA3.1-T63 NA; 10.3700.0530.3850.090−0.0620.298
4

CNRM3.0

O; 1

0.2340.0480.2080.0740.1030.246
5CSIRO3.0 10.1520.0460.1930.104−0.1620.344
6CSIRO3.5 10.2510.0620.3090.087−0.2280.287
7GFDL2.0 O, LU, SO, V; 10.1740.0710.1220.1430.2070.473
8GFDL2.1 O, LU, SO, V; 10.1020.1040.1180.196−0.0620.647
9GISS_AOM 20.1780.0480.1700.0900.0300.298
10

GISS_EH

O, LU, SO, V; 6

0.2020.0790.2400.118−0.1490.388
11GISS_ER O, LU, SO, V; 50.1800.0830.2080.138−0.1120.456
12IAP_FGOALS1.0 30.2040.0850.2500.122−0.1840.402
13ECHAM4 10.2180.0910.2490.150−0.1220.494
14INMCM3.0 SO, V; 10.1840.0510.1880.101−0.0130.334
15IPSL_CM4 10.1770.0530.1370.0820.1570.272
16MIROC3.2_T106 O, LU, SO, V; 10.1550.0530.1510.1020.0160.337
17MIROC3.2_T42 O, LU, SO, V; 30.2160.0850.2430.139−0.1050.459
18MPI2.3.2a SO, V; 50.2150.0960.2480.156−0.1310.514
19ECHAM5 O; 40.2060.0340.2030.0660.0130.219
20CCSM3.0 O, SO, V; 70.2260.1060.2740.157−0.1920.517
21PCM_B06.57 O, SO, V; 40.1800.0380.1790.0760.0050.251
22HADCM3 O; 10.1860.0380.1820.0740.0180.243
23HADGEM1 O, LU, SO, V; 10.2280.0630.2120.1300.0630.428
24HadAT0.1350.0540.0700.1350.2580.444
25RICH0.1260.0560.0900.1340.1400.441
26RAOBCORE0.1490.0580.0540.1130.3770.374
Table 6. Summary of mid-troposphere data series
 Simple trendTrend + level shift
Data seriesModel/obs name extra forcings; no. runsTrend (°C/decade)95% CI ± widthTrend (°C/decade)95% CI ± widthLevel Shift (°C/decade)95% CI ± width
  1. Notes same as for Table 5. Trend slopes estimated using OLS, 95% CI is trend ± number shown, which is computed using VF method (Section 'TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES'). For instance, the RAOBCORE simple trend (bottom row first entry) is 0.132 ± 0.053 °C/decade.

  2. RAOBCORE, Radiosonde Observation Bias Correction using Reanalyses; RICH, Radiosonde Innovation Composite Homogenization.

1BCCR BCM2.0 O; 20.1730.0340.1760.065−0.0130.215
2CCCMA3.1-T47 NA; 50.3700.0280.3630.0530.0300.176
3CCCMA3.1-T63 NA; 10.3950.0520.4070.092−0.0480.303
4CNRM3.0 O; 10.2920.0570.2600.0880.1280.291
5CSIRO3.0 10.1210.0460.1620.104−0.1650.344
6CSIRO3.5 10.2400.0670.3020.092−0.2440.304
7GFDL2.0 O, LU, SO, V; 10.1620.0690.1140.1390.1930.457
8GFDL2.1 O, LU, SO, V; 10.0970.1130.1260.205−0.1160.675
9GISS_AOM 20.1700.0470.1700.091−0.0010.301
10GISS_EH O, LU, SO, V; 60.1890.0780.2260.117−0.1440.385
11GISS_ER O, LU, SO, V; 50.1650.0790.1860.135−0.0840.447
12IAP_FGOALS1.0 30.1910.0810.2320.119−0.1640.392
13ECHAM4 10.2080.0850.2320.146−0.0960.480
14INMCM3.0 SO, V; 10.1890.0540.1860.1050.0080.346
15IPSL_CM4; 10.1810.0590.1350.0860.1850.284
16MIROC3.2_T106 O, LU, SO, V; 10.1620.0570.1520.1060.0390.349
17MIROC3.2_T42 O, LU, SO, V; 30.2180.0930.2460.153−0.1110.505
18MPI2.3.2a SO, V; 50.1910.0860.2120.149−0.0820.490
19ECHAM5 O; 40.2040.0340.2020.0670.0080.222
20CCSM3.0 O, SO, V; 70.2090.0880.2450.137−0.1420.451
21PCM_B06.57 O, SO, V; 40.1640.0290.1450.0570.0780.187
22HADCM3 O; 10.1650.0360.1580.0700.0300.230
23HADGEM1 O, LU, SO, V; 10.2210.0640.2120.1310.0360.431
24HadAT0.0860.0520.0040.1110.3280.366
25RICH0.0900.0550.0440.1370.1830.453
26RAOBCORE0.1170.0540.0250.0940.3660.310

We used three observational temperature series. The HadAT radiosonde series is a set of Microwave Sounding Unit (MSU)-equivalent layer averages published on the Hadley Centre web site1 (Thorne et al., 2005) spanning 1958 to 2012. We use the 2LT layer to represent the GCM LT-equivalent and the T2 layer to represent the GCM MT-equivalent. The other two series are denoted Radiosonde Observation Bias Correction using Reanalyses (RAOBCORE, Haimberger, 2005) and Radiosonde Innovation Composite Homogenization (RICH, Haimberger et al., 2008). Both were obtained from the Institute for Meteorology and Geophysics at the University of Vienna2; however, this site does not provide the data in LT-equivalent and MT-equivalent forms, so MSU-equivalent layer averages of the tropical latitudes were computed for us by John Christy (pers. comm.) using weighting functions that match those used for the HadAT series. We did not use the RATPAC series published by the National Oceanic and Atmospheric Administration3 because the zonal averages are only available in quarterly or annual form. Another series, called IUK-radiosonde from the University of New South Wales,4 only goes up to 2005. The six series are graphed in Figure 2.

image

Figure 2. Radiosonde series. Left column: lower troposphere (LT) series; right column: mid-troposphere (MT) series; top row: Hadley; middle row: RICH; bottom row: RAOBCORE

Download figure to PowerPoint

The HadAT and RICH series deal with the problem of homogenizing short data segments by comparing series at suspected breakpoints to reference series formed using nearby observations to detect if shift terms are needed. The RAOBCORE series uses reference series generated by nearby weather forecasting systems (called “reanalysis data”). Production of these series is therefore an application of the shift-detection methods developed in this paper, but for our current purposes, we will take the data as given and apply it to the model-observation comparison.

The last three lines of Tables 5 and 6 report the estimated trend slopes and VF standard errors for the observed temperature series. Figure 3 displays the observed (LT and MT) model-simulated data (red dots), with a least squares trend line through the model mean, allowing for a break at 1977:12, shown in dark red. The estimated LT trends for, respectively, HadAT, RICH, and RAOBCORE are 0.135, 0.126, and 0.149 °C/decade. The MT trends are, respectively, 0.086, 0.090, and 0.117 °C/decade. The effect of allowing for a level shift (step-change) is shown in Figure 3. The trend through the average of the three radiosonde series, allowing for a break, is shown as the blue dashed line. Using a shift date of 1977:12, the observed LT trends fall to 0.070, 0.090, and 0.054 °C/decade, and the MT trends fall to 0.004, 0.044, and 0.025 °C/decade. Thus, about half of the positive LT trend in Figure 2 can be attributed to the one-time change at 1977:12, and essentially all the MT change is accounted for by the step-change.

image

Figure 3. Model-observation comparisons allowing for intercept shift after 1977:12. Top: MT layer, red dots indicate monthly data from GCMs; red line is trend through average of all GCMs; blue dashed line is trend through average of the three radiosonde series. Model data are centered on zero mean. Trend through observed data is shifted down so as to line up with model trend in starting month. Bottom: same for LT layer

Download figure to PowerPoint

The shift terms are insignificant in all model runs. In the observational series, one of three is significant at 5% in the LT layer, and in the MT layer, one is significant at 10% and one at 5%. It might seem surprising that the effect of the shift dummy is so dramatic on the balloon series trend slope parameters, yet the shift coefficients themselves are not more strongly significant. But in general, trend slopes are estimated more efficiently than level shifts (or intercepts), and therefore, it is more difficult to conduct inference about level shifts than about trend slopes. While there is sufficient noise in the data to make inference about level shifts difficult, it is nonetheless clear that the possibility of a level shift must be taken into account and the noise is not so large as to mask information about the trend slopes. Unmodeled level shifts also induce spurious noise into the model when conducting inference about trend slopes. Therefore, controlling for a possible level shift makes inferences about trend slopes more informative.

Figure 4 plots all the estimated trend slopes along with their 95% CIs. The left column leaves out the level shift and the right column includes it. The model-generated trends are grouped in each panel on the left in red. The trends are ranked from smallest to largest and the numbers above each marker refer to the GCM number (see Table 2 for names). The three blue trends on the right edge are, respectively, the Hadley, RICH, and RAOBCORE series. With or without the shift term, the range of model runs, and their associated CIs overlap with those of the observations. In that sense, we could say there is a visual consistency between the models and observations. However, that is too weak a test for the present purpose, because the range of model runs can be made arbitrarily wide through choice of parameters and internal dynamical schemes, and even if the reasonable range of parameters or schemes is taken to be constrained on empirical or physical grounds, the spread of trends in Figure 4 (spanning roughly 0.1 to 0.4 °C/decade in each layer) indicates that it is still sufficiently wide as to be effectively unfalsifiable. Also, if we base the comparison on the range of model runs rather than some measure of central tendency, it is impossible to draw any conclusions about the models as a group or as an implied physical theory. Using a range comparison, the fact that, in Figure 4, models 8, 5, and 16 are reasonably close to the observational series does not provide any support for models 2, 3, and 4, which are far away. We want to pose the trend comparison in a form that tells us something about the behavior of the models as a group, or as a methodological genre, and this requires using the multivariate testing framework.

image

Figure 4. The 1958–2012 trends and 95% CIs for 23 models (numbered, left to right) and three radiosonde series Hadley, RICH, and RAOBCORE, respectively. Top row: MT; bottom row: LT; left column: trends computed without allowing for level shift; right column: level shift term included in trend model

Download figure to PowerPoint

Multivariate trend comparisons: no shift and known shift date cases

For each layer, we now treat the 23 climate model-generated series and the three observational series as an n = 26 panel of temperature series. We estimate models (1) and (2) using the methods described in Section 'TREATING THE SHIFT DATE AS UNKNOWN'. The parameters of interest are the trend slopes. We are interested in testing the null hypothesis that the weighted average of the trend slopes in the 23 climate model-generated series is the same as the average trend slope of the observed series. The weight coefficient wi equals the number of runs in model i's ensemble mean, to adjust for the reduction in variance in multi-run ensemble means. Placing the observed series in positions i = 24, 25, 26, the restriction matrices for this null hypothesis are

  • display math

where the wi terms sum to 57.

Table 7 presents the VF statistics for the test of trend equivalence between the climate models and observed data. Also, reported are the VF statistics for testing the significance of the individual trends of the observed temperature series, the magnitudes of which (°C/decade) are indicated in parentheses beside the series name. Asymptotic critical values are provided in the table captions and significance is indicated as described in the table. We also compute bootstrap p-values for the tests using the method outlined in Section 'Bootstrap critical values and p-values' using 10,000 bootstrap replications.

Table 7. Results of hypothesis tests using VF statistic
TestNull hypothesisTest scoreBootstrap p-value
  1. Sample period (monthly): January 1958 to December 2012. The bootstrap p-value is computed using the method described in Section 'TREATING THE SHIFT DATE AS UNKNOWN' using 10,000 bootstrap replications. VF critical values: Without level shift, 27.14 (10%, denoted *), 41.53 (5%, denoted **), 83.96 (1%, denoted ***). With level shift at known date, 32.35 (10%, denoted *), 49.40 (5%, denoted **), 99.98 (1%, denoted ***). With level shift at unknown date, 150.99 (10%, denoted *), 188.68 (5%, denoted **), 279.85 (1%, denoted ***).

  2. LT, lower troposphere; MT, mid-troposphere; RAOBCORE, Radiosonde Observation Bias Correction using Reanalyses; RICH, Radiosonde Innovation Composite Homogenization.

  3. a

    Interpolated using critical values in Table 2.

Trend inNo level shift
Hadley LT (0.135)Trend = 0260.4***<0.001
RICH LT (0.126)Trend = 0212.0***<0.001
RAOBCORE LT (0.149)Trend = 0281.4***<0.001
Hadley MT (0.086)Trend = 0117.0***<0.004
RICH MT (0.090)Trend = 0111.4***<0.005
RAOBCORE MT(0.117)Trend = 0197.4***<0.001
LT averageModels = observed97.2***<0.007
MT averageModels = observed167.0***<0.002
Trend inWith level shift at date (assumed known): December 1977
Hadley LT (0.064)Trend = 013.20.273
RICH LT (0.093)Trend = 022.60.160
RAOBCORE LT (0.065)Trend = 011.20.311
Hadley MT (−0.001)Trend = 00.10.925
RICH MT (0.048)Trend = 05.10.485
RAOBCORE MT (0.042)Trend = 03.40.563
LT averageModels = observed354.5***<0.001
MT averageModels = observed685.9***<0.001
Shift term in   
LT averageAvg obs shift term = 019.40.189
MT averageAvg obs shift term = 030.80.107
Trend inWith level shift at date assumed unknown
LT averageModels = observed495.6***<0.001
MT averageModels = observed937.7***<0.001
Shift term in   
LT averageAvg obs shift term = 0180.4*0.061a
MT averageAvg obs shift term = 0259.8**<0.025

In the trend model without level shifts (top panel of Table 7), the zero trend-hypothesis is rejected at the 1% significance level for all six observed series, apparently indicating strong evidence of a significant warming trend in the tropical troposphere over the 1958–2012 interval. A test that the climate models, on average, predict the same trend as the observational data sets is rejected in both the LT and MT layers at 1% significance. Table 8 repeats the model-observation trend equivalence test for each of the 23 models individually. In the LT layer, the differences are significant at 5% or lower in eight cases and in 20 cases in the MT layer. (Not reported are the single-model tests of trend significance, although these can be inferred from Tables 5 and 6 and Figure 4; 22 of 23 models have significant trends (at 5%) in both layers without allowing for a break, and 21 of 23 have significant trends in both layers allowing for a break.) So while, on average, the model trends are significantly different from observations, in the LT layer, it can at least be said that if we ignore the step-change at 1977:12, almost two-thirds of the models have trends that individually do not significantly differ from the observations.

Table 8. VF tests of equivalent trends between individual model (ensemble mean in cases of multiple runs) and average of balloon series
 LT layerMT layer
ModelNo shiftKnown shift dateUnknown shift dateNo shiftKnown shift dateUnknown shift date
  1. Column 1: model. Column 2: LT layer, no shift case. Columns 3 and 4 (Shift 1 and Shift 2, respectively): with shift assumed known at 1977:12 and with shift at date assumed unknown. Columns 5–7: same for MT layer. VF critical values: Without level shift, 27.14 (10%, denoted *) 41.53 (5%, denoted **), 83.96 (1%, denoted ***). With level shift known to be at 1977:12, 33.32 (10%, denoted *), 51.20 (5%, denoted **), 98.46 (1%, denoted ***). With level shift at unknown date, 131.92 (10%, denoted *), 166.41 (5%, denoted **), 261.39 (1%, denoted ***). Last row: fraction of models exhibiting difference from observations significant at ≤5%.

122.5346.39*46.57106.15***128.49**133.33*
2959.34***280.48***757.12***1710.35***505.84***1281.14***
3751.09***932.99***980.16***996.59***1561.96***1588.64***
478.72**33.14147.06*317.46***101.12**364.58***
51.5217.0218.293.4824.8625.05
693.92***154.70**165.27*118.13***251.16**273.53***
722.9011.5526.3962.86**34.93*59.23
86.875.9355.680.0124.5743.87
911.2513.7339.5441.97**36.21*82.78
1031.94*397.95***479.74***49.94**630.06***712.14***
1114.17194.10***218.06**29.57*293.10***312.31***
1229.78*413.78***508.79***50.09**639.50***702.25***
1346.21**521.81***619.35***73.86**648.30***879.13***
1412.5215.1245.8549.54**32.55*64.91
1516.1310.1058.0567.35**28.02117.22
163.3010.5647.8344.37**30.3393.56
1740.62*415.17***426.18***63.33**438.44***445.07***
1832.22*232.33***283.33***47.05**261.93***323.86***
1983.54**55.32**91.36197.34***132.21**149.30*
2032.85*330.88***395.11***59.54**561.83***636.38***
2144.83**74.18**92.17155.54***101.48**126.00
2238.97*36.35*52.3070.26**56.90**93.34
23156.67***126.96***172.77**222.24***329.72***387.06***
≤5%8/2313/2310/2320/2316/2312/23

When we add the level shift dummy at 1977:12 (middle block of Table 7), the trend magnitudes and values of the VF statistics for testing the zero trend-hypothesis drop considerably. The critical values for VF are larger than in the case without the mean-shift dummy. Now none of the observed series has a significant trend. Hence, when the level shift is left out, the increase in the series is spuriously associated with a trend slope, whereas the trend is explained by a jump in the data around 1977. The average shift term is not significant at either level (Table 7 rows 17 and 18), although as mentioned previously, in three of six individual balloon series, the shift term is significant at the 10% or 5% level.

The VF test of equivalence of average trends between the climate models and observed data is more strongly rejected when the level shift dummy is included. Notice that bootstrap p-values drop to essentially zero in this case. This finding is not surprising because, as is clear in Tables 5 and 6, while the estimated trend slopes decrease for the observed series, when the level shift dummy is included, the estimated trend slopes of the climate model series are not systematically affected by the level shift dummy.5 Therefore, there is a greater discrepancy between the climate model trends and the observed trends. Table 8 shows model-by-model comparisons. When a shift with known date is included, the number of 5% rejections in the LT layer rises from 8 to 13 out of 23, but in the MT layer, it drops from 20 to 16. This latter change can be attributed in part to the increase in the critical values, emphasizing the importance of taking into account the dependence of the test on the introduction of the shift term.

Multivariate trend comparisons: unknown shift date case

As a robustness check regarding the assumption that the shift date of the PCS is known, we report results where we treat the shift date as unknown and use the supVF statistic. The supVF statistic is calculated by computing the VF score with the shift term Tb sequentially set across the middle 80% of the data set then selecting the maximum value.

For the tests of model-observational equivalence allowing for a level shift, Figure 5 shows the sequence of VF scores, with 10%, 5%, and 1% critical values shown. The supVF occurs at date 1979:6 in the LT layer and at 1979:5 in the MT layer. These shift dates are close to, although not the same as, 1977:12 but all of the VF scores for dates near 1977:12 time interval far exceed the 1% critical values for the supVF statistic. Because the supVF test is conservative regarding the choice of shift date, this provides strong additional support for the results in Section 'Multivariate trend comparisons: no shift and known shift date cases'.

image

Figure 5. Grid search of VF scores of model-observational equivalence varying the break point across the sample. Top: MT, bottom: LT; horizontal lines show 10%, 5%, and 1% critical values (respectively 131.92, 166.41, and 261.39). Maximum LT value is at obs 258 1979:6 (495.6) and MT at obs 257 1979:5 (937.7 MT)

Download figure to PowerPoint

The supVF scores of tests of model-observational equivalence are reported in Table 7 for model averages and in Table 8 for individual models. A pattern emerges particularly clearly in Table 7 that when we apply the data-mining approach, the test scores get larger, as expected, but so do the critical values. The net effect is to reduce the significance of the test scores when the shift date is treated as unknown. Had we not used the critical values as given by (15), we might have spuriously inflated the significance of our findings in the event the level shifts were not present in the data. This is a useful lesson in the perils of naïve data-mining, in which a specification is selected that maximizes the chance of rejecting some null hypothesis, without taking into account the effect of the data-mining process on the null distribution of the test. The price one pays for being honest and using the conservative critical values implied by (15) is lower power in detecting a deviation from the null hypothesis. Even with lower power, we see in the third panel of Table 7 that the average model is still clearly rejected against the data in both the LT and MT layers. Because this result is not dependent on choosing a particular shift date, it provides strong confirmation of the empirical finding assuming a known shift date.

Regarding the trend magnitudes, we do not report the supVF score for the test of a zero trend in the third block of Table 7. Recall that the search process looks for the location that maximizes the chance of rejecting a null hypothesis. In this case, the significance of the trend would be maximized simply by leaving the shift term out altogether, which corresponds to the test scores in the first block of Table 7.

The supVF scores for the test of whether the average shift term is zero are shown in the bottom two rows of Table 7. Compared with the known-date case, the VF scores are larger, as expected, and the increase exceeds that in the critical values, yielding marginal significance in the LT layer and significance in the MT layer. Note that even if we were to choose a specification with no level shifts in the observed temperature series, we would still reject model-observational equivalence, as shown in the first block of Table 7. The supVF scores provide additional rationale for the importance of controlling for a possible break at an unknown date.

CONCLUSIONS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Heteroskedasticity and autocorrelation robust (HAC) covariance matrix estimators have been adapted to the linear trend model, permitting robust inferences about trend significance and trend comparisons in data sets with complex and unknown autocorrelation characteristics. Here, we extend the multivariate HAC approach of Vogelsang and Franses (2005) to allow more general deterministic regressors in the model. We show that the asymptotic (approximating) critical values of the test statistics of Vogelsang and Franses (2005) are nonstandard and depend on the specific deterministic regressors included in the model. These critical values can be simulated directly. Alternatively, a simple bootstrap method is available for obtaining valid critical values and p-values.

The empirical focus of the paper is a comparison of trends in climate model-generated temperature data and corresponding observed temperature data in the tropical troposphere. Our empirical innovation is to make the trend model robust to the possibility of a level shift in the observed data corresponding to the PCS that occurred around 1978. With respect to the Vogelsang and Franses (2005) approach, this amounts to adding a level shift dummy to the model that requires a new set of critical values that we provide.

As our empirical findings show, the detection of a trend in the tropical lower troposphere and mid-troposphere data over the 1958–2012 interval is contingent on the decision of whether or not to control for a level shift coinciding with the PCS. If the term is included, a time trend regression with autocorrelation-robust error terms indicates that the trend is small and not statistically different from zero in either the LT or MT layers. Also, most climate models predict a significantly larger trend over this interval than is observed in either layer. We find a statistically significant discrepancy between the average climate model trend and observational trends whether or not the mean-shift term is included. However, with the shift term included, the null hypothesis of trend equivalence is rejected much more strongly (at much smaller significance levels).

Regarding the question of preferred specification (that is, whether to include a shift or not), where the researcher suspects a break has occurred, results ought to be robust to controlling for the possibility. In the multivariate tests, when we fix the break at 1977:12, the shift terms are not significant in either level, but when we use the grid search method, the shift is significant at 10% in the LT layer and at 5% in the MT layer. Because breaks are harder to identify than trends, these findings indicate the importance of controlling for the possibility that one is present.

The testing method used herein is both powerful and relatively robust to over-rejections under the null hypothesis caused by strong serial correlation. The power of the test is indicated by the span of test scores in Table 8 in which relatively small changes in modeled trends translate into smaller p-values. Using the data-mining method provides a check on the extent to which the results depend on the assumption of a known shift date.

As such, our empirical approach has many other potential applications on climatic and other data sets in which level shifts are believed to have occurred. Examples could include stratospheric temperature trends that are subject to level shifts coinciding with major volcanic eruptions and land surface trends where it is believed that the measuring equipment has changed or was moved. Generalizing the approach to allow more than one unknown break point is left for subsequent work.

Acknowledgements

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

We thank numerous seminar and conference participants, as well as the referees, for the helpful comments. R. M. thanks the Institute for New Economic Thinking and The Centre for International Governance Innovation for financial support.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

    The climate models do not explicitly model the Pacific climate shift, and so the level shift coefficient has no special meaning for the climate model data. Not surprisingly, the estimated level shift coefficients were positive in 11 cases and negative in 12 of the climate model series.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. BASIC SET-UP WITH NO SHIFT OR KNOWN SHIFT DATE
  5. TREATING THE SHIFT DATE AS UNKNOWN
  6. TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES
  7. FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC
  8. APPLICATION: DATA AND METHODS
  9. CONCLUSIONS
  10. Acknowledgements
  11. REFERENCES
  12. Supporting Information
FilenameFormatSizeDescription
env2294-sup-0001-SUPP0001.docxWord 2007 document104KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.