Four methods for estimating potential seasonal predictability from a single time series are compared. The methods are: an analysis of variance procedure proposed by Shukla and Gutzler (SG), a spectral method proposed by Madden (MN), a bootstrap method proposed by the authors, and an analysis of covariance (ANOCOVA) method proposed by the authors. The time series used for comparison are taken from Monte Carlo simulations, an atmospheric general circulation model (AGCM), and reanalysis data. The comparison clearly reveals that SG systematically underestimates weather noise variance more strongly than the other methods and is therefore not a generally useful method. MN produces the least biased estimates of weather noise variance, but it tends to have a higher probability of identifying insignificant predictability than the other methods. Unfortunately, no simple, universally corrected statements can be made regarding the relative performances of MN, ANOCOVA, and bootstrap based on the AGCM output. Overall, the reanalysis-based estimates of potential predictability of seasonal mean temperature derived from these methods is generally in accord with previous estimates, both in spatial structure and in magnitude. Omitting SG, the other three methods consistently identify about 80% of the globe as significantly predictable, and about 5% of the globe as insignificantly predictable. The remaining 15% of the globe, mostly over extratropical land, yields inconsistent assessments of potential predictability, indicating sensitivity to the assumptions underlying each of the methods. Interestingly, winter mean temperature over most of North America is found to be insignificantly predictable by all three methods.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 The concept of potential predictability is based on the idealization that interannual variability of seasonal means can be partitioned into two independent components: weather noise variability that is inherently unpredictable beyond a few days (or a few weeks for certain large‒scale structures) and a slowly varying component that is potentially predictable [Madden, 1976]. The extent to which seasonal variability exceeds weather noise variability determines the degree of potential predictability. The source of potential predictability often is identified with slowly varying components of the climate system, such as sea surface temperature, sea ice, or soil moisture, and, as such, is relatively constant within a season but can change dramatically from year to year. In principle, potential predictability also can arise from internal atmospheric dynamics [Frederiksen and Zheng, 2007]. Strictly speaking, externally forced changes in greenhouse gases, aerosols, and solar insolation also contribute to potential predictability. The word “potential” is used because the persistent components themselves may not be predictable on seasonal timescales.
 There are two fundamentally different approaches to quantifying potential predictability. The first is based on generating an ensemble of climate realizations using a dynamical model. For instance, all members of an ensemble could be forced by the same sea surface temperature (SST) but initialized at slightly different atmospheric states [Rowell et al., 1995; Kumar and Hoerling, 1995; Zwiers, 1996]. The spread of ensemble members for a given SST measures the weather noise variability, while the variation of the ensemble mean due to varying SSTs measures the boundary‒forced interannual variability. This approach has the advantage of having scope to detect weak signals according to Rowell et al.  but has the disadvantage of relying on models that are always imperfect.
 The second approach is to estimate potential predictability from statistical models fitted to a single realization of observation‒based data. This approach avoids problems due to inadequate dynamical and physical representations but requires long, homogeneous time series and makes statistical assumptions that may be violated. Just as dynamical models differ in their assumptions regarding the behavior of physical processes, statistical methods differ in their assumptions regarding the probabilistic structure of the underlying stochastic process. Madden  proposed a frequency domain approach to estimating potential predictability in which weather noise variance is estimated from power spectra derived from 96 day time series [see also Shukla, 1983; Zwiers, 1987]. Shukla and Gutzler  proposed an analysis of variance method modified to account for autocorrelated time series [see also DelSole and Feng, 2013]. Jones et al.  proposed a time domain approach in which time series are fitted to a mixed model with errors having an autoregressive moving‒average structure. Zheng  proposed this model for estimating autocorrelations, and Feng et al.  proposed a simplified version of this model, called the analysis of covariance (ANOCOVA) model, for testing hypotheses about potential predictability. Zheng et al.  proposed an analysis of variance method that can be applied to monthly mean time series. Finally, Feng et al.  proposed a bootstrap technique that makes less restrictive assumptions about the underlying stochastic process.
 A comparison of the above predictability estimates has not been made using the same data set spanning the same modern time period over the entire globe. The purpose of this study is to compare estimates of potential seasonal predictability derived from observation‒based daily time series. These methods are denoted as Shukla‒Gutzler (SG), Madden (MN), ANOCOVA, and bootstrap and are described in more detail in section 2. The method of Zheng et al.  is not included since it is based on monthly means, but we will show that many of our results are consistent with the results shown in Zheng et al. , which is impressive given that it does not process daily information. Furthermore, we consider only the autoregressive approach of Feng et al. , which appears to be adequate in most cases. The estimates of potential predictability produced by these methods are of fundamental interest, and their differences highlight the sensitivity of these estimates to the choice of method. Application of these methods to synthetic data is discussed in section 3, while application to observation‒based data is discussed in section 5. The concluding section provides a summary and discussion of results.
2 Statistical Methods
2.1 Traditional Analysis of Variance
 Most studies that attempt to estimate potential predictability from a single realization of daily time series assume that the true state on the dth day in the yth year can be modeled as
where εd,y is a stationary, stochastic process representing weather noise, and μy is the change in population mean due to a slowly varying component of the climate system [Madden, 1976; Zwiers, 1987; Zheng et al., 2000]. The variable μy is assumed to be uncorrelated in y and to have variance (where “S” stands for “signal”). In the context of seasonal predictability, the slowly varying component usually is identified with the state of a persistent component in a particular season; in which case, different states can be distinguished by different years y. This identification assumes that weather noise varies on timescales much shorter than a season. The weather noise εd,y is assumed to be independent of μy and independent in different years; that is, εd,y and εd′,y′ are independent for all y≠y′ (regardless of d). This implies that εd,y has a mean, variance, and autocorrelation that is independent of y. The variance of εd,y is denoted , while the mean of εd,y is irrelevant because it only affects the climatological mean, which does not affect predictability. The index d takes on values d=1,…,D, where D is the number of days in a season, and the index y takes on values y=1,…,Y, where Y denotes the total number of years.
 Potential predictability is assessed by testing the null hypothesis that μy is independent of year; that is, by testing the hypothesis
A standard technique for testing this hypothesis is analysis of variance (ANOVA) [Scheffe, 1959]. However, ANOVA assumes that εd,y is independently and identically distributed as a normal distribution; in particular, it assumes weather noise is serially uncorrelated. This assumption is not realistic but will be assumed temporarily for the purpose of explaining the technique. In this case of uncorrelated errors, a test of the hypothesis (2) is based on the statistic
where is an unbiased estimator for the total variance of seasonal means
where and are the seasonal means and grand means, respectively,
and is an unbiased estimator for the intraseasonal variance
The degrees of freedom for (4) and (6) for uncorrelated weather noise are d.f.T=Y−1 and d.f.N=Y(D−1), respectively. If the null hypothesis is true, then the statistic F has an F‒distribution with degrees of freedom d.f.T and d.f.N.
 Unfortunately, the above approach based on traditional analysis of variance is inappropriate for autocorrelated time series, such as daily temperature. Several approaches have been proposed for accounting for autocorrelated processes, as discussed next.
2.2 Shukla‒Gutzler (SG)
Shukla and Gutzler  proposed a test for potential seasonal predictability based on analysis of variance, but with certain modifications that depend on the autocorrelation function of the time series. DelSole and Feng  pointed out that the proposed modifications lead to biased estimates of variance and proposed the following alternative methodology that avoids this bias. First, the statistic for testing equality of variance is modified to be
where T0 is a timescale defined as
and where ρτ is the autocorrelation function at time lag τ. Second, the degrees of freedom in (6) is modified to be
A critical step is the estimation of T0. Numerous authors have pointed out that estimates of T0 based on residuals from the seasonal mean are biased toward small values [Trenberth, 1984a; Zwiers and von Storch, 1995; Zheng, 1996; DelSole and Feng, 2013]. Thiébaux and Zwiers  systematically investigated a variety of methods for estimating T0, including parametric methods based on autoregressive models, and concluded that it is “quite difficult to estimate reliably.” In this study, we estimate T0 in a typical manner, namely by first removing seasonal means from each season and then computing the time lagged covariance, and then substituting the resulting sample autocorrelation function in (8) to estimate T0. Although this estimate is expected to underestimate the population value of T0, it is selected anyway due to lack of a better alternative and to gauge the seriousness of the bias. The sum of (8) is truncated at 15 days, but our results are not sensitive to this truncation.
 The estimated weather noise variance implied by this method is
Similarly, the fraction of predictable variance (FPV) is estimated as
2.3 Madden (MN)
Madden  proposed estimating interannual variability due to weather noise using spectral methods [see also Zwiers, 1987]. The starting point of this method is the fact the variance of a D‒day average of a stochastic process with spectral density function S(f) is
where f is the frequency (day−1), and H(f) is the power transfer function [Zwiers, 1987]
In the case in which the sample size and the time averaging window are equal, H(f) vanishes for all multiples of 1/D, except 0 where H(0)=1. Also, Madden approximated the spectral estimate at zero frequency by that at 1/D. This approximation is called the Low‒Frequency White Noise extension. Therefore, we need only estimate the power at f=1/D:
which is asymptotically distributed as a chi‒squared distribution with 2Y degrees of freedom. Substituting this estimate in the first term of (14) and simplifying gives
This estimate is asymptotically independent of . Thus, a natural statistic for testing potential predictability is
Under the null hypothesis of no potential predictability, the estimates are unbiased estimates of the variance of seasonal means, and the statistic (17) has an approximate F‒distribution with Y−1 and 2Y degrees of freedom. The fraction of predictable variance is defined to be
Shukla  questioned whether the spectra at frequency 1/D can be attributed entirely to the effects of weather noise. If the spectra at those frequencies contains some variability due to boundary forcing, then would be overestimated, resulting in underestimated predictability. Madden  and Thiébaux and Zwiers  also pointed out, in effect, that estimates of the power spectrum at low frequencies are biased in a way that depends on the true power spectrum. Specifically, if the true spectrum is characterized by a relative maximum at the origin, such as that of a first‒order autoregressive process, then the low‒frequency white noise extension tends to underestimate weather noise variance.
Jones et al.  proposed a time domain approach to assessing potential predictability based on a mixed model with errors having an autoregressive moving‒average structure [see also Zheng, 1996]. Obtaining maximum likelihood estimates of the parameters of this model requires nonlinear optimization techniques. Feng et al.  proposed a simpler version of the model that can be solved by least squares techniques. Despite being simpler, the residuals produced are white noise even for relatively low‒order models. The fact that the residuals pass a white noise test for low order implies that the model adequately captures the autocorrelation structure of the time series and that the higher order mixed models proposed by Jones et al.  are not necessary for modeling these particular observed time series. Specifically, Feng et al.  suggested modeling daily time series by a combination of an autoregressive model and an ANOVA model of the form (1):
where p is an unknown order, Xd,y and μy have the same meanings as model (1), εd,y is independent and normally distributed with zero mean and identical variances , and φ1,φ2,…,φp are the autoregressive coefficients. A model of the form (19) is the basis of a method called analysis of covariance (ANOCOVA). This model allows the significance of potential predictability to be tested while rigorously accounting for both autocorrelation in daily time series and uncertainty in model parameters. Feng et al.  discuss the properties of time series generated by model (19) and show that the expected seasonal mean is proportional to μy. Therefore, the null hypothesis of no potential predictability is equivalent to the hypothesis that the term μy is constant:
If the null hypothesis is true, then the ANOCOVA model (19) reduces to
where μ denotes the constant value imposed by the hypothesis (20). Model (21) is a standard AR model. To distinguish the two models, we call (21) the “reduced” model and (19) the “full” model, since the later contains a source of variability due to variations in μy that is not included in the former. The statistic for testing the null hypothesis (20) is
where SSRreduced and SSRfull indicate the sum of the squared errors for the reduced and full models, respectively. If the null hypothesis is true and the model is correct, then FANOCOVA has an F distribution with Y−1 and DY−Y−p degrees of freedom. The statistical test accounts for estimation of the AR coefficients and does not require estimating a timescale, unlike the SG method.
 To measure predictable variance, it would not be appropriate to subtract the estimated noise variance from the observed variance, because the total variance from model (19) does not exactly match that of observations. A more consistent measure of predictable variance would avoid mixing observed and modeled quantities and use only model quantities. The difference in total variance between model and observations is small, usually less than 10%, so the use of model variance instead of observed variance involves a small correction. Consequently, we define a measure of FPV that lies between zero and one for ANOCOVA as
where and are the signal and noise variances, respectively, derived from the model (19). These variances are complicated functions of the model parameters and can be found in Feng et al. . Further details for ANOCOVA procedure, such as parameters solutions, order selection, variance estimates, and model verification, can be found in Feng et al. .
Feng et al.  proposed a bootstrap method for estimating potential predictability. The bootstrap is a standard method for estimating the properties of an estimator based on resampling strategies [Efron and Tibshirani, 1993]. In practice, one draws randomly with replacement from the given sample to construct an empirical distribution of a statistic. An attractive property of the bootstrap method is that it does not require explicitly specifying an evolution model or noise distribution. In the case of autocorrelated data, random resampling shuffles the original data and destroys the ordering that produces the autocorrelation. In order to account for temporal dependence in meteorological daily time series, we follow the standard approach of resampling contiguous blocks of data rather than single elements [Efron and Tibshirani, 1993]. More precisely, we randomly select consecutive time series of length L with replacement from the original data and join them together to generate a random time series of length DY, from which the variance of seasonal means is calculated. This resampling process is repeated 1000 times to build up an empirical distribution of interannual variance. We then compare the observed interannual variance of seasonal means to the 95th percentile of the empirical distribution of variance to determine if the observed variance is significantly larger than that expected from weather noise.
 There is no universally accepted criterion for defining the block length L. Our choice of L is guided by the characteristic timescale T0 as defined in (8). The T0 estimates for temperature are greater than five over most of the globe [Feng et al., 2011, Figure 1]. Hence, we choose a constant block length L of 10 days for temperature.
 To estimate the fraction of predictable variance, we define the noise variance as the mean of the bootstrap variances of seasonal means, and the corresponding FPV as
 The four methods outlined above differ in their assumptions regarding the underlying stochastic process and approximations used to derive sampling distributions. An indicator for the importance of these differences is the degree to which different methods give different results for the same time series. Thus, it is worthwhile to point out the main differences between the methods. MN assumes that weather noise variability is white on timescales longer than a season, while the bootstrap effectively assumes weather noise variability is white on timescales longer than a block length (10 days in our case). As discussed in DelSole and Feng , SG also has a spectral interpretation similar to MN but estimates the power at zero frequency by a spectral window that differs substantially from MN. In summary, these methods make relatively few assumptions about the shape of the power spectrum of weather noise variability but differ in how the power at low frequencies is estimated. In contrast, ANOCOVA effectively assumes that the power spectrum of weather noise variability is constrained to have a shape that is characteristic of a low‒order AR model.
3 Application to Monte Carlo Experiments
 The previous section summarized four distinct methods for testing potential predictability. To gain insight into the relative merits of the methods, we compare them in the context of Monte Carlo experiments. Specifically, we generate random data from a stochastic model with prescribed autocorrelation structure. Since the data are generated by a known stochastic model, the potential predictability is known exactly. We then apply each method to independent realizations from the stochastic model and assess their performance relative to the truth. To quantify how well each method describes the sampling distribution under the null hypothesis of no potential predictability, we generate data from a simple autoregressive model with no potential predictability:
where the variables in (25) have the same meanings as in (21). The stochastic process generated by this model has no potential predictability because the asymptotic mean does not depend on y. We choose the parameters D=90 days and Y=30 years to be consistent with the reanalysis data, which will be described in next section.
Feng et al.  show that the weather noise variance of Xd,y from (25) is given by
To assess how well the individual methods approximate this noise variance, we generate 1000 independent synthetic time series from (25), and for each time series estimate the noise variances as described in section 2, yielding 1000 estimates of the noise variance for each method. Histograms of the resulting estimates are shown in Figure 1 for different values of the autoregressive parameter φ. The true weather noise is indicated by a vertical line in Figure 1. The relative location of the histogram center to the truth is critical in determining the degree of bias in noise estimates. The spread of the histogram is an indicator of the range of estimate—the smaller the spread, the better the estimate.
 The weather noise estimates from SG tend to underestimate the true weather noise for each value of φ. This bias is due to the fact that the timescale T0 is systematically underestimated from residuals about the seasonal mean [Trenberth, 1984a; DelSole and Feng, 2013]. ANOCOVA also tends to underestimate the true weather noise, although the bias is smaller. The bias in ANOCOVA shown in Figure 1 for φ=0 turns out to be an artifact of the adaptive order selection procedure—that is, we have independently verified that ANOCOVA gives unbiased estimates of weather noise when the order is correctly specified (i.e., when p=1 in (19)). In practice, the order of the process is unknown and hence must be estimated from data; so this uncertainty should not be neglected. Interestingly, the bias remains for φ=0.9 even when the order is correctly specified. MN gives unbiased estimates of weather noise for small or moderate values of φ but tends to underestimate weather noise for large φ. The bias in MN at large φ is readily explained. As Madden  noted, the MN method tends to underestimate the weather noise when the power spectra has a relative maxima at zero frequency. Moreover, the degree of underestimation grows with the curvature of the maxima, which in turn is monotonically related to φ. For φ=0.5, this bias is less than 1%, but for φ=0.9, the bias exceeds 40% [Madden, 1976]. Finally, Figure 1 shows that the bootstrap produces unbiased estimates of weather noise at φ=0 but tends to underestimate weather noise for positive φ, with severe biases occurring at large φ. Recall that the bootstrap accounts for serial correlation by selecting continuous blocks of time series. For φ=0.5, the timescale T0 is less than three, suggesting that a block length of 10 (which was used in the Monte Carlo experiments) is adequate. However, for φ=0.9, the timescale T0 exceeds 17, implying that the block length of 10 is too short. In practice, the true timescale is difficult to estimate; so the most appropriate block length is unknown.
 The frequency with which each method rejects the null hypothesis when it is true (i.e., the type‒1 error rate) is given in the title of each panel (as α). If each method worked properly, then the type‒1 error rate should equal 5%. However, it can be seen that only MN gives type‒1 error rates near 5% for low and moderate values of φ. Not surprisingly, the larger the bias in noise variance, the larger the bias in type‒1 error rates. Hence, SG gives exceedingly large type‒1 error rates. ANOCOVA has type‒1 error rates around 13% for white noise (i.e., for φ=0); but as mentioned earlier, this discrepancy is an artifact of the poor criterion used to select the order. After all, for white noise, ANOCOVA reduces to ANOVA, which is the proper method in this case. The bootstrap gives the correct type‒1 error rate for white noise but tends to have larger type‒1 error rates for serially correlated time series. All methods give excessively large type‒1 error rates for large φ. We suspect this discrepancy at large φ is due to the fact that the timescale of the stochastic process is so long that it becomes difficult to distinguish potential predictability from predictability due to autoregressive processes.
 Another consideration is the behavior of the methods when potential predictability exists. To investigate this case we generate time series from the model
where μy is drawn independently for each y from a standardized normal distribution with zero mean and variance
The above variance ensures that the signal‒to‒noise ratio of the potentially predictable variance from (27) is SNR (the expression is derived by taking the ratio of the two terms on the right‒hand side of (28) in Feng et al. ). We chose SNR=4. The weather noise variance estimated from 1000 realizations of this stochastic process are shown in Figure 2. Comparison between Figures 1 and 2 shows that the noise variance estimates are nearly the same for all methods except the bootstrap. The bootstrap generally produces larger estimates of weather noise relative to the estimates without potential predictability. As a result, even for white noise (φ=0), the bootstrap produces biased estimates of weather noise when potential predictability is present. This bias arises because random resampling “scrambles” the signal, but the signal does not vanish entirely when averaged over a season. One could avoid this bias by bootstrapping the residuals about the seasonal mean, but this approach significantly underestimates the weather noise (not shown), thereby trading one bias for another.
 We also have estimated the type‒2 error rate for the case SNR=4 and indicated them in the title of the appropriate panel (as β). Type‒2 error is the probability of accepting the null hypothesis when it is false. For small φ, all methods have nearly perfect type‒2 error rates, i.e., nearly zero. For moderate φ, MN has the largest type‒2 error rate. The apparent reason for this is that MN produces the largest spread in weather noise, a fact also noted by Trenberth [1984b]. The relatively large uncertainty in MN is related to the specific method for extrapolating seasonal spectra to zero frequency—in particular, the extrapolation is based solely on the power estimate at 1/90 day−1. One can conceive of alternative extrapolation methods that pools power from other frequencies to reduce the uncertainty, but at the expense of introducing additional biases. We will not pursue such refined approaches here. Except for the bootstrap, most methods have large type‒2 error rates at large φ, presumably for the same reason as the large type‒1 error rates discussed previously. While the bootstrap does have good power at large φ, this benefit is undermined by the very large type‒1 error rates at large φ.
 The above results suggest some general rules of thumb. First, SG systematically underestimates the weather noise for all values of φ, unlike the other methods. For this reason, it seems reasonable to use one of the other methods instead of SG. Second, all four methods systematically underestimate weather noise variance for large φ. It should be recognized that for large φ, the process is very persistent (e.g., T0=17 days for φ=0.9) and hence very predictable, in which case the concept of potential predictability may not be relevant. Third, we expect ANOCOVA and bootstrap to give similar results for small or moderate values of φ but expect the bootstrap to produce smaller estimates of weather noise than ANOCOVA for large φ, unless the signal‒to‒noise ratio is large. Finally, we expect MN to produce nearly unbiased estimates of weather noise for small or moderate φ, but these same estimates will tend to be deemed insignificant by MN than by the other methods.
4 Application to Ensemble Model Output
 The previous section examined estimates of potential predictability when the true process is known to be a first‒order autoregressive process. In nature, the true process is likely to be more complicated. To assess the quality of the estimates for more realistic time series, we apply the methods to time series generated by atmospheric general circulation models (AGCMs). Specifically, we consider an ensemble AGCM simulations initialized with different atmospheric initial conditions but driven by the same boundary and external forcings. In addition to calculating potential predictability from a single realization of daily time series, the true potential predictability can be estimated directly from the ensembles (e.g., the signal can be estimated from the ensemble mean, while the noise variance can be estimated from deviations about the ensemble mean [Rowell et al., 1995]). Comparison of estimates from different methods provides a basis for judging the quality of the methods. The fact that the AGCM is not perfect is not relevant for our purposes, as the AGCM is used merely to provide a more realistic time series for calculating potential predictability. On the other hand, the AGCM itself is a state‒of‒the‒art model; so the resulting estimates of potential predictability can be considered to be one of the best current estimates of potential predictability.
 The model we use is the atmospheric component of version 2 of the National Centers for Environmental Prediction (NCEP) Climate Forecast Model (CFSv2) [Kumar et al., 2012]. The atmospheric model has a resolution of T126 in the horizontal and 64 layers in the vertical. The simulations consist of twelve 59 year integrations starting in January 1950 with slightly different atmospheric states, but driven by the same observed monthly SST, sea ice, and CO2 concentrations. The ensemble of monthly model output is used to estimate the true potential predictability, while statistical estimates of predictability are determined from the daily output, which is available only from one ensemble member.
 We focus on 2 m temperature in the following comparison analysis. The annual cycle is removed by subtracting out the first three annual harmonics from daily time series of 2 m temperature. Narapusetty et al.  demonstrated that spectral method is more accurate than simple average in producing predictions of independent data. The resultant daily anomaly is divided into four seasonal series: December‒January‒February (DJF), March‒April‒May (MAM), June‒July‒August (JJA), and September‒October‒November (SON). Monthly output is also partitioned into seasonal time series the same as daily anomaly. For daily data, each season defined in this study is exactly 90 days long, beginning on the first day of each 3 month period.
 Scatter plots of weather noise variance calculated from a single daily time series, and from analysis of variance techniques applied to the 12‒member ensemble, are shown in Figure 3. The parameters b and R2 give the regression slope and coefficient of determination of the least squares line fit. All values are statistically significant at 5% level. The figure reveals that the weather noise variance from single daily time series are generally underestimated, with regression slopes ranging from 0.54 to 0.93. The largest scatter occurs for DJF and MAM. SG generates the smallest regression slopes, indicating that SG tends to underestimate the true noise due to underestimation of T0, which is consistent with the Monte Carlo experiments. Interestingly, SG also generates some of the largest coefficients of determinations. Unfortunately, the regression slope for SG depends on season; so the bias is not sufficiently constant to correct with confidence. MN has the largest regression slopes, indicating that it has the least bias, consistent with the Monte Carlo experiments. The bootstrap and ANOCOVA have comparable regression slopes and R2 values, although ANOCOVA frequently gives the smallest R2 values among the four methods.
 The FPV ratio estimated from the 12‒member ensemble and statistical methods during four seasons is shown in Figure 4, where the regions with the insignificant FPV values at 5% significance level are masked out. All the estimates consistently reveal high FPV in the tropics and low values in the extratropics. In accord with the underestimated noise in Figure 3, FPV is overestimated by all four statistical methods particularly over the land. Such overestimation is more severe in SG, which detects the largest predictable areas over the entire globe than the other methods. Noticeably, there is a vast land area with insignificant FPV values in MN estimates. This is consistent with the Monte Carlo results that MN has a higher type‒2 error, i.e., higher probability of rejecting the null hypothesis when it is false. ANOCOVA and bootstrap identify more predictable regions than MN, but the bootstrap produces lower FPV than the other methods.
 The above results are generally in agreement with the conclusions derived from the Monte Carlo experiments discussed in section 3. Specifically, SG seriously underestimates weather noise variance and hence overestimate FPV relative to the other methods. Overall, Madden appears to be the least biased with the smallest regression slopes and reasonable coefficients of determination. But it tends to have a higher probability of rejecting the null hypothesis when it is false and hence identifies more insignificant predictability than the other methods. It is interesting to note that SG has the highest R2 values, suggesting that it has the potential to become a very good method if an improved estimate of the timescale T0 could be derived. No simple, universally correct statements can be made regarding the relative performances of ANOCOVA, bootstrap, and MN, although the Monte Carlo experiments imply that the bootstrap is problematic for short and long memory processes.
5 Application to Reanalysis Data Set
 In this section, we apply four methods for estimating potential predictability to an observational data set of the real world. Unfortunately, daily, long‒term, and global‒scale in situ observed surface air temperature is virtually nonexist. Since reanalysis product is widely used as a convenient addition to observational data, we used 2 m temperature output from the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis [Kalnay et al., 1996]. This data set is generated by a fixed global data assimilation system and fixed model, and assimilates a comprehensive observational data set, including conventional and remotely sensed data, to produce analyses of the atmosphere every 6 h since 1948 on a 192 × 94 Gaussian grid. The fixed data assimilation system and model prevent inhomogeneities due to use of different processing and analysis techniques, but the changes in the observing system may cause spurious variability, especially the introduction of the satellite observing system during the 1970s. To alleviate concerns about falsely identifying potential predictability due to this particular change in the observing system, we use daily averages of 6 hourly temperature from January 1979 to December 2008. Spurious variability due to other inhomogeneities in the observation system undoubtedly exist, which should be borne in mind when interpreting the results.
5.1 Comparison of Weather Noise Estimates
 We applied the four methods described in section 2 to estimate the weather noise variance of 2 m temperature. To highlight similarities and differences among the methods, we show in Figure 5 the weather noise variance only for the ANOCOVA estimate and then show differences relative to the ANOCOVA estimate in the other panels. The figure reveals that weather noise variance itself is stronger over land than in the ocean, a familiar result attributed to to land‒sea heat capacity contrast, and larger during winter than summer, presumably due to enhanced baroclinic instability during winter. These results are generally in accord with previous studies [Zheng et al., 2000].
 Figure 5 shows that SG produces weather noise estimates very close to those of ANOCOVA over the ocean and tropical land areas, and smaller estimates over extratropical land. Such spatial distribution agrees well with that derived from AGCM ensemble simulations. Figure 6 presents weather noise differences from AGCM daily time series by different methods relative to the true variance determined by AGCM ensembles based on the analysis of variance. The figure further confirms the low noise estimates in SG seen in Figure 5. These results are also consistent with the Monte Carlo experiments: Over the ocean, the effective timescale T0 is large, and both methods tend to underestimate weather noise in the same way, whereas over land, the effective timescale is small, and SG tends to produce smaller estimates of weather noise than ANOCOVA.
 The MN method generally produces larger weather noise variance estimates than ANOCOVA shown in Figure 5. These results are consistent with those shown in Figures 1, 2, 3, and 6, which reveal that ANOCOVA underestimates weather noise more strongly than MN.
 In Figure 5, the bootstrap tends to produce larger estimates of weather noise variance over the tropical Pacific, compared to ANOCOVA, but smaller estimates over regions dominated by sea‒ice or snow (especially the polar climates of Antarctica and the Arctic, and subarctic climates such as Northern Canada and Northern Europe). The locations of large and small noise estimates in Figure 5 are in good agreement with substantial positive and negative areas in the bootstrap in Figure 6. The relatively large values over the ocean is consistent with the Monte Carlo results shown in Figure 2, which shows that the bootstrap estimates tend to be larger than the ANOCOVA estimates when the signal‒to‒noise ratio is large (the largest signal‒to‒noise ratios occur over the tropical Pacific). As for the sea‒ice or snow regions, examination of the time series (not shown) reveals intermittent changes in time variability, presumably associated with contrasting climates due to the irregular presence or absence of surface frozen water within the season. Such intermittency is not well captured by autoregressive models but can be captured, in principle, by bootstrap methods. On the other hand, the time series from other regions (not shown) do not show any obvious deviations from autoregressive behavior; so the reason that the bootstrap method yields less weather noise variability than other methods is unclear.
5.2 Comparison of Potential Predictability
 The fraction of predictable variance estimated by all four methods is shown in Figure 7. This figure shows that the potential predictability of temperature from all four methods has a pronounced tropical‒extratropical contrast with maximum FPV greater than 0.9 in the tropical oceans but with most FPV values less than 0.6 over extratropical land areas. These findings are generally in accord with GCM estimates of potential predictability. For instance, the signal‒to‒noise ratios shown in Figure 3 of Phelps et al.  is similar to our FPV plot for DJF, despite the difference in variable (2 m temperature versus 200 hPa geopotential height) and time period (DJF versus JFM). Even the magnitudes are consistent after transforming signal‒to‒noise ratios (SNR) into FPV using the formula FPV=SNR/(SNR+1).
 The fraction of area that is deemed potentially predictable by the different methods in various domains and seasons is summarized in Table 1. Aside from SG, the methods agree that the ocean has more potentially predictable area than land, and that land has more predictable area in JJA than other seasons. Also, the FPV varies more strongly with season over land than over the ocean. The table also shows that MN identifies less predictable area than the other methods, especially over most extratropical land areas. This is true even though the magnitude of FPV is comparable to ANOCOVA and bootstrap over land, as can be seen in Figure 7. These results are generally consistent with the Monte Carlo experiments, which show that at moderate serial correlation (as found over land) MN has accurate type‒1 error rates and larger type‒2 error rates than the other methods, both of which imply that MN tends to reject the null hypothesis more often than the other methods. The table also reveals that SG identifies more predictable area that the methods, consistent with the systematic bias discussed in sections 3 and 4.
Table 1. The Percentage of Regions Where the Null Hypothesis of No Predictability Is Rejected Over (a) Global, (b) Land, and (c) Ocean Using the SG, MN, ANOCOVA, and Bootstrap Methods at DJF, MAM, JJA, and SON for 2 m Temperature From the NCEP/NCAR Reanalysis During 1979 to 2008
Shukla and Gutzler  suggested that the MN method may overestimate weather noise due to the use of 96 day periods and that better estimates might be obtained using 30 day periods. Although the Monte Carlo experiments suggest that the bias in MN is small or negative, these experimental results were based on an idealized first‒order autoregressive model, and more complicated stochastic processes may cause MN to overestimate weather noise variance. The " best" cutoff for MN is really a question about nature; so comparisons with idealized models are not meaningful because the answer is implicitly specified by defining the idealized model. To gain insight into this question, we investigate the sensitivity of MN to the choice of frequency cutoff for the low‒frequency white noise extension. Accordingly, we repeated calculations for MN but setting D in equations (13) and (14) as 30 days. The resulting FPV, shown in Figure 8, reveals significantly enhanced values of FPV over extratropical land regions, especially during DJF and MAM. This result indicates that the MN method is sensitive to the selected cutoff frequency. The proper selection of cutoff frequency in MN is difficult to ascertain.
 The critical issue in the bootstrap is to select the moving block length L. Our selection of block length L is guided by the characteristic timescale T0. As shown in Figure 1 of Feng et al. , large values of T0 are found in the oceans with T0 greater than 10 days over the tropical eastern Pacific, whereas T0 is lower over land areas generally less than 8 days. For calculation purpose, a universal block length of 10 days is selected in the bootstrap. To see the effect of block length L on the bootstrap estimates, we chose a block length of 15 days, repeated the resampling, and presented the derived FPV in Figure 9. In comparison with Figure 7, FPV decreases with increasing block length with the largest reduction over the tropical areas where the bootstrap is biased for large T0 as revealed in the Monte Carlo experiments. This result reveals the sensitivity of the bootstrap to the selection of block length.
 Recall that our definition of potential predictability includes contributions due to climate forcing. During the investigation period 1979–2008, the response to anthropogenic and natural forcing is dominated by a trend [Ting et al., 2009; DelSole et al., 2011]. To assess the contribution of these latter sources of predictability, we repeated all four calculations using detrended data and found that the resulting FPV maps (not shown) were virtually indistinguishable from those shown in Figure 7. This result does not exclude the possibility that the trend is a source of predictability, as has been claimed by Doblas‒Reyes et al.  based on ensemble dynamical model predictions over 44 years, or as claimed by Folland et al.  for Europe based on correlations with climate indices over 133 years. Rather, our result indicates that whatever potential predictability may be associated with the trend, it is too small to be detected from a single realization of 30 year time series. Other climate forcings that are not well captured by trends, such as natural forcing from volcanoes and solar cycles, also could contribute to potential predictability, but their contribution is probably weak due to the small number of major volcanic eruptions during this period and the difficulty of detecting solar cycles in observational data even with statistical optimization techniques [North and Wu, 2001; Camp and Tung, 2007].
 To gain further perspective on the differences among the four methods, we show in Figure 10 the consistency of potential predictability estimated by the different methods. Specifically, we plot an “indicator function” that is defined to be ‒1 or 1, according to whether MN, ANOCOVA, and bootstrap methods simultaneously find insignificant or significant potential predictability, respectively, while it is defined to be 0 if the methods yield inconsistent conclusions about the significance of potential predictability. We omit SG from consideration due to its systematic biases revealed from the Monte Carlo experiments and AGCM ensemble comparisons. More than 80% of the globe, mostly over the ocean and tropical land, is consistently identified as significantly potentially predictable by the three methods, while about 5% of the globe, mostly over extratropical land, is consistently identified as insignificantly predictable. Note that much of North America is consistently identified as not significantly predictable during DJF. The differences in FPV between the methods are within ±0.1 over these regions. This leaves about 15% of the globe, again mostly extratropical land, in which the methods yield inconsistent results. These inconsistencies define regions in which conclusions are sensitive to the underlying assumptions of the individual methods.
6 Summary and Discussion
 This paper compared reanalysis‒based estimates of potential predictability of seasonal mean 2 m temperature derived from four statistical methods: a spectral method proposed by Madden , a corrected analysis of variance method proposed by Shukla and Gutzler , a bootstrap method proposed by Feng et al. , and an analysis of covariance method proposed by Feng et al. . The 2 m temperature was obtained from the NCEP/NCAR Reanalysis during 1979–2008. All four methods consistently indicate significant potential predictability over tropical land areas and most of the ocean in response to the interannual variability of SST, with the potentially predictable variance exceeding 90% in the tropical Pacific. Over land, potential predictability varies strongly with season, with very few extratropical land areas maintaining significant potential predictability throughout the year (in contrast to oceans). However, three methods (MN, ANOCOVA, and bootstrap) indicate that much of North America is not potentially predictable during winter. The methods also give inconsistent conclusions regarding the significance of potential predictability over much of Asia in most seasons. Qualitatively, the spatial structure of potential predictability derived from the four methods are consistent with each other and consistent with the estimates of Zheng et al.  derived from monthly mean observational data [compare our Figure 7 with Figure 1 of Zheng et al., 2000], although our signal‒to‒total ratios appear to be about 10% larger (which might be due to the fact that Zheng et al.  analyzed a different data set and time period).
 Monte Carlo experiments reveal that SG systematically underestimates weather noise variance. This bias, due to the well‒known problem with estimating the timescale T0, is so substantial that we recommend using one of the other methods instead. MN generally produces unbiased estimates of weather noise variance, except for strongly serially correlated processes for which it tends to underestimate weather noise, for reasons anticipated by Madden . For moderate to no serial correlation (as found over land), MN has accurate type‒1 error rates and larger type‒2 error rates than the other methods, implying that it tends to accept the null hypothesis more often than the other methods. Consistent with this, MN tends to identify less potentially predictable area than the other methods. ANOCOVA tends to underestimate weather noise variance because of the uncertainty in determining the order of the autoregressive process—if the order of the process is known exactly (which is not a realistic assumption), then ANOCOVA produces unbiased estimates, except for strongly serially correlated daily processes. The bootstrap produces weather noise variances similar to those produced by ANOCOVA, except for strongly serially correlated processes, for which it greatly underestimates the weather noise. Disagreement among the three methods implies that the assessment is sensitive to the assumptions underlying the estimation method (assuming it is not due to sampling variability). We note that the above biases were identified for first‒order autoregressive processes and that other biases may occur for other processes.
 The four methods also were applied to AGCM data, where the true potential predictability could be estimated from available ensemble members using standard analysis of variance. The comparisons confirm that SG significantly underestimates noise variance and hence overestimates FPV more than the other methods. MN is found to be the least biased method with the smallest regression slopes and reasonable coefficients of determination (R2> 0.85). However, it tends to have a higher probability of rejecting the null hypothesis when it is false and consequently identifies more insignificant predictability than the other methods. No simple, universally correct statements can be made regarding the relative performances of ANOCOVA, bootstrap, and MN based on the AGCM data.
 The MN method tends to imply more weather noise variability, and hence less potentially predictable variance, than the other methods. This difference, however, was found to depend sensitively on the choice of cutoff frequency used to compute the “low‒frequency white noise” approximation, a point raised by Shukla . Unfortunately, the best choice for the cutoff frequency generally is unknown.
 The bootstrap is sensitive to the selection of block length especially over the tropical areas where effective timescale T0 is large. Moreover, the bootstrap estimates of weather noise variability differ most strongly from the others over subarctic and Arctic regions. In some cases, these differences can be traced to intermittent changes in the time variability of the time series, presumably caused by changes in the appearance or disappearance of frozen water on the underlying surface. Such intermittent changes are difficult to capture with autoregressive model but can be captured by bootstrap methods. Hence, the bootstrap estimates may be more trustworthy in these cases.
 The results presented here suggest that caution is needed in regarding the use of statistical methods for estimating potential predictability. We conclude that SG is the least skillful method, and it should be strongly discarded unless the estimates of effective timescale T0 has been improved as shown in DelSole and Feng . Among the other three methods, our study indicates that no method is superior to others for all scenarios in both Monte Carlo experiments and AGCM ensemble comparisons. Statistical estimates with the best performance are different with different autoregressive parameters or different seasons. If one is interested in estimating potential predictability with the least bias, then MN tends to be a good choice, but it has a higher probability of rejecting the null hypothesis when it is false (i.e., higher type‒2 error) than the other methods for moderately autocorrelated time series (e.g., T0 around 3 days). ANOCOVA generally gives more accurate, but biased, estimates of weather noise compared to MN. The bootstrap appears to lie between MN and ANOCOVA in certain situations or inferior to them in other occasions.
 Finally, the predictability estimates based on reanalysis data must be viewed with caveats of uncertainties. In particular, observations of 2 m temperature are not assimilated to the NCEP/NCAR reanalysis; hence, the analyzed 2 m temperature is primarily a model product constrained by other observations. In this respect, one should not interpret 2 m temperature from reanalysis as a real observation. On the other hand, real observations have their own problems associated with instrument error, missing data, instrument bias, and representativeness error related to transforming point observations to gridded data. Since no observation‒based data set is perfect, our results derived from reanalysis are intended merely to serve as a point of reference derived from a very commonly used data set that may be compared to models and other observation‒based data sets.
 This work was sponsored by NASA's Energy and Water Cycle Study (NEWS) program (grant NNX11AE32G). DelSole was supported from grants from the NSF (0830068), the National Oceanic and Atmospheric Administration (NA09OAR4310058), and the National Aeronautics and Space Administration (NNX09AN50G). We thank Arun Kumar and Bhaskar Jha for providing the GCM data used in section 4 to test the methodologies. We also thank three anonymous reviewers for their constructive comments that substantially improved the manuscript.