Empirical Prediction of Short‐Term Annual Global Temperature Variability

Global mean surface air temperature (Tglobal) variability on subdecadal timescales can be of substantial magnitude relative to the long‐term global warming signal, and such variability has been associated with considerable environmental and societal impacts. Therefore, probabilistic foreknowledge of short‐term Tglobal evolution may be of value for anticipating and mitigating some course‐resolution climate‐related risks. Here we present a simple, empirically based methodology that utilizes only global spatial patterns of annual mean surface air temperature anomalies to predict subsequent annual Tglobal anomalies via partial least squares regression. The method's skill is primarily achieved via information on the state of long‐term global warming as well as the state and recent evolution of the El Niño–Southern Oscillation and the Interdecadal Pacific Oscillation. We test the out‐of‐sample skill of the methodology using cross validation and in a forecast mode where statistical predictions are made precisely as they would have been if the procedure had been operationalized starting in the year 2000. The average forecast errors for lead times of 1 to 4 years are smaller than naïve benchmarks on average, and they perform favorably relative to most dynamical Global Climate Models retrospectively initialized to the observed state of the climate system. Thus, this method can be used as a computationally efficient benchmark for dynamical model forecast systems.


Introduction
Human-caused increases in the atmospheric concentration of well-mixed greenhouse gases are causing pronounced global climate change on decadal to centennial timescales (Bindoff et al., 2013). Perhaps the most recognized measure of this change is the long-term effect on global mean surface air temperature (SAT) (T global ), which has increased at a rate of~0.14°C per decade since the middle of the twentieth century (Hansen et al., 2010). Superimposed on top of this externally forced decadal-to centennial-scale warming is variability which is mostly unforced and spontaneously generated from interactions internal to the ocean-atmosphere-land system (Bindoff et al., 2013). Such T global variability is not as persistent as the contemporary externally forced signal, but it can be substantial in magnitude over subdecadal timescales.
For example, T global increased by~0.42°C between 2011 and 2016 which is equivalent to approximately three decades worth of long-term warming at the aforementioned historically observed rate (Hansen et al., 2010). This suggests that, on subdecadal timescales, deviations in T global can approach 0.5°C in magnitude which has recently been associated with profound impacts on climate-related risk (Hoegh-Guldberg et al., 2018). For instance, the aggregate probability of exceeding the preindustrial-defined 99.9th percentile of daily heat extremes over land increases by 50% to 100% with such a magnitude shift in T global (Fischer & Knutti, 2015). In addition to being of large relative magnitude, subdecadal T global variability also has an extensive spatial footprint in the sense that approximately 87% of the global surface and 99% of the global land surface exemplifies a positive linear relationship between local annual unforced SAT deviations and unforced T global deviations ( Figure 1).
Given the extensive spatial footprint and large magnitude of subdecadal T global variability, it is perhaps unsurprising that such variability has been linked to considerable environmental and societal impacts. Unforced subdecadal global temperature variability, typically associated with the state of the El Niño-Southern Oscillation (ENSO) (Trenberth et al., 2002) and its related effects (McPhaden et al., 2006), has been connected to global patterns of primary production (Behrenfeld et al., 2001), the distribution of sea bird, marine mammal, and fish populations (Stenseth et al., 2002), as well as coral bleaching (Walther et al., 2002). Such variability has also been linked to variation in societal phenomena including agricultural output (David & Christopher, 2007), gross domestic product growth (Burke et al., 2015), monetary inflation (Cashin et al., 2017), energy demand (Deschênes & Greenstone, 2011), human mortality (Deschênes & Greenstone, 2011), and civil conflicts (Hsiang et al., 2011). In addition to these substantive impacts, subdecadal T global variability tends to attract significant attention in popular media (Gillis, 2017), which influences the public perception of the urgency/necessity of implementing climate-change mitigation policy.
Given the above concerns, there is potentially substantial utility in the approximate foreknowledge of annual T global values (here, anomalies with respect to the 1951-1980 mean). However, forecasting the particular state of the climate system from months to years ahead of time is notoriously difficult because it is a timescale long enough such that chaos significantly degrades forecasts made based on initial conditions but a timescale too short for typical externally forced signals to strongly emerge from the "noise" of unforced variability (Kirtman et al., 2013;Meehl et al., 2009).
Despite these challenges, there has been an emphasis on annual to decadal climate prediction using both statistical (Krueger & Storch, 2011;Newman, 2013;Sévellec & Drijfhout, 2018;Suckling et al., 2017;Thomas et al., 2008) and dynamical models (Keenlyside et al., 2008;Kirtman et al., 2013;Smith et al., 2007). A particular focus has been placed on the use of dynamical Global Climate Models (GCMs) initialized to the observed state of the climate system and run forward in time, in a manner similar to the procedure used in numerical weather prediction (Keenlyside et al., 2008;Kirtman et al., 2013;Smith et al., 2007). This method has exhibited potential, especially in certain locations, but these GCM-based predictions come with noteworthy challenges such as large computational expense, incomplete observations for initialization, the necessity to correct for mean biases, and the necessity to correct for model drift due to "coupling shock" . Furthermore, accurate GCM simulation of internal modes such as ENSO and its teleconnections is of utmost importance for this application but many GCMs still struggle in their simulation of physical processes key to ENSO dynamics (Bellenger et al., 2013).
In this study, we introduce a complement to GCM-based decadal prediction that makes use of historically observable empirical relationships between the state of the climate system at any given time and subsequent annual T global anomalies. Subdecadal T global anomalies are primarily related to modes of variability in the climate system such as ENSO (Brown et al., 2014;Trenberth et al., 2002), and thus, some previous efforts to statistically forecast subdecadal T global have relied primarily on the use of ENSO indices as predictor variables (e.g., (Smith et al., 2007;Suckling et al., 2017)). However, these methods generally require an ad hoc calculation of an ENSO index which may not fully capture the influence of ENSO on T global . Furthermore, other modes of variability like the Interdecadal Pacific Oscillation (IPO; ), the Atlantic Multidecadal Oscillation (Chylek et al., 2014), the North Atlantic Oscillation (Li et al., 2013), and variability over the Southern Ocean  have all been suggested to influence global temperature, but they are often neglected as potential predictors of T global . Part of the challenge is that many of the aforementioned modes are correlated with each other (either positively or negatively) which precludes their use in statistical frameworks that assume linear independence of predictors.
With the above considerations in mind, we seek a relatively simple empirical methodology for predicting T global deviations that (a) allows predictors of T global to be globally comprehensive and thus does not arbitrarily exclude modes of variability originating from, for example, high latitudes; (b) creates predictors based on the relationship of the data to predictands (as opposed to using, e.g., a priori and/or ad hoc climate indices); and (c) creates predictors that are uncorrelated with each other. We achieve these goals via the use of partial least squares regression (PLSR (Brown & Caldeira, 2017;Wold, 1966)).

Methods
We apply PLSR to the problem of forecasting T global anomalies in the following way. First, observed globally gridded SAT (latitude, longitude, and annual time) anomalies were obtained from the four primary instrumental observational data sets: HadCRUT4 (Morice et al., 2012), NASA GISTEMP (Hansen et al., 2010), NOAA (Vose et al., 2012), and Berkeley Earth Surface Temperature (Rohde et al., 2013).
Second, PLSR was performed between predictors made up of gridded SAT fields (normalized locally by their standard deviation across time) and predictands of subsequent T global deviations. SAT fields were used as predictor variables of T global because they represent some of the longest and most spatially comprehensive data available and because much information on external forcings as well as the state of various modes of variability in the climate system are contained in the SAT field.
The PLSR framework is similar to that of a multiple linear regression (MLR) problem. In MLR, application to the current problem would entail finding coefficients, b ! , such that the mean squared residuals ( r ! ) are minimized in the system, where y ! t would be annual T global anomalies as a function of time (e.g., from 1901-2019) and the matrix [X] t-1 would contain the global spatial field of SAT which precedes the T global anomalies in time (e.g., where the rows would correspond to years 1900-2018 and the columns would correspond to each individual gridded location). Because of the high degree of spatial autocorrelation in the SAT predictor fields, the columns in [X] t-1 will inevitably be highly collinear, and thus, [X] t-1 will be well-below full rank. This precludes the application of MLR to this problem. However, PLSR offers a solution to this issue by creating linear combinations of the columns in [X] t-1 (PLSR components) that represent a large portion of [X] t-1 's variability. The procedure is similar to principle component analysis, but instead of seeking components that explain the maximum variability in [X] t-1 itself, PLSR seeks components in [X] t-1 that explain the variability in y ! t . Ultimately, PLSR is akin to the MLR procedure performed on a matrix [Z] t-1 containing a relatively low number of PLSR components which represent much of the variability in [X] t-1 , Figure 1. Spatial footprint of annual unforced global mean surface air temperature variability showing that the majority of the surface displays a positive relationship with the global mean. Colors represent the magnitude of the linear regression coefficient (slope) between local unforced (subdecadal timescale) annual surface air temperature deviations and global mean unforced (subdecadal timescale) annual surface air temperature deviations. Stippling indicates that the linear regression coefficient is not statistically different from zero at the 90% confidence level or above. Data are from the GISTEMP data set and spans 1950-2019. This time span was selected due to it being the longest time period with near-global spatial coverage. Unforced variability was isolated from forced variability so that subdecadal variations could be highlighted rather than the long-term trend. This decomposition between forced and unforced variability was achieved via multiple linear regression against historical radiative forcings.
Below, we show results using four PLSR components but conclusions are not sensitive to this specific number. We carry out PLSR using the MATLAB™ function "plsregress" (https://www.mathworks.com/help/ stats/plsregress.html). This function performs PLSR regression using the SIMPLS algorithm (see also methods in (Brown & Caldeira, 2017) for more details).
Equation 2 is for the specific case of predicting T global anomalies from only the previous year's SAT field. However, in our application, we use the previous 2 years to predict the next 4 years of T global deviations. Two lagged years were used because we found that there was some increase in skill by including information on not only the most recent state but also on the evolution of the SAT field. The 4-year lead time was chosen as the maximum lead time because we found that the method was outperformed by the naïve benchmarks in some of the observational data sets beyond 4 years ( Figure S4 in the supporting information).
When using two preceding years, the matrices are horizontally concatenated prior to the application of PLSR. So in our application, equation 1 becomes The problem was then conducted separately for each lead time. So for lead times of 2, 3, and 4 years, equation 3 becomes and respectively.

LASSO Regularization
In order to prevent overfitting, we implement the least absolute shrinkage and selection operator regularization (Tibshirani, 1996) (https://www.mathworks.com/help/stats/lasso.html) using only the PLSR forecast and the most recent T global anomaly as the two predictor variables. This serves as a check against overfitting in the sense that it damps the influence of predictors that cause poor predictions on out-of-sample data. The overall effect of this procedure is that under circumstances of little to no out-of-sample skill, the predictions will revert to the mean value of the predictand. Henceforth, we refer to our overall procedure as the BC2020 method.

Validation
Model validation was achieved via leave-one-out cross validation for years prior to 2000 (which we refer to as "hindcast mode") and through completely out-of-sample predictions made on the post-2000 data (which we refer to a "forecast mode").
Under leave-one-out cross validation, each T global anomaly in the time series took a turn acting as a test year, with the remaining years designated as training years. Each test year was held out of the procedure such that the method was blind to the correct T global anomaly for the test year. PLSR was then performed on the training years, and the resulting regression coefficients were used to predict the T global deviation for the test year. If the predictand lead time was 1 year, and two lagging years were used as predictors, equation 1 could be expanded as, where the subscript t refers to annual time and the subscript loc refers to the global gridded location. If the T global anomaly corresponding to the fourth year in data set (t 4 , second row in equation 7) was designated as the test year, then the row corresponding to this year would be deleted and T globalt4 would be hindcast with regression coefficients (b⃑ ) that were calculated without knowledge of the corresponding predictor-predictand combination. Note that information form these deleted years (t 4 , t 3 , and t 2 ) still appear in the linear system elsewhere (e.g., SAT from t 2 still informs the prediction for T globalt3 ).
In hindcast mode, each year took a turn acting as a test year and the errors between the forecast T global anomalies and the observed T global anomalies were used to summarize the predictive skill of the method and inform the empirically derived confidence intervals of the forecast.
The predictive skill of the BC2020 model is also quantified using forecast mode where its predictions are tested on completely out-of-sample data in the years following 2000. Forecast mode means that the procedure is run just as it would have been if it was operationalized starting in the year 2000. That is, no information from future data is used for any part of the training. Specifically, forecasts are made each year (at 1-to 4-year lead times) and model parameters are updated each year prior to the next year's forecast.

Note on Treatment of Forced Variability
Interannual T global variability is the result of both forced and unforced variability, and many previous efforts have attempted to partition this variability explicitly into these two components (e.g., Brown et al., 2014;Foster & Rahmstorf, 2011;Lean & Rind, 2008). Here, we do not attempt to partition temperature variability into forced and unforced components and instead we allow the PLSR procedure itself to partition most of the forced variability into the first PLSR component. We made this choice because, despite much research on the issue (Frankcombe et al., 2015;Mann et al., 2014), the isolation of unforced from total temperature variability remains a major challenge. When trying to do the decomposition, insufficient temporal/spatial variability of historical forcings may cause an insufficient amount of historical variability to be categorized as forced variability, and thus, too much variability may be categorized as unforced variability. On the other hand, some studies have suggested that the best estimates of decadally varying forcing may have been implicitly overfit to observations to some degree (Tung & Zhou, 2013). Under this view, the observed multidecadal temperature variability would contain a substantial unforced component and attempts to partition variability into a forced component and unforced component may be biased in favor of allocating too much variability into the forced designation and necessarily leaving too little variability in the unforced designation.
Given that there is not a clear consensus on how to best decompose forced and unforced variability and given that we prioritized simplicity in our methodology, we chose to feed the PLSR procedure the raw temperature data sets without any preprocessing. This meant that the PLSR procedure itself partitioned most of the forced variability into the first PLSR component which we refer to as the global warming component.
This methodological decision has the added benefit of allowing the procedure to be completely independent of assumptions about the time evolution of external forcings and/or their representation in dynamical GCM simulations. This is preferable if the method is to serve as a benchmark or point of comparison for dynamical GCMs.
Additionally, a major benefit of not attempting to remove forcing as a preprocessing step is that it eliminates the risk of inadvertently feeding in information that the procedure would not have in an operationalized setting. When the procedure is run in forecast mode, the lack of data preprocessing guarantees that there is no information leakage that would cause spuriously low prediction errors. Thus, forecast mode prediction errors incorporate uncertainty in both forced and unforced variability which is desirable when informing confidence intervals going forward in a real-world setting. However, this also means that because the method cannot anticipate abrupt changes in external forcings, it is at an inherent disadvantage compared to GCMs that incorporate retrospective time-varying forcings and statistical models trained on only the unforced component of variability.

Ability of BC2020 Method to Forecast Idealized Signals
As a demonstration of concept, we first employ the BC2020 method on idealized synthetic data. Specifically, we generated synthetic data over the period 1880-2017 which consisted of combinations of sine waves (Figures 2a and 2b) and random noise (not shown). We inserted one oscillation in the Northern Hemisphere grid points and one in the Southern Hemisphere grid points, with each grid point receiving its own random noise time series. We then ran the BC2020 method on the gridded data. Cross-validated hindcasts as well as an out-of-sample forecast are shown in Figure 2c. It can be seen that the methodology is able to learn the relationships between SAT patterns and subsequent T global anomalies. This is particularly apparent in the forecast period (2014-2017 in Figure 2c) where the method predicts an uptick in T global (based on previous patterns) even though T global had been trending down since~2004.

CMIP5 Decadal Hindcast Experiments
An alternative method for predicting T global deviations comes from the observationally initialized GCMs that participated in the CMIP5 decadal hindcast experiments (Taylor et al., 2011). These GCMs were initialized to various aspects of the observed state of the climate system (Table 1) and run forward in time, incorporating retrospective estimates of external forcings over the given forecast period (e.g., hindcasts starting in 1990 incorporated forcing in 1991 associated with the Mount Pinatubo volcanic eruption even though that forcing was not predictable in advance). Therefore, the decadal hindcast experiments incorporate retrospective information on external forcing and thus their hindcast performance is at an advantage relative to the BC2020 method. There are 18 GCMs that participated in this experiment. Several GCMs use multiple initialization methods. These GCMs have some ensemble members which are initialized to the absolute observed anomalies of variables (full field initialization), while some ensemble members are initialized with anomalies from observed climatology (anomaly initialization). We treat GCMs' ensemble sets that use different initialization methods as being effectively different GCMs. This treatment has the effect of expanding the number of GCMs in this experiment from 18 to 24 (Table 1).
When GCMs are initialized to observations, they have a tendency to drift toward their own preferred climate state which is often biased relative to observations. In order to correct for both the bias and the drift, we apply the standard method of drift correction recommended by the International CLIVAR Project office. This method is described below and illustrated for a single GCM (bcc-csm1-1) in Figure S1.
The raw T global hindcasts are represented as Z jτ where j is the initial forecast time (j = 1,…n) and τ is the forecast lead time in years (τ = 1,…m; Figure S1a). The initial forecast time depends on the model (Table 1). The corresponding observations for which the hindcasts are compared against are represented as V jτ ( Figure S1c). For the main results, we show RMSEs with respect to GISTEMP but RMSEs with respect to the other three observational data sets are shown in Figure S4. The average anomalies over the entire series of forecasts are given by (Figures S1d and S1e, respectively).
Drift is calculated on a model-by-model basis as the difference between the ensemble mean forecasts and observations over all cases, ( Figure S1f).
The drift (which implicitly contains any mean bias) is then subtracted from the raw hindcasts to obtain bias/drift corrected hindcasts, (Figures S1g and S2), where Z ′ jτ ¼ Z jτ − Z τ is the anomaly of the raw forecast relative to the forecast average over all forecast periods. We perform this procedure in a hold-one-out cross-validated manner such that the anomalies over any given hindcast time period of evaluation are not included in the bias/drift calculation for that time period.
Other studies have suggested yet more involved postprocessing of the decadal hindcast experiments in which the time-dependent aspect of the drift is taken into account. Such postprocessing requires free-running historical experiments which are not available for all of the CMIP5 GCMs considered here, and thus, we use time-independent drift correction.

Comparison to Other Statistical Methods
Several other studies have published statistical procedures capable of predicting T global on subdecadal timescales (Krueger & Storch, 2011;Newman, 2013;Sévellec & Drijfhout, 2018;Suckling et al., 2017;Thomas et al., 2008). None of these procedures are directly comparable to the method outlined in this work because different choices are made regarding the treatment of the forced component of variability, the target T global data set, the time span of the training data, the time span of the test data, and the rigor of the cross validation (Table S1). Nevertheless, we provide comparisons of the BC2020 method to the most analogous results from these previous studies (green lines and magenta dots in Figure S4). However, a rigorous comparison of methods would require a standardization of method protocols, training data sets, evaluation data sets, and so forth and is beyond the scope of this study.
One advantage of our approach over some previous studies comes from the simplicity of using only the antecedent SAT field to derive predictors of T global . This means that our method requires very little preprocessing and requires no decisions to be made about which particular forcings should be included as predictors and which data sets of those forcings should be used (different groups have produced different estimates of time-varying aerosol forcing for example).  Figure 3 shows the root-mean-square error (RMSE) of T global hindcasts relative to observations for the BC2020 method (blue), a persistence benchmark (black), an extrapolation of a linear trend benchmark (black), and GCM decadal hindcast experiments (red) using the NASA GISTEMP data set as observations (other three data sets shown in Figure S4). The RMSEs of the BC2020 method for both cross-validated 1900-2000 predictions (hindcast mode) and completely out-of-sample predictions post 2000 (forecast mode) are shown.

Hindcast Skill Comparison
The persistence benchmark uses the average of the previous 5 years, and the extrapolation of the linear trend benchmark uses the previous 20 years to fit the trend. Both of these time periods were chosen to minimize the forecast errors from these naïve benchmarks and thus make them as difficult to outperform as possible.

Hindcasts and Forecasts
Figures 4 and 5 shows the BC2020 method's hindcast and forecast predictions of T global at lead times of 1 to 4 years applied to the NASA GISTEMP data set (the other three data sets are shown in Figure S3).
Scientific discussion of decadal climate variability and a potential short-term hiatus in global warming began emerging around the latter portion of the 2000s decade (Easterling & Wehner, 2009). In 2004, the BC2020 method would have predicted a slight cooling through 2008 (the 4-year lead time forecast valid in 2008, magenta dot, was close but below the actual value) and thus might have hinted at the emergence of what would come to be known as the hiatus (Medhaug et al., 2017). By 2012, there was much scientific discussion of the global warming hiatus (Kaufmann et al., 2011;Meehl et al., 2011;Solomon et al., 2010Solomon et al., , 2011 and its potential to persist. In that year, the BC2020 method predicted that the T global was primed to experience an uptick over the subsequent 4 years, and it predicted a new historical record in 2016 (magenta dot for 2016). This surge in T global did come to pass, and 2016 did set the historical record, although it surpassed the magnitude of the BC2020 forecasted anomaly. The BC2020 method has produced some forecasts that were off by a large margin as well. For example, the 1-year lead time errors for 2006 and 2014 were both about −0.2°C. Note, however, that these errors help inform the confidence intervals for both the hindcasts and forecasts over the period 2020-2023.
The spatial patterns and time evolution of SAT variability that are the most relevant to the prediction of subsequent T global anomalies are illustrated in Figure 6. Figures 6a, 6d, and 6g show the PLSR scores (analogous to the principle component time series in principle component analysis) and the loadings (analogous to the empirical orthogonal Functions) of the first three PLSR components which explain 85%, 5%, and 3% of the variance in subsequent T global variability at the 1-year lead time. Positive PLSR loadings displayed on the maps denote where local warm SAT anomalies are associated with subsequent warm T global anomalies, and negative PLSR loadings denote where local cool SAT anomalies are associated with subsequent warm T global anomalies.
The first PLSR component (Figures 6a-6c) largely corresponds to the global warming signal associated with increases in well-mixed greenhouse gas concentrations, and it is the dominant explanation of variability since 1900. It is apparent that this component corresponds to an externally forced signal due to its spatial homogeneity (Sedláček & Knutti, 2012). Variations in incoming solar radiation would also likely be partitioned into this PLSR component because they would also have a spatially homogenous signature. The RMSEs for the GCM decadal hindcasts are calculated over time periods that vary by model as described in TableS1. Note that the GCM decadal hindcast experiments incorporate retrospective information on external forcing, and thus, their hindcast performance is at an advantage relative to the BC2020 method.
The second PLSR component (Figures 6d-6f) shows that warm T global anomalies (apart from the global warming signal) are preceded by cool local SAT anomalies in the tropical Pacific 2 years prior ( Figure 6e). However, 1 year prior to a warm T global anomaly (again relative to the long-term global warming signal), there tends to be a warm anomaly over the equatorial Pacific ( Figure 6f). This pattern indicates that warm Forecasts for 2020-2023 show ±1σ (thick lines) and ±2σ (thin lines) confidence intervals which are derived from the RMSE of the forecast mode errors. The gray shading represents the ±1σ confidence interval for the trend benchmark which is a forecast that projects the given year's global temperature anomaly as being an extrapolation of a trend fit to the previous 20 years. The GISTEMP data set is used for observations here, but results are similar for the other three global temperature data sets considered ( Figure S3). T global deviations (apart from the first PLSR component) are associated with a transition from La Niña-like conditions 2 years prior to El Niño-like conditions 1 year prior to the year in consideration.
The third PLSR component shows that warm T global anomalies are associated with a positive IPO. The sharpening of the IPO pattern from 2 years to 1 year prior to the prediction year indicates that positive T global anomalies are associated with an antecedent strengthening of an already positive IPO state. These findings are consistent with independent methodologies that have highlighted the Pacific's prominence in the modulation of unforced T global variability (Brown et al., 2014;England et al., 2014).
Overall, the first three PLSR components agree with the broader literature that the long-term T global evolution can be predicted by the state of global warming, and subsequent refinements of the T global anomaly for any given year can be made by incorporating information on the state of ENSO and of the IPO.

True Forecast
Training on data from 1900-2019 and using SAT deviations from 2018-2019 as predictor fields, the BC2020 method forecasts T global anomalies, with 2σ uncertainty ranges of +0.98°C (±0.40), +0.98°C (±0.44), +0.91°C (±0.44), and +0.95°C (±0.48) above the 1951-1980 mean for the NASA GISTEMP data set (Figures 4 and 5). The BC2020 method's forecast for 2020 and 2021 suggests values nearly equal to that of 2019 before slightly lower values in 2022 and 2023 (Figures 4 and 5). Thus, the BC2020 method does not necessarily foresee the 2016 global temperature record being broken in the forecast period. Positive loadings indicate that a warm local surface air temperature anomaly at that location is associated with a warm subsequent global temperature anomaly, and negative loadings indicate that a cool local surface air temperature anomaly at that location is associated with a warm subsequent global temperature anomaly. These loadings were calculated over the entire historical time period with no data held out. (j, k) Local surface air temperature anomalies for 2 years (2018-2019) informing the BC2020 method's forecast for 2020-2023.

Discussion and Conclusion
The performance of the BC2020 method is encouraging, but the results shown do necessitate a few caveats. First, the physical mechanisms that can be surmised from statistical relationships are necessarily limited. Insight on the mechanistic underpinnings of the BC2020 method's skill can be informed from PLSR loading patterns ( Figure 6), but the most comprehensive physical understanding of interannual T global variability will ultimately involve the use of GCMs. Furthermore, GCMs provide geographic-specific predictions for many variables which are necessary for developing process understanding and for anticipating many climate-related impacts. Thus, the method presented here should not be considered a replacement for GCM decadal prediction but rather it should be viewed as a complement and/or a benchmark to which GCM predictions can be compared.
Second, the method presented here is only able to forecast T global deviations that imprint leading indicators about their subsequent evolution in global SAT fields. Thus, short-term changes in forcings like those associated with large volcanic eruptions or major shifts in socioeconomic activity and thus emission cannot be anticipated and their occurrence would reduce the accuracy our forecasts. However, this aspect of unpredictability is included in our empirically informed confidence intervals to some degree. For example, the method's T global predictions for 1992 were all higher than what actually occurred because it could not anticipate the mid-1991 Mount Pinatubo eruption. This prediction error does get included in the reported RMSEs at all lead times. Thus, if the frequency of future unpredictable short-term forcings is similar to the past, the method's confidence intervals will accurately represent uncertainty on average.
A final caveat that we will note here is that statistical hindcasts can generally not be validated as rigorously as genuine forecasts. We demonstrate the robustness of our hindcast results using cross validation as well as implementing the procedure on completely held-out test data. Despite these measures, foreknowledge of the correct answers can implicitly bias the research process in a direction of overfitting. Thus, truly rigorous validation of the method used here will only be achieved as actual out-of-sample forecasts are produced and compared with future observations.
Despite the above caveats, the relatively simple empirical method laid out in this study shows skill in predicting interannual global mean SAT variability and even performs better than most dynamical GCMs at this task. This indicates that a substantial fraction of the useful information for constraining short-term T global evolution is contained in the SAT field of the preceding years (the only predictor variable used here). Given that short-term T global variability is of substantial magnitude relative to the long-term trend and it has an extensive global spatial footprint with broad ecological and societal impacts, this tool represents a computationally inexpensive means of anticipating and possibly mitigating some short-term climate effects.