Combining Neural Networks and CMIP6 Simulations to Learn Windows of Opportunity for Skillful Prediction of Multiyear Sea Surface Temperature Variability

We use neural networks and large climate model ensembles to explore predictability of internal variability in sea surface temperature (SST) anomalies on interannual (1–3 years) and decadal (1–5 and 3–7 years) timescales. We find that neural networks can skillfully predict SST anomalies at these lead times, especially in the North Atlantic, North Pacific, Tropical Pacific, Tropical Atlantic and Southern Ocean. The spatial patterns of SST predictability vary across the nine climate models studied. The neural networks identify “windows of opportunity” where future SST anomalies can be predicted with more certainty. Neural networks trained on climate models also make skillful SST predictions in reconstructed observations, although the skill varies depending on which climate model the network was trained. Our results highlight that neural networks can identify predictable internal variability within existing climate data sets and show important differences in how well patterns of SST predictability in climate models translate to the real world.


Introduction
Skillful predictions of regional climate variability on multiyear to decadal timescales provide valuable information for near-term societal decision making (Findell et al., 2023;Kushnir et al., 2019).While such predictions remain a significant challenge, a number of studies have shown potential for predicting patterns of internal climate variability, particularly those related to large-scale ocean variability.Some patterns of ocean variability thought to have predictable components on three-to-10 year timeframes include the El-Nino Southern Oscillation (ENSO), Atlantic Multidecadal Variability, and the Pacific Decadal Oscillation (PDO) (Cassou et al., 2018;Meehl et al., 2009;Van Oldenborgh et al., 2012).These oceanic patterns can also lead to predictability of important processes over land, including rainfall over the Sahel (Martin & Thorncroft, 2014), North American precipitation (Enfield et al., 2001), Atlantic Hurricane frequency (Smith et al., 2010), late winter precipitation over Western Europe (Simpson et al., 2019), and North American and European summer temperatures (Sutton & Hodson, 2005).
Many recent insights into multiyear climate prediction come from initialized decadal hindcast (or retrospective forecast) experiments, where model simulations are initialized with starting conditions that match a historical point in time as closely as possible and then run for up to a decade (Delgado-Torres et al., 2022;Meehl et al., 2021;Yeager et al., 2018).The hindcast simulation is then verified against what actually occurred in the real world and is compared to uninitialized simulations to determine whether the initial starting conditions provided any prediction skill.Prior work has shown that higher skill is achieved when more hindcast ensemble members are used, with often at least 10, and sometimes as many as 40 or 80, ensemble members used (Koul et al., 2023;Meehl et al., 2021).The computational expense associated with these experiments poses a considerable challenge for decadal prediction.Initialized simulations are also subject to model drift, which occurs when a simulation that has been initialized to match observations drifts toward its own model climatology.How exactly initialized forecasts should be corrected to account for this drift presents another challenge for decadal prediction (Meehl et al., 2022;Risbey et al., 2021).
More recently, data-driven or machine learning (ML) based approaches have been used to explore multiyear climate predictability (e.g., Gordon et al., 2021;Qin et al., 2022;Toms et al., 2021).In these studies, a statistical or ML model is trained to predict a climate variable or pattern of interest using existing climate data sets.Because of the need for large amounts of training data, many (although not all) prior studies have focused on multiyear predictability within large climate model simulations.For example, Toms et al. (2021) and Gordon et al. (2021) both use >1,000 years from the pre-industrial control run of the Community Earth System Model Version 2 (CESM2) to analyze predictability of land surface temperatures and the PDO, respectively.
A benefit of ML-based approaches is the potential to learn about predictability of the climate system from existing general circulation model (GCM) simulations, reducing the need for additional initialized simulations.However, as with any approach that uses GCM simulations, the trained ML models are subject to biases present in the underlying simulations.A few studies have explored whether ML models trained on GCMs can make accurate predictions in observations.For example, Labe and Barnes (2022) show that a neural network trained on CESM2 can predict observed global warming slowdowns.Ham et al. (2019) show skillful predictions of observed ENSO variability with up to 17 months lead times using a neural network trained on simulations from different GCMs.These studies show potential for using ML models to predict observed climate variability, but whether or not multiyear predictability in climate models reflects predictability of the real climate system more broadly is still an open question.
Here, we analyze the predictability of sea surface temperature (SST) using neural networks and historical simulations from the Coupled Model Intercomparison Project Phase 6 (CMIP6) archive (Eyring et al., 2016).We focus on predicting internal variability of SSTs at interannual (1-3 years) and decadal (1-5 and 3-7 years) timescales, and apply our analysis globally.To have sufficient training data, we analyze GCMs that have at least 30 historical simulations.After evaluating SST predictability within each GCM, we analyze whether the information learned by the neural networks can lead to accurate SST predictions when tested on reconstructed SST observations.Our goal is (a) to provide an overview and comparison of patterns of SST predictability across different GCMs in the CMIP6 archive and (b) to identify regions where the SST predictability learned from GCMs provides the most skillful predictions of the real ocean.
Before neural network training, we preprocess the data for each GCM separately.First, we regrid all climate model output to a 5°× 5°latitude-longitude grid.We analyze latitudes between 65°S and 65°N.We calculate 12month, 36-month and 60-month average SSTs at each grid point.From each time series (12-month, 36-month and 60-month averages), we subtract the ensemble-mean for each year at each grid point.By removing the ensemble mean response to external forcing, we focus our analysis on learning predictable components of internal climate variability.Once the ensemble mean is removed, we calculate standardized SST anomalies in each grid cell at each timestep based on the grid cell mean and standard deviation.Lastly, we calculate tercile limits at each grid point that are used to classify each SST anomaly as negative (bottom third), neutral (middle third), and positive (top third).The tercile limits are calculated separately for each simulation because some simulations are consistently cooler or warmer than the ensemble mean over the historical simulation period.Calculating the terciles separately creates a balanced number of negative, neutral, and positive anomalies within each simulation.

Neural Network Architecture and Training
We train convolutional neural networks (CNNs) to predict SST anomalies using the GCM output (Figure 1).The CNN takes four global maps of prior SSTs as input.These maps correspond to SSTs averaged over 0-1 year, 1-2 years, 2-3 years, and 3-8 years prior.While variables such as ocean heat content may also be useful predictors, we only use SST so that we can later test the CNNs using a globally available historical SST reconstruction (see Section 2.4).For each set of input maps, the CNN predicts the SST anomaly at one grid cell at a given time in the future.Each prediction is the relative likelihood of three categories: positive SST anomaly (the top tercile of historical anomalies), neutral anomaly (middle tercile), or negative anomaly (bottom tercile).The softmax transformation normalizes the likelihoods for each prediction so that each likelihood is in the range [0,1] and the three likelihoods sum to one.
We make SST predictions for three future time periods: years 1-3 (i.e., 36-month SST anomalies starting from the prediction date), years 1-5 (60-month SST anomalies starting from the prediction date), and years 3-7 (60-month SST anomalies starting 2 years after the prediction date).
We split the 30 historical simulations from each GCM into a training set of 22 simulations, a validation set of three simulations, and a test set of five simulations (Table S1 in Supporting Information S1).We use hyperparameter tuning to select the CNN architecture shown in Figure 1.Details of the hyperparameter tuning and CNN training are included in the Supporting Information.We train separate CNNs for each ocean grid cell, lead time, and GCM, and for each CNN we test three different random initializations and select the one with the lowest validation loss for later analysis.This corresponds to training over 90,000 CNNs in total.For reference, to train all of the CNNs for a single GCM (∼10,000 CNNs) takes approximately 2 days on a single 40-core high performance computing node.

Neural Network Accuracy and Windows of Opportunity
After training, we evaluate CNN performance on the testing data (five simulations per GCM).First, we calculate prediction accuracy across all data in the test set, where accuracy is defined as the frequency with which the CNN predicts the correct output category.We also examine whether the CNNs identify "windows of opportunity," which we define as periods of internal variability that are more predictable than others, or in other words, periods where there is less uncertainty about the future outcome (Gordon & Barnes, 2022;Mayer & Barnes, 2021).Following Mayer and Barnes (2021) and Gordon et al. (2023), we use the CNN prediction of the relative likelihood of each outcome as a measure of the "certainty" or "confidence" of the prediction.The highest confidence predictions are those samples where the CNN predicts a higher relative likelihood of one class versus the others (In contrast, a low confidence prediction would be one where the CNN predicts a similar, or even equal, likelihood across multiple classes.)Higher prediction accuracy among more confident predictions indicates that the CNN has successfully identified windows of opportunity where predictions can be made with more certainty.We calculate accuracy for the subsets of the 40% and 20% most confident predictions within each testing simulation, and then average across the five testing simulations for each GCM.
We compare the neural network accuracy to a persistence model, which assumes that the future SST anomaly remains unchanged.For example, the SST anomaly prediction for year 1-5 is the same as the SST anomaly for the most recent 5 year period.Because there is no confidence associated with these predictions, we only calculate overall accuracy (not windows of opportunity).

Evaluating Neural Network Performance on Reconstructed SST Observations
We use the NOAA Extended Reconstructed SST Version 5 (ERSSTv5) data set (Huang et al., 2017a) to evaluate how well the trained CNNs can predict historical internal SST variability.The ERSSTv5 data set includes global coverage at 2°× 2°resolution from 1854 to present.We analyze monthly SST averages from January 1854 through October 2022.We perform similar preprocessing steps as for the GCM simulations.We regrid to the same 5°× 5°grid and calculate 12-, 36-, and 60-month moving averages.We subtract the third-order polynomial trend from each grid cell to remove the long-term forcing, similar to Mayer and Barnes (2022).We remove the historical trend instead of the multi-GCM ensemble mean because of known biases in long-term SST trends between GCMs and historical observations (e.g., Wills et al., 2022).We then calculate grid-cell means, standard deviations, and tercile thresholds for the ERSSTv5 data.
In analyzing CNN predictions of ERSSTv5 data, we focus specifically on windows of opportunity by looking at the accuracy of the top 20% most confident predictions.We also calculate the accuracy of persistence predictions within the ERSSTv5 data as a baseline comparison.

Results and Discussion
The CNN accuracy results are shown for one model, IPSL-CM6A-LR, in Figure 2, with the remaining models shown in Figures S2-S9 in Supporting Information S1.IPSL-CM6A-LR was chosen to illustrate the general results and not because of better or worse performance compared to other models.Because we have removed the forced response from the GCM simulations, these maps show the accuracy of predicting internal SST variability.
Overall, we find that the prediction accuracy is higher for years 1-3, decreases for years 1-5, and is lowest for years 3-7.This pattern of higher prediction accuracy at shorter lead times is true across all nine GCMs.When accuracy is calculated across all test samples (e.g., left column of Figure 2), the CNNs perform slightly better than the persistence model benchmark (Figures S10 and S11 in Supporting Information S1).However, we find that the CNNs can make much more skillful predictions during windows of opportunity, shown in the middle and right columns of Figure 2. In some regions, prediction accuracy can approach 80% or higher for more confident predictions (e.g., Figures 2c and 2f).We find that the CNNs are able to identify windows of opportunity with higher prediction accuracy in all of the GCMs analyzed (Figures S2-S9 in Supporting Information S1).
Regions where future SSTs are predicted most skillfully include the North Pacific, Tropical Pacific, North Atlantic, Tropical Atlantic and Southern Ocean (defined here as ocean regions between 45°S and 65°S).While the most predictable regions are similar across GCMs, there are also clear inter-model differences.For example, CNNs trained and tested on CNRM-CM6-1 detect especially strong predictability in the North Atlantic (Figure S3 in Supporting Information S1).This is likely due to the stronger persistence of SSTs in North Atlantic in this GCM (Figure S10 in Supporting Information S1).The CNNs trained on CanESM5 or NorCPM1 have much higher accuracy in predicting SST anomalies in the Southern Ocean compared to other regions.As a third example, the CNNs trained on GISS-E2-1-G, MIROC-ES2L and MIROC6 all show strong 1-3 years SST predictability across the tropics, including parts of the Indian Ocean.
Within each ocean basin, the spatial pattern of predictability varies depending on the GCM.For example, within the North Atlantic, many GCMs have higher predictability in the subpolar North Atlantic (e.g., ACCESS-ESM1, NorCPM1).For some GCMs, though, the region of high predictability includes areas in the subtropical North Atlantic (e.g., CNRM-CM6-1, IPSL-CM6A-LR).Different GCMs also have different spatial patterns of predictability in the North Pacific.Many GCMs show higher predictability in the subpolar (and especially the western subpolar) North Pacific region but some models, such as MIROC-ES2L and MIROC6, show higher predictability in the central North Pacific.In the Southern Ocean, the most predictable region depends on both the Geophysical Research Letters 10.1029/2023GL108099 GCM and the lead time.Many GCMs show high predictability across most of the Southern Ocean for year 1-3 predictions.For year 3-7 predictions, the region of high predictability generally narrows to regions of the South Pacific and South Atlantic, especially just west and east of South America (between around 160°W to 0°W).Some similarities between GCMs can likely be attributed to the fact that GCMs are not structurally independent, but share components or development history (Kuma et al., 2023).For example, IPSL-CM6A-LR and CNRM-CM6-1 both use the NEMO ocean model, and CNNs trained on both of these models show some of the highest predictability in the North Pacific and North Atlantic.The CNNs trained on MIROC6 and MIROC-ES2L also show similarities (including high predictability across the tropics), likely because both of these GCMs came from the same earlier model (MIROC5.2).
After training CNNs on each GCM, we look at how well the CNNs perform when tested on ERSSTv5 data.These results are shown in Figure 3 for the year 1-5 lead time (year 1-3 and year 3-7 results are shown in Figures S12  and S13 of Supporting Information S1).We find that the CNNs are able to make skillful predictions on ERSSTv5 data, and that the CNN predictions outperform the historical persistence model (Figure S14 in Supporting Information S1).
The regions with the most accurate predictions in ERSSTv5 are generally the same regions that were most predictable in the GCMs, namely the North Pacific, Tropical Pacific, North Atlantic, Tropical Atlantic, and Southern Ocean.However, there are also differences in the spatial pattern of prediction skill between ERSSTv5 and the GCMs.As an example, in the North Pacific, the regions of highest prediction skill in ERSSTv5 appear similar to the PDO horseshoe pattern in the central/eastern North Pacific (e.g., Figures 3a, 3e, and 3i).In contrast, when the CNNs are evaluated on the original GCM test simulations (Figure 2 and Figures S2-S9 in Supporting Information S1), most of the GCMs lack the PDO horseshoe pattern and show the highest prediction skill in the western subpolar North Pacific.There are also some small regions of prediction skill in ERSSTv5 that did not appear at all in the GCMs, such as along the coast of Chile.
The CNN skill at predicting ERSSTv5 data generally decreases at the 3-7 years lead time (Figure S13 in Supporting Information S1).One exception is in the North Pacific for CNNs that were trained on ACCESS-ESM1-5, CNRM-CM6-1, or IPSL-CM6A-LR.We find that these CNNs still make relatively skillful predictions in the North Pacific at 3-7 years lead times when evaluated on ERSSTv5.In fact, the CNNs trained on ACCESS-ESM1-5 and IPSL-CM6A-LR predict ERSSTv5 in the North Pacific better than they predict their respective GCM testing data at the 3-7 years lead time (Figure 4f).
Figure 4 summarizes the CNN performance on the GCM testing data versus ERSSTv5 data at the global scale (Figures 4a-4c) and for the six regions with the most skillful predictions: North Pacific, Tropical Pacific, Southern Ocean, North Atlantic, Tropical Atlantic, and West Indian Ocean.There are a few interesting patterns that emerge.We find that higher predictability in a GCM does not necessarily lead to higher prediction skill in ERSSTv5.For example, in the North Pacific for years 1-3 and in the Tropical Pacific for years 1-3 and 1-5, the GCMs that correspond to the highest prediction accuracy have lower accuracy when the CNNs are tested on ERSSTv5 (shown by negative correlations in Figure 4).However, in other locations, such as the Tropical Atlantic for years 1-5 and years 3-7, higher predictability in the GCM does correspond to higher prediction skill in ERSSTv5.This may be because there is a larger spread in predictability in the Tropical Atlantic across the GCMs, which allows for a larger correlation between predictability in the GCM and prediction skill evaluated on ERSSTv5.For the most part, prediction accuracy is higher in the original GCM test data than in ERSSTv5 (shown

Geophysical Research Letters
10.1029/2023GL108099 by most points falling below the one-to-one lines).However, in addition to the example given above for the North Pacific, some CNNs can make more skillful predictions in the Tropical Pacific and Tropical Atlantic in ERSSTv5 than in the original GCM test data (Figures 4h,4i,. When comparing the timing of correct window of opportunity forecasts across the CNNs, we find that many of the CNNs make correct, confident predictions at the same time (Figures S15 in Supporting Information S1).In some grid cells, there are times when all nine CNNs made correct, confident predictions.This indicates that the CNNs learn some consistent patterns from the different GCMs.One possibility for why the CNNs may sometimes have higher prediction skill in ERSSTv5 is that while the same dynamics may lead to predictability across different GCMs and ERSSTv5, the signal-to-noise ratio may be stronger in ERSSTv5 compared to some GCMs, potentially leading to higher skill in some cases.
The spread in prediction accuracy across the five ensemble members in each GCM test set is shown by horizontal bars in Figure 4.In general, the differences in predictability between different GCMs are larger than the differences in predictability between individual simulations.However, we do find that there can be substantial spread in prediction accuracy depending on both the region and the GCM.The West Indian Ocean and Tropical Atlantic have the highest spread in predictability across different simulations (although not in all GCMs).Overall, this indicates that a ∼150-year record (the length of our training and testing simulations) may not be sufficient to characterize multiyear predictability at a given location, and is consistent with other studies that have also shown time-dependent variability in decadal prediction skill (Borchert et al., 2019).This result suggests another reason why prediction skill may sometimes be higher in ERSSTv5 compared to the GCMs if the historical record includes periods of relatively high predictability in some regions.
Overall, many of these results are consistent with prior studies on multidecadal climate prediction.One difference is that we measure prediction skill with classification accuracy rather than metrics like the anomaly correlation coefficients.Additionally, while some prior studies remove the forced trend in order to evaluate prediction skill due to internal variability (e.g., Borchert et al., 2021;Delgado-Torres et al., 2022;Smith et al., 2019), many other studies evaluate skill in predicting the combined forced response and internal variability which makes it difficult to compare the magnitudes of prediction skill with our results.Still, the regions that we find have the most predictability across the GCMs include many regions that have been identified in prior work, such as the North Atlantic (Borchert et al., 2021;Yeager & Robson, 2017;Yeager et al., 2018), Southern Ocean (Zhang et al., 2023), and North Pacific (Choi & Son, 2022;Gordon et al., 2021;Qin et al., 2022).
Our results also emphasize the importance of considering prediction uncertainty or confidence using the window of opportunity framework.We find windows of opportunity for multiyear SST predictability across all GCMs studied and at all three lead times studied.These findings are aligned with other recent work demonstrating the occurrence of windows of opportunity within the climate system across multiple timescales using both neural networks (Gordon & Barnes, 2022;Mayer & Barnes, 2021) and initialized hindcasts (Borchert et al., 2019;Brune et al., 2018;Mariotti et al., 2020;Sgubin et al., 2021).

Conclusions
We show that ML, specifically CNNs, can learn patterns of global, multiyear SST predictability from existing, unitialized climate model simulations.Because our approach does not require new GCM simulations, we can efficiently analyze and compare predictability across many different GCMs.We find that the regions with the highest predictability on interannual and decadal lead times include the North Pacific, North Atlantic, Tropical Pacific, Tropical Atlantic and the Southern Ocean.However, when comparing predictability across nine GCMs, we find notable differences in the spatial patterns and magnitude of SST prediction skill.The patterns learned by the CNNs also lead to skillful predictions when tested on the ERSSTv5 data, but the amount of prediction skill in each region varies based on the GCM used for training.We also find different spatial patterns of SST prediction skill in ERSSTv5 compared to the GCMs, although the most predictable regions are generally similar.
These results could lead to multiple future research directions.Recent related work has shown that "explainable ML" methods can be used to understand why CNNs make certain predictions (Davenport & Diffenbaugh, 2021;Gordon et al., 2021;Labe & Barnes, 2021;Toms et al., 2020).These same methods could be applied to the CNNs used here to understand the sources of SST predictability in different regions and how they differ across GCMs and observations, providing insight into both the mechanisms involved in multiyear variability and into GCM biases in how these mechanisms are represented.Further, while the focus of this study was to explore differences in predictability across GCMs, future efforts could focus on training CNNs to produce the best predictions in the observed climate.Here, we used the same number of ensemble members to train each CNN to enable consistent comparisons, but we found that increasing the amount of training data beyond 22 ensemble members typically improves CNN performance.Increasing the training data, or even training on multiple GCMs at once so that each CNNs sees a wider variety of patterns during training, may improve prediction skill on ERSSTv5.We also only used SST as our predictor variable so that we could test the performance of predictions using global SST reconstructions.However, future research could test whether the CNN accuracy improves when given additional information about the ocean state.Overall, this research supports a growing body of literature that shows ML is a valuable tool for advancing the field of skillful multiyear climate prediction.

•
Neural networks trained on climate model output can skillfully predict SST variability in reconstructed observations • Neural network skill in predicting observed SST variability depends on the climate model used for training Supporting Information: Supporting Information may be found in the online version of this article.

Figure 1 .
Figure 1.Overview of convolutional neural network architecture.

Figure 2 .
Figure 2. Accuracy of sea surface temperature predictions using convolutional neural networks trained and tested on IPSL-CM6A-LR simulations.(a) Accuracy calculated across all year 1-3 test predictions.(b) Accuracy calculated for the 40% most confident year 1-3 test predictions (see Methods).(c) Accuracy for the 20% most confident year 1-3 test predictions.Black boxes indicate regions in Figure 4. Other GCMs are shown in Figures S2-S9 of Supporting Information S1.Panels (d)-(f) same as (a)-(c), but for year 1-5 test predictions.Panels (g)-(i) same as (a)-(c), but for year 3-7 test predictions.

Figure 3 .
Figure 3. Accuracy of year 1-5 sea surface temperature predictions for windows of opportunity (i.e., 20% most confident predictions) within Extended Reconstructed sea surface temperature Version 5 data.Panels show convolutional neural networks trained on different GCMs.Other lead times are shown in Figures S12-S13 of Supporting Information S1.

Figure 4 .
Figure 4. Comparison of windows of opportunity (20% most confident) prediction accuracy in general circulation model (GCM) simulations (x-axis) versus Extended Reconstructed sea surface temperature Version 5 (ERSSTv5) data (y-axis).Regional values are area-weighted average accuracy within the boundaries shown in Figures 2c, 2f, 2i, and 3. Horizontal lines show accuracy range across the five GCM test simulations, with points showing the mean accuracy.Correlation between accuracy in the GCMs versus ERSSTv5 is shown in the bottom right of each panel.