Probabilistic seasonal forecasts of North Atlantic atmospheric circulation using complex systems modelling and comparison with dynamical models

Dynamical seasonal forecast models are improving with time but tend to underestimate the amplitude of atmospheric circulation variability and to have lower skill in predicting summer variability than in winter. Here, we construct Nonlinear AutoRegressive Moving Average models with eXogenous inputs (NARMAX) to develop the analysis of drivers of North Atlantic atmospheric circulation and jet‐stream variability, focusing on the East Atlantic (EA) and Scandinavian (SCA) patterns as well as the North Atlantic Oscillation (NAO) index. New time series of these indices are developed from empirical orthogonal function (EOF) analysis. Geopotential height data from the ERA5 reanalysis are used to generate the EOFs. Sets of predictors with known associations with these drivers are developed and used to formulate a sliding‐window NARMAX model. This model demonstrates a high degree of predictive accuracy, as indicated by its average correlation coefficients over the testing period (2006–2021): 0.78 for NAO, 0.83 for EA and 0.68 for SCA. In comparison, the SEAS5 and GloSea5 dynamical forecast models exhibit lower correlations with observed circulation changes: for NAO, the correlation coefficients are 0.51 for SEAS5 and 0.34 for GloSea5, for EA they are 0.15 and 0.09, respectively, and for SCA, they are 0.28 and 0.24, respectively. Comparison of NARMAX predictions with forecasts and hindcasts from the SEAS5 and GloSea5 models highlights areas where NARMAX can be used to help improve seasonal forecast skill and inform the development of dynamical models, especially in the case of summer.


| INTRODUCTION
The North Atlantic jet stream strongly influences the weather in Northwest Europe and has a significant role in determining the strength and sign of North Atlantic atmospheric circulation indices such as the North Atlantic Oscillation (NAO), East Atlantic (EA) pattern, and Scandinavian (SCA) pattern; the anomalous weather patterns of a particular season can be described by the interplay of these modes of variability (Hall & Hanna, 2018).Recent extreme seasons have been characterized by distinctive jet-stream configurations, and jet strength and location are intimately linked with extreme weather conditions (e.g., in temperature and precipitation) experienced across Northwest Europe (Hall & Hanna, 2018).Extreme seasonal weather has important socio-economic implications, in terms of risk avoidance, with costs to the insurance industry (e.g., $£1.5 billion across the UK in winter 2013/14 (Davies, 2014)), and impacts on agriculture, food security, energy supply, public health/wellbeing and severe weather planning.
Until relatively recently, North Atlantic atmospheric variability was thought to be largely due to unpredictable fluctuations (Stephenson et al., 2000).However, dynamical seasonal forecasting systems have been used to develop skillful seasonal forecasts for UK winter weather from a few months ahead (Scaife et al., 2014).Many factors (drivers) appear to influence the NAO and jet-stream changes, and these potential drivers can be broadly grouped into cryosphere effects from variations in sea-ice extent and snow cover, oceanic effects from North Atlantic sea-surface temperatures (SST), tropical influences such as the El-Niño Southern Oscillation (ENSO), and stratospheric effects due to stratospheric circulation variability, solar variability, volcanic eruptions and the Quasi-Biennial Oscillation (QBO) (Hall et al., 2015).These drivers of jet-stream variability can oppose or reinforce one another, and there are indications of interactions between them (Hall et al., 2019).Drivers of jetstream variability show seasonal variation and distinctive drivers of jet-stream variability operate in different seasons.In addition to these identifiable drivers, a significant part of North Atlantic jet changes is driven by internal unforced variability due to chaotic internal dynamical processes (Kushnir et al., 2006;Lorenz, 1963).While a consensus has now been reached that some observed drivers can be reproduced in climate models, improved understanding of more recently identified drivers of the North Atlantic extratropical jet stream is crucial for making progress in UK seasonal climate predictions (Hall et al., 2015).
The focus of government-funded research is on dynamical forecast systems; however, such forecasts are not always accurate, such as in winter 2004-2005(Hall, Scaife, et al., 2017) and more recently in 2013-2014, when dynamical model forecasts did not well predict the positive winter NAO, and furthermore did not consider the accompanying positive EA pattern and hence the exceptionally heavy rain and flooding in southern England (Maidens et al., 2021).While dynamical seasonal forecast models are sensitive in winter to tropical forcing such as El Niño events, some evidence suggests that they may be relatively insensitive to Arctic variability (Cohen et al., 2019).Compared with winter, dynamical model forecasts show relatively little skill in summer, when there is less forcing from the tropics (Hall et al., 2015).Recent work on seasonal prediction with dynamical models has also revealed an intriguing conundrum called the signal-to-noise paradox: this is where such models reasonably well predict the year-to-year variability of the winter NAO but underpredict its amplitude, due to a systematic underestimation of the mechanisms influencing mid-latitude atmospheric circulation (Eade et al., 2014;Scaife et al., 2014;Siegert et al., 2016;Stockdale et al., 2015).Rare events are usually forecast to have a low probability partly due to the signal-to-noise paradox, meaning that if a model predicts a low, but above-average, chance of a rare event happening, this does not necessarily constitute a missed event (Legg & Mylne, 2004).Supplementing dynamical seasonal forecasting systems, statistical methods identify slowly varying boundary conditions such as sea-ice variability, ocean temperatures, and influences from the stratosphere, which are capable of 'nudging' the jet stream and providing elements of predictability (e.g., Baker et al., 2018;Hall, Jones, et al., 2017;Hall, Scaife, et al., 2017).In the mid-latitudes, statistical forecasting has been relatively neglected compared with the tropics; however, recent developments in statistical techniques, under the umbrella of 'machine learning' (e.g., Billings, 2013;Hall et al., 2019) have taken place mainly outside the climatescience community and are relatively quick and cheap to implement.
The novel application of these advanced statistical techniques and systems science methods has significant potential to improve forecast skills and help inform the development of the next generation of dynamical seasonal forecasting systems.Here, we use a Nonlinear AutoRegressive Moving Average with eXogenous inputs (NARMAX) systems identification approach (Billings, 2013;Hall et al., 2019), which is an interpretable machine learning method, to identify and model linear and nonlinear dynamic relationships between a range of meteorological and related variables.In addition to its ability to delineate nonlinear relations, NARMAX is able to identify nonstationary associations that arise from changes in forcings over time, building on studies where dynamical models have suggested changes in NAO forecast skill over periods of several decades (Weisheimer et al., 2019), and so NAR-MAX, therefore, has significant potential to help improve Northwest Europe seasonal weather forecasts.In a pioneering application of NARMAX in this area, Hall et al., 2019 found significant skill in NAO winter forecasting and identified key sources of predictability.Here, we extend skillful seasonal forecasting to the summer season, where dynamical seasonal prediction models currently remain the most problematic, and identify factors that contribute skill to the forecast.In a further innovation, we also consider two other principal North Atlantic atmospheric circulation patterns that complement the NAO.Our results form a firm basis for improving Northwest European regional seasonal weather prediction and should therefore be of interest to potential end-users as well as to model developers and the broader seasonal prediction scientific community.

| DATA
Updated versions of the three principal EOFs of European and North Atlantic atmospheric circulation variability (NAO, EA and SCA) were generated using 500 hPa geopotential height data from the European Centre for Medium-Range Forecasts (ECMWF) ERA5 reanalysis (Hersbach et al., 2020), combined with the Python eofs package (Dawson, 2016).ERA5 is based on the Integrated Forecasting System (IFS) and replaces ECMWF's earlier ERA-40 and ERA-Interim reanalysis products.We obtained ERA5 data, specifically of sea-surface temperatures (SSTs), sea-ice cover, sea level pressure, and 500 hPa geopotential heights, from the Copernicus Climate Data Store.
The summer EOFs are based only on high summer (July and August), as using the full June/July/August data generates a poorly defined pattern for EOF1.This is consistent with previous analysis, for example, Folland et al. ( 2009)), which suggest there is a strong summer NAO signal that characterizes July and August, while the summer NAO behaves differently during June.The three winter EOFs are based on seasonal data for the full winter quarter (December-February).Maps for the winter and summer EOFs are shown in Figure 1.These are largely similar to the EOFs obtained by Hall and Hanna (2018)) but also show some notable differences.For example, the winter SCA pattern has the high-pressure anomaly further west than the (Hall & Hanna, 2018) version, centred to the north and north-east of the British Isles, while the winter NAO has the low-pressure anomaly centred to the south of Greenland, rather than over Iceland.The differences are mostly down to the different methodology used to calculate the EOFs, rather than different time periods.
Monthly 500 hPa geopotential height forecasts from SEAS5 (provided by ECMWF) and GloSea5 (provided by the UK Met Office) were analysed and projected onto the EOFs using projectField from the eofs package.Seasonal forecast runs were provided by the Copernicus Climate T A B L E 1 Potential drivers of winter and summer atmospheric circulation variability that are used as predictors for NARMAX models.Change Service (C35) via the Climate Data Store website.For both models, hindcasts are available from 1993 to 2016 inclusive.For SEAS5, a complete set of seasonal forecast runs is also available from 2017 onwards, but at the time of analysis for GloSea5 C35 only provided an incomplete set of seasonal forecast runs covering the winters of 2017/2018 to 2019/2020 inclusive and the summers of 2018 and 2019.

Dataset
A number of variables that may be used to predict the North Atlantic jet-stream and atmospheric circulation variability, and by extension temperature and precipitation over Northwest Europe, have been collected for both winter (DJF) and summer (JJA), building from the drivers identified by (Hall, 2016;Hall & Hanna, 2018;Hall, Scaife, et al., 2017).A wide range of potential drivers have been assembled, so as to be able to select from a wide range of variables for inclusion in NARMAX.SST anomaly patterns are used, including the ENSO and the Atlantic Multidecadal Oscillation (AMO), plus sea-ice anomalies, snow cover anomalies and tropical precipitation anomalies.The stratospheric polar vortex and QBO are used as predictors for winter atmospheric circulation, but not summer, due to a lack of evidence for them having a strong influence on summer atmospheric circulation.
SST anomalies, sea-ice coverage anomalies, the AMO, tropical precipitation anomalies and the strength of the stratospheric polar vortex were calculated based on ERA5 reanalysis data.This version of the AMO is based on the region from 7 to 75 W, 25 to 60 N, subtracting the global SST anomaly from the regional SST anomaly to remove biases that would result from the upward trend in global SSTs.For summer, an SST-based North Atlantic dipole index is used, based on Oss o et al. ( 2018)), who provided evidence for a link between this and a high-pressure anomaly to the west of the British Isles, resulting in relatively anticyclonic weather over Britain.The North Atlantic Horseshoe SST pattern (Cassou, Terray, et al., 2004) is linked with the winter NAO.The North Atlantic tripole is based on the methodology of Marshall et al. (2001), who provided evidence for this being linked especially with the NAO in winter.Snow cover data are based on Estilow et al. (2015).Monthly sunspot numbers were obtained from the Solar Influences Data Analysis Center (Center, 1956(Center, -2021)).The QBO data are obtained from the Free University of Berlin (Naujokat, 1986).A full list of the drivers is provided in Table 1.Predictors are sourced from a range of preceding months, up to 8 months in advance in some cases, to account for possible lagged teleconnections.For example, for the winter season, the predictors from March to October are considered to be the inputs, while for the summer season, the predictors from last September to April are set to be the inputs of the modelling.

| The NARMAX model
The Nonlinear Autoregressive Moving Average with exogenous input (NARMAX) model for multiple-input and single-output (MISO) systems is generally represented as follows (Wei, 2019;Wei & Billings, 2022): ) and e k ð Þ are the system output, input and noise sequences, respectively; n y , n u , and n e are the maximum lags for the system output, input and noise, respectively; F Á ½ is some nonlinear function, and d is a time delay, typically set to d ¼ 0 or d ¼ 1.The noise sequence e k ð Þ is nearly always unknown for real-world modelling, and it is usually estimated using the prediction error In practice, there are many model structures that can be used to approximate the unknown mapping F Á ½ , including power-form polynomial models, neural networks, radial basis function networks and wavelet expansions (Billings, 2013;Wei et al., 2010).Power-form polynomials, due to their desirable properties, especially their transparency and interpretability, are commonly used to construct NAMARX models (Billings, 2013).
The Nonlinear Autoregressive with eXogenous input (NARX) model presented below is a special case of the NARMAX model (1), which does include the lagged noise variables In many real-world applications, the output y(k) in (1) and ( 2) is assumed to be irrelevant to previous output For such cases, model (2) reduces to the nonlinear infinite-impulse response model (NFIR, also known as Voterra series model) (Billings & Wei, 2008;Wei & Billings, 2009) as follows: In (4), if the time delay and the maximum lags are all set to be zero, that is, d ¼ n y ¼ n u ¼ 0, then the NFIR model (4) reduces to a nonlinear multiple regression (NMR) model (Hall et al., 2019), For a simple illustration, consider a system with three inputs u 1 , u 2 and u 3 , for which the full initial NFIR model comprising all linear and quadratic terms is The degree of nonlinearity or nonlinear degree of model ( 5) is 2, as it contains a number of quadratic model terms.If a polynomial model contains at least one cubic cross-product term, then its nonlinear degree is 3.However, model terms in (5) are usually not equally important for explaining the system and interpreting the change of the output y k ð Þ, meaning that some terms that make a tiny or negligible contribution to explaining the variation in the response y k ð Þ may be removed from the model.With the help of a model-selection algorithm, called the Forward Regression with Orthogonal Least Squares (FROLS) (Billings, 2013), the most important model terms in (3) can be determined and used to generate a concise and compact model.FROLS uses an effective but simple measure, called the error reduction ratio (ERR), to evaluate the contribution of each candidate model term to explaining the variation of the system output.The number of model terms in the final model can be determined using several statistics criteria, such as the Akaike information criterion (AIC) (Akaike, 1974), Bayesian information criterion (BIC) (Schwarz, 1978) and the penalized error-to-signal ratio (PESR) (Wei et al., 2010).

| The sliding-window NARMAX models
Usually, the NARMAX model can generate reliable predictions and reveal convincing transparent relationships between the system input and output over the entire period of the dataset.However, it is more desirable to pay attention to and make use of local information in the dataset, especially for a complex dynamic time-varying system or process like climate change.Therefore, in the following, we introduce the sliding-window NARMAX model (SW-NARMAX) for seasonal forecasts.
The proposed framework of the sliding-window NAR-MAX model is shown in Figure 2. Take a single-input, single-output system as an example, where the measured input signal of N samples is denoted by In a matrix format, the dataset of the system can be represented as ½ wÂ1 be a window of length w.With the one-step forward sliding window (which is shown in the dotted rectangles in Figure 2), the original dataset D can be resampled into s subsets, where For each windowed dataset b can be generated using the method discussed in Section 3.1.Thus, based on (6), there will be s NARMAX models M = {M 1 , M 2 , …, M s }, which are represented by the green rectangles in Figure 2, and s predictions of the system output over the testing period b y will be calculated accordingly.Note that the whole available observations are split into two parts: (1) around 80% of the data are used for model training, and (2) the remaining 20% are used for model testing.Each windowed dataset is a subset of the training dataset.To find the most appropriate model for each window, the associated windowed dataset is further partitioned into training and validation sub-datasets.For each window, the role of the validation data is threefold: (1) to test and validate the model performance using 'unseen' data during the training process; (2) to optimize and adjust model parameters and hyper-parameters, such as the window size and the model structures, where necessary; and (3) to avoid overfitting.

| Model selection and averaging
NARMAX modelling encompasses both linear and nonlinear models.In this case, as discussed in Section 3.1, the polynomial form of the model structure, including both linear and nonlinear forms, is considered.Note that different nonlinear degrees can lead to quite different models, which will affect the forecast and the explanation of the system.A model with a higher nonlinear degree may produce a more accurate representation of the system and hence produce better prediction.However, such a model can become very complex, and it needs a large number of samples for model training and estimation.Therefore, the present modelling challenge is a typical small-size problem, where the number of observations is smaller than the number of regressors.Taking a system with n input variables as an example, if the nonlinear degree of the model is deg, then the number of generated model terms would be n þ deg ð Þ !=n!deg!, where the symbol '!' denotes the factorial function.This implies that models with a high nonlinear degree deg (4 or higher) are intractable, especially when the number of inputs is large, as the number of generated model terms would be tremendous.
In order to avoid overfitting and reduce the sensitivity of the model but meanwhile guarantee a reliable model structure, the highest nonlinear degree is set to be three in this study.Then, for each windowed dataset, there are three candidate NARMAX models: the linear model M m,li , the quadratic model M m,q and the cubic model M m,c , where the number of each candidate NARMAX models is defined as M. Therefore, with the window of length w, there are a total of 3 M NARMAX models including linear and nonlinear model forms.To select the best models from the huge number of potential model candidates, the validation set is applied as discussed in Section 3.1.For each model M m,i (identified from the mth windowed dataset), where i li, q, c f g, the prediction for the validation set is denoted by b y v m,i .The values of the mean squared error (MSE) of each of the 3 M identified models are denoted by, mse v m,li , mse v m,q and mse v m,c .The best model is selected by comparing the MSE in each data group.Thus, for each window of length l, there are M best NARMAX models.
NARMAX methods are generally robust for system analysis and prediction, but using a single 'best' model may be risky in some applications.Therefore, it is reasonable to apply a model-averaging algorithm to reduce the risk associated with depending solely on a single model, especially when dealing with small sample size applications; this can also mitigate the sensitivity of the model to noise or uncertainties (Hall et al., 2019).In this study, the weighted mean scheme is also considered to deliver the predicted value.However, unlike the method in (Hall et al., 2019), the weights are calculated based on the MSE of the M models over the validation period, rather than over the training period.
Assume the values of mean squared errors (MSEs) of n MARMAX models over their respective training periods are known as mse 1 , …, mse s , respectively.Let The framework of the sliding-window NARMAX models.
Then, the averaged model prediction can be defined as

| Prediction verification
Some typical model validation criteria, including correlation coefficients, mean absolute error (MAE) and root mean square error (RMSE), are used to evaluate model performance.The number of forecasts needs to be sufficiently large to make the statistical conclusions about the skill of the forecast robust and convincing, while the sliding-window NARMAX (SW-NARMAX) method can generate several models in the training set and produce valid statistical conclusions.As depicted in Figure 3, the original dataset is meticulously segmented into a training set, a validation set and a testing set.This methodological approach facilitates the initial training of models on the training set.Subsequently, these models are refined and optimized within the validation set.Finally, the efficacy and robustness of the models are comprehensively assessed in the (fully independent) testing set, ensuring a rigorous evaluation of their performance.In addition, the continuous ranked probability score (CRPS) (Leutbecher & Haiden, 2021) is used in this study to evaluate the quality of the seasonal forecast models.The CRPS estimates the difference between the observed and expected outcomes and can be viewed as an integral over the possible Brier scores (Bradley et al., 2008).It is herein defined as Normally, x takes the value 1 or 0 according to whether or not the event occurred in the predefined class, especially a binary classification forecast, while p i is the forecast probability for such occasion i.To clearly calculate the CRPS of the SW-NARMAX models, we define the two class as follows: To calculate the CRPS of the SW-NARMAX models, the forecast probability p i should be obtained first.By The schematic illustration of the NARMAX model resampling using the sliding-window approach.
applying the sliding-window methods in the training period, there will be s SW-NARMAX models identified as discussed above.Thus, for each year in the testing set, there are s predictions as defined above.We can calculate the forecast probability by where count means function to calculate the quantity that meets the condition and b y i i ¼ 1, 2, …, s indicates the predictions of ith SW-NARMAX model.
Based on the definition of the classes, the CRPS verifies the accuracy of the predictions of SW-NARMAX models matching the class of the observation.The smaller the CRPS, the more consistent the category of SW-NARMAX predictions is with the category of observed values; otherwise, they are inconsistent.
Correlation coefficients and significance are assessed using Pearson's correlation coefficient and associated p-values to assess the probability of finding the result if the correlation was zero, where p < 0.05 and p < 0.01 are commonly selected values to assess significance.
To evaluate the statistical significance of the NAR-MAX models, a Monte Carlo sampling method with the autoregressive (AR) model analytical framework is implemented in the supplementary material.This approach involves generating 100 simulated databases derived from the AR models.By repeatedly sampling from these databases, a distribution of outcomes that reflects the inherent variability is constructed, which in turn allows us to estimate the statistical significance of the empirical results.The insights gained from these simulations provide a robust basis for evaluating the statistical significance of the NARMAX findings, ensuring that conclusions are not only grounded in empirical evidence but also resilient to the stochastic nature of the underlying processes we investigate.

| Dynamical models
To help assess the accuracy and utility of NARMAXgenerated seasonal forecasts, comparisons are made with the seasonal forecasts from two commonly used dynamical models.Dynamical model seasonal forecasts are generated based on runs from up to 1 month in advance.Monthly forecast runs are obtained from the C35 Copernicus Climate Change Service.For this analysis, hindcast data for 1993 to 2016 are used from the ECMWF SEAS5 model (Johnson et al., 2019) and the Met Office GloSea5 model (MacLachlan et al., 2015).The GloSea5 outputs are based on seven ensemble members, while SEAS5 has 25 ensemble members over this period.The monthly runs are aggregated to produce a seasonal forecast that corresponds to the seasonal prediction from 1 month out (e.g., the winter forecasts are based on December with 1 month lead time, January with 2 months lead time and February with 3 months lead time, corresponding to a seasonal forecast issued in November).
Predictions of 500 hPa heights from the dynamical models are assessed against the three EOFs discussed in Section 2, for both winter and summer.As with NAR-MAX, in the case of summer, June is considered separately from high summer (July and August), while winter is defined simply as December, January and February and is labelled by the year of the January.Correlation, RMSE and the CRPS score are used together to provide an indication of forecast skill.

| RESULTS
In this section, sliding-window NARMAX models are defined by the indices they use (station-based NAO, EA and SCA), by the start year of the predictor dataset (1979) and by season (summer or winter).For example, the Jun_NAO79 (JA_NAO79) summer models are the sliding-window NARMAX models for the summer NAO in June (July and August average), using the 1979-2022 predictor dataset.In the main part of this paper, we focus on the NARMAX prediction results based on the 1979 (start date) datasets, while the rest of the prediction results are presented in the Supplementary Information.For a better evaluation of the performance of different models against observations, model results are based on weighted means of the model ensemble members over validation and test periods.

| Experimental settings
Firstly, we give a brief introduction to the 1979 dataset and the settings for the sliding-window NARMAX models.In this dataset, six indices or system outputs needed to be modelled in summer: that is, Jun_NAO79, JA_NAO79, Jun_EA79, JA_EA79, Jun_SCA79 and JA_SCA79.This dataset contains 43 observations (years) for each index, while the number of input variables or predictors is 130.Therefore, this application is a typical small number modelling and forecasting problem (Section 3.3).In the experiments, the predictors are up to 8 months in advance (Section 2) of the phenomenon being predicted, where the latest month is April for summer and October for winter weather.

| NAO summer results
For the June (July and August average) NAO, the 15 most frequent predictors in the sliding-window NARMAX models (16 models for Jun_NAO79 and 19 models for JA_NAO79) are listed in Figure 4a,b.Where the same predictor is shown for different months in the same graph (Figures 4-9), that refers to separate ensemble NARMAX models.To avoid overfitting, different months, such as records in February and April for summer, for a particular predictor are not used in the same NARMAX model.In models of Jun_NAO79, the most selected predictors are 'WIR' and' BER', which considering across all relevant months are both selected 13 times, followed by 'NAsnow', 'EIR' and 'WPR', which are selected 12, 10 and 10 times, respectively.For the JA_NAO79 models, the most selected predictor is 'BC', selected 13 times, followed by 'GRE' and 'WIR', which are each identified nine times.In addition, in models of Jun_NAO79 and JA_NAO79, some predictors, specifically 'WIR', 'BER', 'ESL', 'NAsnow' and 'lead1SS', are relatively frequently selected.
The predictions made by the sliding-window NAR-MAX models are presented alongside the observed NAO time series in Figure 4c,d.In Figures 4-9, the red lines with 'o' markers represent the observations of the indices (i.e., the EOF time series defined in Section 2), while the blue lines with crosses represent the weighted mean predictions by the sliding-window NARMAX models.The light blue area is the 95% confidence interval (CI) generated by the identified NARMAX models.As shown in Figure 4c 1 for a full list of predictor names.
predictions do not correspond to the same category as the observed values.This result aligns with our conclusion that the SW-NARMAX weighted mean prediction and observation for that year fall into different categories.
NARMAX verification statistics against observed data for the model testing period are summarized in Table 2. Model-observation correlation coefficients of the Jun_-NAO79 models and JA_NAO79 models for the entire testing period (2014-2021) are 0.89 and 0.87, respectively, and are highly significant (p < =0.01).Furthermore, the Jun_NAO79 model has more skillful performance due to its smaller RMSE and MAE compared with the JA_NAO79 model.
The comparison between the prediction of slidingwindow NARMAX models for EA and the observations is shown in Figure 5c,d   areas of SW-NARMAX models.For JA_EA79, the SW-NARMAX ensemble models show better accuracy among most years in the testing set, while in 2016 and 2019, the predictions by SW-NARMAX models are not accurate compared to the observations.Model-observed correlation coefficients of the Jun_EA79 and JA_EA79 models for the testing period (2014-2021) are 0.78 and 0.88, respectively (Table 2), while the correlation coefficient of the Jun_EA79 is significant at the p < =0.05 level.

| SCA summer results
The statistical analysis of predictors in 15 Jun_SCA79 NARMAX models and 18 JA_SCA79 models are shown in Figure 6a,b.Ranked by frequency of predictor in the models for Jun_SCA79, 'BER' appears most often, 17 times in the analysis, followed by 'AMO' (10 times) and 'HUD' (7 times).For JA_SCA79, 'ESL' is the most frequently selected (appearing 13 times), followed by 'EPR' and 'BER', which appear 12 and 10 times, respectively.
Figure 6c,d show the comparison between the weighted mean predictions and observations over the out-of-sample period for SCA.A majority of the observations (10/16) are, once again, in the prediction band generated by the sliding-window models.However, as before, in a few years, that is, 2016, 2018 and 2019 for Jun_SCA79 and 2015, 2016 and 2019 for JA_SCA79, the probabilistic prediction bands do not encompass the observations.The statistical verification metrics for these models are once again summarized in Table 2. Modelobservation correlation coefficients of the Jun_SCA79 models and JA_SCA79 models for the testing period (2014-2021) are, respectively, 0.69 and 0.65 (Table 2).The CRPS values in Figure 6c,d show that the SW-NARMAX models for the summer SCA yield accurate predictions for many years in the testing set, like 2014, 2015, 2017, 2020and 2021for Jun_SCA79 and 2014, 2017, 2018, 2020and 2021 for JA_SCA, while for the rest years in the testing set for the Jun_SCA79 and JA_SCA, the weighted mean predictions and the CI areas of the SW-NAMRAX models have obvious differences with the observations.

| EOF NAO winter results
The frequency analysis of predictors in the winter EOF NAO79 models are shown in Figure 7a, showing the 15 most chosen predictors among 21 EOF NAO79 models.As listed in Figure 7a, the joint most selected predictors are 'OctBK' and 'Octsno', with 'sep_SPG_SST' as the next most frequently identified predictor.The performance comparison is shown in Figure 7b, while the verification statistics of the weighted mean model is shown in Table 3.For the EOF NAO model, the modelobservation correlation coefficient over the testing set (2014-2022) period is 0.78.The CRPS values in Figure 7b indicate that the SW-NARMAX models show good accuracy over the testing period for winter NAO as all CRPS values are close to zero, implying that observations and predictions are in the same class, while in the year 2016, the CRPS value is relatively high.Moreover, the observations are in the 95% CI area of the predictions proving the accuracy of the models.

| EOF EA winter results
Figure 8a shows the relative frequency of predictors that are included in 23 models of the winter EOF EA. 'augBK' is the most commonly identified predictor in winter EOF EA models.Based on the frequency analysis of predictors in EOF EA predictor models, the most commonly selected predictors are 'HUD' and 'BK', indicating they have more influence on the winter EA.
Performance comparison is shown in Figure 8b.The weighted mean prediction by EOF EA models (red line) closely follows the observations in both the validation and training sets, and the model prediction band consistently encompasses the observations.This comparison once again highlights the skillful performance of the NARMAX models, which is particularly evidenced by the significant correlation of 0.87 over the 2014-2022 testing period (Table 3).
As shown by the CRPS values in Figure 8b, the SW-NARMAX models of the winter EA demonstrate high skill over the testing period, as the proximity of all CRPS values to zero suggests that the observations and predictions belong to the same class.Model accuracy is corroborated by nearly all the observations (except 2015) falling within the 95% confidence interval (CI) range of predictions.

| EOF SCA winter results
The 15 most frequent predictors in 16 winter EOF SCA models are shown in Figure 9a, where the most selected predictors are 'NAH' and 'TRI'.Figure 9b shows the comparison between observations and weighted mean prediction of identified models in the testing set.Although the RMSE (0.71) and MAE (0.54) of SCA79 in the testing set are larger than those of other indices, most (7/9) observations are in the prediction band, with the predictions in 2014 and 2016 outside the prediction band.However, the correlation coefficient (0.84 for SCA; p < 0.01) indicates overall high predictive skill (Table 3).
Figure 9b shows that in most years, the SW-NARMAX models have accurate predictions over the testing period.However, in 2014 and 2016, the performance of the SW-NARMAX models is considerably lower compared with other years, although the models still correctly predict a negative winter SCA.

| Linear and nonlinear NARMAX models
To better demonstrate the performance of the nonlinear relationship between the predictors and the atmospheric circulation indices, we compare experiments carried out using the following two types of NARMAX model: pure linear and pure nonlinear.This allows us to assess the influence of the nonlinear combination of predictors compared with assuming linearity in the seasonal weather system.
As before, the original dataset is divided into two parts: training and testing subsets with a ratio of 8:2.To overcome overfitting, a validation subset is created within the training set.For consistency, we put the statistical results (RMSE, MAE and correlation) from the 1979 dataset in this section with the testing period results for summer and winter, respectively, shown in Tables 4 and 5.
As shown in Tables 4 and 5, the pure nonlinear NAR-MAX models produce more accurate predictions than T A B L E 3 Verification statistics for averaged sliding-window NARMAX models for winter seasonal weather (testing period = 2014-2022).Correlations that are significant at p < = 0.05 are in bold, and correlations that are significant at p < = 0.01 are underlined.

| Dynamical models
Figure S12 illustrates how SEAS5 fared when predicting the winter and high summer NAO, EA and SCA, using hindcast data for 1993-2015 and forecast data for 2016-2022, using that were initialized on 1 December (for winter) and 1 June (for summer).Figure S13 shows the corresponding data for GloSea5/6 (GloSea5 was superseded by GloSea6 in 2021), which are based on forecasts where all of the ensembles were initialized during the month leading up to and including 1 December (for winter) and 1 June (for summer).Glo-Sea5 forecast data were not available for 2016 and 2017.There is a consistent tendency to underpredict the amplitude of the extreme seasons, to a greater extent than is observed for the NARMAX predictions, possibly reflecting a greater 'signal-to-noise' problem than we see with NARMAX.Both models showed significant (at p < 0.05) skill with the winter NAO, producing correlations of just over 0.4, but showed little skill at predicting the winter East Atlantic pattern.The GloSea5/6 correlation of 0.42 with the winter NAO is lower than 0.62 reported by Scaife et al. (2014), but if the analysis is restricted to the period 1993-2012, the correlation is much closer at 0.56, suggesting that the difference is primarily due to GloSea5/6 having a lower success rate in recent winters, for example, failing to predict the negative NAO of winter 2020/2021.While winters 2009/2010 and 2010/2011 were both predicted to have a negative NAO, none of the ensemble members of SEAS5 or Glo-Sea5 captured the intensity of the anomaly.While SEAS5 correctly predicted a negative East Atlantic pattern for the winter of 2004/05, the observed outcome was more extreme than predicted by any of the ensemble members.SEAS5 particularly tended to underpredict the variability in the East Atlantic pattern in winter.For the SCA pattern, again SEAS5 underpredicted the variability but performed better overall than GloSea5.In the case of high summer, covering July and August, no skill is found for the summer NAO or SCA, with both SEAS5 and GloSea5/6 giving small negative correlations with the observed values.However, Glo-Sea5/6 showed evidence of some skill at predicting the summer EA, with a correlation of 0.36 with the observed values, albeit still lower than the values obtained from NARMAX.It appears that high summer is the area where NARMAX may prove especially useful as a means of improving the reliability of our seasonal forecasts.
When June is considered, the dynamical models perform much better, reflecting their greater skill at 1 month out.SEAS5 outperforms GloSea5/6, with a correlation of 0.71 with the June NAO and a correlation of 0.82 with the observed June East Atlantic pattern, while GloSea5/6 showed correlations of 0.51, 0.44 and 0.42 with the NAO, EA and SCA, respectively.SEAS5 correlations with June SCA were also lower, at 0.51.

| Comparison between dynamic models and NARMAX models
In this section, we compare the performance of the sliding-window NARMAX models discussed earlier against the SEAS5 and GloSea5 dynamical models, based  10) and GloSea5/6 forecasts (Figure 11).It is apparent that NARMAX suffers less from the 'signal-to-noise' issue than the dynamical models, although in some cases (especially winter EA), there is still evidence of NARMAX underpredicting extreme values.Another comparison between the SEAS5 and GloSea5/6 dynamical models and NARMAX models is shown in Table 6, based on the data shown in Figures 10 and 11.The statistical results for these NAR-MAX models show higher correlation coefficients and smaller RMSE for all indexes, indicating a more skillful performance than the SEAS5 and GloSea5/6 models.
In Table 6, the predictions by NARMAX are compared with the SEAS5 and GloSea5 models for winter and separately for June and high summer (July and August).NARMAX consistently outperforms the dynamical models, and NARMAX forecasts are significantly correlated with the observed outcome in all cases.With the exception of JA_SCA, all correlations are significant at p < 0.01.The dynamical models only showed skill in some cases, with only SEAS5 predictions of June NAO and June EA proving comparable to the NARMAX predictions.It is worth noting that the GloSea5/6 forecasts of the winter NAO are statistically significant at p < 0.05 when a longer time period is considered, as is seen with 1993-2022 (Figure S13), but due to the smaller sample size, the correlation of 0.47 over 2006-2021 falls short of statistically significant.

| DISCUSSION
The results presented here highlight the potential for NARMAX to add considerable value to current dynamical model predictions of NAO, EA and SCA, especially in the case of summer, where dynamical models tend to struggle to a greater extent than for winter.The NAO alone only accounts for some of the variability in temperature and precipitation over Northwest Europe, making it useful for predicting other important modes of atmospheric circulation variability.For example, (Hall & Hanna, 2018) attributed the exceptionally high rainfall of winter 2013/2014 over much of the UK primarily to a strongly positive East Atlantic pattern.It is therefore encouraging that the NARMAX results are strongly correlated with the observations in the case of EA and SCA as well as NAO.
Splitting the summer into June and July/August shows that the dynamical models perform better at forecasting June NAO, EA and SCA from 1 month ahead than at forecasting July/August, as would be expected.The highest correlation with the observed data is 0.82, in the case of SEAS5 predictions of June EA (compared with correlations of 0.84 and 0.78 for the 1956 and 1979 NARMAX models for June EA, respectively).Surprisingly, the NARMAX verification rate is similar for June and for July/August, indicating that there is less of an advantage over the dynamical models at a relatively short time range but that the accuracy of NAR-MAX does not decline as much for 2 and 3 months ahead, at least in the case of summer.
Compared with the dynamical models, NARMAX predictions show a reduced 'signal-to-noise' problem, that is, the year-to-year variability of the NAO, EA and SCA is captured accurately, and while the amplitude of extreme events is at times underpredicted, it is generally underpredicted to a much smaller extent than with the dynamical models.When analysing summer predictions for June and for July/August, it is clear that the 'signalto-noise' problem with the dynamical models increases markedly when predicting 2 and 3 months ahead as opposed to just 1 month ahead, at least in the case of summer.This is far less apparent with the NARMAX predictions, suggesting that NARMAX-assisted forecasts may be especially useful at reducing the 'signal-to-noise' problem when forecasting further ahead.It is particularly encouraging that this appears to be true for summer, as the dynamical models tend to struggle more with seasonal predictions of summer than of winter.
The lists of predictors that the sliding-window NAR-MAX chooses for summer are mixed, but some consistent results stand out.The 1956 model (see the supporting information) has March Beaufort/Chukchi Sea SSTs as one of the three most selected predictors for both June and July/August NAO.Sea-ice and SSTs dominate among the most often selected predictors, perhaps suggesting feedback between sea-ice concentrations and SST anomalies.The 1979 model is less likely to select sea-ice concentrations, and tropical rainfall dominates among the most selected predictors for June and July/August NAO.The same is mostly true for the EA, but for June EA, both the 1956 and 1979 models often select solar activity with varying lag times.For July/August SCA, the 1979 model has March North Atlantic snow cover as the second most often selected predictor.The recurrence of sea-ice concentrations in the top 10 predictors must be taken with some caution, for as discussed (Hall, Scaife, et al., 2017), the recent sharp decline in sea-ice concentrations could contribute to models overestimating the influence of sea ice and potentially issuing poorer forecasts for recent years.However, despite this issue, the NARMAX predictions consistently outperformed the dynamical models, especially in the case of high summer (July and August).
winter, both sets of models commonly select the North Atlantic Horseshoe (NAH) to predict the winter NAO, especially the 1956 model, which has the May and September NAH as the two most often selected predictors.This is a reassuring result, as it ties in well with the findings of Cassou, Deser, et al. (2004), which identified plausible physical explanations for observed links between the NAH and the subsequent winter NAO, particularly in relation to the NAH during the preceding summer and autumn.The 1979 model most often selects October Barents-Kara Sea-ice concentrations as a predictor for the winter NAO, again an encouraging result, as, for example, Warner et al. (2020) found evidence of interannual winter NAO variability being strongly related to Barents-Kara Sea ice.The 1979 model's second most selected predictor for winter NAO is October European snow cover, which is again a plausible result, as observations and dynamical model simulations also point to the existence of links between the two (Wegmann et al., 2020).September Hudson Bay sea-ice concentrations recur as a commonly selected predictor for the winter EA, and to a lesser extent so does the October stratospheric polar vortex.Hall, Scaife, et al. (2017)) discussed links between solar activity (with a lead time of 6 months to 2 years) and the June tripole and the winter NAO.Neither of those were in the top 10 predictors of the 1956 NARMAX model, although in the model from 1979, the October tripole both featured in the 10 most frequently selected predictors.
Lagged teleconnection links between sea-ice concentrations, SST anomalies, tropical precipitation and subsequent atmospheric circulation patterns have already been found.For example, there may be links between Barents-Kara Sea ice concentrations and extratropical atmospheric circulation via complex teleconnections with the Aleutian low and tropical SST and rainfall variations (Warner et al., 2020).This also ties in well with the 1979 NARMAX model frequently choosing October Barents-Kara sea-ice concentrations as a predictor of the winter NAO.There is also evidence for a link between tropical precipitation anomalies and wintertime European precipitation events (Li et al., 2020) and, correspondingly, the East Atlantic Pattern (Maidens et al., 2021).Some further discussion of the physical interpretability of the NARMAX results is available in the supporting material (see support material section 4).
Ongoing research is downscaling the three principal EOFs used here, in order to determine the links between the EOF time series (both observed and predicted by NARMAX) and Northwest European temperatures and precipitation, including links with persistence and variability indices as well as maximum, minimum and mean values, that are relevant for end-users such as the agrifood, energy and tourism industries.

| SUMMARY
These results demonstrate that NARMAX models have considerable potential to improve upon purely dynamical model-based seasonal weather predictions, especially in the case of high summer (July and August) and therefore significantly extends the pilot study of Hall et al. (2019), which focused on winter.NARMAX models that are designed based on a sufficiently long training period of at least around 25 years, consistently show skillful performance across the range of atmospheric circulation indices and seasons used here.Links between the individual circulation indices and their potential predictors that are frequently chosen by NARMAX are a basis for future work, both with the aim of evaluating the physical plausibility of such links and using NARMAX to assist the identification of new teleconnection links that have not previously been identified and explored.
In this paper, the original dataset is firstly resampled into training and testing sets with a ratio of 8:2, leading to 35 points in the training set and 9 points in testing set (8 points in testing set for summer season).To avoid overfitting, a validation process is performed over a subset of around eight or nine samples in the training set.The maximum nonlinear degree, deg, of NARMAX models is 3 (Section 3.1).Prediction results from slidingwindow NARMAX models are shown in Sections 4.2 and 4.3.To evaluate how predictors have evolved, a separate NARMAX model for 1956 predictors dataset (i.e., covering the period 1956-2022) was developed.This model undergoes training during the period from 1956 to 2008/2009 is validated in the period 2001-2008 and tested in the period 2009-2021, as detailed in Supplementary Information Section 1.1.We refer to these as the 1979 and 1956 NARMAX models, named after the two different time periods they represent.

F
I G U R E 4 Results (predictors (a, b) and predictions (c, d) by sliding-window NARMAX) of Jun_NAO79 and JA_NAO79.Predictors are shown according to the month in which that value occurred (e.g., AprTAR = tropical Atlantic rainfall for the month of April).Refer toTable 1 for a full list of predictor names.
, the weighted average predictions by SW-NARMAX models follow the observations closely in 5 of 8 years and fall out of the CI in the remaining years (2015, 2016 and 2019).Similarly, in Figure 4d, most predictions (5/8) by SW-NARMAX models are close to the observations, while the rest of the years' (2016, 2019, and 2020) observations fall outside the CI.As shown in Figure 4c,d, the two-digit decimal values are the CRPS based on the NARMAX ensemble weighted mean prediction for each year, while the two horizontal dashed lines indicate the values of À0.5 and 0.5.As the definition above, the area between À0.5 and 0.5 is considered to be 'true', while areas outside are set as 'false' for the purpose of CRPS calculations defined in Equations (12) and (13).Smaller CRPS values indicate that the predictions by SW-NAMRAX models are relatively more similar to the observations with the same class, while larger CRPS values indicate a larger departure between the predicted and observed values.Consider Figure 4c as an example.When the predicted value approximates the observed value and the 95% confidence interval (CI) coincides with the category of observed values-as was the case in 2014, 2017, 2018 and 2021-the CRPS values are minimal, nearing zero.However, for specific years such as 2015 and 2016, while the observed value falls out of the predicted 95% CI range, the CRPS values approximate 0.6 because most F I G U R E 5 Results (predictors [a, b] and predictions [c, d] by sliding-window NARMAX) of Jun_EA79 and JA_EA79.Predictors are shown according to the month in which that value occurred (e.g., aprDIP = North Atlantic dipole for the month of April).Refer to Table , while the verification statistics for the validation and testing periods for the mean predictions by the model ensembles are shown in Table2.Similarly, Figure5c,d show that the weighted mean predictions by the NARMAX models usually perform well in following the observed yearly values from 2014 to 2021 (testing period) for Jun_EA79 and JA_EA79.For

F
I G U R E 6 Results (predictors[a, b]  and predictions[c, d]  by sliding-window NARMAX) of Jun_SCA79 and JA_SCA79.Predictors are shown according to the month in which that value occurred (e.g., aprESL = E. Siberian/Laptev Seas for the month of April).Refer to Table1for a full list of predictor names.the Jun_EA79, observations in 2014, 2018 and 2021 are in the 95% CI, while for the JA_EA79, most observations (7/8) fall in the prediction band.Thus, the SW-NARMAX models perform better for JA_EA79 than for Jun_EA79.Similarly, the CRPS values reflect the accuracy of the probabilistic predictions of the SW-NARMAX ensemble models.For Jun_EA79, predictions in 2014, 2018 and 2021 are relatively accurate compared to other years.For the years 2017 and 2020, the CRPS values are greater than 1, as the weighted mean predictions of SW-NARMAX models for 2 years are significantly different.For the year 2015, 2016 and 2019, the CRPS values are less than 1 while greater than 0.4.In these years, the weighted mean predictions of SW-NARMAX models have obvious differences with observations, while the observations have the same category compared to the CI F I G U R E 7 Results (predictors [a] and predictions [b] by sliding-window NARMAX) of EOF NAO79 in winter.Predictors are shown according to the month in which that value occurred (e.g., augNAH = North Atlantic Horseshoe for the month of August).Refer to Table 1 for a full list of predictor names.F I G U R E 8 Results (predictors [a] and predictions [b] by sliding-window NARMAX) of EOF EA79 in winter.Predictors are shown according to the month in which that value occurred (e.g., augBK = Barents-Kara Seas for the month of August).Refer to

F
I G U R E 9 Results (predictors [a] and predictions [b] by sliding-window NARMAX) of EOF SCA79 in winter.Predictors are shown according to the month in which that value occurred (e.g., octLAB = Labrador Sea for the month of October).Refer to Table 1 for a full list of predictor names.T A B L E 2 Verification statistics for averaged sliding-window NARMAX models for summer seasonal prediction (test period = 2014-2021).Correlations that are significant at p < = 0.05 are in bold, and correlations that are significant at p < = 0.01 are underlined.

F
I G U R E 1 0 Comparisons between NARMAX predictions and SEAS5 hindcasts and forecasts for the period 2006-2021.on a training period of 1979-2005 (27 years) and an outof-sample/testing period of 2006-2021 (16 years).Figures 10 and 11 show the NARMAX mean forecast for each year, compared with the observed values and the SEAS5 forecasts (Figure

F
I G U R E 1 1 Comparisons between NARMAX predictions and GloSea5/6 hindcasts and forecasts for the period 2006-2021.GloSea5 forecasts issued in the month ending 1 June were missing for 2016 and 2017.
Table 1 for a full list of predictor names.
Comparison of the numerical performance of linear and nonlinear NARMAX models in winter (testing period: 2014-2022), where significant correlation coefficients are highlighted in bold (p < = 0.05).
T A B L E 6 Verification statistics comparing the SEAS5 and GloSea5/6 model hindcasts and forecasts with NARMAX, for the period 2006-2021 (with NARMAX using the training period.Correlations that are significant at p < = 0.05 are in bold, and correlations that are significant at p < = 0.01 are underlined.