A new approach to extended-range multi-model forecasting: sequential learning algorithms

Multimodel combinations are a well-established methodology in weather and climate prediction and their benefits have been widely discussed in the literature. Typical approaches involve combining the output of different numerical weather prediction (NWP) models using constant weighting factors, either uniformly distributed or determined through a prior skill assessment. This strategy, however, can lead to suboptimal levels of skill, as the performance of NWP models can vary with time (e.g., seasonally varying skill, changes in the forecasting system). Moreover, standard combination methods are not designed to incorporate predictions derived from sources other than NWP systems (e.g., climatological or time-series forecasts). New algorithms developed within the machine learning community provide the opportunity for “online prediction” (also referred to as “sequential learning”). These methods consider a set of weighted predictors or “experts” to produce subsequent predictions in which the combination or “mixture” is updated at each step to optimize a loss or skill function. The predictors are highly flexible and can combine both NWP and statistically derived forecasts transparently. A set of these online prediction


INTRODUCTION
Subseasonal-to-seasonal predictions (hereafter s2s), which range from a few weeks to a few months ahead, fill in the gap between weather prediction and seasonal forecasting. In the past, these forecasting ranges have received little attention, due to the limited predictability of weather forecasts beyond 10 days and the need for longer aggregation periods in order for the effects of boundary conditions to become relevant (e.g., Kirtman et al., 2014;Robertson et al., 2015;Vitart et al., 2015). These forecasting horizons are, nonetheless, of significant relevance for a diverse range of applications (e.g., health, agriculture, water management, humanitarian efforts, energy, etc.) and have been identified as a key research area for climate services development (e.g., Vaughan et al., 2016;White et al., 2017;Vitart and Robertson, 2019). Many different subseasonal hindcasts and operational forecasts are now available, though these differ in important methodological aspects such as launch dates, ensemble sizes, and calibration strategies. A central challenge for all climate service sectors is therefore to derive and maximize the predictive skill available across multiple forecast systems. The benefits of multimodel combinations in climate forecasting have previously been introduced and described for different temporal scales (e.g. Krishnamurti et al., 2000;DelSole, 2007;Tebaldi and Knutti, 2007;Min et al., 2009;Sansom et al., 2013;Siegert and Stephenson, 2019). Most typical combination methodologies involve weighting strategies that assign each model a constant factor, either uniformly or based on an ensemble member skill assessment. Given that the skill of the models can vary at different timescales, and for multiple reasons (e.g., seasonally varying skill, or changes in the forecasting system), the fact that these weights remain constant is a methodological limitation, since it does not allow the resulting combination to adapt to these changes in skill. Other commonly used multimodel combination methodologies such as Bayesian model averaging (BMA: e.g., Raftery et al., 2005) rely on strict assumptions about the probability distributions of the forecasts to obtain a posterior density.
Dynamical prediction systems are known to have limitations on these timescales, given that the lead times are too long to retain much memory of the initial conditions, but too short to be controlled by the boundary conditions (e.g., Vitart et al., 2012). In this context, Cohen et al. (2019) propose that investing in machine-learning techniques for subseasonal forecasting may offer skill advantages over both existing numerical prediction systems and traditional statistical techniques using fixed training periods (e.g., canonical correlation analysis, CCA) alone. A large body of research exists on statistical postprocessing of climate prediction and future projections, including applications of machine-learning techniques (e.g. Monteleoni et al., 2011;DelSole et al., 2015) . Furthermore, statistical postprocessing of model output has often led to enhanced understanding of the physical processes linked to predictability and has therefore led to subsequent improvements to dynamical forecasting systems, and their value can therefore extend beyond forecast skill enhancement (e.g., Doblas-Reyes et al., 2013).
Within the realm of machine learning, a family of algorithms has been developed to perform "online prediction with expert advice" (Cesa-Bianchi and Lugosi, 2006), also known as sequential learning or sequential aggregation rules. These methods consider a set of predictors which, after initially being weighted uniformly, produce subsequent predictions in which the combination or "mixture" is updated continually over time to optimize a loss or skill function. No assumptions are made about the probabilistic properties of the predictors or the forecasts, and the algorithms learn dynamically how the sets are linked. These online prediction methods have several potential advantages for their use in climate prediction.
• The predictor combination is updated in every forecast step, allowing the system to adjust under certain conditions (e.g., time-varying skill or new forecast system releases) to preserve or maximize skill.
• A different combination of predictors can be obtained for different quantiles of the predictand distribution to produce a robust system that maximizes skill for the full forecast probability distribution.
• The risk of including incompetent or counterproductive predictors is minimized by the system adjusting over time to assign them minimal weights.
Sequential learning algorithms have had very little application in weather and climate multimodel prediction. Mallet et al. (2009) first applied a sequential aggregation algorithm to combine multimodel predictions of hourly and daily near-surface ozone concentrations into a deterministic prediction. In this case, they applied a method called exponentiated gradient (Cesa-Bianchi, 1999) and showed that the sequential learning algorithm produced predictions that outperformed the best model and had similar skill to the optimal linear combination. Subsequently, Mallet (2010) enhanced the methodology further through coupling it with a data assimilation scheme, so that the sequential learning algorithm forecasted the model's analyses instead of the observations. In subsequent years, similar implementations were used to forecast other pollutants (e.g., Auder et al., 2016), ocean waves (e.g., O'Donncha et al. 2018;, and several nonmeteorological variables (e.g. Devaine et al., 2013; Hawelka et al., 2017;Amat et al., 2018). Bel (2015, 2016) first applied a similar method in climate forecasting, in the context of decadal prediction of atmospheric variables such as 2-m temperature. They also found that the exponentiated gradient algorithm was able to outperform the individual models and other standard benchmarks such as linear combinations and the climatology. Recently, Strobach and Bel (2020) applied the same exponentiated gradient learning method in the context of future climate projections using the CMIP5 ensemble (Taylor et al., 2012). The authors used a historical period of reanalysis data as a learning set. In addition, they obtained an estimation of multimodel uncertainty by looking at model weight fluctuations during the learning period. Those weights and uncertainties were then used to combine future climate projections, which proved to be less uncertain than the original multimodel ensemble. A set of limitations can be identified in the application of sequential learning algorithms mentioned above. All but Strobach and Bel (2020) focus only on producing a deterministic or point-based mean forecast. They also rely on a fixed learning rate (instead of a time-varying "adaptive" learning rate), which determines the speed at which the algorithm can adjust the combination weights (e.g., Wintenberger, 2017). A fixed learning rate prevents these methods from capturing optimally the temporal variations in the skill of the different predictors considered. Finally, these examples have drawn the predictors in a very simple way (i.e., individual forecasts) and have not explored the potential to use a more complex set of predictors to maximize skill. Sequential learning algorithms allow for improvements in all of these aspects and this work presents results that demonstrate the method's advantages.
The present work tests a set of sequential learning algorithms, with and without time-varying or "adaptive" learning rates, to combine s2s multimodel predictions and to compare the results with more traditional multimodel combination techniques. Two specific algorithms (Bernstein Online Aggregation (BOA) and the exponentiated gradient algorithm (EGA)) were tested considering the following research questions.
• To what extent do these sequential learning algorithms outperform more common "constant-weight" multimodel combinations?
• Can the skill of multimodel forecast combinations be improved through the incorporation of reanalysis-based predictors?
Within the context of the EU-Horizon 2020 S2S4E project (Subseasonal-to-seasonal Forecasting for Energy 1 ), these novel methods are applied to forecasts of national-average electricity demand and wind-power generation across a set of European countries, though the methodology can readily be applied to other s2s forecast properties.

Dynamical subseasonal predictions
Extended-range numerical weather predictions (NWP) and their corresponding reforecasts were compiled by the S2S prediction project 2 . Hindcast ensembles from the European Centre for Medium-range Weather Forecasting (ECMWF: ECMWF ENS-ER 3 ) and National Centers for Environmental Prediction (NCEP: CFS version 2, Saha et al., 2014) modeling centers were obtained from the ECMWF S2S data portal 4 and are described in Table 1.
In the case of ECMWF ENS-ER, the system has an on-the-fly reforecast methodology that generates hindcasts for the 20 years prior to the forecasts, with 11 ensemble members launched biweekly on Mondays and Thursdays. All hindcasts matching forecasts initialized during the calendar year 2016 are included in the analysis, and therefore three separate model cycles had to be considered: • CY41R1: for hindcasts from January 1-March 7; In the case of NCEP, the system has a fixed reforecast period from 1999-2010 and hindcasts were launched daily, creating a four-member ensemble. Given the small size of the ensemble but the frequent launch dates, a lagged ensemble is used to obtain a dataset with a structure and size comparable to ECMWF prior to the multimodel combination: the daily starts of NCEP were subsampled biweekly to match the ECMWF starts (Mondays and Thursdays) and each of those starts was combined with the two preceding ones to generate a larger number of members (12 instead of 4).
The analysis is then performed on the common period for the two datasets, which is 1999-2010. That 12-year period gets reduced to 2000-2010, since the first year is lost to generate one-year persistence forecasts (see Section 2.4). Initial tests revealed that it takes around two years for the aggregation methodologies to achieve a quasi-equilibrium in their weights variability. Therefore, the following two years (2000)(2001) are used to optimize algorithm parameters when necessary, and all skill evaluations are restricted to the nine-year period 2002-2010.
Daily mean 2-m temperatures were obtained for each of the hindcasts and their mean biases were adjusted using equivalent variables from the ERA5 reanalysis (Hersbach et al., 2020). A lead-time-dependent bias correction to the mean is applied, forcing the climatology to match that of ERA5. Furthermore, a variance inflation is then implemented to adjust the hindcasts' spread to match ERA5's while preserving their correlation (Doblas-Reyes et al., 2005). These adjustments are applied at each grid point by first regridding the ERA5 variables to the 1.5 • model grid. A leave-one-out approach is implemented on the hindcast (e.g., the calibration for hindcast year 1999 uses the hindcast climatology of years 2000-2010). A sample time-series plot of UK electricity demand in the different datasets, together with the subperiods of the analysis, is included as Figure S1 in the Supporting Information.

Electricity demand
To illustrate the use of online prediction on an application-relevant variable, daily grid-point 2-m temperatures from both the reanalysis and hindcasts sets are converted to country-aggregate daily electricity demand time series, using a statistical data model (Bloomfield et al., 2020) that accounts for the weather-driven variability while removing time-evolving socio-economic drivers, such as weekly cycles and long-term trends. Grid-point electricity demand time series are aggregated to the country level using European countries' shapefiles (Bloomfield et al., 2020) and compared with the corresponding time series derived from the ERA5 reanalysis (Hersbach et al., 2020) for the same 1999-2010 period. The daily hindcasts were then used to create weekly averages with the following criteria: • week 1: days 1-7; • week 2: days 8-14; • week 3: days 15-21; • week 4: days 22-28; • week 5: days 29-35.
We highlight, however, that the application to electricity demand is included as a proof of concept. The method demonstrated below is very general and can be applied to any forecast variable or derived magnitude. Some additional examples can be found in the Supporting Information (see Section S5).

Sequential learning algorithms: a qualitative description
As a first introduction to sequential learning algorithms (SLAs), we present here a conceptual description of the methodology. A more complete description of the theory of SLAs and all the relevant formulations are included in the Supporting Information (see Section S1.1). This section is meant to describe the "mechanics" of the SLAs in plain language.
Given an observable time series of interest Y (t), these methods consider a set of predictors E(n, t) to be combined into a subsequent prediction of Y (t), for examplê Y (t + 1). The methods consider a linear combination of the predictors and start by assigning them equal weights. Nonetheless, in subsequent time steps (i.e., t + 2, t + 3, etc.), the objective of these methods is progressively to update the set of weights W(n, t + i) so that the resulting predictionŶ (t + i) minimizes a certain loss function of the individual forecasts (i.e., a given measure of error of the predictions). The SLAs can be adjusted to control how quickly the combination weights are allowed to change between forecast steps (referred to as the "learning rate"), how many prior steps are considered in the determination of the updated weights (referred to as the "forgetting factor"), and whether or not there is a lower-bound constraint on the individual weights (referred to as the "fixed share", and typically used to avoid some predictors being discarded permanently).
This article presents comparative results from implementing two different SLA algorithms, one with a time-varying learning rate, Bernstein Online Aggregation (BOA: Wintenberger, 2017), and one with a constant learning rate, the Exponentiated Gradient Algorithm (EGA: Kivinen and Warmuth, 1997). Due to the adaptive learning rates, the BOA algorithm allows for more flexibility in the updating process over time when comparing with the EGA. In each case, a set of improvements to the standard "off-the-shelf" implementations was applied by optimizing the aforementioned parameters over an initial training period. Details about the methods and these implementations can be found in the Supporting Information (see Section S2).

Multimodel combinations and skill references
This study tests the implementation of sequential learning algorithms for the prediction of weekly hindcasts of country-aggregate electricity demand. With this purpose, a simple set of predictors was considered, split into two categories: NWP-based and reanalysis-based. It is emphasized that this predictor set could be expanded, potentially to improve the resulting output forecast (e.g., by including more NWP forecast systems, earlier launch dates, pattern-based indicators or other information sources), but this simple predictor set is sufficient for demonstration purposes.
• Minimum and maximum of the ensemble distribution (FCST_MN, FCST_MX). For each start, the minimum and maximum values from both hindcast systems are retained.

Reanalysis-based
• Quantiles of the climatology (Q10_CLIM, Q35_CLIM, Q50_CLIM, Q65_CLIM, Q90_CLIM). Obtained from the 1.5 • ERA5 climatology for each calendar day using a leave-one-out approach on the years.
• Persistence (PERS). Weekly persistence forecast based on ERA5, calculated using the seven days prior to each hindcast start date.
• Last-year persistence (PERS_1yr The multimodel forecasting method considered in this study is described by the diagram in Figure 1. The predictors described above were combined, fully or partially, using different combinations or aggregation methodologies to obtain a multimodel forecast for a given quantile of the distribution, as described in Section S1.1. By iterating over a set of quantiles ("Qgrid"), the method results in a multimodel probabilistic forecast. As a simple illustration, consider the EGA algorithm applied to combine the NWP predictors (Q10-Q90, FCST_MIN and FCST_MAX) from two systems NCEP and ECMWF. For a given forecast quantile (e.g., the 5th percentile), a new prediction is derived by obtaining combination weights for each of the input predictors. The process is then repeated for the 10th and 15th-95th percentiles (i.e., a separate set of weights is calculated in each case) such that a probabilistic forecast is produced. Given the limited sizes of the reforecast systems considered here, a quantile spacing of 0.05 (5%) was considered. Additionally, a different aggregation rule was created for each hindcast lead (weeks 1-5).
Each mixture was applied using all predictors (denoted as BOA, EGA) and the NWP-based predictors only (denoted as BOA_NWP and EGA_NWP) in order to assess the presence of any added value from including reanalysis-based information in the forecasts. The EGA method was implemented to benchmark the results against prior uses of sequential learning algorithms in climate prediction (e.g., Strobach and Bel 2015;, but it has been considered here with several improvements with respect to its prior uses: EGA is trained for each q i in "Qgrid" using a 2-year training period and optimizing a fixed learning rate across quantiles over this period. Two further innovations were tested and introduced which proved to increase skill: the use of an optimized "forgetting factor", which controls the relevance that older predictions have in the determination of the updated weights, and the use of an optimized "fixed share", which assigns a minimal baseline weight to all the predictors to stop them from being discarded permanently from predictions. These strategies are further described in the Supporting Information (see Section S3). A summary table for all the algorithms is included in the section (see Table S1).
The multimodel combinations were then benchmarked against a set of references described below.
For each q i in "Qgrid", a quantile forecast is obtained from each system (ECMWF and NCEP) and they are then averaged with equal weights (in this case, w = 0.5).
• Ensemble model output statistic calibration (EMOS). A Gaussian EMOS ensemble calibration, also known as nonhomogeneous Gaussian regression , was trained using the probabilistic predictions of the NWP systems (ECMWF, NCEP) from the first TWO forecast years (128 time steps) and then used to generate calibrated probabilistic predictions for the evaluation period and for each q i in "Qgrid". This scheme was considered as an example of ensemble statistical postprocessing and calibration techniques used in probabilistic forecasting.
• Climatology (CLIM). For each q i in "Qgrid", the 1.5 • ERA5 was used to create a climatology for each calendar day using a leave-one-out approach on the years.
• Individual NWP systems (ECMWF, NCEP). For each q in "Qgrid", a quantile forecast is obtained from the ensemble members of each system.
• Oracles (O_NWP_conv, O_NWP_lin). Using the full evaluation period, the oracles are built as the optimal multiple linear regression combinations of all the NWP-based predictors, under two constraints: in the convex case (O_NWP_conv), the weights range between 0 and 1 and have to add up to 1. In the linear case (O_NWP_lin), the weights range freely. Because these oracles require the full knowledge of the entire evaluation period, they provide an upper boundary-estimate for the skill of a fixed-weight performance-based multimodel combination. In an operational setting, "oracle" combinations are not possible, because they require knowledge of the full period, and analogous linear regressions would need to be trained using a previously observed independent period.
It is worth mentioning that, even though persistence is another typical benchmark in weather and climate forecasting, it is not used as such here, because it is part of the predictors set.

RESULTS
The skill of the sequential learning algorithms (SLAs) is first examined in the context of deterministic or point-based skill, as in prior applications of the algorithms for climate prediction (e.g., Strobach and Bel 2015;. As a subsequent step, the skill improvements of the methods are assessed in a probabilistic context, considering all the quantiles in "Qgrid".

3.1.1
Forecast skill for the ensemble median (50th quantile) To illustrate the aggregation methods and their evaluation, we first consider the case of United Kingdom (UK) electricity demand. As an initial step, the skill of the SLA combinations is evaluated using the 50th quantile (Q50) as a measure of the average behavior of the ensemble. As described in Section 2.3, the methods are optimized in terms of a loss function, which in this case was the pinball or quantile loss (see Section S1.1), which is calculated for each prediction time. The average pinball loss for Q50 is equivalent to the mean absolute error (MAE) and is therefore a metric of the deterministic skill of the resulting system. Figure 2 presents these time-averaged losses for the proposed aggregation rules and the benchmark forecasts. Figure 2a shows the full time-average losses, sorted according to lead week 3, and shows that all the SLAs (i.e., all the versions of the EGA and BOA methods considered here) are more skillful than the equal-weights combination (EW_NWP), the EMOS calibration, the climatology (CLIM), and the individual NWP systems (ECMWF, NCEP) for that lead time. Figure 2b presents the relative skill improvements of the predictions relative to the EW_NWP combination, calculated as the ratio between the pinball losses for each week. Three other aspects of the results are highlighted. Firstly, all the SLAs present an increase in skill with respect to EW_NWP for every lead time. This, added to the fact that the oracle combinations (O_NWP_conv and O_NWP_lin, the optimal constant-weight combinations) are less skillful than some of the combinations, indicates that there are potential benefits of SLAs compared with constant-weight combinations and more sophisticated calibrations such as EMOS. The SLAs implemented here resulted in a significant reduction in the pinball loss in weeks 1-5 with respect to all the other predictions (ranging between around 9% and 22%). Finally, these plots also suggest that some benefit arises from including reanalysis-based predictors in the combinations, as the full versions typically have lower losses than the NWP-only SLAs.

Average weights for the 50th quantile
To explore the composition of the aggregation methods and the differences in their performances, Figure 3 presents the time-average weights for different weekly leads and all versions of the BOA and EGA combinations. It shows that, for week 1, the higher weights are assigned to the quantiles of the ECMWF ensemble, though not necessarily centered at Q50. This suggests that the aggregation is correcting residual biases in the shape of the forecasts distribution, which were not addressed by the initial mean correction and variance inflation (see Section 2.1). The NCEP quantiles also present a noticeable contribution to the combinations, though smaller than 10%. As lead time increases (Figure 3b and c, corresponding to weeks 3 and 5), the combinations that include reanalysis-based predictors (black bars) shift weight to the quantiles of the climatology, whereas, for the aggregations that only include NWP-based predictors (BOA_NWP, EGA_NWP), larger weights are assigned to the quantiles of NCEP than in shorter leads. Also, the FCST_MX and FCST_MN gain some relevance. Differences between the BOA and EGA algorithms are noticeable in the relative distributions of the weights, such as along the quantiles of the ECMWF forecasts, but not so much in the evolution with lead time.
It is relevant to point out that, even though this section has discussed the features of the average weights, the fact that they are time-varying is a significant property of the SLAs. As an example, the temporal evolution of a set of weights is discussed in the Supporting Information (see Section S4).

Probabilistic skill: pinball loss as a function of quantile
The prior evaluation of the aggregation rules was focused on the center of the forecast distribution and associated with the deterministic skill. Nonetheless, these aggregation techniques have the potential to generate improvements in skill throughout the forecast distribution. To assess this capability objectively, one can study the behavior of the average pinball loss as a function of the distribution quantile. Figure 4 presents such an analysis for weeks (a) 2 and (b)

F I G U R E 3
Average weights obtained for the UK demand Q50 aggregation rules of the BOA and EGA algorithms over weeks (a) 1, (b) 3, and (c) 5. The black bars correspond to the combinations that consider all the predictors and the gray bars to the NWP-only aggregations. Predictors identified with _1 correspond to the ECMWF system and those marked with _2 to the NCEP system of a model combination or a reference forecast relative to EW_NWP. For week 2 (Figure 4a), the climatology and NCEP are clearly outperformed by EW_NWP for every quantile. ECMWF, however, outperforms EW_NWP everywhere but on the lower quantiles. All the other combinations have smaller losses compared with the uniform combination for most quantiles, but the differences between them are relatively small. It can be seen that, for week 2, the four SLA combinations have clear improvements in the center of the distribution, whereas the oracle combinations show smaller losses in the tails. It is worth recalling that the oracles are not a fair benchmark, since they represent an upper bound to the skill of constant-weights linear combinations (i.e., they were trained using the full period). In general, the EGA algorithms are less stable across the distribution and can become less skillful for particular quantiles. In the case of week 5 (Figure 4b), the differences in losses between the forecasts and references become larger. NCEP and the climatology are both outperformed by EW_NWP for every quantile, whereas ECMWF remains very close in skill to EW_NWP throughout the distribution. The BOA and EGA combinations now exhibit the largest skill improvements for most quantiles, only being outperformed in the extreme tails by the oracles. It is important to note that these tails are subject to larger errors due to the limited size of the sample analyzed here, and therefore these disagreements might not be robust. For this lead week, there is a suggestion that the combinations that include reanalysis-based predictors are more skillful than their NWP-only counterparts for a large part of the distribution, in particular in the case of the BOA algorithm. This suggests that incorporating reanalysis-based information in the online process might result in an increase in skill for longer lead times.

3.2.1
Quantile-mean pinball loss Rather than focusing on any particular quantile of the distribution, one might want to evaluate the overall skill of the resulting combinations. To address that, the average of the losses along Qgrid can be computed. It can be shown that, provided Qgrid is fine enough, this quantile mean pinball loss is a good approximation for the continuous ranked probability score (Taieb et al., 2016). The continuous ranked probability score (CRPS: Matheson and Winkler, 1976) measures the distance between the observed and the predicted cumulative distribution functions (e.g., Hersbach, 2000) and it is widely used in the evaluation of probabilistic forecasting skill. Figure 5a presents the quantile-mean losses (CRPS estimates) for all the combinations and references. In agreement with the previous section, the four SLA combinations show the smallest quantile mean losses for lead weeks 3-5 ( Figure 5a). Additionally, the combinations that include reanalysis-based predictors beat their _NWP counterparts in every case. All the sequential aggregation algorithms show improvements with respect to EW_NWP and the climatology (Figure 5b) for weeks 1-5. The relative improvements in the CRPS estimate of the SLAs for weeks 3-5 range between around 7% and 16%, whereas for week 2 the skill improvements remain close to 5%. For week 1, improvements can be quite high with respect to the EW_NWP, but the SLA combinations do not outperform ECMWF. It is also important to remark that most combinations are more skillful than the oracles for weeks 3-5. This is of relevance, because the linear combination of forecasts (such as through ridge regression) is another standard multimodel forecasting technique used in weather and climate prediction. The oracles, however, provide an upper limit for the skill of constant-weights combinations, since they were optimized using the complete evaluation period, when in practice a fixed independent training period would have been used to obtain the weights. The SLAs, however, can be more skillful than the oracles by allowing the combination weights to vary with time. In a similar way, the SLA combinations outperform the EMOS probabilistic forecast calibration for weeks 3-5.

Statistical significance of skill improvements
The increases in skill presented above, though moderate in some cases, are quite robust. The statistical significance of the skill enhancements can be assessed by using a Diebold-Mariano (DM) test (Diebold and Mariano, 1995), which compares the predictive accuracy of two forecasts using the time series of associated quantile-mean losses.
Results from the DM tests applied to UK demand forecasts are presented in Table 2 and reveal that all the combinations resulted in forecasts that are significantly more skillful than EW_NWP for weeks 1-5. Also, the increased skill of EGA with respect to BOA algorithms appears to be significant only for longer leads. Similarly, the advantages of including reanalysis-based predictors become significant for long leads (3-5 in most cases).
The assessment above might be affected by the multiple testing or multiple comparisons effect (e.g., Benjamini and Hochberg, 1995), but the application of corrections to the significance levels to make the tests more stringent is not universally recommended (e.g., Rothman, 1990;Rubin, 2021). In this context, we have decided to complement the analysis above with the determination of Model Confidence Sets (MCSs: Hansen et al., 2011), selected objectively for each metric and lead week. This method considers the full set of models and the null hypothesis of equal predictive power (or equally, that no inferior model is present in the set). Given a significance level, the procedure sequentially removes the worst-performing model until the null hypothesis can no longer be rejected. The remaining models are therefore the model confidence set for that confidence level. We implemented this method using the R-package estMCS 5 and considering the quantile-mean pinball losses, as in the case of the DM test, for each week individually. The optimal block length for bootstrapping was selected using the method proposed by Patton et al. (2009) and implemented using the R-package np. 6 The resulting MCSs for UK demand forecasts are presented in Table 3. The resulting sets imply that, at the 99% confidence level, no inferior model is contained. The results of this stringent significance test are in line with the assessment presented above, and show that the SLA combinations outperform all the other methods for lead weeks 3-5. For lead week 2, the SLA combinations show similar skill levels to the oracles, but they still outperform the equal-weight combination and the individual NWP models. For lead weeks 3 and 4, it is also shown that combinations that include reanalysis-based predictors outperform the NWP-only ones.

DISCUSSION AND CONCLUSIONS
An innovative set of multimodel aggregation techniques was introduced through application to 6 https://cran.r-project.org/web/packages/np/np.pdf. subseasonal-to-seasonal UK weekly electricity demand forecasting. The results included here have shown very promising results, with a significant skill improvement, particularly at long lead times (i.e., beyond week 3).
Even though some initial benefits of the sequential learning algorithms were obtained when analyzing the deterministic skill of the forecasts (Q50), the full extent of the skill improvements is observed when accounting for the complete quantile distribution. The two multimodel aggregation algorithms tested here, EGA and BOA, showed significant skill improvements with respect to the climatology, standard multimodel methods, and the individual best NWP system (ECMWF) for weeks 2-5. Overall, the optimized BOA and EGA combinations resulted in average improvements ranging around between 7 and 16% in the quantile-mean pinball losses (an estimation for the CRPS) for weeks 3, 4, and 5.
In the results presented here, some additional skill was obtained when the combinations included reanalysis-based predictors. A case study included in the Supporting Information illustrates how the algorithms are able to "learn" from the performance of the predictors through a season and adjust the weights accordingly to minimize losses (see Section S4). These results suggest that, when reanalysis-based predictors were included, this adjustment resulted in better predictions.
These methods were also tested on other large countries such as Germany, France, and Spain, for both electricity demand and wind power, and the main results are included in the Supporting Information (see Section S5). The overall conclusions remain robust, though the relative comparisons with the forecast benchmarks depend on the region and method.
We therefore conclude that the application of these novel multimodel combination techniques is very promising for s2s prediction. Given that the use of these techniques has little computational cost, more complex implementations would also be feasible, including, for example, the following: prior launches of the same forecasting system as predictors; different types of predictors, such as pattern-or regime-based forecasts; or a truly seamless approach to s2s forecasting by incorporating prior seasonal forecasts. Moreover, the "online" nature of these techniques makes them particularly suitable for operational practice: weights are automatically updated with every forecast cycle (rather than relying on some a priori calibration process using a "fixed" sample). Updates and changes in observing systems or NWP models can therefore be incorporated seamlessly without any additional effort.
It is, however, important to note that these SLAs should not be treated as a "black box", and careful implementation and performance verification is required. In particular, while it is clear that the SLAs can always be implemented (i.e., standard versions are available through open-source statistical packages), the process through which skill enhancements are achieved remains unclear. On the one hand, the fact that the SLAs show the largest skill improvements for longer lead times suggests that they act, in part, as an additional bias correction process (i.e., removing residual forecast biases that remained after the initial calibration of the raw NWP model output). On the other hand, the analysis of case studies showing "shocks" in the weight evolution suggests that the SLAs benefit from reanalysis-based predictors that carry memory-like information about how the recent past is behaving with respect to climatology and other predictors. This study introduced the use of SLAs to s2s prediction as a proof of concept, but more research is necessary to identify objectively the sources of the significant skill improvements associated with these methods, and to understand how these might interact with a more complex set of predictors and with additional skill assessment metrics. Additionally, it would be important to assess the skill of these novel methods against more complex multimodel benchmarks that are being applied in s2s prediction (e.g., Wanders and Wood, 2016;Specq and Batté, 2020).