Stratospheric influence on ECMWF sub-seasonal forecast skill for energy-industry-relevant surface weather in European countries

Meteorologists in the energy industry increasingly draw upon the potential for enhanced sub-seasonal predictability of European surface weather following anomalous states of the winter stratospheric polar vortex (SPV). How the link between the SPV and the large-scale tropospheric flow translates into forecast skill for surface weather in individual countries – a spatial scale that is particularly relevant for the energy industry – remains an open question. Here we quantify the effect of anomalously strong and weak SPV states at forecast initial time on the probabilistic extended-range reforecast skill of the European Centre for Medium-Range Weather Forecasts (ECMWF) in predicting country-and month-ahead-averaged anomalies of 2 m temperature, 10 m wind speed, and precipitation. After anomalous SPV states, specific surface weather anomalies emerge, which resemble the opposing phases of the North Atlantic Oscillation. We find that forecast skill is, to first order, only enhanced for countries that are entirely affected by these anomalies. However, the


INTRODUCTION
The energy industry is strongly affected by weather on hourly to seasonal time-scales via influences on electricity supply, demand and prices (e.g., Troccoli et al., 2014;Thornton et al., 2019).With the continuously increasing use of renewable energy sources such as wind or sunlight, this weather dependence has and will become even more critical (e.g., Grams et al., 2017;Bloomfield et al., 2018;van der Wiel et al., 2019).For instance, the daily-averaged wind power generation in Germany can vary up to 30 GW 1 and even more for Europe depending on the prevailing synoptic situation (e.g., Beerli et al., 2017;Grams et al., 2017).Consulting operational numerical weather forecasts for specific countries is thus an important component of the energy industry's daily business.The steady increase in computational power combined with a continuous improvement of numerical weather prediction (NWP) systems in the last decades (Bauer et al., 2015) has allowed us to extend the operational forecast horizon to sub-seasonal time-scales (10-60 days; Vitart and Robertson, 2019).This time-scale closes the gap toward seasonal climate predictions, which have already been run operationally for more than a decade (e.g., Palmer et al., 2004;Kirtman et al., 2014), and thus advances the worldwide efforts toward seamless weather and climate prediction systems (Vitart and Robertson, 2019).Forecasting European weather for lead times beyond two weeks is still only weakly to moderately skilful, with a strong dependence on the season and the initial and predicted large-scale flow situation (e.g., Weigel et al., 2008;Vitart, 2014;Ferranti et al., 2018), and therefore is an area of active research (Vitart et al., 2017).Nevertheless, the European energy industry has already become one of the most important end-users of operational sub-seasonal forecasts.This requires a profound understanding on if and how important sources of sub-seasonal predictability enhance forecast skill for energy-industry-tailored parameters.Beerli et al. (2017) tackled this question by investigating the role of the stratosphere for the predictability of country-aggregated wind power generation across Europe on monthly time-scales in reanalysis data (see below for more details).Here we build upon this study and investigate how the predictability from the stratosphere found by Beerli et al. (2017) translates into sub-seasonal numerical forecast skill for energy-industry-relevant European surface weather on a country scale.We continue with a brief discussion of the current knowledge about sources of sub-seasonal predictability and forecast skill for European surface weather, 1 https://www.energy-charts.de/power_de.htm?source=solar-wind& year=2019&month=3 (accessed 24 January 2020) before we discuss the scope of the study at hand in more detail.
Predictability of the large-scale flow over Europe on sub-seasonal time-scales is generally gained from the state of low-frequency climate modes such as the winter stratospheric polar vortex (SPV; e.g., Baldwin and Dunkerton, 1999;Ambaum and Hoskins, 2002), the Madden-Julian Oscillation (e.g., Cassou, 2008;Lin et al., 2009), the El Niño-Southern Oscillation (e.g., Huang et al., 1998;Toniazzo and Scaife, 2006), tropical rainfall in general (e.g., Scaife et al., 2017;Stan et al., 2017), the Quasi-Biennial Oscillation (e.g., Holton and Tan, 1980;Anstey and Shepherd, 2014;Andrews et al., 2019), or variations in North Atlantic sea surface temperature (e.g., Rodwell et al., 1999) and soil moisture (e.g., Koster et al., 2010;Ardilouze et al., 2017).It is now established that amongst those, the SPV is potentially the most important source of sub-seasonal predictability for European weather during mid-and late winter (Kidston et al., 2015;Butler et al., 2019).This is because anomalously strong SPV states are statistically followed by persistent (up to 60 days) positive phases of the North Atlantic Oscillation (NAO), and anomalously weak SPV states by persistent negative phases of the NAO (Baldwin and Dunkerton, 2001;Charlton-Perez et al., 2018).As the NAO accounts for a major part of the large-scale flow variability over Europe (Hurrell, 1996) and thus strongly influences energy-industry-relevant surface weather (e.g., Brayshaw et al., 2011;Clark et al., 2017;Zubiate et al., 2017), it acts as a connecting link between the stratosphere and the surface.Typically, a positive phase of the NAO is associated with stronger (weaker) than normal near-surface wind and precipitation in Northern (Southern) Europe and higher than normal near-surface temperatures in Central to Northern Europe, and vice versa during a negative phase of the NAO.The mechanisms behind the downward influence from the stratosphere to the troposphere are manifold and partly still debated (Butler et al., 2019).Most likely, the tropospheric conditions before and during the anomalous SPV state (e.g., Ambaum and Hoskins, 2002;Attard et al., 2016;Kodera et al., 2016;Karpechko et al., 2017;Schneidereit et al., 2017;White et al., 2019;Domeisen et al., 2020b) as well as internal dynamics of the SPV itself (e.g., Holton and Mass, 1976;Scott and Haynes, 2000;Albers and Birner, 2014;Jucker, 2016) influence the anomalous SPV event and its subsequent surface impact.A particular type of anomalously weak SPV state is the so-called sudden stratospheric warming (SSW; Scherhag, 1952;Matsuno, 1971;Butler et al., 2015), during which the stratosphere warms abruptly and the westerlies associated with the SPV temporarily reverse.As with weak SPV states as a whole, SSWs can trigger negative phases of the NAO leading to extreme cold waves particularly over Northern Europe and Northern Russia (e.g., Kidston et al., 2015;Kautz et al., 2020).However, due to the large case-to-case variability in the dynamics of SSWs (e.g., Mitchell et al., 2013;Kretschmer et al., 2018), the tropospheric weather regime response can also differ from the classical negative phase of the NAO and for instance yield opposite temperature conditions for Europe (Beerli and Grams, 2019;Domeisen et al., 2020b).SSWs are thus particularly delicate events for the European energy industry.
The potential predictability from the stratosphere has triggered various studies quantifying the effect of anomalous SPV states, in particular of the critical SSWs, on sub-seasonal forecast skill.Most of these studies found an increase in skill both for statistical models (e.g., Baldwin et al., 2003;Charlton et al., 2003;Christiansen, 2005;Karpechko, 2015) and numerical models (e.g., Charlton et al., 2004;Mukougawa et al., 2009;Sigmond et al., 2013;Tripathi et al., 2015a;2015b;Scaife et al., 2016) in predicting a particular tropospheric weather regime response or anomalous large-scale surface weather conditions directly.However, so far only Beerli et al. (2017) have investigated how anomalous SPV states affect the skill for predicting country-averaged surface weather all over Europe, although this is the scale the energy industry strongly relies on.They used a simple statistical model based on the SPV state in ERA-Interim to predict month-ahead-averaged anomalies of country-aggregated wind power generation in Europe.They obtained considerable skill for Northern European countries, but only moderate skill for Central European and no skill for Southern European countries.The skill was shown to be associated with both anomalously strong and weak SPV states at initial time, with the exception of Southern Europe, where the weak SPV states led to negative skill and thus to no skill overall.They showed that the NAO-related near-surface wind anomaly dipoles that follow the anomalous SPV states and affect primarily Northern and Southern Europe were mainly responsible for this skill pattern across Europe.
In this study, we investigate to what degree the influence of an anomalously strong and weak SPV on statistical forecast skill for wind power generation, as found by Beerli et al. (2017), applies to the probabilistic surface weather reforecast skill of the ECMWF (European Centre for Medium-Range Weather Forecasts) sub-seasonal NWP ensemble.We relate the SPV state at forecast initial time to the skill in predicting countryand month-ahead-averaged 2 m temperature, 10 m wind speed, and precipitation in Europe.These parameters are directly relevant for the energy industry and other end users of sub-seasonal forecasts (such as national weather services). 2We analyze the three surface parameters separately, although they often need to be considered in combination when predicting country-specific energy demand, for instance.Investigating the skill from such a multivariate perspective would thus be important as well but this goes beyond the scope of this study.Section 2 starts with an overview of the used datasets as well as the applied skill score and statistical tests.The results are given in Section 3.They demonstrate how the predicted anomaly composites compare to ERA-Interim and how this translates into the skill for individual countries.Section 4 discusses and summarizes the most important findings.

Model and observations
We investigate the skill of ECMWF sub-seasonal reforecasts (hereafter just denoted "forecasts"), obtained from the Subseasonal-to-Seasonal (S2S) Prediction Project Database (Vitart et al., 2017), for 2 m temperature T2M (daily-averaged), 10 m wind speed U10M (instantaneous at 0000 UTC), and total precipitation P (daily accumulated) during 22 winters between December 1995 and February 2017 (initial dates in DJF, i.e., the lead times of some forecasts extend into March).The reforecasts had been initialized from ERA-Interim twice a week, run with 11 ensemble members (1 control and 10 perturbed forecasts) for a lead time of 46 days, and interpolated to a 1.5 • × 1.5 • grid after computation.Similarly to other studies (e.g., Schiraldi and Roundy, 2017;Deflorio et al., 2018), we use more than two reforecasts per week to increase sample size (to a total of 1,280), which comes with the caveat of mixing different model versions (in our case those are CY41R1, CY41R2, CY43R1, and CY43R3).Some of these versions differ with respect to horizontal resolution of the atmosphere (i.e., from 32 km up to day 10 and 64 km beyond to 15 km up to day 15 and 31 km beyond) and ocean (i.e., from 1 to 0.25 • ) and the inclusion of active sea ice.The number of vertical levels (91), which is critical for the representation of stratosphere-troposphere coupling (e.g., Manney et al., 2017;Kawatani et al., 2019), is the same in all model versions.It is further important to note that the reforecast dataset used here likely yields 2 Analyzing 100 m (instead of 10 m) wind speed would be ideal with regard to wind power applications, but this parameter is not available in the Subseasonal-to-Seasonal Prediction Project Database (Vitart et al., 2017) used for this study.However, a comparison of daily 10 m and 100 m wind speed in 40 years of reanalysis data averaged over Germany yields a very high correlation (not shown), which makes us confident in using 10 m wind speed instead.
lower levels of skill than the operational sub-seasonal forecasting system of the ECMWF would have, particularly during the first two weeks, due to the better initial conditions from data assimilation and a higher number of ensemble members in the operational set-up (cf., e.g., Vitart, 2014).Nevertheless, investigating reforecast data enables us to elucidate systematic model behaviour.As an observational reference, we use ERA-Interim reanalysis data (Dee et al., 2011) for the same winters between 1995 and 2017, also retrieved on a 1.5 • × 1.5 • grid.The three surface weather parameters in ERA-Interim are postprocessed to be consistent with the model data: T2M is calculated as an average over the 0000, 0600, 1200, and 1800 UTC values, U10M from the instantaneous 10 m uand v-components at 0000 UTC, and total P as the sum over the 6-hourly accumulated fields at 0600, 1200, 1800, and 0000 UTC.

Definition of countryand month-ahead-averaged anomalies
We evaluate the model skill for country-and month-ahead-averaged anomalies of T2M, U10M, and P, which are defined as follows.First, we calculate the model climatology as a function of the calendar day at forecast initial time and the lead time.For example, the model climatology for the forecast initialized on 1 December 2000 is obtained by averaging over all 11 ensemble members of all the forecasts initialized on the same calendar day (i.e., on 1 December 1995December , 1996December , 1997December , ..., 2000, ...) , ...) separately for every lead time.The daily anomalies at every lead time of a certain forecast are then calculated with respect to the corresponding lead-time-dependent model climatology.A similar principle for calculating calibrated model anomalies has been used in various studies analyzing data from the S2S database (e.g., Vitart, 2017;de Andrade et al., 2019;Wulff and Domeisen, 2019).Second, these daily anomalies are averaged spatially over every European country and, third, temporally over the first 31 days of lead time (including forecast initial time).This yields one value per forecast for every country and parameter.The corresponding country-and month-ahead-averaged anomalies in ERA-Interim are obtained in the same way, but with respect to the ERA-Interim climatology, which is calculated as a 31-day running mean over the calendar-day average of all available years .Although the approach for calculating the model climatology is different from the one used for the ERA-Interim climatology (using ensemble mean instead of running mean), it allows for a similarly strong smoothing and eliminates (lead-time-dependent) model drifts.In this study, we focus on month-ahead anomalies as a particularly relevant time-scale for the energy industry (e.g., Dorrington et al., 2020).In addition, predictability on sub-seasonal time-scales is generally higher for temporally and spatially aggregated fields compared to instantaneous fields (e.g., Buizza and Leutbecher, 2015).Nevertheless, it is important to keep in mind that the first two weeks typically contribute most to forecast skill on sub-seasonal time-scales (e.g., Weigel et al., 2008;Vitart, 2014).For this reason, we also briefly discuss forecast skill for weeks 1 to 4.

Definition of stratospheric polar vortex strength
To investigate the model skill dependent on stratospheric conditions, we additionally calculate the strength of the polar vortex in the lower stratosphere (for simplicity just referred to as stratospheric polar vortex, SPV, hereafter) at forecast initial time (Beerli et al., 2017).First, we calculate the instantaneous geopotential height anomalies at 100 hPa for every ensemble member (the 150 hPa level, used by Beerli et al., 2017, is not available in the S2S database) with respect to model climatologies computed as for the surface parameters.The ensemble mean of these anomalies is then spatially averaged over the polar cap north of 60 • N. Defining the SPV strength on a level in the lower stratosphere instead of the middle stratosphere (∼10 hPa), where the SPV is typically strongest, has shown to be most meaningful for evaluating stratospheric surface weather impacts (Baldwin et al., 2003;Karpechko, 2015;Beerli et al., 2017;Charlton-Perez et al., 2018).This is likely because the lower stratospheric circulation is a good indicator for stratosphere-troposphere coupling.To investigate the influence of anomalously weak and strong SPV states on forecast skill, we define percentile bins of the SPV anomalies of all forecast initial times and, based on those, select the forecasts initialized in the 2% strongest, 10% strongest, 10% weakest, and 2% weakest SPV states (weak SPV states correspond to positive polar-cap-averaged geopotential height anomalies, and vice versa).

Ranked probability skill score (RPSS)
We use a three-category ranked probability skill score (RPSS; Wilks, 2011) to quantify how well a set of n model forecasts predicts above-normal, near-normal, or below-normal T2M, U10M, and P in a country (i.e., the upper, middle, and lower terciles of the corresponding month-ahead anomalies, respectively) compared to a climatological reference forecast.In our case, the climatological reference forecast assumes an equal occurrence probability of one third for each tercile.The RPSS is calculated by relating the mean ranked probability score (RPS) of the n model forecasts, RPS = 1 n ∑ n i=1 RPS i , to the mean ranked probability score of the corresponding climatological reference forecasts, RPS clim (Wilks, 2011): The RPS for a single forecast is calculated as (Wilks, 2011) Here, y j and o j are the cumulative probability vector components of the forecast and the observation with j = 1,2,3 corresponding to the three terciles.For example, a forecast, in which 30% of the ensemble members predict an anomaly belonging to the first tercile, 60% to the second tercile, and 10% to the third tercile, yields a cumulative forecast probability vector Y = (0.3,0.9,1.0).Assuming the observed anomaly would belong to the second tercile (i.e., the second component of the observation vector would have a probability of 1 and the other two components one of 0) would yield a cumulative observation vector O = (0,1,1).And, in our case, the cumulative climatological reference forecast vector required for calculating RPS clim would consequently be Y clim = (1/3,2/3,1).The RPS is thus 0 for a perfect forecast and positive for a non-perfect forecast.And, consequently, the RPSS is 1 for a perfect forecast, 0 if the forecast is equally good as the climatological reference forecast, and negative if the forecast is worse than the climatological reference forecast.In this study, the terciles for the forecast vector Y for a specific country and parameter are defined based on the corresponding month-ahead anomalies from all ensemble members in all available forecasts between 1995 and 2017, whereas the terciles for the observation vector O are defined based on the daily month-ahead anomalies in ERA-Interim but over the same years.

Statistical testing
Grouping the forecasts into those initialized in the extreme SPV states (Section 2.3), which is a key approach of this study, yields a relatively small sample size by nature.
For instance, the two groups associated with the 10% strongest and 10% weakest SPV states consist of 129 and 127 individual forecasts, respectively.In addition, several of these individual forecasts are initialized only some days apart and thus most likely during the same extreme SPV period.The two aforementioned forecast groups consist of only 12 such periods each (mainly representing individual winters), which reduces the samples to a small number of truly independent forecasts.To account for these sample sizes and autocorrelation problems as well as to test whether forecast groups are significantly different from climatological (i.e., SPV-independent) conditions (see below for details), we create two different bootstrapped distributions for each of these forecast groups: a (group-internal) resampled distribution and a corresponding climatological distribution.To obtain the resampled distribution, we randomly resample 1,000 times 80% of all the winters in a certain forecast group (for instance, 80% of the 12 winters in the case of the aforementioned forecast groups) without replacement.To obtain the climatological distribution corresponding to a certain forecast group, we randomly sample 1,000 times a set of forecasts (with the same size as well as a similar calendar day sequence of initial times as the original forecast group) from all available forecasts independent of the SPV state.
On the one hand, the resampled and corresponding climatological distributions are used to quantify robustness, as defined by Papritz (2020), and significance of the mean month-ahead anomalies (i.e., the composite maps) of a forecast group (Figures 3,5, and 6 below in Section 3).Robustness indicates to what degree the mean anomaly is representative for the underlying individual anomalies (i.e., how little the individual anomalies differ from the mean anomaly).To this end, we calculate the mean anomaly for each of the 1,000 resamples of the resampled distribution, which yields a distribution of mean anomalies.Based on this distribution and following Papritz (2020), we define the actual mean anomaly to be robust on the first level, if the absolute value of the mean anomaly is greater than the interquartile range of the resampled distribution, and on the second level, if it is greater than the difference between the 90 and 10% percentiles (Figure 1).As a second measure, significance quantifies how strongly the anomaly differs from climatology.We determine this by comparing the resampled distribution of mean anomalies of a forecast group to the corresponding climatological distribution of mean anomalies.We define the mean anomaly to be significant if the 12.5% percentile of the resampled distribution is greater than the 87.5% percentile of the corresponding climatological distribution for positive anomalies and inverse for negative anomalies (Figure 1; the numbers 12.5 and 87.5% result from the arbitrary but reasonable condition that the centred 75% of the two distributions should not overlap).Although these two statistical characteristics are related, the robustness additionally helps us to understand the RPSS for a certain F I G U R E 1 Principle for assessing robustness and significance demonstrated with a synthetic example: the mean anomaly A of the forecast group is robust on the first level because A is larger than the interquartile range IQR of the resampled distribution (red curve), and on the second level because it is larger than the range between the 10 and 90%iles P 10 − 90 .However, the anomaly is not significant as the 12.5%ile P 12.5 of the resampled distribution is not larger than the 78.5%ileP 78.5 of the climatological distribution (black curve) mean anomaly: for instance, a robust modelled anomaly but non-robust observed anomaly could indicate that the model mostly predicts an anomaly of the same tercile in contrast to the observed anomalies falling into different terciles (due to their larger spread around the mean anomaly), which would yield a low RPSS.
On the other hand, we use the resampled and corresponding climatological distributions to determine whether the RPSS of a forecast group is significantly different from climatological (i.e., SPV-independent) conditions (Figures 4 and 7 below in Section 3).This is achieved by following the same principle as used for testing the significance of the mean anomalies: we calculate the RPSS for each of the 1,000 resamples in the resampled distribution and for each of the 1,000 samples in the corresponding climatological distribution.Based on the resulting two RPSS distributions, we define the RPSS of a forecast group to be significantly different from climatological conditions on the 25, 10, or 5% level, if the 25/75%, 10/90%, or 5/95% percentiles of the two distributions do not overlap.

Overall month-ahead forecast skill
Figures 2a-c show the RPSS for month-ahead 2 m temperature (T2M), 10 m wind speed (U10M), and precipitation (P) of all forecasts between 1995 and 2017 for most European countries.On average, the RPSS is higher for T2M than for U10M and P, and ranges from 0 up to 0.4.There are large regional variations of skill with a characteristic pattern for all of the three parameters: for T2M, the RPSS tends to be higher in Central to Eastern Europe, Sweden, and Italy than in Western to Southwestern Europe as well as the UK, Norway, Finland, and parts of the Balkans.For U10M, the RPSS is highest in Northern Europe (particularly Sweden) but also in the UK, Baltics, and Iberian Peninsula and much lower in Central to Eastern Europe and the Balkans.A similar pattern occurs for P, although the highest values in Northern Europe are limited to Norway.
Figures 2d-f provide a first qualitative test of whether and how anomalous SPV states at forecast initial time -independently of whether they are anomalously strong or weak -might influence this regional variation in the overall RPSS in Figures 2a-c: it summarizes the ERA-Interim anomaly patterns for the three parameters after the 10% strongest and 10% weakest SPV states (see caption for more details; note that the separate anomalies following these two extreme SPV states will be shown and discussed later in Figures 3d,f, 5d,f, and 6d,f).In addition, the mean absolute month-ahead sea level pressure (SLP) of the two forecast groups is indicated with two different contours.Figures 2d-f thus reflect the NAO-related patterns that typically follow anomalous SPV states (e.g., Tripathi et al., 2015b;Beerli et al., 2017): after the 10% strongest SPV states, a strong meridional SLP gradient occurs over the North Atlantic and Europe (solid contours).This likely indicates the positive-NAO-like weather pattern characterized by a strong zonal jet stream.In contrast, the meridional SLP gradient is much weaker after the 10% weakest SPV states (dashed contours), which likely reflects the negative-NAO-like weather pattern with a weaker, wavier, and more southward-shifted jet stream.These circulation patterns lead to the characteristic NAO surface weather anomalies, indicated by the shading, which particularly affect Central to Northeastern Europe in terms of T2M and Northern and Southern Europe in terms of U10M and P. The regions affected by strong anomalies (Figures 2d-f) are to first order co-located with the regions experiencing a relatively high RPSS (Figures 2a-c).This points toward a potential influence of anomalous SPV states on the overall forecast skill via the modification of the NAO.However, it does not indicate the separate contributions from the anomalies following strong and weak SPV states.In the next section, we thus investigate this correlation, and potential causality, in more detail by quantifying the model skill after anomalously strong and weak SPV states separately.We first analyze how well the model reproduces the anomaly patterns after strong and weak SPV states compared to ERA-Interim.This helps us to understand the RPSS for individual countries after anomalous SPV states, which is discussed subsequently.Our main focus is on T2M ) RPSS of all 1,280 forecasts for country-and month-ahead-averaged (a) T2M, (b) U10M, and (c) P. Note that the RPSS is only calculated for the countries marked by the respective brownish colour shade.(d-f) Difference (shading) between the mean month-ahead anomalies in ERA-Interim for the dates initialized in the 10% strongest SPV states minus the mean month-ahead anomalies for the dates initialized in the 10% weakest SPV states for, again, (d) T2M, (e) U10M, and (f) P. Note that the anomalies following the 10% strongest and weakest SPV states are not exactly opposite (cf.Figures 3d,f, 5d,f, and 6d,f) and thus contribute differently to the magnitude of the anomaly differences shown here.The mean sea level pressure for the dates initialized in the 10% strongest (weakest) SPV states is indicated by the solid (dashed) contours.The values in the box in the upper right corners indicate the numbers of forecasts in these two forecast groups (Section 3.2.1);U10M and P are discussed more briefly afterwards (Section 3.2.2).

3.2
Month-ahead anomalies and forecast skill after anomalous SPV states

2 m temperature
Figure 3 shows the mean modelled month-ahead T2M anomalies in the forecasts initialized in the 2% strongest, 10% strongest, 10% weakest, and 2% weakest SPV states in comparison to the corresponding mean anomalies in ERA-Interim.In addition, robustness and significance of the anomalies are indicated (Section 2.5 gives details).
After the 2% strongest SPV states (Figures 3a,b), the model predicts a strong and widely robust positive-NAO-like warm anomaly (of up to 2  Southern Europe relatively well but tends to be overconfident with respect to the warm anomalies over Scandinavia after the 2% strongest SPV states.In contrast, it predicts the cold anomalies over Scandinavia relatively well but struggles to capture the correct extent of the cold air masses into Western, Central, and Southern Europe after the 2% weakest SPV states.The overall weaker robustness of both the modelled and observed anomalies after the 2% weakest SPV states may further indicate the stronger case-to-case variability in the large-scale response (or in how far the cold air masses extend into Europe) following weak SPV states as found by Beerli and Grams (2019) and Domeisen et al. (2020b).After normal SPV states, the model predicts hardly any systematic anomalies, which is consistent with ERA-Interim (not shown).In the following, we demonstrate how the discussed model performance after the different SPV states on the large scale translates into the RPSS for T2M in individual countries.
Figure 4 shows the RPSS for T2M for a selection of countries for all forecasts (grey boxes) and for the forecast groups initialized in the specific SPV states (coloured boxes) which we have already discussed in Figure 3.We show distributions of the RPSS based on the bootstrapped resampled distribution introduced in Section 2.5 (filled boxes) in addition to the actual RPSS for a specific forecast group (horizontal black solid lines inside the boxes).To test whether the RPSS of a specific forecast group is significantly different from the climatological (i.e., SPV-independent) RPSS, we compare its resampled RPSS distribution (filled boxes) to the corresponding climatological RPSS distribution (blank narrow boxes) which is based on randomly selected forecast groups of equal size (Section 2.5 gives details).Three significance levels are indicated with one (25% level), two (10% level), or three stars (5% level) above the corresponding boxes.For brevity, results are shown for subjectively selected countries from the different European regions (which, however, does not necessarily mean that they are representative in terms of skill; Figure S1 in the Supporting Information shows all countries).Before focusing on the RPSS conditioned on the SPV states, it is worth noting that the RPSS of all forecasts in Figure 4 (grey boxes; as in the RPSS maps in Figure 2) is significantly different on the 10% level between the countries with the highest skill, such as Italy or Germany, and the countries with the lowest skill, such as Spain (because the 10/90% percentiles of the corresponding grey RPSS distributions do not overlap).This result itself has implications for energy meteorology, considering for instance the dependence of the energy demand of Germany and Spain on near-surface temperature during winter (e.g., Bessec and Fouquau, 2008).
Our main objective is to understand to what degree the forecasts initialized in anomalously strong and weak SPV states relate to and possibly influence the strong regional skill variation shown in Figure 2 (and by the grey boxes in Figure 4).The RPSS after the 2% strongest SPV states (filled dark red boxes) for Germany, France, Spain, and the UK is higher compared to climatological conditions (blank dark red boxes; significant for all except the UK).Also the RPSS after the 10% strongest SPV states (light red boxes) is enhanced for some of these countries (significantly for Germany and Spain).For Italy, Sweden, and Romania, the RPSS is unchanged after the 2% and 10% strongest SPV states (with the exception of a significant reduction after the 10% weakest SPV states for Sweden).In contrast, the RPSS is significantly reduced for Norway after both the 2% and 10% strongest SPV states.Note that these modifications in skill are, if at all, significant on very different levels, as indicated by the stars in Figure 4. Nevertheless, there are some outstanding and highly significant modifications of the RPSS such as the increase for France or Spain.The reason for this skill pattern can be understood from the anomaly maps in Figures 3a-d: the countries with a significantly enhanced RPSS (Germany, France, and Spain) are affected by the well-predicted warm anomaly over Europe, which is mostly robust both in the model and in ERA-Interim.This consistency in the robustness implies that the positive mean anomaly is representative for most individual anomalies.Therefore, most of the individual anomalies are consistently assigned to the upper tercile both in the model and in ERA-Interim, which yields a relatively high RPSS.In contrast, the model predicts a robust warm anomaly for Norway and Sweden after the 2% strongest SPV states, which hardly appears in ERA-Interim (particularly for Norway).This implies that the model predicts a positive anomaly (upper tercile) in most cases with a strong SPV, which is in contrast to the observed individual anomalies varying either between negative and positive values (all three terciles) or around the climatological state (middle tercile).The erroneous prediction of the upper tercile in most cases thus leads to the relatively low, in the case of Norway even significantly reduced, RPSS.Italy, as a third example, is largely unaffected by any (robust) anomaly both in the model and in ERA-Interim, which explains its unchanged RPSS compared to climatological conditions.Note that the consistency in robustness between the model and ERA-Interim thus appears as a better measure to understand differences in the RPSS than the consistency in significance.Therefore, we mainly focus on robustness hereafter.
The skill pattern looks substantially different after the 2% weakest SPV states (filled dark blue boxes in Figure 4): only Sweden tends to have an increased RPSS compared to climatological conditions, although this is not significant.For all the other countries, the RPSS tends to stay unchanged (Italy, UK, and Norway) or to be reduced (Germany, Romania, France, and Spain; significantly for all except Germany).Similar but smaller RPSS changes emerge after the 10% weakest SPV states (light blue boxes) with a few exceptions (a significant decrease for Italy and a non-significant increase for Norway).However, it is important to note that the spread in the RPSS after the 2% weakest SPV states is remarkably large for many countries.This indicates a strong case-to-case variability in the model performance and will be analyzed in more detail subsequently.Figures 3e-h provide an explanation for this skill pattern: after the 2% weakest SPV states (Figures 3g,h), Sweden is the only country where the robust cold anomaly in the model agrees well with ERA-Interim and thus tends to increase the RPSS.For most other countries, the cold anomalies are rather small and only weakly robust (both in the model and in ERA-Interim).This indicates a strong case-to-case variability in the anomalies following the 2% weakest SPV states, which probably explains the non-robust and thus statistically indistinguishable RPSS for some countries (such as Germany) compared to climatological conditions.However, for some Central, Southern European, and Balkan countries such as Romania, France, and Spain, the RPSS is significantly reduced (and in some cases even negative).We hypothesize that this is due to their location at the transition zone from the colder conditions over Northern Europe to the warmer conditions over North Africa and Western Asia.Whenever the model has a subtle error in the prediction of this transition zone (which is at least the case on average as indicated by Figures 3e-h), it can lead to the wrong anomaly (middle or even opposite tercile) and thus a reduced RPSS for these countries.A similar hypothesis has been suggested by Domeisen et al. (2020a), who also found an average mismatch in this transition zone for T2M following weak SPV states between different models and reanalysis.After the 10% weakest SPV states (Figures 3e,f), the model response agrees much better with the observed response, particularly over Northern Europe.Therefore, the RPSS for Scandinavian countries (Sweden and Norway) tends to increase.Nevertheless, there are still mismatches between the model and ERA-Interim with respect to the extension of the cold anomalies into Central and Southern Europe.This explains the significant reduction in the RPSS for Italy, Romania, France, and Spain even after the 10% weakest SPV states and gives further indications that the conditions in the transition zone between cold and warm anomalies are challenging to predict. Figure S1 (same as Figure 4 but for all countries) shows a similar reduction in the RPSS for most Southern European and Balkan countries and thus corroborates the systematic problem in the model in predicting the T2M response in these regions after weak SPV states.
As discussed in section 2.5, our bootstrapping approach of resampling subsets of winters (i.e., distinct periods of anomalous SPV states) accounts for the small sample size of the forecast groups initialized in the 2% most extreme SPV states.Nevertheless, the resulting RPSS distributions (Figure 4) are biased toward few periods that are represented by many individual forecasts (this is because the RPSS of every resample including this period is dominated by the forecasts of this period).In our dataset, the most obvious period of this kind is the persistent phase of an extremely weak SPV in February 2009, which accounts for around 50% of all the 25 forecasts initialized in the 2% weakest SPV states.To test the sensitivity of our skill analysis to this period, we have recalculated the RPSS distribution after the 2% weakest SPV states after removing the whole winter 2008/2009 from the dataset (hatched blue boxes in Figure 4).Indeed, the RPSS spread becomes much smaller and reveals a clearer skill pattern for many countries (compared to the 2% weakest SPV states including the winter 2008/2009, i.e., the filled blue boxes): the RPSS improves for Norway (significantly) and the UK, which is consistent with Sweden and thus corroborates the argument of enhanced skill for Northern Europe after weak SPV states.On the other hand, the RPSS is significantly reduced and negative for Italy, which aligns with the still poor model performance for other Central, Southern European, and Balkan countries (such as Romania, France, and Spain).Therefore, the weak SPV event in winter 2008/2009 indeed influences the skill pattern in Figure 4. Nevertheless, removing this outlier demonstrates that the north-south contrast in skill for T2M following the weakest SPV states is a robust finding.The reason for this is that, even without the dominant winter 2008/2009, the anomaly composites (corresponding to Figures 3e,f, but not shown here) indicate a model bias in the extent of the cold air masses into Central and Southern Europe compared to ERA-Interim.
Aside from determining whether the skill after a specific extreme SPV state is significantly different from climatological conditions, Figure 4 further allows detecting significant differences in skill between strong and weak SPV states directly.For instance, the (enhanced) RPSS for France and Spain after the strongest SPV states is significantly higher than their (reduced) RPSS after the weakest SPV states (as the 10 and 90% percentiles of the equally sized dark red and dark blue RPSS distributions in Figure 4 do not overlap).The same, but vice versa, applies to Norway if the winter 2008/2009 is removed.These country-specific findings can be an important guidance for end-users of sub-seasonal weather forecasts such as energy meteorologists.
TA B L E 1 Pearson correlation coefficients (r) between the month-ahead T2M anomalies of the ensemble mean and of ERA-Interim for the countries and specific forecast groups shown in Figure 4 Country To test the sensitivity of our results for T2M to the used skill score, we have performed a similar analysis as summarized in Figure 4 but by determining correlation instead of the RPSS.More specifically, we have computed the Pearson correlation coefficient between the month-ahead T2M anomalies of the ensemble mean and of ERA-Interim for the same countries and forecast groups shown in Figure 4.The results, summarized in Table 1, confirm the results from the RPSS analysis for many countries at least in a qualitative sense: countries with a significantly enhanced or reduced RPSS after extreme SPV states compared to climatological conditions have a correspondingly enhanced or reduced correlation coefficient after the same extreme SVP states.For instance, the correlation coefficients for France are 0.87 after the 2% strongest SPV states, 0.31 after the 2% weakest SPV states, and 0.65 for all forecasts, which confirms the asymmetric skill pattern shown in Figure 4. Similarly, Sweden has a correlation coefficient of 0.69 after the 2% strongest SPV states, 0.81 after the 2% weakest SPV states, and 0.68 for all forecasts, which is rather inverse to the skill pattern for France, as shown in Figure 4. To interpret these results in a statistically sound way, the correlation coefficients would have to be calculated for the same bootstrapped distributions as in Figure 4 and not only for the ensemble mean.Nevertheless, this simple correlation analysis supports our interpretation of at least the qualitative pattern of country-specific skill modification for T2M after extreme SPV states based on the thorough analysis using the RPSS.anomaly dipole structure with stronger than normal U10M over Northern Europe and weaker than normal U10M over Mediterranean Europe after the 2 and 10% strongest SPV states.The same applies to the inverse, negative-NAO-like anomaly dipole after the 2 and 10% weakest SPV states (Figures 5e-h).Most of the anomalies are consistently robust both in the model and in ERA-Interim and they reach values of more than ±1.4 m⋅s −1 .The most pronounced mismatches occur on the regional scale: after the 2% and 10% strongest SPV states (Figures 5a-d), the modelled anomalies extend slightly too far to the east.After the 2% weakest SPV states, the transition zone between negative and positive anomalies tends to be slightly displaced in the model compared to ERA-Interim.Furthermore, the positive anomalies in Mediterranean Europe are rather weak, patchy, and hardly robust, both in the model and in ERA-Interim.The large-scale structure and robustness of the P anomalies, shown in Figure 6, strongly correlate with the one for U10M (Figure 5) because of the common dependency of the two parameters on cyclone activity.The most important difference is that the maximum P anomalies, reaching values of more than ±1.4 mm⋅day −1 , are concentrated along the coast of Norway (mainly the mountainous part) and the western coasts of the Mediterranean, which are primarily affected by the NAO-related meridional excursions of the storm track.Consistent with the mismatches found for U10M (Figure 5), the model tends to predict the extent of the negative P anomaly in the Mediterranean after the 2% and 10% strongest SPV states too far to the east into the Balkans.Likewise, there is a meridional shift in the position of the rather patchy anomaly dipole after the (particularly 2%) weakest SPV states in the model compared to ERA-Interim.

10 m wind speed and precipitation
Figure 7 shows that the RPSS of all forecasts (grey boxes) for U10M (Figure 7a) and P (Figure 7b) tends to be lower than for T2M (Figure 4) on average.This indicates an overall lower predictability for U10M and P compared to T2M at least in the model.Considering the anomalous SPV states, Figure 7a shows that the skill pattern for U10M is qualitatively similar to the skill pattern for T2M (Figure 4) in the sense that most displayed countries experience either an enhanced or unchanged RPSS after strong SPV states (although on a lower significance level on average).However, the key difference is that Southern European and Balkan countries such as Italy, Spain, or Romania do not experience a reduced but rather unchanged or even enhanced RPSS also after weak SPV states (particularly if the winter 2008/2009 is removed).This is because the U10M anomaly dipole (Figures 5e-h) is shifted more poleward compared to T2M (Figures 3e-h) and thus affects Southern Europe and the Balkans more directly.However, more Central European countries such as Germany or France still tend to experience reduced skill after weak SPV states, which results from their location in the transition zone between the positive and negative anomalies (as discussed above for T2M).The slight positional error of the predicted anomaly dipole and thus this transition zone in the meridional direction after the 2% weakest SPV states (Figures 5g,h) gives evidence for this.In contrast to U10M, for P the symmetric skill pattern with a (in some cases significantly) enhanced RPSS both after strong and weak SPV states is limited to Italy, Norway, and Spain (Figure 7b; again, particularly if the winter 2008/2009 is removed).This is because the considerable P anomalies mainly occur in these regions, as discussed above (Figure 6).For most other countries, the RPSS after the 2% weakest SPV states is either unchanged or tends to be reduced, which is again due to the positional error of the predicted anomaly dipole (for instance over Germany and France; Figures 6g,h).An exception is Romania with an inverse asymmetric skill pattern: the skill tends to be reduced (non-significantly) after the 2% strongest SPV states due to the overconfident dry anomaly of the model over the Balkans (Figures 6a,b), but enhanced (significantly if the winter 2008/2009 is removed) after the 2% weakest SPV states due to the relatively well-predicted and robust wet anomaly concentrated over the Balkans and the Black Sea region (Figures 6g,h).

DISCUSSION AND CONCLUSIONS
We have demonstrated that the ECMWF sub-seasonal reforecast skill (three-category RPSS) for predicting energy-industry-relevant, country-and month-aheadaveraged surface weather anomalies substantially varies among different European regions: for T2M, the skill is generally higher for Central, Eastern, and Northeastern Europe compared to Southwestern Europe, whereas for U10M and P, the skill is higher for Northern and Southern Europe compared to Central Europe.The strong regional variation in forecast skill primarily reflects the anomaly patterns associated with the NAO, which dominates European surface weather variability on sub-seasonal time-scales.Beside other climate modes (Section 1), anomalous SPV states can lead to persistent NAO-like surface weather anomalies and thus considerably influence skill.In this study, we have demonstrated that the changes in skill of the forecasts initialized during anomalous SPV states are not always positive but can be negative when focussing on a country level: for T2M, many Western, Central, Southern European and Balkan countries tend to have enhanced skill after the strongest SPV states (compared to SPV-independent, climatological conditions), but reduced and in some cases even negative skill after the weakest SPV states.The asymmetry tends to be vice versa for Scandinavian countries, with enhanced skill after the weakest but reduced skill after the strongest SPV states.This can be explained by a north-south asymmetry in model performance after the most extreme SPV states on average: the model predicts the positive-NAO-like warm anomalies in Western, Central, and Southern Europe relatively well but tends to be overconfident with respect to the warm conditions in Northern Europe, particularly Scandinavia, after the strongest SPV states.In contrast, it predicts the negative-NAO-like cold anomalies in Northern Europe centred over Scandinavia considerably well but struggles to correctly capture the strongly case-dependent extent of the anomalously cold air masses into Central and Southern Europe.It is important to note that a substantial case-to-case variability occurs in the mean anomaly patterns following extreme SPV states, which can critically affect the robustness of such an analysis.However, by applying a thorough significance test, we can show that this SPV-dependent north-south asymmetry in skill for T2M is remarkably robust for various countries such as France, Spain, or Norway (Figure S1 gives an overview of all countries).This is a key finding of our study and suggests a flow-dependent T2M model bias in the corresponding regions following extreme SPV states.Furthermore, it aligns with the enhanced (reduced) skill for T2M in Central to Southern Europe after strong (weak) SPV states found by Domeisen et al. (2020a, their figures 5 and 6), and demonstrates how their findings translate into a country-scale perspective.In contrast to T2M, the model skill for U10M and P tends to increase for both Northern and Southern European countries after both strong and weak SPV states (for P, this increase is mainly limited to Norway and the Iberian Peninsula).This results from the NAO-like north-south dipole in the anomaly pattern, which is predicted relatively well at least on the larger scale.Nevertheless, the skill still tends to be reduced or at least unchanged for certain Central European countries such as France following extreme SPV states.We hypothesize that this is because Central Europe is located at the transition zone between the NAO-induced positive and negative U10M and P anomalies, which is slightly displaced in the model in the meridional direction and thus yields erroneous anomalies on average.Our analysis thus provides indications that a good forecast skill for the NAO index (and thus the large-scale flow pattern) does not necessarily imply a good forecast skill for surface weather in those European regions that are located at the edge of the NAO-related surface weather anomalies and thus are particularly sensitive to a correct forecast of their position.The differences in skill between Northern and Central Europe for U10M are similar to the findings of Beerli et al. (2017) for wind power generation (their figure 8a).However, their purely statistical approach does not have skill for Southern European countries after weak SPV states, which is in contrast to our NWP-based findings and thus demonstrates the added value of using a dynamical model for this region.
When investigating forecast skill on sub-seasonal time-scales, it is important to keep in mind that the first two weeks typically contribute most to the skill (e.g., Weigel et al., 2008;Vitart, 2014).To test this hypothesis, we have performed the same RPSS analysis as summarized in Figure 4 but for the T2M anomalies averaged separately for weeks 1 to 4 after forecast initial time (Figures S4 and S5).Indeed, the RPSS decreases roughly exponentially from week 1 to 4, independent of the SPV state at forecast initial time.In most cases, there is no or even negative skill left for week 4.The most striking exception is the RPSS for weeks 3 and 4 after the 2% and 10% strongest SPV states, which is still considerably high for many Central to Southern European countries and thus substantially contributes to their high RPSS for the month-ahead anomalies.The remarkable skill for T2M for weeks 3 and 4 is useful for the energy industry, which also relies on weather forecasts on weekly time-scales for their planning and decision-making.
Despite remarkable correlations between the initial SPV state and surface weather forecast skill, our results do not tackle the question of causality.Various studies have shown that the large-scale tropospheric flow regime (and thus the surface weather response) following anomalous SPV states does not only depend on the stratosphere itself but also on the tropospheric state before and during the stratospheric event (e.g., Ambaum and Hoskins, 2002;Karpechko et al., 2017).For instance, it is possible that the enhanced skill for T2M for Western, Central, and Southern Europe following strong SPV states, as found in this study, partly results from a relatively predictable positive-NAO-like flow regime that is already active before and during the strong SPV states.This would be in line with the SPV-NAO coupling found by Ambaum and Hoskins (2002) and indicate that the strong SPV is rather an integrator than a trigger of the positive phase of the NAO.Similarly, the tropospheric conditions before and during the weak SPV states investigated here might be co-responsible for how well the model predicts the extent of the cold anomalies, centred over Scandinavia, into Central and Southern Europe (cf., e.g., Domeisen et al., 2020b).Although this is important from a dynamical perspective, it does not diminish the usefulness of the SPV state as an indicator for modified sub-seasonal forecast skill.The strong regional contrasts in the modification of country-scale model skill after extreme SPV states is a key finding of this study.Two main implications result from this: on the one hand, end-users of sub-seasonal forecasts such as energy meteorologists or national weather services have to be aware of this regional skill modification when making decisions based on country-scale surface weather predictions after extreme SPV states.In particular, the reduced skill for T2M for many Central, Southern European, and Balkan countries following weak SPV states might be of critical importance for the energy industry, which requires correct forecasts of European cold air outbreaks potentially following weak SPV states.On the other hand, our results for T2M reveal two problems in the model: predicting the average temperature conditions in Scandinavia following strong SPV states and in Central to Southern Europe following weak SPV states.The sources of these problems can be manifold.For instance, the model can have errors in the representation of the stratosphere-troposphere coupling, of the large-scale weather regime response following extreme SPV states, or of the regional surface weather imprint of these large-scale weather regimes.We believe that a detailed analysis of the large-scale weather regime response following extreme SPV states would be an important starting point to better understand these regional model biases.The first problem (reduced skill for T2M in some Scandinavian countries following strong SPV states) could be associated with the misrepresentation of variations in the positive-NAO-like weather regime response, which determine if and how far the anomalously warm air extends into Scandinavia.Using a higher number of Atlantic-European weather regimes than just the bimodal NAO, Beerli and Grams (2019) showed that the positive phase of the NAO indeed manifests itself mainly in the zonal regime with a warm anomaly centred over Scandinavia, or the Scandinavian trough regime with a warm anomaly shifted more to the east over Western Russia.Regarding the second problem (reduced skill for T2M in many Central to Southern European countries following weak SPV states), it is worthwhile relating our results to the widely studied sudden stratospheric warmings (SSWs) as particularly extreme cases of weak SPV states.Comparing the forecasts after the 2% weakest SPV states to the SSW compendium by Butler et al. (2017) reveals that all except one forecast in this category were initialized ten or more days after (and, in one case during) an SSW.This indicates that the erroneous extent of the cold anomaly into Central to Southern Europe, which results in reduced skill for many countries, are at least an indirect consequence of SSWs.Therefore, we require a better understanding of the dynamics of weak SPV states, in particular SSWs, their subsequent surface weather response, and how these two aspects are represented in the model.Beerli and Grams (2019) and Domeisen et al. (2020b) showed that weak SPV states or SSWs, respectively, tend to be followed by either the mostly mild and windy Atlantic trough regime or the cold and calm Greenland blocking regime which corresponds to the negative phase of the NAO.Whether the model has a systematic bias in (one of) these regime responses following both strong and weak SPV states is a subject of our future work.

F
I G U R E 3 (a, c, e, g) Composites of month-ahead T2M anomalies of the forecasts with different SPV states at initialization time: (a) 2% strongest, (c) 10% strongest, (e) 10% weakest, and (g) 2% weakest SPV states.(b, d, f, h) show the corresponding ERA-Interim anomalies.In addition, the robustness is indicated by the point hatching (first, weaker level) and the line hatching (second, stronger level) and the significance is indicated by the black transparent shading (Section 2.5 gives details).The values in the upper-right corners indicate the number of forecasts in the corresponding forecast groups F I G U R E 4 RPSS distributions for T2M in selected European countries of all forecasts (filled grey), of the forecasts initialized in the 2% strongest (filled dark red), the 10% strongest (filled light red), the 10% weakest (filled light blue), the 2% weakest SPV states (filled dark blue), and the 2% weakest SPV states if the winter 2008/2009 is excluded from the dataset (hatched dark blue; text gives details).As described in detail in Section 2.5, the distributions of the filled boxes (including the hatched box) are based on the bootstrapped resampled distributions of the corresponding forecast group.The blank narrow boxes show the bootstrapped climatological distributions corresponding to each forecast group.One, two or three stars above a box indicate that the RPSS (filled boxes) is significantly different from its corresponding climatological RPSS (blank boxes) at the 25, 10, or 5% level.The boxes span the interquartile range, the whiskers indicate the 10 and 90%iles, and the horizontal dashed lines show the median.The horizontal solid lines indicate the RPSS calculated over all forecasts of the corresponding bin, independent of the bootstrapped distribution.Figure S1 shows the same analysis but for all European countries

Figures
Figures 5a-d demonstrate remarkably well that, on the large scale, the model predicts the positive-NAO-like

F
I G U R E 6 As Figure 3, but for P F I G U R E 7 As Figure 4, but for (a) U10M and (b) P. Figures S2 and S3 show the same analysis but for all European countries