Beyond skill scores: exploring sub-seasonal forecast value through a case study of French month-ahead energy prediction

We quantify the value of sub-seasonal forecasts for a real-world prediction problem: the forecasting of French month-ahead energy demand. Using surface temperature as a predictor, we construct a trading strategy and assess the financial value of using meteorological forecasts, based on actual energy demand and price data. We show that forecasts with lead times greater than 2 weeks can have value for this application, both on their own and in conjunction with shorter range forecasts, especially during boreal winter. We consider a cost/loss framework based on this example, and show that while it captures the performance of the short range forecasts well, it misses the marginal value present in the longer range forecasts. We also contrast our assessment of forecast value to that given by traditional skill scores, which we show could be misleading if used in isolation. We emphasise the importance of basing assessment of forecast skill on variables actually used by end-users.


| INTRODUCTION
Over the last 15 years operational forecasting centres are increasingly extending forecasts into the 3-6 week timescale, often called the monthly, extended or sub-seasonal regime. Driven by the notion of seamless prediction [Hoskins (2013)], the aim is to fill the gap between conventional 2-week weather forecasts and longer term seasonal projections. This can be done by harnessing the variability of slow drivers such as sea ice [Chevallier et al. (2019)], the land surface [Dirmeyer et al. (2019)], and atmospheric-oceanic processes such as the Madden-Julian Oscillation [Vitart (2017)].
Despite sub-seasonal forecasting becoming well established, the applications and interpretation associated with these extended forecasts can still be unclear. The importance of clearly understanding the variables and timescales that sub-seasonal forecasts can predict well, and where they have more limited applicability, has of course been discussed, as in [White et al. (2017)].
However, much of the literature focused on assessing sub-seasonal forecast skill uses either mid-tropospheric, large scale fields [Buizza et al. (2005)], or spatially localised station data [Monhart et al. (2018)], both of which are somewhat removed from end-user application which in many cases is interested in national averages at high temporal resolution.
It is also not always clear how directly forecast skill as measured by the usual metrics maps to actual realisable value to an end user.
In this paper we take an "application first" approach, by considering a case-study of French month-ahead energy forecasting over a 9-year period. France is a region of particular interest, both due to the substantial impacts of recent European heatwaves, and as a country where electric heating is prevalent, which closely links surface temperature to electricity demand. Section 2 discusses the details of the forecast systems and data pre-processing used in this study.
In section 3 we explain the use case of energy stakeholders trading power on the French energy market. We directly compute the value of current sub-seasonal forecasts in this case, using real-world power price and demand data.
In section 4 we extract the users' cost-loss ratio from the energy example, and so can compute the commonly used metric of potential economic value (PEV). This allows us to move backwards to a more abstracted skill score, and to then directly compare the PEV to our real-world yardstick, in order to assess the validity of the underlying assumptions.
Moving to purely meteorological scores in section 5, we examine several common skill metrics for daily temperature forecasts. We can then see how our perspective on forecast value is altered depending on the verification method used.
We finish in section 6 by commenting on the implications of these results, both for using sub-seasonal forecasts for energy, and for the way in which academic meteorology assesses forecast skill.

| DATA
We analyse hindcast data from three different operational forecasting systems, covering the common period 1999-2018.
In section 3, we only use forecasts for which energy data is available, restricting us to 2010-2018. The IFS extended range forecasting system (hereafter EC45) is initialised twice weekly as a seamless continuation of the ECMWF 15-day forecast, using the same model cycle and coupled to the NEMO ocean model, running for 46 days. We use the model cycle 45R1, one cycle behind the current operational cycle 46R1 at time of writing, in order to have a full hindcast dataset available. JJA 2018 data for the EC45 system was not available, representing the only departure from the common data period.
The SEAS5 seasonal forecasting system, also run by ECMWF, is initialised once per month using an older model cycle (CY43R1), but runs out to 7 months [Stockdale et al. (2018); Johnson et al. (2019)]. We make use of the first 45 days of the SEAS5 data as a baseline against which to evaluate the utility of the EC45 system's increased initialisation frequency and model improvements.
To capture differences in sub-seasonal forecasts between modelling centres we also include the EMC GEFS subseasonal system (hereafter SubX), which is being run weekly in real time out to 35 days on an experimental basis as part of the SubX project [SubX]). This project collates a number of different sub-seasonal hindcasts from different centres, however we chose not to use other model contributions due to limited hindcast periods and/or small ensemble sizes.
Analysis is performed on both annual and boreal winter (Dec through Feb, or DJF) datasets. The number of initialisation dates for each season, and ensemble size are summarised in table 1. To validate the forecasts we make use of the ERA5 reanalysis [CS3].
In all cases, we analyse French 2-metre temperature data, at daily resolution on a 1 degree grid. A gridpointwise, monthly and lead time dependent bias correction was applied, by mapping quantiles of each forecast's temperature distribution to those of ERA5 during the same time period. A 'drop one out' approach was used: for each hindcast year, data from only the other 19 years were used to bias correct. This prevents over-fitting, as long as we accept the mild assumption that the year-to-year correlation of surface temperatures is negligible. A smooth sinusoidal fit to the seasonal mean was computed for ERA5 and then removed from all datasets, to leave anomaly fields.The area averaging was performed by applying a land-sea mask and taking a cosine-latitude weighted mean over the region [5W-8E,42N-51N], to produce the final scalar anomaly field.

| Methodology
In France as well as many European countries, energy to be delivered on a future date can be bought and sold through a liberalised national market. Power providers and traders are of course interested in buying these energy futures for an optimal price, a quantity which has many complex drivers, including meteorological conditions. We focus on baseline energy contracts, where energy is to be delivered throughout the day of interest, avoiding the issue of high-frequency sub-daily variability in the energy demand.
Energy contracts tend to be traded on discrete timescales (quarterly, monthly, daily etc.). Taking the example of a month-ahead contract, the price of energy on a given day will be constant for the whole month of interest. For example in late January, a MWh of electricity purchased for February 1st is priced the same as a MWh for February 28th. Electricity within the current calendar month however is priced on a daily basis, leading to an increased sensitivity to meteorological conditions. While both these prices will evolve over time and a number of different timescales exist, we consider only two prices of interest as a first approximation; the final month-ahead price, and the day-ahead price, as depicted schematically in figure 1.
We would like to develop a plausible real-world assessment of the value of sub-seasonal surface temperature forecasts, while keeping the meteorological element front and centre. To do this we make use of real-world French energy price data covering the period 2010-2018, both month-ahead and day-ahead spot prices (calculated at around 1100 GMT the day before the target date) [EEX]. We also use daily averaged French demand data covering the same period [RTE].
Modelling the evolution of energy prices is an art in and of itself, and we are not focused on producing a high-fidelity price model in this work. Instead we build a 'good enough' trading strategy that will allow us to provide at least a lower bound on the usefulness of forecast data in this sector.
We use monthly anomalies of ERA5 daily surface temperature to derive a weakly quadratic relationship between temperature and demand anomalies for each season. We find that despite a whole host of missing factors, ERA5 T2m makes a reasonable single-variable predictor of demand, especially in winter when the demand for heating is highest, and so consequently the relationship between price and demand is strongest. By including wind chill, humidity, and cloud cover amongst other variables, it is likely an even stronger relation could be derived.
We start from the basis that the anomalous demand (calculated with respect to month and weekday), is strongly correlated with the day-ahead price but not with the month-ahead price (not shown). This means that if we can predict the energy demand from forecasts of surface temperature, then we can also predict whether it is cheaper to buy a unit of energy at the month-ahead price, or to wait until the day before and buy at spot price.
Given our estimate of demand, a basic trading strategy is: • Predict the anomalous demand for a future date.
• If the demand anomaly is greater than a threshold d , buy energy at the month-ahead price.
• Else, buy energy the day before at the spot price.
There are two simple situations we might imagine a user to be in: one where they wish to buy a set amount of energy (as for a trader for example), and one in which they wish to purchase a certain fraction of the total energy demand (as might be more relevant for an energy provider trying to prevent a production shortfall). In the first case we can can evaluate the average cost per unit of power, C , using this strategy over T days with a predictor of anomalous demand, P D , as: where p month is the month-ahead price, p day , the day-ahead price, and H the Heaviside step function. In the second case, we must account for the fact that we are buying more power for likely high-demand days than for low-demand days, so we weight the cost of power by the ratio of the daily demand to the mean demand:D t := Dt D . This gives us the average cost of buying a set fraction of power: As references we consider 4 predictors that make no use of forecast data. We take 2 'perfect' predictors, one in which we know the future demand anomaly precisely, and one in which we know only the future surface temperature perfectly, and must estimate the demand using our simple demand curve. This allows us to separate out any deficiency in meteorological skill from the drawbacks of our simplistic demand model. As climatological references, we take the strategies of always buying at the day-ahead spot price, and a purely random strategy where the probability of buying day-ahead and month-ahead are both 0.5. Buying for the day-ahead is on average cheaper than buying a month-ahead; a premium is paid for the lower volatility of the long-term contracts.
In order to construct forecast predictors we take from each system the closest forecast to the target date for which the month ahead price is still available (again, as in the schematic figure 1). We use the forecasts' ensemble mean temperature to predict demand. Taking the ensemble mean after predicting demand made no qualitative difference.
In order to better separate out the relative value of the short and long range forecasts, we consider the case where only the first 15 days of the forecast are available, and the case where only long range forecasts (lead times greater than 15 days) are used. For the seasonal SEAS5 system, as forecasts are initialised at the beginning of every month and so only lead times >28 days are ever available for decision making, we use only the overall forecast dataset.
The obvious action threshold to choose is d = 0; we should buy on any positive demand anomaly. However due to the higher average price of month-ahead contracts, and the imperfect skill of forecasts, we also choose to consider a more cautious strategy of only buying in advance if we forecast an upper tercile demand anomaly (≥ 54.4 GWh).

| Results
The average energy costs are shown using the set energy amount scenario of equation 1 in figure 2, while the set energy fraction case of equation 2 is shown in figure 3.
Firstly we note that depending on season and exact buying scenario, the potential savings of a perfect temperature forecast over climatological action range from €1-3/MWh, representing a total saving of between 2.5 and 4.5%. With a more sophisticated demand model that could approach the perfect demand forecast results, then this could be extended to over 5%. These are not insignificant savings, and would certainly be of interest to industry stakeholders, especially given that weather is just one of many drivers acting on energy prices.
When trading on forecasts of positive demand anomalies, long range forecasts actually perform more poorly than climatology, as can be seen from the SEAS5 results and the >15 day sub-seasonal results in figures 2 and 3 a) and c). For the set fraction case, while the extended range sub-seasonal forecasts do appear to show very marginal value during DJF, this cannot be distinguished from climatology given sampling error. Following from this, we also see that forecasts based only on the first 15 days of forecast data are more useful than those using all forecasts.
With these shorter range forecasts, 15% of the annual possible constant amount saving is realised, reaching 40-80% during DJF, where EC45 outperforms SubX substantially. Similar results, with a higher annual saving of 25-30% are seen for the constant fraction scenario.
When instead trading based on upper tercile demand events, there is a clear improvement, with no forecast-based strategies performing worse than climatology in a statistically significant sense. SEAS5 still shows no added value however in any scenario or season, due to its long lead times.
The >15 day extended range forecasts however now show better than climatological value, realising between 10% (in figure 2b) and d)) and 15% (in figure 3 b) and d)) of the total saving of a perfect temperature forecast.
This low-level skill in longer-range forecasts does not necessarily result in an added benefit when shorter range forecasts are available. This can be seen for the EC45 system, where the value of forecast decisions made only on short range forecasts (≤15 days) and those made using all forecasts is indistinguishable, achieving 40-45% of the possible value annually, and 70% during DJF for both trading scenarios.
However, while the less frequently initialised SubX system can perform just as well, it requires the longer range forecasts to do so, with decisions made using only the first two weeks of SubX forecasts realising less than half as much value.
As mentioned, the poor performance of SEAS5 in this example scenario, and the difference between EC45 and SubX systems is naturally explained by the different initialisation frequencies. This is made explicit in figure 4, which shows the cumulative fraction of trading decisions being made with forecasts from each lead time. The SEAS5 system is operating entirely off of 4-6 week forecasts, and so unsurprisingly performs poorly at the daily timescale forecasting task analysed here. The sub-seasonal systems are more similar, but with a lower average lead time for EC45.
In summary, the simplicity of the temperature-based demand model used here limits the value of the forecast information, but regardless is good enough to demonstrate that (some) forecasts have value. Results were optimal when trading on upper tercile anomalies, and tuning the action threshold more precisely could conceivably increase value yet further. Forecasts give better savings in general for DJF than for the full year, as a result of both higher surface temperature skill and a closer temperature-demand relationship. Value primarily originates from forecasts for the first two weeks, but forecasts ≥ 15 days have value in their own right, and can make up for a lower initialisation frequency as is the case for SubX.

| POTENTIAL ECONOMIC VALUE
We see that under realistic decision making conditions, longer range temperature forecasts can provide value in some cases for end users in energy. However we see this depends strongly on the initialisation date and frequency of a forecast system, which are often neglected in meteorological studies.
We would also now like to consider how closely the value of forecasts shown above can be emulated within the more traditional cost/loss framework of the potential economic value (PEV) [Murphy (1985); Sultan et al. (2010)]. This requires no specialist end-user data, and allows us to ignore the lead-time dependence that can emerge in applications.
The PEV can be defined in terms of the confusion matrix,M (conf) , which describes how likely we are to predict or miss an impactful event e, and the cost matrix, M (cost) which tells us the associated cost of acting on our forecast: If the elements of these matrices can be defined and calculated, then the PEV is given by: where This is simply the cost saving acting on the forecast provides, expressed as a fraction of the difference between the climatological cost, C clim , and the cost unavoidable with even perfect knowledge, C · P (e).
Attempts to measure the PEV of a forecast are often hampered by difficulty in assessing the cost/loss ratio of the typical user. However here we are in a position to directly extract the ratio from our applied scenario above. We do this by taking the cost of action as the average amount by which month-ahead price exceeds day ahead price, and by taking the loss as the average amount by which day-ahead cost during a period of high demand exceeds day-ahead cost during low demand.
Depending on season or trading scenario chosen, we find the C/L ratio ranges from 0.6-0.7. We find our results are not qualitatively sensitive to the small differences within this range, and so use an intermediate value of 0.65 for our analysis here.
The main simplification involved here in moving to the PEV framework is that the cost and loss no longer depend on the magnitude of the demand continuously, only on whether the action threshold is exceeded or not.
We find in figure 5 that there are only small differences between PEV for DJF and annually, and between action thresholds, with 1-day forecasts tending to realise 70-80% of the potential value. The PEV falls in all cases below 40% between days 7-10, and reaches zero value by day 15.
We also see immediately that the large differences between models vanish, with the exception of a slightly lower annual PEV for SubX.
Based off these results in isolation, we would expect to see no real-world value in forecasts greater than 15 days, and an average PEV of approximately 0.4 over the first two weeks of a forecast, for both action thresholds and across seasons.
This matches broadly speaking with the annual values calculated in the previous section, especially for the ultimately optimal approach of trading on upper tercile events, which also showed 40% of the potential saving being realised.
However, the saving of 10-15% seen for extended range forecasts are in direct contrast to the zero-value PEV, and results for DJF, with savings of 70% potential when using all forecast data are significantly higher than the PEV would predict.
Therefore we see the PEV in this case misses a valuable component of the user end-case; more extreme demand anomalies have a larger price impact, which is especially true during DJF when correlation between demand and price is particularly strong. This causes an overly pessimistic estimate of forecast skill. Despite this, especially on an annual basis, we see a broad qualitative agreement between the two approaches which lends confidence to the use of PEV in assessing forecast skill.

| NON-ECONOMIC SKILL SCORES
Having evaluated forecast value in economic terms, it is now useful to consider more traditional, purely meteorological, skill scores. In figure 6 we show the correlation, root-mean-square error and continuous ranked probability score (or CRPS) for daily French surface temperatures, some of the most frequently employed skill metrics in the literature. As for the PEV we see relatively minor differences in skill for the different forecast systems, and most differences are confined to week 1 of the forecast.
Correlation skill for DJF remains above zero out to days 25-30, and out to day 20-30 on an annual basis (fig 6i)), painting the most optimistic picture of any score considered. By day 15 the annual correlation is 0.2, while the slightly noisier DJF results range from 0.2-0.3. From this we might well conclude that the daily forecast retains some marginal value into the extended range as we have realised in section 3, although of course there is no way to quantify the term 'marginal' with a non-economic score.
In contradiction, annual RMS Error saturates by day 15, pushing out to days 16-18 during DJF (fig 6ii)), which would then imply essentially no useful information in the extended range. The CRPS (fig 6iii)) provides essentially the same message, saturating between days 15 and 20, and implying no added value could be gained even from a full probabilistic perspective.
This can be best understood in light of the fact that our trading strategy in section 3 is based on threshold exceedance, reducing sensitivity to the differences between model and verification that would degrade RMS error and CRPS. Within the energy framework, as long as sign and approximate magnitude of anomalies are well predicted, value can be extracted.
In section 3 we have quantified the real world value of sub-seasonal forecasts for energy trading on a national market.
To our knowledge this is the first time a multi-model analysis of operational sub-seasonal and seasonal forecasting systems has been performed in conjunction with real world end-user data, allowing a novel assessment of forecast skill.
By constructing a simple trading strategy, we show that forecasts of daily French surface temperature can be used to forecast energy demand with the potential to save €1-3/MWh, and that for weekly initialised systems such as SubX, the added value provided by forecasts beyond 15 days is considerable.
We also emphasise the often calendar-dependent nature of real world forecast applications, and the important role of frequent forecast initialisations, without which forecast systems that perform well "on paper" can have little value when ultimately applied.
We have managed to extract a cost-loss ratio from our data, and compare the true forecast value to the estimate of a simplified PEV model. We find broad qualitative agreement between the two, but note that the binary cost-loss assumption underlying the PEV can cause an underestimation of forecast value in cases where cost correlates strongly with the magnitude of the weather event (i.e. the binary assumption of event/no event is simplistic). In our case this caused forecasts beyond 15 days to have zero PEV when in reality a small economic value of 10-15% of the perfect forecast scenario was realised. Additionally without the cost-loss ratio we estimated from real-world data, we would have had to average over different ratios, or assume an optimal value, reducing realism further.
A comparison to conventional, non-economic scores highlights that even when RMS error and CRPS are at climatological levels, end users might still be able to extract value from forecasts, even with very low positive correlations.
We acknowledge that we have considered here only a single end-user sector in a single country, and that even within this domain, other frameworks for making forecast-based decisions exist. Given that we have only used the ensemble mean forecasts to assess value, the practical value may extend further than we suggest, under a fully probabilistic trading strategy. We present these results not to draw far-reaching conclusions about current operational forecast value, but to provide a new model for assessing sub-seasonal forecasts in a way that directly connects with end-user requirements.
Extending the methodology of this study to additional national domains, and in general trying to use specific enduser examples to choose variables for forecast skill verification, could help shape model development and increase enthusiasm amongst forecast users.

A C K N O W L E D G M E N T S
We acknowledge the National Environmental Research Council for funding of an industrial internship on which the origins of this work are based, ECMWF for making IFS hindcast data available, and the agencies that support the SubX system; NOAA/MAPP, ONR, NASA, and NOAA/NWS. Large shifts in price begin to appear starting from the beginning of the calendar month energy is being purchased for, as prices switch from monthly to daily resolution. We consider using weather forecasts to decide between buying power on the final date of the previous month (green) or on the day before the target date (red), which provides a simple, but representative, characterisation of the price dynamics.

R E F E R E N C E S
F I G U R E 2 The average cost of a MWh of French baseline energy over the period 2010-2018 under different purchasing strategies. Blue bars show the costs associated with different reference strategies, while orange bars show strategies based on all forecast data, green bars are based on only the first 14 days of forecast data, and red bars use only days 15 onward (see main body for full explanation of each strategy). Subplots a) and b) are the average cost over the full year, while c) and d) are for DJF only. a) and c) are based on buying at month-ahead price when the demand anomaly relative to the month is predicted to be positive. b) and d) are based on buying at month-ahead rate when demand anomaly is in the upper tercile for that month. Error bars show ± 1 standard deviation estimated by boot-strapping over individual days.
F I G U R E 3 As for figure 2 but showing the average cost of a MWh of French baseline energy weighted by the amount of demand on that day. This represents the scenario where a user is interested in buying a set fraction of the daily demand. Prices are explicitly valid for a user whose average energy obligation is 1 MWh.

F I G U R E 4
The cumulative distribution of forecast lead times used to make trading decisions in figures 2 and 3 for each forecasting system. The increased initialisation frequency of the EC45 system compared to SubX means that the average lead time of forecasts used for decision making is lower.
F I G U R E 5 The potential economic value of forecasts for forecasts of a) demand exceeding the median, and b demand exceeding the upper tercile, annually in i) and for DJF only in ii). A cost-loss ratio of 0.65 is used in all cases. Shading shows the first standard deviation estimated by bootstrap resampling of forecast days.
F I G U R E 6 Measures of ensemble forecast skill for daily French 2-metre temperature anomalies during DJF (a) and annually averaged (b). Correlations (i) and RMS error (ii) are with respect to the ensemble mean, whereas the CRPS (iii) is a fully probabilistic score. RMSE and CRPS have been normalised by the seasonal climatological variance estimated from ERA5. Shaded regions represent the first standard deviation of bootstrap resampling over forecast initialisation dates.