Extending the forecast range of the UK storm surge ensemble

Authors


Abstract

Flooding due to coastal storm surges presents a significant threat to life and property. The UK has long had a storm surge forecasting system based on a single ‘deterministic’ simulation. This was augmented with an operational storm surge ensemble in December 2009. By producing several simulations sampling the forecast uncertainty, the ensemble estimates the probability of reaching critical water levels and thus supports a more risk-based approach to civil protection.

The original storm surge ensemble provided forecasts out to T + 54 h, limited by the forecast range of the driving MOGREPS-R atmospheric ensemble. Longer-range forecasts could provide advance notice of the potential for a significant event, allowing suitable preparatory actions to be taken. This study investigates the possibility of extending the storm surge ensemble to between 5 and 7 days using atmospheric data from the lower-resolution Met Office 15-day ensemble (MOGREPS-15).

Both case studies and statistical verification indicate the potential for useful forecasts out to the full 7.25 days tested. The best performance is obtained by extending the existing surge ensemble products with output from separate runs of the storm surge model, which have been driven by MOGREPS-15 meteorology from T + 0 h. An attempt to create a single surge history for each member by switching from MOGREPS-R to MOGREPS-15 input at T + 54 h led to spurious oscillations in some cases, and poorer performance on several statistical measures. These issues might be improved by smoothing the discontinuity in atmospheric forcing.

Following this successful trial, the separate-runs extension to the surge ensemble was implemented operationally in summer 2011. The study also demonstrates the benefit of online bias correction and ‘dressing’ the forecast members to account for errors which the system does not otherwise sample. The operational implementation of these features is left for future work. Copyright © 2012 British Crown copyright, the Met Office Published by John Wiley & Sons Ltd.

1. Introduction

Flooding due to coastal storm surges presents a significant threat to life and property. The North Sea storm surge of 31 January 1953 was the worst natural disaster to affect the UK in recent times, causing the loss of 307 lives in East Anglia (Baxter, 2005) and a further 1836 fatalities in the Netherlands (Gerritsen, 2005). Storm surge forecasting aims to mitigate this threat by providing advance warning of dangerous events, so that protective action can be taken.

The Environment Agency is responsible for issuing coastal flood warnings in England and Wales. These are based on wind, wave and storm surge forecasts produced at the Met Office, which in turn feed into local flood impact models. The storm surge forecast model is developed and maintained by the National Oceanography Centre, formerly Proudman Oceanographic Laboratory. In recent years, this capability has been extended to produce an ensemble of storm surge predictions designed to sample the forecast uncertainty and thus enable better management of the flood risk (Flowerdew et al., 2010; Bocquet et al., 2009). This builds on ensemble techniques and systems developed in the meteorological community (e.g. Buizza et al., 2005; Bowler et al., 2008; and references therein). The Dutch Meteorological Institute (KNMI) has developed a similar storm surge ensemble (de Vries, 2008), providing forecasts into the medium range based on the atmospheric ensemble of the European Centre for Medium-Range Weather Forecasts (ECMWF). More recently, Di Liberto et al. (2011) described two small ensembles for New York City which utilize multiple atmospheric and storm surge models.

The original Flowerdew et al. storm surge ensemble utilized a regional atmospheric ensemble to produce forecasts out to T + 54 h. The verification results highlighted the potential for significant skill beyond this period, suggesting a horizon of about 5 days. This time-scale coincides with the range of the ‘tidal outlook’ provided by the Met Office to the Environment Agency. This paper describes the construction and scientific evaluation of an extended-range storm surge ensemble, which aims to provide a more accurate and rigorous basis for managing storm surge risk on this timescale.

Key questions to be considered in the verification include:

  • The actual lead time at which the forecasts become no better than non-meteorological predictions such as the climatological probability of reaching the alert level. This places an upper bound on the sensible length of a dynamic storm surge prediction. The actual limit of useful skill may be further restricted by the actions available to the user and their associated costs and benefits, but these issues are beyond the scope of this paper.

  • The impact of the reduced resolution at which extended-range meteorological forecasts have to be run in order to constrain computational cost.

  • The impact of any transition between different sources of meteorological input.

  • Whether the ensemble continues to outperform a dressed single forecast in the extended range, and indeed whether its advantage increases as the forecast becomes more uncertain.

  • The impact of changes and improvements to the atmospheric forecasting systems since the original Flowerdew et al. verification.

Following this introduction, section 2 describes the models and forcing arrangements used in this study, while section 3 explains the source and processing of the data used to verify the forecasts. Forecasts for selected cases are discussed in section 4. Section 5 compares the average spread and error of the different forecast systems, while section 6 provides a statistical evaluation of the forecast probabilities. Overall conclusions and future work are summarized in section 7.

2. System description

The following subsections describe the storm surge model, the atmospheric forecasts, and the various configurations in which they can be coupled.

2.1. Storm surge model

As in Flowerdew et al. (2010), all systems considered in this study use the CS3X storm surge model. This is based on the CS3 model of Flather (2000), with the domain extended further to the south and west. It provides a shallow-water depth-averaged hydrodynamic simulation of the entire northwest European continental shelf with a regular grid of 1/9° in latitude, 1/6° in longitude, and a 45 s time step. Meteorological forecasts of 10 m wind and sea-level pressure at 1 h intervals are linearly interpolated to the surge model time steps and grid.

The model includes simulation of the astronomic tide forced by harmonic input at the open boundaries. This allows it to represent the modulating effect of water depth on the generation and propagation of surges, known as tide–surge interaction (Horsburgh and Wilson, 2007). However, the quality of the modelled tide at port locations is limited by the finite resolution of the model's coastline and bathymetry, and its limited ability to simulate higher harmonics generated by nonlinear processes in very shallow water. Harmonic analysis of historic data from the relevant tide gauge is generally more accurate (Flather and Williams, 2005). For this reason, a separate run with no meteorological forcing is used to establish the model's version of the astronomic tide. This is subtracted from the full-forcing runs to give predictions of the meteorologically induced surge, including the component due to tide–surge interaction. Predictions for the total water level at individual ports are then obtained by adding the appropriate harmonic tidal prediction.

As in Flowerdew et al. (2010), the ensemble systems are initialized from the existing operational deterministic surge forecast, which uses the same CS3X model. This runs four times per day to T + 48 h, based on forcing from a higher-resolution (12 km) atmospheric forecast over a North Atlantic and European (NAE) domain. Each deterministic surge forecast is preceded by a 6 h ‘hindcast’ driven by analysed meteorology, to produce the best possible estimate of the initial surge state.

There are a number of processes and sources of uncertainty which are not considered by the CS3 model. Since it is a single-layer barotropic model, CS3 does not directly represent the impact of three-dimensional structures such as the thermocline on the response to atmospheric forcing—although results to date show 3D models unable to improve upon 2D models for storm surge prediction (Kevin Horsburgh, personal communication, 2011). The model assumes constant density, neglecting local and remote thermo-halo-steric effects, although mean seasonal variations will be captured through the harmonic tide prediction. The model neglects the variations in surface stress coefficient due to surface waves (Mastenbroek et al., 1993), and variations in bed stress due to bed type, sea grass, etc. The uncertainty associated with these effects could be sampled using appropriate perturbations to surge model parameters. While the basic tidal signal is improved by replacement of the model tide with the harmonic tide prediction, the model's simulation of tide–surge interaction remains based on the model's version of the tide. Without perturbations to sample the uncertainty in the model tide, the system can be overconfident in its prediction of tide–surge interaction, as discussed in Flowerdew et al. (2010). The final water level at ports can also be affected by wave and river set-up. While these two effects are not considered by the surge model itself and will appear as an additional error in the verification presented here, they are considered in the overall Environment Agency flood warning process based on input from separate ocean wave and land hydrology models.

2.2. Atmospheric ensembles

Storm surges are driven by the weather, which is expected to be the dominant source of surge forecast uncertainty. The systems considered here sample this uncertainty by running a storm surge forecast for each member of the Met Office Global and Regional Ensemble Prediction System (MOGREPS; Bowler et al., 2008). This consists of two connected ensembles. MOGREPS-R provides relatively high-resolution regional forecasts to T + 54 h from 0600 and 1800 UTC over the NAE domain. MOGREPS-G provides lower-resolution global forecasts to T + 72 h from 0000 and 1200 UTC. Each ensemble consists of one unperturbed control member and 23 members whose initial state and model evolution are perturbed using methods designed to sample the relevant uncertainties.

MOGREPS-G exists primarily to provide boundary conditions for the regional ensemble, but essentially the same forecasts are also integrated to T + 15 days as the MOGREPS-15 system. It is these MOGREPS-15 forecasts that provide the atmospheric forcing to extend the surge ensemble forecasts beyond T + 54 h. To probe the limits of useful lead time, it was decided to run the trials out to T + 174 h, so that the final 12 h window used for constructing probabilities is centred on T + 7 days.

The MOGREPS atmospheric system has undergone several upgrades since the original surge ensemble trials reported in Flowerdew et al. (2010). In particular, the forecast resolution has been increased from 24 to 18 km in the regional ensemble and from 90 to 60 km (N144 to N216) in the global ensembles. Further enhancements have improved the horizontal and vertical distribution of initial spread (Flowerdew and Bowler, 2011), and the rate of growth of spread (Tennant et al., 2011). The deterministic systems which provide the central state about which the ensemble forecasts are started have also been upgraded. The performance of the storm surge ensemble should have improved as a result of these developments.

2.3. Forcing arrangements

The existing short-range surge ensemble (identified as mogR below) consists of a CS3X storm surge forecast for each MOGREPS-R member. The initial surge state is taken from the corresponding (0600 or 1800 UTC) deterministic surge forecast. Beyond T + 54 h, MOGREPS-15 forcing must be used. The trial tested two alternative ways to construct a surge ensemble for the extended range:

  • mogC: Each member is driven by MOGREPS-R to T + 54 h, and the corresponding MOGREPS-15 member for the remainder of the forecast. The higher-resolution forcing over the first 54 h should provide the best surge state at the start of the extended-range forecast, but carries the risk that any discontinuity between the high- and low-resolution versions of each atmospheric member may lead to imbalance and oscillations in the surge solution.

  • mog15: Produces a separate surge forecast driven by MOGREPS-15 through the whole forecast range. For consistency, the surge model is again initialized from the 0600 or 1800 UTC deterministic surge forecast, and lead times are quoted with respect to this even though the 0000/1200 UTC atmospheric forecasts are actually 6 h older. This approach is technically simpler than mogC, and avoids any discontinuity at T + 54 h. However, it requires a separate run of mogR to provide the best forecasts for the first 54 h, and starts the extended-range forecast from a potentially inferior surge state.

Beyond T + 54 h, mogC and mog15 share the same atmospheric forcing. Their solutions are therefore expected to converge, given the rapid decay in the influence of past forcing observed by Flowerdew et al. (2010). The choice between the two approaches is therefore governed by performance in the period just beyond T + 54 h.

The different types of surge ensemble forecast are summarized in Table 1.

Table 1. Summary of surge ensemble configurations.
LabelInputComments
 T ≤ 54 hT > 54 h 
mogRMOGREPS-RN/AExisting operational surge ensemble
mogCMOGREPS-RMOGREPS-15Single surge evolution with forcing transition at T+ 54 h
mog15MOGREPS-15MOGREPS-15Homogeneous run forced by low-resolution atmosphere
mogSmogRmog15Splices the output of separate surge simulations forced by different resolution atmospheres (see section 7)

3. Verification data

3.1. Hindcasts

The mogC and mog15 systems were tested in an 8-month trial from 6 July 2010 to the end of February 2011. Performance is measured at 15 min intervals for 36 ‘port’ locations around the British coast. An initial verification requiring no external data can be obtained by comparing the surge forecasts to the hindcasts which the deterministic surge system produces using analysis meteorology on the 12 km atmospheric grid. This offers complete coverage at all ports, without the need for observation quality control. It provides a ‘model-world’ view of each event, focusing on the impact of fundamental meteorological predictability while ignoring issues such as model bias and inaccuracies in either the model or harmonic tide predictions.

3.2. Tide gauge observations

Tide gauge observations provide direct evaluation of the final water level at each port. At the time of the work reported here, quality-controlled observations were available up to the end of 2010, and raw observations for the whole trial period. Separate streams of quality-controlled data were supplied for each of the two independent sensors comprising each tide gauge, with rejected data masked out. Data for this project were mostly taken from the recommended primary channel, but the secondary channel was used for ports where this helped to improve the data volume. Rudimentary quality control was also implemented for the raw observations, as in Flowerdew et al. (2010). Manual blacklisting was used to exclude remaining artefacts from both data sources. The filtered raw observations were then used to fill gaps in the quality-controlled data, including the whole of January and February 2011. The spread and error statistics shown in section 5 appear very similar whether calculated from the full merged-observations dataset or the quality-controlled observations alone, suggesting that the raw and quality-controlled observations can be sensibly combined in this way to improve the total data volume.

4. Case studies

The following subsections present a series of examples illustrating the performance of the extended-range surge ensemble in particular forecasting cases.

4.1. Surges at Millport on 8 and 11 November 2010

Early November 2010 saw a period of mobile westerly flow across the Atlantic which brought a number of cyclonic storms and threats of storm surges. One of the larger events occurred on 11 November at Millport in the Firth of Clyde. A deep low of 948 hPa centred just to the west of Scotland produced a strong southwesterly flow into the Clyde region. Observed water levels exceeded the defined alert threshold, although we are not aware of any reports of flooding.

The first three panels of Figure 1 show a series of surge ensemble forecasts for this event. This format will be used throughout the case studies, and shows all the flavours of surge ensemble used within the trial on a single plot. The operational deterministic surge forecast is shown in orange, while the 54 h operational surge ensemble (mogR) is plotted in green. The blue lines show the mog15 configuration, while the lower red lines show the mogC trial. Slightly thicker lines are used for the unperturbed control forecast within each group. The mogC forecasts are identical to the operational ensemble for the first 54 h, and almost perfectly match mog15 after about 96 h. In the intermediate period the red lines are more prominent and show the persisting influence of the higher-resolution forcing.

Figure 1.

Surge ensemble forecasts for Millport initiated at (a) 1800 UTC on 9 November, (b) 0600 UTC on 7 November, (c) 0600 UTC on 5 November, and (d) 1800 UTC on 1 November 2010.

All of the above forecasts are presented as surge residuals, predicted by the model after subtraction of the tide-only run. Ultimately, forecasters need to predict the total still water level, including the tide, and identify whether or not this exceeds the alert level defined for each port. The surge required to reach the alert level is shown by the upper red curves, oscillating up and down with the astronomic tide. Two versions of this threshold are plotted: the solid red curves are calculated using the more accurate harmonic tide predictions while the dotted versions are derived from the tide-only run of the CS3X model. Discrepancies between these two lines suggest situations in which the model's simulation of tide-surge interaction may be less accurate, as discussed in Flowerdew et al. (2010). Finally, verification data are shown using thick lines. The black line shows observed surge residual calculated from raw tide gauge observations by subtraction of the harmonic tide prediction. This crosses above the solid red line when the observed total water level exceeds the alert threshold. The cyan line shows the hindcasts from the deterministic surge system, after subtracting the tide-only run in the same way as occurs for the forecasts.

Figure 1(a) shows that at about 42 h ahead the ensemble predicts a surge of around 1 ± 0.4 m, with around 95% probability of exceeding the alert level by 20 cm (calculated using the harmonic tide estimate of the required surge). Both observations and hindcast indicate that a surge did occur and exceeded the alert level. The ensemble consistently provided a signal for this event from the 0600 UTC forecast on 6 November onwards, over 5 days before the event, with greater spread in the earlier forecasts. In some cases, such as the 0600 UTC forecast on 7 November (Figure 1(b)), none of the ensemble members captured the full height of the observed surge, but there is nevertheless a signal of a significant surge with a 10% probability of exceeding the alert level and a 40% probability of coming within 20 cm of it. In the forecasts initiated at 1800 UTC on 5 November and earlier (e.g. Figure 1(c)) there was considerable ensemble spread, indicating some uncertainty in the forecast, but no clear signal of an event or identified probability of overtopping the alert level. It may be noted that there was an earlier event around 0000 UTC on 8 November that was also strongly and consistently forecast from at least 1800 UTC on 4 November, over 3 days ahead. Figure 1(d) shows that around a 20% probability of exceeding the alert level was first indicated as early as 1800 UTC on 1 November, over 6 days ahead.

These examples show that the extended-range surge ensemble can provide a useful signal for a significant surge event several days ahead and indeed several days earlier than is available from the currently operational 2-day system. Two separate surges at Millport were both successfully predicted on the correct tidal cycle 5–6 days ahead, with strong signals (higher probabilities) 3–4 days ahead. Similar surges were predicted and observed in this period at other ports around the Scottish coast, although the surge did not reach alert levels at those locations.

4.2. Surge at Liverpool on 11 November 2010

The storm on 11 November also produced a significant surge further south on the west coast of Britain, although not sufficient to come close to alert levels. Nevertheless, the ensemble performance is of interest. It is illustrated here with forecast plots for Liverpool, and performance at Heysham was very similar.

Figure 2(a) shows the forecast initiated at 1800 UTC on 6 November. This shows quite a strong signal for a rising surge on day 4–5 at around T + 110 h. The observations show that such an event did indeed occur, although the intensity was underpredicted, and it was followed on the next tidal cycle by another much larger surge which was barely predicted at all. At 4 days ahead the second surge was again missed but at 3 days the system started to give some indication for it (not shown). At 2 or fewer days ahead (Figure 2(c)), the second surge peak is well captured in the forecasts driven by MOGREPS-R (green), but weaker when driven by the lower-resolution MOGREPS-15 (blue). Figure 2(b) shows the intermediate forecast initiated at 0600 UTC on 9 November, for which the second surge falls just outside the MOGREPS-R period. Whereas the 1800 UTC forecast of the same day captures the height of the second surge well, the 0600 UTC forecast significantly underpredicts it. These results suggest that the higher meteorological resolution provided by MOGREPS-R is important for properly forecasting this second surge peak. Even mogC is unable to simulate it properly, despite the fact that less than 12 h have elapsed since the transition from MOGREPS-R to MOGREPS-15 meteorology.

Figure 2.

Surge ensemble forecasts for Liverpool initiated at (a) 1800 UTC on 6 November, (b) 0600 UTC on 9 November 2010 and (c) 1800 UTC on 9 November 2010.

4.3. ‘False alarms’

One of the features of probabilistic forecasts is that extreme events, when they occur, are often outliers or near-outliers in the ensemble distribution, simply because the model climate reflects the real climate in that extremes are unlikely and difficult to generate. In order not to miss extreme events, it is therefore important to take note of relatively low probabilities of extremes, particularly where they are supported by other ensemble members producing significant but less extreme events. Unfortunately this does not guarantee that an extreme event will occur. Figure 3 shows a forecast for Cromer from 2 November predicting a small but significant probability of a major surge event between 6 and 7 days ahead. Similar forecasts were generated for other ports in the east from North Shields and Whitby to Felixstowe and Sheerness. In this case, the observations and hindcast show that no surge event occurred. While the meteorological ensemble correctly predicted the potential for significant cyclonic development in this period, there was uncertainty over the precise location. The suggestions of a surge at Cromer resulted from those MOGREPS members which placed the storm in the North Sea rather than its actual location to the west of Scotland. This is an inevitable characteristic of ensemble forecasts and simply reflects the predictability of the system. Indeed, it is precisely the type of case-specific uncertainty for which an ensemble of dynamic simulations might be expected to outperform offline statistical approaches for estimating forecast uncertainty, such as the dressed single forecasts described later in section 6.1. Nonetheless, this example does indicate the need to have effective decision-making policies to avoid overreacting to every extended-range low-probability alert.

Figure 3.

Surge ensemble forecast for Cromer from 1800 UTC on 2 November 2010.

Figure 4 illustrates a case initiated at 1800 UTC on 21 October 2010. This forecast is for Cromer, but similar forecasts were generated for other ports along the east coast. At day 3, between 60 and 72 h ahead, there is a strong signal for a large surge between tidal peaks, with a small risk of overtopping the alert level on each of the tidal peaks. The observations show that on this occasion the reality was an outlier on the very edge of the ensemble distribution at the lower surge limit, rather than the higher one. It should be noted that the ensemble is not wrong, since it captures the initial rise of the surge well and the observations are just within the ensemble envelope. This is, nevertheless, an example of when a large surge did not materialize despite being forecast with high probability, and the low probability for a weaker (and earlier) surge was realized instead. Such outcomes must occur from time to time if the forecast probabilities are to be statistically reliable.

Figure 4.

Surge ensemble forecast for Cromer from 1800 UTC on 21 October 2010.

4.4. Performance of mogC

The 0600 UTC run on 25 October 2010 provides one example in which mogC performance appears to benefit from the use of higher-resolution MOGREPS-R forcing before T + 54 h. Figure 5 shows results for Cromer, but similar forecasts were produced throughout the east and north coastal regions. Just after the T + 54 h transition, the mogC prediction (now visible in red) shows a moderate positive surge peak not resolved by mog15. The observations confirm that a small surge did occur, although not as large as suggested by the stronger mogC peaks.

Figure 5.

Surge ensemble forecast for Cromer from 0600 UTC on 25 October 2010.

The mogC configuration creates a single history for each surge ensemble member, driven by MOGREPS-R for the first 54 h, and lower-resolution MOGREPS-15 input thereafter. Any discontinuity in forcing has the potential to induce spurious oscillations and degrade the forecast, but it was hoped this would be minimized by the fact that each MOGREPS-R member is derived from the corresponding MOGREPS-15 member. One of the aims of the trial was to determine the relative pros and cons of this approach. Early in the trial it became apparent that when there is no strong surge signal the transition can indeed trigger oscillations in the surge residual. This is illustrated in Figure 6, where a sharp short-period surge peak (in red near T + 60 h) stimulates a rapidly decaying oscillation. This is fairly typical of many examples seen, although sometimes the wave period is a little longer than in this case. In general the impact lasts for less than 24 h, but very occasionally the oscillations persist for around 48 h. The outlying mog15 members around T + 72 h make it particularly clear that the corresponding mogC members are following the same trajectory with the addition of these unphysical oscillations.

Figure 6.

Surge ensemble forecast for Mumbles from 0600 UTC on 2 December 2010.

Transition shock waves such as these are not uncommon in the mogC trial, although the amplitude is normally sufficiently small that the oscillations are only noticeable where there is no significant surge signal. From this perspective the transition shock phenomenon may have less impact on the usefulness of the ensemble for its main purpose of predicting significant surge events. The overall impact of the mogC approach will be assessed within the statistical results below.

5. Spread and error statistics

While case studies can illustrate forecast behaviour in particular situations, an overall assessment requires representative statistics gathered over a variety of forecasting situations. Ensemble forecasts predict a distribution of possible outcomes, which can only be verified by accumulating observations over many forecasts to see if they fit the distribution predicted by the ensemble. This section considers the match between root mean square (rms) spread and error on a time-step-by-time-step basis. For this purpose, ensemble spread is defined as the standard deviation of the ensemble members (including the control) about the ensemble mean forecast for each time and location. If the ensemble members were perfectly sampling the distribution of possible outcomes, and observation error is neglected, their rms departure from the ensemble mean would match the rms departure of truth from the ensemble mean. In other words, the rms spread should equal the rms error of the ensemble mean. This should apply whether the forecasting cases are aggregated by location, lead time, or forecast variables such as the spread itself.

Figure 7 shows scatter plots illustrating the relationship between rms spread and error as a function of port, verified against hindcasts. Over the first 48 h, mogC (a) provides a much cleaner and more linear relationship than mog15 (b), suggesting that the higher-resolution meteorological forcing better reflects the climatological differences between the sites. The mogC ensemble mean has lower rms error than the deterministic forecast at almost every site, whereas the error of the mog15 ensemble mean is worse than the deterministic forecast for almost every site. Both systems provide a good relationship between spread and error as a function of port when averaged over the full lead time range, as shown for mogC in Figure 7(c). This supports the use of the lower-resolution MOGREPS-15 forcing at longer lead times where the errors are higher and probably more dominated by large-scale uncertainties.

Figure 7.

Scatter plot of rms error against rms spread for each port and forecast type (denoted by symbols as indicated in the legend), verified against hindcasts. Results are shown for (a) mogC over the first 48 h, (b) mog15 over the first 48 h and (c) mogC over the entire lead time range (note the larger range of both axes in this case). Both of the top two plots show the same set of deterministic forecast errors—they appear different because they are plotted against different ensemble spreads. The dotted diagonal line marks the ideal relationship in which the rms spread is equal to the rms error of the ensemble mean forecast. This figure is available in colour online at wileyonlinelibrary.com/journal/qj

Figure 8 shows the average evolution of spread and error over the first 48 h of each forecast. This allows comparison with the deterministic forecast and the performance of the original surge ensemble system as illustrated by the bottom row of Flowerdew et al. (2010) Figure 5. Against hindcasts, the errors of the deterministic, mogC control and mogC ensemble mean forecasts grow slowly over the first few hours, with a different shape to the spread and other errors. This probably reflects the autocorrelation between hindcasts and forecasts started from the same initial state with similar configurations of the atmospheric model. The mogC errors grow more slowly than mog15, and the spread more rapidly, reflecting the higher quality and activity of the driving MOGREPS-R forecasts. As seen in the Flowerdew et al. (2010) trials, the deterministic forecast has slightly lower error over the first 12 h, although this may just reflect its higher autocorrelation with hindcasts driven by the same 12 km meteorology. Beyond about T + 18 h, the mogC ensemble mean has the lowest rms error. By contrast, the mog15 ensemble mean only becomes competitive with the deterministic forecast around T + 48 h.

Figure 8.

Rms spread and error with respect to (a) hindcasts and (b) tide gauge observations, as a function of lead time over the first 48 h. Rms errors are shown for the unperturbed ensemble control forecast (blue), the perturbed ensemble members (green), the ensemble mean (including the control; red), and the deterministic forecast using higher-resolution meteorology (cyan). Solid lines are used for mogC and the deterministic forecast, while dotted lines show mog15. The grey histograms show observation density for the mogC forecasts according to the scale on the right of each plot.

Against observations, the results are qualitatively similar to Flowerdew et al. (2010). The errors are dominated by a large initial component, which Flowerdew et al. (2010) showed comes largely from the error in the harmonic tide prediction. Other contributions will include the unmodelled processes discussed in section 2.1, and the fact that the analysis meteorology and storm surge model are not perfect. Ideally, the ensemble would be extended to sample the dynamic impact of these sources of uncertainty. In the meantime, their average influence on threshold exceedance probabilities can be approximated by a dressing as discussed in section 6 below.

In both forms of verification, the magnitudes of the spread and error are lower than in Flowerdew et al. (2010). This presumably reflects improvements in the overall accuracy of the meteorological forecasts as a result of the changes described in section 2.2. Against observations, the rms of the port-specific biases is reduced from 5.5 to 4.9 cm (not shown), and the oscillations of error with lead time appear slightly less pronounced. To the extent that these features are tied to the harmonic tide prediction (including its sampling of seasonal steric effects) this suggests that the harmonic tide predictions may also have been slightly more accurate during the more recent trial. However, it should be remembered that some of the apparent differences in performance may in fact be due to the different sample of situations provided by the two distinct trial periods.

As in Flowerdew et al. (2010), a simple online bias correction can be used to reduce the errors with respect to observations and bring them closer to that component of the error which is explained by the ensemble spread. The harmonic tide prediction is adjusted by the mean difference between the observations and hindcasts in the 12 h prior to data time. Most of this bias is believed to arise from slowly varying errors in the harmonic tide prediction, including any departure of the seasonal steric effect from the mean cycle captured by the harmonic analysis. Since the correction only considers observations made before data time, it could in principle be used in a forecasting context to achieve the indicated performance, given suitable real-time quality-controlled observations. The rms bias is reduced by an order of magnitude (not shown), and the initial rms error from about 9.5 to 7.5 cm (compare Figure 9). The correction remains useful across the full lead time range, slightly reducing (rather than increasing) the rms error at T + 174 h. Other diagnostics such as rank histograms and error binned by spread are also made closer to ideal. Examination of the corresponding port average scatter plots shows that bias correction reduces the rms error for most but not all ports.

Figure 9.

As Figure 8, but showing the full lead time range and including online bias correction in the comparison to observations.

Figure 9 shows the evolution of forecast spread and error over the full lead time range. Against hindcasts, the slight rms error advantage of mogC over mog15 at T + 54 h is lost within a handful of hours. The rms error of the mogC control forecast is worse than mog15 for about the following 24 h, while the detriment to the ensemble mean error lasts about 12 h. The mogC system produces a transitory bump in spread just after the T + 54 h transition, perhaps related to any discontinuities in the meteorological evolution of each member. These results suggest that the mogC approach does more harm than good for the period beyond T + 54 h, although the penalty is relatively modest and short lived.

From day 4 onwards, the performance of the two systems is practically identical, as expected from the rapidly decaying influence of previous meteorological forcing. The rate of growth of spread almost keeps up with the error of the ensemble mean forecast, suggesting that the MOGREPS-15 forcing is capturing many of the important developments in uncertainty at this time-scale. Similar results are seen against observations, where online bias correction has now been included. In this case, the spread even appears to be catching up with the rms error, as the relative importance of the unsampled harmonic tide error diminishes. However, calculation shows that the rate of growth of spread is still less than ideal: if the spread s represented all errors except fixed independent contributions (such as harmonic tide error) with variance h2, the total variance E2 of the ensemble mean forecast error should be equal to s2 + h2, and thus the difference E2s2 should be constant. In fact, the gap between mean square spread and error increases with lead time against both hindcasts and observations.

Figure 10 shows the rms error of selected forecasts in bins defined by the ensemble spread. This tests the extent to which variations in the ensemble spread are indicative of genuine variations in forecast error. Against hindcasts, the results show remarkably good prediction of a range of error magnitudes extending well beyond that predicted by port or lead time alone. As noted in Flowerdew et al. (2010), this is one feature that makes the ensemble particularly valuable for managing the storm surge risk: it correctly indicates the greater than usual uncertainty associated with particular surge events. There is some suggestion that the very highest spreads overpredict the magnitude of the corresponding rms errors, although those errors are still larger than those in the preceding spread bin. Against observations, the overall relationship remains good, although the rms error is underpredicted at low spreads due to the impact of harmonic tide error.

Figure 10.

Rms error of the forecasts indicated in the legend, binned by ensemble spread. Results are shown for mogC over the full lead time range, verified against (a) hindcasts and (b) tide gauge observations with online bias correction. Results for mog15 are practically identical. The dotted diagonal line shows the ideal situation in which each rms error is equal to the rms spread from the same bin. The grey histograms show observation density according to the scale on the right of each plot.

6. Probability verification

While mean and spread statistics summarize overall forecast accuracy, they only consider the first two moments of the ensemble's prediction of the distribution of possible outcomes. They also tend to be dominated by the more common and relatively unimportant cases with no significant surge. One way to more fully evaluate the probabilistic aspects of an ensemble forecast is to consider the ensemble's prediction of the probability that a given quantity will exceed relevant thresholds. This is particularly appropriate for storm surge forecasting, where the key question is whether the total water level will reach the limit of the flood defences. The probability of reaching this threshold is the key quantity required for rational risk management: given a potential protective action, and the availability of estimates of its cost C and the loss L it would prevent if the event occurs, long-term benefit is maximized by taking the action if and only if the probability of the event occurring exceeds the ratio C/L (Richardson, 2000).

The quality of the ensemble's prediction of the probability of exceeding relevant thresholds is thus directly related to its usefulness for risk management. A perfect ensemble would forecast probability 1.0 for situations in which the event occurs, and 0.0 for situations in which it does not. In this way, it is both perfectly sharp (no intermediate probabilities) and perfectly reliable (over cases where a given probability value is forecast, the frequency with which the event occurs matches the forecast probability). The Brier skill score (Wilks, 2006) measures the proximity to this ideal on a scale where 1.0 means a perfect forecast and 0.0 means a forecast no better than always forecasting a probability equal to the overall climatological frequency with which the event occurs. The Brier skill score can be expressed as a difference of two terms: resolution minus reliability. The reliability penalty measures the departure of the forecast probabilities from the corresponding conditional observed frequencies, while the resolution component measures the underlying ability to split the forecasting situations into groups where the event is more or less likely.

The event definitions used here follow Flowerdew et al. (2010). Two types of event are considered: surge residual exceeding a given threshold, and total water level minus the port-specific alert level exceeding a given threshold. The surge thresholds focus primarily on the strength and abnormality of the meteorological situation, while the proximity of total water to the alert level is the most direct indicator of the risk of overtopping flood defences. In both cases, the harmonic tide prediction (with online bias removal when verifying against observations) is used to translate between modelled surge residuals and observed total water levels. All events consider the maximum values observed and forecast within a series of 12 h time windows, spaced at 6 h intervals. This is motivated by the idea that the fundamental decision is whether or not action is required within a given tidal cycle, regardless of precisely when the peak occurs. Thus two ensemble members exceeding the threshold at slightly different times should be counted as a 2/24 rather than 1/24 probability of action being required.

6.1. Forecast dressing

Ensemble forecasting systems aim to provide a case-specific estimate of forecast uncertainty. As forecast error increases into the medium range, one might expect that its distribution becomes less Gaussian, perhaps multimodal in some cases, and more situation specific. If the ensemble can correctly represent these effects, it should have an advantage over approaches which just make some climatological allowance for forecast uncertainty.

To provide this context, the performance of the storm surge ensemble is compared to a series of probability forecasts based on dressing the unperturbed control forecast with an assumed Gaussian error distribution. The alternative dressings illustrate three different ways to choose the width of this assumed distribution, following Flowerdew et al. (2010). The first (ndress) uses zero width, so that the forecast is assumed to be perfect. This gives probability 1.0 when the forecast is above the threshold, and probability 0.0 otherwise. The second variant in Flowerdew et al. (2010), odress, assumed a fixed width equal to the overall rms error of the forecast. Figure 9 showed significant variation in rms error over the extended forecast range now being considered, particularly against hindcasts. This dressing is therefore adapted to a width that depends linearly on lead time (ldress). This fits the portions of the forecasts which are driven by MOGREPS-15 better than it does the portion of mogC which is driven by MOGREPS-R. The final variant, mdress, adds in quadrature a further term given by a fraction of the magnitude of the forecast surge. This reflects the Flowerdew et al. observation that larger surge forecasts are associated with larger than normal errors. This provides a degree of situation dependence, while only requiring a single forecast as input.

These comparisons are intended to provide a benchmark for the ensemble performance, and illustrate the importance of different effects. They are not intended as a comprehensive attempt to produce the best possible dressed single forecast. As in Flowerdew et al. (2010), no attempt has been made to account for the variation in rms error between ports shown in Figure 7. While the changes from the Flowerdew et al. dressing parameters generally improve the performance of the dressed forecasts, it sometimes makes them worse, although this does not affect the main conclusions below. The models of how rms error varies with lead time and forecast magnitude are only approximate. A Gaussian distribution, while simple, may not be the best model for the actual error distribution. Once the assumed and actual error distributions differ, the diagnosed rms error may no longer be the optimal distribution width to use, and an explicit fitting scheme may perform better. The addition of lead time dependence is often neutral or slightly harmful to mdress at long lead times, reducing its advantage over ldress. This may suggest that the lead time and forecast magnitude effects are not independent. The increase in rms error with lead time reflects the forecast sampling a greater range of climatology, while the strength of the forecast magnitude as a predictor of error may decrease with lead time. On the other hand, these considerations simply serve to illustrate the complexities of producing probabilistic forecasts based on historic statistics. By contrast, the ensemble simulates the distribution from first principles, which should be more robust and adaptable, provided that the key aspects are accurate enough to be competitive with reasonable benchmarks.

Following Flowerdew et al. (2010), when verifying against observations, each member of the ensemble is dressed with a Gaussian distribution with standard deviation equal to ldress at zero lead time. This is intended to represent the non-meteorological uncertainties which the ensemble does not sample, including the error in the harmonic tide prediction. For predictions of surge residual, the harmonic tide uncertainty amounts to an observation error, and the dressing provides a fairer comparison to observations similar to Saetra et al. (2004). For predictions of total water level, the harmonic tide is just another source of error in the overall forecasting system, and the dressing should ideally be included when calculating probabilities for operational forecasting. Omission of the dressing reduces the Brier skill score for almost every threshold and lead time (not shown), particularly for surge exceeding zero metres, where it harms both reliability and resolution. There is some suggestion that overestimating this dressing can further improve some Brier skill scores, perhaps by allowing for other errors which the ensemble does not sufficiently represent.

6.2. Bias correction

When verifying against observations, section 5 noted modest reductions in rms error following the application of an online bias correction scheme. The results shown below all use this scheme, with dressing parameters based on the rms errors after its application. Online bias correction improves the Brier skill score in almost all cases (not shown), particularly for non-negative surge thresholds. For negative surges, the Brier skill score and reliability component are still improved, but the resolution is reduced. The fact that the undressed control forecast exhibits the same changes (despite the reduction in overall rms error) suggests the problem is related to the applicability of the deduced bias to the central forecast in these cases.

6.3. Results

Figure 11 shows Brier skill score as a function of lead time for selected thresholds, verified against both hindcasts and observations. The thresholds have been chosen for maximum relevance while still producing a reasonably consistent signal between the different lead times and systems. As an indication of sample size, summed over all ports, the observations contain about 50 windows per lead time with surge exceeding 1 m, and 86 windows per lead time with total water level exceeding the alert threshold.

Figure 11.

Brier skill score as a function of the centre of each 12 h lead time window for surge exceeding zero metres (top), surge exceeding 1 m (middle) and total water level exceeding the port-specific alert level (bottom). The forecasts have been verified against hindcasts (left) and observations with online bias correction (right). Results are shown for mogC (solid) and mog15 (dotted), with scores for the ensemble (with fixed-width dressing when compared to observations) and dressed control forecasts as indicated in the legend.

In all cases, the best ensemble forecast is competitive with or superior to the two non-trivial dressings, with particular advantage at longer lead times. Against observations, the reliability penalty (not shown) is generally small, suggesting limited potential for improvement through statistical post-processing. Ignoring forecast uncertainty (ndress) is strongly detrimental against observations since it neglects harmonic tide uncertainty. Using hindcasts to focus on meteorological uncertainty alone, ndress again performs worst at all but the shortest lead time, the gap generally widening with lead time as uncertainty becomes more significant.

Within the first 2 days, mogC generally performs best, as expected from its higher-resolution meteorological forcing. However, by T + 54 h, the gap between mogC and mog15 scores has reduced, suggesting the penalty for having to switch to lower-resolution meteorological forcing may not be too severe. Following the spread/skill results, mog15 is often superior to mogC in the period just beyond T + 54 h, suggesting that the discontinuity in forcing harms subsequent forecast development. One apparent exception to this rule is surge exceeding 0 m verified against observations, which shows a temporary bump in skill following the transition. This comes from a corresponding dip in reliability penalty (not shown), perhaps related to the bump in spread observed in Figure 9. These properties suggest that this temporary increase in skill is an essentially spurious result of the transition itself.

In terms of pure meteorologically driven predictability, measured against hindcasts, the ensemble has declining but non-zero skill to day 7 and potentially beyond for some thresholds. The results for reaching the alert level, in particular, suggest there is value in running the ensemble to this range, provided the user has available actions with a cost/loss ratio appropriate to the inevitably lower skill associated with longer-range forecasts. The forecast range for non-zero skill reduces with surge magnitude, as might be expected. Negative surge thresholds (not shown) show similar decline in resolution with lead time to their positive counterparts, but reach zero skill sooner due to their higher reliability penalty.

For surge thresholds, the verification against observations almost always produces poorer scores than against hindcasts, as would be expected from the extra sources of error involved. For total water thresholds, the situation is more complicated, with the Brier skill score at longer lead times exceeding that against hindcasts by a larger and larger amount for higher and higher thresholds. The shallower rate of decline of Brier skill score with lead time against observations was noted in the results of the original 2-year trial, but its lead time did not extend far enough for the scores to actually cross. Most of the apparent benefit in verification against observations arises from a lower reliability penalty which does not decline noticeably with lead time. This may reflect the homogenizing effect of harmonic tide error, particularly since it applies to both the dressed and undressed forecasts. The reliability diagrams against hindcasts (not shown) suggest an over-willingness to predict high water levels, whereas the reliability diagrams with respect to observations are more diagonal. Verification against observations also improves the resolution at long lead times (not shown). This seems harder to explain, unless for instance there is some systematic error in the hindcast representation of large events, so that verification against observations correctly ascribes greater skill to the forecasts. As the gradient of resolution with lead time gets ever more shallow, it suggests the forecasts are running out of actual skill. The fact that the numerical value is non-zero probably reflects the predictability that arises from the tidal cycle and its variation from spring to neap tides. This has a strong influence on the relative likelihood of reaching high water levels on different days, causing a system with skill no better than climatology so far as surge is concerned to score much better than the overall average climatological frequency with which that level is reached. The question remains of why this floor on the skill score for high water level thresholds appears to be larger against observations than against hindcasts. The harmonic tide must do better at predicting itself (in the verification against hindcasts) than the observed tidal cycle, and similar reasoning might be expected to apply to the model's simulation of tide-surge interaction.

7. Conclusions

Since December 2009 the Met Office has run an operational storm surge ensemble, driven by the MOGREPS-R atmospheric ensemble out to T + 54 h. This paper has considered two alternative systems designed to extend these forecasts to 5–7 days, forced by the MOGREPS-15 medium-range ensemble. The cheaper mogC system takes the final state of each member of the existing storm surge ensemble, and continues the integration from T + 54 h driven by the corresponding member of the MOGREPS-15 ensemble. The alternative mog15 system runs a separate extended-range surge ensemble driven by MOGREPS-15 meteorology from T + 0 h.

The extended-range systems have been tested in an 8-month trial from 6 July 2010 to the end of February 2011, with runs out to 7.25 days. While this period does not include any major surge events, it does include more minor events at a number of ports, and produces reasonably stable statistics for positive surges up to about 1 m, negative surges down to about -0.6 m and total water levels up to 0.2 m above the alert level. Two sources of verification data have been used. Surge model hindcasts driven by analysed meteorology allow relatively noise-free examination of the fundamental meteorological predictability. Tide gauge observations provide a direct validation of real-world water levels, and the ultimate usefulness of the system as a whole. Performance has been assessed using both case studies and a variety of statistical approaches. The accumulation of these different strands of evidence supports modest confidence in our overall assessment, although conclusions about performance for extreme events would always benefit from increased sample size.

A series of case studies suggested the surge ensemble has the capability to provide useful indications of potential events even out to T + 7 days. The higher-resolution MOGREPS-R meteorology produces better forecasts than MOGREPS-15 for the period where both are available. There was some evidence that the benefit of higher-resolution meteorology persisted into the period just beyond T + 54 h in the mogC system, although the trajectories converge with the mog15 results within 1 or 2 days. One might expect the benefit of high-resolution forcing to persist for longer in situations where the surge is generated, for example, west of Ireland, and takes time to propagate clockwise round the UK to impact ports in the southeast, or countries such as the Netherlands. A wave following the H ∼ 20 m depth contour travelling at equation image m s−1 would take 22 h to travel 1000 miles. In other cases, there are clear instances of ‘transition shock’, where the discontinuity in meteorological forcing triggers spurious oscillations in the surge model. These can persist for 1 or 2 days and clearly harm forecast performance, although they seem to predominantly occur in relatively calm situations which are less important for civil protection.

Ensemble forecasts aim to predict the distribution of possible outcomes, which has to be evaluated statistically over many cases. Section 5 examined the rms error of the various forecasts, and the usefulness of the ensemble spread as a predictor of that error. Section 6 evaluated the quality of the ensemble's prediction of the probability to exceed relevant thresholds on the surge residual and total water level. When verifying against observations, online bias correction was found to be almost universally beneficial, and dressing each ensemble member with a Gaussian distribution improved the probabilistic performance. These post-processing techniques help to address deficiencies in the harmonic tide prediction and processes not included within the CS3 storm surge model. Neither technique is currently applied to operational forecasts. Their introduction would improve the performance of the forecasting system, although to some extent they represent allowances that a forecaster might naturally make when looking at time series plots.

The statistical results support the suggestion from the case studies that the higher-resolution MOGREPS-R meteorology provides the best forecasts within the 54 h for which it is available. The mogC system performs worse than mog15 on almost all measures for the period just beyond T + 54 h, suggesting that the detrimental consequences of the discontinuity in forcing outweigh any carry-through of the benefit of the high-resolution forcing. This leads to the recommendation that mog15 be adopted for surge forecasts beyond T + 54 h, while the existing mogR system remains the best forecast before this time. Where possible, forecasters may find it useful to have plots similar to those presented in section 4, which show the mogR and mog15 results together over their full lead time ranges. This makes clear the relationship between the two forecasts, and allows the case-specific quality of the mog15 forecasts to be examined by comparison to mogR over the first 54 h. Where it is necessary to deliver a single ensemble forecast for each lead time, this should use mogR followed by mog15. These ‘mogS’ members will often possess some discontinuity as they move from mogR to the corresponding mog15 member, but this occurs only at T + 54 h, and avoids the oscillations observed in mogC.

It may be possible to improve the performance of the mogC configuration by smoothing the transition in wind and pressure forcing. This should reduce the problem of shock-induced oscillations, but may not help situations in which the MOGREPS-15 forcing for a given member just after T + 54 h builds on similar forcing before T + 54 h, which is not present in the corresponding MOGREPS-R member. Similarly, one might consider it desirable to smooth the T + 54 h join in mogS to avoid obvious discontinuities. However, this would simply hide the problem, spreading the discontinuity over a longer time period. Since it moves away from the best available forecast, smoothing may well increase rather than reduce the rms error of the affected time steps. One situation in which smoothing may be necessary is to avoid transition shock effects if mogS data were used to drive a further hydrodynamic model. As above, the preferred solution would be to run separate simulations driven by mogR and mog15 respectively, using the mogR-based system for forecasting up to T + 54 h and the mog15-based one for the extended range.

Section 5 demonstrated that the mogS system has lower rms error than the deterministic and control forecasts from about T + 18 h. Situations in which the ensemble produces a larger than normal spread have correspondingly larger rms errors. The ensemble spread captures much of the variation in rms error as a function of lead time or port, provided separate allowance is made for the non-meteorological errors when verifying against observations. However, there is scope for improvement in the driving atmospheric ensemble, since the mean square spread does not quite keep pace with the growth of mean square error as a function of lead time.

Section 6 compared the ensemble performance to a series of dressed control forecasts, to check its performance against these benchmarks and quantify the benefit provided by the explicit dynamic simulation of forecast uncertainty. In all cases, the ensemble was competitive with or superior to the best dressed single forecast. The ensemble advantage often increases with lead time, as might be expected when uncertainties in weather system occurrence, position and timing start to dominate over the kind of variations which a simple dressing can allow for. The forecast dressings were difficult to optimize, partly due to the interacting effects of lead time and surge magnitude on forecast error. By contrast, the ensemble derives a high-quality estimate of uncertainty from fundamental dynamic simulation, immediately usable without the need for calibration based on historic data. The undressed control forecast, representing a forecasting process that completely ignores forecast uncertainty, generally performed worst and often much worse than the other methods.

Section 4 showed case studies in which indications of notable events were provided at the longest lead times probed by the trial. For low surge thresholds, and the key threshold of total water exceeding the alert level, the statistical verification also showed positive skill across the full lead time range. Against observations, there was some ambiguity about whether a Brier skill score of zero represents the true floor of usable skill, given the predictability of the basic tidal cycle. The noise associated with the harmonic tide prediction and other sources of error may reduce the practical utility of some of the fundamental predictive skill demonstrated in verification against hindcasts. Overall, these strands of evidence suggest there is potentially useful skill in the full lead time range explored by this trial. Based on this evidence, the full 7.25-day mog15 system was implemented operationally at the Met Office in summer 2011, augmenting the existing short-range ensemble driven by MOGREPS-R.

Acknowledgements

This work was funded by the Environment Agency for England and Wales. Tide gauge data were supplied by the British Oceanographic Data Centre as part of the UK Coastal Monitoring and Forecasting service. The authors thank Kevin Horsburgh and Pat Hyder for useful comments on the storm surge aspects of this work.

Ancillary