Comparison of different global ensemble prediction systems for tropical cyclone intensity forecasting

Many meteorological centers have operationally implemented global model‐based ensemble prediction systems (GEPSs), making tropical cyclone (TC) forecasts from these systems available. The relatively low resolution of these GEPSs means that limits previous studies primarily focused on TC track forecasting. However, recent GEPS upgrades mean that TC intensity predictions from GEPSs are now also becoming of interest. This study focuses on the verification and comparison of the latest generation of GEPSs for TC intensity forecasts, particularly during the rapid intensification (RI) period over the western North Pacific (WP), eastern North Pacific (EP), and North Atlantic (NA) basins in 2021–2022. On average, the National Centers for Environmental Prediction (NCEP) GEPS performed best in predicting both TC intensity and RI across all three basins. Nevertheless, the exact timing of RI remains highly uncertain for these GEPS, indicating significant limitations in using GEPSs to forecast RI.


| INTRODUCTION
Tropical cyclones (TCs) are one of the most destructive natural hazards and have had catastrophic impacts in many countries (Needham et al., 2015).Accurate forecasting of TCs is essential for reducing the enormous socioeconomic losses that are associated with such storms.With improvements in forecasting models, both TC track and intensity predictions have become increasingly accurate since 2010 (Cangialosi et al., 2020).However, the chaotic nature of the atmosphere and the imperfections inherent in numerical weather prediction systems mean that small initial errors can grow into large prediction errors (Lorenz, 1963(Lorenz, , 1965(Lorenz, , 1993;;Thompson, 1957).A single prediction cannot explicitly capture the forecast uncertainty (Palmer, 2000;Zhang & Zhao, 2016).To solve this problem, the ensemble prediction method was developed to generate a set of predictions that start from different initial states instead of a single prediction (Demeritt et al., 2007;Ehrendorfer & Tribbia, 1997;Leith, 1974).The average of the predictions from the individual ensemble members can provide a good forecast estimate with improved accuracy, relative to a single deterministic forecast.The spread of individual predictions within an ensemble can also be used to quantify forecast uncertainty and provide a probabilistic prediction (Froude et al., 2007;Zhu, 2005).
Over the last half-century, a more profound understanding of the ensemble prediction method has driven the development of many different flow-dependent perturbation methods over statically random perturbations.(e.g., Feng et al., 2014;Palmer, 1992;Toth & Kalnay, 1993, 1997).To create individual ensemble members, all components in a state-of-the-art forecasting system, such as the physics and the numerical and boundary forcings, are perturbed, instead of perturbing only the initial atmospheric conditions (Inverarity et al., 2023;Leutbecher & Palmer, 2008;Palmer et al., 2009;Zhang, 2021).Many meteorological centers have operationally implemented their own global model-based ensemble forecast systems (GEPSs; e.g., Molteni et al., 1996;Mureau et al., 1993;Toth & Kalnay, 1993;Tracton & Kalnay, 1993), which use different models, resolutions, perturbation generating methods, and include different numbers of ensemble members (Yamaguchi & Majumdar, 2010).It is important to assess and compare the performance of GEPSs from different modeling centers.For this purpose, THORPEX (The Observing System Research and Predictability Experiment) Interactive Grand Global Ensemble (TIGGE) dataset was established by the World Weather Research Program (WWRP; Richardson et al., 2005), which has enhanced collaboration between the research and operational meteorological communities and enabled research into a wide range of topics (Bougeault et al., 2010;Swinbank et al., 2016).
By using the TIGGE dataset, many studies have verified and compared the TC forecasting capabilities of different GEPSs.Overall, the European Centre for Medium-Range Weather Forecasts (ECMWF) GEPS has been shown to have higher skill for TC genesis and TC track forecasting than other state-of-the-art GEPSs for most time windows and in the North Atlantic (NA) basin, the western North Pacific (WP) basin, and the South China Sea (Leonardo & Colle, 2017;Nixon, 2012;Titley et al., 2022;Yamaguchi et al., 2015;Zhang et al., 2015).Rama Rao et al. (2015) evaluated TC track forecasts over the northern Indian Ocean from three different GEPSs and showed that the National Centers for Environmental Prediction (NCEP) GEPS has a lower forecast error than the ECMWF and the United Kingdom Met Office (UKMO) GEPSs.Typically, the above-mentioned studies did not compare TC intensity predictions, especially for the period of dramatic increase in TC intensity (rapid intensification [RI]) because it is generally agreed that the resolution of GEPSs is inadequate for predicting TC intensity variations (Strazzo et al., 2016).However, benefiting from the increase in computing power, the horizontal and vertical resolutions of GEPSs have also been increased (e.g., Magnusson et al., 2019;Mamgain et al., 2020;Zhou et al., 2022) Buizza et al., 2007), the National Centers for Environmental Prediction (NCEP) GEPS, and the Met Office Global and Regional Ensemble Prediction System-Global (MOGREPS-G) from the United Kingdom Met Office (UKMO) (Bowler et al., 2008).Table 1 provides the main properties of the seven GEPSs evaluated in this study, which were the Australian Community Climate and Earth-System Simulator (ACCESS) GEPS from the Australian Bureau of Meteorology (BoM) (Bowler et al., 2008), the ECCC GEPS, the ECMWF GEPS, the Japan Meteorological Agency (JMA) GEPS (Takayuki Tokuhiro, 2018), the Korean Integrated Model (KIM) ensemble from the Korea Meteorological Administration (KMA) (Bowler et al., 2008), NCEP GEPS, and the UKMO GEPS (Bowler et al., 2008).The best track data for the EP and NA were acquired from the National Hurricane Center (NHC), while for the WP, it was sourced from the Joint Typhoon Warning Center (JTWC), both of which are accessible through the International Best Track Archive for Climate Stewardship (IBTrACS) dataset, Version 4 (Knapp et al., 2010(Knapp et al., , 2018)).Considering the 6-h forecast step interval, the verification process in this study was also conducted at 6-h intervals.

| Verification methods
To assess the TC track and intensity forecasts, we used the root-mean-square error (RMSE) and ensemble spread to measure the deterministic skill of the ensemble forecasting, as follows: The ensemble spread is a measure of the difference between the members and is represented by the standard deviation with respect to the ensemble mean, whereas the RMSE measures the distance from the ensemble mean forecast to the true value, as represented by the observation (Zhu, 2005).Cases in which there is a sole forecast member available during a particular forecasting time are excluded from our analysis.Previous studies have shown that a perfect ensemble forecasting system should have an ensemble spread and an RMSE that are of the same magnitude, and that a large difference between the two indicates statistical inconsistency (Buckingham et al., 2010;Buizza et al., 2005;Magnusson et al., 2008;Palmer et al., 2006).Therefore, we used the ratio of the RMSE to the ensemble spread to evaluate the reliability of the ensemble predictions (Fortin et al., 2014): 1, but for all named TCs over the EP in 2021-2022.The colors indicate the different forecast centers: light blue for ECCC, yellow for ECWMF, red for NCEP, and blue for UKMO.
The value of the ratio represents the degree of matching between the ensemble spread and the error of the ensemble mean.For a perfect ensemble forecasting system, the ratio should have a value of 1.
In this article, RI was defined as a 30-kt (1 kt = 0.51 m s) or greater increase in the maximum wind speed in 24 h, based roughly on the 95th percentile of the observed intensity changes in the Atlantic basin (Kaplan & DeMaria, 2003).Utilizing observational data, this definition aids in determining whether a TC qualifies as a RI event.However, in order to ensure an adequate sample size, we appropriately adjusted the threshold defining RI within the forecasts.Specifically, intensification rates of 20 kt or greater over the 24-h forecast periods of 0-24 h and 12-36 h are subjected to verification in this study.The reliability diagram, Brier skill score (BSS) and relative operating characteristics (ROC) curves are used to verify the RI probabilistic forecasts over the forecast period of 0-24 and 12-36 h in Section 3.2 (Buizza et al., 2005;Wilks, 2006).The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into six bins (0, 0%-20%, 20%-40%, 40%-60%, 60%-80%, and 80%-100%).The sample size in each bin is often included as a histogram or values beside the data points.The BSS is a proper scoring rule that measures the accuracy of probabilistic predictions, and the ROC curve is a graphical representation of the tradeoff between probability of detection (POD) and probability of false detection (POFD) at various classification thresholds.These indicators provide complementary insights into the performance of ensemble forecasting.
For any event ϕ, the Brier score (Brier, 1950) is computed as: where N is the number of samples, p i denote the probability of ith sample for event ϕ prediction, and o i denote the probability of the ith sample actually occurring for event ϕ (which can take on values of only 0 or 1).
The BSS is computed as: where the reference Brier score is computed by using the sample climatology as the forecast (i.e., 5%; Kaplan & DeMaria, 2003).The range of BSS is À∞ to 1, 0 indicates no skill when compared with the reference forecast, and the perfect score is 1.
The ROC curve illustrates the performance of a binary classifier by plotting the POD against the POFD for different threshold values ([0.0, 0.02, 0.04, 0.06, 0.08, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60] in this article).The POD measures the accuracy of forecasting observed events by calculating the ratio of correctly predicted events (hits) to the sum of hits and misses.It quantifies the ability of the forecast to capture actual occurrences.On the other hand, the POFD assesses the incorrect forecast of "no" events as "yes" by calculating the ratio of false alarms to the total number of observed non-occurrences.It measures the rate of false positives in the forecast: A hit is defined as an RI event that was observed and predicted, while a miss is an RI event that was observed but not predicted.The POD metric ranges from 0 to 1, with 0 indicating that no actual RI events were accurately forecasted and 1 indicating that all actual RI events were accurately forecasted.A false alarm is defined as an RI event that forecast to occur did not occur.The area under the ROC curve (AUC-ROC) is a common summary statistic derived from the ROC curve and provides an overall measure of the classifier's discriminatory power that was forecast to occur but did not occur (diagonal line indicates no skill; Mason, 1982).
Finally, the forecasts of 21 RI-TCs prior to their RI timing were used to assess the uncertainty in RI timing.The large uncertainty in the exact timing of RI can result in significant inconsistencies between forecasts and observations, even when errors are minimal (see Figures S2c and S3c).The correlation coefficient between forecasts and observations, therefore, served as a vital tool for evaluating consistency during RI periods.Additionally, the time window between the earliest and latest forecasted RI timing was employed to represent the range of uncertainty in RI timing.Considering the variability in the RI periods among different TCs, we used the ratio of this time window to the period of RI as a measure of the uncertainty in RI timing: If the time window for forecasted RI is greater than the period of RI, that is, the uncertainty value exceeds 1, it indicates a high level of uncertainty in the forecast (the timing of RI may not be accurately predicted throughout the entire RI period).

| Performance of TC track and intensity forecasts
To assess the performance of the seven GEPSs in terms of TC prediction, we examined all 45 TCs that developed over the northwest Pacific during 2021-2022.Varying forecasts were assessed for each center, and the differences in sample sizes are depicted in Figure S1.In spite of variations in sample sizes across different centers and at various forecasting times, the available datasets retain a size substantial enough to effectively represent average forecasts.Figure 1a shows the track forecast skills of the seven GEPSs, with solid lines representing the RMSE and dashed lines representing the spread.All RMSEs for the track predictions are <800 km at a lead time of 120 h.The JMA GEPS has the lowest RMSE at lead times of <30 h, whereas the RMSEs for the ECWMF and UKMO GEPSs predictions are lower than for the other GEPSs for lead times of >30 h.The ECCC GEPS predictions have the largest spread at lead time of >30 h.During the first 30 h, the ratio of RMSE to spread is closest to 1 for KMA, while UKMO exhibits the closest ratio between 30 and 60 h, and NCEP shows the closest ratio after 60 h.None of these GEPSs demonstrate superiority across all forecast lead times (Figure 1d).
We also evaluated the TC intensity forecasts from the seven GEPSs.As indicators of TC intensity, we considered both the maximum surface wind-speed (MSW) and the minimum sea level pressure (MSLP).The NCEP GEPS predicted MSW and MSLP with a smaller RMSE, and a larger spread, than the other GEPSs (Figure 1b,c).
F I G U R E 4 Reliability diagrams for forecast periods of (a) 0-24 h and (b) 12-36 h.The black-dotted diagonal line represents the perfect reliability.The table to the right of each figure gives the number of cases in each bin.relative operating characteristics (ROC) curves of seven GEPSs over the WP basin for forecast periods of (c) 0-24 h and (d) 12-36 h.Black-dotted line means no skill.Brier skill score (BSS) of seven GEPSs over the WP basin for forecast periods of (e) 0-24 h and (f) 12-36 h.The colors indicate the different forecast centers: purple for BoM, light blue for ECCC, yellow for ECWMF, orange for JMA, pink for KMA, red for NCEP, and blue for UKMO.In contrast to the track forecasts, the magnitude of the ensemble spread for the intensity forecasts is small compared with the RMSE, which indicates that the GEPSs are less reliable for TC intensity forecasting than for track forecasting.Overall, the NCEP GEPS produced the most reliable TC intensity forecasts over the WP (Figure 1e,f).
Previous studies have established that the impact factors for TC vary among basins (Kaplan et al., 2010;Shu et al., 2012), and we therefore analyzed TC forecasting skills for the EP and NA basins during 2021-2022 by considering the 36 and 35 TCs that occurred in those basins, respectively (Figures S1 and S2).Only four of the GEPSs generated TC forecast data for both the EP and NA: ECCC, NCEP, ECMWF, and UKMO.In the EP basin, the ECWMF GEPS yielded the lowest RMSE for the TC track forecasts, and the ECCC GEPS yielded the greatest spread (Figure 2a).The NCEP, UKMO, and ECWMF GEPS had comparable RMSE-to-spread ratios (Figure 2d).TC intensity forecasts from the NCEP GEPS had the smallest RMSE for MSW and MSLP, and their spread was the largest among these GEPSs (Figure 2b,c).The TC track forecast results were similar in the NA basin, where the ECMWF GEPS predictions had the lowest RMSE and the ECCC GEPS predictions had the largest spread (Figure 3a,d).For TC intensity forecasting, the NCEP GEPS had the lowest RMSE for MSW, while the RMSE for the NCEP GEPS MSLP prediction was slightly higher than that for the ECMWF GEPS predictions for lead times of <30 h, but was the lowest of all the GEPSs for longer lead times (Figure 3b,c).
In conclusion, the NCEP GEPS performed well for TC intensity prediction across all three basins.However, the differences between the RMSE and spread were large for TC intensity forecasts, indicating a lower inherent reliability than TC track forecasts.

| Performance of RI forecasts
In the preceding section, we compared the forecasts generated by seven GEPSs with observations over the WP, EP, and NA regions for the 2 years.The comparison aimed to verify the performance of the intensity forecasts, which is a major challenge for operational forecasting centers (Cangialosi et al., 2020).Given that the limited horizontal resolution of these GEPSs restricted the ability to predict large-intensity changes, the probabilistic forecasts of RI also provide guidance for the comparison of these models.In this section, we first examine the performance of RI probabilistic forecasts of these systems over the WP basin for forecast periods of 0-24 and 12-36 h, as shown in Figure 4. Figure 4a,b shows reliability diagrams for probabilistic RI forecasts for the 0-24 and 12-36 h, respectively.The majority of samples are concentrated within bins of low forecast probabilities.Within this range, the reliability curves of these models are above the diagonal line, indicating that RI happened more frequently than what was implied by the forecast probabilities (i.e., under-confident).Additionally, the NCEP GEPS exhibited the largest AUC-ROC and the highest BSS value at both forecast times (Figure 4c-f), which demonstrated highest RI probabilistic forecasting skill among these GEPSs.However, it is important to note that, despite the NCEP GEPS exhibits some forecasting skill, it remains relatively low, with most models still unable to predict RI effectively.Additionally, similar results were also observed in the NA and EP basins (Figures 5 and 6).
According to Judt and Chen (2016), predicting the likelihood of RI during the early stages of the TC life cycle has higher predictability (e.g., Figure S2 and S3).However, deterministic forecasting, especially for RI timing, is still challenging due to the limitations imposed by multiscale interactions.Therefore, a subset of 21 TCs (details are presented in Table S1)that experienced RI was selected for further analysis of their forecasts prior to the RI timing.The RMSEs and spreads for the MSW and MSLP predictions are presented in Figure 7a,b, respectively.Among the seven GEPSs, the ECMWF and the NCEP GEPSs yield the lowest RMSEs, with minimal differences between the two.The ratios of RMSE to spread for the ECMWF, NCEP, and UKMO GEPSs are close to 1, indicating their reliability in terms of RI forecasting (Figure 7c,d).
F I G U R E 8 The uncertainty in the rapid intensification (RI) timing of the seven GEPSs defined by Equation ( 8) is represented by transparent bars, while the probability of detection (POD) of these GEPSs is indicated by black-filled bars.
In addition to accuracy (represented by the RMSE) and reliability (represented by the ratio), we also considered the correlation coefficient between observations and RI forecasts.We found significant positive correlations ( p = 0.05) between observations and MSW forecasts from the ECMWF and NCEP GEPSs (Figure 7e), and between observations and SLP forecasts from the ECCC, ECMWF, and NCEP GEPSs (Figure 7f).Predictions from the NCEP GEPS have the highest correlation coefficient for both variables, indicating that this model outperforms the other GEPSs.Finally, we found that the uncertainty of RI timing (defined by Equation 8) for these GEPSs are all much larger than 1, indicating a lack of capability in RI timing forecasts (Figure 8).Furthermore, comparing to POD, it appears that as the prediction of RI occurrence becomes more accurate, the uncertainty in RI timing also increases.In fact, GEPSs such as KMA and BoM GEPSs show almost no RI prediction capability (POD close to zero), resulting in limited samples to assess uncertainty and inadequately reflect their true uncertainty.In conclusion, despite some models, like NCEP, exhibiting certain RI prediction skills, the current GEPS still lack the ability to accurately forecast RI timing.

| CONCLUSIONS AND DISCUSSION
This study focuses on the verification and comparison of the latest generation of operational GEPSs for TC intensity forecasts, particularly for the RI period.The analysis is based on all named TCs over the WP, EP, and NA basins during 2021-2022.The results show that, compared with other GEPSs, the NCEP GEPS yields a lower RMSE and an RMSE-to-spread ratio that is closer to 1 for intensity forecasts in the WP basin.In both the EP and NA regions, the NCEP GEPS also demonstrates the best performance in TC intensity forecasting, but it is still not that reliable, with ratios quite a bit over 1.In addition to the intensity forecast, the intensity changes of TCs, particularly during RI periods, are of great interest.To assess the probabilistic forecasts of RI among these GEPSs, reliability diagram, ROC curve, and BSS were used to compare the forecast performance over the forecast period of 0-24 and 12-36 h.Our analysis reveals that NCEP GEPS exhibits the highest skill in probabilistic RI forecasting for both forecast periods.
The commendable performance of the NCEP GEPS, in comparison to other GEPSs, can be attributed to an upgrade in late 2020 that included a new dynamical core, higher resolution, a larger ensemble size, a longer forecast length, and the implementation of new model uncertainty schemes.In contrast to the previous version, this upgraded system exhibits significant improvements in TC intensity forecasting (Zhou et al., 2022).As indicated by Chen, Lin, Zhou, et al. (2019) and Chen, Lin, Magnusson, et al. (2019), this enhancement in intensity forecasting capability might be attributed to improvements in both the dynamical core and microphysics components.However, the analysis of this study reveals limited skills in RI forecasting and significant uncertainty in the timing of RI, despite the capability of NCEP GEPS in predicting RI to some extent.Considering that RI timing is predominantly influenced by multiscale interactions (Judt & Chen, 2016), improving model resolution remains a viable approach to enhance forecasting skill.Hence, the upcoming upgrade of ECMWF GEPS in June 2023, with a horizontal resolution of 9 km, holds the potential for significant improvement in RI prediction.Overall, this comprehensive verification and comparison of the latest GEPSs for TC intensity forecasts contribute to improving forecast accuracy, promoting service levels, and aiding researchers in enhancing GEPSs.

ORCID
Deyu Lu https://orcid.org/0000-0001-8322-5296 Comparison of the forecast skills of seven GEPSs for named TCs in the WP Basin in 2021-2022.Plots (a-c) show the RMSE with a solid line and the spread with a dotted line for: (a) TC track forecasts, (b) MSW forecasts, and (c) MSLP forecasts.Plots (d-f) show the ratio of RMSE to ensemble spread for: (d) TC track forecasts, (e) MSW forecasts, and (f) MSLP forecasts.The colors indicate the different forecast centers: purple for BoM, light blue for ECCC, yellow for ECWMF, orange for JMA, pink for KMA, red for NCEP, and blue for UKMO.

F
I G U R E 3 As in Figure 1, but for all named TCs over the NA in 2021-2022.The colors indicate the different forecast centers: light blue for ECCC, yellow for ECWMF, red for NCEP, and blue for UKMO.F I G U R E 4 Legend on next page.

F
I G U R E 5 As in Figure 4 but for EP basin.The colors indicate the different forecast centers: light blue for ECCC, yellow for ECWMF, red for NCEP, and blue for UKMO.F I G U R E 6 As in Figure 4 but for NA basin.The colors indicate the different forecast centers: light blue for ECCC, yellow for ECWMF, red for NCEP, and blue for UKMO.F I G U R E 7 Comparison of the forecast skills of 21 RI-TCs prior to their RI timing.Plots (a,b) show the RMSE with a solid line and the spread with a dotted line for: (a) (a) MSW forecasts and (b) MSLP forecasts.Plots (c,d) show the ratio of RMSE to ensemble spread for: (c) MSW forecasts and (d) MSLP forecasts.Correlation coefficients of these GEPSs between the observed and mean forecasted (e) for MSW (f) for MSLP.The black horizontal lines in (e) and (f) indicate the 95% significance level.The colors indicate the different forecast centers: purple for BoM, light blue for ECCC, yellow for ECWMF, orange for JMA, pink for KMA, red for NCEP, and blue for UKMO.
Details of the seven global ensemble prediction systems.
T A B L E 1