Most seasonal forecasts of Atlantic tropical storm numbers are produced using statistical-empirical models. However, forecasts can also be made using numerical models which encode the laws of physics, here referred to as “dynamical models”. Based on 12 years of re-forecasts and 2 years of real-time forecasts, we show that the so-called EUROSIP (EUROpean Seasonal to Inter-annual Prediction) multi-model ensemble of coupled ocean atmosphere models has substantial skill in probabilistic prediction of the number of Atlantic tropical storms. The EUROSIP real-time forecasts correctly distinguished between the exceptional year of 2005 and the average hurricane year of 2006. These results have implications for the reliability of climate change predictions of tropical cyclone activity using similar dynamically-based coupled ocean-atmosphere models.
 Publicly-available forecasts of seasonal Atlantic tropical cyclone activity include the Colorado State University forecasts from P. J. Klotzbach and W. M. Gray (referred to as CSU forecasts; see Extended range forecast of Atlantic seasonal hurricane activity and U.S. landfall strike probability for 2006, available at http://tropical.atmos.colostate.edu/forecasts/2006/june2006), Tropical Storm Risk [Lea and Saunders, 2006] (referred to as TSR forecasts; see TSR forecasts, http://tropicalstormrisk.com) and NOAA (NOAA outlook, http://www.cpc.ncep.noaa.gov/products/outlooks/hurricane.shtml). Each of these is based on statistical models which encode empirical correlations between Atlantic tropical cyclone activity and precursor climatic predictors. The CSU forecasts also include an analog method and a final adjustment. In spring 2006, there was a consensus among these statistical empirical forecasts that the 2006 Atlantic season would be active, though less than the record-breaking 2005 season. Such forecasts proved inaccurate, as the 2006 tropical storm season with 10 tropical storms was slightly below the long term climatology of 10.4 from 1958 to 2001. As highlighted in the media [Stevenson, 2006; Johnston, 2006], a number of insurance companies who took action based on these forecasts lost billions of dollars. Saunders  attributes the low tropical storm activity in 2006 to El-Niño conditions and the presence of African dry air and Saharan dust.
 Dynamical coupled ocean-atmosphere climate models can also be used to predict seasonal tropical cyclone activity. Such global climate forecasts are made from sets of initial conditions preceding the onset of the hurricane season. These initial states are determined by atmosphere and ocean observations worldwide, and assimilated into the dynamical model. State dependent uncertainty in the resulting forecasts is estimated by running ensembles of integrations in which uncertainty in both initial conditions and model equation sets are represented explicitly [Palmer and Hagedorn, 2006]. There exist a number of different methods to represent model uncertainty [Palmer et al., 2005, Collins, 2007]. Here we focus on the use of the so-called multi-model technique, pioneered on the seasonal timescale by the DEMETER project [Palmer et al., 2004].
 Although the horizontal resolution of global operational dynamical seasonal forecasting models is generally insufficient to simulate the intensity of hurricanes, simulated tropical cyclonic systems are nevertheless realistic in other respects. For example, they develop a warm temperature anomaly above the centre of the vortex, which is a characteristic of observed tropical storms. Hindcast experiments have shown that dynamical models forced by observed SSTs [Vitart et al., 1997; Camargo et al., 2005] and fully coupled ocean-atmosphere models [Vitart, 2006] can simulate the inter-annual variability in Atlantic tropical storm activity.
2. The EUROSIP System
 The success of the DEMETER project led directly to the development of the operational EUROSIP multi-model ensemble. EUROSIP presently consists of 3 seasonal forecasting systems from ECMWF, Met Office and Météo-France and global probability forecasts are produced routinely for several variables. Each forecasting system is run in real-time with initial conditions from the 1st of each month and the forecasts are issued the 15th of the month (the delay allows acquisition of SST fields from the previous month, time to run the forecasts, and a margin to ensure a reliable operational schedule). In real-time mode, the ensemble size for each model is either 40 or 41. In addition, each coupled ocean-atmosphere model was used to produce a set of re-forecasts initialized with the ocean and atmosphere analyses of the 1st of the month for each year (the SSTs are predicted by the coupled system). The number of re-forecasts varies between models. The re-forecasts share a common period of 1993–2004, with an ensemble size smaller than in real-time, but of at least 5 members.
 The tropical storms produced by each model component of EUROSIP are tracked using the method described by Vitart and Stockdale . Since the component models have biases which can vary from one model to another, the number of model tropical storms is calibrated a posteriori using a set of past integrations (see Appendix A).
 A large portion of the seasonal variability of Atlantic tropical storms is associated with sea surface temperature (SST) variations over the tropical Pacific [Gray, 1984; Shapiro, 1987; Goldenberg and Shapiro, 1997] and Atlantic [Saunders and Harris, 1997; Goldenberg and Shapiro, 1997; Landsea et al., 1999]. Therefore it is crucial for a coupled dynamical system to predict correctly the seasonal variability of SSTs in these two regions. Figure 1 displays the SSTs averaged over the peak August–October period of Atlantic tropical storm activity over the NINO3 region (top panel), often used as an index for ENSO activity, and over the hurricane main development region over the Atlantic (bottom panel), as predicted by the ensemble mean of EUROSIP after calibration for predictions starting from 1st June. The ensemble 2 standard deviation range is also indicated. EUROSIP has clear skill in predicting the evolution of SSTs in both regions, with a correlation between the EUROSIP ensemble mean and verification of 0.92 (p-value of 0.00001) over the NINO3 region and 0.81 (p-value of 0.001) over the Atlantic. Those correlations are larger than the correlations obtained by persisting SST anomalies from the previous month (respectively 0.47 and 0.73 which have p-values of 0.07 and 0.003 respectively).
Figure 2 shows the EUROSIP re-forecasts of Atlantic tropical storm frequency (produced a posteriori) for the common re-forecast period 1993–2004 and the forecasts for 2005 and 2006 along with the observed frequency of Atlantic tropical storms for the period July to November (the forecasts are issued mid-June). The multi-model displays some skill in reproducing the observed inter-annual variability of Atlantic tropical storms, with the model successfully predicting intense tropical storm activity in 1995 and 2005 as in observations. Table 1 shows the linear correlation and RMS error between the inter-annual variability of the multi-model ensemble median and the observed frequency of Atlantic tropical storms over the full season, along with the same statistics for CSU and TSR forecasts over the same period. For the CSU forecasts, we use the modified final forecasts which were issued in real time. Re-forecasts from TSR were used to cover the full period 1993–2006 and are produced the same way as the real-time forecasts. Table 1 suggests that over the period 1993–2006, EUROSIP had higher skill than CSU and TSR forecasts. In particular EUROSIP displays an RMS error substantially smaller than the statistical models despite the fact that it displays more variance (2.91 instead of 2.64 for TSR and 2.15 for CSU) during this period. If we remove the year 2006, the TSR and CSU linear correlations increase significantly (0.77 and 0.67 respectively), but still remain lower than EUROSIP (0.815), and the RMS error of both statistical models is still much higher than with EUROSIP. Table 1 indicates also that EUROSIP multi-model produces better scores, particularly for RMS errors, than the individual model components, in agreement with Vitart . The dynamical models perform also better than the statistical methods over a longer period of time (20 years instead of 14 years) (Table 2). A 10,000 bootstrap re-sampling procedure indicates that the difference in skill between EUROSIP and CSU and EUROSIP and TSR is significant at the 1% level for both time periods. However, the comparison may not be entirely fair for the CSU forecasts, since the statistical model used by CSU before the final adjustment has changed several times in the past. In addition, the EUROSIP forecasts are issued slightly later than the TSR and CSU forecasts.
Table 1. Linear Correlation and Root Mean Square Error Between the Number of Atlantic Tropical Storms Predicted by Each Individual Dynamical Model, EUROSIP, TSR and CSU and the Observed Number of Tropical Storms During the Whole Atlantic Hurricane Season Over the Period 1993–2006a
Observed number of tropical storms is taken from http://www.nhc.noaa.gov. Numbers in parentheses correspond to the p-values of the correlation.
Table 2. Same as Table 1 but Over the Period 1987–2006a
Only the Met Office and ECMWF seasonal forecasting systems have re-forecasts covering the period 1987-2006. Numbers in parentheses correspond to the p-values of the correlation.
 EUROSIP ensemble forecasts can be issued as probabilities (for instance the probability of the number of tropical storms to be above normal), but a large number of cases (larger than the current size of the model re-forecast common period 1993–2004) is needed to validate the reliability of the probability forecasts of tropical cyclone numbers. However, a version close to EUROSIP was integrated over a period of 49 years for 6-months starting on 1st May for the DEMETER project [Palmer et al., 2004].
 The reliability of DEMETER Atlantic activity forecasts is measured using so-called attributes diagrams [Hsu and Murphy, 1986; Wilks, 2005] which show the conditional relative frequency of occurrence of an event as a function of its forecast probability (see Figure 3) with a perfectly reliable system having data close to the diagonal. For purposes of assessment, forecasts of above and below normal activity were grouped into three categories according to the forecast probability, 0–33.3%; 33.3% to 66.6% and 66.6 to 100%. Figure 3 suggests that the multi-model ensemble forecasts has more skill than the trivial climatological information in predicting the probability of a more intense or less intense Atlantic tropical storm season.
3. The 2005 and 2006 Seasons
 Global forecasts from EUROSIP have been produced in real-time each month since early 2005. This means that two real-time seasonal forecasts of the number of Atlantic tropical storms have been produced so far: the 2005 and the 2006 seasons. Table 3 shows the forecast of Atlantic tropical storms for each individual model and for EUROSIP for 2005 and 2006.
Table 3. Forecasts of the Number of Atlantic Tropical Storms Issued in June 2005 and 2006 for Each Component Model of EUROSIP and for the Multi-Model Ensemble, Along With Forecasts From CSU and TSR Issued in June and Forecasts From NOAA Issued in Maya
If we had used the median as in Figure 2 instead of the mean, the EURO-SIP values would have been 18 and 12 for 2005 and 2006 respectively.
 2005 was a record season for Atlantic tropical storm activity, with 27 tropical storms, which is well above the 1965–2004 climatology (10.6 Atlantic tropical storms per year). The multi-model forecast (16.2 Atlantic tropical storms) made in June 2005 was more accurate than the statistical predictions from CSU (15 tropical storms) and TSR (about 14 tropical storms). Although the ensemble mean and median of EUROSIP are well below the number of observed tropical storms in 2005, EUROSIP predicted probabilities of an extreme season in 2005 (37% and 13% probability for the number of Atlantic tropical storms to be above 20 and 27 respectively) higher than the probabilities produced by EUROSIP for those thresholds during the re-forecast period 1993–2004 (18% and 4.2% probabilities respectively). In June 2006, EUROSIP predicted 12.1 tropical storms. This forecast issued in June was in contrast to the forecasts from statistical models, which predicted a very active hurricane season. For instance, CSU and TSR predicted 17 and about 14 tropical storms, respectively. Although EUROSIP predicted higher tropical storm activity than observed (12.1 instead of 10), the multi-model forecast proved more accurate than the statistical models in this case.
4. Further Implications
 Because of the coarse resolution of the dynamical models, the simulated tropical storm tracks tend to be unrealistically short (see for example Figure 4a) and the strength of the tropical storm is weaker than observed values. Because of this, the EUROSIP forecasts are at present limited to the frequency of tropical storms, unlike CSU and TSR forecasts which are more detailed and include the prediction of the risk of US landfall and the frequency of intense hurricanes. However, planned increases in the resolution of EUROSIP seasonal forecasting models (from about 200 km to about 100 km resolution) might make it possible to produce direct forecasts of tropical storm landfall, although the models still produces significantly less intense hurricanes than observed. Alternatively, statistical techniques could be applied to infer landfall probabilities from model outputs. An illustration of the expected impact of increased resolution is given in Figure 4b.
 In conclusion, this paper describes an inherently different approach from publicly available statistical methods to predict the frequency of tropical storms. Here we use a state-of-the-art dynamical seasonal forecasting system based directly on the laws of physics, which for tropical cyclone activity benefits from the skill of dynamical models to predict ENSO events [van Oldenborgh et al., 2005] and Atlantic SSTs in particular. Results suggest that this method performs better than current publicly available statistical methods. In particular, EUROSIP predicted in June a 2006 season that would not be much different from climatology, unlike the statistical methods, which predicted an active season, causing some insurance companies to lose billions of dollars. The dynamical forecasts discussed in this paper represent a viable alternative to the statistical methods with tropical storm forecasts available over all ocean basins, since the dynamical models are global. Indeed, tropical storm forecasts are just one of the many outputs of dynamical seasonal forecasting systems.
 Another important conclusion of the present paper is that the dynamical multi-model ensemble technique produces probabilistic forecasts of the number of Atlantic tropical storms which have some reliability. Since we use the same class of dynamical models as in the Intergovernmental Panel on Climate Change (IPCC), the investigation of the reliability of seasonal forecasts of hurricane frequency provides an opportunity to assess confidence in predictions of changing frequency of Atlantic tropical storms with climate change.
 Dynamical climate forecasting of tropical storms is primarily serendipitous and an outcome of ENSO-focused research that has lead to dynamical coupled ocean-atmosphere forecast model development. As such, future improvements in skill cannot be guaranteed until more targeted research is undertaken to understand the full-range of processes within the dynamical models.
Appendix A:: Model Calibration
 Dynamical models tend to drift towards a climate that is somewhat different from the observed climate. The effect of the drift on the model calculations is estimated from previous integrations of the model in previous years (the re-forecast). The drift is then removed from the model solution a posteriori (the calibration). For most model variables, including SSTs the drift is treated as a bias and removed additively. We calibrate the number of tropical storms in a given year by multiplying the number of model storms by a factor such that the median of the model climate equals the median of the observed climate. The calibration of the re-forecast is performed using cross-validation and independently for each model.
 The real-time forecasts of 2005 and 2006 were produced using the mean. The median is however a more stable measure of the centre of the distribution, since the model distributions of tropical storm frequency tend to be strongly skewed towards very large values. The observed distribution of tropical storms is also far from being a normal distribution and the 2005 extremely active tropical storm season poses a problem of stability when performing cross-validation with the ensemble mean. Most results are calculated using the median, although we also give the actual forecasts issued in real-time (Table 3). We plan to issue future real-time forecasts using the median instead of the mean.
 The authors wish to thank the two anonymous reviewers whose comments proved invaluable in improving the presentation of the material.