A globally consistent reanalysis of hurricane variability and trends



[1] Recently documented trends in the existing records of hurricane intensity and their relationship to increasing sea surface temperatures suggest that hurricane intensity may be increasing due to global warming. However, it is presently being argued that the existing global hurricane records are too inconsistent to accurately measure trends. As a first step in addressing this debate, we constructed a more homogeneous global record of hurricane intensity and found that previously documented trends in some ocean basins are well supported, but in others the existing records contain trends that may be inflated or spurious.

1. Introduction

[2] The relationship between global warming and trends in hurricane activity is presently a topic of active research and debate, and much of the debate is rooted in questions about the suitability of the hurricane records that have been used to identify these trends [Landsea et al., 2006]. These “best track” records [Jarvinen et al., 1984; Chu et al., 2002] comprise global historical measures of hurricane position and intensity. Intensity is defined in terms of sustained surface wind speed, although the details of this definition can vary according to the protocols of individual forecast offices. Teams of forecasters update the best track data at the end of the hurricane season in each ocean basin using data collected during and after each hurricane's lifetime (tropical cyclones are known by different names in the various ocean basins, but here we are using the term “hurricane” in a generic sense). The variability of the available data combined with long time-scale changes in the availability and quality of observing systems, reporting policies, and the methods utilized to analyze the data make the best track records inhomogeneous by construction. Temporal consistency is sacrificed in favor of best possible absolute accuracy at every period during the lifetime of each hurricane.

[3] After the advent of global monitoring with geostationary satellites in the mid to late 1970's, metrics related to hurricane frequency are generally considered accurate, but the known lack of homogeneity in both the data and techniques applied in the post-analyses has resulted in skepticism regarding the consistency of the best track intensity estimates. As a first step toward addressing this shortcoming, we constructed a more homogeneous data record of hurricane intensity by first creating a new consistently analyzed global satellite data archive from 1983 to 2005 [Knapp and Kossin, 2007] and then applying a new objective algorithm to the satellite data to form hurricane intensity estimates. Our new homogeneous record of hurricane intensity is denoted as the UW/NCDC (University of Wisconsin-Madison/National Climatic Data Center) record. Where the best track records sacrifice consistency in lieu of best possible absolute accuracy, our new record sacrifices best possible absolute accuracy for temporal consistency. It is important then to note that the UW/NCDC record serves as a complement to the best track, and not as a replacement.

2. Data and Algorithm Development: Objective Estimation of Hurricane Intensity

[4] The new satellite data archive described by Knapp and Kossin [2007] was constructed at NCDC through a careful reanalysis of 23 years of global geostationary infrared satellite imagery (July 1983 to December 2005) to remove sources of time-dependent biases. The archive comprises ∼169,000 observations in more than 2,000 tropical storms. The spatial and temporal resolution of the imagery was made uniform at 8 km and 3 h which are the coarsest resolutions of the earliest data. Each image was re-positioned on the hurricane center at that time using the center-positions from the best track. Here we restricted the data to only include fixes that were over water and between 45°N and 45°S latitude.

[5] The infrared brightness temperature (Tb) fields were azimuthally averaged about the storm center to produce radial Tb profiles. To isolate the leading patterns of variability, an empirical orthogonal function (EOF) analysis was performed on the Tb profiles. The EOFs contain information about hurricane eye temperature (when an eye is present), the height of the convective eyewall clouds, and the average radial structure of cloudiness around the storm [cf. Kossin et al., 2007], and these factors are correlated with hurricane intensity. For example, warmer eye temperature and higher eyewall clouds (indicated by colder cloud-top Tb) are strongly related to greater intensity. This is the foundation of the Dvorak Enhanced Infrared (EIR) technique [Dvorak, 1984], which is utilized by all tropical forecast offices in every ocean basin to estimate hurricane intensity with geostationary infrared imagery. In the Dvorak EIR technique, eye Tb and cloud-top Tb are directly related to intensity with a look-up table. The EOFs represent these temperatures but also contain additional information about radial structure such as eye size and radial extent of the cold cloud-tops above the eyewall. This additional information is also related to intensity [Kossin et al., 2007].

[6] The algorithm used for estimating hurricane intensity was formed using a subset of the satellite data corresponding to best track intensity estimates that were contemporaneous with low-level aircraft reconnaissance in the Atlantic. This subset comprises 1,940 measurements in 137 storms from 1989 to 2004. The aircraft-measured intensities were used as ground truth for training the algorithm. After performing the EOF analysis on the entire sample of Atlantic Tb profiles, a multi-variate regression was formed using the aircraft-matched subsets of the time-dependent expansion coefficients [principal components (PCs)]) of the analysis, along with storm latitude, local mean solar time in hours, and the logarithm of the age of the storm measured in 3-hourly periods since reaching named-storm status (maximum wind greater than 17 m s−1). These predictors were chosen based on a priori expectations: The relationship between Tb (represented by the PCs) and intensity is analogous to the basis of the Dvorak technique, and latitude has been shown to modify this relationship [e.g., Kossin and Velden, 2004] as well as affect radial structure and size [Kossin et al., 2007]. Storm age serves as a climatology predictor and allows the regression to distinguish between the very cold cloud-tops that tend to occur in nascent tropical storms, and the cold cloud-tops that occur later during the mature stage. Local solar time is included to address the diurnal cycle in the size of the cold-cloud canopy above storms [e.g., Kossin, 2002] and the diurnal cycle of eye size [Muramatsu, 1983], but this predictor adds only a small contribution.

[7] A stepwise regression technique was applied to the predictor pool with the requirement of significance above the 99% confidence level. This procedure identified nine predictors that were then used in the regression algorithm: the first 6 PCs, latitude, age, and local solar time. All higher order PCs were not significant. In our training sample (N = 1940), the regression explains 64% of the variance of aircraft reconnaissance-measured intensity. The order (greatest to least) of relative contributions of the predictors to the regression is: PC1, age, PC3, PC4, PC5, PC2, latitude, PC6, and local solar time. Co-linearity between the predictors is not an issue: the PCs are orthogonal by construction and all other correlations between the PCs and the remaining predictors are insignificant.

[8] The algorithm was cross-validated using a jackknife procedure: each storm was individually removed from the full training sample and EOF analysis was performed on the sub-sample of remaining storms. The regression was then trained on the sub-sample and tested on the storm that was left out. This was done for all storms in the sample and the cumulative errors were tallied. The independently-derived error distribution is shown in Figure 1 and demonstrates reasonable skill of the algorithm. Figure 1 can be compared directly to Figure 8 of Velden et al. [2006], which is based on a very similar error analysis of operational Dvorak technique estimates from the National Hurricane Center [Brown and Franklin, 2004]. They compared Dvorak intensity estimates (from the period 1997–2003) with best track intensity that were contemporaneous with aircraft reconnaissance, and found that 90% of their (absolute) errors were less than 9 m s−1, 75% were less than 6 m s−1, and 50% were less than 3 m s−1. In comparison, 90% of the absolute errors of our algorithm were less than 12 m s−1, 75% were less than 8 m s−1, and 50% were less than 4 m s−1. Overall Root Mean Square (RMS) error was 6 m s−1 for the Dvorak estimates compared with 9 m s−1 using our algorithm.

Figure 1.

Storm-by-storm cross-validated intensity error distribution for our objective algorithm. Error is defined as the absolute difference between aircraft reconnaissance-based best track intensity and our estimated intensity. Mean Absolute Error (MAE), Root Mean Square error (RMS), and bias are shown.

[9] A physical explanation for the observed intensity trends in the best track has been posited using connections between upward trending tropical SST and maximum potential intensity (MPI) theory [Emanuel, 1988, 2005; Webster et al., 2005; Knutson et al., 2001; Trenberth, 2005; Hoyos et al., 2006]. Stated simply, the argument is that while there is no direct contemporaneous correlation between local SST and hurricane intensity (a hurricane can routinely spend its entire intensity evolution over relatively constant SST), an increase of SST does increase the maximum potential intensity, and over long enough time-scales this should be reflected by an increase on the extreme end of the hurricane intensity spectrum. To uncover this relationship, Emanuel [2005] used a Power Dissipation Index (PDI), which considers the cube of the maximum wind speed and thus accentuates the strongest cases, and Webster et al. [2005] considered the frequency of the most extreme intensities (Saffir-Simpson categories 4 and 5). It is important then that our algorithm capture maximum intensities well. Using the aircraft reconnaissance-based data introduced above, we compared the maximum intensity achieved by each individual storm to the maximum intensity estimated by our algorithm. The errors were normally distributed (bias = 0.005 m s−1, skewness = 0.34, and RMS error = 8.5 m s−1). The seasonally averaged maximum intensities have very small errors (RMS error = 2.7 m s−1) and the seasonal-mean time series of estimated maximum intensity correlates very strongly with the reconnaissance-based time series (r = 0.99). However, closer scrutiny revealed the potential for a problem — the algorithm does tend to under-estimate the strongest intensities. This result represents a potential weakness in the method, but we will provide countering evidence for the fidelity of the algorithm in the following section.

[10] It was previously noted that the operational Dvorak estimates are somewhat more accurate than our simple automated algorithm, and this suggests that a complete reanalysis of our new satellite data record using the Dvorak technique would increase the accuracy of the new intensity estimates. Such a reanalysis would be a significant undertaking however. The Dvorak technique requires a human analyst to follow a set of procedures that may take 3–5 minutes per satellite image. For our new global satellite imagery archive, this would require about 3,400–5,700 hours of labor. Given this constraint, the method applied here, while far from a panacea, is a reasonable first cut at the problem. Future analyses will hopefully expand on our work.

[11] Our algorithm was trained in the Atlantic, but since our goal was to create a consistent intensity record in lieu of a record with maximum absolute accuracy, we can apply the algorithm in any basin. The fundamental relationships between infrared imagery (as well as the other predictors) and intensity do not change among the various ocean basins, and the Dvorak technique is applied in much the same way everywhere. There are differences in the method for converting raw Dvorak technique output into intensity estimates, but these differences can only create a temporally constant bias and this will not affect the trend analyses shown in the next section. The high accuracy of the algorithm in the East Pacific, which will be demonstrated below, also suggests that the algorithm behaves predictably outside of the basin in which it was trained.

3. Results

[12] We first show comparisons between the UW/NCDC and best track records when applied to the North Atlantic [the Atlantic best track is constructed and maintained by personnel at the National Hurricane Center (NHC)]. Our goal in this work, as stated above, is to test the veracity of hurricane variability and trends in the best track, and we are not concerned with absolute contemporaneous comparisons between the records. To focus only on comparisons of variability and trends, all variables discussed here are first normalized by their means and standard deviations.

[13] To address the results of Emanuel [2005], Figure 2 (top left plot) compares PDI derived from the best track and our new UW/NCDC record (the other plots are described below) and demonstrates excellent agreement in both variability and trend (equation image where Vi is an intensity estimate and n is the total number of intensity estimates in that season). The largest difference is seen in the 1995 season, where the algorithm tended toward greater intensities, but there are no systematic differences between the records and the trends are identical. The trends for both records are significant (following Webster et al. [2005], significance is inferred throughout this work by a Mann-Kendall trend test with a requirement of 95% confidence or greater). These upward trends are known to be part of a multi-decadal variability that has been attributed to natural internal forcing and, more recently, external anthropogenic forcing [Kerr, 2000;Goldenberg et al., 2001; Bell and Chelliah, 2006; Mann and Emanuel, 2006; Trenberth and Shea, 2006].

Figure 2.

Comparisons of best track and UW/NCDC records in each of the six hurricane-prone basins. For each basin, there are three vertically-oriented plots showing normalized PDI, and frequency and percentage of the most intense hurricanes [storms that achieve intensities greater than 2 standard deviations (2σ) from the total 23-year sample mean]. 2σ events in the Atlantic, East, West, and South Pacific, and Northern and Southern Indian Ocean best track records represent maximum intensities greater than 58, 61, 65, 55, 47, and 57 m s−1, respectively. The thicker lines are smoothed with a 1-4-6-4-1 binomial filter. Straight lines on the PDI plots are best-fit lines of the unsmoothed time series.

[14] To address the findings of Webster et al. [2005], we considered frequency and percentage of the most intense storms. In our framework, these are represented by “2σ” events, that is, storms that achieve a maximum intensity greater than two standard deviations from the 23-year mean of all intensities. For the Atlantic best track, a 2σ event denotes a maximum intensity of 58 m s−1. Thus, 2σ events almost exactly represent Saffir-Simpson category 4–5 hurricanes (intensity greater than 59 m s−1). Figure 2 shows the comparison of frequency/percentage of 2σ events in the two records and again we find good agreement in variability and trend (all trends are significant). The UW/NCDC record systematically contains fewer events, but there are no time-dependent differences that affect trends.

[15] The notable similarities between the two Atlantic records demonstrate that our simple objective algorithm captures hurricane intensity well in a mean sense, and that the best track intensities have apparently remained fairly consistent over the past two decades. Note that the algorithm was trained with 1,940 points, but the plots of Figure 2 represent the complete Atlantic record (N = 10,520). Within the limitations of the algorithm and the 23-year span of the data, our results strongly support the findings of Emanuel [2005] and Webster et al. [2005] for the Atlantic.

[16] We now consider the best track for the Eastern Pacific, which is also constructed and maintained by the NHC (in this basin, we considered the period 1984–2005 because there were early-season storms prior to the July 1983 start of our data). Figure 2 (top middle plots) shows the comparison between the UW/NCDC and best track records in that basin, and again we find excellent agreement in variability and trends. The PDI trends are downward and significant, while the frequency and percentage of 2σ events contain no significant trends.

[17] The comparisons between the best track and UW/NCDC records in the Atlantic and East Pacific provide high confidence that our algorithm captures the relevant aspects of hurricane activity, and can perform extremely well outside the Atlantic where it was trained. We now apply the algorithm to the remaining hurricane-prone ocean basins. The best track data for these basins (Northwest Pacific, South Pacific, and the Northern and Southern Indian Oceans) are constructed and maintained by a different agency [the Joint Typhoon Warning Center (JTWC)] using guidelines and protocols that differ from the NHC. Best track data from the JTWC are issued with a warning that confidence in the intensity estimates is generally low and decreases retrogressively in time, and that comparisons between these data and other best track data from other agencies should be made “with extreme caution” [Chu et al., 2002]. Our purpose here is to determine whether these stated data issues exhibit any time dependency that could introduce a spurious trend.

[18] The remaining plots of Figure 2 show comparisons between the JTWC best track and the UW/NCDC record for those four basins (in these basins, we considered the period 1983–2004 because the 2005 JTWC best track data were not yet available). In general, agreement is not as good as that found with the NHC best track, but the algorithm does a reasonable job of capturing inter-seasonal variability. However, there are time-dependent differences in every basin that cause a measurable reduction in the amplitude of the trends in the UW/NCDC record. Note that since we use the storm positions from the best track data to orient the satellite imagery and form our intensity estimates, the differences between the best track and UW/NCDC records are due solely to intensity differences. The 22-year trends of PDI and frequency/percentage of the most intense hurricanes, calculated from the best track, were found to be significant only in the S. Indian Ocean. No trends were significant in the UW/NCDC record. Our results in the Northwest Pacific are in good agreement with the recent study of Wu et al. [2006] which used best track data from two additional sources (the Regional Specialized Meteorological Centre in Japan and the Hong Kong Observatory in China) and found no trends in PDI or category 4–5 storms.

[19] Since aircraft reconnaissance into storms was routine in the Northwest Pacific during the earlier period of our record 1983–1987, the best track intensities during this period are likely to be more accurate than the later period 1988–2004 after the termination of reconnaissance. This indicates a systematic over-estimation of intensity in the later period of the JTWC record when compared to the UW/NCDC data. This systematic bias is also evident in the Northern and Southern Indian Oceans and the South Pacific, but the Indian Ocean comparisons should be viewed with some caution. Until the launch of the MeteoSat-7 satellite in 1998, the S. Indian Ocean was poorly sampled due to a gap in the satellite viewing area, and the view-angle between the existing satellites and storms in both the N. and S. Indian Oceans was often highly oblique [Knapp and Kossin, 2007].

[20] A reanalysis of S. Indian Ocean storms in the Australian region was also performed recently by Harper and Callaghan [2006], in which the Dvorak technique was applied to archival satellite imagery for the period 1968–2004 (twice-daily polar orbiting satellite imagery was used during the period prior to the 1977 launch of the GMS-1 geostationary satellite). They found that upward PDI trends in the S. Indian Ocean best track became insignificant in the reanalyzed data, but trends in the frequency and percentage of Saffir-Simpson category 4–5 storms were similar (although there was a very large overestimation in the absolute values of the existing best track). They argued that the remaining trend was highly dependent on the earliest estimates from 1968–1972, which are the most suspect in terms of fidelity, and the trend of category 4–5 storms dissipates by the mid-1980's. They ultimately concluded that their reanalysis did not exhibit any trends that could not be reasonably explained by expected deficiencies in analysis or natural variability. When compared with our results here, there is fairly compelling evidence that upward trends in the S. Indian Ocean are overstated in the present best track, but there is still enough uncertainty to keep the question open.

[21] When all basins are considered together, which constrains our analyses to the 21-year period 1984–2004, the upward trends in the best track of PDI and frequency/percentage of intense storms are all significant, but the global UW/NCDC record exhibits no significant trends (Figure 3).

Figure 3.

Similar to Figure 2, but for the combined (global) records. The trends of PDI and frequency/percentage of strongest storms are all significant in the best track, but none are significant in the homogeneous UW/NCDC record. 2/σ events are based on the statistics of the global sample.

4. Concluding Remarks

[22] The time-dependent differences between the UW/NCDC and JTWC best track records underscores the potential for data inconsistencies to introduce spurious (or spuriously large) upward trends in longer-term measures of hurricane activity. Using a homogeneous record, we were not able to corroborate the presence of upward trends in hurricane intensity over the past two decades in any basin other than the Atlantic. Since the Atlantic basin accounts for less than 15% of global hurricane activity, this result poses a challenge to hypotheses that directly relate globally increasing tropical SST to increases in long-term mean global hurricane intensity.

[23] Efforts are presently underway to maximize the length of our new homogeneous data record but at most we can add another 6–7 years, and whether meaningful trends can be measured or inferred in a 30-year data record remains very much an open question. Given these limitations of the data, the question of whether hurricane intensity is globally trending upwards in a warming climate will likely remain a point of debate in the foreseeable future. Still, the very real and dangerous increases in recent Atlantic hurricane activity will no doubt continue to provide a heightened sense of purpose to research addressing how hurricane behavior might change in our changing climate, and further efforts toward improvement of archival data quality are expected to continue in parallel with efforts to better reconcile the physical processes involved. If our 23-year record is in fact representative of the longer record, then we need to better understand why hurricane activity in the Atlantic basin is varying in a fundamentally different way than the rest of the world despite similar upward trends of SST in each basin.


[24] We are grateful to Peter Webster, Chris Landsea, and an anonymous reviewer for their comments on an earlier version of this paper. This material is based upon work supported by the National Science Foundation under grant ATM-0614812.