Using Video Recognition to Identify Tropical Cyclone Positions

Tropical cyclone (TC) center fixing is a challenge for improving forecasting and establishing TC climatologies. We propose a novel objective solution through the use of video recognition algorithms. The videos of tropical cyclones in the Western North Pacific are of sequential, hourly, geostationary satellite infrared (IR) images. A variety of convolutional neural network architectures are tested. The best performing network implements convolutional layers, a convolutional long short‐term memory layer, and fully connected layers. Cloud features rotating around a center are effectively captured in this video‐based technique. Networks trained with long‐wave IR channels outperform a water vapor channel‐based network. The average position across the two IR networks has a 19.3 km median error across all intensities. This equates to a 42% lower error over a baseline technique. This video‐based method combined with the high geostationary satellite sampling rate can provide rapid and accurate automated updates of TC centers.

CNNs with satellite imagery to TC center fixing. Matsuoka et al. (2018) trained CNNs on simulated outgoing long-wave radiation to detect TCs. P. Wang et al. (2020) utilized CNNs with satellite imagery to detect TCs. However, template matching (circle fitting and image segmentation) was needed to identify the exact TC center. Only Yang et al. (2019) have applied CNNs to satellite imagery for TC center fixing. They trained the networks with single, non-sequential, geostationary satellite images to objectively estimate TC centers. Their algorithm identified all 25 cyclones within 500 images with a mean position error of 28.6 km. A breakdown of position estimates by intensity was not provided. Generally, the largest uncertainties in center location are found for weaker TCs or tropical storms (Wimmers & Velden, 2016) (hereafter, TC will be used as a term referring to both tropical cyclones and tropical storms). This is true for both manual and automated center fixing. Therefore, the aim here is to provide a robust objective TC center fixing method that significantly reduces the position errors at low intensities. This maximizes the operational benefit available to the forecaster.
Incorporating recurrent layers into a CNN facilitates video analysis. Physically distinct features can be identified and the temporal evolution of those features can be constrained (Yue-Hei Ng et al., 2015). Sequential geostationary images have intrinsic value in identifying regions of increased convection and determining the TC wind field via feature tracking. Both pertain to the TC center (Hasler et al., 1998;Wimmers & Velden, 2016). Recurrent CNNs trained on satellite data have been used to improve precipitation forecasts (Xingjian et al., 2015). Regarding TCs, prior research has applied recurrent CNNs to cyclone identification, center fixing, and track forecasting (Kim, 2019;Kim et al., 2018;Liu et al., 2016). However, they train on sequential model data, typically wind fields and surface pressure. This is the first study to combine CNNs with videos of TC satellite observations to address the TC centering problem.

Data set
Himawari infrared (IR) images of 528 TCs in the Western North Pacific between 1996 and 2019 were analyzed. TC positions and intensities were defined by the best track archive (Knapp et al., 2010). The best track archive is a TC data set combining information from global forecasting agencies and provides highly accurate estimates of historical TC location and intensity. Images of TCs with a central minimum pressure P min < 1,005 hPa were used. A central minimum pressure P min < 1,005 hPa equates to a TC intensity of tropical storm strength or higher (Simpson & Saffir, 1974). Three channels are considered: two long-wave channels (IR1 = 10.3-11.3 μm and IR2 = 11.5-12.5 μm), and one water vapor channel (WV = 6.5-7 μm). The data set is composed of hourly observations for TCs in the region 0-70° N, 100-160° E at a 5.5 km resolution on a cylindrical projection. The data set is detailed by Murata et al. (2013). Brightness temperature (BT) calibrations are available from the Japanese Meteorological Agency.
Hourly TC positions were determined with linear interpolations from the best track path (Knapp et al., 2010). For each channel, 9,293 videos were created. The videos were composed of six sequential images. The videos were centered on the TC position at an initial time t 0 , and encompass r ≤ 550 km from the interpolated best track position at t 0 . The initial time is 5 h prior to the current position and t 5 is the current time-the time of the most recent observation. The neural network determines the position of the cyclone at t 5 , that is, the current position. In addition to the six frame videos, videos composed of 3 and 12 sequential images were tested. These tests provided unsatisfactory results.

Neural Network Architectures
There are several possible ways classic CNNs can be adapted from image recognition tasks to process the time dimension and tackle this video recognition problem (Ji et al., 2012;Karpathy et al., 2014;Yue-Hei Ng et al., 2015). This study focuses upon CNNs utilizing convolutional long short-term memory layers (Con-vLSTM) (Xingjian et al., 2015). General long short-term memory (LSTM) layers are 1D recurrent layers that analyze sequential signals recursively to provide a temporal signal. LSTM layers contain operations that update, forget, and retain information through this recursive loop. ConvLSTM layers implement these operations within convolutions, thereby overcoming the loss of spatial information in traditional LSTM layers where feature maps are reduced to 1D (Xingjian et al., 2015). An example network is shown in Figure 1. ConvLSTM networks can be structured with the ConvLSTM layer as either the initial or final layer of convolutions. This analyzes the temporal evolution of rudimentary or compound features respectively. For comparison, the same problem was tackled with CNNs with a traditional 1D LSTM layer (Yue-Hei Ng et al., 2015) (hereafter Recurrent-CNN) and CNNs which implement 3D convolutions, which span the time dimension with no recurrent layers (Ji et al., 2012) (hereafter 3D-CNN). We tested a variety of network configurations of varying depth for each architecture type. All networks followed the basic structure where convolutional layers preceded dense layers.
A total of 15 networks were trained and tested for each frequency channel. These included three ConvLSTM networks, six Recurrent-CNNs, and six 3D-CNNs. The number of ConvLSTM networks tested was limited by computational capacity. The mean absolute error between the position estimate and the best track position was used in training all networks. Initial tests split observations into train (90%) and test (10%) groups to identify the optimal neural network architecture. The best performing neural network was then cross-validated eight times, where the data was split into train (93%) and test (7%). The cross-validation tests spanned half the data and each test data was independent. To stabilize the training process, the train and test positions were normalized by the upper 99% of distances (an average translation speed over 5 h of 15.4 ms −1 ). The few cyclones that exceeded this speed were removed. All networks were trained using the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001 for 60 epochs. Batch sizes were kept constant at 32 videos. ReLu activation functions were used for all convolutional layers (Russakovsky et al., 2015), with tanh activation functions used for fully connected layers. Tests comparing number of epochs (where 25 ≤ epoch ≤ 200) and learning rates l r (where 0.0001 ≤ l r ≤ 0.1) indicated that the values chosen provided good performance while minimizing the time spent training. Significant benefits were found (∼ 20%, p ≪ 0.01) through data augmentation. This involved randomly rotating videos and corresponding position vectors by a multiple of 90°.
The best performing architecture of all those tested is the ConvLSTM trained upon the IR1 band (Table 1). This network applied two convolutional layers (64 filters) then the ConvLSTM layer (128 filters), and finally three dense layers (512, 128, and 2 nodes). After 60 epochs, the network provides a median test error of 23.0 km across all intensities (Table 1). Table 1 shows the position errors for the ConvLSTM network trained separately for the IR1, IR2, and WV bands. Taking the mean of several CNN position estimates could provide an improvement upon any one CNN estimate. To work effectively, the mean requires comparable quality, independence, and low bias. The recurrent-CNNs and 3D-CNNs exhibit significant biases, independent of network depth, for all channels (average bias = 11.6 km, −55° measured clockwise from due east). A reduced bias is present in the ConvL-STM network (IR1 bias = 3.5 km, 176.5°; IR2 bias = 5.2 km, 178.9°; WV bias = 12.6 km, 147.3°; Figure 2).
Errors for each of the channels are shown in Figure 2. Both Figure 2 and Table 1 show that the WV channel position estimates are significantly worse than the IR1 and IR2 estimates for all intensities (p ≪ 0.01). A SMITH AND TOUMI 10.1029/2020GL091912 3 of 9  Table 1 are calculated using networks trained only upon the IR1 and IR2 spectral bands. This average across two networks provides a 19% reduction (p = 0.001) in position error over those found with the IR1 network.
The network position errors are compared to two benchmarks: ARCHER-2 (Wimmers & Velden, 2016) and persistence. Persistence linearly extrapolates from the previous two TC best track center positions (Knapp et al., 2010). The results provided by the average across ConvLSTM IR1 and IR2 bands (AVG) demonstrate a 42% reduction in median error from 33 to 19.3 km, compared to ARCHER-2 (Table 1). Error reductions of 38%, 26%, and 15% are found for tropical storms, category 1 TCs and categories 2-5 TCs, respectively. Against the persistence benchmark, a 4%, 22%, and 34% reduction in error is found for those intensity bins. This places this objective ConvLSTM network inline with the manual and subjective center estimates undertaken by the Satellite Analysis Branch and the Tropical Analysis and Forecasting Branch presented by Wimmers and Velden (2016).
The performance of the ConvLSTM AVG network is shown for cyclone Namtheun as a case study (Figure 3). There is a tendency for the network to provide more accurate position estimates when the cyclone reaches high intensities as the median position error drops from 23.5 km for tropical storms to 9.1 km for category 5 TCs. This is to be expected given the more coherent and robust cloud structures that are associated with stronger intensities (Dvorak, 1984 Note. Position errors are quoted as median averages for persistence, IR1, IR2, and WV; the count value is the number of position estimates made in cross-validation. AVG positions are calculated as the mean position of the IR1 and IR2 networks for any given video. Abbreviations: AVG, average; IR, infrared; RMSE, root mean square error; WV, water vapor. Examples of strong and weak network performance for Typhoon Damery are shown in Figure 4. In both of these cases, the intensity is relatively weak: P min = 998 hPa for (a) and (b); and P min = 995 hPa for (c) and (d). This makes the case studies interesting as the cloud structures do not obviously allude to the TC center.

Table 1 The Position Errors Available When Applying the ConvLSTM Network to Videos Composed of Six Sequential Images of a Cyclone for Each Spectral Band, Binned by Intensity
There is no eye formed and cold clouds do not encompass the best track center symmetrically. Both examples exhibit disorganized cloud structures. In the weak performance case (Figure 4a and 4b), low BT values (BT < − 40°C) are 0-500 km east of the TC center in one expansive cohesive group (contoured and labeled as cluster (1)). Comparing the observations taken 6 h apart, it is clear that the cloud structure does not vary significantly with time. This leads to a poor TC center estimate for this intensity (error = 43 km from the best track position). The neural network is provided with very little temporal evolution that can inform an accurate TC center estimate. The example of a good performance (Figure 4c and 4d) also has a disorganized cloud structure, but with a clear temporal cloud evolution. Two distinct cold regions exist in observations at t 0 and t 5 (both are labeled (1) and (2)). The motion of these two cloud structures is rotational about the TC center (demonstrated with arrows in Figure 4c). The temporal signal provides the network with information that better informs a reliable TC center estimate (error = 6 km from the best track position).
SMITH AND TOUMI 10.1029/2020GL091912 5 of 9 There is a clear difference in the quality of the position estimates between different network architectures ( Figure 5). While the recurrent CNN and 3D CNN architectures provide position estimates of equal quality, each with a median position error ∼35 km, both networks underperform compared to the ConvLSTM network. There is also a large variation between different architectures in the training error and the testing error trade-off, indicating that the recurrent CNN is over-fitting. ConvLSTM layers require more epochs to learn over discrete convolutional or LSTM layers. Subsequently, further reductions in position error may be available with more epochs (Figure 5).
SMITH AND TOUMI 10.1029/2020GL091912 6 of 9  (Knapp et al., 2010). Images in the left column are the first frames in the video (at t 0 ), images in the right column are the last frames in the six-frame video (at t 5 ). The images at t 5 are re-centered only for ease of interpretation. A − 40°C black BT contour shows colder cloud. All x-y axes are in km from the TC center position.

Discussion and Conclusion
We demonstrate a novel way of objectively estimating the center of TCs. Centering low-intensity storms is particularly difficult. The use of video recognition algorithms enables features such as spiral rain bands, regions of intensifying convection, and cyclone eyes to be identified, but also enables their motion to be tracked and the rotational center to then be determined. This is easily achieved in high-intensity cases where the features are spatially consistent and the wind speeds are high, such that temporal evolution is rapid and therefore identifiable. In low-intensity cases, the features are much more inconsistent and may not evolve significantly over the short term ( Figure 4). Significant temporal evolution of spatially consistent features is less frequent in low-intensity storms and leads to less accurate position estimates.
The temporal evolution of rotating cloud features and the regions of cloud development are crucial in determining the TC center (Wong & Yip, 2009). The WV channel is most sensitive to water vapor between 500 and 100 hPa (Hillger & Schmit, 2011). Rotation in these layers may be small away from the core (Smith & Toumi, 2021). Cloud motion can also be hidden by the thick clouds aloft. The IR1 and IR2 channels are both capable of imaging to lower heights (Hillger & Schmit, 2011), where rotation extends further from the center (Smith & Toumi, 2021). These more distant rotating cloud features are also less likely to be obscured by the high cloud tops associated with the TC core. The WV channel introduces additional errors when included into an average of networks, so that the two-member average (IR1 and IR2) outperforms a three-member average.
Both the 3D-CNN and recurrent-CNN approach the video recognition problem in overly simplistic ways that are equally imperfect ( Figure 5). The 3D-CNN algorithm identifies lateral features in consecutive time-steps instantaneously. The recurrent CNN extracts the spatial features within each image and then compresses them to a sequence of 1D vectors for temporal analysis in recurrent layers. The compression of features to 1D loses the spatial inter-relationships of features (Xingjian et al., 2015), which is also physically meaningful. These methods either poorly approximate the temporal (3D-CNN) or the spatial signals (recurrent CNN). The ConvLSTM network addresses both of these issues, preserving the spatiotemporal relationships within the ConvLSTM layer. The importance of the spatiotemporal cloud relationship is consistent with the understanding of cooperative interaction of moist convection, the primary circulation and the secondary circulation in TCs. Capturing this interaction is challenging, but it is crucial for weaker TCs when there is low consistency in spatiotemporal patterns. Applying the ConvLSTM layer at the end of several convolutions implies that the most reliable centering information can be obtained through temporal analysis of larger scale and compounded cloud features (e.g., spiral bands or regions of intense convection) over local cloud features (e.g., cloud surface textures/roughness).
We present an investigation into the use of video recognition algorithms to establish TC centers. Several network architectures have been tested which exploit information on the temporal evolution of TC imagery for three different spectral bands. Networks trained on long-wave channels outperform water vapor channels for good physical reasons. Neural networks applied to videos can provide a rapid objective TC center estimate to assist the forecaster at the potentially very high temporal resolution (10-15 min) of current geostationary satellite imagery. The accuracy of this method over baseline techniques justifies future research into the use of video recognition algorithms with satellite imagery to monitor TCs.

Data Availability Statement
Data used to create the figures and tables found in this study are available from Smith (2021). External data sets for this research are described in these studies: Murata et al. (2013), Wimmers and Velden (2016), and Knapp et al. (2010). Satellite imagery is available from http://weather.is.kochi-u.ac.jp/for research purposes only. The IBTracs data set is available from https://www.ncdc.noaa.gov/ibtracs/. ARCHER-2 center-fix data is available from http://tropic.ssec.wisc.edu/real-time/archerOnline/cyclones/. SMITH AND TOUMI 10.1029/2020GL091912 7 of 9