An Ensemble Machine Learning Approach for Tropical Cyclone Detection Using ERA5 Reanalysis Data

Tropical Cyclones (TCs) are counted among the most destructive phenomena that can be found in nature. Every year, globally an average of 90 TCs occur over tropical waters, and global warming is making them stronger, larger and more destructive. The accurate detection and tracking of such phenomena have become a relevant and interesting area of research in weather and climate science. Traditionally, TCs have been iden-tiﬁed in large climate datasets through the use of deterministic tracking schemes that rely on subjective thresholds. Machine Learning (ML) models can complement deterministic approaches due to their ability to capture the mapping between the input climatic drivers and the geographical position of the TC center from the available data. This study presents a ML ensemble approach for locating TC center coordinates, embedding both TC classiﬁcation and localization in a single end-to-end learning task. The ensemble combines TC center estimates of diﬀerent ML models that agree about the presence of a TC in input data. ERA5 reanalysis were used for model training and testing jointly with the International Best Track Archive for Climate Stewardship records. Re-sults showed that the ML approach is well-suited for TC detection providing good generalization capabilities on out of sample data. In particular, it was able to accurately detect lower TC categories than those used for training the models. On top of this, the ensemble approach was able to further improve TC localization performance with respect to single model TC center estimates, demonstrating the good capabilities of the proposed approach.


Introduction
Tropical Cyclones (TCs), also known as hurricanes or typhoons, are counted among the most fascinating and destructive phenomena that can be found in nature (K.Emanuel, 2003).Several conditions are at the basis of TC formation.As described by Riehl (2004) and Dunn (1951), the process is triggered by warm ocean waters' condensation that provides most of the energy to the system.A sufficient distance of the TC from the equator allows the Coriolis force to take effect, leading to the typical spinning motion.In addition, stored heat energy is released by condensation, warming up the air and lowering the pressure.Besides the aforementioned conditions, the cyclone center (i.e., its eye) is typically located in a low-pressure region, surrounded by strong winds and deep cumulonimbus.As the TC travels, it becomes an auto-sufficient system that continuously gathers energy from the ocean.If the TC shifts on land (i.e., the so-called landfall), the TC loses the energy drawn by warm water's condensation, thus leading to its rapid dissipation (Kepert, 2010;Rüttgers et al., 2019;MetOffice, 2023).
The geographical areas in which the formation of TCs is supported are called cyclone formation basins.There are seven basins around the world, each with a specific water depth and sea surface temperature: the main consequence is the difference in the number of TCs per year and the season in which they develop (Roy & Kovordányi, 2012).Every year, globally an average of 90 TCs occur over tropical waters (K. A. Emanuel & Nolan, 2004) and global warming is making them stronger, larger and more destructive, as recognized by Elsner et al. (2008); Mendelsohn et al. (2012); Sun et al. (2017).As reported by the World Meteorological Organization (WMO), 1, 942 disasters have been attributed to TCs, which caused US $ 1, 407.6 billion in economic losses and almost 8 million killed people over the past 50 years (World Meteorological Organization, 2023), thus making TCs impact quite significant on different sectors, such as infrastructures, economy, human health, social unrest.
The accurate detection and tracking of such phenomena have become a relevant and interesting area of research in weather and climate science (Dabhade et al., 2021;Scoccimarro et al., 2014).Traditionally, TCs have been identified in large climate datasets through the use of deterministic tracking schemes, also known as TCs trackers (Horn et al., 2014).The latter are algorithms capable of identifying -by means of thresholds applied on variables significant to the cyclogenesis -patterns related to a warm core in gridded datasets and connecting them along the TC trajectory (Bourdin et al., 2022).Depending on the particular variables involved in the tracking process, two main categories of schemes exist: physics-based (see Camargo and Zebiak (2002); Zhao et al. (2009); Zarzycki and Ullrich (2017); Murakami (2014); Horn et al. (2014); Chauvin et al. (2006)) and dynamics-based that include the TRACK method (Hodges et al., 2017;Strachan et al., 2013) and the Okubo-Weiss-Zeta (OWZ) algorithm (Tory et al., 2013).
The aforementioned thresholds are set by the author of the scheme, therefore they are subjective and mainly rely on the human expertise about the phenomena under investigation (Dabhade et al., 2021;Enz et al., 2022).Moreover, thresholds may depend on the particular geographical region of study and the related formation basin, as well as on the TC categories (Bloemendaal et al., 2021;Befort et al., 2020).However, manual threshold tuning may lead to subjective bias, as well as to the potential inability of generalizing the proposed approach to other situations or domains.
The state of different climate variables, which the tracking schemes are applied on, is simulated by physics-based Earth System Models (ESMs) that provide large amounts of data at different spatio-temporal resolutions.In addition to ESM data, ground-based in situ observations and satellite retrievals contribute to further increase the data volume.Such large-scale data introduces issues in terms of how scientific data can be effectively managed and processed to make the best out of it (Gray et al., 2005).Indeed, climate scientists, meteorological agencies and policy decision makers need to process and extract meaningful information from these huge datasets in a cost-effective manner and in a reasonable amount of time (Sebestyén et al., 2021).In this context, High Performance Data analytics systems can address some of the issues and provide support for descriptive/statistical analysis from this large-scale data (Elia et al., 2021).Nevertheless, in the last few years Machine Learning (ML) and Deep Learning (DL) algorithms became popular as data-driven paradigms for supporting feature extraction from the vast amounts of scientific data currently available (Hey et al., 2020).ML and DL algorithms can, actually, go beyond what can be extracted with traditional descriptive and deterministic methodologies.In particular, for the phenomena targeted by this study, ML models are well-suited since they can capture the mapping between the input climatic drivers and the geographical position of the TC center from the available data, without the need of subjective thresholds.As a consequence, such models can complement deterministic tracking schemes for the TC detection task.
To this extent, several research efforts can be found in the scientific literature towards the development of cutting-edge TC detection approaches beyond the existing deterministic tracking schemes.For example, many studies focused on the use of satellite data and DL approaches for accurately locating the TC eye (Carmo et al., 2021).
Several works, such as Lam et al. (2023); Pang et al. (2021); Shakya et al. (2020); Haque et al. (2022), framed the identification of the TC center as an object detection task for which the You Only Look Once (YOLO) v3 DL object detection model was adopted.Similarly, a DL-based object detection approach was proposed in Wang et al. (2020) with the aim of retrieving the TC center through segmentation, edge detection, circle fitting, and comprehensive decision of satellite IR images.Segmentation of the shape and size of the detected TC in high-resolution satellite images was also provided by Nair et al. (2022).To this extent, a pipeline consisting of a Mask Region-Convolutional Neural Network (R-CNN) detector, a wind speed filter and a CNN classifier was adopted to accurately detect TCs.In Xie et al. (2022), a Feature Pyramid Network (FPN) was proposed as a feature extractor and region proposal network that searches for the potential areas of cyclones along with a Faster Region-based CNN (Faster R-CNN) to calibrate the locations of TC regions.Faster R-CNN was also used by Xie et al. (2020) to classify the presence of TCs in Mean Wind Field-Advanced Scatterometer (MWF-ASCAT) satellite data.The authors of S. Kim et al. (2019) exploited a Convolutional LSTM (Con-vLSTM) to detect, track and predict hurricane trajectories on Community Atmospheric Model v5 (CAM5) simulation data.With the aim of capturing both temporal dynamics and spatial distribution, trajectories were modeled as time-sequential density maps.The detection of tropical and extratropical cyclones was addressed as an image segmentation task by Kumler-Bonfanti et al. (2020).They used Global Forecasting System (GFS) and Geostationary Operational Environmental Satellite (GOES) to compare four stateof-the-art U-Net-based models designed for the detection task.In Carmo et al. (2021), data from the Sentinel-1 C-band SAR satellite were used to provide a DL-based detector of the TC center, also providing estimates of the related category according to sea surface wind and rain-related topological patterns.Authors further provided explainability through the analysis of key patterns highlighted by the Gradient-based Class Activation Map method.M. Kim et al. (2019) used eight predictors gathered from the Wind-Sat satellite to frame a TC detection task.Then, they compared the detection skills of three ML algorithms, namely Decision Trees (DT), Random Forest (RF), and Support Vector Machines (SVM) and a model based on Linear Discriminant Analysis (LDA).
A TC detection approach based on ML is proposed in this work and applied on the joint North Pacific and Atlantic TC formation basins.Although the detection task is similar to other studies from the state of the art, there are some important differences in the algorithmic approach used.From a methodological perspective, a ML ensemble approach is proposed to accurately locate the TC center coordinates.Exploiting a single ML model for locating the TC center would have resulted in unreliable results because of the inherent complexity of the TC detection task.The ensemble, instead, allows combining TC center estimates of different ML models that are in agreement about the presence of a TC in input data.In this way, each model can learn different spatial characteristics of the TC structure and the ensemble allows providing more accurate TC center estimates.Additionally, the ML setup used in this study allows embedding both TC classification and localization in a single end-to-end learning task.
Moreover, in contrast to other studies, a total of six ERA5 reanalysis TCs predictors (i.e., mean sea level pressure, 10m wind gust since previous post-processing, instantaneous 10m wind gust, relative vorticity at 850 mb and temperature at 300 and 500 mb) were used in place of satellite data.Reanalysis data combines model simulations and observations to provide the best representation of climate variables in the past (ECMWF, 2020a).Accurate TC centers geographical coordinates were retrieved from the International Best Track Archive for Climate Stewardship (IBTrACS) dataset, the most com-plete global collection of historical TC occurrences (National Oceanic and Atmospheric Administration, 2023).
The rest of the paper is organized as follows: Section 2 describes data sources and the processing steps required to build a suitable dataset for ML training.Moreover, the experimental setup is described, along with models architectures and the ensemble procedure.Section 3 presents the results on the test set, focusing on some meaningful test cases for which the performance of models' ensemble is compared against single model ones.Section 4 discusses the obtained results, highlighting strengths and limitations of the proposed approach, also drawing the main conclusions from this work and points out some relevant future activities.

Data Sources
This subsection provides a description of the two data sources used to build the dataset for the ML setup.

The International Best Tracks Archive for Climate Stewardship
The International Best Track Archive for Climate Stewardship (IBTrACS) presented by Knapp et al. (2010) is an institutional, open access and centralized archive that provides the most complete set of historical TC best track data at a global level.It integrates historical records with observations retrieved from 12 different meteorological agencies.The main aim of IBTrACS is fostering research in the context of such events by keeping track of their geographical position, frequency and intensity worldwide.IBTrACS reports global TCs occurrences at 0.1 • (∼ 10km) of spatial resolution from 1841 to present with a 3-hourly temporal frequency.However, in this study, only TC records between 1980 and 2019 were selected from IBTrACS v4 (Knapp et al., 2018) at a temporal frequency of 6 hours.Although IBTrACS provides TC records from 1841 to date, 1980 is considered the beginning of the Modern Era, characterized by the extensive use of geostationary satellite imagery at a global scale.On the other hand, more recent TC information are subject to frequent reanalysis by the different meteorological agencies contributing to IBTrACS, and, for these reasons, the TC selection was limited to 1980-2019.Furthermore, 6-hourly data provides additional information about the TC characteristic, such as the Maximum Sustained Wind (MSW), contrary to 3-hourly ones (Knapp et al., 2010).
Concerning the geographical domain, this study mainly targets the North Pacific formation basin which is widely recognized as a particularly active region where most TCs occur every year (Roy & Kovordányi, 2012).Since a substantial number of TC events cross both the North Pacific and North Atlantic regions, thus reaching up to 320 • E of longitude, the final domain of interest is 100 − 320 • E, 0 − 70 • N (i.e., joint North Atlantic and North Pacific).

ERA5 Reanalysis
Climate variables that are the main drivers of TCs, thus contributing to the formation and sustainment during their lifetime, were retrieved from the Copernicus Climate Change Service (C3S) ERA5 reanalysis datasets.ERA5 reanalysis combines global numerical weather predictions with newly available observations in an optimal way to produce consistent estimates of the state of the atmosphere (ECMWF, 2020b).In this study, mean sea level pressure [P a] (msl ), 10m wind gust since previous post-processing [ms −1 ] (fg10 ) and the instantaneous 10m wind gust [ms −1 ] (i10fg) were gathered from the ERA5 reanalysis on single levels (Hersbach et al., 2023b), whereas the relative vor-ticity at 850 mb [s −1 ] (vo850 ) and the temperature at 300 and 500 mb [K] (t300 and t500, respectively) were collected from the ERA5 reanalysis on pressure levels dataset (Hersbach et al., 2023a).Each of the aforementioned climatic variables was provided on a regular grid of 0.25 • × 0.25 • (∼ 27km × 27km) of spatial resolution, targeting the geographical domain previously described and it was managed as a 2-dimensional map of size 280 × 880 pixels.Moreover, data was collected with a 6-hourly temporal resolution (i.e., 00.00, 06.00, 12.00 and 18.00 time steps) for the period 1980-2019, thus matching TCs records selected from IBTrACS, except for fg10 that was originally collected with a hourly temporal resolution.In particular, the ERA5 fg10 variable reports the maximum of the wind gust in the preceding hour.Therefore, to match the 6-hourly temporal resolution of this study, for each time step, the maximum over the previous 6 hours was computed.

IBTrACS filtering and selection
Starting from trajectories belonging to the joint North Atlantic and North Pacific geographical domain (100 − 320 • E, 0 − 70 • N ), only IBTrACS records with track type field flagged as main were considered.Therefore, provisional, spur and provisional-spur tracks were implicitly discarded as they are characterized by a higher level of uncertainty (IBTrACS Science Team, 2019).It is noteworthy that data from recent years are typically provided as provisional or spur, meaning that their corresponding values have not been reanalyzed yet and therefore are of lower quality compared to main tracks.This can happen because some variables -such as the intensity, position and storm categories -are subject to change according to posterior reanalysis by the meteorological agencies.Moreover, uncertainties in the observing system may result in contradictory opinions by different agencies about the storm location, leading to spur tracks.This is mainly due to difficulties in localizing the center of circulation or in the case of storms merging (i.e., Fujiwhara effect) (IBTrACS Science Team, 2019).As an additional selection step, tracks were filtered out based on the nature field, specifically discarding those trajectories marked as: (i) Not Reported (NR) whose nature is unknown, (ii) Disturbance Storms (DS ) that correspond to not-well-formed storms characterized by a MSW less than 34 knots, and (iii) Mixture (MX ) that correspond to tracks that received contradicting reports about the nature of the observing system from different agencies.At the end of this filtering and selection process, only Tropical Storms (TS ),Extra Tropical (ET ) and Subtropical Storms (SS ) main tracks were considered at a 6-hourly temporal resolution.
Figure 1 shows the selected TC records occurring in the domain of study for the considered period.TCs tracks were further divided into non-overlapping groups according to the route taken during their lifecycle, reporting also the number of occurrences in each group.The trajectory of TCs in the North Atlantic (light blue), West Pacific (yellow) and East Pacific (orange) remain confined to such basins (i.e., they originate and dissipate in the same basin), whereas East and West Pacific tracks (blue) as well as East Pacific and North Atlantic ones (purple) cross different basins during their lifecycle.

Patches generation and labeling
For each of the six climatic drivers, ERA5 maps (of dimension 280 × 880 pixels) were evenly tiled into 7 × 22 non-overlapping patches of size 40 × 40 pixels each.The TC eye can occur in every pixel of the patch, not necessarily in its center, thus non cyclonecentric patches were generated.Then, drivers were stacked together, resulting in data of dimension 40 × 40 × 6.In order to associate patches containing a TC (from now on referred to as positive patches) with its center position (i.e., the TC eye), the latitude and longitude geographical coordinates extracted from IBTrACS were rounded to match the resolution of the ERA5 grid (0.25 • ×0.25 • ).Subsequently, rounded coordinates were further converted into local-patch positions in terms of row-column index pairs (considering the 40 × 40 patch as a matrix).
Different TC phenomena may simultaneously occur in the domain of interest at a particular time, thus multiple positive patches can be retrieved from a single ERA5 map.The patches that do not contain a TC (from now on referred to as negative patches) were labeled with a negative row-column coordinate (i.e., [−1, −1]), indicating the TC absence.In this way, retrieved local patch row-column pairs were used as the target of the detection task.

Dataset creation
Selected patches in the considered period  were split into training and test sets as follows: patches belonging to two consecutive years for each decade were selected for testing (1983, 1984, 1993, 1994, 2003, 2004, 2013, 2014, respectively), whereas patches in the remaining years were used for training.In this way, the training set comprises patches that span the whole time period, enabling ML models to capture and learn potential climate change patterns that may affect the input atmospheric drivers (World Meteorological Organization, 2022).
In order to build the training set, negative patches were carefully selected to enhance the variance of the dataset, as well as to improve the predictive skills of ML models.Among the edge patches surrounding a positive one, the three corner patches closest to the storm center were considered as negative (referred to as nearest patches, see purple patches in Figure 2).Despite nearest patches are labeled as negative, they may contain residual structures (e.g., spiral wind gust tails, minimum regions of mean sea level pressure, etc.) of the TC located in the central patch.Therefore, including such patches can benefit model training.
Additionally, for each positive patch a further negative sample was randomly selected among the 7 × 22 patches of the map excluding the edge ones previously mentioned and thus ensuring that no major TC phenomena occur in the randomly selected patch.By construction, the training set is imbalanced towards negative samples (i.e., 55, 639 positive patches and 212, 679 negative ones, yielding 20% of samples containing a TC).To address this imbalance ratio, as well as to increase the variance of the training set, a selective data augmentation procedure was used.For each positive sample, three transformations were considered: left-right flip, up-down flip and 180 • rotation (Shorten & Khoshgoftaar, 2019).
Conversely, all the 7×22 patches belonging to each map of the test set years were selected to assess the actual performance of ML models on out-of-sample data.The resulting test set consists of 967, 513 negative patches and 14, 149 positive ones, ending up in a strongly imbalanced test set (i.e., only 1.46% of positive samples).

Neural Network Architectures
For each input patch that comprises the six climatic drivers, the proposed TC detection task aims at predicting the local row-column coordinates corresponding to the TC center, if present.From a logical point of view, the detection task can be split into two consecutive subtasks, i.e., classification and localization.Classification consists in determining the presence or absence of TCs within the input patches, whereas the localization subtask concerns the prediction of the TC center coordinates, if present.To this extent, several VGG-like Neural Network (NN) architectures (Simonyan & Zisserman, 2014) were developed and trained, each differing in terms of number of layers, filters and kernel sizes.
Figure 3 depicts the overall architecture.Input patches are processed by a series of convolutional and max-pooling layers that encode the input volume, thus progressively decreasing height and width dimensions, while, at the same time, increasing the depth of the activation volume.The encoder ends with a bottleneck layer (i.e., the N-th convolutional block in Figure 3) that squeezes the processed information resulting in a lower dimensional representation of the patch content (Barrera, 2022).In the decoding stage, bottleneck output is flattened and processed through a series of dense layers that gradually reconstructs the information and links it with the target row-column coordinates of the TC center within the patch.In this sense the VGG-like network is trained to learn the mapping between input climatic drivers content and the two output coordinates.
A total of four different model configurations have been assessed for the detection task, called VGG V1, VGG V2 and VGG V3 that differ for complexity, as well as VGG V4 which was obtained by replacing each basic convolutional layer with a customized one: Conv + Gaussian Noise (optional) + Batch Normalization (optional) + Dropout (optional) + Leaky ReLU (Maas et al., 2013).Optional hyperparameters are selectively activated on each layer.Additionally, the first two convolutional layers of VGG V4 are variable in the kernel size, enabling the model to capture wider spatial features, thus including the TC eye and its surrounding structure, as well.

Training and Validation
For each model configuration, 25% of the training patches were used for validation purposes, whereas the remaining for the actual training.The six input drivers were normalized in the [0, 1] range and augmented according to the data augmentation procedure presented in Section 2.3.1.The aforementioned models were trained for 500 epochs with a batch size of 8, 192 patches, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e −4 .
Two different loss functions were implemented and used.The first one is the Mean Absolute Error (MAE) between real and predicted coordinates.The second one is a custom loss defined by the authors for this study, called Cyclone Classification Localization (CCL) loss, which is a linear combination of the MAE, the Binary Cross Entropy (BCE) loss and the Euclidean distance between real and predicted coordinates (L2).CCL tries to contemporarily pursue two goals: (i) minimize the classification error (through BCE) and (ii) minimize the localization error (through MAE and L2 terms).Based on different hyperparameters configuration, a total of 13 models were trained.

Test Metrics
The test set was used to evaluate the generalization capabilities of the trained ML models on out-of-sample data.To this extent, different evaluation metrics were computed according to the classification and localization subtasks: • Classification metrics: False Positives (FP), True Positives (TP), False Negatives (FN) and True Negatives (TN) rates were computed, along with precision and recall.It is worth noting that in this context a FP occurs when the ML model incorrectly classifies a patch as containing a TC although it is not present in reality.
• Localization metrics: the Euclidean distance between the real and predicted TC center coordinates was computed only for those patches correctly classified as positive (TP) by the ML model.

Consensus and Models Ensemble
Since the 13 models are trained with a different set of hyperparameters and/or layers configuration, each of them learns different characteristics and high-level features in the training set patches.Therefore, an ensemble approach (Ganaie et al., 2022) has been assessed in this study to combine the predictions made by different models with the aim of improving the overall accuracy skills (see Figure 4).
As depicted in Figure 4, for each patch of the test set, the approach consists in evaluating first how many models agree in classifying it as positive.The number of models that should be in agreement -m in Figure 4 -represents an additional hyperparameter that indicates the minimum level of consensus required to consider the patch as likely containing a TC.
After a trial and error procedure on the validation set, this parameter was set to 7 in order to reach a trade-off between FP and FN rates on the test set, given that a lower number of FNs is preferable for the task of predicting the occurrence of such Extreme Weather Events (EWEs).Each of the 13 models can potentially provide very different estimates about the location of the TC center for the same input patch.Therefore, the Interquartile Range (IQR) method was adopted as a further filtering step to keep only the estimates closer to their median value.In particular, the method consists in considering as outliers those TC center estimates (x ) that satisfy the following inequality: Indeed, the IQR is computed as the difference between the third (Q 3 ) and the first (Q 1 ) quartile, providing information about the spread of the data around the median value.Finally, the localization of the TC center is performed as the ensemble average of the row-column estimates of inliers.

Results
Table 1 summarizes the averaged results produced by the 13 models on the test set, according to the evaluation metrics presented in Section 2.3.4.These 13 models are involved in the ensemble approach described in Section 2.3.5.
Concerning the localization subtask, VGG V1 and VGG V4 models (i.e., Models #1 and #4) performed best by achieving -on average -a lower Euclidean distance between predicted and actual TC center coordinates with respect to VGG V2 and VGG V3 (i.e., Models #2 and #3).This result still holds when the CCL is used as loss in the place of MAE (i.e., Models #5 to #8).All these models were trained with a kernel size of 3. The VGG V4 model trained with a kernel size of 5 and the MAE loss (Model #9) produced similar results to Model #4, by achieving an Euclidean distance of 116.38 km.Moreover, increasing the kernel size up to 13 seems not to further improve the localization error.Nevertheless, increasing the kernel size allows better extracting spatial TC patterns and, thus, reducing the number of FN (i.e., Models #12 and #13).In general, as it can be observed from the results reported in Table 1, a higher localization error in terms of Euclidean distance corresponds to a reduction of the FN rate (lower is better).Regarding the classification subtask, FP and FN rates are in trade-off, thus a lower FN rate corresponds to a higher FP rate as outlined in the table.However, for studies like this addressing the detection of EWEs it is generally more appropriate having a higher number of FPs rather than FNs.Indeed, a false alarm (FP) is preferable, in such a case, against situations in which models incorrectly classify a TC as not present (FN).less than m (given) models detect the presence and localize the TC in the patch, the TC is considered as absent and the patch is labeled with negative coordinates.
Otherwise, the IQR algorithm is applied on the output of the models detecting the TC; the final estimated location of the TC center is computed by averaging the values of the predictions not filtered by IQR.

Detection through the ML ensemble approach
Figure 5 shows the ML ensemble approach applied on three different patches during TC John evolution (August, 11 th to September, 13 th 1994), overlaid on the fg10 variable.Each row in Figure 5 refers to a specific time step of John's lifetime, whereas each column describes a particular step of the proposed localization approach.For instance, Panel a) reports TC center estimates provided by 12 models (red squares) out of 13, thus only one model predicted the absence of a TC in the input patch.In this case, the minimum consensus of 7 is reached.In terms of localization error, the mean TC center estimate of such 12 models (green diamond) is 61.76 km far from the actual TC center (dark blue cross).Subsequently, in Panel b), the IQR method allows detecting 2 outliers (red squares outside the red box).Therefore, Panel c) reports only 10 remaining inliers (red squares inside the red box) along with the ensemble TC center estimate (purple triangle).Thanks to the IQR method the ensemble TC center estimate (purple triangle) is closer to the actual TC center (dark blue cross) than the initial mean estimate (green diamond).In particular, the distance between the actual TC center and the one provided by the IQR method applied on the ensemble is 39.06 km, with a localization improvement of about 37%.The same description also holds for the examples reported in Panels d-i), where respectively 12 (in f)) and 8 (in i)) models contribute to the computation of the ensemble TC center estimate, with a localization improvement of about 21% and 37%, respectively.It is worth noting that the proposed procedure consisting in the combination of IQR and ensemble approach is of general value and it is particularly suited in such situations in which ML models predictions are spread.Panels d-f) in Figure 5 represent an example of such behavior.
The aforementioned procedure was applied on the entire test set.The ensemble approach allows obtaining an average localization distance between the actual TC center and the estimated one of about 118.46 km (see the second to last row in Table 1).An average localization distance of 127.5 km would have been obtained if the IQR method was not applied.This results in an improved localization accuracy of 9 km.From a classification perspective, evaluation of the ensemble on the test set reports 5.38% of FPs and 10.71% of FNs respectively (second to last row in Table 1).On one hand the recall metric was 89.29%, meaning that the models' ensemble is highly confident in identifying TC centers.On the other hand, considering that the number of FPs (52,051) is more than 4 times greater than the number of TPs (12,634), the precision metric is not very high (i.e., 19.53%).Since there is a strong imbalance between positive and negative samples, these results must be carefully interpreted as it may result in a misleading perception of the ensemble classification skills (Koehler, 1996).To this extent, a further evaluation procedure was conducted over 10, 000 balanced subsets randomly sampled from the test set.Each subset is composed of 16, 000 test patches.Both the randomness and the high number of subsets guarantees the statistical meaningfulness of the experiment.The evaluation metrics (reported in Section 2.3.4) were computed on each of the 10, 000 subsets and averaged together, and the results are reported in Table 1 (last row).It can be noted that besides the precision metric, which showed a substantial increment to 94.31%, the others were not affected at all.In summary, the ensemble shows good classification skills from both precision and recall standpoints on such a balanced dataset.Therefore, it further proves that the ensemble approach is more accurate than using a single ML model for TC detection.

Sensitivity Analysis of the ML ensemble approach
A sensitivity analysis of the proposed approach was applied to assess the performances of the ML ensemble approach during various phases of the TC lifecycle that are typically characterized by different intensities of Maximum Sustained Wind (MSW), as registered in IBTrACS.As an example, in Figure 6 models ensemble predictions are reported for the TC Chantal (10-15 September 1983), overlayed on the vo 850 input driver.
Figure 5. ML ensemble approach applied on three different time steps of TC John lifetime (rows), overlaid on the fg10 variable.In each row, panels represent a particular stage of the proposed procedure.The number of models involved is reported in each panel and their TC center estimates are depicted as red squares, while the actual TC center is represented as a dark blue cross.Their average is reported as a green diamond.In center panels (b, e and h) the IQR is applied to detect outliers among models' TC center estimates.In the right panels, the model ensemble average (purple triangle) is computed only considering inlier values.
In particular, three time steps over the Chantal lifecycle are shown, namely 1983-09-11 at 00.00, 1983-09-12 at 12.00 and 1983-09-15 at 06.00, respectively.In the early and final stages (i.e., a) and c) panels) the vo 850 variable does not show the typical circular spatial patterns surrounding the storm eye and indeed the models ensemble struggle to estimate accurately the actual TC center (blue cross).This results in spread predictions and thus a higher standard deviation of the ML ensemble (red circle).To explain this situation, the MSW was retrieved from the IBTrACS dataset for the corresponding timesteps.The early and final stages of the Chantal TC are characterized by MSW speeds of 35 and 30 knots (i.e., weak TCs in IBTrACS), respectively.On the contrary, during the middle stage of its evolution (Panel b)), when the storm gains more strength (i.e., the MSW increases to 65 knots), spatial circular patterns of the vo 850 become more evident around the eye.This makes the detection task easier and leads to a lower localization error between predicted and actual TC center location.Indeed, models involved in the ensemble predict approximately the same position, leading to a lower standard deviation (i.e., circle radius in Panel b) is smaller) from the ensemble TC center estimate.

Test Cases: Keoni and Julio Storms
Model #13 was used to evaluate the performance of the ensemble approach with respect to a single model.Model #13 was selected since it showed more balanced localization and classification metrics (see Table 1).Specifically, the Keoni and Julio TCs that occurred in the domain of interest were analyzed.The centers detected by Model #13 and the ensemble over the TCs evolution were reported in Figures 7 and 8, respectively.Both figures are organized as follows: the upper panel represents the MSW of the TC, expressed in knots.Middle panel shows the Euclidean distance between true and predicted TC center coordinates, expressed in kilometers, produced by Model #13 and by the ensemble approach.Lastly, lower left and lower right plots represent the true TC track (in gray) along with predictions from the ensemble (in blues) and Model #13 (in red), respectively.

TC Keoni
The TC Keoni occurred from 9 th August to 4 th September 1993.During its lifecycle it became a hurricane characterized by strong winds that reached 115 knots of speed before starting to lose its intensity from 19 th August until early September.
From Figure 7 it can be evinced that early and final stages of Keoni lifecycle are characterized by low MSW, and therefore both Model #13 and the ensemble provided TC center estimates with a higher Euclidean distance from the actual TC center coordinates.On the other hand, as the cyclone gains more strength, input drivers' spatial features around the eye become more evident, thus making the TC detection easier for both the ensemble and Model #13.As a result, Euclidean distance is lower than the other stages This analysis is in line with the one made in Section 3.2 about the relationship between MSW and localization error.
The time steps in which the TC was not detected (from now on referred to as discontinuities) are the same for both Model #13 and the ensemble.This means that most of the models in the ensemble did not find a TC in the corresponding time step.Considering the overall TC trajectory, the ensemble better converges to the actual one provided by IBTrACS (lower left panel), as it implicitly exploits the contribution from all the models to provide more accurate TC center estimates.Conversely, Model #13 (lower right panel) provides a higher localization error in TC center estimates even though it catches the overall trajectory pattern.

TC Julio
Similarly to the Keoni test case, the early and final stages of TC Julio are characterized by low MSW (upper panel) and higher localization errors for both the ensemble and Model #13, as reported in Figure 8.Nevertheless, even though in the aforemen-  tioned stages the cyclone is classified as a Disturbance Storm, the models are still able to capture the phenomena in this status.This is a remarkable result since both Disturbance Storm and Not Reported TCs were filtered out from the training set and therefore their characteristics have not been shown during training.
Over the TC evolution, the Euclidean distance remains stable on average -especially for the ensemble -and slightly increases as the typhoon dissipates its energy.A discontinuity appears only for the model's ensemble, thus showing that Model #13 detected the TC in the corresponding time step, but most of the other models failed (middle panel).Another interesting aspect worth highlighting is the large difference between the localization error of the ensemble model and the Model #13 for the entire TC lifetime.This clearly demonstrates that the ensemble localization approach is more robust than a single model prediction, thus better converging to the real cyclone trajectory over its evolution, as depicted by lower left and right panels.

Conclusion and Discussion
The present study proposed a TC detection approach based on ML that aims at identifying and localizing TCs' center in terms of geographical coordinates by exploiting six climatic drivers that are related to the genesis and sustainment of such EWEs.
Given the inherent complexity of the detection task, trusting the estimation of the TC center position through a single ML model would have led to unreliable results.Therefore, an ensemble approach was proposed to integrate the knowledge learnt by different ML models that are trained for the same detection task.The ensemble relies on 13 VGGlike architectures that are trained with distinct hyperparameters configurations on the same input-output pairs.This allows extracting different intrinsic patterns and features related to the TC evolution during its lifetime, as well as reducing the uncertainty associated with its center position estimate.The present approach is extendable either by adding new ML models to the ensemble or improving the current ones to get higher classification and localization performance.
ERA5 reanalysis data, concerning six input climatic drivers, were jointly exploited with IBTrACS historical records to train and test the designed models.Reanalysis data combine model simulations with observations to provide the best state representation of different climatic variables in the past.However, as recognized by Hodges et al. (2017), no assimilation of TCs is performed in ERA5, unlike other reanalyses such as JRA-55 or NCEP-CFSR datasets.Nonetheless, the ensemble exhibits a good accuracy, specifically providing 10.71% of FN and 5.48% of FP on the test set (Section 3.1).The IQR method applied to the ensemble as an outlier filtering step further improved the performance, thus leading to an average localization accuracy of 118.46 km (almost 1 • of latitude) on the test set.Indeed, Zarzycki et al. (2021) and Roberts et al. (2020) evaluated a series of metrics on ERA5 with comparable performance to JRA-55 and NCEP-CFSR: the main reason can be found in the enhanced resolution of ERA5 with respect to the previous ERA-Interim product.This motivated the use of ERA5 reanalysis for the six input climate drivers in this study.Moreover, the presented processing methodology of ERA5 maps led to non-cyclone-centric input patches, i.e., 40 × 40 images in which the TC center can occur in any position, not necessarily in its center.In this way, ML models were able to learn the drivers spatial patterns and characteristics related to the presence of the TC inside the patch, regardless of its position.As a result, beyond Tropical, Subtropical and Extra Tropical storms, the ensemble was capable of localizing also centers of NR (Not Reported) and DS (Disturbance Storms) cyclones with a low error, even though they were not included in the training set, thus demonstrating good generalization capabilities of the proposed approach.Furthermore, tiling ERA5 maps into non-overlapping patches of fixed size allowed ML models to detect multiple TCs that can simultaneously occur in the joint North Atlantic and Pacific geographical domain targeted in this study.The application of the proposed approach to other formation basins was not assessed in this study and will be subject to future investigation.
Concerning the limitations of the present research, it is important to take into account uncertainties related to IBTrACS and ERA5 data that may lead to biased TC center positioning.In particular, IBTrACS provides TC center geographical coordinates aligned on a 0.1 • ×0.1 • grid and with an uncertainty that is inversely proportional to the storm intensity (IBTrACS Science Team, 2019).ERA5 maps, on the other hand, are provided on a 0.25 • ×0.25 • grid.Therefore, TC centers were aligned on the ERA5 grid, as a preprocessing step (Section 2.2.2).As a result, all the sources of uncertainty implicitly affected both ML training and inference.As higher resolution reanalysis data will become available, the inherent uncertainty will also be reduced.
The proposed study focused only on analyzing the performance of the ML TC detection solution.As a future activity, a comparison with deterministic solutions like TStorm from National Oceanic and Atmospheric Administration (NOAA) (https://www.gfdl.noaa.gov/tstorms/) is envisioned to understand how the ML-based approach performs with respect to a deterministic tracking schema.
Moreover, it must be noted that the workflow for supporting the TC detection solution presented here is very complex and composed of heterogeneous data and software components.It requires large-scale data handling solutions, jointly with ML algorithms and access to High Performance Computing infrastructure.As next step the authors aim at developing an integrated pipeline that can apply the pre-processing and ML model pipeline directly to the output of an ESM simulation.This effort is currently ongoing in the frame of the eFlows4HPC EU project (Ejarque et al., 2022).Finally, the ensemble approach proposed here will be explored in future work, in the context of the inter-Twin project (https://www.intertwin.eu/), by exploiting climate projection data from CMIP experiments with the aim of providing an indication on how climate change affects TCs frequencies and locations in the future.

Figure 1 .
Figure 1.Visualization of 6-hourly TCs location within the 1980-2019 period in the joint North Atlantic and North Pacific geographical domain (100 − 320 • E, 0 − 70 • N ).TCs in each sub-basin are highlighted by a different color, along with the relative number of occurrences.Only IBTrACS records whose nature is TS, SS and ET are shown in the picture.

Figure 2 .
Figure 2. Overview of the dataset building pipeline.ERA5 maps are tiled into non-overlapping patches of size 40 × 40.To create the training and validation subsets, for each tiled ERA5 map, the patch containing the TC is considered along with nearest and random patches, discarding the remaining ones.Concerning the test subset, all the patches are considered.

Figure 3 .
Figure 3. Basic representation of VGG-based models configured to address the TC detection task.Patches related to the six input drivers are stacked together and the local row-column coordinates of the TC center are used as the target.Models differ for complexity as well as different hyperparameters configuration.

Figure 4 .
Figure 4. Diagram representing the models ensemble approach.All the n pre-trained models are fed with the same patches, yielding n row-column couples.If

Figure 6 .
Figure 6.TC Chantal vo 850 spatial patterns during early (left panel), middle (middle panel) and final (right panel) stages of its lifecycle.The ensemble TC center estimate (red cross) along with the true one (blue cross) is represented.The standard deviation of the ensemble TC center estimate is represented through the red circle.

Figure 7 .
Figure7.Comparison between performance of the ML ensemble against Model #13 on TC Keoni track.The first panel depicts MSW during the cyclone evolution.The second one shows the Euclidean distance between real and predicted TC center estimates for both Model #13 (red line) and the ensemble (blue line), with the standard deviation among models in agreement (light blue area) for the latter.Last two panels represent, respectively, the TC geographical coordinates predicted by the ensemble (blue line) and Model #13 (red line) along with real Keoni tracks (gray line) over its lifetime.Discontinuities in middle and last panels time series correspond to time steps in which either Model #13 or the ensemble did not detect the TC.

Figure 8 .
Figure8.Comparison between performance of the ML ensemble against Model #13 on TC Julio track.The first panel depicts MSW during the cyclone evolution.The second one shows the Euclidean distance between real and predicted TC center estimates for both Model #13 (red line) and the ensemble (blue line), with the standard deviation among models in agreement (light blue area) for the latter.Last two panels represent, respectively, the TC geographical coordinates predicted by the ensemble (blue line) and Model #13 (red line) along with real Julio tracks (green line) over its lifetime.Discontinuities in middle and last panels time series correspond to time steps in which either Model #13 or the ensemble did not detect the TC.

Table 1 .
Average metrics over the test set for each of the 13 models.Ensemble performance on both imbalanced and randomized balanced test sets was also reported for comparison in the last two rows.