A Benchmark to Test Generalization Capabilities of Deep Learning Methods to Classify Severe Convective Storms in a Changing Climate

This is a test case study assessing the ability of deep learning methods to generalize to a future climate (end of 21st century) when trained to classify thunderstorms in model output representative of the present‐day climate. A convolutional neural network (CNN) was trained to classify strongly rotating thunderstorms from a current climate created using the Weather Research and Forecasting model at high‐resolution, then evaluated against thunderstorms from a future climate and found to perform with skill and comparatively in both climates. Despite training with labels derived from a threshold value of a severe thunderstorm diagnostic (updraft helicity), which was not used as an input attribute, the CNN learned physical characteristics of organized convection and environments that are not captured by the diagnostic heuristic. Physical features were not prescribed but rather learned from the data, such as the importance of dry air at mid‐levels for intense thunderstorm development when low‐level moisture is present (i.e., convective available potential energy). Explanation techniques also revealed that thunderstorms classified as strongly rotating are associated with learned rotation signatures. Results show that the creation of synthetic data with ground truth is a viable alternative to human‐labeled data and that a CNN is able to generalize a target using learned features that would be difficult to encode due to spatial complexity. Most importantly, results from this study show that deep learning is capable of generalizing to future climate extremes and can exhibit out‐of‐sample robustness with hyperparameter tuning in certain applications.

classification and prediction of convective storms in the present climate (Gagne II et al., 2019), modeled using the Weather Research and Forecasting model (WRF; Skamarock & Klemp, 2008) at high-resolution (4 km).However, as the climate continues to warm, some future thunderstorms may be outliers in the baseline climate used for training (Trapp & Hoogewind, 2016), and these extreme events may be more difficult for CNNs to identify.This article explores the ability of CNNs to classify convection of a future climate modeled with WRF, along with the physical reasons for the resultant performance.
Climate change is altering the large-scale atmospheric landscape over North America, resulting in changes to the frequency and intensity of organized convection (Prein et al., 2017;K. L. Rasmussen et al., 2017).Future changes to thermodynamic and kinematic fields can impact climatological distributions of convection morphology and associated severe hazards (e.g., tornadoes and large hail; Diffenbaugh et al., 2013;Trapp et al., 2007Trapp et al., , 2009)).Studies have shown a climate change imprint on various aspects of severe thunderstorms and associated environments (Allen, 2018), including increases in thermodynamic buoyancy and thunderstorm frequency (Brooks, 2013;Hoogewind et al., 2017), increases in convective inhibition (Taszarek et al., 2020), more societal exposure (Ashley & Strader, 2016), and an eastward geographic shift of environments over the United States favorable for severe hazards (Gensini & Brooks, 2018).However, discerning the interplay between thermodynamic and kinematic components on future convection has been more challenging (Brooks, 2013), given that subtle changes to either field can alter the potential of a thunderstorm to produce severe hazards (Doswell et al., 1996).This complex interplay, and varying seasonal and geographical trends, limit the broader conclusions that can be derived from climate studies of severe convective storms.
In current forecasting applications, advancements in delineating thunderstorms capable of producing specific hazards have included the development of environmental proxies and composite indices that take kinematic and thermodynamic factors into account (Gropp & Davenport, 2018;E. N. Rasmussen, 2003; R. L. Thompson et al., 2003Thompson et al., , 2007Thompson et al., , 2012)).Updraft helicity (UH) is an example of a diagnostic parameter, which estimates the magnitude of rotation within a thunderstorm's updraft using vertical wind speeds and vorticity (Kain et al., 2008).Strongly rotating thunderstorms with high magnitudes of UH (e.g., ≥75 m 2 s -2 ) have a greater likelihood to be of supercell morphology (Clark et al., 2013;Sobash et al., 2016), a type of thunderstorm that observations have shown to be more likely to produce severe hazards (Bunkers et al., 2006;Duda & Gallus Jr, 2010).Scalar thresholds for UH have been used to classify model-simulated convection, with thunderstorms that exceed the predetermined threshold classified as severe (Molina, Allen, & Prein, 2020;Sobash et al., 2011).These dichotomous assignments derived from UH have been used in kilometer-scale climate simulations to estimate changes to severe hazards in a future climate (Gensini & Mote, 2015;Trapp et al., 2011).However, the use of a heuristic to delineate non-severe and severe convection can result in incorrect categorizations of thunderstorms that fall near the predetermined threshold.UH values representative of severe convection also vary seasonally and regionally, based on the climatological environments that drive severe convection activity (Molina, Allen, & Prein, 2020;Sobash & Kain, 2017).Recently, Sobash et al. (2020) trained a CNN to forecast severe hazard potential using severe thunderstorm parameters derived from WRF, showing that a CNN can learn from diagnostics.The focus herein lies in evaluating a CNN's ability to classify convection and its out-of-sample robustness to a future climate.
CNNs are a class of deep learning models canonically used for computer vision tasks because of the capability of processing multiple layers of information to detect nonlinearities and translation invariant details of features (Krizhevsky et al., 2012;LeCun et al., 1998).Various techniques have been developed to prevent deep learning models from overfitting and to improve training stability, such as dropout and batch normalization (Ioffe & Szegedy, 2015;Srivastava et al., 2014), which help CNNs generalize relationships among input features and increase prediction accuracy.However, explaining the reasons for model skill has been challenging, due to the complex architecture of CNNs that include many trained weights and biases within hidden layers and feature maps.Various CNN techniques have been recently developed to create and improve explanations of machine learning predictions and classifications (Barnes et al., 2019;McGovern et al., 2019).These explanation techniques include saliency maps (Simonyan et al., 2013) and permutation feature importance (Breiman, 2001;Lakshmanan et al., 2015), which have been shown to help explain skillful CNN predictions of convective hazards (Gagne II et al., 2019).Identifying reliable reasons for model

Earth and Space Science
performance can increase the trust of atmospheric scientists in machine learning and foster further discovery of the physical processes driving societally impactful weather and climate extremes.
Using deep learning and explanation techniques, the following questions will be analyzed in this study: 1. Are future strongly rotating thunderstorms classified skillfully by a CNN that was trained under current climate conditions? 2. Which input features and spatial patterns are identified to be most important by the deep CNN for classification?3. What are the reasons (explanations) for incorrect classifications?

Thunderstorm Identification in Climate Simulations
A set of two convection-permitting model simulations created by the Water System Program of the National Center for Atmospheric Research were used to extract thunderstorm objects for this study (C.Liu et al., 2017).The two simulations were created using WRF at 4 km grid spacing over the CONUS.The WRF simulations cover 13 years each and represent a retrospective climate period (October 2000-September 2013) and a future climate period (end of the 21st century).Initial and boundary conditions for both simulations were driven by the 6-hourly and 0.  7 ERA-Interim (Dee et al., 2011), which is a global climate reanalysis data set produced by the European Centre for Medium-Range Weather Forecasts.A pseudo-global warming (PGW) perturbation signal (Schär et al., 1996), representative of an end of the 21st century business as a usual climate scenario, was added to state variables of the future climate simulation.The PGW signal was derived from a set of 19 Coupled Model Intercomparison Project Phase 5 (CMIP5) models (Taylor et al., 2012) generated with a Representative Concentration Pathway of 8.5 W 2 m (RCP8.5)radiative forcing, which is a very high greenhouse gas concentration pathway (Moss et al., 2010).To prevent drifting of the 4 km regional simulation from the reanalysis boundary conditions, large-scale spectral nudging of moderate strength was applied above the planetary boundary layer (von Storch et al., 2000), which provided synoptic-scale fidelity to past weather events yet allowed the mesoscale to evolve with some freedom.These model simulations allow us to isolate thermodynamic signals from kinematic influences on the future climate.Simulation details are available in Table 1 and additional specifications can be found in C. Liu et al. (2017).
The watershed transform (Lakshmanan et al., 2009) was used to identify high-intensity updrafts that constitute thunderstorms from the convection-permitting climate simulations.The watershed transform, as employed herein, identified thunderstorms using a simulated radar reflectivity minimum threshold of 40 dBZ, which is a quantity proportional to the number of drops per unit volume and provides an estimate of convective precipitation (Trapp et al., 2011).Grid cells adjacent to the detected local maxima that also exceeded a minimum threshold of 20 dBZ were then treated as a part of the thunderstorm object.This process was repeated iteratively and surrounding grid cells were continually associated with a thunderstorm until MOLINA ET AL.Planetary boundary layer scheme Yonsei University (Hong et al., 2006) Shortwave and longwave radiation scheme RRTMG (Iacono et al., 2008) Land surface scheme Improved Noah-MP land-surface model (Niu et al., 2011) a No sub-grid cloud cover, shallow, or deep cumulus parameterizations were employed.

Table 1
Weather Research and Forecasting Model Simulation Parameterization Schemes a and Settings, as Detailed in C. Liu et al. (2017) values were either below a minimum threshold of 20 dBZ or exceeded a predetermined thunderstorm object spatial extent of 128 km (32 grid cells × 4 km grid spacing).Each thunderstorm was saved as an object spanning 128 × 128 km containing the thunderstorm and the adjacent environment, which influences thunderstorm characteristics (R. L. Thompson et al., 2012).Thunderstorms were extracted over land and east of the Rocky Mountains (Figure 1), where severe thunderstorms have a greater climatological likelihood of occurrence (Brooks et al., 2003).The temporal focus of this study was limited to winter (December, January, and February; DJF) and spring months (March, April, and May; MAM).Other seasons were omitted due to a simulated dry bias during summer months across the central CONUS, which was partly associated with land-surface feedbacks (Barlage et al., 2018).
Similar to Gagne II et al. (2019), meteorological state variables were extracted from the WRF simulations to train a CNN after creating the thunderstorm objects.Five variables were extracted and interpolated onto four different vertical levels, resulting in a total of 20 input attributes used for training the CNN (Figure 2).The five variables are pressure (P; hPa), temperature (T; K), water vapor mixing ratio ( w q ; g 1 kg ), and zonal (u) and meridional (v) winds (m 1 s ).Variables were then interpolated onto the following heights above ground level (AGL): 1, 3, 5, and 7 km.AGL heights were preferred over constant pressure surfaces because pressure surfaces might be below ground across portions of the High Plains and AGL heights are more likely to sample similar parts of a thunderstorm updraft.UH (  2 2 m s ) was also extracted and is quantified as where the integral of the product of vertical velocity (w) and vertical vorticity ( ) is computed from 2 to 5 km AGL (Kain et al., 2008).UH was used as ground truth to create training labels, but was not used as an input attribute (i.e., variable) for the CNN.A high-magnitude UH threshold (e.g., 75 m s ) was used to delineate convection more likely to be of supercell morphology (Sobash et al., 2011).The 1D vector containing labels for CNN training and testing was created using binary assignment (i.e., integer encoding) derived from UH and encoded as 0 or 1.Values exceeding the UH threshold (the potentially severe category) were assigned a label of 1, whereas values below the threshold were assigned a label of 0. Thunderstorm objects were then split into two subsets prior to CNN training: 60% for training and 40% for testing.Stratified sampling of thunderstorm objects that exceeded or did not exceed the delineated UH threshold was conducted to ensure that training and testing data contained the same percentage of majority and minority classes.Since the meteorological variables contain different dynamic ranges, the training data were standardized by subtracting the training set variable's mean and then dividing by its standard deviation.The testing data were also standardized for model evaluation using the mean and standard deviation values extracted from the current climate training data.

CNN Architecture and Explanations
The deep learning model used in this study was a CNN (LeCun et al., 1990) that consisted of three convolutional layers (similar to Gagne II et al., 2019).The 20 input attributes described in the previous sub-section were fed into the first convolutional layer (Figure 2).A stride length of 1 was used for the filter windows, which consisted of 5 × 5 grid cells, with zero padding also applied to the edges of each feature map.The rectified linear unit (ReLU; max(0, x)) activation function was used for each feature map (except for the last dense layer), which preserved the magnitude of positive signals and negated negative signals when propagated forward through the network (LeCun et al., 2015).Max pooling was performed after each convolutional layer by extracting maximum values of the feature maps within a sliding 2 × 2 filter window with stride length of 1. Max pooling added translation invariance and allowed the model to learn higher-level features (i.e., increase in kilometers) in deeper layers.After the three convolution and pooling operations, the resultant data were flattened into a 1D vector, passed through a dense layer (Figure 2), and a ReLU activation function was applied to its output.The 1D vector was then passed through a final dense layer, with a sigmoid activation function applied to produce the model's output as a value between 0 and 1, which was interpreted as the probability that the input attributes contained a strongly rotating thunderstorm.
The weights of the CNN were trained to minimize mean squared error (MSE) using the Adam optimization algorithm (Kingma & Ba, 2014) via backpropagation with a learning rate of 0.0001.Glorot uniform was used as the layer weight initializer (Glorot & Bengio, 2010).Sensitivity to random weight initialization was assessed by training several models using different random initializations and skill was comparable across models, potentially due to the large training sample size or the large number of input attributes used during training.During training, a batch size of 128 was used, randomly pulled from the training data population and passed forward through the CNN.To prevent overfitting of weights during training, Ridge (L2 norm; 0.001) regularization was added as a penalty term to reduce the magnitude of the weights at each convolutional layer.Batch normalization was also applied after each convolutional layer and the first dense layer (i.e., before each pooling layer), which involved standardizing layer outputs by subtracting the batch mean and dividing by the batch standard deviation, in effect reducing covariance shift (Ioffe & Szegedy, 2015).2D spatial dropout (30% in this study; Srivastava et al., 2014) was also employed after batch normalization, which increased the robustness of learned features.We note that numerous training iterations were run with the order of batch normalization and spatial dropout reversed, but results were more skillful with batch normalization preceding spatial dropout in this application.A validation data set was used during training, consisting of 10% of the available training data, which provided insight into the skill of the model during training.Cross-validation on fivefolds was also performed to assess result sensitivity to the underlying test data distribution and differences were minimal.The final model settings were selected based on the lowest resultant test data MSE from a hyperparameter grid search that resulted in over 128 independently trained CNNs trained using 20 epochs.The classification output of the lowest MSE was evaluated using probabilistic and nonprobabilistic skill metrics that will be further detailed within the results.For more details about CNNs, see Goodfellow et al. (2016).
To explore the relative importance of specific meteorological variables on the CNN classification performance, we used the permutation feature importance (PFI; Breiman, 2001) analysis, specifically the single-pass forward test variant.PFI ranks variables based on how much randomizing them impacts error during testing, with larger magnitude decreases in skill associated with greater importance.Higher relative importance suggests that the respective variables have greater relevance to the classification due to the larger magnitude weights associated with them within the CNN architecture.Five hundred permutations were completed for each of the 20 variables to capture uncertainty associated with shuffling order in PFI.Permuted fields for a set of examples were also visualized to further explain variable importance results.The chosen examples consist of cases that were originally classified as one class by the CNN but switched to another class due to PFI.Certain classified thunderstorms that were switched to incorrect classifications (according to the ground truth label) also consistently appeared in larger skill reductions, and these were used to narrow down the subset of thunderstorms for visualization.To explain model reasoning within a spatial context, image-specific saliency maps (Simonyan et al., 2013) and input*gradient (Shrikumar et al., 2016) were used.Input*gradient is computed as the product of the local gradient and input (Mamalakis et al., 2021), which shows the variable contribution to the prediction.

Thunderstorm Classification in a Future Climate
Thunderstorms identified within the future climate model simulation contain warmer temperatures and higher moisture content than thunderstorms identified within the current climate model simulation at all vertical levels (1, 3, 5 and 7 km; Table 2), which is consistent with the applied PGW signal (C.Liu et al., 2017).Table 2 shows that future thunderstorms contain about 1.3 g 1 kg more low-level (at 1 km) water vapor mixing ratio and are about 2.4 K warmer at low levels (at 1 km) than thunderstorms of the current climate (both statistically significant at the 95th percentile confidence level).These results are also consistent with the Clausius-Clapeyron equation that estimates a 7% increase in saturation vapor pressure per +  1 C. Using the Clausius-Clapeyron equation, a water vapor mixing ratio of future thunderstorms should be about 8.76 g 1 kg at 1 km, which is comparable to the 8.9 g 1 kg contained in thunderstorms extracted from the future climate model simulation (Table 2).Extremes within the future climate were also of interest.These extremes were classified as "outlier cases" and were selected as thunderstorms containing 1 km water vapor mixing ratio exceeding the 99 th -percentile of thunderstorms from the future climate.The focus of outlier cases lies on the 1-km water vapor mixing ratio because increased low-level moisture and thermodynamic buoyancy can result in more intense vertical winds related to stronger thunderstorm updrafts.Added low-level moisture and warmth provide additional thermodynamic buoyancy and vertical instability that could lead to more intense convection in the future (Prein et al., 2017;K. L. Rasmussen et al., 2017).The increased moisture and warmth could also pose the CNN with added difficulty in performing the thunderstorm classification task.Table 2 shows little change in zonal (u) and meridional (v) thunderstorm winds between the current and future climate model simulations.Since the classification task being performed by the CNN is related to winds, the relative consistency in wind magnitude may result in little change in classification skill between the current and future climate model simulations.
Here, we evaluate probabilistic forecasts generated by the CNN, which are probabilities that the thunderstorm objects contain a strongly rotating or non-strongly rotating thunderstorm.Strongly rotating thunder-Earth and Space Science storms are associated with a higher probability magnitude and non-strongly rotating thunderstorms are associated with a lower probability magnitude.The large imbalance between the majority and minority classes was important to consider during the evaluation of the CNN classification skill (Table 3).Therefore, the performance diagram and metrics that are more useful for evaluating forecasts of rare events were used (Roebber, 2009).The minority class in this case consists of strongly rotating thunderstorms, which are rare events that comprise approximately 3% of all thunderstorms in the convection-permitting model simulations.Performance diagrams summarize the probability of detection (POD; ratio of hits to the total of hits and false alarms), critical success index (CSI; ratio of hits to the total of hits, false alarms, and misses), and bias (ratio of false alarms to misses).Success ratio (SR) is also summarized, which is 1false alarm ratio (FAR; ratio of false alarms to the total of hits and false alarms).The curves shown on the performance diagrams were created by varying the probability threshold between 0 and 1 to convert probabilistic forecasts into binary forecasts and show how skill changes based on the probability threshold used (Figures 3a, 3c  and 3e).
The performance diagrams (Figures 3a,3c and 3e) show that despite being trained with thunderstorm objects extracted from the current climate model simulation, CNN skill remains consistent and high (0.69 max CSI) when classifying thunderstorms of the future climate model simulation (Figure 3c).These results suggest that a CNN is capable of learning spatial representations and variable relationships that are transferable to a warmer and moister climate.Figure 4a shows that correctly classified strongly rotating thunderstorms (according to the ground truth label) of the future climate contained approximately 4 g 1 kg more low-level moisture (at 1 km) than correctly classified strongly rotating thunderstorms of the cur-MOLINA ET AL.Environments surrounding the thunderstorms were omitted for these statistics.Future thunderstorms with higher low-level moisture content than most cases in the future climate (i.e., outlier cases with  99 th percentile of 1-km water vapor mixing ratio in the future climate), are also shown.Statistically significant values of the future climate and future outliers are indicated in boldface with asterisks (*) and computed using confidence intervals of 2.5th and 97.5th percentile of a 1,000-member bootstrap from a total sample of 454,242 thunderstorm objects extracted from the current climate simulation.

Table 2
Median of Thunderstorm Variables Extracted From the Current and Future Climate Simulations rent climate.The consistency in CNN skill could be partly related to bulk wind shear (1-5-km) distributions that remained relatively stationary between both climate model simulations (Figure 4a).Despite the imbalance between the majority and minority classes, the CNN was able to perform the classification task skillfully, suggesting that techniques to augment minority classes may not always be necessary (e.g., Chawla et al., 2002).However, model bias exhibits some sensitivity to the forecast threshold used.We note that the same forecast threshold values were evaluated for current, future, and outlier thunderstorms (10,000 values evenly spaced between 0 and 1) in Figure 3 and only a threshold of 0.5 was used for all thunderstorm classes in Table 3. Max CSI and lower bias were achieved when evaluating model skill using a probability threshold of approximately 0.6 in the future climate and 0.5 in the current climate (Figure 3c).
A probability threshold of 0.5 results in a small over forecasting bias (1) of strongly rotating thunderstorms of the future (Table 3), which shows that the CNN generally has lower skill in classifying strongly rotating thunderstorms of the warmer and more moist climate.
Performance metrics were also computed for the outlier future thunderstorms, which were characterized by a higher low-level moisture content than the 99 th -percentile of the future climate, in order to further quantify the out-of-sample robustness of the CNN (Table 2).Results show that CNN classification skill with outlier thunderstorms of the future climate remains high, with a max CSI of 0.73 (Figure 3e), which is comparable to the current and future climate subsets (Figures 3a and 3c).These results further substantiate that a CNN can exhibit out-of-sample robustness in climate applications.Results also suggest that deep learning can sufficiently generalize relationships among input variables and remain skillful with extreme events.However, over forecasting of strongly rotating outlier thunderstorms was identified (bias 1) with a probability threshold of 0.5 (Figure 3e; Table 3), which implies overconfidence in classifying thunderstorms with extreme low-level moisture.Like thunderstorms of the future climate, the consistency in CNN skill could be partly related to bulk wind shear (1-5 km) distributions that remained relatively stationary between current climate and future outlier thunderstorms (Figure 5a).However, in addition to high-end low-level moisture content, future outlier thunderstorms also contained substantially higher moisture content in mid-to-upper levels, as shown in Figure 5 for 5 km maximum water vapor mixing ratio.This result suggests that the spatial arrangement of meteorological fields likely also plays an important role in CNN prediction skill in addition to variable relative magnitudes.We note that a model consisting of the same architecture was trained with class imbalance addressed (i.e., equal sized potentially severe and non-severe classes) and resulted in more substantial bias (Table 4) than the model trained with large class imbalance (Table 3).
The Brier skill score (BSS) was used as an additional evaluation metric and can be visualized with the attributes diagram (Figures 3b, 3d and 3f), which shows forecast probabilities against observed relative frequencies (Hsu & Murphy, 1986;Wilks, 2011).An attributes diagram provides a measure of forecast reliability, where the dashed 45° line represents perfect reliability.Attributes diagrams show forecast probabilities (between 0 and 1), which are plotted against the observed relative frequency for that forecast probability (Wandishin et al., 2005).An example of "perfect reliability" for a 0.6 forecast probability (e.g., x-axis in Figures 3b,  3d and 3f) is when that forecast probability corresponds to a similar observed relative frequency (e.g., y-axis in Figures 3b, 3d, and 3f).The solid horizontal line in Figures 3b, 3d, and 3f shows the climatological probability of strongly rotating thunderstorms occurring within the respective climate sample, which is higher in outlier cases than in the current and future climates.Since attributes diagrams consider climatological and forecast probability frequency, they also show how different forecasts are from climatology (i.e., resolution).The gray shading in Figures 3b, 3d and 3f show areas contributing to positive BSS, which are areas where BSS resolution exceeds reliability (Gagne II et al., 2019) Note.Also shown are the total number of true positive (i.e., hits), false positive (i.e., false alarms), false negative (i.e., misses), and true negative predictions made by the CNN.Future thunderstorms that have higher low-level moisture content than most cases in the future climate (i.e., outlier cases with  99 th percentile of 1 km water vapor mixing ratio in the future climate), are also shown.Metrics were computed using a 0.5 forecast probability threshold for the current, future, and outlier thunderstorms.

Table 3 Table Contains Various Skill Metrics Used for Evaluation of Convolutional Neural Network (CNN) Performance During the Current and Future Climate
show the frequency of forecast probabilities for each climate subset, which in this case features a bi-modal distribution, with peaks at low (0.05) and high (0.95) forecast probabilities.This bimodal distribution is most pronounced for outlier cases (Figure 3f).The attributes diagram curve closely parallels the dashed 45-degree diagonal line across all forecast probabilities for the current climate (Figure 3b), which conveys high forecast reliability.However, future and outlier cases have lower reliability between the 0.2-0.7 forecast probabilities and higher reliability at low (0.2) and high (0.7)forecast probabilities (Figures 3d and 3f).and e).Attributes diagrams are also displayed, which show forecast probabilities against observed relative frequency, using a forecast probability bin size of 0.05 (b, d) and 0.1 (f).Inset panels in the top left show the frequency of forecast probabilities and the gray shading shows regions where resolution exceeds reliability (b, d, f).Ninety-fifth percentile confidence intervals (two-tailed) computed from a 1,000-member bootstrap shown with shading (a-f).
These results corroborate the performance diagram results, which show that the CNN has an over-forecasting bias for future and outlier thunderstorms.

CNN Interpretation
Permutation feature importance (PFI) was conducted to determine the relative importance of input variables on CNN prediction skill.The area under the receiver operating characteristic curve (AUC; Mason, 1982) was used, which is a scalar that represents model performance encompassing the probability of detection and false detection.Using AUC, PFI reveals that zonal (u) and meridional (v) winds at 3 km have the highest relative importance for CNN prediction (Figures 6a and 6d).PFI is consistent for predictions generated using the current climate and future climate thunderstorms (Figures 6a and 6d), which shows that mid-level kinematic fields play an important role in the proper classification of rotating convective storms.This result is physically reasonable given that UH (computed from 2 to 5 km AGL) was used to create the thunderstorm labels that were subsequently used to train the CNN.The climatological homogeneity between current and future climate mid-level winds (Table 2) also likely contributed to the consistency in variable importance across climate subsets.Zonal and meridional winds at 1-and 5 km were also identified as relatively important (Figures 6a and 6d).Several thermodynamic variables also ranked in the top 50th percentile in importance, suggesting that the CNN also relies on characteristics of physical variables that were not included in the UH computation.These relatively higher-ranking thermodynamic variables include, temperature at 5 km and water vapor mixing ratio at 1-and 7 km (Figures 6a and 6d).
Additional skill metrics were used for PFI in order to explore the sensitivity of the analysis to the respective evaluation method.PFI using CSI, which is a skill evaluation metric that neglects true negative events (as described earlier), further emphasizes the relative importance of mid-level kinematic fields (Figures 6b  Figure 5. Same as Figure 4, but for outlier thunderstorms and water vapor mixing ratio at 5 km above ground level.and 6e).BSS was also used for PFI (Figures 6c and 6f) and results generally align with AUC and CSI results in regard to the relative importance of mid-level kinematic fields.Interestingly, however, moisture at 5 km was ranked most important when evaluating the CNN classification skill for current and future climate (Figure 6c) thunderstorms.This result suggests that mid-level moisture is an important variable for classification of strongly rotating thunderstorms, given the lower ranking found using AUC, which also takes into account correct classification of non-strongly rotating thunderstorms (according to the ground truth label).
PFI offers insight into the relative importance of variables based on modulations to the CNN prediction skill, but the method does not provide reasons for the rankings.For instance, it is not immediately clear why water vapor mixing ratio at 5 km has greater relative importance than at 1 km.To explore the reasons for PFI rankings, visualizations were created of thunderstorms that were initially classified as strongly rotating but switched to a non-strongly rotating classification as a result of the permuted variable (Figure 7). Figure 7c shows an example of a strongly rotating thunderstorm.Its associated water vapor mixing ratio at 5 km (Figure 7a) was permuted to a field that had a greater overall magnitude of moisture (Figure 7b) and peak moisture values that were offset from the thunderstorm locations (Figure 7c), which resulted in the non-strongly rotating classification.Various other thunderstorms also had a similar pattern; higher overall moisture content and shifted peak value locations in the permuted field resulted in non-strongly rotating classifications (not shown).Supercells generally form in environments characterized by moist low-levels and drier mid-to-upper levels, while stratiform precipitation or less organized convection could be characterized by higher and more homogeneous moisture profiles (Bunkers et al., 2006;R. L. Thompson et al., 2012).These examples show that moisture characteristics of vertical atmospheric profiles are likely a learned feature by the CNN.In regard to the high importance of zonal and meridional winds, thunderstorms that were classified as non-strongly rotating during PFI were generally due to the uniformity of zonal or meridional winds in the permuted fields (Figures 7e and 7h), as opposed to the overall magnitude of the horizontal winds.These results show that the CNN learned that wind directional shifts over a small region located near the thunderstorm core were indicative of strong rotation.
Visualizations were also created for thunderstorms that were initially classified as non-strongly rotating, but switched to a strongly rotating classification during PFI (Figure 8).The permuted moisture fields for these examples (e.g., Figure 8b) were generally drier and contained large magnitude gradients in space that represented isolated and intense convection.Regarding kinematic fields, the original zonal (Figure 8d) and meridional (Figure 8g) winds lacked rotational characteristics for the respective thunderstorms (Figures 8f  and 8i).However, the permuted fields contained strong rotational features (Figures 8e and 8h), which likely resulted in the changed classification.
Individual thunderstorms from the future climate model simulation were chosen to visualize areas of saliency for predictions made by the CNN.Simulated radar reflectivity of the respective examples are shown in Figure 9, which contains a true positive, false positive, false negative, and true negative case.High values of simulated radar reflectivity (65) are evident near the thunderstorm core of the true positive case (Figure 9a), which represents a region of high precipitation intensity.The false positive and false negative examples also contain thunderstorms with high reflectivity (65; Figures 9b and 9c), but the most intense region for the false negative case is located near the southern edge of the image.The true negative case (Figure 9d) contains lower maximum reflectivity magnitudes than the other examples (65), and convection that is smaller in size and less organized, which possibly contributed to the true negative classification by the CNN.
Saliency maps highlight the thunderstorm object areas of input features that are of relevance to the CNN prediction.For water vapor mixing ratio (right two columns in Figure 10), positive gradients demarcate the respective pixels that are relevant to the model prediction.Moisture at low-and mid-level heights for the true positive case located near the thunderstorm core are relevant to the prediction of strongly rotating thunderstorms (Figures 10c and 10d).While high moisture content may not be related to thunderstorm rotation and horizontal kinematics, it does show that the CNN identified the thunderstorm core (region of high precipitation intensity, and thus moisture content) as relevant for the strongly rotating prediction.Non-salient regions of respective variables are zero gradients and therefore, correspond to pixels that are not relevant to the model prediction.In the case of zonal and meridional winds at 3 km (left two columns in Figure 10), winds of higher absolute magnitude with opposing signs in close proximity to each other represent regions of rotational winds within the region of the thunderstorm's updraft.These features are evident at the thunderstorm core location, suggesting that the rotation signature is relevant to the strongly rotating prediction.Similar gradient patterns are present in the maps of the false-positive and false-negative examples at thunderstorm core locations (Figures 10e-10l).Saliency maps for true negative cases are substantially different (Figures 10m-10p)-gradients are no longer present across a small and focused region near the thunderstorm core, but rather across broad areas of the thunderstorm object.Additionally, gradients from zonal and meridional winds generally no longer align to form an organized circulation (Figures 10m and 10n).
Another local and post hoc ML explanation method is the input*gradient method, which highlights contributions to the prediction, computed as the product of the local gradient with the input itself (Shrikumar et al., 2016).Mamalakis et al. (2021) conducted an attribution benchmark using attribution known a priori (i.e., ground truth) for regression problems, and found the input*gradient method to produce more skillful explanations than saliency maps.While the benchmark used global sea surface temperatures and consisted of a very different spatial scale as compared to our application, we decided to also use the input*gradient method for further exploration of our model's explanations.Explanations were generally consistent between the saliency map and input*gradient methods, with a few minor exceptions (Figure 11).For instance, the wind rotational signature was mostly of one sign (positive) near the thunderstorm core using the input*gradient method for hits, false alarms, and misses (Figures 11a, 11b, Figure 7. Three example cases (c, f, i) that were classified as non-strongly rotating thunderstorms during the permutation feature importance (PFI) analysis.The convolutional neural network classified these future climate thunderstorms as strongly rotating prior to PFI.The top row shows the original water vapor mixing ratio ( q w) field (a) for a pair of strongly rotating thunderstorms (c), and the q w field that replaced the original during PFI (b).The center and bottom rows show fields for other example thunderstorms, but for perturbed meridional (v) and zonal (u) winds, respectively.Updraft helicity exceeding 75  2 2 m s is indicated with black contours (c, f, i).Vectors show winds at 3 km corresponding to the respective thunderstorm image (c, f, i).The q w fields were normalized by the maximum value of both plots (a, b).11e, 11f, 11i and 11j).The spatial extent of the contour explanations were reduced for true negative cases and primarily of negative magnitude for all variables shown (Figures 11m-11p).The moisture fields at 1-and 5-km for the miss (false negative) example contain mixed signals (positive and negative) that were displaced away from the thunderstorm core, potentially explaining why the CNN failed to classify the supercell as strongly rotating (Figures 11k and 11l).A challenge with comparing ML explanation methods without attribution ground truth is that we cannot quantify explanation skill and rely heavily on our domain knowledge to assess skill subjectively.It is also possible that certain explanation methods are more useful for certain spatial and temporal scales yet not others.These questions are beyond the scope of this study, but should be explored in future work.Results generated using saliency maps and input*gradients can be corroborated by visualizing the frequency of maximum UH for all thunderstorms.Figure 12a shows that the CNN is able to identify strongly rotating thunderstorms across a broad range of UH values that exceed 75  2 2 m s and is therefore able to capture a variety of thunderstorm rotation intensities.Thunderstorms that were classified as strongly rotating, but evaluated as false alarms because the corresponding UH did not exceed 75 m s ; Figure 12b), which past studies have found to also be representative of supercellular convection (Trapp et al., 2011).Missed classifications of strongly rotating thunderstorms generally do not consist of large UH magnitudes (100 m s , which is characteristic of less organized convective storms (Figure 12d).These results further demonstrate that a CNN is able to generalize a target derived from a heuristic using learned features in the data that would be difficult to encode due to spatial complexity.There is sensitivity, however, to the thunderstorm location within the thunderstorm object, which can be visualized with 2D histograms that contain the frequency of UH exceeding 75  2 2 m s , typically located near the thunderstorm core (Figure 13).Correctly classified strongly rotating thunderstorms (according to the ground truth label) contained regions of high rotation (UH75  2 2 m s ) near the center of the thunderstorm object (Figures 13a  and 13c), while missed classifications are located near the edges of the thunderstorm object (Figures 13b  and 13d) for thunderstorms during both the current and future climate.This comparison shows that CNNs can struggle with classifications of features located near the edges of a spatial region of interest, resulting in missed events.

Conclusions
A CNN was trained to learn relationships and identify features among meteorological state variables in order to classify convection types, with a focus on rotation within the updraft core of a thunderstorm.Strong rotation and associated storm morphology could result in a higher likelihood of convection producing severe hazards, such as tornadoes and large hail, which are a dangerous threat to the public.We hypothesized that due to climate change, a trained CNN may fail to classify and identify convection that lies outside of the climatological distribution of data used for training.We conducted a test-case study to address our hypothesis.Using a thermodynamically driven future climate model simulation, our results show that a CNN can remain skillful in classifying convective storms using learned representations of physical variables.false-positive (e-h), false-negative (i-l), and true negative (m-p) cases.Variables shown include several denoted as important by the PFI analysis, such as 3 km zonal (a, e, i, m) and meridional (b, f, j, and n) winds, and water vapor mixing ratio ( q w) at 5-km (d, h, l,and p).Mixing ratio ( q w) at 1 km is also shown (c, g, k, and o).
The key results that provide answers to the questions posed in the introduction follow: 1.A CNN trained using a current climate model simulation can skillfully classify out-of-sample (with regard to moisture content) thunderstorms in a thermodynamically driven future climate.This is possibly partly due to the use of batch normalization and spatial dropout; an equivalent model trained without batch normalization and spatial dropout results in an under-forecasting bias (about 0.84) in the current and future climate.2. Kinematic fields and mid-level moisture were identified as important variables for skillful classification by the CNN.Spatially, wind rotation signatures with concurrently overlaid sharp mid-level moisture gradients were also important.3. Thunderstorm classifications that were incorrect according to the associated ground truth label included cases that were near the thunderstorm object edge or had a UH value that was near but on the opposite side of the predefined threshold.Key result 1 shows that a CNN is robust to out-of-sample cases during convection classification, which is a promising result given the changes already occurring to large-scale environments and moisture advection patterns associated with severe thunderstorms (Gensini & Brooks, 2018;Molina & Allen, 2020).Key results 2 and 3 also show that a CNN can learn complex relationships among input features using labels derived from heuristics.Physical features were not prescribed but rather learned from the data, such as the importance of dry air at mid-levels for intense thunderstorm development when low-level moisture is present (i.e., convective available potential energy).We emphasize that UH was only used to create the labels and was not used as an input attribute into the CNN during training.Unlike computer vision classification tasks (Russakovsky et al., 2015), humans can bypass generating a large number of hand-labeled data for training models to perform atmospheric feature classifications, which would also pose challenges given conflicting definitions of atmospheric phenomena in the scientific literature.Additionally, results show that large imbalances in labeled data may be overcome with sufficient hyperparameter tuning.Overall, results show that the CNN can classify thunderstorms as strongly rotating that were near the UH threshold and appeared supercellular, learning to generalize prescribed UH labels.
There are several limitations that are important to acknowledge.The focus in this study lies on a future climate that was thermodynamically driven in order to isolate competing thermodynamic and kinematic signals, but it is possible that a CNN may not generalize well with a future climate that accounts for both changes in the thermodynamic and large-scale dynamics.We do note that there is a 14% increase in future strongly rotating thunderstorms as compared to the current climate, which is a substantial increase and an indication that changes in large-scale dynamics may not pose a significant issue to the CNN.An additional limitation is that physical interpretation methods require substantial human interpretation, making it possible to miss important features or fail to discover new physical relationships.However, this is a broader issue within machine learning explanations (i.e., interpretability), as it introduces the potential for confirmation bias from human scientists attempting to explain results.Future work should explore incorporating feature uncertainty or physics within the CNN model architecture to explore the differences to results contained herein.Additionally, methods to ameliorate missed classifications near the edges of study domains should be explored.As societal exposure to severe hazards continues to increase (e.g., Ashley & Strader, 2016), it is important to continue better identifying and understanding severe hazards within climate model simulations.The use of deep learning methods that do not impose rigid thresholds or expert systems decisions should continue to be explored, since meteorological phenomena generally do not neatly fit into predefined classes.Deep learning offers a viable avenue to continue to better understand weather and climate extremes.archived on Zenodo (Molina, Gagne, & Prein, 2020).The convolutional neural network was trained using the high-level Python-based Keras deep learning library (Chollet, 2015) with the TensorFlow (version 2.3.1)low-level backend (Abadi et al., 2016).We are grateful to Imme Ebert-Uphoff (Colorado State University) and an anonymous reviewer for their insightful and valuable feedback that improved the quality of the manuscript.

Figure 1 .
Figure 1.Thunderstorm objects for this study were extracted from areas east of the Rocky Mountains (over land) within the dashed-line polygon.An example thunderstorm object is shown over the CONUS for scale, with the inset displaying a larger version, and corresponding state variables are shown on the right.State variables listed from top-to-bottom are water vapor mixing ratio ( w q ; g 1 kg ), temperature (T; K), pressure (P; hPa), and zonal (u) and meridional (v) winds (m 1s ).The four layers for each state variable indicate the four levels (1, 3, 5, and 7 km above ground) at which variables were derived.

Figure 2 .
Figure 2. The architecture of the convolutional neural network (CNN).The model consists of three 2D convolutional layers and max pooling layers.The dimensions of the feature maps are shown in parentheses.2D filter windows are depicted in pink, of dimension 5 × 5 for each convolutional layer and 2 × 2 for each max pooling layer.The kilometer range of learned features grows in deeper layers of the CNN because of max pooling layers; the spatial extent learned is 20 × 20 km for the first convolutional layer (filter containing 5-filter pixels × 4-km per pixel), 40 × 40 km for the second convolutional layer, and 80 × 80 km for the third convolutional layer.

Figure 3 .
Figure3.Performance diagrams (a, c, and e) show curves that represent convolutional neural network skill as a function of the probability of detection (POD) and success ratio (1-FAR [false alarm ratio]) across various probability thresholds.The grayscale-filled contours show the critical success index (CSI), the dashed lines display the bias, and circles along the curves display probability thresholds (a, c, and e).Attributes diagrams are also displayed, which show forecast probabilities against observed relative frequency, using a forecast probability bin size of 0.05 (b, d) and 0.1 (f).Inset panels in the top left show the frequency of forecast probabilities and the gray shading shows regions where resolution exceeds reliability (b, d, f).Ninety-fifth percentile confidence intervals (two-tailed) computed from a 1,000-member bootstrap shown with shading (a-f).

Figure 4 .
Figure 4. Scatter plots showing water vapor mixing ratio (1 km above ground level [AGL]) against bulk wind shear (1-5 km AGL) for thunderstorm objects of the current and future climate evaluated as (a) hits, (b) false alarms, (c) misses, and (d) correct negatives.The dots represent individual thunderstorm objects of the current (black) and future (red) climates, while the stars show the median of the respective climate thunderstorm objects.Bivariate density distributions are also shown with marginal plots created using Gaussian kernels.Random subsets of thunderstorm objects are shown for easier visualization.

Figure 6 .
Figure6.Permutation feature importance (PFI) analysis for the current climate (a-c) and future climate (d-f) thunderstorms shown using box and whisker plots.The median of 500 permutations is represented by the vertical line within the box and the whiskers represent all 500 measured changes in skill.PFI was conducted using various skill metrics, including area under the receiver operating characteristic curve (AUC; a, d), critical success index (CSI; b, e), and Brier skill score (BSS; c, f).Changes in skill were normalized by the maximum change in the respective climate subset and skill metric.

Figure 8 .
Figure 8. Same as Figure 7, but for a subset of thunderstorms that were classified as strongly rotating.The convolutional neural network classified these thunderstorms as non-strongly rotating prior to permutation feature importance.No black contours are included in the thunderstorm plots (c, f, i) because updraft helicity did not exceed 75  2 2 m s for the shown examples.
are heavily skewed toward high UH values (mostly contained UH values that exceeded 40  2 2

Figure 9 .
Figure 9. Simulated radar reflectivity for representative examples of thunderstorms extracted from the future climate simulation, evaluated as a hit (a), false alarm (b), miss (c), and correct negative (d).

Figure 10 .
Figure 10.Saliency maps for representative examples of future climate thunderstorms, including true-positive (a-d), false-positive (e-h), false-negative (i-l), and true negative (m-p) cases.Variables shown include several denoted as important by the PFI analysis, such as 3 km zonal(a, e, i, m) and meridional (b, f, j, and n) winds, and water vapor mixing ratio ( q w) at 5-km (d, h, l,and p).Mixing ratio ( q w) at 1 km is also shown (c, g, k, and o).

Figure 11 .
Figure 11.The same as Figure 10, but input*gradient values are displayed instead of saliency maps.

Figure 12 .
Figure 12.Histograms show the frequency of maximum updraft helicity (UH) for future climate thunderstorms separated into four subsets: hits (a), false alarms (b), misses (c), and true negatives (d).Frequency of true negatives (d) are x 3 10 magnitude.For comparison, the frequencies of the maximum UH for current climate thunderstorms are shown with blue lines and for future outlier thunderstorms in the inset plots.

Figure 13 .
Figure 13.Spatial histograms show the frequency of updraft helicity (UH) exceeding 75  2 2 m s normalized by maximum frequency for thunderstorm objects classified as hits (a, c) and misses (b, d) during the current (a, b) and future climates (c, d).
Portions of this study were supported by the Regional and Global Model Analysis (RGMA) component of Earth and Environmental Systems Modeling in the Earth and Environmental Systems Sciences Division of the U.S. Department of Energy's Office of Biological and Environmental Research (BER) through the National Science Foundation IA 1947282.This work also was supported by the National Center for Atmospheric Research (NCAR), which is a major facility sponsored by the National Science Foundation (NSF) under Cooperative Agreement No. 1852977.This material is also based upon work supported by the NSF under Grant No. ICER-2019758 and part of the NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES).M. J. Molina conducted this research as part of the NCAR Advanced Study Program.High-performance computing support on Cheyenne and Casper (https:// doi.org/10.5065/D6RX99HX) was provided by the NCAR Computational and Information Systems Laboratory (CISL), some of which was from the CISL Analytics and Integrative Machine Learning (AIML) group allocation.

Table 4
Same as Table 3, but Using a Model Trained With Class Imbalance Addressed (Equal Sample Sizes of Potentially Severe and Non-Severe Thunderstorms)