How Are Tornadic Supercell Soundings Significantly Different From Nearby Baseline Environments?

This study explores how tornadic supercell soundings significantly differ from the same‐location and same‐hour baseline environment soundings, sampled from the days prior to or following the event. Permutation testing is used to identify whether sounding‐derived parameters mixed‐layer convective available potential energy and 0–1 km storm‐relative helicity are significantly different between the tornadic and baseline environment. Typically, in an environment with marginal values of certain key environmental parameters, anomalous values of those environmental parameters are more strongly associated with supercell tornadoes. Furthermore, many tornadic events already exhibit environmental conditions favorable for tornadic supercells a day prior to the event itself. Generally, supercell tornadoes that occur during typical peak tornadic activity time frames are easier to distinguish from baseline (non‐tornadic) environments compared to those occurring in other time frames. Spatiotemporal variations of distinguishability between tornadic and baseline environmental parameters add complexity to traditional parameter‐based fixed threshold forecasting.


Tornado Data
The tornado data set used in this paper is the same as the one featured in A. K. Anderson-Frey et al. (2019), consisting of all the contiguous United States (CONUS) grid-hour supercell tornado events from February 2003 to November 2017. The tornado event data set was created by filtering county tornado segment data and keeps the tornado with the highest (E)F-scale rating within a given hour within a 40 × 40 km 2 area . The reanalysis data was obtained from the European Centre for Medium-Range Weather Forecasts Reanalysis 5 (ERA5) hourly data with a resolution of 0.25° × 0.25° (Hersbach et al., 2020). This sounding data is then obtained for a range of sizes of grid boxes in the vicinity of each tornado event. For this study, we only consider the events within regions most typically associated with tornadoes in the United States: the Southeast, the Great Plains, and the Midwest regions within a bounding box (Figure 1). The meteorological parameters we consider in the work that follows are the mixed-layer convective available potential energy (MLCAPE) and 0-1 km storm-relative helicity (SRH1). MLCAPE is a thermodynamic instability parameter that can act as a proxy for the theoretical maximum updraft strength and SRH1 is a kinematic parameter associated with the coupling of updraft lifting and low-level vertical vorticity that is crucial for the longevity of the supercell and is also associated with a heightened potential for tornadogenesis (Brooks et al., 1994(Brooks et al., , 2003Coffer & Parker, 2017;Davies-Jones, 1990;Markowski et al., 2014;Rasmussen & Blanchard, 1998). Both MLCAPE and SRH1 are calculated using the XCAPE Python package (Chiara Lepore & Allen, 2021) with default settings.

Baseline Environment Data
The baseline environmental sounding data is selected to be the same hour as tornado events in each of the 15 days before and after the corresponding tornado event. More specifically, N × N grid points are selected, centered on a tornado event during the hour when the event occurred; this grid will constitute the tornado environment. This same grid at the same hour will then be selected from ±15 non-tornadic days centered around the date of the tornado event; these grids constitute the baseline environment. Choosing the baseline environments at the same hour as the corresponding event mitigates the impact of the diurnal variation of environmental parameters, as described in A. K.  and Hua and Anderson-Frey (2022). The ±15-day increment is chosen so as to maintain a sufficient baseline environment data size for comparison without potentially blending in 3 of 9 seasonal variability, as described in R. L. Thompson et al. (2012) andA. K. Anderson-Frey et al. (2016). Note that currently the ERA5 reanalysis data has a latency of 5 days which means the most recent baseline environments we could retrieve and compare are only available up to the date that is 5 days prior to the time of forecasting; however, we still analyze the impact of using baseline environment data from the entire previous 15 days considering that future ERA5 reanalysis data could potentially update more frequently. (the discussion of the ±15 days both before and after is reserved for Section 3.4). If there exist other tornado events within the ±15 days near the vicinity of the tornado (i.e., the same location and the same hour that corresponds to the tornado event), the environments of those days are excluded from calculation, although there could exist a few environments that are convective but not producing tornadoes. Multi-day events during a tornado outbreak are still being analyzed individually; what are filtered out are the baseline environments in the same region and hour that are flagged as separate tornado events. In addition to selecting the sampling time scale, selecting the sampling grid size is also critical for statistical testing. The sizes of the grid boxes for both the baseline environments and the tornadic environments are chosen to be 3 × 3, 5 × 5, 7 × 7, and 9 × 9. The center grid box is always the nearest grid box to the location of the tornado event.

Permutation Tests
We compare tornadic supercell environments to their associated baseline environments using a paired sample permutation hypothesis test (Efron & Tibshirani, 1994). The paired sample permutation test is a dependent nonparametric test procedure to test a null hypothesis: that the observations within each pair are drawn from the same underlying distribution (the observations from the same grid should have a similar distribution) and that the sample (tornadic or baseline) to which they are assigned is random. Suppose data contains two samples; for example, tornadic, baseline. Let n be the number of observations in the tornadic scenario, which must also equal the number of observations in the baseline scenario. The observations of tornadic and baseline scenarios are randomly swapped between samples (maintaining their pairings) and the statistic is calculated. For example, tornadic = [t1, t2, t3] and baseline = [b1, b2, b3]. An example of this permutation type is x = [b1, t2, b3] and y = [t1, b2, t3]. This process is performed repeatedly according to the number of permutation times, generating a distribution of the statistic under the null hypothesis. The statistic of the original data is compared to this distribution to determine the p-value. From previous studies related to tornadic environments (e.g., R. L. Thompson et al. (2012)), we expect that both MLCAPE and SRH1 should be greater for tornadic environments compared to baseline environments. We therefore use a one-tailed test with a significance level α of 0.05. If the approximate probability of obtaining a test statistic greater than or equal to the observed value under the null hypothesis is less than 5%, we have significant evidence to reject the null hypothesis in favor of the alternative.
Besides exploring how tornadic environments are different from baseline environments, understanding how often two samples of non-tornadic environments reject the null hypothesis (i.e., how often non-tornadic environments have significantly higher values of MLCAPE or SRH1 than other non-tornadic environments; this situation is hereafter referred to as the null or false-positive scenario) is also important. Including the analysis of such false-positive scenarios provides a typical reference to compare with the probability of tornadic environments failing to reject the null hypothesis. The way we conduct the paired sample permutation test for the false-positive scenario is identical to the tornadic versus baseline environment scenario; the only difference lies in the fact that we replace the tornadic environment with another baseline environment for comparison. For every null scenario that corresponds to a tornado event, we randomly select a non-repeating pair of baseline environments across each day within the ±15-day increment (a total of 30 pairs of samples). A pair of baseline environments is created by randomly matching two of the days within the ±15-day increment that both belong to the same tornadic environment.
Finally, in Sections 3.3 and 3.4, we investigate the aggregate statistics from the results of permutation tests across multiple days. Considering results across multiple days reduces both Type I (false positive) and Type II (false negative) errors.

General Significance Percentages
To begin, we define a "significance percentage" as the percentage of all tornadic supercell environments that reject the null hypothesis (i.e., the percentage of tornadic supercell events that have significantly higher values of MLCAPE or SRH1 compared with the non-tornadic baseline; Figure 2). The goal is to visualize how this significance percentage varies with the number of days relative to the tornado event date, as well as how it varies with the grid size surrounding the tornado event location. The general significance percentage for the null scenario (i.e., in non-tornadic environments, how often can we expect to reject the null hypothesis?) provides an approximate reference establishing context for the significance percentage of the tornadic scenario. Overall, the MLCAPE significance percentage in Figure 2a shows a decreasing trend from 92% to 78% from the days prior to the event to the days following the event; that is, MLCAPE values in the tornadic environment are more commonly above the baseline environment when the baseline being used is prior to the event. For comparison, the MLCAPE significance percentage for the null scenario is around 45 ± 1.5% for all grid sizes (not shown). The tornadic MLCAPE significance percentage on a 3 × 3 grid is overall slightly less than that of the other grid sizes; this slight decrease is similar to that of the null-scenario MLCAPE significance percentage on a 3 × 3 grid of 43.5%, which is also 1%-2% lower compared to other grid sizes.
In contrast to MLCAPE, the SRH1 significance percentage in Figure 2b shows a slight increase from around 87% to 91% from the days prior to the event to the days following the event. Generally, the significance percentage of SRH1 is higher compared to MLCAPE both in the tornadic and the null scenarios (not shown). This reversal in the trend of significance percentage for the tornadic scenarios of MLCAPE and SRH1 could be explained by the seasonality of the environmental parameter: because many tornadoes under consideration occur during spring, which includes a transition from colder temperatures to warmer temperatures, the MLCAPE of the typical environment is also likely to increase, and the spring season is also climatologically associated with a decrease in SRH1. Thus, the MLCAPE (SRH1) of the tornadic environment becomes less (more) different when comparing with a baseline environment at a future date. Furthermore, both MLCAPE and SRH1 tornadic scenarios show a drop in significance percentage 1 day before the tornado event. We hypothesize that a greater portion of events already exhibited favorable synoptic conditions to develop tornadic supercells 1 day before the actual tornado event.

Regional Significance Percentage
In addition, we can break down the significance percentage by geographical region (Figure 3). The MLCAPE significance percentage distribution of the Southeast (Figure 3a) is noisier compared to the other two regions, and the Great Plains (Figure 3b) exhibits patterns similar to the general significance percentage in Figure 2a. This less clear-cut pattern of the Southeast is likely because there is a secondary peak of tornado activities in the Southeast around November and December (Long et al., 2018), during which the environment transitions from a higher MLCAPE to a lower MLCAPE. Furthermore, the SRH1 significance percentage in the Southeast (Figure 3d) is typically around 92%, which is significantly higher compared to both the Great Plains (Figure 3e) and the Midwest (Figure 3f) according to the t-test, and is also subject to less variation both in grid sizes and across multiple days.
The high significance percentage of SRH1 for tornadic scenarios in the Southeast matches previous literature investigating high-shear, low-CAPE (HSLC) storms. Both Sherburn and Parker (2014)   across all regions. This percentage can be interpreted as how significantly different a given non-tornadic environment is from a tornadic environment; low percentages indicate more similarity between non-tornadic and tornadic environments (see text). The x-axis displays the day of the comparison (non-tornadic event) relative to the tornado event. The y-axis represents the numbers of grid squares surrounding the location of the tornado event that are sampled for comparison. prevalent from late fall through early spring. The overall significance percentage of SRH1 in the Great Plains is less compared to both the Southeast and the Midwest, indicating lower predictability of tornadic supercells based solely on SRH1 values for the Great Plains. When comparing the overall significance percentage of MLCAPE and SRH1 of tornadic scenarios for the Great Plains (Figures 3b and 3e) and Midwest (Figures 3c and 3f) regions, both parameters in the Midwest have higher significance percentages and are less sensitive to changes in picking the day relative to the event. The significance percentage of the null scenario for each geographical region is similar to the "All Regions" case (not shown), which means the pattern of false alarms should vary less across geographical regions.

Regionally and Temporally Aggregated Significance Percentages
Understanding how the significance percentage for tornadic scenarios varies diurnally and seasonally provides insight into the temporal variability of tornado predictability. The aggregated significance percentage as described in Section 2.3 is calculated across the entire range of grid sizes over ±15 days in this section. Section 3.4 provides additional details about the sensitivity of aggregated significance percentage with respect to the number of days sampled.
Generally, the aggregated significance percentages (Figure 4) over the time frames with the most tornadic activity are relatively high, whereas time frames outside of those that are climatologically tornado-favorable (e.g., around local sunset and in spring and summer months; A. ) have lower aggregated significance percentages but also have noticeably fewer tornado events to support such low aggregated significance percentages. As for the Southeast, the MLCAPE aggregated significance percentage is lower in summer (Figure 4b), but the aggregated significance percentages of SRH1 have similar magnitudes of around 90% both diurnally and seasonally (Figure 4c). This phenomenon could be related to the higher percentage of HSLC environments in the Southeast, as favorable values of SRH1 may compensate for more marginal values of MLCAPE. However, during summer, SRH1 is weaker (although it could still be significantly higher than for the baseline environment) and MLCAPE is typically higher overall. In this case, now the SRH1 is the limiting factor and so the more favorable MLCAPE compensates, which leads to the MLCAPE difference between the baseline environments being less prominent. In contrast to the Southeast, the Great Plains have smaller SRH1 aggregated  (Figures 4h and 4i) exhibits a similar pattern to the Great Plains but with overall slightly greater magnitude in both MLCAPE and SRH1 aggregated significance percentage.

Sensitivity of Aggregated Significance Percentages
If we reduce the sampling size for comparison in terms of both grid sizes and number of days in order to reduce computational cost (i.e., reduce the necessary number of permutations to generate the null distribution), does the aggregated significance percentage decrease? In a forecasting scenario where only prior information is available, it is more realistic to (randomly) sample days before the event rather than sampling days after the event. Thus, we calculate the mean aggregated significance percentage for all combinations of days that are randomly sampled prior to the tornado event. Figure 5 also provides the aggregated significance percentage of sampling the entire ±15 days as a reference benchmark. The average aggregated significance percentage is also calculated for the null scenario (i.e., no tornado events), which serves as a reference for how potentially sampling additional null pairs could reduce significance percentages (i.e., reducing chances of being identified as a tornadic environment).
As expected, the mean aggregated significance percentage of the tornadic scenario increases as we sample more days with greater grid sizes (Figures 5a, 5e 5i, 5c, 5g, and 5k). Although the mean aggregated significance percentage of the null scenario (Figures 5b, 5f, 5j, 5d, 5h, and 5l) decreases as we sample more pairs, the mean  Figure 1 (a, d, g) The seasonal and diurnal distributions of tornado events for each region. In all plots, the x-axis displays the month of the tornado event, while the y-axis displays the time of the tornado event with respect to local sunset. 7 of 9 aggregated significance percentage actually increases as we sample greater grid sizes. Such a reversal of the sensitivity to the grid size provides a way to pick an optimal grid size that has the largest significance percentage difference between the tornadic and null scenarios. For example, considering Figures 5a and 5b in a forecasting scenario, the maximum difference between the mean aggregated significance percentage of tornadic and null scenario lies on the 3 × 3 grid size and sampling that includes data from 15 days prior to the event for comparison. Such a combination of grid size and sampled days achieves optimal predictability while keeping the false-alarm potential low.
Regional variations in the sensitivity of mean aggregated significance percentages do exist. SRH1 (Figure 5c) of tornadic scenarios in the Southeast is less sensitive to variations in grid sizes and sampling sizes compared to both the Great Plains ( Figure 5g) and the Midwest (Figure 5k). The Southeast has the greatest percentage difference between tornadic and null scenarios when selecting a 3 × 3 grid size coupled with sampling 15 days prior to an event for MLCAPE (Figures 5a and 5b) and sampling 15 days prior for SRH1 (Figures 5c and 5d). Similarly for either the MLCAPE or SRH1, both the Great Plains and the Midwest achieve the greatest percentage difference when selecting a 3 × 3 grid size coupled with sampling 15 days prior to the event (Figures 5e, 5f, 5i, and 5j). Picking an optimal grid size in different regions could potentially help reduce the required computation for future studies that utilize spatial features for prediction.
Finally, the largest mean aggregated significance percentage difference, which is the range of the maximum difference percentage between all possible pairs of grid cells within each subfigure in Figure 5, for both MLCAPE and SRH1 of tornadic scenarios is about 6%-9% compared to 13%-14% for MLCAPE and 18%-20% for SRH1 under the null scenarios. Especially under the null scenarios for MLCAPE (Figures 5b, 5f, and 5j), the drop in the significance percentage from 5 × 5 to 3 × 3 grid size is around 4%, but only around 2% for SRH1 (Figures 5d, 5h, and 5l). Such variability suggests that including additional nearby null soundings is adding noise to the parameter distribution. MLCAPE could also be subject to more noise or variability when compared to SRH1. Figure 5. Aggregated significance percentages for grid size and sample size variations in the (a, c, e, g, i, k) tornadic and (b, d, f, h, j, l) null scenarios. The x-axis for the tornadic scenarios represents the total number of days that are randomly sampled prior to (−) or after (+) the tornado event. The x-axis for null scenarios represents the total number of null events that have been randomly selected for comparison.

Summary and Discussion
This study explores how tornadic supercell soundings significantly differ from nearby baseline environment soundings by calculating the sounding-derived parameters MLCAPE and SRH1 and quantifying the percentage of events that are significantly different between the tornadic and baseline environments after performing a permutation test. Notably, MLCAPE or SRH1 alone are able to identify around 81%-98% ( Figure 5) of tornadic supercell soundings as being significantly different from their typical baseline environment soundings. The significance percentage when comparing baseline environments to baseline environments (i.e., a null hypothesis of no significant difference between environments) is around 22%-44% ( Figure 5), dramatically lower than that comparing tornadic environments to baseline environments. SRH1 shows higher significance percentages in the Southeast, whereas MLCAPE shows higher significance percentages in the Great Plains and Midwest. Tornadoes occurring outside of the typical tornado activity timeframe (e.g., late at night or early in the morning) are harder to predict based on the near-storm environment alone, as shown by the reduction in significance percentages in Figure 4. Finally, significance percentages are less sensitive to variability in the size of grids sampled per event compared to the number of typical baseline events that are sampled for comparison using the permutation test ( Figure 5).
Future studies should include additional environmental parameters or combinations of parameters (e.g., 0-6-km shear, lifting condensation level height) and also examine how tornado warning false alarms are distinguished from tornadic environments (rather than the easier problem described in this work of tornadic vs. non-tornadic environments). Investigating additional combinations of parameters could potentially increase the significance percentages of tornado events or lower the significance percentages of null scenarios. Exploring the false alarm data set by comparing the significance percentage to tornado events could point to avenues for reducing false alarm rates for certain regions over certain time frames.
In comparison to traditional machine learning classification algorithms that learn a decision boundary or adjust the weight of the parameters through training data, this study provides a simpler perspective of classifying tornado events by identifying whether the environment is significantly different from baseline environments, adding nuance to previous studies that operated under the unstated assumption of a significant difference between tornadic and non-tornadic environments. If recorded tornado events over a certain region and certain time frame are difficult to distinguish from their baseline environments, the classification performance for that specific region and time could decrease even when using complex classification algorithms. Most importantly, these results provide a simple typical reference for the improvement of tornado forecasting by more advanced machineand deep-learning models.