A machine-learning approach to thunderstorm forecasting through post-processing of simulation data

Thunderstorms pose a major hazard to society and economy, which calls for reliable thunderstorm forecasts. In this work, we introduce SALAMA, a feedforward neural network model for identifying thunderstorm occurrence in numerical weather prediction (NWP) data. The model is trained on convection-resolving ensemble forecasts over Central Europe and lightning observations. Given only a set of pixel-wise input parameters that are extracted from NWP data and related to thunderstorm development, SALAMA infers the probability of thunderstorm occurrence in a reliably calibrated manner. For lead times up to eleven hours, we find a forecast skill superior to classification based only on NWP reflectivity. Varying the spatiotemporal criteria by which we associate lightning observations with NWP data, we show that the time scale for skillful thunderstorm predictions increases linearly with the spatial scale of the forecast.


| INTRODUCTION
While thunderstorms undoubtedly constitute inspiring natural spectacles that move any human being to a certain extent, their impact in the form of lightning, strong winds and heavy precipitation (including hail) is hazardous to society and economy.Besides the small but real chance of being struck by lightning (Holle, 2016), thunderstorms pose a threat to crops and lifestock (Holle, 2014) as well, and are known to trigger wild fires (Veraverbeke et al., 2017).In addition, they constitute a major safety concern for aviation (Gerz et al., 2012;Borsky and Unterberger, 2019).Furthermore, thunderstorms and lightning damage electrical infrastructure such as wind turbines (Yasuda et al., 2012), which jeopardizes the transition to sustainable energy production.Finally, since the number of severe thunderstorms is expected to increase due to climate change (Diffenbaugh et al., 2013;Rädler et al., 2019), accurate thunderstorm forecasts become ever more relevant.
Thunderstorm forecasts with lead times of more than one hour usually rely on numerical weather prediction (NWP).
This method consists of simulating the future atmospheric state by numerically solving equations derived from the laws of physics.The accuracy of NWP has improved with the advent of high-performance computing, the increased availability of observational data through satellite imagery, as well as advances in data assimilation (Bauer et al., 2015;Yano et al., 2018).In order to use NWP data for thunderstorm predictions, one needs to know how thunderstorms manifest themselves in terms of the NWP output fields.In a post-processing step, this knowledge is then used to identify signs of thunderstorm occurrence in simulation data.
Various ideas for identifying signs of thunderstorm occurrence have been put forward in recent years.For instance, post-processing of NWP data has been blended with nowcasting methods (Kober et al., 2012;Hwang et al., 2015).
Empirical knowledge on convective activity has been translated into expert systems using fuzzy logic (Lin et al., 2012;Li et al., 2021).The fuzzy logic technique allows the construction of decision rules for thunderstorm occurrence based on domain knowledge.Lately, machine learning (ML) methods based on artificial neural networks have gained popularity.
These methods generalize the fuzzy logic approach in the sense that decision rules are constructed by solving a data-driven optimization problem.Previous studies include neural networks with relatively few neurons (Ukkonen and Mäkelä, 2019;Kamangir et al., 2020;Sobash et al., 2020;Jardines et al., 2021), as well as deep neural networks with convolutional layers and millions of trainable parameters (Geng et al., 2021;Zhou et al., 2022).Findings suggest that neural network models are more skillful at predicting thunderstorm occurrence than comparable ML approaches like random forests (Herman and Schumacher, 2018;Ukkonen and Mäkelä, 2019).
The promising results in ML have encouraged us to apply neural network methods to historical simulation data of the ICOsahedral Nonhydrostatic D2 Ensemble Prediction System (ICON-D2-EPS), an NWP ensemble model for Central Europe with a horizontal resolution of ca. 2 km (Zängl et al., 2015;Reinert et al., 2020).ICON-D2-EPS is a limited-area model which explicitly resolves convection and is run operationally by the German Meteorological Service (DWD).To the best of our knowledge, neural networks have not yet been employed for the identification of thunderstorm occurrence in ensemble data with a comparable horizontal resolution.In this work, we present the neural network model SALAMA (Signature-based Approach of identifying Lightning Activity using MAchine learning).It has been trained to predict thunderstorm occurrence through the post-processing of simulation data.
In section 2, we describe how independent datasets for the training, testing and validation of our model have been compiled from NWP forecasts and lightning data.Details on the ML architecture are provided in section 3.While thunderstorm occurrence is identified in a pixel-wise manner, we systematically vary the spatiotemporal criteria by which the lightning observations are associated with the NWP data.This enables us to study the effect of different spatial scales on the model identification skill and allows us to estimate the advection speed of thunderstorms.Further results are presented in section 4 and demonstrate that, for lead times up to at least eleven hours, SALAMA is more skillful than a baseline method based only on NWP reflectivity.In addition, we show a linear relationship between the spatial resolution scale of our model and the time scale during which skill decreases with lead time.This is consistent with earlier findings that resolving smaller scales brings faster growing forecast errors about (Lorenz, 1969;Selz and Craig, 2015).

| DATA
We collected simulation data from the ICON-D2-EPS ensemble model, as well as lightning observations from the lightning detection network LINET (LIghtning detection NETwork, Betz et al., 2009).The simulations were used to extract predictors of thunderstorm occurrence, while lightning observations serve as ground truth.

| Study region and period
The model domain of ICON-D2-EPS covers the areas of Germany, Switzerland, Austria, Denmark, Belgium, the Netherlands and parts of the neighboring countries.For our study, we cropped the model domain at its borders by approximately 100 km to reduce boundary computation errors.In a cylindrical projection, our study region corresponds to a rectangle with the southwest corner located at 45 • N, 1 • E, the northeast corner located at 56 • N, 16 • E and all sides being either parallels or meridians (fig.5).
There are daily model runs every three hours starting at 00 UTC.We collected simulation data from June to August 2021 over the entire study region in hourly steps, taking always the latest available forecast for each hour.Following this procedure results in forecasts with lead times of 0 h, 1 h or 2 h.
Each model run has 20 ensemble members which differ from each other in a manner consistent with the NWP uncertainty in the initial conditions, model error, and boundary conditions (Reinert et al., 2020).In section 4.2, we will relate NWP forecast uncertainty, estimated by ensemble variability, to ML model skill.

| NWP predictors
The atmospheric fields used as predictors of thunderstorm occurrence in this study are given in table 1.They have been selected as follows: We considered as candidate predictors all two-dimensional fields provided in ICON-D2-EPS, as well as two ICON-D2-EPS pressure-level fields associated with deep moist convection in the literature, namely the relative humidity at 700 hPa and the vertical wind speed in pressure coordinates at 500 hPa (Li et al., 2021).In addition, we stipulated that the predictors be available on the open-data server of the DWD (https://opendata.dwd.de, last visit: 2023-03-14), such that the trained model can eventually be used in real-time.For a given candidate input field, we compared histograms of the field value distribution during and in the absence of thunderstorm occurrence and kept only fields that differed significantly in the two distributions.
As shown in table 1, all predictors can be related to thunderstorm activity through physical mechanisms like instability and moisture.In particular, our selection process has led to predictors that agree with findings in the literature (Ukkonen and Mäkelä, 2019;Jardines et al., 2021;Leinonen et al., 2022).
Conversely, convective inhibition (CIN), which is sometimes listed as a convective predictor (Kamangir et al., 2020), has not passed the selection process.This is likely due to the fact that we have checked for predictive power in terms of developed thunderstorms.CIN, however, correlates with the hours leading up to a thunderstorm and has been removed once the storm reaches its mature stage.
It is worth stressing that we have excluded certain parameters on purpose, namely the geographical location of a thunderstorm event, the time of the day, and the time of the year.
In doing so, we assume the existence of a universal signature shared by all thunderstorms, irrespectively of where and when they occur.In addition, the list of predictors does not include the lead time of the forecast.We check in section 4 whether our model, which has been trained on data with lead times between 0 h and 2 h, displays skill on data with longer lead times.
T A B L E 1 List of the 21 input parameters used in the study ("DIA": including sub-grid scale).
where ‖⋅‖ denotes the great-circle distance between  and   .
We trained our model with different values for the spatial and temporal thresholds Δ and Δ in order to study the relationship between them and classification skill systematically.

| Compiling independent data sets
The data obtained from NWP and lightning observations can be considered a set of tuples (ξ, ), where ξ ∈ ℝ  denotes the  = 21 input parameters and  ∈ {0, 1} corresponds to a label of the ground truth (1: thunderstorm occurrence, 0: no thunderstorm occurrence).As the input fields were provided on a triangular grid, we first performed an interpolation onto a 0.125 The rarity of thunderstorms makes predicting their occurrence more challenging as ML models tend to struggle with learning from unbalanced datasets (Sun et al., 2009).As a matter of fact, we verified that when trained on a climatologically consistent dataset, our model would predict the majority class (i.e.no thunderstorm) at every occasion.We therefore

| METHODS
In this section, we provide details on SALAMA, focusing on how it has been trained and calibrated.In addition, we introduce metrics for the evaluation of model skill and present a baseline model for comparison.

| Model description
It is worthwhile to introduce some ML terminology.The three data sets used for training, testing and validation (section 2.4) are made up of examples (ξ, ).Each example consists of a pattern ξ ∈ ℝ  of  input features and a label  ∈ {0, 1}.
Given a pattern ξ, the problem at hand is to infer the probability of thunderstorm occurrence, which constitutes a task known as binary classification.In the following, we consider both the pattern and its corresponding label to originate from a random experiment.Therefore, let  be an -dimensional random variable for the pattern and let  be a random variable Denote by (ξ () ,  () ) =1… the training set with  examples.The most likely configuration of weights, given the training set, is then obtained by minimizing the negative loga-rithm of the likelihood function, log  ( |ξ () ,  () ) , (3) with respect to the weights.The expression in eq. ( 3) is referred to as binary cross-entropy loss function in ML terminology.The process of determining the weights that minimize loss is called training.We trained SALAMA using the robust iterative stochastic method Adam (Kingma and Ba, 2014).
However, if one used the configuration of weights which minimizes eq. ( 3) exactly, a neural network would likely suffer from overfitting, i.e. learning parts of the noise in the data as well.To this end, we implemented an early stopping pro-

| Analytic model calibration
In order to address the climatological rarity of thunderstorm occurrence, we have artificially increased the fraction of positive examples in the data set used for the training of our neural network (section 2.4).In this section, we explain why this dataset modification causes our model to be miscalibrated, and derive an analytic correction for model output calibration.
It is crucial to understand that if the trained model were naively applied to a test set with a different fraction of positive examples than in the training set, the produced probabilities would be inconsistent with the observed relative frequency of thunderstorm occurrence.In order to see this, we use Bayes' theorem to expand the conditional probability of thunderstorm occurrence given a pattern ξ, which yields: The denominator can be expressed as +  ( = ξ| = 0) ( = 0). (5) Let  ( = 1) = 1 −  ( = 0) = , where  denotes the climatological probability of thunderstorm occurrence with no prior knowledge.Then, When we want to apply our neural network to a dataset with  ≠ g, the correct probability output reads which can be derived by solving eq. ( 7) for (ξ) and substituting the result into eq.( 6).On the other hand, if the sample climatology of training set and test set are equal ( g = ), eq. ( 8) yields   (ξ, ) =   (ξ, g), i.e. no probability correction is needed.
If the model probability output is consistent with the observed relative frequency of thunderstorm occurrence, the model forecasts are referred to as reliable.In order to check whether our neural network provides reliable forecasts, we occurrence.We then define the following two bin-wise terms: Up to a factor (1 − ) known as uncertainty term, the sums ∑  Δ  RES  and ∑  Δ  REL  are called the resolution and reliability, respectively, of the model.Resolution measures forecast variance, with higher values of resolution indicating a better ability of the model to differentiate between thunderstorm and non-thunderstorm patterns (Toth et al., 2003).
Reliability quantifies the mean squared deviation of the calibration curve from the diagonal.The bin-wise terms defined in eqs.( 9) to ( 10

| Skill evaluation metrics
Metrics for evaluating classification skill using a test set with  examples include the Brier score (BS), which is known for being strictly proper (Bröcker and Smith, 2007b).Normalization with a reference Brier score BS ref = ∑  =1 ( −  () ) 2 of a random climatological model yields the Brier skill score (BSS) BSS = 1 − BS BS ref ( 12) Murphy (1973) showed that BSS can be written as the difference between resolution and reliability (section 3.2).Thus, in terms of eqs.( 9) to (10), BSS is given by the area between RES and REL as functions of .This is illustrated in fig. 3 (b).
While the BSS directly acts on the probability outputs  ()   (eq.( 11)) of the model, a large class of classification metrics require the conversion of probabilities to binary output first.This is done by introducing a decision threshold p.If  > p, thunderstorm occurrence for the corresponding example is deemed "true", otherwise "false".In combination with the two options from the label, there are four possible outcomes for each example.They are presented as a contingency matrix in table 2. Here, e.g."hits" refers to the number of examples in the test set that qualify as "hit" according to table 2. POD is often referred to as recall in the ML literature, while 1 − FAR is also known as precision.
Precision and recall need to be simultaneously optimized for a useful classifier.For instance, perfect recall is easily achieved by predicting the thunderstorm class at every occasion.For problems with class imbalance, a popular choice of combining the two scores consists of taking the harmonic mean, which yields the  1 -score: where the hits by accident amount to  × (hits + false alarms).

| Baseline model
As thunderstorms are accompanied by convective precipitation, radar reflectivity constitutes a natural surrogate for thunderstorm storm occurrence in the nowcasting community (Dixon and Wiener, 1993;Wilson et al., 1998;Turner et al., 2004).ICON-D2-EPS outputs the column-maximal radar reflectivity (DBZ_CMAX in for SALAMA, the baseline model outputs the probability of thunderstorm occurrence (for the one ensemble member that produced the input reflectivity).
Figure 4 shows the resulting reliability diagram.The light dotted line corresponds to the uncorrected calibration curve, while the dash-dotted line results from applying probability correction ( 8).The baseline model produces well-calibrated output for small model probabilities while the model displays underconfidence above probabilities of approximately 0.2.As examples with higher probabilities than 0.2 make up less than 1 % of the examples in the test set, we assume that these examples therefore did not contribute sufficiently to the loss function, which instead favored well-calibrated small probabilities.In an effort to construct a competitive baseline model, we used the validation set to fit a linear function to the part of the dash-dotted calibration curve with probabilities higher than 0.15.Then, if the output of the baseline model after application of probability correction eq. ( 8) is denoted by , the calibrated output reads () for  > 0.15, and  otherwise.
The resulting well-calibrated calibration curve is given by the solid line in the reliability diagram.The histogram and the lower panel in fig. 4 (a) refer to the latter calibration curve.
One can see that BSS is essentially determined by the baseline resolution, just like for SALAMA (fig.3).The baseline scores comparably to SALAMA in terms of reliability.On the other hand, the baseline resolution is significantly worse, which results in a lower BSS compared to SALAMA.(Dixon and Wiener, 1993;Mueller et al., 2003), for which the probability of thunderstorm occurrence reads 0.22.

| RESULTS
SALAMA provides a general post-processing framework for NWP ensemble forecasts.While we trained SALAMA on lead times up to two hours, we apply the same model to all lead times and all ensemble members individually, using neither the lead time nor the ensemble member index as input feature.Working with ensemble data, our framework readily allows us to study the ensemble spread of thunderstorm occurrence.For example, if we have, for a given location, a 3 h-forecast of ICON-D2-EPS at hand, it consists of 20 input feature tuples (one tuple for each ensemble member).One can now compute a thunderstorm probability according to eq. ( 8) for each member.As we will discuss in section 4.2, the ensemble spread of thunderstorm probability is linked to the NWP forecast uncertainty of the input features.In the following, we compare SALAMA to the baseline model based on reflectivity (section 3.4) and move on to investigating how the spatiotemporal thresholds of the lightning label configuration (section 2.3) influence the classification skill of SALAMA as a function of lead time.

| Comparison to baseline model
In this section, we keep the thresholds of the lightning label configuration (section 2.3) fixed to the particular choice Δ = 15 km, Δ = 30 min.The climatological fraction of thunderstorm examples in the test set amounts to  = 0.021 in this configuration.The results of this section, however, do not change qualitatively if another configuration is used.
As a first step, we visually compare the performance of SALAMA and the baseline model in a case study.For this purpose, we run SALAMA for three consecutive hours of an evening with thunderstorm occurrence over Southern Germany.This day has not been used for the training of SALAMA.
In fig. 5, we plot the probability of thunderstorm occurrence for an arbitrary member of the NWP ensemble for the entire study domain.Observed thunderstorm occurrence is given by black contours.The corresponding plots for the baseline model are added below the panels of SALAMA.In this particular case study, SALAMA tends to detect more thunderstorm pixels than the baseline model.On the other hand, SALAMA seems to produce more false alarms.
In order to compare the skill of SALAMA and our baseline quantitatively for the entire study period, we evaluate the skill scores introduced in section 3.3.We use for this purpose the test set introduced in section 2. The model lead times for the three hours are 1 h, 2 h, and 0 h, respectively.The color maps display the result for the first ensemble member of ICON-D2-EPS, while lightning labels (Δ = 15 km, Δ = 30 min, section 2.3) are shown as black contours.A jump in the color maps indicates the decision thresholds used for the evaluation of the skill scores in table 3.
eration, SALAMA scores better than the baseline model.The uncertainties are computed here, as well as for the subsequent evaluations, by the bootstrap resampling method introduced in section 3.2.Note that we obtain POD = 1 − FAR =  1 for both models.This is a result from our choice of decision threshold: Recall generally equals precision for unbiased forecasts (Wilks, 2011).

| Lead time dependence of classification skill
The data sets for training, testing and validation introduced in Uncertainties are obtained from 200 bootstrap resamples and show the symmetric 90 % confidence interval.is higher than the baseline skill for any of the considered lead times.
It is tempting to assume that the decrease in skill with lead time originates from an increasing NWP forecast uncertainty for longer lead times.We can use ensemble data to check this hypothesis.Let  be either one of the 21 input features or the model thunderstorm probability, i.e. a quantity that is given for each ensemble member and each lead time.Then define the ensemble spread  ′  of  as the ensemble standard deviation of , where we make the dependence on the lead time  lead explicit.

| Effect of the label size
So far, the temporal and spatial thresholds of the label con- which means that classification skill decreases more slowly for coarser label configurations.This is in agreement with the anticipation (Lorenz, 1969) that resolving smaller scales in NWP models is associated with a more rapid growth of forecast errors.Our finding is complementary to convection studies involving a scale-dependent skill score (Roberts, 2008), and high-resolution simulations (Selz and Craig, 2015).

| CONCLUSION AND PERSPEC-TIVES
Addressing the need for accurate thunderstorm forecasting and leveraging advances in high-resolution NWP and ML, we have  This approach has allowed us to ensure reasonable reliability without calibration fits.Furthermore, we have proposed a dwd and mpi-m: Description of the non-hydrostatic dynamical core.Quarterly Journal of the Royal Meteorological Society, 141, 563-579.URL: https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.2378.
undersampled the majority class in the training set, such that both labels appear equally frequently (class balance).On the other hand, the validation and testing set remain climatologically consistent since we wish to quantify model performance in a realistic setting.Having different sample climatologies in the training and test set, however, requires model output calibration, which is discussed in section 3.2.
of thunderstorm occurrence (1: thunderstorm, 0: no thunderstorm).We are interested in  ( = 1| = ξ), namely the conditional probability of thunderstorm occurrence if the pattern is known.A feedforward artificial neural network model is a function  ∶ ℝ  → (0, 1) that models the relationship between the input pattern and the corresponding probability of thunderstorm occurrence.We refer to  simply as neural network.Neural networks use compositions of matrix multiplications, as well as non-linear operations referred to as activation functions.The architecture of our neural network is presented in fig.2:It consists of the input and output layer as well as hidden layers, where each layer is a vector of numbers obtained from the previous layer by one matrix multiplication and by applying an activation function to the result in a component-wise manner.The complexity of  is adjustable through the number of hidden layers and the size of each hidden layer, i.e. the number of nodes.Our model has three hidden layers and 20 nodes per hidden layer.Moreover, we use rectified linear units for the hidden layers and a sigmoid function to map the output layer to a probability between zero and one.U R E 2 (Color online) The architecture of SALAMA: Input features are scaled to order 1.We use rectified linear units as activation functions in the hidden layers.A sigmoid function maps the output layer to the open interval (0, 1).The entries, also referred to as weights, of the matrices that connect the layers are adjusted according to the data in the training set.We therefore add a subscript  ∈ ℝ  to  to express the dependence on the  weights.If   constitutes an accurate representation of the conditional probability of thunderstorm occurrence, i.e.   (ξ) =  ( = 1| = ξ), then the likelihood of observing a label  for a given input feature ξ reads ) where the residual function (ξ) =  ( = ξ| = 0)∕ ( = ξ| = 1) is not expected to depend on .Nevertheless, eq. (6) shows that the conditional probability of thunderstorm occurrence does carry an implicit -dependence through the prefactor (1 − )∕.The training set contains an increased fraction g of positive examples (in our work: g = 1∕2), while the corresponding fraction in the test set is (up to fluctuations due to the finite sample size) equal to the climatological value .During training, the neural network, therefore, learns to produce the following model output used the test set to produce a reliability diagram.For this purpose, one partitions the interval (0, 1) of possible forecast probabilities into bins.For each bin, one considers all examples whose model probability falls into the bin.Then, one computes the relative frequency of thunderstorm occurrence and plots it against the bin-averaged model probability per bin.The resulting curve is referred to as calibration function.An example for one configuration of lightning labels is shown in fig. 3 (a), for which 10 equidistant bins have been used.Shown are two calibration functions: The light grey line corresponds to a calibration function without any probability correction, while the solid black line results from applying (8) to our model output.The uncertainty on the observed frequency spans the 5th and 95th percentiles of fluctuations and has been estimated through a bootstrap resampling procedure similar to Bröcker and Smith (2007a): By drawing with replacement, one produces variations of the original test set and considers the sample-to-sample fluctuations of observed relative frequencies.The uncalibrated line severely overestimates the relative frequency of thunderstorm occurrence at all model probabilities.As has been worked out, this is not a result of faulty training but stems from having different sample climatologies in the training and test sets.After calibration, however, our model produces reliable forecasts for probabilities close to 0 and 1.On the other hand, our model slightly underestimates the relative frequency of thunderstorm occurrence for forecast probabilities below 0.6.Further calibration could be done using statistical methods like isotonic regression (Niculescu-Mizil and Caruana, 2005), which is beyond the scope of this work.Instead, we consider our model sufficiently reliable and appreciate that the level of reliability has been attained by means of the analytical correction (8) alone.In addition to calibration curves, binning the forecast probabilities allows the introduction of two useful metrics of classification skill.Of the  examples in the test set, let   fall into bin  with bin width Δ  , bin-averaged model probability   and observed relative frequency   of thunderstorm ) and shown in fig. 3 (b) offer an overview of how much each probability bin contributes to reliability and resolution.For instance, resolution is most impacted by examples with model probabilities of ca.0.3 and dominates over reliability.Reliability diagram of SALAMA, evaluated for the test set with the label configuration Δ = 15 km, Δ = 30 min (section 2.3).(a) Calibration curve after applying probability correction (8) (black solid line), and before (grey light dotted line), and histogram of examples per bin.Perfect reliability is indicated by a dashed diagonal.Shaded band corresponds to the symmetric 90 % confidence interval obtained by 200 bootstrap resamples.(b) Bin-wise resolution and reliability (eqs.(9) to (10)) and their relation to the Brier skill score (BSS, section 3.3) as a function of model probability.
an infinite number of options to combine the four possible outcomes to a single skill score, we selected the scores in this study based on their suitability for tasks with significant class imbalance.Namely, we do not wish to reward our model for correctly classifying the majority class.This amounts to dismissing scores which explicitly use correct rejects.Given a test set and a fixed decision threshold, the probability of detection (POD) and false-alarm ratio ( the CSI consists of subtracting as many hits as a model randomly classifying according to climatology would obtain.The equitable threat score (ETS) reads ETS = hits − hits by accident hits − hits by accident + misses + false alarms , (

Figure 4
Figure 4 (b) shows the learned and calibrated relationship between NWP reflectivity and the corresponding probability of thunderstorm occurrence.The herein observed monotonously increasing relationship implies that thunderstorms become more likely as reflectivity increases.A typical threshold for defining thunderstorms in nowcasting is 35 dBZ Training of the baseline model.(a) Reliability diagram panels as in fig.3, but for the baseline model.(b) Learned relationship between the baseline input field and the corresponding probability of thunderstorm occurrence.
4, which consists of examples of the entire summer of 2021.For some of the scores, we need to set a decision threshold.As a criterion, we demand that forecasts be unbiased (average fraction of examples classified as thunderstorms is equal to the observed fraction of thunderstorm examples), yielding thresholds of 0.193 (SALAMA) and 0.119 (baseline).The thresholds are also indicated in the color bars of fig. 5.The threshold found for reflectivity corresponds to 28 dBZ and is slightly below the typical literature threshold cited in section 3.4.The performance of SALAMA and the baseline is summarized in table 3. Irrespectively of the skill score under consid-Color online) Probability of thunderstorm occurrence for June 23, 2021 from 19 UTC on, for SALAMA (upper row) and the baseline model (lower row).
fig.7, we plot the SALAMA classification skill, measured in terms of the skill scores introduced in section 3.3 as a function of lead time and compare it to the dependence obtained for the baseline model.Figure7shows that, for SALAMA, classification skill decreases approximately exponentially (note the log-scaling of the -axis) for lead times longer than 1 h, irrespectively of the skill score under consideration.The classification skill of SALAMA at a lead time of 1 h is actually higher than at 0 h, which is likely a spin-up effect from the NWP model(Sun et al., 2014).On the other hand, SALAMA skill is systematically superior to baseline skill for all lead times.In fact, even 11-hour-lead-time skill of SALAMA figuration have been fixed to Δ = 15 km and Δ = 30 min (henceforth referred to as default configuration).In this section, we study how varying the spatiotemporal thresholds affects the classification skill of SALAMA.As a first step, we compute reliability diagrams for different label configurations.In panel (a) of fig.9, we study a configuration with smaller thresholds than for the configuration studied so far.Panel (b) displays a configuration with reduced Δ and increased Δ.In panel (c), both thresholds are increased with respect to the default configuration.The exact choice of Δ and Δ for the three panels is somewhat arbitrary presented SALAMA, a feedforward neural network model that identifies thunderstorm occurrence in NWP forecasts up to Reliability diagrams as in fig.3, but with label configurations (a) Δ = 15 min, Δ = 9 km ( = 14 km), (b) Δ = 10 min, Δ = 21 km ( = 24 km), (c) Δ = 40 min, Δ = 24 km ( = 36 km).The spatial scale  is introduced in eq.(20).
Color online) Classification skill of SALAMA, expressed in terms of the area under the PR curve, as a function of the label configuration (section 2.3).The slope of the dashed lines is chosen such that classification skill is approximately constant along the lines.Each line corresponds to a specific spatial scale  (eq.(20)).Decay time of classification skill (quantified by the area under the PR curve) as a function of the spatial scale.Each data point corresponds to one label configuration in fig.10.The parameters of a linear fit are also shown, as well as the Pearson coefficient of correlation.11 h in advance in a pixel-wise manner.The inference of the probability of thunderstorm occurrence is based on input parameters that are physically related to thunderstorm activity and do not explicitly feature information on location, time or forecast range.This gives reason to expect that the signature learned by the model generalizes to thunderstorms outside the study region of this work and remains valid in a changing climate.In addition, the availability of all input features in real-time makes SALAMA readily available for operational use.We have addressed the technical challenge caused by the rarity of thunderstorms and the corresponding small fraction of positive examples by increasing this fraction during training and analytically accounting for the increase when testing.
table 1), which we refer to as reflectivity in what follows.In order to construct a baseline for comparison to SALAMA, we repeat training our model, but use only reflectivity as input.The architecture of the baseline model is identical to the one presented in fig. 2 except for the input layer, which has now only a single node.Just like Scores for classification skill, evaluated on the test set, for SALAMA and the baseline model.The probability thresholds used for evaluation are chosen such that the forecast is unbiased and amount to 0.193 (SALAMA), 0.119 (baseline).Uncertainties are obtained from 200 bootstrap resamples and show the symmetric 90 % confidence interval.
The brackets ⟨⋅⟩ denote the average over all 20 ensemble members.Denote by  ′  ( lead ) the expression obtained by performing the average of  ′  over the entire study region and all times associated with the test set.Lastly, we define the