A Large Simulation Set of Geomagnetic Storms—Can Simulations Predict Ground Magnetometer Station Observations of Magnetic Field Perturbations?

We use the Space Weather Modeling Framework Geospace configuration to simulate a total of 122 storms from the period 2010–2019. With the focus on the storm main phase, each storm period was run for 54 hr starting from 6 hr prior to the start of the Dst depression. The simulation output of ground magnetic variations, ΔBH in particular, were compared with ground magnetometer station data provided by SuperMAG to statistically assess the Geospace model regional magnetic perturbation prediction performance. Our results show that the regional predictions at mid‐latitudes are quite accurate, but the high‐latitude regional disturbances are still difficult to predict.

. Prediction of the magnetic disturbances caused by the storm time currents is important for space weather, and one of the main motivations of this study.
Rapid variations in the geomagnetic field induce a geoelectric field at Earth's surface, which in turn drive geomagnetically induced currents (GIC), which can have harmful effects on power grids, natural gas pipelines, or other technological systems (Pirjola et al., 2000). The geoelectric field depends on the ground conductivity structure, which means that the local geology influences the formation and intensity of the GIC (Zheng et al., 2013). As the power spectrum of the geoelectric field is dominated by frequencies below 1 Hz, the GICs act as a direct current (DC) component on top of the 50 or 60 Hz alternating current power system (A. Pulkkinen et al., 2017). The effects in the power systems include saturation of transformers, which can generate equipment damage and/ or system-wide disturbances and power outages (Bolduc, 2002;Lanzerotti, 2001). In the natural gas pipelines, the DC currents can cause corrosion and thus shorten the lifetime of the system (Boteler et al., 1998;Pirjola et al., 2005). As the GIC and their effects on the systems depend on the regional rather than global level of disturbances, the global activity indices are not well suited for serving the system operators wishing to get advance warnings and nowcasts of the intensity of the disturbances (Abt Associates Inc., 2019). The Federal Emergency Management Agency National Threat and Hazard Identification and Risk Assessment report (Federal Emergency Management Agency, 2019) recognized space weather-associated power outage as one of two natural hazards (besides a pandemic) that can have nation-wide impacts. Furthermore, the Promoting Research and Observations of Space Weather to Improve the Forecasting of Tomorrow Act, or the PROSWIFT Act (Congress, 2020), recognizes addressing the GIC risk to power systems as critical for the nation's safety and security.
The NOAA Space Weather Prediction Center (SWPC) in Boulder, CO, produces ground magnetic perturbation maps based on the University of Michigan Space Weather Modeling Framework (SWMF) Geospace model for the ground infrastructure operators (Space Weather Prediction Center, 2022). This study addresses the capability of the SWMF Geospace to provide accurate regional predictions of the ground geomagnetic disturbances. As an operational model, the Geospace model has been comprehensively validated with regard to its predictions of the global indices (Cash et al., 2018;Liemohn et al., 2018).
In terms of the more localized magnetospheric responses like with ground magnetometers, A. Pulkkinen et al. (2013) set the ground work for validating a model's performance in predicting dB H /dt. The study included SWMF as well as other models, with a focus on 6 storms and 12 stations in the northern hemisphere. Welling et al. (2017) extended that study by assessing the model performance as a function of the solar wind driver and SYM-H. It is important to note that in this study, we examine two latitude bands, midlatitude stations between −50° and 50° and high-latitude stations between 50° and 90° latitude, while the previous studies focus on the northern hemisphere in the latitude band between 50° and 65°. This study is in the same vein as the previous studies of its kind, but focuses on the localized disturbances recorded at individual ground magnetometer stations to maximize the usability of the results in practical applications.
The exponential increase in computer power and storage have allowed the use of methods like ensemble modeling employed for example, by Morley et al. (2018). They simulated a storm with several variations of the input solar wind, which allowed them to evaluate a quantifiable uncertainty. Our approach is to use solar wind data during a large number of storms to obtain a statistical database of storm simulations, from which we can draw statistical conclusions. A similar study, but with fewer storms, and without including coupling of the inner magnetosphere model that allows the development of an intense ring current during the storm main phase, was carried out by (Kwagala, Norah Kaggwa et al., 2020). In the analysis, we make use of best practices solidified by the community , such as the use of contingency tables and Heidke Skill Scores (HSS) as well as the horizontal component of magnetic perturbations (but not dB/dt in this case).
In this paper we describe a statistical database of storm simulations and use that together with the SuperMAG ground magnetometer observations database to assess the model performance at individual stations, comparing results in different latitude bands. Section 2 describes the methodology and the results, Section 3 summarizes the results and concludes with discussion.

Observations
Following the often-used definition of a geomagnetic storm as an event with a peak Dst value below −50 nT, we examined all periods during 2010-2019 fulfilling that condition. A small subset of the periods were discarded due to lack of clear signature of storm onset or main phase development or significant data gaps in solar wind and interplanetary magnetic field (IMF) records. A total of 122 such periods with minimum SYM-H below −50 nT were included in the study. The storm onset time was selected to be the time when the SYM-H index started to decrease (often following a compression caused by the ICME-associated shock). Each interval had a duration of 54 hr starting from 6 hr prior to the storm onset. While this duration captures all storm main phases in the data set, it does not always extend far enough to capture the entire storm recovery phase. This is due to computational limitations in running this many storms.
The solar wind measurements were obtained from the OMNI database (Goddard Space Flight Center, 2021), which provides the IMF and the solar wind plasma parameters propagated to the upstream bow shock allowing for direct association with the geomagnetic activity indices (Papitashvili et al., 2014). The solar wind driver intensity was assessed using the Newell coupling parameter (Newell et al., 2007), which is proportional to the rate of change of magnetic flux at the nose of the magnetopause, and can be written in the form where θ = tan −1 (B Y /B Z ) is the IMF clock angle and = ( 2 + 2 ) 1∕2 denotes the transverse component of the magnetic field perpendicular to the Sun-Earth line and is given in nT, V is the solar wind speed in km/s, and α is a scaling factor of the order of 10 3 scaling it to dimensional units of Wb/s (Cai & Clauer, 2013). The ground magnetic field variations were analyzed using the SuperMAG (Gjerloev, 2012) database comprising 1-min magnetic measurements from several hundred magnetic stations over the globe through the INTERMA-GNET (2021) network and others. While the total number of stations is large, at any given time instance, the number of available stations ranges from about 50 to 150. We categorize the set of stations into three latitude ranges, the mid-latitudes (−50° to 50° magnetic latitude), high-latitudes (50°-90° magnetic latitude) and auroral region stations (60°-70° latitude). For each event, we used all stations that had continuous data throughout the event interval. Figure 1 illustrates a sample storm in the data set. The major storm occurred on 7-9 May 2014, and had a peak SYM-H intensity below −100 nT. The storm was run using the Space Weather Modeling Framework's Geospace configuration with virtual ground magnetometers in the locations where real magnetometers are found around the globe. The panels show the Newell coupling function illustrating the solar wind driving intensity as well as a summary of the solar wind input to the model, the SYM-H index, the AL index, and the Cross-Polar Cap Potential (CPCP) estimated using a model driven by solar wind parameters and the Polar Cap Index (Ridley & Kihn, 2004). As an example, the two bottom panels show observations from two stations, from Boulder, Colorado, in the mid-latitudes, and from Yellowknife, Northwest Territories, Canada, within the auroral region (note that the observations are shown in different scales).

Space Weather Modeling Framework Geospace Configuration
The Space Weather Modeling Framework (SWMF) is a combination of numerical models to simulate space physics processes from the Sun to the Earth's upper atmosphere and outer heliosphere (Gombosi et al., 2021;Tóth et al., 2012). The core of the SWMF is the Block-Adaptive-Tree-Solar wind-Roe-Upwind-Scheme (BATS-R-US), a 3D model solving the magnetohydrodynamic equations. The Ridley Ionosphere Model (RIM) is a potential field solver for the ionosphere, and the Rice Convection Model (RCM) is a kinetic model of the ring current and inner magnetosphere. The SWMF couplers tie these three components together to simulate the space weather effects in space and on ground. The Geospace configuration used for this study mimics the one used operationally at the NOAA SWPC.
The solar wind and the magnetosphere are modeled by BATS-R-US in ideal MHD mode with the adaptive grid resolution changing between 0.125 R E in the near-Earth region and 8 R E in the distant tail. The simulation box in the Geocentric Solar Magnetospheric coordinates covers the region from 32 R E to −224 R E in the X direction and ±128 R E in the Y and Z directions. The inner boundary is a spherical surface at radial distance R = 2.5 R E . A steady state is found for each solar wind initial condition while the simulation grid resolution varies spatially, with the resolution increasing in areas that are highly dynamic like the magnetosphere. The simulation is then started in a time-accurate mode in which a static refined grid is used.
The RIM model solves the Poisson equation for the electrostatic potential at a two-dimensional ionospheric surface (Ridley et al., 2006). BATS-R-US feeds the RIM the field-aligned currents from the simulation inner

of 16
boundary, and the ionospheric conductances are derived using the field-aligned current intensity and location combined with background dayside and night-side conductances. The potential is set to zero at the lower latitude boundary at 10°. The RIM then solves the Vasyliunas (1970) equation for the electric potential, and gives the velocity boundary condition by using the electric field produced to derive velocity using the MHD equations. The ionosphere and magnetosphere models are coupled at a cadence of 5 s of simulated time.
The model output contains a regular 360 × 180 grid of magnetic disturbances between 0° and 360° longitude and ±90° latitude stored at 1-min cadence. The ground magnetic disturbances predicted by the simulation are computed by Biot-Savart integration of the currents external to the Earth, using both the BATS-R-US (for the magnetospheric currents) and RIM (for the ionospheric currents) models (Gombosi et al., 2021;Yu & Ridley, 2008). The contribution of the RCM is to describe the strong ring current that develops in the inner magnetosphere, and the RCM plasma pressure is coupled back to the BATS-R-US and thereby creates a significant effect on the ground magnetic variations. The resulting model values are rotated to obtain the B North , B East , and B Down components that are directly comparable with the ground magnetic stations. Figure 1 shows the simulated values (in blue) over the observed values as well as the magnitude of the difference between the observed and simulated values normalized by the mean of the observed values shown (red, scale to the right). While the simulated SYM-H follows the observed one quite well, it can be seen that the AL index is not as well reproduced by the simulation: the simulated AL follows the observations early in the storm when the AL is not very large, but the simulation does not catch the very strong currents associated with the substorm activity near the peak of the storm. This is typical of the simulations in the data set (T. I. Pulkkinen et al., 2022). The two bottom panels show representative local magnetic measurements from YKC and BOU stations. These two stations were chosen as they have the largest data sets of magnetic perturbations for the time periods covering the storms. In both latitude bands, the simulation tends to underestimate the magnitude of the disturbance.

Statistical Storm Simulation Data Set
The full data set (Al  comprises 122 storms during years 2010-2019. Figure 2 shows the superposed epoch of the observed, simulated and error (difference) of the SYM-H for all storm time periods used in this study. The zero epoch time was set to the storm onset time defined as the time when the Dst index starts to decrease (as opposed to e.g., using the storm sudden commencement as a storm onset indicator). The blue lines show the mean of the entire data set, the dotted line is the median and the upper and lower gray curves display the inter-quartile range of the data set.
Throughout this study, we define the error as (with y as the time-dependent quantity) It is clear that the superposed epoch curves follow each other very well, and the superposed epoch error is quite small, while showing a trend toward negative errors (model values recovering faster than the observed ones).
Although not shown here, this negative trend is largely caused by large storms with maxima below −100 nT. It is also notable that the scatter of the errors is larger during the storm main phase and reduces toward the storm recovery. This is indicative of the highly varying SYM-H values that can lead to large errors for example, by a small time shift between the model and the simulation. The storms included in the study are shown in Appendix A.

Error Statistics
The one-minute-data from the simulation were compared with the 1 min perturbation observations that were available from the SuperMAG database. Figure 3 shows a 2-dimensional histogram for the entire data set of ΔB H similar in format to those shown in Camporeale et al. (2020) for dB/dt. The intensity scaling is selected such that each vertical column (bin of observed ΔB H ) is normalized by the maximum value within that column for better visualization. The left panels show two individual magnetometer stations in Yellowknife, Canada (YKC) and Boulder, Colorado (BOU) as representatives for high-latitude (50°-90° magnetic latitude) and mid-latitude (−50° to 50° magnetic latitude) regions. We note again that this latitude selection is different from previous work such as A. Pulkkinen et al. (2013), as our study covers all available stations in the SuperMAG database (Gjerloev, 2012) at all latitudes. This analysis confirms previous findings that indicate that SWMF typically underestimates the observed magnetic disturbances, especially at high latitudes.  The error magnitude differences between a mid-latitude and auroral station reflects the different drivers and magnitudes of the disturbances: Under most conditions, the mid-latitude stations record the variations in the ring current, with typical signal intensity of few tens of nT, or during storm times up to −100 nT and even below.
Other currents that contribute are the magnetopause current and field-aligned currents . On the other hand, auroral stations record the strong ionospheric currents associated with substorm activity. The disturbances are of the order of several hundred nT, at times reaching −1,000 nT and even more.
The right panels of Figure 4 show the error statistics for all stations in the latitude band above 50° magnetic latitude in both northern and southern hemispheres (top panel) and in the latitude band −50° < Mlat < 50°, representative of auroral/polar cap stations and mid/low latitude stations, respectively. While the distributions are slightly wider than those for individual stations, they share the same features of relatively symmetric distributions with long tails in the positive error direction.

Error Statistics at Individual Stations
Next we examine the errors at individual stations, to resolve the spatial distribution of the errors over the globe.
To that end, we examine the horizontal magnetic field component B H which always gives a positive signal. In this case, negative errors (model-observed value) mean that the model underestimates the disturbance intensity, while positive errors mean model predicting larger than observed signal. The horizontal component including the Eastward component also records currents that are not strictly in the east-west direction (high latitudes) or along the magnetic equator (mid-latitude stations).   latitude activity associated with substorms or other localized magnetotail processes driving currents coupling to the ionosphere (T. I. Pulkkinen et al., 2022). Interestingly, the median errors in the north component don't show this tendency to the positive in the high-latitude region, but show a larger inter-quartile range indicating the model's difficulty capturing the direction of the perturbation.
We note that the opposite signs of the medians (negative for the horizontal component and positive for the north component) arise from the fact that the horizontal component is a positive quantity while the north component is a negative quantity, and hence the errors have different signs for the model underpredicting both components. The bottom panels depicting the lower-latitude observations show that the median errors are typically in the direction of slight underprediction, and the inter-quartile ranges are similarly skewed toward negative. The error distributions are relatively constant over all latitudes. The similarity of the distributions at all stations is on one hand indicative of the consistent capability of the simulation and on the other hand speaks to the high quality of the data with only a few stations with spurious results. A few stations near the equator produce underpredictions for the north component. It is possible that this is due to the fact that the simulation does not model the equatorial electrojets that may contribute to the measured disturbance and hence to larger errors near the magnetic equator (Forbes, 1981;Onwumechili, 2019;Rastogi, 1989).

Heidke Skill Score Analysis
To quantify the ability to forecast measurements at the individual stations, we assign Heidke skill score values (HSS) (Heidke, 1926) to each station. The HSS is defined in the typical skill score format with skill given by the ratio of the difference between model's score and a random chance score by the difference of the perfect score and random chance score. The HSS can be obtained through a 2 × 2 contingency table where the simulation and observations are compared to a threshold value (Table 1), and obtained using the definition (Jolliffe & Stephenson, 2012) which shows that the HSS maximum value for no misses and no false positives is 1, value of zero indicates no skill, and negative values indicate skill worse than chance.
A key for the usability of the HSS to assess the prediction quality for operational customers lies in a correct selection of the threshold value separating "Hits" (True positives) and "True negatives" (Quiet time). As the typical signal intensity varies with latitude, to gain meaningful results, the thresholds for "True positives" should likely be different for stations at different latitudes. On the other hand, for practical purposes, the threshold should reflect the level of disturbance requiring action from a space weather user.
For the lower latitude stations, the storm limit −50 nT can be argued to be a suitable event threshold. In this database consisting of simulations of storm periods, reaching below the 50 nT threshold should occur for a substantial portion of the time, and such disturbances are likely to drive currents that are of concern for example, to the power system operators. Figure 7 shows a geographic map of the HSS values for each of the stations available through the SuperMAG network using the 50 nT event threshold. The skill scores vary from very low (especially in the polar regions) to above 0.6, with best HSS values at the low and mid-latitudes.
Interestingly, there is a band of lower skill scores in the range 0.3-0.4 in the latitude band around 50°-60° latitude. This may be due to the fact that during storms, this latitude band is often underneath the ionospheric currents, which create very strong disturbances. If the model auroral oval is not able Figure 6. From top to bottom, horizontal component errors in high-latitudes (50°-90° magnetic latitude) and mid-latitude (−50° to 50° magnetic latitude) of individual magnetometer stations. The bars show the inter-quartile ranges of the errors in the stations found on those latitudes. The top is the third-quartile, the cross is the second-quartile (median), and the bottom is the first-quartile. In the top panels, the gray-shaded regions show the typical auroral zones.

Station above Station below
Simulation above H (hit) F (false positive) Simulation below M (miss) N (true negative) Note. "Above" and "Below" indicate field values above and below the threshold. to accurately track the real one, motion of the equatorward edge can cause the station to be on one side of the boundary in the model and on the other side in the simulation.
Note that the skill scores with the 50 nT threshold are quite good in the auroral oval region around 60°-70° geographic latitude. This indicates a high number of hits, highlighting the fact that the model-even if not capturing the intensity of the perturbations-is often able to capture the event occurrence.
The bottom panel of Figure 7 shows the HSS results with a threshold of 200 nT more corresponding to the magnitude of the auroral electrojet currents during geomagnetic activity (Akasofu et al., 1980;Klumpar, 1979;Waters et al., 2001). Obviously, the skill scores at lower latitudes are low, as the disturbances rarely reach such  Figure 8 shows the HSS values for each individual station as function of magnetic latitude. The HSS 95% confidence interval for each station is calculated using bootstrap resampling (Efron, 1979). The confidence interval lengths are relatively small with an average of 0.018 for the threshold of 50 nT and 0.024 for the threshold of 200 nT. The plot clearly demonstrates the dip in HSS around the magnetic latitude of the equatorward edge of the auroral region (shaded gray) as discussed above.
A quantitative summary of the HSS values and their inter-quartile ranges is given in Table 2. The table demonstrates the conclusion discussed above that the performance of the model in the mid-latitudes, with a median HSS of 0.51, is better than that of high-latitudes, with a median HSS of 0.32. A (Wilcoxon, 1945) signed-rank test of the set of mid-latitude and high-latitude stations show that the difference of the medians is indeed statistically significant with a p-value, of 10 −30 . Specifically, the test for whether the mid-latitudes have a higher median gives a p-value of 10 −30 . Furthermore, the higher-latitude stations have a higher inter-quartile range of 0.24 as compared to the mid-latitude error range of 0.11.

Discussion
In this paper, we have addressed the capability of the SWMF Geospace model to predict magnetic disturbances at individual stations. In general, the results are encouraging with positive HSS skill scores in the inter-quartile range of 0.36-0.53.
The magnetic field disturbance intensity has a tendency to be underpredicted in the simulation ( Figure 6). The errors have a longer tail in the direction of underprediction (Figure 4). This indicates that for a sufficiently low threshold for event occurrence is required for the model not to produce an overly large number of misses.
Lower-latitude stations (−50…50° magnetic latitude) generally have a higher skill score than their higher latitude counterparts. This is expected, as the highly variable ionospheric currents arise from disruptions of the tail current associated with substorms and/or the field-aligned currents associated with Earthward propagating flow bursts, both of which are difficult to model to high accuracy in time and location. Furthermore, the model is optimized for computing the Dst and SYM-H indices .   (25)) and 75th percentile (p(75)) which show the inter-quartile range. Further developing the ionospheric response and the coupled magnetotail processes offers an opportunity to improve the accuracy of the higher-latitude responses. In this study, we coupled the BATS-R-US global magnetosphere with 0.125 R E maximum resolution to the RIM ionospheric module. Both increasing resolution of the MHD model (Welling et al., 2019) and improving the description of the conductances in the ionosphere module (Mukhopadhyay et al., 2020) can influence the model accuracy and performance as measured by skill scores.
Also, as shown by Ridley et al. (2010), MHD simulations of the magnetosphere come with inherent limitations. The RIM model is a potential solver that forces a zero potential at 10° latitude. The conductances come from an empirical model, driven by field-aligned currents from the magnetospheric module. The relationship between the conductances and the field-aligned currents is a key factor contributing to the intensity of the ground magnetic perturbations (Mukhopadhyay et al., 2020).
The magnetic perturbations in the mid-latitude and equatorial regions are due to magnetospheric currents, while the perturbations in the auroral regions are caused by currents flowing in the ionosphere, which create an order of magnitude larger ground magnetic perturbation than the magnetospheric (ring and tail) currents. As the auroral ionospheric currents have a sharp equatorward boundary, errors near that boundary may be large if the model is not able to predict the location of the auroral oval correctly.
Generally, the dayside currents are largely directly driven by the solar wind, while the nightside involves more processes arising from the magnetotail plasma sheet, which complicates the relationship between the solar wind and the magnetosphere-ionosphere coupling processes.
Using a similar SWMF configuration as in this study, and a 2-year interval of the operational model, Liemohn et al. (2018) obtained a HSS of 0.51 for the hourly Dst index and −50 nT event threshold. Figure 7 shows that, indeed, stations at the midlatitudes (typically used to derive Dst) have an HSS of 0.5 or higher. It is also important to note that the 2-year interval is dominated by quiet time, while our data set is strongly biased toward storm times. As this leads to more balanced number of hits and true negatives, the skill scores are higher-even for the individual stations.
The observed Dst is calculated as a weighted average of several stations at midlatitudes, attempting to capture the currents flowing near the equatorial plane in the magnetosphere. On the other hand, the Dst index is derived from the SWMF results by Biot-Savart integration of the magnetic disturbance at the center of the Earth Yu and Ridley (2008). Thus, there might be small differences between the simulation Dst as it is calculated now, and that computed from the simulated station disturbances at the actual Dst stations. However, the good correspondence between the simulated and observed Dst shows that this difference in methodology is not a major source of error.
Haiducek et al. (2017) studied statistics of individual storms using the global indices during the month of January 2005. As these are storms not included in this study, they provide an independent comparison. They found an error probability density for SYM-H similar to the errors for mid-latitude stations shown in the bottom right panel of Figure 4. Furthermore, they also assert that increasing the simulation resolution does not necessarily improve the accuracy of SYM-H predictions. Camporeale et al. (2020) examined statistics of individual stations, specifically regarding dB H /dt over a 2-year interval, focusing on three stations FRN, OTT, IQA at low, (sub-)auroral, and high latitudes. They showed that SWMF operational Geospace configuration predicts the changes in perturbation better at mid-latitudes than at high-latitudes, consistent with results in this study. Furthermore, they proposed a machine learning algorithm combined with SWMF, which was shown to increase skill scores significantly.

Summary and Conclusion
In this study, we performed a comprehensive study of over 100 simulations of geomagnetic storms and the resulting predictions of disturbances ΔB H measured at over 300 ground magnetometer stations covering a wide range of latitudes and longitudes. Using event threshold values of 50 nT below 50° latitude and 200 nT at high latitudes above 50°, the SWMF reproduced ground magnetometer signals to accuracy (HSS > 0.6) that is useful for the space weather operators. This leads to the possibility of using the model operationally to predict hazardous levels of geomagnetic activity in a more localized sense, giving regional predictions for auroral and subauroral regions.