Towards GIC forecasting: statistical downscaling of the geomagnetic field to improve geoelectric field forecasts

Geomagnetically induced currents (GICs) are an impact of space weather that can occur during periods of enhanced geomagnetic activity. GICs can enter into electrical power grids through earthed conductors, potentially causing network collapse through voltage instability or damaging transformers. It would be beneficial for power grid operators to have a forecast of GICs that could inform decision making on mitigating action. Long lead-time GIC forecasting requires magnetospheric models as drivers of geoelectric field models. However, estimation of the geoelectric field is sensitive to high-frequency geomagnetic field variations, which operational global magneto-hydrodynamic models do not fully capture. Furthermore, an assessment of GIC forecast uncertainty would require a large ensemble of magnetospheric


Introduction
Intensification of magnetospheric and ionospheric current systems drives changes in the geomagnetic field measured on the ground ( ), which induces an enhanced geoelectric field, as expressed by the Maxwell-Faraday equation. The induced geoelectric field drives currents within the Earth that can enter grounded conducting networks as geomagnetically induced currents (GICs) (Koskinen et al., 2017;Pulkkinen et al., 2017). GICs can flow into the power grid through earthing points at substations (Cannon et al., 2013;Oughton et al., 2017), particularly in regions with high ground resistance, as the geoelectric field is larger and the network provides a more favorable path for GICs to flow. The quasi-DC signal introduced into an AC grid system can lead to half cycle saturation in transformers causing degradation and, in extreme cases, destruction, failure, and system collapse. The geomagnetic field can be used as a proxy for potential ground effects and GIC studies commonly use the time derivative to quantify potential effects.
Nowcasting and advanced forecasting of geomagnetic disturbances is generally achieved through global magnetohydrodynamic (MHD) models (Welling, 2019), driven with near-Earth solar wind observations or, for increased lead time, the output of solar-wind simulations (Merkin et al., 2007). The ground-level magnetic field, Abstract Geomagnetically induced currents (GICs) are an impact of space weather that can occur during periods of enhanced geomagnetic activity. GICs can enter into electrical power grids through earthed conductors, potentially causing network collapse through voltage instability or damaging transformers. It would be beneficial for power grid operators to have a forecast of GICs that could inform decision making on mitigating action. Long lead-time GIC forecasting requires magnetospheric models as drivers of geoelectric field models. However, estimation of the geoelectric field is sensitive to high-frequency geomagnetic field variations, which operational global magneto-hydrodynamic models do not fully capture. Furthermore, an assessment of GIC forecast uncertainty would require a large ensemble of magnetospheric runs, which is computationally expensive. One solution that is widely used in climate science is "downscaling," wherein sub-grid variations are added to model outputs on a statistical basis. We present proof-of-concept results for a method that temporally downscales low-resolution magnetic field data on a 1-hr timescale to 1-min resolution, with the hope of improving subsequent geoelectric field magnitude estimates. An analog ensemble (AnEn) approach is used to select similar hourly averages in a historical data set, from which we separate the highresolution perturbations to add to the hourly average values. We find that AnEn outperforms the benchmark linear-interpolation approach in its ability to accurately drive an impacts model, suggesting GIC forecasting would be improved. We evaluated the ability of AnEn to predict extreme events using the FSS, HSS, cost/loss analysis and BSS, finding that AnEn outperforms the "do-nothing" approach.
HAINES ET AL.
10.1029/2021SW002903 2 of 18 which is typically extrapolated from much higher in the magnetospheric domain, is used to drive geoelectric field models. Empirical models also exist (Weimer, 2013(Weimer, , 2019).
The operational magnetospheric MHD models underestimate the magnitude of the perturbations across a wide frequency range, including the sub-hourly variations important for GICs (Welling, 2019). Pulkkinen et al. (2013) examined on a 1-min timescale and found an underestimation of magnitude between a factor of 2 and 10. Without accurate representation of high-frequency variations of the geomagnetic field, resolution of peak amplitudes in the derived surface geoelectric field and GICs may be underestimated (Grawe et al., 2018).
However, a counter example is Raeder et al. (2001) who used an MHD model to simulate the Bastille Day storm and compared their results to observations. Using a power spectral density (PSD) analysis they found that the model worked well for frequencies of 0-3 mHz and actually overestimated the power at higher frequencies. These results are likely due to using a model configuration with a high grid resolution that would currently be prohibitive for operational forecasting, particularly if large ensembles of magnetospheric runs are required to estimate forecast uncertainty. Figure 1 shows an example of SWMF power spectrum at a broad range of frequencies. The observed and modeled (using SWMF) horizontal magnetic field, the magnetic field component most relevant to GICs, is shown for the December 2006 CCMC test case (https://ccmc.gsfc.nasa.gov/challenges/dBdt/) at the Newport magnetometer site. The time series are shown in Figure 1a and the resulting power spectra in Figure 1b. The colored lines represent different model configurations. The power spectra shows that each configuration of the model underestimates the power spectral density, however, the magnitude of underestimation is highly sensitive to model configuration with 12a_SWMF, the current operational configuration, performing best. These models are giving an output at a 1-min resolution but the time series is smoother than that observed, meaning the amplitude of the higher frequency variations is reduced as shown by the power spectra. These simulation results have been provided by the Community Coordinated Modeling Center at Goddard Space Flight Center for the 2013 Space Weather Workshop and an online interface is available for analysis of the model runs (https://ccmc.gsfc.nasa.gov/ challenges/dBdt/).
A general underestimation is in agreement with Pulkkinen et al. (2013), who show in their Figures 3 and 4 that SWMF underestimated . Although here we only show that SWMF exhibits this underestimation, we note that this underestimation is a general feature of operational models predicting geomagnetic perturbations (Pulkkinen et al., 2010(Pulkkinen et al., , 2011(Pulkkinen et al., , 2013. Recent work from Dimmock et al. (2021) tested different spatial resolution configurations of SWMF for the September 2017 event. They found that the high resolution made a significant improvement to the PSD and GIC forecasts. However, they noted that SWMF performs poorly in substorms and increasing the resolution has limited benefit in these periods. They concluded that a skillful GIC forecast can be done with SWMF but that computational power makes this operationally difficult. In contrast, Haiducek et al. (2017) compared the performance of SWMF on an event in 2005 using the resolution of the operational model and a higher resolution. They used these configurations to estimate geomagnetic indices and cross-polar cap potential (CPCP). They found that results were not sensitive to resolution with the exception of predicting AL which may have been improved. The discrepancy is possibly because Haiducek et al. (2017) did not increase the resolution nearly as much as Dimmock et al. (2021). Mukhopadhyay et al. (2020) also used the configurations of Haiducek et al. (2017) finding that the high-resolution configuration performed generally better under the Heidke skill score for predicting .
Several further studies have shown that nonstandard MHD model configurations can achieve excellent results for small scale phenomena in a statistical sense. Welling et al. (2021) modeled the magnetospheric response to a hypothetical "perfect" coronal mass ejection and successfully resolved high frequency phenomena. Realistic 3 of 18 studies of ULF waves have been made by MHD models (Claudepierre et al., 2009;Hartinger et al., 2014) and small spatial and temporal features have been resolved by a new MHD model (Sorathia et al., 2020). These studies show that MHD models have the capability of properly capturing high frequency ground perturbations relevant to GICs, but the model configurations required are currently computationally prohibitive for operational real-time forecasting.
A viable operational alternative to increasing MHD model grid resolution is through the use of a method that statistically relates variability across temporal scales, namely a statistical downscaling approach. In addition to improving the geoelectric field reconstruction from a single magnetospheric model run, downscaling also has the potential to allow uncertainty quantification without the need for a magnetospheric model ensemble.

of 18
This paper addresses the characterization of high-frequency variability in the magnetic field, B, through statistical downscaling. Downscaling has been used in terrestrial weather forecasting to effectively increase the temporal and spatial resolution of global climate models (GCMs) (Christensen & Christensen, 2003;Maraun et al., 2010). For rainfall, this is done because rainfall typically occurs on subgrid scales so cannot be accurately captured with a GCM alone. Maraun et al. (2010) classifies downscaling into three general categories: perfect prognosis approaches, model output statistics, and weather generators. Perfect prognosis approaches statistically determine relationships between low resolution predictors and the high resolution predictands. This works if the predictors are realistic, such as from a perfect (low resolution) forecast model, that is, a perfect prognosis. Model output statistics builds a similar statistical relationship but with the aim of also correcting the bias of the forecast model. As such, model output statistics are model-specific. Finally, weather generators generate new high resolution time series that have the same statistical properties as observations, rather than just a probability of a sub grid event. Weather generators can be either perfect prognosis or model output statistics based.
As discussed by Morley (2020), statistical downscaling is relevant to space physics, in particular, to solar wind parameters used as inputs to magnetospheric models. Owens et al. (2014) considered temporal downscaling of solar wind parameters for this purpose. This was done because the magnetospheric models are sensitive to variability at a higher time resolution than is represented in numerical solar wind forecasts. Owens et al. (2014) used a random noise generator that gave high temporal noise with approximately correct statistical properties and added this noise onto the baseline of the solar wind parameters. They found that even relatively simple solar wind downscaling significantly increased the value of the subsequent magnetospheric forecast.
In this work, we employ temporal downscaling to increase the variability of magnetic field time series on the ground. By developing a model-independent perfect prognosis scheme, we are assuming that future global MHD models will provide a perfect representation of the low resolution magnetic field variations and/or model biases can be corrected by other means. However, the approach will be applicable to global MHD models that return a skillful and unbiased representation of the low resolution magnetic field. As the high-frequency variations are sampled from an ensemble of observations, an ensemble of geoelectric field estimates can also be reproduced from a single magnetospheric model run.
In the future, we hope to apply our downscaling methodology directly to forecasts provided by global MHD models and potentially as a means for uncertainty estimation. However, it is important to develop and test the downscaling scheme in isolation, and not to convolve it with the performance of a specific magnetospheric model. Thus, we adopt the widely used perfect prognostic approach and produce a perfect low-resolution forecast time series by taking 1-hr boxcar means of B observed by ground-based magnetometers. This 1-hr series is then linearly interpolated to 1-min resolution. This represents the undownscaled time series. As will be shown in Section 4, this undownscaled series effectively removes all power in variations below 1 hr. Thus, it is not a direct proxy for high-resolution magnetospheric model output. However, we start from this 1-hr linearly interpolated undownscaled series for two reasons. First, we expect magnetospheric models to perform better than this but it can be thought of as a 'worst-case scenario' for low-resolution magnetospheric models such as might be used for real-time forecasting in large ensembles. Second, if the downscaling manages to successfully relate the variability at 1-hr resolution to that at 1-min resolution, it should be more than adequate for use with magnetospheric model output.
The downscaling scheme attempts to reintroduce high-frequency perturbations onto the linearly interpolated 1-hr time series to produce a more realistic (in a statistical sense) B time series at the 1-min resolution. By using observations as the undownscaled time series, rather than model output, we remove model error from the process of developing and testing our methodology. Additionally, this approach allows us to easily create a large database of low-resolution, undownscaled "forecasts" with which to test our model, without requiring decades of magnetospheric model output.
HAINES ET AL.

Data
The ground-based magnetometer measurements we use are provided by SuperMAG (Gjerloev, 2012) (http://supermag.jhuapl.edu), an international collaboration bringing together data from over 300 magnetometer stations. The SuperMAG ground-level magnetic field perturbation data has been homogenized in terms of coordinate system, processing technique, and file structure.
A ground-based magnetometer measures the magnetic field from all sources in its vicinity. For studies on magnetic perturbations due to ionospheric and magnetospheric current systems, the magnetic baseline needs to be subtracted from the measurements to remove effects from other magnetic sources such as the Earth's intrinsic magnetic field. Gjerloev (2012) describes the SuperMAG data processing technique for removing the base line, in which knowledge of typical timescales of variations of different magnetic fields is used. These amount to a yearly trend, mainly due to the secular variation in the Earth's main field, and a diurnal trend due to the Sq current system, the quiet day daily variation in ionospheric activity due to solar radiation. These are subtracted from the magnetometer measurements, leaving the prime source of variability as space-weather driven activity.
Of course, magnetometer measurements can occasionally have erroneous measurements. These usually take the form of a spike in activity for a single data point during an otherwise quiet period. These errors can sometimes get past the SuperMAG quality control and into the final data sets. The data used for this analysis is a SuperMAG data set that has been cleaned for occasions where an error has exceeded the 99.97th percentile in terms of the change in the magnetic field with time as described in Rogers et al. (2020). The data may still have errors at lower levels of activity.
In this study, we primarily use data from the Eskdalemuir (ESK) station located in southern Scotland with geographic coordinates of 55. 314°N

Analog Ensemble
The Analog Ensemble (AnEn) approach was originally used for terrestrial weather forecasting (e.g., Delle Monache et al., 2013;van den Dool, 1989), but has been far surpassed by physics-based models for that application. However, AnEn has more recently been employed in space and magnetospheric physics where the physical models are less accurate, largely from the limited availability of observations to completely characterize the necessary boundary conditions. In such a situation, empirical schemes can be valuable. Haines et al. (2021), Owens et al. (2017), Riley et al. (2017), and Barnard et al. (2011) have experimented with AnEn for forecasting the solar wind, geomagnetic activity, and changes in space climate. In each case AnEn outperformed the benchmarks considered.
The AnEn methodology exploits an extensive historical data set for forecasting purposes through analogy to past evolution of a given system. Specifically, an AnEn examines the present state of the predictors, looks in the historical data set for analogous periods, then takes the predictand from the most analogous period. By selecting multiple analogous periods, an ensemble of predictands can be created, enabling a probabilistic forecast of future evolution.
In this work, AnEn is used not for forecasting, but for temporal downscaling to relate variations on long and short timescales. To demonstrate that the downscaling framework works for ground-level B, we chose 1-hr and 1-min for the long and short timescales somewhat arbitrarily, as described in the previous section. They are intended as examples rather than fixed parameters. At the high frequency, 1-min makes sense as that is the typically available resolution of long-term ground-based B series and also the input resolution for many geoelectric field models. At the low frequency, the time scale of interest will depend on the specific model and the situation in which the model is being used. For example, where real-time forecasting is required and/or ensembles of magnetospheric models are being used, it may be necessary to reduce the model resolution. As said, the low-resolution timescale 6 of 18 of 1-hr is a tunable parameter. If the downscaling is able to successfully relate 1-hr and 1-min variations, it should perform even better at relating, for example, 20-min and 1-min variations. Due to the perfect prognostic approach we can use the low-resolution time series as predictors. Specifically, the predictors used are the low-resolution values of the horizontal magnetic field at the start and the end of the considered hour. Analogous periods of these are found and used to predict a 1-min resolution time series.
The AnEn algorithm is outlined in Figure 2 and described in the following points, in which the subscript H stands for 1-hr and M for 1-min values: 1. Split the 1-min SuperMAG data into two sets (D1 M and D2 M ). D1 M is the test data set containing the short period to be downscaled. D2 M is the independent training data set comprised of the remaining data. 7 of 18 2. Compute low-resolution data using a 1-hr box-car means, to give D1 H and D2 H . 3. Using D1 H , take the values at the start (t 1 ) and end (t 2 ) of the hour being considered, as shown in Figure 2a. 4. Search D2 H for the N most similar consecutive values, by mean squared error, to those at t 1 and t 2 , as in Figure 2b, where N is the chosen number of analogs. 5. Remove the baseline value from the associated D2 M leaving only the higher frequency structure of the analog interval, that is, minute-scale variations with the baseline removed, as in Figure 2c. The baseline is defined as the 60-min rolling mean. 6. Add each D2 M analog onto D1 H to produce an ensemble of downscaled values as in Figure 2d. 7. Repeat this process for each hour in D1 H .
The data is then repeatedly split into different test and training sets so that the whole 34-year period can be downscaled using an independent training set. Note that this procedure uses data from after the 'forecast' time, so it is not strictly a hindcast. However, this approach uses the volume of available historical data available to a forecast made today and thus quantifies the current expected performance of downscaling.

Reference Model
We use a reference model, as suggested by Liemohn et al. (2018), as a benchmark of comparison for the AnEn's performance. As this is a proof of concept study, we choose a reference model that represents a "do-nothing" approach to downscaling. For this we downscale the 1-hr time series of the magnetic field using a linear-interpolation, denoted as the linear-interpolation approach. Through this we end up with 1-min resolution time series without adding further high resolution structure.
As stated in Section 2, this 1-hr linear-interpolation series is not representative of ground-level B produced by typical state-of-the-art magnetospheric models, as can be seen from the power spectra in Figures 1 and 4. Instead, 1-hr can be seen more as a worst-case scenario-most magnetospheric models would be expected to reasonably reproduce the B-field fluctuation power at around 0.00028 Hz, even in real-time ensembles.

MT-Transfer Function
The goal of this work is not to recreate the high resolution magnetic field on a point-by-point basis, but to add in realistic high-frequency variability in a statistical sense. In particular, we are interested in the higher frequency structure insofar as it improves the subsequent estimate of the induced geoelectric field, which is the driver of GICs.
This can be tested with an "impacts" model. For this purpose we use a magnetotelluric-(MT-) transfer function Simpson & Bahr, 2020) produced for the ESK site by the British Geological Survey (BGS). The MT-transfer function converts a time series of the local magnetic field into a time series of local geoelectric field. The MT-transfer function first makes a Fourier transform of the magnetic field, then multiplies the result by an empirically determined matrix of coefficients, which account for the local ground conductivity, and finally makes an inverse Fourier transform to compute the geoelectric field in the time domain. The matrix of coefficients is derived from simultaneous observations of the magnetic and geoelectric fields at ESK.
To quantify the performance of the downscaling scheme, we focus on the magnitude of the estimated E-field. Each B-field ensemble member was individually transformed with the MT-transfer function to result in an associated E-field ensemble member. A "good" outcome would be that the |E| from the downscaled series is closer to the |E| obtained from using the observed series, than the linear-interpolation approach. An ideal outcome would be that the observed |E| output falls within the spread of the ensemble of |E| outputs obtained with the ensemble of downscaled series.

Evaluation
The AnEn downscaling approach has been applied to the entire 34-year period (1983-2014) of observations using an ensemble of 100 members built hour by hour as described above. Figure 3 shows an example spanning 6-hr of heightened activity, with the x-component (east-west) in Figure 3a and the y-component (north-south) in Figure 3b. This period was a geomagnetic storm with a minimum Dst of −172 nT. The observed time series Figure 3. A 6-hr time series from 1983-02-04 of the magnetic field at ESK in the x (east-west) and y (north-south) directions in the geographic coordinate system. The red line shows the observed 1-min time series, the color bands show the spread of the AnEn series (the 10th-90th and 25th-75th percentiles) with the median in black, and the blue line shows the linearinterpolation approach, taken to be the undownscaled magnetic field, as a reference.

of 18
For the interval shown in Figure 3, the 10th-90th percentile band captures some of the variability seen in the observations, however, it seriously underestimates the variability on several occasions. Notably, toward the middle of the period, when the event is at the peak, the ensemble spread captures less of the variability. This suggests that the AnEn will struggle with the larger events such as this. By the definition of confidence, we would expect the observation to sit within the 0th-100th percentile band 100%of the time, in the 10th-90th percentile band 80% of the time and in the 25th-75th percentile band 50% of the time. In actuality here, the percentage of observations in the 0th-100th, 10th-90th, and 25th-75th percentile bands for B x are 83.4%, 40.3%, and 20.3%, respectively. For B y this is 98.9%, 51.8%, and 21.3%, respectively, for this illustrative period. Figure 4 shows the power spectra of the magnetic field from observations and AnEn. Shown is the median and percentile bands of the PSD's achieved by all the ensemble members computed with Welch's method using the Hanning window without overlap. The AnEn ensemble follows the observations closely with a general trend to slightly underestimate the power at lower frequencies (0-0.003) and slightly overestimate the power for higher frequencies (0.007 and above). The 10%-90% range of the AnEn is very narrow at approximately 0.5 at the most, reflecting a consistent performance across the whole ensemble. The linear-interpolation approach is shown in blue but has been cut off because, as expected, the power spectral density is very low and hence makes scaling the y-axis difficult. It is clear that AnEn provides a power spectrum much more similar to that of the observations than the linear-interpolation approach achieves.
To measure the effectiveness of adding higher frequency structure we use the B time series magnetic fields from the observations, AnEn and the linear-interpolation approach to drive the MT-transfer function as described in Section 3.3. The output of the MT-transfer model is shown in Figure 5 for the same 6-hr period shown in Figure 3. We see that the AnEn captures some of the geoelectric field variability within its spread but the observations lie outside the range of the analog ensemble on many occasions. The percentage of observations in the 0th-100th, 10th-90th, and 25th-75th percentile bands for E x are 97.4%, 59.5%, and 31.3%, respectively. For E y this is 97.4%, 51.8%, and 27.4%, respectively, for this illustrative period. Figure 5 reveals that, as expected, the linear-interpolation series yields very low geoelectric fields, without any significant variation. With a large ensemble size, the AnEn median will tend toward a smooth line despite variations in individual ensemble members. Therefore, the usefulness of AnEn is not in its median but rather in the spread of its ensemble members for showing possible realizations of the time series. Because of this it is not useful to directly compare AnEn median to the linear-interpolation approach values. However, we do see that the spread on the analog ensemble is of a more similar magnitude to that in observations than the linear-interpolation approach time series. In addition, AnEn provides an idea of the uncertainty in a forecast, which is useful for making decisions.
While this example period is illustrative, it is necessary to evaluate AnEn as a downscaling model over the full 34-year period using a set of metrics. In the following evaluation we have taken care to choose metrics that are robust to timing errors, as we make the assumption that the spectral properties of fluctuations and the magnitude of the peaks are generally more important than the phasing for GIC impacts. This is also relevant since operations require a lead time of possible occurrence and an estimate of the severity of that occurrence as they cannot implement system wide mitigation in real-time. When comparing data on a point-by-point basis, timing errors, in which a defined event is correctly predicted to occur but at slightly the wrong time, will incur a double penalty The yellow color band shows the 10%-90% range of the AnEn. The linear interpolation approach is shown in blue, part of which has been cut from the plot due to large differences in scale. 10 of 18 by many common metrics (e.g., see Figure 8 of Owens [2018]). For example, accuracy, which gives a fraction of correct predictions across the whole data set, will count the forecast as wrong when it predicts an event that does not occur at the exact time step and wrong when the forecast does not predict an event that is observed, even if the time step is off by just one step. Figure 5. A 6-hr time series from 1983-02-04 at ESK of the geoelectric field computed from the magnetic field using the MT-transfer function. The data is in the x (east-west) and y (north-south) directions in the geographic coordinate system. The red line shows the time series computed from the 1-min observed time series, the color bands show the spread of the geoelectric field computed from the analog ensemble with the median in black, and the blue line shows geoelectric field computed from the linear-interpolation approach magnetic field.

of 18
The sensitive values of GIC magnitude and timescales are dependent on the set up of individual transformers and the power grid configuration. For example, the size of geoelectric field that will cause a significant GIC is dependent on the ground conductivity in the region around the transformer. We use horizontal geoelectric field as a practical solution to provide a general evaluation of the method (Beamish et al., 2002), however, transformers are sensitive to the individual E x and E y parameters, depending on grid configuration (Orr et al., 2021).

Threshold-Exceedance Prediction
In this subsection, we evaluate each individual ensemble member within AnEn for its ability to give a binary prediction of an event at individual time steps. We examine three levels of activity for event classification using the magnitude of total horizontal geoelectric field, denoted |E|, from the MT-transfer function. The magnitude of the total horizontal geoelectric field is shown for an illustrative period in Figure 6. The chosen thresholds for evaluation are the 99th, 99.9th, and 99.99th percentiles of the magnitude of the total horizontal geoelectric field from the MT-transfer function driven by observed magnetic field time series over the period 1983 to 2016. These are 22.3, 58.8, and 171.9 mV/km, respectively, and shown in Figure 6 by the horizontal dashed lines. For context, during the March 1989 storm that lead to the Hydro-Quebec network collapse, the peak geoelectric field magnitude at ESK was 411.4 mV/km as computed using the MT-transfer function. It is worth noting that the system collapse experience during this geomagnetic storm occurred before the peak due to the rapid onset of a substorm (Boteler, 2019).
In order to allow for timing errors at the minute scale, we evaluate AnEn using the fraction skill score (FSS) (Owens, 2018;Roberts & Lean, 2008). The FSS is most commonly used to measure the fractional occurrence of events in a given spatial window. Here, we use FSS with a 60-min temporal window and count the fraction of predicted time points, which are classified as events, and the fraction of observed time points which are events, within the same time window. This is repeated for each ensemble member for time windows covering the whole data set and the mean squared error (MSE) between the observed and predicted fraction time series is computed. This is repeated for a reference forecast, in this case the linear-interpolation series, and the FSS is taken as 1 − (MSE forecast /MSE reference ). A perfect forecast would achieve FSS = 1, a forecast with no skill compared to the reference would achieve FSS = 0 and a forecast performing worse than the reference will achieve a negative score. FSS is most useful to end users who need to know if an event will occur within a given time window without the need for exact (in this case, to the minute) knowledge of when it will occur. Figure 7 shows the FSS achieved for each of the 100 ensemble members across the entire data set for each of the three event thresholds. Ensemble ID is ordered from best to worst analogs considered, where best means the 1-hr values in the analogous periods are most similar to present conditions by RMSE. For the 99th percentile threshold (panel a) we see that each ensemble member has a positive FSS, with an average value across the whole ensemble of 0.095, showing it outperforms the reference method. When considering events over the 99.9th percentile (Figure 7b), again shows all ensemble members having a positive FSS with an average across the ensemble of 0.17. We also see a clear trend in which ensemble members based upon better analogs produce better FSS scores. The increased visibility of the trend for the 99.9th percentile compared to the 99th percentile suggests that at higher thresholds we are inherently considering rarer events, which reduces the number of good analogs available.
For events over the 99.99th percentile (panel c) the FSS is mainly positive for the first 50 ensemble members and approximately zero for the second 50. The mean FSS for the whole ensemble is 0.067. There is a very stark decrease in the skill of the ensemble members as the ensemble ID increases suggesting that for such a high threshold there are only around 30 to 50 good analogs for AnEn to work with. This finding can help inform a decision on an appropriate ensemble size for deployment. It also suggests that it would be appropriate to weight ensemble members if they are to be combined in any way.

1-hr Mean Value Prediction
The impact of GICs on transformers can be dependent on time-integrated effects, meaning that problems occur when GICs exceed a certain threshold for a certain duration (Moodley & Gaunt, 2017). With this in mind, we now evaluate the model using events classified using thresholds of the 1-hr mean value of |E| previously used. The hourly mean of the magnitude of geoelectric field is shown for an illustrative period is shown in Figure 8. We again consider thresholds at the 99th, 99.9th, and 99.99th percentiles of the 1-hr means of the horizontal geoelectric field magnitude from the observed time series. These values are 17.9, 47.0, and 139.0 mV/km, respectively. These are shown on Figure 8 by the horizontal dashed lines. For context, the peak hourly mean observed at ESK during the March 1989 storm was 77.1 mV/km, suggesting that although the peaks of this storm were large, they were short lived. These metrics are useful as impacts of a heightened geoelectric field are often caused by sustained heightened values on approximately the tens of minutes to 1-hr time scale (Pulkkinen et al., 2017). The metrics in this section are useful to end users who need to know when periods of heightened activity will occur and users who are impacted by time-integrated effects.

Deterministic Prediction
The first metric chosen is the Heidke skill score (HSS) (Jolliffe & Stephenson, 2003). HSS measures the accuracy of a model, taking into account the number of correct random forecasts. This allows for proper measurement of where crf, the number of correct random forecasts, is where n is the total number of predictions.
HSS of AnEn is shown in Figure 9 for the three event thresholds considered. HSS has been computed for each ensemble member shown by the yellow bars and HSS for the linear-interpolation approach is shown by the black dashed horizontal line. AnEn clearly outperforms the linear-interpolation approach and it generally achieves a good positive score with the exception of some of the ensemble members based on weaker analogs for the 99.99th percentile threshold. This again suggests that the available data set is too small for 100 analogs of more extreme events.

Probabilistic Prediction
Next, we evaluate AnEn in its ability to give a probabilistic prediction of an event by counting how many of the ensemble members predict an event and normalizing by the size of the ensemble. This is evaluated using the Cost/Loss analysis (Murphy, 1977;Owens et al., 2017;Richardson, 2000), which allows different end users of a forecast to assess its value for their particular use case. The idea is that taking mitigating action due to a forecast incurs a Cost, C, of fixed value, and experiencing an event without taking mitigating action incurs a Loss, L, of fixed value. The Cost/Loss analysis sums these Costs and Losses for acting on a particular forecast across 14 of 18 a long time series and compares the sum to that of a perfect forecast and a climatological forecast method (which, at all times, predicts the probability of an event as the fraction of time in which that event is experienced across the whole data set). The result is the potential economic value (PEV), which is 1 for a perfect forecast, 0 for a forecast of equal ability to climatology, and negative for a forecast with worse ability than the climatology. PEV is given as a function of the Cost/Loss ratio, C/L, which is between 0 and 1 for all end users that may find a forecast valuable. In a probabilistic Cost/Loss analysis that we employ here, mitigating action is taken if the probability given by AnEn exceeds the Cost/Loss ratio of the end user. For more details see Murphy (1977) and Richardson (2000). Figure 10 shows the PEV for the Cost/Loss domain (0, 1) for the probabilistic downscaling from the AnEn and the linear-interpolation approach. We see that for all three event thresholds AnEn outperforms the reference method. We also see that the PEV is highest for the lower end of the Cost/ Loss domain, which means it will most benefit end users who better tolerate false alarms (false positives) rather than missed events (false negatives). This is because at the lower end of the C/L domain the cost of taking mitigating action is very low compared to the loss incurred due to not taking action and an event happening. Therefore, these users would generally prefer to take mitigating action on a false alarm than not take action on a real event.
Finally, we look at how AnEn performs under the Brier skill score (BSS) (Jolliffe & Stephenson, 2003). Like Cost/Loss analysis, BSS can compare probabilistic forecasts with deterministic ones, allowing direct comparison of the probabilistic AnEn and the deterministic undownscaled series. BSS is useful to end users who wish to use the probabilistic information of AnEn. To compute BSS, the standard Brier score (BS) must first be computed. The BS is the normalized sum of the square error between the probabilistic forecast and the observations over the whole time series, where the observations take a binary value of 0 or 1 depending on whether an event occurs. Events are again taken to be hours exceeding the 99th, 99.9th, and 99.99th percentiles of observed |E|. BS is computed for both AnEn and the reference model then combined into BSS by Similar to the Cost/Loss and FSS, a perfectly skillful forecast receives BSS = 1, a forecast with no skill relative to the reference receives BSS = 0, and a negative score signals the forecast method performs worse than the reference.
BSS is shown for AnEn for the three event thresholds in Table 1. It seems that the 100-member AnEn has skill over the linear-interpolation approach for all considered thresholds but drops in skill for the 99.99th percentile events. It is likely that this is the result of the limited span of the data set and hence number of analogous extreme events. A reduced ensemble size or ensemble-member weighting would likely yield a better BSS, particularly for the 99.99th percentile events. This is shown in the third column of the table, which gives BSS for a 20 member ensemble. We see that the BSS of the 99.99th percentile events increases more in line with the lower thresholds. Figure 9. The Heidke skill score (HSS) for the three event thresholds on applied to 1-hourly |E| data. Ensemble members are ordered from best to worst analogs considered. A perfect forecast has a score of 1, a forecast with no skill over random prediction has a score of 0, and a forecast with every prediction incorrect has a score of −

Discussion and Conclusions
Statistical downscaling of magnetic field data for the purposes of GIC forecasting has been demonstrated in the form of a perfect prognostic approach. We employed the analog ensemble (AnEn) methodology, finding that with its spread and higher frequency contributions, a more accurate E-field mapping is obtained than when compared to an E-field derived from undownscaled B-field data.
To obtain a "low-resolution" data set, ground-level magnetic field perturbation data was smoothed from high frequency (1-min) to low frequency (1-hr) resolution. High frequency structure was then reintroduced into the low-resolution (1-hr) series using the AnEn approach. Both the low frequency and the downscaled time series were then used in a magnetotelluric-transfer function to compute the corresponding horizontal geoelectric fields.
We presented the power spectrum of the observations, AnEn showing that AnEn closely resembles the spectral properties of the observations and far outperforms the linear-interpolation approach. Although AnEn has not been applied to the output of a global MHD model, it can be seen that it has the potential to improve the spectral properties of a forecast that has an underestimation of spectral power at the high frequencies.
The method was validated using a range of methods to test different aspects of the downscaling scheme. Specifically, we used the fraction skill score (FSS), Heidke skill score (HSS), Cost/Loss analysis, and Brier skill score (BSS). FSS was used to evaluate AnEn on the occurrence rate of 1-min events within 1-hr windows. The events were defined using three thresholds, namely, 99th, 99.9th, and 99.99th percentile of the entire data set . AnEn had a positive FSS for all ensemble members for the 99th and 99.9th percentile thresholds showing that AnEn outperformed the undownscaled approach. For the 99.99th percentile threshold, some of the weaker analogs achieved a negative FSS suggesting that the ensemble size of 100 was too large for the current data set to allow good analogs of the most extreme events to be found. Nevertheless, the overall FSS was still positive.
Since impacts of GICs tend to require an elevated geoelectric field over a sustained period, we also evaluated AnEn for its ability to predict the hourly mean value of geoelectric field. This was achieved by defining events as the 1-hr mean value exceeding the thresholds of 99th, 99.9th, and 99.99th percentile of the hourly means of the entire data set. With this event definition, HSS revealed that AnEn outperformed the undownscaled series for all ensemble members in the three event thresholds, except for a small number in the 99.99th percentile events.
This work has evaluated AnEn with an ensemble size of 100. The ensemble size should be chosen large enough that a wide range of possible outcomes can be included, but small enough to ensure analogs are of a good quality and are in fact analogous. The fraction skill score and Heidke skill score revealed that better quality analogs downscaled more skillfully. The number of good quality analogs available depends both on the size of the historical data set and on the rarity of event considered. This was particularly evident when considering events above the 99.99th percentile suggesting 100 members is too many to ensure all analogs are of a good quality. A more appropriate ensemble size for this threshold would be approximately 20 as shown by the BSS analysis. Future implementations of this method should use these results to inform an appropriate ensemble size for the size of event of interest.

of 18
In this work, the probabilistic prediction given by AnEn was made by simple ensemble member voting. The impact of analog quality could be mitigated if, when converting to a probabilistic prediction from an ensemble of predictions, the voting power of each member is dependent on the quality of the analog, as measure by the inverse of the RMSE between analog and period under consideration and normalizing. This would mean that members expected to have the most insight into the situation have greater sway in the overall prediction.
We implemented a probabilistic Cost/Loss analysis revealing that AnEn has a higher potential economic value than the undownscaled approach and that the value of the forecast was greater for end users who can tolerate false alarms at the lower end of the Cost/Loss domain. Like the previous metrics, AnEn performed better in the 99th and 99.9th percentile events.
A shortcoming of AnEn is that there is expected to be a lack of good analogs for the most extreme events. To address this, AnEn could be improved by expanding the predictors used to include such things as geomagnetic indices and estimates of current systems. This could allow AnEn to be more aware of the drivers of geomagnetic activity and thus allow the use of fewer-but-better-quality analogs in a reduced size ensemble. Although this is a shortcoming, it is important to remember moderate space weather events are problematic as well as the rarer, more extreme events (e.g., Schrijver, 2015;Schrijver et al., 2014). A further way to increase ensemble member quality would be to create the training data set, D2 M , using a rolling-mean rather than box-car as this would create a more potential analogous periods and hence increase analog quality overall.
We used a perfect prognostic approach to downscaling, which assumes the low time resolution forecast given is a perfect forecast. This allowed us to use historical observations as if they were forecast model outputs. However, this approach is limited because the models are not perfect. It is expected that biases in the forecast model would not be corrected but carried through by the downscaling methodology. This paper has focused on the results for the Eskdalemuir station, however, an equivalent analysis has been conducted for the Lerwick and Hartland magnetometer stations in the UK. The AnEn downscaling methodology applied to these stations generally perform similarly to ESK, supporting the claim that this methodology could be applied more broadly. The achieved mean FSS, mean HSS and BSS for events above the three thresholds are shown in Table 2 for Lerwick and Hartland. The results for ESK are also shown for reference. AnEn is shown to perform to a slightly better standard at Lerwick, particularly for the 99th percentile threshold, and slightly worse at Hartland, particularly for the higher thresholds.
In this work, AnEn has been used both to generate a downscaled time series and to estimate the uncertainty of it by using many ensemble members. It would be quite possible to remove the downscaling element and just use the algorithm to provide probabilistic information for a forecast that already has the correct spectral properties. This work has given proof of concept that downscaling can be implemented to improve a forecast that lacks realistic high-frequency structure. From here, research should be conducted to create downscaling schemes that are optimized to perform better than AnEn when the downscaled data is used to drive an "impacts" model. The optimization could include finding different model configurations for specific space weather drivers. This would take knowledge of the solar wind driving the magnetosphere and restrict AnEn to choosing analogs from historical periods driven by the same solar wind context. Once downscaling methods have been further investigated, the front runners will need to be manipulated to form a "bolt-on" piece for a global MHD model. We finally note that the methods developed here do not attempt to correct for any biases in the magnetospheric models. Thus, it remains to be seen whether the improvements demonstrated here translate directly to a forecasting situation, or where further bias-correction of magnetospheric models is also required.