Forecasting GOES 15 >2 MeV Electron Fluxes From Solar Wind Data and Geomagnetic Indices

The ﬂ ux of > 2 MeV electrons at geosynchronous orbit is used by space weather forecasters as a key indicator of enhanced risk of damage to spacecraft in low, medium, or geosynchronous Earth orbits. We present a methodology that uses the amount of time a single input data set (solar wind data or geomagnetic indices) exceeds a given threshold to produce deterministic and probabilistic forecasts of the >2 MeV ﬂ ux at GEO exceeding 1,000 or 10,000 cm − 2 s − 1 sr − 1 within up to 10 days. By comparing our forecasts with measured ﬂ uxes from GOES 15 between 2014 and 2016, we determine the optimum forecast thresholds for deterministic and probabilistic forecasts by maximizing the receiver ‐ operating characteristic (ROC) and Brier skill scores, respectively. The training data set gives peak ROC scores of 0.71 to 0.87 and peak Brier skill scores of − 0.03 to 0.32. Forecasts from AL give the highest skill scores for forecasts of up to 6 days. AL, solar wind pressure, or SYM ‐ H give the highest skill scores over 7 – 10 days. Hit rates range over 56 – 89% with false alarm rates of 11 – 53%. Applied to 2012, 2013, and 2017, our best forecasts have hit rates of 56 – 83% and false alarm rates of 10 – 20%. Further tuning of the forecasts may improve these. Our hit rates are comparable to those from operational ﬂ uence forecasts, that incorporate ﬂ uence measurements, but our false alarm rates are higher. This proof ‐ of ‐ concept shows that the geosynchronous electron ﬂ ux


Introduction
The risk to satellite operations from highly energetic charged particles in near-Earth space is a key design and operations driver for spacecraft in low, medium, or geosynchronous Earth orbits. The exact flux and energy levels at which a spacecraft becomes vulnerable to energetic charged particles is dependent on its design and the components used, but high fluxes of near-relativistic electrons have been shown to present an increased risk of anomalies or failures in electronic hardware (Baker et al., 1987;Iucci et al., 2005;Romanova et al., 2005;Wrenn & Sims, 1996;Wrenn & Smith, 1996;Wrenn et al., 2002). Typically, there is an increased occurrence of anomalies and phantom commands when the >2 MeV 1-or 2-day fluence exceeds 10 8 cm −2 sr −1 , equivalent to an average electron flux of ∼500-1,000 cm −2 sr −1 s −1 . Space weather monitoring and forecasting organizations, such as NOAA and the Met Office, now issue alerts when the measured electron flux in geosynchronous orbit exceeds 1,000 cm −2 sr −1 s −1 or the fluence forecast by models such as the Relativistic Electron Flux Model (REFM; based on Baker et al., 1990) exceeds 10 8 cm −2 sr −1 .
Modeling the variations in the radiation belts based on physical processes is non-trivial. The complex interplay of particle injections from processes further out in the magnetosphere and the energization, diffusion and scattering of particles from their interactions with electromagnetic waves poses a significant challenge. A full physics-based description of the radiation belts requires knowledge of the ion and electron distribution functions across all pitch angles and at all energies, the wave populations that they generate and the subsequent coupling back of those wave populations into variations in those distribution functions. In practice, electron fluxes at an outer boundary and global maps of average wave power under different solar wind or geomagnetic conditions (e.g., Meredith et al., 2012;Usanova et al., 2012) are used as the inputs to physics-based models (e.g., Horne et al., 2013;Subbotin & Shprits, 2009). However, there is significant overlap in the distributions of wave power categorized by geomagnetic activity which may lead to uncertainties in the outputs of models based upon them Watt et al., 2017). Nevertheless, these models show some success in reconstructing particle fluxes in the radiation belts, with correlation coefficients of 0.72 to 0.99 and skill scores of 0.6 to 0.8 (e.g., Glauert et al., 2018). Such models can be operationally useful if the inputs controlling the wave fields can be forecast. For example, the SPACECAST model (Horne et al., 2013) uses Kp-derived average wave power models, with Kp forecasts of up to 3 days now available from NOAA.
An alternative approach to operationally modeling the radiation belts is to produce empirical models of the electron fluence at geosynchronous orbit based on measured parameters, typically using solar wind data as input (e.g., Baker et al., 1990;Balikhin et al., 2016). These models use solar wind data from several days previous to provide forecasts of the radiation belt fluence over the next 1-3 days. However, atypical changes in the electron flux at geosynchronous orbit, such as rapid drop-outs or increases, are not well predicted from the upstream inputs alone. To correct for this, these models compare themselves to the observed fluence and make adjustments to account for significant offsets between forecasts and observations. As such, these models require both the upstream parameters and in situ measurements to provide accurate forecasts.
In the following, we show a proof-of-concept for forecasts that predict whether the >2 MeV flux at geosynchronous orbit is likely to be above the levels at which alerts are issued by space weather monitoring agencies, rather than predicting the actual flux. These forecasts are based on individual solar wind parameters and geomagnetic indices. Data from 2014-2016 inclusive are used as a training data set for the forecasts, and the forecast performance is checked using data from 2012, 2013, and 2017. We examine whether geomagnetic or solar wind activity observed over 1-10 days can provide a skillful forecast of the flux in the radiation belts exceeding 1,000 or 10,000 cm −2 sr −1 s −1 within the next 1-10 days using standard forecast verification techniques and discuss possible implications for space weather operators and our understanding of radiation belt dynamics.

Data
In this study, we examine the extent to which the >2 MeV flux observed by the GOES 15 spacecraft can be forecast using simple "threshold breaking" forecasts based on different upstream and geomagnetic parameters. The electron flux in geosynchronous orbit is measured by the Energetic Proton Electron and Alpha Detector Electron, part of the Space Environment Monitor, on the GOES 15 spacecraft. In this study we use the 1-min data from the E2 channel of the A sensor (westward facing) which captures the >2 MeV electron flux. We use data from 1 January 2014 and 31 December 2016 to construct our forecasts, encompassing the maximum of solar cycle 24 and its declining phase.
(AL) and the ring current (SYM-H). We also create forecasts using the solar wind velocity and ram pressure and the IMF B Z . The data used in this study were all obtained through the NASA CDAWeb data service.

Terminology
To aid the reader, we define a series of terms that will be used throughout to discuss our methodology, analysis, and results.
1. "Events" are when the GOES 15 >2 MeV electron flux exceeds 1,000 cm −2 s −1 sr −1 (corresponding to approximately the median flux level) or 10,000 cm −2 s −1 sr −1 (corresponding to approximately the 90th percentile of flux) for at least one data point (1 min) within the forecast window. Events for the flux exceeding these two different thresholds are examined separately; 2. "Forecast Window" is a period of between 1 and 10 days in length within which we are forecasting that an event may occur. We note that a 10-day forecast does not indicate that an event will occur in 10 days' time, but instead that an event will occur within the next 10 days. As such, longer forecasts naturally encompasses the shorter forecast windows as well. Forecast Windows directly follow an Input Window; 3. "Input Window" is a period of between 1 and 10 days over which the Input Variable is assessed to provide a forecast. Input Windows directly precede Forecast Windows; 4. "Input Variable" is the variable used to create the forecast, which in this study is one of either AL, SYM-H, IMF B z , solar wind pressure, or solar wind velocity; 5. "Forecast Threshold" is the value the Input Variable must exceed in order to count toward a forecast. AL, SYM-H, IMF B Z must be below the Forecast Threshold and solar wind pressure or velocity must be above the threshold; 6. "Break Duration" is the amount of time that the Input Variable exceeds the Forecast Threshold within the Input Window.

Creating Forecasts
Forecasts were created between 1 January 2014 and 31 December 2016 inclusive. For each day, we determined the break duration over the previous 1-10 days (the input windows) for input thresholds set to each percentile of the input variable (calculated over 1 January 2014 to 31 December 2016). Sets of deterministic forecasts were then derived from this by setting a positive forecast to be returned when the break threshold exceeded each decile of the forecast window length, while sets of probabilistic forecasts were derived by setting the break duration as a proportion of the input window to be the event probability (see below for details). For each day, we also determined whether the 1-min flux exceeded either 1,000 or 10,000 cm −2 s −1 sr −2 within the next 1-10 days (the forecast window).
The start times of each of our forecasts are separated by 1 day. However, because most of the Input Windows and Forecast Windows are greater than 1 day in length, the calculated forecasts are not independent of one another; that is, some of the input data or forecast windows overlap with each other. In order to analyze independent forecasts, we extracted and analyzed subseries of forecasts that are separated by either the Input Window or Forecast Window duration, whichever is greater. For example, for forecasts with a 3-day Input Window and 5-day Forecast Window, we extracted a subseries of forecasts separated by 5 days. In order to use all the available data, this extraction was repeated by varying the start day and then taking the mean of the calculated verification metrics.

Deterministic Forecasts
Deterministic forecasts provide a binary prediction that an event will or will not occur. Analysis of these forecasts is typically based on contingency or truth tables. These show the number of times an event was: (A) predicted and occurred (hit); (B) predicted and did not occur (false alarm); (C) not predicted and occurred (miss); and (D) not predicted and did not occur (correct rejection). A number of measures of the validity or skill of the forecast can be calculated from these tables (for a discussion, see Hogan & Mason, 2012). Commonly used measures include the hit rate (H=A/(A+C)), false alarm rate (F=B/(B+D)), and the difference between them, known as the Peirce score (also known as Hanssen and Kuipers discriminant or True Skill score, Peirce, 1884).
The ability of a variable to distinguish between two different events or states, or the separation of the distributions of said variable for two different events or states, can be assessed using Receiver Operating Characteristics (ROC, e.g., Swets, 1988). In this study, we examine how well the distributions of the break durations are separated for periods of high or low energetic electron flux. This is done by generating a series of contingency tables by varying the value of the break duration for a given forecast threshold above which a positive forecast is made. In this study, this value is set to the deciles of the input window length. The hit rates and false alarm rates from these tables are then calculated and plotted against one another and the area under the curve is calculated, in our case using a trapezoidal sum with the end points fixed at [0,0] and [1,1]. This area is known as the ROC score and is a nonparametric two-sample statistic (e.g., Zweig & Campbell, 1993) that indicates the degree of separation of the distributions of break durations for events and nonevents. If the distributions of break duration are entirely separated, that is, there is a value of the break duration below which only nonevents occur and above which only events occur, then the ROC score is 1. If the distributions of the break duration entirely overlap, that is, break duration shows no skill in separating events and nonevents, then the ROC score is 0.5. We compare the ROC scores for different forecast thresholds to determine the optimum forecast threshold for which break duration can separate events and nonevents.
For a given input variable, the maximum ROC score for an input window/forecast window pair indicates the best input threshold for discriminating between high or low electron flux events. By comparing these maximum ROC scores across variables, we can then determine which variable provides the most skillful deterministic forecasts from our methodology. Given that there is likely to be some overlap in the distributions of break durations preceding events and nonevents, the best break duration can only be determined by a cost-benefit analysis by the forecast user. One simple form of this analysis is to find the break duration that returns the highest Peirce score, although different users may sacrifice higher hit rates for lower false alarm rates or vice versa. We note that the Peirce score from a ROC curve of forecasts with no skill will be 0 while the ROC score is 0.5. As such, the ranges of skilful ROC scores and Peirce scores differ by a factor of 2.
Figures 1 and 2 show matrices of the maximum average ROC scores calculated for each input window (X axis) and forecast window (Y axis) pair for forecasts of the >2 MeV flux exceeding 1,000 cm −2 sr −1 s −1 and 10,000 cm −2 sr −1 s −1 , respectively. In each panel, reds indicate maximum average ROC scores greater than 0.75 and blues indicate scores below this. For each forecast window length, the input window length that gives the highest maximum average ROC score is highlighted. Table 1 shows the variable that gives the highest maximum average ROC score for each forecast window length and the corresponding input window length and forecast threshold. ROC scores cannot indicate which break duration is best, so we report the hit rate, false alarm rate and break duration that give the highest Peirce score for the specified forecast threshold and input window length. For the interested reader, we have similar tables for each input variable in the supporting information.
The variation of the maximum average ROC score with input window and forecast window differs for each input variable. For AL and SYM-H forecasts of both the flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 , the maximum average ROC score decreases with forecast window length, such that the highest ROC scores for any input window length are for forecasts windows of 1 day. Similarly, for a given forecast window, the ROC scores increase with input window length up to some maximum for an input window length of ∼5 days for AL and ∼3-4 days for SYM-H. The maximum average ROC scores from AL are greater than those for the other input variables for most input window and forecast window pairs. The exception to this is solar wind pressure, from which the maximum average ROC scores increase with input window length and forecast window length. The maximum average ROC scores from the solar wind pressure forecasts are greater than those from the AL forecasts when the sum of the forecast and input window lengths is greater than 16 for forecasts of flux exceeding 1,000 cm −2 sr −1 s −1 or 14 for forecasts of flux exceeding 10,000 cm −2 sr −1 s −1 .
The pattern for the solar wind input variables is distinctly different from that of the geomangetic variables and indeed from each other. As described above, the maximum average ROC scores from solar wind pressure forecasts increase with forecast and input window length. The maximum average ROC scores from IMF B Z forecasts are low for all input windows and forecast window pairs, not exceeding a ROC score of 0.69. The maximum average ROC scores from forecasts by solar wind velocity have a distinct chevron pattern for forecasts of flux exceeding 1,000 cm −2 sr −1 s −1 . The peak maximum average ROC score occurs for increasing input window lengths for forecast window lengths up to 5 days, then for decreasing input window lengths for forecast windows between 6 and 10 days. The maximum average ROC scores are relatively high for all input window and forecast window pairs, just below those for AL. This pattern is not repeated for the solar wind velocity forecasts of flux exceeding 10,000 cm −2 sr −1 s −1 , instead being qualitatively similar to the maximum average ROC scores from SYM-H forecasts. This is in line with the link between solar wind velocity and Dst (and by extension, SYM-H) during storms (e.g., Burton et al., 1975;Klimas et al., 1998). However, solar wind velocity and solar wind pressure show distinctly different patterns of maximum average ROC scores, with the ROC scores from solar wind velocity forecasts generally being higher than those from pressure forecasts. This implies that it is the solar wind coupling rather than the compression of the magnetosphere that is important for the enhancement of flux at geosynchronous orbit, likely due to the radiation belts being compressed to within geosynchronous orbit, and thus the flux at GEO dropping rather than increasing under high solar wind pressures.
For AL, SYM-H, and solar wind pressure, the input window lengths that give the highest maximum average ROC scores remain similar with increasing forecast window length. For AL, the highest maximum ROC scores come from input windows of 5 or 6 days for the lower flux forecasts and 3-6 days for the higher flux forecasts; for SYM-H, they come from input windows of 3-5 days for the lower flux forecasts and 2-4 days for the higher flux forecasts; for solar wind pressure, they predominantly come from input windows of 10 days for forecasts of both flux levels. For forecasts from the solar wind velocity and IMF B Z , the input windows that give the highest maximum ROC scores are more variable, although the input windows are generally shorter for the higher flux forecasts by solar wind velocity. As such, for forecasts of the electron flux exceeding 10,000 cm −2 sr −1 s −1 , less input data are required.  The highest ROC scores for most input window/forecast window pairs come from AL, showing that the electron fluxes at GEO are strongly influenced by geomagnetic activity such as substorms, which are thought to provide seed and source particle injections (e.g. Horne, 2007;Jaynes et al., 2015). The higher ROC scores for SYM-H predicting electron fluxes above 10,000 cm −2 sr −1 s −1 , and the similarity to the ROC scores from forecasts from the solar wind velocity indicates that these fluxes are most likely during storm-like times when the ring current is enhanced. Given that AL and SYM-H are both enhanced during storms, it is difficult to decouple the effects of storms and substorms; however, these results hint that enhancements in the geosynchronous electron flux can occur within ∼3 days following enhancements in SYM-H but within a longer period of ∼5 days following substorm activity. Table 1 shows that for both the lower and higher flux levels, the forecasts with the highest ROC scores and highest Peirce scores have hit rates between 56% and 89% and false alarm rates between 11% and 53%. The average hit rates for the forecasts with the highest ROC scores is 74% for both the higher and lower flux levels. The highest ROC scores and Peirce scores are achieved for forecast windows of 1 day. For both the lower and higher flux thresholds, the break threshold is a similar percentage of the input window for all forecast windows. Similarly, the input windows and input thresholds are similar for AL and solar wind pressure when they provide the highest ROC scores. The ROC scores and Peirce scores decrease with increasing forecast window. The forecast thresholds for electron flux being above 1,000 cm −2 sr −1 s −1 appear to be a relatively low activity level (AL >∼−100 nT), but these correspond to the 28th-39th percentiles of AL.
That is to say that if AL is in the lower tercile of activity for at least 30% of the previous 5-6 days, the flux at geosynchronous orbit is likely to be above 1,000 cm −2 sr −1 s −1 . The forecast thresholds for the electron flux being above 10,000 cm −2 sr −1 s −1 are lower, around the 17th percentile. Given that the input window lengths are similar for the corresponding forecast windows, the break durations for the higher flux forecasts are two thirds of those for forecasts of the lower flux level, indicating that a higher level of activity but for a shorter period of time is needed to predict the electron flux exceeding 10,000 cm −2 sr −1 s −1 . For both flux levels, solar wind pressure outperforms AL for forecast windows of 8-10 days with hit rates comparable to AL for short forecast windows; however, the false alarm rates are higher at ∼50%.
An alternative to using geomagnetic or solar wind conditions is to base forecasts on persistence: predicting that an event will occur within a set forecast window if it also occurred within a set input window. By varying the input window length for each forecast window, we determine different hit rates and false alarm rates based on this principle and thus create a persistence ROC curve for each forecast window. Unlike the above, rather than generating a series of ROC curves for each forecast window/input window pair, we now can only create a single ROC curve for each forecast window. In order to give more complete curves, the input window range was extended to be up to 54 days. As above, independent forecasts are extracted from a complete set of forecasts and the average metrics are calculated. Figure 3 shows the ROC curves for persistence forecasts with forecast windows of 1-10 days for ( Figure 3a) the >2 MeV flux exceeding 1,000 cm −2 sr −1 s −1 and ( Figure 3b) the >2 MeV flux exceeding 10,000 cm −2 sr −1 s −1 . Table 2 shows the maximum average ROC score for each forecast window along with the input window length that gives the highest Peirce score and the corresponding hit rates and false alarm rates.
The ROC scores from the persistence forecasts vary between 0.835 and 0.886 ( Figure 3a) and 0.704 and 0.882 ( Figure 3b), with the highest scores for forecast windows of 1 day or 22 days. The Peirce scores are also high, ranging from 49% to 74%, with hit rates of up to 91% and false alarm rates from 5% to 40%. The ROC scores from the persistence forecasts of flux exceeding 1,000 cm −2 sr −1 s −1 are higher than the corresponding deterministic forecasts by approximately 3 percentage points. In contrast, the ROC scores for persistence forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 tend to be lower than the corresponding deterministic forecast for forecast windows of up to 5 days. However, the Peirce scores of all the persistence forecasts are higher than the corresponding deterministic forecasts. The high ROC scores and hit rates from the majority of persistence forecasts is unsurprising when one considers variability of the radiation belts and the levels of flux being considered. The e-folding time for the radiation belt is of the order of 3-4 days (Borovsky & Steinberg, 2006) and the flux thresholds tested are close to the median and 90th percentiles. If the flux exceeds these thresholds by ∼>30%, it will generally remain above that threshold for at least 1 day.
In summary, simple forecasts of whether the >2 MeV electron flux at geosynchronous orbit can be constructed by determining the amount of time various solar wind or geomagnetic activity parameters exceed a specified threshold. While persistence-based forecasts of the electron flux exceeding 1,000 cm −2 sr −1 s −1 have ROC scores that are consistently higher than those from our corresponding forecasts, for forecast windows of 2-5 days our forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 have higher ROC scores than the corresponding persistence-based forecasts, indicating that our deterministic forecasts can be more skilful than persistence. We note, however, that the persistence forecasts tend to have higher maximum Peirce scores. Overall, our deterministic forecasts with the highest ROC and Peirce scores have average hit rates of ∼74% and false alarm rates of ∼27%.

Probabilistic Forecasts
Probabilistic forecasts provide an indication of the likelihood that a binary event will or will not happen and naturally account for some of the uncertainty in forecasting. For a perfect probabilistic forecast, an event would occur 10% of the time that a probability of 10% was forecast. In a similar manner to the deterministic forecasts, we created probabilistic forecasts by examining the proportion of an input window that the input variable exceeds a forecast threshold. We arbitrarily set the probability of an event occurring to the break duration as a proportion of the input window. This assumes that a linear relationship between the amount of time a forecast variable is beyond some limit and the likelihood of an event occurring and does not account for the actual magnitude of the input variable or any more complex relationships.
The nature of probabilistic forecasts means that an individual forecast is neither right nor wrong. For example, if a forecast gives a 90% probability that the fluxes will be high on the next day, but the fluxes are low, then the forecast was not necessarily wrong since 10% of the time a forecast of 90% is given, the fluxes should be low. As such, one cannot determine the components of a contingency table for a probabilistic forecast and a ROC score cannot be generated. Instead, probabilistic forecasts can be examined using Brier scores (Brier, 1950) or Brier skill scores. Brier scores are the mean squared error of the forecasts, calculated as where N is the total number of forecasts, f is the forecast probability (in the range 0 to 1). e is set to 1 if the event occurred or 0 if it did not. A perfect forecast will have a Brier score of 0. The Brier skill score compares the calculated Brier score with that from a reference forecast, typically a climatological forecast. The Brier skill score is then calculated as where B ref is the Brier score of the reference forecast. The Brier skill score varies between -∞ and 1, with a perfect forecast returning a Brier skill score of 1 and a forecast which shows no improvement over the reference forecast returning a score of 0. In this study, we used the climatology (average event occurrence rate for each forecast window length) to create reference forecasts and associated Brier skill scores. Figure 4 shows matrices of the maximum average Brier skill scores calculated for each input window (X axis) and forecast window (Y axis) pair for forecasts of the >2 MeV flux exceeding 1,000 cm −2 sr −1 s −1 , with darker colors indicating higher scores. Figure 5 shows similar results for forecasts of the >2 MeV flux exceeding 10,000 cm −2 sr −1 s −1 .
The overall trends in the Brier skill scores show some similarities to those in the ROC scores from the deterministic forecasts: forecasts from IMF B Z have low Brier skill scores for all forecast window and input window pairs; the Brier skill scores from forecasts using AL and SYM-H decrease with increasing forecast window; the maximum Brier skill score occurs at similar input windows for each forecast window for forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 by AL and SYM-H; the Brier skill scores from forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 using solar wind pressure increase with increasing length of forecast window and input window. For the higher flux forecasts, the Brier skill score drops off more rapidly with input window length compared to the lower flux forecasts, but the Brier skill scores from the lower flux forecasts drop off more rapidly with forecast window length than the higher flux forecasts. Unlike the ROC scores of the deterministic forecasts, the Brier skill scores from solar wind velocity forecasts are qualitatively similar to those of SYM-H for forecasts of the flux exceeding both 1,000 and 10,000 cm −2 sr −1 s −1 . Furthermore, the highest Brier skill scores for solar wind pressure forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 are achieved for long input windows but short (<5 day) forecast windows.
For forecasts by AL and solar wind pressure, and for forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 by SYM-H and solar wind velocity, the input windows that give the highest maximum average Brier skill scores are similar. For AL, these are typically 3-4 days; for SYM-H they are 4-5 days; for solar wind velocity they are ∼6 days, although the highest Brier skill scores are for input windows of 3-4 days; and for solar wind pressure they are 9-10 days. This is in contrast to the deterministic forecasts, in which SYM-H gives the highest ROC scores in the shortest input windows and indicates that auroral zone geomagnetic activity over a relatively short interval provides a good indicator of the likelihood of high fluxes at GEO.   Table 3 shows the variables and inputs that provide the best Brier skill scores for forecasting the probability of the >2 MeV flux exceeding 1,000 cm −2 sr −1 s −1 (top rows) or 10,000 cm −2 sr −1 s −1 (bottom rows) and for each forecast window. A set of tables which shows similar results for each input variable is provided in the supporting information. For both sets of forecasts, the Brier skill score decreases with increasing forecast window, indicating that our forecasts are tending toward forecasts based on the climatology. This is a natural consequence of our analysis. For our forecasts of the electron flux exceeding 1,000 cm −2 sr −1 s −1 , the Brier

10.1029/2019SW002416
Space Weather score remains relatively constant with increasing forecast window length, indicating that the performance of our best forecasts does not degrade with increasing window length, but rather that the occurrence of flux exceeding this near-median level tends toward the climatology. While SYM-H provides the highest Brier skill scores for forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 within 7 or more days, these forecasts only perform as well as the climatology. In contrast, the Brier scores for forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 steadily increase with forecast window length, indicating that our best forecasts perform less well with increasing forecast window, but still perform better than the climatology for all forecast windows.
Table 3 also shows that the forecast threshold increases with increasing forecast window length. Considering the forecast threshold as a percentile of the input variable, this varies between 47% and 81% for the lower flux forecasts and between 16% and 43% for the higher flux forecasts. However, the input window length does not show a similar increasing trend, instead staying relatively constant for a given input variable. The implication of this is that greater activity is required to forecast over shorter periods than longer periods (noting that smaller AL and SYM-H values indicate higher activity levels).
In summary, simple probabilistic forecasts of the >2 MeV flux at geosynchronous orbit exceeding 1,000 or 10,000 cm −2 sr −1 s −1 based on the duration that an input variable exceeds a set threshold can provide skillful forecasts that perform better than the climatology. The mean square error of these forecasts is between 0.1 and 0.22. AL provides the best input variable, with extended periods of moderate activity forecasting the >2 MeV flux exceeding 1,000 cm −2 sr −1 s −1 and shorter periods of greater activity forecasting the >2 MeV flux exceeding 10,000 cm −2 sr −1 s −1 .

Forecast Verification
In the above, we have used data from 2014 to 2016 inclusive to produce simple forecasts of the >2 MeV flux at geosynchronous orbit exceeding either 1,000 or 10,000 cm −2 sr −1 s −1 . We now examine the performance of these forecasts by applying them to data from 2012, 2013, and 2017 which was not used in the forecast training data sets.

Verifying Probabilistic Forecasts
Reliability plots or attributes diagrams (Hsu & Murphy, 1986) show the occurrence rate of events against the grouped forecast probabilities. The accompanying histograms show the number of forecasts within each probability bin. The shaded region, delimited by the climatology (ō, horizontal and vertical lines) and a diagonal line given by Y ¼ðX þ ōÞ=2 indicates those results that contribute positively to the Brier skill score, weighted by their relative occurrence. The results from a perfect forecast will lie along the one-to-one line. To generate the reliability curves in Figure 6, we created forecasts of the flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 between 2012 and 2017 inclusive using the forecast parameters with the highest Brier skill scores in Table 3. The forecasts between 2014 and 2016 are our training set. As above, we extracted subseries of independent forecasts and calculated the mean occurrence from these subseries in each of the probability bins. Similarly, the histograms show the mean number of events in each bin. In each panel, the climatology used to define the shaded area is the average occurrence of events within the time period of the plot. The forecast Brier scores are given in the panels, along with the Brier skill scores calculated using the climatology within that period as a reference (BSS) or using the climatology of the training data as a reference (BSS 2014(BSS -2016 ).
The reliability curves from our training data set largely fall within the gray-shaded region, consistent with their low Brier score and positive Brier skill scores. For both flux levels, the forecasts tend to over-predict occurrence below the climatology and underpredict occurrence above the climatology, but the trend runs close to the one-to-one line. For the higher flux level forecasts ( Figure 6, lower panels), there are fewer events in the probability bins in which the occurrence rate differs from the forecast probability the most. For example, in 2017 the occurrence rate in the 0.5-0.6 and 0.6-0.7 forecast probability bins was 1.0, deviating from the forecast probability by 30-50%. However, on average there were fewer than five events in the 0.5-0.6 and 0.6-0.7 probability bins compared to 23 in the 0.1-0.2 probability bin, for which the occurrence rate deviated from the probability by only 10%. Given that the Brier score is the mean of the difference of the forecast probability and occurrence (Equation 1), the Brier score is naturally weighted towards the most common probabilities. As such, the deviation of the occurrence rates from the higher probabilities does not strongly influence the Brier score.
The reliability curves from the application of our forecasts to 2012-2013 and 2017 show that the forecasts perform less well in these years, tending to under-predict the occurrence of high flux events. The Brier skill scores calculated using the concurrent climatologies are much lower than the Brier skill scores for the training data set. In contrast, the Brier skill scores calculated using the climatology of the training data set remain positive, indicating that the forecasts outperform forecasts based on the climatology of the training data. This implies that the underlying year-to-year variability in the climatologies of the fluxes exceeding 1,000 or 10,000 cm −2 sr −1 s −1 is not well captured by a single set of forecast parameters. However, the fact that all the reliability curves increase with increasing probability and that the training data provides good Brier skill scores would suggest that this forecast methodology can be tuned to give skilful forecasts of high-flux intervals. Figure 7 shows the daily maximum fluxes measured by GOES-15 color-coded by the 1 day deterministic forecast of the flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 using the parameters set out in Table 1. Blues indicate a correct forecast, oranges indicate a miss, greens indicate a false alarm and whites indicate a correct nonevent forecast. The top panels show the results for 2012, the middle panel shows the results for 2013, and the bottom panels show the results for 2017. The mean hits, false alarms, misses, and correct rejections for the forecasts are shown by each panel. As for the training data set, we calculate the forecast metrics for each year by extracting multiple subseries of independent forecasts and report the means of metrics of these subseries. Given that the optimum forecasts use a 6-day input window, each subseries contains 60 forecasts. Overall, the forecasts perform relatively well, with a majority of blue or white coloured days. For forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 , there are a greater number of misses than false alarms and these show a slight tendency to occur following the correct forecast of high fluxes. Given that our forecasts do not predict how high the fluxes get and do not include any decay times, these misses may arise due to the persistence of fluxes above the median level. The hit rates for the forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 in 2012, 2013, and 2017 were 0.72, 0.63, and 0.72, respectively, somewhat lower than the training data (see Table 1). Similarly, the false alarm rates were 0.26, 0.14, and 0.25, respectively, comparable to the training data. The forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 had hit rates of 0.64, 0.57, and 0.65 in 2012, 2013, and 2017, respectively, again somewhat lower than from the training data set, and false alarm rates of 0.11, 0.08, and 0.09, slightly lower than the training data. The mean level of flux during the false alarms for forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 was 3,693, 3,759 and 4,419 cm −2 sr −1 s −1 in 2012, 2013, and 2017, respectively, thus these false alarms tend to occur when the flux is elevated, but does not reach the 10,000 cm −2 sr −1 s −1 level.

Verifying Deterministic Forecasts
The differences in the performances of both the deterministic and probabilistic forecasts between the training data and 2012, 2013, and 2017 indicate that the parameters that provide the highest ROC, Peirce, and Brier skill scores are somewhat variable. Figure 6 shows that the climatology of the occurrence of high-flux events varies, with the flux having been more frequently above 1,000 cm −2 sr −1 s −1 in 2017. Given that our methodology does not use measured values of the flux at GEO as an input to the forecasts, they do not account for this underlying variation in the climatology. This may be corrected for by using, for example, the previous year as training data for the current year and regenerating the forecast parameters more frequently.

Discussion
We have demonstrated that the amount of time a single variable exceeds a set forecast threshold can be used to generate either deterministic or probabilistic forecasts of the >2 MeV electron flux at GEO being above 1,000 or 10,000 cm −2 sr −1 s −1 . The optimum forecast thresholds are determined by maximizing standard forecasting metrics for deterministic and probabilistic forecasts. By comparing these metrics for different input variables, we find that AL, SYM-H, and solar wind pressure provide the most skilful forecasts of high-flux events.

Space Weather
Tables 1 and 3 show that some of the parameters and skill scores of our forecasts with the highest ROC or Brier skill scores show little variation. For example, the AL-based deterministic forecasts of the electron flux exceeding 10,000 cm −2 sr −1 s −1 have ROC scores of 0.71-0.87 from variable percentiles of 13-21% and Peirce scores of 0.29-0.6 from break thresholds of 20%. This is a natural consequence of our methodology. Our methodology gives the highest skill scores for forecasts of the flux being high within 1 day into the future. Longer forecasts must also include this first day, thus the success of longer forecasts is dominated by success in the near-term. However, within our training data, we do find successful longer term forecasts that are not successful due to high fluxes in the short term. For example, there are 48 successful 6-day forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 when the flux was not high in the first 3-days of the forecast window.

Comparison With Other Forecasts
Earth's radiation belt is heavily populated by a range of spacecraft, all of which have some risk exposure due to the abundance of high-energy particles in this region. Empirically, spacecraft are at an enhanced risk of anomalies due to internal charging effects when the 2-day >2 MeV electron fluence exceeds 10 8 cm −2 sr −1 , corresponding to a mean flux of ∼500 cm −2 sr −1 s −1 (Baker et al., 1987;Iucci et al., 2005;Romanova et al., 2005;Wrenn & Smith, 1996;Wrenn & Sims, 1996;Wrenn et al., 2002). Both NOAA and the Met Office issue "Electron Event Alerts" to satellite operators when the observed >2 MeV flux exceeds 1,000 cm −2 sr −1 s −1 or when the 72-hr fluence predicted by the Relativistic Electron Flux Model exceeds 10 9 cm −2 sr −1 . However, this only allows for nowcasts of the flux level or relies on upstream solar wind data and in-situ flux measurements. In this study, we have shown that it is possible to forecast the likelihood of the fluxes reaching these levels and thus an alert being issued within up to a 10-day interval based on simple measurements of either AL, SYM-H, or solar wind pressure and with hit rates of up to 84%.
A variety of radiation belt models have been developed to predict the current or near-future state of the radiation belts. These typically fall into two categories: physics-based semiempirical models that typically impose a measured particle distribution at an outer boundary and use empirical models of various wave populations to provide the magnetospheric conditions to solve a series of transport equations and thus forecast the fluxes across the whole of the radiation belts (e.g., Glauert et al., 2014aGlauert et al., , 2014bFok et al., 2008;Li et al., 2001;Subbotin & Shprits, 2009;Turner & Li, 2008); or empirical and machine learning models which use a combination of geomagnetic and upstream solar wind conditions and recently measured radiation belt fluxes to predict flux levels for a few days into the future (e.g., Baker et al., 1990;Boynton et al., 2013Boynton et al., , 2015Boynton et al., , 2016Chen et al., 2019;Coleman et al., 2018;Shin et al., 2016;Wei et al., 2018). In these models, a key component is the flux at the outer boundary or the measured flux at geosynchronous orbit. Our forecasts have three key differences to these models: our forecasts only predict whether the flux will exceed some value rather than predicting the actual flux that will be observed; our forecasts do not use in situ measurements of the flux in any form; our forecasts predict over a much longer time range. However, given the way that our forecast windows have been constructed, we do not predict when within a forecast window a flux threshold will be exceeded. As such, our forecasts may be a useful supplement to other operational models since they can operate in the absence of any upstream or in-situ measurements and provide some indication as to the possible upcoming hazardous levels of energetic electron flux. Balikhin et al. (2016) presented contingency tables for the REFM and SNB3GEO models forecasting the 1-day >2 MeV electron fluence exceeding 10 8 or 10 9 cm −2 sr −1 between 2 March 2012 and 31 December 2013. While these forecasts are not directly comparable to our own, comparing their performance is informative. From these tables, the hit rates for these forecasts were 66% (86/129), and 57% (4/7), respectively for the REFM and 82% (106/129) and 57% (4/7), respectively, for SNB3GEO, with the bracketed numbers indicating the number of hits and the sum of the hits and misses. From our training data set, the hit rates of our 1-day forecasts of flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 were 83% and 71% respectively. Restricting our forecasts to the same period as Balikhin et al. (2016), we get hit rates of 67% (239/357) and 52% (63/121) for forecasts of the flux exceeding 1,000 and 10,000 cm −2 sr −1 s −1 , respectively. As such, our forecasts, which predict the >2 MeV flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 have similar hit rates to forecasts of the 1-day fluence exceeding 10 8 or 10 9 cm −2 sr −1 , although notably there are a greater number of events for our forecasts to predict. Similarly, we calculate that the false alarm rates were 5% (22/532) and 1% (7/654), respectively, for the REFM and 6% (33/532) and 0.3% (2/654), respectively, for SNB3GEO. In comparison, the false alarm rates for our forecasts were higher at 21% (66/313) and 10% (56/549).
Since both the REFM and SNB3GEO models incorporate corrections based on in-situ measurements of the >2 MeV fluxes, it is informative to compare their hit rates and false alarm rates with those of a persistence forecast. Over the same period as the Balikhin et al. (2016) study, a 1-day persistence forecast of the fluence exceeding 10 8 cm −2 sr −1 had a hit rate of 72% (93/129) and a false alarm rate of 7% (36/530). Similarly, a 1-day forecast of the fluence exceeding 10 9 cm −2 sr −1 had a hit rate of 57% (4/7) and a false alarm rate of 10.1029/2019SW002416 Space Weather 0.5% (3/652). The hit rates for the lower fluence level are higher than the REFM, but 10% lower than SNB3GEO, but the false alarm rates were within a few percentage points. For the higher fluence level, the hit rates and false alarm rates were similar for the REFM, SNB3GEO, and persistence forecasts. This would suggest that the persistence aspect of the REFM and SNB3GEO tends to dominate forecasts of high fluence. By definition, our forecasts have no persistence aspect.
Comparing the hit rates of our forecasts between the training data set, 2012-2013 and 2017 we see that, while our forecast methodology has some merits, the forecast parameters have a level of instability. The Reliability Curves in Figure 6 show that the climatology of the flux exceeding 1,000 or 10,000 cm −2 sr −1 s −1 varied across the three different time periods, becoming notably higher in 2017. Given that our forecasts do not use the measured flux in anyway, they cannot account for this other than by recalculating the forecast parameters. We note that the parameters for the SNB3GEO model were determined from data between 11 July 2004 and 11 October 2005 (Boynton et al., 2015) and this forecast performs relatively well during 2012-2013, especially at the low fluence level. As such, there may be some underlying variability in the radiation belts which is not adequately captured by geomagnetic activity or solar wind driving over a few days, but can be accounted for by including some level of persistence.
To our knowledge, our methodology is the first to provide a probabilistic forecast of the geosynchronous >2 MeV electron flux. As per our deterministic forecasts, AL proved to provide the highest Brier skill scores using inputs over the previous 3-5 days. The Brier skill scores, which are a measure of the improvement over the climatology, were above 0.2 for forecast windows of up to 3 days, a level comparable to some terrestrial weather forecasts (e.g., Barnston et al., 2010;Stefanova & N. Krishnamurti, 2002). In order to calculate the forecast probabilities, we made the arbitrary assumption that the duration that the forecast threshold was broken within the input window was equivalent to the probability of an event occurring. The mean squared errors in the forecasts were between 10-20%, driven by the forecasts overestimating occurrence rates below the climatology and underestimating occurrence rates above the climatology, which would indicate a slightly non-linear relationship between break duration and the probability of an event. In principle, the reliability diagrams in Figure 6 can be used as a calibration curve for the forecasts to correct for this, but more complex relations could be examined. For example, Mourenas et al. (2019) showed that enhancements in the radiation belt occur after the integrated Ap index was above 800nT hr. Further forecasts could be based upon this integrated measure or equivalently the average of a geomagnetic index within a given input window.
The methodology used to verify the forecasts is based on that commonly used in terrestrial weather forecasting. These methods have been used in other aspects of space weather forecasting (e.g., Murray et al., 2017; and are widely applicable within this field.

Radiation Belt Drivers
The >2 MeV electron flux at geosynchronous orbit can vary by several orders of magnitude (e.g., Paulikas & Blake, 1979). These variations are driven by the interactions between the solar wind and Earth's magnetosphere but can be complex as different parts of the system respond in different ways at different times and these responses can be coupled to each other. As a result, geomagnetic activity can arise through near-direct driving by large-scale structures in the solar wind resulting in geomagnetic storms (e.g., Borovsky & Denton, 2006), or as an integrate-and-fire response to solar wind coupling, resulting in geomagnetic substorms Freeman and Morley (2004). While the global response of the radiation belts to geomagnetic storms is repeatable (Murphy et al., 2018), the variations at different locations in the radiation belts are more complex (Reeves et al., 2003). This response is complicated by the fact that the spatial location of closed electron drift paths are strongly dependent on the magnetic field configuration: during geomagnetic storms, a compression of the magnetosphere means that geosynchronous orbit is closer to the outer edge of the magnetosphere and that the radiation belts are located closer to the Earth. The radiation belts can also be enhanced by lower-level geomagnetic activity such as substorms (Horne & Thorne, 1998;Horne, Thorne, Shprits, et al., 2005;Jaynes et al., 2015;Li et al., 2007;Summers et al., 1998), with 50% of substorm intervals followed by an increase in the number of electrons in the radiation belts within 24-48 hr (Forsyth et al., 2016). Overall, the response of the fluxes of electrons at different energies and different L* shells shows stronger correlations with measures of geomagnetic activity than with the solar wind driver (Zhao et al., 2017). The integrated geomagnetic activity can be a good indicator of likely radiation belt enhancements (Mourenas et al., 2019). As such, both internal and external drivers influence the state of the radiation belts and fluxes at set locations in the magnetosphere.
Our results show that, in general, geomagnetic indices provide better forecasts of the >2 MeV flux exceeding either 1,000 or 10,000 cm −2 sr −1 s −1 than solar wind parameters, but that the forecast verification parameters (ROC scores, Brier skill scores) can be similar. This is in keeping with recent studies of the variability of radiation belt fluxes. Zhao et al. (2017) showed correlation coefficients of ∼−0.5 between AL or solar wind velocity and the flux of 1,000 MeV/G particles at an L* of 6 but that across other L* shells, the correlation with AL was generally higher. Other studies have shown there is greater mutual information between the radiation belt seed populations and geomagnetic parameters than with solar wind parameters (Tang et al., 2017). However, that is not to say that the solar wind has no influence. Studies such as Balikhin et al. (2011), Boynton et al. (2013), and Wing et al. (2016) have shown that the coupling between the radiation belts and the solar wind driver are nonlinear and may have different time lags but there is mutual information between them. The complex response of the magnetosphere to both external driving and internal conditions complicates forecasting the radiation belts based on upstream conditions alone. For example, the occurrence of substorms, which are thought to provide the necessary particle injections to energize the radiation belts (Forsyth et al., 2016;Horne, 2007;Jaynes et al., 2015), appears to be somewhat random, possibly occurring as an integrate-and-fire process (Freeman & Morley, 2004). For such events, the processing of the solar wind driver by the magnetosphere, which is reflected in geomagnetic activity, is a critical component in determining changes to the radiation belts.
In this study, we have opted to use AL as an input variable to our forecasts, as opposed to the AE index, which is commonly used as a parameterization for radiation belt model inputs (e.g., Meredith et al., 2012). AE is the difference between the AL index, which is designed to capture the strength of the westward electrojet and is always negative, and AU, which is designed to capture the eastward electrojet and is always positive. Given that substorm injections occur on the nightside in conjunction with the formation of the substorm current wedge and associated enhancement in the westward electrojet, we consider that AL is a more appropriate input to a forecast of the fluxes at the outer edge of the radiation belts.
In order to predict fluxes above 1,000 or 10,000 cm −2 sr −1 s −1 , our deterministic forecasts require that AL drops below −74 or −171 nT, respectively. These are not particularly extreme values of AL, corresponding to the 36th and 17th percentiles. Superposed epoch analysis of substorms shows that, on average, AL just reaches −200 nT at the end of the expansion phase, but activity during the growth phase brings AL below −100 nT (Juusola et al., 2011). We also note that the thresholds for SYM-H to predict geosynchronous flux enhancements are above −10 nT (see supporting information), 45 percentage points above the -50 nT level that is taken to indicate storm activity (Gonzalez et al., 1994). Our results thus show that enhancements in flux at geosynchronous orbit follow periods of ongoing substorm activity, but activity at a level that would not necessarily be classed as a geomagnetic storm.

Conclusions
We have examined whether various geomagnetic indices and solar wind parameters can be used to forecast whether the >2 MeV flux from GOES 15 will exceed either 1,000 or 10,000 cm −2 sr −1 s −1 within a forecast window of up to 10 days. These forecasts are based on the amount of time forecasting variables exceed a set threshold during a window of time prior to the forecasting time and rely entirely on conditions and observations external to the radiation belts. We used 2014-2016 inclusive as a training set for our forecasts. Our results show that the AL index provides the best forecasts and that 1. the deterministic forecasts of the flux exceeding 1,000 cm −2 sr −1 s −1 that gives the highest Peirce score requires AL <−103 nT for 43 hr in the preceding 6 days; 2. the deterministic forecasts of the flux exceeding 10,000 cm −2 sr −1 s −1 that gives the highest Peirce score requires AL <−249 nT for 29 hr in the preceding 6 days; 3. the best probabilistic forecasts of the flux > 1,000 cm −2 sr −1 s −1 are based on the proportion of the preceding 4 days that AL < −50 nT; 4. the best probabilistic forecasts of the flux > 10,000 cm −2 sr −1 s −1 are based on the proportion of the preceding 4 days that AL <− 215 nT.
Our simple deterministic forecasts have hit rates of up to 89% with high ROC scores (>0.8) and high Peirce scores (up to 0.57); however, our false alarm rates are also high (10-53%). Our deterministic forecasts do not perform as well as persistence forecasts of high-flux events. Compared to the results of Balikhin et al. (2016), our forecasts give slightly lower hit rates and higher false alarm rates than forecasts of high fluence; however, this may be improved with an optimised forecast tuning methodology. However, our forecasts are entirely independent of measurements of the electron flux at GEO. Overall, this demonstrates that forecasting the high energetic electron fluxes at geosynchronous orbit over periods of greater than a few days is possible with relatively high success rates based on observations solely from geomagnetic activity or upstream measurements.

Data Availability Statement
GOES flux and OMNI data used in this study were obtained from NASA's CDAWeb (https://cdaweb.gsfc. nasa.gov) using the SPDFCDAWEBCHOOSER IDL interface. GOES fluence was downloaded from this site (ftp://ftp.swpc.noaa.gov/pub/indices/old_indices/). The forecast data were created using the methods that are fully described in the text of this article and reproducible.