Evaluating Auroral Forecasts Against Satellite Observations

The aurora is a readily visible phenomenon of interest to many members of the public. However, the aurora and associated phenomena can also significantly impact communications, ground‐based infrastructure, and high‐altitude radiation exposure. Forecasting the location of the auroral oval is therefore a key component of space weather forecast operations. A version of the OVATION‐Prime 2013 auroral precipitation model (Newell et al., 2014, https://doi.org/10.1002/2014sw001056) was used by the UK Met Office Space Weather Operations Centre (MOSWOC). The operational implementation of the OVATION‐Prime 2013 model at the UK Met Office delivered a 30‐min forecast of the location of the auroral oval and the probability of observing the aurora. Using weather forecast evaluation techniques, we evaluate the ability of the OVATION‐Prime 2013 model forecasts to predict the location and probability of the aurora occurring by comparing the forecasts with auroral boundaries determined from data from the IMAGE satellite between 2000 and 2002. Our analysis shows that the operational model performs well at predicting the location of the auroral oval, with a relative operating characteristic (ROC) score of 0.82. The model performance is reduced in the dayside local time sectors (ROC score = 0.59) and during periods of higher geomagnetic activity (ROC score of 0.55 for Kp = 8). As a probabilistic forecast, OVATION‐Prime 2013 tends to underpredict the occurrence of aurora by a factor of 1.1–6, while probabilities of over 90% are overpredicted.

. Ionospheric currents associated with enhanced auroral activity can induce currents in the ground which can damage ground-based infrastructure such as electricity supply networks (e.g., Cannon et al., 2013;Erinmez et al., 2002;Freeman et al., 2019;Smith et al., 2019). In addition, forecasting the occurrence of visible aurora is of importance for auroral tourism and is a key tool in promoting public awareness and engagement with space weather, through projects such as Aurorasaurus (MacDonald et al., 2015) and AuroraWatch UK (Case et al., 2017).
The auroral oval is highly dynamic with activity driven by factors both internal (e.g., geomagnetic substorms) and external to the magnetosphere (e.g., the interaction with the solar wind). Prolonged periods of southward directed interplanetary magnetic field (IMF) can increase the open flux content of the magnetosphere which causes the auroral oval to expand to lower latitudes (Cowley & Lockwood, 1992). During substorms, the sudden onset of reconnection in the magnetotail leads to a rapid brightening and widening in the nightside auroral oval which spreads eastwards and westwards during the substorm expansion phase (e.g., Akasofu, 1964).
The OVATION auroral forecast model Newell et al., 2002) is an empirical model which predicts the location of the auroral oval based on the upstream solar wind conditions. The most recent version, OVATION-Prime 2013(OP-2013Newell et al., 2014) uses average particle precipitation maps obtained from Defense Meteorological Space Program (DMSP) satellites (Hardy et al., 1984(Hardy et al., , 1985 spanning 21 years between January 1, 1984 and December 31, 2005, UV auroral data from the Global Ultraviolet Imager (GUVI) instrument onboard the Thermosphere Ionosphere Mesosphere Energetics and Dynamics (TIMED) satellite and real time solar wind conditions measured at the L1 point to produce maps of the predicted auroral flux. A version of the OP-2013 auroral forecast model has been implemented in daily operations of leading space weather forecasting centers including the U.S. National Oceanic Atmospheric Administration (NOAA) Space Weather Prediction Center (SWPC), the U.S. Department of Defense, Space Weather Operations Centre (Jones et al., 2017), and the UK Met Office. The OP-2013 model was originally supplied to the Met Office by SWPC, however, the operational implementations at the Met Office and SWPC have since diverged.
Forecast evaluation is an important step in both the implementation and development of space weather forecast models. Model verification can provide information on the skill, accuracy, and reliability of models and also provides quantitative benchmarks to compare different forecast models. Previous verification studies have evaluated the performance of the earlier generation aurora forecast model, OVATION-Prime 2010 (Kosar et al., 2018;Lane et al., 2015;Machol et al., 2012;Mitchell et al., 2013;Newell, Sotirelis, Liou, et al., 2010;. Machol et al. (2012) and Newell, Sotirelis, Liou, et al. (2010) evaluated the auroral forecasts of OP-2010 against ultraviolet images of the aurora from the instruments onboard the Polar satellite. Newell, Sotirelis, Liou, et al. (2010) compared the instantaneous and hourly averaged predicted auroral power to the observed power estimated from the Polar Ultraviolet Imager (UVI) data. The auroral power predicted by OP-2010 was found to be correlated with the observed auroral power from Polar UVI with a correlation coefficient 2 56% r  for the instantaneous power forecast and an 2 58% r  for the hourly averaged auroral power, demonstrating that just over half of the observed auroral power can be forecast by the OP-2010 model. Mitchell et al. (2013) found that OP-2010 described 47% of the variance in the Polar UVI nightside auroral power while a similar auroral model, OVATION-SuperMag (OVATION-SM) which uses averaged DMSP precipitation maps and ground magnetometer data from SuperMAG, described 71% of the nightside variance. Machol et al. (2012) used binary event analysis to evaluate the forecasts from OP-2010 and the suitability of the model as a tool for forecasting visible nightside aurora. Machol et al. (2012) compared the nightside auroral forecast to the boundaries derived from a fixed brightness threshold of the nightside auroral emission in the Polar UVI data. The result of this verification study found that the OP-2010 had a hit rate of 0.58 (the proportion of correct positive forecasts out of the total positive observations of aurora), a false alarm rate of 0.14 (the proportion of aurora forecasts which were not observed) and an overall accuracy of 0.86 (the proportion of correct positive and negative forecasts over the total number of forecasts). Lane et al. (2015) performed a comparison study of the energy flux outputs forecast from three different models: OP-2010, the Kp-based auroral forecast model by Hardy et al. (1991), and a ring-current model from the Space Weather Modeling Framework (Fok et al., 2001;Tóth et al., 2005). Similarly to Machol et al. (2012), Lane et al. (2015) also used fixed thresholds to define the equatorward auroral boundary defined from particle precipitation measurements from the DMSP satellites. The authors presented the results in terms of the prediction efficiency, which is the model's ability to describe the percentage variance in the observed data set. The prediction efficiencies of OP-2010 were found to be 0.55 and 0.58 for the threshold values of 0 Verification is important in monitoring model performance and also acts as a benchmark against which proposed improvements to the model can be tested. Verification techniques that are routinely used in terrestrial weather forecasting are now being applied to space weather forecast models. Binary event analysis is a method of comparing model forecasts with a ground-truth observational data set and is widely used in many applications. The approach of using binary event analysis has been applied to evaluate nowcast and forecast models, for example, in the verification study of OP-2010 by Machol et al. (2012) and verification studies of other space weather models including predicting magnetopause crossings (Lopez et al., 2007;Welling & Ridley, 2010), radiation belt models Ganushkina et al., 2015Ganushkina et al., , 2019, temporal changes in the induced ground magnetic field ( / ) dB dt (Pulkkinen et al., 2013), and solar flare forecasts (Barnes et al., 2016;Kubo et al., 2017;Leka et al., 2019;Murray et al., 2017;. In this study, we present an evaluation of auroral forecasts from the version of OP-2013 that was being used operationally at the Met Office, until December 2020. We compare the auroral forecasts from the model against auroral boundaries derived by Longden et al. (2010) from global FUV images of the auroral oval obtained by the IMAGE satellite. In particular, and in contrast to Lane et al. (2015), Machol et al. (2012), andNewell, Sotirelis, Liou, et al. (2010), we examine the output auroral probabilities from the operational auroral forecast, rather than the physical quantities (the predicted auroral power, energy, or auroral flux) provided by the underlying OP-2013 model. We assess the model performance in predicting the location of the auroral oval using binary event analysis and present the results in relative operating characteristic (ROC) curves. We also assess the forecast probabilities of aurora occurring output by the model using reliability curves. Our results show that, overall, the model performs well at predicting the location of the auroral oval, but the forecast probabilities tend to underpredict auroral occurrence. Furthermore, we show that the model results are substantially less reliable on the dayside and during periods of enhanced geomagnetic activity.

Forecast Model: OP-2013
Both the OP-2010 and OP-2013 versions of the auroral forecast model Newell et al., 2009Newell et al., , 2014 predict the precipitating electron and proton auroral flux based on upstream solar wind conditions, measured at L1. Newell et al. (2009) created averaged particle precipitation maps of the auroral oval collected by the Special Sensor J instruments onboard the DMSP satellites and categorized the DMSP particle precipitation energy spectra into four categories of aurora: mono-energetic, broadband and diffuse electron aurora and ion aurora. Newell et al. (2009) determined a linear scaling between the electron and proton flux from the DMSP data with an empirically derived solar wind coupling function (Newell et al., 2007). The upstream solar wind data required includes the z B and y B components of the IMF, the total magnetic field strength, the solar wind velocity, and the IMF clock angle. In each model grid point, the particle flux was calculated as a function of season and the type of aurora. For OP-2013, additional UV auroral data from the GUVI instrument onboard the TIMED satellite is included to improve the performance of the model at higher values of Kp, between Kp 5 and 8 (Newell et al., 2014). The resultant maps of linear scaling coefficients are then used to predict the precipitating electron and proton fluxes under all upstream conditions. Additional improvements to the model made in the upgrade from OP-2010 to OP-2013 include further noise reduction and a smoother data interpolation in the post-midnight magnetic local time (MLT) sectors (Newell et al., 2014). We direct the interested reader to  and Newell et al. (2007Newell et al. ( , 2009Newell et al. ( , 2014 for full details of the OP-2010 and 2013 models. The Met Office operational implementation of the version of OP-2013 assessed in this study, assumes a fixed 30-min propagation time for the solar wind measured at the L1 point to arrive at Earth. In this operational version, the combined precipitating particle flux from all types of aurora at each grid point is linearly scaled into an estimated probability of aurora occurring which is interpreted as the probability of an observer seeing the visible aurora. The linear conversion of auroral flux to probability implemented in the version of OP-2013 at the Met Office are as originally developed by SWPC and could be further refined. The forecast probabilities were tuned by SWPC in response to citizen science observations under the assumption that the forecast probabilities of aurora occurring were mainly used by members of the public and may underpredict the probability of aurora occurring (Rodney Viereck, private communications). The results of this study could be used to tune the forecast probability to optimize the forecast reliability. Further details on the conversion from the predicted auroral flux to the probability of aurora occurring is included in the supporting information. The operational implementation provides an auroral forecast for both the northern and southern hemispheres 30 min ahead of the current time.
The original OP-2013 Interactive Data Language code was supplied to the Met Office by SWPC. In 2016, the Met Office converted the code to Python and returned the Python version of OP-2013 to SWPC. In October 2020, SWPC implemented an upgraded version of OVATION termed OVATION 2020 which, again, differs from the Met Office implementation. OVATION 2020 uses an improved geomagnetic field model to provide a more accurate auroral location. In addition, OVATION 2020 provides the modeled energy flux in 2 / ergs cm as well as the scaled probability of seeing the aurora. SWPC have also implemented an estimate of the solar wind driving based on Kp data to use as an alternative to run the model when upstream solar wind data is unavailable. Details of the SWPC auroral forecast using OVATION 2020 can be found on the SWPC website.
The version of the OP-2013 model evaluated in this study was used operationally at the Met Office until December 2020. The Met Office currently use an alternative Kp-driven 3-day forecast version of the OP-2013 model. We note that the Kp-driven version was developed at the Met Office independently of the SWPC Kp-driven model. The Met Office may return the 30-min forecast version of OP-2013 evaluated in this study to operational use in the future to operate in parallel with the Kp-driven 3-day forecast version. In this study, we refer to the 30-min auroral forecast as a nowcast to distinguish it from the alternative 3-day auroral forecast which is currently in operation at the Met Office.
In this study, we produce hindcasts of the output from the 30-min nowcast version of OP-2013 used at the Met Office using historic solar wind data for the period between May 2000 and October 2002, not auroral forecasts that were issued in near real time by the Met Office. Figure 1a shows an example output of the OP-2013 northern hemisphere 30 min auroral forecast from September 25, 2000. The model output was produced using Advanced Composition Explorer (ACE) solar wind data. The auroral oval is plotted on geographic coordinates with the color scale showing the forecast probability of aurora occurring.

Observational Data: Auroral Boundaries Derived From IMAGE FUV Data
The NASA IMAGE satellite was in operation between 2000 and 2005 in a highly elliptical, precessing polar orbit which enabled it to capture images of the northern and southern polar regions. The orbit had an initial perigee of 1,000 km and an apogee of 44,000 km (∼7 R e ; Mende, Heetderks, Frey, Lampton, Geller, Habraken, et al., 2000). Between 2000 and 2002, the orbital apogee was situated over the northern hemisphere. IMAGE carried a far-ultraviolet (FUV) wideband imaging camera (WIC) sensitive to emission between 140 and 190 nm (Mende, Heetderks, Frey, Lampton, Geller, Abiad, et al., 2000) which took images of the Earth approximately every 2 min, determined by the spin of the spacecraft (Burch, 2000).
Using IMAGE FUV data, Longden et al. (2010) developed an automated technique to identify the poleward and equatorward luminosity boundaries of the auroral oval. The IMAGE FUV data was converted from geomagnetic coordinates to altitude-adjusted corrected geomagnetic coordinates (AACGM; Baker & Wing, 1989). Longden et al. (2010) created a latitudinal intensity profile of auroral emission in each of the 24 magnetic local time (MLT) sectors and fitted these profiles with both single and double Gaussian profiles. The profile fits to the data were evaluated using the reduced  2 statistic and the best fit function retained. The poleward and equatorward auroral luminosity boundaries (PALBs and EALBs, respectively) are defined as the poleward and equatorward the points on the Gaussian curve where the auroral intensity drops to half the peak value (the full width half maximums [FWHMs]) of the best fitting Gaussian function, offset from the center of the Gaussian peak. We direct the interested reader to the full description of the method published in Longden et al. (2010). The boundaries determined by Longden et al. (2010) provide a single location of the poleward and equatorward boundaries of the auroral oval in each MLT sector, without any assumption of the global shape of the auroral oval. Auroral boundaries were identified for each global auroral image, with a cadence of 2 min.
In this study, we use the poleward and equatorward auroral luminosity boundaries determined from the IMAGE WIC data by Longden et al. (2010) as a ground-truth observational data set to compare with the model forecast probability maps output from OP-2013. The poleward boundary identifications from Longden et al. (2010) have been shown to be co-located with the poleward emission boundary measured from DMSP within 3° on average in all MLT sectors, making the boundaries a suitable observational data set to compare with the OP-2013 forecasts. The auroral boundary data available for the northern auroral oval spanned 30 months from May 2000 to October 2002 (Chisham, 2017). Figure 1b shows a comparison of the probability forecast maps from OP-2013 to the poleward and equatorward auroral boundaries determined by Longden et al. (2010) in MLT and magnetic latitude (MLAT) coordinates. The colors show the 30-min forecast of the probability of aurora occurring as output from OP-2013. Gray regions indicate a forecast probability of aurora occurring of less than 1%. The black lines show the corresponding observed boundaries. We note that, in this example, there is a lack of observed auroral boundaries in some dayside MLT sectors. While the method of Longden et al. (2010) aims to identify the poleward and equatorward auroral luminosity boundaries in each MLT sector, the number of successful boundary identifications in dayside sectors is much lower than on the nightside . The dayside aurora tends to be dimmer and thinner (Carbary, 2005;Holzworth & Meng, 1975) and is more contaminated with dayglow making it more difficult to identify the dayside auroral boundaries. In this study, we only evaluate the model where there are corresponding observational auroral boundaries.

Verification Method
In this study, we have produced the OP-2013 auroral forecasts spanning the period of May 2000-October 2002 (Marsh & Mooney, 2021), coinciding with the available observational auroral boundary data from Longden et al. (2010), using historic solar wind data measured by the ACE satellite, provided by the NOAA. Each forecast requires 4 h of input solar wind data, thus in order to ensure that the forecasts were independent of one another, we down-sampled our forecast data set to four hour resolution. To match the model forecast and the observational ground-truth auroral boundaries, we use the auroral boundaries that were closest in time and within 2.5 min of the 4-h separated forecast time. This resulted in 3,360 corresponding forecast and observation pairs. The MLAT range of the OP-2013 data spans 50°-89.5°and covers 24 h of MLT, with a grid resolution of 0.25 MLT by 0.5° MLAT.
In this evaluation study, we use two verification techniques that are widely used in terrestrial weather forecast verification. First, we apply binary event analysis to evaluate how well the OP-2013 model discriminates between auroral and non-auroral regions via comparison with the Longden et al. (2010) boundaries. This evaluates how well the model performs as a deterministic forecast for predicting the location of the auroral oval. We test over a range of forecast probability levels, between 0% and 100% in 10% increments. At a particular level, for each available forecast and observation pair in each grid cell with a forecast probability that exceeds the level, we determine whether or not the aurora was observed at that grid cell. We repeat this test to build up truth tables for different forecast probability thresholds. If the forecast probability of aurora occurring is equal to or greater than the set level and aurora was also observed, it counts as a hit in our truth table. If the forecast probability of aurora occurring is equal to or greater than the set level but aurora was not observed, it counts as a false alarm. If the forecast probability of aurora occurring is less than the set level and aurora was observed, it counts as a miss. If the forecast probability of aurora occurring is less than the set level but aurora was not observed, it counts as a correct negative. From the truth tables for each level, we evaluate the hit rate (hits/[hits + misses]) and false alarm rate (false alarms/[false alarms + correct negative forecasts])). These hit rates and false alarm rates are combined and presented on ROC curves (Mason, 1982;Swets, 1988;Swets et al., 1955). ROC curves are obtained by plotting the calculated hit rate against the false alarm rate from the truth table, for each 10% probability level. A ROC score, calculated as the fractional area under the ROC curve, provides a quantitative summary of the model discrimination indicated by the ROC plot. A ROC score between 0.5 and 1 indicates that the hit rate exceeds the false alarm rate for most probability levels and that the model is skillful in discriminating events from non-events.
Second, we assess the validity of the forecast probabilities against the observed occurrence of the aurora using reliability (or attribute) diagrams (Hsu & Murphy, 1986;Jolliffe & Stephenson, 2012;Wilks, 2006). The forecast model would be completely reliable if, over all the occasions during the assessment period when the forecast probability was p, the aurora was observed p% of the time. However, if the forecast probabilities and observed frequencies of occurrence do not have a one-to-one correspondence, the reliability diagram provides information on whether the model is under forecasting or over forecasting the probabilities. This information can be used to re-calibrate the forecast probabilities by rescaling the probability of aurora occurring against the observed occurrence of aurora. We provide suggestions of how the forecast probabilities of aurora occurring may be adjusted based on the results of this study in Section 4.2. Attribute diagrams are similar to reliability diagrams, showing the observed frequency of an event against the forecast probabilities but they include additional information such as the average, climatology value of the observations and forecasts which can be used to assess the forecast model in more detail. Further detail on ROC and reliability analysis is provided in the supporting information.
ROC and reliability analysis are standard methods used in forecast verification by the weather community (e.g., Dube et al., 2017). They have been used to evaluate flare forecasts from the Met Office Space Weather Operations Centre (MOSWOC) in studies by Murray et al. (2017) and Sharpe and Murray (2017), to evaluate the performance of a new radiation belt forecast model  and to assess a sudden storm commencement probabilistic forecast model (Smith et al., 2020).
The spherical geometry of the auroral forecasts means that the area of each grid cell is not uniform. This can influence how well the forecast is judged to perform. For example, near the pole, where aurora are not generally expected to occur, there is a greater concentration of grid cells than at 60°, where there is a greater likelihood of auroral activity. To account for this, the inputs into our ROC and reliability analysis were weighted by the cosine of the latitude of each grid cell (e.g., Young, 2010).

Results
In the following section, we present the results of our evaluation of the OP-2013 model using the locations of the auroral boundaries derived from IMAGE WIC data. In Section 3.1, we present the results of the analysis of 2.5 years of data between May 2000 and October 2002 in all MLT sectors. In Sections 3.2 and 3.3, we present the results of the verification during the four seasons of 2001 and in different MLT sectors around the auroral oval to test for seasonal and spatial variations in the forecast performance. In Section 3.4, we present the results of the verification during geomagnetically active times for different values of Kp.

Model Evaluation Between May 2000 and October 2002
Figure 2 shows the ROC curve from the comparison of 2.5 years of model forecast and observation pairs. The curve is constructed by setting the probability threshold in 10% increments to calculate the hit rates and false alarm rates. The ROC curve shown in Figure 2 shows that, over the 2.5-year verification period, the model performs well and has a ROC score (fractional area under the curve) of 0.82. At each 10% probability increment the model hit rate is higher than the false alarm rate with a maximum difference between the hit rate and false alarm rate of 0.6 for probabilities exceeding 5%. Thus, the forecasts perform well at predicting the location of the aurora overall. The probability bin centered on 10% has the largest difference between the hit rate and the false alarm rate, also referred to as the Peirce Score (Peirce, 1884). This shows that a probability of between 5% and 15% is the threshold at which the OP-2013 model performs the best at discriminating between regions of aurora and no aurora, compared to the observed auroral boundaries. Figure 3 shows the reliability diagram for the full 2.5 year verification period, plotting the occurrence rate from auroral observations for given forecast probability ranges. Figure 3 shows that the aurora are largely underpredicted, with occurrence frequencies greater than the forecast probabilities for probabilities up to 80%. The lowest non-zero probabilities of 10% and 20% are underpredicted by a factor of 6 while the 80% probabilities are only underpredicted by a few percent. The 90% and 100% probability bins slightly overpredicted the probability of aurora occurring with the highest probability value of 100% overpredicting the occurrence by 20%, a factor of 1.25.
The dotted horizontal and vertical lines indicate the observed climatological frequency of occurrence of aurora is 0.30, calculated as the fraction of positive auroral observations that the aurora did occur out of the total number of auroral observations. The histogram in Figure 3 shows the number of data points in each forecast probability bin. The histogram shows that the probabilities forecast by the OP-2013 model are distributed across all probability bins and are not clustered around the climatology value. The lowest forecast probability bin contains all forecasts issued with a probability of 5% and lower and has the largest number of data points. This bin is dominated by the grid points where the main auroral oval is rarely or never predicted to occur, for example, at low and high MLATs. The large number of forecasts with a low probability of aurora occurring in this bin correspond to a large number of observations where the aurora was not observed to occur which reduces the overall observed climatology (mean occurrence). The solid pink diagonal line of no skill lies mid-way between the diagonal line of perfect reliability and MOONEY ET AL.   the horizontal climatology line. Points on the reliability curve which lie above/below the line of no skill, contribute positively/negatively to the Brier skill score. Pink shading indicates the region where the forecast is skillful compared with the in-sample climatology. The majority of the points on the reliability line lie in the shaded skill region except for probabilities of 10% and 20% which appear to be extremely underpredicted by the OP-2013 model. The Brier skill score (Brier, 1950;Murphy, 1973) of −0.03 indicates that overall, the OP-2013 model is not more skillful at predicting when the aurora occurs than simply always forecasting the within-sample climatology of 0.30. While the Brier skill score indicates that the OP-2013 model is not more skillful than using a climatological forecast, the attributes diagram shows that the majority of forecast probabilities are skillful. The discrepancy in the conclusions drawn from these two analyses metrics highlights the increased understanding of the model performance that can be gained from using the full attributes diagram rather than only using value of the Brier skill score.

Seasonal Verification During 2001
Seasonal variations in ionospheric conductivity as a result of the solar zenith angle affects the auroral precipitation (Liou et al., 2001;Newell et al., 1996). The seasonal variation in the auroral emission was examined by  and implemented in the OP-2013 model by calculating the the predicted auroral flux as a function of season. Here, we have evaluated the seasonal variability in the model performance. For the seasonal analysis we use data and forecasts from February 5, 2001 into February 4, 2002 as this is the only complete year of WIC observational data including all seasons. The seasons were defined similarly to the those used by Newell, Sotirelis, & Wing (2010) as being 90 days centered on the equinoxes and solstices. The start and end dates of each season were then adjusted slightly to include the six uncategorized days that fall between the seasons by this definition. The seasonal dates used in the analysis are as follows: spring is between February 5 and May 6; summer is between May 7 and August 8; autumn is between August 9 and November 6; winter is between November 7 and February 4.  The seasonal variation in the ROC score may be indicative of the model performance but it may also be due to the seasonal variations in the identification of the auroral boundaries. During the summer months, the increased UV contamination from reflected sunlight reduces the number of successfully identified auroral boundaries in the WIC data. We also note that the ROC scores of summer and autumn 2002 are reduced compared to the same season in previous years. In summer 2002, the IMAGE satellite suffered damage to the boom which affected the satellite pointing and resulting in an increased uncertainty in spacecraft pointing (Frey, 2010) and thus, increased uncertainty and thus in the location of the auroral boundaries. Figure 4 shows the reliability diagram for each of the seasons in 2001. The seasonal reliability is consistent with the overall reliability shown in Figure 3. In all seasons, the occurrence frequency increases rapidly with probability, thus there is an underprediction of the auroral occurrence. For autumn and spring forecasts, the observed auroral occurrence plateaus at  0 8 . and  0 9 . , respectively, for midrange forecast probabilities between ∼20% and 70%, whereas the occurrence rate in summer and winter increases steadily with probability above a forecast probability of 20%. The autumn forecasts are overpredicted at the higher probability values of 70% and above. We note that there is no significant difference in the solar or geomagnetic activity between the seasons in 2001. Generally, there are a higher number of auroral boundary observations to compare with the OP-2013 model forecasts in winter, as indicated by the winter histogram in Figure 4, however, this is not expected to have a considerable effect on the verification results.

MLT Dependence
The shape of the auroral oval varies with MLT sector. Typically, the dayside auroral oval tends to be thinner and dimmer (Carbary, 2005;Holzworth & Meng, 1975) while the nightside aurora generally extends over a wider MLAT range and is more variable with enhanced auroral precipitation linked to magnetospheric activity such as substorms. Here, we evaluate the performance of the OP-2013 model in the noon, dawn, dusk, and midnight regions. Each region is defined as 3 h of MLT centered on MLT sectors 00, 12, 06, and 18. The ROC curves of each 3-h MLT sector are shown in Figure 5 and show that the model performs well in the dawn, dusk and midnight sectors, with ROC scores of between 0.78 and 0.86. However, the ROC score from the noon region is considerably lower, at 0.59 showing that forecast model does not perform as well in this region. Using a probability threshold of 10% to indicate the presence of aurora only gives a hit rate of 0.2, much lower than the hit rates of 0.6-0.85 seen in the other MLT sectors. The results in the truth table for the noon analysis are dominated by missed forecasts and correct rejections where the aurora is not forecast by the model. The lack of forecast aurora in this region may be because of a data gap in the underlying DMSP particle precipitation data, due to the dawn-dusk orbit of the spacecraft. The midnight data gap was interpolated over in the upgrades between OP-2010 and OP-2013 , however, there are no details on whether the corresponding dayside data gap was interpolated.
MOONEY ET AL.
10.1029/2020SW002688 9 of 16  Figure 5b shows the reliability diagrams for each 3-h MLT region. The reliability curves for the dawn, dusk, and midnight sectors are similar to those of the 2.5-year verification shown in Figure 3 with forecast probabilities below 70%-80% being largely underpredicted and greater than 80% being overpredicted. The reliability diagram from the noon MLT sectors is quite different to the other MLT sectors. The reliability curve from the noon MLT sectors shows that the OP-2013 model tends to underpredict when forecasting aurora with probabilities less than 30% and overpredict when forecasting aurora with probabilities between 30% and 60%; whereas, aurora was not forecast with probabilities 70%.

Kp Dependence
In the following section, we evaluate the performance of the OP-2013 aurora forecast model under different levels of geomagnetic activity based on Kp. Kp levels of 5 and above are generally considered to be geomagnetically active periods and so it is important to evaluate the performance of the OP-2013 model during these levels of geomagnetic activity which can have a real impact on daily services at Earth. The OP-2010 model was known to break down at higher levels of geomagnetic activity of Kp  5 (Newell et al., 2014). This led to the inclusion of additional GUVI data at higher Kp levels (Kp 5-8) as part of the upgrade to the OP-2013 generation.
All corresponding forecast and observation pairs between May 2000 and October 2002 were divided into subsets based on the level of Kp measured at the time. The results of the ROC analysis, including all the ROC scores for each Kp level, are shown in Figure 6. The ROC scores generally decrease for increasing levels of Kp, with Kp = 1 having a ROC score of 0.83 and Kp = 8 having a ROC score of 0.55. The ROC scores for Kp = 1-6 are within 0.05 of each other, implying that the model performs relatively well at discriminating between auroral and non-auroral regions at these levels of activity. However, at the highest activity levels of Kp = 7 and Kp = 8, the ROC score drops to 0.7 and 0.55, respectively. While these ROC scores indicate that the forecast has some skill in identifying where the aurora will be, these forecasts are less skillful than at lower activity levels. The results for Kp = 8 show that the hit rates are lower and the false alarm rates are higher compared to the results for lower Kp levels, indicating that the model is predicting that aurora will occur but not always in the correct locations, compared to the observed auroral boundaries. It is not uncommon for models which perform well within a nominal range of average conditions to not perform as well during extreme events. Despite the improvements made to the OP-2013 model using GUVI data during higher levels of Kp, these observations are likely limited due to the rarity of periods of extremely high Kp. It would be informative to repeat this analysis with the predecessor OP-2010 model to quantify the improvement made by including the GUVI data at high levels of Kp. Figures 6b and 6c show the reliability diagrams for Kp levels of 1-8. The reliability curves for Kp levels 1-5 plateau at an observed frequency of 0.8-0.9 for forecast probabilities of 30% and above. The reliability curves for Kp levels 6-7 plateau at a lower observed frequency of aurora of 0.7-0.8 for forecast probabilities of 10% and above. Kp = 8 shows the reliability curve dropping with increasing probability such that the observed occurrence of high probabilities is much lower than the forecast probability indicating a more concerning overprediction. From the histogram, we note that Kp levels between 1 and 3 are the most common, with the highest number of points in these categories representing low geomagnetic activity. Kp levels of 7 and 8 are statistically much more rare events and have the lowest number of data points in the ROC and reliability analysis. The inclusion of more data in the analysis for this level of high geomagnetic activity would help to confirm this evaluation of the OP-2013 model at these high Kp levels.

Discussion
In this study, we have used auroral boundaries derived from global IMAGE FUV data between May 2000 and October 2002 to evaluate the performance of the auroral forecasts made by the OP-2013 model, used operationally at the Met Office. Using a combination of ROC and reliability analysis, we find that overall, the OP-2013 model performs well at predicting the location of the aurora with a ROC scores of between 0.70 and 0.86, although the forecast skill was notably lower around noon (ROC score of 0.59) and at higher Kp (ROC score of 0.55, for Kp = 8). The overall ROC score compares well with other space weather forecasts, such as M-class solar flare forecasts . The OP-2013 forecast probabilities tended to underpredict the occurrence of the aurora, with the observation frequency of the aurora typically plateauing at 0.8 for forecast probabilities exceeding 20%.

Deterministic Auroral Forecasts
The results of the ROC analysis show that overall the model performs well as a deterministic model at discriminating between regions of aurora and no aurora. In the seasonal analysis, while there is some seasonal variability in ROC scores, all ROC scores are greater than 0.74, indicating that the model performs well year round.
In the evaluation of OP-2013 by MLT sector, the model had a lower ROC score in the dayside MLT sectors centered on the noon MLT (11-13 MLT). The noon MLT sectors had a ROC score of 0.59 compared to the dusk (17-19 MLT), dawn (05-07 MLT), and midnight (23-01 MLT) sectors which had ROC scores between 0.78 and 0.86. The higher ROC scores in the nightside MLT sectors indicate that the model performs better at predicting the location of the aurora in these regions.
The width of the auroral oval varies with local time sector. The nightside auroral oval is typically wider and more dynamic than the dayside. The nightside auroral dynamics are primarily driven by substorms which cause a rapid expansion and brightening of the auroral oval. The solar wind driven OP-2013 model is unable to forecast substorm activity and the 30-min resolution of the operational forecasts cannot capture substorm dynamics, however, the change in width of the auroral oval during substorms occurs within the average predicted auroral oval location. Mooney et al. (2020) showed that during substorms, the poleward boundary of the auroral oval moves by up to 3° in the substorm onset MLT sectors. During substorms, the typical width of the auroral oval varies by 10°-17° (Walach et al., 2017). Compared to the width of the auroral oval, the 3° change in the poleward boundary represents a small change of 17%-30% of the total oval width. In addition, after the substorm activity has subsided, the auroral oval generally returns to the same size and width that it had prior to the substorm and so substorms have no lasting effect on the auroral oval. During a substorm, the relatively small expansion in the oval width in the substorm onset sectors near midnight would result in a slight increase in the number of missed forecasts but this does not have a big impact on the overall ROC score. While the OP-2013 model cannot forecast if or when a substorm may occur, the occurrence of a substorm has a relatively low impact on the performance of the OP-2013 model. In contrast, the lower ROC scores of the model in the noon sectors indicate that the model does not forecast the location of the dayside auroral oval particularly well. In the dayside MLT sectors, the auroral oval is generally much thinner and so any offset between the observed and forecast locations of the auroral oval will result in a bigger reduction in the overall ROC score.
In the final part of this study in Section 3.4, we focused on the performance of OP-2013 during periods of different levels of geomagnetic activity, defined by Kp level. Overall, the ROC scores decrease with increasing Kp levels from 0.83 to 0.55 for Kp = 1 and Kp = 8, respectively. All ROC scores for Kp 1-7 are greater than 0.70. The relatively high ROC scores above 0.70 for geomagnetic activity up to Kp = 7 may indicate that the additional GUVI data is having a positive effect on the performance of OP-2013 at disturbance levels between Kp 5 and 7. It would be informative to repeat this analysis to evaluate the performance of OP-2010 during high Kp levels to confirm and quantify the improvement made by including GUVI data. The low ROC score of 0.55 for Kp = 8 is likely due to the rarity of periods of extremely high Kp and thus, the model is less well constrained. This could suggest that the linear scaling of auroral flux with solar wind driving used by Newell et al. (2007) to construct the OP-2010 and OP-2013 models breaks down during more extreme and statistically more rare events of Kp  7.

Evaluating the Forecast Auroral Probabilities
The reliability diagrams show that the forecast probabilities of aurora occurring tend to be underpredicted, that is, that the aurora occurs more frequently than the model predicts, particularly for lower probability values of less than 80%. At the highest forecast probability values, greater than 80%, the model tends toward a slight overprediction of the probability of aurora occurring. This is observed in most cases from the seasonal, MLT sector, and geomagnetic activity analysis.
The observed frequency of aurora does not increase linearly with the forecast probabilities but instead is relatively constant between 0.8 and 0.9 for for all the forecast-observation pairs for forecast probabilities of 20% and above. This means that the lower forecast probabilities of 20% are underpredicted by a factor of 6. As the forecast probability of aurora occurring tends toward the observed frequency of aurora, the difference between the forecast probability and the observed frequency decreases and so the factor of how much the aurora is under or overpredicted also decreases.
The results of the reliability analysis show that the conversion from auroral flux to probability of aurora occurring is not particularly robust, however, this conversion is a non-trivial task. Using the results of the reliability analysis, a correction to re-calibrate the probabilities forecast by the model could be developed to improve the reliability OP-2013 auroral forecasts. The probabilities forecast by the OP-2013 model vary with season, MLT sector and geomagnetic activity (Kp level) which would need to be accounted for if a correction were to be developed. However, the results of the reliability analysis showed that for forecast probabilities of above 20%-30%, the observed occurrence of aurora is approximately constant at around 0.7-0.9, which would make it difficult to linearly re-scale the forecast probabilities. All forecast probabilities of ∼20% or above would be re-scaled to an ∼80% probability of aurora occurring, effectively producing a deterministic forecast. Germany et al. (1998) found that the brightness of the UV electron aurora is proportional to the total electron energy flux with a conversion factor of approximately 0.12 R per 10.1029/2020SW002688 utilized by Case et al. (2017) and Machol et al. (2012) to define the poleward and equatorward boundaries from the Polar UVI images and OP-2010 output. The conversion of the total electron energy flux to brightness could be used to derive a more robust method of converting the predicted auroral fluxes into a probability of aurora occurring. However, given the difficulties of linearly scaling the auroral flux into a probability of aurora occurring, it may be preferable to develop a flux threshold system, using the conversion of Germany et al. (1998). For example, in regions where the predicted auroral flux is greater than zero indicates that there may be some auroral effects. In regions where the auroral flux exceeds a certain brightness threshold would indicate that the aurora should be visible and the brightest aurora would be predicted in the regions of maximum auroral flux.

Comparisons With Previous Auroral Forecast Evaluation Studies
Lane et al. (2015), Machol et al. (2012), and Newell, Sotirelis, Liou, et al. (2010) evaluated the auroral forecasts from OP-2010. From these three studies, the binary event analysis methods applied by Machol et al. (2012) are the most comparable to the analysis applied in this study. Machol et al. (2012) evaluated the use of the OP-2010 model as an operational forecast model for visible aurora by assessing the deterministic ability of the model to forecast the location of the aurora compared to UVI observations. We have similarly examined how well the OP-2013 model performs as a deterministic forecast of the location of the aurora, although using IMAGE FUV data as our ground truth and utilizing the ROC curves and scores to examine the performance of the model. Extending this, we have also examined the validity of the forecast probabilities of aurora occurring as well as examining the performance of the model with season, local time and geomagnetic activity.
The most notable difference between the analysis presented in this study and the analysis of Machol et al. (2012), other than the updated model in this study, is the determination of the ground-truth data. Machol et al. (2012) compared the locations of model predictions of electron fluxes exceeding 1 2 1 erg cm s   and auroral luminosities from Polar UVI exceeding 0.25 kR whereas we used auroral luminosity boundaries determined from IMAGE WIC data by Longden et al. (2010). As such, a direct comparison between the results cannot be used to infer any change in performance between the OP-2010 and OP-2013 models, but may still be informative. Table 1 shows the verification statistics calculated from the 10%, 50%, and 80% binary event analysis in this study with the results of Machol et al. (2012). In the study by Machol et al. (2012), the results were presented in terms of the false alarm ratio (as defined by Wilks, 2006). In Table 1, we present our results for the false alarm rate and the false alarm ratio, for comparison with the results of Machol et al. (2012). The equations for all the verification statistics in Table 1 are provided in the supporting information. Comparing the results of Machol et al. (2012) and the 10% bin from this analysis, all of the statistics are within 15%. Machol et al. (2012) found that by increasing the energy flux threshold used to define the location of the auroral boundaries, resulted in an increase in the number of false positives and a decrease in the number of false MOONEY ET AL.  Machol et al. (2012) negatives in the truth table. The binary event analysis presented here, similarly shows that as the probability threshold is increased, the number of false positives increases and the number of false negatives decreases.
Overall, the results of the verification statistics from both studies show a similar performance for both the OP-2010 and OP-2013 generations of the model. We caution that the results of these two studies cannot be directly compared to assess improvements made between the two generations of the model. Differences in the results between the two studies presented in Table 5.3 may reflect the upgrades made to the model between the OP-2010 and OP-2013 generations, however due to the differences in the observational datasets and the definition of the observed auroral boundaries between this study and the study by Machol et al. (2012), the comparison of the two sets of results cannot be used to quantify the upgrades implemented in the model.

Conclusions
In this study, we have evaluated the performance of the version of OP-2013 that was used operationally by the Met Office in daily space weather forecasts by comparing the forecast outputs with the location of the auroral oval identified from IMAGE FUV data by Longden et al. (2010). We have applied forecast evaluation techniques which are routinely used in terrestrial weather forecast verification to assess both the deterministic and probabilistic nature of the auroral forecast model. Overall, the OP-2013 model performed well at predicting the location of the auroral oval, with ROC scores of between 0.70 and 0.86, although the forecast skill was notably lower around noon (ROC score of 0.59) and at higher Kp (ROC score of 0.55, for Kp = 8).
The reliability analysis showed that the observed frequency of aurora is constant at 80%-90% for forecast probabilities of ∼20% and above and does not scale linearly with increasing forecast probability. This results in the lower forecast probabilities of 20% being significantly underpredicted, by a factor of 6, that is, the aurora occurs six times more frequently than the model predicts for a forecast probability of 20%. The highest forecast probabilities of ∼90%-100% are overpredicted by up to approximately 20%; that is, the aurora occurs up to 20% less frequently than the model predicts for these high forecast probability values. The majority of forecast probabilities are skillful with the exception of the 10% and 20% probabilities which are substantially underpredicted. The results of the reliability analysis from this study could be used to recalibrate the forecast probabilities of aurora occurring and improve the Met Office auroral forecasts.
The ROC and reliability analysis presented in this study show a robust methodology that is widely used in terrestrial weather forecast verification that can also be applied to a wide range of space weather forecast models which have an appropriate set of observations to use in the analysis. These methods can be used to fairly compare forecasts from similar models or to quantify improvements made to space weather models during model development. The results presented in this analysis provide a performance benchmark against which upgrades to the OP-2013 auroral forecast model or alternative auroral forecast models can be fairly and quantitatively tested. Our analysis also highlights the further insight into the reliability of the forecast probabilities of aurora occurring output by the model from using attribute diagrams in addition to calculating the Brier skill score, compared to solely using the Brier skill score.

Data Availability Statement
The auroral hindcast data set produced from the Met Office operational implementation of the Ovation-Prime 2013 nowcast model that was used in this study was provided by the Met Office and is available at http://doi.org/10.5281/zenodo.4653288. The Ovation-Prime 2013 code was provided to the Met Office by Rodney Viereck of the NOAA Space Weather Prediction Center and the original Ovation-Prime code was developed by Patrick Newell and colleagues at Johns Hopkins University. Historic solar wind data from the ACE satellite was provided by Douglas Biesecker at the National Oceanic Atmospheric Administration and are available at https://sohoftp.nascom.nasa.gov/sdb/goes/ace/monthly/. Auroral boundary data were derived and provided by the British Antarctic Survey based on IMAGE satellite data (https://www. bas.ac.uk/project/image-auroral-boundary-data/). These boundary data are freely available from https:// doi.org/10.5285/75aa66c1-47b4-4344-ab5d-52ff2913a61e. The IMAGE FUV data were provided courtesy of the instrument PI Stephen Mende (University of California, Berkeley). IMAGE FUV data are archived at https://cdaweb.gsfc.nasa.gov/pub/data/image/fuv/. Kp index data were provided by ISGI GFZ Potsdam.