Assessing Machine Learning Techniques for Identifying Field Line Resonance Frequencies From Cross‐Phase Spectra

Monitoring the plasmasphere is an important task to achieve in the Space Weather context. A consolidated technique consists of remotely inferring the equatorial plasma mass density in the inner magnetosphere using Field Line Resonance (FLR) frequencies estimates. FLR frequencies can be obtained via cross‐phase analysis of magnetic signals recorded from pairs of latitude separated stations. In the last years, machine learning (ML) has been successfully applied in Space Weather, but this is the first attempt to estimate FLR frequencies with these techniques. We survey several supervised ML algorithms for identifying FLR frequencies by using measurements of the European quasi‐Meridional Magnetometer Array. Our algorithms take as input the 2‐hour cross‐phase spectra of magnetic signals and return the FLR frequency as output; we evaluate the algorithm performance on four different station pairs from L = 2.4 to L = 5.5. Results show that tree‐based algorithms are robust and accurate models to achieve this goal. Their performance slightly decreases with increasing latitude and tend to deteriorate during nighttime. The estimation error does not seem to depend on the geomagnetic activity, although at high latitudes the error increases during highly disturbed geomagnetic conditions such as the main phase of a storm. Our approach may represent a prominent space weather tool included into an automatic monitoring system of the plasmasphere. This work represents only a preliminary step in this direction; the application of this technique on a more extensive data set and on more pairs of stations is straightforward and necessary to create more robust and accurate models.

and its higher harmonics the cross-phase presents a maximum and the amplitude ratio crosses one with a positive slope. One or both these characteristics can be used to identify FLR frequencies which, in turn, depend on the magnetic field strength and plasma mass density distribution along the field line. By making assumptions on the magnetic field topology and the field line plasma distribution, the governing wave equation (Singer et al., 1981) can be numerically solved, providing an estimate of the plasma mass density in the equatorial crossing point of the field line.
Using EMMA observations, Del Corpo et al. (2019) created a data set of validated frequencies considering only the fundamental harmonic and employing a semi-automated algorithm which performs the above procedure. They used the TS05 model (Tsyganenko & Sitnov, 2005) to represent the geomagnetic field lines and assumed a power-law dependence of the plasma mass density along them. The validation procedure takes care of possible failures of the automated part. One of the most critical situations concerns the detection of FLR frequencies at high latitudes. They can be very close to the spectral resolution, so the fundamental harmonic could be hard to detect, especially during prolonged quiet period, when the FLR frequency usually decreases. In this case higher harmonics could be accidentally picked up by a fully automated procedure (Del Corpo et al., 2019). Other issues can emerge when a pair is sounding the region near the PBL where the selected FLR frequency can be mismatched or can even disappear (Fraser et al., 2005;Menk et al., 2004;Milling et al., 2001).
Based on the gradient method, mentioned above, a few authors in the past developed algorithms to automatically detect FLRs (Berube et al., 2003;Chi et al., 2013;Lichtenberger et al., 2013;Wharton et al., 2018). In these works, either statistical or morphological properties of the cross-phase spectra were used to individuate the resonance peaks. In this study, we propose an alternative approach based on machine learning methods, using the cross-phase spectra by Del Corpo et al. (2019) as input data and their data set of validated frequencies to train the algorithms.
Machine learning techniques represent a subset of artificial intelligence and they have turned into one of the most powerful tools for addressing space weather problems. Since the early 90s machine learning has benefited from the high amount and quality of both space and ground-based data. Nowadays machine learning methods are applied to the most varied space weather topics, and they can be identified according to their final goal into four macro categories : (i) algorithms for event identification, such as solar image classification (Armstrong & Fletcher, 2019), near-Earth plasma regions classification (Breuillard et al., 2020) and the classification of periods with ULF waves activity (Balasis et al., 2019); (ii) methods for revealing causality between high dimensional data sets and specific events, see Wing et al. (2018) and Heidrich-Meisner and Wimmer-Schweingruber (2018); (iii) forecasting algorithms, widely used for predicting solar flares (Massone et al., 2018), the arrival time of coronal mass ejections (Liu et al., 2018) and the behavior of various geomagnetic indices (Chandorkar & Camporeale, 2018); and (iv) algorithms for modeling non-linear relationships which try to reveal the physical action of the system starting from first principles (Boynton et al., 2018).
Despite the wide application, machine learning techniques were applied to the characterization of plasmasphere regions only in the past few years. Main contribution is from Zhelavskaya et al. (2016Zhelavskaya et al. ( , 2017 who applied a convolutional neural network for estimating electron mass density. To the best of our knowledge, the only attempt of estimating FLR frequencies using a ML approach has been developed by Fujimoto et al. (2019) for a pair of stations in New Zealand (mid-latitude).
In this work, we assess various machine learning methods for addressing the challenging task of identifying FLR frequencies by using ground-based measurements.
In Section 2 we briefly introduce the machine learning algorithms adopted for this analysis. Section 3 widely explains the set of data obtained from the EMMA network. Section 4 focuses on pre-processing, feature and model selection procedure which compose the whole machine learning process. Finally, in Section 5 we discuss the results and in Section 6 we outline the significant breakthrough and development produced by this analysis.

FOLDES ET AL.
After the training process SVM and KRR have learned the same model shape, though they use different cost functions, that is, quadratic for KRR and ɛ-insensitive for SVM. Another difference between these two algorithms is the execution time; since KRR can be computed in a closed form, it results faster with medium-sized (less than 10 4 samples) data sets, but SVM can learn sparse models resulting in faster performance with massive sets (Murphy, 2012).

Decision Trees
Differently from single-stage algorithms, like those described in the previous section, decision trees (DTs) are multi-step models which recursively divide data in two partitions (binary split). Starting from the root node, containing all data, the tree structure is composed by intermediate nodes (called splits) and terminal nodes (leaves).
The process involves two macro phases: the splitting procedure and the pruning phase. The main objective of pruning techniques (see Esposito et al., 1997, for a review) is to generalize the tree structure and hence avoid over-fitting, see Figure 1. Pruning techniques can be controlled by various parameters which drive the growth of DTs, such as the minimum number of samples in a split/leaf or the maximum number of features to consider for the best split.
The splitting procedure determines the growth of a DT; at each iteration the central operation is finding the best split for the data set according to some error measures (Buntine & Niblett, 1992). The tree growth continues until no more splitting produces a significant gain in the error measure or a stopping criterion is satisfied. DTs represent optimal algorithms in processing massive data sets, handling missing values and mixed data (ordinal, categorical and continuous), performing features selection, ignoring redundant features, but even for their user-friendly final structure.
One of the major shortcoming is a poorness of generalization which make this kind of algorithms prone to over-fitting and highly unstable with respect of slight variation in the input data (Li & Belford, 2002). This error, namely variance error, can be addressed using ensemble (or "forest") models which combine results from numerous trees (usually averaging, i.e., Random Forest), or combine various DTs during the growth phase (i.e., Gradient Boosting methods).

Ensemble Methods
The central concept besides ensemble methods is that multiple weak estimators can achieve better results than a single complex model; they result more efficient than DTs in reducing the variance error (Breiman, 1996).
Let us consider many shallow DT regressors (usually hundreds) which give K solutions of a real-valued function, 1 ( ) F X , 2 ( ) F X , …, ˆ( ) K F X . We can build the final prediction using a linear combination of the estimation: where the coefficients, a j , determine the type of procedure adopted in constructing the ensemble. Two main approaches are commonly used: bootstrap aggregating (or Bagging) methods use the average as an estimator, that is, a j ≡ 1/K, and boosting techniques, which employs coefficients dependent on the number of estimators (Buhlmann, 2012).
Similarly to bagging methods, the Random Forest (RF) algorithm averages the solutions of many estimators; still, it considers for splitting data only a subset of features, d < D, producing highly varied estimators and resulting in faster performance with high dimensional data sets. In RF, each tree grows independently, while boosting algorithms create one tree at a time to improve the prediction of the previous model recursively. Usually, RF obtains better performance than bagging methods but worse than boosting ones.
A wide group of boosting methods are those related to the gradient descent technique, widely known as Gradient boosting models. Considering, for instance, a quadratic loss function    10.1029/2020JA029008 4 of 21 Figure 1. Illustration of a typical scenario occurring during the procedure of hyperparameter tuning for every machine learning method. As the model complexity increases the estimation error for the training set decreases; still, the model parameters are too specific and hence they are not able to reproduce the same high performance with unseen data (i.e., validation set), namely over-fitting.
methods act to minimize L by adding further estimators according to the value of the negative gradient, r(x n ) = y n − F(x n ) = −∂L/∂F(x n ), which represents how far the model is from a reliable estimation. Gradient boosting algorithm can be directly generalized to different loss functions, without modifying any part of their procedure.
In this analysis, we focus our attention on two particular boosting algorithms: eXtreme Gradient Boost (XGB, Chen & Guestrin, 2016) and Light Gradient Boost Model (LGBM, Ke et al., 2017). Both methods have shown in the last years to address regression/classification problems with high and robust performance.
More in detail, XGB is a recent algorithm resulted in being highly efficient in regression/classification competition (see www.kaggle.com) and many real-world applications (Ivanov et al., 2020;Kunte & Panicker, 2020;Luckner et al., 2017). XGB employs second-order derivative during gradient descent resulting in faster and more refined performance. It results particularly reliable in handling massive data sets because of its parallel implementation. Despite the numerous novelties and improvements of XGB, this algorithm can result barely efficient with high dimensional data set, because of its splitting method which turns out very time-consuming.
For speeding up the splitting procedure, LGBM combines two additional methods, namely Gradient-based One-Side Sampling (GOSS) and Exclusive Features Bundling (EFB). GOSS, in combination with EFB, picks (usually randomly) only a small portion of data having small gradients for estimating the information gain, since these points give minor contributions to the gradient descent.
LGBM, in comparison to XGB, can simply handle with categorical data without requiring any pre-processing phase.

Data
The plasmaspheric mass density can change drastically during active geomagnetic periods and the associated field line resonance frequencies change accordingly. Therefore, a good data set, suitable to train ML algorithms, should cover the most comprehensive range of geomagnetic disturbance levels. The data set by Del It comprises prolonged quiet geomagnetic periods as well as disturbances of different intensity, including 13 geomagnetic storms, nine of them having a minimum Dst value less than −100 nT. For each period, the analysis has been carried out over a variable number of station pairs, depending on the data coverage. About 20 pairs were analyzed for each period, but only eight were common to all the data set.
This work aims to test the viability of a machine learning algorithm to pick up trusted FLR frequencies from magnetic field cross-phase spectra collected by a network like EMMA. Since the frequency generally decreases with latitude, such algorithm should be tested not only in different geomagnetic conditions, but also for data collected at different latitudes. For this reason, we selected four pairs among the eight common to the entire data set, that cover at the same time the largest range of latitudes. Figure Table 1.
The data set consists of daily Fourier cross-phase spectra computed with a 2-h moving window every half an hour . Each 2-h window is divided into a variable number of sub-intervals that depends on the latitude (increasing from 1 to 7 with decreasing latitude). The spectra are evaluated for each sub-interval and successively averaged. This implies a spectral resolution that increases with increasing latitude, a condition necessary to detect FLR frequencies that vary by more than a factor of 10, moving from the lowest to the highest latitude station pair. Each final spectrum is smoothed over 9-11 samples and the fundamental resonance frequency is eventually provided following a selection and validation process composed of both automated and manual steps.
The fundamental resonance frequencies are provided with a time resolution of 30 min, although the coverage is not continuous. In fact, for several reasons, it is not always possible to detect FLR frequencies from cross-phase spectra: (i) data gaps in one or both stations; (ii) noise in the magnetic signals that inhibits the appearance of the resonance signatures in the spectra; (iii) quiescent LF wave periods that prevent the excitation of standing Alfvén waves (Balasis et al., 2019). Statistical works have shown that the FLR detection rate changes also with latitude and local time with a broad peak around noon; in particular, during nighttime it can be as low as 10% (Chi et al., 2013;Del Corpo et al., 2019).
To efficiently sustain a standing Alfvén wave, the field line should be fixed at the footprints, which implies an infinite conductivity. This approximation is generally satisfied during daytime, when the ionospheric conductivity is sufficiently high, but after sunset the conductivity decreases due to the predominant role of recombination processes and it should be assumed finite (Allan & Knox, 1979). Whether the ionosphere conditions are able to sustain normal modes also during nighttime, is still under debate. If the conductivity is sufficiently small, the observed modes should be interpreted as free-end ones. Recently Takahashi   FLRs across the midnight sector compatible with a fixed-end mode configuration. In some cases, only one footprint is sunlit while the other is in darkness, generating quarter-wave modes (Obana et al., 2008(Obana et al., , 2015. These situations make less reliable the nighttime validated frequencies.
Despite the data set contains some dubious frequencies, that is, especially during nighttime hours, we kept all the frequencies for the training and used sunrise and sunset information to interpret the results according to the reliability of validated frequencies; the main reason is that we want to maximize the number of samples for the training phase.

Input and Output Data
Every ML algorithm adopted in this analysis requires a so-called features matrix as input data, where rows represent different samples and columns are the input features. As previously stated, following the well-established cross-phase technique, our input for the ML algorithm is the cross-spectrum, as shown in Figure 3. The output will be the estimated value of the resonance frequency f i for that sample.
Since we would like to test our ML algorithms on different pairs of stations, the input matrix is created separately for each couple. Typically, the resonance frequency decreases with increasing latitude except across the PBL where the behavior is the opposite. Thus, to obtain reasonable average spectral resolutions ) at every latitude along the EMMA network, different pairs of stations have a different length of cross-phase spectra, resulting in features matrix with different dimensions: TAR-BRZ and SUW-BEL spectra are composed by 212 frequency bins (between 0 and 120 mHz), while OUJ-HAN and MUO-PEL have both 285 input features, but for OUJ-HAN they vary between 0 and 80 mHz while for MUO-PEL between 0 and 40 mHz.
Input data are transformed using two subsequent operations to improve the performance of ML algorithms: standardization, (x − μ)/σ and normalization, (x − x min )/(x max − x min ). The same procedure is applied to the output frequencies.
As can be observed in Figure 4, the first transformation gives a more Gaussian-like shape to the distribution of frequency, while the second one forces values between 0 and 1. Both operations are commonly adopted in various ML applications resulting in higher accuracy with every algorithm. After these transformation data are split in training and test set with roughly a 7:1 ratio ( Figure 4); the former will be used to create the ML model, while the latter will evaluate the model performance.

Methods
In Figure 5 is depicted the entire ML pipeline built for FLR frequencies identification. After the stage of data pre-processing, described in the previous section, our ML procedure involves the application of typical ML techniques for creating the optimal model and subsequently improving its performance: feature selection and model selection.
Every ML algorithm is then trained via the cross-validation technique to further avoid over-fitting, producing a final model which is used to estimate frequencies from the test set.

Feature Selection
Feature selection techniques are one of the central concept in ML methods, since the data set dimensionality, as well as the feature choice, severely affects the algorithm performance. Various works showed that numerous and redundant features may result in decreasing the algorithm accuracy and robustness (Guyon & Elisseeff, 2003;Hua et al., 2004;Kapetanios, 2007;Peng et al., 2005). The reasons for implementing and including feature selection routines in a ML pipeline can be summarized in four main points: lower dimensional data sets are simpler to interpret and visualize for end-users (the curse of dimensionality), increasing the model generalization (avoid over-fitting), improving the algorithm accuracy and shortening data FOLDES ET AL.  . The left panel shows the original data set, while the right panel represents the same data after the preprocessing stage for Tartu-Birzai. In both panels distributions are divided in training (blue) and test (orange) set. This division is performed before the pre-processing procedure. processing times. For all these reasons we include two different feature selection techniques for reducing the data set dimensionality as well as evaluating additional features.
Feature selection methods can be categorized in embedded, filters, and wrapper methods (Guyon & Elisseeff, 2003). Embedded methods are ML models which implement feature selection using different regularization techniques, that is, l1/l2-norm for linear models and a feature sub-sampling for DTs (Lal et al., 2006). Filters are model-independent methods which simply employ some correlation measure to find the optimal subset of features. Commonly known correlation measures are mutual information, point-wise mutual information and the Pearson's R correlation coefficient. Filters are computationally cheaper than other methods. Eventually, wrapper methods use model performance (accuracy and robustness) as a metric for evaluating feature subsets. These methods are computationally heavy, especially with high-dimensional data set, and often they allow only a random (or sparse) search between all the possible subsets.
For reducing the initial matrix dimensionality, we first compute the mutual information (MI) at every frequency bin and for the whole data set. Mutual information can be interpreted as a correlation measure between inputs and outputs, and it represents a powerful tool to identify even non-linear and non-monotonic correlations. Then we apply a time average over the entire period and compare mutual information values, represented in Figure 6 as a color scale, with the average cross-phase spectrum ( Figure 6, top panel). By imposing a confidence level over the cumulative function of the mutual information (bottom panel of Figure 6) at α = 0.95 (vertical dotted line), we can keep only frequencies below this threshold value. In this way, we ensure that, despite the lower number of features, the data set contains at least 95% of the information to estimate the output frequencies. Figure 6 summarizes the results of this procedure for TAR-BRZ. As expected, highest values of mutual information correspond to the first peak of the average cross-phase spectrum, meaning that, on average, this is the most informative part of the spectrum for predicting the frequencies. The first peak occurs at f 1 ≃ 14 mHz and is related to the first harmonic since it perfectly overlaps with the blue interval (top panel in Figure 6) which represents the 0.25 and 0.75 percentile of the validated FLR frequency distribution (i.e., interquartile range, IQR). The second peak occurs at f 2 ∼ 50 mHz; the frequency ratio between the first and second peak is f 2 /f 1 ∼ 3.6 which is compatible with the expected ratio between the third and fundamental harmonics (Schulz, 1996).

FOLDES ET AL.
10.1029/2020JA029008 9 of 21 Applying this procedure to the four pairs, we found that, independently from the couple of stations, the 95% threshold is reached using only the 80% of the initial number of features, resulting in 170 for TAR-BRZ and SUW-BEL, and 230 for OUJ-HAN and MUO-PEL.
The second part of the feature selection stage is the evaluation of additional features using wrapper methods. These novel features are spatio-temporal information, solar activity proxies, and geomagnetic activity indicators derived from geomagnetic indices. All the additional features analyzed in this section are summarized in Table 2. Temporal features are represented by the day of year (DoY) to highlight the seasonal variation of the resonance frequency (Vellante et al., 2007) and the decimal hour, because is commonly known that resonance frequencies show a diurnal modulation (e.g., Chi et al., 2013;Del Corpo et al., 2019).
To consider the effects of solar activity, mostly for its impact on the production of ion-electron pairs at ionospheric height, we take into account the daily radio solar flux at 10.7 cm (F107, see Table 2) as provided by OMNIWeb. From the same database we also consider the y-component of the solar wind electric field (EField in Table 2) to investigate any impact of the convection electric field.
Since the frequency also depends on the geomagnetic field line length we consider the equatorial crossing (r eq ) using the TS05 models to map the field line from the station pair position.
Geomagnetic activity indicators are evaluated to consider the FLR frequency variations due to the geomagnetic activity. The use of such indicators is widespread in the literature (e.g., Carpenter & Anderson, 1992;Del Corpo et al., 2019Gallagher et al., 1988;Moldwin et al., 2002;Sheeley et al., 2001;Takahashi et al., 2006). The simplest indicators are the geomagnetic indices themselves. For this analysis, we consider the Kp, the Dst and the AE indices provided by the OMNIWeb database (https://omniweb.gsfc.nasa.gov/). More sophisticated indicators that take into account some effects of the magnetospheric dynamics can be derived from the Kp and Dst indices. In this work we consider two main different approaches: in the first, following the work by Moldwin et al. (2002), the maximum/minimum Kp/Dst are evaluated over a period τ preceding the time analyzed t 0 ; the second is made by following the work by Gallagher et al. (1988) and by evaluating the average Kp over a period τ preceding t 0 , weighted with an exponential function of the form w(t) = exp[−(t − t 0 )/τ]. Both the procedures are applied to values of τ in the range 12-72 h with a step of 12 h, producing three groups of indices. The first is related to the Kp maximum in the preceding period and is reported in Table 2    Notes. The geomagnetic indexes, the solar radio flux, and the convection electric field are derived from the OMNIWeb database.   R is the score obtained using only the cross-phase as input features. Methods with lower performance (Kernel Ridge Regression and Support Vector Regression) have an improvement with additional features while, as the model accuracy increases, they become less relevant. A comparison of these panels makes difficult to conclude which features could be more relevant for improving predictions. low-dimensional data sets. At each step, the selected feature is evaluated multiple times using the cross-validation technique; in this way we can produce a confidence interval defined by the standard deviation of the cross-validation score (orange area in Figure 7).
From the six panels in Figure 7 we can observe that the additional features have lower influence as the method complexity (and model performance) increases. Except for KRR and SVR, which result in poor performance, the other ML methods considered for this analysis (DTR, RF, LGBM, and XGB) basically express the same behavior. Algorithms that reach higher performance tend to return similar results also adding novel features, meaning that for these algorithms information from the crossphase is sufficient to achieve the best accuracy. Another aspect is that it is impossible to select optimal quantities resulting in best performance for all the considered methods. Similar results are found for the other pairs of stations.
Hence, since no additional feature results in a significant improvement of the algorithms performance, we consider as input the cross-phase spectra only.

Model Selection
The choice of the optimal ML method for a specific task and the procedure for tuning the model hyperparameters are both objectives of model selection. In Section 5.1 we will address the choice of the most suitable ML method for predicting resonance frequencies, by observing the model complexity and the accuracy for all the four station pairs. In this section we focus the attention on the selection of the best hyperparameters which control the model complexity of the six ML methods described in Section 2.
The number, typology and value of these parameters determine the model complexity and its performance. It is worth noting that hyperparameters, optimized in this step, are significantly different for the various methods; their optimal values are summarized in Table 3.
The first two rows in Table 3 are kernel methods which control the model complexity via the regularization terms, α for KRR and C for SVM. Higher values of α reduce collinearity between coefficients leading in higher generalization, while as α approaches zero KRR reverts to a simple OLS. In SVMs applied to regression problems (Support Vector Regression, SVR), the regularization parameter, C, controls the tolerance region width around the division boundary; increasing C results in more points close to the boundaries and hence more support vectors are required. Conversely, higher values of C mean a narrower tolerance and thus fewer points near boundaries.
In both methods input data are transformed using an RBF kernel for creating non-linear regression models (see Section 2.1). The hyperparameter γ for kernel methods refers to the standard deviation adopted for the Gaussian kernel (RBF kernel).
For decision trees and ensemble methods the main hyperparameters controlling the tree structure and the ensemble population are the maximum depth (i.e., Max. depth) and the number of estimators (i.e., # estimators), as briefly explained in Section 2.3.
Since the hyperparameter space is not too wide, for selecting the optimal values we perform a random search (N = 3,000 points) over the hyperparameter grid of each ML method. Real    [100,200,500,700,1000,1500,2000].
The selection of optimal model hyperparameters is essential into the ML pipeline for avoiding model over-fitting, reducing the variance error and increasing model accuracy. We select the best method and model for identifying resonance frequencies by considering two main aspects: the accuracy of the method and the ability to create a common model for all the four station pairs, reflected by more stable hyperparameters across the pairs (see Table 3).
After this stage, the next step is training/validation, during which all the ML models are trained using the cross-validation (CV) technique, with a number of CV folds, k CV = 5. CV divides the training set into k CV partitions; at each step one of the partition is excluded from the training phase, and it will be only used for validation. By optimizing results obtained with the validation set model, over-fitting is avoided and the model generalization increases.
The training set for each pair of stations is selected using two main criteria: all the station pairs must have the same proportion between training and test set, and the four pairs must share periods in the test set for a comparison. In particular, we selected DoY 148-162 (2013)

Models Evaluation
The first part of the results focuses on the performance comparison of the various ML methods. The accuracy of regression problems can be evaluated by using various estimators, the coefficient of determination (R 2 ), the mean absolute error (MAE), the mean absolute percentage error (MAPE) and the root mean squared error (RMSE): where y i are the actual frequencies, y i  the predicted frequencies and y the average value of y i .
For the CV procedure we use R 2 , since it weakly depends on the distribution function of data and therefore it is more reliable for regression problems. The results from the CV procedure are shown in the boxplot of Figure 8 for TAR-BRZ, the other station pairs show a similar behavior; hence we show only one pair for the sake of simplicity. The first aspect of this graph is a significant difference in accuracy between kernel-based (KRR and SVR) and tree-based models. The main reason for such a difference could rely on data: by using discrete cross-phase spectra the validated frequencies can assume only a discrete set of values and, for their implementation, tree-based methods are more suitable, and therefore accurate, to estimate discrete-like data.
From Figure 8 we can also argue that DTR has a high variance error, meaning that the model is too simple for the specific task. This statement is also supported by the high variability of hyperparameters across the station pairs (see Table 3). On the other hand, RF results in high accuracy, but it does not appear sufficiently robust with high frequency outliers, as highlighted by a large skewness and the presence of a flier point out of the ends of the whiskers.
LGBM and XGB show a similar behavior even in their performance, resulting in robust algorithms which avoid over-fitting and reduce the variance error. These two methods also result in similar hyperparameter variability between the four station pairs (see Table 3), the main difference relies on the execution time (see Table 4); XGB processes the same number of sample in half the time with respect of LGBM.
After the CV stage each ML method is applied to the prediction of resonance frequencies of the test set for each station pair. The frequencies obtained during both the training and the test phase are available from https://doi.org/10.5281/zenodo.4304911 (Foldes et al., 2020). The results for TAR-BRZ (DoY 78-82 and 145-162, 2013, N test = 619) are summarized in Table 4. By observing parameter stability in Table 3 and the algorithm performance in Table 4, we can observe that XGB and LGBM show almost the same robustness and accuracy. For these reasons, we choose XGB as most performing algorithm between them by considering two main aspects: it always employs fewer estimators than LGBM to reach the same performance, resulting as a simpler model, and because of its faster execution time it is more suitable for real-time applications.
To further explore the performance of the six ML methods we show, as an example, the relative error as a function of time ( Figure 9) for a particular case study: 2013, DoY 151 (31 May). The relative error is the difference between the estimated frequency (f meth ) and the observed one (f obs ), normalized to f obs . The intervals in which one or both the field line footprints are in darkness are represented as light or dark shadowed areas, respectively. The sunrise and sunset times (edges of the shadowed area) are evaluated at the height of 120 km as suggested by Del . The error increases during nighttime for all the methods, probably because, as stated above, the night frequencies are harder to detect and the relative cross-phase spectra can result poorly informative. During daytime the prediction error is remarkably low for ensemble methods (especially LGBM and XGB), as previously pointed out in Table 4, and their fluctuations show a Gaussian-like behavior.
As stated above, XGB is the most suitable method to estimate FLR frequencies, hence, starting from this point, we perform further analysis using only the XGB method.
In Figure 10, we show an overview of the results obtained on the test set of TAR-BRZ. The first scatter plot (left panel) contains the median value of the predicted frequencies with the interquartile range (IQR) represented with bars. To compute the uncertainty of the model estimation we use two different approaches; for single estimator algorithms (i.e., KRR, SVR, and DTR) the uncertainty is computed using N = 1,000 different runs, while for multi-estimator algorithms (i.e., RF, LGBM, and XGB) the IQR is derived from the distribution of the estimators. The error increases when we observe a broader spread of the estimators; the uncertainty At first glance, we observe a substantial difference between kernel-based (Kernel Ridge Regression and Support Vector Regression) and tree-based models. Light Gradient Boost Model (LGBM) and eXtreme Gradient Boost (XGB) are quite similar methods as we see even from results, but XGB tends to be more robust for all the four pairs of stations since it reaches performances comparable with LGBM and RF using a lower number of estimators (i.e., simpler model, see Table 3); hence we consider XGB as the best-performed algorithm for this task.

Table 4
The General Performance of the Six ML Algorithms for TAR-BRZ seems to increase slightly with frequency, but surely the error is greater when points are farther from the actual frequency value.
Colors of the scatter plot represent the day, night and penumbral frequencies; the latter are those selected when only one footprint of the field line is sunlit. The division is made using the same criteria of Figure 9. Night and penumbral frequencies are harder to be observed (Balasis et al., 2019;Del Corpo et al., 2019) and indeed they are less numerous. The main reason is that cross-phase spectra are barely informative and hence FLR identification results more difficult. During penumbra, frequencies could correspond to quarter-wave mode events (Del Corpo et al., 2019;Obana et al., 2015). Right panel in Figure 10 represents  frequencies, actual and estimated; since points in Figure 10 (right panel) accurately follow the identity line (dashed line), the mean value of the two distributions does not show a significant shift. At frequency values higher than 38 mHz, and between 25 and 32 mHz, points move below the bisector, which means that estimated frequencies are slightly lower than actual frequencies in these ranges. As can be noticed from the left panel in Figure 10, the lack of points at higher frequencies (≥40 mHz) produces higher errors which nevertheless can be considered as statistical fluctuations.

The June 2013 Geomagnetic Storm: A Case Study
As already stated in Section 4.2, the geomagnetic storm occurred in June 2013 offers an excellent opportunity to test the algorithm performance since it comprises highly variable geomagnetic conditions with comprehensive data coverage for all the considered station pairs. The two top panels of Figure 11 show peculiarity of this event is the prolonged very low geomagnetic activity conditions that characterize the days preceding the storm, allowing us to test the efficiency of the algorithm also in such an unusual situation.
The last four panels in Figure 11 show, from top to bottom, the relative error of the estimated frequency for the four station pairs arranged in decreasing latitude order. As for Figure 9, light and dark shadowed areas correspond to periods in which the field lines have one or both footprint in dark, respectively. The average relative error at L = 2.4, 2.9, 4.1 is about +1-2% suggesting that the distribution of the deviations has a Gaussian-like shape. At L = 5.5 the average relative error is about +4.5% suggesting a systematic overestimation of the algorithm, even if this larger value might be due to a preponderance of positive deviations on DoY 152 during which the main phase of the storm took place. To further evaluate the performance of the algorithm we can look at the average deviation from zero (the MAPE) that does not exceed 13%, although it increases from 4% to 13% with increasing latitude. As already pointed out, this could be due to the signal-to-noise ratio in the cross-phase spectra that becomes lower as the latitude increases. During daytime hours the MAPE assumes the values 2.4%, 2.7%, 4.7%, and 6.9% at L = 2.4, 2.9, 4.1, and 5.5, respectively, suggesting that the method performs significantly better in these conditions. Again, there is the exception on DoY 152 at L = 5.5 for which the daytime MAPE is 17.5%, indicating that, at least for high latitude pairs, the performance of the method deteriorates during highly dynamical geomagnetic conditions. Also during the extreme quiet conditions on DoY 150 there are some important deviations at L = 4.1 and 5.5, mainly during nighttime. Other than that, there is no evidence of any dependence of the relative error from the geomagnetic activity. It is important to note that almost all the situations in which the algorithm does not perform well are associated to fuzzy or low signal-to-noise ratio cross-phase spectra for which also the supervised selection can be sometimes questionable. These issues can be better understood looking at Figure Figures 9 and 11. The main phase of the storm starts at the beginning of DoY 152, however the magnetosphere is already perturbed in the afternoon of DoY 151 as clearly visible from the growing of the Kp index (see Figure 11). During this period the spectra are fuzzy for all the pairs but TAR-BRZ, and important mismatches between the estimated and validated frequencies occur at SUW-BEL on DoY 151 and at MUO-PEL on DoY 152. Such mismatches could not depend exclusively from the algorithm efficacy, but also from the presence of questionable points in the data set by Del Corpo et al. (2019), as is the case for the above mentioned points at SUW-BEL. They created the data set taking particular care to daytime hours when the physical assumptions made in the entire procedure are expected to be applicable, but keeping also nighttime frequencies to study the efficiency of the procedure during these conditions; occasionally, they kept also some unclear points as the ones just described. In this preliminary work we used the entire data set as it was, analyzing also nocturnal frequencies, but to obtain an operative space weather tool the reliability of the selected night frequencies should be investigated and all questionable data points should be excluded. It is worth noting that, also during nighttime or disturbed geomagnetic conditions, the algorithm performs well when the spectra have smooth and clear traces of the resonance peak. A good example are the spectra of TAR-BRZ (panels c and g) in which the only notable deviations from the validated frequencies arise during discontinuities as in the interval 02:30-03:30 of DoY 151 and at 16:30 on DoY 152, when a quarter wave event and a PBL crossing probably happened. In general, the good performances of the algorithm, mainly during daytime hours, are confirmed.

Conclusions
Our work represents a preliminary result in the direction of building a tool for monitoring the plasmasphere dynamics using the whole EMMA network.
We have evaluated several ML algorithms, commonly adopted for regression problems, to estimate FLR frequencies in cross-phase spectra among magnetic field measurements. This survey has proved that ML techniques can be successfully applied for addressing this challenging task; in our particular case tree-based methods have resulted the most reliable algorithms of those considered.
Results have shown a slight dependence of the estimation error with the station latitude, that is, MAPE increases from 4% to 13% passing from L = 2.4 to L = 5.5, so it remains acceptable even at high latitude. In fact, the percentage of points with an absolute relative error ≤0.1 is about 70% for MUO-PEL (L = 5.5), and it increases rapidly at OUJ-HAN (L = 4.1), 82%, reaching 90% for TAR-BRZ (L = 2.9) and SUW-BEL (L = 2.4). An interesting aspect evidenced in this analysis is that results generally do not depend on geomagnetic activity, except for high latitude pairs for which extreme geomagnetic conditions (i.e., storm main phase, prolonged quiet conditions) could produce a higher estimation error. We have considered the June 2013 geomagnetic storm as case study; except for L = 5.5, when the MAPE reaches values greater than 15% in some FOLDES ET AL.
10.1029/2020JA029008 18 of 21 points, during this event the estimation error remains around its average value passing from quiescent days to the main phase of the geomagnetic storm, and eventually to the recovery phase. This feature makes our ML procedure particularly reliable for space weather purposes within the framework of the plasmaspheric density nowcast.
However, since we have trained our models only on times with observed FLR frequencies, our procedure is unable to distinguish whether a resonance is triggered. To create a complete automated ML procedure suitable for real-time monitoring, it may be required an additional ML step placed before the one we have already built. To classify times when the FLR frequency can be observed or not, we could follow an approach similar to Balasis et al. (2019), who adopted a fuzzy-logic neural network for classifying periods with ULF activity in geomagnetic field time series.
The present analysis could be mainly improved by extending our training set. First, a larger data set means a more balanced proportion between the number of samples, N, and the number of features, M, that in our case is represented by the number of frequencies in the cross-phase spectra. Then, by extending the data set, we could remove from the training data those FLR frequencies which present an unclear trace on the crossphase spectrum and hence are considered less reliable (e.g., nighttime frequencies, quarter-waves). Lastly, a larger data set would allow to test our procedure on most varied geomagnetic conditions.
Our approach may be extended to all the EMMA station pairs, and in principle to any magnetometer network, like CARISMA (Mann et al., 2008) and McMAC (Chi et al., 2005). Testing this method on other pairs of stations, even longitudinally separated from EMMA, would allow to obtain more information about the accuracy and the robustness of our technique.