Prediction of Dynamic Plasmapause Location Using a Neural Network

As a common boundary layer that distinctly separates the regions of high‐density plasmasphere and low‐density plasmatrough, the plasmapause is essential to comprehend the dynamics and variability of the inner magnetosphere. Using the machine learning framework PyTorch and high‐quality Van Allen Probes data set, we develop a neural network model to predict the global dynamic variation of the plasmapause location, along with the identification of 6,537 plasmapause crossing events during the period from 2012 to 2017. To avoid the overfitting and optimize the model generalization, 5,493 events during the period from September 2012 to December 2015 are adopted for division into the training set and validation set in terms of the 10‐fold cross‐validation method, and the remaining 1,044 events are used as the test set. The model parameterized by only AE or Kp index can reproduce the plasmapause locations similar to those modeled using all five considered solar wind and geomagnetic parameters. Model evaluation on the test set indicates that our neural network model is capable of predicting the plasmapause location with the lowest RMSE. Our model can also produce a smooth magnetic local time variation of the plasmapause location with good accuracy, which can be incorporated into global radiation belt simulations and space weather forecasts under a variety of geomagnetic conditions.


Introduction
The plasmasphere, which is full of cold and dense plasma (electrons ∼1 eV with density ∼10 2 -10 4 cm −3 ), plays an important role in modulating the fluxes of energetic particles in the Earth's ring current and radiation belts (e.g., Cao et al., 2017;Darrouzet et al., 2009;Fu et al., 2020;Gu et al., 2011Gu et al., , 2012Gu et al., , 2020Hua et al., 2019;Kozyra et al., 1995;Lemaire & Gringauz, 1998;Ni et al., 2013;Sandel et al., 2003;Takahashi & Anderson, 1992;Yi et al., 2021;Zhou et al., 2020). The plasmasphere strongly influences the plasma properties of the inner magnetosphere by wave-particle interactions (e.g., Fu et al., 2016;Gary et al., 1994;Hua et al., 2020a;Ni et al., 2017Ni et al., , 2014Thorne & Horne, 1992;Wilson et al., 1992;Xiang et al., 2018;Young et al., 1981), and contributes to assessing the low-latitude boundary layer and plasma sheet (e.g., Cao et al., 2016;Elphic et al., 1997). The outer boundary of the plasmasphere is called "plasmapause," which is often defined as a transition region in which the plasma density drops dramatically by at least half an order of magnitude in a short distance of 0.5 R E (R E is Earth's radii) (e.g., Carpenter & Anderson, 1992;Moldwin et al., 2002). The position of the plasmapause (L pp ) is usually balanced by the convection electric field outside the plasmasphere and the corotation electric field inside the plasmasphere. The variation of L pp responds considerably to the changes of solar wind forcing and geomagnetic activity (e.g., Goldstein et al., 2003;Liu et al., 2015). Because the plasmapause is related to the plasmasphere, the ring current, and the radiation belts in the inner magnetosphere, where the wave-particle interactions inside and outside the plasmapause are quite different due to the sharp change in plasma density, the prediction of the plasmapause location becomes very important for magnetospheric research (e.g., Hua et al., 2020b;Khoo et al., 2018Khoo et al., , 2019Ma et al., 2020;Ni et al., 2015;Summers et al., 2008;Xiang et al., 2020;Zhang et al., 2018Zhang et al., , 2019.
Abstract As a common boundary layer that distinctly separates the regions of high-density plasmasphere and low-density plasmatrough, the plasmapause is essential to comprehend the dynamics and variability of the inner magnetosphere. Using the machine learning framework PyTorch and highquality Van Allen Probes data set, we develop a neural network model to predict the global dynamic variation of the plasmapause location, along with the identification of 6,537 plasmapause crossing events during the period from 2012 to 2017. To avoid the overfitting and optimize the model generalization, 5,493 events during the period from September 2012 to December 2015 are adopted for division into the training set and validation set in terms of the 10-fold cross-validation method, and the remaining 1,044 events are used as the test set. The model parameterized by only AE or Kp index can reproduce the plasmapause locations similar to those modeled using all five considered solar wind and geomagnetic parameters. Model evaluation on the test set indicates that our neural network model is capable of predicting the plasmapause location with the lowest RMSE. Our model can also produce a smooth magnetic local time variation of the plasmapause location with good accuracy, which can be incorporated into global radiation belt simulations and space weather forecasts under a variety of geomagnetic conditions. GUO ET AL.
In the past three decades, several statistics-based empirical models using satellite observations have been developed to study the dynamic variations of the plasmasphere and to predict the plasmapause location. Below we review briefly some important empirical models in this regard.
(1) Carpenter and Anderson (1992) (hereafter referred to as CA-1992) proposed a functional Kp-based empirical model to specify the plasmapause location, i.e., L pp = 5.6-0.46Kpmax, where Kpmax is the maximum Kp in the preceding 24 h. By defining the plasmapause as the position where the plasma number density drops by a factor of 5 or more within ΔL < 0.5, this model was developed for magnetic local time (MLT) = 00-15 based on the plasmapause crossing events obtained from the International Sun-Earth Explorer (ISEE 1) data (2) Based on the Combined Release and Radiation Effects Satellite (CRRES) measurements in 1990-1991, Moldwin et al. (2002) where a 1 , a mlt , b 1 , b mlt , and b ϕ are the fitted parameters, Φ denotes the magnetic local time in terms of      2 / 24 mlt , and Q is an indicator associated with the evolution of geomagnetic indices before the plasmapause crossing. A recent study of Bandić et al. (2016) (hereafter referred to as BAN-2016) revisited the same CRRES data sets and constructed a new plasmapause model which is similar to OBM-2003 but with different coefficients (3) The plasmapause locations extracted from the Imager for Magnetosphere-to-Aurora Global Exploration (IMAGE) extreme ultraviolet (EUV) imager data were used by Larsen et al. (2007)  Those models above are mostly developed as functions of solar wind and/or geomagnetic parameters with/ without the MLT dependence, in terms of certain fitting methods based on the statistics of satellite observations. Because the plasmapause varies significantly with changes of the solar wind and geomagnetic activity, a time-dependent model of the plasmapause, including its MLT dependence, is required for a variety of magnetospheric physics studies and magnetospheric dynamic simulations. However, in-situ satellite observations cannot provide this important information of the plasmapause location due to the limited spatiotemporal coverage of satellites in space. Regarding the empirical plasmapause models mentioned above, they use the parameter values at one specific time stamp as inputs to evaluate the plasmapause location, thereby possibly missing some underlying connection of the plasmapause evolution in a specific time period. Recently, the machine learning technique has emerged into its golden age with considerable developments of algorithms, tools, and high-speed computers (Camporeale et al., 2018). A number of models in the field of space physics and space weather have been established based on the machine learning technique and produced good performance (e.g., Bortnik et al., 2016;Chu et al., 2017aChu et al., , 2017bPires de Lima et al., 2020;Shprits et al., 2019;Zhelavskaya et al., 2016Zhelavskaya et al., , 2017Zhelavskaya et al., , 2019. Thus, in this study, we attempt to establish a machine learning model of the plasmapause location in terms of a neural network, aiming to predict the global distribution of the plasmapause location and its temporal evolution with good accuracy.
The outline of this paper is arranged as follows. In Section 2, we introduce the adopted instrumentation and data sets, the criteria that we use to distinguish a plasmapause crossing event, and the adopted back propagation neural network method with the selection of input parameters. Section 3 presents our neural network model results using either single or multiple input parameters, and examines the model performance. Finally, we make the conclusions in Section 4.

Data
In this study, we use the solar wind parameters (V SW , B Z ) and geomagnetic parameters (Kp, Dst, and AE indices) to predict plasmapause locations. These five parameters have good relations with the L pp , according to previous studies (e.g., He et al., 2017;Liu et al., 2015;Moldwin et al., 2002). The time series of these parameter values with 1 h resolution are obtained from OMNIWeb (http://omniweb.gsfc.nasa.gov/). The background electron densities are evaluated from the upper hybrid resonance frequencies measured by the High-frequency Receiver of the Electric and Magnetic Field Instrument Suite and Integrated Science (EMFISIS) instrument (Kletzing et al., 2013;Kurth et al., 2015) onboard the Van Allen Probes (VAP) (Mauk et al., 2012). For this study, we adopt the VAP Level-4 plasma density data during the period from September 2012 to December 2017.

Determination of Plasmapause Crossings
The plasmapause positions measured by Van Allen Probes are used to develop the plasmapause model and to test the accuracy of the model prediction. As widely adopted in previous studies (e.g., Liu & Liu, 2014;Liu et al., 2015;Moldwin et al., 2002), we define the plasmapause where the value of background electron density drops/increases by a factor ≥5 within 0.5 L. Using this criterion, we further determine the satellite position with the sharpest density gradient to represent the plasmapause location during a plasmapause crossing event. For reliability, we only consider the cases of one plasmapause crossing determined in each inbound and outbound satellite trajectory. Figure 1 shows an example of the plasmapause crossings observed by VAP-A on October 15, 2012. Figure 1a shows the temporal variation of the ambient electron density. The blue curve presents the VAP-A Level-4 plasma density data, which fluctuates largely when the electron density is observed below 100 cm −3 . To facilitate the identification of a plasmapause crossing, we adopt the Savitzky-Golay filter (Savitzky & Golay, 1964) to smooth the plasma density profile, as shown by the overplotted green curve. Here the length of window is 51 and the order is 3. Figure 1b shows the L-shell variation corresponding to the VAP-A trajectory.
Note that the dipole geomagnetic field model is used to obtain L values. Figure 1c shows the variation of the ratio between the maximum and minimum electron density during the 0.5 L interval of each data point (i.e., within ±0.25 L). If the density ratio is ≥5, such a satellite position is considered within the plasmapause region, as bounded by the dashed black vertical lines in each panel. Furthermore, the satellite position with the largest plasma density gradient is determined as the plasmapause location, as indicated by the red vertical lines. The plasma density gradient is computed by where i represents the number of the data points, and dg(i) denotes the plasma density gradient for the ith data point.
Applying the above identification criterion to Van Allen Probes measurements, we finally register 6,537 plasmapause crossing events during the period from September 2012 to December 2017. The plasmapause crossing events without valid solar wind parameters are removed to generate a robust database.  Figure 2a, it is clear that the plasmapause crossing is most likely to occur around L = 4-5, consistent with the frequent observations of the plasmasphere boundary at this region. It is also indicated that VAP crossed the plasmapause with the largest number of events on the midnight-to-dawn side (i.e., 00-05 MLT) and the smallest number of events on the postnoon-to-dusk side (i.e., 14-19 MLT). The global distribution of the VAP sample points is shown in Figure 2b. Evidently, the satellite trajectories cover all MLT sectors at L < 6 with similar sample numbers, indicating that the preferential plasmapause location at L ∼ 4-5 on the midnight-to-dawn side is not a result of the orbit configuration but a phenomenon fact.

Statistics of the Plasmapause Crossing Events
To understand the underlying relationship between the plasmapause location and geomagnetic activity, we perform a detailed analysis of the magnitudes of the three geomagnetic indices ( Figure 3 also show that only using the maximum/minimum values of these indices to predict the plasmapause location may possibly miss some underlying evolution of the plasmapause in a specific time period. For example, for the maximum Kp = 2, Lpp can vary largely between L = 4-6. Therefore, in this study, we use the neural network to process a large amount of input information and establish a plasmapause prediction model in terms of an optimal input-output mapper.

Neural Network and Input Selection
Nowadays artificial neural networks (ANN) (e.g., Hebb, 1949;Marr & Poggio, 1976;McCulloch & Pitts, 1943) are extensively used in the approximation of functions, classification, pattern recognition, and clustering. Neural networks are capable of mapping complicated nonlinear functions between inputs and outputs (Zhelavskaya et al., 2016). Neural networks contain three or more layers, i.e., the input layer, the hidden layer (or multiple hidden layers), and the output layer, which are generally mapped by activation functions. To evaluate the performance of a neural network, we usually calculate the root-mean-squared-error (RMSE) using the following formula: where the X obs,i is the observed plasmapause location, X model,i is the modeled plasmapause location, and n is the number of the considered events. To develop a reliable neural network model, we need three parts of data: training set, validation set, and test set. The training set is used to obtain proper weights and thresholds and then to establish a complete mapping model; the validation set is used to adjust proper hyperparameters such as the number of neurons in the hidden layer(s), epoch, learning rate, and so on; the test set is used to test the model performance and evaluate its accuracy (Bortnik et al., 2016). Generally, the training set and validation set should come from the same data set, and the validation set is used for many times to adjust the hyperparameters. In contrast, the test set should not participate in the processes of training the model and tuning hyperparameters so that its comparisons with the model results can evaluate the model performance GUO ET AL.
10.1029/2020SW002622 5 of 13 in a reliable way. In this study, we choose the plasmapause crossing events in 2012-2015 (5,493 events in total) as the training set and validation set. The two sets are divided by the K-fold cross-validation method (Schaffer, 1993;Shao, 1993) for tuning hyperparameters and evaluating the model accuracy. By taking K = 10 in this study, this method can split the original data set into 10 disjoint parts and use nine parts as the training set and the remaining one part as the validation set. By permuting the roles of validation and training sets, we can train 10 different models, whose performance should be approximately equal and whose average performance can be investigated. In this way, we can guarantee that the given model is not specific to an arbitrary choice of a training set and that its good performance is not just by coincidence (Camporeale, 2019). The hyperparameters of the final model are chosen based on the minimum averaged model RMSE of the validation set. To prevent overfitting, we choose the model that has the best performance from the perspective of the validation set rather than the training set. Furthermore, an independent test set that GUO ET AL.
10.1029/2020SW002622 6 of 13 contains the remaining 1,044 plasmapause crossing events in 2016-2017 is adopted to evaluate the model accuracy. The architecture of the neural network adopted in this study is shown in Figure 4.
In this study, a back propagation (BP) neural network model that includes one hidden layer is implemented to develop the plasmapause prediction model. We use a popular neural network library-PyTorch (https://pytorch.org/) to establish our neural network. The number of neurons in the hidden layer is 64 for the single-parameter models and 40 for the multiparameter model. We use real-valued parameters as the network inputs, including Kp, AE, Dst indices, solar wind velocity (V SW ), and IMF B Z . As indicated in Figure 3, the plasmapause position may be related to the geomagnetic activity over a time interval. In order to acquire the reliable time delay for each parameter, we quantify the RMSEs of the single-parameter neural network models by inputting the parameter information over different time durations preceding the plasmapause crossings, the results of which are shown in Figure 5. For each parameter color-coded in the panel, the shadow region represents the variation range of RMSE corresponding to the 10 neural network models based on the 10-fold cross-validation method, and the solid curve displays the average RMSE of the 10 models, which is taken as the final RMSE of the single-parameter model using the specific input parameter. Clearly, if we only use the parameter information at the current time (time delay equal to 0) as the input, the neural network model with the Dst index input can have the best performance with the lowest RMSE ∼ 0.77. But when taking the parameter information preceding the plasmapause crossing (e.g., the time delay effect) into account, the neural network model with the AE or Kp index input exhibits the best performance with the lowest RMSE ∼ 0.7 for the time delay ∼43 and 42 h, respectively. In contrast, corresponding to the usage of IMF B Z , V SW , and Dst as the single-parameter input, the lowest RMSE of the model appears for the time delay ∼36, 7, and 6 h, respectively. As a consequence, we use these hyperparameters in GUO ET AL.
For neural networks, these input data need to be normalized to their minimum/maximum values at the beginning. We use the Min-Max scaler from the Python scikit-learn library (https://scikit-learn.org/stable/, see also Pedregosa et al., 2011).
In addition, Rectified Linear Unit (ReLU) (Nair & Hinton, 2010) is adopted as activation function to map the input signals into output signals that are needed for the neural network to function, as follows:

Model Results and Performance Validation
Following the methodology described above, we establish the single-parameter and multiple-parameter neural network models to predict L pp with the MLT dependence. The time delay for each single-parameter model follows the results of Figure 5  to obtain better performance (RMSE = 0.7 and 0.73, respectively). Furthermore, the performance of the single-parameter model using the AE or Kp indices as inputs only are comparable to the performance of the multiparameter model ( Figure 6f). These results suggest that AE and Kp index are good parameters for predicting plasmapause location. Moreover, Figures 6d and 6e show that the L pp-MOD is scattered over a much narrower L shell range than L pp-OBS when modeled using only AE or only Dst index. This means that using multiparameter, the trend of L pp-MOD lies better with L pp-OBS , with the range of L pp-MOD wider than the range of L pp-MOD when only using AE or Dst.
To investigate the MLT dependences of the model performance, we calculate the RMSE values and standard deviations between L pp-OBS and L pp-MOD in our test set from six plasmapause prediction models in different MLT ranges and show the results in Figure 7.

Discussions and Conclusions
In this study, we develop an L pp prediction model with MLT dependences based on the ANN technique. The results suggest that only using the AE or Kp indices as inputs can yield good predictions of plasmapause positions. Our neural network model is capable of predicting the global plasmapause location and simulating the evolution of the plasmapause with low RMSE.
GUO ET AL.
10.1029/2020SW002622 9 of 13 In our ANN model, only solar wind and geomagnetic parameters are used as inputs. Basically, the current plasmapause positions are also related to previous plasmapause locations since the change of the plasmapause location is continuous. Adding the previous plasmapause locations as inputs in our model is a potential method to address this issue and improve model performance. However, the orbit period of VAP is around 9 h, which is too long to get continuous measurements of plasmapause locations. Generally, one VAP satellite samples the plasmapause in every ∼4.5 h and two VAP satellites can sample the plasmapause in every 2-3 h. In addition, some other satellites can be used to measure the plasmapause positions, like MMS, ERG, and THEMIS. Using all these satellite measurements, it is likely to find approximately continuous measurements of plasmapause locations during some periods. Therefore, it is interesting to build a new plasmapause prediction model based on measurements during these special periods, which is left for further study.
This study aims to obtain a model that can predict plasmapause location and we only focus on the data set obtained from VAP. However, the currently largest available plasmapause location database is compiled based on observations from 18 satellites from 1977 to 2015 . To examine the reliability of our developed ANN model, we display our AE model performance on their database only including VAP observations in Figure 8a and the whole database in Figure 8b, respectively. Clearly, for the observed plasmapause locations in the VAP's scope ∼L = 3-6, our model can reproduce the evolution of the plasmapause and accurately predict the plasmapause location (RMSE = 0.59 with CC = 0.87). However, our model has weak predictive ability on the plasmapause locations beyond VAP's scope (RMSE = 2.10 with CC = 0.67). In addition, to evaluate the performance more precisely with multisatellites databases, our ANN model needs further improvements. For example, when we obtain the whole database to test our model in Figure 8b, we could use a classification machine learning model to judge whether the plasmapause is at L < ∼5.5 or at L > ∼5.5 during all conditions. After the classification, we can model the plasmapause location only for the category that the modeled plasmapause is at L < ∼5.5. For the observed plasmapause at L > ∼5.5, we can first compare the categorization results, i.e., whether the neural network model can correctly predict the times when the observed plasmapause is at L > ∼5.5. Then we can compare the performance for the times when the plasmapause is at L < ∼5.5. This approach will be applied in our future study.
Although previous studies (e.g., Cho et al., 2015;Larsen et al., 2017;O'Brien & Moldwin, 2003) demonstrate that solar wind parameters and geomagnetic indices both have strong correlation with the plasmapause, the prediction results using geomagnetic indices seem better than the results using solar wind parameters (Verbanac et al., 2015). The underlying reasons are still unsolved.

Data Availability Statement
The background electron density data from the EMFISIS instrument were obtained from http://emfisis. physics.uiowa.edu/data/index. The solar wind parameters and geomagnetic indices were obtained from the online OMNIWeb (http://omniweb.gsfc.nasa.gov/). Our crossing events data and model could be found in https://figshare.com/articles/dataset/Prediction_of_Dynamic_Plasmapause_Location_using_a_Neural_Network/13181171. GUO ET AL.