Deep Neural Networks With Convolutional and LSTM Layers for SYM‐H and ASY‐H Forecasting

Geomagnetic indices quantify the disturbance caused by the solar activity on a planetary scale or in particular regions of the Earth. Among them, the SYM‐H and ASY‐H indices represent the (longitudinally) symmetric and asymmetric geomagnetic disturbance of the horizontal component of the magnetic field at midlatitude with a 1‐min resolution. Their resolution, along with their relation to the solar wind parameters, makes the forecasting of the geomagnetic indices a problem that can be addressed through the use of Deep Learning, particularly using Deep Neural Networks (DNNs). In this work, we present two DNNs developed to forecast respectively the SYM‐H and ASY‐H indices. Both networks have been trained using the Interplanetary Magnetic Field (IMF) and the related index for the solar storms occurred in the last two solar cycles. As a result, the networks are able to accurately forecast the indices 2 h in advance, considering the IMF and indices values for the previous 200 min. The evaluation of both networks reveals a great forecasting precision, including good predictions for large storms that occurred during the solar cycle 23 and comparing with the persistence model for the period 2013–2020.


Introduction
Geomagnetic indices quantify the disturbance of the terrestrial magnetosphere due to interplanetary transients. Among all the existing indices (see review by Menvielle et al., 2011), the disturbance storm time, Dst, index (Mayaud, 1968;Sugiura, 1964), is commonly used to describe the geomagnetic storm strength. Taking advantage of this, the Dst index is used to drive operational forecasts of various geospace systems, including the thermosphere for mass density and satellite drag (e.g., Licata et al., 2020, and references therein) or the ionosphere for storm propagation (e.g., Hu et al., 1998). Sometimes, SYM-H index (Iyemori, 1990;Iyemori et al., 1992) is used over the Dst index because of its advantage of having higher time resolution (1-min) data as compared to hourly Dst index. Indeed, both Dst and SYM-H are proxies of the symmetric ring current, while the longitudinally asymmetric part of geomagnetic disturbance field at low latitude to midlatitude is quantified by ASY indices (Iyemori, 1990;Iyemori et al., 1992). For example, SYM-H index is used by Forsyth et al. (2020) to construct forecasts of the >2 MeV flux or by Tshisaphungo et al. (2018) to model the ionospheric F2 layer (foF2) changes during geomagnetic storms.
The official values of several indices, including Dst and SYM-H, are obtained from ground magnetometers and provided through the WDC Kyoto (http://wdc.kugi.kyoto-u.ac.jp/), but provisional Dst is available with 1-h latency and SYM-H is not available in real time. This delay in the nowcasting of the geomagnetic indices makes their forecasting even more necessary to provide the proper input to drive other space weather parameters in an operational situation. For this purpose, real-time solar wind parameters measured at L1 point are used to drive forecast of the indices. The forecasting methods based on physical or empirical relationships between the solar wind and the indices start with the works by Burton et al. (1975) and O'Brien Abstract Geomagnetic indices quantify the disturbance caused by the solar activity on a planetary scale or in particular regions of the Earth. Among them, the SYM-H and ASY-H indices represent the (longitudinally) symmetric and asymmetric geomagnetic disturbance of the horizontal component of the magnetic field at midlatitude with a 1-min resolution. Their resolution, along with their relation to the solar wind parameters, makes the forecasting of the geomagnetic indices a problem that can be addressed through the use of Deep Learning, particularly using Deep Neural Networks (DNNs). In this work, we present two DNNs developed to forecast respectively the SYM-H and ASY-H indices. Both networks have been trained using the Interplanetary Magnetic Field (IMF) and the related index for the solar storms occurred in the last two solar cycles. As a result, the networks are able to accurately forecast the indices 2 h in advance, considering the IMF and indices values for the previous 200 min. The evaluation of both networks reveals a great forecasting precision, including good predictions for large storms that occurred during the solar cycle 23 and comparing with the persistence model for the period . and McPherron (2000, which achieve a reasonable level of accuracy. Rastätter et al. (2013) compares the performance of 30 models forecasting a 1-min Dst index (equivalent to SYM-H) during four events: two events representing highly disturbed times and the other two events represent quieter times. None of the models consistently performed best for all the events.
In this scenario, where accurate forecasting of geomagnetic indices is needed for operations, using Machine Learning (ML) techniques is a current trend due to the promising results that can be found in the literature. In this way, Camporeale (2019) presents a survey in the current state of the art of ML applied to space weather nowcasting and forecasting. This work provides an insight into the most important contributions that have been made in recent years, as well as the most common ML metrics that could be applicable to the field.
Regarding to the metrics, it is also important to highlight the work by Liemohn et al. (2018), where guidelines for geomagnetic index forecasting are provided, establishing which metrics are the most appropriate for each index. They recommend it in two processes: metrics focused on the accuracy of the prediction (Root Mean Squared Error [RMSE] and MAE) and metrics that assess the relationship between the predicted and the real value (Pearson correlation coefficient), that is, how close is the prediction to the original values, and metrics for assessing the event detection performance (HSS, POD, and FAR), which indicates how many events would have been detected if we only relied on our prediction. In a similar direction, Welling et al. (2018) made several recommendations to validate ground magnetic perturbation forecasting systems based on ML.
Focusing on geomagnetic indices prediction, one of the first ML approaches is proposed by Hernandez et al. (1993) for the AL index forecasting, which uses feed-forward networks and Autoregressive Moving Average Models with nonlinear filters. Novel works attempt to forecast the Kp index, such as Zhelavskaya et al. (2019) and Shprits et al. (2019), where different methods were evaluated, achieving the highest performance using a feed-forward neural network with one hidden layer, scoring a RMSE of 0.55 for the nowcasting and 0.7 for the 3-h horizon.
One of the most studied indices is the Dst. For this index, several mathematical and ML forecasting systems have been developed. We can highlight the work by Gruet et al. (2018) that combines the usage of Long Short-Term Memory (LSTM) layers in a Recurrent Neural Network (RNN) with a Gaussian process to provide a forecast up to 6 h ahead of the Dst. This model used hourly data from the OMNIWeb and Global Positioning System databases. Their model obtained great accuracy, with a correlation coefficient higher than 0.873 and an RMSE lower than 9.86. However, despite the overall performance, it is unable to obtain accurate predictions of intense storms where the Dst reaches values lower than −250 nT. Lazzús et al. (2019) explored and compared several ML techniques for the Dst index forecast problem. In his work, several Artificial Neural Networks (ANNs) are studied, as well as its combination with bio-inspired algorithms, such as particle swarm optimization, genetic algorithms, and a hybridization of both, to improve the system's accuracy. In order to evaluate the models' performance, the RMSE and the correlation coefficient metrics were used, which are included in the previously mentioned recommendations. Their model achieved a RMSE lower than 5 nT when forecasting up to 3 h in advance and a RMSE lower than 7 when forecasting up to 6 h ahead.
ML algorithms can take advantage of high-resolution data. In this sense, the hourly nature of the Dst can lead to a loss of relevant information. As a consequence, the latest research is focused on forecasting the SYM-H and ASY-H indices, which have 1-min resolution, so the amount of information that may be lost due to the averaging is almost negligible. For the forecast of these indices, we can highlight the work of Bhaskar and Vichare (2019) where a Nonlinear Autoregressive Network with exogenous input solution provides the SYM-H and ASY-H prediction. Their model considers an input history of 30 min and an output feedback of 120 min. As for the inputs, they consider velocity, density, and Interplanetary Magnetic Field (IMF). For the training data, SYM-H and ASY-H indices during geomagnetic storms from 1998 to 2013 are used. However, only storms when the SYM-H reaches values below −85 nT are used as the target for training the networks. Predictions for both indices during nine geomagnetic storms of the 24th solar cycle are showcased, including the large storm that occurred on St. Patrick's Day in 2015, presenting the model capabilities for predicting these indices about an hour before they happen.
Deep Learning (DL) and ANNs have been used as well to forecast the onset of magnetic storms, as presented in the work of Maimaiti et al. (2019). They use a time history of solar wind speed, proton density, and IMF as inputs to forecast the occurrence probability of the onset of magnetic substorms over the next hour. The model, trained using a data set from the SuperMAG list of magnetic substorm onsets, is able to identify substorms 75% of the time, which is a significant improvement over the previous state of the art algorithms for this particular problem that only identified around 21% of the substorms in the same data set. Siciliano et al. (2020) made a comparison of two ANNs for the SYM-H index forecasting: one based on LSTM layers and another one based on Convolutional Neural Network (CNN) layers. With the objective of developing operational models, they only use the IMF measured by ACE's magnetometer (they discard variables such as speed or density due to the data gaps occurred as a consequence of the instruments saturation during intense storms) to forecast storms in which the SYM-H index achieved values lower than −100 nT. They also compare the performance of the networks both with and without the index itself as an input parameter. This paper proposes a Deep Neural Network (DNN) based on LSTM and single-dimension convolutional layers to forecast the SYM-H and ASY-H indices using the IMF measured by ACE and the indices themselves. Meanwhile the use of LSTM and CNN layers has been explored in the literature, the combination of both is a novel approach in this field that increases the forecasting accuracy.
The paper is structured as follows. Next section describes the database and variables used to train our neural networks. Section 3 presents the network architecture developed to predict the geomagnetic indices. Section 4 showcases the prediction accuracy of our networks, comparing it with previous works. Finally, some conclusions are outlined.

Database
For training, validating, and testing the DNN presented in the next section, we use the 42 geomagnetic storms that occurred between 1998 and 2018 in which the SYM-H index achieved a minimum value of less than −100 nT. For most storms, the time interval consists of 10 days surrounding the storm minus a few exceptions, as shown in Tables 1-3. We retrieve the IMF parameters and the SYM-H and ASY-H indices from https:// cdaweb.gsfc.nasa.gov/index.html (King, 2005). The indices are retrieved in 5-min averages (OMNI_HRO_5MIN), whereas the IMF parameters are retrieved from the level 2 ACE data (AC_H3_MFI) sampled at 1-s intervals, recorded at L1. The selected parameters of the IMF are its magnitude and its three Geocentric Solar Magnetospheric (GSM) components.
Since the time resolutions of the indices and the IMF parameters are different, the IMF data have been grouped using the same resolution as the indices. When grouping the IMF data, the mean and standard deviation of each variable were computed. The aim of computing the standard deviation is to obtain information about the fluctuation of the IMF during the period, which would be lost if only the average is used.
The choice of limiting the input variables to the IMF data is motivated by the limited data availability of the solar wind plasma parameters COLLADO-VILLAVERDE ET AL. Note. From left to right: number used to identify the storm, start and end days, occurrence (Y) or not (N) of a multidip (MP) storm, SYM-H index minimum value, and ASY-H maximum value. For training and validating the ASY-H, the storm 18 from the training set is swapped for the storm 23 in the validation set, to better distribute the instances of each set according to this particular index. Note. For training and validating the ASY-H, the storm 18 from the training set is swapped for the storm 23 in the validation set, to better distribute the instances of each set according to this particular index.

Table 2
Storms Used to Validate the DNN Models for effective implementation of any ML algorithm. Figure 1 in Larrodera and Cid (2020) shows data coverage of the main solar wind parameters measured by the ACE spacecraft from the time it is operational until the end of the year 2017. Data availability for IMF and solar wind velocity is over 98% for any year in the sample (even if some data gaps appear in solar wind velocity due to saturation of SWEPAM instrument during important storm events), but the coverage drops substantially for solar wind temperature or density. Indeed, the proton density and the temperature became increasingly sparse starting in 2010 as the detectors aged. An operational improvement has significantly increased the frequency of good-quality SWEPAM observations since October 23, 2012, but in this scenario, we choose to consider the solar wind parameter with almost full data availability. In fact, the IMF gaps are limited to a few seconds in most cases, which have been filled using linear interpolation. Additionally, by averaging into 5-min windows, the impact of the absence of a few seconds is greatly diminished.
Regarding to the indices, they serve a double purpose: they will be used both as the input of the network and as the output. Thus, the input of the network will be a discrete-time window of the indices in conjunction with the other variables. The output will be the predicted values for the SYM-H or ASY-H indices for 1 and 2 h in advance. Let us introduce a key point in our approach: each index is predicted by a specific DNN, that is, we train two separate neural networks using the IMF variables and one of the indices, the SYM-H or ASY-H separately.
Therefore, each network will receive nine input variables: • 5-min average and standard deviation of the IMF strength, expressed in nT. • 5-min average and standard deviation of the X, Y, and Z GSM components of the IMF, expressed in nT.
• SYM-H index, expressed in nT.
• ASY-H index, expressed in nT.
The split of the storms into the training, validation, and testing sets has been done following the approach proposed by Siciliano et al. (2020). The training set consists of 20 storms (48% of the whole 42 storms, 60,808 samples, Table 1), 5 storms have been saved for validation (12%, 14,688 samples, Table 2), and the remaining 17 storms (40%, 55,072 samples, Table 3) have been used for testing. The split has been done so each set is uniformly populated regarding the intensity of the storms. However, despite that this split is suitable for the SYM-H index, it is not for the ASY-H one, since the two most intense storms for this index would be unseen by the network during training. Trying to improve the split for the ASY-H index, we have swapped the storm number 18 of the training set with the storm number 23 of the validation set, only for the ASY-H index training, validation, and testing, so the SYM-H split remains unchanged.
Additionally, to ease the convergence of the network and improve the forecasting, each variable in the input data has been standardized: (i) subtracting the mean (centering the data on −1) and (ii) dividing it by the standard deviation (scaling the data so it has unit variance). This is shown in Equation 1, where Z is the standardized data, x are the original values, μ is the mean of that variable, and σ its standard deviation. Standardization is preferred in the DL domain since it makes each feature similar to a normal distribution. Additionally, this scaling method is more resilient to outliers than the more common normalization between [-1, 1] and [0, 1], which only considers the minimum and maximum values instead of the overall statistics of the data.

Neural Network Architecture
The prediction of the SYM/ASY-H indices can be seen as a time-series forecasting problem due to the temporal nature of the data. It is well known that the prediction's accuracy is inversely proportional to the look ahead horizon (Bontempi et al., 2013), representing a key constraint in the definition and modeling of forecasting algorithms. In our work, we propose a DNN model for forecasting the SYM-H and ASY-H indices. Therefore, we have to define the temporal scope and forecasting horizon as both properties define the input and output layers of our DNNs.
The networks presented in this section will look back 200 min into the past to make the predictions, forecasting the indices at the subsequent 1 and 2 h. As stated in the previous section, we use 5-min resolution data, so the 200 min used to make our prediction are grouped into 40 time steps. That decision has a great impact on the complexity of the network: it greatly reduces the amount of data needed to make each prediction, which reduces the model size and the required training resources (time and memory). The output will be the predicted value of the indices in the next 1 and 2 h. Figure 1 depicts a sketch of the proposed network for the indices forecasting. It is built by a combination of single-dimensional CNN layers (Koprinska et al., 2018), LSTM layers (Hochreiter & Schmidhuber, 1997), and multilayer perceptron (Gardner & Dorling, 1998, also known as fully connected layers, from now on referenced as dense layers). The boxes shaded in gray represent operations without learnable parameters, whereas the colored ones have weight and bias parameters that need to be optimized.
The CNNs are designed to capture spatial and temporal dependencies. The most used layer of this family in the DL field is the two-dimensional convolutional layer, whose main application is image processing. Although we are not extracting image information, our time-series data are very similar to an image from the programmatic perspective: images are two-dimensional matrices of pixels (width, height, and color channels), while our input variables are matrices whose height is 1, its width is the time steps, and the data COLLADO-VILLAVERDE ET AL.
10.1029/2021SW002748 5 of 15 variables correspond to the color channels in the images. When dealing with data structured in this way, we can take advantage of the extensively researched convolutional layers. Under the hood, this layer applies the convolution operation over several subsequent time steps for each variable, extracting features from them, while maintaining the temporal structure.
The LSTM layer belongs to the family of the RNN (Jain, 2000). This family of layers is designed to solve the temporal dependency in the data. Instead of analyzing each time step individually or the time series as a whole, the layer loops over each time step, processing the input in a timely manner. For each time step, the relevant information is saved and combined with the one obtained from the previous time steps, while simultaneously discarding the features that are no longer relevant. This approach inherently takes into account the temporal dependency of the data, allowing the forecasting of time series with outstanding accuracy.
Focusing on our DNN, we have split the input data into two branches as can be seen in Figure 1: The left branch of the network (dashed red square) considers the whole input sequence, that is, the 40 time steps or 200 min history, whereas the right branch (dashed blue square) focuses on the last 20 time steps or last 100 min of data, that is, the recent history of the input data. This split allows one section of the network to have a more complete vision of the sequence fully analyzing it, meanwhile the other section focuses on the more recent input data.
The left branch consists of a CNN block with residual connections, similar to those used for image recognition (He et al., 2015). It is composed of three stacked CNN layers with different kernel sizes (7, 5, and 3) operating over the input features. Parallel to that stack, there is another CNN layer with a kernel size of 1 is also operating over the input data. The result of each CNN layers is added together as can be seen in Figure 2. This kind of architecture has proven to increase the performance over the usage of a single CNN layer, both in the image recognition task and in this particular forecasting problem.
One issue that has to be addressed in the CNN is the padding of the input features to maintain the sequence length, allowing the later addition following the residual connections architecture (otherwise the left and right branch of the residual CNN block would have different shapes). This is generally done by adding zeros using one of the following approaches: causal or same padding, as can be seen in Figure 3. The last one, the same padding, adds zeros to both ends of the data, meanwhile the causal padding only adds zeros to the left. This means that, using the causal padding, the CNN filter is only computed using the current and previous time steps, maintaining the temporal order. Another valuable advantage regarding the usage of this kind of padding in time series, specially when the convolutional layers are placed before the recurrent ones, is that the last time step processed by the following recurrent layer does not have any padded zeros to the end (i.e., the last time step). When applying filters to data padded with zeros, they are inherently contaminated. For image processing, this is not a significant issue, but for the time-series forecasting problem, it has great importance, since the last time step is the one that carries most of the relevant information when the derivatives are calculated in order to back-propagate and optimize the network (Hochreiter, 1998). These particularities make the causal padding the most suitable one for time-series forecasting tasks and the one used in our model.   Also, we have tested different activation functions for the CNN layers: The Rectified Linear Unit (ReLU) and Exponential Linear Unit (ELU), as well as no activation function. The best performance was achieved without using any activation function. However, the difference was very slight.
The decision to process the raw input data using CNN layers before the LSTM layers is motivated by the fact that CNNs are excellent feature extractors, so they can facilitate the work of the subsequent LSTM layers considerably.
The output of the convolutional block is then processed by a LSTM Encoder-Decoder architecture as shown in Figure 4. This block consists of two connected LSTM layers and is commonly used in Neural Machine Translation tasks. The first layer, called encoder is a sequence-to-vector network, that processes the added filters from the previous convolutional block, encoding the 40 time steps in a single vector, comprising all the important information during the whole sequence. Then, the next LSTM layer, called decoder, works as a vector-to-sequence network, recursively processing the context vector twice. The two resulting vectors for each iteration are used to predict the index value in the first and second hour, respectively. For both LSTM layers, the tanh function is used for the activation, meanwhile the sigmoid is used for the recurrent computations.
The output of the LSTM Encoder-Decoder block is combined with the output of the recent history processing block (right blue dashed box in Figure 1). This block works over the most recent time steps, that is, the last 20 (or last 100 min of input data). We do this separation, instead of working over the whole input sequence, because the recent history of the data has high significance for the forecasting.
The first part of this branch is composed of a single CNN layer, which we named convolutional context. It has a kernel size equal to the amount of time steps used as input. Using this configuration, each filter represents a summary of the last 20 time steps. Since we only want to obtain a single vector, no padding is needed. The second part of the branch consists of calculating the mean and standard deviation of each input variable for the last 20 time steps. This represents the amount of variability, and thus, disturbance, in the moments prior to the prediction.
For combining both branches of the architecture, first, the filters learned from the convolutional context are added to each resulting vector from the LSTM decoder. As the LSTM Decoder provides two output predictions (at t + 60 min and t + 120 min), we combine the filters from the convolutional context and the two predictions into two vectors, one for each predicted horizon, while maintaining the sequence shapes. This combination is inspired by the Residual Networks used in image recognition mentioned earlier.
Then, the mean and standard deviation previously computed are concatenated to each vector to add more information regarding the amount of disturbance of each variable in the latest time steps, increasing the length of each sequence by 18 (mean and standard deviation of the nine input variables).
COLLADO-VILLAVERDE ET AL.   Figure 1). Grayed outputs from the encoder are ignored. The encoder works as a sequence-to-vector network compressing the whole context into a vector. Then, the decoder produces the predictions from that encoder context. Finally, each of the resulting vectors is processed through a dense block composed of three dense layers to output the final prediction as depicted in Figure 1. The first two dense layers apply the ELU activation function; it leads to a slower convergence, but usually it offers a better performance over the ReLU one. Finally, the last dense layer has only one neuron without an activation function to output the final prediction.
The ELU function is similar to the more common ReLU one, except on the negative inputs (the ReLU sets negative values to 0). The ELU function is calculated as in Equation 2, where α is a scale for the negative factor, which we fixed to 1, and z the input value.
Regarding the number of neurons of each layer, we have tested values of 32, 64, and 128, obtaining the best performance with 64 neurons. In total, 134.338 parameters need to be optimized when 64 neurons are used for each layer. The DNNs for both indices have been implemented using Keras, an open source library that provides a Python interface for the development of ANN and acts as an interface for the TensorFlow library.

Training and Validating
All neural networks tend to overfit the training set, losing their ability to generalize the predictions outside the training instances at some point. Thus, is of critical importance to decide which data will be used to train and test the network. The weights and biases of the connections between neurons are updated using the instances from the training set, iterating several times through it. Each iteration is known as epoch. After each epoch, the current weights and biases are evaluated against the validation set, without modifying the weights. This gives us an estimate of how well the network generalizes and when the network is simply overfitting the training data and thus, losing its ability to properly generalize to unseen, new data. By following this procedure, the validation set is only used to decide when to stop the training process.
During the training, the previously motioned weights and biases that govern the behavior of the network have to be modified. There are several optimization algorithms for controlling this process (Géron, 2017). Selecting which algorithm to use has a great impact in the training time, and, usually, the election is based on the experience. In our case, for the training process, we have used the Adam optimizer (Kingma & Ba, 2017). This optimizer is a stochastic gradient descent method, which computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. It is computationally efficient with a low memory requirement. This last makes it, from our perspective, the best choice for performing the training. The training convergence is controlled by a set of hyperparameters, whose are the following: • Learning rate (0.001): step size at each iteration while moving toward minimizing the loss function.
To finish the training, we have followed an early stop procedure in which, when the validation error has not improved for a defined number of epochs, the training finishes. We have set 50 epochs for the early stop and 200 epochs as a maximum number of epochs for the training process. Once the training process has ended, whether due to exhausting the number of epochs or because the validations loss ceased to improve, the weights and biases are reverted and fixed to those that achieved the best performance on the validation set.
Then, the resulting model is evaluated on the test set, which has never been seen by the DNN.

Results
In this section, we will present an assessment of our DNN architecture for forecasting the SYM-H and ASY-H indices. Before presenting the results, next subsection introduces the metrics used.

Metrics
To evaluate the accuracy of the predictions made by the previously presented neural network, we have followed the evaluation guidelines for geomagnetic index predictions recommended by Liemohn et al. (2018). Such methodology implies computing several fit performance metrics between the predicted indices values and the real ones. The metrics used are the RMSE and the coefficient of determination R 2 , which are also included in the survey by Camporeale (2019) for regression problems and commonly used other works in the field, for example, Yang et al. (2018) and Yang and Shen (2019). Both have been computed after reverting the standardization process done to the variables prior to using them in the network.
The first metric, RMSE (see Equation 3), gives an insight regarding how well the model captures the range of values of the index, highlighting the larger differences between the real values of the indices (y i ) and the predicted ones (ˆi y ) which mainly happens during active times where the error is farther from 0 and can be used to compare the results obtained by our model to those obtained by other authors.
The second metric, R 2 (see Equation 4), represents the amount of variance of the observed data, explained by the predicted, being y the mean of the N samples evaluated. This metrics output values usually ranged between 0 and 1. A R 2 = 1 means that the model perfectly predicts the target variable, meanwhile 0 implies that the model is equivalent to forecasting the mean of the target variable. It can also achieve negative values if the predictive model performs worse than simply predicting the average of the target variable.

Model Evaluation
In this section, we present, discuss, and evaluate the predictions obtained using our DNN model for the 17 test storms (see Table 3) using the metrics discussed above, for both indices and both forecasting horizons (1 and 2 h). The metrics are displayed using three decimal places since we believe that level of precision is important regarding the metrics used. The results are summarized in Tables 4 and 5 for the SYM-H and ASY-H, respectively.
In Table 4, we show the computed RMSE and R 2 metrics for the SYM-H index, for each test storm, for both forecasting horizons. First, in the case of the 1-h horizon, the metrics are compared to the ones obtained by Siciliano et al. (2020), using their LSTM model (i.e., slightly better than their CNN model). If we compare the obtained results between each model, our proposed network achieves a significantly lower RMSE in 12 out of the 17 storms and behaves mostly the same in 2 out of the 17 and worse in 3 out of the 17. However, the mean is significantly lower. Regarding the R 2 metric, the mean results are the same, but the results presented by Siciliano et al. (2020) are provided with a lower precision, providing one decimal place for the RMSE and two for the R 2 , meanwhile, we provide three decimal places in all cases, so the results may not be totally accurate.
Focusing on the predictions at 2 h, the results are compared against a persistence model, referenced as baseline (to our extent there are no other works that make predictions up to 2 h). The behavior of the persistence model is a simple prediction algorithm for which the prediction at both t + 60 and t + 120 is the value of the index at time t. This model has no predictive ability but serves as a basis for comparison. In this case, our model always outperforms the baseline model, the worst performing storm, number 30, only improves the baseline by a slight margin, but all the other storms are predicted significantly better.
Following, Table 5 depicts the computed metrics for the test storms for the ASY-H index for both forecasting horizons. In this case, the metrics are compared against the baseline model for both predictions horizon. The only storm in which our model behaved worse than the baseline is the prediction at 1 h for the storm number 30; in all the other cases, our model performs better.
The only other work that we can compare to regarding the ASY-H index is the one done by Bhaskar and Vichare (2019). In such work, they used a NARX network to predict the storms that happened in 2014 and 2015. This means that the only shared storm between both test subsets is the one that occurred in March 2015 (number 41 in Table 3). However, they considered different days for the start and end of the storm: they considered more than 20 days, meanwhile we only considered 10, so the comparison is not strictly fair, since the longer the period, more nondisturbed time will be predicted which will decrease the overall error metrics. Nevertheless, they reported a RMSE of 20.43, whereas our model achieved an error of 17.924 when forecasting 1 h ahead and 20.380 when the horizon was increased to 2 h.
Another experiment that we have performed is the evaluation of the models' predictions during large, mostly quite periods, that is, period that contained a small amount of storms or relative small storms that have never been seen by the DNN. Since a final objective is to deploy an operational forecasting system, we want to avoid false positives, increasing the reliability of our approach. To evaluate this behavior, we have computed the metrics for both indices, 1 and 2 h ahead, from January 2013, until December 2020. For such period, the minimum SYM-H is −233 (2015) and the maximum ASY-H is 348 (2015). It is important to note that none of the used training storms belong to this period (only two validation storms are in this range). Table 6 shows the metrics computed for the SYM-H index over the predictions for each of the mentioned years. It is important to mention that since the number of disturbances on a yearly scale is low, the baseline model is harder to outperform. Still, only for the last 2 years, the 1-h look ahead model is slightly worse than the baseline model, whereas for the remaining years and for the 2 h look ahead, our model performs better than the baseline. Assessing the ASY-H reported in Table 7, only one year (2014) performs slightly worse than the baseline for the 1-h prediction, whereas for the remaining cases our model outperforms the baseline.
COLLADO-VILLAVERDE ET AL. Note. The combined mean error is depicted in the last row.

Table 4 Metrics for the SYM-H Prediction Over the Test Storms Set ( 3), Comparing With the Work of Siciliano et al. for 1-h Predictions and the Baseline for the 2-h Prediction
Finally, we provide the temporal profile of the SYM-H and ASY-H predictions for 1 and 2 h against the real values for the storms occurred in November 2004 as an example of an intense storm ( Figure 5, storm number 37 in Table 3), and the storm occurred in June 2013 as an example of a moderate one ( Figure 6, storm number 40 in Table 3). The plots for all the storms in the test set can be seen in the supporting information provided.
COLLADO-VILLAVERDE ET AL. Note. The combined mean error is depicted in the last row. Besides the great performance of our models, we still can find a few intervals where the predictions are not as accurate as expected. One of these examples is the about 6-h interval starting at 12 UT on April 16, 1999 (storm number 29) when the SYM-H (both 1-and 2-h) prediction does not accurately match with the observed SYM-H. The reason for this not-so-good performance seems to be related to the fact that the solar wind dynamic pressure is playing a major role in the solar wind-magnetosphere interaction. Figure 7 evidences that dynamic pressure increased during that interval, reaching about 30 nPa. Solar wind plasma parameters involved in dynamic pressure (proton density and flow speed) have not been considered as input parameters in our models. Therefore, the models are not expected to accurately reproduce the magnetospheric response (as seen by geomagnetic SYM-H and ASY-H indices) of an outstanding contribution of the solar wind plasma relative to the IMF contribution.

Conclusions
The prediction of the SYM-H and ASY-H indices is of paramount importance to unravel the information about the response of the Earth's magnetosphere during geomagnetic storms. Given the importance of these predictions, a notable amount of work has been done trying to forecast its values. Even so, the problem has not been fully solved yet. Some forecasting systems cannot predict further enough into the future, COLLADO-VILLAVERDE ET AL.

10.1029/2021SW002748
13 of 15  the horizon in which the predictions are reliable has no real value yet, and when the prediction horizon is further enough into the future so it starts to become of great use, their accuracy is greatly hindered.
Thus, the present work proposes the usage of a DNN system to forecast them by exploiting the advantages of combining CNNs and LSTM layers. Using the IMF measured by ACE, and the index itself, the trained networks are able to predict both indices up to 2 h ahead. Although the overall metrics for both indices are quite accurate, the forecasting during the most intense storms can still be improved. However, the scarce number of events of such intensity, in conjunction with the voracity of neural networks for training data, increases the difficulty of this task.
Based on the evaluation of the network's prediction presented in the article, we can acknowledge that our predictions not only outperform the current state of the art forecasting models for the SYM-H index but also provide a good performance for the ASY-H when comparing to the persistence model. Thus, the proposed DNN will be implemented in a near future as a real-time forecasting product under the Spanish Space Weather Service (http://www.senmes.es). The implementation plan includes the real-time nowcasting of SYM-H and ASY-H.
We have noticed the influence of dynamic pressure in the SYM-H index in some intervals where predictions were not so accurate. We also observed that the ASY-H index is significantly harder to predict than the SYM-H, especially during the highest peaks. This suggests that this index may be influenced by more variables than the ones studied in this work. Future efforts will be devoted to include solar wind plasma parameters like proton density and flow speed among the input parameters of the DNN system. Even if there are several gaps in ACE plasma data during severe solar storm conditions, these problems could be solved by using solar wind plasma data from DSCVR mission.