Short-term load demand forecasting through rich features based on recurrent neural networks

With the emerging penetration of renewables and dynamic loads, the understanding of grid edge loading conditions becomes increasingly substantial. Load modelling researches commonly consist of explicitly expressed load models and non-explicitly expressed techniques, of which artiﬁcial intelligence approaches turn out to be the major path. This paper reveals the artiﬁcial intelligence-based load modelling technique to enhance the knowledge of current and future load information considering geographical and weather dependencies. This paper presents a recurrent neural network based sequence to sequence (Seq2Seq) model to forecast the short-term power loads. Also, a feature attention mechanism, which is along channel and time directions, is developed to improve the efﬁciency of feature learning. The experiments over three publicly available datasets demonstrate the accuracy and effectiveness of the proposed model.


INTRODUCTION
With the continuous development of economy, society and smart grid, the power system makes increasingly higher requirements on the accuracy of load forecasting [1]. An accurate shortterm load prediction is an essential backbone of the smart grid operation, which can facilitate the economic dispatch, demand response program etc. [2]. However, the load characteristics of power systems are not the same in different regions and the increasing penetration of distributed renewable energy also brings great unpredictability and uncertainty to the load side of the grid. These place higher challenges to the power load forecasting [3].
Researchers have endeavoured to improve the accuracy and speed of the forecasting methodologies applied in power systems over the past decades [4,5]. The statistical model is one of the most widely used forecasting methods, which aims to describe the relationship between the time series of the forecasted values and actual history load data [6]. The  [7,8]. Statistical models can solve the problem of forecasting delay effectively, but the accuracy of long-term forecasting is unsatisfactory. Artificial intelligence, another group of methods, is extensively applied for load forecasting [9][10][11], such as neural networks and support vector machine (SVM).
Neural networks are able to learn complicated non-linear relationship between input and output. They are simple and efficient techniques with good performance and have been widely applied in load and renewable energy forecasting [4]. For instance, authors in [10] present several improved models using artificial neural networks to predict the aggregated load demand with good performance. However, most neural network models require a large training set to prevent over-fitting and the training process is usually unstable [12]. SVMs work well with small training set and are able to model non-linear relationship through the kernel trick, but choosing a good kernel function is non-trivial [13]. Authors in [14] propose a least-squares SVM for annual electric load forecasting. Recently, deep learning approaches have attracted lots of attentions and have made tremendous achievements in the forecasting area. They have the capability of learning and generalisation on large dataset and have made several breakthroughs in the fields of computer vision and natural language processing [15,16]. The core idea of deep learning approaches is to model complicated relationship between input and output data by stacking several layers of non-linearities. When applied to load forecasting tasks, instead of designing models manually, these powerful models are able to learn the hidden relationship between prediction and historical data automatically given large enough training data. Recurrent neural networks (RNNs), one of the deep learning models, is designed to model the time dependency of data sequence, which is usually applied in forecasting tasks [17]. However, the vanilla RNNs suffer from gradient vanishing for large recurrent steps, which prevents vanilla RNNs modelling long term time dependency of data. In [18], a long short-term memory (LSTM) cell is proposed to avoid this by applying gates mechanism. Authors in [11] present an LSTM RNN based framework to forecast the electric use of individual residential customers. Reference [19] also applied the RNN with LSTM unit for household load forecasting. In [20], gated recurrent units (GRUs) are proposed with a simpler architecture, but provides comparable performance with LSTM. Authors in [21] have adopted the gated RNN model for day-ahead load forecasting in commercial buildings.
In this paper, we proposed a forecasting framework with the sequence to sequence (Seq2Seq) RNN, which has not been adopted for load forecasting so far. The Seq2Seq model is originally proposed to address the end-to-end sequence learning problem with neural networks [22]. This model is able to handle various length of input and output sequence and provide good performance on machine translation tasks. In [23], an attention mechanism is proposed to let the Seq2Seq model to soft select a set of relevant inputs when predicting each output. This improves the machine translation performance especially for long sentence cases as well as provides interoperability of the model. Besides machine translation and speech recognition [24], Seq2Seq has been successfully applied to several other tasks including image captioning [25], answering questions [26], video representation learning [27] and human motion prediction [28].
Many research achievements on feature selection have been carried out to reduce the computational complexity as well as improve the performance of downstream tasks by removing redundant features [29]. In [30], a subset of weather and historical load features is extracted for load forecasting by ranking candidate features using a conditional mutual information approach. In [31], the optimal feature subset is selected based on the correlation between features and the load. In the proposed model, instead of selecting a fixed set of features for forecasting, different sets of features are selected to predict the power load at different time step through our feature attention mechanism.
The main contributions of this article can be summarised as follows, 1. An RNN-based Seq2Seq model is proposed to predict shortterm loads. The Seq2Seq model has rarely been applied in the load forecasting area and this paper will fill the knowledge gap in this domain. The method is suitable for the load forecasting purpose and is demonstrated to be accurate and efficient 2. Feature attention mechanism is involved to learn rich feature representations. Unlike the original Seq2Seq where the decoder is only fed with the previous output, in our model at each output time, the decoder also takes features that are computed from raw input features through a feature attention layer as input. Specifically, a new set of features are computed from the raw features along with both channel and time directions before fed into the decoder The rest of the paper is organised as follows. Section 2 presents the background knowledge of RNNs and GRUs. Section 3 describes the proposed Seq2Seq method for load forecasting. Section 4 discusses the feature attention mechanism applied in our model. Section 5 documents the solution procedure and the experiment setup. Case studies and results are presented in Section 6. Finally, Section 7 concludes this paper.

BACKGROUND
In this section, we briefly introduce the background knowledge of RNNs [32] and GRUs [20] used in our model and more details can be found in [32,20].

Recurrent neural networks
RNN is a neural network for modelling sequential data by taking an input sequence with length T to generate T steps output sequence and hidden states. At each time step t, RNN takes current input x t and the previous hidden state h t−1 to produce h t and the output y t . Thus, it is able to model the time dependencies of the data and is suitable to capture the temporal relationship between the previous and current states. Traditionally, the RNN cell is defined as [17]: where tanh () is the hyperbolic tangent function, and W, U and V are weight matrices along with bias vectors b and c, which are usually learned through backpropagation through time (BPTT) algorithm [33] during training.

Gated recurrent units
The vanilla RNNs can suffer from vanishing/exploding gradients, thus have difficulty to learn long-term dependencies. Some sophisticated recurrent cells including LSTM [18] and GRU [20] are proposed to avoid this issue by creating paths through time with derivatives that neither vanish nor explode [17]. GRU is applied in our proposed model and the definition is expressed as follows [34], where σ is the logistic sigmoid function, ⊙ is an element-wise multiplication, W z , U z , W r , U r , W and U are weight matrices along with bias vectors b z , b r and b h to be learned. GRU is chosen for our model because we found GRU outperformed LSTM during the experiments. There is no rule that one cell is always better than another one. It depends on the dataset used for modelling. Thus, a practical strategy is to test both cells for a new dataset then choose the one providing the best performance. However, when both cells have similar performance, GRU is usually preferred, as GRU is computationally more efficient.

LOAD FORECASTING USING RECURRENT NEURAL NETWORKS
Dynamic power load sequence usually can be considered as a time series, thus RNN is a suitable tool for dynamic load forecasting. In our work, the Seq2Seq model is applied and the feature engineering is also involved. In this section, we first introduce the general model of power load. Then we discuss how to use a Seq2Seq framework to learn the load model and forecast the load. Finally, we introduce our feature attention mechanism which reduces the feature engineering efforts.

Power and load model
Let y t and y true t be the measured value and true value of a power load at time t, respectively. Then y t can be modelled as where N ∼  (0, 2 ) is a Gaussian noise with a zero mean and a standard derivation σ introduced by the measurement uncertainty. Let x t be a subset of features that affects the value of y true t , and f (x t , ) be a load model modelling y true t for all t with parameters ω. Then the likelihood of y t given features x t is where σ model is a standard deviation with a lower bound σ as the load may not be perfectly modelled by only using feature set x t . Assume σ model is independent on x t , our goal is to find a functionf (x t , ) which approximates the load model f given sufficient observations. In this paper, we use RNNs consisting of an encoder and a decoder as the function approximator and the load model is fitted by the maximum likelihood estimation based on Equation (9). Because of the time dependence property of the load sequence, feature set x t can include historical load data, weather temperature, and any other external features which can affect the load value at time t.

Learning load model through a sequence to sequence model
Seq2Seq [22] is a popular RNN based model widely used in natural language processing tasks [35][36][37]. It consists of an encoder RNN and a decoder RNN and is able to map an input sequence to a target sequence with different lengths. The encoder first maps an arbitrary length input sequence to a fixed-size vector (the last hidden state of encoder RNN). Then the decoder is initialised by the fixed-size vector encoder provided and generates a target sequence. One advantage of this model for load forecasting is that the entire information of the input sequence is utilised to generate the target sequence. Assuming the load sequence with length T only depends on the previous historical data with length T'+1 and corresponding external features, the likelihood of the target load sequence given feature set x t is [15] p( y t +1 , … , y t +T |x t+1 , … , x t+T ). (10) where v is the fixed-size encoded input sequence, d t is the multi-channel input features at time t and each p( y t +i |v, y t +1 , … , y t +i−1 , d t+i ). (13) is a Gaussian distribution as discussed in Section 3.1. Unlike the original Seq2Seq where the decoder is only fed with the previous output, in our model at each output time t, the decoder also takes features d t as input. The features are computed from original input features through a feature attention layer, which will be discussed in detail in the next section. The load model then can be learned by maximising the log-likelihood of Equation (12) through a mean squared error (MSE) loss function: whereŷ t +i is computed from the decoder RNN output followed by a linear fully connected layer. It is observed in [22] that the reversed order of input is preferred for language translation tasks. Inspired by this, the encoder is constructed by a bidirectional RNN [38] to maintain both directions of the input sequence. Figure 1 illustrates the proposed model used for the load forecasting. Note that the encoder is a bidirectional RNN and the decoder is a single directional RNN.

FEATURE ATTENTION MECHANISM
The process of feature engineering which heavily relies on experts' domain knowledge is often time-consuming and inflexible in adding new types of features. To reduce the efforts on feature engineering, we introduce a feature attention mechanism which is able to learn rich feature representations automatically and efficiently. The only required feature engineering is to group features based on their dependency.
An attention mechanism is proposed for Seq2Seq in [35] to force the model to learn to focus on a specific position of the encoder input sequence. Inspired by this, as all the features are fed into the decoder in our model, we introduce an attention mechanism to force the model to learn which input features are important for output at each output time t. The input features of our Seq2Seq model is a multiple channel feature sequence including historical load data (encoder input sequence) as well as external features such as weekdays, temperature and any other features affecting the load values at each time step. Though the input sequence is already encoded in the final state of the encoder, we found explicitly including it as an additional feature channel improves the load forecasting performance in our experiment, as the load demands between two successive days at the same time are highly correlated. This can be interpreted as manually specifying the attentions which are learned in [35].
We then assume that at each output time t, the model should focus on a specific set of features over time and channels. For example, along with time dimension, the load at time t can depend on previous input features or even the future features. This is reasonable for the load forecasting purpose, for example, the future temperatures can affect the load demand at a partic-ular time of a day because of some human planning behaviours. Similarly, it is also true for the channel dimension.
The feature attention learning is only applied for numerical features. That is, category and ordinal feature channels are fed into the decoder directly. Specifically, let . be the original T×M input M-Channel numerical feature matrix with each column an input numerical feature sequence and . be a T×N learned N-channel numerical feature matrix. Feature attention learns a set of weights w such that d a where d o i (t ) and d a i (t ) are the i-th row and the t-th column element of D o num and D a num , respectively. In the proposed approach, attention over channels and time are learned separately to reduce the computational complexity.

Channel attention
Feature attention over channel is learned based on the feature dependency over channel. One example of channel dependent features are the temperatures obtained from several locally closed weather stations. Feature channels are firstly divided into K groups based on their dependency. Then for each group, the dependency between channels is modelled by weighted summation over all the features within the group. It is desired to learn multiple sets of the weights, since there may exist multiple useful combinations of features across channels. To reduce the computational complexity, we further assume that within each group, features at different time are sharing the same weights. Thus, for each group, the channel attention output can consist of multiple channels, each of which is a weighted average over original feature channels.
} be a set of T×1 single channel of input numerical feature sequence, {k i , i = 1, … , K } be a partition of index set for S d based on the feature channel dependency, |k i | be the cardinality of the set k i , and n c be the number of output channels per group. Then the feature attention over channel for each group is computed as s.t.
where d c i,1 is the l-th output channel attention feature vector for i-th feature group. The constraint is obtained by a softmax [32] over the weights. Note that for groups with only one feature

Time attention
Feature time attention is applied to each individual feature channel separately. Two types of structures are used to model the feature attention over time. Multiple channels output is used in both structures for the same reason as mentioned before. First, we use a linear projection structure to model the time attention weights depending on the output time t. That is, at different output time t, the model should focus on input features with different patterns. Let n t be the number of output channels. The time attention for each feature channel d c i is modelled as where d c i (t ) and d t i,l (t ) are the t-th element of channel attention feature vector d c i and the corresponding l-th output time attention feature vector d t i,1 , respectively. The constraint is obtained by a softmax [39] over the weights.
Then we use a multi-scale weight sharing structure inspired by [40] to model the weights which are independent on output time t. That is, at each time t, the model should focus on the same set of features centring at time t, which mimics the process of sliding window feature extraction. A convolutional layer is used to meet the requirement of weight sharing as well as force the model to focus on the features around current time t. The convolutional layer structure used in our experiments is illustrated in Figure 2, which is a modified inception layer proposed in [40]. Three modifications are applied: 1. One dimensional filters are used for one-dimensional sequence data; 2. Pooling is removed because the localisation is critical for attention learning and we want to keep the original sequence length as well; 3. 1×1 filters are removed before larger filters since the dimension reduction is not necessary here (input feature has a single channel). However, we keep the parallel 1×1 filters to add more non-linearity.
This structure is able to capture multi-scale representations of features over time with a low computational cost. The filter sizes are not necessarily the same as the one shown in Figure 2. Each size of the filter can have several output channels, and can be tuned for specific tasks. For all the filters, a rectified linear unit [41] is used as the activation function.
The parameters required for the inception structure are constant with the input feature length T. However, for the linear projection structure, the number of parameters is LT 2 n t , which may lead over-fitting when T is too large. In this case, one may remove the linear projection part and only keep the inception layer with more output channels and larger filter sizes, since the load value is less likely to be affected by features too far away. Let n I be the number of filters used in the inception layer, then the number of learned numerical feature channels becomes N = (n t + n I )L.

Including original features
To force the model to focus on features at the current time as well, the original features are concatenated with the new features as the final decoder input feature. Thus, the final feature matrix is where D o co is the matrix of the original category and ordinal features with each column an individual feature channel. At each time step t, the input feature vector d t is the t-th row of matrix D. With this setting, features can be easily added or removed by adding or removing corresponding channels from the original feature matrix before fed into the feature attention layer.

Solution procedure
As illustrated in Figure 1, the encoder takes an arbitrary length of historical load sequence as input. The decoder RNN is initialised by the final state of the encoder and takes a set of feature sequences as input. Features are first fed into a feature attention layer to force the decoder to focus on important features over time and channels at each output time t. Then the decoder generates the load sequence based on the learned features and previous outputs. Note that the input and output sequences can have different lengths. The load model is trained by finding a functionf which minimises the distance between the true load and the model predicted load on a training set.

Experiments
To demonstrate the efficiency of our model, we apply the model on the next day hourly load forecasting task and compare the performance with some previous models which are designed specifically for load forecasting. It is suitable to have the load forecasting as the task for demonstrating the proposed feature attention learning, because load demand forecasting depends on the historical usage and also heavily depend on external features such as holidays, public events and temperatures [42].
In our experiments, three public datasets are used: Hourly loads of the

Network architecture and training details
The choice of hyper-parameters and model training details are briefly discussed here. The decoder and each direction of the encoder in the Seq2Seq model consist of 2 layers with GRU of 80 hidden units. The number of output channels of inception filters is chosen to be 4 and 1 for other feature attentions.
During the training process, the network takes load demand of the current day and the corresponding features of the next day as input. Then it generates the estimated hourly load demand of the next day. The length of input and output sequences are not necessarily the same. However, in our experiments, no significant improvement is observed when a longer input sequence is provided at the cost of higher computational complexity. We have tested the proposed network using the demand of the previous week, the same day in the previous week, and the current day as input feature, and we found that the current setting provides the best performance. We further assume the next day temperatures can be predicted with high accuracy, so the true temperatures of the next day are used as the feature if applicable [46,47]. The network is optimised by RMSProp algorithm [48] through BPTT [33]. As discussed in Section 3.2, the model assumes the load demand sequence to be predicted depending only on the input historical sequence of the encoder, but not on preceding historical load. Thus, we can safely shuffle the data of each training epoch to reduce the risk of over-fitting. At the beginning of each epoch, a set of forecasting pairs containing load demand of two successive days and corresponding features are first constructed. Then the forecasting pairs are shuffled as the training set for current epoch. The batch size used for PL, GEFCom2012 and ISO-NE are set to be 16, 8 and 1, respectively. The learning rate starts from 0.001 and is divided by 10 every 50 epochs. The models are trained for up to 150 epochs. The number of layers and hidden units, and the learning rate schedule are found by grid search on a system level load demand of ISO-NE. Other hyper-parameters are manually picked, since this work is not aiming to archive the SOTA performance and the current performance is satisfactory. Higher performance can be expected if all the hyper-parameters are chosen by grid search.
During the test, to obtain higher performance, the final prediction load is obtained by averaging the results of 10 independent experiments under different initialisation and randomisation.
The model is implemented in Python using the TensorFlow framework [49]. The experimental environment is CPU i7-2600 with 8GB RAM. It requires about 10 min, 20 min, and 2 h to train one load demand model for PL, GEFCom2012 and ISO-NE, respectively.

Evaluation metric
Mean absolute percentage error (MAPE) and root mean square error (RMSE) are used as the evaluation metrics in our experiments. Let N be the total number of samples to be predicted, y i be the ground truth andŷ i be the predicted value. Then MAPE is defined as and RMSE is defined as Since the true load demand y i ≫ 0 in all of the three datasets, we can safely use MAPE without numerical instability for forecasting evaluation. For both metrics, the lower value indicates the better performance.

CASE STUDIES AND RESULTS
In this section, three case studies with different datasets are performed. The results demonstrate the effectiveness and the outstanding performance of our proposed model compared with some previous work. Discussions of feature attention learning for load forecasting and the extension of our model to other types of power system load modelling is also presented.

Case I: PL dataset
In the PL dataset, only load demand data from 1 January 2002 to 31 December 2004 is recorded. Time attention is applied to a single channel historical data. To compare with an artificial immune system based load forecasting model (AISLFS [50]), 1.44% n/a n/a n/a n/a n/a AISLFS [50] 1.51% n/a n/a n/a n/a n/a ESN [54] n/a n/a 4.60% 104 n/a n/a  [51]) is also included because it is reported as the best model on the PL dataset in [50]. The other four general forecasting approaches are included for comparison as well. Persistence forecasting (Naive) which predicts the next day load using the previous day load indicates how different the load demands are between two successive days and provides the lower bound of the performance. Random Forest (RF) [52] uses a rolling window with length 24 as features to predict the load value after 24 points of the window. In our experiments, 500 trees are used to ensure good forecasting performance. Multilayer perceptron (MLP) [53] containing 3 hidden layers with 129, 256, 128 hidden units takes the same input and output setting as the proposed model. To demonstrate the necessity of Seq2Seq model, we also use a single RNN with 2 layers of 150 GRU units to directly predict the load with time lag 24 from historical sequence.
The forecasting performance on the testing set is shown in the first two columns of Table 1, where Seq2Seq is the proposed model. We can see the proposed model has higher accuracy than RNN even with a smaller number of hidden units, because the Seq2Seq structure is able to encode the entire data from the previous day before prediction. Note that the hyper-parameters of RF, MLP and RNN are not fully tuned for the best performance. So better performance can be expected, but less likely to outperform the proposed model without non-trivial modifications. Also note that in Table 1, the performance evaluations of NN and AISLFS are reported in [50], while ESN is reported in [54]. Final results of RNN, MLP and proposed Seq2Seq are averaged from 10 independent experiments.

Case II: Global energy forecasting competition 2012 from Kaggle competition dataset
We then use the data from GEFCom2012 and train the model for forecasting the load at the system level (sum of all zonal level load demands). The dataset consists of hourly load data and hourly temperatures from 11 locally closed weather stations from 1 January 2004 to 30 June 2008. Thus, the hourly temperatures of 11 weather stations are channel dependent features and applied for channel attention learning. We use the same setting as [54] to compare with their work of fine-tuning parameters of forecasting algorithms through evolution. That is, training samples include data from the year 2004 to 2007 and testing samples include data from the year 2008. Over-fitting is observed when 80 hidden units are used, so 40 hidden units for both encoder and decoder are used for this case.
The forecasting performance on testing data is shown in the third and fourth column of Table 1. ESN is an echo state network based model [56] which is the best model on GEF-Com2012 after parameters tuning through evolution reported in [54]. We notice that RNN has difficulty to model the load on this dataset. However, our Seq2Seq model based on RNNs still achieves the best performance. Figure 3(a) shows the MAPE comparison of 10 independent experiments between using channel attention across 11 weather stations and using averaged temperatures of 11 weather stations. Feature attention across time is applied for both cases. We can see learning the channel attention weights does improve the accuracy. The learned weights for each weather station from 10 independent experiments are shown in Figure 4, where different colours indicate different experiments, and the height of the bar is the weight learned for each weather station. Notice that stations 9, 11, 5, 2 and 3 are always assigned higher weights than other stations, which demonstrates that our model does not simply learn a random set of weights. Thus, other than improving the performance, channel attentions also provide the importance of each weather station on GEFCom2012 forecasting task.

Case III: ISO-NE dataset
In this case, we use the data from ISO-NE and learn the model for forecasting the load at the system level (sum all zonal level load demand). The dataset consists of hourly historical data The comparison of forecasting performance on ISO-NE is shown in the last two columns of Table 1. Figure 6 shows the week of the best and the worst forecasting results over 2015. Each week starts from Monday to Sunday. Solid curves are the ground truth demands and dashed curves are the predicted demands. We can observe the higher errors exist around peak demand. In addition, we can see the trend of the load during one week is correctly predicted (demand is relatively low during weekends), which is the contribution of the weekday feature. Notice that the high load demand on Sunday of the week in Figure 6(a) is because of the high temperature on that day (84 • F at the peak) and our model is able to capture this information correctly. Since US federal holidays are not used as the input feature, the model considers 4 July as an ordinary weekend and has a low prediction accuracy. Though adding this feature in our model is straightforward, we did not do this because this paper is not the discussion of feature selection for load demand forecasting and the contribution of the weekday feature has already demonstrated the capability of our model to encode this type of feature. Other than the temperatures at current time, the model surprisingly pay more attention on the temperatures in the future. It is partially because we keep the original temperatures as another feature channel and the decoder itself is able to memorise past input, which lets the model learn the weights to focus on the future features. Another reason is the impact of human behaviours. One can decide whether to use a device based on the forecasted temperatures. Also, the bottom left region with high weight values can be explained as the extension of this tendency, as the dew point temperatures between two successive days at the same time is highly correlated. Though after around 5:00 PM, the model also learns high weights for current temperatures, since during that time period, the load demand is dominated by family uses and the current temperature may immediately influence one to turn on or off a device at home. We also show the time attention weights of temperature  Figure 5(b). A similar pattern is observed. However, after 7:00 PM, the model pays more attention to the temperatures in the past few hours. Thus, as well as improving the forecasting performance, the attention learning could guide the manual design of load models.

CONCLUSION AND DISCUSSION
In this work, we present a Seq2Seq based framework with GRU units for short-term load forecasting. This proposed framework also leverages feature engineering to improve the accuracy and interpretability of the forecasting model. Unlike the common Seq2Seq where the decoder is only fed with the previous output, Three publicly available datasets indicating different feature scenarios have been applied to demonstrate the effectiveness of the proposed model. The method has been compared with other dominant algorithms and the results validate that our proposed Seq2Seq model has a better performance than the state-of-the-art load forecasting models while requiring less effort in model design and feature engineering. This proposed work can evolve to a generalised platform for load modelling and analysis, with the short-term load forecasting as just one of the typical applications. Load models can be presented and expressed explicitly in some cases when being able to perform component by component analysis, or equivalent representation formatting such as ZIP. But for most of the time, the load models for an aggregated node are not possible to be derived in an explicit manner. Our RNN-based framework learns the load models inexplicitly and can apply to multiple applications. For example, two Seq2Seq can be used to learn the active and reactive power of an exponential recovery load model [57]. The historical data can be previous active and reactive power sequences and the external feature can be the current or voltage sequence, which will result in other major applications in addition to load forecasting.
Our future work will focus on demonstrating the extension of the proposed framework while incorporating other information from power systems operation.