Deep Learning for Time Series Forecasting: The Electric Load Case

Management and efficient operations in critical infrastructure such as Smart Grids take huge advantage of accurate power load forecasting which, due to its nonlinear nature, remains a challenging task. Recently, deep learning has emerged in the machine learning field achieving impressive performance in a vast range of tasks, from image classification to machine translation. Applications of deep learning models to the electric load forecasting problem are gaining interest among researchers as well as the industry, but a comprehensive and sound comparison among different architectures is not yet available in the literature. This work aims at filling the gap by reviewing and experimentally evaluating on two real-world datasets the most recent trends in electric load forecasting, by contrasting deep learning architectures on short term forecast (one day ahead prediction). Specifically, we focus on feedforward and recurrent neural networks, sequence to sequence models and temporal convolutional neural networks along with architectural variants, which are known in the signal processing community but are novel to the load forecasting one.


Introduction
Smart grids aim at creating automated and efficient energy delivery networks which improve power delivery reliability and quality, along with network security, energy efficiency, and demand-side management aspects [1]. Modern power distribution systems are supported by advanced monitoring infrastructures that produce immense amount of data, thus enabling fine grained analytics and improved forecasting performance. In particular, electric load forecasting emerges as a critical task in the energy field, as it enables useful support for decision making, supporting optimal pricing strategies, seamless integration of renewables and maintenance cost reductions. Load forecasting is carried out at different time horizons, ranging from milliseconds to years, depending on the specific problem at hand.
In this work we focus on the day-ahead prediction problem also referred in the literature as short term load forecasting (STLF) [2]. Since deregulation of electric energy distribution and wide adoption of renewables strongly affects daily market prices, STLF emerges to be of fundamental importance for efficient power supply [3]. Furthermore, we differentiate forecasting on the granularity level at which it is applied. For instance, in individual household scenario, load prediction is rather difficult as power consumption patterns are highly volatile. On the contrary, aggregated load consumption i.e., that associated with a neighborhood, a region, or even an entire state, is normally easier to predict as the resulting signal exhibit slower dynamics.  [24] GRU D T, C, other * * Dongguan, China [15] CNN D C, TI USA [16] CNN D C, TI Sceaux, France [25] CNN + LSTM D T, C, TI North-China [26] CNN + LSTM D -North-Italy , H (humidity), P (pressure), C (calendar including date and holidays information), TI (time), * other input features were created for this dataset, * * categorical weather information is used (e.g., sunny, cloudy), Dataset: the data source, a link is provided whenever available.
Historical power loads are time-series affected by several external time-variant factors, such as weather conditions, human activities, temporal and seasonal characteristics that make their predictions a challenging problem. A large variety of prediction methods has been proposed for the electric load forecasting over the years and, only the most relevant ones are reviewed in this section. Autoregressive moving average models (ARMA) were among the first model families used in short-term load forecasting [4,5]. Soon they were replaced by ARIMA and seasonal ARIMA models [6] to cope with time variance often exhibited by load profiles. In order to include exogenous variables like temperature into the forecasting method, model families were extended to ARMAX [7,8] and ARIMAX [9]. The main shortcoming of these system identification families is the linearity assumption for the system being observed, hypothesis that does not generally hold. In order to solve this limitation, nonlinear models like Feed Forward Neural Networks were proposed and became attractive for those scenarios exhibiting significant nonlinearity, as in load forecasting tasks [3,[10][11][12][13]. The intrinsic sequential nature of time series data was then exploited by considering sophisticated techniques ranging from advanced feed forward architecture with residual connections [14] to convolutional approaches [15,16] and Recurrent Neural Networks [17,18] along with their many variants such as Echo-state Network [18][19][20], Long-Short Term Memory [18,[21][22][23] and Gated Recurrent Unit [18,24]. Moreover, some hybrid architectures have also been proposed aiming to capture the temporal dependencies in the data with recurrent networks while performing a more general feature extraction operation with convolutional layers [25,26].
Different reviews address the load forecasting topic by means of (not necessarily deep) neural networks. In [36] the authors focus on the use of some deep learning architectures for load forecasting. However, this review lacks a comprehensive comparative study of performance verified on common load forecasting benchmarks. The absence of valid cost-performance metric does not allow the report to make conclusive statements. In [18] an exhaustive overview of recurrent neural networks for short term load forecasting is presented. The very detailed work considers one layer (not deep) recurrent networks only. A comprehensive summary of the most relevant researches dealing with STLF employing recurrent neural networks, convolutional neural networks and seq2seq models is presented in Table 1. It emerges that most of the works have been performed on different datasets, making it rather difficult -if not impossibleto asses their absolute performance and, consequently, recommend the best state-of-the-art solutions for load forecast.
In this survey we consider the most relevant -and recent-deep architectures and contrast them in terms of performance accuracy on open-source benchmarks. The considered architectures include recurrent neural networks, sequence to sequence models and temporal convolutional neural networks. The experimental comparison is performed on two different real-world datasets which are representatives of two distinct scenarios. The first one considers power consumption at an individual household level with a signal characterized by high frequency components while the second one takes into account aggregation of several consumers. Our contributions consist in: • A comprehensive review. The survey provides a comprehensive investigation of deep learning architectures known to the smart grid literature as well as novel recent ones suitable for electric load forecasting. • A multi-step prediction strategy comparison for recurrent neural networks: we study and compare how different prediction strategies can be applied to recurrent neural networks. To the best of our knowledge this work has not been done yet for deep recurrent neural networks. • A relevant performance assessment. To the best of our knowledge, the present work provides the first systematic experimental comparison of the most relevant deep learning architectures for the electric load forecasting problems of individual and aggregated electric demand. It should be noted that envisaged architectures are domain independent and, as such, can be applied in different forecasting scenarios.
The rest of this paper is organized as follows.
In Section 2 we formally introduce the forecasting problems along with the notation that will be used in this work. In Section 3 we introduce Feed Forward Neural Networks (FNNs) and the main concepts relevant to the learning task. We also provide a short review of the literature regarding the use of FNNs for the load forecasting problem. In Section 4 we provide a general overview of Recurrent Neural Networks (RNNs) and their most advanced architectures: Long Short-Term Memory and Gated Recurrent Unit networks. In Section 5 Sequence To Sequence architectures (seq2seq) are discussed as a general improvement over recurrent neural networks. We present both, simple and advanced models built on the sequence to sequence paradigm. In Section 6 Convolutional Neural Networks are introduced and one of their most recent variant, the temporal convolutional network (TCN), is presented as the state-of-the-art method for univariate time-series prediction. In Section 7 the real-world datasets used for models comparison are presented. For each dataset, we provide a description of the preprocessing operations and the techniques that have been used to validate the models performance. Finally, In Section 8 we draw conclusions based on the performed assessments.
or a subset of. To ease the notation we express the input and output vectors in the reference system of the time window instead of the time series one. By following this approach, the input vector at discrete time t becomes Similarly, we denote asŷ t = f (x t ;Θ) ∈ IR n O , the prediction vector provided by a predictive model f whose parameters vector Θ has been estimated by optimizing a performance function.
Without loss of generality, in the remaining of the paper, we drop the subscript t from the inner elements of x t and y t . The introduced notation, along with the sliding window approach, is depicted in Figure 1.
In certain applications we will additionally be provided with d − 1 exogenous variables (e.g., the temperatures) each of which representing a univariate time series aligned in time with the data of electricity demand. In this scenario the components of the regressor vector become vectors, i.e., x t = [x[0], . . . , x[n T − 1]] ∈ IR n T ×d . Indeed, each element of the input sequence is represented as ∈ IR is the scalar load measurement at time t, while z k [t] ∈ IR is the scalar value of the k th exogenous feature.
The nomenclature used in this work is given in Table 2.

Feed Forward Neural Networks
Feed Forward Neural Networks (FNNs) are parametric model families characterized by the universal function approximation property [37]. Their computational architectures are composed of a layered structure consisting of three main building blocks: the input layer, the hidden layer(s) and the output layer. The number of hidden layers (L > 1), determines the depth of the network, while the size of each layer, i.e., the number n H, of hidden units of the − th layer defines its complexity in terms of neurons. FNNs provide only direct forward connections between two consecutive layers, each connection associated with a trainable parameter; note that given the feedfoward nature of the computation no recursive feedback is allowed. More in detail, given a vector x ∈ IR n T fed at the network input, the FNN's computation can be expressed as: where Each layer is characterized with its own parameters matrix W ∈ IR n H, −1 ×n H, and bias vector b ∈ IR n H, .
Hereafter, in order to ease the notation, we incorporate the bias term in the weight matrix, i. Given a training set of N input-output vectors in the (x i , y i ) form, i = 1, . . . , N , the learning procedure aims at identifying a suitable configuration of parametersΘ that minimizes a loss function L evaluating the discrepancy between the estimated values f (x t ; Θ) and the measurements y t : The mean squared error: is a very popular loss function for time series prediction and, not rarely, a regularization penalty term is introduced to prevent overfitting and improve the generalization capabilities of the model The most used regularization scheme controlling model complexity is the L2 regularization Ω(Θ) = λ Θ 2 2 , being λ a suitable hyper-parameter controlling the regularization strength.
As Equation 4 is not convex, the solution cannot be obtained in a closed form with linear equation solvers or convex optimization techniques. Parameters estimation (learning procedure) operates iteratively e.g., by leveraging on the gradient descent approach: where η is the learning rate and ∇ Θ L(Θ) the gradient w.r.t. Θ. Stochastic Gradient Descent (SGD), RMSProp [38], Adagrad [39], Adam [40] are popular learning procedures. The learning procedure yields estimateΘ = Θ k associated with the predictive model f (x t ;Θ).
In our work, deep FNNs are the baseline model architectures.
In multi-step ahead prediction the output layer dimension coincides with the forecasting horizon n O > 1. The dimension of the input vector depends also on the presence of exogenous variables; this aspect is further discussed in Section 7.

Related Work
The use of Feed Forward Neural networks in short term load forecasting dates back to the 90s. Authors in [11] propose a shallow neural network with a single hidden layer to provide a 24-hour forecast using both load and temperature information. In [10] one day ahead forecast is implemented using two different prediction strategies: one network provides all 24 forecast values in a single shot (MIMO strategy) while another single output network provides the day-ahead prediction by recursively feedbacking its last value estimate (recurrent strategy). The recurrent strategy shows to be more efficient in terms of both training time and forecasting accuracy. In [41] the authors present a feed forward neural network to forecast electric loads on a weekly basis. The sparsely connected feed forward architecture receives the load time-series, temperature readings, as well as the time and day of the week. It is shown that the extra information improves the forecast accuracy compared to an ARIMA model trained on the same task. [12] presents one of the first multi-layer FNN to forecast the hourly load of a power system.
A detailed review concerning applications of artificial neural networks in short-term load forecasting can be found in [3]. However, this survey dates back to the early 2000s, and does not discuss deep models. More recently, architectural variants of feed forward neural networks have been used; for example, in [14] a ResNet [42] inspired model is used to provide day ahead forecast by leveraging on a very deep architecture. The article shows a significant improvement on aggregated load forecasting when compared to other (not-neural) regression models on different datasets.

Recurrent Neural networks
In this section we overview recurrent neural networks, and, in particular the Elmann Net architecture [43], Long-Short Term Memory [44] and Gated Recurrent Unit [45] networks. Afterwords, we introduce deep recurrent neural networks and discuss different strategies to perform multi-step ahead forecasting. Finally, we present related work in short-term load forecasting that leverages on recurrent networks.

Elmann RNNs (ERNN)
Elmann Recurrent Neural Networks (ERNN) were proposed in [43] to generalize feedforward neural networks for better handling ordered data sequences like time-series.
The reason behind the effectiveness of RNNs in dealing with sequences of data comes from their ability to learn a compact representation of the input sequence x t by means of a recurrent function f that implements the following mapping: ... . (Right) The network after unfolding. Note that the structure reminds that of a (deep) feed forward neural network but, here, each layer is constrained to share the same weights. h init is the initial state of the network which is usually set to zero.

By expanding Equation 6 and given a sequence of inputs
where W ∈ IR n H ×n H , U ∈ IR d×n H , V ∈ IR n H ×n O are the weight matrices for hidden-hidden, input-hidden, hidden-output connections respectively, φ(·) is an activation function (generally the hyperbolic tangent one) and ψ(·) is normally a linear function. The computation of a single module in an Elmann recurrent neural network is depicted in Figure 3.
It can be noted that an ERNN processes one element of the sequence at a time, preserving its inherent temporal order. After reading an element from the input sequence x[t] ∈ IR d the network updates its internal state h[t] ∈ IR n H using both (a transformation of) the latest state h[t − 1] and (a transformation of) the current input (Equation 6). The described process can be better visualized as an acyclic graph obtained from the original cyclic graph (left side of Figure 2) via an operation known as time unfolding (right side of Figure 2). It is of fundamental importance to point out that all nodes in the unfolded network share the same parameters, as they are just replicas distributed over time.
The parameters of the network Θ = [W, U, V] are usually learned via Backpropagation Through Time (BPTT) [46,47], a generalized version of standard Backpropagation. In order to apply gradient-based optimization, the recurrent neural network has to be transformed through the unfolding procedure shown in Figure 2. In this way, the network is converted into a FNN having as many layers as time intervals in the input sequence, and each layer is constrained to have the same weight matrices. In practice Truncated Backpropagation Through Time [48] TBPTT(τ b , τ f ) is used. The method processes an input window of length n T one timestep at a time and runs BPTT for τ b timesteps every τ f steps. Notice that having τ b < n T does not limit the memory capacity of the network as the hidden state incorporates information taken from the whole sequence. Despite that, setting τ b to a very low number may result in poor performance. In the literature BPTT is considered equivalent to TBPTT(τ b = n T , τ f = 1). In this work we used epoch-wise Truncated BPTT i.e., TBPTT(τ b = n T , τ f = n T ) to indicate that the weights update is performed once a whole sequence has been processed. Despite of the model simplicity, Elmann RNNs are hard to train due to ineffectiveness of gradient (back)propagation. In fact, it emerges that the propagation of gradient is effective for short-term connections but is very likely to fail for long-term ones, when the gradient norm usually shrinks to zero or diverges. These two behaviours are known as the vanishing gradient and the exploding gradient problems [49,50] and were extensively studied in the machine learning community.

Long Short-Term Memory (LSTM)
Recurrent neural networks with Long Short-Term Memory (LSTM) were introduced to cope with the vanishing and exploding gradients problems occurring in ERNNs and, more in general, in standard RNNs [44]. LSTM networks maintain the same topological structure of ERNN but differ in the composition of the inner module -or cell.
x +  Each LSTM cell has the same input and output as an ordinary ERNN cell but, internally, it implements a gated system that controls the neural information processing (see Figure Figure 3 and 4). The key feature of gated networks is their ability to control the gradient flow by acting on the gate values; this allows to tackle the vanishing gradient problem, as LSTM can maintain its internal memory unaltered for long time intervals. Notice from the equations below that the inner state of the network results as a linear combination of the old state and the new state (Equation 14). Part of the old state is preserved and flows forward while in the ERNN the state value is completely replaced at each timestep (Equation 8). In detail, the neural computation is: where is generally a sigmoid activation while φ(·) can be any non-linear one (hyperbolic tangent in the original paper

Gated Recurrent Units (GRU)
Firstly introduced in [45], GRUs are a simplified variant of LSTM and, as such, belong to the family of gated RNNs. GRUs distinguish themselves from LSTMs for merging in one gate functionalities controlled by the forget gate and the input gate. This kind of cell ends up having just two gates, which results in a more parsimonious architecture compared to LSTM that, instead, has three gates.  The basic components of a GRU cell are outlined in Figure 5, whereas the neural computation is controlled by: where W u , W r , W c ∈ IR n H ×n H , U u , U r , U c ∈ IR n H ×d are the parameters to be learned, ψ(·) is generally a sigmoid activation while φ(·) can be any kind of non-linearity (in the original work it was an hyperbolic tangent). u[t] and r[t] are the update and the reset gates, respectively. Several works in the natural language processing community show that GRUs perform comparably to LSTM but train generally faster due to the lighter computation [52,53].

Deep Recurrent Neural Networks
All recurrent architectures presented so far are characterized by a single layer. In turn, this implies that the computation is composed by an affine transformation followed by a non-linearity. That said, the concept of depth in RNN is less straightforward than in feed-forward architectures. Indeed, the later ones become deep when the input is processed by a large number of non-linear transformations before generating the output values. However, according to this definition, an unfolded RNN is already a deep model given its multiple non-linear processing layers. That said, a deep multi-level processing can be applied to all the transition functions (input-hidden, hidden-hidden, hidden-output) as there are no intermediate layers involved in these computations [54]. Deepness can also be introduced in recurrent neural networks by stacking recurrent layers one on top of the other [55]. As this deep architecture is more intriguing, in this work, we refer it as a Deep RNN. By iterating the RNN computation, the function implemented by the deep architecture can be represented as: It has been empirically shown in several works that Deep RNNs are better to capture the temporal hierarchy exhibited by time-series then their shallow counterpart [54,56,57]. Of course, hybrid architectures having different layers -recurrent or not-can be considered as well.

Multi-Step Prediction Schemes
There are five different architecture-independent strategies for multi-step ahead forecasting [58]: Recursive strategy (Rec) a single model is trained to perform a one-step ahead forecast given the input sequence. Subsequently, during the operational phase, the forecasted output is recursively fedback and considered to be the correct one. By iterating n O times this procedure we generate the forecast values at time t + n O . The procedure is described in Algorithm 1, where x[1 :] is the input vector without its first element while the vectorize(·) procedure concatenates the scalar output y to the exogenous input variables. 7: x ← concatenate(x[1 :], vectorize(o))) 8: k ← k + 1 9: end while 10: return o asŷ t To summarize, the predictor f receives in input a vector x of length n T and outputs a scalar value o.
Direct strategy design a set of n O independent predictors f k , k = 1, . . . , n O , each of which providing a forecast at time t + k. Similarly to the recursive strategy, each predictor f k outputs a scalar value o, but the input vector is the same to all the predictors. Algorithm 2 details the procedure.

Algorithm 2 Direct Strategy for Multi-Step Forecasting
k ← k + 1 7: end while 8: return o asŷ t DirRec strategy [59] is a combination of the above two strategies. Similar to the direct approach, n O models are used, but here, each predictor leverages on an enlarged input set, obtained by adding the results of the forecast at the previous timestep. The procedure is detailed in Algorithm 3.

Algorithm 3 DirRec Strategy for Multi-Step Forecasting
x ← concatenate(x, vectorize(o)) 8: k ← k + 1 9: end while 10: return o asŷ t MIMO strategy (Multiple input -Multiple output) [60], a single predictor f is trained to forecast a whole output sequence of length n O in one-shot, i.e., differently from the previous cases the output of the model is not a scalar but a vector:ŷ DIRMO strategy [61], represents a trade-off between the Direct strategy and the MIMO strategy. It divides the n O steps forecasts into smaller forecasting problems, each of which of length s. It follows that n O s predictors are used to solve the problem.
Given the considerable computational demand required by RNNs during training, we focus on multi-step forecasting strategies that are computationally cheaper, specifically, Recursive and MIMO strategies [58]. We will call them RNN-Rec and RNN-MIMO.
Given the hidden state h[t] at timestep t, the hidden-output mapping is obtained through a fully connected layer on top of the recurrent neural network. The objective of this dense network is to learn the mapping between the last state of the recurrent network, which represents a kind of lossy summary of the task-relevant aspect of the input sequence, and the output domain. This holds for all the presented recurrent networks and is consistent with Equation 9. In this work RNN-Rec and RNN-MIMO differ in the cardinality of the output domain, which is 1 for the former and n O for the latter, meaning that in Equation 9 either V ∈ IR n H ×1 or V ∈ IR n H ×n O . The objective function is:

Related work
In [17] an Elmann recurrent neural network is considered to provide hourly load forecasts. The study also compares the performance of the network when additional weather information such as temperature and humidity are fed to the model. The authors conclude that, as expected, the recurrent network benefits from multi-input data and, in particular, weather ones. [28] makes use of ERNN to forecast household electric consumption obtained from a suburban area in the neighbours of Palermo (Italy). In addition to the historical load measurements, the authors introduce several features to enhance the model's predictive capabilities. Besides the weather and the calendar information, a specific ad-hoc index was created to assess the influence of the use of air-conditioning equipment on the electricity demand. In recent years, LSTMs have been adopted in short term load forecasting, proving to be more effective then traditional time-series analysis methods. In [21] LSTM is shown to outperform traditional forecasting methods being able to exploit the long term dependencies in the time series to forecast the day-ahead load consumption. Several works proved to be successful in enhancing the recurrent neural network capabilities by employing multivariate input data.
In [22] the authors propose a deep, LSTM based architecture that uses past measurements of the whole household consumption along with some measurements from selected appliances to forecast the consumption of the subsequent time interval (i.e., a one step prediction). In [23] a LSTM-based network is trained using a multivariate input which includes temperature, holiday/working day information, date and time information. Similarly, in [31] a power demand forecasting model based on LSTM shows an accuracy improvement compared to more traditional machine learning techniques such as Gradient Boosting Trees and Support Vector Regression.
GRUs have not been used much in the literature as LSTM networks are often preferred. That said, the use of GRU-based networks is reported in [18], while a more recent study [24] uses GRUs for the daily consumption forecast of individual customers. Thus, investigating deep GRU-based architectures is a relevant scientific topic, also thanks to their faster convergence and simpler structure compared to LSTM [52].
Despite all these promising results, an extensive study of recurrent neural networks [18], and in particular of ERNN, LSTM, GRU, ESN [62] and NARX, concludes that none of the investigated recurrent architectures manages to outperform the others in all considered experiments. Moreover, the authors noticed that recurrent cells with gated mechanisms like LSTM and GRU perform comparably well to much simpler ERNN. This may indicate that in short-term load forecasting gating mechanism may be unnecessary; this issue is further investigated -and evidence found-in the present work.

Sequence To Sequence models
Sequence To Sequence (seq2seq) architectures [63] or encoder-decoder models [45] were initially designed to solve RNNs inability to produce output sequences of arbitrary length. The architecture was firstly used in neural machine translation [45,64,65] but has emerged as the golden standard in different fields such as speech recognition [66][67][68] and image captioning [69].
The core idea of this general framework is to employ two networks resulting in an encoder-decoder architecture. The first neural network (possibly deep) f , an encoder, reads the input sequence x t ∈ IR n T ×d of length n T one timestep at a time; the computation generates a, generally lossy, fixed dimensional vector representation of it c = f (x t , Θ f ),  c ∈ IR d . This embedded representation is usually called context in the literature and can be the last hidden state of the encoder or a function of it. Then, a second neural network g -the decoder -will learn how to produce the output sequenceŷ t ∈ IR n O given the context vector, i.e.,ŷ = g(c, Θ g ). The schematics of the whole architecture is depicted in Figure 6.
The encoder and the decoder modules are generally two recurrent neural networks trained end-to-end to minimize the objective function: whereŷ[t] is the decoder's estimate at time t, y[t] is the real measurement, h[t − 1] is the decoder's last state, c is the context vector from the encoder, x is the input sequence and Ω(Θ) the regularization term. The training procedure for this type of architecture is called teacher forcing [70]. As shown in Figure 6 and explained in Equation 23, during training, the decoder's input at time t is the ground-truth value y[t − 1], which is then used to generate the next state h[t] and, then, the estimateŷ[t]. During inference the true values are unavailable and replaced by the estimates: This discrepancy between training and testing results in errors accumulating over time during inference. In the literature this problem is often referred to as exposure bias [71]. Several solutions have been proposed to address this problem; in [72] the authors present scheduled sampling, a curriculum learning strategy that gradually changes the training process by switching the decoder's inputs from ground-truth values to model's predictions. The professor forcing algorithm, introduced in [73], uses an adversarial framework to encourage the dynamics of the recurrent network to be the same both at training and operational (test) time. Finally, in recent years, reinforcement learning methods have been adopted to train sequence to sequence models; a comprehensive review is presented in [74].
In this work we investigate two sequence to sequence architectures, one trained via teacher forcing (TF) and one using self-generated (SG) samples. The former is characterized by Equation 23 during training while Equation 24 is used during prediction. The latter architecture adopts Equation 24 both for training and prediction. The decoder's dynamics are summarized in Figure 7. It is clear that the two training procedures differ in the decoder's input source: ground-truth values in teacher forcing, estimated values in self-generated training.

Related Work
Only recently seq2seq models have been adopted in short term load forecasting. In [33] a LSTM based encoderdecoder model is shown to produce superior performance compared to standard LSTM. In [75]   demonstrate its better performance with respect to a suite of models ranging from standard RNNs to classical time series techniques.

Convolutional Neural Networks
Convolutional Neural Networks (CNNs) [76] are a family of neural networks designed to work with data that can be structured in a grid-like topology. CNNs were originally used on two dimensional and three-dimensional images, but they are also suitable for one-dimensional data such as univariate time-series. Once recognized as a very efficient solution for image recognition and classification [42,[77][78][79], CNNs have experienced wide adoption in many different computer vision tasks [80][81][82][83][84]. Moreover, sequence modeling tasks, like short term electric load forecasting, have been mainly addressed with recurrent neural networks, but recent research indicates that convolutional networks can also attain state-of-the-art-performance in several applications including audio generation [85], machine translation [86] and time-series prediction [87].
As the name suggests, these kind of networks are based on a discrete convolution operator that produces an output feature map f by sliding a kernel w over the input x. Each element in the output feature map is obtained by summing up the result of the element-wise multiplication between the input patch (i.e., a slice of the input having the same dimensionality of the kernel) and the kernel. The number of kernels (filters) M used in a convolutional layer determines the depth of the output volume (i.e., the number of output feature maps). To control the other spatial dimensions of the output feature maps two hyper-parameters are used: stride and padding. Stride represents the distance between two consecutive input patches and can be defined for each direction of motion. Padding refers to the possibility of implicitly enlarging the inputs by adding (usually) zeros at the borders to control the output size w.r.t the input one. Indeed, without padding, the dimensionality of the output would be reduced after each convolutional layer.
Considering a 1D time-series x ∈ IR n T and a one-dimensional kernel w ∈ IR k , the i th element of the convolution between x and w is: with f ∈ IR n T −k+1 if no zero-padding is used, otherwise padding matches the input dimensionality, i.e., f ∈ IR n T . Equation 25 is referred to the one-dimensional input case but can be easily extended to multi-dimensional inputs (e.g., images, where x ∈ IR W ×H×D ) [88]. The reason behind the success of these networks can be summarized in the following three points: • local connectivity: each hidden neuron is connected to a subset of input neurons that are close to each other (according to specific spatio-temporal metric). This property allows the network to drastically reduce the number of parameters to learn (w.r.t. a fully connected network) and facilitate computations.
• parameter sharing: the weights used to compute the output neurons in a feature map are the same, so that the same kernel is used for each location. This allows to reduce the number of parameters to learn.  The dilation factor d grows on each layer by a factor of two and the kernel size k is 2, thus the output neuron is influence by 8 input neurons, i.e., the history size is 8 • translation equivariance: the network is robust to an eventual shifting of its input.
In our work we focus on a convolutional architecture inspired by Wavenet [85], a fully probabilistic and autoregressive model used for generating raw audio wave-forms and extended to time-series prediction tasks [87]. Up to the authors' knowledge this architecture has never been proposed to forecast the electric load. A recent empirical comparison between temporal convolutional networks and recurrent networks has been carried out in [89] on tasks such as polymorphic music and charter-sequence level modelling. The authors were the first to use the name Temporal Convolutional Networks (TCNs) to indicate convolutional networks which are autoregressive, able to process sequences of arbitrary length and output a sequence of the same length. To achieve the above the network has to employ causal (dilated) convolutions and residual connections should be used to handle a very long history size.

Dilated Causal Convolution (DCC)
Being TCNs a family of autoregressive models, the estimated value at time t must depend only on past samples and not on future ones (Figure 9). To achieve this behavior in a Convolutional Neural Network the standard convolution operator is replaced by causal convolution. Moreover, zero-padding of length (filter size -1) is added to ensure that each layer has the same length of the input layer. To further enhance the network capabilities dilated causal convolutions are used, allowing to increase the receptive field of the network (i.e., the number of input neurons to which the filter is applied) and its ability to learn long-term dependencies in the time-series. Given a one-dimensional input x ∈ IR n T , and a kernel w ∈ IR k , a dilated convolution output using a dilation factor d becomes: This is a major advantage w.r.t simple causal convolutions, as in the later case the receptive field r grows linearly with the depth of the network r = k(L − 1) while with dilated convolutions the dependence is exponential r = 2 L−1 k, ensuring that a much larger history size is used by the network.
Residual Connections Despite the implementation of dilated convolution, the CNN still needs a large number of layers to learn the dynamics of the inputs. Moreover, performance often degrade with the increase of the network depth. The degradation problem has been first addressed in [42] where the authors propose a deep residual learning framework. The authors observe that for a L-layers network with a training error , inserting k extra layers on top of it should either leave the error unchanged or improve it. Indeed, in the worst case scenario, the new k stacked non linear layers should learn the identity mapping y = H(x) = x where x is the output of the network having L layers and y is the output of the network with L + k layers. Although almost trivial, in practice, neural networks experience problems in learning this identity mapping. The proposed solution suggests these stacked layers to fit a residual mapping F(x) = H(x) − x instead of the desired one, H(x). The original mapping is recast into F(x) + x which is realized by feed forward neural networks with shortcut connections; in this way the identity mapping is learned by simply driving the weights of the stacked layers to zero.
By means of the two aforementioned principles, the temporal convolutional network is able to exploit a large history size in an efficient manner. Indeed, as observed in [89], these models present several computational advantages compared to RNNs. In fact, they have lower memory requirements during training and the predictions for later timesteps are not done sequentially but can be computed in parallel exploiting parameter sharing. Moreover, TCNs training is much  more stable than that involving RNNs allowing to avoid the exploding/vanishing gradient problem. For all the above, TCNs have demonstrated to be promising area of research for time series prediction problems and here, we aim to assess their forecasting performance w.r.t state-of-the-art models in short-term load forecasting. The architecture used in our work is depicted in Figure 10, which is, except for some minor modifications, the network structure detailed in [87]. In the first layer of the network we process separately the load information and, when available, the exogenous information such as temperature readings. Later the results will be concatenated together and processed by a deep residual network with L layers. Each layer consists of a residual block with 1D dilated causal convolution, a rectified linear unit (ReLU) activation and finally dropout to prevent overfitting. The output layer consists of 1x1 convolution which allows the network to output a one-dimensional vector y ∈ IR n T having the same dimensionality of the input vector x. To approach multi-step forecasting, we adopt a MIMO strategy.

Related Work
In the short-term load forecasting relevant literature, CNNs have not been studied to a large extent. Indeed, until recently, these models were not considered for any time-series related problem. Still, several works tried to address the topic; in [15] a deep convolutional neural network model named DeepEnergy is presented. The proposed network is inspired by the first architectures used in ImageNet challenge (e.g, [77]), alternating convolutional and pooling layers, halving the width of the feature map after each step. According to the provided experimental results, DeepEnergy can precisely predict energy load in the next three days outperforming five other machine learning algorithms including LSTM and FNN. In [16] a CNN is compared to recurrent and feed forward approaches showing promising results on a benchmark dataset. In [25] a hybrid approach involving both convolutional and recurrent architectures is presented. The authors integrate different input sources and use convolutional layers to extract meaningful features from the historic load while the recurrent network main task is to learn the system's dynamics. The model is evaluated on a large dataset containing hourly loads from a city in North China and is compared with a three-layer feed forward neural network. A different hybrid approach is presented in [26], the authors process the load information in parallel with a CNN and an LSTM. The features generated by the two networks are then used as an input for a final prediction network (fully connected) in charge of forecasting the day-ahead load.

Performance Assessment
In this section we perform evaluation and assessment of all the presented architectures. The testing is carried out by means of three use cases that are based on two different datasets used as benchmarks. We first introduce the performance metrics that we considered for both network optimization and testing, then describe the datasets that have been used and finally we discuss results.

Performance Metrics
The efficiency of the considered architectures has been measured and quantified using widely adopted error metrics. Specifically, we adopted the Root mean squared error (RMSE) and the Mean Absolute Error (MAE): where N is the number of input-output pairs provided to the model in the course of testing, y i [t] andŷ i [t] are respectively the real load values and the estimated load values at time t for sample i (i.e., the i − th time window). · is the mean operator, · 2 is the euclidean L2 norm, while · 1 is the L1 norm. y ∈ IR n O andŷ ∈ IR n O are the real load values and the estimated load values for one sample, respectively. Still, a more intuitive and indicative interpretation of prediction efficiency of the estimators can be expressed by the normalized root mean squared error which, differently from the two above metrics, is independent from the scale of the data: where y max and y min are the maximum and minimum value of training dataset, respectively. In order to quantify the proportion of variance in the target that is explained by the forecasting methods we consider also the R 2 index: All considered models have been implemented in Keras 2.12 [90] with Tensorflow [91] as backend. The experiments are executed on a Linux cluster with an Intel(R) Xeon(R) Silver CPU and an Nvidia Titan XP.

Use Case I
The first use case considers the Individual household electric power consumption data set (IHEPC) which contains 2.07M measurements of electric power consumption for a single house located in Sceaux (7km of Paris, France). Measurements are collected every minute between December 2006 and November 2010 (47 months) [29]. In this study we focus on prediction of the "Global active power" parameter. Nearly 1.25% of measurements are missing, still, all the available ones come with timestamps. We reconstruct the missing values using the mean power consumption for the corresponding time slot across the different years of measurements. In order to have a unified approach we have decided to resample the dataset using a sampling rate of 15 minutes which is a widely adopted standard in modern smart meters technologies. In Table 3 the sample size are outlined for each dataset.
In this use case we performed the forecasting using only historical load values. The right side of Figure 11 depicts the average weekly electric consumption. As expected, it can be observed that the highest consumption is registered in the morning and evening periods of day when the occupancy of resident houses is high. Moreover, the average load profile over a week clearly shows that weekdays are similar while weekends present a different trend of consumption.
The figure shows that the data are characterized by high variance. The prediction task consists in forecasting the electric load for the next day, i.e., 96 timesteps ahead.
In order to assess the performance of the architectures we hold out a portion of the data which denotes our test set and comprises the last year of measurements. The remaining measurements are repeatedly divided in two sets, keeping aside a month of data every five ones. This process allows us to build a training set and a validation set for which different hyper-parameters configurations can be evaluated. Only the best performing configuration is later evaluated on the test set.

Dataset
Train Test   IHEPC  103301 35040  GEFCom2014 44640  8928  Table 3: Sample size of train, validation and test sets for each dataset.

Use Case II and III
The other two use cases are based on the GEFCom2014dataset [35], which was made available for an online forecasting competition that lasted between August 2015 and December 2015. The dataset contains 60.6k hourly measurements of (aggregated) electric power consumption collected by ISO New England between January 2005 and December 2011. Differently from the IHEPCdataset, temperature values are also available and are used by the different architectures to enhance their prediction performance. In particular the input variables being used for forecasting the subsequent n O at timestep t include: several previous load measurements, the temperature measurements for the previous timesteps registered by 25 different stations, hour, day, month and year of the measurements. We apply standard normalization to load and temperature measurements while for the other variables we simply apply one-hot encoding, i.e., a Kdimensional vector in which one of the elements equals 1, and all remaining elements equal 0 [92]. On the right side of Figure 11 we observe the average load and the data dispersion on a weekly basis. Compared to IHEPC, the load profiles look much more regular. This meets intuitive expectations as the load measurements in the first dataset come from a single household, thus the randomness introduced by user behaviour makes more remarkable impact on the results. On the opposite, the load information in GEFCom2014comes from the aggregation of the data provided by several different smart meters; clustered data exhibits a more stable and regular pattern. The main task of these use cases, as well the previous one, consists in forecasting the electric load for the next day, i.e., 24 timesteps ahead. The hyper-parameters optimization and the final score for the models follow the same guidelines provided for IHEPC, the number of points for each subset is described in Table 3.

Results
The compared architectures are the ones presented in previous sections with one exception. We have additionally considered a deeper variant of a feed forward neural network with residual connections which is named DFNN in the remainder of the work. In accordance to the findings of [93] we have employed a 2-shortcut network, i.e., the input undergoes two affine transformations each followed by a non linearity before being summed to its original values. For regularization purposes we have included Dropout and Batch Normalization [94] in each residual block. We have additionally inserted this model in the results comparison because it represents an evolution of standard feed forward neural networks which is expected to better handle highly complex time-series data. Table 4 summarizes the best configurations found trough grid search for each model and use case. For both datasets we experimented different input sequences of length n T . Finally, we used a window size of four days, which represents the best trade-off between performance and memory requirements. The output sequence length n O is fixed to one day. For each model we identified the optimal number of stacked layers in the network L, the number of hidden units per layer n H , the regularization coefficient λ (L2 regularization) and the dropout rate p d . Moreover, for TCN we additionally tuned the width k of the convolutional kernel and the number of filters applied at each layer M (i.e., the depth of each   output volume after the convolution operation). The dilation factor is increased exponentially with the depth of the network, i.e. d = 2 with being the − th layer of the network. Table 5 summarizes the test scores of the presented architectures obtained for the IHEPCdataset. Certain similarities among networks trained for different uses cases can be spotted out already at this stage. In particular, we observe that all models exploit a small number of neurons. This is not usual in deep learning but -at least for recurrent architectures -is consistent with [18]. With some exceptions, recurrent networks benefit from a less strict regularization; dropout is almost always set to zero and λ values are small.
Among Recurrent Neural Networks we observe that, in general, the MIMO strategy outperforms the recursive one in this multi step prediction task. This is reasonable in such a scenario. Indeed, the recursive strategy, differently from the MIMO one, is highly sensitive to errors accumulation which, in a highly volatile time series as the one addressed here, results in a very inaccurate forecast. Among the MIMO models we observe that gated networks perform significantly better than simple Elmann network one. This suggests that gated systems are effectively learning to better exploit the temporal dependency in the data. In general we notice that all the models, except the RNNs trained with recursive strategy, achieve comparable performance and none really stands out. It is interesting to comment that GRU-MIMO and LSTM-MIMO outperform sequence to sequence architectures which are supposed to better model complex temporal dynamics like the one exhibited by the residential load curve. Nevertheless, by observing the performance of recurrent networks trained with the recursive strategy, this behaviour is less surprising. In fact, compared with the aggregated load profiles, the load curve belonging to a single smart meter is way more volatile and sensitive to customers behaviour. For this reason, leveraging geographical and socio-economic features that characterize the area where the user lives may allow deep networks to generate better predictions.
For visualization purposes we compare all the models performance on a single day prediction scenario on the left side of Figure 12. On the right side of Figure 12 we quantify the differences between the best predictor (the GRU-MIMO) and the actual measurements; the thinner the line the closer the prediction to the true data. Furthermore, in this Figure, we concatenate multiple day predictions to have a wider time span and evaluate the model predictive capabilities. We observe that the model is able to generate a prediction that correctly models the general trend of the load curve but fails to predict steep peaks. This might come from the design choice of using MSE as the optimization metric, which could discourage deep models to predict high peaks as large errors are hugely penalized, and therefore, predicting a lower and smoother function results in better performance according to this metric. Alternatively, some of the peaks may simply represent noise due to particular user behaviour and thus unpredictable by definition.  Measurements  gru_mimo  dfnn  tcn  lstm_rec  rnn_mimo  lstm_mimo  rnn_rec  gru_rec  fnn  seq2seq_tf  seq2seq_sg   0  100  200  300  400  500  600 Steps ( Table 5: Individual household electric power consumption data set results. Each model's mean score (± one standard. deviation) comes from 10 repeated training processes.
The load curve of the second dataset (GEFCom2014) results from the aggregation of several different load profiles producing a smoother load curve when compared with the individual load case. Hyper-parameters optimization and the final score for the models can be found in Table 4. Table 6 and Table 7 show the experimental results obtained by the models in two different scenarios. In the former case, only load values were provided to the models while in the latter scenario the input vector has been augmented with the exogenous features described before. Compared to the previous dataset this time series exhibits a much more regular pattern; as such we expect the prediction task to be easier. Indeed, we can observe a major improvement in terms of performance across all the models. As already noted in [22,95] the prediction accuracy increases significantly when the forecasting task is carried out on a smooth load curve (resulting from the aggregation of many individual consumers).
We can observe that, in general, all models except plain FNNs benefit from the presence of exogenous variables.
When exogenous variables are adopted, we notice a major improvement by RNNs trained with the recursive strategy which outperform MIMO ones. This increase in accuracy can be attributed to a better capacity of leveraging the exogenous time series of temperatures to yield a better load forecast. Moreover, RNNs with MIMO strategy gain negligible improvements compared to their performance when no extra-feature is provided. This kind of architectures use a feedforward neural network to map their final hidden state to a sequence of n O values, i.e., the estimates. Exogenous variables are elaborated directly by this FNN, which, as observed above, shows to have problems in handling both load data and extra information. Consequently, a better way of injecting exogenous variables in MIMO recurrent network needs to be found in order to provide a boost in prediction performance comparable to the one achieved by employing the recursive strategy.   Table 6: GEFCom2014results without any exogenous variable. Each model's mean score (± one standard. deviation) comes from 10 repeated training processes.
For reasons that are similar to those discussed above, sequence to sequence models trained via teacher forcing (seq2seq-TF) experienced an improvement when exogenous features are used. Still, seq2seq trained in free-running mode (seq2seq-SG) proves to be a valid alternative to standard seq2seq-TF producing high quality predictions in all use cases. The absence of a discrepancy between training and inference in terms of data generating distribution shows to be an advantage as seq2seq-SG is less sensitive to noise and error propagation.
Finally, we notice that TCNs perform well in all the presented use cases. Considering their lower memory requirements in the training process along with their inherent parallelism this type of networks represents a promising alternative to recurrent neural networks for short-term load forecasting.
The results of predictions are presented in the same fashion as for the previous use case in Figure 13. Observe that, in general, all the considered models are able to produce reasonable estimates as sudden picks in consumption are smoothed. Therefore, predictors greatly improve their accuracy when predicting day ahead values for the aggregated load curves with respect to individual households scenario.

Conclusions
In this work we have surveyed and experimentally evaluated the most relevant deep learning models applied to the short-term load forecasting problem, paving the way for standardized assessment and identification of the most optimal solutions in this field. The focus has been given to the three main families of models, namely, Recurrent Neural Networks, Sequence to Sequence Architectures and recently developed Temporal Convolutional Neural Networks.  Table 7: GEFCom2014results with exogenous variables. Each model's mean score (± one standard. deviation) comes from 10 repeated training processes.
An architectural description along with a technical discussion on how multi-step ahead forecasting is achieved, has been provided for each considered model. Moreover, different forecasting strategies are discussed and evaluated, identifying advantages and drawbacks for each of them. The evaluation has been carried out on the three real-world use cases that refer to two distinct scenarios for load forecasting. Indeed, one use case deals with dataset coming from a single household while the other two tackle the prediction of a load curve that represents several aggregated meters, dispersed over the wide area. Our findings concerning application of recurrent neural networks to short-term load forecasting, show that the simple ERNN performs comparably to gated networks such as GRU and LSTM when adopted in aggregated load forecasting. Thus, the less costly alternative provided by ERNN may represent the most effective solution in this scenario as it allows to reduce the training time without remarkable impact on prediction accuracy. On the contrary, a significant difference exists for single house electric load forecasting where the gated networks shows to be superior to Elmann ones suggesting that the gated mechanism allows to better handle irregular time series. Sequence to Sequence models have demonstrated to be quite efficient in load forecasting tasks even though they seem to fail in outperforming RNNs. In general we can claim that seq2seq architectures do not represent a golden standard in load forecasting as they are in other domains like natural language processing. In addition to that, regarding this family of architectures, we have observed that teacher forcing may not represent the best solution for training seq2seq models on short-term load forecasting tasks. Despite being harder in terms of convergence, free-running models learn to handle their own errors, avoiding the discrepancy between training and testing that is a well known issue for teacher forcing. It turns out to be worth efforts to further investigate capabilities of seq2seq models trained with intermediate solutions such as professor forcing. Finally, we evaluated the recently developed Temporal Convolutional Neural Networks which demonstrated convincing performance when applied to load forecasting tasks. Therefore, we strongly believe that the adoption of these networks for sequence modelling in the considered field is very promising and might even introduce a significant advance in this area that is emerging as a key importance for future Smart Grid developments.