Deep learning for spatio-temporal supply and demand forecasting in natural gas transmission networks

Germany is the largest market for natural gas in the European Union, with an annual consumption of approx. 95 billion cubic meters. Germany’s high-pressure gas pipeline network is roughly 40,000 km long, which enables highly fluctuating quantities of gas to be transported safely over long distances. Considering that similar amounts of gas are also transshipped through Germany to other EU states, it is clear that Germany’s gas transport system is essential to the European energy supply. Since the average velocity of gas in a pipeline is only 25km/h, an adequate high-precision, high-frequency forecasting of supply and demand is crucial for efficient control and operation of such a transmission network. We propose a deep learning model based on spatio-temporal convolutional neural networks (DLST) to tackle the problem of gas flow forecasting in a complex high-pressure transmission network. Experiments show that our model effectively captures comprehensive spatio-temporal correlations through modeling gas networks and consistently outperforms state-of-theart benchmarks on real-world data sets by at least 21%. The results demonstrate that the proposed model can deal with complex nonlinear gas network flow forecasting with high accuracy and effectiveness.


Introduction
Germany is the largest market for natural gas in the European Union, with an annual consumption of approx. 95 billion cubic meters [1]. Since a similar amount of gas is also transported across Germany to other EU states, Germany represents a vital transit hub for natural gas [11]. Natural gas is the second most used energy source in Germany, and in 2019, its share in Germany's primary energy consumption was 25 % [34]. Even though the heat market is still by far the most important market for natural gas, recently, natural gas is being used for other purposes as well. In particular, gas plays an important role in the German "Energiewende", especially in the transition from fossil fuels to renewables in the power sector. Due to short ramp-up times, gas-power-plants provide a fast source of electricity in times of low availability of renewable sources.
The German high-pressure gas pipeline network's length is roughly 40,000 km, enabling highly fluctuating quantities of gas to be transported efficiently and safely over long distances. Considering that large amounts of gas are also transshipped to other countries, it is clear that the gas transport system is essential to the European energy supply. Expansion measures are cost-intensive and take long time to implement. Thus, the optimal use of existing capacities is crucial for safety of supply and service quality. Since the average gas pipe velocity is only 25km/h, adequate high-precision and high-frequency forecasting of supply and demand is a decisive matter for efficient control and operation of the complex natural gas transmission networks and distribution systems.
Traditionally, statistical models have been commonly used as a tool for natural gas forecasting. In their early work Herbert et al. [23,22] used regression analysis, residual analysis and linear regression equation to analyse effect of heating degree days (HDD), price of natural gas, price of residual fuel oil, and industrial activity on industrial demand for natural gas. Aras and Aras [3] proposed two individual autoregressive time series models for heating and nonheating season instead of attempting to capture the seasonal patterns in a single model while Gorucu [18] proposed multivariable regression for forecasting gas consumption. Timmer and Lamb [40] used linear model external variables like days below percentile (DBP) and heating degree days (HDD) in determining relations between temperature and residential natural gas consumption.
Vondracek et al. (2008) proposed statistical approach to estimate natural gas consumption of individual residential and small commercial customers using nonlinear regression. Autoregressive models are also commonly used in energy forecasting. ARIMA models were reported in Liu and Lin [32] and Erdogdu [14] while Brabec [8] used Nonlinear mixed effects model with individual customer-specific parameters and compared the model with ARIMAX (Autoregressive integrated moving average with eXtra / eXternal process) and ARX (Auto-regressive with exogenous inputs). Recently, Chen et al. [12] proposed advanced Functional AutoRegressive model with eXogenous variables (FARX) for day-ahead hourly gas flow forecast.
Especially in the last two decades machine learning methods have intensively been developed to model more complex data. Kizilaslan and Karlik [28,29] used ANN with several different algorithms (Quick propagation Algorithm, Conjugate Gradient Descent Algorithm, Levenberg-Marquardt Algorithm etc.) to forecast daily, weekly and monthly gas consumption in Istanbul. They used regression analysis to determine effect of different factors. Szpilk [39] presented Multilayer Perceptron(MLP) with additional calendar and weather features for predicting natural gas consumption in the city Szczecin in Poland. Nabavi et al. [33] developed and compared multiple linear regression, logarithmic multiple linear regression methods, and nonlinear autoregressive with exogenous input artificial neural networks (NARAX) model as a tool for predicting annually consumption in residential and commercial sectors in Iran. Many research studies showed that hybrid models (combining or ensembling) improve performance and robustness of forecast relative to individual forecast models. This is especially important when forecasting multiple time series with different behaviour (such as nodes in the gas network representing different demands of suppliers, industrial or residential customers) since hybrid models enhance the advantages of the individual approaches [35,42,25].
It is important to notice that, although the traditional statistical and machine learning models generally have good forecast accuracy in real-world data applications, they often neglect the information available in the spatial-temporal correlations of high-dimensional, complex gas network data. Moreover, the prediction performances of traditional methods depend profoundly on feature engineering, which usually requires expert experience in the corresponding field.
In recent years, deep learning methods have been employed to deal with high-dimensional spatio-temporal data defined on the complex non-Euclidean domains such as graphs. The success of convolutional neural networks (CNN) in extracting spatial features of grid-based data inspired advances in redefining the notation of convolution and adapt to the graph domain. Bruna et al. [10] were pioneers in developing a graph convolution based on the spectral graph theory. Following their work, several researchers improved and extended spectral-based convolutional graph neural networks [31,13,27]. In their survey Wu et al. [43] give a comprehensive overview of graph neural networks in data mining and machine learning fields together with an overview of possible applications of graph neural networks to various domains such as computer vision [44,24], natural language processing [38,5], traffic [47,19,45], chemistry [46], etc... Inspired by the success of recently proposed deep learning models for energy time series forecasting [26,36] we propose a deep learning spatio-temporal predictive model (DLST) to tackle the problem of gas flow forecasting in a complex high-pressure transmission network. In our model, the convolution is employed directly on graph-structured data to extract profoundly meaningful patterns and features in the space and time domain. We formulate the problem of forecasting gas flows on graphs and build the complete convolutional model, enabling us to significantly speed up training while using fewer parameters.
Our experiments show that our model effectively captures comprehensive spatio-temporal correlations through modeling gas networks. It consistently outperforms state-of-the-art benchmarks on real-world data sets for four different forecasting horizons with a skill score (calculated on normalized mean absolute percentage error) between 0.218 and 0.494. These results strongly indicate that the proposed model is capable of dealing with complex nonlinear gas network flow forecasting with high accuracy and effectiveness.
This work is part of a joint project with one of the Germany's biggest transmission system operators (TSO), Open Grid Europe [17]. Together with OGE, we develop a tool for high precision gas flow forecasting, providing multi-step ahead hourly forecasts for more than 100 nodes in the gas network. These nodes have very different statistical characteristics, as they represent a variety of different source and consumption points, ranging from connections to other gas networks or countries over industrial consumers to storage facilities.
The paper is organized as follows: In Section 1 we give a brief introduction to the motivation of the paper and presents related work. Section 2 gives a formulation of the problem and explains the methodologies used in building the DLST model. In Section 3 we introduce the real-world data set we use for the computational verification of our model described in Section 4. In Section 5 we discuss the obtained results. Finally, Section 6 brings some conclusions.

Forecasting on graphs
We formulate the problem of forecasting gas flow in the gas transmission network as follows: predict gas flow (q t+1 , ...,q t+H ) in the next H time steps, where q t ∈ R n is a vector of gas flow values of n nodes at time step t, given the set of features F ∈ R d×n consisting of M historical flow observations (q t−M +1 , ..., q t ) and an (optional) set of exogenous features (ex 1 , ..., ex d−M ) ∈ R (d−M )×n such as weather information, season indicators, economical parameters etc.
The observation q t can be considered as a signal defined on the undirected graph G = (V, A, W ) which consists of a finite set of nodes V with |V| = n representing the nodes of a gas transmission network, a set of arcs A and an adjacency matrix W ∈ R n×n modeling the connections between nodes. If arc a i,j connecting vertices i and j exists, W i,j represents the weight of the arc; otherwise, W i,j = 0.
For modeling connections between nodes, we split the set of arcs A into individual sets A = A p ∪ A a for the different types of network elements considered in this paper i.e. passive and active elements respectively.
The basic passive elements of the transmission network connecting two nodes are pipes. Nodes can be also connected via active elements such as valves, regulators and compressor stations that can potentially change the network topology and are used for controlling pressure along the direction of the flow, see [21].
Furthermore, we split the set of nodes V = V b ∪ V o into a set of boundary nodes and a set of inner nodes, respectively. We denote V + ⊆ V the supply and V − ⊆ V the demand nodes of the network.
Using the information of graph data, we can capture the spatio-temporal interdependence among nodes. The main challenge is that due to non-Euclidean nature of graph data some essential operations of machine learning such as convolutions do not apply to a graph domain [9]. Furthermore, for graph data, instances are not independent from each other since each instance (node) is related to others by its connections.

Spatio Graph Convolution Unit (SGCU)
First, let us define the spatial convolution on a given graph G. We define the diagonal degree matrix as D ∈ R n×n with D ii = j W ij and Laplacian as L = I − D − 1 2 W D where I is an identity matrix. Then the Singular Value Decomposition (SVD) is applied to Laplacian as L = U ΛU T ∈ R n×n where matrix U ∈ R n×n consists of eigenvectors of normalized Graph Laplasian transforms a signal to frequency domain and Λ ∈ R n×n is a diagonal matrix of eigenvalues of L. Now we define the graph Fourier transform of the input graph signal x ∈ R n as U x ∈ R n and the graph Fourier transform of a filter g ∈ R n as U g ∈ R n . Finally, we define the convolution of a graph signal x with filter g as where we denote * ς as graph convolution operator and represents an element wise product. By defining w = U g as the filter in frequency domain, the convolution can be rewritten as The above graph convolution requires a filter w to have the same size as the input signal x, which would be inefficient and hard to calculate when the graph has a large size(O(n 2 ) multiplications with graph Fourier basis).
To localize the filter and reduce the number of parameters, we restricted the kernel diag(w) to a polynomial of Λ as diag(w) = Θ(Λ) = [13].
K represents the kernel size of graph convolution and determines the maximum radius of the convolution from central nodes. The number of trainable parameters are restricted to K.
In graph signal processing Chebyshev polynomial T k (x) is traditionally used to approximate kernels for graph signal as a truncated expansion of order K-1 as withΛ = 2Λ/λ max − I n where λ max denotes the largest eigenvalue of L [20]. Now, the graph convolution can then be rewritten as where T k (L) ∈ R n×n is the Chebyshev polynomial of order k evaluated at the scaled LaplacianL = 2L/λ m ax − I n . Using the polynomial approximation and recursively computing K-localized convolutions we reduce the cost of (1) to O(K|E|) as (4) shows [13].
Further, we extend the graph convolutions operator " * ς" defined on graph signal x ∈ R n to multi-dimensional tensors. First, for a signal with C i channels X ∈ R n×Ci , the graph convolution can be generalized as: are the size of input and output of the feature maps, respectively). The graph convolution for 2-D variables is denoted as "Θ * ςx " with Θ ∈ R K×Ci×Co .
If the graph signal X ∈ R d×n×Ci , is a 3-D variable composed of d matrices whose column i is the C i -dimensional value of X at the i th node in graph G we can apply the equal graph convolution operation with the same kernel Θ for each feature of d. Thus, the graph convolution can be further generalized in 3-D variables, noted as Θ * ςX with X ∈ R d×n×Ci .

Temporal Convolution Unit (TCU)
Although recurrent neural networks (RNN) are still the most common choice for time series forecasting, CNN models recently achieved great success in dealing with time series data [16,6] since they overcome main problems of recurrent models. The weights used to compute the output neurons in a feature map are the same (the same filter is used for each location) so the training of CNN is faster and with fewer parameters. In the case of a long input sequence, RNN use a significant amount of memory to store the partial results for their multiple cell gates while in CNNs the filters are shared across a layer, with the back-propagation path depending only on the network depth. Additionally, the number of parameter to learn is significantly reduced since each hidden neuron is connected only to a subset of input neurons that are close to each other.
To capture the temporal dynamic behavior of gas flows we employ convolutional neural network with a specific design suitable for time series data: Temporal convolutional networks (TCNs) inspired by [41,4,7] and recently wildly used in numerous applications [37,30].
If we consider a 1-D time-series x ∈ R n and a one-dimensional filter Φ ∈ R Kt , we can define the i th element of the convolution between x and Φ as: Instead of the standard convolution operator (Figure 2(a)), temporal convolution networks use causal convolutions which ensure that the prediction made at the time step t,q t+1 can not depend on any of the future time steps feature f t+1 as shown in Figure 2(b).
Futhermore, TCNs use dilated convolutions defined as: where d is a dilation factor and * dil represents modified convolutional operator which apply the same filter over an area larger than its length by skipping input values with a certain step as it is shown in Figure 2(c). Note that with dilation d = 1 we have standard 1-d convolution. Using larger dilation enables an output at the top level to represent a wider range of inputs and effectively expanding the receptive field of a TCN.
The causal dilated convolution can be generalized to 3D variables by employing the same convolution kernel Φ ∈ R Kt×Ci×Co to every node signal with C i channels x i ∈ R d×Ci in graph G equally, noted as "Φ * dil x" with x ∈ R d×n×Ci .
We form a Temporal Convolutional Unit (TCU) (as shown on Figure 3 in a pink block) by stacking causal dilated convolution layers with dilation factors in an increasing order which enables the receptive field of a model to grow exponentially. This way, TCU is capable to capture longer sequences with less layers and saves computational resources. The TCU consist of L 1-D causal dilated convolution with a kernel K t and dilation factors d = 2 l , l ∈ [0, L − 1], followed by a non-linearity and a dropout layer for avoiding overfitting. Furthermore, residual connections are implemented among stacked temporal convolutional layers [41].

DLST architecture
In order to combine features from both spatial and temporal domains, we propose a DLST model to process the graph-structured gas flow time series.
As Figure 3 shows, in the first block of the model we process separately the historical flow information and the exogenous information through a standard temporal convolutional layer with 1D causal convolution, non-linearity unit (NLU) and a dropout layer. Later, the results are concatenated together and processed by the first Spatio Graph Convolution Unit (SGCU). Furthermore, we apply a Temporal Convolutional Unit, as described in Section 2.2, with L layers where each layer consists of a block with 1D dilated causal convolution with increasing dilation factor d = 2 l , l ∈ [0, L − 1], a non-linearity unit (NLU) and a dropout layer to prevent overfitting. After TCU we apply once more SGCU in order to explore spatial dependencies on long term temporal information.
Finally, we apply once more temporal 1D-CNN with fully connected layers to calculate the final prediction.

Data
In this study we aim to forecast hourly natural gas inflows and outflows (supply and demand) in a high-pressure gas pipeline network operated by one of the largest transmission system operators in Germany, Open Grid Europe GmbH [17]. The observed gas network has 397 nodes in total, 25 supply (V + ), 73 demand nodes (V − ) as well as 299 inner nodes(V o ). Even though we calculate the forecast for all nodes in the network, both boundary and inner nodes, when For modeling the connections between the nodes we used the network topology provided by OGE. As described in Section 2, nodes can be connected with a passive (pipes) or an active element (valves, regulators, compressor stations). In the observed network the nodes are connected with 171 pipes, 254 valves, 154 regulators and 34 compressor stations.
We define the adjacency matrix W of the graph based on the distances among nodes in the gas network as follows: where l i,j ∈ R is the length of the pipe connecting nodes i and j. The data set covers 33 months (May 2016-January 2019) of hourly gas flows for each node. Thus, every node of the gas network graph contains 24144 observations (covering 1006 days) in total. Additionally, we use as exogenous features, a set of measured air temperatures at the nodes location with the same time resolution. In order to compare results for different nodes, the data inputs are standardized by Z-Score method. Figure 4 shows summary statistics of 25 supply and 73 demand nodes. The nodes are sorted according to total absolute flow during the observed period and denoted as 'S' for supply nodes 'D' for demand nodes. The distribution of most of the nodes is centered around zero having wide-ranging variations. We can notice that some of nodes are skewed (both left and right) and have a large amount of outliers.  Figure 5 represents the heatmap of standardized aggregated daily flow for the 98 boundary nodes. Large supply nodes (S1-S8) have a stable flow with very small sensitivity to season changing. These nodes usually represent transfer points from and to other networks and the flow is mainly induced by long term trading contracts. Other, smaller supply nodes show some seasonal patterns, especially high peaks during the winter period. Most of the demand nodes show strong seasonal dependency since they belong to the group of so called "Municipal supply nodes". These nodes supply residential areas and small commercial customers. Some of the largest (by total flow) demand nodes do not show a seasonal pattern, yet mostly intermittent behaviour and they belong to the group of Power Plant nodes, where a large amount of gas is needed in a relatively short period of time.
To transform the time series forecasting problem into a supervised machine learning problem we adopt a sliding window approach as illustrated in Figure  6. The window slides and obtains an input-output instance at each position. The output window size is set according to desired forecasting horizon and the optimal value for input window size had to be carefully decided based on the characteristics of the data sets and the model.  We train the model and optimize hyperparameters on the training and validation set by minimizing the square loss. We calculate out-of-sample forecasts on the test set using the fitted model and the rolling window method with the input of previous 24 observations and the output that is determined by the forecasting horizon. During the test, we do not update the hyperparameters of the model.
For each boundary (supply and demand) node we calculate an out-of-sample 1, 3, 6 and 12 hours ahead forecast for period 1.12.2018 until 31.01.2019 (62 days in total) with rolling windows approach.
Architecture and hyperparameters of DLST model are optimized using OP-TUNA software [2]. Both the graph convolution kernel K s and temporal convolution kernel K t are set to 3 and for NLU we choose standard ReLU function. The number of channels are 32 for the input and first SGC unit, 64 for TCU as well as for second SCG unit. The TCU has 4 layers with dilation factor dil = 2 l , l = 0, ..., 3 . The proposed model is trained by minimizing mean squared error using ADAM optimizer for 50 epochs with a batch size of 32. The learning rate is initially set to 0.001 and was decayed for 0.9 every 5 epochs.

Benchmark models
We compare the performance DLST with several well-known benchmarks: ARIMA and the state-of-the-art artificial neural network models like MLP, CNN and LSTM [15] model. We determine the best ARIMA models for an univariate time series of 98 nodes according to a Akaike information criterion (AIC). To estimate the parameters of the ARIMA models we use input data of 28 days rolling window (parameter estimation of the ARIMA model is significantly faster then for neural network models). The LSTM is implemented using one LSTM layer hidden layer with optimized number of hidden neurons and learning rate for each individual node, a dropout of 0.1 followed by a fully connected output layer with the number of neurons equal to the forecast horizon and was trained for 100 epochs. MLP and CNN models are implemented in a similar fashion and optimized for the individual nodes. The input data for MLP, CNN and LSTM model is constructed in the same fashion as for DLST model, respecting the receptive filed of DLST, so that both models can use the same amount of historical information.

Performance measurement
The performance of DLST model is measured and quantified by calculating the forecast accuracy for individual (boundary) nodes, as well as the mean for the entire network. Specifically, we use mean absolute deviation (MAD), root mean squared error (RMSE), mean absolute percentage error(MAPE) as well as normalized mean absolute percentage error(nMAPE) defined as follows: where q d,t,h andq d,t,h are the real and forecast values of the natural gas flows on day d, hour t and H is a forecasting horizon. In order to compare the proposed method with alternative benchmark models we described in Section 4.1 we compute a different skill scores for each of the benchmark model : A perfect forecast results in a skill score value of 1. A forecast which is outperforming the benchmark model will have a skill score between 0 and 1. If the skill is equal to the benchmark forecast the skill score will be 0, and a forecast which is under performing the benchmark forecast will have unbounded negative skill score values.

Results and discussion
In Table 1 we report the average skill scores of the DLST model compared to the four benchmark models for 25 supply and 73 demand nodes as well as the accuracy of the DLST model for reference. All metrics are calculated on the 1488 instances of the out-of-sample test set, as explained in Section 4. The proposed model outperforms all benchmark models for four considered forecasting horizons with a skill score of at least 0.218. Comparing to the "second best" model (marked bold in the Table 1), DLST has a skill score calculated with normalized mean absolute percentage error (nMAPE) between 0.218 and 0.494 depending on forecasting horizons. The proposed model shows high dominance in forecasting 3-steps ahead, while the difference is smallest for the one-step ahead horizon.
High mean accuracy (measured with four performance metrics) for every forecasting horizon and the positive skill score compared to all benchmark models strongly indicate that the DLST model more effectively captures comprehensive spatio-temporal correlations through modeling of the gas network. It benefits from the information available in the spatial-temporal correlations of high-dimensional, complex gas network data. Figure 7 shows the average skill score for individual nodes and different forecasting horizons compared to the "second best" model. The circle size represents the size of the node based on the total amount of flow during the observed period, while the percentage of instances where DLST outperforms the benchmark is shown on the x-axes. While we can see that for some nodes proposed model has negative skill, it is clear that for most nodes, and especially nodes with biggest flows, the DLST model has a positive skill score and provides more accurate forecast for more than 50% of instances. The detailed ranking of the models for forecasting individual nodes is given in Figure 8. The ranking is calculated based on average nMAPE for individual nodes for every instance. DLST model outperforms the benchmark models for the first eight biggest supply and five biggest demand nodes in one step ahead forecasting. For longer horizons, especially H=3 and H=6, DLST is dominant, having the rank one for more then 85% of the nodes. LSTM is, in most of the cases, second most accurate model except for nodes like S15 and S18 where ARIMA model is having the rank one for all horizons. The reason why neural network (machine learning) models are failing to forecast these two nodes lies in the fact that they had zero flows for more than 95% of the hours in the training set. As an illustration, Figure 9 shows measured and forecasted flow for 3 most significant supply and demand nodes (ranked by total absolute flow) for a testing period of two months.

Conclusion
In this paper, we have proposed a deep learning spatio-temporal model based on temporal and graph convolutional neural networks to capture both temporal and spatial dependencies in gas flows in a complex transmission network. The proposed model has been applied to predict hourly natural gas flows with different forecasting horizons (1,3,6 and 12) in one of Germany's longest and most S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20  complex pipeline network. We consider a network with gas flow time series of 98 boundary nodes, 25 supply and 73 demand nodes, and model the connections between the nodes to extract valuable information about mutual influences of different nodes in order to increase forecast accuracy. We have compared the proposed DLST model with several benchmark models popular in the literature.
The obtained results show that the DLST model provides a stable and accurate forecast, consistently outperforming all considered benchmark models with a relative skill score between 0.218 and 0.494, especially in forecasting the most important nodes of the network. Among the benchmark models, the LSTM model tuned for individual nodes is the best one and gives a stable forecast accuracy, particularly for the nodes with strong seasonal patterns. Furthermore, there are special cases where the flow during the training period has significantly different behavior (e.g. high percentage of zero flow hours). For these cases, statistical models like ARIMA have an advantage compared to machine learning models.