STGNet: Short‐term residential load forecasting with spatial–temporal gated fusion network

With the continuous transformation of the global power system, ensuring the flexibility and stability of the power system has become a direction that the global energy industry has been striving for. This puts forward higher requirements for load forecasting, especially high‐precision short‐term residential load forecasting. However, the high volatility and uncertainty of electric consumption make this problem challenging. Existing methods usually focus on capturing temporal correlations and ignore the underlying spatial correlations in historical loads, which is crucial for accurate forecasting. In this paper, we propose a novel spatial–temporal gated fusion network (STGNet) for short‐term residential load forecasting including not only individual residential load but also aggregated load. By seeking both adaptive neighbors and temporal similarity neighbors, we construct a dynamic graph based on a data‐driven method. On this basis, we provide an adaptive graph gated fusion mechanism to capture fine‐grained node‐specific spatial dependence more accurately and comprehensively. Finally, a bidirectional gated recurrent unit (BiGRU) is utilized to learn the temporal dependence of load data. Extensive experiments on real‐world datasets are conducted and demonstrate the superiority of STGNet over the existing approaches.


| INTRODUCTION
Social progress achieved during the past few decades share an almost complete dependence on a stable and continuous supply of electricity.However, the power system of all countries is undergoing great changes to cope with climate change, environmental pollution, fossil energy shortage, and other issues, which raises challenges to the safe and stable operation of the power system.Short-term load forecasting (STLF) on the level of residential customers is an important basis for ensuring the efficient and stable operation of the microgrid system. 1 For power companies, accurate and reliable load predictions are very important, as higher predictions lead to increases of the power generation costs, while lower predictions lead to the need to purchase expensive peak electricity. 2However, with the increasing penetration of renewable energy 3 and the continuous application of flexible load, 4,5 more and more highly uncertain factors are injected into the power system, which increases the difficulty of STLF.
Regarding load forecasting, many approaches had been proposed to address this problem.These approaches can be classified in various ways: (a) electrical appliancelevel, house-level, region-level, and country-level approaches from the perspective of the scope of forecasting area; (b) very short-term (minutes ahead), short-term (hours or days ahead), midterm (months ahead), and long-term (years ahead) approaches from the perspective of forecasting horizon.Among these approaches, high precision STLF on the level of residential customers has been facing many challenges because residential electric consumption is more stochastic and dynamic under the influence of the random behaviors of consumers, seasonal factors, climatic conditions, and so forth.Therefore, the issues on short-term residential load forecasting remain open, which is the main focus of this work.
STLF has been studied as correlated time series analysis for decades.Recent studies show that unlike general correlated time series prediction, STLF research should also pay attention to spatial correlations among the electric load series from different sources (spaces/ regions) except for the temporal correlations.Recently, graph modeling on spatial-temporal data has been in the spotlight and successfully used in many domains with the development of graph neural networks (GNN).Although impressive improvements have been achieved by introducing GNN-based models into STLF, these models still face two shortcomings.The first one is lack of comprehensive graph construction.Some existing methods only consider local spatial information when aggregating the neighbors' information, but ignore those distant houses sharing similar temporal patterns.Regarding this problem, some studies have already made several improvements.Lin et al. 6 utilize self-adaptive adjacency matrix for graph modeling, but learnable matrices are usually lack of correlations representation ability for complicated spatial-temporal dependencies.The second limitation is lack of learning ability for datasource specific feature.Most of GNN-based models use the shared parameter space to capture the prominent and shared patterns among all load series, while ignoring the fine-grained data-source specific patterns.In fact, load series exhibit diversified patterns (e.g., similar, contradictory, and even irrelevant) due to individual load consumption behavior.
To address the aforementioned problems, we model the spatial dependency totally based on a data-driven method, seeking both adaptive neighbors and temporal similarity neighbors.To compensate for the information deviation caused by artificial modeling, we designed the adaptive graph gated fusion (AGGF) module to capture spatial correlations more comprehensively.Motivated by Bai et al., 7 we introduce node adaptive parameter learning Graph Convolutional Network (GCN) to learn node-specific feature for each electric load series.Furthermore, we combine AGGF with bidirectional gated recurrent unit (BiGRU) to propose a novel model called short-term residential load forecasting with spatial-temporal gated fusion network (STGNet).
The main contributions of this paper are summarized as follows: (1) We construct a novel informative graph by a datadriven method, and introduce node-specific GCN to capture fine-grained node-specific spatial dependencies in the load series.The rest of this paper is organized as follows.In Section 2, we outline the related work.Next, we define the problem studied of this paper in Section 3. Section 4 illustrates our proposed model and each module in detail.In Section 5, we validate the effectiveness of our model in comparison with six baselines on several realworld datasets.Finally, we conclude our work and put forward the future work in Section 6.

| Correlated time series forecasting
Traditional methods simply deploy statistical models, for example, auto-regressive moving average, auto-regressive integrated moving average, and fuzzy logic, for STLF. 8,9n recent years, machine learning, especially deep learning, has dominated the correlated time series prediction due to its superior ability in modeling complex functions and learning correlations from data automatically. 10Some studies [11][12][13][14][15][16] rely on feed-forward neural network (FNN), the recurrent neural network (RNN), and its variants, including the long short-term memory (LSTM), and gated recurrent units (GRU), to model the load series.Recently, there have been many attempts 17,18 to use transformer-based methods for time series prediction and have achieved very good results.Yang et al. 19 proposed a new hybrid load-forecasting model to improve effectiveness in the model training and reduce sensitivity to outliers.However, these studies do not effectively utilize the spatial correlations between different load series.Some scholars have deeply studied the relevant factors affecting the load prediction and introduced them into the models to improve the forecast performance.Bu et al. 20 proposed a hybrid STLF model fully considering the related meteorological factors.The experimental results showed that the forecasting accuracy of the proposed model is significantly improved.In Bendaoud et al., 21 the researchers proposed a STLF model based on generative adversarial networks.In this model, four influencing factors (maximum and minimum temperature, day of the week and month) were taken into consideration.The experimental results demonstrated that the proposed model has excellent performance.3][24] However, these models all require additional meteorological information.

| Spatial-temporal load forecasting
Exploring the available historical load data, Tascikaraoglu et al. 25 discovered that there usually existed an interesting trend between the load series from a target house and those from its surrounding houses, which can be utilized for improving the forecast precision.It is easy to understand that customers in the same region might have similar power consumption patterns because they experience similar conditions, such as temperature, season and holiday effects, and so forth.Kim et al. 26 proposed a spatial-temporal models based on the Euclidean convolution, in which spatial features are first captured by CNN layer and then temporal features are captured by RNN or its variants.The CNN layer extracted the features of a multivariate time series variable collected from sensors, which means the proposed model is limited in scenarios with insufficient information.Hong et al. 27 explored the spatial-temporal correlations among different appliances and presented a short-term residential load forecasting framework based on deep neural network (DNN) and iterative ResBlocks.Experiments on a real-world dataset demonstrated the superiority of the proposed IRBDNN based method.9][30][31][32][33] Inspired by Wu et al., 34 Lin et al. 6 first propose to tackle the short-term residential load forecasting including both the individual load and aggregated load with a self-adaptive Graph WaveNet-based framework.The proposed framework utilized self-adaptive adjacency matrix for graph modeling to capture the hidden spatial correlations among different houses.Compared with the other models, the proposed model has achieved excellent results.However, the learnable adjacency matrix is usually lack of correlations representation ability for complicated spatial-temporal dependencies. 32Besides, the proposed model used a shared parameter space to capture the prominent and shared patterns among the entire load series, while ignoring the fine-grained data-source specific patterns.

| Graph convolutional networks
GCNs generalize the traditional convolution to graphstructured data and have achieved extraordinary performance.They are widely used in many domains such as node embedding, 35 graph classification, 36 and link prediction. 37GCNs fall into two main categories, that is, spectral and spatial approaches. 38Spectral GCNs use spectral graph theory to implement convolutional operations on graphs.ChebNet 39 achieves local features through parametric convolution kernel, while reducing the parameter complexity and computational complexity.Spatial GCNs directly perform convolution filters to aggregate neighborhood signals on spatial domains.As an inductive representation learning framework, Graph-SAGE 40 is able to efficiently generate feature representations of unknown nodes using the rich attribute information of nodes.GAT 41 achieves the adaptive matching of the weights to different neighbors by aggregating the neighbor nodes through the selfattention mechanism.

| PROBLEM FORMULATION
In this paper, the goal is to forecast the future load information based on the historical information of residential units.The proposed method can improve the overall prediction accuracy by learning spatial-temporal dependencies between different residential load series.Definition 1. Residential network G.We treat each house as a node, and use graph G V E A = ( , , ) to describe the topological structure of the residential network, where ) is the set of all nodes, N is the number of the nodes, and E is the set of edges.The adjacency matrix A represents the dependency between the nodes, and Definition 2. Feature matrix X G .We regard the electric load information as the attribute feature of the nodes in the network G, which can be expressed as × denotes the values of all the features of all nodes at time step t, and denotes the cth feature value of node n at time step t.
Problem 1. Suppose we have M steps historical load data, and we want to predict the next step load information of each house.The problem of shortterm residential load forecasting can be considered as learning the mapping function f ( )  on the premise of given residential network G and the feature matrix X.
where X t+1 is the load prediction of each residential house at time step t + 1 (i.e., 1-h ahead load prediction).

| METHODOLOGY
In this section, we describe how to use the STGNet model to realize the load forecasting task.As shown in Figure 1, STGNet mainly consists of three components: nodespecific graph convolution module (NSGCN), AGGF module, and BiGRU.These modules will be introduced in detail below.

| Node-specific patterns learning module
To improve the accuracy of load prediction, we should acquire more precise and reliable spatial correlations.The traditional CNN is widely used to obtain spatial features due to its superior performance.However, it is applicable for the Euclidean space and the residential network is in the form of graph rather than twodimensional grid, as shown in Figure 2. Recently, GCN, which specializes in arbitrary graph-structured data, has been in the spotlight.As shown in Figure 3, inputting an image to GCN, the features of each node change from X to Z through several hidden layers.However, regardless of how many hidden layers are in the middle, the connection relationship between the nodes, that is, A, is shared.According to Kipf and Welling, 42 the graph convolution operation can be well-approximated by firstorder Chebyshev polynomial expansion and generalized to high-dimensional as: where denotes the input graph signal and output signal, respectively.R W C F × ∈ is a shared parameter matrix by all nodes, and can be obtained by learning.
Traditional sharing parameters may be efficient to learn the most prominent patterns among all nodes and can reduce the number of parameters, while ignoring the node-specific patterns.In fact, the electric consumption data exhibit diversified patterns (as shown in Figure 4) due to natural and human factors such as outdoor temperature values, humidity and the social activities specific to some time periods (e.g., holiday activities, special festival activities, etc.).Khatoon et al. 43 comprehensively analyzed various factors that affect residential electric consumption.The electric load is highly correlated with meteorological factors such as temperature, humidity, wind speed, rain and snowfall and cloud cover.1][22][23][24] In addition to the data taken from the houses with the similar consumption pattern, which significantly contribute to the improvement of forecasting performance, the data from slightly and even negatively correlated houses also provide useful information to the forecasting model.As a result, only capturing shared patterns among all nodes is not enough for accurate load forecasting, and it is essential to maintain a unique parameter space for each node to learn nodespecific patterns.However, assigning parameters for each node will lead to a significant increase in the number of parameters, or even over-fitting problem.To address this problem, Bai et al. 7 propose to obtain node-specific patterns by parameter matrix factorization.In this paper, we adopt the same operation to learn two smaller parameter matrix instead of directly learning W.Then, W can be expressed as is the weight pool.In this way, for a specific node i, there is a corresponding parameter matrix Finally, GCN with Node-Specific patterns learning (i.e., NSGCN) can be written as: GCN can effectively represent nodes by aggregating neighbor information.It has been proved by many studies that most GCN with two layers have the best results compared to the stacking of dozens. 44,45To better obtain spatial dependence of load series, we use the twolayer GCN model and the forward model can be formulated as: where W (0) and W (1) represents the parameter matrix in the first and second layer, and can be decomposed like Formula (3).soft max( )  and Relu( )  represent the activation function, respectively.
F I G U R E 2 (A) Classical CNN on two-dimensional grid data; (B) the GCN on electric load data.In the residential network, each node represents a residential unit.Green nodes denote the target node, and the red nodes represent the neighbors.
F I G U R E 3 Graph convolutional network is a convolutional neural network that directly processes graph-structured data.The core idea of GCN is to update each node's hidden representation by aggregating the feature information from its neighbor nodes, which can effectively utilize the structural information of the inputs.
When forecasting the electric consumption of a target house, it is evident that incorporating spatial information can help improve the forecasting accuracy.The question here is how to determine the best candidate houses that would contribute to the forecasting performance without increasing computational burden, that is, how to determine the adjacency matrix.Adjacency matrix is crucial as it determines the aggregation manner of target nodes and their neighbors in the graph convolution.For residential load forecasting, the correlations in load series are difficult to predetermine quantitatively since the users' electricity consumption is random.Furthermore, when predicting a certain residential load, it is not only directly related to the surrounding houses, but houses with high similarities on the electricity load curve also should be considered.For example, residents with similar ages, jobs and family backgrounds may have more similar electricity habits, although they are geographically far apart.Different from the existing methods which generally use the geographic distance to construct the graph structure, our adjacency matrix generation method is totally data-driven with temporal similarity extracted.
The methods for measuring the similarity between time series can be divided into three categories: (a) time step-based similarity, which generally can be measured by Minkowski distances; (b) shape-based similarity, which generally can be measured by dynamic time warping (DTW) 46 ; (c) change-based similarity, which can be evaluated by Gaussian mixture model. 47In our model we use DTW to effectively calculate the similarity of different load series.
Given two load series X a a a = ( , , …, ) . DTW calculates the similarity between the two series by extending and shortening them.To align these two sequences of unequal length, we need to construct a distance matrix D n m × , whose elements D D a b = ( , ) pq p q represents the distance of a p and b q , that is DTW uses the sum of the distances between all similar points, called the normalized path distance (or warp path distance), to measure the similarity between the two series.Then, the similarity can be represented by the cumulative distance M ij s : where M p q M p q M p q min ( − 1, − 1), ( − 1, ), ( , − 1)) is the sum of the minimum cumulative distance of the neighboring elements that can reach the point.The procedure is shown in Figure 5, the length of load series X i and X j is   X i and   X j , respectively.Each pane corresponds to an element of the distance matrix D n m × and the black solid line represents the two load series' warping path calculated by DTW, the heat map of the similarity among load series' is shown as Figure 6A.
After calculation, we finally obtain every element A ˜ts ij of the temporal similarity adjacency matrix A ˜ts : where ε is the threshold to determine the sparsity of adjacency matrix A ˜ts .The heat map is shown in Figure 6B.By combining the graph convolution operation, the output of A ̃ts -NSGCN can be formulated as:

| Adaptive graph gated fusion
Due to the deviation caused by artificial modeling, the spatial correlations in load series are difficult to capture accurately when using the temporal similarity adjacency matrix A ˜ts only.To solve this problem, we propose AGGF mechanism to automatically learn finegrained spatial dependence.It has been proven that gating mechanism has a strong ability to control the flow of information and plays a critical role in RNN.As shown in Figure 7, we use an adaptive adjacency matrix A ˜adp for the gating fusion to perform GCN operations.A ˜adp can be defined as: ( ( )) where ∈ is the randomly initialized learnable node embedding dictionaries for all nodes.By multiplying E g and E g T , we can obtain the spatial dependency between all nodes.ReLU activation function can eliminate weak connections and Softmax function is applied for normalization.This adaptive adjacency matrix does not require any prior knowledge and is learned end-to-end through stochastic gradient descent.Therefore, the AGGF mechanism can compensate for the information deviation caused by artificial modeling, and further capture the potential fine-grained spatial dependency.By combining the graph convolution operation, the gating value can be formulated as: F I G U R E 5 Given two load series X i and X j , the black solid line is the two load series' warping path calculated by DTW.

(A) (B)
F I G U R E 6 Heat map of the similarity among load series.
The structure of AGGF.
We design the AGGF as: where  is the element-wise product between matrices and σ ( )  is the sigmoid function.

| Temporal dependence modeling
Except for the spatial correlations, acquiring the temporal dependence is another key problem in electric load forecasting.RNN and its variants, such as LSTM 48 and GRU, 49 are very effective for data with sequence characteristics and are widely used in correlated time series analysis.The basic RNN is unable to capture longterm dependency due to gradient disappearance and gradient explosion, 50 which was explored in depth by Hochreiter et al. and Bengio et al.LSTM is a special kind of RNN, capable of learning long-term dependencies. 49owever, the computational cost of LSTM is relatively large because of its complex internal structure.The GRU is also designed to solve the gradient problem and maintain the effect of the LSTM while making the structure simpler.In most cases, GRU has comparable performance with LSTM, sometimes even better than LSTM.Beyond that, the biggest advantage of GRU is less computational overhead.Although the GRU model performs well in extracting features of long time series, it cannot obtain the information from the back to the front, which can be solved by BiGRU.Therefore, we chose the BiGRU model to capture the temporal dependencies of electric load series.Then, we introduce the structure and calculation process of GRU and BiGRU.
The network structure of GRU (shown as Figure 8) is very similar to that of LSTM, while GRU has only two gates (the update gate z t and the reset gate r t ) to control the transmission state of the information.
The update gate z t determines how much of the previous hidden state is retained.It can be calculated as: (11)   where h t−1 is the hidden state at time t − 1, and x t is the input information at time t.Then we get the results from 0 to 1 via a sigmoid function.
The reset gate r t determines how much information from the past needs to be forgotten.It can be calculated as: In fact, the calculation formula of r t is the same as z t , but the parameters are different, which can be learned during training.Then a tanh layer creates a vector of cell state h ˆt, which could be added to the state.
Finally, the hidden value h t will be passed on to the next GRU unit.It can be calculated as: BiGRU is an extension of GRU that can better capture bidirectional temporal dependencies.As shown in the Figure 9, the BiGRU layer adopts a bidirectional structure based on the basic GRU structure: one is a forward GRU layer that processes forward input sequences, the other one is the backward GRU layer that processes input sequences in the reverse direction.
The architecture of GRU.
Finally, the output y t of BiGRU is obtained by the weighted connection of the forward output h t  and the backward output h t  .The process can be described by the following equations: ( )

| Spatial-temporal gated fusion network
In this part, we introduce the STGNet, which integrates NSGCN module, AGGF module and BiGRU to capture the spatial and temporal correlations in electric load data.The calculation process is shown as below: where G X ( ) t , defined in Equation (10), represents the output of AGGF module and h t is the output at time step

and W h t
 ,W h t  are the weights and deviations, which can be trained end-to-end by back propagation.
In our work, we focus on 1-h ahead load forecasting and the goal is to minimize the error between the real load value and the predicted value.Thus, we choose Huber loss 51 as our training objective and optimize the loss for load forecasting.
where Y and Y ˆare the real load values and the predicted values, respectively.δ is hyperparameter to control sensitivity of squared error loss.

| Datasets
We verify the performance of STGNet on several realworld datasets, which are released on OpenEI 52 by Lin et al.The dataset contains residential load consumptive records in various areas of the United States.These historical load measurements are collected every hour in the year of 2012, and there are 8760 hourly data points for each residential house.In the following experiments, we adopted the same data preprocessing procedures as in the work of Lin et al. 6 We also randomly select 15 houses from Los Angeles (LA), New York (NY), and Texas (TX), respectively, to showcase the effectiveness of STGNet.
Besides the individual residential load forecasting, the aggregated load forecasting is of great significance as The architecture of BiGRU.
well.It is defined as the total hourly power consumption of all houses in a certain region.Computationally speaking, we first predict the individual load in certain area, and then get the aggregated load by the summation of all individual loads.

| Baseline methods
To evaluate the performance of our model, we compare STGNet with several popular baselines.
• FNN: Feed-forward neural network with two hidden layers and L2 regularization.

| Experimental settings
During the experiment, we applied standard normalization for data preprocessing, and use 60% of the data for training, 20% for validation and 20% for testing.All the baselines, including our STGNet, are implemented based on Pytorch framework.We chose Huber loss as the loss function and optimized our model with Adam optimizer with a learning rate 0.001 for a maximum of 100 epochs.Early stopping with a window size of 10 is used, that is to say we stop training if the validation loss does not decrease for 10 consecutive epochs.We set the number of hidden units to 64 and the batch size to 128.The searching length in "DTW" algorithm is 72 and the embedding dimension is set to 10.The learning rate decay is used during training process.Through parameter-tuning on the validation set, we choose the best parameters for all models.All the evaluated models are implemented on a server with one Intel(R) Core(TM) i9-10900X CPU@3.30GHz and one NVIDIA GeForce GTX 3090AD GPU card.
To fully verify the performance of our model, we use the following metrics for evaluation:

| Experiment results and analysis
We compare our proposed model with the six baselines on LA, NY, and TX dataset.The results of individual load forecasting and aggregated load forecasting over the next 1 h are shown in Tables 1 and 2, respectively.We deploy two widely used metrics-MAPE and MAE to measure the performance.We attempted to recurrent the baseline models and adjusted the experimental results to the optimal state under our existing conditions.However, due to some settings and parameters not being disclosed, there are some slight differences between our experimental results and the original results.We have repeatedly verified the model to ensure their accuracy.Following the definitions used by most studies, the performance on individual load forecasting is evaluated by an average performance over all houses in one specific area and the aggregated load performance is computed on the aggregated load of one specific area.Results show our STGNet outperforms baseline models consistently on all datasets.
From Tables 1 and 2 we observe that: (1) Poor performances of FNN, SVR, LSTM, and LSTNet indicate the limitation of only employing temporal correlations.(2) Ada-GWN and our STGNet, which are non-Euclidean-based models, have better performance than CNN-GRU, witch is Euclidean-based model.It demonstrates that the residential houses network is in the form of graph rather than twodimensional grid.(3) STGNet achieves state-of-the-art forecasting performance on the three datasets in terms of all evaluation metrics.Our model constructs a novel graph by a data-driven method, and then introduce the node-specific patterns learning GCN to capture fine-grained spatial dependency among the load series.Furthermore, combining the information from adaptive neighbors and temporal similarity neighbors, we design an AGGF mechanism to capture spatial correlations more accurately and comprehensively.

| Ablation study
To verify the effectiveness of each module in STGNet, we conduct ablation experiments.The ablation experiments are to remove a specific module from our proposed STGNet model to explore its impact on the final results.If there is a significant decrease in the model's performance after removing one specific module, it indicates that this module is essential.On the contrary, it proves that this module does not contribute to the whole model.][30][31][32][33] Our proposed STGNet model is divided into four main parts: A adp-NSGCN module, A ts -NSGCN module, NSGCN module and BiGRU module.As shown in Table 3, we design three variants named TS-NSGCN, A-NSGCN and TSA-GCN to verify the effectiveness of each module in STGNet.We construct TS-NSGCN and A-NSGCN by removing the A adp-NSGCN module and A ts -NSGCN module, respectively.To verify the effect of NSGCN module, we further design TSA-GCN by replacing the NSGCN module with basic GCN.To verify the efficiency of GCN, we chose GRU as the baseline.
Tables 4 and 5 show the performance of individual load forecasting and aggregated load forecasting when the number of the house is 15, respectively.To visually display the results, we conduct the histograms of the results, as shown in Figures 10 and 11.From the results, We can observe that: (1) the performance of GRU is poorer than all variants in the three datasets, demonstrating the efficiency of GCN.(2) Our proposed model STGNet outperforms TSA-GCN, which shows the necessity of capturing node-specific features.(3) STGNet beats | 551 TS-NSGCN and A-NSGCN, which demonstrates that the AGGF module can further improve the prediction performance.The AGGF module not only captures spatial correlations more comprehensively but compensates for the deviation caused by artificial modeling.Especially in the application of large-scale graphs, the performance improvement is relatively obvious since nodes can obtain richer neighbor information.

| Model robustness and prediction visualization
The electricity consumption curve is affected by many factors, such as the seasons, outdoor weather and the holidays.To verify the stability of our model under different conditions, we randomly select 1-week snapshot from the testing dataset of LA.The individual load forecasting visualization and the aggregated load forecasting visualization are shown in Figure 12.It can be seen that although the load characteristics are different, our prediction results, both individual load prediction and the aggregated load prediction, are relatively stable and reliable.Note: STGNet achieves the best results on all three datasets (smaller value means better performance).
T A B L E 3 Ablation study on different configurations of modules.

Components of STGNet
Variants TS-NSGCN A-NSGCN TSA-GCN Noise is unavoidable during the data collection process, 55 such as sampling error, instrument error, artificial error, data transmission and storage error and data processing errors.For these errors, we can reduce them by increasing sampling frequency, using high-quality measuring equipment, improving the reliability of data transmission and storage system, continuously monitoring and regularly evaluating the data collection process.7][58] To solve the problems of environmental noise interference and insufficient input information, Ma et al. 59 proposed measurement error prediction using improved local outlier factor and kernel support vector regression.Kong et al. 60 proposed a remote prediction method for smart meter errors.The method introduced dimension reduction estimation model to solve the model's insolvability.To improve the robustness and accuracy of estimation, damped recursion least squares was used to ensure the use of new data and narrows the range of changes in estimators.Even so, it is impossible to completely remove the errors generated during data collection and transmission.To verify the accuracy and reliability of our proposed model, we conduct noise immunity analysis experiments.
We add a random noise with Gaussian distribution N σ (0, ) to the data during the experiments.The experimental results on different evaluation metrics are shown in Figures 13  and 14. Figure 13 shows the individual load forecasting results after adding Gaussian noise and Figure 14 shows the aggregated load forecasting results after adding Gaussian noise.It can be seen that the metrics change little, even when the load data are disturbed to different degrees.Thus, our proposed model is robust and suitable for processing data contaminated with noise.

| Computation cost
To evaluate the computational efficiency, we compare the training time and inference time of STGNet with LSTnet, CNN-GRU, and Ada-GWN on the LA dataset in Table 6.All the evaluated models are implemented on a server with one Intel(R) Core(TM) i9-10900X CPU@3.30GHz and one NVIDIA GeForce GTX 3090AD GPU card.In terms of the training time, STGNet runs slower than LSTnet and Ada-GWN as a cost for learning nodespecific features.Although the training speed is slow, STGNet has better performance than the other methods.In this regard, we can reduce the embedding dimension to shorten the training time, but it will lead to a decrease in prediction accuracy.From the perspective of inference time, there is not significant difference since these models all make one-step predictions.For offline training scenarios, the performance and computation cost of STGNet are acceptable.| 553

| Effect analysis on adjacency matrices
Adjacency matrix is crucial as it determines the aggregation manner of target nodes and their neighbors in the graph convolution.For residential load forecasting, the correlations in load series are difficult to predetermine quantitatively since the users' electricity consumption is random.In our model, we propose to use shape-based similarity to effectively calculate the similarity of different load series.We attempt to use standard DTW to calculate the similarity between different sequences and achieve great prediction results.However, the standard DTW has a high time complexity, so we also introduce several improved algorithms of DTW, including FastDTW (fast dynamic time warping), DDTW (derivative dynamic time warping), and WDTW (weighted dynamic time warping).
FastDTW is a fast algorithm used for calculating distances or similarity between two time series.FastDTW significantly reduces the computational complexity by adopting the approximation strategy and down-sampling technique.It can provide great approximate results while maintaining high computational efficiency.DDTW is proposed to solve the problem of "singularities" of standard DTW, which is the situation where a point on one time series maps onto multiple points of another time series.The core concept of DDTW is to obtain higher-level features related to "shape" by calculating the first derivative of time series.WDTW is also designed to solve the "singularities" of standard DTW.WDTW adds a weight when calculating the Euclidean distance between two points of two data series, and this weight is related to the phase difference on the X-axis between the two points.It is a powerful framework that can contain Euclidean distance and traditional DTW distance.The heat map of adjacency matrices generated by standard DTW, FastDTW, DDTW, and WDTW are shown in Figure 15.Furthermore, we apply the generated adjacency matrices in the STGNet model for STLF and name the corresponding models STGNet, STGNet_FastDTW, STGNet_DDTW and STGNet_WDTW, respectively.The forecasting results are presented in Table 7. Figure 16 is a visualization of the results.From the perspective of computational complexity, standard DTW, DDTW, and WDTW are comparable, significantly higher than FastDTW.Based on the forecasting performance, STGNet_DDTW is the most stable, outperforming STGNet on all three datasets.But on the NY and TX dataset, the prediction performance of STGNet_DDTW is inferior to that of STGNet_WDTW.The accuracy of WDTW is related to the selection of the weight parameters, resulting in relatively unstable performance of STGNet_WDTW.Based on the experimental results, it can be seen that STGNet_WDTW has the worst prediction accuracy on LA dataset, but the best prediction accuracy on NY and TX dataset.In all models, the prediction accuracy of STGNet_FastDTW is lower than other models.However, its computing time is the least.Especially, on the LA dataset, STGNet_FastDTW has better predictive performance than STGNet_WDTW.It is worth noting that the overall prediction performance of STGNet_FastDTW is still better than all baseline models.
Considering the computation cost reductions, the performance of STGNet_FastDTW is still good.| 557 In the future, we will further optimize the parameter settings and network structure.In fact, electric load forecasting is affected by many factors such as rainfall, snowfall, temperature, wind speed, cloud cover, humidity, and so forth.In the next step, we will explore the influence of these relevant factors to further improve the prediction accuracy and reliability.

F I G U R E 4
Examples of load series with diverse patterns.The load of House 3 and House 4 follow a similar pattern, and as a contrast, the load of House 1, 2, and 4 have obvious different patterns.

FFFF
I G U R E 10 Ablation study on individual load forecasting when the number of the house is 15.(A) MAE of STGNet variants on three datasets; (B) MAPE of STGNet variants on three datasets.I G U R E 11 Ablation study on aggregated load forecasting when the number of the house is 15.(A) MAE of STGNet variants on three datasets; (B) MAPE of STGNet variants on three datasets.I G U R E 13 Noise immunity analysis on individual load forecasting.The horizontal axis represents σ , the vertical axis represents prediction results, and different colors mean different datasets.(A) MAE after adding Gaussian perturbation.(B) MAPE (%) after adding Gaussian perturbation.I G U R E 14 Noise immunity analysis on aggregated load forecasting.The horizontal axis represents σ, the vertical axis represents prediction results, and different colors mean different datasets.(A) MAE after adding Gaussian perturbation.(B) MAPE (%) after adding Gaussian perturbation.
• SVR: Support vector regression model, which can capture the pairwise relationships among time series.• LSTM: LSTM network, which is a RNN with fully connected LSTM hidden units.• LSTNet: Long-and short-term time-series network, which uses the CNN and the RNN to extract shortterm local dependency patterns among variables and discover long-term patterns for time series trends. 53 64CNN-GRU: CNN-GRU model, which combines CNN and multilayered GRU to capture spatial-temporal dependence.54•Ada-GWN:Spatial-temporal residential load forecasting model, which is proposed base on adaptive Graph WaveNet.6 T A B L E 1 Performance comparison of individual load forecasting on Los Angeles (LA), New York (NY), and Texas (TX) dataset.
Note: STGNet achieves the best results on all three datasets (smaller value means better performance).FENG ET AL.
Performance comparison of aggregated load forecasting on Los Angeles(LA), New York (NY), and Texas (TX) dataset.
T A B L E 4 Ablation study on individual load forecasting when the number of the house is 15.Ablation study on aggregated load forecasting when the number of the house is 15.
T A B L E 6 The computation time on the LA dataset.
F I G U R E 15 Adjacency matrices of LA, NY, and TX dataset when the number of the house is 15.
In this paper, we propose a novel spatial-temporal model STGNet for short-term residential load forecasting including both the individual load and the aggregated load.Our model constructs a novel graph by a datadriven method and then designs the AGGF module to fusing the information from adaptive neighbors and temporal similarity neighbors.By integrating the AGGF module and BiGRU, STGNet can adaptively capture the accurate spatial and temporal dependency in the load series without a predefined graph.Extensive experiments on three real-world datasets demonstrate STGNet achieves consistently great performance.Furthermore, our proposed model is not limited to load forecasting, but can also be applied to other spatial-temporal applications.Performance comparison of different adjacency matrices generated by standard DTW, FastDTW, DDTW, and WDTW.
T A B L E 7 F I G U R E 16 Performance comparison of STGNet, STGNet_FastDTW, STGNet_DDTW, and STGNet_WDTW.FENG ET AL.