A per-unit curve rotated decoupling method for CNN-TCN based day-ahead load forecasting

The existing load forecasting method based on the per-unit curve static decoupling (PCSD) would easily lead to the deviation and translation of forecasting results. To tackle this chal-lenge, a per-unit curve rotated decoupling (PCRD) method is proposed for day-ahead load forecasting with convolutional neural network and temporal convolutional network framework. The PCRD method decomposes the load into three parts: the rotated per-unit load curve, the 0 AM load, and the daily average load. The shape feature of the load curve is extracted by CNN, the temporal features of the 0 AM load and daily average load are extracted by TCN. The rotation operation is to rotate the per-unit load curve at the midpoint of the curve until the ﬁrst load point is aligned to the same point, in order to improve the similarity of per-unit load curves and to alleviate the deﬂection of forecasting results. The 0 AM load can verify the accuracy of the daily average load, which alleviates the translation of forecasting results. Several experimental results show that the proposed method has higher accuracy and stability than the existing PCSD method. After repeated experi-ments on multiple data sets, the generalization ability of the model is also veriﬁed.


INTRODUCTION
The load forecasting plays an important role in alleviating power supply and demand imbalance, improving the operation efficiency of power stations, and maintaining the safe operation of the power grid. If the error of load forecasting increases slightly, it will cause great economic loss [1]. In terms of time scale, the load forecasting can be divided into ultrashort-term load forecasting, short-term load forecasting, medium-term load forecasting, and long-term load forecasting. In order to improve the effectiveness of day-ahead scheduling, this paper focuses on day-ahead load forecasting, which is a kind of short-term load forecasting. compensate for the shortcomings of traditional algorithms in extracting non-linear features. Machine learning algorithms that used for load forecasting mainly include support vector regression (SVR) [7,8], fuzzy inference system (FIS) [9,10], and artificial neural network (ANN) [11,12].
The ANN has the advantage that the larger the data volume are, the more accurate the forecast is. With the increase of available data and the enhancement of current computing ability, ANN stands out from many machine learning algorithms and has been further studied by many research scholars. Back propagation neural network (BPNN) is the first neural network algorithm applied to load forecasting. In [12], a BPNN based on a rough set was proposed to deal with short-term load forecasting with dynamic and non-linear factors, to improve the accuracy of forecasting. In [13], the deep belief network (DBN) based on K-nearest neighbours is proposed to capture uncertainty and reflect the range of electrical load fluctuation. In as load forecasting [27,28], solar irradiance prediction [29], and heat load prediction [30]. In [27], the prediction effect of multi-temporal-spatial-scale TCN in short-term load forecasting was verified. Compared with LSTM, TCN reduces the mean absolute error (MAE) of forecasting results. This further demonstrates that TCN has a stronger ability to extract temporal features in load forecasting. It inspires us that if CNN and TCN are combined to extract shape and temporal features, respectively, it may bring better forecasting performance for day-ahead load forecasting.
Only by adopting an appropriate combination method can the advantages of both neural networks be brought into full utilization. In the day-ahead load forecasting, the power load is always decomposed into the per-unit curve and daily average load, because the factors that affect the two parts are different or the same factors affect them in different ways. This decoupling method is hereinafter referred to as the per-unit curve static decoupling (PCSD) method. In [31], a short-term load forecasting method based on the decoupling mechanism was proposed for a small region power grid, and the theoretical support that PCSD can reduce the forecasting error is analysed effectively. However, on the one hand, once the error of the daily average load is large, all predicted values are deviated as a whole from the actual value, which causes the translation problem of the forecasting results. On the other hand, these per-unit load curves are similar in shape, but there always exist angle deviations between these curves due to the difference in the overall trend. This makes it difficult to extract the shape features of load curves, which causes the deflection problem of the forecasting results. Similar to the PCSD method, another decoupling method decomposes the power load into its variation trend and maximum [25]. These decoupling methods are essentially the same, and they both have the above problems.
Here, the per-unit curve rotated decoupling (PCRD) method is proposed to avoid the above shortcomings of the existing PCSD method. The power load is decoupled by PCRD into three parts: the rotated per-unit load curve, the 0 AM load, and the daily average load. The deflection problem of the forecasting results based on PCSD can be solved by rotating and aligning the per-unit load curves of similar days. The rotation operation makes the per-unit load curves of similar days coincide so that the curve shape feature can be extracted without being affected by the overall trend. By fusing the temporal features of the 0 AM load and the daily average load, the overall trend of the load curve can be predicted. That is, two points determine a curve with a fixed shape, which is more stable than just determining the load curve by the daily average load. The translation problem of the forecasting results based on PCSD can be solved by forecasting the 0 AM load. The 0 AM load can verify the accuracy of the daily average load on another time scale, in order to prevent a too large error of daily average load. By solving the above problems, the PCRD method further improves the accuracy of day-ahead load forecasting.
The rest of the paper is organized as follows. The PCRD method and its principle are formulated detailly in Section 2. The PCRD method for convolutional neural network and temporal convolutional network (CNN-TCN) based day-ahead load forecasting model and the related modules are developed in Section 3. Based on practical real data, in Section 4, the proposed model is applied to day-ahead load forecasting and is compared with the CNN-LSTM model based on PCRD, the  CNN-TCN model based on PCSD, the TCN model based on  the time series (TS), the LSTM model based on TS, BPNN  model based on similar days (SD), and SVR. Finally, the conclusion is drawn in Section 5.

THE ESTABLISHMENT OF PCRD
For day-ahead load forecasting, some researchers use rolling forecasts [19]. This will cause that the later the forecast point is, the greater the forecast error is. Thus, the rolling forecasts may be less pronounced when the sampling points for a day are 24, but may introduce a larger forecast error when the sampling points for a day are 96. By taking full advantage of the daily periodicity of the power load, the power load is decomposed into two parts: the daily average load and the load per-unit curve by PCSD, thus eliminating the drawbacks caused by rolling forecasts. The difficulty after decoupling the load data is not in the forecasting of the daily average load, but in the forecasting of the shape of the load curve. In order to extract the shape features of the load curve, the power load can be further decomposed by PCRD into three parts for forecasting, specifically, the rotated per-unit load curve, the 0 AM load, and the daily average load. These three parts are considered comprehensively in the proposed PCRD method, which is described as: where P t is the actual value of power load at time t; P * rev,t is the value of the rotated per-unit load curve at time t; P zero is the 0 AM load; P ave is the daily average load; M is the number of load sampling points per day. The rotation operation is to rotate the per-unit load curve at the midpoint of the curve until the first load point is aligned to the same point, which will not change the average value of the curve. Figure 1 is the comparison diagram before and after the rotation of Monday's load of Guangzhou City from May to June 2019. In Figure 1a, the per-unit load values of each curve at the same time have great differences, and the overall change trend and the amplitude value do not coincide completely. In Figure 1b, after rotating the per-unit load curve, all curves basically coincide and the overall change trend is relatively consistent, which facilitates the extraction of the shape features of the per-unit load curve. After the per-unit load curve is rotated, it can be decoupled into the rotated per-unit load curve and the amount of rotation. The amount of rotation can be expressed by the 0 AM load and the daily average load. Thus, the power load is decoupled into the rotated per-unit load curve, the 0 AM load, and the daily average load.
To explore the influence of rotation operation on the shape similarity of load curves, the MAE between the similar day and the predicted day at different time intervals is calculated and all MAE are added and averaged as total MAE to assess the similarity between the curves. When the time interval is g days, the calculation formula of the total MAE is expressed as follows.
where M is the number of load sampling points per day; N is the number of test samples, and 1000 test samples are selected for each time interval in this paper; P * t ,n is the load per-unit value at time t on the day n.
In Figure 2, the time interval is the abscissa and total MAE is the ordinate. The blue line represents the total MAE of curves before rotation. The orange line represents the total MAE of curves after rotation. The blue line and the orange line have obvious periodicity. When the time interval is the multiple of 7 days, there is a trough on the total MAE curve, because many factors have weekly periodicity. The total MAE is the least when the time interval is 7 days. This indicates that the load curve shape of the predicted day is closest to that of the day 1 week before. There is also a huge trough on the total MAE curve when the time interval is 364 days. This is because many factors have annual periodicity. When the time interval is 364 days, it is a FIGURE 2 Total MAE of per-unit load curves before and after the rotation (1000 samples) multiple of 7 days and is closest to a full year. Therefore, the proposed PCRD method takes the per-unit load curves of 7, 14, 21, 28, and 364 days before the predicted day as the input siginals to predict the shape of the load curve. In Figure 3, under any number of samples, rotation operation can effectively improve the similarity between load curves, so as to facilitate CNN to capture shape features. The larger the sample size, the periodicity of the week and the periodicity of the year were obvious.
By combining Figures 1 and 2, it can be seen that the rotation operation makes the per-unit load curves of similar days coincide, which facilitates the extraction of curve shape features. In addition, since the 0 AM load value on the predicted day is the closest load value to the known data set, the 0 AM load value is the easiest to be predicted. The 0 AM load can verify the accuracy of the daily average load on another time scale. Therefore, the forecasting accuracy can be improved by decoupling the power load into the rotated per-unit load curve, the 0 AM load, and the daily average load.

The framework of the CNN-TCN model based on PCRD method
The framework of the CNN-TCN model based on PCRD method is shown in Figure 4. In the data preprocessing stage, the data set is cleaned first. Then the per-unit load curve, daily average load value, and load of the previous day are separated from the power load data for extracting the features of the rotated per-unit load curve, the daily average load, and the 0 AM load on the predicted day separately. Next, the daily average load value, load curve of the previous day, and external data are normalized, and the per-unit load curve is rotated and aligned. Here, the min-max normalization is used to convert these data into dimensionless data. The conversion formula is expressed as follows [27]: where x ij and x nom,ij represent the original data and normalized data of the variable j of the original sample i, respectively; x max,j and x min,j represent the original maximum and minimum values of the variable j, respectively. For a long time, the ability of CNN for the extraction of image shape features has been widely recognized. The shape of the load curve can be seen as a one-dimensional (1D) picture. Therefore, the 1D CNN module is adopted to extract the shape features of the per-unit load curve. In addition, the TCN module is adopted to extract temporal features of daily average load and 0 AM load. The output of these three modules is flattened and combined with external data into a matrix, and then this matrix is inputted into a fully connected layer consisting of three linear layers. After the feature fusion of the fully connected layer, the predicted value of the daily load curve is finally output.
The hyperparameters of the fully connected layers are shown in Table 1. The hyperparameter tuning for different layer types is described below.  1. For the linear layer: The input size of the first linear layer is 1300, which is determined by the number of external data and the output of CNN and TCN. The output size of the last linear layer is 96, which is set to be equal to the number of sampling points on the predicted day. Except for the first linear layer, there is an obvious rule that the number of input size in each linear layer is equal to the number of output size in the previous linear layer. For the layer number and the input/output size of linear layer, we adjusted the hyperparameters from two perspectives. From one perspective, according to the design of AlexNet in [32], there are three linear layers in our model, and the input/output size of first two linear layers are close to the input size of the first layer. From another perspective, we, respectively, The losses of the proposed model on the training, validation, and test sets design the input/output size of linear layer as 130, 1300, and 13,000 to test the model performance and to provide hyperparameter guidance. We find that too little input/output size (e.g. 130) will lead to underfitting and convergence to a large loss value, while too large input/output size (e.g. 13,000) will lead to overfitting and long training time. Thus, we finally designed the input/output size to be 1300 except for the first linear layer. Based on the above design, the proposed model will have no obvious overfitting or underfitting phenomenon (see Figure 5 in Section 4), which shows that the input/output size of first two linear layers are within a reasonable range. 2. For the batch normalization layer: The batch normalization layer after the linear layer can effectively alleviate the overfitting. The batch normalization layer does not change the size of the input data, but only normalizes the data under the same batch. Therefore, the input and output sizes of the batch normalization layer are equal to the output size of the previous layer, that is, 1300. 3. For the activation function layer: The activation function is the ReLu function, which is the most common activation function at this stage. Compared with Sigmoid and Tanh functions, the ReLu function is easy to calculate and fast to converge. 4. For the dropout layer: The dropout rates of dropout layers are 0.5, which is the most common value for dropout rates. In [33], Srivastava et al. believed that the optimal value of dropout rate is 0.5. Because when the dropout rate is 0.5, the most randomly generated network structures will be generated.
In the training and validation stage, the method of four-fold cross-validation is adopted [34]. At the beginning of each iteration, the whole data set is randomly divided into four subsamples. One of them is retained as a verification set and the rest are retained as a training set, and then the model is trained and validated. The above process is repeated four times until each subsample has been retained as a validation set once. At the end of the last iteration, the model with the lowest average loss of the validation sets is saved as final result.
In Table 2, the proposed CNN-TCN model has iterations of 1500, a learning rate of 0.02, an optimizer of Adamax, and a loss function of MSEloss. The design of these hyperparameters is described as follows. Since the final saved model is the model with the lowest average loss of the validation set, instead of the model at the end of the last iteration, we can get a satisfactory final saved model by setting a large enough iterations number. Through practice and literature research, it is found that iterations of 1500 is sufficient for the proposed model, which can be also verified from the losses analyses of the proposed model on the training set, verification set, and test set (see Figure 5 in Section 4). Since the selections of the learning rate, optimizer, and loss function have a great influence on the convergence speed and prediction performance of the model, we conduct the grid search [35] for these three key hyperparameters design. For the grid search, the model training is carried out for all the hyperparameter combinations within the search space, and the hyperparameter combination with the lowest average loss of the validation set is selected finally. The search space of these three hyperparameters is represented in the second column of Table 2. The search space of learning rate is referred to in [27]. The search space of optimizer is referred to in [26]. The optimizer is selected from adaptive moment estimation (Adam), Adamax (a variant of the Adam algorithm), stochastic gradient descent (SGD), root mean square propagation (RMSProp), and adaptive gradient (Adagrad), which are all the common optimizers. Since the final selection of the optimizer is Adamax that embedded with an adaptive learning rate optimization algorithm, the process of fine-tuning of the learning rate parameters can be omitted. The loss function is selected from the mean absolute error loss (MAEloss) and the mean square error loss (MSEloss), which are commonly used in regression problems. Compared with the MAEloss, the MSEloss can obtain less calculation and faster convergence. Other hyperparameters of the proposed CNN-TCN model, such as convolution kernel size, activation function, dropout rate, are described in Tables 1, 3, and 4.
To evaluate the effectiveness of the proposed method, the evaluation indexes including the MAE, mean absolute percentage error (MAPE), and root mean square error (RMSE) are calculated. The MAE, MAPE, and RMSE can be used to evaluate the performance of the CNN-TCN load forecasting model based on PCRD. The formulas for calculating these assessment indicators are as follows [30]: where P t is the actual value of power load at time t; ⌢ P t , is the predicted value of power load at time t; M is the number of load sampling points per day.

CNN module
The CNN is a widely used ANN in processing tasks with the high local correlation of data, such as visual image, video prediction, text classification. To extract the features on the shape of the load curve, the per-unit curve is regarded as a 1D image. According to Figure 2, five similar days are selected here, and the date intervals between them and the predicted day are 7, 14, 21, 28, and 364 days, respectively. The reason why we do not select 364*2 days before the predicted day is that it would make the sample set too small. Otherwise, 2 years of data need to be reserved as similar days. Since the selection of similar days is mainly to capture the shape of the load curve, when formulating the selection standard of similar days, we only select according to whether the shape of the load curve is similar. Although many researchers have proposed many schemes to select similar days according to external factors such as meteorological data, in the actual load forecasting process, external factors such as meteorological data on the predicted day are also predicted by other methods. Because the external data in our data set has less sampling frequency each day, choosing similar days based on the similarity of meteorological data makes it harder for the model to learn the characteristics of the curve shape. After rotation, their per-unit load curves are fed into the 1D CNN as different channels. The relationship between the points of the per-unit load curve is extracted by five-layer 1D CNN cells step by step. Each 1D CNN cell consists of a 1D convolution layer, a batch normalization layer, an activation function layer, and a 1D max-pooling layer.
In Table 3, the hyperparameters of the proposed CNN module are shown. The principles and techniques for the setting of each hyperparameter is described as follows. The number of input channels in the first layer is set to 5, in order to match the number of similar days. That is, each input channel corresponds to a curve shape of a similar day. Except for the first layer, the number of input channels in each layer is equal to the number of output channels in the previous layer. The number of output channels of each layer is, respectively, set to 16, 32, 64, 128, and 256, which is based on common sense that the number of output channels should be a multiple of 8, such as AlexNet [32]. This hyperparameter setting method, which sets the number of output channels to a multiple of 8, is to facilitate the GPU to segment the matrix and weight, so as to improve the learning efficiency. The number of output channels in the five-layer convolutional network increases gradually to offset the information loss caused by the reduction of the feature map. The convolution kernel size should be as small as possible, in order to reduce the computational complexity. Therefore, all convolution kernel sizes of the convolution layers are 2, except for the first convolution layer. If the convolution kernel size of the first convolution layer is 2, the lengths of feature maps output by the first convolution layer will be 95. Since 95 is an odd number, the pooling layer will ignore the last number in the feature maps, which results in the loss of valid information. Therefore, the convolution kernel size of the first convolution layer is 3. The design of the convolution kernel size (column 4 of Table 3) and stride (column 5 of Table 3) of the pooling layer refers to the design of LeNet-5 in [36]. When feature maps alternate through the convolution and pooling layers, its length decreased from 96 to 94, 47, 46, 23, 22, 11, 10, 5, 4, finally to 2. At this point, the feature maps cannot be convolved again. This is why only five convolution layers and five pooling layers are adopted instead of more layers. The design of multi-convolutional layers and small convolutional kernels can reduce computational overhead and extract shape features at different scales as much as possible.
Each 1D convolution layer goes through an affine transformation (the CNN module is mainly referred from [37]): where z j represents the feature maps on output channel j; h m represents the feature maps on input channel m, and is input of this CNN cell; f j represents the convolution kernels on output channel j; b j represents the bias on output channel j; M represents the number of input channels; * represents the convolution operation.
To prevent gradient explosion and gradient loss, each convolutional layer is followed by a batch normalization layer. Each batch normalization layer goes through an affine transformation: where z bn represents the normalized feature maps using the mean standard deviation of the batch samples and γ and β are trainable parameters that are tuned during optimization; u is the output of the batch normalization layer. The activation function adopts a rectified linear unit function, which can effectively solve the problem of gradient disappearance. The expression is expressed as follows: where a is the output of the activation function layer.
To alleviate the oversensitivity of the convolutional layer to location, a 1D max-pooling layer is added at the end. The expression is shown as follows: where o j,s represents the element s on output channel j of the 1D max-pooling layer and is output of this CNN cell; a j,s represents the element s on output channel j of the activation function layer; k is the size of the pooling window; r is the stride of the max-pooling layer.

TCN module
The TCN is a kind of CNN which captures temporal features. Compared with RNN, TCN has the following advantages: (1) flexible receptive field; (2) a more stable gradient; (3) using less memory. The core idea of TCN is to combine 1D convolution and causal convolution to form a time-related one-way structure. In this structure, the value at a certain moment is affected only by the value at that moment in the next layer and before that moment. In short-term load forecasting, the load value is only affected by the past load value. Therefore, the structure of causal convolution is more in line with the inherent law of load. Another feature of this structure is that this structure can ensure each hidden layer has the same length as the input layer, the hidden layer is filled with zero.
To capture more long-term temporal features, the CNN generally increases the number of convolutional layers or increases the size of the convolution kernel. This leads to gradient loss and overfitting problems. To avoid these problems, TCN introduces dilated convolution to enlarge the receptive field. For a 1D input sequence x∈R n and convolution kernels f:{0, …, k-1} →R, the dilated causal convolution operation F on element s of the sequence is defined as follows (the TCN module is mainly referred from [26]): where d is the dilated coefficient, k is the size of the convolution kernel, and s-d⋅i accounts for the direction of the past. Figure 6 is a schematic diagram of 1D dilated causal convolution. In Figure 6, with the increase in the number of layers, the receptive field of each layer increases gradually. 1D dilated causal convolution can dynamically adjust the receptive field by adjusting the dilated factor d and filter size k.
To avoid network degradation when there are too many network layers, the features of the shallow layer can be transferred to the deep layer, that is, the residual connection can be adopted. Figure 7 is a schematic diagram of the TCN residual block. Each TCN residual block consists of two dilated causal convolution cells and a 1*1 convolution layer. Each 1D dilated causal convolution cell consists of a 1D convolution layer, a weight normalization layer, an activation function layer, and a dropout layer.
The TCN module used to predict the daily average load is connected by a five-layer residual block. The TCN module used to predict the 0 AM load is connected by a seven-layer residual block. The input data of the two TCN modules are the daily average load of the 4 weeks before the predicted day and the actual load of the day before the predicted day, respectively. Therefore, the sequence lengths of the two modules are 28 and 96, respectively.
In Table 4, the hyperparameters of the proposed two TCN module are shown. The reason why the five-layer and sevenlayer residual blocks are chosen is that the overmuch residual blocks will cause the dilation rate to exceed the sequence length. For example, if the sequence length in Figure 6 is 8, the blue square in the output layer is only affected by the blue square below it. Because the yellow square outside the sequence length is automatically filled with 0. At this point, the 1D causal dilated convolution is reduced to a linear layer.
Each 1D dilated causal convolution layer goes through an affine transformation: where z j represents the feature maps on output channel j; h m represents the feature maps on input channel m, and is input of The schematic of the TCN residual block this TCN cell; f j represents the convolution kernels on output channel j; b j represents the bias on output channel j; M represents the number of input channels; * d represents the dilated causal convolution operation. To prevent gradient explosion and gradient loss, each convolutional layer is followed by a weight normalization layer. Each weight normalization layer goes through an affine transformation: where z wn represents the normalized feature maps using the Euclidean norm of neuron weights and γ is a trainable parameter that is tuned during optimization; u is the output of the batch normalization layer. The activation function adopts a rectified linear unit function, which can effectively solve the problem of gradient disappearance. The expression is expressed as follows: where a is the output of the activation function layer.
To alleviate overfitting, a dropout layer is adopted after the activation function. In model training, the dropout layer will inactivate a part of neurons randomly, which makes the model less dependent on some local features and stronger generaliza- where o j,s represents the element s on output channel j of the dropout layer and is output of this TCN cell; a j,s represents the element s on output channel j of the activation function layer; v j,s is the random variable subject to Bernoulli distribution with parameter p. In model testing, instead of inactivating neurons, the weight of each neuron in the dropout layer is multiplied by p.

Data sources and preprocessing
The experimental data come from about 10 years' power load data and the meteorological data from January 2008 to June 2019 in Guangzhou (see Table 5 for details). The load data sampling interval is 15 min, 96 sampling points per day. There are 367,968 sampling points for load data. All meteorological data sampling intervals are 6 h, 4 sampling points per day. The external data was collected from 15,332 sampling sites. From the original data set, 3833 samples were divided in days.
To improve the accuracy and convergence speed of the proposed method, the input data and output data are normalized and the per-unit load curve is rotated and aligned. The experimental environment is built under Windows 10 system. The CUDA 11.0 was used for GPU acceleration. The experiment was conducted using the Pytorch deep learning framework and an NVIDIA GTX 1050 graphics card.

Comparative analysis of forecasting results
To verify the forecasting effect of the proposed PCRD method, seven different models are trained in this paper, including the CNN-TCN model based on PCRD, the CNN-LSTM model based on PCRD, the CNN-TCN model based on PCSD, the TCN model based on TS, the LSTM model based on TS,  the BPNN model based on SD, and the SVR. These models have a learning rate of 0.02, iterations of 1500, an optimizer of Adamax, and a loss function of MSEloss. The forecasting results of seven models are compared. The data from 24-30 June, 2019 are selected as test sets. The 2920 consecutive days' data before the test set are selected as the corresponding training sets. Figure 8 shows the comparison of the predicted load values, actual load values, and corresponding relative errors among the seven models on 24 June, 2019. The TCN model based on TS and the LSTM model based on TS are poor at capturing the shape features of the daily load curve, leading to a large forecasting error. Compared with the CNN-TCN model based on PCSD, SVR, and the BPNN based on SD, the CNN-TCN model based on PCRD has a better forecasting effect in the early morning of the predicted day. This is because that the PCRD introduces the forecasting of the 0 AM load on the predicted day in the model framework. This is one of the advantages of PCRD over conventional PCSD. Compared with the CNN-LSTM model based on PCRD, the forecasting error of the CNN-TCN model based on PCRD is significantly smaller at noon. Among the seven methods, the CNN-TCN model based on PCRD has the best forecasting effect on this sample. Tables 6-8  effective for the improvement of the PCSD. Therefore, PCRD significantly improves forecasting accuracy.

Results generalization ability verification
To verify the generalization ability of the CNN-TCN model based on PCRD, three data sets, each containing seven consecutive days, are selected as new test sets. The 2920 consecutive days' data before these test sets are selected as the corresponding training sets. In Figure 9, for most of the predicted dates, the forecasting accuracy of the CNN-TCN model based on PCRD is better than that of the CNN-TCN model based on PCSD. In Figure 10, the MAPE of the CNN-TCN model based on PCRD is between 0.74% and 4.32%, while the MAPE of the CNN-TCN model based on PCSD is between 0.93% and 7.32%. This indicates that the CNN-TCN model based on PCRD is more stable.
The performance of load forecasting can be illustrated by residual distribution, which refers to the distribution of the difference between the predicted and actual load values in a period of time. The residual distributions of different forecasting method (PCRD and PCSD) will have different means and variances. Based on the MATLAB tool, and taking Figure 9b as an example, the normality tests of the residual distributions of PCRD and PCSD are shown in Figure 11. Figure 11a is the statistical histogram of the residual. Compared with the traditional PCSD method, the proposed PCRD method makes the mean value of the residual closer to 0, the residual range is smaller, and the predicted value is closer to the true value. Figure 11b is the quantile-quantile (QQ) plot. The trajectories generated by the residuals of the two methods are all close to straight lines, which indicates that the residual distributions are close to the normal distribution. Compared with the traditional PCSD method, the residual trajectories of the proposed PCRD method has a smaller slope and is closer to the origin. In other words, compared with traditional PCSD method, the variance of the residual of the proposed PCRD method is smaller and the mean value is closer to 0. Figure 11c is the empirical cumulative distribution functions plot. The function images trajectory of the two methods are close to the normal cumulative distribution function images trajectory. This verifies the normality of residuals from another perspective. Figure 11d is the probability plot for normal distribution. The trajectories generated by the residuals of the two methods are all close to straight lines, which indicates that the residual distributions are close to the normal distribution. Compared with the traditional PCSD method, the residual image trajectory of the proposed PCRD method has a larger slope, which means that the residual variance of the proposed  method is smaller. Just like the above analysis of Figure 9b, the residual distributions of Figure 9a-d also satisfy the normal distribution.
In addition, we also tested the normality of residuals of Figure 9a-d by the Jarque-Bera test. When the significance level is 0.05, the residual distributions of Figure 9a-d obey the normal distribution. Table 9 lists the average MAPE of the CNN-TCN model based on PCRD and the CNN-TCN model based on PCSD under different test sets. The average MAPE of the CNN-TCN model based on PCRD improved by 1.51% relative to that of the CNN-TCN model based on PCSD. In Table 9, the forecasting effect of the CNN-TCN model based on PCSD is closely related to the test samples, and the forecasting error is larger than that of the CNN-TCN model based on PCRD. In addition, the forecasting effect of the CNN-TCN model based on PCSD differs greatly in different test sets, indicating that its generalization ability is not strong. There are two reasons for this. First, for PCSD, the difficulty of predicting the daily average load varies from a test set to another test set, since PCSD does not have 0 AM load to check the accuracy of the daily average load. The second reason is that the forecasting results of PCSD are influenced by the overall trends of per-unit load curves on similar days. If the overall trends of the per-unit load curves in the sample are similar, the results will be better. If not, it will be difficult to extract the shape features of the load curve. Thus, the CNN-TCN model based on PCRD makes the forecasting results more accurate and stable. Therefore, the CNN-TCN model based on PCRD has a stronger generalization ability.  Figure 5, the loss of the proposed model on the verification set is minimized in the 1344 iterations, and the model in this iteration will be saved as the best model. At this point, the losses on training set, verification set and test set are, respectively, 0.00056, 0.00027, and 0.00121. The reason why the loss value on validation set is smaller than that on training set is that the dropout layers will dropout some neurons in the training stage, but not in the validation stage. The MAPE of training set and test set were 1.86% and 2.39%, respectively. The error of test set is slightly larger than the error of training set and in the same order of magnitude, which indicates that the proposed model has no obvious overfitting phenomenon, and on the other hand, it also demonstrates the generalization ability of the proposed model.
The cross-validation can prevent overfitting. Selecting an appropriate K value for cross-validation can avoid overfitting on the premise of ensuring prediction accuracy and saving calculation time. Figure 12 shows the predicted results under 4fold and 10-fold cross-validation, and actual values. In terms of forecasting accuracy, the MAPE of 4-fold cross-validation was 2.39%, while the MAPE of 10-fold cross-validation was 3.95%. In terms of training time, in the case of 1500 iterations of training, 4-fold cross-validation needed 2.33 s per iteration on average, while 10-fold cross-validation needed 8.49 s per iteration on average. This indicates that 4-fold cross-validation is more suitable for this model and is more suitable than 10fold cross-validation. The results in Figure 5 do not imply that 4-fold cross-validation is superior to 10-fold cross-validation. In other application scenarios, the optimal K value may be different.

CONCLUSION
For alleviating deflection and translation of the forecasting error in conventional PCSD, a PCRD method is proposed for CNN-TCN based day-ahead load forecasting. The forecasting of power load is decomposed into the forecasting of rotated per-unit load curve, 0 AM load, and daily average load. Here, different deep learning algorithms are used to extract different features. The CNN is applied to capture the shape features of rotated per-unit load curve. The TCN is applied to capture the temporal features of 0 AM load and daily average load. Six other models are established as the comparisons to prove the validity of the proposed method. Compared with the conventional PCSD, the experimental results show that, the proposed PCRD method reduces the MAPE of the day-ahead load forecasting by 1.51% on average. The CNN-TCN model based on PCRD can obviously improve the forecasting accuracy and has certain generalization ability in the practical application of dayahead forecasting. Therefore, the PCRD approach provides a new and effective way to fuse deep learning algorithms for dayahead load prediction. In future work, the proposed method can be applied to dayahead electricity price forecasting, day-ahead electricity consumption forecasting and other related fields. In addition, PCRD can also be combined with other machine learning algorithms, which may further improve the forecast accuracy. Here, external factors are simply combined with the output of the three modules into a new matrix before the feature fusion. Future work can focus on proposing the corresponding external factor inputs for the three different prediction modules.