Time series prediction model based on autoregression weight network

We propose an autoregressive weighted network (ARWNet) time series forecasting model inspired by the idea of ensemble learning. The model adopts the classical autoregressive analysis to optimize weak learners. Meanwhile, the combined weight optimization method is used to construct an efficient, strong learner. With these methodological foundations, the scalability of the framework is greatly enhanced by the possibility of experimenting with other learners to assist in decision‐making. Machine learning can provide great utility and acceptable cost in the prediction process of electricity transformers. Over the years, many research papers on time series prediction have been reported. This work will focus on the analysis using the potential properties in the series: long‐term, continuity, periodicity, and delay. In our experiments, the ETT‐small dataset is used to compare the prediction accuracy of ARWNet and other mainstream models. All results suggest that the proposed ARWNet model demonstrates strong generalization ability and high predicting accuracy with delay characteristics, which outperform current popular time‐series prediction methods.

its sensitivity to data makes it read quickly and adapt to large fluctuations.However, its limitation is that the model is affected by gradual prediction to a large extent, which leads to its poor or even ineffective remote prediction; The classical time series model ARIMA proposed by Box et al. 2 decomposes the trend term of time series and analyses it from a mathematical point of view.The model has a simple structure and strong interpretability and is still widely used in many fields.Its disadvantage is that it requires the series to be stable, it can only capture linear relationships in essence, and it is not sensitive to nonlinear relationships.Some network structure models based on deep learning mainly include LSTM, prophet, TCN, DeepAR, Informer, and so on.Sean J. Taylor et al. 3 put forward the Prophet model with trend term, season term, and holiday term as the main body.This method does not need obvious regular series, and the fitting speed is fast.It is worth mentioning that the method has high interpretability parameters and does not need a particularly deep theoretical foundation.The limitation of the prophet model is that the results are slightly simple and cannot be rolled forecast, which limits the accuracy of forecasting to some extent.DeepAR, proposed by David Salinas, 4 is a method to generate an accurate probability prediction based on a training autoregressive recurrent network model on a massive correlated time series.The model requires the recursive generation of predicted values for future periods employing repeated sampling.This is because DeepAR outputs a probability distribution, and sampling is only one of the paths.Repeat sampling is required if accurate expectation values are desired.The most related works [5][6][7] all start from a trial of applying a transformer in time-series data and fail in LSTF forecasting as they use the vanilla Transformer.According to the characteristics of long-time series, Zhou et al. 8 proposed an informer model based on a prob sparse self-attention mechanism and distillation operation and improved the performance of the encoder and decoder.This has greatly enhanced the research process in the field of long sequences. 9,10The TS-RP method proposed by Zhang et al. 11 performs better than most MF-based rating prediction methods in terms of recommendation accuracy and computational complexity.The recommendation accuracy and computational complexity are better than most MF-based rating prediction methods.Zhang et al. 12 propose a new method called DeepRisk to fuse firm demographic data and financing behavior data to predict the credit risk of SMEs in SCF.We use a multimodal learning strategy to fuse these two different data sources.The connection vectors obtained by data fusion are used as inputs to a feedforward neural network to predict the credit risk of SMEs.
The electricity distribution issue is the ability to manage the distribution of electricity to different customer areas based on sequentially changing demand.This is a major decision issue that affects the livelihood of the country.If future demand changes in a specific region can be accurately predicted and the electricity supply can be adjusted in time, it can reduce certain power and equipment wastage.Therefore, electrical power demanding predication is a basic but important problem.
Sequence prediction is a problem that involves using historical data to predict the next value or values in the sequence.Electrical power demanding predication problem is a typical sequence predication problem.
However, it is very difficult to predict the future demand of a specific supply area, because the demand for electrical power will be affected by different factors such as weekdays, holidays, seasons, weather, and temperature, resulting in the potential properties in the sequences: long-term, continuity, periodicity and delay, as shown in Figure 1 (see Table 1 for the meaning of each legend).Figure 1 suggests that electrical supply temporal sequences has obvious autoregressive characteristics. 13The characteristics dictate that we cannot use conventional cross-validation to prove the validity and robustness of the proposed model.Because the randomly disrupting time series data can cause temporal discontinuities, resulting in the loss of some of the latency we want the model to capture.With the inspiration of the classic auto regression model, we turn to the autoregressive analysis model to capture such temporal sequence features.
The major challenge for electrical power demanding is to enhance the prediction capacity to meet the increasingly long sequence demand, periodic, non-linear and delay effect, which requires extraordinary long-range delay alignment ability.
Current forecasting methods cannot satisfy the long-term with delay effect high-precision forecasting of the demand for electrical power.To avoid under-supply, decision makers usually increase the threshold for power supply, resulting in power waste and unbalanced power distribution. 8,13Based on the characteristics of power transmission data, exploring efficient prediction model becomes the key to solving this bottleneck.
In this paper, inspired by ensemble learning, we propose a time series prediction model with delay characteristics, that is, auto regression weight network (ARWNet) based on autoregressive process and ensemble learning, as shown in Figure 2. In this model, three weak learners (LSTM, TCN, and XGBoost) are selected to model and predict the same input in parallel, due to the fact that neural network structural models are recognized as the most effective method for long sequence, periodic, non-linear time series prediction.
Meanwhile, with the inspiration of ensemble learning principle and framework, in order to overcome the larger error in the prediction of the single weak learner, the prediction accuracy can be enhanced by ensemble learning.The innovation of this article lies in the integration of three currently outstanding time series prediction machine learning models based on the ensemble learning framework, while considering the delay characteristics of electrical supply sequences.Our proposed ARWNet integrates the advantage of neural networks, which can capture long time series features, and the auto-regressive methods that have the ability to capture series delay patterns.We set the initial weight values of the strong learners according to the weight-balancing algorithm.From these initial weights, the final optimal weight values are obtained using the grid search method.The experimental results show that the proposed model can effectively improve the prediction accuracy of time series with a delay compared with recent state-of-the-art models, such as DeepAR, Prophet, Informer, TCN, 14 and XGBoost.And it is suitable for scenarios with many influencing factors and a high correlation between the series and its historical data.TUNING OF HYPER PARAMETERS The subject of this experiment is univariate prediction.Considering the characteristics of univariate fluctuations over a large time range, the sub-learner is targeted to be processed using a neural network structure.
To produce good and accurate results, learning models require many parameters that need to be defined before training. 15,16Different hyperparameters should be assigned according to different network structures and datasets.Setting hyper parameters is usually achieved by manual adjustment, which is time-consuming and inefficient.It can automatically find the best configuration of parameters for the model.Grid search is a hyperparameter tuning technique that can be applied to find the best model configuration, automatically. 17Most of the hyperparameters required in this experiment are obtained using the grid search method.

Window size selection
According to the general experimental setup, here we specify a length of 24.If the size is too large, some algorithmic models will fail.Conversely, if the scale is too small, the model will not achieve the expected performance.This experiment is conducted for time series with delay term effects, and the model prediction performance can be maximized by analyzing and processing the delay terms.

ARWNET
In this section, we will briefly introduce some methods and theoretical background related to the proposed ARWNet framework.

Long short-term memory
The long short-term memory network, commonly known as LSTM, is a special kind of recurrent neural network (RNN) that learns long-term dependencies.LSTM can effectively avoid gradient disappearance and gradient explosion.This neural network also has the same repetition module chain structure as RNN, but the repetition module of LSTM has a different structure.
Each memory cell's internal architecture guarantees constant error flow within its constant error carrousel (CEC), provided that truncated back prop cuts off error flow trying to leak out of memory cells.This represents the basis for bridging very long time lags. 18Two gate units learn to open and close access to error flow within each memory cell's CEC.The multiplicative input gate affords protection of the CEC from perturbation by irrelevant inputs.Likewise, the multiplicative output gate protects other units from perturbation by currently irrelevant memory contents. 19STM is an important representative of a recurrent neural network.Using the characteristics that time series data are always related to historical data, we can capture the hidden information in time series through LSTM.It is also an important reason why LSTM plays a dominant role in time series prediction.LSTM neural network is a special cyclic neural network that can effectively solve the problem of gradient explosion of traditional cyclic neural networks.At the same time, LSTM neural network can process long time series data and has long-term memory ability.

Temporal convolutional network
Recursive structured neural networks have been dominating research for a long time because of their superior performance.The convolutional structure was indeed relatively weak in the processing of time series until the introduction of architectural elements such as dilated convolution and residual connectivity.The results of Bai et al. 14 show that with these elements, a simple convolutional architecture is more effective than a recursive architecture such as LSTM in different sequence modeling tasks.Unlike the recursive architecture of the LSTM, the temporal convolutional network (TCN) uses a convolutional architecture to capture features.The convolution in the architecture is causal, that is, there is no "leakage" of information that has never occurred in the past.TCN also emphasizes how to use a combination of very deep networks (augmented by residual layers) and inflated convolution.It can build a very long, effective historical scale (i.e., the ability of the network to investigate the very distant past to make predictions).
The TCN is a convolutional neural network structure, which is different from the recurrent structure.It analyses the time series from another perspective.We use LSTM and TCN because they are neural networks with different structures.The combination of LSTM and TCN can learn from each other and take full advantage of their respective strengths.

eXtreme gradient boosting
XGboost is an extensible machine learning system for tree enhancement algorithms. 20,21It is often used in machine learning and data mining problems and runs more than 10 times faster than a single machine learning algorithm.It is the de facto choice of ensemble method and is used in challenges such as the Netflix prize.Here, we combine XGboost into the proposed ARWNet framework since XGBoost algorithm has an excellent combination of predictive performance and processing time compared to other algorithms.

Auto regression analysis
The autoregressive (AR) model is a statistical method that deals with time series.It uses X 1 , … , X t−1 to predict the performance of X t and assumes that they are a linear relationship.It is developed from linear regression in regression analysis.Instead of using X to predict Y, it uses X to predict X.That is why it is called autoregression.This method proposes a measure of the autocorrelation coefficient as the degree of correlation between two different periods of the same event, that is, a measure of the effect of a past behavior on the present.
Equation ( 1) is the general expression of AR.The parameter p is the order of the autoregressive term in the time series data, and φ is the autocorrelation coefficient.The order p is the value of the lag at which the PACF curve crosses the upper limit of the confidence interval for the first time.These p lags are characteristic of the AR time series.
In summary, the autocorrelation function (ACF) describes the autocorrelation between two observations, including both direct and indirect correlation information.Unlike the ACF, the partial autocorrelation function (PACF) does not look for the correlation between the lag term and the current term but rather for the correlation between the residual and the next lag value.PACF highlights the direct relationship between the observation (y t ) and its lag term (y t−k ) and reduces the effect of other short lag terms (y t−1 , y t−2 , … , y t−k−1 ).The reason for using PACF in the AR model is that it eliminates the previous redundant effects and explicitly obtains the lags associated with the residuals.Correlation and lags are prominent features of time series and are the main source of error reduction when using autoregressive ideas for forecasting studies.Proper use of series features can better improve experimental results.Equation (2) shows the calculation of the partial autocorrelation function.
We apply the AR model to determine the order of the series, find the parameter p, and eliminate the interference of the lag term in the series by the delayer.In short, the prediction sequence obtained by model learning is advanced by p time units.In this process, the order p is very important, as it reflects the correlation between the current value and the first p lagged terms.
As we discussed in the introduction, the major challenge for electrical power demand is to enhance the prediction capacity to meet the increasingly long sequence demand, and delay effect, which requires extraordinary long-range delay alignment ability.By applying autoregression analysis, we can greatly reduce the prediction error of the model, since the autoregressive (AR) model meets the major challenge for electrical power demanding prediction, it can enhance the prediction capacity to satisfy the increasing long sequence fluctuation and delay effect.

Weight balance in ensemble learning
In many applications that call for automated decision-making, it is normal to receive data obtained from different sources that may provide complementary information.A suitable combination of such information is known as data or information fusion.And it can lead to improved accuracy of the classification decision compared to a decision based on any of the individual data sources alone.Ensemble learning 22 often achieves significantly better prediction performance than a single learner by combining multiple learners.Ensemble learning is realized by combining the strengths of different learners to complement each other and get a learner with more comprehensive capabilities.
Based on this idea, we combine learners with different structures.For example, LSTM with recursive structure, TCN with convolutional structure, and XGBoost, a boosted tree algorithm, improved from a narrow ensemble learning approach.
The idea of the boosting algorithm is to feed the original training set to a weak learner and change the sample distribution based on the performance of that learner. 23It allows underperforming samples to receive more attention at a subsequent time, thus improving the performance of the integrated learner.This process is repeated until all weak learners learn, and finally, all weak learners are weighted and combined. 24ased on the core idea of ensemble learning, we propose the weighted average method as the combination rule of ARWNet, and the assignment of the weight of weak learners according to the mean absolute error (MAE) of their evaluation metrics.The initial weights of each learner are calculated as Equation (3).
where σ MAE 1 is the average absolute error value of the weak learner i .The initial weight of each learner is calculated according to Equation (3), and the individual optimal weight is searched by the Grid Search algorithm.After all the weak learners' weights are determined, the predictive value of the final strong learner is calculated according to Equation (4).
By weight balancing, we obtain the final prediction that combines the advantages of various weak learners, thus minimizing the prediction bias.
The above three types of learning apparatus (LSTM, TCN, XGboost) have different characteristics.By assembling the advantages of convolutional networks, recurrent networks, and boosted trees, we may form a more powerful time-series prediction network framework, that is, ARWNet.

Evaluation criteria
In the experiment, mean square error (MSE) and mean absolute error (MAE) were selected as the error evaluation indexes of the prediction effect of the model.
where: N is the sample size, y i and y ′ i are the predicted value and the real value of sample i .

Dataset
We use ETT-small 8 as an experimental dataset, which can provide a baseline for the experimental results.We use it to predict the oil temperature of power transformers and to study the ultimate load capacity of power transformers.Here, we choose oil temperature data as the main prediction variable.Because, oil temperature can reflect the condition of power transformers.One of the most effective strategies is to predict how safe the oil temperature of power transformers can be to avoid unnecessary waste.At the same time, it can meet the power demand of the supply areas.The data of power includes load and oil temperature, from two sites in two different regions of the same province in China.ETT-small, a popular test set, has been used to compare many high-performance prediction models in the field of time series forecasting.Specifically, the dataset contains short-period patterns, long-period patterns, long-term trends, and a large number of irregular patterns.In Table 1, we give a sample of the data.The detailed meaning of each column is shown in Table 1.To better represent the presence of long-term and short-term repeating patterns in the data, we plot in Figure 1 the autocorrelation of all variables in the ETT-small-h1 dataset, where the top blue curve is the target variable "oil temperature", which maintains some short-term local continuity, while the other variables (various loads) ETTh1 and EETh2 are recorded at the hourly granularity and ETTm1 at 15-min granularity.The time span is from 00:00-July 1, 2016, to 19:00-June 26, 2018.

Hyper-parameter optimization
We use the grid search method to optimize the hyperparameters of each weak learner of the integrated model.The weight assignment is also based on the grid search method.We determine the approximate range of the main parameters of the model by using cross-validation methods.For LSTM, the learning rate is chosen from the intervals (0.0001 ∼ 0.01); the hidden state size is chosen from (16 ∼ 64).For TCN, the size of kernel is chosen from (2 ∼ 10).For XGBoost, the max depth is chosen from (3 ∼ 9); the weight of min child is chosen from (3 ∼ 9).This paper only introduces the optimization range of some hyper-parameters.

Ideas before the experiment
In our experiments, we set training/test 7/3.We experimented with both methods to observe the results and listed the relevant model evaluation metrics in the table .Here we use the mean square error (MSE) and mean absolute error (MAE) as evaluation metrics.
To extract the sequence with delay characters, we need to perform an autoregressive analysis on the prediction sequence.The lag order of this series is determined by autoregressive analysis as formulated in Equations ( 1) and (2).For our proposed ARWNet, autoregressive analysis serves as a delayer to eliminate power demanding sequence delaying fluctuation effect.The ARWNet working principle is: after all the weak learners are processed by the delayer, then we construct an integrated strong learner.However, before the model is assembled, we have two options.Method 1 is to delay processing before integration, and Method 2 is to integrate first and then delaying.To test which option has the lower error, we perform the following comparison experiments.The results are listed in Tables 2 and 3.As shown in the tables, Method 1 defeats Method 2. We reduce the prediction error (MAE) from 0.399 to 0.153 for the ETTh1 dataset.
TA B L E 2 Integration followed by autoregressive analysis and processing (Method 2).

TA B L E 3
Parallel autoregressive analysis and processing before integration (Method 1).

F I G U R E 3 Prediction result curve of ETTh1 slice data.
Considering the operation order of autoregressive processing and ensemble learning, we perform many experiments.And the results show that Method 2 is not effective or even fails.Then the last 100 data comparison curves are shown in Figure 3.
Since oil temperature time series demonstrate significant delay characteristics, in term of the idea of ensemble learning, "delay processing before integration" can extract more delay information.In other words, the individual learner's prediction error should be minimized before assembling in order to improve the prediction accuracy of the ensemble learner.The above discussion can explain why Method 1 is better than Method 2. Therefore, we chose Method 1 for our following experiments, that is, after all the weak learners (LSTM, TCN, XGboost) are processed by autoregressive analysis, then we construct an integrated strong learner by weighted average method.

Experiment details
In the following experimental section, according to method 1 (i.e., delay processing before integration), we performed an autoregressive analysis of the predictions of different weak learners and processed them with a delayer to remove the delayed relationship in the prediction process.Based on the idea of ensemble learning, we then balance the weights of the processed sub-sequences, which can further improve the prediction accuracy of the model.In this experiment, we determined that the training/test set is divided according to 7/3 through pre-experiments, based on which the learner is able to obtain better prediction results, and all predictions strictly followed this ratio.Considering the specificity of time series data, our prediction strategy has been changed a bit accordingly.For example, in the iterative process of validating the model performance, we abandon the conventional cross-validation and adopt the rolling-origin evaluation method based on the Monte Carlo idea, and using the core idea of integrated learning, combined with the ability of neural networks to capture long-term time-series features with strong theoretical support, and so forth.
Based on the idea of the test-bench strategy, 25 we need to perform iterative simulations on the dataset to verify the stability of the model.On the other hand, considering that the time series dataset has a certain continuity and periodicity, most important with delayed characters in time, the conventional cross-validation is not applicable here.Disrupting the dataset made us unable to capture the delay of the data.
Here we apply an alternative iteration idea based on the Monte Carlo strategy, and a common approach is to perform multiple training/test splits and then calculate the average of the errors on these splits.We use the forward-chaining technique, which is a Forward-Chaining based approach (also known in the literature as rolling-origin evaluation (Tashman, 26 2000) and rolling-origin-recalibration evaluation (Bergmeir & Benitez, 27 2012)).
After we use the grid search method to optimize all the hyper-parameters of weak learners (LSTM, TCN, XGBoost), intending to extract the delaying term, autoregressive analysis is applied to the weak learners (LSTM, TCN, XGboost), then we construct an integrated strong learner by a weighted average method based on Equations ( 3) and (4).
Specifically, in this experiment, we divided the data set into 10 segments proportionally.The first iteration uses one-tenth as the data set for training and testing.The second iteration uses two-tenths of the data and repeats the process.Thereafter, each iteration adds a period of data and outputs the results until the entire data set is covered.Each iteration is divided into training set/test set strictly in the ratio of 7/3.For example, in the first iteration, we bring the first-time segment (2016/7/1 0:00:00 to 2016/9/11 13:45:00 total 6968 data) as a data source to train the model and output the results of one experiment.The next iteration is from 2016/7/1 at 0:00 to 2016/11/23 at 3:45 for a total of 13,936 data.The experimental results are shown in the following Table 4.
Following Figure 2, the ARWNet model framework, the following Figures 4-7 visualize the specific steps, respectively.Here we briefly describe the experimental procedure of the ARWNet model by using the plot of the ETTm1 dataset as an example.Finally, the effects of different algorithmic models on the same evaluation metrics of the ETT dataset are compared.We compare our proposed model with up-to-date prediction algorithm.The proposed method outperforms Informer, LogTrans, and DeepAR in MSE by decreasing 46.6%, 75.4%, and 82.4% in ETTm1.
As shown in Figures 4 and 5, the raw data are fed into the paralleled weak learners to obtain three different prediction sequences.The aim is to take advantage of the different learners to improve the prediction accuracy of the raw data.
Based on the statistical test of the partial autocorrelation function, we find that the delay terms of the three learners are different.Figure 6 shows the results of the autoregressive analysis of the LSTM prediction sequence.In contrast, the effect of the lag term in the sequence can be eliminated to the maximum extent by the delayer operation.This operation is a central part of reducing the model error.Specifically, the delayers shift the predicted sequence forward by p time units overall according to the order p of the delay term given by the partial autocorrelation function.Delayers have proven to be effective in reducing the error in model prediction results.
After eliminating the effect of most of the lagging terms, the next step is how to construct a strong ensemble learner.As discussed above, the idea of ensemble learning is used to assign an initial weight value to each weak learner, and then the grid search method is used to find the final best weight values for each weak learner.The use of ensemble learning integrates the better predictive value of learners, thus eliminating the influence of individual model bias.
The evaluation metrics of the model during the experiments are shown in Table 5.The experimental results for different prediction methods in Table 6.
To further test the predictive performance of our proposed ARWNet, we applied it to multivariate prediction scenarios.We enlarge ETT dataset by adding six other dependent variables (HUFL, HULL, MUFL, MULL, LUFL, LULL).The  prediction result is listed as follows.Table 7 shows ARWNet outperformed up-to-date prediction methods.The best results are marked in bold.The prediction results in the table suggest that the prediction performance of the proposed ARWNet outperforms some state-of-the-art prediction methods, such as Informer, LogTrans, Reformer, and LSTMa.

Analysis of experimental results
Tables 6 and 7 shows ARWNet outperformed up-to-date prediction methods.The best results are marked in bold.
Tables 5-7 suggest that the prediction accuracy of each weak learner is not better than that of the current methods.From the prediction result in the above tables, it is clear that the prediction performance of the sub-learner as a stand-alone module is not outstanding in all aspects and cannot meet the demand for long-term high-precision prediction.When we integrated these methods into the ARWNet framework, what can be demonstrated is that the final prediction performance is all substantially improved and stable on data with different time granularity, for example, h1 and m1.

F I G U R E 7
The predictions in the ETTm1 dataset.

TA B L E 5
Model effect optimization.We can observe that: (1) the proposed ARWNet model significantly reduces the prediction compared with the other methods, as shown in Tables 6 and 7.Meanwhile, the Monte Carlo test verifies our proposed model is robust.Results in Tables 5, 6, and 7 illustrate the success of ARWNet in enhancing the long-term high-precision time-series electricity prediction problem.(2) The ARWNet model outperforms up-to-date mainstream methods in prediction at both hourly and minute levels, thanks to the good combination of autoregression model and machine learning algorithms.And specifically, the advantage of neural networks is that they can capture long-time series features, and the auto-regressive methods can capture series delay patterns.

TA B L E 6
The strong learner is better than the existing methods.The ARWNet model can greatly improve prediction accuracy.This is because in the idea of ensemble learning, to achieve a good integration effect, individual learners should be "good but different."This means that individual learners should have a certain "accuracy."And at the same time, there should be "diversity."That is, there should be differences among learners.

CONCLUSIONS
In this experiment, we study the time series with the delay forecasting problem and propose an ARWNet model.To improve prediction accuracy, we apply the idea of ensemble learning to combine weak diversified learners.Specifically, the autoregression method is used to eliminate the effect of delay.In short, our proposed ARWNet model is a modular replaceable prediction framework where the choice of sub-learners can be varied depending on the dataset and other factors and is the result of an integration idea.For the specificity of the data, we use the rolling-origin evaluation method to evaluate the performance of the model.This ensures the continuity, periodicity, and delay of the data.The grid search method is used to achieve optimal weight for each weak learner.The effectiveness of our ARWNet approach has been successively verified on real world ETT datasets, and the prediction results illustrate that our model stands out in many up-to-date methods.The good performance of the ARWNet model can help managers to make the right decisions on power distribution issues.

F I G U R E 4 F I G U R E 5
Oil temperature fluctuation (in month).Prediction results for the last 1000 slices of the test set with different learners.

F I G U R E 6
Partial autocorrelation function of the LSTM prediction sequence.

Autoregressive presentation of electrical supply temporal sequence features. TA B L E 1 ETT data samples and variables interpretation. Field Date HUFL HULL MUFL MULL LUFL LULL OT Description The recorded date High use full load High use less load Middle use full load Middle use less load Low use full load Low use less load Oil temperature (target)
Flow chart of the ARWNet.

TA B L E 4
Iterative results based on Monte Carlo strategy.
Comparison of model evaluation indexes of different prediction methods.Comparison of model evaluation indexes of different prediction methods.