A deep network with analogous self-attention for short-term traffic flow prediction

Funding information National Natural Science Foundation of China, Grant/Award Number: 61973265 Abstract Short-term traffic flow prediction plays a crucial role in research and application of intelligent transportation system. Neural network algorithm can use the big data for training and has more advantages over other prediction models in traffic features extraction. However, it is still a problem to extract the spatiotemporal features of traffic flow in a simple and sufficient way to improve the prediction accuracy. In this paper, a double-branch deep residual gated convolutional neural network (RGCNN) is proposed to extract features from both time and space based on three-dimensional traffic data, and scaled exponential linear units is used as an activation function to enhance the convergence effect of network training. In order to increase the ability of the network to fit the traffic data, an analogous self-attention (ASA) is designed, which retains the advantages of attention while hardly increasing training costs. Simulation experiments are carried out in real traffic data sets, the simulation results of traffic flow prediction tasks in different prediction horizons show that the prediction performance of the proposed prediction model (ASA-RGCNN) is superior to that of other common prediction models and the proposed model can be applied to the predicting task under different traffic conditions. By visualising ASA weights at different traffic flow levels, the impact of space-time traffic data on the prediction task can also be found out.


INTRODUCTION
With the popularity of private cars and the increase in travel rates, the consequent traffic congestion problems in various countries have become more and more serious. In recent years, the focus of alleviating traffic congestion has shifted from the increase and expansion of infrastructure to sustainable intelligent traffic management, such as the application of intelligent transportation system (ITS). The continuous development of the ITS and the continuous improvement of the traffic information acquisition technology not only make the acquisition means and the traffic information sources more abundant, but also improve the quality and the accuracy of the data acquisition [1]. As a key part of ITS, short-term traffic forecasting can help road users to make better travel decisions, improve traffic operation efficiency, reduce carbon emissions and ease traffic congestion [2]. The task of short-term traffic prediction is to predict traffic parameters of a certain road section in the next time interval (usually 5-30 min). However, the actual traffic system is a complex nonlinear system with many random interference factors, and traffic data presents various complex and uncertain features, which brings great challenges to fast and accurate traffic prediction in the short term. For decades, a large number of scholars have studied the above tasks from the prediction models. With the improvement of traffic information collection and storage methods, the quality and quantity of acquired data have greatly improved, which also makes data-driven machine learning methods more accurate in traffic flow prediction tasks than traditional statistical models. Recently, models based on neural networks [3][4][5][6][7][8] have become the main prediction model in the traffic prediction task due to the excellent nonlinear function approximation ability. In particular, the neural network models which can simultaneously extract and utilise the spatiotemporal correlation features of traffic data [9][10][11] have better adaptability and prediction accuracy. Nevertheless, there is still room for improvement in the accuracy of the prediction model, and the time consuming of the forecast operation should also be a measure index of a good forecast model. Therefore, this paper designs a novel dual-branch deep network model of space-time to predict short-term traffic flow. The main contributions include three aspects. Firstly, two onedimensional residual gated convolution networks (RGCNN) are used to extract the spatio-temporal correlation features hidden in the traffic data from two branches. The advantage of this design is to ensure that the spatial correlation features of different time points and the time correlation features of different spatial points extracted from any sample have different effects on the predicted value, which has a positive effect on the improvement of prediction accuracy. Secondly, an analogous self-attention module is designed to weight the spatio-temporal features extracted by RGCNN. Compared with the attention modules in most papers, not only it has the same effect on enhancing useful features and suppressing useless features to make the accuracy of the prediction results higher, but its calculation time has been significantly reduced. Thirdly, the Selu function is used in this designed model instead of the batch normalisation (BN) operation used in most traffic prediction deep learning models, which not only improves the convergence effect during training, but also reduces the calculation time of the prediction model.
The organisation of the rest of the paper is as follows. Section 2 summarises the literature of traffic prediction and deep learning. Section 3 presents the methodology of this paper, mainly involving convolution operations, data structure, model structure and training optimisation method. Section 4 gives the simulation experimental design and result analysis. Finally, Section 5 conducts conclusion.

LITERATURE REVIEW
On account of the significance of the traffic prediction task, it has been studied since the 1980s [12]. Initially, statistical methods were used to make predictions, such as Markov chain prediction model [13,14], Kalman filtering prediction model [15,16], auto-regressive integrated moving average (ARIMA) prediction model [17][18][19], partial least square prediction model [20,21] and so on. However, traffic data shows complex nonlinear characteristics, which is not stable and orderly, so these statistical methods based on traditional mathematics are easy to produce large errors [22]. In recent years, machine learning algorithms have been widely used in traffic prediction due to their good nonlinear fitting ability. For instance, [23] proposed an online support vector regression algorithm to predict short-term traffic flows under both typical and atypical conditions, and proved the best effect in dealing with atypical events. [24] proposed a K-nearest neighbour model to predict short-term traffic flow, which was analysed from two regression models of average and weighted nonparametric. [25] proposed an improved Bayesian combination method to predict traffic flow, which is more sensitive to the disturbance performance of component predictors and can adjust its integration more quickly, thus producing better prediction.
With the development of advanced traffic-aware infrastructure, it becomes more convenient to collect large amounts of historical traffic data. Whereas these machine learning algorithms have low feature extraction ability in the context of big data, which lead to the decline of model generalisation ability. Therefore, it is particularly important to design a prediction model with stronger feature extraction ability and higher accuracy in traffic flow prediction research. As a branch of machine learning, the deep learning method based on neural network (NN) has strong nonlinear processing ability and the ability to fit the uncertainty of traffic big data characteristics. As a result, studies based on this 40 kind of approach is beginning to emerge. [3] used different artificial neural network (ANN) model to predict the hourly traffic flow in second tolled bridge of Bosphorus in Turkey, and proved the prediction quite accurate. [4] proposed a novel NN training method that employs the hybrid exponential smoothing method and the Levenberg Marquardt algorithm, which aims to improve the generalisation capabilities of previously used methods for training NNs for short-term traffic flow forecasting. [5] used deep belief network to extract the effective features of traffic flow without supervision. [6] applied the model mixed with stacked auto-encoders and deep neural network (DNN) to the bus rapid transit (BRT) passenger flow prediction, and verified the proposed method has the capability to provide a more accurate and universal passenger flow prediction model for different BRT stations with different passenger flow profiles. [7] proposed a deep polynomial neural network combined with a seasonal ARIMA model to predict the traffic flow, which has high prediction accuracy and high clarity on the spatiotemporal relation of its deep structure. [8] proposed a novel data-driven vehicle speed prediction method based on back propagation-long short-term memory (BP-LSTM) algorithms for long-term individual vehicle speed prediction along the planned route and the model has a good prediction accuracy.
On the other hand, in order to further improve the model prediction accuracy, some scholars have applied spatiotemporal characteristics to the traffic flow prediction. Convolutional neural network (CNN) is a commonly used image processing algorithm in computer vision. Since CNN can capture the local dependence of traffic data and is less sensitive to noise, it is applied to traffic prediction. For instance, [26] transformed the spatiotemporal traffic dynamics into a two-dimensional spatiotemporal matrix to describe the spatiotemporal relationship of traffic flow, and then applied CNN for prediction. It proved that CNN can train the model in a reasonable time, which is suitable for large traffic networks. [27] transformed the geographical data into images, and used a residual CNN learning framework to the taxi pickup prediction in New York, and verified that its accuracy was higher than other methods. [28] presented a method for data grouping of urban short-term traffic flow prediction based on CNN, which considered the spatial relationship between traffic locations and used the information to predict its validity in real sets. [29] used the convolution-based residual network (DeepST) to model the spatial dependence near and far between any two regions in the city, while ensuring that the deep structure of the neural network does not affect the prediction accuracy of the model. Moreover, the superiority of the model was evaluated through Beijing taxi trajectory data and meteorological data as well as New York City bike.
However, CNN can only extract the spatial features of traffic data, which is poor in the performance of time features extraction. As a result, some scholars started to combine CNN with recurrent neural network (RNN), such as LSTM which can extract the temporal characteristics of traffic data to build a prediction model. [9] used one-dimensional CNN to obtain the spatial features of traffic flow, and LSTM to mine the shortterm variability and periodicity of traffic flow, and then carried out spatiotemporal features fusion to realise the short-term traffic flow prediction. [10] proposed a grid representation method to preserve the fine scale structure of the traffic network. Traffic velocity of the whole network is converted into a series of static images to input into the spatiotemporal recursive convolution network for traffic prediction, which inherits the advantages of deep convolutional neural network (DCNN) and LSTM. [11] proposed a new hybrid convolution LSTM neural network model (ConvLSTM) based on the critical road sections, and took the traffic speed of the critical road sections as input to the ConvLSTM to predict the future traffic state of the entire network. Although these models can achieve high accuracy, RNN is too complex compared with CNN structure and cannot be used in parallel. When the number of prediction model layers increases or the data samples are too large, it requires too much computing hardware and costs. In order to reduce the computational cost of the prediction model and also to effectively extract the spatiotemporal features of the input, [30] and [31] introduce a gated convolutional network (GCN) that obtains higher-level and more abstract spatial and temporal features of the input through the stack of the gated linear elements, and succeed in the research of natural language processing.
Attention module is an important part of deep learning theory which can greatly improve the fitting ability of the prediction model. In [32], it was first proposed to use attention model to realise natural language processing tasks, which can enhance the output of effective features and suppress the output of unnecessary features. Then [33] added alignment to the attention mechanism and proposed two models of global attention and local attention. These attention modules are mainly applied on the basis of RNN. In recent years, some scholars have applied attention mechanism to CNN to improve its fitting ability. The main results include squeeze-and-excitation network [34] and convolutional block attention module [35], which all get the attentionweight matrix by compressing the dimensions of features. In the study of traffic prediction, attention module has also become a useful tool to extract the features of traffic data. For instance, [36] proposed a multi-component attention method for traffic flow prediction, which used CNN captures local trend characteristics of residuals and used a bidirectional LSTM captures trends and seasonal time adjustments, while through the introduction of the attention mechanism, highly related historical information may be connected for multi-component flow data in the final prediction. [37] proposed a traffic flow prediction model based on DNN attention module (DNN-BTF), which used the CNN to mine the spatial features of traffic flow and the LSTM to mine the temporal features of traffic flow, and also visually demonstrated how the DNN-BTF model understands traffic flow data. [38] presented an attention module based on CNN and GCN (AGCNN) and used a three-dimensional data matrix constructed by traffic flow, speed and occupancy to predict traffic speed. Although the use of attention module can increase the prediction accuracy of the model, the number of network layers increases, which also increases the calculation time.
Accordingly, it is a major challenge in traffic prediction research to establish a fast and accurate prediction model. To this end, this paper focuses on using advanced deep learning algorithm to achieve the short-term multi-position traffic flow prediction task to ensure the further improvement of the model prediction accuracy while the reduction of the calculation time of the prediction model. In the designed prediction model of this paper, a deep residual gated convolutional neural network with analogous self-attention model (ASA-RGCNN) is proposed to predict traffic flow, in which the double-branch deep residual gated convolutional network can reasonably extract and integrate the spatial and temporal features of historical traffic data to improve the accuracy of the prediction model. The ASA-RGCNN model also combines a novel analogous selfattention modules to weight the extracted deep time and space features, which not only achieves further improvement of prediction accuracy compared with the traditional attention module, but also reduces the calculation time. In addition, in the whole network construction process, Selu function [39] is used to replace the BN operation [40], which ensures the convergence of the model during training and reduces a lot of calculation time. After the model is established, it is proved that it has more advantages than the existing advanced models through the training and testing of real traffic data.

METHODOLOGY
In traffic flow theory, there are three main parameters that can reflect the characteristics of traffic information: traffic flow, traffic occupancy and traffic speed. Assume the goal of the prediction model of this paper is to predict the traffic flow of the consecutive locations in the prediction horizon. And it is noted that traffic data has obvious spatio-temporal dependence, that is, traffic data of the spatial location of a certain period is affected not only by the traffic data of the previous period, but also by the traffic data of upstream and downstream locations. Therefore, a novel spatiotemporal deep learning network is proposed in this paper to predict traffic flow, as shown in Figure 1.
The dual-branch residual gated convolutional neural network (RGCNN) is used to extract deep temporal features and spatial features, and then the analogous self-attention (ASA) module is used to weight hidden features. Finally, the traffic flow is predicted by the regression module by using the fusion features of the two branches.

Multi-channel convolution
Because of its features of local link, weight sharing and parallel computing, CNN is the most widely used model in image analysis and has achieved remarkable results. The traffic data samples in this paper are similar to those in the image processing category, so the use of CNN to extract features is also applicable. The multi-channel convolution shown in Figure 2 is the core operation of a CNN. Suppose the input and output of the convolution operation are x ∈ R H ×W ×C and h ∈ R M ×N ×C ′ . Since the input has multiple channels, we assign each channel a convolution kernel with dimension K × K and the convolution step is 1 in every dimension, these convolution kernels form w c ′ ∈ R K ×K ×C . Thus the multi-channel convolution operation is described as follows.

Data structure
Assume the traffic flow of p consecutive locations f p i=1 will be predicted in (t , t + 1, … , t + ), where is the prediction hori- is used as the inputs to generate the prediction points for the prediction horizon, where f is the traffic flow, s is the traffic speed, o is the traffic occupancy, and n is the history data time horizon. History data of the three parameters mentioned above is used to build a 3D matrix as the input of the prediction model, as shown in Figure 3.
The 0-1 normalisation is performed to remove the mathematical problems caused by the convolution operation of the three parameters of traffic data due to their different orders of magnitude. Take traffic flow f for example: wheref is the result of the normalisation operation of f , f max and f min represent the maximum and minimum values of f in all samples, respectively. In the same way,ŝ andô can be obtained, and then, a new samplex = {f ,ŝ,ô} p i=1 is formed. The input data matrix is described as follows:

Deep features extraction
As the main structure of the prediction model, the performance of traffic data deep features extraction module is directly related to the prediction results of the prediction model. This paper uses a dual-branch residual gated convolutional neural network to extract the temporal and spatial features of the input data. The external structure of the two branches is the same, as shown in Figure 4. Unlike traditional convolutional neural networks (CNN), gated convolution neural networks (GCNN) consists of two parts: a convolution unit and a gated unit, which through the element-wise multiplied operation between the convolution activation values and the gated values to obtain the output of hidden feature maps. Compared with LSTM, the structure of the GCNN can be parallel operation and the calculation time is less, and compared with CNN, it adds a gating mechanism, which makes the convolution result selective and better fitting effect. The gated convolution operation of the lth layer is formulated as: where ⊗ is the element-wise product. Z l ,(t |s) and H l ,(t |s) are the input and output of the l th layer of gated convolution, respectively. C 1,l ,(t |s) , C 2,l ,(t |s) represent two different convolution operations with the same dimensions of the convolution kernel. In the branch of deep time feature extraction, k 1 = 1 and k 2 > 1 in the dimension of gated convolution kernel, and in the branch of deep spatial extraction, k 1 > 1 and k 2 = 1 in the dimension of gated convolution kernel. The advantage of this design is to ensure that the spatial correlation features of different time points and the time correlation features of different spatial points extracted from any sample have different effects on the predicted value, which is more in line with the actual law. S(⋅) is the sigmoid function to obtain the gated value: In order to prevent the degradation problem caused by too deep network, every two layers of gating convolution and access to a shortcut connection before nonlinear activation constitute a residual module [41], and if the dimensions of the input and output of the shortcut connection structure are inconsistent, a unit convolution operation (k 1 = 1 and k 2 = 1) is needed to make their dimensions consistent. Therefore, the nonlinear activation parts of the gated convolution module at different layers are calculated: where ⊕ is the element-wise summation. J l ,(t |s) is the output of shortcut connection in l th layer and l must be an even number. C 3,l ,(t |s) is a unit convolution operation, U is the nonlinear activation function of gated convolution operation and D(⋅) is the dimension size. As seen from Figure 4, all the nonlinear activation functions are Selu functions defined as follows.
where ≈ 1.0507 and ≈ 1.6733 are known parameters in Selu. It has a very important role is to use very little computational time to obtain the same effect as batch normalisation, making the prediction model easier to train.

Analogous self-attention
Through the dual-branch deep RGCNN, the temporal correlation feature maps O t and the spatial correlation feature maps O s can be extracted, respectively, but different points in the same channel or different channels in these two feature maps have different effects on the prediction task. Therefore, the features attention mechanism is usually used to make the prediction model controllable, which can enhance useful features and suppress redundant features. However, if the attention weights are obtained from weighted feature maps, it is equivalent to increasing the number of layers of the prediction model, which increases the computational time of model. In order to alleviate this problem, according to the feature maps of being paid attention and the weights of attention in the self-attention [42] derived from the same input, the weights of the attention of map points and channels are generated from the input X of the  Figure 5.
The map points attention structure is very simple. A convolution operation is performed on the input sample X to obtain a single-channel feature map with the same size as each channel map in O t or O s , and then the sigmoid activation function is used to map the value of each point in the map to a weight value between 0-1, finally the weight value map and the features maps O t or O s are operated by element-wise product to obtain the weighted feature maps MPA t or MPA s . The whole process is calculated by where C a,t , C a,s are normal convolution operations (k mp1 = k mp2 > 1). The structures of channels attention is performed after map points attention is completed, so the feature maps that need to be paid attention are FA t or FA s , but the weight maps are still obtained from the input sample X . First, X is flattened to obtain one-dimensional maps, and then through a fully connected (FC) layer the number of channels in the maps is adjusted as the same as that in MPA t or MPA s , next, the sigmoid function is used to map the value of each channel in the maps to a weight value between 0-1, finally the weight value maps and the features maps MPA t or MPA s are operated by element-wise product to obtain the weighted feature maps CA t or CA s . The whole process is calculated by: where Fla is the flatten operation. W t and W s are the fully connected weights in the channels attention of the temporal branch and the spatial branch, respectively.

Multi-location flow prediction
To obtain predictors of multi-location traffic flow through a single calculation model, it is necessary to fuse the extracted deep temporal features CA t and spatial features CA s to perform Regression structure regression prediction, as shown in Figure 6, which first fuses the spatio-temporal feature maps, and then expands it into a one-dimensional sequence through the flatten operation, finally calculates the final predictors through a FC layer. Based on the feature H st after fusion, the multi-location traffic flow predictorsŷ is obtained: Since the samples of the model are all normalised, the final regression function selects sigmoid. After the prediction model is built based on the above description, a large number of samples are needed for training before the model can be used. In the process of model training, the mini-batch adaptive moment estimation (Adam) [43] algorithm is used as the training optimiser. The choice of loss function on output layer is the mean square error (MSE) defined as follows.
where`represents all learnable parameters, M is the number of training samples in each batch, y is the true value andŷ is the predicted value. In Mini-batch Adam algorithm, the learning rate with exponential decay defined as the following form is adopted to enhance its convergence effect.
= 0 * dr gs ds (14) where 0 is the initial learning rate, dr is the decay rate, gs is the global iteration steps and ds is the decay iteration steps. The training algorithm of the prediction model is essentially to adjust all the most appropriate learnable parameters, which is summarised as Table 1.
After the model is trained, the traffic flow prediction task of multiple traffic locations in different prediction horizons can be realised and the predicted value of future traffic flow can be obtained through the de-normalisation operation.

EXPERIMENTAL VERIFICATION
The simulation experiments are conducted on a computer with a core i9-9900k 3.6GHZ CPU, NVIDIA GeForce RTX 2080Ti 11GB and DDR4 16G memory. The libraries in Keras and Tensorflow are used to build the prediction model in a Python environment. A parallel computing framework CUDA and (2) Concatenate historical traffic data set {f, s, o} as X by Equation (3); (3) Divide all historical traffic data X into mini batch samples X m ; (4) Initialise all learnable parameters`and the learn rate 0

Data setting
The traffic data from Caltrans performance measurement system (PeMS) [44] are collected in the experimental verification.  The minimum time interval of traffic data for each day is 5 min, so there are 288 time points for each spatial location. Figure 8 shows the aggregated average traffic flow at 288 time points of a day, the aggregated average traffic flow of the 12 consecutive locations and the aggregated average daily traffic flow of a week. From Figure 8, it is obvious that the data of traffic flow in a day shows the characteristics of troughs at both ends and peaks at the middle, there is little difference in the average traffic flow between different locations, and there is more data of traffic flow on weekdays than on weekends. At the same time, it is noted that the traffic data from freeway have obvious daily periodicity, which should be considered in the design of prediction model [45]. For this regard, the daily periodic components of the road section are extracted through trigonometric polynomial function [45,46], and then the residual components with the periodic components removed are used as the input of the prediction model.

Simulation results and analysis
Since in many studies, the prediction performance of deep learning models is better than traditional statistical models and machine learning models, five prediction models based on deep learning including ANN [3], LSTM, DeepST [29], DNN-BTF [37] and AGCNN [38] are compared in the experiment to verify the dominance of the proposed ASA-RGCNN prediction model. The size of each input sample is (12,12,3). In order to ensure the fairness of each prediction model in simulation, mini-batch Adam algorithm and exponential decay learning rate are adopted to train. The batch number is 128 and the initial learning rate is 0.05. Meanwhile, according to experience and repeated operation in practice, some of the compared prediction models are slightly modified. Specifically, since ANN model has the characteristics of fast calculation speed and poor fitting ability, the number of neurons in each layer is relatively large. LSTM module in LSTM model or DNN-BTF model has a slow calculation speed, and over-fitting phenomenon will occur when the number of layers is too many, so the number of layers is relatively small. In order to prevent the occurrence of overfitting and under-fitting, the size of the convolution kernel of the model with CNN module is also carefully selected. The structure of each prediction model is described as follows: (1) The ANN model adopts a network structure of seven fully connected layers module, with 256 neurons in each hidden layer. per layer. The normalisation operation is replaced by the Selu function and the padding in the convolution operation uses the "SAME" in the above models.
Three evaluation metrics are first used to evaluate the predictive performance of the above models, including the mean absolute error (MAE), the root mean square error (RMSE), and the mean relative error (MRE), which are defined as follows.
where N is the size of the test set, y n andŷ n are the true value and the predicted value, respectively. The MAE is the average of the absolute value of the deviation of all single predicted values from the actual values, which can accurately reflect the size of the actual prediction error. The RMSE is the square root of the average of all single predicted values from the actual values, which is more affected by outliers than MAE. The MRE is the average of the absolute value of the deviation of all single predicted values from the actual value and the proportion of the true values. These metrics measure the errors between the real value and the predicted value from different aspects, and the prediction model can be judged according to the errors. This paper uses 14-day test samples in the simulation experiment. Table 2 describes the prediction performance of the proposed model and the five compared models over the entire data samples. The testing contains the error calculation results of the prediction horizons from 5 to 30 min. Table 2 shows that all prediction models have the same rule when the predicted horizon increases, the prediction error metrics generally shows an upward trend, that is, the prediction accuracy will decrease. In each prediction horizon, the ANN and LSTM models cannot fully extract the temporal and spatial features of the traffic data, which makes the prediction error relatively large, while the DeepST, DNN-BTF, AGCNN and ASA-RGCNN models extract the temporal and spatial features of traffic data through different branches, so the prediction error is small. The proposed ASA-RGCNN model has the smallest error, which means the highest accuracy.
In order to verify the effectiveness of the proposed ASA-RGCNN model in more detail, first the daily traffic flow data is divided into three levels by the aggregated average traffic flow: the moment of low traffic flow [0, 150), the moment of medium traffic flow [150, 500), and the moment of high traffic flow >500. The moment of high traffic flow is the peak time periods and the others are off-peak time periods. Figure 9 shows the evaluation metrics of different prediction models at three levels, where the prediction horizon is 5 min. It can be seen from Figure 9 that the MAE, RMSE and MRE of the ASA-RGCNN model are lower than those of other prediction models at any level of traffic flow. The MAE and RMSE of high traffic flow moments are higher than those of low traffic flow moments, but they are lower than those of medium traffic flow moments and the MRE of high traffic flow moments is lower than those of both low traffic flow moments and medium traffic flow moments. Therefore, it can be judged that the ASA-RGCNN model can maintain a good prediction accuracy at peak time periods. And compared with other models, the prediction error of the ASA-RGCNN model changes the least over time, which means that the generalisation ability is the best. And then, the prediction accuracy of different models is evaluated in terms of weekdays and weekends, where the prediction horizon is 5 min, as shown in Figure 10. It can be found from Figure 10, the MAE and RMSE of each prediction model are higher on weekdays than on weekends, but the MRE was lower than on weekends. In addition, the MAE, RMSE and MRE of the ASA-RGCNN model are the lowest both on weekdays and weekends, indicating that the prediction performance of ASA-RGCNN model can be better either on weekdays or weekends.
Since prediction tasks are performed at multiple spatial locations and temporal points, in addition to error metrics, the distance and similarity of vectors are also important performance indicators. The average Euclidean distance and the average cosine similarity of the time vector and the space vector are   Table 3 when the prediction horizon is 5 min.
From Table 3, it can be seen that the average Euclidean distance between the temporal vectors and the spatial vectors of the ASA-RGCNN model's predicted value is smaller than other models, and the average cosine similarity is larger than other models. That is to say, the difference in individual value and direction between the predicted values of the ASA-RGCNN model and the true values is smaller than that of other prediction models. In addition, considering that the abnormal traffic flow data will appear due to a certain uncertainty in the traffic system, the effect of the proposed prediction model in this abnormal situation is analysed. Figure 11 shows the comparison between true values and the predicted values at the 10th spatial location on April 20, 2019 and the first spatial location on April 26, 2019 when the ASA-RGCNN model is used. Obviously, when the traffic flow data appears abnormal, the ASA-RGCNN model still obtains good prediction effect.
The calculation time (training time and testing time) of the model represents its calculation cost, and its value directly affects the feasibility of the model. Table 4 shows the calculation time results of different prediction models, among of which ANN and DeepST models require less time, because each layer of the ANN model is a simple fully connected structure and the DeepST is a simple convolutional structure, and neither of them has any other complex operations, such as attention module. The calculation time required by LSTM and DNN-BTF models is relatively long, mainly because LSTM structure is not capable of parallel operation compared with full connection structure or convolution structure. Although ASA-RGCNN model has a longer calculation time than ANN and DeepST model, it has a shorter calculation time than other prediction models, which means that the computing efficiency of gated convolution operation is lower than that of full join operation and ordinary convolution operation, but higher than that of LSTM operation. The Selu function in this paper, as an alternative to batch standardised operations, plays a key role in the process of prediction model training. In the ASA-RGCNN prediction model, we use Selu function and BN operation as the layer normalisation tool of gated convolution, and then compare them from two aspects: one is the iteration time and the other is the convergence ability. When the number of iterations of these two models is 1000, the time required for the model using the Selu function is 30.362 s, and the time required for the model using BN operation is 38.021 s, which mean that the calculation cost of the Selu function is less than that of the BN operation. The MSE of the two models changes with the number of iteration as shown in Figure 12. We can find that the model using Selu has a faster convergence rate than the model using BN operation, and the oscillation amplitude is smaller during the iteration process, which means its convergence effect is better.
We also have tested the proposed model without attention module and with attention module, whose weights are taken from the previous layer. The metrics results of model without attention (WOA), with attention (WA) and with analogy self-  Table 5, which shows that the prediction error of the model with the attention module is much smaller than that of the model without the attention module, but the calculation time is longer. Compared with the model with the attention module, the prediction error of the model with analogy self-attention module has little difference in the prediction error, but the calculation time will be reduced, which means that the prediction accuracy is guaranteed while the calculation time is reduced. On the one hand, a neural network with an attention mechanism can learn the attention mechanism autonomously, and on the other hand, the attention mechanism can in turn help us understand the neural network. The most intuitive way to understand the internal meaning of a prediction model is to visualise it. We select the analogy self-attention weights at three predicted moments for visualisation operations, including 02:00 on April 17, 2019 (low traffic flow moment), 08:05 on April 18, 2019 (medium traffic flow moment) and 05:20 on April 19, 2019 (high traffic flow moment).
The weights of map points attention in the space-time dual branch are shown in Figures 13 and 14, respectively. Regardless of the level of traffic flow we want to predict, the weight of map points attention on the time branch will be larger at adjacent time points, and the weight of map points attention on the space branch will be larger at the upstream space locations. In the other words, the traffic flow predicted by the model is more influenced by the traffic flow at adjacent time points and upstream space locations.
The weights of channels attention in the space-time dual branch are shown in Figure 15. Compared with the weights of map points attention, channels attention weights in each level of traffic flow has no obvious regularity. The function of the channels attention module is to enhance the influence of the useful channels on the predicted value of the feature maps and to suppress the influence of the useless channel on the predicted value. However, each channel is obtained through gated convolution operation, in which the channels of the convolution kernels are randomly ordered, which makes it difficult to find the regularity of the distribution of channels attention.

CONCLUSION
The short-term traffic flow prediction can provide an advanced management and control strategy for the traffic system, so the rapid and accurate prediction method is particularly critical. In this paper, we propose a dual-branch residual gated convolutional network with analogous self-attention to predict the traffic flow. The network extracts the deep temporal features and spatial features by using a dual-branch deep residual gated convolution module, and weight them by using analogous self-attention module. Meanwhile, a regression layer is used to obtain multi-location traffic flow. Compared with other prediction models by simulation test, the proposed prediction model not only has higher prediction accuracy, but also has a relatively less calculation time, and its adaptability to different traffic conditions is also better. Certainly, traffic flow prediction combined with deep learning still needs further research, including the sample treatment and the network structure improvement. Meanwhile, single forecasting task of sections can be expanded to a network, which can make the network structure to deal with more realistic task.