Spatial-temporal attention wavenet: A deep learning framework for trafﬁc prediction considering spatial-temporal dependencies

Trafﬁc prediction on road networks is highly challenging due to the complexity of trafﬁc systems and is a crucial task in successful intelligent trafﬁc system applications. Existing approaches mostly capture the static spatial dependency relying on the prior knowledge of the graph structure. However, the spatial dependency can be dynamic, and sometimes the physical structure may not reﬂect the genuine relationship between roads. To better capture the complex spatial-temporal dependencies and forecast trafﬁc conditions on road networks, a multi-step prediction model named Spatial-Temporal Attention Wavenet (STAWnet) is proposed. Temporal convolution is applied to handle long time sequences, and the dynamic spatial dependencies between different nodes can be captured using the self-attention network. Different from existing models, STAWnet does not need prior knowledge of the graph by developing a self-learned node embedding. These components are integrated into an end-to-end framework. The experimental results on three public traf-ﬁc prediction datasets (METR-LA, PEMS-BAY, and PEMS07) demonstrate effectiveness. In particular, in the 1 h ahead prediction, STAWnet outperforms state-of-the-art methods with no prior knowledge of the network.


FIGURE 1
A simple example of the distance cannot entail the genuine dependency. Detector A and detector B is close but they are less related because have contrary directions. Detector C and detector D have the same distance to detector A, but they have different effects on it because one is on the upstream and the other is on the downstream computed by spatial information (e.g. distance between censors), but sometimes the coefficients cannot entail the genuine dependency relationships. To give an example in Figure 1, the close distance between nodes (or detectors) cannot indicate strong spatial dependencies and nodes that have similar distance can have different impacts.
• There are circumstances when connections can not entail the relationship when the connections are missing, for the reason that the spatial information sometimes can be unavailable due to secrecy or privacy, and the relations are more complicated than connectivity considering various attributes such as the number of lanes, surrounding environment, and infrastructure conditions. • The fixed coefficients may fail to model to dynamic spatial dependencies and result in inaccuracy, because different nodes have different impacts and even the same location has varying influence as time goes by in terms of traffic volume, density and relevant emergent events [2]. They fail to simultaneously model the spatial-temporal features and the dynamic correlations of traffic data.
To address existing challenges, we propose a novel deep learning framework, named as Spatial-Temporal Attention Wavenet (STAWnet). Specifically, we integrate the convolution neural network [9] and attention mechanism [10] into an end-toend framework to extract the spatial-temporal dependencies. By developing a self-adaptive node embedding, STAWnet can capture the hidden spatial relationship in the data without knowing the graph structure information. We evaluate STAWnet on three public traffic network datasets, METR-LA, PEMS-BAY and PEMS07. STAWnet can achieve satisfactory performance but without prior knowledge of the network as an input, which means our method can be applied to other tasks flexibly. The main contributions of this work are as follows: • Compared to existing model, STAWnet used self-learned node embedding to learn the latent spatial relationship instead of extracting adjacency relationships from prior knowledge of the graph. It brings high flexibility and can be easily extended to other spatial-temporal forecasting tasks. • We designed a dynamic attention mechanism that can adjust the coefficients of different nodes based on traffic conditions and spatial information. • STAWnet can overcome the difficulty in multi-step prediction considering complex dynamic spatial-temporal dependencies and provide certain explainability. The results on real world datasets indicate that STAWnet yields leading predic-tions performance in terms of various prediction error measures.
The rest of the paper is organized as follows: In Section 2, we give a literature review of related works. In Section 3, we formalize the traffic prediction problem and introduce the overall framework of the STAWnet. In Section 4, experiments are implemented on three datasets to compare with other models. Then we analyze the components of the model in detail in Section 5. Finally, we conclude our work and future directions.

Traffic forecasting
Traffic forecasting has been studied for decades, and various emerging methods have been constantly proposed to model traffic characteristics. Lint and Hinsbergen divided these methods into three categories, that is, naive methods, parametric methods and non-parametric methods [11]. Parametric methods often require a wealth of prior knowledge based on queuing theory and traffic flow theory and they cannot handle unpredictability or complex factors. With the rapid development of real-time traffic collection methods, non-parametric (or datadriven) approaches through mass historical data to capture similar traffic patterns prevail in recent years. Further, Zhang et al. divided data-driven methods into three representative subcategories, that is, statistical models, shallow machine learning models and deep learning models [8]. Given historical observations, many traffic prediction studies only consider temporal dependencies using time-series models. The autoregressive integrated moving average (ARIMA) and Kalman filtering have been widely applied [12,13]. These methods have difficulty achieving high accuracies because they ignore spatial dependencies and only consider the dynamic change of traffic conditions based on the stationary assumption of time sequences. However, this assumption is usually unsatisfied considering traffic dynamics. Machine learning methods such as KNN [14] and SVM [15] are also applied to model complex traffic data and yield satisfactory results. Guo et al. build feature extraction model, and applied k-means method to divide the stations into different types. Then they proposed a hybrid prediction model based on kernel ridge regression and Gaussian process regression to predict the short-term passenger flow of urban rail transit, and verified it on the Automatic Fare Collection System data [16]. However, the performance of traditional machine learning models heavily depends on manual feature engineering and selection, and they are not suitable for largescale traffic forecasting.

Deep learning on spatial-temporal prediction
In this decade, deep learning methods are prevalent and have achieved high accuracy and efficiency in transportation studies. Ma et al. used the long short-term memory neural network to capture non-linear temporal dynamics effectively [17]. Yang et al. proposed an enhanced long-term features based on LSTM model. It takes full advantages of LSTM in processing time series and overcomes its limitations in insufficient learning of long temporal dependency due to time lag to predict the origin destination flow in the next hour [18], but recurrent neural network models treat traffic sequences of different roads as independent data streams. Following studies further explore the utilities of spatial information. A series of studies applied convolution network to extract spatial patterns by treating inputs as image pixels [19,20]. Considering spatialtemporal interactions, Sun et al. converted spatial-temporal traffic dynamics to images and applied convolutional neural network (CNN) and recurrent neural network [21]. Chu et al. used multi-scale convolutional long short-term memory network to handle travel demand prediction [22]. Yao et al. further learned the spatial-temporal dependency simultaneously by integrating LSTM, local-CNN and semantic network embedding [23]. Bao et al. developed a hybrid deep learning neural network to predict the short-term demand of free-floating bike sharing [24]. However, the models with deep architectures above do not distinguish spatial variables across topological adjacency. In other words, traffic network has a non-euclidean structure and models based on euclidean structure may compromise the effects of capturing spatial correlations.
Typical deep learning structure like convolutional neural networks cannot be directly used in non-Euclidean and directional structure. To overcome this problem, the graph neural network [25] based techniques have been popular for spatial relationship modeling by aggregating neighboring nodes' information into features. Davis et al studied that graph-based model offers competitive performance against the grid-based model at a lower computational complexity, across three real-world large-scale taxi demand-supply data sets by representing the Voronoi spatial partitions as nodes on an arbitrarily structured graph [26]. As one of the widely used models, graph convolution nework (GCN) [27] methods define graph convolutions by introducing filters from the perspective of graph signal processing, which is based on graph spectral theory and have been applied in related areas. For instance, Yu et al. proposed Spatio-Temporal Graph Convolutional Networks using GCN with Chebyshev polynomial approximation and gated temporal CNN to captures spatial and temporal correlations correspondingly [4]. Li et al. presented the diffusion convolution recurrent neural network [6], which combines diffusion convolution and recurrent neural networks. Wu et al. also adapted diffusion convolution in spatial modelling [5]. It considers both connected and uncon-nected nodes in the modelling process and uses dilated convolution to learn long sequences of data. Zhang et al. used a multigraph GCN to capture patterns of passenger inflow and ouflow with different granularity. They combined GCN and a threedimensional convolutional neural network to predict short-term passenger flow in urban rail transit and achieved leading performance [28]. Jin et al. transferred hybrid GCN model from station-based scenes to grid-based scenes by modelling adjacency matrices and fused graph-level representation and pixellevel representation to obtain joint representation in ride-hailing demand prediction [29]. However, current GCN based models have some shortcomings. They have restrict graph degree and require identical graph structure shared among inputs. Also, they are incapable of learning from topological structure due to fixed graph structure without training. Furthermore, attention based models has been widely applied in deep learning society. It can help neural networks to learn which parts of the input are more relevant. Based on that, graph attention networks employ attention mechanisms which assign larger weights to the more important nodes [10]. It is a promising approach in capturing the correlations between inputs and outputs while improving the interpretability of deep learning models. Do et al. applied spatial and temporal attentions to exploit the spatial dependencies between road segments and temporal dependencies between time steps respectively, which showed promising results and helped to understand spatial-temporal correlations [30]. Guo et al. applied an attention based spatial-temporal graph convolutional networks to effectively capture the dynamic spatial-temporal correlations in traffic data [2]. There are studies took daily and weekly periodic patterns into consideration and used an attention-based periodic-temporal neural network, which captures the spatial, temporal, and periodical correlations [31,32].
Although these recent works show satisfactory results, graph convolution and graph attention based methods are highly dependent on adjacency matrix whose coefficients are computed by spatial information or context information. Sometimes these coefficients are not given or cannot entail the genuine dependency relationships. Also, the fixed coefficients may fail to model to dynamic dependencies and result in inaccuracy.

METHODOLOGY
In this section, we first give the mathematical definition of the problem. Next, we describe the overall structure and main building blocks of our framework, the temporal convolution layer and the dynamic attention layer. They work together to capture the spatial-temporal dependencies.

Problem definition
We define X(t) ∈ R C * N as the traffic observations at time step t , where C is the number of traffic conditions of interests (e.g. speeds, volumes) and N is the number of detectors in the network. The objective is to learn a function f (⋅) that map T ′

FIGURE 2
The STAWnet consists of multiple ST-blocks. Each ST-block contains a gated TCN and a DAN, where node embedding is integrated into. Layer normalization is utilized within every block to prevent over-fitting. Moreover, both residual and skip connections are used throughout the network to speed up convergence. In the end, the skip outputs from gated TCN in different ST-blocks are added up. Finally, the sum goes through output layers to compute the predictions historical observations to future T observations as To clarify, our problem definition is different from most existing relevant studies [2][3][4][5][6][7]33]. For others, an input graph must be defined. For example, graph  with  = (, , A). Here,  is a set of nodes with || = N ,  is a set of edges and A ∈ R N * N represents the adjacency matrix. Then the learned functionf (⋅) that map T ′ historical observations to future T observations as It is clear to tell that the difference is that our task does not need the graph structure information as an input. This simplification brings high flexibility because sometimes the graph information is unknown or hard to define. Thus, this model needs less prior knowledge and can be easily extend to other spatial-temporal prediction tasks.

Framework of the STAWnet
In this section, we elaborate on the proposed architecture of STAWnet. As shown in Figure 2. It consists of multiple stacked spatial-temporal blocks (ST-blocks) and output layers. A ST-block is constructed by a gated temporal convolution network (TCN) and a dynamic attention network (DAN), which are designed to capture the temporal and spatial dependencies correspondingly. By stacking these ST-blocks, it is able to handle spatial-temporal dependencies at different temporal level. The details of each module are described in the following sections.

Gated TCN for extracting temporal dependencies
Although RNN-based approaches are prevalent in time-series analysis, they suffer from time-consuming iteration and gradient explosion/vanishing for capturing long-range sequences in practice. CNN-based approaches enjoy the advantages of parallel computing, stable gradients and simple structure. Inspired by [34,35], we adopt the dilated CNN that allows an exponentially large receptive field by increasing the layer depth aiming to capture temporal dependencies. A dilated convolution is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is more efficient [35]. As a special case, dilated convolution with dilation 1 yields the standard convolution. In addition, a gated mechanism is used to learn complex temporal dependencies. The computation in Gated TCN is written as where  i out ∈ R C i ×N ×T i is the output of the i th ST-block, C i is the number of channels of the input data in the i th ST-block, and T i is the length of the temporal dimension in the i th ST-block.
is the sigmoid activation function, ⊙ denotes an element-wise multiplication operator, * denotes a convolution operator, f and g denote filter and gate, and W is the learnable convolution filter.
Given inputs from last layer as three-dimension tensors with size [C l −1 , N, T l −1 ]. After the gate TCN, the outputs become three-dimension tensors with size [C l , N, T l ] with T l = T l −1 − d l , where d l is the dilation size in the l th STblock. Thus, the length of the temporal dimension of the tensors get shorter after going through gated TCN layers. For the input of the first ST-block, where Conv 1 * 1 is the 1*1 convolution computation which is used to increase dimensionality.

Attention mechanism for extracting dynamic spatial dependencies
The impacts from other roads can be dynamic considering different traffic conditions. For example, the impact of one road to other roads can be stronger when the road is congested compared to when it has low volume [36]. To better model the dynamic spatial dependencies, the self-attention network is employed on graph-structured data to extract patterns in our model accordingly. Attention mechanism has been widely applied in deep learning society due to their high efficiency and flexibility in modelling dependencies and achieved good results in different tasks like computer vision [37], natural language processing [38], and graph learning [10].
The key idea of attention is to dynamically assign different weights to different nodes, as shown in Figure 4. For node i, we compute a weighted sum form all other nodes' information in the network: where i, j is the attention score indicating the importance of node j to node i with From real-world experience and traffic flow studies [39], both the traffic network structure and traffic conditions could help to predict future conditions. Motivated by this intuition, we incorporate both the intrinsic network information and the traffic condition into the prediction models. So the self-learned node embedding is concatenated with the hidden state, and adopt the scaled dot-product approach to compute the attention. Considering the aforementioned complex factors affecting the relationships between nodes, we propose to learn a node embedding to capture the hidden representations of every node in the network. Node embedding is a mapping of a discrete node ID to a vector of continuous numbers. In other words, embeddings are low-dimensional, learned continuous vector representations of discrete variables (node ID). In other words, it projects the nodes into vectors with latent information like famous Word2Vec model [40]. In practice, it is randomly initialized and gradually trained. The well-trained embeddedings are representations of nodes where similar nodes are closer to one another. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space. Following, we compute where ‖ represents the concatenation operation, ⟨⋅, ⋅⟩ denotes the inner product operation, e i is the node embedding of the FIGURE 4 Framework of DAN node i, W k , W q ∈ ℝ d ′ ×d c are the key and query matrix in DAN, and d c is the dimension of h l i,t ‖e i . The node embedding e i , the key matrix W k , and query matrix W q are all learnable parameters.
After the attention scores are obtained, the hidden state can be updated through Equation (4). The shape of the outputs is the same as inputs with  l S [:, i, t ] = h ′ l i,t . This operation is efficient because it is parallelizable across node pairs.

The output layer
ResNet mechanism is applied in the framework to prevent gradient vanishing [41]. The output of the l th ST-block is computed as In every ST-block, the gated TCN has a skip output as shown in 2. Then the skip connections are added up as ∑ i≤N st where N st is the number of ST-blocks. Following, two layers of non-linear transformation with ReLU [42] as the activation function are used to compute the final output. During the training, the goal is to minimize the error between real traffic observations on the roads and the predicted value. The loss functions is shown in Equation (8). with and f (⋅) is the prediction model.

Datasets
We evaluate the performance of our model and baseline models on three widely-used traffic prediction datasets with different road network scales:

Benchmarks
To demonstrate the effectiveness of the proposed model, We compare STAWnet with the following models: • HA: Historical average, which is a naive method that models the traffic flow as a periodic process and uses the weighted average of previous periods as the prediction. work [6], which combines recurrent neural networks with diffusion convolution modeling both inflow and outflow relationships. • STGCN: Spatial-Temporal Graph Convolution Network [4], which applies purely convolutional structures to extract spatial-temporal features simultaneously from graphstructured time series. • GaAN: Gated Attention Networks [44], uses a multi-head attention-based network with a convolutional sub-network to control each attention head's importance. • Graph WaveNet: A convolution network architecture [5], which introduces a self-adaptive graph to capture the hidden spatial dependency, and uses dilated convolution to capture the temporal dependency. • APTN: Attention-based Periodic-Temporal neural Network [31], which is an end-to-end solution for traffic forecasting that captures spatial, short-term, and long-term periodical dependencies. • ST-GRAT: Spatiao-Temporal GRaph ATtention [33], which uses spatial attention, temporal attention, and spatial sentinel vectors to capture the spatiotemporal dynamic in road networks.
For all mentioned approaches, the input are uniformed and the hyperparameters are tuned that performed best on the validation set.

Experimental settings
Following the previous works [4][5][6], We use T =T ′ =12 with historical and prediction timesteps (1 h). The dataset is split into three parts with 70% of the data used for training, 20% used for test and 10% used for validation. We train the model using Adam optimizer [45] with learning rate of 0.001, batch size of 64, and epochs of 100. The training objective is L1 loss. The dimension of each node embedding vector is 16, hidden dimension is 32. The number of ST-blocks is 8. Our experimental platform is on the server with eight CPUs (Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz), 256-GB RAM, and two GPUs (NVIDIA GeForce RTX 2080 Ti, 11GB memory).

Experimental results
In the experiment, we measure the accuracy of the models using mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) computed as (10) Table 1 shows the metrics of the STAWnet and the baseline algorithms for 5, 15, 30 and 60 min ahead forecasting on the test datasets. Time series analysis method is not ideal, sometimes even worse than historical average method in multi-steps, indicating its limited abilities of modeling non-linear and complex spatial-temporal data, but LSTM as a deep learning method obtains better prediction results compared with ARIMA. The models which consider both the spatial and temporal relationships, including DCRNN, STGCN, Graph WaveNet and ST-GRAT, achieve satisfactory results. Although the accuracy of the STAWnet is slightly lower than Graph WaveNet and ST-GRAT in the shorter-than-30-min prediction on the first two dataset, it drops much slower as the prediction sequence getting longer and STAWnet is more accurate in the long-time prediction. Considering the average metrics on all the horizons, the STAWnet achieves the most accurate performance as shown in the Table 2.
For the multi-step prediction task, it is worth noticing that the accuracy of the long-time prediction is of high importance no matter in theory or in practice. Because the long-time prediction usually is the bottleneck of the model accuracy with the existence of error propagation and historical knowledge forgetting. A more accurate long-time prediction can give reliable guidance for travel planning, departure time scheduling, and traffic regulation.
Because the performance gap on these two datasets is minor, a more challenging dataset PEMS07 with 883 detectors is added. On PEMS07, STAWnet outperforms the other benchmarks in terms of almost all metrics. On 60-minute prediction, STAWnet achieves approximately 5.3% higher performance in terms of MAE, 2.8% higher in terms of RMSE and 7.0% higher in terms of MAPE, showing effectiveness on capture complex spatiotemporal dependencies. It is worth noticing that STAWnet still achieves better results without being given the information of the detector network, such as distance, functional similarity, and connectivity. The performance will drop dramatically (e.g. Graph WaveNet, ST-GRAT) or the model does not even work (e.g. DCRNN, STGCN) for other models without this information as input. Thus, the intuition behind the experiments is that STAWnet is more flexible and accurate compared with existing models. STAWnet is slightly worse on the short-time prediction because the information from adjacent nodes is more useful in short-time prediction, but STAWnet does not use underlying spatial structures. Actually, STAWnet has more strength in the long-time prediction and is more flexible compared with baselines, and on average, the overall accuracy of STAWnet is the best.

Computation speed
Select the leading performance models. In this section, we report the computation costs of the models on the METR-LA dataset, as shown in Table 3. Comparing STAWnet to the baselines, we find it is about four times faster than GaAn. STAWnet is also faster than DCRNN, ST-GRAT and APTN. The Graph WaveNet performs best because it is a non-autoregressive model. Overall, STAWnet is the second best model in terms of the training time cost and inference time cost. On the other hand, Graph WaveNet, DCRNN and ST-GRAT need graph information and APTN needs periodic observations as input. Correspondingly, extra data preprocessing is also needed, but STAWnet does not need auxiliary information. Overall, STAWnet is faster and more flexible to similar tasks without too much data preprocessing.

Self-learned node embedding
We further investigate the relationship of the nodes on the METR-LA experiment. Shown in Figure 5(a) as a heatmap, the real adjacency relationship is based on the distance between nodes, and the heatmap value W d i j is computed as where W d i j represents the edge weight between sensor v i and v j calculated by dist(v i , v j ), which denotes the euclidean distance between sensor v i and v j , is the standard deviation of the distances and d is the distance threshold [6]. As shown in the Figure 5(b), the heatmap value of the self-learned adjacency W s i j is based on the cosine similarity between node embedding computed as where s is the similarity threshold. It needs to be mentioned that we choose two different metrics to evaluate these two adjacencies because v i , which is a coordinate, and e i , which is a node embedding, are different types of data having different dimensions and properties. It is more suitable to choose appropriate metrics for them.
As qualitative descriptions, although no information of the sensor positions is given, the relationship heatmap of the selflearned node embedding obtains some similar patterns as the real-world adjacency heatmap, which are labelled in red rectangles. On the other hand, the difference of these two heatmaps is that the self-learned relationship have more bluer points in the figure, indicating that the self-learned adjacency heatmap can give more hidden relationships with other nodes, not just limited to the close neighbors. The effectiveness of self-learned node embedding improving the prediction accuracy are further quantitatively discussed in the following.

Ablation studies
To quantitatively verify the effectiveness of the model design, we conduct experiments with different configurations. The "DAN w/o" means the model do not include the dynamic attention network. The "dynamic w/o" means the model only use the node embedding to do the attention computation without concatenating the output from the TCN, which can be seen as a fixed spatial dependency. The "node embedding w/o" case means the model only use output from the TCN to do attention without node embedding, which can be seen as only considering the similarity of dynamic historical observations between nodes without the learned knowledge of the network structure. Table 4 shows the average score of MAE, RMSE and MAPE over 1-h prediction. . When implementing the attention, the node embedding and the output from Gated TCN are concatenated. From the results, it can be concluded that the self-leaned node embedding greatly helps to improve the accuracy, and dynamic attention mechanism helps to achieve superior results than baselines. The outputs from gated TCN concatenating with self-learned node embedding are able to assign dynamic weights to different neighbors and better depict the FIGURE 5 The comparison between the real adjacency relationship and the self-learned relationship on the METR-LA dataset. The self-learned adjacency matrix has similar patterns as the real adjacency matrix Test spatial-temporal dependencies by considering both the network structure and the traffic conditions. Comparing GCN-based models with adjacency matrix, which can be seen as a static relationship, and sometimes the static relationship can be misleading. For example, for relatively long-time prediction, the observations from distant nodes can be more useful compared with adjacent nodes. Without the constraints of the graph, DAN with spatial attention can selectively focus on certain nodes rather than treating all adjacent nodes equally, and this data-driven attention mechanism captures the dynamic correlations as well. These nodes with different weights can reconstruct the useful relationship of the entire graph. Also, considering all the nodes brings regularization effect to the model and avoids over-fitting between nodes.

Attention interpretation and visualization
The attention mechanism can also give some explanations about how the model learns the spatial dependencies from others. To better understand the contribution of the attention weights, we visualize representative attention weights in Figure 6(a) and their physical locations on the real map to show how the proposed model can handle complex traffic situations.
To illustrate, we randomly select a traffic sensor node (sensor id: 769373) in the METR-LA dataset. The darkness of the colors represents the magnitude of attention. Based on Figure 6(a), we can find most nodes with most dark colors located at the regions which are close to the sensor 769373. It is reasonable because the traffic condition of close neighbors have most influence to predict future traffic speed. Another area with higher attention is located between the downstream nodes and the freeway intersection in the south which leads to the downtown of the Los Angeles. Also, the model also gives relatively high attention to the north-west corner and south-east corner, which are both traffic-busy areas in the city. Zooming in to find details in Figure 6(b), it is interesting to tell that the nodes on the same side with same vehicular directions have more importance compared with nodes with a contrast vehicular direction when making predictions, which strongly supports the aforementioned description in Figure 1.
The attention of nodes is also dynamic in terms of different traffic conditions. Take sensor node (sensor id: 717583) as another example. As shown in the Figure 7, at first, the model gives more attention to nodes in the downstream at morning peak hour. After 30 min, the attentions are transferred to both upstream and downstream nodes. After 1 h, the model more focuses on the upstream node, and some of the nodes near Node 717583 with light color are in the contrary direction, so they are given less importance. It can concluded that the self-learned node embedding gives the latent information of nodes. In the prediction process, the attention part captures the dynamic spatial-temporal dependencies. Hence, STAWnet not only achieves a state-of-the-art forecasting performance but also shows an interpretability advantage to a certain extent.

CONCLUSIONS
To summarize, the dynamic change of traffic conditions on road network exhibits spatial and temporal dependencies. This paper presents STAWnet to capture spatial-temporal dependencies efficiently by combining temporal convolution with attention mechanism. The design of the self-learned node embedding gives an insight that prior knowledge of the graph may not be necessary in learning the spatial-temporal dependencies for traffic prediction tasks, and the dynamic dependencies can help to improve the prediction accuracy. Experiments on three real-world datasets show that STAWnet achieves state-of-the-art results with fewer inputs compared with related studies. In addition, the self-learned node embedding and attention weights can help to identify the influential nodes indicating the interpretability of the proposed model. Our source code are available at https://github.com/CYBruce/STAWnet. For future works, a more detailed analysis of spatialtemporal patterns at road networks should be analyzed to find the explainable spatial-temporal patterns, and we will further explore the utilities of external context data, such as venue types, weather conditions and event information as multi-view graph to other spatial-temporal prediction tasks. Also, the self-learned node embedding can be transferred to other related tasks like data imputing, clustering analysis and relationship identification.