High-performance trafﬁc speed forecasting based on spatiotemporal clustering of road segments

Trafﬁc speed prediction is an indispensable element of intelligent transportation systems. Numerous studies have devoted to high-precision prediction models. However, most existing methods implement the link-wise or network-wide input. The former is time-consuming especially for large-scale applications, while the latter may incur the dilemma of underﬁtting owing to the heterogeneous trafﬁc states within the entire network. Herein, we propose a novel prediction scheme based on spatiotemporal trafﬁc pattern clustering. Firstly, road segments are partitioned into several groups via the developed clustering approach, which considers both the observed data sequence and spatial topology structure. Subsequently, sequence-to-sequence learning architecture is employed for each group to generate predictions for the entire trafﬁc network. Validated by a real-world dataset in Beijing, our proposed paradigm offers a signiﬁcant improvement over other well-known benchmarks for various prediction intervals in terms of prediction accuracy and computational efﬁciency.


INTRODUCTION
Short-term traffic prediction is becoming increasingly essential for providing high-quality transportation services [1]. Reliable future traffic state information is conducive for formulating effective management (e.g. signal timing optimisation [2] and ramp control [3]) and assisting travellers' route planning and departure time scheduling [4]. However, it remains a challenging task owing to the requirements of both prediction precision and timely calculation. With the rapid development of artificial intelligence technology, new prediction models have gradually emerged with deeper structures. These models offer significant improvements in accuracy performance but impose greater computation burden. Therefore, it is crucial to maintain the quality of these methods while reducing computational load significantly. Hence, we herein introduce a novel strategy that applies a deep-learning framework in classified groups of road segments, where classification is conducted to reduce the scale of learning.
The main contributions of this research can be summarised as follows: 1. A spatiotemporal clustering method of road segments is proposed, which considers both traffic speed patterns and spatial topology information.
2. Based on the clustering results, we developed a novel prediction scheme. The sequence-to-sequence learning (Seq2Seq) algorithm is separately deployed cluster by cluster in parallel. 3. We conducted numerical tests on a large-scale real-world dataset provided by AMAP, a leading mobile navigation application in China. The results indicate that our approach yields better performances than other state-of-the-art baselines in terms of both prediction accuracy and computational efficiency.
The remainder of this study is organised as follows. Section 2 presents recent relevant studies. Section 3 describes our proposed method. Section 4 presents the experimental dataset and analysis result. Finally, Section 5 concludes the study and outlines future research.

LITERATURE REVIEW
In this section, we review traffic prediction models and traffic prediction with clustering technology, which are related to the current study.

Traffic prediction models
In the past decade, substantial efforts have been expended to model the temporal and spatial correlations of traffic state evolution under the topic of forecasting, the summary of which is presented as follows.
Statistical methods such as auto-regressive integrated moving average, Kalman filtering, and their variants [5][6][7] can predict future values based on previous observations by successive time-series analysis. However, statistical models typically rely on the stationary assumption, and they cannot accurately capture the time-varying and non-linear characteristics of traffic flow. The massive accumulation of traffic data provided opportunities for the realisation of data-driven approaches. The fuzzy theory [8], artificial neural network (ANN) [9], support vector machines [10], and K-nearest neighbour model (KNN) [11] have also been exploited for short-term traffic prediction. Furthermore, deep-learning methods have reported better performances. Recurrent neural networks (RNNs) [12], stacked auto-encoders [13], and deep belief networks [14] are powerful tools for learning temporal dependencies from data sequence.
To construct a robust prediction framework, the temporal features must be extracted and the spatial information of road networks must be incorporated. Conventional vector autoregressive models consider the effects of upstream and downstream links [15]. Cross-correlation coefficient [16] and convolutional neural networks (CNNs) [17,18] have been utilised to analyse spatial relevance. Subsequently, graph convolution networks (GCNs) have been proposed to address non-Euclidean structured data, which have been investigated in the transportation domain. Temporal convolution networks [19], long short-term memory networks (LSTMs) [20], and Seq2Seq [21] were fused with GCNs to depict spatiotemporal dependencies in traffic prediction.
To summarise, remarkable achievements of various models have been reported in previous studies. Unlike the aforementioned studies, we propose an efficient and general prediction paradigm based on the clustering of road segments.

Traffic prediction with clustering technology
In [22], a non-parametric clustering-based technique that provides accurate traffic forecasting was established through the exploitation of traffic data dynamics, whereas in [23], an urban traffic flow prediction system using a multifactor pattern recognition model was established, which combined a Gaussian mixture model clustered with an ANN. Tang et al. constructed a fuzzy neural network to forecast travel speed, which was trained by the K-means method to partition inputs into different clusters and a weighted recursive least-squares estimator to optimise the parameters [24]. Subsequently, Wang et al. developed a traffic threshold identification module based on an improved K-means clustering algorithm and implemented a traffic flow prediction framework based on adaptive speed thresholds [25]. Shen et al. presented a novel traffic speed prediction method based on temporal clustering analysis and a deformable convolution neural network that could discriminate traffic environments [26]. The aforementioned clustering targets were the observed data samples, where the spatial information of road networks was not incorporated.
Song et al. measured the similarities between road segments based on adjacency, connectivity, and congestion; subsequently, they generated topological maps by road segment clustering [27]. However, the way to proceed with traffic prediction was not involved. The spatiotemporal performance trends at the network level and for individual links were mined through Kmeans clustering [28], which could create new prediction error measures rather than enhance the prediction performance. In [29], a clustering procedure was used to identify time-variant clusters of road segments with high spatial correlations; subsequently, they were fed into an ANN-based traffic forecasting model, which is the most correlated model with ours. Nevertheless, the simple multilayer perceptron (MLP) structure will impair the capacity in capturing the temporal dependencies, especially for multistep prediction. We further propose obtaining the optimal number of clusters through the pre-experiment on a validation set along with providing a comprehensive comparison of prediction performance between link-, cluster-, and network-level inputs.

METHODOLOGY
In this section, we first interpret the notations of the variables used herein; subsequently, we illustrate the devised method in detail.

Preliminaries
We model the network as a directed graph ( , ), where the node-set  comprises road segments (also called links), and  is the set of directed edges representing the connections among the road segments. D the distance matrix of the road segments and the variable D i j denotes the length of the shortest path from the road segment x i to x j considering the driving direction.
We denote v i t as the speed of the road segment x i (∀x i ∈  ) at the t th time slot, where the time slot is divided with a fixed duration (e.g. 5 min), and the 'speed' is defined as the average speed of floating cars travelling through the road segment during this time interval. V t ∈ ℝ N (| | = N the total number of road segments) is the speed vector of the underlying road network at the t th time slot, where the i th element is (V t ) i = v i t . Given the historical observations {V t −s | s = 0, 1, … , m} (m is the look-back time window) and network topology , the traffic speed forecasting problem is to seek the trustworthy approxi-mationV t +n of the future value V t +n , where n is the required prediction step.

Road segment clustering
The traffic conditions of the entire road network vary significantly over both time and space [30], which is challenging for traffic prediction. However, focusing on the local part, the traffic states are relatively stable and closely dependent [31]. Hence, our major objective is to partition the given road network into clusters, with each containing a group of considerably correlated road segments. Based on the above notations, we define P i j to characterise the similarity in traffic patterns between road segments x i and x j : where refers to a historical observed dataset that covers a sufficient period (e.g. one week), and | | denotes its cardinality.
However, only considering the observed data into account may yield ambiguous results for the sake of coincidences in numerical values (e.g., two irrelevant road segments may possess similar speeds). As the traffic flow at downstream and upstream roads affect each other through transfer and feedback effects [32], it is necessary to incorporate the topology information into the clustering procedure. Combined with the spatial distance, the spatiotemporal similarity between road segments i and j can be quantified as whereP i j ∈ [0, 1] andD i j ∈ [0, 1] are the normalised results of P i j and D i j , respectively; and are coefficients to adjust the effects of corresponding components. Once the similarity matrix of road segments is derived, our next goal is to partition the road segments into a set of disjoint clusters such that the total dependencies between road segments within the same cluster are maximised. Let  k be the k th cluster with q road segments and M be the total number of clusters: This can be solved via agglomerative clustering, which generates clusters in a bottom-up way to produce a binary tree with each data point as either a leaf node or root node [33].
The algorithm is initialised by treating each road segment as a separate cluster. The second step is to calculate a distance matrix embodying the distance between any two clusters. In this study, we used the average-linkage criterion [34], which determines the average similarity between all road segments of the two clusters as the distance for clustering. It is expressed as In the third step, the two clusters with the shortest distance are identified and merged into a new cluster. Subsequently, steps 2 and 3 are repeated until the number of clusters reaches the given number.
Owing to the clustering processing, the road segments in the underlying network are classified into several groups. In the following sections, we denote {x k 1 , x k 2 , … , x k q } as the road segments in the cluster  k , and V as the corresponding speed observations at the t th time slot.

Traffic prediction method
As road segments within the same cluster are strongly spatial correlated, the primary objective of the traffic prediction model is to capture the temporal evolution of the traffic state.
We leverage the Seq2Seq architecture [35] to model the temporal dependencies, which unites two RNN modules to overcome the disadvantage of fixed output timestamp of a single RNN. As shown in Figure 1, the encoder embeds the input sequence {V t }, and its final hidden state is fed into the decoder, which learns to predict the future traffic speed {V t +n }. It is noteworthy that the data stream of our Seq2Seq model is cluster-level V (k) t , which is vastly different from link-level v i t and network-level V t in previous studies [36]. The benefits of this manner will be emphasised in Section 4.
During training, historical ground truth series are fed to the decoder (see Figure 1(a)), whereas ground truth observations are replaced by predictions generated by the model itself during testing (see Figure 1(b)). The model performance may degrade due to the discrepancy of input distributions between the training and testing stage. Hence, scheduled sampling is integrated [37] into the model, where the encoder is fed with either the ground truth observation with probability ∈ i or the prediction by the model with probability 1-∈ i at the ith iteration. In this study, we set ∈ i as where Λ is the predefined threshold. Throughout the model operation, the hidden state h t −1 , which is passed to next timestamp t , together with input V (k) t generates a new hidden state h t at each iteration. Both the encoder and decoder adopt the LSTM structure [38], and the calculation formulas are shown in Equations (6)- (11): where f t , i t , o t , and c t represent the forget gate, input gate, output gate, and memory cell vector, respectively; [⋅||⋅] concatenates two tensors along the same dimensions; (y) = To jointly reduce the errors in the multiple-step prediction, the loss function is defined as the mean absolute error between (V (k) where ‖ ⋅ ‖ 1 returns the L 1 norm of the vector. All the parameters are updated by minimising the loss function through the stochastic mini-batch gradient descent (BGD) algorithm in the training stage. The training steps of the proposed model is elaborated in Algorithm 1.

Dataset description
Real-world traffic dataset was collected from anonymous navigation users of the AMAP (https://www.amap.com/). We selected the inner fourth ring road in Beijing as the study site, which encircled the downtown area. As shown in Figure 2, the 65-km fourth ring road is partitioned into 162 road segments of length 400 m. Subsequently, we calculated the 5-min average speed for each road segment by utilising the instantaneous state of the trajectory points. The missing values were completed by the nearest non-missing values using MATLAB [39]. The average traffic speeds of morning and evening peak hours on weekdays of October 2016 are shown in Figures 3(a) and (b), respectively. The entire dataset ranging from October 1, 2016, to November 27, 2016, was separated into three independent subsets, the details of which are provided in Table 1.

Experimental setting
In this subsection, we compare our prediction scheme with several prevailing methods:   [38], the inputs are reshaped into a matrix with one axis as the time step and another axis as the number of road segments, which can address temporal correlations well. 5. Graph attention network (GAT): The GAT [42] leverages masked self-attentional layers to specify different weights to different neighbourhoods, which considers spatial information. 6. Seq2Seq: It is facilitated with the same structure as that of our prediction model except for the cluster-level input. 7. Traffic graph convolutional (TGC)-LSTM: The TGC-LSTM [20] unifies the traffic graph convolution operation and LSTM structure to attain spatiotemporal correlations in the transportation system.
It is noteworthy that KNN, LSTM, GAT, and Seq2Seq are applicable for both link-wise modelling and network-wide inputs. The performances of these models under the two scenarios were tested in the following experiments.
Speed records in the past 60 min (i.e. m = 11) were employed as inputs for all models (excluding HA) to forecast the traffic states in the next 10, 20, and 30 min. The prediction accuracy of all road segments was assessed via three typical metrics, that is, mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error (RMSE), expressed as follows: where v i t andv i t are the ground truth and prediction values at the i th road segment at the t th time slot, respectively;  denotes the testing set.
To ensure fairness in both accuracy and efficiency comparisons, the same training hyperparameters were retained in the neural-network-based methods (except TGC-LSTM). In detail, the batch size was 512, the learning rate 10 −2 , and the training steps 2500 with the Adam optimiser. We accommodate the parameters of TGC-LSTM to the underlying dataset on the basis of suggestions in [20]. The LSTM, encoder, and decoder of Seq2Seq comprised of one hidden layer and eight hidden units for the link-level input and 160 hidden units for the  network-level input. The threshold of scheduled sampling for the Seq2Seq model was = 1700; 1-hop and 2-hop neighbours were considered separately in the GAT. The above settings are determined through pre-experiment on validation set which achieves a balance between prediction accuracy and computation efficiency.
For our prediction paradigm, we set = = 0.5 and used the historical observations from October 10 to 16 for clustering (i.e. we calculated P i j using Equation 1). We exploit the data from one week for clustering because traffic patterns in both weekdays and weekends are appropriately considered. The optimal number of clusters was selected as 12 via the preexperiment on the validation set of the corresponding prediction performances from (4,8,12,16,20), (the details will be provided in Section 4.4). The model settings remained the same as those mentioned above except that 16 hidden units were used.

Prediction performance
Tables 2-4 present the comparisons of the proposed model and benchmark algorithms for 10-, 20-, and 30-min ahead fore-casting on the testing datasets. The following conclusions were obtained from the results: (i) Our prediction strategy, that is, implementing the Seq2Seq model cluster by cluster achieved the best prediction results for all three evaluation metrics under various steps and exhibited superior computational efficiency. Although the specialised spatiotemporal deep-learning framework, TGC-LSTM, presents better performances than other baselines, it is also inferior to our approach. (ii) Figure 4 compares the execution time and MAE of the KNN, LSTM, GAT, and Seq2Seq fed with link-and network-level inputs under a 20-min time period. As expected, link-wise modelling requires a lengthy calculation time while affording better precision, whereas the networklevel input performs worse as the heterogeneous traffic states of the entire network cause the underfitting problem. (iii) Figure 5 shows the execution time and MAE of the Seq2Seq model with link-, network-, and cluster-level inputs during a 10-30 min interval. It is clear that the clustering pre-process improved the accuracy and efficiency significantly. Compared with the link-level input, the cluster-level input incorporates spatial features and reduces the required number of training rounds from the number of links to the number of clusters (e.g. for the road network with 160 links, the link-level input needs 160 training rounds link by link. When we divide them into 20 clusters, only 20 training rounds are required). Unlike the networklevel input, the cluster algorithm partitions the network  into highly correlated groups, which enables the spatiotemporal dependencies to be learned more easily. It is noteworthy that the running time of the Seq2Seq model increases rapidly with the input dimension. The cluster-level input decreases the input dimension at each iteration, and the parallel computing for each cluster contributes to reduced time consumption. The results serve as a powerful support to our motivation of this study. Figure 6 shows the change of MAE with the number of clusters in (4,8,12,16,20) on the validation and testing set under 20-min prediction step, respectively. The variation in errors exhibits the same trend between validation and testing, indicating that it is appropriate to select the optimal number of clusters through pre-experiment on the validation set. Meanwhile, the   results are insensitive to the number of clusters, which verifies the robustness of our prediction scheme. Figure 7 presents the partition results of road segments with 12 clusters, for example, the road segments 6-13 shown in Figure 2 are divided into cluster 2 (shown in orange) in Figure 7(a). Comparing Figures 7(a) and (b), the distribution of clusters without spatial information is disorderly (e.g. the road segments in the north and south fourth ring road are both classified into cluster 7, which unlikely owns the physical interactions), which is clearly not the desirable partition for traffic prediction.

Discussion of cluster-level input
To further validate the effect of our prediction strategy, an additional experiment was conducted to extract road segments randomly to form the same number of groups as that of our clustering process. As shown in Table 5, feeding the model with random sets of road segments does not improve the prediction accuracy, which proves the necessity of our designed clustering algorithm. The results of this experiment show that the improvements in the cluster-level input are owing to the highly relevant spatial information within each cluster. A simple modification of the input dimension through feeding the random link sets cannot boost the model performance. Figure 8 depicts the performances of the proposed prediction scheme with different coefficients , in Equation (2). The results explicate that the clustering with a relative tradeoff between pattern similarity (P i j ) and spatial distance (D i j ) is better for the prediction accuracy. The partition of road segments without the consideration of pattern similarity ( = 0) or spatial distance ( = 0) will compromise the prediction performance. Moreover, Table 6 assesses the effects of clustering with our spatiotemporal similarity measurement S i j and Pearson product-moment correlation coefficient [45,46] on prediction performances. The results prove that the devised spatiotemporal similarity measurement is more beneficial for promoting traffic prediction accuracy.   Abbreviation: PPMCC, Pearson product-moment correlation coefficient.

CONCLUSION
A novel high-performance traffic prediction paradigm was developed in this study. In the first step, road segments in the underlying road network were partitioned into several groups through a spatiotemporal clustering algorithm. Subsequently, strongly correlated and mutually influencing road segments were introduced into the same cluster. Next, the Seq2Seq model was employed cluster by cluster to perform traffic prediction. Using a real-world traffic dataset, our approach demonstrated significant improvements over other benchmarks in terms of both prediction accuracy and computational efficiency. Furthermore, we analysed the differences between link-and networklevel inputs, the influences of the cluster number, the effects of randomly selected road segment groups, and the designed spatiotemporal similarity measurement. In future studies, we may extend our prediction scheme for application to large-scale urban road networks, where the influences of traffic lights may be considered into the clustering process.