An unsupervised approach for traffic motion patterns extraction

Correspondence Asadollah Shahbahrami, Department of Computer Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran. Email: shahbahrami@guilan.ac.ir Abstract Automatic analysis, understanding typical activities, and identifying vehicle behaviour in crowded traffic scenes are fundamental and challenging tasks for traffic video surveillance. Some recent researches have been using machine learning approaches to extract meaningful patterns occurring in a traffic scene, for example, intersection. In this regard, we convert visual patterns and features to visual words using dense and sparse optical flow and learning traffic motion patterns with group sparse topical coding (GSTC) algorithm. In the first step of the proposed algorithm, the input traffic video is divided into non-overlapping clips. After that, motion vectors are extracted using dual TV-L1 as a dense optical flow and Lucas–Kanade as a sparse optical flow and converted to flow words. For learning traffic motion patterns, the GSTC algorithm, that is, a non-probabilistic topic model (TM) has been applied. These patterns represent priors on observable motion, which can be utilised to describe a scene and answer behaviour questions such as what are the motion patterns in a traffic scene and what is going on. The experimental results which have been obtained using a real dataset, QUML, show that the combination of the GSTC + dual TVL1 extracts more traffic motion patterns in comparison with the GSTC + Lucas–Kanade and previous studies.

machine vision systems. Extracting the general traffic motion patterns and features using the surveillance cameras are the first step to smooth the way to the state-of-the-art machine vision system for controlling traffic. General traffic motion patterns show the common activities of a scene that can be used for other important tasks such as rule mining and violation detection.
Discovering the traffic motion patterns result in a meaningful model of the scene and facilitate scene analysis. To extract the traffic motion patterns and behaviour understanding, researchers have developed new analysis approaches based on machine learning [3]. In traffic scenes such as intersection and roundabout, some of the usual activities, called motion scene models, happen regularly and periodically. By observing motion over time, the typical motion patterns can be learned and used as a priori knowledge for prediction. The advantage of these techniques is that they are generally applicable and do not require manual retraining for new scenes or scenarios. There are two particular methods for the automatic traffic behaviour understanding in this field: (1) Trajectory clustering and (2) topic modelling. In the first approach, moving objects such as vehicles are observed over a long period. Trajectories are extracted from the video during observation to build a training database. The training trajectories are then clustered into groups of similar trajectories. These clusters represent the prototypical patterns of motion encountered in a scene. However, the quality of those methods is highly dependent on the robust tracking of vehicles, which is inherently difficult in general due to noise, changing lighting conditions, shadows, and occlusion. In the second method, low-level features such as moving or general pixel movements are computed with optical flow algorithms, combined with topic models (TMs) and sparse coding [4,5] for reaching the general perspective of the scene [3,[6][7][8]. This study focuses on the second approach for automatic traffic behaviour understanding.
For automatic traffic behaviour understanding purpose, the video taken from the surveillance cameras is first divided into a sequence of non-overlapping clips with the same period. Motion vectors in each clip are extracted using an optical flow algorithm and converted into visual words called flow words. To extract latent patterns that represent the common motion distribution in the scene, the group sparse topical coding (GSTC) [9] framework as one of the non-probabilistic TMs is applied. To compare the performance of optical flow algorithms, we use two different optical flow algorithms to extract motion vectors, Lucas-Kanade [10] as a sparse optical flow and dual TV-L1 [11] as a dense optical flow. Our main attempt is to test the combined model of the TM and the optical flow in extracting the common traffic motion patterns at the intersection. In other words, we are looking for an answer to the question 'How many of the traffic motion patterns that are occurring at the intersection can be extracted in an unsupervised approach?'. We have combined two different models of the optical flow algorithms with the original GSTC and compared the extracted patterns from these two methods with the visible visual patterns in the dataset. Our experimental results show that the GSTC + dual-TVL1 extracts more traffic motion patterns compared to the GSTC + Lucas-Kanade. In other words, the combination of the GSTC and dense optical flow extract more traffic motion patterns than sparse optical flow and compared to other related work. This performance is not free because the dense optical flow is computationally intensive than the sparse optical flow. In the dense optical flow, all pixels in the frames are processed, while in the sparse optical flow, a selected number of pixels are processed. Our study is inspired by [1,3,12] with the main difference as follows. In our study, traffic motion patterns of vehicles are extracted using original group spare topical coding with two different dense and sparse optical flow algorithms. In improved GSTC, a new regularisation term has been added to the GSTC for sparse word representations. We have not applied sparsity in our TM.
The remainder of the study is organised as follows: Section 2 provides an overview of other methods on traffic and behaviour analysing and background about the TM and optical flow. Section 3 gives a brief description of the problem. Section 4 describes traffic motion patterns detection using the TM and optical flow. In Section 5, the implementation and results are described. Finally, the conclusion is outlined in Section 6.

TM
In order to make it easy to follow the content of the study, we have provided all used abbreviations in Table 1. TM was first developed in the context of text mining and natural language processing (NLP) to extract the latent topic that occurs in a collection of documents. TMs such as probabilistic latent semantic analysis (PLSA) [13], latent dirichlet allocation (LDA) [14], hierarchical Dirichlet process (HDP) [15], and sparse topical coding (STC) [16] recognise actions by exploring the co-occurrence of features at different hierarchical levels. The GSTC has been proposed as a new non-probabilistic TM to learn the sparse latent representations of large collections of text documents [9].
Suppose a collection of documents D in the number of M is given, containing words from a vocabulary V in the number of N , a whole collection of documents is D ∈ ℤ N ×M , in which d n,m denotes the number of appearances of the nth word (n ∈ N) in mth document (m ∈ M ). Let ∈ ℝ K ×N be a dictionary with K bases, where each base is assumed to be a topic base, that is, a unigram distribution over V . For documents D, GSTC projects D into a semantic space spanned by a set of automatically learned topic bases and directly obtain the unnormalised word code S ∈ ℝ M ×N ×K for each individual word in documents D [9]. Table 2 represents the notations used in the GSTC algorithm.
The GSTC solves the optimisation problem according to Equation (1). The first part of Equation (1) is equivalent to minimising an un-normalised KL-divergence between observed word counts d n,m and their reconstructions S m,n,: :,n . Variable S m,n,: denotes the nth row of word code of m document (clip) and :,n denotes the nth column of . The second part is a group lasso, that is, a mixed ℓ1/ℓ2 norm [9]. Equation (2) shows the linear algebra operation form of Equation (1): where S m ∈ ℝ N ×K is a word code of mth document, D :,m is a mth document and ||…|| 2,1 is a mixed ℓ1/ℓ2 norm.

Optical flow
A motion vector is a 2D vector that provides an offset from the coordinates in the current frame to the coordinates in a reference frame. The motion vector shows pixel or object displacement in consecutive frames of a video. The optical flow is one of the algorithms to compute the motion vector. In other words, optical flow is the motion of pixels or objects between consecutive frames in a video, caused by the relative movement between the object and camera. Optical flow algorithms are divided into two groups, dense and sparse. The goal of both algorithms is the same, that is, computing pixels displacement, but basically they are different in the mathematical formulation and hypotheses. In general, the algorithm that calculates consecutive frame displacements for a subset of pixels is called sparse optical flow, and the algorithm that calculates consecutive frame displacements for all pixels is called dense optical flow. The sparse optical flow algorithm has three main inputs: The current frame, a subset of pixels in the current frame, and the next frame. While the dense optical flow algorithm has two main inputs: The current and the next frames. [10,11,17,18].

Related study
Most existing methods for analysing traffic can be divided into two categories. In the first category called 'trajectory clustering', the object such as a vehicle or pedestrian is detected and observed over a long time. Trajectories are extracted from the video during observation to build a training database. The training trajectories are then clustered into groups of similar trajectories. These clusters represent the prototypical patterns of motion encountered in a scene. However, the quality and reliability of these methods rely heavily on detection and tracking algorithms that are error-prone due to noise, light conditions, climate changes, and obstruction. In addition, the clustering of paths requires comparing the similarity of all samples, which can be computationally expensive [19].
To address these difficulties, some researchers [20,21] directly use low-level motion or appearance features such as optical flow. In the second category, the motion of video frames is extracted without exact detection and tracking algorithms. Then, directly from these extracted properties, a model of motions and activities is constructed [3]. In [21], a statistical model of optical flow features was employed to obtain a representation for the motion patterns. In [22,23], typical activities were characterised by hierarchical Bayesian models such as LDA and HDP. Dynamic dual-HDP non-parametric Bayesian model was used in [24] to automatically model activity categories and semantic regions without specifying the number of topics and with the online updates of the model. In [25], the two-level LDA TM learned first for single-agent motion, which is input to second level LDA for multi-agent interactions. Fast rank-1 robust PCA is used for foreground detection with counts of pixels in blocks used as input for DPMM learning, enabling incremental learning and inference [26]. In [1,12,27], LDA and sparse coding were used for efficient learning, scene understanding, and abnormal event detection. A two-staged cascaded LDA model was formulated in [28] for automatic discovering and learning of behavioural context for video-based complex behaviour recognition and anomaly detection. The study in [29] relied on PLSA to identify abnormal activities and repetitive cycles while [30] used an LDA to detect usual and unusual activities in the traffic scene. A topicrelated STC is proposed for abnormality detection in the traffic patterns [27]. In [31], semi-supervised method based on sparse TM was proposed to detect anomalies in video surveillance. Table 3 shows some important work in traffic analysing with TM.

TRAFFIC MOTION PATTERNS
In intersections, some of the usual activities called traffic motion patterns happen regularly and periodically. Actions such as turning left, turning right, and passing through intersection occur almost in all intersections. Twelve usual traffic motion patterns and their descriptions are depicted in Figure 1 and Table 4.

AUTOMATIC TRAFFIC MOTION PATTERNS DETECTION
In the traffic video, there are visual features such as vehicle motion and motion patterns in different directions. For extracting traffic motion patterns, the occurrence of visual features and activities in clips should be converted to a predefined  Table 4 Figure 2. There are three main steps. In the first step, the input traffic video is divided into a set of non-overlapping clips. Then motion vectors of each clip are calculated with dense and sparse optical flow for each consecutive frame in clips and converted into flow words in the second step. Finally, histograms of flow words represented in the second step are converted to a matrix called bag-of-words and given to the GSTC model for extraction of traffic motion patterns. In the following Section 4.1, each step is explained in detail.

FIGURE 2
The general block diagram of the proposed approach for traffic motion patterns extraction

Converting to short clips
In the proposed approach, the TM has been used. The TM treats a web page or single article as a document in NLP and text mining, while our data is video. Therefore, the input video should be considered as a collection of documents. There are different words in each document, while in video frames and clips, there are different visual features. Based on this, there is a mapping strategy from video to document representation. The whole video is divided into non-overlapping subclips and are

Modelling traffic flows
Pixels displacement is a key feature to detect general activities between video frames. Hence, this property is considered as a key feature to detect traffic motion patterns automatically. Pixels displacement is raw data and it should be converted into a fixed vector to use in the learning algorithm. The left-hand side of Figure 3 depicts the eight, north, northeast, northwest, south, southeast, southwest, east, and west directions for each pixel displacement for two consecutive frames. These different directions are converted into a fixed vector as depicted in Figure 3. All pixel displacement which is located into cells or blocks is converted into a fixed vector (R N ) of size N = number of cells × 8. Each entry in vector shows the number of occurrences of specific directions in that block. In the following subsections, modelling of traffic flows with the dense and sparse optical flow are described.

Modelling traffic flows with dense optical flow
The dense optical flow provides a motion vector per pixel in each of the two consecutive frames. Figure 4 depicts the flowchart of the flow word extraction with the dual TV-L1 algorithm. We first give each two consecutive frames to the dual TV-L1 algorithm.
The motion vectors matrix is divided into C x × C y square cells, each of which contains p × p pixels. A threshold is applied to motion vectors to eliminate noise and maintain reliable flows. To generate the flow words, we sample the remaining motion vectors, s i = (x, y, u, v), where (x, y) is positioned on a grid with p pixels spacing. Then the variables (u, v) of the motion vector of the samples are quantised into eight directions.

Modelling traffic flows with spare optical flow
The sparse optical flow provides a motion vector for a set of predetermined pixels that has interesting features such as edges, corners, and textures. The flowchart of flow word extraction with the Lucas-Kanade algorithm is illustrated in Figure 5. For every two consecutive frames, at first, we find key points using Shi and Tomasi corner detector [32]. Then, frames are divided into C x × C y square cells, each of which contains p × p pixels. The vector motion of the key point is computed with the Lucas-Kanade algorithm. Again, a threshold is applied to motion vectors to eliminate noise and maintain reliable flows. Motion vectors are denoted by (x, y, ). The positions (x, y) are quantised to the nearest position on a grid of p pixel spacing and the angles of motion vectors, , are quantised into eight directions.
Finally, a set of fixed vocabulary V = {1, … , N} , N = C x × C y × 8 is formed. Each word contains two aspects of the content: Location and direction information. Flow words of video clips are collected all over the frames. Then a video clip is represented as a vector d m = (1, … , N ) T , which each entry in d m specifies the number of occurrences of flow word n in this clip. In this study, we define a motion pattern as a spatial distribution of flow words. Motion patterns, denoted as , are corresponding to the latent topics in a TM. Each row of is a topic basis, which is a distribution over the vocabulary V , i.e., k ∈ P, where P is a(N − 1)-simplex.

Learning traffic motion patterns
The learning traffic motion patter step aims to cluster a cooccurrence flow word. At first, all the flow words extracted from the previous step transform into a matrix called bag-of-words. Each column is a single clip and its flow words. This matrix is input to the learning algorithm. As can be seen in the last part of Figure 2, learning traffic motion patterns, flow word codes are extracted from the bag-of-words matrix, then the extracted flow

Learning algorithm formulation
A typical solution to solve objective function in Equation (2) In Equation (3), like the GSTC formulation in [12], the ℓ2 norm of the reconstructions error is minimised.

Learning flow word code
This step aims to find the portion of the extracted flow words of the clip in each motion pattern. Cube S in Figure 2 depicts the flow word code of the clips. As mentioned, the objective function is bi-convex. So, the algorithm optimises the S when dictionary is fixed. Since the word codes are grouped in topics and each group is separable, the block coordinate descent (BCD) method is baptised for this optimisation in [9,12] by solving the problem for each S nk alternatively as formula in Equation (4): Instead of calculating each S kn separately, whole matrix S is obtained by writing the optimisation problem as Equation (5): s.t. s > 0 (5) In Equation (5), T is a diagonal matrix of document D :,m and R = diag(1∕||S 1,: || 2 + , … , 1∕||s k,: || 2 + ). The encoding algorithm is depicted in Figure 7

Learning motion patterns
In this step, all the extracted flow word codes of the clips cluster to a similar K group in matrix . Each row of is a unigram distribution of flow words that shows general traffic motion patterns of the scene. The matrix is updated by solving the optimisation problem in Equation (6): Instead of computing each kn separately, Equation (6) is transformed to Equation (7) for solving entire matrix : where T and S m T is transpose matrix and T is a diagonal matrix of document D :,m . The encoding algorithm is shown in Figure 8.

Environmental setup and dataset
The proposed approach was implemented in the visual studio C++ programming environment. The OpenCV [34] functions were used for computing optical flow for Lucas-Kanade and dual TV-L1 algorithms. Linear algebra C++ library (armadillo [35]) was utilised for implementing a learning algorithm. The hardware system was a PC with Intel Core i7 4790 CPU and 16GB RAM. Performance evaluation of the proposed approach was performed using the QMUL junction video dataset [36]. Figure 9 depicts one sample frame of this dataset, which has been captured at 25 frames per second with 360 × 288 resolution in an intersection in London. In this dataset which contains 90,000 frames, the density of vehicles is relatively high and therefore motions are quite complex. Moreover, this video dataset has quickly become a favourite dataset for automatic traffic analysis. Twelve traffic motion patterns and their description have already been mentioned in Table 4 and Figure 1.
Nine of these patterns exist in the QMUL dataset which is depicted in Table 5. It should be noted that the angle of the camera and its height from the ground has an important role in extracting the correct and exact motion vector. The 200 and 100 clips have been used in GSTC + dual TV-L1 and GSTC + Lucas-Kanade methods, each containing 75 and 150 frames, respectively.

Parameter settings
The whole video has been divided into 3-s length for GSTC + dual TV-L1 and 6-s length for GSTC + Lucas-Kanade subclips,based on that there are a different number of clips for both algorithms. For both optical flow algorithms, the scene has been divided into 10 × 10 cells.   Table 4) GSTC +  Table 6 depicts the obtained results of both GSTC + dual TV-L1 and GSTC + Lucas-Kanade methods. Both algorithms extract nine exiting visual features in the dataset, while GSTC + dual TV-L1, which uses a dense algorithm, extracts more visual patterns than the other algorithm. In the intersection, in each frame and clip, there may be a combination of different visual features. This is because vehicles pass in different directions simultaneously from a fixed camera position point of view.

Experimental results
In the dense algorithm, all pixels are considered in the calculation; more combinations of traffic patterns are considered while this behaviour is not seen in the sparse algorithm. There are different, one, two, and three motion patterns in 17 clusters from 20 clusters in GSTC + dual TV-L1, while there are 12 clusters in the second algorithm. In other words, the first algorithm extracts more traffic motion patterns than the second algorithm which uses sparse optical flow.
Some extracted traffic motion patterns using GSTC + dual TV-L1 and GSTC + Lucas-Kanade are depicted in Figures 10 and 11, respectively. As mentioned, dense optical flow provides motion vectors for all pixels which causes more flow words obtained from clips than sparse optical flow. Therefore, more meaningful motion patterns were detected.
A comparison of our results using the GSTC + dual TV-L1 and GSTC + Lucas-Kanade algorithms with improved STC with dual-TVL1 approach [1] and improved GSTC with Lucas-Kanade optical flow [12] is presented in Table 7. Our approach extracts more meaningful traffic motion patterns. For example, motion pattern number 11 in Figure 10(d) has been detected while not detected in Figure 12 using STC and dual TV-L1 [1]. As another example, more motion patterns such as #3 and #6 in Figures 11(a) and (b) have been extracted compared to just four motion patterns detected in Figure 13 using improved GSTC and Lucas-Kanade [12]. Besides, we are trying to answer this question in an intersection: 'What is going on?' using extracting meaningful traffic motion patterns. While in [1,12], motion patterns have been used for abnormality detection, a clip was classified as normal or abnormal. A rough comparison of our study and [1,12] is depicted in Table 7. In fact, Table 7 is a comparison between visible motion patterns in the dataset (Table 5) and extracted motion patterns with both methods (Figures 10 and 11) with respect to defined common patterns (Figure 1 and Table 4). In other words, Figure 1 shows the possible patterns in the dataset, in which Table 5 indicates the number of visible motion patterns. Studies in [1,12] have used these motion patterns for abnormality detection. They classified a clip as normal or abnormal, and if a clip cannot be described with obtained patterns, it is considered as an anomaly.

Discussion
Although the combination of the TM and the dense optical flow yields more motion patterns due to the computing motion vec-tor of each pixel, investigating and obtaining motion vectors of all available pixels impose a high computational cost. On the other hand, the combination of the TM and sparse optical flow has a low computational cost, while it extracts fewer motion

FIGURE 13
Some extracted motion patterns using improved GSTC and Lucas-Kanade [12] patterns, and selecting pixels to calculate the motion vector is the challenging task. In other words, pixels displacement between frames as a feature and TM as a learning algorithm are used to extract traffic motion patterns, while it does not need object detection and tracking algorithms. To increase performance, the vehicle object should be detected in frames, for example, using the YOLO framework, the motion vector of detected objects is computed. It is possible to create a new bagof-words matrix, where each column belongs to a single object in contrast to the common method that each column belongs to a clip. Combining object detection and sparse optical flow with clustering is our future study.

CONCLUSIONS
In urban traffic, intersections play an essential role. The traditional method of video surveillance is not suitable due to the huge amount of video data and the unreliability of the human being. It requires a system that automatically captures traffic motion patterns. In this study, an unsupervised approach was presented to extract traffic motion patterns and convert them into visual words in video surveillance. First, the whole traffic video dataset is divided into clips. The local motion vectors were extracted by dual TV-L1 as a dense optical flow for all pixels and Lucas-Kanade as a sparse optical flow for a set of specific pixels and then converted to flow words. Given the fact that at intersections, motions such as turning left, turning right, and passing through the intersection that are template patterns of traffic, we use the GSTC algorithm for learning traffic motion patterns. The experimental results show that the GSTC + dual TV-L1 extracts more traffic motion patterns than the GSTC + Lucas-Kanade algorithm and some previous studies.