CNN ‐ based estimation of heading direction of vehicle using automotive radar sensor

Modern autonomous vehicles are being equipped with various automotive sensors to perform special functions. Especially, it is important to predict the heading direction of the front vehicle to adjust the speed of the ego ‐ vehicle and select appropriate actions. Here, we propose a method for estimating the instantaneous heading direction of a vehicle using automotive radar sensor data. First, using a frequency ‐ modulated continuous wave (FMCW) radar in the 77 GHz band, we accumulate the automotive radar sensor data for different movements of the front vehicle (e.g., stop, going ahead, reversing, turning left, and turning right). To distinguish the different movements of the vehicle, we use the convolutional neural network (CNN) and train it using the acquired radar sensor data. Because the CNN algorithm usually uses image data as input, it is essential to convert radar sensor data into image data. Therefore, we apply a high ‐ resolution angle estimation algorithm to the obtained radar data and convert it into a two ‐ dimensional range map. After the CNN model is trained with the obtained radar sensor data, various movements of the front vehicle can


| INTRODUCTION
In recent years, people's interest in autonomous driving has been rapidly increasing. A key element for realising autonomous driving is the development of automotive sensors. Automotive sensors such as cameras, lidars, and radars, each play a unique role and assist autonomous driving. Autonomous driving using a single type of sensor is not possible, and data received from various sensors must be appropriately combined to respond to diverse driving conditions. Among the various sensors, the radar sensor is considered essential because it has a stable performance even under adverse weather conditions and has a wide detectable range compared to other sensors [1]. Radar systems for automobiles modulate their waveforms in various ways to estimate the target information such as relative distance or velocity. Particularly, the most widely used radar system is frequency modulated continuous wave (FMCW) radar. In an FMCW radar, a linear frequency modulation (LFM) [2] method which uses a signal with a linearly increasing frequency is commonly used. This modulation technique is efficient for target detection because it enables the joint estimation of distance and velocity at a high resolution while consuming low power. When using the LFM waveform, the distance and velocity information of the target can be estimated by analysing the frequency-domain received signal [3].
Recently, automotive radars capable of performing various functions as well as detection and tracking are required. For example, modern autonomous driving functions require obtaining an image of the front scene, which can be realised by obtaining a high-resolution radar data. To this end, a radar system using a multiple-input-multiple-output (MIMO) antenna system is gaining significant attention. The MIMO radar can provide more sophisticated target detection results because it enables the precise estimation of target's angle information [4]. In addition, by applying machine learning algorithms to the received data, automotive radars can be used for a wide variety of applications. For example, it can be used to identify the type of detected target [5], recognise the driving environment [6,7], or image the detected target in three dimensions [8].
Among the various autonomous driving functions it is important to predict the movement of the front vehicle. The information about the front vehicle's movement can be used to automatically adjust the speed of the ego-vehicle or to select appropriate actions. In this regard, the authors in [9] used a mono camera to estimate the heading direction of the front vehicle. Also, in [10], the heading information obtained from a lidar sensor was used for tracking multi-targets. However, to the best of our knowledge, no research has been performed to identify the motion of a front vehicle using a radar sensor alone. Therefore, in this paper, we propose a method for estimating the instantaneous heading direction of the front vehicle by applying a machine learning algorithm to the automotive radar sensor data. We use an FMCW MIMO radar system and collect data by measuring radar signals for various movements of the front vehicle. Specifically, we accumulate radar sensor data when the vehicle in front is stationary, going straight, going backward, turning left, and turning right. Then, we use a convolutional neural network (CNN) [11] to classify these five different movements. The CNN is a widely used deep learning algorithm that has recently been applied to radar sensor data. For example, the authors in [12] attempted to monitor parking spaces by applying CNN to the radar-image data. In addition, in [13], various hand gestures were classified by applying threedimensional (3D) CNN to the processed radar data. Because the CNN uses the image format as an input data, it is essential to convert the radar signal received by the MIMO antenna into an image format. Therefore, we transform a 3D radar data cube composed of distance, velocity, and angle into an intuitive 2D range-angle map and use it as input to the network. The CNN structure was determined by testing the performance of the network for various combinations of the number of blocks and filters. Using the trained model, our proposed algorithm was able to estimate the heading direction of the vehicle with an accuracy higher than 94%.
The remainder of this paper is organised as follows. In Section 2, we describe the basic signal processing of the FMCW MIMO radar system. Then, in Section 3, we introduce how to convert radar data into an image format suitable for CNN input. In addition, the framework of the CNN is discussed. Next, the classification performance when the proposed network is trained with the acquired radar data is presented in Section 4. Finally, we conclude this paper in Section 5.

| Basic principles of FMCW radar
In an FMCW radar system, a voltage-controlled oscillator (VCO) creates an LFM signal whose frequency increases linearly as a function of time [14]. The time-frequency slopes of the transmitted and received LFM signals are shown in Figure 1. The transmitted waveform S(t) can be expressed as where A S and f c denote the amplitude and carrier frequency of the modulated waveform. In addition, K ¼ B T sw indicates the time-frequency slope, where B and T sw denote the operating bandwidth and sweep time.
When this transmitted waveform S(t) is reflected from M targets, the received signal R(t) can be expressed as Then, as R(t) passes through the frequency mixer, it is multiplied with the transmitted waveform S(t) generated by the VCO. The output of the mixer passes through a low-pass filter (LPF) to remove the high-frequency component and a baseband signal is obtained. The operation can be expressed as where Lð⋅Þ indicates the LPF output signal. By using quadrature modulation, we can obtain the in-phase and quadrature components of X(t), which can be expressed as -619 and Here, X I (t) and X Q (t) represent the in-phase and quadrature components, respectively.
When multiple LFM signals are repeatedly transmitted at high speed, r m in Equations (4) and (5) can be replaced by r m + v m pT sw [15], where p is the index for each chirp. Similarly, the time delay t d m can be replaced by 2ðr m þv m pT sw Þ c . By considering this velocity term and sampling the signals, Equations (4) and (5) are modified as follows: and where T s is the sampling period. Finally, the baseband signal is given as This signal X[n, p] is commonly referred to as beat signal. If we apply the 2D-fast Fourier transform (FFT) to X[n, p], two peak values can be extracted at Thus, if we use these frequencies extracted from 2D-FFT, we can estimate r m and v m using b f b and b f d as follows: Here, K, f c , and c are fixed radar parameters.

| Signal processing of MIMO radar
To estimate the angle as well as range and velocity of the target, automotive radar sensors are being equipped with array antenna systems [3]. Furthermore, to improve the angular resolution of the array antenna system MIMO antenna systems capable of generating virtual receiving channels are becoming widely used in the automotive industry [4]. When N T transmit antenna elements and N R receiving antenna elements are used, a total of N T � N R receiving channels are generated in a MIMO antenna system. In this virtual channel, the spacing between the virtual antenna elements can be expressed as where d T is the spacing between transmit antenna elements and d R is the spacing between receiving antenna elements.
Moreover, the time delay of the m th target at the l th virtual antenna element can be written as where l(l = 0, 1, …, N T � N R − 1) is the index of the receiving channel and θ m is the angle of the m th target. Using the above equation, the beat signal X[n, p] is modified as which is a 3D signal having n-axis, p-axis, and l-axis for the range, velocity, and angle information, respectively. This signal is commonly referred to as data cube in the radar literature, and we will use this signal for estimating the instantaneous heading direction of the vehicle.

| Radar signal measurement scenarios
In our experiment, we used the AWR1642BOOST [16] automotive radar sensor evaluation kit, manufactured by Texas Instruments. The radar kit is connected to the DCA1000EVM module to capture the data, as shown in Figure 2. The radar system parameters are shown in Table 1. We installed the radar 1 m above the ground in an outdoor parking lot. As shown in Figure 3b, the areas in front of the radar kit are divided into nine zones. The size of the zone is set as 4 m � 4 m by considering the width of vehicle lanes and size of the vehicle, and safety corns are placed at the vertex of each zone. The size of the zone did not affect the outcome provided that the full image of a front vehicle was obtained. To distinguish the different movements of the vehicle, the experiments are divided into five different scenarios, as shown in Figure 4. In scenario A, the vehicle is stationary, in scenarios B and C the vehicle is driving forwards and backwards, and in scenarios D and E the vehicle is turning left and right, respectively. In all measurements, the vehicle speed was in the range of 0-15 km/ h. The total number of measurements for the five different scenarios was 60. A single measurement consists of 200 data cubes, so we collected a total of 12,000 data cubes. -621

| Generation of input data for CNN
Because the CNN uses an image data as input, it is important to convert the radar signal into an image suitable for learning.
In this study, we use a 2D range-angle map as the network input. Therefore, it is important to obtain a range-angle information at a high resolution. We can express the radar signal in Equation (13) as a 3D data cube for the n, p, and l axes, as shown in Figure 5. As described in Equations (9) and (10), the range and velocity information of the target can be estimated by applying FFT to the n and p axes. Similarly, by applying FFT to the l-axis, the angle of the target can be estimated as where b f a is the estimated frequency along the l-axis.
However, when the angle of the target is estimated through FFT, the angular resolution is limited by the number of antenna elements. Each antenna element is regarded as a sampled value, and because the number of antenna elements is generally smaller than the number of samples per chirp or the number of chirps, the angular resolution is considerably worse than the range or velocity resolution. For example, the number of virtual antenna elements is 8 in our MIMO system, whereas the number of chirps is 128 and the number of samples per chirp is 256. Therefore, instead of applying the FFT to the l-axis, we use the MUSIC [17] algorithm, which is a high-resolution frequency estimation algorithm that uses an eigenspace method. The MUSIC algorithm first computes a covariance matrix using the received signal samples. Then, by performing eigen-decomposition to the covariance matrix, the eigenvectors corresponding to signals and noise are obtained. The MUSIC algorithm uses the idea that any signal vector belonging to the signal subspace should be orthogonal to the noise subspace. By using this orthogonality condition, a MUSIC pseudo-spectrum is formed such that the peak occurs at the target's angle.
The overall procedure of our proposed method is summarised in Figure 6. First, we apply the FFT to the n-axis to extract the range information of the target. If the number of FFT points is N n , the size of the data after FFT becomes (N T N R ) � N n . Then, we apply the MUSIC algorithm to the laxis for every range index. Through this process, the frequency along the n-axis and l-axis are extracted and we can create a high-resolution range-angle map. Figure 7 shows the rangeangle map and converted x-y range map when the front vehicle turns left. The value of each pixel represents the amplitude of the signal. As shown in the figure, the signals are strongly reflected by the wheels [18,19], and we can identify the instantaneous heading direction of the vehicle through this image. Therefore, we use this single high-resolution rangeangle map as the input to the CNN and train the network.

| Framework of CNN
The structure of the CNN used in our work is shown in Figure 8. The range-angle map data shown in Figure 7a is used as an input to the network. First, a normalisation is applied to the input data to centre each data at the origin. Then, the input data is passed through multiple blocks of convolutional layers and pooling layers. In the convolutional layer, the input data is convoluted with filters to extract features of the image. In addition, a zero padding is applied to maintain the dimension of the output equal to that of the input. The size of the filter is 3 � 3, and the number of filters is 2 n . The appropriate number of blocks and filters will be discussed in Section 4. Next, the output of the convolutional layer passes through the rectified linear unit (ReLU) layer to provide nonlinearity to the network. Then, max pooling is performed, which is a down sampling technique to prevent overfitting. Through this process, the size of the input data is reduced and features are extracted from the image data.
Next, a classification process is performed through fully connected layers and a softmax layer. We set the number of nodes as 1024 in the first fully connected layer and five in the second fully connected layer, representing the five different scenarios. Then, the real numbers of the output from the fully connected layer are converted into probability values through the softmax layer. These probability values are used to predict the instantaneous heading direction of the vehicle at the output layer.
When the gradient of the loss function is calculated by backward propagation, the weight parameters are updated in the direction to minimise the loss function. Through an iterative update process, the optimal weight parameters are found where the loss function is minimum. In our work, the loss function is calculated by using the cross entropy, and gradient descent algorithm is used to minimise the loss function.

| CLASSIFICATION RESULTS
In our work, we collected 12,000 data from experiments during two days. Among the 12,000 data, we extracted 11,748 data by removing the ones that do not belong to one of the 5 cases. A high-resolution range-angle map was obtained by applying the FFT to the n-axis of the data cube and then by applying the MUSIC algorithm to the l-axis, as explained in Section 3. We F I G U R E 6 Range-angle map generation using the radar data cube used the CA-CFAR [20] algorithm to detect the location of the target, and extracted the range-angle map around the target by applying a rectangular window function, as shown in Figure 7a. The size of the windowed image was 127 � 51, which was used as input to the CNN. Then, we randomly extracted 70% of the total 11,748 data and used it as training set, 15% as validation set, and 15% as test set. The training set was equally extracted from each of the five cases so that the training data is not biased to a specific case. The training set was divided into multiple batches, each consisting of 300 data samples. A single batch represents a set of samples used to update the weight parameters. The learning rate was set as 0.001 and the number of epochs was set as 40. To reduce the variance of the estimated classification accuracy, this training process was repeated 10 times using the Monte Carlo method [21]. We trained the network with 10 independent data sets and derived the final classification accuracy by averaging the results. In our work, we used the MATLAB software for training the network.
First, to find the appropriate network parameters, we varied the number of blocks and filters and compared the performance of the network. The computational complexity of the CNN algorithm is affected by the number of blocks and filters. Therefore, these parameters should be appropriately set to lower the computational complexity while maintaining the classification accuracy high. The number of filters was increased from 2 to 128 in units of 2 n , and the number of blocks was increased from 1 to 4. The classification results are shown in Figure 9. When the number of filters was 2, the classification accuracy was the lowest, and it tended to decrease as the number of blocks increased. In contrast, when the number of filters was higher than 2, the network showed the best performance when the number of blocks was 2. Therefore, we set the number of blocks as 2. In addition, the classification accuracy increased when more filters were used, but the increase was not significant when more than 32 filters were used. As a result, we set the number of filters as 32 considering the computational complexity.
Next, we examined the classification accuracy of the training and validation sets when the number of blocks and filters are 2 and 32, respectively. The classification accuracy and loss function according to the number of iterations are shown in Figure 10. We set the maximum number of iterations as 1,080, and validation was performed every 20 iterations. As it can be seen from the figure, the network was fully trained at the 780th iteration and training has stopped at this point. Both graphs show that the training and validation sets have similar training curves. This indicates that the proposed network was well trained and over-fitting was prevented.
In addition, we analysed the performance of the proposed network by using t-distributed stochastic neighbour embedding (t-SNE) [22] algorithm, which is a non-linear data reduction method. This method visualises the high-dimensional data in low-dimensional subspace. The t-SNE representation of the raw data is shown in Figure 11a, and that of the output at the second fully connected layer is shown in Figure 11b. It is evident that the boundary between clusters becomes clear when using the output data from the trained network.
Moreover, Table 2 shows the confusion matrix indicating the classification results of the proposed network. Scenarios A, D, and E showed classification accuracy higher than 94%, whereas scenarios B and C showed low classification accuracy. Scenario A showed high classification accuracy since the vehicle is stationary and there are little fluctuations in the In addition, scenarios D and E resulted in high classification accuracy because the vehicle is rotating and more parts of the vehicle are illuminated by the radar. A highintensity signal is reflected by the wheel, and the vehicle's motion can be easily detected. In contrast, when the vehicle is moving forwards or backwards, the area of illumination is narrow compared to when the vehicle is rotating, and there is little signal reflected by the wheel. As a result, the vehicle's movement was not easily recognisable and these cases showed low classification accuracy. The overall classification accuracy of the five scenarios was 94.44%. Therefore, we believe that our proposed algorithm can effectively estimate the instantaneous heading direction of the front vehicle.

| CONCLUSION
In this paper, we proposed a CNN-based method for estimating the instantaneous heading direction of the front vehicle using an automotive radar sensor. The 77 GHz FMCW radar was installed in an outdoor parking lot, and experiments were conducted for five different scenarios by changing the heading direction of the front vehicle. We converted the received data into a 2D range-angle map by applying the FFT and MUSIC algorithm along the range and angle axes. Then, we used the CA-CFAR algorithm to estimate the location of the target, and extracted the range-angle map around the target by applying a  rectangular window function. This windowed image was used as an input to the CNN. The structure of the CNN was determined by finding the number of blocks and filters that results in high classification accuracy while maintaining low computational complexity. The classification results showed that our proposed method can effectively estimate the instantaneous heading direction of the front vehicle with a high accuracy. For future work, a series of images can be used as an input to improve the estimation accuracy, or a tracking algorithm can be used to obtain more information about how the front vehicle is moving. Furthermore, the estimation accuracy can be improved by using the velocity information as well as the range-angle information.