L4Net: An anchor‐free generic object detector with attention mechanism for autonomous driving

National Natural Science Foundation of China, Grant/Award Number: 61872032; Beijing Municipal Natural Science Foundation, Grant/Award Numbers: 4202058, 9192008 Abstract Generic object detection is a crucial task for autonomous driving. To devise a safe and efficient object detector, the following aspects are required to be considered: high accuracy, real‐time inference speed and small model size. Herein, a simple yet effective anchor‐free object detector named L4Net is proposed, which incorporates a keypoint detection backbone and a co‐attention scheme into a unified framework, and achieves lower computation cost with higher detection accuracy than prior art across a wide spectrum of resource constrains. Specifically, the backbone utilizes Multi‐scale Receptive‐fields Enhancement module (MRE) to capture context‐wise information, where the features of object scale and shape invariance are simultaneously considered. The co‐attention scheme integrates the strength of bothClass‐agnosticAttention (CA) and SemanticAttention (SA), and explores the valuable features from low‐level to high‐level to generate more accurate prediction boxes. Compared with previous feature fusion strategy, multi‐scale features are selectively integrated by fully exploiting the different characteristics of low‐level and high‐ level features, which leads to a small model size and faster inference speed. Extensive experiments on four well‐known datasets demonstrate the effectiveness of our method. For instance, L4Net achieves 71.68%mAPonKITTI test set, with 13.7Mmodel size at the speed of 149 FPS on NVIDIA TX and 30.7 FPS on Qualcomm‐based device, respectively, which is 4x smaller and 2x faster than baseline model.


| INTRODUCTION
In recent years, autonomous driving has become a hot topic in object detection tasks. However, limited by computational constraints on embedded devices [1,2], a real-time generic object detector with high accuracy on low-power devices becomes a critical and challenging task for autonomous driving [3]. Generally, existing real-time object detectors can be broadly divided into two categories [4]: anchor-based methods and anchor-free methods. Most successful anchorbased object detectors [5][6][7][8][9][10][11][12][13][14] usually enumerate large amounts of potential object locations and then categorize each proposal into a corresponding class label. However, such a strategy not only increases the computation cost when calculating the IoU, but also introduces many hyperparameters, such as box size and aspect ratio [15]. Therefore, the detectors using anchor-box cannot be deployed in the autonomous driving due to the high time cost. Meanwhile, a large model size of this kind of model is not suitable for embedded systems [16][17][18][19].
To overcome the drawback of the above anchor-based methods, the anchor-free detection methods [20][21][22][23][24][25] are proposed, which bypass the requirement of anchor boxes and predictions generated in a point(s)-to-box style, and have drawn a lot of attention in computer vision community. Among these methods, one of the most representative method is CenterNet proposed by Zhou et al. [20], which models an object as a centre point of its bounding box and regresses to all other object properties (size and orientation) in each position. The centre point-based approach is end-to-end differentiable and does not require any post-processing, hence it is simpler and faster. However, such detection method in a per-pixel prediction fashion relies heavily on the accuracy of keypoints selection and the edge details of the object. For example, in Figure 1 (c), the object parts (the front wheels and the reflectors) that are hard to distinguish tend to be regrettably ignored by CenterNet [20], which leads to the inaccuracy of prediction boxes localization, so as to further hinders efficient and safe driving.
To alleviate the above-mentioned issues, we propose L4Net, a single and unified network composed of a keypoint detection backbone and a co-attention scheme. For keypoint detection backbone, we basically follow the CenterNet [20] but make modifications to CenterNet by taking advantage of multiscale receptive-fields to extract object scales and shape invariance since efficiently handling these complicated variations is crucial for improving the localization ability in object detectors based on keypoints estimation. In addition, the co-attention scheme incorporates class-agnostic attention and semantic attention to fusion low-level and high-level features jointly on top of the modified CenterNet, which is the key contribution of our method. Compared with previous works [5,6,8,26] that normally ignore the coefficient interaction of these two kinds of features and directly integrate multi-level features without distinction, for our model design, classagnostic attention contributes more to focus on shallow features such as contours, texture and colors under the classagnostic setting, while semantic attention lays emphasis on containing context-aware information to assign the object with correct labels [27]. Figure 1 (d) demonstrates the effectiveness of our proposed method by considering the fusion of shallow and abstract semantic features. Seeing that co-attention scheme can focus on more discriminative features meanwhile filter out redundant information, our method can apply fewer convlayers and dense connection to learn valuable features, which makes our model achieve speed-accuracy trade-off with small model size. In short, the contributions are summarized as follows: 1. We propose L4Net, a completely anchor-free framework for object detection. L4Net jointly train the keypoint-based

| Real-time object detectors
The real-time generic object detectors can be broadly divided into two categories: anchor-based and anchor-free methods. For anchor-based detectors [2,3,[5][6][7][8][9][10][11][12][13][14][15][16][17][18][19], they are commonly classified into two-stage detectors and one-stage detectors. In two-stage detectors, the first stage generates large amounts of object proposals, and the second stage classifies the proposals as well as refines the coordinates with the anchor boxes. Such pipeline is first demonstrated its effectiveness by R-CNN [13] and is widely used in later two-stage detectors [16]. For instance, Qin et al. [16] present a lightweight backbone and adopt an extremely efficient design in the detection head and Region Proposal Network (RPN), which achieving superior detection accuracy. Compared with two-stage methods, the one-stage detectors skips object proposal generation and predicts bounding boxes and class scores in one evaluation, such as SSD [6], YOLO [4,7,8] and RefineDet [11]. When coupled with small backbone network, lightweight detector Tiny-DSOD [12] introduces two innovative blocks: depthwise dense blocks (DDB) and depthwise feature pyramid networks (D-FPN). However, large amounts of hyper-parameters are required to define anchor boxes in these methods. Besides, exhaustive computation cost for IoU calculation can lead to large model size and slow inference speed. Recently, to overcome the drawbacks mentioned above, the recent anchor-free object detection methods [20][21][22][23][24][25] have drawn a lot of attention where the prediction box is generated in a point(s)-to-box form. And the anchor-free object detectors can be roughly categorized into two groups: one is based on the anchor-point detection, which encodes and decodes object bounding boxes as anchor points with corresponding point-to-boundary distances, like FCOS [21], FoveaBox [22] and GA-RetinaNet [23]. In FCOS, all anchor points in a ground truth bounding box are used to predict the bounding boxes and the low-quality detected bounding boxes are suppressed by the proposed "centre-ness" branch; the other is based on keypoint detection, which predicts the locations of several keypoints (e.g. corners, centre or extreme points) of the bounding box, and groups those keypoints to form a box, such as CornerNet [24], ExtremeNet [25] and CenterNet [20]. The key step of CornerNet is to recognize which keypoints belong to the same instance and grouping them correctly. Herein, we mainly refer to CenterNet proposed by Zhou et al. [20], due to the fact that the principles of CenterNet are relatively easy to understand and very efficient since it models an instance as a single point without any grouping scheme to separate different instances.

| Attention Mechanism
Attention mechanism [27][28][29][30][31][32][33][34][35] has been successfully applied to various tasks such as machine translation, neural image/video captioning and object detection since it could assign greater weights to highly responsive feature maps. Bahdanau et al. [28] propose a network with attention to correlate the degree of association of each word in the output sequence with a particular word in the input sequence for machine translation. Xu et al. [29] propose the first visual attention method in image captioning and use both 'hard' and 'soft' attention mechanisms to focus on different parts of the image for each word in the output sequence. As for object detection, Hu et al. [30] propose a 'Squeeze-and-Excitation' block with adaptively weighting channel-wise feature in terms of corresponding responses. Perreault et al. [31] use the segmentation maps as a self-attention mechanism to weight the feature map used to produce the bounding boxes, decreasing the signal of non-relevant areas. Also, Zhou et al. [34] propose semantic self-attention module to effectively recognize foreground regions and suppress backgrounds.
Compared with existing methods [9,12,16,34] that directly utilize multi-level features without distinction, which leads to slower inference speed and sub-optimal results due to the distraction from redundant details, we propose an efficient coattention scheme to obtain more accurate prediction boxes and faster inference speed by selectively integrate high-level and low-level features.

| Overview
The overview architecture of our proposed L4Net is shown in Figure 2, which consists of keypoint detection backbone, coattention scheme and detection head. For keypoint detection backbone design, we enhance the shallow features representation ability via steam block and utilize multi-scale receptivefields to extract object-level shape-invariant features for accurately locating keypoint. The co-attention scheme contains two modules: class-agnostic attention and semantic attention, which are used to capture object boundary details and global context-aware information from low-level and high-level features, respectively. Through the fusion of these two attention, our model can filter out the distraction of background information and generate more accurate prediction boxes. For detection head, we basically follow the CenterNet [20].

| Backbone design for keypoint detection
To improve the inference speed of object detector and match the input resolution to the backbone network, the L4Net uses small input resolution of 512�512 pixels. We observe that object spatial details tend to be ignored in existing methods [16,20], which leads to inaccuracy of prediction boxes localization. Therefore, we put more emphasis on absorbing more shallow features into our backbone network to boost recognizing object details such as edges and textures. In addition, we utilize multi-scale receptive-fields to capture more context-wise information and encode long-range relationship among pixels, which is vital for the localization subtask, especially for the localization of large objects.
Based on the above insights, we build a lightweight backbone based on residual block [10] to effectively detect keypoint. The network structure is shown in Table 1. Motivated by Pelee [3] and Inception-v4 [26], we first add a costefficient steam block after the input layer. The structure of stem block is shown on Figure 3 (Left). The steam block consists of a convolution layer with 7 � 7 convolution kernel at stride two to retain as much information as possible about the original image, and a concatenation module from three parallel branches to obtain complex feature maps at different scales and reduce computation cost: 3 � 3 at stride two maxpool, three consecutive convolution layers with 1 � 1 � 32, 3 � 3 � 32 at stride two and 1 � 1 � 64 convolution kernels, 3 � 3 � 64 at stride two convolution layer. Following the steam block, multi-residual blocks are stacked to enhance feature representations ability. The number of stacked residual blocks for different resolution is set in terms of the size distribution of object in the dataset. For example, there are a total of 34,856 object boxes in the KITTI dataset, among which 17.7% were small object(area < 32 2 ), 53.7% were medium object(32 2 < area < 96 2 ), and 28.6% were big object (96 2 < area). Therefore, we can increase the number of residual blocks at small feature maps for medium and large object predictions.
In addition, local invariance is very significant for detection task. However, the existing models generally directly stack the convolution and pooling layers in a bottom-to-up style, which may not be able to deal with some complex scenes effectively. Here, inspired by Scale-invariant feature transform (SIFT) [36], we devise a multi-scale receptive-fields enhancement module to obtain the feature representation of scale, position and shape invariance. The SIFT is a vision algorithm to search extreme points on spatial scale and gain their location, rotation and scale invariants, where it is usually used to extract local features of images.
The SIFT proposed the difference-of-Gaussian (DOG) scale-space and image Gaussian pyramid [36]. The DOG F I G U R E 2 The overall architecture of L4Net. L4Net backbone is based on residual block and multi-scale receptive-fields enhancement module to obtain the feature representation of scale and shape invariance for more accurate locating keypoint. The channel self-attention module captures global semantic information for high-level features. the spatial self-attention module captures object boundary details for low-level features. The fusion of two attention modules not only improves the accuracy of the prediction box, but also reduces the computation overhead scale-space are processed by several different Gaussian kernel functions at the same resolution; and the Gaussian pyramid are generated by the Gaussian function to blur and downsamples the image with different resolutions. Similar to DOG, we adopt dilated convolution with various dilation rates (3,5,7) to acquire feature maps of same scale but different receptive-fields. Afterwards, the feature maps of different dilated convolutional and a 1 � 1 dimension reduction feature by cross-channel are concatenated for each layer. Finally, we change output channels of the MRE by the 1 � 1 convolution to fit input channels of the next layer. Similar to image Gaussian pyramid, we take Layer1, Layer2 and Layer3 of L4Net backbone to obtain multiscale feature maps. In a word, multi-scale receptive-fields enhancement module is shown in Figure 3 (Right).

| Attention mechanism
In this subsection, we elaborate the co-attention scheme to filter out background distraction information significantly and focus more on valuable features. The co-attention scheme is composed of semantic attention and class-agnostic attention, which capture context-aware information and object boundaries details from high-level and low-level features, respectively.

| Semantic attention
We devise semantic attention after multi-scale receptive-fields (MREs) to assign greater weights to channels which shows more active response to semantic of interest for high-level feature. Since each convolutional layer performs as a pattern detector, each channel of the feature map generated by convolutional layer is a response activation of the corresponding semantic. Applying semantic attention in a channel manner can be regarded as a process of selecting the object with correct labels. For semantic attention (SA), we define high-level features where W and H represent width and height of feature map and C represents the channel number. First, to exploit correlations between channels, we apply global average pooling to obtain channel statistics vector v h ∈ R C , where the channel vector v h is generated by shrinking f h i ∈ R W �H in terms of its spatial dimensions W � H and v h i is calculated by Then we use two fully connected layers to extract correlations between channels, as shown in Figure 4). Motivated by Hu et al. [30], to aid generalization and reduce computation cost, we encode the channel vector with reducing the feature dimension to 1/16 of the input in the first FC layer, then activating it with ReLU and recovering to the original dimension through the second fully connected layer. Finally, the normalized channel vector weights mapped to [0,1] are obtained by a sigmoid operation following where W 1 and W 2 refer to parameters in semantic attention, σ 1 refers to sigmoid operation, fc 1 and fc 2 represent fully connected layers, δ is the activation function ReLU.
where ⊗ denotes the outer product of weighted channel vectors and high-level features.

| Class-agnostic attention
Considering all spatial positions equally resulting in suboptimal results with distraction of background details, we design class-agnostic attention to focus more on the object boundaries details for generating more accurate prediction boxes.
For class-agnostic (CA), We define low-level feature maps as f l ∈ R W �H�C . The R ¼ fðx; yÞ|x ¼ 1; …; W ; y ¼ 1; …; Hg represents the set of spatial positions, where p ¼ (x, y) denotes the spatial coordinate of low-level features. As shown in Figure 5, class-agnostic attention is generated through the following steps: At first, for reducing computation cost and obtaining rich global information, we, respectively, approximate the 'k � k' kernel to '1 � k' þ 'k � 1' and 'k � 1' þ '1 � k' to capture different spatial concerns as C 1 ∈ R W �H�1 and C 2 ∈ R W �H�1 for low-level features.
Then, the class-agnostic attention feature CA ∈ R W �H�1 can be generated by sigmoid operation normalizing the addition of C 1 and C 2 following F I G U R E 4 Semantic attention module will allocate greater weight to highly responsive channels 40 -WU ET AL.
where W 1 and W 2 represent parameters in class-agnostic attention, σ 2 is sigmoid function, conv1 and conv2 refer to 1 � k and k � 1 convolution layer, respectively. Meanwhile, k is hyper-parameter, which is set 9 on our model. The final output f l ∼ ∈R W �H�C of the block is obtained by weighting f l with CA.
where ⊕ represents the outer product of weighted classagnostic attention feature and low-level features.

| Implementation details
To evaluate the performance of our proposed method, we use 15 state-of-the-art object detectors for comparative studies and four widely used real-time object detection metrics (mean average precision [mAP], average recall [AR], speed and model size) are also utilized to evaluate each comparing method.
Here, all detection metrics are the average of 10 experiments. In addition, we conduct comprehensive ablation study and compare our proposed L4Net with its degenerated model to demonstrate the effectiveness of each component in our model. During the process of experiments, most training strategies follow CenterNet, such as data augmentation, input size, classification and regression of prediction layers, loss function (focal loss [17] for classification and L1 loss for localization) and optimizing the overall goal by Adam from scratch. Finally, the experiment is conducted on NVIDIA TITAN X with CUDA 8.0 and deployed on Qualcomm-based low-power embedded devices by Snapdragon Neural Processing Engine (SNPE) [37]. The detailed experimental results are introduced in the following subsection.

| The results on KITTI benchmark
First, our model is conducted from scratch on KITTI dataset, which contains 7481 training images collected from real-world autonomous driving scenarios [38]. Meanwhile, Half of the 7481 images are randomly selected as the training set, and the rest as the validation set. The detection metrics are tested on the validation set. During the training, We set a mini batch size of 64. The initial learning rate is set to 5 � 10 -4 , and divided by a factor of 10 at 90 and 120 epochs, respectively. Our training stops at 200 epochs. Moreover, since our training dataset is small, we use heavy data enhancements including random flip, random scaling, cropping and colour jitter to prevent model overfitting and improve robustness. However, in the inference phase, data enhancements will increase too much inference time and so not be used.
The detection results of the proposed method and other methods on validation set are reported in Table 2. The detection speed of the proposed model, L4Net, is 49 fps faster than CenterNet [20], which has the fastest inference speed among the previous object detectors, despite the mAP of L4Net outperforming that of CenterNet [20] by more than 6.5%. Furthermore, the KITTI dataset consists of a strongly wide set of 1242 � 375 images. To avoid the disappearance of small objects, the input size of our model is resize to 1200 � 300. It is noticed that the mAP of L4Net with a large input size of 1200 � 300 is slightly better than L4Net (original), but the inference speed of L4Net-large is 53 fps slower. We observe that the large input image can indeed detect more small object. However, in practice, the autonomous driving task generally needs to accurately detect the objects within 10 m around the vehicle. Therefore, it is unnecessary to use large input images and extra computational overhead to detect small, meaningless objects at a distance.
In addition, although the mAP of L4Net is 16.34% lower than that of FCOS [21], which has the highest accuracy among the previous anchor-free detection models, it is noteworthy that the inference speed of the proposed method is 9x better than FCOS [21]. Our method achieves trade-off between accurate and speed while FCOS spends too much computational cost (i.e. large input size and multi-scale prediction) to improve accuracy, which is unsafe due to high latency for autonomous driving application. Table 3 shows the inference speed and model size of our L4Net and other methods on real device Qualcomm 820. The speed is calculated by the average time that the benchmark tool processes 50 images. According to Table 3, L4Net runs approximate 2x faster with 4x smaller model size than CenterNet [20] on real device, while L4Net and CenterNet F I G U R E 5 Class-agnostic attention module pays more attention to the foreground regions, which extracts valuable features for object contours WU ET AL.
-41 have similar speed on TITAN X GPU. As shown in the second column of Table 3, mode size of FCOS [21] is 2x and 9x larger than CenterNet and L4Net, respectively, which is unfeasible embedded system deployment, although FCOS has higher accuracy (see Table 2). In addition, the model size of Tiny-DSOD [12] is 4.2 M, which is 3x smaller than L4Net, but it turns out that Tiny-DSOD and L4Net have similar inference speed. This is because L4Net only generates a prediction box for each object without complicated NonMaximum Suppression (NMS) [5].

| The results on PASCAL VOC2007 benchmark
Next our detector is evaluated on the combination of VOC2007 and VOC2012 trainval datasets, and tested on the VOC2007 test set [39]. A mini batch size of 128 is set. The initial learning rate is set to 5 � 10 -5 in training process, since we observed that loss is oscillating and not smooth with a large learning rate from scratch. And we divide by a factor of 5 at 60 and 100 epochs, respectively. The iteration stops at 160 epochs as the loss tends to decline slowly in training. Other settings are the same as on KITTI dataset. Table 4 summarizes the detection metrics on VOC2007 test set. The results of the upper part come from state-of-the-art anchor-based detectors, and the result of the lower part come from anchor-free detectors. In terms of mAP, Our L4Net achieves significant accuracy 73.1%, which outperforms most anchor-based detection models, except RetinaNet. However, our L4Net is 18x faster in fps than RetinaNet. When comparing our L4Net with the state-ofthe-art anchor-free detectors, the accuracy drops. But the L4Net demands much less computing cost and has faster inference speed. For instance, FoveaBox [22] is 8.4% higher in accuracy than L4Net, while with more than 16x slower inference speed. The above comparisons reveal that L4Net has much better trade-off between inference speed and detection accuracy, which is of great use for resource-restricted devices.

| The results on MS COCO benchmark
We also evaluate the performance of our proposed model on the MS COCO test-dev set, which is composed of 118k training images, 5k validation images and 20k images [20,40]. The batch size of training is set to 32, and the model is trained with 5 � 10 -4 learning rate for the first 90 epochs, then 5 � 10 -5 and 5 � 10 -6 for another 30 and 60 epochs, respectively. Table 5 reports the detection results on the COCO testdev set. L4Net achieves 31.2% mAP, which is more better than the anchor-based RefineDet [11] and Pelee [2], and even outperforms the lightweight detector Tiny-DSOD [12]. Compared with EfficientDet-D0 [19] and SNIPER [14], our L4Net achieves similar accuracy with up to 2x faster inference speed and 4x fewer input size. Besides, L4Net has significant comparing with all the anchor-free methods listed in Table 5. For example, L4Net has 19x faster speed than CornerNet [24]. These comparisons verify that L4Net is first real-time yet accurate anchor-free detector for resource-restricted object detection usages. Meanwhile, we show some qualitative results on the KITTI, VOC 2007 and the COCO test-dev in Figure 6 respectively. Our method works well with the occlusions, interclass interference and clustered background.

| The results on BDD benchmark
Finally, we evaluate the performance of the proposed model and other models on the BDD test set, which is the latest published autonomous driving dataset with the largest scale and most diverse content. And the BDD dataset consists of ten classes: bike, bus, car, motor, person, rider, traffic light, traffic sign, train and truck. The ratio of training, validation and test set is 7:1:2. In L4Net training, the batch size is 64 and the initial learning is 5 � 10 -4 . Table 6 shows the experimental results on BDD test set. L4Net is 149 fps faster than the CenterNet [20], which has the fastest operation speed among the previous studies, despite the accuracy of L4Net outperforming that of CenterNet [20] by 0.3 mAP. In addition, compared with FoveaBox [22], which has the highest accuracy among the previous real-time detectors, the performance of L4Net with a 1024 � 800 input resolution and ResNeXt backbone in the last row of Table 6 shows a better mAP of 2.2 and faster operation speed of 8.0 fps, and consequently, L4Net outperforms FoveaBox [22] in terms of the accuracy and detection speed.

| Ablation study on KITTI
Keypoint detection backbone improves localization ability of object centre point. All variants results in ablation studies are based on a compressed CenterNet with 63.21% mAP (Table 7(a)) as our baseline which reduces channels of convolution kernel than official CenterNet to generate small model size. We first apply the steam block after input layer of CenterNet. Result is reported in Table 7 (b). It is evident that steam block can offer a significant improvement (up to 0.96% mAP) over the baseline without adding too much parameters. As for Multi-scale receptive-fields enhancement (MRE) module, the accuracy reached 66.43% mAP (Table 7 (c)). Furthermore, we do a statistic on overlapping ratio (distance threshold ≤ ffi ffi ffi 2 p ) of object centre point between ground truth and the proposed method is 42.6%, which improves 3.7% than CenterNet. Therefore, it turns out that the accuracy of object centre point localization can be improved by capturing features of object scales and shape invariance according to MRE. Co-attention scheme regresses more accurate prediction boxes. Next, we further apply the co-attention scheme on the top of the keypoint detection backbone, so that the proposed model can focus on different-levels valuable features. Seeing that co-attention scheme consist of semantic attention (SA) and class-agnostic attention (CA),  (Table 7 (f)), where the co-attention improves mAP by 5.25% with negligible increase on parameters than Table 7 (c). With our co-attention scheme, the inaccurate issue of prediction boxes at the object edge details is effectively alleviated and its effectiveness also further is confirmed by the error analysis protocol provided in the study Lin et al. [40] in Figure 7. Each sub-image depicts a series of precise-recall curves with different settings, where the area under each curve is displayed (brackets) in the images. A larger IoU reveals that the coordinates of the prediction boxes are closer to ground truth. The overall AP and the medium object AP at IoU ¼ 0.75 of L4Net improve by 4.4% and 11.4% compared with CenterNet, which indicates our model performs better in prediction boxes localization ability. We also visualize the detection results in Figure 6. These figures clearly show that this is consistent with our initial motivation to design co-attention scheme for different level features.

| CONCLUSION AND FUTURE WORK
We have proposed an anchor-free generic object detector named L4Net that achieved the best speed-accuracy trade-off with small model size for autonomous driving. For keypoint detection backbone, we utilized multi-scale receptive-fields to capture object scales and shape invariance. The co-attention scheme consists of class-agnostic attention and semantic attention which captured object spatial details and context-aware information from low-level and high-level features, respectively. Compared with the baseline, the proposed L4Net model, respectively, improves the mAP by 6.51, 2.9, 1.4, 0.3 for the KITTI, PASCAL VOC, MS COCO and BDD datasets, respectively, with 13.7 M model size at the speed of 149 FPS on NVIDIA TX and 30.7 FPS on Qualcomm-based device. As a result, the proposed method significantly improved the performance of object detection system for autonomous driving applications.
Furthermore, although our model achieves the satisfactory trade-off between accuracy and speed for autonomous driving, some details are easily ignored when we maintain the fast speed. And our model is poor performance when it is suffered from adverse weather (rainy, foggy), light (in and out of tunnels), and large obstructions. Therefore, in the future work L4Net will be improved by introducing the effective network structure design, such as geometric graph-based multi-view learning.