Interest point detection from multi-beam light detection and ranging point cloud using unsupervised convolutional neural network

Interest point detection plays an important role in many computer vision applications. This work is motivated by the light detection and ranging odometry task in autonomous driving. Existing methods are not capable of detecting enough interest points in unstructured scenarios where there are little constructions or trees around, and correspondingly light detection and ranging odometry will fail to continuous localisation. An interest point detector is proposed for detecting interest points from multi-beam light detection and ranging point cloud using unsupervised convolutional neural network. The point cloud is projected into a two-dimensional structured data according to the scanning geometry. Then the convolutional neural network ﬁlters trained in an unsupervised manner are used to generate a local feature map with the two-dimensional structured data as input. Finally, interest points are obtained by extracting the grids that have signiﬁcant differences with their neighbour grids. Based on an odometry benchmark, the experiments show that the proposed interest point detector can capture more local details, which contributes to more than 16% error decrease in point cloud registration in highway scenes.


INTRODUCTION
Interest point detection in light detection and ranging (LiDAR) point cloud is an essential part in many computer vision tasks, such as point cloud registration [1,2], visual odometry [3], simultaneously localisation and mapping [4], indoor/outdoor mapping [5] etc. However, the existing methods mostly focus on interest point detection in ordinary scenarios. As far as we know, there is no method that is suitable in both ordinary scenarios and unstructured scenarios with fewer surrounding objects, such as highway scenes in KITTI's LiDAR odometry dataset [6].
Technically speaking, the trade-offs between different scales and the trade-offs between different point densities are the main reasons of this difficulty. In order to overcome these difficulties, we addressed the task of detecting interest points from multi-beam LiDAR point cloud by projecting the point cloud into two-dimensional (2D) structured data and picking up grids, which have a significant difference in neighbouring grids, with the help of 2D convolutional neural network (CNN) filters. We also carried out the training of CNN filters unsupervised using an auto-encoder network structure. The contributions of the method are as follows: • Unsupervised CNN-based detector for detecting interest points from multi-beam LiDAR point cloud. • Contributes to more than 16% error decrease in point cloud registration in unstructured scenarios such as highways.

RELATED WORK
The difference between our method and others' method is shown in both data structures and approaches. Further, the idea of our method was inspired by the methods used for 2D images. Therefore, this section analyses the applicability of different data structures for LiDAR point cloud data processing from the perspective of interest point detection, and introduces the related methods developed for both 2D images and 3D point clouds.

Light detection and ranging point cloud representation methods
There are mainly three categories of data representation methods for LiDAR point cloud: unordered points or raw point cloud, 2D grid, and 3D grid. Despite the handcrafted algorithms on raw point cloud, deep learning based PointNet [10] and PointNet++ [11] were proposed considering unordered points as input. Then, PointNet-like networks were also implied on object classification, hierarchical feature learning etc. In recent years, many PointNet-based networks emerged, such as 3DFeat-Net [1], USIP [2], L-Net [12], VoxelNet [13], DeepICP [14] etc., which are used for detecting interest points and extracting features from unordered points. Taking unordered points as input can maximise the preservation of the information in the original data. However, because the PointNet meta-structure has a limit on the number of input points [15], down sampling on the raw point cloud is always applied in advance. This option can lead to inevitable information loss and at the same time makes the interest point detection in highway scenes more difficult.
2D grid representation methods have gained most attention due to the proximity to common 2D image based processing [15]. Projecting points into a 2D grid method along a 3D direction is the most general approach, such as Watertight [16], Front View of LiDAR point cloud [17], and Bird's Eye View [18,19]. But, in the case of overlapping along the projection direction, it will lead to the loss of information. On the other hand, projecting along one specific direction is also not a compact way of representing multi-beam LiDAR point clouds. Considering this point, based on the scanning geometry, some works project the point cloud into a 2D grid, according to cylindrical [20] or spherical [21] projections. Using this approach, only a small number of grids are empty, resulting in a very compact representation. Moreover, by adding channels, such as intensity data, or range values, the representation could be a multi-channel 2D structured grid [20,[22][23][24][25]. Its effectiveness and suitableness in 2D CNNs have been proved by many applications in vehicle detection [20], object segmentation [26], semantic segmentation [22], ground segmentation [24], and end-to-end LiDAR point cloud matching [23,25].
Compared to 2D grid representation of 3D data, 3D grid representation preserves the original shape of the input point cloud, and it also has a fixed scale of real size on each grid neighbouring area. In the work of [20], a 3D CNN is applied to 3D grid model in order to detect vehicles. However, it will be timeconsuming if all the computing are put on the whole 3D grid to detect interest points. On the other hand, because most 3D grids are empty, the computation and memory cost of 3D grids would be higher than the same amount of consumptions of 2D grids. Therefore, relatively, it is more appropriate to use this data structure for extracting features [27,28] when a set of interest points are given, but it is not suitable to detect interest points from the whole given 3D data.

Interest point detection methods
Handcrafted interest point detectors such as ISS [9], Harris 3D [8], NARF [29], SHOT [30], and clustering-based method [16] have made impressive achievements. In the recent years, many deep learning based operators [31] or detectors have emerged. We divide these methods into three categories: methods based on supervision [32][33][34][35], methods based on weakly supervision [1,36,37], and methods not based on supervision. Methods based on supervision use traditional classifiers or endto-end networks [38] to detect interest points. Methods based on weakly supervision, such as 3DFeat-Net, train networks to detect interest points using the ground-truth pose constraint or feature points produced by structure-from-motion algorithms [39]. According to our knowledge, USIP achieves state-of-theart performance on interest point detection in an unsupervised way. However, it is still not capable of detecting enough interest point in some highway scenes. Our work was inspired by the work in [40], which re-localises interest points using L2 norms of CNN filters in 2D images. But in our work, the image-processing techniques are used after generating the local feature map based on CNN filters. Our method is carried out by describing the following process in the rest of the paper.

THE PROPOSED INTEREST POINT DETECTOR
The core idea of the proposed detector lies in how to get the local feature map using CNN filters and how to get the saliency

2D representation of multi-beam light detection and ranging point cloud
Based on the scanning geometry, multi-beam LiDAR point cloud can be projected into 2D structured data using spherical projection. Here, this data structure is called spherical ring point cloud. Due to its compactness and the 2D structure, it not only has a few empty grids that will bring redundant computations but it is also suitable for 2D convolutional operation.
In some works, such as [20,22,26], just the front part of LiDAR point cloud is used. In this work, all the points from multi-beam LiDAR point clouds will be projected into 2D grids, aiming at less information loss. Assume p = (x, y, z ) is one point in the LiDAR point cloud, then the projection functions are where c and r denote the row index and the column index of the grid cell in the spherical ring model, respectively, H is the height of the projection model, Δ and Δ are azimuth angu- More channels, such as the intensity of laser beams and the range value, can also be projected into grids. But to make the method more concise and universal, we use only coordinate values (x, y, z ) to generate a spherical ring data with three channels, although the intensity is available in the dataset for our experiment. Eventually, the unordered multi-beam LiDAR point cloud is transformed into a 2D structured spherical ring with a size of H × W × C , where W denotes the width of the model, and C denotes the number of channels.

Local feature description using convolutional filters
Based on the input of 2D structured spherical ring, we address the task of detecting interest points as finding the grids that have salient feature differences with their neighbour grids. To quantify the feature difference, we use CNN filters to generate a local feature map, which describes the local feature based on the neighbouring grids. Then the saliency of the local features can be obtained using succinct image processing techniques. Figure 2 depicts the idea of local feature description using convolutional filters. After the training of the CNN, the first and second layers (shown in yellow grids) form the local feature descriptor. The output of the descriptor for one grid is a feature vector. The 3 × 3 neighbouring area is the receptive field of the features. To compress the feature vectors with a lower number of dimensions, we set 1 × 1 as the convolutional size of the second layer and set N 1 > N 2 .
In order to train the CNN filters in an unsupervised way, convolutional auto-encoder (CAE) [41] is applied with the source After the unsupervised training in this auto-encoder manner, the front layers are taken as the local feature response network, with the whole spherical ring as input and local feature map as output. Hence, this method has the following advantages: (1) the CNN filters are trained in an unsupervised end-to-end manner; (2) the network is very light due to the shared parameters; (3) the network is easily adjusted. The pipeline of training and inference is shown in Figure 3.
Auto-encoders have a typical bottleneck structure. The output sizes of the layers generally decrease first and then increase. At the same time, the changing characteristics of the number of channels are usually reversed. This design for CAEs can force the network to adjust the parameters to achieve feature compression. In our network, the second layer has less output channels than the first layer, in order to reduce the number of output dimensions to lower the amount of calculation and simultaneously increase the non-linearity of the feature description. Figure 2 is the meta structure of the response network in Figure 3. The second convolutional layer in Figure 2 is depicted as the third layer in Figure 3. This convolutional layer uses 1 × 1 kernel size. Therefore, the receptive field of the response network equals the kernel size in the first convolutional layer, which is 3 × 3. As mentioned in the previous part, the size of the input spherical ring is H × W × C . We note the local feature map generated by the response network as response image R. Assume that the size of R is H × W × N 2 , where N 2 is the number of the output channels which is also the dimension of the local features. Then the following succinct image processing procedures are carried out to get the feature saliency map and interest points.

Obtaining saliency map and interest points
We address the generation of the saliency map by comparing each grid with its neighbouring grids. Assume that each grid has a (2h + 1) × (2h + 1) neighbour size, then the local feature of each grid should be compared to ((2h + 1) 2 − 1) neighbour grids. Firstly, for each grid, all the response differences with neighbour grids are recorded in feature difference map D. Secondly, for each grid, based on the validity mask generated when doing the spherical ring projection, the smallest feature difference among all the valid neighbour grids is taken as the saliency score. Thirdly, among the saliency map S , the interest points are picked out by filtering out the grids that have high salient scores.
The feature difference map D is calculated as where the size of D is In order to quantify the difference between neighbouring features, the difference normal map D N is generated by computing the L2-norms of D. The calculation is as follows: where N denotes a function that can get the L2-norm of the input matrix or vector. Thus, all the feature difference values of the neighbour grids for each grid are saved in data D N which has a size of H × W × (2h + 1) × (2h + 1). The saliency scores for being interest points are obtained by picking out the smallest difference among their valid neighbouring grids. The calculation of the saliency map S is as follows: where min denotes a function that outputs the smallest valid value in the input difference norm map D N (r, c, :, :) with its corresponding mask M. Finally, the saliency map with a size of H × W is obtained to indicate the scores of the grids in the spherical ring for being interest points.
To avoid getting meaningless points, mediocre points with lower saliency scores are filtered out by a threshold , and the total number of detected interest points is limited by a parameter N . In order not to get interest points at close distance, we also set a threshold to filter out interest points within the distance of to the LiDAR.

EXPERIMENTS AND ANALYSIS
In the following, we first describe the details of the experiments, then the comparison with other methods on both highway scenarios and ordinary scenarios based on KITTI dataset are carried out.

Dataset
We use KITTI odometry dataset for our experiments. It is one of the most widely used benchmarks for evaluating computer vision tasks. The data is gathered by a 64-beams LiDAR, Velodyne HDL-64E. The 22 sequences in the dataset cover most types of scenes in driving cases, such as city streets, highway etc.
Half of the data contains ground-truth poses for the LiDAR point clouds. The ground-truth poses are used for evaluating the matching accuracy based on detected interest points and so to indicate the matchability of the interest points.
Additionally, according to [42], there is an intrinsic parameter error of 0.22 • . We applied the 0.22 • of correction to our point cloud, and also applied it to the interest points detected by other methods.

Spherical ring projection
The key scanning geometric parameters in Velodyne HDL-64E for spherical ring projection are 64 laser beams [−24.8 • , 2 • ], vertical FOV with ∼ 0.4 • vertical angular resolution, and 360 • horizontal FOV with 0.18 • angular resolution (azimuth). There are around 0.13 million points in each frame of point cloud.
Considering that the projected spherical rings have more empty grids than expected due to the movement of the vehicles during the data collection, we set the projection angles smaller than in reality. In our experiments, we set Δ = 0.2 • and Δ = 0.4254 • . But options like denoising are not carried out, aiming to reduce the amount of handcrafted part and to keep the whole method easy to adjust. Even so, not all the hollow grids are avoided. But this problem is just left to our network which is solved by the CNN training automatically.
The actual point cloud may exceed the expected vertical FOV because of the platform moving during the rotating of the LiDAR sensor. In order to keep as many points as possible, we set the height of spherical rings as 69 which is bigger than 64. Additionally, if there is more than one point dropped in the same grid, the previous points will be covered by the last grids.

4.1.3
Network details Table 1 shows the network details. Only convolutional layers (Conv.), max-pooling layers (Max.), and up-pooling layers (Up.) are used in the CNN auto-encoder network. Filter numbers and channel numbers are shown in bold fonts. We use "ReLU" activation function for all the layers except the last layer because the last layer output 3D coordinates may contain negative values. Moreover, we set "Mean Squared Error" as the loss function and "Adam" as the optimisation method with the default parameters in [43]. The data in the 22 sequences in KITTI dataset is applied to the network training. The training was done based on two NVIDIA 1080ti graphic cards.

Processing after getting feature map
Based on the parameters' definitions in the previous section, we set as 10 m and as 0.2. The recommended value for N is between 512 and 1024. For fast computation of this part, we implement all the operations using python module cupy, which can use GPU to boost matrix computation. We had less than 0.02 s for processing one frame of data on one NVIDIA 1080ti graphic card. The source code is released at https://github.com/SRainGit/CAE-LO. Figure 4 shows a comparison of the interest point detection results in a highway scenario based on six methods, including our method. All the methods are set to detect 1024 interest points from the given point cloud. The first three methods are hand-crafted: ISS, Harris, and SIFT. The implementation codes are from the authors of USIP. Obviously, these three methods detect too many meaningless points from the ground with fewer matchable points from the sides. The last three methods are 3DFeat-Net, USIP, and our method. 3DFeat-Net uses attention mechanisms to avoid detecting points from meaningless candidates, and thus shows its effectiveness in the comparison, but it is still not enough compared to our method. USIP interest points are gathering more around the crossing lines between the sides and the ground, not with matchable points from local details. Despite the unsatisfied results from other methods, our method is fully capable of detecting the interest points from guardrails and signposts; the color pictures are shown in Figure 5. This characteristic of our method shows its advantage in the matching case in Figure 6.   Figure 6 shows two pairs of matching results using USIP interest points and our interest points with the same UISP descriptor in a highway scenario. Due to the lack of surrounding buildings and trees, the only matchable interest points lie mostly on the guardrails and signposts. Apparently, USIP fails in this scenario. However, our method can capture the interest points located in both guardrails and signposts, thereby significantly reducing the matching error. Figure 7 uses the same experimental setting but based on an ordinary scenario. Both methods can successfully match the point clouds. However, firstly, although we set that both meth-ods detect 1024 interest points, it looks like our method detects more points in both Figures 6 and 7. It is not hard to find out that UISP interest points are gathering more around the same areas or the same points. Secondly, our interest points are more scattered from the LiDAR sensor. Based on the work of [42], detected close interest points are not very effective in constraining the rotations of the point cloud matching tasks. Hence, according to this point, our interest points have more advantages in the registration problem.

Comparison with other methods
To further evaluate the performance of the interest points and to illustrate the advantages of our method, we use a  frame-to-frame feature-based matching based on KITTI dataset to create a comparison among other methods. RANSAC is used for the feature-based matching. Similar experiments can be found in our previous preprint publication [44]. We use USIP descriptor to extract features based on the interest points detected from all the methods. The inlier threshold is set to 1.0 m, and the RANSAC iterations is set to 10,000 maxima. All the first 11 sequences in KITTI with the ground-truth poses are used for the matching experiment. We use relative translation error (RTE) and relative rotation error (RRE) [45] to evaluate the matching accuracy based on the detected interest points, and matching success rate to assess the matchability of the detected interest points at all the scenarios in the dataset. We consider a matching successful when RTE < 0.5 m and RRE < 1 • . Table 2 strongly supports the advantage of our method in the use of highway scenarios. Compared to state of the art, the RTE and RRE based on our method are decreased by over 37% and 16%, respectively. The success rate is increased from 94.73% to 98.45%. From Table 3 (the best performing values are marked in bold font), it is easy to observe that our interest points achieve the lowest RTE and the highest success rate in general scenes, while our method achieves a little higher RRE compared to USIP.  In other words, our method is performing better in highway scenes, whereas it shows no much difference with state of the art. This conclusion can be explained in two aspects.
(i) Benefiting from the local feature description capability trained by CNN, our method is more capable of detecting local details, especially in unstructured scenarios like highways. However, USIP detects interest points based on point clusters generated by point grouping algorithms, which leads to a much lower sensitivity on local details with sparse points. In the work of USIP, experiments based on 128, 256, and 512 interest points were carried out [2]. We test USIP with 1024 interest points in this work. Then we found that although USIP keeps decreasing the matching error with increase in the number of detected interest points, it will not improve its attention on local details but will just concentrate on the same areas or the same places. In contrast, our method can capture more local details in areas such as guardrails and signposts. Hence, our method can achieve a lower RTE as well as a higher success rate. (ii) USIP estimates the localisations of the interest points, not picking up interest points already in the point cloud. The points are gathered by LiDAR sensors with fixed angular resolutions. This means that the interest points detected by the methods such as ours and 3DFeat-Net are already accompanied by a certain angular error. From this perspective, USIP can achieve a smaller RRE.
In summary, our interest points show significant advantages in highway scenarios due to its capability on capturing local details. At the same time, the proposed detector shows almost equivalent accuracy with state of the art in ordinary scenarios.

CONCLUSION
Motivated by improving the accuracy and reliability of localisation in autonomous driving, this paper proposes an unsupervised CNN-based detector for detecting interest points from multi-beam LiDAR point cloud. Our method utilises the compact structure of multi-channel spherical ring to project 3D point cloud to 2D grid data. This projection makes the 2D CNN method applicable. We also use auto-encoder network architecture to train the CNN filters in an unsupervised way. Compared with state-of-the-art works, our method decreases by more than 16% error of point cloud registration on unstructured scenarios. Based on the same spherical ring representation method, further studies can be conducted on tasks such as segmentation and object recognition. Moreover, the proposed interest point detector based on local feature difference saliency can be also applied for image-based registration.