DAN‐Conv: Depth aware non‐local convolution for LiDAR depth completion

Sparse LiDAR depth completion is a beneficial task for many robotic applications. It commonly generates a dense depth prediction from a sparse depth map and its corresponding aligned RGB image. This image-guided depth completion task mainly has two challenges: sparse data processing and multi-modality data fusion. In this letter, they are dealt with by two novel solutions: (1) To efficiently process sparse depth input, a Depth Aware Non-local Convolution (DAN-Conv) is proposed. It augments the spatial sampling locations of a convolution operation. Specifically, DAN-Conv constructs a non-local neighbourhood relationship by exploiting the intrinsic correlation among observable depth pixels. In particular, it can readily replace standard convolution without introducing additional network parameters. (2) A Symmetric Co-attention Module (SCM) is proposed to fuse and enhance features from depth and image domain. It estimates the importance of complementary features by the co-attention mechanism. Finally, a neural network built on DANConv and SCM is proposed. It achieves competitive performance on the challenging KITTI depth completion benchmark. Comparing to approaches with approximate accuracy, this lightweight network requires significantly fewer learnable parameters.

Introduction: Owning to the limitation of LiDAR (Light Detection and Ranging) sensor, it produces an accurate but sparse 3D point cloud description of the surroundings. After projecting the point cloud onto the image plane, the observable pixels on the depth map are fairly sparse (e.g. less than 6% with depth information from KITTI [1]). The goal of completion is to predict the depth value for the non-observable pixels. Recent studies have shown that Convolution Neural Network (CNN) can achieve impressive performance on this task. These approaches can be categorized into two types: (1) Depth-only approaches [1,2,17] utilize the continuity of depth values to interpolate missing depth with sparse input only. These works demonstrate the strong correlation among adjacent depth points. (2) Image-guided approaches [3,4,15] exploit depth cues from RGB image as auxiliary guidance. The rich semantic cues and texture information from the image domain help these methods achieve relatively higher completion accuracy.
In the CNN-based completion methods, the fixed receptive field of a standard 2D convolution inevitably includes irrelevant and even blank neighbours from the sparse depth map. To overcome the shortcoming of this sampling strategy, SI-Conv [1] tries to deal with it by normalizing convolution output with a binary observation mask. However, the number of valid feature points contributing to each output pixel will be non-uniform. Xu et al. [6] utilize Deformable ConvNets [5] to process depth map. It utilizes an auxiliary network to predict non-local neighbours. By visualizing the position of neighbours predicted by Deform-Conv, they observed that the adjacent points with approximate depth values are highly correlated. In the meanwhile, the drastic changes of depth value always appear at the boundary of an object.
Based on the above facts, we speculate that the difference of depth value is sufficient to support the selection of non-local neighbours and propose DAN-Conv. In specific, DAN-Conv generates the sampling points from a fixed number of adjacent observable points with an approximate depth value. Unlike DeformConv which predicts offset of neighbours by several neural network layers, DAN-Conv does not require any extra learning configuration. It can be integrated into the existing network seamlessly. As shown in Figure 1, we select several depth points and illustrate their neighbours generated by DAN-Conv. Obviously, DAN-Conv can aggregate more homogeneous feature points than a standard convolution. According to Zhong et al. [7], there is an inherent relationship between the depth map and RGB image: although the completion only happens on the non-observable pixels, it is still necessary to exploit the complementary between them to enhance each other. Nevertheless, the widely used channel-wise concatenation cannot effectively exploit this prior. Zhong et al. [7] also proposed that some statistical techniques, for example, correlation analysis, can be used to learn the shared subspace between two modalities. Based on these arguments, we propose to use the co-attention mechanism [9] to jointly reason about the complementary between features from depth and image domain. In specific, we propose a Symmetric Co-attention Module (SCM) to fuse and then enhance point-wise features by adaptively estimating the importance of the feature from another modality.
Built on DAN-Conv and SCM, a novel encoder-decoder structured CNN is proposed for this completion task. In the encoder part, DAN-Conv is used to process point-wise data gathered from the depth map and RGB image. The non-local neighbourhood relationship from the sparse depth map is replicated to the image domain. In the decoder part, we fuse and upscale feature maps from two modalities in a cascade manner. Thanks to the effectiveness and simplicity of this design, our model achieves state-of-the-art performance while maintaining a lightweight network structure.
In summary, the major contributions of this letter can be summarized as: 1 We propose a non-local convolution operation to improve the efficiency of sparse depth map processing. 2 We propose a multi-modality fusion module that exploits the complementary between depth and image data by the co-attention mechanism. 3 The experimental results demonstrate that the competitive performance is achieved by our model with significantly fewer network parameters.
Related work: Depth-only Methods merely use sparse depth measurements as input to recovery the full resolution depth map. Despite the limitation of observable depth points, these approaches demonstrate that the non-observable depth areas can be interpolated by the sparse input only. Ku et al. [2] apply several image processing techniques on the inverted depth map. These operators include hole filling and blurring to round off edges and remove outlier pixels. The main disadvantage of [2] is its insufficient adaptability to various scenes under different sparsity patterns. Recently, more deep learning based methods are proposed to solve this problem. Jonas et al. [1] propose a sparse convolution based deep neural network. Their sparsity invariant convolution explicitly considers data sparsity by utilizing a binary observation mask. Eldesokey et al. [22] propagate confidences together with sparse input through a novel normalized convolution. The learning target of their network is to minimize the data error and maximize the output confidence simultaneously.
It is a non-trivial task for the depth-only methods to accurately infer the missing depth values when the span between the observable depth point is too wide. Thus, the lately proposed methods use an auxiliary RGB image to guide the completion procedure. These image-guided methods require the extra RGB image to be pixel aligned with the sparse  [15] propose a self-supervised framework to train an encoder-decoder network. Without using semi-dense annotation, they exploit the photo consistency from nearby RGB frames as the supervision signal. DeepLiDAR [4] and Xu et al. [3] further boost the completion accuracy on LiDAR depth data. They use surface normal to restrain depth generation. A synthetic dataset is used to train their network with the pre-generated surface normal. Wouter et al. [10] predict dense depth from two pathways: a global pathway uses concatenated RGB image and sparse depth map to downscale feature map and extract guidance information, while a local pathway performs completion with the help of guidance from the global pathway. Jaritz et al. [17] propose a modified NASNet [20] for depth completion and semantic segmentation tasks simultaneously. Their sparse training strategy makes their model handle various sparse density patterns. These works demonstrate that accurate depth completion requires auxiliary information such as depth cues from the image domain. Nevertheless, a small displacement between RGB image and depth map is inevitable. In the meanwhile, it is still an open problem about how to effectively fuse data from multiple modalities.
Approach: Depth Aware Non-local Convolution (DAN-Conv) generates non-local neighbours for each observable point and then aggregates the collected feature as a standard convolution does. Instead of applying filters with a regular structure, for example, a 3×3 grid, DAN-Conv avoids the aggregation of non-observable and irrelevant local neighbours to a certain extend. Figure 2 illustrates the process of DAN-Conv. The channel width of the input feature map is C in . We first gather N point-wise features from input map where the depth measurements are available. Then, the receptive field of convolution is constructed by the indices of K neighbours generated from Algorithm 1. After that, the organized neighbourhood features are concatenated with gathered features before the multiplication of weights. The convolution weight has a shape of [(K + 1) * C in , C out ]. Finally, the C out width point-wise feature is scattered back to a blank feature map with the help of a binary observation mask M.
The core of DAN-Conv is the generation of K neighbours for the observable pixels. As described in Algorithm 1, for each observable point, we first retrieve the indices of 2K nearest neighbours according to their spatial locations. Then, the K output neighbours, that is, the set i , are selected from them with the most approximate depth value. Thus, the output relationship is represented as a form of feature indices.
We propose a Symmetric Co-attention Module (SCM) to fuse and enhance features from two modalities by learning the complementary be- Fig. 3 The process of symmetric co-attention module Fig. 4 The process of DAN-MOD module. symbols , ⊕, and represent element-wise minus, summation, and multiplication, respectively Inspired by the co-attention mechanism [9], this representation can be used to generate two weighting maps for attentional fusion. In specific, we pass it into two FC+Sigmoid layer combinations separately. Finally, the feature from one modality, for example, F I , is enhanced by its complementary, that is, F d , with a weight map w d in a summation manner: DAN-MOD module plays the role of feature fusion and feature map downscaling. It consists of DAN-Conv, SCM, and standard 2D convolution layers. The structure of DAN-MOD is shown in Figure 4. Besides feature map F in and its corresponding observation mask M, DAN-MOD takes point-wise feature F in from another modality as input. It first uses the standard convolution to process the feature map and generates F 2d . The stride parameter S decides whether to perform downscaling or not. Then, the DAN-Conv is employed to process point-wise features for each observable point. The neighbourhood relationship needed by DAN-Conv is computed in advance from a depth map at the specific scale. The point-wise feature F s output by SCM just covers the observable pixels. The non-observable positions are filled by the scheme proposed from [8]. Thus, the intermediate dense feature map is generated as: F = F s + (1 − M ) F 2d . Finally, a skip connection [19] from input to output is applied to avoid gradient vanishing with residual learning.
The architecture of this proposed encoder-decoder network is depicted in Figure 5.
Theencoder part composes of a depth branch and an image branch. It is built upon multiple DAN-MOD modules. For feature maps at a specific scale, a pair of DAN-MODs are employed firstly. The second pair is used to perform both feature fusion and feature map downscaling. We use the sampling method proposed by Li et al. [11] to generate downscaled depth maps. This strategy minimizes the breaking of intrinsic structure caused by max pooling. The extracted neighbourhood relationship is shared by two branches. The decoder is built by several Symmetric Gated Fusion Modules (SGFM) [18]. It has a symmetric structure to fuse multi-modal contextual representations. Finally, three dense depth maps, that isD,D d , andD I , are predicted for fusion, depth, and image branch, respectively.D is the final output.
Similar to [18], the training process is mainly supervised by the MSE reconstruction loss between three predictions and the ground truth. In addition, we use a smoothness loss to encourage the local smoothing of the dense prediction. This setting can resolve the griding effects caused by transpose convolution [21].
Experiments: We compare our model against the state-of-the-art methods in the KITTI depth completion benchmark [1]. KITTI provides pairs The quantitative results are evaluated by four metrics: root mean squared error (RMSE), mean absolute error (MAE), root mean squared error of the inverse depth (iRMSE), and mean absolute error of the inverse depth (iMAE): The model used in this part is trained 50 epochs. The performance of our proposed network is compared with other state-of-the-art works in Table 1. It achieves competitive performance and even outperforms some approaches using additional data or labels.
The ablation model is trained 30 epochs for evaluating the effectiveness of the SCM and the amount of neighbours K in DAN-Conv. The results in Table 2 demonstrate the improvements brought by SCM. Although the impact on the scale and complexity of our network is trivial, nevertheless, it plays an important role in determining the accuracy of its prediction.

Conclusion:
In this letter, we focus on solving two challenges in imageguided sparse depth completion task: how to process sparse depth data and how to fuse features from different modalities. For the first one, we propose a novel depth aware non-local convolution to replace standard convolution. It can aggregate non-local contextual information adaptively by selecting the adjacent observable points with an approximate depth value. For the second one, a co-attention based symmetric module is proposed to exploit the complementary between features from depth and image domain. Finally, a network is built upon the above two solutions. It achieves state-of-the-art performance on challenging KITTI benchmark with 1.46M parameters only. The source code is available at https://github.com/godspeed1989/DANConv.