Point cloud super ‐ resolution based on geometric constraints

Among all digital representations we have for real physical objects, three ‐ dimensional ( 3D) is arguably the most expressive encoding. But due to the limitations of 3D scanning equipment, point cloud often becomes sparse or partially missing. A point cloud super ‐ resolution (PCSR) method based on geometric constraints is proposed to solve the sparse problem of point clouds: it allows dense point clouds to be generated by sparse point clouds. The method is based on the conditional generative adversarial network including redesigned generator and discriminator for point cloud data specially. More-over, the method can maintain the shape of the dense point cloud by adding geometric constraints. The contributions of our work are as follows: (1) a PCSR method based on geometric constraints is proposed; (2) add a module for obtaining point cloud neighbourhood information in the generator, called K ‐ nn operation module; and (3) feature aggregation is performed using the weighted pooling to process the neighbourhood information obtained by the K ‐ nn operation module. Extensive experimental results demonstrate the effectiveness of the proposed method.


| INTRODUCTION
Among all digital representations we have for real physical objects, three-dimensional (3D) is arguably the most expressive encoding. 3D representations allow storage and manipulation of high-level information (e.g. semantics, affordances, function) as well as low-level features(e.g. appearance, materials) about the object. However, unlike the 2D image, there is no unified output representation for 3D geometry. The choice of representation is critical for learning a good generative model of 3D shapes, Voxel representations are a straightforward generalization of pixels to the 3D case. But its shortcomings are also obvious. Voxel leads to data sparsity and computation cost of 3D convolution. Some methods use meshes to represent the discrete high-dimensional space, so that the objects can be represented naturally. However, these methods require a template measure of the same topology and do not allow arbitrary changes. Point clouds are a set of points; the representation of point clouds is much simpler than that of mesh. Although there is no connection between points in the point cloud, point cloud is most close to raw sensor data.
3D point cloud understanding is a long-standing problem. Typical tasks include 3D object classification [1], 3D object detection [2][3][4][5], 3D semantic segmentation [6][7][8] and 3D object reconstruction [9,10]. Recently, PointNet [6] architecture directly operates on point cloud instead of the 3D voxel grid or mesh. It not only accelerates computation but also notably improves the segmentation performance. However, PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. To address this problem, PointNet++ [7] leverages neighbourhoods at multiple scales to achieve both robustness and detail capture. DGCNN [11] proposes a new neural network module dubbed EdgeConv suitable for CNN-based high-level researches. It cannot only incorporate local neighbourhood information but also stack or recurrently apply to learn global shape properties.
The sparse point cloud can describe the outline of the object, but the lack of detail will affect its classification and recognition. In the field of a two-dimensional image, single image super resolution (SISR) is a classic computer vision problem, which aims to recover a high-resolution (HR) image from a low-resolution (LR) image. The popular method is to use the deep convolution neural network and generative adversarial networks to restore the higher quality image. Since SISR can restore high-frequency information, it is widely used in applications such as medical imaging [12], satellite imaging [13], security and surveillance [14], where high-frequency details are greatly desired. Motived by these methods, now we can use deep neural networks such as PU-Net [15], EC-Net [16] and PU-GAN [17] to sample the point cloud. However, these networks cannot produce reliable results for extremely sparse and uneven low-quality inputs. To get excellent results, these works also need to provide a lot of label information for the dataset.
As we have known, the point cloud is three-dimensional but two-dimensional. Therefore, it is necessary to design a network different from the general neural network to adapt to the particularity of point cloud format. To address this problem, the Point Cloud Super-Resolution (PCSR) method is proposed in this work. The method uses a conditional generative adversarial network, which is divided into generator and discriminator. Low-resolution point cloud is inputted into the generator to generate high-resolution point cloud. The point cloud pairs are fed into the discriminator to determine whether the highresolution point cloud is real or not. Our method can learn local and global features at the same time. Global features can guide the generation of point clouds, and local features can improve the details and textures of the results. Some techniques [7,18] for learning local features treat points independently at a local scale to maintain permutation invariance. This independence, however, neglects the geometric relationships among points, presenting a fundamental limitation that leads to local features missing. Instead of generating points features directly from point cloud itself; our method generates edge features that describe the relationships between a point and its neighbours. The method cannot only capture local geometric structures but also maintain permutation invariance. Moreover, the robustness of the network is improved by the weighted pooling aggregation edge features. In PCSR method, low-resolution point clouds are input to the generator, and high-resolution point clouds are obtained by fusing local and global features. The discriminator determines whether the generated highresolution point cloud is correct and makes the generated result of the generator better. Extensive experimental results demonstrate the effectiveness of the proposed algorithm.
The rest of this study is organized as follows: In Section 2, we review some traditional approaches for point cloud and image super-resolution. Section 3 gives a detailed explanation of PCSR and the experiment results are shown in Section 4. Eventually, this work is concluded in Section 5.

| Point cloud
Various 3D representations [1,6,7] have been explored recently based on deep learning on 3D data. Among them, point cloud representation is becoming increasingly popular due to its memory efficiency and intuitiveness.
A point cloud represents a geometric shape typically its surface as a set of 3D locations in a Euclidean coordinate frame. In 3D space, these locations are defined by their x, y, z coordinates. Thus, the point cloud representation of an object or scene is an N � 3 matrix, where N is the number of points, referred to as the point cloud.
Some latest works [6,7,11,19] directly take raw point clouds as input without converting them to other formats. PointNet [6] pioneered point clouds as a representation of deep learning tasks. It is a unified architecture that learns both global and local point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks, like classification, part segmentation, semantic segmentation. PointNet proposed to use shared multi-layer perceptrons and max-pooling layers to obtain the feature of point cloud. They achieved permutation invariance by applying a fully connected neural network to each point independently followed by a max-pooling operation. This approach is to approximate a general function defined on a point set by applying a symmetric function on disorder point cloud in the set: where h is a multi-layer perceptron network and g is a composition of a single variable function and a max-pooling function. Each point contains 3D coordinates Because the max-pooling layers are applied across all the points in point cloud, it is difficult to capture local features, limiting its ability to recognize fine-grained patterns and generalisability of complex scenes. PointNet++ [7] improved the network in PointNet by adding a hierarchical structure. The hierarchical structure is similar to CNNs used in image CNNs, which extracts features starting from small local regions and gradually extending to larger regions. By applying this structure, PointNet++ is able to learn local features with increasing contextual scales. PointCNN [19] is a generalization of CNN into leveraging spatially local correlation from data represented in point cloud. The core of PointCNN is the X-Conv operator which weights and permutes input points and features before they are processed by a typical convolution. However, PointCNN is unable to achieve permutation-invariance which is necessary for point clouds.
Dynamic graph CNN [11] proposed a method that can dynamically update the graph. This approach is inspired by PointNet and convolution operations. Instead of working on individual points like PointNet, however, graph neural networks exploit local geometric structures by constructing a local neighbourhood graph and applying convolution-like operations on the edges connecting neighbouring pairs of points. Differently from traditional graph CNNs, dynamic graph CNN is dynamically updated after each layer of the network. The proximity in feature space differs from proximity in the input, and the dynamic update leads to nonlocal diffusion of information throughout the point cloud. Wu et al. [20] extend the dynamic filter to a new convolution operation, named Point-Conv. PointConv can be applied on point clouds to build deep convolutional networks. This approach treats convolution kernels as nonlinear functions of the local coordinates of 3D points comprising weight and density functions. With respect to a given point, the weight functions are learnt with multilayer perceptron networks and density functions are learnt through kernel density estimation. LI ET AL. -313

| Super-resolution
Many image super-resolution methods have been developed in the computer vision community. The pioneer work is termed as the Super-Resolution Convolutional Neural Network (SRCNN) proposed by Dong et al. [21]. SRCNN first upscale low-resolution images to the desired size using bicubic interpolation, which is the only pre-processing SRCNN perform. SRCNN's network structure is very simple, using only three convolution layers. However, the high computational cost still hinders it from practical usage that demands real-time performance. FSRCNN [22] aims at accelerating the current SRCNN and proposes a compact hourglass-shape CNN structure for faster and better SR. FSRCNN redesign the SRCNN structure mainly in three aspects. Subsequently, a deep network with 20 layers was proposed in VDSR [23] to improve the reconstruction accuracy of CNN. VDSR uses a very deep convolutional network inspired by VGG-net used for ImageNet classification [24]. Training a very deep network is hard due to a slow convergence rate. The residuals between the HR images and the interpolated LR images were used in [23] to speed up the converging speed in training and also to improve the reconstruction performance. Tai et al. [25] propose Deep Recursive Residual Network (DRRN) for SISR. In DRRN, an enhanced residual unit structure is recursively learnt in a recursive block and it stacks several recursive blocks to learn the residual image between the HR and LR images. The residual image is then added to the input LR image from a global identity branch to estimate the HR image. In SRDenseNet [26], the feature maps of each layer are propagated into all subsequent layers, providing an effective way to combine the low-level features and high-level features to boost the reconstruction performance. Ledig et al. [27] propose a superresolution generative adversarial network (SRGAN) for which it employs a deep residual network (ResNet) with skipconnection and diverges from MSE as the sole optimization target. Different from previous works, SRGAN proposes a novel perceptual loss using high-level feature maps of the VGG network combined with a discriminator that encourages solutions perceptually hard to distinguish from the HR reference images.
Different from image super-resolution, Yu et al. [15] introduced a deep neural network PU-Net to upsampling point sets based on the PointNet++ architecture. The core idea is to learn the multi-level features of each point and implicitly expand the point set through the multi-branch convolution unit in the feature space. The expanded feature is divided into multiple features and these features are reconstructed into an up-sampling point set. PU-Net is applied at a patch level and has a joint loss function that encourages the upsampling points to remain uniformly distributed on the underlying surface. However, PU-Net cannot detect the edge on the regular object point cloud, which leads to the uneven edges and corners of the object and presents the defect of irregular undulation. Later, they introduced EC-Net [16] for edge-aware point cloud upsampling by formulating an edge-aware joint loss function to minimize the point-to-edge distances. EC-Net design network to process points grouped in local patches, and train it to learn and help consolidate points, deliberately for edges. Through this method, EC-Net can attend to the detected sharp edges and enable more accurate 3D reconstructions in the process of sampling on the point cloud. However, these thresholds and edges are in fact manually marked on the point cloud in advance, resulting in the data that needs to be manually labelled. PU-GAN [17] proposes a generative adversarial network to sample on point cloud. To realize a working GAN network, PU-GAN constructs an up-down-up expansion unit in the generator for upsampling point features with error feedback and self-correction and formulate a self-attention unit to enhance the feature integration. Further, it designs a compound loss with adversarial, uniform and reconstruction terms, to encourage the discriminator to learn more latent patterns and enhance the output point distribution uniformity. But it still uses the architecture of PointNet++, the feature extraction and fusion in feature space is not enough.

| METHOD
We propose a PCSR method based on geometric constraints using a conditional generative adversarial network. The method is divided into generator and discriminator. Consider a lowresolution (LR) point cloud with 1024 points, the ground truth high-resolution (HR) point cloud contains 4096 points. The ground truth is downsampled to 1024 points, then superresolution to 4096 points. This can better evaluate the superresolution result. We still call the super-resolution point cloud a "low-resolution" point cloud, although it has the same number of points as the high-resolution point cloud.
LR point cloud is inputted into generator G to generate HR point clouds. The point cloud pairs are fed into the discriminator D to determine whether the HR point cloud is real or not. An overview of PCSR generative adversarial network is shown in Figure 1.

| The architecture of generator
The generator architecture is visualized in Figure 2. The generator has four key blocks: one joint alignment networks block named spatial transform network that align input points, K-nn operation block get point cloud K nearest neighbours, weighted pooling block aggregate the features of K neighbours by weighted averaging and a local and global information combination block .
We discuss the reason behind these design choices in separate paragraphs below.
Spatial Transform Network Block: The semantic labelling of a point cloud needs to be invariant if the point cloud undergoes certain geometric transformations, such as rigid transformations. It is therefore expected that the learnt representation of point set is invariant to these transformations. Inspired by Jaderberg et al. [28], we predict an affine transformation matrix by spatial transform network and directly 314apply this transformation to the coordinates of input points. The spatial transform network resembles the traditional network and is composed by basic modules of point independent feature extraction, max pooling and fully connected layers. f x i ; …; x n g ⊆ R F . In the simplest setting, F = 3, it is also possible to include additional coordinates representing colour, surface normal and so on. We use the k-nearest neighbour algorithm to search points x j1 , x j2 , …, x jk and make these points closest to x i . Then, we calculate x j − x i , this operation encodes local information. However, if the PCSR method only encodes local information, it will treat the shape as a collection of small patches and lose the global shape structure. So, we finally combine both the global shape structure (captured by the coordinates of the patch centres x i ) and local neighbourhood information (captured by x j − x i ).

K-nn Operation
Weighted Pooling Block: DGCNN [11] applies a channel-wise symmetric aggregation operation (max pooling) on the edge features associated with all the edges emanating from each vertex. Different from [11], we think that the distance between neighbours has different effects on the central point. This is because using max pooling as a symmetric aggregation operation only retrieves information from the farthest neighbours. So we design a weighted pooling operation to solve this problem. After search neighbour points x j1 , x j2 , …, x jk of the centre point x c , the distance d j1 , d j2 , …, d jk between neighbour points and the centre point is calculated. The weight α of each point by distance is defined as follows: The weighted pooling value β is defined as follows: Local and Global Information Aggregation: To make a model invariant to input permutation, we use a simple symmetric function to aggregate the information from each point. Here, a symmetric function takes all n vectors as input and output a new vector that is invariant to the input order. The output of the symmetric function is a global signature of the input set. PCSR requires a combination of local and global knowledge. Our solution can be seen in Figure 2 In Figure 2, after the spatial transform network block and Knn operation block, there are three Conv blocks. The first Conv block uses two multi-layer perceptron layers (64,64). The second Conv block uses two multi-layer perceptron layers (128,128) while the third Conv block uses a multi-layer perceptron layer (256). After obtaining local features, we use one Conv block with a multi-layer perceptron layer (1024) to extract the feature. Finally, we use a Conv block with five multi-layer perceptron layers (512,256,128,128,3) to obtain high-resolution point cloud.

| The architecture of discriminator
Inspired by [29], the proposed method guides the data generation process by adding condition constraints to the discriminator. The discriminator architecture is visualized in Figure 3. The low-resolution point cloud and the highresolution point cloud are inputted into the discriminator at the same time. This method allows the discriminator to discern whether the generated output is true and whether it matches the generator's input. As a result, the output can maintain the same distribution as the original input.
Same as the generator, the discriminator uses the spatial transform network to remain invariant if the point cloud undergoes certain geometric transformations. This idea can be further extended to the alignment of the feature space.
Another alignment network named feature transformations network is inserted on the point feature map and predict a feature transformation matrix to align features from different input point clouds.
After getting the global feature by symmetric function, the multi-layer perceptron classifier can be trained on the global features for classification.
The discriminator consists of three Conv blocks. The first Conv block uses two multi-layer perceptron layers (64,64). The second Conv block uses three multi-layer perceptron layers (128,512,1024) while the third Conv block uses a multi-layer perceptron layer (512,256,1).

| Objective function
The objective of Point Cloud Colourization Network can be expressed as: where G tries to minimize this objective against an adversarial D which tries to maximize it.
A critical challenge is to design a good loss function for comparing the predicted point cloud and the ground-truth. To plug in a neural network, a suitable distance must satisfy at least three conditions: (1) differentiable with respect to point locations; (2) efficient to compute, as data will be forwarded and back-propagated for many times; (3) robust against the small number of outlier points in the sets. So we use Chamfer distance (CD) as a loss function. For each point, the algorithm of CD finds the nearest neighbour in the other set and sums the squared distances up.
Therefore, our final objective is Our loss function contains L cGAN ðD; GÞ and L CD ðGÞ. Parameter λ is used to adjust the proportion of L CD ðGÞ.

| Implementation details
Generative adversarial networks (GANs) [30] have been enjoying considerable success as a framework of generative models in recent years, and they have been applied to numerous types of tasks and datasets [31,32]. A persisting challenge in the training of GANs is the performance control of the discriminator. In high dimensional spaces, the density estimation by the discriminator is often inaccurate and unstable during the training, and the generator fails to learn the multimodal structure of the target distribution. To solve this problem, we draw on the experience of SNGAN [33] to stabilize the training of discriminator networks. The Lipschitz constant is the only hyper-parameter to be tuned in SNGAN, and the algorithm does not require intensive tuning of the only hyper-parameter for satisfactory performance. Compared with other methods, its implementation is simple and the additional computational cost is small.
Batch normalization (BN) is a milestone technique in the development of deep learning. BN normalizes the features by the mean and variance computed within a (mini-)batch. This has been shown by many practices to ease optimization and enable very deep networks to converge. However, normalizing along the batch dimension introduces problems-BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. Due to the limitation of memory, only a small batch size can be used for PCSR. To avoid this problem, we use group normalization (GN) [34] as a simple alternative to BN. [34] notices that many classical features like SIFT [35] and HOG [36] are group-wise features and involve group-wise normalization. For example, an HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram. Analogously, GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.

| Datasets and preprocessing
We use the ShapeNetCore subset of the ShapeNet [37] as our dataset. ShapeNet is a richly annotated, large scale dataset of 3D shapes. ShapeNetCore is a densely annotated subset of ShapeNet covering 55 common object categories with about 51,300 unique 3D models. To satisfy the network input format, we convert the OBJ file in the dataset into the point cloud format and then perform uniform sampling to obtain lowresolution point cloud and high-resolution point cloud, which is shown in Figure 4.

| Experimental results of point cloud super-resolution
The network is trained to use a subset of ShapenetCore in ShapeNet. Some generated samples are visualized in Figure 5. The point cloud in the left column is the low-resolution point cloud that is inputted to the generator. The point cloud in the middle column is the super-resolution point cloud that is outputted from the generator. The point cloud in the right column is ground truth. LI ET AL. -317

| Comparison with other methods
To verify the rationality of our structure, we design some baseline methods for comparison. The experiment is conducted base on the proposed method PCSR, original PointNet [6], PointNet++ [7], PointCNN [19], DGCNN [11] and SAWNet [38] architecture. And our model is compared with PU-Net [15] and PU-GAN [17]. For PU-Net and PU-GAN, we used their public code and retrained their networks using our training data, the results are listed in Figure 6.
To quantitatively evaluate the quality of the output point sets, we formulate two metrics to measure the deviation between the output points S 1 and the ground truth meshes S 2 , as well as the distribution uniformity of the output points.
For Chamfer distance(CD) in Equation (7), the algorithm finds the nearest neighbour in the other set and sums the squared distances up.
The second metric is Earth Mover's distance (EMD) [39]: where ϕ: S 1 → S 2 is a bijection. The EMD distance solves an optimization problem, namely, the assignment problem. For all but a zero measure subset of point set pairs, the optimal bijection ϕ is unique and invariant under infinitesimal movement of the points. Tables 1 and 2 list two kinds of quantitative comparison results on ShapeNet dataset, respectively. It can be seen that the original PointNet cannot generate the point cloud. This is because PointNet has no local information to guide the generation of point cloud. PointNet++ and PointCNN solve the problem of PointNet by getting the information of neighbour points, but they do not have a good feature fusion of point cloud information. The super-resolution point cloud generated by DGCNN does not have very good constraints on the distribution of point clouds. This is because DGCNN chose the F I G U R E 4 Data Pre-processing.The first row: Original dataset format. The second row: Highresolution point cloud. The third row: Lowresolution point cloud F I G U R E 5 The results of point cloud super-resolution. The left column of point cloud is the input, the middle column is the output of the model and the right column is the ground truth maximum aggregation mode for neighbours and had not obtained local information very accurately. SAWNet combines PointNet and DGCNN to improve the complexity of the network, but it will lead to over-fitting and increase the training time. Compared with baseline methods, the results of PU-Net and PU-GAN are greatly improved. But PU-Net and PU-GAN are based on the first extraction of patches, the network cannot fully extract features and reconstruct the data with low mesh quality, such as ShapeNet. However, the PCSR model can generate the super-resolution point cloud very well; this is because PCSR model gets the local information by K-nn operation and aggregate the local information by weighted pooling. This method can accurately learn the distribution of point clouds.

| Comparison with different aggregation mode
In this section, we compare the aggregation modes which are max pooling, mean pooling and weighted pooling on the three categories. The results are shown in Figure 7. From the experiment, it can be concluded that the results generated using max-pooling and mean pooling cannot well express the shape of the object. This is because the maximum pooling gets the farthest point in the neighbourhood, and the average pooling gets the centre of gravity of the neighbourhood. These two methods cannot get the neighbour information around the centre point very well, so the experimental results are not good enough. On the contrary, the results of the weighted pooling proposed by this paper can better maintain the shape of the object. This verifies our previous conjecture in Section 3.1.

| Comparison of different neighbourhood numbers
In order to find out whether the number of neighbourhoods will affect the generated point cloud, we set K to 10, 20, 30. We selected three categories in the ShapeNetCore dataset to complete the experiment. Experiment results are shown in Figure 8, it can be seen that the best result is generated when K = 20. This is because when K = 10, the neighbour's information is not enough to express the local information completely. Additional, too much useless neighbour information will be added when K = 30 so that there are many interference information in the features. And the greater the K, the longer the training time will be. So K = 20 is used as the number of neighbours.

| CONCLUSION
We propose a conditional adversarial network with point cloud feature extraction for PCSR. The generator and discriminator of the network are redesigned to be suitable for the point cloud.