Deep Learning for Scene Flow Estimation on Point Clouds: A Survey and Prospective Trends

Aiming at obtaining structural information and 3D motion of dynamic scenes, scene flow estimation has been an interest of research in computer vision and computer graphics for a long time. It is also a fundamental task for various applications such as autonomous driving. Compared to previous methods that utilize image representations, many recent researches build upon the power of deep analysis and focus on point clouds representation to conduct 3D flow estimation. This paper comprehensively reviews the pioneering literature in scene flow estimation based on point clouds. Meanwhile, it delves into detail in learning paradigms and presents insightful comparisons between the state‐of‐the‐art methods using deep learning for scene flow estimation. Furthermore, this paper investigates various higher‐level scene understanding tasks, including object tracking, motion segmentation, etc. and concludes with an overview of foreseeable research trends for scene flow estimation.


Introduction
A bunch of research works have emerged from autonomous driving (AD) to support advanced transportation sector.In this context, understanding the complex environment is vital for automated vehicles to drive safely.A scene flow estimator could intuitively discriminates different motion patterns of moving agents, for example pedestrians, cyclists, cars, and so on from on-board sensor data.As shown in Figure 1, scene flow represents the motion field of individual objects in a 3D scene [VBR*99].Scene can be represented by depth images and point clouds.Methods based on images extract depth, disparity, and optical information separately to learn the flow vector.However, image-based methods usually rely on standard variational formulations and energy minimization [HR20], which yield limited accuracy and suffers from long runtime.The advent of affordable 3D sensors, for example LiDARs and RGB-D cameras, simplifies the process of acquiring large-scale 3D point clouds.With the flourishing demand from industry, leveraging point clouds as scene representations is becoming a hotspot in recent years.Deep learning (DL) is a branch of machine learning techniques, which usually utilizes deep neural networks to solve machine learn-ing problems.It extracts features automatically and emphasizes on learning a high-level abstract representations of data [GHH*21].Learning process can be fully-supervised, weakly-supervised, and self-supervised.A plethora of deep learning techniques on point clouds have emerged to solve different classical computer vision tasks such as 3D shape classification [MWB21,GLMH55], object detection [ZCL20, QCLG20], object tracking [SHHX18], semantic scene segmentation [ZZtZX20, HYX*20], and instance segmentation [JYC*20], to name a few.With the rise of deep learning techniques for scene understanding tasks, deploying deep neural networks for scene flow estimation has attracted increasing research attention.
Thanks to the introduction of large-scale synthetic dataset Fly-ingThings3D [MIH*16] with ground-truth flow annotations, many supervised methods are allowed to learn deep hierarchical features of point clouds and fuse these features to estimate scene flow.This supervised training strategy outperforms traditional registration algorithms, for example ICP [BM92] and shows great potential to be applied in real scenarios.To this end, datasets such as KITTI [MG15], NuScenes [CBL*], and Argoverse [CLS*19] are created, which contain real scenes scanned from various actual environment.However, datasets collected by LiDAR do not provide reliable correspondences between consecutive scans.Therefore, a lot of DL models have performance gap between synthetic dataset and real dataset.In addition, there are many unexpected occlusions in real scenarios which will affect the overall accuracy.In spite of recent attempts that exploit the advantages of DL models, unleashing the full power of deep neural networks on 3D point cloud understanding is still in its infancy.
We summarize and categorize current challenges in scene flow estimation into data challenges and DL models challenges, which are introduced in the following.

Data challenges.
• Noise.Point cloud, as one of the most popular format of three dimensional data, is unstructured and noisy.Noise is inevitable from the scanning and reconstruction process.It will hinder the feature extraction and misguide the searching of correspondent points in the neighbourhood.• Difference in point density.A LiDAR system identifies the position of the light energy returns from a target to the LiDAR sensor.This inherent attribute of the LiDAR sensor leads to unevenly distributed points underlying a surface.The density decreases dramatically as distance from sensors increases.How to address the diversified point density is still an open problem.• Big data challenge.Scene represented by point clouds contains millions of points.For example, in the Argoverse dataset, each point cloud produced by LiDAR sensor has 107k points at 10 Hz.Such amount of data increase the burden in processing.• Diversified motion fields.Background motion and foreground motion co-exist in a scene.Likewise, large and small motion, close and far objects, rigid and non-rigid objects co-exist in dynamic scenes.The diversity of motion scales poses a great challenge on discriminating different motion fields.• Occlusions.Scene points taken at time t, may be occluded in subsequent time steps.Consequently, a few objects will disappear due to occlusions.The presence of occlusions will significantly influence the flow estimation accuracy.• LiDAR challenge.Environment interference is challenging to data collection using LiDAR.Although LiDAR is not sensitive to the variations of lighting, it is still struggling with reflective surfaces and bad weather (e.g.heavy fog, rain, and snow).The consequence of these imperfections is the loss of object motion and structure information.
Challenges from DL models.
• Generalization ability.Existing wisdom aims to improve the performance on a specific dataset but fails to generalize to other datasets, especially on the generalization from the simulated to real scenes.• Transformation challenge.There exist multiple transformations (e.g.rotations, translations) in real dynamic scenes, which is challenging for DL models to handle effectively.Some objects will be distorted in the consecutive frames if their transformations are not strictly aligned.• Accuracy challenge.It is impossible to obtain 100% accurate ground-truth scene flow from real scenarios.Due to limited annotations for real scenes, it is challenging to achieve satisfactory accuracy in DL algorithms.• Efficiency challenge.Real-time processing ability is imperative for AD entities.However, the computing power and memory space allocated for processing massive 3D data constructed on vehicles are limited.Currently, efficient DL model that can produce real-time large scene perception is still under-explored.
There are a few surveys [YX16,XAZX17] which have thorough analyses of methods for traditional optical flow estimation and depth estimation.Xiang et al. [XAZX17] reviewed scene flow applications, including image segmentation, image matching, and feature extraction.However, they do not provide sufficient quantitative comparisons between different methods and lack the review of learning based methods.Recently, Liu et al. [LLW*20] and Zhai et al. [ZXLK21] have presented some learning-based scene flow estimation literature and compared their performance on various datasets.Unlike Liu et al. [LLW*20] that only outlined image-based scene flow estimation methods, Zhai et al. [ZXLK21] cover both the optical flow (2D) and scene flow (3D) estimation literature and categorize them into knowledge-driven, data-driven and hybrid-driven methods.Zhai et al. [ZXLK21] introduce scene flow estimation approaches according to the dimension of data representation: 2.5D (image-based) and 3D (point-based).This survey aims to narrow the gap in this topic.Therefore, we comprehensively review up-to-date compelling DL models applied in point cloud-based scene flow estimation approaches.The main contributions of this paper are summarized as follows: • Comprehensive review.For the first time, we investigate DL methods for point cloud-based scene flow estimation.We provide a comprehensive comparison and insightful analysis on recent deep learning methods (2019)(2020)(2021)(2022)(2023), including supervised, weakly-supervised, and self-supervised scene flow estimation methods.
• Review of open challenges.We provide an overview of existing challenges in scene flow estimation, which is categorized into data challenge and DL challenge.• Applications and research directions.We present how the estimated scene flow can benefit higher-level scene understanding tasks.Several promising research directions in 3D scene flow estimation are discussed.

Problem Statement and Taxonomy
Let P ∈ R N Evaluation metrics..There are four main metrics to evaluate the predicted scene flow.More detailed equations of the following terms can be found in [WHWW21].
• 3D End Point Error (EPE3D): it is the average absolute distance (L 2 distance) between the predicted flow vector and ground truth flow vector in meters.• Acc3DS: it is the percentage of flow vectors whose EPE3D < 0.05m or the relative error < 5%.• Acc3DR: it is the percentage of flow vectors whose EPE3D < 0.1m or the relative error < 10%.• Outliers: if the EPE3D of a point > 0.3m, it is considered as an outlier.So this metric depicts the percentage of points whose EPE3D > 0.3m or relative error > 10%.

Building Blocks in Scene Flow Estimation
This section summarizes some basic building blocks for scene flow estimation that comprise the DL pipeline.Learning-based frameworks for scene flow estimation from point clouds usually consist of three stages: (1) feature extraction; (2) feature fusion and matching; and (3) flow generation and refinement.

Feature extraction paradigms
Traditional convolutions are not suitable to irregular point sets.To enable effective usage of the geometry domain knowledge on point clouds, point feature learning is an essential step.We introduce the dominant feature extraction blocks leveraged by scene flow estimation methods in this section.

Set conv layer
Set conv layer is first proposed in PointNet++ for point cloud classification and segmentation [QYSG17].Point feature is independently calculated via an MLP (multi layer perceptron) and then accumulated by max pooling.A set conv layer takes N points p i = {x i , f i } with its XYZ coordinates x i ∈ R 3 and its feature f i ∈ R c (i = 1, . . ., N ) as input.The outputs include a sub-sampled point cloud with N points and the point-wise feature, where p j = {x j , f j }.For each sub-sampled region (centred at point x j ) defined by a ball neighbourhood specified by radius r, the updated local feature is computed based on a symmetric function defined as where h(•) is a non-linear function (an MLP layer) with concatenated f i and point difference x i − x j as inputs [QYSG17], and MAX is the element-wise max pooling operator.

PointConv feature pyramid
PointConv layer is proposed to learn point features hierarchically.The PointConv method involves inputting the positions of point clouds and training an MLP to estimate a weight function.The method also involves applying an inverse density scale to the learned weights to adjust for non-uniform sampling.It has been leveraged by many scene flow estimation works.The PointPWC-Net generates multiple levels of feature representations, with each level computed through convolution on the previous level.The PointConv operation is defined as follows: where S(δ x , δ y , δ z ) denotes the inverse density at a point (δ x , δ y , δ z ).
The weight function W (δ x , δ y , δ z ) is approximated by MLPs from 3D coordinates (δ x , δ y , δ z ) and the inverse density S(δ x , δ y , δ z ).F (x + δ x , y + δ y , z + δ z ) represents the feature of a point in the local region G centred around point p = (x, y, z).After point convolution, the feature in a local region is updated.

Flow embedding layer
This layer learns to aggregate both feature similarities and spatial relationships of points to yield embedded features for point motions [LQG19].As illustrated in Figure 2, this layer fuses the feature from source point cloud P: The flow embedding is computed by An improved version of this embedding layer is proposed by Wang et al. [WWLW21], a weighted embedding strategy that samples neighbouring points in the second frame for the source point.Motion embedded based on a patch-to-patch manner involves the larger receptive field of each point.

Correlation matrix
Another stream of works attempts to find soft correspondences between source point cloud P and target point cloud Q. Inspired by optimal transport theory [Vil09], building an optimal transport could help address one-to-one matching between P and Q [PBM20, LLX21].Recently, SCTN [LZGG22] and FlowStep3D [KER21] adopt a correlation matrix to estimate point correspondences.The correlation matrix is defined as where p t i and q t+1 j represent points from the source and target point clouds separately.F θ (•) represents the point feature extraction function.After obtaining this correlation matrix, scene flow can be predicted by the Sinkhorn algorithm [PBM20].

Cost volume
Cost volume is widely used in stereo matching, which encodes the relation between two consecutive frames.In 2D image field, cost volume is often represented by a 3D tensor.Constructing cost volume in 3D point clouds is more difficult than in the 2D domain since point clouds are unordered and possess different sampling densities.To reduce the computational complexity, Wu et al. [WWL*20] introduce a discretization operation on the cost volume.The matching cost between point p i and point q j is defined as where concat(•) is the abbreviation of concatenation and f i , g j are features correspond to point p i , q j .PointPWC-Net [WWL*20] uses multi-layer perceptron (MLP) to obtain nonlinear relationship between two points with an additional direction vector represented by (q j − p i ).Based on Equation (5), the cost volume for an individual point p c is formulated as where W P and W Q are convolutional weights to compute the costs from patches in point cloud P to that in point cloud Q. N p (p c ) represents the neighbourhood of point p c and N Q (p i ) represents the neighbourhood of point p i in point cloud Q.So the cost volume is aggregated in a patch-to-patch matching manner.The pipeline of constructing a cost volume is depicted in Figure 3.

Set upconv layer
In the upsampling step for flow refinement, the set upconv layer propagates the input set of points into a set of target point coordinates by aggregating the neighbouring point features of the input points.It shares the same structure with set conv layers and it is flexible and trainable to propagate/summarize features from one point cloud to another.We refer readers to FlowNet3D [LQG19] for more details on this layer.

Gated recurrent unit
Recurrent updating mechanism is widely used in scene flow estimation methods [GTY*22, KER21, DZL*22].The updated scene flow vector is produced by Gated Recurrent Unit (GRU) with a few set upconv layers.As presented in FlowStep3D [KER21], the hidden state h k is calculated as where represents Hadamard product and σ (•) is the sigmoid activation function.The initial state h 0 is calculated by two set conv layers based on the feature of the source point cloud.

Datasets
In this section, we concentrate on point cloud datasets employed in scene flow estimation.A taxonomic study is presented in terms of the source of the data, as elaborated in Figure 4.

Data preprocessing
LiDAR raw data are usually scanned in large-scale and unevenly distributed with many irregularly shaped contents.As mentioned before in Section 1, noise and outliers are inevitably imposed when collecting LiDAR data.Therefore, the pre-processing step is necessary for dealing with noise, error, as well as outliers.Pre-processing on point clouds (e.g.ground point removal, down-sampling) is also a significant step before estimating scene flow.Removing ground points with inconspicuous features will enable more efficient learning on point clouds.The most simple method is via thresholding on the height axis, like in HPLFlowNet [GWW*19].However, this approach is a little aggressive and will lead to important information loss on some objects.In practice, ground points usually constitute a flat plane with less significant visual cues.There are two ground segmentation algorithms: RANSAC and GroundSegNet are proposed to improve the effectiveness of ground points removal.RANSAC is the abbreviation of Random Sampling and Consensus.It fits a plane in a set of points and classifies points close to the plane as ground points [LQG19].GroundSegNet is originated from the segmentation branch in PointNet [QSMG17], which is trained to classify points to the ground or non-ground part [LQG19].Both algorithms generate accurate segmentation results on KITTI2015 [MHG15,MG15].

Methodology
This section reviews the existing methods from the perspective of supervision and analyses how the state-of-the-art methods deal with challenges that exist in scene flow estimation.We roughly categorized them into the following types: supervised, weakly supervised, and self-supervised methods.And we refer reader to Figure 5 for a summary of the state-of-the-art learning wisdom of scene flow estimation in the recent few years.

Supervised methods
Early methods [BMWR19, ZHZ*19] project the point clouds onto 2D cylindrical maps and apply traditional CNNs to train their flow estimation model.Starting from methods that tackle a large amount of data, we can identify a core set of the most innovative work on supervised learning approaches for scene flow estimation.Many supervised learning approaches rely on ground-truth labels of scene flow.The deep networks are initially trained on synthetic datasets and then fine-tuned on real data.HALFNet.Wang et al. [WWLW21] proposed a hierarchical attention learning network with two different attentions in each flow embedding.Especially, a hierarchical attentive flow refinement module is designed to propagate and refine scene flow layer by layer.HALFNet [WWLW21] adopts a more-for-less strategy, which means the number of input points is greater than the number of output points in scene flow estimation.HALFNet has approved its effectiveness in gaining precise structure information of the scene and reducing the consumption of GPU memory.It is also noteworthy that HALFNet uses multiple Euclidean information, which allows the attentive flow embedded in a patch-to-patch manner.Generally, HALFNet demonstrates a better generalization ability of the 3D method than FlowNet3 [ISKB18] in 2D metric (e.g.optical flow) and achieves reasonable accuracy compared with existing supervised methods.However, HALFNet does not train on a large realworld dataset, which limits its performance on this kind of dataset.To facilitate an inductive summarization on the above methods, we divide these scene flow models by their building blocks, as listed in Table 2.We also systematically investigate the advantages and disadvantages of different methods.

Weakly/Self-supervised methods
There are a lot of supervised methods trained on a synthetic dataset and fine-tuned on a small set of real data.However, this training scheme leads to domain gap between the synthetic dataset and the real-scanned dataset, which makes the trained models perform poorly in real-world scenes.A handful of works [MOH20, KER21, JLA*22] have been proposed to handle performance gap between different datasets by devising self-supervised architectures.According to the backbone used by these self-supervised methods, we divide them into flow embedding based, correspondences based, and correspondences free methods.Table 3 summarizes the advantages, deficiencies, and training datasets of these methods.Just-Go.Mittal et al. [MOH20] utilize nearest neighbour loss and cycle consistency loss based on the framework of FlowNet3D [LQG19].Nearest neighbour loss is formulated as the average Euclidean distance of the transformed point to its nearest neighbour in the second point cloud.So it regularizes the initial flow to be as close as possible to the correct scene flow.Cycle consistency loss is calculated through the absolute Euclidean distance between the transformed point from reverse flow and the original point.The combination of the above two self-supervised losses enables training on large unlabelled autonomous driving datasets that contain sequential point cloud data.However, it ignores the local geometrical properties of point clouds.

Adversarial-SFE.
Victor et al. [ZvVBM20] proposed a metric learning approach for self-supervised scene flow estimation.Unlike previous self-supervised methods which rely on fine-tuning and finding correspondence in the input data to search for near-est neighbours, Adversarial-SFE.[ZvVBM20] utilizes an adversarially learning loss.Hence Adversarial-SFE does not suffer from the domain shift between synthetic data and real data.Moreover, Adversarial-SFE takes advantage of the permutation invariant nature of the point cloud.It proposes triplet loss by sampling points together with cycle consistency loss.Adversarial-SFE computes the distance between a pair of point clouds on a latent space.The proposed adversarial metric learning consists of four components: (1) a triplet loss with anchor and positive sampling, (2) a cycle consistency loss, (3) multi-scale triplets for global and local consistency, and (4) adversarial optimization.Self-Point-Flow.Note that each point not only possesses a spatial position (x, y, z) but also potentially has vectors of attributes, such as normal, colour, or material reflection.Self-Point-Flow [LLX21] uses global mass constraints with multiple descriptors to formulate one-to-one matching with 3D point coordinate, colour, and surface normal as measures.In the optimal transport module, the sum of these three individual costs represents the final transport cost in the entropic regularization term that is solved by the Sinkhorn algorithm.This enables the generation of pseudo labels for real data, which is generated from the assignment matrix.However, conflicting results that exist on local regions will lead to incomplete pseudolabel generation.To address this issue, Self-point-Flow builds a graph through random walk theory that integrates local consistency to refine the pseudo labels.This algorithm is executed on a fullyconnected undirected subgraph and refined with several random walk steps.Then, it propagates to directed subgraph without initial pseudo labels and infers new pseudo labels based on the affinity matrix that describe the nearness between each point in the undirected subgraph (labelled node set) and directed subgraph (unlabelled node set).

FlowStep3D. Inspired by RAFT [TD20], FlowStep3D [KER21]
introduces a recurrent structure to unroll scene flow estimation model with refinement operation.In FlowStep3D, the initial flow vector is estimated by a global correlation matrix, then the rest of the flow sequences are updated based on local correlations in the gated recurrent unit.FlowStep3D adopts several basic layers, for example set conv layer (Section 3.1.1),flow embedding layer (Section 3.3.1) in Flownet3D [LQG19].Two regularization loss weights are proposed to adjust the regularization.It contributes to the updating of scene flow during iterations.The distance between pseudo labels and their nearest point in the second point cloud tells the reliability of the pseudo label.So that these inaccurate noisy labels are assigned low confidence to reduce the negative effect on network training.To refine the confidence scores of pseudo labels, Noisy-Pseudo updates the confidence score via a local geometry-aware weighted confidence of all the neighbouring pseudo labels.Additionally, the combination of both 2D information and 3D information contributes to the self-supervised learning and leads to good performance on both synthetic data and real-world LiDAR data.This method highlights the effectiveness of using multi-sensor data in scene flow estimation., the initial point mapping and rigid transformation are calculated.Then the rigid transformation and pseudo labels for each supervoxel is updated accordingly by solving a least-square problem.This least-square problem aims at calculating rotation matrix and translation vector that aligns independent rigid body from source to target.After several iterations, all of the optimal pseudo rigid scene flow from every supervoxel are combined to form the complete pseudo scene flow.

DCA-SRSFE
Pseudo-LiDAR.[JWMW22].This work can accurately perceives 3D dynamics in 2D images by utilizing a pseudo-LiDAR point cloud as a bridge to compensate for the limitations of estimating 3D scene flow from LiDAR point clouds.Points that do not contribute to the scene flow preditons are filtered out.In addition, a disparity consistency loss is proposed to boost the self-supervised training.

OGC. OGC [SY22]
focuses on making use of inherent object dynamics to assist object segmentation.To extract per-point features and generate object masks, an object segmentation network is first applied to a single point cloud.Then, a self-supervised network is utilized to estimate per-point motions from a pair of point clouds.Due to the challenging moving patterns of different objects, how to fully utilize object dynamics to assist object segmentation becomes more tricky.To tackle this problem, OGC introduces three loss terms to yield effective segmentation supervision.The geometry consistency over dynamic object transformations allows for high-quality masks learning for given flows.Regularization of geometry smoothness ensures that flow vectors in a local area remain consistent with the central point.The geometry invariance loss drives the estimated object masks to be invariant across different views of point clouds.

SCOOP. SCOOP [LAC*22
] consists of a self-supervised neural network and an optimization module that work hybridly to estimate scene flow.In the initialization step of scene flow estimation, SCOOP focuses on extracting point features to obtain soft correspondences, in which cosine similarity is applied to compute matching cost.In the flow refinement step, two optimization functions, basically deployed for reducing the error and increasing the consistency of scene flow field.According to the results, SCOOP reduces errors by over 50% compared to feed-forward models and provides 10 times faster inference time than the Neural Prior work [LKPL21] relying solely on optimization.Additionally, SCOOP allows for a unique trade-off between time and performance.
Rigid3DSF.To ease the high demand of supervision in scene flow estimation problem, Gojcic et al.

Quantitative analysis
Results of recent deep learning based methods on different datasets along with non-learning methods are tabulated in Table 4 and Table 5.It is hard to declare which approach is the winner compared to others as it depends on the datasets and specific data training scheme they used.We focus on the results generated from the same training dataset and make the following observations.

Applications
Scene flow is one of the most fundamental visual cues in the hierarchy of dynamic scene perception.It provides applicable information for higher-level tasks.The progress in scene flow estimation will refurbish the performance of other scene understanding tasks [GLW*21].

Point cloud densification
Point clouds are unevenly distributed and sparse.Therefore, some small objects cannot be well-represented by limited number of points.To increase the quality of point clouds, a number of point cloud upsampling methods [YLF*18, QAL*21] have been proposed to generate dense and complete point clouds from the original sparse point cloud.Recent work (e.g., SFPC [PHL20]) discovered that the predicted scene flow can be used to densify the point clouds.SFPC [PHL20] uses five adjacent frames from an Argoverse scene in each direction to densify the current frame.The visual comparison of non-rigid densification proposed by SFPC [PHL20] against the original sparse point cloud and ICP is shown in Figure 6.This visual comparison indicates that SFPC recovers more detailed geometry of objects than ICP.

Motion segmentation
From pedestrians walking at a constant speed to high-speed vehicles, the issue of detecting objects of interest can be addressed by segmenting the underlying motions.This improvement validates the capability of scene flow to boost the performance of LiDAR odometry task.

Object tracking
Self-driving can be divided into four separate parts: detection, object tracking, motion forecasting, and motion planning [LYU18].The objective of object tracking is to identify and locate multiple objects of interest and keep track of their trajectories simultaneously.To enhance the robustness of motion

Potential Research Directions
To address the issue caused by diversified motion fields, there have been quite a few attempts, for example Rigid3DSF Here we provide an overview of promising directions for further research.

Multi-source and multi-modality data fusion
2D images contain fine-grained information while 3D point clouds provides more geometric details.LiDARs and cameras (e.g.RGB-D camera, monocular camera) are the most common sensors for multimodal perception in the literature [FHSR*20].Despite the interest in scene understanding via multi-modality data fusion is growing, only a few papers [LYY21, JWMW22] utilize multi-modality data in scene flow.The effectiveness of data fusion algorithms is restricted by the representation of spatial-temporal information and the ability of CNNs to learn.This is a complex issue that deserves further exploration.

Multitask learning
An important avenue for future work is to deploy end-to-end multitask learning (MTL) pipelines.In the field of visual computing, labels are very limited among all kinds of real datasets, and there is still a long way to go to train a robust and accurate learner.MTL which learns task relations from data automatically helps reduce the manual labelling cost for each learning task.A popular example is shown in semantic segmentation and depth estimation [ZY22].From this perspective, extracting the commonalities from several related tasks for joint learning across tasks is a promising direction to boost performance.As scene flow is inherently a low-level visual cue, it can be integrated with other visual components such as object locations for higher-level scene understanding tasks.Such joint learning enables the model to better cope with complex scene data and improves its self-evolution and self-adaption with multi-task knowledge.Besides, a multi-task learning strategy can even outperform separate models trained independently on each task [CGK18a] and further improve the robustness [MTL22].As shown in recent works [TWZ*18, CGK18b, HH19, JKBC20], there are considerable attempts to integrate multiple tasks in a unified architecture.

Domain adaptation
Most current deep learning networks are data-driven.Many stateof-the-art DL models have achieved impressive results.However, those DL models are fine-tuned on a fixed task set.It is still in the beginning to adapt current DL models to different domains.Since 3D annotations usually depend on the annotations obtained from the image domain, it is hard to achieve equal accuracy on a larger dataset.One enlightenment is to use transfer learning.Transfer the knowledge gained from solving one problem (e.g.depth estimation), and apply it to different but related problem such as scene flow estimation.From a broader perspective of self driving and robotic

Semi-supervised learning scheme
Semi-supervised learning is a novel branch of machine learning that leverages unlabelled data to reduce the usage of manual annotations.Jiang et al. [JSJ*19] introduced a compact network (SENSE) that shares common encoder features among optical flow, disparity, occlusion, and semantic segmentation.SENSE [JSJ*19] handles partially labelled data from images very well.To ameliorate the issue of sparse ground-truth annotations of scene flow, SENSE applies a distillation loss and a self-supervised loss to the supervised losses, which forms their semi-supervised loss.The success of semi-supervised learning in the field of optical flow estimation shows that it has the potential to fill the gap between unsupervised learning and supervised learning.

Knowledge distillation
Collecting large-scale dynamic scene data requires complex calibration.In addition, the cost of transforming the original data into a trainable format is expensive.As a consequence, a labelled dataset for scene flow estimation is very rare.Therefore, applying the knowledge distillation model for training small scene flow estimation networks would be a possible solution for data-hungry networks.In machine learning, knowledge distillation (KD) is the process of compressing the knowledge in a large model into a smaller one.As shown in Figure 7, the traditional knowledge distillation model consists of a teacher model and a student model.In many proposed deep learning models, there are often heavy parameters.Although it's commonly accepted that integrating multiple models and introducing more parameters improves the accuracy of a model, we have to bear high computational costs in the meanwhile [MFL*20].KD allows training smaller models with minimal loss in performance.The main innovation of KD is that the student network is trained not only via the information provided by true labels but also by observing how the teacher network works with the data.To our best knowledge, DCA-SRSFE [JLA*22] is the only method that applies the KD model to point-based scene flow estimation so far.

Efficiency
Deep learning requires expensive GPUs and lots of machines.However, memory and computation resources on board are limited.When it comes to processing large-scale point clouds captured from outdoor scenes, this limitation makes the accurate estimation of scene flow more difficult.The design of convolutional kernels and feature descriptors is the key to balance the efficiency and accuracy of processing 3D data.In spite of the significant improvements of DL models in 3D point cloud learning [QYSG17, WSL*19, ZFF*21], DL models that achieve real-time perception of surrounding dynamics for AVs are still under-explored.

Conclusions
This paper reviews the state-of-the-art approaches for scene flow estimation on point clouds within the scope of deep learning paradigms.A comprehensive overview of the challenges in this field is listed.Extensive analyses on supervised, weakly-supervised, and self-supervised scene flow estimation methods are presented.Merits and demerits of these methods are also covered.Moreover, this paper introduces several higher-level scene understanding tasks from the perspective of scene flow estimation and discusses promising research directions.We hope this survey will inspire more research in this field.

Figure 1 :
Figure 1: Visualization of scene flow for a KITTI example scene.The source point cloud is shown in blue and the target point cloud is shown in green.The deformed points are obtained by adding scene flow vectors (shown in red arrows) to source points, which are shown in PaleVioletRed.

FlowNet3D.
Liu et al. [LQG19]  proposed FlowNet3D by extracting point features from point clouds directly.It has three main layers for point cloud processing and uses PointNet++ as its backbone for feature learning.As shown in Figure2, the flow embedding layer aims to aggregate point similarities for scene flow encoding.FlowNet3D finds soft correspondences between point clouds in two consecutive frames.The set upconv layer (Section 3.4.1) is used for flow refinement.The model has shown good results on synthetic datasets, but has not achieved equivalent performance in real-world settings due to the difficulty of obtaining point-level supervision from real-world data.

Figure 5 :
Figure 5: Chronological overview of the most relevant work on deep learning-based scene flow estimation on 3D Point Clouds.

SFGAN.
3D point clouds represent the continuous motion of objects in real scenarios.Based on this insight, Wang et al. [WJS*22] utilize generative adversarial networks (GANs) to learn scene flow.SFGAN [WJS*22] presents a novel strategy via discriminating between the generated point clouds and the real point clouds.The predicted scene flow and the source point cloud are incorporated to generate the fake point cloud identical to the target point cloud.Then the discriminator discerns the consistency between the real scene and the synthesized 3D scene (fake point cloud) to enhance the performance of the scene flow generator.The adversarial training on the generator and discriminator enables SFGAN to specify the consistency of the scene during a period of time.

.
Jin et al.[JLA*22]  proposed a mean-teacher framework for unsupervised domain adaptation from synthetic data to real data.DCA-SRSFE [JLA*22] consists of a student model that uses ground-truth scene flow labels for supervision and a teacher model updated as the Exponential Moving Average (EMA) of the student model weights.A deformation regularization module and a correspondence refinement module are introduced to produce highquality pseudo labels.In the deformation regularization module, a rigid motion between the first point cloud and the warped point cloud is predicted via Kabsch algorithm[Kab76].This module encourages shape distortion awareness in the student model and promotes adaptive deformations for the target domain.The flow vector is later improved with surface correspondence by refining local geometry.DCA-SRSFE is supervised by ground truth flow labels in the source domain and trained with a consistency loss over the target domain.The proposed synthetic dataset GTA-SF is a large-scale dataset with real-world labels.According to the experiments, DCA-SRSFE has narrowed down the performance gap between synthetic datasets and real-world scenarios.RCP.RCP [GTY*22] decomposes scene flow estimation into two interlaced steps.The first step optimizes 3D flow point-wisely, followed by a recurrent network to optimize 3D flow globally.In the point-wise optimization module, an auxiliary flow vector is calculated by concatenating the point feature and positional encoding.In the second optimization step, RCP leverages GRU to update the hidden state for the estimation of residual flow vectors.RCP is trained in both the fully-supervised manner and the self-supervised manner.RCP also conducts experiments on point cloud registration, where 6-DoF poses are generated by point-to-point costs.The results on scene flow estimation and point cloud registration have achieved on-par performances with state-of-the-art methods.Ego-motion.Inspired by HPLFlowNet [GWW*19], Ego-motion [TLOP20] uses DownBCL and CorrBCL as building blocks to regress relative poses from a pair of point clouds.It estimates nonrigid flow and ego-motion jointly with iterative update module to refine the rigid transformation.Ego-motion also compares performance between fully-supervised, hybrid, and self-supervised training strategy, which shows that hybrid training scheme performs better on FlyingThings3D [MIH*16] and KITTI2015 [MHG15, MG15].RigidFlow.RigidFlow [LZL*22] introduces local rigidity prior in self-supervised scene flow learning.Based on the assumption that a scene is composed of several rigid moving parts, RigidFlow decomposes the source point cloud into a collection of local rigid regions.Different from recent self-supervised works [BEM*21, PHL20] that utilize local rigidity as regularization terms, RigidFlow enhances the pseudo label generation module via integrating local rigidity in region-wise scene flow estimation.With a pre-trained predicted flow [LLX21]

©
2023 The Authors.Computer Graphics Forum published by Eurographics -The European Association for Computer Graphics and John Wiley & Sons Ltd.
[GLW*21]  proposed a data-driven method that integrates flow into a higher-level scene abstraction represented by multi rigid-body motion.Rigid3DSF [GLW*21] connects point-wise flow with other higher level scene understanding tasks through an object-level deep network.In detail, Rigid3DSF divides the scene into foreground, background, and abstract rigid objects as scene components.As such, scene flow in the background is assigned as ego-motion of sensors and motion prediction in the foreground can be reasoned on the level of individual object.To exploit the geometry of the rigid entities, Rigid3DSF introduces an inductive bias.Rigid3DSF also proposes a new test-time optimization to refine the flow predictions.For the training on real dataset under weak supervision, Rigid3DSF uses SemanticKITTI [BGM*19] without dense scene flow annotations.RC-SFE.RC-SFE [DZL*22] is a weakly-supervised scene flow learning framework based on GRU recurrent network.Apart from the source point cloud and the target point cloud, RC-SFE also takes a set of abstraction masks of the source point cloud generated by a pre-trained segmentation network [GLW*21] as input.To convert the initial point correspondences status and pre-warped scene flow, RC-SFE applies Kabsch algorithm[Kab76] to obtain transformations for each segmented abstractions.So the rigid flow is calculated by the abstraction transformations and abstraction masks.During the updating stage, an GRU-based error awarded optimization is utilized to refine the prediction.Compared to previous work that use indirect constraints into iterative optimization, RC-SFE introduces direct multi-body rigidity constraints to alleviate structure distortion.After several recurrent updates, an optimal mix of scene flow and rigid flow are calculated to form the final hybrid scene flow.However, RC-SFE cannot address the estimation of scene with many non-rigid parts.Same as Rigid3DSF [GLW*21], RC-SFE relies on the segmentation of background to generate accurate esti-mation.Dealing with non-rigid motions and occlusions is worthy of further exploration in the future.

Figure 6 :
Figure 6: Application of scene flow approach for point cloud densification, from [PHL20].The left image is projected from original sparse point cloud collected in Argoverse scene.Middle image represents the densified frame via Iterative Closest Point (ICP) algorithm.The right image is the densification from SFPC [PHL20].

Figure 7 :
Figure 7: The architecture of Knowledge Distillation.
[LLX21]notes point cloud at time t with N 1 points and Q ∈ R N 2 ×3 represents the point cloud at time t + 1 with N 2 points.Scene flow estimation aims at recovering the 3D motion from point cloud P captured at the first frame to point cloud Q at the next frame.Therefore, the target for scene flow estimation is that each point p i ∈ P should be as near as possible to the corresponding point q i ∈ Q after scene flow recovery.It is noteworthy that due to the sparsity and unstructured nature of point clouds, the source point cloud and the target point cloud do not necessarily have the same number of points or have hard correspondences.Many methods estimate scene flow vectors for the points in the first point cloud.With this prior setting, the per-object transformation parameters can be predicted [GLW*21].The prominent methods only use point coordinates to estimate the motion vector.There is also an attempt[LLX21]that makes use of colour and surface normal as additional clues to find point correspondences.

Table 1 :
Open real scene datasets.Avg points per frame is the number of points from all LiDAR returns computed on the released data.Trains and tests represent the number of training and testing samples in the dataset.Scenes represents the number of scenes captured in the dataset.Resolution is the corresponding image size of each captured scene.Day&night means the dataset covers data collected day and night.sparsity that yields a distribution shift between KITTI and NuScenes.NuScenes has recorded diverse data from Boston and Singapore.However, NuScenes does not provide scene flow annotations, which poses a great challenge in deep learning based methods to predict accurate scene flow.
• Waymo.The Waymo dataset [SKD*20] includes a large number of 3D ground truth bounding boxes for LiDAR data and 2D tightly fitting bounding boxes for camera images, all of which are high quality and have been manually annotated.It contains 158,081 training and 39,987 validation frames of point clouds with LiDAR labels [JLA*22], such as vehicles, pedestrians, signs and cyclists.However, scene flow labels are not included.
[WWLW21]s a correlation matrix to estimate soft correspondences by combining features from both the sparse convolution and the transformer module.Additionally, SCTN proposes a feature-aware spatial consistency loss to improve its ability to distinguish different motion fields.FH-Net [DDX*22] deals with multi-scale flows from different layers with a much faster speed.To this end, FH-Net extracts keypoint features via hierarchical Trans-flow layer.The computed sparse flow is then used to obtain hierarchical flows at different resolutions through an inverse Trans-up layer.FH-Net also introduces a new data augmentation strategy to enhance the accuracy of predicted flow, particularly on complex dynamic objects.This work sets new standards for performance on the KITTI and Waymo datasets.SAFIT.SAFIT [SM22] introduces the concept of relation reasoning between object-level and point-level relations.The relation module captures relational features between objects, which diversifies the feature palette of 3D point cloud and can be combined with other features to boost the performance of scene flow.This is different from other methods that only extract geometry or location features for individual objects.As presented in SAFIT, the supervised training scheme outperforms FLOT by 3.8%, 22.58% on preprocessed FlyingThings3D and KITTI dataset [GWW*19].Besides, SAFIT has 10.90% and 21.82% accuracy improvement over FLOT on FlyingThings3D and KITTI where occluded points are not removed [LQG19].Built upon successful bidirectional learning in time series-based tasks and 2D optical flow estimation, Bi-PointFlowNet [CK22]develops the first bidirectional model for 3D scene flow estimation.Bi-PointFlowNet targets at estimating the optimal non-rigid transformation that represents the best alignment from the source to the target frame.Previous standard procedure (i.e.grouping -> concatenation -> MLP -> max-pooling) usually leads to redundant computations.To address this issue, Bi-PointFlowNet decomposes the MLP weights in bidirectional flow embedding layer into three sub-weights.In this way, the local coordinates, the propogated feature, and the replicated feature of two point clouds can be transformed to produce a new fused feature vector.The following upsampling and warping layer are the same as PointPWC-Net.Compared to PointPWC-Net [WWL*20], Bi-PointFlowNet reduces the total operation by 44% and accelerates the inference by 33%.volume, which addresses the dissimilarity in local structure caused by sparse depth sensor (LiDAR) sampling.For occluded points, Est&Pro proposes an uncertainty truncated propagation network to propagate the flows from non-occluded points to those occluded points.Intuitively, the flow estimator is responsible to the non-occluded points, while the flow propagation network focuses on motion flows of the occluded points.RMS-FlowNet.RMS-FlowNet[BSMS22]employs feature extraction module consists of top-down pathway and bottom-up pathway.From the beginning level, they apply local-feature-aggregation and down-sampling to proceed features at each level.Then utilize up-sampling and transposed convolution to propogate point features.Unlike previous hierarchical structure[WWLW21], RMS-FlowNet proposes a Patch-to-Dilated-Patch flow embedding strategy, which re-computes features generated from previous steps with new attention scores.This desgin could speed up the model without sacrificing the accuracy.RMS-FlowNet use a fully supervised loss function similar to PointPWC-Net.This work bears great improvements to the recent efforts on quicker predictions handling large, consecutive point clouds containing over 250K points.
[WS22]],NMV17] methods, for example FlowNet3D[LQG19]and MeteorNet[LYB19]apply Farthest Point Sampling (FPS) to extract point features.However, FPS usually leads to different downsampled results from two point clouds that represent the same manifold [WPL*21].Hence, it is intractable to estimate accurate scene flow with the unstable features extracted by FPS.FESTA [WPL*21] address this issue via the spatial abstraction with attention (SA 2 ) layer and the temporal abstraction with attention layer.In the SA 2 layer, FESTA utilizes a trainable Aggregate Pooling module which is based on the shifted position of points by defining the attended regions.PointPWC-Net.Wu et al. proposed PointPWC-Net [WWL*20]that predicts scene flow via constructing the cost volume at each feature pyramid level.To capture large motions, PointPWC-Net proposes a coarse-to-fine strategy that concatenates the feature at level L with upsampled feature from level L + 1.The scene flows are refined by features generated from the cost volume, the upsampled flow, and the source point clouds.However, PointPWC-Net has some limitations on the KITTI dataset[MG15].Firstly, it failed to perform well when the object is a straight line or a plane.In addition, it is hard to obtain effective correspondences from two consecutive frames due to the strong deformation of local shapes.At last, PointPWC-Net retains the ground points, which may Res3DSF.Based on the observation that humans are good at perceiving the surrounding dynamic movement, Res3DSF[WHWW21]includes a context-aware point feature pyramid mod-ule together with a residual flow refinement layer for scene flow estimation.Many previous methods ignored the discrimination of repetitive patterns in dynamic scenes.Res3DSF incorporates the contextual structure learning into their 3D spatial feature extraction layer and learn soft aggregation weights.Res3DSF adopts attentive cost volume to learn flow embeddings from the context-aware feature pyramid module.These flow embeddings are then refined by the Three-NN interpolation and multiple MLP layers to acquire the final complete scene flow.The evaluation results illustrated in Table4indicate the effectiveness of the framework proposed by Res3DSF[WHWW21].Res3DSF well addresses the diversity of motion fields, so that it can estimate long-distance motion.FLOT.Several studies in graph matching, such as[MGCF19,NMV17], utilize optimal transport to find correspondences between two different graphs.Inspired by these works, FLOT[PBM20]casts the task of scene flow estimation as finding soft correspondences on a pair of point clouds via solving an optimal transport problem.FLOT extracts point features through several convolution layers.The transport cost is then measured by cosine similarity of these point features.To circumvent the absence of correspondence on some points, FLOT[PBM20]proposes a mass regularisation to ensure that mass is uniformly distributed over all points.usesacombination of sparse convolution for feature extraction and a transformer module for accurate scene flow prediction.It is the first work to incorporate the transformer with sparse convolution, which allows it to learn relation-based contextual information on point clouds.Furthermore, the continuous CRFs ensures the spatial smoothness and the local rigidity of the scene flow predictions.Therefore, rigid motion is well-considered in HCRF-Flow under the constraints of both point-level and region-level consistency.PV-RAFT.As mentioned before, PointPWC-Net [WWL*20] utilizes a coarse-to-fine strategy to find point correspondences.However, it suffers from the error accumulation [WWR*21].PV-RAFT [WWR*21] is an innovative approach that builds correlation volumes to address limitations of previous cost-volume based methods.It is inspired by the recurrent all-pairs field used in 2D optical flow[TD20].With voxel correlation features that encodes longrange point clouds, and point-based features that aggregates finegrained local information, PV-RAFT efficiently captures both shortrange and long-range correlations in consecutive point clouds.PV-RAFT utilizes a Gated Recurrent Unit (GRU) to iteratively update the predicted scene flow with context features as auxiliary information.Besides, PV-RAFT also develops a truncation operation and a refinement module to further increase the accuracy.tion3.1.1).To capture reliable match candidates from point clouds even in a long distance, WhatMatters proposes a novel all-to-all point mixture module with backward reliability validation.A comprehensively analysis on point similarity calculation, designs of scene flow predictor, input elements of scene flow predictor, and flow refinement level design showcase what matters in 3D scene flow network.FH-Net.Dynamic3DSA.To facilitate the analysis of point cloud sequences, four different tasks are integrated into a complete multi-frame 4D scene analysis approach.Huang et al. [HGH*22] comprehensively study point cloud registration, motion segmentation, instance segmentation, and piece-wise rigid scene flow estimation.To this end, it is necessary to separate individual moving objects from the static background and infer their temporal and spatial properties.Dy-namic3DSA accumulates 3D points across multiple frames while representing the scene as a collection of rigid moving agents, followed by the reasoning of motion by agents.Bi-PointFlowNet.Est&Pro.Est&Pro[WS22]employs a subnet to predict the occlusion mask, which guides the flow predictor to focus on estimating the motion flows of non-occluded points.In this way, more valid matching costs can be calculated.Est&Pro designs a local-adaptive © 2023 The Authors.Computer Graphics Forum published by Eurographics -The European Association for Computer Graphics and John Wiley & Sons Ltd.cost

Table 2 :
Summarization of fully supervised DL architectures for scene flow estimation.FLY3D is the abbreviation of FlyingThings3D.denotes methods with open-sourced code.
Pros: Simple and efficient; Addressed transformation challenge.Cons: Annotation-hungry; Poor performance on occluded points.Pros: Pioneer in using a sparse convolution and transformer to exploit the coherent motions and model point correlations; Spatial feature-aware.Cons: Annotation-hungry.KITTI2018, FLY3D PV-RAFT [WWR*21] Pros: Pioneer in integrating point and voxel correlations in recurrent all-pairs field to estimate scene flow; GRU-based iterative method.Cons: Structure distortion; High time consumption.Pros: Coarse-to-fine strategy; Supervised and self-supervised training fashion.Cons: Some objects are out of view; Error accumulation in the early step.Pros: Feature-based attention module; Improved re-weighting mechanism in calculating convolutional weights.Cons: Poor performance on occlusions.Pros: Efficient; Addressed the difference in density challenge and big data challenge.Cons: Lack of evaluation on large-scale real dataset: NuScenes.

Table 3 :
Summarization of self-supervised/weakly supervised DL architectures for scene flow estimation based on Point Clouds.FLY3D is the abbreviation of FlyingThings3D.denotes methods with open-sourced code.SLIM [BEM*21] removes the annotation requirement constraint on realistic data by integrating the self-supervised scene flow estimation and the motion segmentation framework.SLIM presents that the motion segmentation signal can be generated by detecting the discrepancy between raw flow predictions and rigid egomotion.Compared to existing methods [MOH20, WWL*20], SLIM leverages arbitrary point densities and does not rely on one-to-one correspondences.SLIM is upgraded based on RAFT [TD20] and [LZLG22]ate-of-the-art weakly supervised; Good generalization ability; Addressed the transformation challenge.Cons: Sensitive to the accuracy of background masks; Rely on rigidity assumption; Suffer from occlusions.Pros: Enhanced local rigidity in scene flow estimation; Good generalization ability.Cons: Failed on non-rigid motion; Suffer from occlusions.Pros: Recurrent architecture for non-rigid scene flow; All-to-all correlation learning; Addressed big data challenge and annotation challenge.Cons: Manually set iteration parameters; Suffer from occlusion challenge.DCA-SRSFE [JLA*22] Pros: Reduced the domain gap between the synthetic dataset and the real dataset; Avoided shape deformations; Addressed the transformation challenge.Cons:The predictions on non-rigid objects are not accurate.tionshowsthataself-driving vehicle generates abundant data but only 5% of the data is usable.Therefore, PillarML utilizes multi-sensor as sources of data and exploit free signals from them.SLIM.Noisy-Pseudo.Noisy-Pseudo[LZLG22]is a novel multi-modality framework that utilizes both RGB images and point clouds to generate pseudo labels for training scene flow networks.The selection of pseudo labels depends on the geometric information of point clouds.

Table 4 :
The quantitative evaluation results on Flyingthings3D [MIH*16].Self/full indicates the training strategy on FlyingThings3D.Lower values are better for the error metrics including EPE3D and Outliers.Higher values are better for the accuracy metrics including Acc3DS and Acc3DR.All results are compared based on the quantitative results provided by original papers.

Table 5 :
The quantitative evaluation results on three versions of KITTI scene flow datasets.Self/full means self-supervised and fully-supervised learning approach.
prediction, Flow-Mot [ZKC*20] suggests using the estimated scene flow to compute object-level movement.While most tracking methods adopt a "track-by-detection" approach and utilize the Kalman Filter to avoid having to adjust hyperparameters, FlowMot uses scene flow estimation to obtain 3D motion information that is consistent.Recently, Yang et al. [YJY*22] proposed a novel scene flow based point cloud feature fusion module that leverages temporal information in dynamic 3D point cloud sequences to improve 3D object tracking.These works demonstrate the potential of scene flow to address the challenges faced by current object tracking methods that lack generalization across different datasets.
[LZGG22]and SLIM [BEM*21] that learn background and foreground motions separately.For the occlusion challenge, Occlusion-G[OR21b], FESTA [WPL*21], and Est&Pro[WS22]explored different mask-ing operations to reduce the interference of the occluded points.In terms of accuracy, the state-of-the-art supervised method (What-Matters [WHL*22]) improves the accuracy from 41.3% to 92.9% on the FlyingThings3D dataset.Also, several architectures, such as SLIM [BEM*21] and SCTN[LZGG22]still cannot afford the burden of processing a large amount of points.The training time drastically increases as the size of point cloud increased.