T-ESVO: Improved Event-Based Stereo Visual Odometry via Adaptive Time-Surface and Truncated Signed Distance Function

spatial – temporal adaptive TS that can deal with different camera motions in various environments. The mapping unit introduces the TSDF to describe the 3D representation of environments and achieves depth estimation based on the global historical depth information contained in the environmental TSDF description. The tracking unit achieves the 6-DoF pose estimation through an 3D – 2D registration method based on the left/ right TS selection mechanism and the depth point selection mechanism. The effectiveness and robustness of the proposed system are evaluated on various datasets, and the experimental results show that T-ESVO achieves good performance in both accuracy and robustness when compared with other state-of-the-art event-based stereo VO systems.

distance Function (TSDF) for event-based mapping, on one hand, to obtain more denser global environmental map construction through the accumulation of local event depths and continuous updates of the TSDF, on the other hand, to achieve more accurate depth estimation based on the global historical depth information contained in the TSDF.Moreover, tracking is achieved based on left/right TS selection, depth point selection, and 3D-2D registration to improve the accuracy and robustness of the system.We implement our proposed system based on ESVO [3] and evaluate it qualitatively and quantitively.The results show that T-ESVO achieves better performance in both accuracy and robustness in different challenging scenarios (Please refer to Figure 1 for a quick glance at our proposed T-ESVO's performance).To conclude, our contributions can be summarized as: 1) We present a novel spatial-temporal adaptive TS method that can enable the system to work normally under different camera motions and output stable event processing results for various camera moving speeds; 2) We introduce TSDF for our event-based VO system to make full use of global historical depth information and improve the accuracy of depth estimation via TSDF.A denser global 3D environmental map can also be obtained with our proposed TSDF-based mapping method; 3) We propose a 3D-2D registration tracking method based on the left/right TS selection mechanism and the depth point selection mechanism.This tracking method can effectively improve the accuracy and robustness of camera motion estimation; and 4) We implement the proposed T-ESVO in Cþþ and evaluate it on three public datasets. [21,23,24]The results demonstrate that our system achieves good accuracy and robust performance when compared with the state-of-arts.
The rest of the article is organized as follows.Section 2 gives a brief review of related works.The system overview of T-ESVO is presented in Section 3. The detailed methods are introduced in Sections 4-6.Section 7 provides experimental evaluations and results.Finally, our conclusion is drawn in Section 8.

(a) (b)
Figure 1.The performance of our proposed T-ESVO on the public upenn dataset (upenn_ indoor_ flying1 sequence). [23]a) The estimated inverse depth frame (left) and the surface inverse depth frame via TSDF (right); b) The reconstructed semidense map and the estimated trajectory (black).

Related Works
This section first reviews the related works on event processing methods, and then gives a brief literature review of event-based VO/SLAM methods.

Event Processing
Event processing methods, they can be categorized into two paradigms: event-by-event method and group of events method. [1]For the event-by-event method, it handles input events one by one relying on the availability of additional information, such as other sensors' information or prior knowledge of previous events.This kind of method has been adopted for many visual tasks, including feature tracking, [25] SLAM, [26,27] image reconstruction, [28] etc.For the group of events method, it handles a group of events together based on a fixed event number or fixed time window.][31][32] Others process groups of events as tensors, including TS, [3,21,33] event histogram, [34] event map, [22] etc.As the group of events method, the TS representation method does not need additional information and can be interpreted as an anisotropic distance field for 3D-2D registration. [3]ost existing TS methods can hardly deal with event streams under different camera moving speeds.Manderscheid et al. [35] proposed speed invariant TS to cope with different speed of the objects in corner points detection based on event camera.Liu and Delbruck [36] proposed adaptive time-slice to handle different relative motion speed of the camera in optical flow estimation based on event camera.However, speed invariant time-surface can only ensure the temporal consistency of event representation in the local patch and adaptive time-slice is a kind of event map representation method which is not as suitable as TS for 3D-2D registration. [3]In addition, neither of these two methods is adopted for VO system for the right now.In this article, we present a novel spatial-temporal adaptive TS method to tackle different camera motions for the event-based VO system.

Event-Based VO and SLAM
For event-based VO/SLAM methods, they can be divided into monocular ones and stereo ones.Mueggler et al. [37] presented an onboard event-based monocular system for 6-DoF localization with high speed and fast rotation.Kim et al. [26] can perform real-time 3D reconstruction, 6-DoF localization, and scene intensity estimation by only a monocular event-based camera.Kueng et al. [25] implemented a monocular VO system by event-based feature detection.Rebecq et al. [22] proposed EVO that is implemented by an image alignment tracker via semidense mapping.Bryner et al. [38] introduced a maximum-likelihood framework to estimate the camera motion.Nguyen et al. [39] used stacked spatial LSTM network to estimate camera poses with event images as input.However, monocular methods always face the scale consistency problem, and stereo methods perform better in this aspect thanks to the known depth information.
For event-based stereo VO/SLAM methods, Zhou et al. [3] proposed a stereo event-based VO system called ESVO.The system achieved semidense 3D reconstruction [21] and 3D-2D registration tracking [40] via inverse depth frame and TS.Jiao et al. [24] analyzed and compared the performance of ESVO under different event representation methods.
In this article, we present an event-based stereo VO method called T-ESVO.We propose spatial-temporal adaptive TS for event processing and introduce TSDF for the event-based mapping.And tracking is achieved through an 3D-2D registration method based on the left/right TS selection mechanism and the depth point selection mechanism.The proposed T-ESVO can achieve good performance in terms of accuracy and robustness.

System Overview
The system overview of T-ESVO is shown in Figure 2. The proposed system can be divided into three components: event processing unit, mapping unit, and tracking unit.The event processing unit is responsible for converting event stream data to TS images based on the spatial-temporal adaptive TS (called "A-TS", see Section 4) method.The length of the event stream will be decided by the temporal and spatial characteristics of current events to avoid too much or too little information in corresponding A-TS images.
The mapping unit is responsible for depth estimation and semidense map construction.First, this unit will estimate the latest events' depth by stereo event matching based on the A-TS images from the event processing unit and the camera poses from the tracking unit.And then, it will fuse these asynchronous event depth points to a new inverse depth frame.Third, it will use this inverse depth frame to create/update the environmental TSDF description and obtain the global semidense map of the environment based on the TSDF.Finally, it will re-estimate the depth information in the inverse depth frame based on the global historical depth information contained in the TSDF.
The tracking unit is responsible for camera motion estimation.First, it will choose a suitable current A-TS by brightness and contrast similarity between the reference A-TS image and current left/right A-TS images.Then, it will select a depth point subset that conforms to the depth point distribution in the reference inverse depth frame for 3D-2D registration.Finally, it will estimate the pose transformation between the reference camera frame and the current camera frame based on 3D-2D registration method.

Spatial-Temporal Adaptive Time-Surface
The output of the event camera is a stream of asynchronous events.We use e k ¼ ðu k , v k , t k , p k Þ to represent a single event, containing the pixel coordinate x k ðu k , v k Þ, timestamp t k , and polarity p k .The traditional TS [3] is a 2D camera size map where each pixel stores the time information.At time t, it can be defined as where t last means the timestamp of the latest event at the pixel xðu, vÞ, and φ is the constant decay rate parameter (e.g., 30 ms in the study of Zhou et al. [3] ).
Based on Equation (1), we can convert a stream of events to a TS image ℐð⋅, tÞ whose pixel's intensity is the value of the corresponding pixel in the TS T ð⋅, tÞ rescaled from [0,1] to the range [0,255].
The traditional TS can be considered as a kind of fixed-time window event processing method.The length of the time window is determined by the TS image generation frequency and the average brightness (intensity) of the TS image is determined by φ.The traditional TS sets φ with a fixed value, so that only events that occur in the time window can be clearly reflected in the TS image (events' corresponding pixels with certain intensity values), and events outside the time window cannot show sufficient pixel intensity in the TS image because they occur too early.
However, the setting of φ with a fixed value makes the traditional TS unable to handle different occurrence frequencies of events.When the camera is moving normally or rapidly (high occurrence frequency and enough events in the time window), the traditional TS can generate TS images with sufficient intensity information for depth estimation.Instead, when the camera is moving slowly or stationary (low occurrence frequency and few events in the time window), it cannot generate good TS images (low average intensity).The fixed event number event processing method also faces a similar problem and it cannot generate good TS images under fast camera motion.Related discussion can be also found in the study of Liu and Delbruck. [36]o tackle this problem, we propose a novel spatial-temporal adaptive TS, called "A-TS", which can cope with different camera motions and output stable event postprocessing results.Based on the traditional TS, we count the number of events in the latest time window as n to judge the camera motion and define the A-TS value as where N τ depends on the number of events for depth estimation in the mapping unit and we set N τ as N in Section 5.1, which can ensure that there is enough intensity information in the A-TS image for depth estimation.φ N τ is defined as where t N τ is the timestamp of the latest N τ -th events, and t w is the length of the time window of A-TS.Like the traditional TS image, Equation (2) needs to be rescaled to the range [0,255] to create the corresponding A-TS image.From Equations (2) and (3), we can find that: 1) When events occur at a high frequency (fast or normal camera motion, n ≥ N τ ), the A-TS will work as the traditional TS and there are more than blackN τ events which can provide enough intensity information for the A-TS image.The top half of Figure 3 shows the processes of the traditional TS and the A-TS under fast or normal camera motion in an office scene; 2) When events occur at a low frequency (slow or no camera motion, n < N τ ), there are at least N τ events which can provide enough intensity information for the A-TS image by adaptively setting the value of φ N τ .The bottom half of Figure 3 shows the processes of the traditional TS and the A-TS under slow or no camera motion in an office scene.
The proposed A-TS method is summarized in Algorithm 1.The inputs of the algorithm are: the timestamp of the A-TS t, the timestamp of the latest event at each pixel in the image plane t last ð⋅Þ at time t, the number of events in the latest time window n at time t, the number of events for depth estimation in the mapping unit N τ , the decay rate parameter φ, the timestamp of the latest N τ -th events t N τ before time t, and the length of the time window t w .The output of the algorithm is the A-TS image ℐð⋅, tÞ at time t.First, the algorithm will judge camera motion based on n and adaptively choose different TS mode according to different camera motion.Second, when the camera is in slow motion or stationary, the algorithm will adaptively set the value of φ N τ according to t N τ for A-TS image output containing sufficient intensity information.

TSDF-Based Mapping Unit
Due to the sparsity of event data and the possible large amount of noise in event streams, it is difficult to obtain dense and accurate depth estimation results based on event data.To this end, for the first time, we introduce the TSDF into the eventbased depth estimation process and present a TSDF-based mapping unit for T-ESVO.Utilizing the characteristics of TSDF including global historical depth information and fast update speed, the proposed mapping unit can, on one hand, update and optimize the global environmental map based on the TSDF in real time, and on the other hand, make full use of the global historical depth information contained in the TSDF to reduce the depth estimation error caused by the noise of local event data.
The mapping unit of our T-ESVO is shown in Figure 4, which is divided into two parts: local depth estimation (Figure 4a) and mapping and depth re-estimation via TSDF (Figure 4b).The first part is responsible for estimating latest events' depth and fusing an inverse depth frame which contains local historical depth information based on the depth estimation method proposed in ESVO. [3]The second part is responsible for creating and continually updating the global TSDF description of the environment and re-estimate the depth based on the TSDF which contains the global historical depth information.In this section, we will give a detailed introduction about these two parts.

Local Depth Estimation
When stereo A-TS images are feed into the mapping unit, T-ESVO will follow the depth estimation method in ESVO [3] to estimate events' depth and fuse an inverse depth frame with A-TS images' timestamp.At time t, the depth of event e tÀε ¼ ðx tÀε , t À ε, pÞ (with ε ∈ ½0, δt) can be estimated based on event co-occurrence principle and the epipolar constraint.The depth estimation optimization objective function is defined as where ρ and ρ ⋆ mean the depth/inverse depth of the event, ℐ left ð⋅, tÞ and ℐ right ð⋅, tÞ are left and right A-TS images, respectively, T tÀδt:t means camera transformation matrix between t À δt and t obtained from the tracking unit, and the depth residual r i is defined as where x 1,i and x 2,i are pixel in the event corresponding patches W 1 and W 2 of left and right A-TS images, respectively.Based on Equation ( 4), we can estimate the depth of the latest N events at time t.We set N as 1,500 (upenn dataset [23] ) and 1,000 (rpg dataset [21] and esim dataset [24] ) based on the corresponding setting as in the study of Zhou et al. [3] for different sensor resolutions to obtain a trade-off result between accuracy performance and time cost.And then these asynchronous event depth points can be fused into an inverse depth frame based on the probabilistic model of estimated inverse depth proposed in the study of Zhou et al. [3] In T-ESVO, we usually fuse the inverse depth frame at time t based on 3N % 5N event depth points, where N event depth points are estimated from time t, and the remaining event depth points are estimated from the most recent several moments before time t.Therefore, through local depth estimation, T-ESVO can obtain an inverse depth frame and a local map which contains local historical depth information, as shown in Figure 4a.

Mapping and Depth Re-Estimation via TSDF
Although the local depth estimation presented in Section 5.1 can fuse an inverse depth frame containing local historical depth information, due to the high temporal resolution of events and the real-time requirement of the VO system, this inverse depth frame can only contain historical depth information in a short period of time.As a result, the local depth estimation method can hardly avoid errors caused by the noise in event streams.To tackle this problem, we introduce TSDF [41] to provide global historical depth information for event-based mapping and depth estimation.Based on every fused inverse depth frame obtained from Section 5.1, T-ESVO will create or continuously update a TSDF description of the environment containing global historical depth information in real-time.And then obtain the global environmental map and re-estimate the depth information in the inverse depth frame based on the TSDF, as shown in Figure 4b.
TSDF [41] can construct the global environmental surface representation based on consecutive (inverse) depth frames and associated camera poses.The value of voxel corresponds to the signed distance to the closest surface interface, taking on positive (negative) and increasing (decreasing) values moving from the visible (nonvisible) surface into free space until the truncated distance. [42]In our T-ESVO, we construct and store TSDF based on the voxel hashing method [43] to enable the system to run in real-time based on CPU only.
To integrate a new inverse depth frame into a TSDF, we raycast from the camera origin to the 3D point corresponding to every depth point of the inverse depth frame and update the TSDF distance and weight values of voxels along this ray.When a voxel (center position is C ∈ ℝ 3 ) is passed by a ray between a 3D point (position is P ∈ ℝ 3 ) and the camera origin (position is O ∈ ℝ 3 ), we will calculate the new distance update value d and the new weight update value w, where d is the distance from the voxel center to the 3D point.And then, based on d and w, the new TSDF distance value D and the new TSDF weight value W will be updated for this voxel as following where w d and w u are depth and uncertainty weights, respectively.
We follow [44] to define depth weight w d with the behindsurface drop-off as where ρ is the depth of the voxel, d t is the truncated distance for TSDF, and ε is the size of voxel.We set d t ¼ 4ε in our case.
For uncertainty weight w u , we regard the standard deviation σ p of the depth point (from the probabilistic model of estimated inverse depth used in Section 5.1) as its depth estimation uncertainty and use S-curve to reduce the impact of depth points with high depth estimation uncertainty on TSDF updating.We defined w u as where σ τ is the uncertainty threshold, a and b are parameters of S-curve.σ τ is set as 0.15 in the upenn dataset and 0.015 in the rpg, esim datasets for different scene sizes.a and b together determine the coordinate of the central symmetrical point of the S-curve in Equation ( 8), namely ( a b , 1 2 ), and the size of a and b also determines the "slope" of the S-curve.To reduce the impact of depth points with high uncertainty and increase the impact of depth points with low uncertainty on TSDF updating, we set a ¼ 20 and b ¼ 40 for our mapping unit.
After TSDF creation and update, T-ESVO can obtain the updated global environmental map via the surface in the updated TSDF as shown in Figure 4b.And then, T-ESVO will re-estimate the depth information in the inverse depth frame obtained from Section 5.1 based on the updated TSDF following Algorithm 2. For the depth point xðu, vÞ ∈ ℝ 2 in the inverse depth frame, we first ray-cast from the camera origin Oðx o , y o , z o Þ ∈ ℝ 3 to the corresponding 3D point X ðx, y, zÞ ∈ ℝ 3 of x and make voxel set V ¼ fv 1 , v 2 , : : : , v n g containing all voxels along the ray.Then we traverse V to filter out the voxel v i ∈ V closest to the surface by voxel's TSDF distance value D v i and weight value W v i .Finally, we calculate the projection point X 'ðx the ray, as shown in Figure 5.
If the distance between X ' and X is greater than the re-estimation distance threshold d τ , we will replace X with X ' and re-estimate the depth of x by X '.In Algorithm 2, we set D τ and W τ as thresholds for judging whether the voxel is close to the surface.
To speed up filtering out the voxel closest to the surface, and reduce the error caused by the low-weight voxels on the depth re-estimation results, we set D τ and W τ as ε and 0.05 respectively, ε is the size of the voxel.In addition, the size of d τ is related to the voxel scale and is set as 0.8ε in T-ESVO to obtain a trade-off result between depth re-estimation accuracy and efficiency.

The Tracking Unit
Tracking is the core part of the VO system.ESVO [3] proposed a tracking method based on 3D-2D registration for the eventbased VO system, which is inspired by the edge-alignment method for RGB-D cameras using distance fields [40] and takes Algorithm 2. Depth Re-estimation via TSDF.full advantage of TS.However, this tracking approach has two problems: 1) Does not take into account the negative impact of event camera hardware limitations (e.g., flickering effect) on tracking; 2) Due to the real-time consideration of the VO system, randomly selects depth points from the inverse depth frame for 3D-2D registration, resulting in randomness and uncertainty in tracking results.
To tackle aforementioned problems, on the basis of 3D-2D registration, we propose a tracking unit for T-ESVO based on the left/right TS selection mechanism and the depth point selection mechanism.The tracking unit of T-ESVO is shown in Figure 6, consisting of three parts: left/right TS selection, depth point selection, and 3D-2D registration.The first part is responsible for using the image similarity to select the proper current A-TS that is less affected by the flickering effect, so as to reduce the negative impact of the flickering effect on tracking.The second part is responsible for selecting a depth point subset which is more satisfied the spatial distribution of depth points in the reference inverse depth frame, so as to reduce the randomness and uncertainty of the tracking results.The third part is responsible for estimating the camera motion between the reference camera frame and the current camera frame based on the selected reference depth point subset and the selected current A-TS.In the following, we will introduce these three parts in detail.

Left/Right TS Selection
Due to the hardware limitations of the event camera, some event streams (e.g., some sequences in the public rpg dataset [21] ) produced by the event camera will have the flickering effect.The flicker effect causes a large number of noise events in the event stream in a short period of time, so that the A-TS image generated at this time contains a large number of noise pixels with high intensity values, which will affect the accuracy and robustness of the camera pose estimation.Figure 7 shows the flickering effect on the rpg_bin_edited sequence at two timestamps (t ¼ 8.6 s, t ¼ 12.3 s).
To solve this problem, we introduce a left/right TS selection mechanism for the tracking unit of T-ESVO.As shown in Figure 6 (left/right TS selection part), the tracking unit will compare the current left/right A-TS image with the one used in the latest 3D-2D registration separately, then based on the similarity score, it will choose the more similar one (left or right) as the current A-TS for 3D-2D registration.Considering the characteristics of the flickering effect and the real-time requirement, we choose luminance and contrast similarity from structural similarity [45] for similarity comparison as where P and C represent previous and current A-TS images, respectively (we will calculate the similarity of the previous A-TS image with the current left/right A-TS image respectively), μ is the mean intensity of the image, σ is the intensity standard deviation of the image, C 1 and C 2 are set as ð0:01Ã255Þ 2 and ð0:03Ã255Þ 2 , respectively, referring to the study of Wang et al. [45] Based on Equation ( 9), T-ESVO can select the current A-TS which is less affected by the flickering effect and apply it to the 3D-2D registration to reduce the negative impact of flickering effect on tracking.The relevant experimental results in Section 7.4 can demonstrate the effectiveness of the proposed left/right TS selection mechanism.

Depth Point Selection
The inverse depth frame generated in the mapping unit (Section 5) often has a large number of depth points (according to the different event camera resolutions used in different datasets, the number of depth points contained in each inverse depth frame ranges from 3,000 to 10 000).If all the depth points in the inverse depth frame are used in the tracking process, it will bring high computational overhead.To reduce the time cost, ESVO [3] randomly selects a part of depth points from the reference inverse depth frame for 3D-2D registration.However, this method may lead to some randomness and uncertainty for the tracking accuracy.
In T-ESVO, we propose a depth point selection mechanism to tackle this problem.On the one hand, this mechanism will focus on selecting the depth points from the "edges" of the A-TS image, since events usually occur at the edges of objects or places with obvious textures, to reduce the negative impact of noise events on tracking.On the other hand, this mechanism will select a subset of depth points to participate in 3D-2D registration according to the spatial distribution of depth points in the reference inverse depth frame.As illustrated in Figure 6 (depth points selection part) and Algorithm 3, the depth point selection can be achieved by following belowfour steps: Step 1: Project all depth points' corresponding 3D points of the reference inverse depth frame M, which can be regarded as the set of depth points, to the reference A-TS image ℐð⋅, t ref Þ (selected to participate in 3D-2D registration at reference timestamp by the left/right TS selection mechanism) to enhance the intensity information in ℐð⋅, t ref Þ; Step 2: Use EDLines [46] to quickly extract line segments in the enhanced A-TS image ℐð⋅, t ref Þ to get the edge map ℰ; Step 3: Use the contour retrieval algorithm [47] to extract connected regions as set C in the edge map ℰ.Each connected region C i ∈ C represents a part of the image plane and contains several edges from ℰ, as shown in Figure 6 (Depth Points Selection part, step 3).According to the connected regions  13 for each M i do and corresponding 3D points' projected coordinates in the A-TS image, cluster the depth points into depth point subset M i ∈ M corresponding to C i ∈ C; Step 4: According to the proportion of the number of depth points numberðM i Þ in each connected region C i to the total number of depth points numberðMÞ and the number of depth points n depth required for 3D-2D registration, determine the number of points n i to select in each connected region.And then, select this number of depth points from each connected region by stochastic sampling to form the depth point subset M depth for 3D-2D registration.In T-ESVO, we set n depth as 3,000 (upenn dataset [23] ) and 2,000 (rpg datasets [21] and esim dataset [24] ) based on the corresponding setting in the study of Zhou et al. [3] for different sensor resolutions.

3D-2D Registration
After the aforementioned two procedures, T-ESVO can get the proper current A-TS and the selected reference depth point subset.Then, T-ESVO will carry out the 3D-2D registration proposed in ESVO [3] to estimate the camera motion from the reference camera frame to the current camera frame, as shown in Figure 6 (3D-2D registration part).Note that the timestamp of the current A-TS (current camera frame) is slightly later than the reference inverse depth frame (reference camera frame).And in fact, what we obtain in this subsection is the camera motion between the reference left camera frame and the current left/ right camera frame, because the inverse depth frame is constructed based on the left camera coordinate system and the current A-TS used in 3D-2D registration is determined by the left/ right TS selection mechanism (proposed in Section 6.1).First, T-ESVO will create the negative A-TS image ℐð⋅, tÞ of the corresponding A-TS T ð⋅, tÞ by here, we re-scale the pixel's value from [0,1] to the range [0,255].
The negative A-TS image can be interpreted as a kind of anisotropic distance field which is usually used in edge-based VO systems. [40]hen, T-ESVO will find a transformation TðθÞ which can align the dark regions of the current negative A-TS image and the template image ℳ projected from the inverse depth frame M depth (composed of the depth point subset selected in Section 6.3) by TðθÞ, as shown in Figure 6 (3D-2D registration part).These alignments are based on the warping function as follows where θ are the motion parameters combined the Cayley parameters [48] for the orientation and the translation parameters, TðθÞ can convert θ to a transformation matrix from the reference camera frame to the current camera frame, x is the 2D coordinate of a valid pixel on ℳ and ρ is the corresponding depth of this pixel in M depth , π cur is the projection function in the current left/right camera frame, and π ref left À1 is the back-projection function in the reference left camera frame.

Experimental Evaluation
We implement the T-ESVO system referring to the implementation of the ESVO system [3] based on robot operating system (ROS) [50] in Cþþ.In the T-ESVO system, there are three different independent threads, namely the TIME_SURFACE thread, MAPPING thread, and TRACKING thread.They run concurrently to achieve event processing, depth estimation, and camera motion estimation.TIME_SURFACE thread implements the event processing unit (in Figure 2), MAPPING thread implements the mapping unit (in Figure 2), and TRACKING thread implements the tracking unit (in Figure 2).Each thread has a different running rate accordingly to ensure reliable operation and the real-time performance of the whole system.Regardless of the motion state of the camera, the TIME_SURFACE thread will execute the A-TS method (Section 4) to generate A-TS images at a fixed frequency.
The MAPPING thread and TRACKING thread will operate in an interleaved fashion estimating the depth and camera motion respectively based on the A-TS images and the latest events, as shown in Figure 2.
To evaluate the proposed T-ESVO system, we perform corresponding quantitative and qualitative experimental evaluations on the public rpg dataset [21] and upenn dataset. [23]Note that, the event data in the rpg and upenn datasets were recorded at a low streaming rate (30 Hz), which cannot meet the highfrequency output of A-TS (%77 Hz).Therefore, we modified the event recording streaming rate to 1,000 Hz in the rpg and upenn datasets and these edited sequences will be named with the suffix "edited" in the following evaluations.We also evaluate the T-ESVO on the esim dataset which is generated by Jiao et al. [24] using the event camera simulator ESIM. [51]First, we show the advantage of the proposed A-TS.Second, we compare T-ESVO with state-of-the-art methods in terms of tracking performance.Third, we evaluate the depth estimation and mapping performance of the T-ESVO.Fourth, we validate the advantage of our proposed tracking method.Fifth, we test the T-ESVO in difficult conditions for standard cameras.Finally, we analyze the real-time performance of our system.Considering the randomness of the algorithm results, [24] the results of quantitative evaluations (motion estimation, depth estimation, and time cost) are all average results of 10-trial tests.Additionally, considering the possible scale error in the adopted comparison methods, we align the estimated trajectory with ground truth using Sim(3) Umeyama alignment [52] in all evaluations.All experiments run on a computer equipped with Intel Core i9-10900K CPU @ 3.70 GHz and Ubuntu 20.04 LTS operation system.

Performance of the A-TS
To prove the effectiveness of our proposed A-TS method, first, we compare the T-ESVO with the traditional TS method and T-ESVO with the A-TS method (namely T-ESVO) on the rpg and upenn datasets which are collected in the real world.Table 1 shows the mean absolute trajectory error (ATE) of absolute translational error under six sequences for T-ESVO with the traditional TS method and with the A-TS method.As shown in Table 1, T-ESVO with the A-TS method demonstrates better accuracy performance in terms of camera motion estimation, especially on the upenn dataset since it contains more different motion states, such as slow camera motion on the upenn_ indoor_ flying1_ edited sequence (8-10 s) and stationary on the upenn_ indoor_ flying3_ edited sequence (t ¼ 9.3 s).
Second, we choose the upenn_ indoor_ flying3_ edited sequence for a more intuitive evaluation.This sequence is collected by a stereo DAVIS346 event camera mounted on a flying drone.In this sequence, the drone flies at different speeds and there is little event data when it hovers.As shown in Figure 8, the drone hovers for a while when flying near a bucket (t ¼ 9.3 s).For the traditional TS method, there is few useful information in the TS image and the negative TS image for depth estimation and 3D-2D registration respectively.And for our A-TS method, the generated A-TS image and negative A-TS image still retain enough intensity information for depth estimation and 3D-2D registration.We also show the estimated trajectories using T-ESVO with these two different TS methods in Figure 8.Note that, except for the event processing unit, the tracking unit and the mapping unit are the same for these two methods of comparison.
The results in this subsection demonstrate that our A-TS method can assist T-ESVO in coping with different camera motions and achieving better performance.

Quantitative Evaluation of the Tracking Performance
To show the performance of T-ESVO, we compare T-ESVO with ESVO and variants of ESVO with different event processing methods (referring to [24] ) on the esim, rpg, and upenn datasets.Table 2 shows the mean ATE of absolute translational error under 12 sequences for T-ESVO, ESVO, and variants of ESVO.The subscript of ESVO represents the different event processing methods.TS represents time-surface (ESVO TS equals to the origin ESVO).EM2000, EM3000, EM4000, and EM5000 represent event map with 2,000, 3,000, 4,000, and 5,000 events per map.EMTS represents a combination of TS and event map (EM) proposed in the study of Jiao et al., [24] which will switch the Figure 8.The performance demonstration of our proposed A-TS method on the upenn_ indoor_ flying3_ edited sequence.The first column is the intensity frame of the scene when the drone is hovering (t ¼ 9.3 s).The second column is the TS/A-TS images produced by the traditional TS method and our A-TS method separately.The third column is the 3D-2D registration results (warpped color-coded inverse depth frame overlaid on the negative TS/A-TS image).The fourth column is the corresponding estimated trajectories.
event representation between TS and EM according to the degeneracy factor.Except for event processing methods, the mapping and tracking units of these variants of ESVO are same to the origin ESVO.We refer to the study of Jiao et al. [24] for the results of ESVO and variants of ESVO.As shown in Table 2, T-ESVO demonstrates better accuracy performance in terms of camera motion estimation.We also plot the translational errors and rotational errors over time from T-ESVO and ESVO under the sequences mentioned in Table 2.As shown in Figure 9, T-ESVO achieves lower translational and rotational errors than ESVO, demonstrating that T-ESVO has better motion estimation accuracy.However, for both ESVO and T-ESVO, the noise in the event stream and the low resolution of the camera, even the specific camera motion patterns, [3] will cause errors in the motion estimation results.Furthermore, we find that motion estimation based on 3D-2D registration does not cope well with frequent rotation (e.g., rotational errors on the esim and rpg datasets, as shown in Figure 9).For a more detailed 6-DoF pose estimation diagram with T-ESVO and ESVO on the esim, rpg, upenn datasets, please refer to Appendix A.
In addition, to evaluate the robustness of the proposed system, we also intercepted two sequences in the upenn dataset based on upenn_ indoor_ flying1 sequence and upenn_ indoor_ flying2 sequence.One sequence is called upenn_ indoor_ flying1_ edited_ long (long sequence), and another one is called upenn_ indoor_ flying2_ edited (fewer events generated).As shown in Table 3, our T-ESVO succeeds in upenn_ indoor_ flying1_ edited_ long, while ESVO fails.When facing environments with fewer events, our T-ESVO also demonstrates better performance.
Finally, we compare T-ESVO with the frame-based SLAM/VO method to demonstrate the superiority of event-based methods.Referring to the study of Zhou et al., [3] we choose a state-of-theart frame-based stereo SLAM pipeline (ORB-SLAM 2 [53] ) and ESVO [3] for comparison.Table 4 shows the root mean square (RMS) of ATE of absolute translational error under six sequences for ORB-SLAM 2, ESVO, and T-ESVO.We refer to the study of Zhou et al. [3] for the results of ORB-SLAM 2 and ESVO.Note that, for a fair comparison, the global bundle adjustment (BA) of ORB-SLAM 2 is disabled; nevertheless, the results with global BA enabled are also reported in brackets in Table 4. [3] As shown in Table 4, we can find that T-ESVO is slightly less accurate than ORB-SLAM 2 and comparable to ESVO on the rpg dataset; but T-ESVO shows a significantly better performance than ORB-SLAM 2 and ESVO on the upenn dataset.As mentioned earlier, this is because the motion estimation based on 3D-2D registration does not cope well with frequent rotation, such as the sequences in the rpg dataset.In summary, the results in Table 4 demonstrate the superiority of the event-based method compared to the frame-based method.

Performance of the TSDF-Based Mapping Unit
In this subsection, we present the quantitative and qualitative experimental evaluations of our proposed T-ESVO in terms of depth estimation and mapping.
Firstly, we compared the depth estimation results of T-ESVO and T-ESVO without TSDF (using the mapping unit from ESVO [3] ) under the upenn sequences which provide the ground truth of depth information in Table 4.Note that, except for the mapping unit, the tracking unit and the event processing unit are the same for these two methods of comparison.We use the evaluation method from [54] to quantify the depth estimation performance of T-ESVO and T-ESVO without TSDF.In detail, we adopt the upenn_ indoor_ flying1_ edited sequence, upenn_ indoor_ fly-ing2_ edited sequence and upenn_ indoor_ flying3_ edited sequence for evaluation.As shown in Table 4, T-ESVO outperforms T-ESVO without TSDF in all evaluation criteria including: absolute relative difference, squared relative difference, RMSE (linear), RMSE (log), and percentage of depth points under different maximum relative error thresholds (a1 = 1.25, a2= 1.25 2 , a3= 1.25 3 ).
We also plot the depth estimation results in Figure 10.The estimated inverse depth frames and the surface inverse depth frames based on TSDF are given in Figure 10.As shown in Table 2. Accuracy performance with different event-based stereo VO systems.The public rpg, upenn, esim datasets are adopted for comparison.The mean ATE (cm) of absolute translational error is used as the metric.The best result is highlighted in bold.

Performance of the Proposed Tracking Unit
To validate the effectiveness of our proposed tracking unit for T-ESVO, we present the quantitative and qualitative experimental evaluations in this subsection.
First, we give the translational and rotational results of T-ESVO with the ESVO's tracking unit and T-ESVO with the proposed tracking unit (introduced in Section 6) on the rpg and upenn datasets.Note that, except for the tracking unit, the mapping unit and the event processing unit are the same for these two methods of comparison.
The boxplots of absolute translational and rotational error statistics are shown in Figures 12 and 13, respectively.As shown in Figure 12, for absolute translational error, T-ESVO with the proposed tracking unit performs better than T-ESVO with the ESVO's tracking unit in both translational error median and distribution on most sequences.As to the rotational error, as shown in Figure 13, T-ESVO with the proposed tracking unit also performs better than T-ESVO with the ESVO's tracking unit on all sequences.From Figures 12 and 13, the experimental results demonstrate that the proposed tracking unit can effectively improve the accuracy and robustness of camera motion estimation.
Second, we choose the rpg_bin_edited sequence to validate our proposed left/right TS selection mechanism (proposed in Section 6.1).Compared to other datasets, the event streams in the rpg dataset are more seriously affected by the flickering effect, and the rpg_bin_edited sequence is one of the most affected sequences.Figure 14 shows the corresponding results for this evaluation.The top half of Figure 14 displays the differences of similarity scores (ΔS) between the current left A-TS image () and the current right A-TS image ().When ΔS ≥ 0, our T-ESVO's tracking unit will choose the current left A-TS for 3D-2D registration.When ΔS < 0, our method will choose the current right A-TS.The bottom half of Figure 14 lists left/ right A-TS images (one of them is affected by flickering effect) and A-TS selection results (in red boxes) at three timestamps (t ¼ 5.8 s, t ¼ 11.8 s, t ¼ 14.6 s).As shown in Figure 14, we can find that the proposed TS selection mechanism can correctly select the A-TS which is less affected by the flickering effect, thereby reducing the negative impact of the flickering effect on camera motion estimation.

Evaluations under Challenging Illumination Conditions
In this subsection, we test our proposed T-ESVO in challenging conditions for standard cameras.First, we collect two sequences in a dark office with a stereo event-camera rig, which consists of two DAVIS346 event cameras of 346 Â 260 pixel resolution.One sequence is collected with a lamp off to simulate the dark condition.Another sequence is collected with a lamp on to simulate the high dynamic range (HDR) condition.And then we run T-ESVO on these two sequences and the results are shown in Figure 15.Under these two challenging illumination conditions, the standard camera sensor (with 55 dB dynamic range) of DAVIS346 event camera cannot see anything (without the lighting of the lamp), which will lead to the failure of the VO systems based on standard cameras.By contrast, our proposed T-ESVO can work robustly in these challenging illumination conditions due to the HDR property of event cameras (DAVIS346 event camera has a dynamic range with 120 dB).

Real-Time Performance Analysis
In this section, we analyze the real-time performance of our proposed T-ESVO system.First, we discuss the time cost brought by the addition of TSDF to the system.We choose the voxel hashing method [43] for TSDF construction and storage, and our method can run in real-time on CPU.Table 5 shows the real-time performance of TSDF function (including TSDF creation and update and depth re-estimation via TSDF, as shown in Figure 2) and the whole mapping unit with different voxel sizes.The upenn_ indoor_ flying1 _ edited sequence is adopted for the test.As shown in Table 5, to some extent: 1) the addition of TSDF can improve the accuracy of motion estimation; 2) decreasing the voxel size can improve the motion estimation accuracy, and at the same time, it will result in more time-consuming; 3) for semidense depth estimation in T-ESVO, when the voxel size is very small (e.g., size of voxel is 0.01 m in Table 5), the accuracy of motion estimation will decline due to the lack of depth information (many voxels will be marked as outliers).In our case, setting the voxel size as 0.05 m (upenn dataset) and 0.01 m (rpg and esim datasets with smaller scenes and fewer events) will obtain a trade-off result, which can effectively improve the accuracy of depth and motion estimation without increasing too much time overhead.
Second, we show the real-time performance of T-ESVO and compare it with ESVO.Table 6 lists the detailed execution times of different units of T-ESVO and ESVO under the upenn_ indoor_ flying1_ edited sequence.As shown in Table 6, for T-ESVO, the event processing unit takes about 12 ms to create the A-TS image, the mapping unit takes about 52 ms (%19 Hz) to estimate 1,500 events' depth and fuse 6,000 depth points to the inverse depth frame, the tracking unit takes about 13 ms (%77 Hz, left/right TS selection takes about 1.25 ms, depth point selection takes about 2.04 ms) to perform 3D-2D registration based on 3,000 depth points.For ESVO, the time overheads of the event processing unit, the mapping unit, and the tracking unit (with the same parameter setting as T-ESVO) are about: 10, 47 (%21 Hz), and 10 ms (%100 Hz), respectively.
Therefore, specifically, as shown in Table 6, the event processing unit of T-ESVO, compared with the event processing unit of ESVO, the computational load has increased by about 22%; for the mapping unit, the computational load has increased by about 12%; for the tracking unit, the computational load has increased by about 35%.Overall, T-ESVO can still perform depth estimation at about 19 Hz and motion estimation at about 77 Hz.
To summarize, compared to ESVO, T-ESVO can still run at a real-time frequency while achieving better accuracy and robust performance (from Tables 2, 3, and 7, Figures 9 and 16)

Conclusion and Discussion
This article proposes an event-based stereo visual odometry system (T-ESVO) via adaptive TS and TSDF.For the event processing unit, we present a novel spatial-temporal adaptive TS method to deal with different camera motions.For the mapping unit, T-ESVO introduces the TSDF to reconstruct 3D environments and re-estimate depth.For the tracking unit, T-ESVO achieves   However, the performance of T-ESVO is still limited by the hardware of event cameras (such as noise and dynamic effects, low spatial resolution, etc.).On the one hand, this can be improved with the development of event cameras, on the other hand, we would like to introduce the low-cost IMU as a compensation sensor to enhance the performance.Besides, more effective event representation methods and fast-tracking methods like [55] also need to be developed.

Appendix A A more detailed comparison between T-ESVO and ESVO
We plot the estimated 6-DoF poses from T-ESVO and ESVO under the sequences mentioned in Table 2.As shown in Figure 16, T-ESVO presents a better accurate 6-DoF pose estimation performance than ESVO.From Figure 16, we can more clearly observe that motion estimation based on 3D-2D registration does not cope well with frequent rotation, such as the rotation estimation errors in Roll-Rotation and Pitch-Rotation in sequences from the esim and rpg datasets, as shown in Figure 16.

Figure 2 .
Figure 2. System overview of T-ESVO.The event streams will be processed and represented as A-TS images first.Then the mapping unit will estimate the depth and construct a global map based on A-TS images and camera poses via TSDF, and at the same time, the tracking unit will estimate 6-DoF poses based on the inverse depth frame and A-TS images.

Figure 3 . 5 Setend 7 else 8
Figure 3. Traditional TS versus A-TS under different camera motions in an office scene.When the camera is in fast or normal motion, the A-TS will work as the traditional TS (top part).When the camera is in slow motion or stationary, the A-TS will adaptively change the decay rate parameter based on t N τ (bottom part).The left column is the intensity frames from the DAVIS346 event camera of the scene.The middle three columns are the processes of different TS methods: the first column shows the determination of t N τ and the events which can provide enough intensity information for TS images (in the yellow boxes); the second and third columns show how events are converted into pixels of different intensities under fixed φ (traditional TS) and adaptive φ N τ (A-TS).The right column is the output TS images.

Figure 4 .
Figure 4.The process of the TSDF-based mapping unit includes two parts: a) local depth estimation is responsible for estimating events' depth and fusing inverse depth frames based on the event data in local time, b) mapping and depth re-estimation via TSDF is responsible for TSDF creation and update and depth re-estimation based on the TSDF which contains the global historical depth information.

Figure 5 .
Figure 5. Depth re-estimation via TSDF.The re-estimated 3D point X' of depth point x is the projection of C i on the ray.C i is the center of the closest voxel to the surface along the ray.

Figure 6 .
Figure 6.The process of the tracking unit includes three parts: left/right TS selection (left top) is responsible for selecting proper current A-TS for 3D-2D registration, depth point selection (bottom) is responsible for selecting depth point subset which is satisfied the distribution of depth points, 3D-2D registration (right top) is responsible for estimating camera motion.

Figure 7 .
Figure 7.The flickering effect on the rpg_bin_edited sequence.When t ¼ 8.6 s, the left A-TS image is affected by the flickering effect and there are many noisy pixels with high-intensity values.When t ¼ 12.3 s, the right A-TS image is affected by the flickering effect and there is a noisy pixel band with highintensity value.

Figure 9 .
Figure 9.The plots of translational errors and rotational errors with different methods on the esim, rpg, and upenn datasets.T-Error represents the translational error, R-Error represents the rotational error.

Figure 10 ,
Figure 10, the TSDF can help gain denser inverse depth frames with global historical depth information.Besides, we also show the qualitative comparison of mapping results between T-ESVO and T-ESVO without TSDF in Figure 11, where we mark the differences for mapping results and estimated trajectories between T-ESVO and T-ESVO without TSDF.The red boxes indicate the mapping differences and the blue circles indicate the trajectory differences.From Figure 11, we can see that T-ESVO can construct a richer semidense map with a smoother trajectory.

Figure 10 .
Figure 10.Depth estimation results of T-ESVO on the rpg and upenn datasets.The first row shows intensity frames from the DAVIS camera, the second row shows the estimated inverse depth frame, and the third row shows the surface inverse depth frame via TSDF.Inverse depth frames are color-coded from red (close) to blue (far) over a black background.The left four columns (the rpg sequences) range from 0.5 m to 5 m, and the right three columns (the upenn sequences) range from 1 m to 6.25 m.

Figure 11 .
Figure 11.Qualitative comparison of mapping results on the upenn sequences using T-ESVO without TSDF and T-ESVO.The first column shows intensity frames of the scene.The second and third columns show the intensity semidense maps with trajectories estimated by T-ESVO without TSDF and T-ESVO respectively.The red boxes indicate the differences in constructed semidense maps, and the blue circles indicate the differences in estimated trajectories.

Figure 12 .
Figure 12.Boxplot of absolute translational error statistics for T-ESVO with the ESVO's tracking unit and with the proposed tracking unit.The middle box spans the first and third quartiles, while the whiskers are the upper and lower limits.

Figure 13 .
Figure 13.Boxplot of absolute rotational error statistics for T-ESVO with the ESVO's tracking unit and with the proposed tracking unit.The middle box spans the first and third quartiles, while the whiskers are the upper and lower limits.

Figure 14 .
Figure 14.The performance demonstration of our proposed left/right TS selection method on the rpg_bin_edtied sequence.Top part: the differences of similarity scores (ΔS) between the current left A-TS image () and the current right A-TS image ().Bottom part: the left/right A-TS images and the A-TS selection results (in red boxes) at three timestamps (t ¼ 5.8 s, t ¼ 11.8 s, t ¼ 14.6 s).

Figure 15 .
Figure 15.Results in dark and HDR conditions.Top part: results in the dark condition; Bottom part: results in the HDR condition.The first column shows the intensity semidense maps with trajectories estimated by T-ESVO.The second, third, and fourth columns show the intensity frames from the DAVIS camera, the A-TS images (left), and the inverse depth estimation maps respectively.

the 6 -
DoF pose estimation through an 3D-2D registration method based on the left/right TS selection mechanism and the depth point selection mechanism.The experimental results show that our T-ESVO achieves good performance when compared with other event-based stereo systems, and at the same time, with no compromise to real-time performance.

Table 1 .
Quantitative comparison results of accuracy performance for T-ESVO with the traditional TS method and with the a-TS method on the rpg and upenn datasets.The mean ATE (cm) of absolute translational error is used for evaluation.The best result is highlighted in bold.

Table 3 .
Robust performance of T-ESVO and ESVO tested on challenging upenn sequences.The mean ATE (cm) of absolute translational error is used for evaluation.

Table 4 .
Quantitative comparison results of accuracy performance for ORB-SLAM 2, ESVO, and T-ESVO on the rpg and upenn datasets.The RMS ATE (cm) of absolute translational error is used for evaluation.The best result is highlighted in bold.

Table 5 .
Quantitative comparison of depth estimation for T-ESVO and T-ESVO without TSDF on upenn sequences.The best result is highlighted in bold.

Table 6 .
The time cost of the TSDF function and the whole mapping unit under the upenn_ indoor_ flying1_ edited sequence.

Table 7 .
Mean execution times of T-ESVO's and ESVO's different units under the upenn_ indoor_ flying1_ edited sequence.