A Fast Geometric Regularizer to Mitigate Event Collapse in the Contrast Maximization Framework

Event cameras are emerging vision sensors and their advantages are suitable for various applications such as autonomous robots. Contrast maximization (CMax), which provides state‐of‐the‐art accuracy on motion estimation using events, may suffer from an overfitting problem called event collapse. Prior works are computationally expensive or cannot alleviate the overfitting, which undermines the benefits of the CMax framework. A novel, computationally efficient regularizer based on geometric principles to mitigate event collapse is proposed. The experiments show that the proposed regularizer achieves state‐of‐the‐art accuracy results, while its reduced computational complexity makes it two to four times faster than previous approaches. To the best of our knowledge, this regularizer is the only effective solution for event collapse without trading off the runtime. It is hoped that this work opens the door for future applications that unlocks the advantages of event cameras. Project page: https://github.com/tub‐rip/event_collapse

Contrast maximization (CMax) [22] is a powerful event processing framework that achieves state-of-the-art accuracy in various motion estimation tasks.On the other hand, it may suffer from an overfitting problem called event collapse, where the optimizer converges to an undesired global optimum [23].Prior works have tackled event collapse in several ways, such as whitening the data [24], reformulating the task (e.g., by providing additional depth information [24]), or adding a regularizer to improve the optimization landscape [23].However, these proposals present shortcomings: (i) assuming known depth data is task-specific and requires an additional sensor such as a LiDAR or a stereo setup, and (ii) the above regularizing techniques may not be effective [24] or require considerable extra computation [23].Towards more practical application of event cameras, it is paramount to effectively alleviate event collapse in a computationally-efficient manner.
This paper proposes a novel, computationally efficient regularizer to mitigate event collapse in the CMax framework (Figure 1).From a theoretical point of view, the regularizer is designed based on geometric principles of motion field deformation (measuring area rate of change along point trajectories).In contrast to previous methods, the regularizer does not depend on the event data; it only depends on the motion hypothesis (i.e., the warp).This is desirable because events may not be equally distributed on the space-time image domain, and motion hypotheses provide information even in homogeneous brightness regions Fig. 2: Method overview.The proposed regularizer (blue line) is based on geometric principles and solely relies on motion parameters θ, while previous approaches (dashed line) are built from warped events [23].Adapted with permission from Ref. [25], 2019, Gallego et al.
(where events are scarce).From a practical point of view, the proposed regularizer drastically reduces computational complexity, being two to four times faster than previous solutions, while achieving state-of-the-art results to mitigate event collapse.
The above contributions open the door to efficient and interpretable regularizers for motion estimation problems with geometrically-meaningful parametrizations.
Event collapse is originally analyzed in detail by [23].It may occur or not depending on the task, the space of motion hypotheses and the data.Hence, previous works for tackling event collapse can be categorized based on tasks.For example, optical flow estimation has a high number of degrees-of-freedom (DOFs) (2N p , where N p is the number of pixels), i.e., motion parameters, and is a collapse-enabled problem.A common approach to mitigate collapse for this task is to add a strong regularizer, such as the classical Charbonnier loss [37], to encourage smoothness of the flow [14], [32].Another approach consists of increasing the wellposedness of the problem by using a tile-based motion field and a multi-reference focus loss [16].However, event collapse may still appear in some parts of the image plane.
Ego-motion estimation problems, which are the main focus of this work, parametrize motion over the image space with relatively lower DOFs.The well-posedness of the problem is affected not only by the number of DOFs but also by the geometric meaning of the motion.By reformulating the problem to reduce the number of DOFs, [38] and [33] increase the well-posedness of the task.Whitening of warped events is proposed in [24] to mitigate event collapse, while [23] designs effective regularizers based on the divergence or the area deformation of the motion field, at the expense of increased computational cost.Initialization close to the solution (in the basin of attraction of the desired local optimum) can also play an important role in evading event collapse [22].
Our work is most related to [23] because we focus on low-DOF motion estimation tasks and seek a principled regularizer to gauge and penalize event collapse.Our theoretical analysis provides a formula for homographic motions (8 DOFs), which can be particularized for: 1 DOF (zoomin/out motion), 2 DOFs (feature flow), 3 DOFs (rotational camera motion), 4 DOFs (planar similarity motion) and 6 DOFs (planar affine motion).While both [23] and our proposal are interpretable and grounded on geometric principles of motion trajectories, the most important theoretical difference is that our regularizer does not depend on the event data.As a result, the regularizer drastically improves computational complexity while achieving on par or better motion estimation results on publicly-available datasets.

METHODOLOGY
This section first reviews how an event camera works (Section 3.1) as well as the regularized CMax framework (Section 3.2).Then, we propose the new regularizer and explain its geometrical meaning and implications (Section 3.3).

Event Camera
Instead of acquiring brightness images at fixed time intervals (e.g., frames), event cameras record brightness differences asynchronously, called "events" [1], [26].An event e k .= (x k , t k , p k ) represents a brightness change, and it is triggered as soon as the logarithmic brightness at the pixel x k .= (x k , y k ) exceeds a preset threshold.Here, t k is the timestamp of the event with µs resolution, and polarity p k ∈ {+1, −1} is the sign of the brightness change.Figure 2 shows the input events (red and blue dots (x k , t k ) in spacetime, with color representing polarity).

Regularized Contrast Maximization
The regularized CMax framework [23] aims at finding the motion parameters θ that optimize the objective function: Here, G is the data fidelity term and R is a regularizer with weight λ > 0. The overall steps of the framework are described in Figure 2, where black lines indicate the previous approaches, and blue lines indicate our proposal.

Data fidelity term
The term G(θ) measures the alignment of the events with respect to the candidate motion θ (Figure 2, black solid line).The original events E = {e k } Ne k=1 are transformed according to the motion hypothesis into a set of warped events E = {e k } Ne k=1 : The warp function x k = W(x k , t k ; θ) transports every event along its motion trajectory until a reference time t ref is reached.Point trajectories are parametrized by θ, which consists of motion or scene unknowns (e.g., scene depth, moving objects).
Powerful objective functions are designed based on the count of warped events [25].The representation of event count as an image is defined by the image of warped events (IWE): Each pixel of IWE counts how many events e k are warped into pixel x.The event polarity can be used by setting b k = p k , and not used if b k = 1.The Dirac delta δ is approximated by a Gaussian: δ(x − µ) ≈ N (x; µ, 2 ), where = 1 pixel.
Finally, the objective function, such as the IWE variance is calculated: with mean µ I .= 1 |Ω| Ω I(x; θ)dx and image domain Ω.The interpretation of ( 4) is as follows: the larger the IWE variance (contrast of the IWE), the better the alignment of the warped events E .Contrast is related to sharpness and focus [25].

Previous regularizers
The regularizer term R in (1) penalizes event collapse in certain types of warps.When event collapse happens, the warped events are accumulated into too few pixels or lines, resulting in an undesired global optimum G (overfitting).Two regularizers are proposed in [23] by averaging some collapse quantities attached to each event (Figure 2, dashed line).Specifically, they use the divergence of the Point trajectory ($(0), 0) J Fig. 3: Rate of change of area deformation.The warp W defines point trajectories γ(t) = (x(t), t) in the space-time image domain.We define the regularizer R based on differential area deformation along γ(t).The rate of change of area is given by the derivative of the Jacobian J t,t+∆t .flow D(E, θ) and the area-based deformation of the warp A(E, θ), which are given by: where the flow f .= ∂W(x, t; θ)/∂t, and Jacobian of warp J(x, t; θ) .= ∂W(x, t; θ)/∂x, are the space-time derivatives of W. Just like (3), D and A are used to create images of average divergence and area deformation per pixel.Finally, [23] computes R as the trimmed mean of such images.

Proposed Motion-based Regularizer
Although [23] successfully mitigates overfitting, it comes at a computational cost.The complexity of these regularizers is O(N e + N p ) because ( 5) depends linearly on the number of events N e and the resulting average images have N p pixels.This extra complexity makes the whole pipeline more than twice slower than the original (unregularized) CMax framework, whose complexity is also O(N e + N p ) [22].Not only the computational complexity is a burden, but also the fact that (5) are measured relative to a single reference time.For example, A(E, θ) increases as t k increases, since it measures the area deformation from t k to t ref = t 1 .This scaling problem is undesirable because (i) events far from t ref contribute more to R than events closer to t ref , and (ii) this effect could be amplified depending on the temporal distribution of the events.
Intuitively, motion fields are well-posed or not (i.e., collapse-enabled) by design, regardless of the event data.Hence, an ideal regularizer should not depend on the events, but solely on the warp parameters (Figure 2, blue line).The main idea of the proposed regularizer is to aggregate differential deformations rather than relative ones.Figure 3 shows the geometric interpretation: R is obtained as the integral of the rate-of-change of the area element deformation along the space-time point trajectories (x(t), t) defined by the motion.

Collapse-enabled warp with 1 DOF
To illustrate our approach, consider the simplest example: a 1-DOF motion that approximates the translation of camera along its optical axis Z.This is a simplified zoom-in/out Fig. 4: Regularizer R for the 1-DOF warp, (10).
motion without knowledge of scene depth, as used in [34].
The warp W is given by where θ ≡ h z , and the coordinate frame is at the center of the image plane.For simplicity, t Assuming an area element attached to each point of the motion trajectory γ(t) = (x(t), t) (Figure 3), the change of area (i.e., area deformation) from t to t + ∆t is given by: The Taylor series expansion of ( 7) at ∆t = 0 is Since the first term is always 1 (i.e., is trivial), we focus on the second term, which conveys the meaning of "speed" of area deformation.The derivative of (7) at ∆t = 0 conveys the rate of change or differential amplification factor of the area: Finally, the total rate of change of the deformation along the observation time window is The regularizer (10) is plotted in Figure 4.It solely depends on θ ≡ h z and has computational complexity O(1).In addition, it is developed from geometric principles, and it is interpretable: h z = 0 (identity warp) gives R = 0; h z ∈ (0, 1) (contraction; collapsing warp) gives large R > 0; and h z < 0 (expansion warp) gives R < 0.Moreover, notice that R behaves like a barrier function, approaching infinity (i.e., large penalty) for values close to h z = 1 (maximum allowed contraction before events flip side with respect to the image center).

Well-posed warp with 2 DOFs
The 2-DOF translational motion (feature flow) in image space is a well-posed warp, since collapse never happens because the motion lines are parallel.The warp is given by constant velocity θ for all pixels.Translations do not change the area element, i.e., J t,t+∆t = 1, hence R ≡ 1 0 d|J| d∆t = 0. Since the resulting regularizer vanishes, it does not affect the landscape of the composite objective function (1), as expected.
The incremental rotation between t and t + ∆t yields Hence, the area element at x(t) deforms according to: where r 3 is the third row of R(ω∆t) (see Section A.1).The derivative of (13) at ∆t = 0 is given by: Finally, the integral of ( 14) over the point trajectory (parametrized by the initial point x(0)) is given by: The integrals in (15) have units of absement.To obtain the regularizer R, we threshold R x(0) at −0.2 and compute its mean, which allows small amounts of natural deformation caused by rotation.Similar to (10), (15) does not depend on the events.However, in contrast to (10), ( 15) is spatially varying, providing an aggregated deformation map: it is smaller in the center of the image and larger (in absolute value) in the periphery.The computational complexity of R is O(N p ), which can be further reduced if only a subset of the pixels is used.
Although 3-DOF rotations involve small deformations, their values (14) are considerably smaller than those of collapse-enabled warps like (6), and R does not affect the accuracy of the angular velocity estimation (as Section 4.4 will show).Also, pure rotations around the Z axis ω = (0, 0, ω z ) do not change the area, as expected, resulting in R = 0.

Collapse-enabled warp with 4 DOFs
The 1-DOF warp (Section 3.3.1) is a particular case of the 4-DOF warp in [24], [34], which approximates a freely-moving camera (6 DOFs) by means of a similarity transformation on the image plane.The scaling parameter h z of the similarity transformation controls the amount of zoom in/out, i.e., the amount of contraction/expansion of the warp.Hence, we use (10) to penalize the amount of contraction.A mathematical justification is given in Appendix A. Since the purpose of our regularizer is to discourage the collapse while allowing a small amount of natural collapse, we use

EXPERIMENTS
We assess the performance of our regularizer by first showing its effectiveness on collapse-enabled warps, which naturally appear in driving sequences.Second, a runtime analysis proves that our proposal is faster than prior work.Finally, we also demonstrate the effect on rotational sequences, in order to show that the proposed regularizer does not harm well-posed warps.There is no need to test feature flow since we have proved analytically that the regularizer vanishes for such a warp.
The dataset consists of events, grayscale frames and IMU data from an event camera (mDAVIS346, 346 × 260 pixels [45]), camera poses, LiDAR data, and ground truth optical flow provided by [41].We use the outdoor sequences, where the event camera is mounted on a car.Following previous work [23], we select several excerpts from the outdoor day1 sequence that have a dominant forward motion, which is reasonably well approximated by collapse-enabled warps such as 1 DOF and 4 DOF cases.In total, we evaluate on 3.2 million events spanning 10 s.The DSEC dataset [42] is another recent dataset of driving sequences.It includes more complex scenes (e.g., moving objects, higher dynamic range) with a higher resolution event camera (Prophesee Gen3, 640 × 480 pixels).Ground truth optical flow is computed as the motion field, using the scene depth from a LiDAR [44].In total, we evaluate on 380 million events spanning 40 s from the zurich city 11 sequence.
The ECD dataset [46] is widely used to assess camera ego-motion [8]- [11], [23], [27], [28], [47].Each sequence provides events, frames, calibration information, and IMU data from a DAVIS240C camera (240 × 180 pixels [48]), as well as ground truth camera poses from a motion capture sys- tem (at 200Hz).We use boxes rotation and dynamic rotation sequences for 3-DOF rotational motion estimation, to consistently compare with previous work.In total, we evaluate on 43 million events (10 s) of the box sequence, and on 15 million events (11 s) of the dynamic sequence.

Metrics
Optical flow accuracy (for MVSEC and DSEC experiments) is given by the Average Endpoint Error (AEE) and the percentage of pixels with AEE greater than N pixels ("N PE"), for N = {3, 10, 20}.They are calculated only in pixels with valid ground truth.We also adopt the FWL metric [49], which is defined as the relative variance of the IWE with respect to that of the identity warp.The FWL measures the IWE sharpness: FWL < 1 means that the estimation is worse than the zero-flow baseline, while FWL > 1 implies that the resulting IWE is sharper than the baseline.Rotational motion accuracy is assessed as the RMS error of angular velocity estimation, following previous works [24], [25], [27].Angular velocity is assumed to be constant during the time window of events, and compared with the IMU's gyroscope value (ground truth) at the midpoint: (t 1 + t Ne )/2.We also use the FWL metric to measure the IWE sharpness.
The estimation time window spans: dt = 4 grayscale frames (at ≈ 45Hz) in MVSEC (standard for MVSEC benchmark), 500k events for DSEC, and 30k events for ECD dataset, respectively.For runtime comparison we use a fixed number of events (30k for MVSEC and 500k for DSEC), because the runtime depends on the number of events (e.g., O(N e + N p )).We set λ in ( 1

Results on Collapse-Enabled Warps
Tables 1 and 2 report the results of collapse-enabled warp experiments (1 and 4 DOFs) on MVSEC and DSEC, respectively.They report the flow AEE, N PE, and FWL for the two data terms: image variance and the squared magnitude of the IWE gradient ("Gradient Magnitude").Throughout the experiments, both metrics are considerably high in the results of the original CMax ("No regularizer") and whitening [24] methods.This indicates that the warp overfits to events (event collapse).On the other hand, our regularizer produces better AEE values and moderately higher FWL than 1.These results clearly show that our regularizer successfully discourages event overfitting while producing sharper IWEs than the identity warp.Our results are competitive with those by [23], and we do not find significant accuracy differences between them.
Qualitative results are shown in Figure 5 (MVSEC and DSEC rows).Our regularizer (last column) provides the best IWEs, which reveal the sharp edges of the scene, while the IWEs without regularizer produce event collapse (second column).Although R in ( 10) is a scalar, we visualize it as an image to compare it with the area deformation map from [23].Notice that the area deformation map [23] shows collapse only at pixels with warped events, while our regularizer provides dense maps (even in pixels with no events, corresponding to homogeneous brightness regions) because it is purely geometric, based on the motion parameters.

Runtime Comparison
Table 3 reports the runtime comparison of the methods, notably with respect to the original CMax ("No regularizer").We use Python (3.9.12) on a CPU (Mac M1 2020, 8 Cores), and average the runtime over 400 trials.The whitening technique [24] is slower than the original CMax ("No regularizer").The runtime difference is due to an extra SVD step on the events, which is more noticeable (2× slower) in the DSEC dataset than in MVSEC because it uses more events.The "Deformation" regularizer in [23] is also two to three times slower than the original CMax.When both regularizers in [23] are combined ("Div.+ Def."), the runtime becomes even larger.Finally, our regularized approach has almost the same runtime as the original CMax, since its complexity is O(1), thus it is two to four times faster than competing methods.
Figure 6 visualizes the accuracy and runtime of the methods (on DSEC data).Runtime is reported relative to the "No regularizer" case.It clearly shows that the proposed regularizer is the only effective approach against event collapse that does not compromise the speed of the CMax framework.

Results on Well-Posed Warps
To confirm that the proposed regularizer does not harm well-posed warps (e.g., 3-DOF rotational motion), we report results on the ECD dataset in Table 4.We use the variance as data fidelity and the Adam optimizer.In both rotational sequences, the results of "No regularizer" and ours produce very similar RMS and FWL values.This is because R values

No regularizer Whitening
Deformation b e t t e r

Application: Time-to-contact
The parametrization of collapse-enabled warps has useful implications toward future application on intelligent vehicles, such as advanced driver-assistance system (ADAS).Let us introduce another interpretation of the parameter h z .For a freely moving camera with linear and angular velocities V and ω, respectively, the apparent velocity v(x) on the image plane of a 3D point X = (x, y, Z(x)) (at depth Z(x) with respect to the camera) can be computed using the 2 × 6 feature sensitivity matrix [51]: which can be used to warp events: Assuming a vehicle with body-frame velocity v z , i.e., V ≡ (0, 0, v z ) , ω ≡ (0, 0, 0) , the motion field (16) becomes v(x) = (v z /Z(x)) x, and substituting in (17) gives x k = (1−v z /Z(x))t k .Comparing this expression to (6), and assuming Z(x) is spatially invariant, we identify 1.605s 0.575s i.e., the parameter h z is inverse of the time-to-contact or time-to-collision (TTC) [52].
Figure 7 shows two examples of TTC from the MVSEC dataset.It is remarkable that this 1-DOF warp model can be related to the popular concept in ADAS, and our regularizer plays an important role toward real-time computation of TTC given its runtime.Also note that (18) establishes a relation between TTC, vehicle speed and scene depth, and that the TTC can be used to estimate the scene depth given the vehicle speed, or vice versa, the vehicle speed given the scene depth.We hope this connection helps future implementation of event-camera application in collision avoidance systems.

LIMITATIONS
As in many regularized problems, the regularizer weight λ is empirically set.It depends on the paired objective function and on the data, i.e., on the scene.The proposed method also has a heuristically selected threshold (margin) that allows us to account for small natural deformations of the motion field.It would be desirable to develop automatic strategies to balance both data fidelity and regularizing terms (including the threshold) for best optimization convergence and results.
While we have obtained formulas to measure the rate of change of area deformation along point trajectories for low-DOF motions (up to 8-DOF homographies in Appendix A), the regularizer requires aggregation in the form of integrals (e.g., (15)), which are evaluated via numerical integration.Extending the ideas in this work to higher DOFs (e.g., optical flow) and developing efficient numerical approaches is an important research direction.
Our current implementation of the regularized CMax framework is not real time.This is especially relevant as the spatial resolution of event cameras increases (VGA size [42] and 1Mpixel event cameras [2]), which also increases the number of events to be processed.Therefore, in the future it will be important to speed up the method (both the data fidelity term and the regularizer) to enable interactive applications.
Finally, while in the time-to-contact application (Section 4.5) we established the connection between scene depth and vehicle speed, the example warp is 1-DOF (assumes a single depth for the whole scene), which may be an oversimplification.The warp does not consider where events happen from a user perspective (e.g., on the road or on the side of the road).A more elaborate model would consider the location of the events and decide whether there will be an actual contact along the predicted vehicle trajectory.

CONCLUSION
We proposed a novel regularizer to mitigate event collapse in the CMax framework, based on aggregating differential deformations of the motion field.The experimental results show its efficacy, achieving on-par state-of-the-art accuracy in low-DOF motion estimation problems.Furthermore, the proposed regularizer is the only effective approach to date against event collapse that does not compromise the runtime of the CMax framework.Since the low-DOF motion estimation forms a foundation of more complex motion models, it would be important to analyze more complex warps, such as dense optical flow, to alleviate event collapse effectively and efficiently.We hope this work encourages future research towards this paramount application of the CMax framework that leverages the advantages of event cameras.

APPENDIX A
In homogeneous coordinates, a homographic warp W is given by [22] x and the point trajectories are represented by Hence, the differential transformation from t to t + ∆t is given also by a homography H t,t+∆t : x h (t + ∆t) ∼ H(t + ∆t; θ) H −1 (t; θ) Ht,t+∆t x h (t).
The 4-DOF transformation in [34] has a similar geometric meaning but a different parametrization.Hence, we use the above result and penalize collapse by means of the corresponding scaling parameter in [34].

A.3 Affine Transformation on the Image Plane, Aff(2)
A planar affine transformation has 6 DOFs in θ.Letting and using (21) gives H t,t+∆t = H A (t + ∆t; θ) (H A ) −1 (t; θ).Affinities also form a matrix Lie group, hence H t,t+∆t is an affinity.Moreover, its third row is also e 3 H t,t+∆t = (0, 0, 1), and following similar steps as those in (27) Notice that the 2 × 2 matrix A includes not only a scaling parameter but also a shear, which affects the area deformation.

Fig. 1 :
Fig. 1: Sample application of event cameras.Top: The advantages of event cameras are beneficial for robotics applications, such as autonomous driving.Bottom: The proposed regularizer discourages event collapse (left), and reveals sharp edges of the scene in a computationally efficient manner (right).(Top image licensed from Stock Photo ID 1102269152).

Fig. 5 :
Fig. 5: Qualitative results.(a) Original events.(b)-(d) Results without regularization: 1-DOF motion results (MVSEC [43] and DSEC [42]) are trapped in global optima of event collapse, as shown in the IWEs (b).The regularizers in such collapse cases (c)-(d) are very large compared with the well-posed warp cases (boxes rot and dynamic rot rows).(e) Results with the proposed regularizer: it mitigates collapse for MVSEC and DSEC scenes while it does not harm the ECD scenes.Best viewed in the electronic version.
) as follows: if the data term is the IWE variance, λ = 1.0 for MVSEC and ECD experiments, and λ = 5.0 for DSEC experiments; if the data term is the squared magnitude of the IWE gradient, λ = 0.2 for MVSEC and λ = 1.0 for DSEC experiments.The optimization algorithm is the Tree-Structured Parzen Estimator (TPE) sampler [50].

Fig. 6 :
Fig.6: Runtime comparison for the DSEC experiment.Runtime is relative to that of the original CMax ("No regularizer").Our method has desirable properties: small AEE and runtime.

Fig. 7 :
Fig. 7: Time to Contact application example.The parametrization with h z in the 1-DOF warp can be used to approximate the TTC for the dominant depth of the scene represented by the events (e.g., the trees).