**3. Method**

Let us present our approach to measure and mitigate event collapse. First, we revise how event cameras work (Section 3.1) and the CMax framework (Section 3.2), which was informally introduced in Section 2.1. Then, Section 3.3 builds our intuition on event collapse by analyzing a simple example. Section 3.4 presents our proposed metrics for event collapse, based on 1-DOF and 2-DOF warps. Section 3.5 specifies them for higher DOFs, and Section 3.6 presents the regularized objective function.

#### *3.1. How Event Cameras Work*

Event cameras, such as the Dynamic Vision Sensor (DVS) [2,3,32], are bio-inspired sensors that capture pixel-wise intensity changes, called events, instead of intensity images. An event *ek* .= (**<sup>x</sup>***k*, *tk*, *pk*) is triggered as soon as the logarithmic intensity *L* at a pixel exceeds a contrast sensitivity *C* > 0,

$$L(\mathbf{x}\_k, t\_k) - L(\mathbf{x}\_k, t\_k - \Delta t\_k) = p\_k \mathsf{C}\_\prime \tag{1}$$

where **<sup>x</sup>***k* .= (*xk*, *yk*), *tk* (with μs resolution) and polarity *pk* ∈ {+1, −<sup>1</sup>} are the spatiotemporal coordinates and sign of the intensity change, respectively, and *tk* − Δ*tk* is the time of the previous event at the same pixel **<sup>x</sup>***k*. Hence, each pixel has its own sampling rate, which depends on the visual input.

#### *3.2. Mathematical Description of the CMax Framework*

The CMax framework [12] transforms events in a set E = {*ek*}*Ne k*=1geometrically

$$
\boldsymbol{\varepsilon}\_{k} \doteq (\mathbf{x}\_{k'} \mathbf{t}\_{k'} \boldsymbol{p}\_{k}) \quad \stackrel{\scriptstyle \mathsf{W}}{\longmapsto} \quad \boldsymbol{e}'\_{k} \doteq (\mathbf{x}'\_{k'} \mathbf{t}\_{\mathrm{ref}}, \boldsymbol{p}\_{k})\_{\prime} \tag{2}
$$

according to a motion model **W**, producing a set of warped events E = {*e k*}*Ne k*=1. The warp **x** *k* = **<sup>W</sup>**(**<sup>x</sup>***k*, *tk*; *θ*) transports each event along the point trajectory that passes through it (Figure 2, left), until *t*ref is reached. The point trajectories are parametrized by *θ*, which contains the motion and/or scene unknowns. Then, an objective function [10,13] measures the alignment of the warped events E . Many objective functions are given in terms of the count of events along the point trajectories, which is called the image of warped events (IWE):

$$I(\mathbf{x}; \boldsymbol{\theta}) \doteq \sum\_{k=1}^{N\_{\boldsymbol{\theta}}} b\_{k} \, \delta(\mathbf{x} - \mathbf{x}\_{k}^{\prime}(\boldsymbol{\theta})). \tag{3}$$

Each IWE pixel **x** sums the values of the warped events **x** *k* that fall within it: *bk* = *pk* if polarity is used or *bk* = 1 if polarity is not used. The Dirac delta *δ* is in practice replaced by a smooth approximation [33], such as a Gaussian, *<sup>δ</sup>*(**x** − *μ*) ≈ N (**x**; *μ*, <sup>2</sup>) with = 1 pixel. A popular objective function *G*(*θ*) is the visual contrast of the IWE (3), given by the variance

$$G(\theta) \equiv \text{Var}\left(I(\mathbf{x}; \theta)\right) \doteq \frac{1}{|\Omega|} \int\_{\Omega} (I(\mathbf{x}; \theta) - \mu\_I)^2 d\mathbf{x},\tag{4}$$

with mean *μI* .= 1|Ω| Ω *<sup>I</sup>*(**x**; *<sup>θ</sup>*)*d***x** and image domain Ω. Hence, the alignment of the transformed events E (i.e., the candidate "corresponding events", triggered by the same scene edge) is measured by the strength of the edges of the IWE. Finally, an optimization algorithm iterates the above steps until the best parameters are found:

$$\theta^\* = \arg\max\_{\theta} G(\theta). \tag{5}$$

#### *3.3. Simplest Example of Event Collapse: 1 DOF*

To analyze event collapse in the simplest case, let us consider an approximation to a translational motion of the camera along its optical axis *Z* (1-DOF warp). In theory, translational motions also require the knowledge of the scene depth. Here, inspired by the 4-DOF in-plane warp in [20] that approximates a 6-DOF camera motion, we consider a simplified warp that does not require knowledge of the scene depth. In terms of data, let us consider events from one of the driving sequences of the standard MVSEC dataset [34] (Figure 1).

For further simplicity, let us normalize the timestamps of E to the unit interval *t* ∈ [*<sup>t</sup>*1, *tNe* ] → ˜*t* ∈ [0, 1], and assume a coordinate frame at the center of the image plane, then the warp **W** is given by

$$\mathbf{x}\_{k}^{\prime} = \left(1 - \tilde{t}\_{k} h\_{\tilde{z}}\right) \mathbf{x}\_{k\prime} \tag{6}$$

where *θ* ≡ *hz*. Hence, events are transformed along the radial direction from the image center, acting as a virtual focus of expansion (FOE) (cf. the true FOE is given by the data).

Letting the scaling factor in (6) be *sk* .= 1 − ˜*tkhz*, we observe the following: (i) *sk* cannot be negative since it would imply that at least one event has flipped the side on which it lies with respect to the image center; (ii) if *sk* > 1 the warped event gets away from the image center ("expansion" or "zoom-in"); and (iii) if *sk* ∈ [0, 1) the warped event gets closer to the image center ("contraction" or "zoom-out"). The equivalent conditions in terms of *hz* are: (i) *hz* < 1, (ii) *hz* < 0 is an expansion, and (iii) 0 < *hz* < 1 is a contraction.

Intuitively, event collapse occurs if the contraction is large (0 < *sk* 1) (see Figures 1C and 3a). This phenomenon is not specific of the image variance; other objective functions lead to the same result. As we see, the objective function has a local maximum at the desired motion parameters (Figure 1B). The optimization over the entire parameter space converges to a global optimum that explains the event collapse.

 **Figure 3.** *Point trajectories* (streamlines) defined on *x* − *y* − *t* image space by various warps. (**a**) Zoom in/out warp from image center (1 DOF). (**b**) Constant image velocity warp (2 DOF). (**c**) Rotational warp around *X* axis (3 DOF).
