**1. Introduction**

Event cameras, as a kind of bio-inspired sensor, trigger events on each pixel independently and asynchronously according to the changes of scene brightness. Compared with standard cameras, they output event flow, which is formed by coordinates on the image plane and the time when an event happens, namely *x,y* and *t* (μs or ns level). Another characteristic of event flow is polarity, which indicates the increase or decrease in brightness. The presence or absence of polarity information depends on manufacturers. For example, DAVIS 240C from Inivation provides polarity information, while IMX636 from Sony/Prophesee does not have polarity information.

Event cameras have the characteristics of low latency, high dynamic range, and low power consumption. Due to their different characteristics compared to traditional cameras, event cameras open up a new paradigm for a series of tasks such as VO (Visual Odometry), SLAM (Simultaneous Localization and Mapping), and SfM (Structure from Motion). Feature tracking on event cameras is one fundamental step toward the maturity of these practical applications, which have aroused the interest of a wide range of researchers [1,2].

Although many asynchronous feature-tracking methods have been proposed, template matching on event frames or event patches is still a major way to process event information, especially for high-level tasks [3]. By accumulating a certain number of events or calculating the significance of the incoming event, the contours of objects in the scene are formed in an image-like frame, which is closely related to the movement of the carrier.

**Citation:** Li, Z.; Liu, Y.; Zhou, F.; Li, X. Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras. *Remote Sens.* **2022**, *14*, 1773. https:// doi.org/10.3390/rs14081773

Academic Editor: Gemine Vivone

Received: 29 January 2022 Accepted: 3 April 2022 Published: 7 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Practical feature tracking on event frames still faces challenges from efficiency and accuracy problems. Efficiency is related to high event rates, which depends on carrier motions, scene (e.g., dynamic objects), texture, etc. Event rates may vary significantly, and thus, the frequency of event frames may be very high, increasing the computational difficulty of keeping tracking on a sufficient number of features.

Event flow is sparse, asynchronous, noisy, and only represents brightness changes; thus, event-to-frame transformation has the problem of low signal-to-noise ratio and "texture" loss. Therefore, feature tracking purely relying on event flow may have drift problems, deteriorating the accuracy and robustness of high-level tasks.

Purely relying on event flow information will lead to one problem: when the carrier is moving slowly, the time interval between event frames may be larger than that between images. The frequency of positioning or mapping solutions would not meet the requirements of users. For example, when a drone is performing autonomous exploration slowly in an unknown environment, timely positioning results are still needed for motion and path planning. Therefore, intensity images and inertial information can still provide timely updates as the basic support for high-level tasks.

From the view of bionics, the research on animal processing mechanisms for external information shows that different parts of the brain handle different senses with different attention, which jointly supports the decision of action and judgment. Visual information is not an exception, which is highly related to events, as events only happen when brightness changes. Both global views of the scenes and inertial information are still sensed and processed by the brain latently. Therefore, the fusion of event, intensity, and inertial information has support from bionic research [4,5].

Now, several event cameras, such as DAVIS 346 and CeleX-4, provide normal intensity images, angular velocity, and acceleration from embedded IMU (Inertial Measurement Unit), supporting the feasibility of using complementary information for feature tracking.

The advantages of multiple sensor fusion bring potentials to overcome accuracy and efficiency problems from purely event-information-based methods. Therefore, a new feature-tracking method on event frames is proposed in this paper. Its novelty can be summarized as follows:


The rest of paper is constructed as follows: Section 2 reviews feature-detection and -tracking methods on event camera. Section 3 firstly illustrates the data stream that the proposed method deals with and gives a brief description of WF-MHT-BP and then presents inertial based rotation prediction, which acts as the priors for the next feature-tracking steps in Section 3.1. The method to generate event frame is introduced in Section 3.2. After that, MHT-BP, which is the tracking method purely relying on event frames, is introduced in Section 3.3. Section 3.4 presents the weighted integration of feature-tracking solutions from event frame and intensity frame. In Section 4, WF-MHT-BP is compared with two methods implemented in MATLAB and EKLT (Event-based Lucas–Kanade Tracker), implemented in C++ in terms of accuracy and efficiency, and EKLT in terms of feature-tracking length. Section 5 concludes the paper and gives the direction for future work.

#### **2. Related Work**

Feature tracking is an active research field, where a number of algorithms have been proposed. Traditional feature tracking on intensity images can be divided into featurematching-based methods and template-based tracking methods [6]. Two representative methods are SIFT (Scale-Invariant Feature Transform) [7] and KLT [8,9] respectively. Recently, many deep-learning-based algorithms have been proposed to improve the available number and robustness of feature matching, such as SuperPoint [10] and D2Net [11]. However, the efficiency problem of deep-learning-based methods is an obstacle for practical applications, especially for mobile devices.

Due to the different characteristics of event cameras, feature tracking on event flow follows different paradigms. A practical way is to convert event flow to event frames. Usually, events are collected in a temporal window to form event frames, and then, traditional feature tracking paradigms can be applied [12,13]. To improve efficiency and accuracy, different event-to-frame transformations and feature-tracking methods are proposed [14].

Event-to-frame transformation is the first step for event-frame-based feature tracking. Time-surface (TS) is a kind of global 2D surface using exponential decay kernel [15] to emphasize events happening recently. Another global method is Event Map, proposed by Zhu et al. [12], to project the events in a selected spatio-temporal window on frames directly. Surface of active events (SAE) is a local form of processing 3D spatio-temporal domain that pays attention to the most recent event at each pixel [16]. Normally, feature detection on TS or SAE is more accurate than direct methods, as the response of events happening recently is larger. However, computational complexities of direct methods are much lower than that of TS or SAE. Besides, TS or SAE needs more memory, as at least floats are needed in event frames.

A number of event-camera-based feature-detection and tracking algorithms focusing on improving accuracy and efficiency have been proposed. Li et al. [17] proposed SAE-based FA-Harris corner detection algorithm directly on asynchronous events instead of event frames. Alzugaray et al. [18] proposed Arc\* detector based on modified SAE filtering and subsequently proposed HASTE (multi-Hypothesis Asynchronous Speeded-up Tracking of Events), which purely tracked feature on an asynchronous patch using multihypothesis [19]. Tedaldi et al. [13] detected Harris features in intensity images and then used ICP (Iterative Closest Point) method to establish correspondences. Zhu et al. [12] proposed an affine transformation based Expectation-Maximization (EM) algorithm to align two patches in the consequent event frames.

The fusion of event frames and intensity images provides benefits for feature tracking. Gehrig et al. [20] proposed an event-camera-based tracker, which optimized brightness increment differences from intensity images and event flows. Dong et al. [21] proposed a template-based feature-tracking method to improve the robustness. They predicted feature-tracking solutions with events and used intensity to correct them. The calculation burdens of these methods cannot be ignored due to their high algorithm complexities. It is observed that only a few features are set to be tracked in these algorithms to ensure real-time performances, affecting the applications of high-level tasks.

Some high-level applications potentially achieve feature tracking by reconstructing 3D geometry. Usually, 3D coordinates of features act as prior information for feature tracking, as they can be projected to the image plane with predicted poses. Zhou et al. [3,22] tracked the pose of a stereo event camera and reconstructed 3D environments by minimizing spatio-temporal energy. Liu et al. [23] proposed spatial-temporal registration algorithm as a part of event-camera-based pose-estimation method. However, the performance of feature tracking and quality of high-level tasks (e.g., pose estimation, scene reconstruction) are closely related. Multiple factors affect feature-tracking accuracy. Moreover, computational burdens will increase with higher algorithm complexity.

In summary, feature tracking on event frames has efficiency and accuracy problems: (1) Tracking efficiency is low due to characteristics of event cameras and designed algorithms. The number of trackable feature points is small, which affects the stability of

high-level tasks. (2) Purely tracking features on events easily cause accuracy problems. Although multiple sensor-fusion-based feature tracking has been proposed, the efficiency and accuracy problems still need to be further explored with all available information.

#### **3. Methodology**

The incoming data stream for an event-camera-based localization or mapping system is illustrated in Figure 1 as the basic input assumption of the proposed method. Event flow, intensity images, and inertial information will be received asynchronously. Note that polarity information of event flow is not a must for the proposed method. Normally, intensity images and inertial information have equal time intervals. If no dynamic objects are in the scene, event flow potentially represents carrier motion, and the "frequency" of event frames is not even if a constant number of events are collected. Therefore, the proposed method cannot predict the resource of the next input.

**Figure 1.** Different frequency of intensity images, event frames, and inertial updates.

Another characteristic of the data stream is that features are only detected on intensity images. Since each intensity frame will be attached with an event frame (see Section 3.2), the features to be tracked are directly projected to event frames with the same position.

The implementation of the proposed method can be illustrated in Figure 2. Firstly, IMU provides angular velocity *wt* for feature rotation prediction, which acts as priors for KLT tracking and MHT-BP. Secondly, event flow is accumulated to generate event frame *Ft* for practical applications. Feature-detection module provides the positions of features for initialization, and features will be also re-detected if the number or the distribution score [24] of features are less than the threshold. Thirdly, with the assistance of rotation prediction from IMU, the method assigns MHT-BP for event frame *Ft* and KLT tracking for intensity image *It*. The weighted fusion mechanism will fuse their solutions. The tracking solution is outputted by MHT-BP or weighted fusion module depending on the applications. After the feature-tracking process is over, the method will automatically add detected features if the number or the distribution score [24] of features are not large enough. Shi-Tomasi corner feature point [25] was chosen as the feature to be tracked. Note that the proposed method is not limited to feature types and can be extended by other features, such as FAST [26].

**Figure 2.** Flowchart of data stream and the proposed method WF-MHT-BP.

#### *3.1. Inertial-Aided Rotation Prediction*

According to the work of [27,28], the relationship between tracked features in two consecutive images can be approximated by Equation (1) considering the small translation relative to depth.

$$\mathbf{u}\_{pred} = \mathbf{K} \mathbf{R} \mathbf{K}^{-1} \mathbf{u}\_{last} \tag{1}$$

where K is the intrinsic matrix, and u*pred* and u*last* are 2D positions of the tracked feature and the predicted feature, respectively. The rotation matrix R can be integrated from angular velocity information *wt* [27] as shown in Figure 2. u*pred* will acts as the predicted feature position for the tracking.
