*3.4. Weighted Fusion Using Event/Intensity Information*

MHT-BP is essentially a template-matching method purely using event information, which suffers from drift problems. Due to the high "frequency" of event frames, the mature KLT tracking results between intensity frames can guide the correspondence establishment between event frames. A constant-velocity model is used to re-predict the tracking solutions. A weighted fusion method is proposed to correct the drift and thus improve the accuracy. This provides two options for real-time processing and post processing. For real-time processing, features can be obtained from MHT-BP. For the post-processing tasks, such as bundle adjustment (BA) and SfM, poses and structure calculated from tracked features of event frames can still be obtained in a delayed manner.

As illustrated in Figure 4, the first row is one event frame sequence, and the second one is one intensity image sequence. The black dots and blue dots are detected features that share the same position. MHT-BP provides tracking solutions on each event frame. Due to the blurring effects of event frames, tracking solutions from MHT-BP suffer from drift problems. Drifts represented by uncertainties (shown in Figure 4) increase with time. However, since KLT tracking on the intensity sequence uses texture information, the increasing rate of drift is lower than that of MHT-BP through our tests. Therefore, the weighted fusion mechanism can reduce the drift error.

Firstly, the stochastic model of MHT-BP and KLT is expressed by σ*<sup>F</sup>* and σ*<sup>I</sup>* , respectively, as Equations (4) and (5). The uncertainty is related to time and tracking quality.

$$
\sigma\_F^2 = \beta \Delta \mathbf{t} \tag{4}
$$

$$
\sigma\_I^2 = \delta \Delta \mathbf{t} \tag{5}
$$

where Δt is time period from the timestamp of first intensity image. β and δ are drift in the unit of pixel per second, which can adjust the weights for fusing the solutions. The parameters need to be reasonably and empirically defined.

Secondly, the velocity of optical flow from KLT tracking is assumed to be constant due to the small-motion assumption. If event frames exist between two intensity images, feature-tracking solution can be linearly estimated on virtual intensity frame, such as I*t*−Δ*t*<sup>1</sup> and I*t*−Δ*t*<sup>2</sup> (see the dotted border of the second row in Figure 4) using Equations (6) and (7).

$$\mathbf{x}\_{KLT}^{t\_i} = \Delta \mathbf{x} \cdot \frac{\mathbf{t}\_i}{\Delta \mathbf{t}} \tag{6}$$

$$y\_{KLT}^{t\_i} = \Delta \mathbf{y} \cdot \frac{t\_i}{\Delta \mathbf{t}} \tag{7}$$

where Δx and Δy represent feature displacements between I*t*−<sup>1</sup> and I*<sup>t</sup>* on the x and y axis, respectively. Δt represents the time period between I*t*−1, and I*t*. *ti* represents consumed time period from *t* − 1.

Thirdly, with above-mentioned stochastic model and constant-velocity model, the weighted fusion can be conducted by using Equations (8) and (9). As shown in Figure 4, the red dots on last row show the result of weighted fusion.

$$\mathbf{x}\_{Fus\epsilon}^{t\_i} = \frac{\mathbf{o}\_F^2 \cdot \mathbf{x}\_{Event}^{t\_i} + \mathbf{o}\_I^2 \cdot \mathbf{x}\_{KLT}^{t\_i}}{\mathbf{o}\_I^2 + \mathbf{o}\_F^2} \tag{8}$$

$$\mathbf{y}\_{Fus\epsilon}^{t\_i} = \frac{\mathbf{o}\_F^2 \cdot \mathbf{y}\_{Event}^{t\_i} + \mathbf{o}\_I^2 \cdot \mathbf{y}\_{KLT}^{t\_i}}{\mathbf{o}\_I^2 + \mathbf{o}\_F^2} \tag{9}$$

The weighted fusion of MHT-BP and KLT solutions is named as WF-MHT-BP. Like the normal feature-tracking algorithm, it also detects new features after processing one event frame when the number of tracked feature or the distribution score is less than the threshold.

**Figure 4.** Illustration of the weighted fusion method for MHT-BP and KLT.

The proposed WF-MHT-BP can be summarized as Algorithm 1. Facing the structure of data stream shown in Figure 1, firstly, the inertial information is used to predict feature-tracking rotation. Then, if an event frame arrives, MHT-BP will purely use event information to generate tracking solutions. If an intensity image arrives, the proposed algorithm (WF-MHT-BP) can provide tracking solutions by fusing tracking solutions of KLT and MHT-BP. It should be noted that both MHT-BP and the proposed algorithm are able to output tracking solutions depending on when tracking solutions are needed.

**Algorithm 1:** Feature-tracking method based on integration of event, intensity and inertial information (WF-MHT-BP)


#### **4. Experiments**

The experiments chose 16 datasets from an event camera dataset publicly provided by University of Zurich [29], which uses a DAVIS 240C from Inivation, Zurich, Switzerland. It provides events flow, intensity images, and IMU measurements. The resolution of intensity images is 240 × 180. WF-MHT-BP runs on MATLAB platform in a computer with I5-10400F and 16 GB memory.

The input parameters are summarized in Table 1. τ*FixNum* is set as 3000, which means when 3000 events arrive, an event frame will be generated. That is, an event frame will be formed once the number of events reaches 3000. α, θ, Δx, and Δy define the search range of affine transformation as illustrated in Equation (3). β and δ are drift in the unit of pixel per second used in Equations (4) and (5). Θ and Θ*sco* are set as 40 and 0.15, respectively, which means if the number or distribution score of tracked features is less than 40 or 0.15, Shi-Tomasi detection will be conducted to improve the number of newly detected features.

**Table 1.** Input parameters for the proposed method.


The state-of-the-art methods are chosen from open-source event-frame-based feature tracking methods. One is probabilistic data association-based tracking (PDAT) method proposed by Zhu et al. [12]. The publicly available implementation is used. Another is ICP-based feature-tracking algorithm (High Temporal Resolution Tracking algorithm, HTRT) based on the work of Tedaldi et al. [13]. Since there are no original implementation provided by the authors, an implementation by a third-party is used (https://github.com/ thomasjlew/davis\_tracker, accessed on 29 February 2022). The modifications are made to provide better performances for the comparison. ICP maximum iterations is changed to 3. The feature is changed to Shi-Tomasi feature, which is the same with WF-MHT-BP.

Compared with purely using event information in the above-mentioned methods, EKLT in C++ version, which integrates event and intensity information, is also compared in accuracy, efficiency, and feature-tracking length. Since the proposed algorithm is implemented in MATLAB version, which is normally slower than C++, the performance of EKLT is listed as a reference.

Firstly, MHT-BP, which is the internal parts of WF-MHT-BP, is compared with an opensource template-matching method to show its improved efficiency and comparable accuracy. Then WF-MHT-BP is compared with PDAT, HTRT, and EKLT in tracking accuracy and efficiency. Finally, the feature-tracking length is compared between EKLT and WF-MHT-BP.

#### *4.1. Feature Tracking Accuracy and Comparison between MHT-BP and FasT-Match*

The goal of feature matching on event frames is to find the affine transformation in small ranges between consecutive frames. FasT-Match [30] was chosen as the baseline to compare the efficiency and patch matching error, as it achieves the same goal with MHT-BP. Moreover, it is a template-patch-matching-based method with similar control flow as MHT-BP. The difference of MHT-BP and FasT-Match are: (1) MHT-BP uses a simplified affine transformation model, but Fast-Match uses more complex transformation model with six parameters. (2) Batch processing is used in MHT-BP, but FasT-Match does not have the mechanism. (3) FasT-Match has a branch-and-bound search strategy to find the parameters in the transformation model, but MHT-BP does not use strict termination conditions to improve efficiency. FasT-Match are implemented in MATLAB, which is the same with MHT-BP. "shapes\_rotation" was chosen for efficiency and template-matching error comparison.

Small, average normalized patch errors (NPE) mean higher similarities between two patches. Figure 5 shows the cumulative distribution function(CDF) of NPE and consumed time. The curve of MHT-BP and FasT-Match is very similar. Average NPEof FasT-Match and MHT-BP are 0.049 and 0.041, respectively, which means both of their matching errors are very small, and their differences can be ignored compared with the error ranges (0, 1). The mean consumed time of MHT-BP and FasT-Match is 3.53 ms and 80.40 ms, respectively. The consumed time of MHT-BP is much lower than FasT-Match. The reason for the acceptable error and reduced computational complexity is MHT-BP has the mechanism of batch processing and four-parameter affine transformation to improve accuracy and efficiency. It can be concluded that the majority of normalized patch errors by FasT-Match are slightly lower than that of MHT-BP method. However, the consumed time for MHT-BP is much lower than that of FasT-Match, which shows around a 10×–20× increase in speed. Although matching error of MHT-BP increased by around 19% compared with FasT-Match (0.049 vs. 0.041), the matching error is corrected in a timely manner by KLT tracking in WF-MHT-BP in the next step. Therefore, MHT-BP is chosen as the internal part of WF-MHT-BP.

**Figure 5.** CDF of normalized patch error and consumed time for MHT-BP and FasT-Match.

#### *4.2. Feature-Tracking Accuracy Comparison*

Figure 6 shows feature-tracking solutions from the proposed method. The traces of tracked features are projected on the first intensity image to show the matching results. The 16 datasets with different scenarios of lighting, objects, and motion are compared in feature-tracking accuracy.

**Figure 6.** Illustration of feature tracking solutions from WF-MHT-BP (Each subgraph show the trajectories of tracked features projected on the first intensity image).

Both MHT-BP and WF-MHT-BP will generate tracking solutions. The solutions between frames are filtered by fundamental matrix based RANSAC (RANdom SAmple Consensus). For the parameters for RANSAC-based fundamental matrix, Sampson distance threshold is set as 0.1. The inlier ratio is used as the indicator for tracking accuracy. PDAT and HTRT follow in the same way. However, EKLT corrects tracking solutions from event information when an intensity frame arrives, and the inlier ratio between intensity frame is used.

For each dataset, the average inlier ratio is calculated as shown in Table 2. Note that the inliers from RANSAC are not used for the next feature-tracking process. For each method, different scenes with different light, object, and motion settings have different average inlier ratios. It is interesting to find that inlier ratio of HTRT for all the datasets is around 50%. The inlier ratio of PDAT ranges from 38.19% in "hdr\_poster" scenario to 80.80% in "shapes\_transaltion" scenario. In the scenario of "boxes"-, "hdr"-, and "poster" related scenarios, inlier ratio of PDAT decreases rapidly, showing the difficulty from environmental factors.


**Table 2.** Inlier ratios of PDAT/HTRT/EKLT/WF-MHT-BP methods.

The inlier ratio of EKLT is much better than PDAT and HTRT, which reaches 88.26% on average. WF-MHT-BP achieves the highest inlier ratio for all datasets. The main differences between EKLT and WF-MHT-BP are: (1) EKLT uses gradients to track features on event frame without batch processing, but MHT-BP in the proposed method uses four-parameter affine transformation for feature tracking in the way of batch processing. (2) WF-MHT-BP uses inertial information to predict the positions of features, while EKLT does not use inertial information. (3) KLT only correct drifts on arrived intensity images. WF-MHT-BP uses a simple and efficient fusion mechanism to correct tracking solutions from MHT-BP and current positions of tracked features. Therefore, the main reason is that WF-MHT-BP integrated all available factors for feature tracking, making tracking solution more accurate.
