*2.2. Motion Model of Camera-IMU System*

A 9-dimension system was built to represent the motion of the camera–IMU system. The state vector is defined as follows:

$$\mathbf{x} = \begin{bmatrix} \ ^ {\rm SI}\mathbf{p}^{\rm T} & ^ {\rm SI}\mathbf{p}^{\rm T} & ^ {\rm SI}\mathbf{v}^{\rm T} \end{bmatrix}^{\rm T} \tag{1}$$

where <sup>I</sup> SI*ρ* is the modified Rodrigues parameter (MRP) representing the rotation of the ICS relative to the SICS; SI*p* is the position of the origin of the ICS in the SICS; and SI*v* is the velocity of the origin of the ICS relative to the SICS. A discrete motion model was constructed according to the six degrees of the freedom kinematic model of a rigid body as follows:

$$\mathbf{x}\_{k} = \mathbf{F}\mathbf{x}\_{k-1} + \mathbf{G}(\mathbf{x}\_{k-1})\mathbf{u}\_{k-1} + \mathbf{C} \tag{2}$$

where *<sup>u</sup>k*−<sup>1</sup> = <sup>I</sup> *ω*<sup>T</sup> *<sup>k</sup>*−<sup>1</sup> <sup>I</sup> *a*T *k*−1 T is the input vector of the navigation system; <sup>I</sup> *<sup>ω</sup>k*−<sup>1</sup> and <sup>I</sup> *<sup>a</sup>k*−<sup>1</sup> are the measurements of the gyroscope and the accelerator at time *k* − 1, respectively; *C* = 01×<sup>6</sup> Δ*t* SI*g*<sup>T</sup> T is a constant vector; Δ*t* is the sampling period; SI*g* = SI <sup>E</sup> *<sup>R</sup>*E*<sup>g</sup>* is the gravity acceleration vector in the SICS; SI <sup>E</sup> *R* is the transformation matrix from the earth-fixed coordinate system to the SICS, of which initial value determined by a simple attitude and heading reference system (AHRS) [32]; and *F* and *G*(*xk*−1) are the state-transition matrix and input matrix, respectively, formulated as follows:

$$\begin{aligned} F &= \begin{bmatrix} \mathcal{Z}\_3 & 0\_3 & 0\_3 \\ 0\_3 & \mathcal{Z}\_3 & \Delta t \mathcal{Z}\_3 \\ 0\_3 & 0\_3 & \mathcal{Z}\_3 \end{bmatrix} \\ G(\mathbf{x}\_{k-1}) &= \begin{bmatrix} \mathcal{A} \begin{pmatrix} \mathcal{I}\_{\mathrm{SI}} \rho\_{k-1} \end{pmatrix} & 0\_3 \\ 0\_3 & 0\_3 \\ 0\_3 & \mathcal{I}\_{\mathrm{SI}} \mathcal{R} \begin{pmatrix} \mathcal{I}\_{\mathrm{SI}} \rho\_{k-1} \end{pmatrix}^{\mathrm{T}} \end{bmatrix} \end{aligned} \tag{3}$$

where *I*<sup>3</sup> is the three-dimensional identity matrix; and 03 is the three-dimensional zero matrix. *A* I SI*ρk*−<sup>1</sup> and <sup>I</sup> SI*R* I SI*ρk*−<sup>1</sup> are formulated as follows. The right subscript *<sup>k</sup>* <sup>−</sup> 1, left subscript 'SI' and left prescript 'I' in the right side of the equations are neglected to simplify the description

$$A\begin{pmatrix}\mathbf{1}\_{\mathrm{SI}}\rho\_{k-1}\\\end{pmatrix}=\frac{1}{4}\begin{bmatrix}\rho\_1^2-\rho\_2^2-\rho\_3^2+1 & 2(\rho\_1\rho\_2-\rho\_3) & 2(\rho\_1\rho\_3+\rho\_2)\\2(\rho\_1\rho\_2+\rho\_3) & \rho\_2^2-\rho\_1^2-\rho\_3^2+1 & 2(\rho\_2\rho\_3-\rho\_1)\\2(\rho\_1\rho\_3-\rho\_2) & 2(\rho\_2\rho\_3+\rho\_1) & \rho\_3^2-\rho\_1^2-\rho\_2^2+1\end{bmatrix}\tag{4}$$

$$\begin{aligned} \;^1\_\mathbf{SI} \mathbf{R} \begin{pmatrix} 1 \\ \mathbf{S} \mathbf{I} \end{pmatrix}\_{k-1} = \begin{bmatrix} 1 - \frac{8 \left(\rho\_2^2 + \rho\_3^2\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & -\frac{4 \left(\left|\rho\right|^2 \rho\_3 - 2\rho\_1 \rho\_2 - \rho\_3\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & \frac{4 \left(\left|\rho\right|^2 \rho\_2 + 2\rho\_1 \rho\_3 - \rho\_2\right)}{\left(\left|\rho\right|^2 + 1\right)^2} \\\hline \frac{4 \left(\left|\rho\right|^2 \rho\_3 + 2\rho\_1 \rho\_2 - \rho\_3\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & 1 - \frac{8 \left(\rho\_1^2 + \rho\_3^2\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & -\frac{4 \left(\left|\rho\right|^2 \rho\_1 - 2\rho\_2 \rho\_3 - \rho\_1\right)}{\left(\left|\rho\right|^2 + 1\right)^2} \\\ -\frac{4 \left(\left|\rho\right|^2 \rho\_2 - 2\rho\_1 \rho\_3 - \rho\_2\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & \frac{4 \left(\left|\rho\right|^2 \rho\_1 + 2\rho\_2 \rho\_3 - \rho\_1\right)}{\left(\left|\rho\right|^2 + 1\right)^2} & 1 - \frac{8 \left(\rho\_1^2 + \rho\_2^2\right)}{\left(\left|\rho\right|^2 + 1\right)^2} \end{bmatrix} \end{aligned} \tag{5}$$

In fact, <sup>I</sup> SI*R* I SI*ρk*−<sup>1</sup> is the direct cosine matrix representing the rotation of the IMU coordinate system relative to the static IMU coordinate system.

#### *2.3. Keyframe Management*

A keyframe consisting of a RGB image and a depth image represents the reference for the egomotion estimation. The very first RGB image and depth image are used to set up the first keyframe. A point set is sampled from the images of the keyframe together with the intensity. Unlike the indirect VO, no extraction or matching process of the distinct features is needed in the direct VO. As a result, the direct VO is generally faster than the indirect VO. However, the accuracy and robustness are hard to ensure without carefully designed features. To ensure the accurate estimation, the point sampling in the direct VO should follow two basic criteria: (1) The sampled points are distributed as evenly as possible in the image; and (2) The absolute intensity gradient at the sampled point is significantly higher than its neighbors. The absolute intensity gradient *g* at point *j* with the pixel coordinate of *u*p*j* , *v*p*<sup>j</sup>* can be evaluated with the intensity differences as follows:

$$g = \sqrt{\left(\frac{I\left(u\_{\mathbb{P}\_{j}} + 1, v\_{\mathbb{P}\_{j}}\right) - I\left(u\_{\mathbb{P}\_{j}} - 1, v\_{\mathbb{P}\_{j}}\right)}{2}\right)^{2} + \left(\frac{I\left(u\_{\mathbb{P}\_{j}}, v\_{\mathbb{P}\_{j}} + 1\right) - I\left(u\_{\mathbb{P}\_{j}}, v\_{\mathbb{P}\_{j}} - 1\right)}{2}\right)^{2}}{2}} \tag{6}$$

where *I*(·) is the intensity at the corresponding pixel coordinate. For even sampling, the RGB image in the keyframe is divided into patches with the size of *b* × *b*. The pixel with the highest absolute intensity gradient among the pixels with valid depth measurements in each patch is chosen as a candidate. If the absolute intensity gradient of a candidate is no less than a predesigned threshold *g*thres, the 3D

coordinate of the candidate pixel is constructed. We used a simple pin-hole camera model for the 3D construction.

$$\mathbf{^SCp}\_{\mathbf{p}\_{\mathbf{p}\_j}} = \begin{bmatrix} \mathbf{^SCx}\_{\mathbf{x}\_{\mathbf{p}\_j}} \\ \mathbf{^SCy}\_{\mathbf{p}\_j} \\ \mathbf{^SCz}\_{\mathbf{p}\_j} \end{bmatrix} = d \begin{pmatrix} \mathbf{u\_{p}}\_{\mathbf{p}\_j} \, \mathrm{v\_{p}} \\ \frac{\mathbf{v\_{p}}\_{\mathbf{p}\_j} - \mathbf{c\_{y}}}{f\_{\mathbf{y}}} \\ 1 \end{pmatrix} \tag{7}$$

where *d u*p*j* , *v*p*<sup>j</sup>* is the depth measurement of corresponding point; and *f*x, *f*y, *c*x, and *c*<sup>y</sup> are the intrinsic parameters of the RGB camera. To simplify the measurement function, the constructed points were transformed into static IMU coordinate system. The transformed 3-dimensional coordinate was added into the point set of the keyframe.

$$\mathbf{^{SC}p}\_{\mathbb{P}j} = \mathbf{^{C}R^{T}\left(^{SC}p\_{\mathbb{P}j} - ^{C}p\right)}\tag{8}$$

where <sup>C</sup> <sup>I</sup> *<sup>R</sup>* and <sup>C</sup>*<sup>p</sup>* represent the transformation between the ICS and the CCS. We assumed that the two sensors were mounted on a rigid body so that the transformation did not change over time. The camera–IMU transformation and the intrinsic parameters were calibrated in advance and the calibration methods are not included in this paper.

The absolute intensity gradient threshold *g*thres was designed according to the intensity characteristic of each patch and evaluated adaptively online. We treated *g*thres as the sum of two parts:

$$\mathbb{S}\_{\text{thres}} = \mathbb{\overline{g}} + \mathbb{\mathfrak{g}}\_{\text{b}} \tag{9}$$

where *g* is the average intensity of the image patch and *g*<sup>b</sup> is a positive value to ensure that the absolute intensity gradient of the candidate is sufficiently high. It is easy to understand that the absolute intensity gradient of the candidate should exceed more pixels in a larger patch, which means that *g*thres is higher with a larger patch and *g*<sup>b</sup> has a positive correlation with the patch size *b*. We used a simple linear function to represent the positive correlation and formulated the lower-bound constraint of the absolute intensity gradient as follows:

$$\mathcal{R} \ge \overline{\mathcal{g}} + \lambda (b - 1), \lambda > 0 \tag{10}$$

where *λ* is a positive constant. Particularly in the case of *b* = 1, i.e., only one pixel exits in each patch, the above lower bound constraint degrades into *g* ≥ *g*, which means that all points with valid depth measurements are constructed. Figure 1 shows an example of point sampling on a 640 × 480 image. In the example, the above method was able to sample points not only from the edges of the objects, but also from regions with weak intensity variations such as the computer screen and the floor. The points were distributed more densely with a smaller patch. The numbers of sampled points with different patch sizes in this example are shown in Table 1.

**Table 1.** The number of sampled points with different patch sizes.


**Figure 1.** Example of point sampling on a 640 × 480 image with different patch sizes. The green points represent the sampled points. **(a**) The original image; (**b**) The result of point sampling with *b* = 2; (**c**) The result of point sampling with *b* = 4; (**d**) The result of point sampling with *b* = 8; (**e**) The result of point sampling with *b* = 16; and (**f**) The result of point sampling with *b* = 32.

The above point sampling method adjusts the number of the points by tuning *b* and guarantees evenly distributed points. The usage of images approximated the dense visual odometry with a small *b*. The usage of images approximated the sparse visual odometry with a large *b*. The keyframe may not sufficient to estimate the egomotion as the camera moves, so we used two judging criteria to change the keyframe:


A new keyframe is constructed with the current RGBD images if any of the two criteria are satisfied. After a new keyframe is constructed, the previous keyframe will be replaced by the new one, and the reference coordinate systems are changed in turn. The estimations of SI*v* and SI <sup>E</sup> *R* are transformed with <sup>I</sup> SI*ρ*. Then, the estimations of <sup>I</sup> SI*<sup>ρ</sup>* and SI*<sup>p</sup>* are set to 0.

The above keyframe and point sampling method mainly refers to Engel's DSO method, except for several simplifications as follows:

