*Article* **Improved Point–Line Visual–Inertial Odometry System Using Helmert Variance Component Estimation**

#### **Bo Xu <sup>1</sup> , Yu Chen 1,\* , Shoujian Zhang <sup>1</sup> and Jingrong Wang <sup>2</sup>**


Received: 11 August 2020; Accepted: 5 September 2020; Published: 7 September 2020

**Abstract:** Mobile platform visual image sequence inevitably has large areas with various types of weak textures, which affect the acquisition of accurate pose in the subsequent platform moving process. The visual–inertial odometry (VIO) with point features and line features as visual information shows a good performance in weak texture environments, which can solve these problems to a certain extent. However, the extraction and matching of line features are time consuming, and reasonable weights between the point and line features are hard to estimate, which makes it difficult to accurately track the pose of the platform in real time. In order to overcome the deficiency, an improved effective point–line visual–inertial odometry system is proposed in this paper, which makes use of geometric information of line features and combines with pixel correlation coefficient to match the line features. Furthermore, this system uses the Helmert variance component estimation method to adjust weights between point features and line features. Comprehensive experimental results on the two datasets of EuRoc MAV and PennCOSYVIO demonstrate that the point–line visual–inertial odometry system developed in this paper achieved significant improvements in both localization accuracy and efficiency compared with several state-of-the-art VIO systems.

**Keywords:** visual–inertial odometry; Helmert variance component estimation; line feature matching method; correlation coefficient; point and line features

#### **1. Introduction**

Simultaneous localization and mapping (SLAM) has become a key technology in autonomous driving and autonomous robot navigation, which has attracted widespread attention from academia and industry [1]. Visual SLAM technology, using an optical lens as a sensor, has the characteristics of low power consumption and small size, and is widely used in indoor environment positioning and navigation. However, visual SLAM has higher requirements for observation conditions. When the movement speed is fast or the illumination conditions are poor, the tracked point features are easily lost, resulting in larger positioning errors. In order to improve the reliability and accuracy of the visual SLAM system, fusing inertial navigation data into the visual SLAM system can significantly improve the positioning accuracy and reliability, which has become a research hotspot.

Visual–inertial odometry (VIO) uses visual and inertial navigation data for integrated navigation, which has broad application prospects and is studied worldwide [2,3]. The earliest VIO systems are mainly based on filtering technology [4,5] by using the integral of inertial measurement unit (IMU) measurement information to predict the state variables of the motion carrier, which further updates the state variables with visual information, so as to realize the tightly coupled approaches of vision

and IMU information. However, with the linearization point of the nonlinear measurement model and the state transition model fixed in the filtering process, the linearization process may pose a large error with an unreasonable initial value. Thus, most scholars adopt the method of graph optimization [6,7] and use iterative methods to achieve higher precision parameter estimation [8]. For example, the OKVIS [9] system uses tightly coupled approaches to optimize the visual constraints of feature points and the preintegration constraints of IMU, and adopts optimization strategy based on keyframe and "first-in first-out" sliding window method by marginalizing the measurements from the oldest state. The VINS [10] system is a monocular visual–inertial SLAM scheme, which uses a sliding-window-based approach to construct the tightly coupled optimization of IMU preintegration and visual measurement information. In the sliding window, the oldest frame and the latest frame are selectively marginalized to maintain the optimized state variables and achieve a good optimization effect.

At present, mainstream VIO systems generally use point features as visual observations. For example, the VINS system is designed to detect Shi–Tomasi corner points [11], which uses the Kanade–Lucas–Tomasi (KLT) sparse optical flow method for tracking [12]. The S-MSCKF system [13] is designed to detect feature from accelerated segment test (FAST) corner points [14], using the KLT sparse optical flow [12] method for tracking. The OKVIS [9] system is designed to detect Harris [15] corner points, and uses binary robust invariant scalable keypoints (BRISK) [16] to match and track feature points. In most scenarios, the number of corner points are large and stable, which can ensure positioning performance. However, in weak texture environments and scenes where the illumination changes significantly, the point features always have less visual measurement information or have large measurement errors [17,18]. In order to present relief from the insufficient point feature performance, the line features that can provide structured information are introduced into the VIO system [19]. The simplest way is to use the two endpoints of the line to represent the 3D spatial line [20,21]. The 3D spatial line represented by the endpoints requires six parameters, while the 3D spatial line only has four degrees-of-freedom (DoFs); thus this representation will further cause a rank deficit problem of the equation and add additional computational burden. Bartoli and Sturm [22] proposed an orthogonal representation of line features by using four parameters to represent the 3D spatial line, in which the three-dimensional vector is related to the rotation of the line around three axes, and the last parameter represents the vertical distance from the origin to the spatial line [23]. This representation method has good numerical stability. Based on the line feature representation, He et al. [24] proposed a tightly coupled monocular point–line visual–inertial odometry (PL–VIO), which uses point and line measurement information and IMU measurement information to continuously estimate the state of the moving platform, and the state variables are optimized by the sliding window method, which ensures the accuracy and in the meantime guarantees an appropriate number of optimization variables, thereby improving the efficiency of optimization. Wen et al. [25] proposed a tightly coupled stereo point–line visual–inertial odometry (PLS–VIO), which uses stereo point–line features and IMU measurement information for tightly coupled optimization. Compared with the monocular VIO system, the stereo VIO system has higher stability and accuracy, while the time consumption is greatly increased.

In a VIO system that uses point features and line features at the same time, the traditional line feature matching method using line binary descriptors (LBD) [26] is time consuming, which reduces the real-time performance of the entire VIO system. At the same time, in the VIO system, it is difficult to provide reasonable and reliable weights of point and line features, and these two points are the key to getting good performance of the point–line coupled VIO system.

Line features have geometric information and good pixel level information. By using these two kinds of information, line features can be matched. Helmert variance component estimation (HVCE) [27] can determine the weights of different types of observations, and has been applied in many different fields including inertial navigation system (INS) and global navigation satellite system (GNSS) fusion positioning [28,29], global positioning system (GPS) and BeiDou navigation satellite system

(BDS) pseudorange differential positioning [30], and other fields, which demonstrate the effectiveness of Helmert variance component estimation.

Based on this discussion, at the front end of the point–line VIO system, the line feature matching speed is slow; at the back end, when performing tightly coupled optimization of IMU observation, point feature observation, and line feature observation, it is difficult to determine a more reasonable point–line weight. Contributions described in this article follow: *Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 3 of 19 Based on this discussion, at the front end of the point–line VIO system, the line feature matching speed is slow; at the back end, when performing tightly coupled optimization of IMU observation,


The organization of this paper is as follows. After a comprehensive introduction in Section 1, the mathematical model is introduced in Section 2. The numerical experiments are conducted in Section 3 and the results are discussed in Section 4. Finally, conclusions and recommendations are given in Section 5. method and other classic methods on different datasets. The organization of this paper is as follows. After a comprehensive introduction in Section 1, the mathematical model is introduced in Section 2. The numerical experiments are conducted in Section 3 and the results are discussed in Section 4. Finally, conclusions and recommendations are given in

#### **2. Mathematical Formulation** Section 5.

*2.1. Notations* 

In general, the VIO system is divided into two modules: the front end and the back end. The front end is designed for the processing of visual measurement information, the preintegration of IMU measurement information [8], and calculates the initial poses. The back end is designed for data fusion and optimization. The front end of PL–VIO [24] adds line feature measurement information in addition to the original point feature measurement information, which improves the robustness of the algorithm. On the basis of PL–VIO, in order to reduce the front-end running time, the matching algorithm of the line feature is improved. In order to improve the accuracy of visual information in the overall optimization, we adopt the method of Helmert variance component estimation to better determine the prior weights of point and line information. **2. Mathematical Formulation**  In general, the VIO system is divided into two modules: the front end and the back end. The front end is designed for the processing of visual measurement information, the preintegration of IMU measurement information [8], and calculates the initial poses. The back end is designed for data fusion and optimization. The front end of PL–VIO [24] adds line feature measurement information in addition to the original point feature measurement information, which improves the robustness of the algorithm. On the basis of PL–VIO, in order to reduce the front-end running time, the matching algorithm of the line feature is improved. In order to improve the accuracy of visual information in the overall optimization, we adopt the method of Helmert variance component estimation to better

Figure 1 shows the algorithm pipeline. At the front end, we improved the line feature matching algorithm, as is shown in the red box. Simultaneously, as shown again in the red box at the back end, before entering the sliding window optimization, we use the Helmert variance component estimation algorithm to estimate the weights of point features and line features. Finally, we add visual information and IMU measurement information to the sliding window for optimization. determine the prior weights of point and line information. Figure 1 shows the algorithm pipeline. At the front end, we improved the line feature matching algorithm, as is shown in the red box. Simultaneously, as shown again in the red box at the back end, before entering the sliding window optimization, we use the Helmert variance component estimation algorithm to estimate the weights of point features and line features. Finally, we add visual

information and IMU measurement information to the sliding window for optimization.

**Figure 1.** Overview of our improved point–line visual–inertial odometry (IPL–VIO) system. **Figure 1.** Overview of our improved point–line visual–inertial odometry (IPL–VIO) system.

stipulates the following notation. The visual–inertial odometry uses the extracted point features and line features as visual observation values, and couples IMU measurement information for integrated

#### *2.1. Notations Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 4 of 19

the body frame.

Figure 2 [24] shows the basic principle of the point–line coupled visual–inertial odometry, and stipulates the following notation. The visual–inertial odometry uses the extracted point features and line features as visual observation values, and couples IMU measurement information for integrated navigation; c*<sup>i</sup>* and b*<sup>i</sup>* represent camera frame and IMU body frame at time *t* = *i*; f*<sup>j</sup>* and L*<sup>j</sup>* represent a point feature and a line feature in the world coordinate system. The variable *z* c*i* f*j* is the *j*th point feature observed by *i*th camera frame, *z* c*i* L*j* is the *j*th line feature observed by *i*th camera frame; they compose visual observations, *zbib<sup>j</sup>* represents a preintegrated IMU measurement between two keyframes; q*bc* and p*bc* are the extrinsic parameters between the camera frame and the body frame. navigation; c*i* and b*i* represent camera frame and IMU body frame at time *t i* <sup>=</sup> ; f*j* and L*<sup>j</sup>* represent a point feature and a line feature in the world coordinate system. The variable <sup>c</sup> f *i j z* is the *<sup>j</sup>* th point feature observed by *<sup>i</sup>* th camera frame, <sup>c</sup> L *i j z* is the *<sup>j</sup>* th line feature observed by *i* th camera frame; they compose visual observations, *i j b b z* represents a preintegrated IMU measurement between two keyframes; q*bc* and p*bc* are the extrinsic parameters between the camera frame and

**Figure 2.** An illustration of visual–inertial odometry. Visual observations: point observations and line measurements. IMU observations: inertial measurement unit measurements. **Figure 2.** An illustration of visual–inertial odometry. Visual observations: point observations and line measurements. IMU observations: inertial measurement unit measurements.

#### *2.2. Improved Line Feature Matching Algorithm 2.2. Improved Line Feature Matching Algorithm*

In general, most line feature matching algorithms use LBD [26] to match line features, which need to describe the line features, and the matching of the descriptors would take a certain amount of time, hugely increasing the burden of calculations. In general, most line feature matching algorithms use LBD [26] to match line features, which need to describe the line features, and the matching of the descriptors would take a certain amount of time, hugely increasing the burden of calculations.

Since the line features contain rich geometric and texture characteristics, we comprehensively use the angle, position, and pixel properties of the line features to match the line features. The algorithm can increase the matching speed of the line features. The specific algorithm follows: Since the line features contain rich geometric and texture characteristics, we comprehensively use the angle, position, and pixel properties of the line features to match the line features. The algorithm can increase the matching speed of the line features. The specific algorithm follows:

(1) According to the midpoint coordinates of the line features, narrow the matching range by extracting line features of the left and right image, and the two endpoints of the line features are extracted by the line segment detector (LSD) algorithm [33]. Then the left image is divided into m × n grids and the line features extracted from the left image are mapped into different grids according to the midpoint coordinates, as shown in Figure 3. When the midpoint coordinates of the line features in the right image fall into the corresponding grid of the left image, then all line features of the left image and the right image in the same grid are obtained as candidate line features. We denote candidate line features in the left image as { } *P12 n ,P P* ,... and in right image as { } *QQ Q 12 n ,* ,... . (1) According to the midpoint coordinates of the line features, narrow the matching range by extracting line features of the left and right image, and the two endpoints of the line features are extracted by the line segment detector (LSD) algorithm [33]. Then the left image is divided into m × n grids and the line features extracted from the left image are mapped into different grids according to the midpoint coordinates, as shown in Figure 3. When the midpoint coordinates of the line features in the right image fall into the corresponding grid of the left image, then all line features of the left image and the right image in the same grid are obtained as candidate line features. We denote candidate line features in the left image as {*P*1, *P*2, . . . *Pn*} and in right image as {*Q*1, *Q*2, . . . *Qn*}.

the body frame.

candidate line features in the left image as { } *P12 n ,P P* ,... and in right image as { } *QQ Q 12 n ,* ,... .

navigation; c*i* and b*i* represent camera frame and IMU body frame at time *t i* <sup>=</sup> ; f*j* and L*<sup>j</sup>*

camera frame; they compose visual observations, *i j b b z* represents a preintegrated IMU measurement between two keyframes; q*bc* and p*bc* are the extrinsic parameters between the camera frame and

**Figure 2.** An illustration of visual–inertial odometry. Visual observations: point observations and line

In general, most line feature matching algorithms use LBD [26] to match line features, which need to describe the line features, and the matching of the descriptors would take a certain amount

Since the line features contain rich geometric and texture characteristics, we comprehensively use the angle, position, and pixel properties of the line features to match the line features. The algorithm can increase the matching speed of the line features. The specific algorithm follows:

(1) According to the midpoint coordinates of the line features, narrow the matching range by extracting line features of the left and right image, and the two endpoints of the line features are extracted by the line segment detector (LSD) algorithm [33]. Then the left image is divided into m × n grids and the line features extracted from the left image are mapped into different grids according to the midpoint coordinates, as shown in Figure 3. When the midpoint coordinates of the line features in the right image fall into the corresponding grid of the left image, then all line features of the left

measurements. IMU observations: inertial measurement unit measurements.

*2.2. Improved Line Feature Matching Algorithm* 

of time, hugely increasing the burden of calculations.

L *i j*

f *i j z* is the

*z* is the *<sup>j</sup>* th line feature observed by *i* th

represent a point feature and a line feature in the world coordinate system. The variable <sup>c</sup>

*<sup>j</sup>* th point feature observed by *<sup>i</sup>* th camera frame, <sup>c</sup>

**Figure 3.** Using geometric information to match line features. The initial matching is performed by the pixel coordinates of the line features. The red box represents the grid divided in the image, and the yellow lines represent the line features that fall into the same grid of the two matching images.

(2) According to the correlation coefficient of the pixels in the surrounding area of the line features, the matching line features are determined. We match the candidate line feature *P<sup>i</sup>* in {*P*1, *P*2, . . . *Pn*} with the line features {*Q*1, *Q*2, . . . *Qn*}, and the correlation coefficient of single pixel in the matching line is calculated using Formula (1) [34]. The respective correlation coefficients between *P<sup>i</sup>* and {*Q*1, *Q*2, . . . *Qn*} are calculated according to averaging the correlation coefficient of each pixel in the corresponding line features, and the correlation coefficients of the line features are sorted from small to large. If the correlation coefficient between *P<sup>i</sup>* and *Q<sup>j</sup>* is the largest, the respective correlation coefficients between *Q<sup>j</sup>* and {*P*1, *P*2, . . . *Pn*} are calculated as well. If the correlation coefficient between *Q<sup>j</sup>* and *P<sup>i</sup>* is also the largest, *P<sup>i</sup>* and *Q<sup>j</sup>* are considered to be a pair of matching lines.

$$\rho(c,r,c',r') = \frac{\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}i + r\_i i + c' \mathcal{G}\_{i+r',j+c'}' - \frac{1}{m \cdot n} \left(\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}i + r\_i j + c' \right) \left(\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}\_{i+r',j+c'}' \right)}{\sqrt{\left[\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}\_{i+r,j+c}' - \frac{1}{m \cdot n} \left(\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}i + r\_i j + c' \right)^2 \right] \left[\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}\_{i+r',j+c'}' - \frac{1}{m \cdot n} \left(\sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathcal{G}\_{i+r',j+c'}' \right)^2 \right]} \tag{11}$$

where *c*,*r* ∈ *l*1, *c* 0 ,*r* <sup>0</sup> ∈ *l*2, (*c*,*r*) are the pixel coordinates of line *l*<sup>1</sup> on left image; (*c* 0 ,*r* 0 ) are the pixel coordinates of line *l*<sup>2</sup> on right image; *m*, *n* are the matching window size; *gi*,*<sup>j</sup>* is the gray value at (*i*, *j*) on left image; *g* 0 *i*,*j* is the gray value at (*i*, *j*) on right image; and ρ(*c*,*r*, *c* 0 ,*r* 0 ) is the correlation coefficient.

(3) According to the rotation consistency of the line feature angles of the matched images, mismatches are eliminated. If the matching images are rotated, the angle changes of all matching line features should be consistent, which means the line feature rotation angles of the matching images have global consistency. If there is a rotation angle obviously inconsistent with the rotation angles of other matching line features, the matching pair may be seen as a mismatch and should be eliminated. This paper establishes a statistical histogram from 0 to 360 degrees in a unit of 1 degree. Through the histogram, the angle changes of matching line features are counted and the group with the largest number of histograms is retained. Line feature matching pairs that fall into other groups are considered to be mismatches and are eliminated.

#### *2.3. Tightly Coupled VIO System*

The VIO system in this paper uses point features, line features and IMU measurement information to optimize in the sliding window. In the optimization process, reasonable weights of different measurement information need to be given. Generally, the IMU measurements adopt the form of preintegration to construct the observation constraints, and the weight matrix of the IMU observation is recursively obtained, with the point features and the line features assigned prior weight matrices. Since the point feature and the line feature express different visual measurement information, the given

prior weight matrices may be unreasonable to a certain extent. We use the Helmert variance component estimation method to obtain the post-test estimation of the prior weight matrices to better determine the contribution of visual measurement information to the overall optimization.

In order to better explain the improved algorithm in this article, the basic principles of tight coupling in the VIO system will be introduced in the following section, according to the basic principles of IMU error model, point feature error model, line feature error model, and Helmert variance component estimation.

#### 2.3.1. Basic Principles of Tightly Coupled VIO System

In order to ensure accuracy and take into account efficiency at the same time, the sliding window algorithm is used to optimize state variables at the back end of the VIO system. Define the variable optimized in the sliding window at time *t* as [24]:

$$\begin{aligned} \mathbf{X} &= [\mathbf{x}\_{\text{lt}}, \mathbf{x}\_{\text{n}+1}, \dots, \mathbf{x}\_{\text{n}+N\prime}, \lambda\_{\text{m}\prime}, \lambda\_{\text{m}+1\prime}, \dots, \lambda\_{\text{m}+M\prime}, l\_{\text{k}\prime}l\_{\text{k}+1\prime}, \dots, l\_{\text{k}+K}] \\ \mathbf{x}\_{i} &= [\mathbf{p}\_{\text{w}\flat\_{i}\prime}, \mathbf{q}\_{\text{w}\flat\_{i}\prime}, \mathbf{v}\_{i}^{\text{w}}, \mathbf{b}\_{\text{a}}^{\text{b}}, \mathbf{b}\_{\text{g}}^{\text{b}}] T, \quad &i \in [n, n+N] \end{aligned} \tag{2}$$

where *x<sup>i</sup>* describes the *i*th IMU body state; **p***wb<sup>i</sup>* , **v** *w i* , **q***wb<sup>i</sup>* describe the position, velocity, and orientation of the IMU body in the world frame; **b** *bi a* , **b** *bi <sup>g</sup>* describe the acceleration bias and angular velocity bias. We only use one variable, the inverse depth λ*<sup>k</sup>* , to parameterize the *k*th point landmark from its first observed keyframe. The variable *l<sup>s</sup>* is the orthonormal representation of the *s*th line feature in the world frame. Subscripts *n*, *m*, and *k* are the start indexes of the body states, point landmarks, and line landmarks, respectively. *N* is the number of keyframes in the sliding window. *M* and *K* are the numbers of point landmarks and line landmarks observed by all keyframes in the sliding window.

We optimize all the state variables in the sliding window by minimizing the sum of cost terms from all the measurement residuals [24]:

$$\begin{split} \min & \rho(\left\|\mathbf{r}\_{p} - \mathbf{J}\_{p}\mathbf{X}\right\|\_{\Sigma\_{P}}^{2}) + \sum\_{i \in \mathcal{B}} \rho(\left\|\mathbf{r}\_{b}(\mathbf{z}\_{b; b\_{i+1}}, \mathbf{X})\right\|\_{\Sigma\_{b; b\_{i}}}^{2}) + \\ & \sum\_{(i, j) \in \mathcal{F}} \rho(\left\|\mathbf{r}\_{f}(\mathbf{z}\_{\mathbf{f}\_{j}}^{c\_{i}}, \mathbf{X})\right\|\_{\Sigma\_{\mathbf{f}\_{j}}^{c\_{j}}}^{2}) + \sum\_{(i, l) \in \mathcal{L}} \rho(\left\|\mathbf{r}\_{l}(\mathbf{z}\_{\mathbf{f}\_{l}}^{c\_{i}}, \mathbf{X})\right\|\_{\Sigma\_{\mathbf{f}\_{i}}^{c\_{j}}}^{2}) \end{split} \tag{3}$$

where n **r***p*,**J***<sup>p</sup>* o is prior information after marginalizing out one frame in the sliding window, and **J***p* is the prior Jacobian matrix from the resulting Hessian matrix after the previous optimization. The variable **r***<sup>b</sup>* (**z***bibi*+<sup>1</sup> , *X*) is an IMU measurement residual between the body state *x<sup>i</sup>* and *xi*+1; *B* is the set of all preintegrated IMU measurements in the sliding window; **r***<sup>f</sup>* (*z ci* **f***j* , *X*) and **r***<sup>l</sup>* (*z ci Li* , *X*) are the point feature reprojection residual and line feature reprojection residual, respectively. *F* and *L* are the sets of point features and line features observed by camera frames. The Cauchy robust function ρ is used to suppress outliers.

We express the abovementioned nonlinear optimization process in the form of a factor graph [35]. As shown in Figure 4, the nodes represent the variables to be optimized; in the VIO system they are the visual features and the state variables of the IMU body. The edges represent the visual constraints, IMU preintegration constraints, and prior constraints. Through the constraint information of the edges, the state variables of the nodes are optimized.

variance component estimation.

all keyframes in the sliding window.

from all the measurement residuals [24]:

function *r* is used to suppress outliers.

2.3.1. Basic Principles of Tightly Coupled VIO System

optimized in the sliding window at time *t* as [24]:

*x*

where *<sup>i</sup> x* describes the *<sup>i</sup>* th IMU body state; *wbi* **<sup>p</sup>** , *<sup>w</sup>*

orientation of the IMU body in the world frame; *<sup>i</sup> <sup>b</sup>* **b***<sup>a</sup>* , *<sup>i</sup> <sup>b</sup>* **<sup>b</sup>**

*i i*

*i wb wb i a*

*X xx x*

principles of IMU error model, point feature error model, line feature error model, and Helmert

In order to ensure accuracy and take into account efficiency at the same time, the sliding window algorithm is used to optimize state variables at the back end of the VIO system. Define the variable

> 1 11 [ ] [ ] [] *i i*

, ,..., , , ,..., , , ,... , ,,, , ,

**p q vbb** *i nn N* = + + + + ++ *ll l* = Î+

*w T b b*

g

velocity bias. We only use one variable, the inverse depth *l<sup>k</sup>* , to parameterize the *<sup>k</sup>* th point landmark from its first observed keyframe. The variable *sl* is the orthonormal representation of the *s*th line feature in the world frame. Subscripts *n* , *m*, and *k* are the start indexes of the body states, point landmarks, and line landmarks, respectively. *N* is the number of keyframes in the sliding window. *M* and *K* are the numbers of point landmarks and line landmarks observed by

We optimize all the state variables in the sliding window by minimizing the sum of cost terms

1

+ Î +

*i i*

(, ) (,)

where { } *p p* **r J**, is prior information after marginalizing out one frame in the sliding window, and *<sup>p</sup>* **J** is the prior Jacobian matrix from the resulting Hessian matrix after the previous optimization.

The variable <sup>1</sup> ( ) *i i b bb* **r z** <sup>+</sup> ,*X* is an IMU measurement residual between the body state *<sup>i</sup> x* and *i*+1 *x* ; *<sup>B</sup>*

are the point feature reprojection residual and line feature reprojection residual, respectively. *F* and *L* are the sets of point features and line features observed by camera frames. The Cauchy robust

We express the abovementioned nonlinear optimization process in the form of a factor graph [35]. As shown in Figure 4, the nodes represent the variables to be optimized; in the VIO system they

*i j F il L* **<sup>f</sup> f**

Î Î

2 2


min ( ) ( ( , ) )

å

*X zX*

is the set of all preintegrated IMU measurements in the sliding window; ( ) *<sup>i</sup>*

*p p b bb i B*

information of the edges, the state variables of the nodes are optimized.

**rJ r**

*P r r*

1

*b bi i*

**r r**

*r r*

2 2

å å

å å (3)

*<sup>c</sup> <sup>c</sup> j i <sup>i</sup> <sup>i</sup> Li <sup>j</sup>*

( ( ,) ) (( ,) )

*i i*

*zX zX*

*c c f l L*

+

*n n nN m m mM k k kK*

g

*ll l*

*<sup>i</sup>* **v** , *wbi* **q** describe the position, velocity, and

describe the acceleration bias and angular

*j c*

*<sup>f</sup> z***<sup>f</sup> r** , *X* and ( ) *<sup>i</sup>*

*i c <sup>l</sup> <sup>L</sup>* **r** *z* ,*X*

(2)

**Figure 4.** Optimization of a factor graph in the VIO system. Pink squares represent visual factors, purple squares represent prior factors, red squares represent IMU preintegration factors, blue nodes represent visual feature state variables to be optimized, and green nodes represent IMU body state variables to be optimized.

#### 2.3.2. IMU Measurement Model

The IMU original observation values are preintegrated between two consecutive camera observation frames of **b***<sup>i</sup>* and **b***<sup>j</sup>* , and an IMU measurement error model constructed through the preintegration [24]:

$$\mathbf{r}\_{b}(\mathbf{z}\_{b;b\_{j}},\mathbf{X})=\begin{bmatrix}\mathbf{r}\_{p}\\\mathbf{r}\_{\theta}\\\mathbf{r}\_{\nu}\\\mathbf{r}\_{b\mathbf{z}}\\\mathbf{r}\_{bg}\end{bmatrix}=\begin{bmatrix}\mathbf{R}\_{b;w}(\mathbf{p}\_{wb\_{j}}-\mathbf{p}\_{wb\_{j}}-\mathbf{v}\_{i}^{w}\Delta t+\frac{1}{2}\mathbf{g}^{w}\Delta t^{2})-\hat{\mathbf{u}}\_{b;b\_{j}}\\2\left[\hat{\mathbf{q}}\_{b;b\_{j}}\otimes(\mathbf{q}\_{b;w}\otimes\mathbf{q}\_{wb\_{j}})\right]\_{\mathrm{xyz}}\\&\mathbf{R}\_{b;w}(\mathbf{v}\_{j}^{w}-\mathbf{v}\_{i}^{w}+\mathbf{g}^{w}\Delta t)-\hat{\mathbf{f}}\_{b;b\_{j}}\\&&\mathbf{b}\_{a}^{b\_{j}}-\mathbf{b}\_{a}^{b\_{i}}\\&&\mathbf{b}\_{\mathcal{S}}^{b\_{j}}-\mathbf{b}\_{\mathcal{S}}^{b\_{i}}\end{bmatrix}\tag{4}$$

where **z***bibj*= [αˆ *<sup>b</sup>ib<sup>j</sup>* , βˆ *bib<sup>j</sup>* , **q**ˆ *bib<sup>j</sup>* is the preintegrated measurement value of the IMU [8]; [·] *xyz* extracts the real part of a quaternion, which is used to approximate the three-dimensional rotation error.

#### 2.3.3. Point Feature Measurement Model

For a point feature, the distance from the projection point to the observation point, that is, the reprojection error, is used to construct the point feature error model. The normalized image plane coordinate of the *k*th point on the *c<sup>j</sup>* th frame is *z cj* **f***k* = [*u cj* **f***k* , *v cj* **f***k* , 1] *T* , the reprojection error is defined as [24]:

$$\mathbf{r}\_{f}(\mathbf{z}\_{\mathbf{f}\_{k}}^{c\_{i}},\mathbf{X}) = \begin{bmatrix} \frac{\mathbf{x}\_{\mathbf{f}\_{j}}^{c\_{j}}}{\mathbf{z}\_{\mathbf{f}\_{k}}^{c\_{j}}} - \mathbf{u}\_{\mathbf{f}\_{k}}^{c\_{j}}\\ \frac{\mathbf{y}\_{\mathbf{f}\_{j}}^{c\_{j}}}{\mathbf{z}\_{\mathbf{f}\_{k}}^{c\_{j}}} - \mathbf{v}\_{\mathbf{f}\_{k}}^{c\_{j}} \end{bmatrix} \tag{5}$$

where *z cj* **f***k* = [*u cj* **f***k* , *v cj* **f***k* , 1] T indicates the point on the normalized image plane that is observed by the camera frame *c<sup>i</sup>* and [*x cj* , *y cj* , *z cj* ] indicates the point transformed into the camera frame *c<sup>i</sup>* .

#### 2.3.4. Line Feature Measurement Model

The reprojection error of a line feature is defined as the distance from the endpoints to the projection line. For a pinhole model camera, a 3D spatial line *L*= [**n**, **d**] T is projected to the camera image plane by the following formula [24]:

$$\mathbf{1} = \begin{bmatrix} l\_1 \\ l\_2 \\ l\_3 \end{bmatrix} = \mathbf{K} \mathbf{n}^c = \begin{bmatrix} f\_y & 0 & 0 \\ 0 & f\_x & 0 \\ -f\_y c\_x & -f\_x c\_y & f\_x f\_y \end{bmatrix} \mathbf{n}^c \tag{6}$$

where a 3D spatial line is represented by the normal vector **n** and the direction vector **d**, *K* is the projection matrix for a line feature. According to the projection of a line (Equation (6)), the normal vector of a 3D spatial line is projected to the normalized plane, which is the projection line of a 3D spatial line.

The reprojection error of the line feature in camera frame *c<sup>i</sup>* is defined as (7) [24]:

$$\mathbf{r}\_l(\mathbf{z}\_{L\_l}^{c\_i}, \mathbf{X}) = \begin{bmatrix} d(\mathbf{s}\_l^{c\_i}, \mathbf{l}\_l^{c\_i}) \\ d(\mathbf{e}\_l^{c\_i}, \mathbf{l}\_l^{c\_i}) \end{bmatrix} \tag{7}$$

where *d*(**s**, **l**) indicates the distance function from endpoint **s** to the projection line **l**.

#### *2.4. Basic Principle of Helmert Variance Component Estimation*

Perform the first-order Taylor expansion of the point feature error formula (5) and the line feature error Formula (7) to obtain:

$$\mathbf{r}\_f(\mathbf{z}\_{\mathbf{f}\_k}^{\varepsilon\_i}, \mathbf{X}) \approx \mathbf{r}\_f(\mathbf{z}\_{\mathbf{f}\_k}^{\varepsilon\_i}, \mathbf{X}\_0) + \mathbf{J}\_f \Delta \mathbf{x} \tag{8}$$

$$\mathbf{r}\_l(\mathbf{z}\_{L\_l}^{c\_l}, \mathbf{X}) \approx \mathbf{r}\_l(\mathbf{z}\_{L\_l}^{c\_l}, \mathbf{X}\_0) + \mathbf{J}\_l \Delta \mathbf{x} \tag{9}$$

where **r***<sup>f</sup>* (*z ci* **f***k* , *X*0) and **r***<sup>l</sup>* (*z ci Ll* , *X*0) are the values of the point feature error model and the line feature error model at the state variable *X*0, respectively, **J***<sup>f</sup>* and **J***<sup>l</sup>* are the corresponding Jacobian matrices.

The constructed least squares optimization is:

$$\mathbf{H} \Delta \mathbf{x} = \mathbf{b} \tag{10}$$

$$\mathbf{H} = \mathbf{J}\_f^T \mathbf{P}\_f \mathbf{J}\_f + \mathbf{J}\_l^T \mathbf{P}\_l \mathbf{J}\_l = \mathbf{H}\_f + \mathbf{H}\_l \tag{11}$$

$$\mathbf{b} = -\mathbf{J}\_f^T \mathbf{P}\_f \mathbf{r}\_f - \mathbf{J}\_l^T \mathbf{P}\_l \mathbf{r}\_l = \mathbf{b}\_f + \mathbf{b}\_l \tag{12}$$

where **P***<sup>f</sup>* and **P***<sup>l</sup>* are the weight matrices corresponding to the point feature observations and the line feature observations, respectively.

In general, during the first optimization, the weights of the point feature observations and the line feature observations are inappropriate, or the corresponding unit weight variances are not equal. Let the unit weight variance of the point feature and the line feature observations be σ 2 *f* , σ 2 *l* , the corresponding relationship between covariance matrix and the weight matrix is:

$$
\Delta\_f = \sigma\_f^2 \mathbf{P}\_f^{-1} \tag{13}
$$

$$
\Sigma\_l = \sigma\_l^2 \mathbf{P}\_l^{-1} \tag{14}
$$

where **Σ***<sup>f</sup>* and **Σ***<sup>l</sup>* are the covariance matrices of the point and line features.

Using the rigorous formula of Helmert variance component estimation, we get:

$$\begin{array}{ll} E(\mathbf{r}\_f^T \mathbf{P}\_f \mathbf{r}\_f) &= \sigma\_f^2 \left\{ \text{tr}(\mathbf{H}^{-1} \mathbf{H}\_f \mathbf{H}^{-1} \mathbf{H}\_f) - 2 \text{tr}(\mathbf{H}^{-1} \mathbf{H}\_f) + \mathbf{n}\_1 \right\} \\ &+ \sigma\_l^2 \left\{ \text{tr}(\mathbf{H}^{-1} \mathbf{H}\_f \mathbf{H}^{-1} \mathbf{H}\_l) \right\} \end{array} \tag{15}$$

$$\begin{array}{ll}E(\mathbf{r}\_l^T \mathbf{P}\_l \mathbf{r}\_l) = & \sigma\_l^2 \Big\{ \text{tr}(\mathbf{H}^{-1} \mathbf{H}\_l \mathbf{H}^{-1} \mathbf{H}\_l) - 2 \text{tr}(\mathbf{H}^{-1} \mathbf{H}\_l) + \mathbf{n}\_2 \Big\} \\ & + \sigma\_f^2 \operatorname{tr}(\mathbf{H}^{-1} \mathbf{H}\_f \mathbf{H}^{-1} \mathbf{H}\_l) \end{array} \tag{16}$$

where n1, n<sup>2</sup> are the number of observations of point feature and line feature.

After combining the formulas we get:

$$\mathbf{S} = \begin{bmatrix} \text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{f}\mathbf{H}^{-1}\mathbf{H}\_{f}) - 2\text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{f}) + \mathbf{n}\_{1} & \text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{f}\mathbf{H}^{-1}\mathbf{H}\_{l})\\ \text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{f}\mathbf{H}^{-1}\mathbf{H}\_{l}) & \text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{l}\mathbf{H}^{-1}\mathbf{H}\_{l}) - 2\text{tr}(\mathbf{H}^{-1}\mathbf{H}\_{l}) + \mathbf{n}\_{2} \end{bmatrix} \tag{17}$$

$$\mathbf{W} = \begin{bmatrix} \mathbf{r}\_f^T \mathbf{P}\_f \mathbf{r}\_f \\ \mathbf{r}\_l^T \mathbf{P}\_l \mathbf{r}\_l \end{bmatrix} \dot{\boldsymbol{\theta}} = \begin{bmatrix} \boldsymbol{\hat{\sigma}}\_f^2 \\ \boldsymbol{\hat{\sigma}}\_l^2 \end{bmatrix} \tag{18}$$

$$\mathcal{S}\dot{\theta} = \mathcal{W} \tag{19}$$

$$
\boldsymbol{\hat{\theta}} = \mathbf{S}^{-1} \mathbf{W} \tag{20}
$$

We take the post-test unit weight variance σˆ 2 *f* of the point feature as the unit weight variance, then the post-test weights of the point feature and the line feature are:

$$\mathbf{P}\_f = \mathbf{P}\_f \tag{21}$$

$$
\hat{\mathbf{P}}\_l = \frac{\hat{\sigma}\_f^2}{\hat{\sigma}\_l^2} \mathbf{P}\_l \tag{22}
$$

In sliding window optimization, in order to improve the efficiency of optimization, we ignore the trace part in the coefficient matrix *S*.

#### **3. Experimental Results**

We performed two improvements to the IPL–VIO system: the front-end line feature matching method and the back-end Helmert variance component estimation. In order to evaluate the performance of the algorithm in this paper, we used the EuRoc MAV [31] and PennCOSYVIO [32] datasets for verification.

We compared the IPL–VIO proposed in this paper with OKVIS–Mono [9], VINS–Mono [10], and PL–VIO [24] to verify the effectiveness of the method. OKVIS is a VIO system which can work with monocular or stereo modes. It uses a sliding window optimization algorithm to tightly couple visual point features and IMU measurements. VINS–Mono is a monocular visual inertial SLAM system that uses visual point features to assist in optimizing the IMU state. It uses a sliding window method for tightly coupling optimization and has closed-loop detection. PL–VIO is a monocular VIO system that uses a sliding window algorithm to tightly couple and optimize visual points, line features, and IMU measurement. Since the IPL–VIO in this article is a monocular VIO system, we compared it with the OKVIS in monocular mode and VINS–Mono without loop closure.

All the experiments were performed in the Ubuntu 16.04 system by an Intel Core i7-9750H CPU with 2.60 GHz and 8 GB RAM, on the ROS Kinetic [36].

#### *3.1. Experimental Data Introduction*

The EuRoc microaerial vehicle (MAV) datasets were collected by an MAV containing two scenes, a machine hall at ETH Zürich and an ordinary room, as shown in Figure 5. The datasets contain stereo images from a global shutter camera at 20 FPS and synchronized IMU measurements at 200 Hz [31]. Each dataset provides a groundtruth trajectory given by the VICON motion capture system. The datasets also provide all the extrinsic and intrinsic parameters. In our experiments, we only used the images from the left camera.

The PennCOSYVIO dataset contains images and synchronized IMU measurements that are collected with handheld equipment, including indoor and outdoor scenes of a glass building, as shown in Figure 5 [32]. Challenging factors include illumination changes, rapid rotations, and repetitive structures. The dataset also contains all the intrinsic and extrinsic parameters as well as the groundtruth trajectory.

We used the open source accuracy evaluation tool evo (https://michaelgrupp.github.io/evo/) to evaluate the accuracy of the EuRoc MAV datasets. We used absolute pose error (APE) as the error evaluation standard. For better comparison and analysis, we compared the rotation and translation parts of the trajectory and the groundtruth, respectively. Meanwhile, the tool provides a visualization of the comparison results, thereby the accuracy of the results can be analyzed more intuitively.

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 10 of 19

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 10 of 19

**Figure 5.** EuRoc MAV datasets and PennCOSYVIO dataset scenes. Images (**a**,**d**) are the room scenes of the V1\_01\_easy dataset in the EuRoc MAV datasets. Images (**b**,**e**) are the machine hall scenes of the MH\_05\_difficult dataset in the EuRoc MAV datasets. Images (**c**,**f**) are the indoor and outdoor scenes of PennCOSYVIO dataset. **Figure 5.** EuRoc MAV datasets and PennCOSYVIO dataset scenes. Images (**a**,**d**) are the room scenes of the V1\_01\_easy dataset in the EuRoc MAV datasets. Images (**b**,**e**) are the machine hall scenes of the MH\_05\_difficult dataset in the EuRoc MAV datasets. Images (**c**,**f**) are the indoor and outdoor scenes of PennCOSYVIO dataset. We used the open source accuracy evaluation tool evo (https://michaelgrupp.github.io/evo/) to evaluate the accuracy of the EuRoc MAV datasets. We used absolute pose error (APE) as the error evaluation standard. For better comparison and analysis, we compared the rotation and translation parts of the trajectory and the groundtruth, respectively. Meanwhile, the tool provides a visualization

We used the open source accuracy evaluation tool evo (https://michaelgrupp.github.io/evo/) to evaluate the accuracy of the EuRoc MAV datasets. We used absolute pose error (APE) as the error evaluation standard. For better comparison and analysis, we compared the rotation and translation parts of the trajectory and the groundtruth, respectively. Meanwhile, the tool provides a visualization of the comparison results, thereby the accuracy of the results can be analyzed more intuitively. The PennCOSYVIO dataset is equipped with accuracy assessment tools (https://daniilidis-The PennCOSYVIO dataset is equipped with accuracy assessment tools (https://daniilidis-group. github.io/penncosyvio/). We used absolute pose error (APE) and relative pose error (RPE) asthe evaluation criteria for errors. For RPE, it expresses the errors in percentages by dividing the value with the path length [32]. The creator of PennCOSYVIO cautiously selected the evaluation parameters, so their tool is suited for evaluating VIO approaches in this dataset. Therefore, we adopted this evaluation tool in our experiments. of the comparison results, thereby the accuracy of the results can be analyzed more intuitively. The PennCOSYVIO dataset is equipped with accuracy assessment tools (https://daniilidisgroup.github.io/penncosyvio/). We used absolute pose error (APE) and relative pose error (RPE) as the evaluation criteria for errors. For RPE, it expresses the errors in percentages by dividing the value with the path length [32]. The creator of PennCOSYVIO cautiously selected the evaluation parameters, so their tool is suited for evaluating VIO approaches in this dataset. Therefore, we

#### group.github.io/penncosyvio/). We used absolute pose error (APE) and relative pose error (RPE) as the evaluation criteria for errors. For RPE, it expresses the errors in percentages by dividing the value *3.2. Experimental Analysis of the Improved Line Feature Matching Algorithm*

adopted this evaluation tool in our experiments.

with the path length [32]. The creator of PennCOSYVIO cautiously selected the evaluation parameters, so their tool is suited for evaluating VIO approaches in this dataset. Therefore, we adopted this evaluation tool in our experiments. *3.2. Experimental Analysis of the Improved Line Feature Matching Algorithm*  We compared the proposed line feature matching method with the LBD descriptor matching method. Figure 6 shows the line feature matching effect of the LBD descriptor matching method and We compared the proposed line feature matching method with the LBD descriptor matching method. Figure 6 shows the line feature matching effect of the LBD descriptor matching method and the method proposed in this paper. Figure 7 shows the trajectory errors of two methods running on EuRoc MAV's MH\_02\_easy dataset and V1\_03\_difficult dataset. We comprehensively used geometric information such as the position and angle of the line features, as well as the pixel gray information around the line feature to match the corresponding line feature. It can be seen that the accuracy of the improved algorithm is equivalent to the descriptor matching method. *3.2. Experimental Analysis of the Improved Line Feature Matching Algorithm*  We compared the proposed line feature matching method with the LBD descriptor matching method. Figure 6 shows the line feature matching effect of the LBD descriptor matching method and the method proposed in this paper. Figure 7 shows the trajectory errors of two methods running on EuRoc MAV's MH\_02\_easy dataset and V1\_03\_difficult dataset. We comprehensively used geometric information such as the position and angle of the line features, as well as the pixel gray information around the line feature to match the corresponding line feature. It can be seen that the

accuracy of the improved algorithm is equivalent to the descriptor matching method.

(**a**) (**b**)

**Figure 6.** *Cont*.

(**a**) (**b**)

matching line.

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 11 of 19

**Figure 6.** Comparison of the matching effect between line binary descriptors (LBD) matching method and improved matching method on the MH\_02\_easy dataset. Images (**a**,**b**) are LBD descriptor matching scenes—the line of the same color is the corresponding matching line; (**c**,**d**) are the matching scenes of the improved matching method, and the line of the same color is the corresponding **Figure 6.** Comparison of the matching effect between line binary descriptors (LBD) matching method and improved matching method on the MH\_02\_easy dataset. Images (**a**,**b**) are LBD descriptor matching scenes—the line of the same color is the corresponding matching line; (**c**,**d**) are the matching scenes of the improved matching method, and the line of the same color is the corresponding matching line. and improved matching method on the MH\_02\_easy dataset. Images (**a**,**b**) are LBD descriptor matching scenes—the line of the same color is the corresponding matching line; (**c**,**d**) are the matching scenes of the improved matching method, and the line of the same color is the corresponding matching line.

(**c**)(**d**) **Figure 7.** Comparison of improved matching method and LBD descriptor matching method: (**a**,**c**) compare translation errors on the MH\_02\_easy and V1\_03\_difficult datasets; (**b**,**d**) compare rotation **Figure 7.** Comparison of improved matching method and LBD descriptor matching method: (**a**,**c**) compare translation errors on the MH\_02\_easy and V1\_03\_difficult datasets; (**b**,**d**) compare rotation errors on the MH\_02\_easy and V1\_03\_difficult datasets. **Figure 7.** Comparison of improved matching method and LBD descriptor matching method: (**a**,**c**) compare translation errors on the MH\_02\_easy and V1\_03\_difficult datasets; (**b**,**d**) compare rotation errors on the MH\_02\_easy and V1\_03\_difficult datasets.

errors on the MH\_02\_easy and V1\_03\_difficult datasets. We counted the trajectory error and time of the two methods after running the MH\_02\_easy and We counted the trajectory error and time of the two methods after running the MH\_02\_easy and V1\_03\_difficult dataset of EuRoc MAV; the root mean square error (RMSE) of APE is used to evaluate We counted the trajectory error and time of the two methods after running the MH\_02\_easy and V1\_03\_difficult dataset of EuRoc MAV; the root mean square error (RMSE) of APE is used to evaluate

the translation error and rotation error, respectively, and the time is the average time of the different

the translation error and rotation error, respectively, and the time is the average time of the different algorithms running the datasets, as shown in Table 1. It can be seen that running the MH\_02\_easy dataset by using the LBD descriptor matching algorithm, the errors of the translation part and rotation part are 0.13057 m and 1.73778 degrees; using the matching algorithm proposed in this article, the errors of the translation part and rotation part are 0.13253 m and 1.73950 degrees. Although the accuracy has decreased, it is very limited. When running the V1\_03\_difficult dataset, using the LBD descriptor matching algorithm to run the dataset, the errors of the translation part and rotation part are 0.19490 m and 3.31055 degrees; using the matching algorithm proposed in this paper, the errors of the translation part and rotation part are 0.19792 m and 3.27675 degrees. The accuracy of the translation part is slightly decreased, but the accuracy of the rotation part is slightly increased, and the overall accuracy is equivalent.


**Table 1.** Comparison of LBD descriptor matching method and the matching method we proposed.

Using the improved line feature matching method and the LBD descriptor matching method, the final trajectory accuracy is equivalent. However, when comparing the running time for the MH\_02\_easy dataset, the LBD descriptor matching takes an average of 74 ms per frame, and the method described in this paper takes 15 ms; it can be seen that the running time is 20% that of the LBD descriptor matching method; for the V1\_03\_difficult dataset, LBD descriptor matching takes an average of 37 ms per frame, the method described in this paper takes 10 ms, and the running time is 27% of the LBD descriptor matching method. It can be seen that the method proposed in this article can effectively speed up the line feature matching.
