*3.2. Inter-Frame Feature Constraint Matching*

Different from the neighboring line merging of different line features within the same frame in feature extraction, the least square method-based line feature constraint matching is for the same line feature pair whose angle and distance change between two consecutive frames. Considering the angle and translation changes in the same line feature pair during the carrier movement, a minimized sparse matrix model can be constructed to ensure the minimum total error in matching the line features extracted between the front and back frames.

Given a line *l <sup>W</sup>* = *nW<sup>T</sup>* , *<sup>v</sup>W<sup>T</sup> <sup>T</sup>* ∈ *<sup>R</sup>*<sup>6</sup> extracted from the world coordinate system, where *<sup>n</sup>W*, *<sup>v</sup><sup>W</sup>* ∈ *<sup>R</sup>*<sup>3</sup> is the normal vector and direction vector, respectively, of *<sup>l</sup> W*, let the transformation matrix from the world frame to camera frame be *T<sup>W</sup> <sup>C</sup>* <sup>=</sup> *R<sup>W</sup> <sup>C</sup>* , *t W C* , with *R<sup>W</sup> <sup>C</sup>* , *t W <sup>C</sup>* denoting the rotation and translation, respectively, then *l <sup>W</sup>* can be expressed in Plücker coordinates within the camera frame as:

$$I^{\mathbb{C}} = \left[ \begin{array}{c} n^{\mathbb{C}} \\ v^{\mathbb{C}} \end{array} \right] = T\_{\mathbb{C}}^{W} l^{W} = \left[ \begin{array}{c} R\_{\mathbb{C}}^{W} \\ 0 \end{array} \begin{array}{c} \left[ t\_{\mathbb{C}}^{W} \right]\_{\times} R\_{\mathbb{C}}^{W} \\ R\_{\mathbb{C}}^{W} \end{array} \right] \left[ \begin{array}{c} n^{W} \\ v^{W} \end{array} \right] \in R^{6} \tag{2}$$

It can be seen that the matching of line feature pairs in the camera frame is a 6-DOF parametric matching problem. In order to improve the accuracy and simplify the line feature matching problem, it can be simplified as a 4-DOF parameter matching optimization problem. Let all the line feature pairs obtained by matching between two consecutive frames in the camera frame be:

$$F\_{\mathbf{i}\mathbf{j}} = \left\{ (l\_{i\prime}l\_{\mathbf{j}}) \mid j \in [1, n] \right\} \tag{3}$$

where *li* and *lj* are certain line features extracted in the previous frame and subsequent frame, respectively, *n* is the total number of line features in the subsequent frame.

According to the variation in the inter-frame line characteristics shown in Figure 3, the parameter matrix can be set as *eij* = *<sup>θ</sup>ij*, *<sup>μ</sup>ij*, *<sup>ρ</sup>ij*, *dij<sup>T</sup>* , *θij* and *dij* are the included angle and translation distance between two consecutive frames, respectively, *μij* and *ρij* are the projection ratio and length ratio of the front-to-back interframe line features. Constructing the parameter matrix may establish a linear constraint matrix *Ai* = [*ei*1, ... ,*eij*,*ein*] of the subsequent keyframe for *li*. The target vector of the matching judgment of *li* is *mi* = *mi*1,..., *mij*,..., *min<sup>T</sup>* . The value of each component is determined by the result of feature matching, where matching is 1 and non-matching is 0. If ∑ *min* = 1, the linear constraint *Aimi* = *t* will be satisfied. Therefore, the line feature matching problem can be optimized into a constrained matching equation based on least squares:

$$\min\_{m\_i} \lambda \left\| m\_i \right\|\_1 + \frac{1}{2} \left\| A\_i m\_i - \mathbf{t} \right\|\_2 \tag{4}$$

where *λ* is the weight coefficient and *t* = [0, 1, 1, 0] *<sup>T</sup>* is the constraint target vector.

**Figure 3.** Deviation of a line feature during the movement of the carrier. (**a**) Parallel offset (**b**) angular offset.

#### *3.3. LiDAR-Aided Depth Correlation of Visual Features*

LiDAR-aided depth correlation of visual features can effectively improve the scale ambiguity of monocular cameras. Since the LiDAR resolution is much lower than that of the camera, the use of only a single frame of sparse point cloud depth correlation will result in a large number of visual feature depth deletions [30]. Therefore, this study purposes a strategy of superimposing multi-frame sparse point cloud to obtain the depth value of the point cloud, which is used to establish the depth correlation with the visual features.

As shown in Figure 4, *f <sup>V</sup>* <sup>1</sup> is a feature point in the visual frame {*V*}, and *dL* <sup>1</sup> ,..., *<sup>d</sup><sup>L</sup> m* is a group of depth point clouds in the lidar frame {*L*}. Projecting *<sup>d</sup><sup>L</sup> <sup>n</sup>* onto a unit spherical surface *Vg* with *f <sup>V</sup>* <sup>1</sup> as the spherical center to obtain a projection point *d Vg <sup>n</sup>* :

$$d\_n^{V\_\mathcal{X}} = R\_L^{V\_\mathcal{X}} d\_n^L + p\_L^{V\_\mathcal{X}} \ n \in [1, m] \tag{5}$$

where *<sup>R</sup>Vg <sup>L</sup>* and *p Vg <sup>L</sup>* are the rotation matrix and external parameter matrix of {*L*} to *Vg* , respectively. Taking *f <sup>V</sup>* <sup>1</sup> as the root node to establish KD tree to search for the three closest depth points *d*1, *d*2, *d*<sup>3</sup> on the sphere. Then, connecting *f <sup>V</sup>* <sup>1</sup> with the camera center *O* and intersecting Δ*d*1*d*2*d*<sup>3</sup> with *Od*, we can obtain the characteristic depth of *f <sup>V</sup>* <sup>1</sup> as *<sup>f</sup> <sup>V</sup>* <sup>1</sup> *Od*.

**Figure 4.** Association of visual feature depth.

#### **4. Back-End: LVIO-GNSS Fusion Framework Based on Factor Graph**

*4.1. Construction of Factor Graph Optimization Framework*

The framework of factor graph optimization based on the Bayesian network proposed in this study is shown in Figure 5. The state vector in that world frame construct according to the constraint factor shown in the figure is:

$$\mathcal{X} = \left[ \mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_i, \lambda\_1, \lambda\_2, \dots, \lambda\_{p+1}, \mathbf{z}\_1, \dots, \mathbf{z}\_l, d\_1^\varepsilon, d\_2^\varepsilon, \dots, d\_{k'}^\varepsilon, d\_{1'}^\mathbf{p}, d\_{2'}^\mathbf{p}, \dots, d\_k^\mathbf{p} \right] \tag{6}$$

where *xn* = *pn*, *qn*, *vn*, *ba*, *bg* represents the IMU state at the *n*th time, which includes the carrier position *pi*, the rotation quaternion *qi* and the velocity *vn* obtained by IMU preintegration in the world frame, *ba* and *bg* stand for the acceleration bias and the gyroscope bias in IMU body frame, respectively, *λ<sup>p</sup>* represents the inverse depth of the visual point feature in the visual frame from its initial observation in the first frame, *<sup>l</sup>* represents the orthogonal frame of the visual line feature, *d<sup>e</sup> <sup>k</sup>* and *<sup>d</sup><sup>p</sup> <sup>k</sup>* stand for the distances between the LiDAR feature points and its corresponding edge or plane feature point cloud, respectively.

**Figure 5.** Factor graph optimization framework of our system. Constraints of factor graph on the keyframe maintenance include three local constraints and two global constraints.

Therefore, the Gaussian–Newton method can be used to minimize all cost functions to construct a maximum a posteriori estimation problem, to perform nonlinear optimization on the state vectors in the sliding window:

$$\begin{aligned} \min\_{\mathcal{X}} \left\{ \left\| r\_p - \mathcal{J}\_p \mathcal{X} \right\| ^2 + \sum\_{k \in B} \left\| r\_B \left( \boldsymbol{\hat{z}}\_{k+1'}^k \mathcal{X} \right) \right\|\_{p\_i}^2 + \\ \sum\_{(i,j) \in F} \rho \left( \left\| r\_f \left( \boldsymbol{\hat{z}}\_{i'}^j \mathcal{X} \right) \right\|\_{p\_c}^2 \right) + \sum\_{(i,j) \in L} \rho \left( \left\| r\_l \left( \boldsymbol{\hat{z}}\_{i'}^j \mathcal{X} \right) \right\|\_{p\_c}^2 \right) + \sum d\_k^\epsilon + \sum d\_k^p \right\} \end{aligned} \tag{7}$$

where *rp*, J*<sup>p</sup>* contains the prior states after the marginalization in the sliding window, and J*<sup>p</sup>* is the Jacobian matrix, *rB z*ˆ*k <sup>k</sup>*+1, X represents the IMU residuals, and *pi* is the IMU covariance matrix; *rf z*ˆ *j i* , X and *rl z*ˆ *j i* , X represent the re-projection errors of visual point and line features, *pc* is the visual covariance matrix, and *ρ* represents Huber norm, with specific values as follows:

$$\rho(e(s)) = \begin{cases} \frac{1}{2} |e\_1(s)|^2 & e(s) = e\_1(s), |e\_1(s)| \le \delta \\\ \delta |e\_2(s)| - \frac{1}{2} \delta^2 & e(s) = e\_2(s), |e\_2(s)| > \delta \end{cases} \tag{8}$$

The specific meaning of each sensor cost function in Formula (6) is as follows.
