**1. Introduction**

Multi-sensor fusion localization technology based on Simultaneous Localization and Mapping (SLAM) is a fundamental technology in the field of high-precision localization of mobile carriers [1]. The SLAM-based multi-sensor fusion system applied to mobile carriers can be divided into two core parts: the front-end, and the back-end. The function of the front-end is used to analyze the environmental fingerprint information collected by the sensors, in order to estimate the positional information of the mobile carrier in time. In addition, the change in the surrounding environment with the movement of the carrier is restored. The function of the back-end is used to obtain the final positioning results by iteratively optimizing the position estimates obtained from the front-end analysis. Depending on the sensors used in the front-end, it can be divided into methods mainly based on LiDAR and vision [2,3]. Engineers and researchers in related fields have conducted a lot of research in both directions and produced a series of research-worthy results.

The main vision-based SLAM approach, namely visual odometry (VO), has long dominated the SLAM technology field due to the lower cost of the camera compared with

**Citation:** He, X.; Gao, W.; Sheng, C.; Zhang, Z.; Pan, S.; Duan, L.; Zhang, H.; Lu, X. LiDAR-Visual-Inertial Odometry Based on Optimized Visual Point-Line Features. *Remote Sens.* **2022**, *14*, 622. https:// doi.org/10.3390/rs14030622

Academic Editor: Giuseppe Casula

Received: 20 December 2021 Accepted: 26 January 2022 Published: 27 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

LiDAR. However, pure monocular visual SLAM systems cannot recover metric scales. Thus, there is a growing trend to utilize low-cost inertial measurement units to assist monocular vision systems, which is called visual-inertial odometry (VIO). Monocular VIO provides high-quality self-motion simulation by using monocular cameras and inertial measurement unit (IMU) measurements, which has significant advantages in terms of size, cost, and power. Based on the method of feature association, visual SLAM can be classified into feature point method and direct method. The feature point-based method VIO accomplishes the inter-frame feature constraint by extracting and matching image feature points [4–6]. Therefore, rich environmental texture is required to ensure that the threshold of the number of effective feature points required for feature tracking is reached. Tracking loss of feature points is prone to occur in weak texture environments such as parking lots and tunnels, which in turn affects localization accuracy and real-time performance. The theoretical basis of the direct method-based VIO is the assumption of constant grayscale [7,8]. It only needs to capture environmental features by the changes in the grayscale image to establish constraints, which has a better real-time performance. Nevertheless, the tracking accuracy is greatly affected by environmental illumination changes. Therefore, stable and rich line feature models are required to be introduced into the front-end to provide stable and accurate feature constraints for visual back-end state estimation. In 2018, He et al. proposed PL-VIO based on point-line feature fusion, but too many optimization factors greatly limited the real-time performance in practical tests [9]. In 2020, Wen et al. proposed PLS-VIO to optimize the 6-DOF pose by minimizing the objective function and improving the line feature matching filtering strategy to reduce the probability of mismatching [10]. Although the VIO based on point-line features has a positive effect on the number of features [11,12], it still cannot solve the scale uncertainty problem of monocular cameras. The development of VIO in practical applications still has certain limitations.

As another important technical means of SLAM-based localization technology, SLAM mainly based on LiDAR is also widely used in the industry for its high resolution, high accuracy, and high utilization of spatial features. In 2016, Google proposed Cartographer, a 2D LiDAR based on particle filtering and graph optimization. In 2017, Zhang et al. proposed the LOAM for the first time, which uses the curvature of the LiDAR point cloud to register the effective point cloud features as planar points and edge points [13]. In 2018, Shan et al. proposed LeGO-LOAM based on LOAM, which uses the ground plane feature point cloud to further filter outliers from the scanned point cloud and improve the LOAM frame [2]. In 2020, Shan et al. further introduced the LIO-SAM algorithm based on the previous work, which uses IMU pre-integrated measurements to provide initial pose estimation for laser odometry [14]. In addition, a Bayesian network-based factor graph optimization framework is proposed, in which the global position is constrained by adding GPS factors, and an incremental smooth global voxel map is established. These schemes provide technical feasibility for the high-precision positioning by fusing LiDAR with other sensors.

However, due to the inherent shortcomings of the main sensing sensors, such as the limited scanning angle of LiDAR and the sensitivity of the mainly vision-based methods to light variations, these methods can hardly show excellent robustness in real-world applications. To further improve the localization performance, LiDAR-Visual-Inertial Odometry, as a multi-sensor fusion localization method, has become a research focus of SLAM with its advantages of multi-sensor heterogeneity and complementarity.

The existing LVIO multi-sensor fusion strategy can be described from the front-end and back-end perspectives. First, the front-end fusion strategy of LVIO is introduced. Generally, LiDAR acts as a feature depth provider for monocular VO as a way to improve the scale ambiguity of visual features. Meanwhile, VO performs state estimation from the extracted visual features, which is provided as the initial state for LiDAR scan matching. Therefore, the quantity and quality of visual features are closely related to the precision of state estimation of the fusion system. In existing fusion systems, the features extracted

by camera are mainly point features [15,16]. Xiang et al. proposed a combination of fisheye camera and LiDAR based on a semantic segmentation model, which improved the confidence of the depth of visual features in the driving environment of unmanned vehicles [15]. Chen et al. proposed a method to construct a loopback constraint for LiDARvisual odometry by using the Distributed bag of Words (DboWs) model in the visual subsystem, although, without introducing IMU sensors to assist in the initial positional estimation [16]. In 2021, Lin et al. proposed R2LIVE to incorporate IMU into the fused localization system, in which the LiDAR odometry is used to establish depth constraints for VIO [17]. Although the above-mentioned algorithms exhibit superior performance to the VIO based on point features, it is still difficult to extract rich and effective features in weak texture environments, which leads to the failure in LiDAR scan matching. Therefore, additional feature constraints on the LiDAR need to be added with line features that are more robust to environmental texture and luminosity variations. Visual SLAM based on point-line features has been studied but not widely applied to LVIO systems in recent years [18,19]. In 2020, Huang et al. first proposed a LVIO based on a robust point and line depth extraction method, which greatly reduces the three-dimensional ambiguity of features [18]. Zhou et al. introduced line features in the direct method-based VIO to establish data association [19]. The above-mentioned algorithms provide technical feasibility for LVIO based on point-line features.

From the perspective of the back-end fusion strategy, LVIO can be classified into two categories based on different optimization algorithms: filter-based methods and factor graph methods. Although the filtering method is a traditional technology to realize multisensor fusion, its principle defect of frequent reconstruction of increasing or decreasing sensors limits its application in LVIO [20]. As an emerging method in recent years, the factor graph method can effectively improve the robustness of SLAM system when a single sensor fails because of its plug-and-play characteristics. Therefore, it is widely applied to deal with such heterogeneous aperiodic data fusion problems [21]. In addition, since LVIO is in the local frame, there are inherent defects such as accumulated errors. Thus GNSS measurements need to be introduced for global correction [22–24] to realize local accuracy and global drift-free position estimation, which makes full use of their complementarity [24]. The research on adding GNSS global constraints into the local sensor fusion framework are as follows: Lin et al. modified the extended Kalman filter to realize a loose coupling between GPS measurements and LiDAR state estimation, but there is a large single linearization error to be solved [17]. In 2019, Qin et al. proposed VINS-Fusion, which uses nonlinear optimization strategies to support Camera, IMU, and GNSS [25], but it assumes that GNSS is continuous and globally convergent, which is inconsistent with reality. In any case, these strategies presented above provide numerous reliable ideas.

Generally speaking, we can conclude that the existing LVIO fusion system has two problems that deserve further exploration. First, on the premise of ensuring the real-time performance, more abundant feature constraints are needed to improve the pose estimation accuracy of LVIO. Secondly, global constraints are needed to globally optimize the LVIO local pose estimation results. To address these issues, this study presents a LiDAR-Visual-Inertial Odometry based on optimized visual point-line features. First of all, an improved line feature extraction in scale space and constraint matching strategy based on the square method are proposed, which provides richer visual feature for the front-end of LVIO. Secondly, multi-frame LiDAR point clouds were projected into the visual frame for feature depth correlation, which improves the confidence of monocular visual depth estimation. At the same time, the initial visual state estimation can be used to optimize the scan matching of LiDAR. Finally, a factor graph based on the Bayesian network was used to build the LVIO fusion system, in which the GNSS factor and loop factor are introduced to constrain LVIO globally, to achieve locally accurate and globally drift-free position estimation in the complex environment.

#### **2. System Overview**

The general framework of the LiDAR-Visual-Inertial Odometry based on optimized visual point-line features proposed in this study is shown in Figure 1. The system consists of the front-end of LiDAR-Visual-Inertial Odometry tight combination and the back-end of factor graph optimization.

**Figure 1.** Overall algorithm framework, system inputs include IMU, camera, lidar and optional GNSS. IMU provides initial state correction for VIO subsystem and LiDAR-inertial odometry (LIO) subsystem, VIO and LIO systems use each other's information to improve the positioning accuracy, and GNSS signals are optionally added to the back-end to provide global constraints.

In the front-end of our algorithm, the visual odometry not only extracts point features, but also further extracts line features in the improved scale space and performs geometric constraint matching on them, which improves the number of features in the weak texture environment. Then, the feature depth provided by LiDAR point clouds performed a role in correlating the depth of monocular visual features. IMU pre-integration provides all necessary initial values, including attitude, velocity, acceleration bias, gyroscope bias, and three-dimensional feature position, for completing the initial state estimation after time alignment with a camera. If VIO initialization fails, the IMU pre-integration value is used as the initial assumption to improve the robustness of the fusion system in the texture-free environment.

After the front-end initialization is successfully realized, the back-end optimizes the factor graph by using the estimated residual of each sensor's state. IMU pre-integration, visual residual and lidar residual were added to the factor graph as local state factors for maximum a posteriori estimation. In order to further correct the cumulative error of local state estimation, the residual of GNSS single-point positioning measurements was used as the global positioning factor to add to the factor graph. Besides, when the system detects the path loop, the loop factor will be added to the factor graph to participate in the nonlinear optimization and obtain the optimal global pose estimation.
