**1. Introduction**

For autonomous unmanned systems, SLAM technology can observe the surrounding stationary environment and build a 3D map of the environment through sensors such as cameras and LiDARs installed on the robot [1]. For dynamic scenes, SLAM-based 4D reconstruction technology can reconstruct 4D (3D+time) dynamic scenes with rigid moving objects [2–4]. However, the complexity of the actual scene means the SLAM system, with only positioning and mapping functions, is unable to meet the needs of many scenarios such as military operations, emergency rescue, and disaster relief. Besides, autonomous unmanned systems are often required to obtain positioning and environmental maps at the same time. It is necessary to develop an intelligent multifunctional perception system with self-positioning, mapping, target tracking, and positioning functions to search for objects of interest within the field of view and obtain the target location.

In order to improve the state estimation accuracy of SLAM systems, a large number of multi-sensor fusion methods have been used, such as the fusion of camera and IMU [5–8], LiDAR and IMU [9–11], and a combination of all of them [12–14]. The sensors used in these methods can be divided into local pose estimation sensors such as camera and IMU and global pose estimation sensors such as GPS and magnetometer. However, they all have their advantages and disadvantages, so the single use of a certain type of sensor limits the SLAM system in practical application [15–17]. The short-term results are more credible for local pose estimation sensors, but they have two shortcomings. One is that their pose estimation results do not have a global coordinate system, so the method is not reusable. The second is that when the system runs for a long time, there will be an inevitable cumulative drift. Although the loop closure detection method can correct the accumulated error in the SLAM system, problems such as difficult matching, a large amount of data,

**Citation:** Chen, C.; Ma, Y.; Lv, J.; Zhao, X.; Li, L.; Liu, Y.; Gao, W. OL-SLAM: A Robust and Versatile System of Object Localization and SLAM. *Sensors* **2023**, *23*, 801. https:// doi.org/10.3390/s23020801

Academic Editor: Anastasios Doulamis

Received: 21 October 2022 Revised: 26 December 2022 Accepted: 26 December 2022 Published: 10 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and limited application scenarios still exist. The frequency of global pose estimation sensors is not high, so they cannot provide much continuous observation information. Furthermore, their measurement noise is relatively large; therefore, they cannot be directly used for pose estimation. However, they have global observation coordinates and are not affected by time accumulation. Therefore, fusing different sensors is an important method to enhance pose estimation accuracy.

For targeting, YOLOv4 [18] provides a high-speed and accurate target detection network model. The optical flow method and Kalman filtering are often used for target tracking. In terms of application, J.A. [19] proposes an automatic expert system, based on image segmentation procedures, that assists in safe landing through recognition and the relative orientation of the UAV and platform. Dr. Krishna [20] specified detection and tracking algorithms in terms of extracting the features of images and videos for security and scrutiny applications. We tend to fuse other sensors such as LiDAR or Radar to locate the target because monocular cameras lack depth information. Yifang [21] proposes the use of RADAR and Infrared sensor (IR) information for tracking and estimating target state dynamics. To project image-based object detection results and LiDAR-SLAM results onto a 3D probability map, Gong et al. [22] combine visual and range information into a frustum-based probabilistic framework.

For the above reasons, this paper proposes an online positioning, mapping, and targettracking and location system based on camera, IMU, solid-state LiDAR, and GPS. Specifically, our SLAM system consists of the following four parts: a LiDAR-inertial subsystem (LIS), Visual-inertial subsystem (VIS), GPS-inertial subsystem (GIS), and global pose graph optimization (PGO). The LIS and VIS are tightly coupled, and there exists loose coupling between them. The combination of them can improve the accuracy and robustness of the whole system. Finally, the LIS and VIS results are sent to the PGO system for global pose graph optimization to eliminate accumulated drifts. Due to our distributed structure design of tightly coupled internal and loose coupled external subsystems, our system dramatically improved its robustness even in cases where one of the subsystems fails. In addition, a LiDAR–Camera fusion localization method is proposed based on conventional target detection and tracking. The global position of the target is obtained in real-time based on our SLAM.

To test the effectiveness of our system, we built the necessary sensor equipment and collected many scene-rich datasets, including high-altitude UAV aerial photography datasets, ground vehicle datasets, and ground handheld datasets. Considering that there are relatively few datasets that include the sensors we use, we open source all collected datasets for other researchers to use. Finally, we conduct extensive experiments on our dataset to test our system. Experiments show that our system can perform the expected function well with good accuracy and robustness.

The main contributions of this paper are as follows:


#### **2. Method**

Here, we first introduce the block diagram of our system and then introduce our SLAM subsystem and target-tracking and localization subsystem in detail, respectively. Specifically, we first introduce our three subsystems, namely VIS, LIS, and GIS. Then, we describe how to alleviate cumulative drift using global pose graph optimization. Finally, we introduce the object tracking and localization subsystem and demonstrate how to use SLAM system results to obtain the global position of the object.

#### *2.1. The Overview of Our System*

An overview of our system is shown in Figure 1, which includes a multi-sensor fusion SLAM system and an object tracking and location subsystem. The SLAM system is divided into the following three parts: data preprocessing, three internal subsystems running in parallel, and the final pose-graph optimization. The data preprocessing step preprocesses the input image, IMU, and LiDAR data, including image feature extraction, IMU pre-integration, and LiDAR plane-feature extraction. Then, it will send the results to the three subsystems, i.e., VIS, LIS, and GIS. There is an interaction between the VIS and LIS subsystems. That is, they both provide each other with the current estimated state, which can improve the accuracy and robustness of the whole system. Specifically, for VIS, we refer to the practice of sliding-window-based nonlinear optimization in VINS-Mono [6]. Since the depth of visual feature points of VIS usually has a large uncertainty, inspired by [12], we register the LiDAR point cloud to the image to assist image depth extraction, which significantly improves the accuracy of VIS for feature point depth estimation. For LIS, the large number of LiDAR point clouds leads to significant challenges in the computing performance, so we refer to the approach of ES-IEKF in FAST-LIO2 [10] and use the fast Kalman filter algorithm to accelerate the calculation. For GIS, we use the IMU data for state propagation and the GPS observation data to correct the IMU results to obtain a high-frequency GPS signal equal to the IMU frequency. Finally, we fuse the results of VIS, LIS, and GIS for pose graph optimization to correct the cumulative drift.

Target tracking and localization rely on local point cloud maps and poses of keyframes provided by the SLAM system. First, we detect targets on the image, and track them between consecutive frames using a Kalman filter. Subsequently, we filter the point clouds that fall in the detection frame based on the external parameters of the camera and LiDAR, using Euclidean clustering to filter the portion of the closest target as the target point cloud. Laser points may not occupy the ground target within the camera's field of view because the laser point cloud becomes sparse with increasing distance. Therefore, we consider using a local point cloud map instead of a single frame of laser points as our input to compensate for the sparsity of the laser point cloud. Finally, the real-time global position of the tracked target is calculated based on the key poses.

## *2.2. Visual Inertial Subsystem*

The pipeline of VIS is similar to VINS-Mono [6], and the system block diagram is shown in Figure 1. For VIS, we define the world coordinate system as {*W*}, and the state variables of the IMU coordinate system are represented in {*W*} as

$$\mathbf{x}\_{i} = \begin{bmatrix} \mathbf{R}\_{b\_{i}}^{\mathsf{IV}} & \mathbf{p}\_{b\_{i}}^{\mathsf{IV}} & \mathbf{v}\_{b\_{i}}^{\mathsf{IV}} & \mathbf{b}\_{\omega\_{i}} & \mathbf{b}\_{\mathbf{a}\_{i}} \end{bmatrix}^{T} \tag{1}$$

where **R***<sup>W</sup> bi* <sup>∈</sup> *SO*(3) is the rotation matrix, **<sup>p</sup>***<sup>W</sup> bi* <sup>∈</sup> <sup>R</sup><sup>3</sup> is the position, **<sup>v</sup>***<sup>W</sup> bi* is the velocity, **b***ω<sup>i</sup>* and **ba***<sup>i</sup>* are the IMU biases. The IMU motion model is as follows:

$$\begin{split} \mathbf{p}\_{b\_{k+1}}^{\mathcal{W}} &= \mathbf{p}\_{b\_{k}}^{\mathcal{W}} + \mathbf{v}\_{b\_{k}}^{\mathcal{W}} \Delta t\_{k} \\ &+ \int\_{t \in [t\_{k}, t\_{k+1}]} \Big( \mathbf{R}\_{t}^{\mathcal{W}} (\mathbf{a}\_{t} - \mathbf{b}\_{a\_{l}} - \mathbf{n}\_{a}) - \mathbf{g}^{\mathcal{W}} \Big) dt^{2}, \\ \mathbf{v}\_{b\_{k+1}}^{\mathcal{W}} &= \mathbf{v}\_{b\_{k}}^{\mathcal{W}} + \int\_{t \in [t\_{k}, t\_{k+1}]} \Big( \mathbf{R}\_{t}^{\mathcal{W}} (\mathbf{a}\_{t} - \mathbf{b}\_{a\_{l}} - \mathbf{n}\_{a}) - \mathbf{g}^{\mathcal{W}} \Big) dt, \\ \mathbf{q}\_{b\_{k+1}}^{\mathcal{W}} &= \mathbf{q}\_{b\_{k}}^{\mathcal{W}} \odot \int\_{t \in [t\_{k}, t\_{k+1}]} \frac{1}{2} \boldsymbol{\Omega} (\omega\_{t} - \mathbf{b}\_{\omega\_{l}} - \mathbf{n}\_{\omega}) \mathbf{q}\_{l}^{b\_{k}} dt, \end{split} \tag{2}$$

where

$$\boldsymbol{\Omega}(\boldsymbol{\omega}) = \begin{bmatrix} -\lfloor \boldsymbol{\omega} \rfloor\_{\times} & \boldsymbol{\omega} \\ -\boldsymbol{\omega}^T & 0 \end{bmatrix}, \begin{bmatrix} \boldsymbol{\omega} \end{bmatrix}\_{\times} = \begin{bmatrix} 0 & -\boldsymbol{\omega}\_z & \boldsymbol{\omega}\_y \\ \boldsymbol{\omega}\_z & 0 & -\boldsymbol{\omega}\_x \\ -\boldsymbol{\omega}\_y & \boldsymbol{\omega}\_x & 0 \end{bmatrix},$$

**q***<sup>W</sup> bk* is the quaternion represent of **<sup>R</sup>***<sup>W</sup> <sup>t</sup>* , **a***<sup>t</sup>* is the acceleration reading from IMU, **n***<sup>a</sup>* is the acceleration noise of IMU, *ω<sup>t</sup>* is the angular velocity reading from IMU, **n***<sup>ω</sup>* is the angular velocity noise of IMU, Δ*tk* is the duration between the time interval [*tk*, *tk*<sup>+</sup>1].

For the input image, VIS first detects the Fast-corner [23] and then uses the KLT algorithm [24] for optical flow tracking. Since the inverse depth of visual features optimized by VIS has great uncertainty, we accumulate the LiDAR point clouds of recent frames and then project them onto the image to assist depth estimation. Specifically, as shown in Figure 2, we use the transformation between LiDAR and camera coordinate system **T***C <sup>L</sup>* = [**R***<sup>C</sup> <sup>L</sup>* | **<sup>p</sup>***<sup>C</sup> <sup>L</sup>* ] to project the LiDAR point cloud onto the camera image. Then we find the nearest three projected LiDAR points on the image plane for a visual feature by searching a two-dimensional K-D tree. At last, we use these three points to fit a plane and back-project the visual feature onto this plane as its 3D point, which is shown in Figure 3. It can be seen that for most visual feature points, we can accurately estimate their corresponding 3D points. It is very beneficial in improving the accuracy of the VIS.

**Figure 2.** Visual feature depth registratoin with LiDAR points.

**Figure 3.** (**a**) The visual features on the image. (**b**) The estimated 3D points of visual feature. It's clear that lots of visual features can get accurate estimates of their 3D points.

In the back-end sliding window optimization, the camera pose and the depth of feature points will be optimized as state variables. Unlike the almost completely independent design of LIS and VIS in [12], we use the current state value **x***<sup>L</sup>* predicted by LIS as the initial state of VIS. The factor is added to the back-end optimization process of VIS, as shown in Figure 4. It is well known that the robustness and accuracy of LIS are higher than VIS, so this design can improve the performance of VIS, thereby making a more accurate estimate of the initial pose for the next LIS.

**Figure 4.** The factor graph of our visual inertial subsystem.

#### *2.3. LiDAR Inertial Subsystem*

Our LIS is modified from [10]. Specifically, as shown in Figure 1, we use Iterate Kalman Filter based on error-state to fuse IMU and LiDAR observations. Benefiting from the improvement of Kalman Gain *K* in [10], this algorithm can run in real-time without causing excessive computational burden with the increase in LiDAR observation points. The iteratively optimized Kalman filter algorithm has been proved by [25] to have the same results as the least squares algorithm using Gauss-Newton, so our LIS also guarantees the accuracy of the algorithm.

When receiving a scan from LiDAR, we first extract the plane feature points and then use the poses obtained from inertial integration to remove the motion distortion of the point cloud. We use the IMU state propagation Equation (2) to obtain an up-to-date estimate of the current LiDAR pose. However, unlike [10], thanks to the existence of our VIS, we can continue to use IMU data to estimate the current LiDAR pose based on the latest pose estimated by VIS. This method can improve the accuracy of our LIS.

After obtaining the pose estimation of current scan, we need to calculate the distance from the extracted plane feature points to the fitted plane, which is same as in [10]. However, in practical applications, the LIS has a significant drift in height, i.e., the *z* axis. So we add a ground constraint to solve this problem and can flexibly choose whether to use this constraint for different scenes. Our ground detection algorithm is simple but effective. Expressly, we assume that the ground is an almost horizontal plane, which is almost always satisfied indoors and holds outdoors in the vast majority of cases. Then we accumulate the last few frames of LiDAR point clouds in the LiDAR coordinate system {*L*}. We assume that the distance *h* between LiDAR installation height and the ground is unchanged. For cars and handheld devices, this assumption is generally valid. Then we filter out all point clouds whose height is in [*h* − *δ*, *h* + *δ*], and ground points are almost included. As shown in Figure 5 shows the ground point cloud detected by our algorithm. It can be seen that our algorithm can detect the ground well.

**Figure 5.** (**a**) The visual image view. (**b**) The ground point clouds(white) detected by our LiDAR inertial subsystem. It's clear that the ground point clouds can be segmented successfully.

Then we use the RANSAC algorithm to fit ground plane in these point clouds and get the equation of the ground equation in {*L*}. We can get

$$(\mathbf{x}\_d^L - \mathbf{p}\_d^L) \cdot \mathbf{n}\_d^L = 0,\tag{3}$$

where **x***<sup>L</sup> <sup>d</sup>* is point on ground plane, **<sup>p</sup>***<sup>L</sup> <sup>d</sup>* is point on the plane and **<sup>n</sup>***<sup>L</sup> <sup>d</sup>* is the plane normal vector in the {*L*}. Since world coordinate system {*W*} is aligned with gravity **g** = [0, 0, −*g*] *<sup>T</sup>*, we can easily get the actual ground equation in our world coordinates as

$$(\mathbf{x}\_t^{\mathcal{W}} - \mathbf{p}\_t^{\mathcal{W}}) \cdot \mathbf{n}\_t^{\mathcal{W}} = 0,\tag{4}$$

where **p***<sup>W</sup> <sup>t</sup>* = [0, 0, −*h*0] and **<sup>n</sup>***<sup>W</sup> <sup>t</sup>* = [0, 0, 1] *<sup>T</sup>*, and *h*<sup>0</sup> is the distance from the origin of our {*W*} coordinate system to the ground.

In the same way, we use the current LiDAR pose **T***<sup>W</sup> <sup>L</sup>* to convert this actual ground equation into the current frame LiDAR coordinate system {*L*}. We can get

$$(\mathbf{x}\_t^L - \mathbf{p}\_t^L) \cdot \mathbf{n}\_t^L = 0. \tag{5}$$

Next, we define the detected plane [**n***<sup>L</sup> <sup>d</sup>* ; **<sup>p</sup>***<sup>L</sup> <sup>d</sup>* ], the real plane [**n***<sup>L</sup> <sup>t</sup>* ; **p***<sup>L</sup> <sup>t</sup>* ], and the error between them is added to the optimization of LIS to alleviate the *z* axis drift. First we use a rotation matrix **<sup>R</sup>** ∈ *SO*(3) to rotate **<sup>n</sup>***<sup>L</sup> <sup>t</sup>* to *x* axis, that is

$$\begin{bmatrix} \mathbf{1} & \mathbf{0} & \mathbf{0} \end{bmatrix}^\top = \mathbf{R} \mathbf{n}\_t^L. \tag{6}$$

Then we use this rotation matrix to rotate **n***<sup>L</sup> <sup>d</sup>* , that is

$$\begin{bmatrix} x\_{d'}^L & y\_{d'}^L & z\_{d'}^L \end{bmatrix}^\top = \mathbf{R} \mathbf{n}\_d^L. \tag{7}$$

We define the variables to measure the orientation error between the detected ground equation and the actual ground equation as

$$\alpha = \arctan \frac{y\_{d'}^{L}}{x\_{d'}^{L}}, \quad \beta = \arctan \frac{z\_{d'}^{L}}{\sqrt{x\_{d'}^{L}}^2 + y\_{d'}^{L}}.\tag{8}$$

Define the variables to measure the translation error of these two planes as

$$\gamma = \mathbf{p}\_t^L \cdot \mathbf{n}\_t^L - \mathbf{p}\_d^L \cdot \mathbf{n}\_d^L. \tag{9}$$

We define **e** = [*α*, *β*, *γ*] *<sup>T</sup>* as the final ground residual, which can well constrain the error of our system in the directions of *roll*, *pitch*, and *z*. This improvement on the final LIS can be seen in our later experiments.

#### *2.4. GPS Inertial Subsystem*

We add a GPS-inertial subsystem to better prepare for back-end pose graph optimization. In this subsystem, we use the Error-state Kalman Filter to integrate the state obtained by IMU and the state of GPS observations. This fusion can obtain a high-frequency pose output at IMU frequency. The reason for this design is that the frequency of GPS data is relatively low, and it cannot accurately time matched with the output of the VIS or LIS [9,15] both use the assumption of uniform motion to interpolate GPS data. However, in the case of high-speed and non-uniform motion, this assumption will cause the interpolation results to contain large noise, which reduces the accuracy of back-end pose graph optimization. Specifically, we use Equation (2) to predict the IMU state and define the error of the state **x** prediction as

$$
\delta \mathbf{x} = \begin{bmatrix}
\delta \mathbf{R}\_b^W & \delta \mathbf{p}\_b^W & \delta \mathbf{v}\_b^W & \delta \mathbf{b}\_\omega & \delta \mathbf{b}\_\mathbf{a}
\end{bmatrix}^\top. \tag{10}
$$

For GPS data, it is defined in WGS84 coordinate system. Same as [15], we first convert the data to ENU coordinate system as our state. Assuming that the position of our GPS sensor in the IMU coordinate system {*b*} is **<sup>p</sup>***<sup>b</sup> GPS*, then we can get GPS data to observe the origin of IMU coordinate system **p***<sup>W</sup> GPS* as

$$\mathbf{p}\_{\rm GPS}^{\rm W} = \mathbf{p}\_{b}^{\rm W} + \mathbf{R}\_{b}^{\rm W} \mathbf{p}\_{\rm GPS}^{b}. \tag{11}$$

Using Equation (11) we can get the Jacobian matrix of GPS observations for the error state *δ***x** as

$$\mathbf{x}\_{GPS} = \begin{bmatrix} \mathbf{I} & \mathbf{0} & -\mathbf{R}\_b^W \lfloor \mathbf{p}\_{GPS}^b \rfloor\_\times & \mathbf{0} & \mathbf{0} \end{bmatrix}^\top. \tag{12}$$

Referring to the error-state Kalman filter equation in [26], we can get the result of the fusion of GPS and IMU, which is more accurate than the interpolation of GPS data using the assumption of uniform motion.

#### *2.5. Pose Graph Optimization*

After all three subsystems complete their estimation tasks, their results are sent to the final pose graph optimization system for processing. In the pose graph optimization system, we select keyframes for the input of VIS and LIS and use iSAM2 [27] to optimize the pose graph. Precisely, we will count the pose changes between the latest keyframes in the relative pose map of the current input frame. If the rotation or translation part of the pose transformation exceeds the threshold we set, then we will use it as an optimized keyframe. Thanks to our GIS, it uses GPS and IMU to perform Error-state Kalman Filter to get high-frequency GPS observations, which makes a GPS observation constraint almost available for each keyframe. However, the GPS signal usually has a large error when occluded, which reflected in our GIS system is the output covariance *PGPS* is relatively large. We filter the results of *PGPS* < *θ<sup>P</sup>* to add to pose graph. As shown in Figure 6, it is the block diagram of pose graph optimization system.

**Figure 6.** The block diagram of pose graph optimization.

#### *2.6. Targets Detection and Tracking*

We use YOLOv4 to detect the targets and track them with the optical flow method and Kalman filter. We first use the optical flow method to eliminate the deviation of the pixel coordinates caused by the camera movement of targets and then use the Kalman filter to predict their position.

#### 2.6.1. L-K Optical Flow

First, the L-K optical flow assumes that the gray value of the same point in space is constant across images; then, the weighted least squares method is used to estimate the optical flow field, where the grayscale value of the point *a* = (*x*, *y*) at time *t* is assumed to be *I* = (*x*, *y*, *t*), and the optical flow constraint equation can be derived based on the following assumptions:

$$
\nabla I \cdot \mathbf{V}\_a + I\_t = \mathbf{0},\tag{13}
$$

where ∇*I* = (*Ix*, *Iy*) denotes the gradient of the image at point *a*; and *Va* = (*u*, *v*) is the optical flow at point a. Assuming that the optical flow is the same at each point in a local neighborhood centered at point a, search for the displacement that minimizes the matching error in this block, i.e., define Equation (14) for this neighborhood, and minimize its function value as follows:

$$F(\mathbf{x}, \mathbf{y}) = \sum\_{(\mathbf{x}, \mathbf{y}) \in \Omega} \mathcal{W}^2(\mathbf{x}, \mathbf{y}) \quad [\nabla I \cdot \mathbf{V}\_a + I\_\mathbf{f}]\_\prime \tag{14}$$

where Ω denotes the local neighborhood of point a and *W*(*x*, *y*) denotes the weight function. The optimal solution of equation Equation (14) is obtained as follows:

$$\begin{aligned} \mathcal{A} &= [\nabla I(\mathbf{x}\_1), \nabla I(\mathbf{x}\_2), \dots, \nabla I(\mathbf{x}\_n)], \\ \mathcal{W} &= \text{diag}[W(\mathbf{x}\_1), W(\mathbf{x}\_2), \dots, W(\mathbf{x}\_n)], \\ \mathbf{b} &= -[I\_l(\mathbf{x}\_1), I\_l(\mathbf{x}\_2), \dots, I\_l(\mathbf{x}\_n)]. \end{aligned} \tag{15}$$

The final equation can be solved:

$$\mathbf{V} = \left[\mathbf{A}^{\mathrm{T}} \mathbf{W}^{2} \mathbf{A}\right]^{-1} \mathbf{A}^{\mathrm{T}} \mathbf{W}^{2} \mathbf{b}. \tag{16}$$

The simple L-K optical flow method cannot manage a situation where the UAV is moving quickly at a high altitude. Moreover, it will generate significant computational errors due to the large motion, which will not only affect the algorithm's accuracy but also reduce the overall computing speed. In this paper, we employ the pyramid-based L-K optical flow method, whose principle is described as follows: First, the optical flow and affine transformation matrices are calculated for the image of the highest layer. The result of the calculation of the previous layer initializes the calculation of next layer. The optical flow and affine transformation matrices are calculated based on the initialization. This process is repeated until the original image layer is reached. The final result will be computed depending on this coarse-to-fine filtering process.

#### 2.6.2. Kalman Filter Update Objective

In the case of a video that needs to be tracked, the state vector can be expressed as follows:

$$\mathbf{X} = \begin{bmatrix} \mathbf{c}\_x & \mathbf{c}\_y & \mathbf{w} & \mathbf{h} & \mathbf{v}\_x & \mathbf{v}\_y & \mathbf{v}\_w & \mathbf{v}\_h \end{bmatrix} \tag{17}$$

where **c***x*, **c***<sup>y</sup>* are centers, **w**, **h** are the width and height of the bounding box, and **v***x*, **v***y*, **v***w*, **v***<sup>h</sup>* are the speed of their change. Note that **c***x*, **c***y* are corrected in Section 2.6.1. We can predict **X***<sup>t</sup>* based on **X***t*−1:

$$\mathbf{x}' = \mathbf{F}\mathbf{x} + \boldsymbol{\mu},\tag{18}$$

$$\mathbf{P'} = \mathbf{F}\mathbf{P}\mathbf{F'}^\top + \mathbf{Q}\_{\prime} \tag{19}$$

where **F** is called the state transfer matrix, **P** is the covariance of tracking at the moment *t* − 1, and **Q** is the noise matrix of the system. The observed and predicted values are considered to apply to the same target if they satisfy the following two conditions:


Finally the tracker is updated with predicted (**x** ) and observed values, as follows (**y**):

$$\mathbf{y} = \mathbf{z} - \mathbf{H}\mathbf{x}',\tag{20}$$

$$\mathbf{S} = \mathbf{H} \mathbf{P}' \mathbf{H}^\top + \mathbf{R},\tag{21}$$

$$\mathbf{K} = \mathbf{P'} \mathbf{H}^{\top} \mathbf{S}^{-1},\tag{22}$$

$$\mathbf{x} = \mathbf{x}' + \mathbf{K} \mathbf{y},\tag{23}$$

$$\mathbf{P} = (\mathbf{I} - \mathbf{K}\mathbf{H})\mathbf{P}',\tag{24}$$

where the observation matrix **Z** = [ **c***<sup>x</sup>* **c***<sup>y</sup>* **w h** ].

#### 2.6.3. LiDAR Vision Fusion Targeting

To calculate the point cloud falling in the bounding box, we first calculate the pixel coordinates of the laser point in the image as follows:

$$\mathbf{T}\begin{bmatrix} \mathbf{c}\_i^{cam} \\ \mathbf{1} \end{bmatrix} = \mathbf{T}\_b^c (\mathbf{T}\_b^w)^\top \begin{bmatrix} \mathbf{c}\_i^{map} \\ \mathbf{1} \end{bmatrix}, (i = 1, 2, 3, \dots, N\_A), \tag{25}$$

$$\mathbf{p}\_{i}^{cam} = \begin{bmatrix} u\_{i} \\ v\_{i} \\ 1 \end{bmatrix} = \frac{\mathbf{I}^{cam}}{w\_{i}^{\prime}} \mathbf{c}\_{i}^{cam} = \frac{1}{w\_{i}^{\prime}} \begin{bmatrix} u\_{i}^{\prime} \\ v\_{i}^{\prime} \\ w\_{i}^{\prime} \end{bmatrix}, (i = 1, 2, 3, \dots, N\_{A}), \tag{26}$$

where **c** *map <sup>i</sup>* = [*xi*, *yi*, *zi*] is the point in the local point cloud map and *NA* is the amount of point clouds, **T***<sup>w</sup> <sup>b</sup>* = [**R***<sup>w</sup> <sup>b</sup>* | **<sup>p</sup>***<sup>w</sup> <sup>b</sup>* ], **<sup>T</sup>***<sup>c</sup> <sup>b</sup>* = [**R***<sup>c</sup> <sup>b</sup>* | **<sup>p</sup>***<sup>c</sup> <sup>b</sup>*], **<sup>p</sup>***cam <sup>i</sup>* is the pixel coordinate of the point cloud, and **<sup>I</sup>***cam* <sup>∈</sup> <sup>R</sup><sup>3</sup> <sup>×</sup> 3 is the intrinsic matrix of the camera.

Suppose there are *n* trackers in the current image. We filter the point clouds **<sup>C</sup>***cam* = {**c***cam <sup>i</sup>* |*<sup>i</sup>* = 1, 2, 3, ... , *NA*},**C***map* = {**<sup>c</sup>** *map <sup>i</sup>* |*i* = 1, 2, 3, ... , *NA*} in the occupancy detection frame and use Euclidean clustering to classify the points in the map point cloud (**C***map*) into *<sup>n</sup>* classes. Moreover, the center of the point cloud **<sup>O</sup>** = {**o***i*|*<sup>i</sup>* = 1, 2, 3, ... , *<sup>n</sup>*} is the target object's location. The largest rectangle that can wrap the point cloud represents the outline of the target.

#### **3. Physical Experiment Analysis**

Considering that there is no open dataset that can satisfy all the sensors used by our algorithm simultaneously, we built our hardware equipment and collected a large number of scene-rich datasets to verify our algorithm. First, we will introduce our hardware equipment and the collected datasets. Then, based on the collected dataset, we experimentally verify the improvements made in our system relative to [6,9]. Specifically, our experiments included the improvement of VIS by registering feature depth with LiDAR and improvements related to leading ground constraints into LIS. Then, we tested the improvement of the positioning and mapping accuracy of the SLAM system by adding a GPS to pose graph optimization on datasets of different scales.

We set up two experiments for the target localization system on the ground and in the air, respectively. The ground experiment mainly verifies the Ray-vision fusion's relative localization effect without providing the global localization error. After that, we conducted aerial experiments while measuring the global positioning error at different distances based on our SLAM. All experiments were performed on the same system with an Intel® Core™ i7-9700 CPU @ 3.00GHz × 8 and Nvidia GTX 1080ti.

### *3.1. Hardware and Dataset of Our System*

The hardware of our system is shown in Figure 7, which includes a global shutter camera, a LiVox AVIA LiDAR (FoV of 70.4◦ × 77.2◦), a GNSS-INS module, a power supply unit, and an onboard computation platform (equipped with an Intel i5-8400 CPU and 16 GB RAM).

**Figure 7.** (**a**) The front view of our hardware. (**b**) The back view of our hardware. The total weight of our device is below 3 kg.

We collected various datasets with rich scenes. Specifically, we used two large-scale datasets collected by drones at an altitude of about 100 m, which we call HZ-odom and HZ-map. In the ground scene, we fixed the device to the electric bicycle and collected several datasets, including two large-scale datasets, which we named ZJG-gym and ZJGnsh. Three medium-sized datasets, which we name ZJG-lib, YQ-odom, and YQ-map. We also collected two hand-held datasets of small-scale scenarios on the ground and named them CSC-build and CSC-road. The specific information of each dataset is in Table 1, which includes the duration of the dataset, the length of the trajectory, whether they include a return to the origin, and the difficulty level.


**Table 1.** Dataset detail.

#### *3.2. Feature Depth Registration of VIS*

In this experiment, we focus on the improvement achieved by including the depth registration of visual feature points using LiDAR point cloud in our VIS system. Since our VIS system is adapted from VINS-Mono [6], we mainly compare the results of our VIS system for depth registration of visual feature points with [6]. During the experiment, our LIS subsystem only runs the point cloud preprocessing part. It does not use its running results to add priors to VIS. The block diagram of our VIS in the experiment is shown in Figure 8.

**Figure 8.** The block diagram of our VIS experiment.

We conducted experiments on YQ-map ground dataset. In this dataset, the vehicle runs at a constant speed most of the time; therefore, the IMU is close to degenerating. Moreover, the vehicle has many 90° turns, which means the visual constraints of the VIO may easily fail, and this feature can lead to scale drift problems. For fairness of the experiment, we turned off the loop closure detection thread of VINS-Mono and only compared the accuracy of the odometry. Furthermore, we set VINS-Mono and our VIS system to be identical in terms of front-end feature extraction, back-end sliding-window keyframes, and optimization time. Since the YQ-map dataset returns to the origin, we use the distance from the endpoint to the start point to judge the accuracy. The result is shown in Figure 9. Due to the depth registration of visual feature points, we can see that our system has better scale consistency and higher accuracy than VINS-Mono. This result is easy to explain: our VIS outperforms VINS-Mono due to the extra scale gained by adding LiDAR point clouds to the depth registration of visual feature points.

In addition, to test the improvement of the absolute accuracy of VIS by performing depth registration, we also conduct experiments on large-scale dataset ZJG-gym and use GPS trajectories as ground truth. The comparison results of our VIS and VINS-Mono are shown in Figure 10. We can see that our VIS and GPS trajectories are in better alignment. We use the root mean square error (RMSE) results to measure the result accuracy. The RMSE of our VIS is 5.03 m, while VINS-Mono is 7.92 m.

**Figure 9.** The result of our VIS and VINS-Mono on YQ-map dataset. It goes back to origin.

**Figure 10.** (**a**) The trajectory alignment of our VIS and GPS on ZJG-gym dataset. (**b**) The trajectory alignment of VINS-Mono and GPS on ZJG-gym dataset. It is clear that our VIS has a better result.

#### *3.3. Ground Constraint of Our LIS*

In this experiment, we focus on the improvement by adding ground constraints to our LIS subsystem. Since our LIS system is adapted from FAST-LIO2 [10], we mainly compare the results of our LIS system with ground constraints and [10]. Specifically, in the experiments, we run our LIS alone without VIS's prediction.

We conducted our experiments on ZJG-lib dataset, which has many horizontal grounds. Specifically, our LIS system uses the same parameter configuration as FAST-LIO2, including the point cloud downsampling density, the number of iterative Kalman filtering, etc. Then

in the experiment, we focus on the *z* axis drift. The result is shown in Figure 11. From the figure, we can see that due to the ground constraints we added, our LIS system has almost no drift in the height direction, while FAST-LIO2 shows a significant drift in height. This result is easy to explain. For ZJG-lib dataset, the LiDAR installation location is close to the ground with a height of 0.8 m. Therefore, the incident angle of the LiDAR scanning distant ground points is small, which reduces the accuracy of the point cloud scanned by LiDAR and increases the drift in altitude of FAST-LIO2.

**Figure 11.** (**a**) The map result of FAST-LIO2 on ZJG-lib dataset. (**b**) The map result of our LIS with ground constraint on ZJG-lib dataset. (**c**) The path result of FAST-LIO2 and our LIS. It is clear that due to the ground constraint, our LIS has less drift than FAST-LIO2.

#### *3.4. Pose Graph Optimization of Our System*

In this experiment, we focus on the improvement of our entire SLAM system due to the addition of global pose graph optimization. To demonstrate the cumulative trajectory drift suppression that occurs when incorporating a GPS, we conducted experiments on large-scale datasets in the air and on the ground. Specifically, we used multiple large-scale datasets to examine the improvement of our system by fusing GPS data.

First, we conduct experiments on an aerial dataset with HZ-map dataset, in which the drone carries equipment to collect data at an altitude of 100 m. Because of the poor weather conditions when collecting the data, the aircraft in the air is highly unstable, so this dataset presents a considerable limitation in terms of the accuracy and robustness of the SLAM system. Here, we compare the localization and mapping of our system with VINS-Mono and FAST-LIO2 systems, and the results are shown in Figure 12. Since our system fuses GPS data, there is better consistency in positioning and mapping results in large-scale scenes. However, due to the turbulence of the drone at a high altitude in this dataset, VINS-Mono fails and does not provide meaningful results.

In order to better test the accuracy and robustness of our algorithm, we also conducted experiments on many other datasets and used the GPS trajectory as the ground truth. We show the root mean square error (RMSE) results in Table 2 and some trajectory results in Figure 13. The compared algorithms are the separate VIO system VINS-Mono [6], the separate LIO system FAST-LIO2 [10], and the state-of-the-art LIVO systems R2LIVE [13] and R3LIVE [28], which are most similar to our system. It is clear that our system achieves the best accuracy and the most robust performance. Interestingly, after R2LIVE [13] and R3LIVE [28] are integrated with cameras, some data sequences have a lower precision than FAST-LIO2 [10]. This is expected because incorporating visual information in situations that are not conducive to camera work may reduce accuracy. Therefore, visual information is usually incorporated in LIO systems mainly to increase the robustness of the whole system. In addition, R2LIVE [13] and R3LIVE [28] failed on both HZ-map and HZ-Odom datasets, which are run in harsh high-altitude environments. Their two subsystems, VIO and LIO, are tightly coupled. Once a subsystem state is incorrectly estimated, it will have

a devastating impact on the entire system. Because our VIO and LIO subsystems are loosely coupled, the collapse of one subsystem will not affect the regular operation of the whole system.

**Figure 12.** (**a**) The image view of our HZ-map dataset. (**b**) The trajectory of GNSS, FAST-LIO2 and our system. VINS-Mono failed due to the aggressive motion. (**c**) The map built by our system aligned with Google Earth. As a result of the GNSS fusion in our system, we obtain a highly accurate result.

**Table 2.** RMSE translation error w.r.t GPS.


**Figure 13.** (**a**) The trajectory results on ZJG-nsh dataset. (**b**) The trajectory results on YQ-map dataset. (**c**) The trajectory results on YQ-odom dataset.

#### *3.5. Robustness Evaluation of SLAM System*

First, we evaluate the robustness of our system when subject to severe motion. HZmap is a dataset of drones flying in the air. During automatic flight, the plane experiences sudden stops, turns, and other actions. As shown in Figure 14, the plane has a large number of drastic pitch and yaw angle changes. It can also be seen from the previous experiments that VINS-Mono, R2LIVE, and R3LIVE all failed. Our VIS subsystem also failed. However, the other two subsystems still usually work, which showes the excellent robustness of our system.

**Figure 14.** The Euler Angles of HZ-map dataset, which has an aggressive motion.

Then, we tested the performance of our system when the sensor failed on YQ-map dataset. As shown in Figure 15a, we selected three locations uniformly throughout the trajectory and disabled the camera, LiDAR, and GPS to detect the impact on the system. The experimental results are shown in Figure 15b. When one or two sensors fail, our system can operate normally and obtain relatively accurate trajectory results. Running only the VIO system leads to poor trajectory results only when both the GPS and LiDAR fail. This experiment proves that our system has good robustness.

## *3.6. LiDAR-Vision Fusion Relative Localization*

One person was arranged in an open outdoor scene as the target to be located in the experiment. The person holds RTK, as shown in Figure 7, facing a dynamic target walking arbitrarily within 10–40 m of the device's field of view (as shown in Figure 16). We randomly select some locations as checkpoints and use the laser to measure the distance between the target and the sensor module to compare with the localization results and evaluate the relative localization accuracy of the algorithm. The experimental results are shown in Table 3.

**Figure 16.** Two test screenshots are shown. (**a**) The results of image detection and tracking. The target is marked with a green box, and the point cloud in (**b**) is also marked with a green box to indicate the spatial position relative to the perception module.


**Table 3.** Relative Positioning error.

As seen from the results in Table 3, the proposed LiDAR-Camera fusion positioning method in this paper has high measurement accuracy: the relative positioning error is provided in centimeters when the target is less than 40 m away from the measurement unit. As GPS is usually only accurate to the meter level; therefore, it is essential first to ensure that the relative positioning accuracy is as high as possible to maintain the lowest possible error when converting the object's position to the world coordinate system. The following describes the global positioning experiment for the target.

#### *3.7. SLAM Based Global Localization*

We chose to perform this experiment in the air, keeping with the system's actual application scenario. Again, we chose people as targets to evaluate our algorithm. As shown in Figure 17, our UAV flies to a distance of 10 m, 30 m, 60 m, and 90 m from the target for ground reconnaissance, while on the other hand, the target on the ground carries a GNSS receiver that moves at least ten meters along a given trajectory. At this point, the UAV is stationary or follows the target in motion, keeping the target in view. Once the separation distance exceeds 60 m, it is difficult to observe the target in the image with the naked eye. Therefore, we retrained the detection model using the Visdrone Dataset [29] and our small air-to-ground target dataset (Table 4).

**Table 4.** Comparison of Visdrone and our Dataset.


On the other hand, the laser point cloud at high altitudes would be sparse in terms of measuring the distance to the target. Therefore we use the local point cloud map provided by our SLAM system for the calculation instead of the single frame. In addition, our SLAM system also provides the UAV pose corresponding to the local point cloud, which can be used to calculate the global positions of the targets. It is worth mentioning that our framework locates all targets in the field of view in real time. However, to facilitate the evaluation of the results, we only count the localization error of one of the targets (Figure 18). The RTK conversion on our target path point to the coordinates under the takeoff point ENU coordinate system is used as ground truth. The distance between the localization result and its closest ground truth is used as the error to evaluate the algorithm.

**Figure 17.** Schematic diagram of the aerial reconnaissance experiment. The image on the left is the actual aerial image (about 30 m apart), the red box is the target's position, and the image on the right indicates the relative position relationship between the UAV and the target.

**Figure 18.** The display image of the tracking effect. (**a**) shows the real-time tracking result, and after a target is selected, the target is indicated by a red ball in (**b**). The green in (**b**) represents ground truth, and the white color is the aircraft's trajectory calculated by our SLAM.

We collected data using a UAV and verified our algorithm offline. The target position was inferred in 71.5 ms in one round of the experiment.The experimental results are shown in Table 5. Even at high altitudes where the sensors moved violently, our algorithm tracked the target stably and maintained high positioning accuracy. Especially at medium and long distances of 90 m, where the target occupied only a dozen pixel values, our system maintained an error of about one meter.

**Table 5.** Global Positioning error.


#### **4. Discussion Conclusions**

In the case of GPS denial, UAV self-positioning and target detection technology can play a very influential role in the military and rescue fields that require reconstructing the target area's ground scene quickly, obtaining the corresponding GPS position, and marking the category and real-time position of some critical targets.

This paper proposes a robust, versatile self-localization mapping and target-tracking localization system. Our SLAM system fuses multiple local and global sensors, including camera, LiDAR, IMU, and GPS, and thus has the advantages of high local accuracy and no global drift. Our SLAM system consists of three subsystems, VIS, LIS, and GIS. The three subsystems are tightly coupled, and the subsystems are loosely coupled through information sharing. This system architecture not only ensures the system's accuracy but also improves the system's robustness. We built our experimental equipment, collected many air and ground datasets, and conducted detailed experimental verification and analysis. The experimental results prove that our SLAM system has higher accuracy and better robustness than the current SOTA system. We also introduce a LiDAR-Camera Fusion object tracking and localization algorithm. We first used the retrained YOLOv4 to detect the target's position on the image and used LK optical flow and the Kalman filter to track the targets. We used LiDAR to recover the depth Information of the Target. Our target-tracking and localization system can effectively detect the target of interest and perform global localization of the target based on the results of the SLAM system. We conducted extensive experiments on our collected datasets, and the results show that our system performs the expected functions well. In the future, we will research real-time online tightly coupled GPS data and the high-altitude detection of small targets.

**Author Contributions:** Methodology, C.C. and Y.M.; data curation, L.L.; writing—original draft preparation, C.C. and Y.M.; writing—review and editing, J.L. and X.Z.; supervision, Y.L.; project administration, W.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
