1. Introduction
With its own sensors, simultaneous localization and mapping (SLAM) is capable of tackling both localization and map construction challenges. In the case where an environment map or the location of a robot is not available in advance, SLAM can be used to address a variety of applications [
1]. Meanwhile, visual SLAM (VSLAM) that utilizes cameras as its major sensors has many advantages, such as cheaper hardware, direct object detection and tracking, and the provision of rich visual and semantic information [
2], making it a hot topic in the field of robotics at present.
Over the years, VSLAM has attracted tremendous attention [
3], and some extremely mature VSLAM algorithms have come into being, such as MonoSLAM [
4], ORB-SLAM2 [
5], ORB-SLAM3 [
6], PTAM [
7], RGB-D SLAM [
8], and large-scale direct SLAM (LSD-SLAM) [
9]. Among them, the ORB-SLAM3 [
6] system proposed by Campos et al. operated robustly in real time in both large and small scenes and indoor and outdoor environments. Furthermore, it could support vision, vision-inertial guidance, and hybrid maps and can operate on monocular, binocular, and RGB-D cameras using pinhole or fisheye models. For all of these, the SLAM algorithms are predicated on a static environment or items moving at a very slow pace. However, in reality, the existence of dynamic items such as vehicles and animals is inevitable. Most of the SLAM algorithms employ feature points for localization, which are generally derived from locations with high textural information [
5,
7,
10,
11]. Therefore, when a moving object includes substantial textural information, the algorithms will collect a tremendous amount of wrong feature points, causing a low accuracy of the pose estimation, large trajectory errors, and even double shadows in the map. Thus, how to limit the interference of moving items and improve the positioning precision and robustness in dynamic situations is also a challenge in the field of VSLAM.
Proposed firstly by Smith, Self, and Cheeseman [
12] in 1986, the SLAM technique has been applied for more than three decades with three main development stages. The first stage is the traditional phase, where the SLAM problem is proposed and converted into a state estimation problem. It mainly uses LiDAR as the sensor data source, which is usually tackled by an extended Kalman filter, particle filter, and maximum likelihood estimation. The drawback of this phase is that the convergence nature of the map is ignored, and the localization problem is treated separately from the map construction problem. The second stage is the algorithm analysis phase. A.J. Davison presented MonoSLAM [
4] in 2007, which was treated as the first real-time monocular-vision SLAM system, but it could only perform localization and map building offline. Klein et al. [
7] later suggested PTAM, which adopted nonlinear optimization as a backend to parallelize the tracking and map-building process. However, it is only applicable in small scenes and is prone to loss during tracking. MurArtal et al. [
13] proposed ORB-SLAM in 2015 based on PTAM. It revolved around ORB features for pose estimation and overcame the cumulative error problem through a bag-of-words model, but there still existed frame loss during rotation. The third stage is the robustness and prediction phase. In the context of the comprehensive development of AI technology, deep learning has gradually been integrated into VSLAM to achieve robustness, high-level scene understanding, and computational resource optimization. Deep-learning-based VSLAM systems obtain semantic labels of features by applying sophisticated CNN structures such as YOLO [
14], SSD [
15], SegNet [
16], and Mask R-CNN [
17] when a new frame arrives. Bescos [
18] et al. proposed a dynamic SLAM system designated Dyna-SLAM, which incorporated the Mask R-CNN instance segmentation algorithm and multiview geometry module into ORB-SLAM2 in order to eliminate dynamic feature points. In Detect-SLAM, Zhong et al. [
19] employed a relatively fast SSD target detection model to identify image frames, and then propagated motion probability through feature matching to increase the influence area and eliminate the influence of interference factors in a dynamic environment. J. Vincent et al. [
20] utilized case segmentation and multiview geometry to generate masks of dynamic objects in order to avoid the processing of these mask image regions during optimization, thereby minimizing the impact of these dynamic visual features on positioning and mapping. Yu et al. [
21] enhanced ORB-SLAM2 and designed DS-SLAM, which adopted a SegNet neural network for semantic segmentation and filters out dynamic feature points based on point cloud motion analysis. Wu et al. [
22] utilized an upgraded lightweight target detection network [
23] to generate system semantic information and suggested a new geometric method to filter dynamic features in the detection region. Theodorou et al. [
24] enhanced ORB-SLAM3 with the YOLOR object-detection network and optic-flow technology to detect and eliminate dynamic objects in key frames and validate them in customized train-station data sets, raising the accuracy by 89%. These algorithms combined with deep learning are able to filter the impact of dynamic factors and significantly improve localization accuracy compared to the ones in the first two stages, but they lack online performance due to great computing power consumption.
To address the dilemma of the poor real-time efficacy of VSLAM systems, graphics processing units (GPUs) are gradually entering the scope of research. Several research works have reported the performance results of the parallel execution of computer vision algorithms on CPUs and made some comparisons. Chaple et al. [
25] compared the performance of image-convolution implementations on GPUs, FPGAs, and CPUs. Russo et al. [
26] compared image-convolution processing on GPUs and FPGAs. References [
27,
28] addressed a K-means clustering method, a two-dimensional filter, and stereo-vision problems on quad-core CPUs and GPU processors, respectively. The investigation showed that all image pixels could be independently processed by GPUs, demonstrating their superiority.
To efficiently upgrade the positioning accuracy of the VSLAM algorithms in dynamic-indoor situations and to meet the real-time requirement, this paper refines the framework based on ORB-SLAM3 from the following aspects:
(1) An independent GPU-accelerated dynamic-object-detection thread is created. It could recognize and label different levels of dynamic objects in complex environments, thus enhancing the positioning precision and robustness of ORB-SLAM3 in dynamic situations. By deploying the YOLOv5 object-detection function on the GPU, the identification speed of image frames is significantly increased, thereby guaranteeing the system’s online manifestation.
(2) In the tracking thread, a dynamic-feature-point-removal step is included. It incorporates the LK optic-flow algorithm and the dynamic prior information. Due to the differing instantaneous velocities of the pixels moving in static and dynamic objects, the optic-flow approach is employed to monitor and delete dynamic feature points. The object-detection prior information is utilized to categorize the discovered items, assist the system in eliminating potential dynamic factors, and retain feature points in static and some low-dynamic objects for positioning and environmental map construction.
(3) YG-SLAM is validated in the TUM RGB-D public dataset and further verified in KITTI dataset. Compared to ORB-SLAM3, it has obtained great enhancements in the accuracy and robustness over dynamic-environment sequences. It also provides a superior localization accuracy and greater speed in numerous sequences than other dynamic-scene SLAM algorithms.
The remaining portions are organized as follows:
Section 2 introduces the system architecture, optic-flow method, object-detection module, and dynamic-point-elimination procedure. The results and numerical analysis of the dataset experiments are presented in
Section 3. After discussing and concluding the paper, we detail its shortcomings as well as future studies.