A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization

Guan, Hongliang; Qian, Chengyuan; Wu, Tingsong; Hu, Xiaoming; Duan, Fuzhou; Ye, Xinyi

doi:10.3390/su15043048

Open AccessArticle

A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization

by

Hongliang Guan

^1,2,3,

Chengyuan Qian

¹

,

Tingsong Wu

¹,

Xiaoming Hu

⁴,

Fuzhou Duan

^1,2,* and

Xinyi Ye

¹

Engineering Research Center of Spatial Information Technology, MOE, Capital Normal University, 105 West Third Ring North Road, Haidian District, Beijing 100048, China

²

China Centre of Resources Satellite Data and Application, Beijing 100094, China

³

China Siwei Surveying and Mapping Technology Co., Ltd., Beijing 100190, China

⁴

Beijing Jumper Science Co., Ltd., Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(4), 3048; https://doi.org/10.3390/su15043048

Submission received: 1 December 2022 / Revised: 21 January 2023 / Accepted: 30 January 2023 / Published: 8 February 2023

(This article belongs to the Special Issue Intelligent Transportation System in the New Normal Era)

Download

Browse Figures

Versions Notes

Abstract

:

Simultaneous localization and mapping (SLAM) based on RGB-D cameras has been widely used for robot localization and navigation in unknown environments. Most current SLAM methods are constrained by static environment assumptions and perform poorly in real-world dynamic scenarios. To improve the robustness and performance of SLAM systems in dynamic environments, this paper proposes a new RGB-D SLAM method for indoor dynamic scenes based on object detection. The method presented in this paper improves on the ORB-SLAM3 framework. First, we designed an object detection module based on YOLO v5 and relied on it to improve the tracking module of ORB-SLAM3 and the localization accuracy of ORB-SLAM3 in dynamic environments. The dense point cloud map building module was also included, which excludes dynamic objects from the environment map to create a static environment point cloud map with high readability and reusability. Full comparison experiments with the original ORB-SLAM3 and two representative semantic SLAM methods on the TUM RGB-D dataset show that: the method in this paper can run at 30+fps, the localization accuracy improved to varying degrees compared to ORB-SLAM3 in all four image sequences, and the absolute trajectory accuracy can be improved by up to 91.10%. The localization accuracy of the method in this paper is comparable to that of DS-SLAM, DynaSLAM and the two recent target detection-based SLAM algorithms, but it runs faster. The RGB-D SLAM method proposed in this paper, which combines the most advanced object detection method and visual SLAM framework, outperforms other methods in terms of localization accuracy and map construction in a dynamic indoor environment and has a certain reference value for navigation, localization, and 3D reconstruction.

Keywords:

visual SLAM; ORB-SLAM3; object detection; dynamic environments; dense point cloud map

1. Introduction

Simultaneous localization and mapping (SLAM) is widely used in mobile robots, unmanned driving, and augmented reality (AR) because of its ability to rely on sensors to achieve autonomous machine positioning, 3D mapping, and path planning. Based on the sensors used, SLAM can be classified as laser SLAM or visual SLAM. In comparison to the former, visual SLAM, with a small camera size, low cost, and abundance of texture information in an image, provides more possibilities for visual algorithms to process motion estimation. With the advancement of camera technology and computational performance, the use of camera sensors in SLAM technology has received increased research attention from a growing number of researchers [1].

However, while some visual SLAM frameworks, such as ORB-SLAM3 [2], SVO [3], and DSO [4], are mature, most visual SLAM research uses the static environment assumption as the core basis for SLAM operations. Specifically, SLAM assumes that the environment contains only rigid, motionless objects and that the distribution of scene geometric relations is fixed. However, there are numerous dynamic objects in the real world, such as pedestrians, vehicles, etc. The presence of dynamic objects not only affects the accuracy of camera pose solving, but also causes significant deviations in environmental map construction, resulting in system crashes or failures in localization and map building [5]. Furthermore, the motion estimation of visual sensors in dynamic environments and the calculation of spatial 3D points is an important research topic in Structure from Motion (SfM), and some researchers address the problem of SFM 3D reconstruction in dynamic environments by studying non-rigid objects [6,7]. As a result, improving the robustness and reliability of visual SLAM systems in dynamic environments is a growing focus of current research.

Visual SLAM currently requires improved robustness and stability for real-world scenario-oriented applications. Researchers have used several types of a priori information, such as motion consistency and semantic information, to address the problem of motion estimation failure caused by dynamic objects. The motions of dynamic and static objects are independent of each other with respect to the camera sensor for motion consistency, and some approaches [8,9,10] use motion consistency to determine feature points on dynamic objects. However, because it assumes a static environment when calculating camera motion using feature points, this method requires a relatively high accuracy to solve its own motion model, and it is difficult to guarantee the accuracy of the solution. This method also leads to a challenge similar to the “chicken-and-egg” problem [11], namely, using the motion model calculated from static points in an environment to judge static objects in the environment. It is difficult to ensure an accurate outcome if the relationship between the above two aspects is not well-balanced. Furthermore, a deep learning approach can learn semantic information from the training set as a priori information, and the semantic information extracted by different image processing methods has different effects on dynamic scene problems. For example, semantic segmentation can be used to separate potential motional objects [12,13], which can greatly reduce dynamic interference in the scene and improve the precision of SLAM localization and mapping, but its computational complexity is excessive due to image-by-pixel semantic segmentation, making real-time operation difficult. The object inspection method based on the detection box is significantly more efficient than the pixel-level semantic segmentation method, and when combined with certain approaches for eliminating dynamic objects [14,15], it can effectively remove movable objects in dynamic scenes. Nonetheless, this approach necessitates a good trade-off between real-time and algorithm accuracy. As a result, the problem has not yet been fully solved, despite the fact that researchers have tried a variety of approaches to reduce the interference of moving objects in dynamic scenes.

In conclusion, this paper incorporates YOLOv5 [16] object detection as a parallel module into ORB-SLAM3 to process selected different objects separately, aiming to solve the problem of balancing real-time and precision-based object detection algorithms in visual SLAM in a dynamic scene. First, we develop a low-complexity dynamic feature point determination and elimination strategy based on the geometric position relationship between dynamic and static objects that recycles static feature points in the dynamic detection frame, while validly rejecting dynamic feature points. The motion of semi-static objects is then determined by combining the motion of static feature points. The algorithm presented in this paper improves the localization and positioning accuracy of the ORB-SLAM3 framework in complex dynamic scenes, while remaining in real time. In addition, the localization accuracy found in this paper is comparable to semantic SLAM and the two most recent SLAM algorithms based on target detection, but it has a significant time advantage. Finally, in order to address the issue that the sparse point cloud map in ORB-SLAM3 is unreadable and difficult to reuse, we eliminate the interference of dynamic objects using semantic information derived from object detection and eventually construct a purely static dense point cloud map of the scene.

2. Related Work

Regarding the dynamic environment visual SLAM problem, most approaches reported in the literature determine and remove dynamic regions in the image and use only static points to calculate the relative position pose between two frames before tracking and map building. As a result, the precise segmentation of dynamic objects in images is critical for solving the SLAM problem in dynamic environments [17]. The following summarizes the relevant methods.

2.1. Self-Motion-Based Methods

The motion model in the visual SLAM framework does not conform to the majority of dynamic objects in images. Sun et al. [18] used this law to calculate the homography matrix between two frames and segments based on the inconsistent motion of dynamic objects using the RANSAC algorithm. However, when dynamic objects are dominant or the scene is obscured by large dynamic objects, the RANSAC method removes static points as outlier points, resulting in large localization errors. As a result, Lu et al. [19] proposed a DLRSAC (distribution and local-based RANSAC) algorithm to build a grid distribution model to efficiently discriminate dynamic feature points. This method, which assumes that static objects have a wide distribution degree, effectively compensates for the RANSAC algorithm’s shortcomings. Sun et al. [20] constructed a foreground model based on mutual motion between two frames and combined it with RGB-D frame information to segment dynamic and static feature points, but when combined with DVO SLAM [21], the algorithm fails to run in real time.

2.2. Semantic Segmentation-Based Methods

In recent years, several researchers have used deep learning algorithms to process dynamic objects as a whole, which can reduce reliance on motion models, reduce algorithm complexity, and improve dynamic object determination accuracy. For example, Yu et al. [22] proposed DS-SLAM, which integrates semantic segmentation and motion consistency detection to eliminate moving objects; however, the algorithm was heavily reliant on the outcome of motion consistency detection and was only suitable for processing some highly dynamic objects, such as people. Bescos et al. [23] proposed DynaSLAM, which detects dynamic objects using the semantic segmentation method and the multi-view geometry method, and its algorithm can discriminate more complex dynamic scenes.

2.3. Object-Detection-Based Methods

Object detection methods based on detection boxes have a much higher detection efficiency than pixel-level semantic segmentation methods, and they have been developing rapidly in recent years, with increasingly accurate detection effects. When combined with some strategies for eliminating dynamic objects, the localization is capable of reaching an accuracy close to the semantic SLAM level in dynamic scenes. For example, Zhao et al. [24] proposed DO-SLAM, which embeds YOLOv5 into the front end of ORB-SLAM2 and relies on the geometric constraint of polar lines between two frames to solve this problem. However, the algorithm first estimates the relative transformation between two frames using the Lucas–Kanade optical flow tracking and RANSAC algorithm alone for all feature points, and then combines this with the dynamic detection frame. This does not guarantee the accuracy of its own motion estimation, the dynamic feature point determination accuracy is unstable, and it does not solve the problem of large positioning errors when dynamic objects occupy the main body. While Zhang et al. [25] also combined YOLOv5 with an ORB-SLAM2 front-end, the difference is that they only rejected dynamic feature points for potential dynamic points within the dynamic box based on the optical flow method’s relative transformation and polar line constraints. Although these two methods can effectively remove the interference of dynamic objects to some extent, they both do not make effective use of the results of target detection and perform redundant calculations on some feature points, which reduces the SLAM system’s operational efficiency.

3. Methods

The ORB-SLAM3 algorithm, which is currently one of the best visual SLAM algorithms, has demonstrated a promising performance in a variety of scenes. It is divided into three threads: tracking, local map construction, and loopback detection. As shown in Figure 1, our algorithm uses ORB-SLAM3 as the core framework and modifies it for dynamic scenes, as well as adding the object detection module and dense point cloud map construction. Object detection is implemented as a parallel module in the ORB-SLAM3 tracking module and is used to remove interference from dynamic scenes.

First, we perform object detection on the read image data to classify the objects in the scene, and then we pass the detected objects’ category and detection box coordinate information to the tracking module. Second, after the extraction of ORB feature points for the current frame is completed in the tracking module, the feature points in the dynamic box are judged and eliminated by object detection results, and then the semi-static feature points are detected based on the pure static feature points left by the RANSAC method with polar line geometric constraints. Finally, stable static feature points are used in the subsequent tracking and map construction process (The detailed processing of ORB-SLAM3 is referred to in [2]).

3.1. Object Detection

The algorithms in this paper are primarily applied to real-world scenarios with a relatively high accuracy and real-time object detection requirements. There are numerous object detection algorithms available, and the YOLO5 series is one of the best in terms of combined accuracy and rate. Among them, YOLOv5m on Tesla V100 achieves an mAP (mean average precision) accuracy of 45.4% and speed of about 122 FPS, as tested by the COCO dataset and officially reported [16] (frames per second). As a result, in this paper, YOLOv5m is chosen as our object detection module based on detection speed and accuracy requirements, and its specific algorithm flow is referred to in [16].

After detecting various objects with YOLOv5m in visual SLAM complex scenes, it is necessary to use different processing methods for each object. Most algorithms extract only regular dynamic objects, except for some objects that are easily moved by people, such as chairs, books, etc. As a result, in SLAM, these can be both static and dynamic objects. Simply extracting dynamic objects from the detected objects without considering these potentially dynamic objects will affect the precision of estimating relative poses in the subsequent SLAM system. To retain as much static object information as possible, while eliminating as much interference from dynamic objects as possible, we divide the objects in the scene into three categories, as shown in Table 1, for the subsequent design of dynamic feature point determination and elimination strategies. The three classes of objects detected in the current frame are then sent to the tracking module along with the coordinate information of the corresponding detection box.

3.2. Determination and Elimination of Feature Points in the Dynamic Box

To solve the dynamic interference problem, we remove the dynamic feature points in the tracking module by combining the processing results of the object detection module. Unlike other algorithms, the algorithm in this paper effectively evaluates dynamic feature points using the geometric position relationship between different detected objects to reduce the reliance on its own motion model and minimize algorithm complexity.

As illustrated in Figure 2, because there are frequently a large number of pixels in the dynamic detection box of the object detection algorithm that do not belong to the dynamic object, deleting all feature points in the dynamic object box will inevitably result in the loss of a large number of high-quality static feature points, resulting in inaccurate pose estimation and even tracking failure. As a result, after extracting the ORB feature points of the current frame in the tracking stage, we used the data information (object class and detection frame coordinates) of different objects processed by the object detection module to frame the feature points belonging to pure static objects and dynamic objects. We then determined all the feature points in the dynamic object box, but not in the static object box as valid dynamic feature points and eliminated them. Furthermore, all identified feature points in the static background outside the three object boxes will be kept. This strategy is summarized in Table 2.

This method not only accurately removes dynamic feature points from the scene, but also recovers some high-quality static feature points within the dynamic object frame, ensuring that there are enough feature points for subsequent SLAM module processing.

3.3. The Semi-Static Object Motion Check

We used RANSAC [26] for semi-static objects to perform dynamic object determination and elimination because these objects are typically small in volume and are lacking in real-world scenes. After removing the valid dynamic feature points, RANSAC used the remaining stable static feature points to compute the basis matrix F of the current and previous frames, followed by the use of geometric constraint of epipolar lines to detect the feature points within the semi-static objects. When compared to the traditional RANSAC method, which uses all feature points, this processing ensures that the dynamic feature points are outliers with precision.

As shown in Equation (1),

P_{1}

and

P_{2}

are the homogeneous pixel coordinates of the semi-static feature points on the previous and next frame matches, respectively.

P_{1} = [u_{1}, v_{1}, 1], P_{2} = [u_{2}, v_{2}, 1]

(1)

According to the definition of the epipolar line geometry [27], the current frame epipolar line

L

is:

L = [\begin{matrix} X \\ Y \\ Z \end{matrix}] = F P_{1} = F [\begin{matrix} u_{1} \\ v_{1} \\ 1 \end{matrix}]

(2)

The distance

d

between the current point

P_{2}

and the epipolar line

L

is calculated as follows:

d = \frac{| P_{2}^{T} F P_{1} |}{\sqrt{{|| X ||}^{2} + {|| Y ||}^{2}}}

(3)

When

d

is greater than the threshold ε,

P_{2}

is judged as dynamic feature points to reject them, according to the epipolar line geometric constraint theorem. Finally, the processed stable feature points are used for subsequent tasks, and the semantic information from target detection is retained for later dense map construction.

3.4. Creation of Dense Point Cloud Map

ORB-SLAM3 sparse point cloud maps only have simple geometric information, which makes it difficult for robots to reuse sparse point cloud maps for more advanced tasks, such as obstacle avoidance and human–robot interactions. There are numerous solutions to this problem, such as creating dense point cloud maps [28]. However, most existing dense point cloud map constructions are not easily adaptable to complex dynamic environments. Because dynamic objects are repeatedly observed in different frames and eventually exhibit a “ghosting” phenomenon, we created a purely static dense point cloud map based on object detection semantic information.

With the removal of dynamic object interference, the ORB-SLAM3 algorithm can obtain a more accurate camera pose for selected keyframes by tracking and optimizing stable feature points. The pixels in the dynamic region of the keyframe’s color image and depth image are first removed based on the detection box and object class extracted by the object detection module after the camera pose is obtained. The three-dimensional coordinates in the world coordinate system of all feature points, as well as the RGB information for these points, can then be obtained using Equation (4), based on the transformation relationship between the world coordinates

P_{w}

of the spatial point

P

and the pixel coordinates

P_{u v}

(

K

and

T

denote the internal reference matrix of the camera and the transformation matrix of the current frame, respectively). Finally, the point cloud map is downsampled with a 9:1 voxel filtering operation to avoid the redundancy of map points introduced by the ORB-SLAM3 algorithm.

Z P_{u v} = Z [\begin{matrix} u \\ v \\ 1 \end{matrix}] = K T P_{w} = K T [\begin{matrix} \begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \end{matrix} \\ 1 \end{matrix}]

(4)

4. Experimental Results and Analysis

4.1. Dynamic Feature Point Elimination

In this paper, we use the TUM dataset to evaluate the accuracy and robustness of the SLAM algorithm when dealing with dynamic indoor scenes. To meet the needs of various scenes, the TUM dataset contains three types of data sequences. Among them, fr3 is primarily intended for use in dynamic environments. Because the algorithm in this paper is designed for dynamic environments, the fr3 class dataset is used as the primary test object. As shown in Table 3, four sets of image sequences are chosen for experimental validation.

The trajectory accuracy is primarily assessed using two metrics: absolute trajectory error (APE) and relative attitude error (RPE [29]). Because the estimated camera trajectory’s initial poses and the true value of the trajectory differ slightly. The initial pose-transformation matrix must be estimated using the least squares method, and the trajectory accuracy degree is evaluated after the trajectories are aligned. Because the algorithm used in this paper is based on ORB-SLAM3, the performance of this paper is first compared to the original ORB-SLAM3 algorithm. Then, it is compared to the current high-performing semantic SLAM algorithms DS-SLAM and DynaSLAM for dynamic scenes, as well as two recent object-detection-based SLAM algorithms, DO-SLAM and [25]. The comparison results between the algorithms in this paper and ORB-SLAM3 are presented first.

Table 4 and Table 5 show the root mean square error (RMSE) of APE and the RPE of ORB-SLAM3 compared to the algorithms in this paper. To assess the system’s performance, each method was run five times on the dataset and the median was calculated. By quantitatively evaluating the test data and comparing the trajectory visualization shown below using the evo tool, it is clear that the algorithm in this paper outperforms ORB- SLAM3 in a dynamic environment. The percentage improvement can reach 91.10%.

In Table 4, the accuracy improvement of this paper’s method is significantly greater in the first two image sequences than in the last two image sequences. This is because, in the first two image sequences, the motion of dynamic objects is greater in magnitude and the interference in SLAM localization is also stronger, allowing this paper’s method to more clearly optimize the dynamic interference problem. As a result, this paper concluded that, given a sufficient static background, the more dynamic objects and the greater the motion amplitude in complex dynamic scenes, the greater the advantage of this paper’s method.

Figure 3 depicts a comparison of translation error, rotation error, and APE among the trajectories found in this study, the ORB-SLAM3 method, and ground truth trajectories in four sets of image sequences in a time series. Overall, the method outperforms the ORB-SLAM3 method in terms of translation error in 3D coordinates, as well as rotation error in rotation angle, pitch angle, and yaw angle.

As shown in Figure 4, due to the interference of a large number of dynamic feature points, especially when dynamic objects occupy the main position of the image, the estimated trajectory of ORB-SLAM3 will have a large error with the real trajectory on the ground, and even a short tracking loss will occur because the correct feature match cannot be found. The method used in this study effectively remedies this phenomenon.

Table 6 compares the APE results of this paper’s method to DS-SLAM, DynaSLAM, DO-SLAM and [25]. The accuracy performance of this paper’s algorithm in the fr3_walking_xyz data sequence is superior to DS-SLAM and DynaSLAM, while in the fr3_walking_halfsphere data sequence, it is slightly inferior to those two algorithms. Our method performs slightly worse than the DS-SLAM and DynaSLAM methods in fr3_sitting_xyz and fr3_sitting_halfsphere data sequences. Overall, the trajectory estimation accuracy of this paper’s algorithm is sufficient in most experimental scenarios and is comparable with the current better-performing semantic segmentation-based dynamic scene SLAM.

Table 6 also shows that the accuracy of this paper’s algorithm is higher than DO-SLAM in the fr3_walking_xyz data sequence, roughly equal to [25], and slightly inferior to the two algorithms in the fr3_walking_halfsphere data sequence. The accuracy of this paper’s method is slightly lower than DO-SLAM in the frs3_sitting_xyz and fr3_sitting_halfsphere data sequence. Overall, the trajectory estimation accuracy of this paper’s algorithm is close to that of two recent target detection-based dynamic scene SLAM algorithms.

Table 7 compares the real-time running frame rates of DS-SLAM, DynaSLAM, DO-SLAM and [25], the ORB-SLAM3 algorithm, and the algorithm in this paper. Because the device affects the running frame rate, we used the same device for experiments for open-source algorithms, such as ORB-SLAM, DS-SLAM, and DynaSLAM. However, for algorithms that do not have open-source code, such as DO-SLAM and [25], we calculated the maximum running frame rate based on the device and experimental results used in the original paper. The algorithm in this paper outperforms DS-SLAM and DynaSLAM in terms of speed. The primary reason for this is that DS-SLAM acquires semantic information via the SegNet [30] semantic segmentation algorithm, whereas DynaSLAM acquires semantic information via the Mask-RCNN [31] semantic segmentation algorithm, both of which are pixel-level segmentation algorithms with high computational costs. The algorithm in this paper, on the other hand, acquires semantic information with the object detection algorithm that outputs results at 133 fps, and the semantic acquisition thread and the tracking thread of ORB-SLAM3 are processed in parallel, so that the final algorithm’s processing speed can meet the requirements of real-time operation. The table also shows that the algorithm in this paper outperforms the DO-SLAM algorithm and the algorithm in [25] in terms of running frame rate. This is because these two methods perform more complex processing when estimating their own motion. For example, [25] first used the pyramid optical flow method to perform optical flow calculations for different pyramid layers based on the characteristics of the adjacent frames: constant luminance, temporal continuity, and spatial consistency. The feature points in the image are then tracked using the pyramidal optical flow to determine their correspondence. Finally, the dynamic feature points are detected and rejected using the base matrix F calculated by the eight-point method and the geometric constraints of the polar lines. Although the APE experimental results of these two methods are relatively good, they take a long time to run. However, in this paper, the results of the target detection algorithm are mostly used to reduce the number of feature points involved in the operation in the process of dynamic feature point rejection, specifically by simplifying the judgment of some dynamic feature points based on the reclassification of the target detection results and the processing strategies for different classification results.

4.2. Static Dense Map Construction

As shown in Figure 5, the sparse point cloud map of ORB-SLAM3 itself only stores the location information of feature points in the scene, which only possesses simple geometric information with a poor readability, making it difficult for the robot to reuse the sparse point cloud map to accomplish more advanced tasks, such as obstacle avoidance and human–machine interaction.

However, neither conventional dense point cloud maps nor octree maps are particularly adaptable to dynamic environments. When there are dynamic objects in the environment, their movement will eventually present a “ghost shadow” phenomenon, as shown in Figure 6, resulting in poor readability of the dense point cloud map and difficulties in secondary use.

For the problem of dynamic interference, we use the object detection method to remove dynamic objects from the scene and then apply 9:1 voxel filtering to create a static dense point cloud map.

5. Conclusions

A visual SLAM approach for complex dynamic environments is developed in this paper. This system is based on the ORB-SLAM3 framework and includes an object detection module based on YOLO v5, an improved tracking module, and a dense point cloud map module. Based on tests on the TUM RGB-D dataset, the method presented in this paper outperforms ORB-SLAM3 in terms of localization accuracy in dynamic environments. In comparison to other advanced dynamic SLAM methods, the method in this paper improves the SLAM system’s operation speed while ensuring localization accuracy, and ultimately constructs a dense point cloud map with a higher readability and reusability.

The following is a summary of the main findings of this paper:

(1): As a new object detection module, YOLO v5 was added to the ORB-SLAM3 framework. We improved the original ORB-SLAM3 tracking module using detected object dynamic information to reject dynamic feature points, and by computing the positional transformation using purely static feature points, the localization accuracy of ORB-SLAM3 in dynamic environments was improved.
(2): The improved dense point cloud building module was added to ORB-SLAM3, utilizing the dynamic object information extracted by the object detection module to remove dynamic objects, resulting in the creation of a static point cloud map of the scene. Because the tracking module removes dynamic objects to obtain a clearer camera pose trajectory, and the accuracy of the constructed map is largely determined by trajectory precision, the dense point cloud map of the scene constructed using the algorithm in this paper has a high readability and reusability.
(3): The comparison of this paper’s algorithm with the original ORB-SLAM3, two semantic SLAMs that perform better in dynamic environments (DS-SLAM, DynaSLAM), and two recent target detection-based dynamic SLAMs (DO-SLAM, [25]) on the TUM RGB-D dynamic dataset shows that this paper’s method can run at 30+ fps speed operation. In all four image sequences, the localization accuracy improves to varying degrees over ORB-SLAM3, and the absolute trajectory accuracy can be improved by up to 91.10%. The localization accuracy of this paper’s method is comparable to DS-SLAM, DynaSLAM, DO-SLAM, and [25], but it runs significantly faster.
(4): Under some rotating shots, the horizontal detection frame frames a large number of non-dynamic backgrounds, which can result in no available static information in SLAM and, in severe cases, lost tracking. In the future, we can use the rotating detection frame method, which is commonly used in remote sensing object detection, to mark dynamic objects with greater precision, while retaining as much static background as possible.

Author Contributions

Conceptualization, H.G., C.Q. and F.D.; Methodology, H.G., C.Q., T.W. and F.D.; Software, C.Q.; Formal analysis, C.Q. and T.W.; Investigation, C.Q.; Resources, H.G. and X.H.; Data curation, C.Q. and T.W.; Writing—original draft, C.Q.; Writing—review & editing, T.W. and X.Y.; Supervision, H.G. and F.D.; Project administration, H.G., X.H. and F.D.; Funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (No. 2018YFC1800904); Capacity Building for Sci-Tech Innovation—Fundamental Scientific Research Funds (No. 20530290078).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, [Duan, F], upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A Comprehensive Survey of Visual SLAM Algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Cui, L.; Ma, C. SOF-SLAM: A Semantic Visual SLAM for Dynamic Environments. IEEE Access 2019, 7, 166528–166539. [Google Scholar] [CrossRef]
Badias, A.; Alfaro, I.; Gonzalez, D.; Chinesta, F.; Cueto, E. MORPH-DSLAM: Model Order Reduction for Physics-Based Deformable SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7764–7777. [Google Scholar] [CrossRef] [PubMed]
Parashar, S.; Pizarro, D.; Bartoli, A. Robust Isometric Non-Rigid Structure-From-Motion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6409–6423. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Son, C.Y.; Kim, H.J. Robust Real-time RGB-D Visual Odometry in Dynamic Environments via Rigid Motion Model. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6891–6898. [Google Scholar]
Palazzolo, E.; Behley, J.; Lottes, P.; Giguere, P.; Stachniss, C. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 7855–7862. [Google Scholar]
Wang, R.; Wan, W.; Wang, Y.; Di, K. A New RGB-D SLAM Method with Moving Object Detection for Dynamic Indoor Scenes. Remote Sens. 2019, 11, 1143. [Google Scholar] [CrossRef]
Saputra, M.R.U.; Markham, A.; Trigoni, N. Visual SLAM and Structure from Motion in Dynamic Environments. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
Brasch, N.; Bozic, A.; Lallemand, J.; Tombari, F. Semantic Monocular SLAM for Highly Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 393–400. [Google Scholar]
Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
Wang, H.; Zhang, A. RGB-D SLAM Method Based on Object Detection and K-Means. In Proceedings of the 2022 14th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 20–21 August 2022; pp. 94–98. [Google Scholar]
Chang, J.; Dong, N.; Li, D. A Real-Time Dynamic Object Segmentation Framework for SLAM System in Dynamic Scenes. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; NanoCode012; TaoXie; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; et al. Ultralytics/yolov5: v6.0—YOLOv5n ‘Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support. Zenodo Tech. Rep. 2021. [Google Scholar] [CrossRef]
Panchpor, A.A.; Shue, S.; Conrad, J.M. A survey of methods for mobile robot localization and mapping in dynamic indoor environments. In Proceedings of the 2018 Conference on Signal Processing And Communication Engineering Systems (SPACES), Vijayawada, India, 4–5 January 2018; pp. 138–144. [Google Scholar]
Sun, Y.; Liu, M.; Meng, M.Q.H. Improving RGB-D SLAM in dynamic environments: A motion removal approach. Robot. Auton. Syst. 2017, 89, 110–122. [Google Scholar] [CrossRef]
Lu, X.; Wang, H.; Tang, S.; Huang, H.; Li, C. DM-SLAM: Monocular SLAM in Dynamic Environments. Appl. Sci. 2020, 10, 4252. [Google Scholar] [CrossRef]
Sun, Y.; Liu, M.; Meng, M.Q.H. Motion removal for reliable RGB-D SLAM in dynamic environments. Robot. Auton. Syst. 2018, 108, 115–128. [Google Scholar] [CrossRef]
Kerl, C.; Sturm, J.; Cremers, D. Dense visual SLAM for RGB-D cameras. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 2100–2106. [Google Scholar]
Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Zhao, X.; Ye, L. Object Detection-based Visual SLAM for Dynamic Scenes. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 1153–1158. [Google Scholar]
Zhang, X.; Zhang, R.; Wang, X. Visual SLAM Mapping Based on YOLOv5 in Dynamic Scenes. Appl. Sci. 2022, 12, 1548. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Chum, O.; Matas, J.; Kittler, J. Locally Optimized RANSAC; Springer: Berlin/Heidelberg, Germany, 2003; pp. 236–243. [Google Scholar]
Matsuki, H.; Scona, R.; Czarnowski, J.; Davison, A.J. CodeMapping: Real-Time Dense Mapping for Sparse SLAM using Compact Scene Representations. IEEE Robot. Autom. Lett. 2021, 6, 7105–7112. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. The overall framework of the method in this paper.

Figure 2. Image feature point distribution diagram.

Figure 3. Comparison of trajectory translation error, rotation error, and APE.

Figure 4. Improving partial trace loss.

Figure 5. Sparse point cloud map of ORB-SLAM3.

Figure 6. (a) Ghosting in the dense point cloud map in a dynamic scene. (b) Static environment map for removing dynamic objects.

Table 1. Classification of dynamic properties of objects in the scene.

Object Property	Object
pure static objects	Monitor, cabinet, refrigerator, etc.
semi-static objects	Chair, mouse, keyboard, cup, etc.
dynamic objects	Human, vehicle, animal, etc.

Table 2. Dynamic object determination strategy table.

In the Dynamic Object Box	In the Static Object Box	Determine as Valid Dynamic Feature Points
True	True	False
True	False	True
False	True	False
False	False	False

Table 3. Image sequence of data set.

Sequence Name	Sequence Size	Resolution	Frame Rate
fr3_walking_xyz	827	640 × 480	30
fr3_walking_halfsphere	1021	640 × 480	30
fr3_sitting_xyz	1261	640 × 480	30
fr3_sitting_halfsphere	1110	640 × 480	30

Table 4. Comparison of ORB-SLAM3 and the method in this paper on APE (m).

Sequence of Data Sets	ORB-SLAM3	Method in this Paper	Percentage Increase
fr3_walking_xyz	0.162312	0.014443	91.10%
fr3_walking_halfsphere	0.188741	0.055487	70.60%
fr3_sitting_xyz	0.026101	0.016832	35.51%
fr3_sitting_halfsphere	0.230181	0.175883	23.59%

Table 5. Comparison of ORB-SLAM3 and the method in this paper on RPE (m).

Sequence of Data Sets	ORB-SLAM3	Method in this Paper	Percentage Increase
fr3_walking_xyz	0.034260	0.012871	62.43%
fr3_walking_halfsphere	0.022132	0.019239	13.07%
fr3_sitting_xyz	0.012549	0.008167	34.92%
fr3_sitting_halfsphere	0.015779	0.017147	−8.67%

Table 6. Comparison of DS-SLAM, DynaSLAM, and this paper’s methods on APE in meters (three decimal places reserved).

Sequence of Data Sets	DS-SLAM	DynaSLAM	DO-SLAM	[25]	Method in this Paper
fr3_walking_xyz	0.015	0.015	0.026	0.014	0.014
fr3_walking_halfsphere	0.022	0.025	0.032	0.016	0.055
fr3_sitting_xyz	0.014	0.015	0.013	-	0.018
fr3_sitting_halfsphere	0.013	0.017	0.017	-	0.177

Table 7. Comparison of DS-SLAM, DynaSLAM, DO-SLAM, [25] and this paper’s methods at running frame rates.

Sequence of Data	DS-SLAM	DynaSLAM	DO-SLAM	[25]	ORB-SLAM3	Method in this Paper
TUM-RGBD-fr3	5.5 fps	1.3 fps	<10.0 fps	<20.0 fps	32.0 fps	30.2 fps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, H.; Qian, C.; Wu, T.; Hu, X.; Duan, F.; Ye, X. A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization. Sustainability 2023, 15, 3048. https://doi.org/10.3390/su15043048

AMA Style

Guan H, Qian C, Wu T, Hu X, Duan F, Ye X. A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization. Sustainability. 2023; 15(4):3048. https://doi.org/10.3390/su15043048

Chicago/Turabian Style

Guan, Hongliang, Chengyuan Qian, Tingsong Wu, Xiaoming Hu, Fuzhou Duan, and Xinyi Ye. 2023. "A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization" Sustainability 15, no. 4: 3048. https://doi.org/10.3390/su15043048

APA Style

Guan, H., Qian, C., Wu, T., Hu, X., Duan, F., & Ye, X. (2023). A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization. Sustainability, 15(4), 3048. https://doi.org/10.3390/su15043048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dynamic Scene Vision SLAM Method Incorporating Object Detection and Object Characterization

Abstract

1. Introduction

2. Related Work

2.1. Self-Motion-Based Methods

2.2. Semantic Segmentation-Based Methods

2.3. Object-Detection-Based Methods

3. Methods

3.1. Object Detection

3.2. Determination and Elimination of Feature Points in the Dynamic Box

3.3. The Semi-Static Object Motion Check

3.4. Creation of Dense Point Cloud Map

4. Experimental Results and Analysis

4.1. Dynamic Feature Point Elimination

4.2. Static Dense Map Construction

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI