A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments

Zeng, Zhi; Lin, Hui; Kang, Zhizhong; Xie, Xiaokui; Yang, Juntao; Li, Chuyu; Zhu, Longze

doi:10.3390/rs15235479

Open AccessTechnical Note

A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments

by

Zhi Zeng

¹,

Hui Lin

^2,*,

Zhizhong Kang

^3,4,5

,

Xiaokui Xie

²,

Juntao Yang

⁶,

Chuyu Li

^3,7 and

Longze Zhu

³

¹

School of Computer Science and Engineering, Huizhou University, Huizhou 516007, China

²

College of Resources and Environment, Beibu Gulf University, Qinzhou 535000, China

³

School of Land Science and Technology, China University of Geosciences, Beijing 100083, China

⁴

Subcenter of International Cooperation and Research on Lunar and Planetary Exploration, Center of Space Exploration, Ministry of Education of the People’s Republic of China, Beijing 100081, China

⁵

Lunar and Planetary Remote Sensing Exploration Research Center, China University of Geosciences, Beijing 100083, China

⁶

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

⁷

Beijing Institute of Surveying and Mapping, Beijing 100038, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(23), 5479; https://doi.org/10.3390/rs15235479

Submission received: 1 October 2023 / Revised: 31 October 2023 / Accepted: 4 November 2023 / Published: 23 November 2023

(This article belongs to the Topic Artificial Intelligence in Navigation)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional simultaneous localization and mapping (SLAM) system tends to operate in small-area static environments, and its performance might degrade when moving objects appear in a highly dynamic environment. To address this issue, this paper proposes a dynamic object-aware visual SLAM algorithm specifically designed for dynamic indoor environments. The proposed method leverages a semantic segmentation architecture called U-Net, which is utilized in the tracking thread to detect potentially moving targets. The resulting output of semantic segmentation is tightly coupled with the geometric information extracted from the corresponding SLAM system, thus associating the feature points captured by images with the potentially moving targets. Finally, filtering out the moving feature points can greatly enhance localization accuracy in dynamic indoor environments. Quantitative and qualitative experiments were carried out on both the Technical University of Munich (TUM) public dataset and the real scenario dataset to verify the effectiveness and robustness of the proposed method. Results demonstrate that the semantics-guided approach significantly outperforms the ORB SLAM2 framework in dynamic indoor environments, which is crucial for improving the robustness and reliability of the SLAM system.

Keywords:

indoor location-based services; semantics-guided dynamic object recognition; semantic segmentation; simultaneous localization and mapping

1. Introduction

SLAM is a solution that integrates both mapping and localization tasks [1]. More specifically, SLAM is the key technique used to achieve intelligent services for autonomous navigation and transportation communities for robots [2,3]. These services include autonomous parking, advanced home services, medical services, path planning obstacle avoidance, and more [4]. While the traditional global navigation satellite system (GNSS) is an efficient method for accurate outdoor positioning and provides reliable navigation and location-based services, it fails to achieve the goal of both the positioning and navigation tasks, such as indoor areas, due to their complex structures and the lack of satellite signals [5]. Therefore, as a result, autonomous navigation and positioning in GNSS-limited environments remain an ongoing issue to date.

Over the past decades, many sensors, including LiDAR, RGB-D cameras, and IMU, have the characteristics of configuration simplicity, miniaturized size, and low cost. By collecting and utilizing LiDAR point cloud data along with other sensor data, a comprehensive understanding of the environment can be obtained. This enables the provision of valuable information for decision-making, resulting in a safer, more intelligent, and efficient experience in autonomous driving. The application of this technology is extensive in the data acquisition and analysis platform of autonomous driving systems. It is widely used in detecting and identifying obstacles on the road, including vehicles, pedestrians, bicycles, and more. Particularly, it plays a significant role in the SLAM system to assist vehicle positioning and navigation. Therefore, numerous visual SLAM systems have been widely studied and developed for the specific requirements [2,6]. Starting to move from an unknown position, the robot locates itself according to its own motion estimation and the local map. Meanwhile, an incremental map based on its own positioning was constructed [7,8]. In a highly stationary environment, current visual SLAM systems usually can typically achieve high levels of location accuracy. Commonly, the visual SLAM category can be divided into two methods: feature-based and direct. However, as one of the visual SLAM methods, direct methods operate by processing the pixels of the images and calculating camera pose between sequential frames to minimize the photometric error [6]. As a representative of the feature-based methods, Mur et al. [9] developed a visual SLAM system called oriented FAST and rotated BRIEF (ORB), which uses ORB as the feature points to estimate relative pose and perform data association by feature matching [10]. An extended version, the ORB SLAM2 system [11], is used with monocular, stereo, and RGB-D cameras. Camera poses and map point positions are calculated by minimizing reprojection errors. However, traditional feature point-based methods often fail or reduce accuracy in challenging environments. Direct methods operate directly on image intensities for tracking and mapping, rather than using key or feature points. As the typical representative of the SLAM system on the direct method, Engel et al. developed the large-scale direct (LSD) SLAM system. It optimizes the camera pose and 3D point coordinates by constructing a photometric error function, but it can be sensitive to the camera’s internal reference and exposure and prone to data loss, especially during rapid motion [12]. In fact, LSD-SLAM relies on feature point methods since direct methods lack loop closure detection capabilities. In recent years, Engel et al. further improved the direct method with Direct Sparse Odometry (DSO) [13], which uses photometric calibration to minimize illumination effects. However, DSO abandons the back-end optimization and loop closure parts, and there is basically no mapping function, causing significant scale drift and accumulated errors in pose estimation, making relocation difficult after data loss.

In recent years, lots of deep learning techniques have been developed to accurately and reliably understand targets. Meanwhile, a new type of deep learning architecture, called capsule networks, achieves a more accurate and robust description of objects by grouping features with spatial relationships and representing them as capsules. By learning the relationships between different capsules, capsule networks can better describe the shape, posture, and perspective of objects. On the basis of an attention feature map, Shakhnoza, M. et al. used a capsule network to compensate for the absence of a deep network to improve the classification of the fire and smoke results [14]. However, capsule networks also have drawbacks, such as long training time and complex capsule structures, making them difficult to implement in practical applications. Notwithstanding, Bowman et al. still have demonstrated deep learning’s practical capability in scene understanding [15]. This offers important evidence to improve both mapping and positioning.

This paper describes how a semantic segmentation network U-Net can be used to achieve pixel-wise segmentation for potentially movable objects.

To address the aforementioned issues, a dynamic object-aware visual SLAM algorithm was developed based on the ORB-SLAM2 framework specifically for dynamic indoor environments. The primary contributions of this work are outlined as follows:

(1): Built upon the original ORB-SLAM2 framework, we have developed a dynamic object-aware visual SLAM algorithm specifically designed for highly dynamic indoor environments. Our algorithm integrates a semantic segmentation network into ORB-SLAM2, effectively enhancing the accuracy and robustness of the system.
(2): Considering the semantic characteristics of dynamic objects, we have implemented a semantic segmentation algorithm using U-Net to achieve pixel-wise segmentation of potentially movable objects. Consequently, the effect of moving objects on camera pose estimation is alleviated by filtering out the feature points from dynamic objects.
(3): Quantitative and qualitative experiments were conducted to validate the effectiveness and robustness of the proposed method. The experiments were carried out using the TUM public dataset as well as a real scenario dataset captured from Kinect 2.0.

The remaining portion of this paper is structured as follows. Section 2 summarizes previous reviews about the research and development process of SLAM from static to dynamic, especially the construction of semantic maps about visual SLAM, which is the future development direction for robots. Section 3 provides an overview of the proposed system, including its general functionality and a detailed explanation of its modules. Section 4 conducts a large number of experiments to evaluate the proposed system. And last, we conclude our work with a discussion of future outlooks in Section 5 and Section 6.

2. Related Works

Up to now, many SLAM systems have achieved high location accuracy by heavily relying on the strict assumption in a static environment. Usually in SLAM, localization and mapping are the two crucial problems for path planning. In fact, a recursive Bayesian filter is a statistical method used for sequential state prediction for autonomous robots [16]. However, the latest SLAM systems are prone to performance degradation in dynamic outdoor environments where moving objects, such as pedestrians, violate the strict static world assumption and can result in fake feature associations. In fact, when facing a dynamic environment SLAM with moving objects, solving the precision issues involved in simultaneous localization, mapping, and moving object tracking (SLAMMOT) becomes more challenging. Therefore, in recent years, SLAM in dynamic environments has been a widely discussed topic in the computer vision and robotics community to discuss how to improve the quality of map construction.

Generally, the assumption of static objects in a dynamic environment will lead to the deterioration of the entire SLAM process [17]. However, robust pose estimation or localization in a dynamic environment is critical for many applications. To date, one of the most commonly used methods in the field of robot navigation or autonomous driving research is SLAMMOT. Typical methods include using laser scanners with Bayesian methods to track moving objects [18] or using monocular cameras in dynamic environments to classify and track dynamic objects through the application of the MonoSLAM algorithm and bearing-only tracking [19].

To reduce computational complexity, the solution to SLAM decouples from the moving objects tracking. Initially, researchers used stereo cameras to study the SLAMMOT problem in an indoor environment [20], and the results showed that using stereo SLAMMOT could improve SLAM performance in dynamic environments. For outdoor environments, scholars have implemented grid-based localization and local mapping, along with detection and tracking of moving objects (DATMO), using the multi-hypothesis tracking (MHT) method combined with an adaptive interactive multiple model (IMM) filter used to detect moving objects [21]. Subsequently, a hierarchical approach was adopted to classify moving objects in outdoor dynamic environments using 3D range lasers in order to reduce sensor noise and erroneous detection of dynamic objects from 3D laser scanner data [22]. Even in dynamic environments, such as RoboCup and traffic scenes, an algorithm based on an extended Kalman filter (EKF) for simultaneous localization and tracking of multiple robots was proposed by leveraging moving objects. This extension of moving objects to the localization estimation can improve positioning performance to some extent while addressing the problem of map construction in challenging environments [23].

To achieve this goal, combine multi-view geometry and deep learning to build a semantic DynaSLAM system based on the existing ORB-SLAM2 [24], which adds dynamic object detection and background restoration functions to remove dynamic objects from the scene in a set of SLAM schemes. Similarly, based on the ORB-SLAM2 framework, Yu C et al. proposed a complete semantic SLAM system (DS-SLAM) for dynamic environments [25] to reduce the impact of dynamic objects on pose estimation. They put the real-time semantic segmentation network in an independent thread and combined it with a motion consistency check to filter out the dynamic parts of the scene, such as a walking person, thus improving the robustness and accuracy of the localization and mapping modules in dynamic scenes. In high-dynamic environments, DS-SLAM is significantly superior to ORB-SLAM2 in terms of accuracy and robustness. In addition to storing image sequences incremental, these semantic SLAM methods also concentrate on the update of semantic information and the multi-view semantic fusion [26]. The key insight behind these semantic SLAM methods aims to provide SLAM with higher-level information via semantic segmentation and object detection [27]. Chang et al. successfully detected dynamic objects in dynamic scenes using a combination of multiple rules, including multi-view geometry, instance segmentation networks, and optical flow [28]. Similarly, Cheng et al. [29] and Xie et al. [8] present a dynamic target discrimination method based on instance segmentation and sparse optical flow, in which the extracted dynamic keypoints are removed to improve the positioning accuracy of mobile robots in dynamic environments.

Semantic segmentation is a crucial field in computer vision. And we also know that semantic segmentation belongs to a pixel-level classification for images. Image semantic segmentation was first proposed by Ohta et al. It refers to assigning a pre-defined label representing the semantic category for each pixel in an image [30]. So far, semantic segmentation can be roughly divided into two categories: traditional and deep learning-based methods. Among them, deep learning-based semantic segmentation has made significant progress and can now be used in some real-life scenarios. The image semantic segmentation-based pixel-wise classification method uses massive data such as original images, annotated images, and weakly labeled images as training samples to capture richer image features. This not only increases the overall fitness of the model but also improves learning efficiency, effectively enhancing segmentation accuracy. There are several studies focusing on obtaining semantic information by using pixel-wise semantic segmentation in addition to instance segmentation to improve feature extraction and association [31,32]. For example, Bescos et al. made full use of instance segmentation and ORB features to jointly optimize the performance of both static scene mapping and dynamic object tracking [33], while Xun and Song first filtered out the movable objects using semantic and depth information and then optimized the pose estimation based on both static feature point from movable objects and feature points from static objects, respectively [34]. Object detection is also important for generating the objects’ position, orientation, and category information, which is useful for semantic constraint and data association [35,36,37]. Despite more attention given to semantic SLAM, there has been limited research on object-level data association, although Qian et al. developed a maximum weighted bipartite matching method for this purpose [38]. Meanwhile, semantic maps can effectively represent what the robot has visited and what it has seen. Therefore, building semantic maps is an agreed-upon direction, as the current visual SLAM schemes use image features with low semantic distinction, resulting in weak features, and the existing point cloud maps do not distinguish between different objects. On the other hand, semantic segmentation technology aims to achieve object classification at the pixel level and can provide valuable information for loop closing and bundle adjustment (BA) optimization.

3. Methodology

In this section, we will provide a comprehensive and detailed introduction to the presented system. First and foremost, we will give an overview of the proposed system. Then, we will provide a brief introduction to the semantics-guided module, which we have incorporated into the improved system. Subsequently, we will present additional detailed implementation for tracking and mapping based on the original ORB SLAM2 system.

3.1. System Overview

As we all know, semantic segmentation plays a crucial role in image understanding and AI. It involves classifying each pixel in an image to determine its category, such as its foreground, background, and other objects, enabling further region division. At present, semantic segmentation has been widely used in autonomous driving scenarios. ORB SLAM2 is considered a leading feature-based algorithm in this field [1]. Figure 1 illustrates the pipeline of the proposed method, which builds upon the original ORB-SLAM2 system. The key idea behind the method is to integrate a U-Net-based semantic segmentation module for detecting the moving objects into the original ORB SLAM2 system since the system is a 3D localization and mapping algorithm based on ORB features and achieves a good result in efficient processing, accurate tracking, and mapping. Alongside the three parallel threads of tracking, local mapping, and loop closing in the original ORB SLAM2 system, a semantic segmentation thread is added to the system. The orange box in Figure 1 represents the improved content. It should be noted that for computational efficiency, the semantic segmentation thread is only utilized in keyframes. As a result, the semantics-guided SLAM system minimizes the impact of dynamic object feature points on tracking and local mapping, while enhancing the accuracy and stability of camera pose estimation.

3.2. Semantic Segmentation Module

To detect and isolate the dynamic objects, we employ pixel-wise semantic segmentation. Deep learning-based image-processing methods demonstrate improved performance in various tasks, including detection and segmentation. Ref. [39] describes how a proposed encoder–decoder, which nested the semantic network EfficientNetv2 [40] with a novel attention gate (AG), can be used to segment forest fires and smoke using a novel encoder–decoder framework. Unlike the modified EfficientNetv2, U-Net is a semantic segmentation algorithm based on convolutional neural networks (CNNs) that exhibits good robustness, achieving good results even with limited training data or in the presence of noise. It can adapt to different tasks by adjusting parameters in the network structure. The U-Net architecture is adopted for its excellent performance in locating the object edge information. In our implementation, we utilize the PyTorch library [41] to construct the U-Net architecture. Herein, U-Net is an encoder–decoder-based architecture comprising a contracting path for context capture and a symmetric expanding path for precise localization [42]. The encoder and decoder are connected through two 3 × 3 convolutions. Each 3 × 3 convolution in the network is followed by a rectified linear unit (ReLU) activation function to enhance the model’s expressive power. The output feature map of the second convolution in each stage of the encoder is transmitted to the decoder through skip connections. It is then cropped and concatenated with the output feature map of the corresponding stage’s upsampling layer in the decoder to achieve a fusion of shallow and deep information, providing more semantic information for the decoding process. Lastly, a 1 × 1 convolution is used to convert the 64-dimensional channel feature map into a 2-dimensional channel feature map. A Softmax classifier is applied to classify each pixel point individually, generating a segmentation map with white representing the target and black representing the background. The basic framework of U-Net is illustrated in Figure 2.

The U-Net module trained on the PASCAL visual object classes (VOCs) dataset is capable of segmenting potentially dynamic objects such as pedestrians, birds, cats, dogs, etc., which may also appear as moving objects in real indoor scenarios. However, for our implementation, we only consider pedestrians as potential moving objects for pixel-wise semantic segmentation in real indoor scenarios.

Specifically, the U-Net architecture takes a raw RGB image as the input, which has a size of

h \times w \times 3

. The network predicts an output matrix with a size of

h \times w \times n

, where

n

represents the number of dynamic objects in the query image. Following this, each predicted channel

i \in n

is then used to acquire a binary mask. Finally, the semantic segmentation results of all moving objects within the query image can be obtained by combining all the channels into one. Although there have been lots of semantic SLAM solutions by integrating semantic segmentation or object detection, they often require significant computing resources and have slow speeds even in offline mode because these modules are applied to every new frame. In our implementation, to reduce computational complexity, only the new keyframe is fed into the U-Net semantic segmentation architecture for dynamic content culling. Due to the query, a new frame is tracked simultaneously with keyframes and the local map, ensuring that the segmented keyframes and generated local map consist only of the static environment structure. For keyframe selection, please refer to the details in the original ORB SLAM2 system. After selecting a new keyframe, we perform an individual semantic segmentation thread during tracking. This involves deleting the correspondences of dynamic feature points to estimate the relative motion and removing the associated dynamic map points for generating and updating the local map. The steps to remove the correspondences of dynamic feature points in ORB-SLAM are as follows.

Step1. Initial detection: ORB-SLAM first detects feature points in the image using a FAST corner detector and uses ORB descriptors to describe these feature points.

Step2. Motion estimation: By utilizing the optical flow information or orientation transformation of ORB feature points, ORB-SLAM can estimate the camera’s motion (i.e., the camera’s pose).

Step3. Reproject: Reverse project the estimated camera poses onto a 3D map, converting 2D image coordinates into 3D world coordinates.

Step4. Determine reprojection error: Reproject the 3D map points obtained from backprojection back into the camera coordinate system and calculate their reprojection error in the image.

Step5. Dynamic point filtering: Filter feature points based on reprojection errors and other judgment conditions (such as matching stability, motion consistency, etc.). If the reprojection error of a feature point exceeds the set threshold, it is determined as a dynamic feature point.

Step6. Dynamic point removal: Once a feature point is determined as a dynamic point, ORB-SLAM will remove it from the map. That is, it will no longer be used for subsequent map updates or camera pose estimation.

Through the above steps, ORB-SLAM can identify and remove dynamic feature points, thereby ensuring the stability and accuracy of the map. This can improve the robustness of the SLAM system, making it more suitable for positioning and mapping in dynamic scenarios.

3.3. Tracking and Mapping

Inspired by the previous parallel tracking and mapping (PTAM) [43], this module is based on the original ORB SLAM2 system. It consists of tracking, local mapping, and loop closure of three primarily parallel threads. The process begins with the step of ORB feature extraction to generate primitives. Afterward, for each new frame, following the idea of the raw ORB SLAM2, a two-stage tracking algorithm is implemented. By tracking a recent keyframe and querying the new frame with the largest overlap, an initial pose estimation is obtained. Following this, the camera motions are estimated or relocalized using the previous frames for feature matching in the local map. Then, the motion-only BA is used to minimize the reprojection errors. In this way, we realize an ORB SLAM2 system, which is a more reliable visual odometry that can calculate pose estimation and transformation between frames in static scenarios, as mentioned in Section 3.1. The system maintains a keyframe database and map data, containing only static features and map points. Meanwhile, the local BA is performed in the separated local mapping thread. Additionally, the loop closure thread plays a crucial role in correcting accumulated drift errors. It identifies large loops through place recognition and utilizes pose graph optimization to improve motion drift. After pose graph optimization, another thread executes full BA to calculate the optimal structure and estimate the camera motion.

4. Experimentation and Analysis

4.1. Experimental Dataset

To showcase the effectiveness of our U-Net architecture, we conducted experiments using two datasets: the public dataset TUM and the real scenario dataset captured from Kinect 2.0. Our system is built upon the original ORB SLAM2 as a baseline, and we assess the qualitative and quantitative improvements achieved through our semantics-guided approach. Moreover, we compared our proposed system to other semantic SLAM methods, including the geometry-based and learning-based approaches, to highlight its reliability and robustness.

The TUM public dataset consists of 39 sequences captured in indoor environments using an RGB-D camera. Furthermore, we use a high-accuracy motion capture system to obtain the ground truth about trajectories for evaluation. We select several sequence images with some dynamic objects to evaluate so as to concern the performance of the proposed system in dynamic indoor environments. These sequence images include sitting half, sitting static, walking half, walking rpy, and walking static. In the sitting sequence, people in the scene exhibit low-motion states, with small movements while remaining seated on a chair. Although the range of motion is small, it is not completely still due to camera movement. In the walking sequence, people exhibit significant movement, leaving the chair and walking indoors, creating a highly dynamic scene. Both the low-dynamic scene and the high-dynamic scene in the TUM dataset have been segmented, respectively.

Furthermore, we collected indoor RGB and depth images in the teaching building at the China University of Geosciences in Beijing using a Kinect 2.0 camera, which is make of human-computer interaction device based on somatosensory interaction. The experimental environment is illustrated in Figure 3 and Figure 4. The captured scene includes various objects such as tables, chairs, computers, people, etc. During the data collection process inside the teaching building, we conducted three experiments. In the first experiment, we set a range of 7.5 cm and moved the camera backwards and forwards along the edge of the table, as depicted in Figure 2. In the second experiment, the camera was confined within the red area, and by rotating the camera, we obtained indoor scene images, as shown in Figure 4a. For the third experiment, we placed the camera on the edge of the table. The camera started from the vertex, performed a square movement around the table, and then returned to the initial position. The red dotted line in Figure 4b represents the trajectory of the camera during this experiment.

To evaluate semantic segmentation, we employed several metrics including intersection over union (IOU) for each class, class accuracy, mean IOU, mean accuracy, and overall accuracy, where class IOU describes the coverage degree between the object pixels in image semantic segmentation to ensure similarity between the predicted and ground truth objects. Class accuracy

P_{a c c}

is used to calculate the pixels of a detected object (

C_{d p}

) divided by ground truth objects (

C_{g p}

). The formula for

P_{a c c}

is shown in Equation (1). Let IOU be the overlap region of a detected polygon (

s_{d p}

) and a ground truth polygon (

s_{g p}

) divided by the union area (

s_{d p}

) and (

s_{g p}

). The IOU metric is shown in Equation (2).

P_{a c c} = \frac{C_{d p}}{C_{g p}}

(1)

I O U = \frac{a r e a (s_{d p} \cap s_{g p})}{a r e a (s_{d p} \cup s_{g p})}

(2)

In general, when the IOU value between the detected class and the ground truth is larger than 0.5, we consider the class to be the same as the object.

Additionally, for evaluating motion trajectories, two commonly used error metrics are the root-mean-square error (RMSE) of the absolute trajectory error (ATE), measured in meters, and the RMSE of the relative attitude error (RAE). The RAE typically consists of two components: translation in meters per second (m/s) and rotational drift in degrees per second (°/s). The ATE assesses the overall consistency of the trajectory, while the RPE measures the drift of the odometer per second.

4.2. Experimental Details

For semantic segmentation, the U-Net model is utilized as described in Section 3.1. The model is trained with two datasets: the PASCAL VOC dataset (as shown in Figure 5) and the real scenario dataset (as shown in Figure 6). In our implementation, the entire scene is categorized into four classes: pedestrians, chairs, tables, and others. Herein, we adopt Adam as the optimizer with parameter betas = (0.9, 0.999) and weight decay =

{2 \times 10}^{- 5}

. The models are trained with a batch size of eight and an initial learning rate of

10^{- 4}

. To ensure more optimal learning rates throughout the training processes, we employ the cosine annealing scheduler [44]. Dropout with a probability of 0.1 is applied for regularization purposes. The training is conducted for 1000 epochs.

4.3. Semantic Segmentation Results

The retrieval of dynamic points depends on the semantic segmentation results. Therefore, using the VOC, TUM, and real scenario datasets, we first use the U-Net semantic segmentation method to evaluate performance. Table 1 lists the semantic segmentation results compared to the ground truth from the VOC dataset, which provides the pixel-wise labeling results. In addition, we visually presented some qualitative perspectives of the semantic segmentation results in Figure 7 and Figure 8. Experimental results demonstrate the proposed method utilizing semantic segmentation effectively recognizes pedestrians, chairs, tables, TVs, and backgrounds, with an overall accuracy of approximately 94.17%, a mean class IOU of approximately 75.11%, and a mean class accuracy of approximately 86.46%.

4.4. TUM Public Datasets

Most of the semantic information is utilized in dynamic SLAM so as to directly eliminate the points belonging to potential moving objects, thereby reducing drift in the trajectory.

In our implementation, we conducted both quantitative and qualitative analysis on the TUM public dataset. Typically, long-term operation in a stable and invariant scene is a requirement for any SLAM system. However, we are more concerned about the performance of the system in dynamically changing environments. To test our system’s performance in dynamic scenarios, we conducted experiments using the dynamic sequences from the freiburg3 (fr3) dataset in TUM, which contain the sitting_xyz, sitting_halfsphere, sitting_rpy, walking_xyz, walking_halfsphere, and walking_rpy image sequences. In these sequences, “sitting” and “walking” represent two different states of a person in the dataset where the camera is stationary and the person is either sitting or walking. The terms “static”, “xyz”, “halfsphere”, and “rpy” correspond to the camera being stationary, moving along the xyz axis, following a half-sphere trajectory, and rotating in the order of roll, pitch, and yaw, respectively. All sequences were captured with the camera focused on the same table while executing different motion trajectories, and people were moving and changing some objects during the sequences. Herein, the fr3_sittting sequence presents a low-dynamic sitting person, while the fr3_walking sequence presents high-dynamic pedestrians. Figure 9 presents a section of the trajectory graph across different scenes with ORB-SLAM2 and our proposed U-Net framework compared the ground truth of three different frameworks. Figure 9a is the result of the trajectory of sitting static in an fr3 sequence in a three-dimensional space, Figure 9b is the trajectory of walking_rpy in the XY direction, and Figure 9c is the trajectory of half walking in the XY direction, respectively.

The results show that for the walking_xyz and walking_rpy sequences, where both the camera and objects have significant motion, our proposed method is more closely aligned with the ground truth compared to the original ORB SLAM2, especially in high-dynamic scenes. This indicates that our system has a keyframe generation strategy that is particularly useful when sudden motion occurs during exploration.

Additionally, we also examine the impact of a semantics-guided strategy on the performance of different scenes in the ATE by comparing the original ORB SLAM2 with the proposed method. Table 2 lists an overview of the effect of semantics-guided strategy on the performance in various scenes, including root-mean-square error (RMSE), standard deviation (SD), and other ATE parameters in low- and high-dynamic scenes. Based on the findings presented in Table 2, we can observe that the performance of the proposed system has been enhanced in both low-dynamic and high-dynamic scenes. The degree of improvement is relatively modest in low-dynamic scenes, whereas in high-dynamic scenes, the improved system exhibits significant performance enhancement. Consequently, our method demonstrates resilience to the presence of dynamic objects in the environment.

In addition, Table 3 presents a comparison of our system’s RMSE with the existing systems such as Co-Fusion [45], VO-SF [46], and ORB SLAM2. In low-dynamic scenes, our system outperforms other existing systems, showing average RMSE differences of 0.03 m. Similarly, in high-dynamic scenes, our system demonstrates superiority over other existing systems, with average RMSE differences of 0.23 m. Based on the quantitative comparison, we can conclude that removing the dynamic points is a beneficial solution for enhancing the performance of ORB SLAM2, particularly in high-dynamic scenes.

4.5. Real Scenario Datasets

To further evaluate the generalization of our proposed method, we conducted a comparison between the ground truth value and the existing ORB SLAM2 system using a real scenario dataset. We take the teaching building datasets as an example to predict the trajectory of robots with three different scenarios. Compared with ground truth, the results of the trajectory predicted by both the ORB SLAM2 system and our proposed system are shown in Figure 10. Through qualitative analysis, it is evident that our proposed system yields a trajectory that aligns more closely with the preset trajectory, outperforming the original ORB SLAM2 system in terms of reduced data drift and cumulative error.

5. Discussion

Currently, there are many SLAM studies in environments containing dynamic objects. In recent years, there has been a rapid development of deep learning (DL) technology. Image semantic segmentation based on deep learning (ISSbDL) methods utilizes deep neural networks (DNNs) to extract image features and semantic information from a large amount of annotated image data. Based on this information, the methods learn and infer the pixel categories in the original image. By training the network end-to-end, each pixel is classified to achieve the goal of semantic segmentation.

ORB-SLAM2 is to the best of our knowledge the first open-source visual SLAM system that can work either with many sensor’s data as inputs. Moreover, a comparison to the state of the art shows that ORB-SLAM2 is currently the best stereo SLAM solution. Crucially, it achieves zero-drift localization in already mapped areas and has the highest accuracy in most cases. Therefore, when combining the significant contributions of deep learning in the field of image semantic segmentation and leveraging the advantages of the U-Net network architecture for pixel-level semantic segmentation, especially in dynamic indoor environments, adopting an end-to-end image segmentation approach can avoid the issues caused by pre-generating image patches and improve segmentation accuracy.

Undeniably, segmenting moving objects in dynamic environments needs to ensure the clarity of segmented images. Our algorithm is not well-suited to handle image distortions or blurriness caused by fast motion. The indoor dynamic environment mentioned in this paper represents a special constrained scenario compared to complex environments. The focus of this paper is to embed a U-Net architecture, which is a deep learning-based pixel-level semantic segmentation network, into the ORB-SLAM2 framework. This integration achieves zero-drift localization by incorporating moving objects into the localization estimation, which improves localization performance and accuracy while addressing the challenge of map construction in challenging environments.

However, improving the quality of map construction in dynamic environments has been a widely discussed topic in the computer vision and robotics communities. With the widespread adoption of DNNs in the computer vision community, DNNs have emerged as a representation learning technique aimed at learning high-level abstractions of data using multiple layers of neural networks. DNNs have already made significant advancements in many research areas, with several implementations of DNNs being used for visual localization and 3D reconstruction. However, as we all know, the use of DNNs for motion segmentation is still hard.

6. Conclusions

This paper presents a novel visual SLAM method that is specifically designed for dynamic indoor environments. The method leverages a semantic segmentation architecture called U-Net, which is integrated into the ORB-SLAM2 framework in the tracking thread to detect potentially moving targets. By integrating a U-Net-based semantic segmentation model into the original ORB-SLAM2, the method aims to alleviate the drift of the trajectory caused by dynamic objects.

The proposed architecture effectively identifies potentially moving targets, which helps guide the original ORB-SLAM2 system and mitigate trajectory drift. The semantic segmentation output is closely linked with the geometric information in the SLAM system, enabling the association of feature points extracted from images with potentially moving targets. By filtering out feature points that are likely to be moving, the method significantly enhances localization accuracy in dynamic indoor environments.

The effectiveness of the proposed method has been validated through experiments on both public and real scenario datasets, demonstrating its reliability and robustness. The results show that by filtering out feature points that tend to be moving, our proposed method can remarkably improve localization accuracy in dynamic indoor environments.

Future extensions might include a focus on the combination of semantic and geometric information for reliable data association. On one hand, we need to tackle the problems of large-scale dense fusion and multi-sensor data fusion, cooperative mapping, and increased motion blur robustness.

Author Contributions

Conceptualization, Z.K. and H.L.; methodology, Z.Z. and H.L.; formal analysis, Z.Z., X.X., H.L., and J.Y.; investigation, C.L., H.L., and J.Y.; data duration, L.Z. and H.L.; original draft, H.L. and Z.K.; review and editing, Z.K. and Z.Z.; supervision, Z.K.; funding supporting, H.L. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was jointly supported by the NSFC of China (No. 41872207), the National Key Research and Development Program of China (No.2019YFE0123300), the 2021 High-level Talents Research Launch Project of Beibu Gulf University of China (No. 2021KYQD03), the Marine Science First-Class Subject of Beibu Gulf University of China (No. 2022MCFC012), and the Key project of the Guangdong Provincial Department of Education of China (No. 2022ZDZX3030).

Data Availability Statement

The TMU dataset comes from the website https://vision.in.tum.de/data/datasets/rgbd-dataset/online_evaluation (accessed on 3 November 2023), which is composed of 39 sequences recorded using Microsoft Kinect sensors in different indoor scenarios. Among them, dynamic SALM uses 9 datasets of Dynamic Objects and corresponding sections of Validation Files.

Acknowledgments

The authors thank all survey participants and reviewers of the paper. Appreciation goes to the Lunar and Planetary Remote Sensing Exploration Research Center and the International Cooperation and Research of Space Exploration Center, Ministry of Education of The People’s Republic of China for their selfless help and for providing the datasets for experiments.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

SLAM	simultaneous localization and mapping
TUM	Technical University of Munich
GNSS	global navigation satellite system
LiDAR	light detection and ranging
RGB-D	red green blue-deeply
IMU	inertial measuring unit
ORB-SLAM2	oriented FAST and rotated BRIEF SLAM II
LSD SLAM	large-scale direct SLAM
DSO	direct sparse odometry
SLAMMOT	simultaneous localization, mapping, and moving object tracking
MonoSLAM	monocular SLAM
DATMO	detection and tracking of moving objects
MHT	multi-hypothesis tracking
IMM	interactive multiple model
EKF	extended Kalman filter
DynaSLAM	dynamic SLAM
DS-SLAM	a semantic visual SLAM toward dynamic environments
BA	bundle adjustment
AG	attention gate
CNNs	convolutional neural networks
ReLU	rectified linear unit
PASCAL	pattern analysis statistical modeling and computational learning
VOCs	visual object classes
PTAM	parallel tracking and mapping
IOU	intersection over union
RMSE	root-mean-square error
ATE	absolute trajectory error
RAE	relative attitude error
SD	standard deviation
DNNs	deep neural networks

References

Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A Comprehensive Survey of Visual SLAM Algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
Ni, J.; Gong, T.; Gu, Y.; Zhu, J.; Fan, X. An improved deep residual network-based semantic simultaneous localization and mapping method for monocular vision robot. Comput. Intell. Neurosci. 2020, 2020, 7490840. [Google Scholar] [CrossRef]
Qin, T.; Chen, T.; Chen, Y.; Su, Q. Avp-slam: Semantic visual mapping and localization for autonomous vehicles in the parking lot. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5939–5945. [Google Scholar]
Chaplot, D.S.; Salakhutdinov, R.; Gupta, A.; Gupta, S. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitio, Washington, DC, USA, 16–20 June 2020; pp. 12875–12884. [Google Scholar]
Chen, J.; Li, Q.; Hu, S.; Chen, Y.; Hiu, S.; Chen, Y.; Li, J. Global Visual And Semantic Observations for Outdoor Robot Localization. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 419–424. [Google Scholar] [CrossRef]
Li, B.; Zou, D.; Sartori, D.; Ling, P.; Yu, W. TextSLAM: Visual SLAM with Planar Text Features. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 2102–2108. [Google Scholar]
Muthu, S.; Tennakoon, R.; Rathnayake, T.; Hoseinnezhad, R.; Suter, D.; Bab-Hadiashar, A. Motion segmentation of rgb-d sequences: Combining semantic and motion information using statistical inference. IEEE Trans. Image Process. 2020, 29, 5557–5570. [Google Scholar] [CrossRef] [PubMed]
Xie, W.; Liu, P.X.; Zheng, M. Moving object segmentation and detection for robust RGBD-SLAM in dynamic environments. IEEE Trans. Instrum. Meas. 2020, 70, 1–8. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Mur-Artal, R.; Tardos, J.D. Visual-Inertial Monocular SLAM With Map Reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM; Springer International Publishing: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Shakhnoza, M.; Sabina, U.; Sevara, M.; Cho, Y.-I. Novel Video Surveillance-Based Fire and Smoke Classification Using Attentional Feature Map in Capsule Networks. Sensors 2022, 22, 98. [Google Scholar] [CrossRef] [PubMed]
Bowman, S.L.; Atanasov, N.; Daniilidis, K.; Pappas, G.J. Probabilistic data association for semantic slam. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1722–1729. [Google Scholar]
Aslan, M.F.; Durdu, A.; Yusefi, A.; Sabanci, K.; Sungur, C. A Tutorial: Mobile Robotics, SLAM, Bayesian Filter, Keyframe Bundle Adjustment and ROS Applications. In Robot Operating System (ROS); Koubaa, A., Ed.; Studies in Computational Intelligence; Springer: Cham, Switzerland, 2021; Volume 962. [Google Scholar] [CrossRef]
Saputra, M.R.U.; Markham, A.; Trigoni, N. Visual SLAM and structure from motion in dynamic environments: A survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
Wang, C.C.; Thorpe, C.; Thrun, S.; Hebert, M.; Durrant-Whyte, H. Simultaneous localization, mapping and moving object tracking. Int. J. Rob. Res. 2007, 26, 889–916. [Google Scholar] [CrossRef]
Migliore, D.; Rigamonti, R.; Marzorati, D.; Matteucci, M.; Sorrenti, D.G. Use a single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009. [Google Scholar]
Lin, K.-H.; Wang, C.-C. Stereo-based simultaneous localization, mapping and moving object tracking. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 3975–3980. [Google Scholar]
Vu, T.-D.; Burlet, J.; Aycard, O. Grid-based localization and local mapping with moving object detection and tracking. Inf. Fusion 2011, 12, 58–69. [Google Scholar] [CrossRef]
Azim, A.; Aycard, O. Layer-based supervised classification of moving objects in outdoor dynamic environment using 3D laser scanner. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, USA, 8–11 June 2014; pp. 1408–1414. [Google Scholar]
Chang, C.-H.; Wang, S.-C.; Wang, C.-C. Exploiting Moving Objects: Multi-Robot Simultaneous Localization and Tracking. IEEE Trans. Autom. Sci. Eng. 2016, 13, 810–827. [Google Scholar] [CrossRef]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Civera, J.; Gálvez-López, D.; Riazuelo, L.; Tardós, J.D.; Montiel, J.M.M. Towards semantic SLAM using a monocular camera. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 1277–1284. [Google Scholar]
Zhi, S.; Bloesch, M.; Leutenegger, S.; Davison, A.J. SceneCode: Monocular Dense Semantic Reconstruction using Learned Encoded Scene Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11776–11785. [Google Scholar] [CrossRef]
Chang, J.; Dong, N.; Li, D. A Real-Time Dynamic Object Segmentation Framework for SLAM System in Dynamic Scenes. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Cheng, J.; Wang, Z.; Zhou, H.; Li, L.; Yao, J. DM-SLAM: A Feature-Based SLAM System for Rigid Dynamic Scenes. Isprs Int. J. Geo-Inf. 2020, 9, 202. [Google Scholar] [CrossRef]
Csurka, G.; Perronnin, F. An efficient approach to semantic segmentation. Int. J. Comput. Vis. 2011, 95, 198–212. [Google Scholar] [CrossRef]
Ji, T.; Wang, C.; Xie, L. Towards Real-time Semantic RGB-D SLAM in Dynamic Environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11175–11181. [Google Scholar] [CrossRef]
Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor. IEEE Sens. J. 2021, 21, 20657–20664. [Google Scholar] [CrossRef]
Bescos, B.; Campos, C.; Tardos, J.D.; Neira, J. DynaSLAM II: Tightly-Coupled Multi-Object Tracking and SLAM. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
Yuan, X.; Chen, S. SaD-SLAM: A Visual SLAM Based on Semantic and Depth Information. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4930–4935. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, Y.; Zhu, D.; Feng, Y.; Coleman, S.; Kerr, D. EAO-SLAM: Monocular Semi-Dense Object SLAM Based on Ensemble Data Association. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4966–4973. [Google Scholar] [CrossRef]
Lin, X.; Yang, Y.; He, L.; Chen, W.; Guan, Y.; Zhang, H. Robust Improvement in 3D Object Landmark Inference for Semantic Mapping. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13011–13017. [Google Scholar]
Zhang, J.; Yuan, L.; Ran, T.; Tao, Q.; He, L. Bayesian nonparametric object association for semantic SLAM. IEEE Robot. Autom. Lett. 2021, 6, 5493–5500. [Google Scholar]
Qian, Z.; Patath, K.; Fu, J.; Xiao, J. Semantic slam with autonomous object-level data association. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11203–11209. [Google Scholar]
Muksimova, S.; Mardieva, S.; Cho, Y.-I. Deep Encoder–Decoder Network-Based Wildfire Segmentation Using Drone Images in Real-Time. Remote Sens. 2022, 14, 6302. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems. Adv. Neural Inf. Process. 2019, 32, 8024–8035. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
Yang, J.; Kang, Z.; Yang, Z.; Xie, J.; Xue, B.; Yang, J.; Tao, J. Automatic Laboratory Martian Rock and Mineral Classification Using Highly-Discriminative Representation Derived from Spectral Signatures. Remote Sens. 2022, 14, 5070. [Google Scholar] [CrossRef]
Rünz, M.; Agapito, L. Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4471–4478. [Google Scholar]
Jaimez, M.; Kerl, C.; Gonzalez-Jimenez, J.; Cremers, D. Fast odometry and scene flow from RGB-D cameras based on geometric clustering. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3992–3999. [Google Scholar] [CrossRef]

Figure 1. Pipeline of the proposed method built on an ORB SLAM2 system. The three threads filled with gray are the same as the original threads of ORB SLAM2. The improvements over the original ORB SLAM2 system are highlighted in orange.

Figure 2. Framework of the U-Net architecture.

Figure 3. Camera trajectory in the first experiment of the real scenario. The red dotted line is the camera location in left or right to desk.

Figure 4. Camera trajectory in the second and third experiments of the real scenario. (a) Second experiment. (b) Third experiment.

Figure 5. Part of the VOC dataset.

Figure 6. Part of real scenario datasets.

Figure 7. Semantic segmentation results of the VOC. (a) Image #1, (b) ground truth #1, (c) predicted output #1, (d) image #2, (e) ground truth #2, (f) predicted output #2, (g) image #3, (h) ground truth #3, (i) predicted output #3, (j) image #4, (k) ground truth #4, (l) predicted output #4, (o) image #5, (p) ground truth #5, (q) predicted output #5, (r) image #6, (s) ground truth #6, (t) predicted output #6.

Figure 8. Semantic segmentation results of both TUM and real scenario datasets. (a) Image #1, (b) predicted output #1, (c) image #2, (d) predicted output #2, (e) image #3, (f) predicted output #3, (g) image #4, (h) predicted output #4, (i) image #3, (j) predicted output #3, (k) image #4, (l) predicted output #4.

Figure 9. The trajectory graphics on TUM datasets for different scenes. (a) Trajectory of fr3_sitting_static in a three-dimensional space. (b) Trajectory of fr3_walking_rpy in the XY direction. (c) Trajectory of fr3_walking_half in the XY direction.

Figure 10. Predicted trajectory of the teaching building datasets compared with ground truth. (a) First scenario, (b) second scenario, (c) third scenario.

Table 1. Semantic segmentation results.

Category	Class IoU	Class Accuracy
Pedestrian	83.02%	90.73%
Chair	33.85%	45.01%
Table	60.09%	65.66%
TV	70.59%	80.06%
Background	93.51%	96.53%
Average	75.11%	86.46%
Overall accuracy	94.17%

Table 2. Effect of the semantics-guided strategy on the performance of different scenes.

Sequence		System	Absolute Trajectory Error (In Meters)
Sequence		System	RMSE	S.D.	MAX	MIN	MEAN	MEDIAN
Low-dynamic scene	Sitting static	ORB SLAM2	0.0242	0.0129	0.0962	0.0020	0.0205	0.0183
		Our system	0.0226	0.0118	0.0850	0.0017	0.0193	0.0172
		Improvement	6.61%	8.53%	11.64%	15.00%	5.85%	6.01%
	Sitting half	ORB SLAM2	0.0084	0.0039	0.0355	0.0003	0.0075	0.0068
		Our system	0.0077	0.0038	0.0275	0.0010	0.0067	0.0060
		Improvement	8.33%	2.56%	22.54%	−233.33%	10.67%	11.76%
High-dynamic scene	Walking half	ORB SLAM2	0.4175	0.2160	1.4238	0.0257	0.3573	0.2957
		Our system	0.3838	0.1324	0.7351	0.0779	0.3602	0.3488
		Improvement	8.07%	38.70%	48.37%	−203.11%	−0.81%	−17.96%
	Walking rpy	ORB SLAM2	1.0034	0.5387	2.1407	0.0351	0.8466	0.8244
		Our system	0.5539	0.1819	1.0368	0.0529	0.5231	0.5225
		Improvement	44.80%	66.23%	51.57%	−50.71%	38.21%	36.62%
	Walking static	ORB SLAM2	0.4201	0.1710	0.6821	0.0603	0.3838	0.3388
		Our system	0.3292	0.0999	0.5657	0.0475	0.3136	0.2861
		Improvement	21.64%	41.58%	17.06%	21.23%	18.29%	15.55%

Table 3. Comparison with other existing systems for different scenes.

Sequence		RMSE (In Meters)
Sequence		ORB SLAM2	VO-SF	Co-Fusion	Our System
Low-dynamic scene	Sitting static	0.0242	0.0290	0.0110	0.0226
Low-dynamic scene	Sitting half	0.0084	0.1800	0.0360	0.0077
High-dynamic scene	Walking half	0.4175	0.7390	0.8030	0.3838
	Walking rpy	1.0034	0.8740	0.6960	0.5539
	Walking static	0.4201	0.3270	0.5510	0.3292

Note: The best performance is highlighted in black bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, Z.; Lin, H.; Kang, Z.; Xie, X.; Yang, J.; Li, C.; Zhu, L. A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments. Remote Sens. 2023, 15, 5479. https://doi.org/10.3390/rs15235479

AMA Style

Zeng Z, Lin H, Kang Z, Xie X, Yang J, Li C, Zhu L. A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments. Remote Sensing. 2023; 15(23):5479. https://doi.org/10.3390/rs15235479

Chicago/Turabian Style

Zeng, Zhi, Hui Lin, Zhizhong Kang, Xiaokui Xie, Juntao Yang, Chuyu Li, and Longze Zhu. 2023. "A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments" Remote Sensing 15, no. 23: 5479. https://doi.org/10.3390/rs15235479

APA Style

Zeng, Z., Lin, H., Kang, Z., Xie, X., Yang, J., Li, C., & Zhu, L. (2023). A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments. Remote Sensing, 15(23), 5479. https://doi.org/10.3390/rs15235479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. System Overview

3.2. Semantic Segmentation Module

3.3. Tracking and Mapping

4. Experimentation and Analysis

4.1. Experimental Dataset

4.2. Experimental Details

4.3. Semantic Segmentation Results

4.4. TUM Public Datasets

4.5. Real Scenario Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI