A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation

Liu, Xiwen; He, Yong; Li, Jue; Yan, Rui; Li, Xiaoyu; Huang, Hui

doi:10.3390/s24113388

Open AccessReview

A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation

by

Xiwen Liu

^1,2,

Yong He

²,

Jue Li

^3,*

,

Rui Yan

²,

Xiaoyu Li

² and

Hui Huang

⁴

¹

Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natura Resources, Shenzhen 518034, China

²

School of Smart City, Chongqing Jiaotong University, Chongqing 400074, China

³

College of Traffic & Transportation, Chongqing Jiaotong University, Chongqing 400074, China

⁴

Chongqing Digital City Technology Co., Ltd., Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(11), 3388; https://doi.org/10.3390/s24113388

Submission received: 11 April 2024 / Revised: 16 May 2024 / Accepted: 21 May 2024 / Published: 24 May 2024

(This article belongs to the Section Navigation and Positioning)

Download

Browse Figures

Versions Notes

Abstract

:

Visual simultaneous localization and mapping (VSLAM) enhances the navigation of autonomous agents in unfamiliar environments by progressively constructing maps and estimating poses. However, conventional VSLAM pipelines often exhibited degraded performance in dynamic environments featuring mobile objects. Recent research in deep learning led to notable progress in semantic segmentation, which involves assigning semantic labels to image pixels. The integration of semantic segmentation into VSLAM can effectively differentiate between static and dynamic elements in intricate scenes. This paper provided a comprehensive comparative review on leveraging semantic segmentation to improve major components of VSLAM, including visual odometry, loop closure detection, and environmental mapping. Key principles and methods for both traditional VSLAM and deep semantic segmentation were introduced. This paper presented an overview and comparative analysis of the technical implementations of semantic integration across various modules of the VSLAM pipeline. Furthermore, it examined the features and potential use cases associated with the fusion of VSLAM and semantics. It was found that the existing VSLAM model continued to face challenges related to computational complexity. Promising future research directions were identified, including efficient model design, multimodal fusion, online adaptation, dynamic scene reconstruction, and end-to-end joint optimization. This review shed light on the emerging paradigm of semantic VSLAM and how deep learning-enabled semantic reasoning could unlock new capabilities for autonomous intelligent systems to operate reliably in the real world.

Keywords:

semantic segmentation; visual simultaneous localization and mapping; deep learning; dynamic environments; comparative review

1. Introduction

The utilization of visual simultaneous localization and mapping (VSLAM) enabled autonomous agents to operate in unfamiliar environments with greater efficiency. This was achieved through the progressive construction of maps and the estimation of poses. However, the efficacy of traditional VSLAM systems tended to decline in dynamic environments with movable objects. The integration of semantic segmentation into VSLAM enabled the effective differentiation between static and dynamic elements in intricate scenes. The matching and localization in most existing VSLAM systems were dependent on geometric features including points, lines, or planes. These fundamental features often lacked meaningful interpretations of the 3D environment, leading to limitations in distinguishing between static and dynamic elements within the environment.

In recent years, with the development of deep learning, semantic segmentation technologies made significant advancements and found applications in various fields such as autonomous driving, augmented reality, and image editing [1]. Semantic segmentation involved dividing a digital image into multiple segments and assigning each pixel to a specific class, such as person, car, road, sidewalk, vegetation, or building [2]. It offered an efficient method for extracting semantic information from visual data. Many researchers investigated the integration of semantic segmentation into conventional VSLAM systems, showcasing improved performance, particularly in highly dynamic environments. This innovative approach was commonly referred to as semantic VSLAM or semantic SLAM [3]. The semantic segmentation model was primarily utilized in visual odometry, loop closure detection, and environment mapping. Semantic segmentation aided visual odometry in identifying dynamic objects by providing a pixel-level understanding of semantics. Changes in viewpoint, lighting conditions, environmental dynamics, and perceptual aliasing could compromise the flexibility and accuracy of loop closure detection when using original visual features. However, integrating semantic segmentation emphasized stable background structure and provided a powerful clue for enhancing loop closure detection under different conditions. By integrating pixel-level semantic labels, semantic segmentation distinguished various categories such as walls, furniture, objects, and people. This approach explicitly represented the environment rather than just depicting surfaces or geometric primitives, leading to a more comprehensive understanding of the environment and enhancing several aspects. Ultimately, this method improved the intelligent interaction and decision-making ability of environment mapping [4].

However, current research on semantic VSLAM remained constrained in its scope and scale. There was a notable absence of a systematic review and comparative analysis within this evolving domain. The majority of published studies concentrated on isolated algorithm development and experimental verifications using limited datasets. The interrelations between semantic segmentation and various aspects of VSLAM were not exhaustively explored. Furthermore, a comprehensive performance benchmark and evaluation across diverse solutions were lacking. Therefore, it is important to undertake a review of the cutting-edge semantic segmentation technologies implemented in VSLAM to illuminate potential avenues for future research.

The main objective of this study was to present a comprehensive overview of the development and applications of VSLAM. It summarized and compared the various applications of semantic segmentation in the key components of VSLAM, outlining their roles, benefits, and limitations. Additionally, research gaps in the current literature were identified, and promising future directions in this field were highlighted. The subsequent sections of this paper are structured as follows: Section 2 introduces the fundamental technologies and influential factors in the workflow of traditional VSLAM systems. Section 3 outlines the general principles and classification of mainstream semantic segmentation approaches, with a particular focus on deep learning-based methods. In Section 4, diverse applications of semantic segmentation in the key components of VSLAM are elaborated upon. Finally, Section 5 concludes the paper by summarizing the key findings and discussing potential future research directions. The architecture for this paper is shown in Figure 1.

2. Key Technologies and Influencing Factors of VSLAM

2.1. Workflow of VSLAM

The technologies of VSLAM were categorized into four primary components: front-end visual odometry, back-end optimization, loop closure detection, and mapping. The front-end visual odometry involved extracting features from sequences of images and matching them between frames to calculate the incremental motion of the camera pose, enabling real-time localization but being susceptible to drift accumulation over time [5]. The back-end optimization refined the poses by minimizing the discrepancies between predicted and observed feature locations within a temporal window, thereby mitigating accumulated drift [6]. Loop closure detection identified previously visited locations upon revisiting them, establishing constraints between current and past poses to constrain drift. The mapping module combined visual data and optimized poses to progressively construct a map of unfamiliar environments [7]. The typical workflow and interconnection among these modules are illustrated in Figure 2.

2.2. Front-End Visual Odometry

Two primary approaches were employed for visual odometry: feature-based methods, which involved matching sparse features across frames, and direct methods, which analyzed the intensities of all pixels. Feature-based methods, including ORB-SLAM [8], were renowned for their robustness but could encounter difficulties in textureless regions. In contrast, direct methods, exemplified by LSD-SLAM [9], bypassed explicit data association but exhibited lower resilience in dynamic scenes. Semi-direct methods aimed to strike a balance between the two by tracking sparse features and minimizing photometric errors. Challenges encountered in visual odometry included illumination variations, motion blur, occlusions, and the presence of dynamic objects in the surroundings.

Numerous approaches have been put forward to enhance the robustness of visual odometry. Optimization of feature extraction algorithms including scale-invariant feature transform (SIFT) and oriented FAST and rotated BRIEF (ORB) was pursued, and the integration of visual data with inertial measurements had demonstrated improved accuracy in challenging scenarios. For instance, Ortiz, et al. [10] integrated SIFT into position- and scale-invariant feature transform (PSIFT), refining SIFT to 48 bytes, with PSIFT exhibiting comparable performance to SIFT while achieving enhanced accuracy and efficiency, outperforming most contemporary binary descriptors. Additionally, Wang, et al. [11] combined self-motion estimation with sequence-based learning using deep neural networks. Specifically, they utilized Convolutional Neural Networks (CNNs) to estimate camera motion in optical flow and modeled motion dynamics using Recurrent Neural Networks (RNNs). However, in complex or dynamic environments, traditional front-end visual odometry positioning was often not accurate enough, resulting in an escalating error over time.

2.3. Back-End Optimization

Two primary categories of approaches were employed for back-end optimization: filtering methods and smoothing methods. Filtering methods, including the Extended Kalman Filter (EKF), iteratively updated pose estimates by incorporating motion dynamics and observations [12]. On the other hand, smoothing methods, like bundle adjustment, optimized poses within a sliding window by minimizing the reprojection errors of all features. While filtering methods were computationally efficient, smoothing methods traded off increased accuracy for higher computational costs.

In addition to standard optimization techniques, various enhancements were introduced, including the integration of geometric constraints and the utilization of geometric constraint-based joint optical flow for identifying dynamic feature points. Zhao and Vela [13] introduced the maximum logarithm of determinant (Max-logDet) metric to guide feature selection in least-squares pose optimization, demonstrating through experiments that optimized least-squares algorithms could achieve effective feature selection, thereby significantly improving pose tracking accuracy. Furthermore, Zhao and Vela [14] proposed a method that extracted the most informative segments from each 3D line through appropriate line segment segmentation for pose optimization formulations. The results illustrated that precise line segmentation could enhance pose estimation accuracy, although the limitation lay in the incapacity of low-level geometric features to semantically handle dynamic environments. In environments that lacked texture, traditional loop closure detection methods might have struggled to function effectively and could have been sensitive to variations in lighting, potentially resulting in erroneous or overlooked detections.

2.4. Loop Closure Detection

Loop closure detection served to determine whether the camera revisited a previously mapped region, offering constraints to minimize drift [15]. The main approaches included appearance-based methods utilizing image retrieval techniques and learning-based methods that leveraged CNNs.

Appearance-based methods commonly utilized the bag-of-words model to create global image descriptors. Several studies enhanced VSLAM algorithms by refining the bag-of-words model [16]. Shen, et al. [17] proposed a loop closure detection algorithm based on an enhanced real-time updating bag-of-words model. By extracting feature descriptors from online images and integrating them with preloaded offline words, a customized bag-of-words specific to the mobile robot’s operational environment was generated. This tailored bag-of-words adapted to the robot’s specific application scene, thereby enhancing the system’s resilience. Additionally, Xi, et al. [18] proposed a slam-dunk loop closure detection algorithm that optimized the bag-of-words model. They enhanced the clustering algorithm to create the offline vocabulary tree and utilized an improved K-Means algorithm for vocabulary tree construction. Furthermore, they continually updated the vocabulary tree with image feature data from the real-world scene to enhance its representational capabilities. Although these approaches were successful in location recognition, they heavily depended on viewpoint invariance and robust feature extraction [19].

2.5. Mapping

Based on different front-end visual odometry methods, the resulting maps were categorized into sparse, semi-dense, and dense maps [20]. Sparse maps, generated from feature-based odometry, consisted of point landmarks. Semi-dense maps incorporated certain surface elements computed through direct methods. Dense maps estimated depths for all pixels using stereo or monocular depth estimation techniques. While denser maps offered more detailed information about structures and textures, they required more intensive computational resources.

The quality of map construction in VSLAM was heavily influenced by environmental features. In environments with rich textures and distinctive features, VSLAM algorithms typically excelled in map building and localization estimation. Conversely, performance might have deteriorated in environments with limited textures or sparse features. The conventional ORB-SLAM was a feature-based approach that constructed sparse maps from point clouds but might have lacked the necessary detail to accurately identify specific objects [8]. Consequently, Sunderhauf et al. [21] introduced an object-oriented semantic mapping method that integrated a single-shot multi-box detector (SSD) and ORB-SLAM2. This approach dynamically maintained and updated point clouds for each object class; however, the individual management and updating of each object in the map presented challenges for robots in distinguishing objects of the same category in practical scenarios [22].

Metrics used to assess mapping accuracy encompassed absolute trajectory error for global consistency, relative pose error for drift assessment, and completeness to gauge the extent of the true environment coverage [23]. The fidelity of reconstructed maps was significantly influenced by the richness of features present in the perceptual environments.

2.6. Evaluation Metrics

Various metrics were proposed to assess and compare the performance of different VSLAM systems. The most frequently utilized metrics included absolute trajectory error (ATE), relative pose error (RPE), and map completeness [24].

2.6.1. Absolute Trajectory Error (ATE)

The ATE metric quantified the global consistency between the estimated trajectory and the ground truth trajectory. It was computed as the root-mean-squared error (RMSE) between the positions of estimated poses and ground truth poses at each timestamp.

A T E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| | trans (T_{gt, i}^{- 1} T_{est, i}) | |}_{2}^{2}}

(1)

where est,i and gt,i are the positions of estimated and ground truth poses at time i, and N is the total number of poses. ATE evaluates the overall localization accuracy of the VSLAM system. Lower ATE indicates better global consistency. However, ATE does not reflect the drift or stability of the system.

2.6.2. Relative Pose Error (RPE)

RPE quantified the error in the relative pose changes between two time steps over a fixed time interval Δt:

R P E_{trans} = \sqrt{\frac{1}{N - Δ t} \sum_{i = 1}^{N - Δ t} {| | trans {((T_{gt, i}^{- 1} T_{gt, i + Δ t}))}^{- 1} {(T_{esti, i}^{- 1} T_{esti, i + Δ t})}^{\lor} | |}_{2}^{2}}

(2)

where trans represents the translation part of the variable inside the parentheses; esti,i and gt,i are the positions of estimated and ground truth poses at time I; ∆t indicates the interval time; and N is the number of relative pose changes being compared [25]. RPE effectively evaluates the local drift of VSLAM algorithms over a given time interval. Lower RPE indicates more stable odometry output and smaller drift. But RPE does not quantify global consistency or absolute accuracy.

2.6.3. Map Completeness

In addition to localization accuracy, the completeness of reconstructed maps was another crucial metric, particularly for robot navigation and planning [26]. More comprehensive maps offered greater detail about objects, structures, and textures within the environments. However, quantifying map completeness was more challenging compared to numerical pose errors. Currently, some researchers examined the integrity evaluation of maps based on volume, surface, and semantic classification, as outlined in Table 1.

While completeness metrics based on volumetric, surface, or semantic classifications have been proposed, standardized benchmarks are still lacking.

In summary, ATE and RPE were the two most widely adopted metrics for evaluating VSLAM systems in terms of localization accuracy and drift. Map completeness was also significant but lacked standardized quantitative benchmarks. Employing these complementary metrics provided a more comprehensive understanding of the performance of VSLAM algorithms and systems.

2.7. Influencing Factors

The performance of VSLAM systems could be influenced by numerous factors pertaining to the environments, sensors, algorithms, and system capabilities. These factors determined the accuracy, robustness, and efficiency with which a VSLAM system could operate in real-world conditions. The primary influencing factors are summarized below:

Environments: The perceptual environments imposed fundamental constraints on the quality of visual observations and features that could be extracted [27], directly impacting the performance of VSLAM algorithms. Environments characterized by rich textures and stable lighting conditions were more conducive to optimal performance, whereas low-textured areas, repetitive patterns, reflective surfaces, and varying illuminations could diminish visual processing and data association capabilities. Highly dynamic environments with numerous moving objects could also adversely affect motion estimation and map construction [28]. Field environments typically present more challenges compared to indoor settings. A consistent solution to mitigate the interference caused by dynamic objects in SLAM systems involves employing object detection and image segmentation algorithms to filter out dynamic regions in the images before visual odometry. Subsequently, the camera’s approximate position is computed using static environmental points, and a map containing semantic information is generated.

Sensing modalities: The types and characteristics of visual sensors determined the visual information perceived by the VSLAM system [29]. Monocular cameras imposed fewer constraints on motion estimates compared to stereo or red, green, blue, and depth (RGB-D) cameras. Range sensors including lasers and depth cameras provided geometric structures more directly but lacked color and texture details. Parameters including field of view, resolution, frame rate, exposure time, and other intrinsic factors also influenced the amount of usable information captured from the surroundings. Multi-sensor fusion involving inertial, global positioning system (GPS), or other exteroceptive data could potentially compensate for the limitations of individual sensors. The model of Frustrum PointNets integrated both RGB cameras and light detection and ranging (LiDAR) sensors to enhance scene understanding accuracy [30].

The feature extraction, data association, motion estimation, and map optimization components within the VSLAM system collectively contributed to its overall performance, robustness, and efficiency. Utilizing more repeatable feature detectors and descriptors enhanced data association for tracking and loop closure. The choice between filtering and smoothing techniques introduced a trade-off between efficiency and accuracy. Moreover, map representations impacted interpretability, storage requirements, and computational complexity [31]. The development of you only look at coefficients with dynamic convolutions (YOLACT-Dyna) aimed to eliminate potential moving objects in the scene and provide an approximate camera pose estimation. Subsequently, leveraging the camera pose and polarity constraints, the algorithm calculated motion probabilities for each potential moving object. Finally, motion feature points were filtered out, and the pose was computed using static feature points.

Existence of loops: Detected loops presented crucial opportunities to diminish accumulated drift and strengthen global map consistency. Smaller loops with shorter intervals aided in constraining pose errors more frequently [32]. However, repetitive environments with ambiguous appearance information could result in false loop detections. The quantity, size, and frequency of loops in the trajectory fundamentally influenced performance [33].

Motion dynamics: Highly dynamic movements characterized by high speed, sudden rotations, and aggressiveness could have a detrimental impact on visual processing in VSLAM. Motion blur and rolling shutter effects could degrade visual feature extraction and matching [34]. Moreover, complex maneuvers further complicated motion estimation, potentially violating the assumptions of typical VSLAM algorithms. Wen, et al. [35] introduced a novel visual SLAM approach named DP-SLAM, which relied on sparse feature tracking and integrated the concept of motion probability. A propagation model was utilized for dynamic keypoint identification. This methodology was integrated into the front end of the ORB-SLAM2 system, serving as a preprocessing stage to filter out keypoints associated with moving objects. Furthermore, the backgrounds of frames occluded by identified dynamic objects were painted over, offering benefits for applications including virtual reality and augmented reality.

Hardware capabilities: The computational capacity of onboard processors and co-processors determined the feasibility of deploying computationally intensive algorithms for real-time VSLAM. Additionally, power consumption and heat dissipation presented constraints on embedded system design [36]. The available onboard memories limited the duration and resolution of mapping sessions, while communication bandwidth impacted multi-robot collaborative mapping capabilities. YANIK, et al. [37] conducted an evaluation and comparison of three visual SLAM methods—ORB-SLAM2, direct sparse odometry (DSO), and DSO with loop closure (LDSO)—in terms of energy consumption and resource usage, as shown in Figure 3.

In summary, a multitude of interrelated factors encompassing environments, sensors, algorithms, hardware, loops, and motion impacted the accuracy, robustness, and efficiency of VSLAM systems. A comprehensive understanding of these influential factors offered valuable insights into the trade-offs and constraints associated with various design decisions. This understanding propelled research efforts towards the development of more comprehensive and resilient solutions for VSLAM in challenging real-world scenarios.

3. Principles and Methods of Semantic Segmentation

Semantic segmentation involved assigning semantic labels including person, car, road, etc., to each pixel in an image, offering a powerful method to extract high-level understanding from visual data. With the resurgence of deep learning in recent years, CNNs emerged as the predominant approach, demonstrating remarkable success in semantic segmentation [38]. This section begins by presenting the definition and evaluation metrics of semantic segmentation. Subsequently, common network architectures are discussed, with a focus on encoder–decoder-based frameworks. Finally, factors influencing the performance of semantic segmentation models are summarized.

3.1. Definition and Evaluation Metrics

Semantic segmentation aimed to divide an image into non-overlapping regions with the same semantics, essentially framing it as a pixel-level classification challenge. Given an input image (I) comprising N pixels, semantic segmentation assigned a semantic label (l_i) to each pixel (i), with l_i drawn from a predefined label set (L) containing K potential classes. The output was a label map with dimensions matching those of the input image.

Two commonly used evaluation metrics for semantic segmentation were pixel accuracy and mean intersection over union (mIoU) [39]. Pixel accuracy quantified the percentage of correctly classified pixels, while mIoU assessed the overlap between predicted and ground truth masks for each class before averaging the results. Higher values indicated superior segmentation performance. Additional metrics like frequency-weighted intersection over union (IoU) could accentuate the importance of rare classes.

3.2. Encoder–Decoder Architectures

Most state-of-the-art semantic segmentation models leveraged convolutional encoder–decoder architectures [40]. The encoder progressively reduced spatial resolution and extracted visual features through a sequence of convolutional and pooling layers. Conversely, the decoder symmetrically restored object details and spatial dimensions via upsampling and convolutions, enabling the integration of semantic insights from deep layers with fine-grained details from earlier layers.

Earlier designs converted CNN classifiers into fully convolutional networks (FCNs) and introduced upsampling layers to recover spatial intricacies [41]. U-Net innovatively introduced skip connections linking corresponding encoder and decoder layers to incorporate multi-scale semantic information [42]. Numerous researchers worldwide refined semantic segmentation models tailored to diverse scenarios. The semantic segmentation network (SegNet) model, proposed by Badrinarayanan et al. [43], stood out as a representative encoder–decoder algorithm for road and vehicle segmentation. Compared to the architecture of FCNs, SegNet boasted a significantly reduced size. This reduction was primarily attributed to the utilization of positional information in SegNet from recorded pooling operations instead of direct deconvolution processes. In SegNet, the pooling layers not only retained maximum values but also stored the spatial positions of these maxima in the original image. This approach facilitated accurate mapping of relevant values to their respective positions during upsampling, thereby enhancing the accuracy of the reconstructed image.

Some researchers also focused on optimizing models from a structural standpoint. For instance, the refinement network (RefineNet) employed a multi-path refinement architecture to fuse features from various levels, and DeepLabv3 enhanced the encoder with atrous convolutions to encode multi-scale context [44]. The pyramid scene parsing network (PSPNet) integrated a pyramid pooling module before the decoder to aggregate contextual information [45]. In recent years, new models were proposed to address the degradation problem in order to achieve higher accuracy with a large number of convolutional layers. While some of these models demonstrated improved results [46], achieving an appropriate trade-off between efficiency and accuracy remained a challenging task for large datasets. Kazerouni et al. [47] proposed the Ghost-Unet model as an asymmetric encoder–decoder structure for high-precision semantic segmentation, considering a reasonable number of convolutional layers, which received positive feedback in terms of accuracy and efficiency.

3.3. Context Modeling Modules

To incorporate broader contextual information beyond local receptive fields, various context modeling modules were introduced. Atrous Spatial Pyramid Pooling (ASPP) scanned an image with filters at multiple sampling rates to capture multi-scale information [48]. Non-local networks utilized non-local operations to capture long-range dependencies. Algorithms that integrated contextual information aimed to enhance the accuracy and robustness of semantic segmentation by leveraging local and global information, as well as features from different scales. These algorithms were better equipped to address challenges including complex scenes and unclear boundaries, leading to improved semantic segmentation outcomes. Noteworthy algorithms included conditional random fields (CRFs) [49], dilated convolutions [50], and multi-scale predictions [51].

CRFs further refined predictions by smoothing using pixel affinities. Attention mechanisms dynamically aggregated contextual information surrounding regions of interest. These modules enhanced contextual representations and relationships to improve segmentation accuracy. The seminal model for conditional random fields was the DeepLabv1 model, introduced by CHEN et al. [52]. It incorporated a fully connected CRF model as an independent back-end processing step to optimize segmentation results. Each pixel was treated as a node within the region, and the association between two pixels, regardless of distance, influenced pixel label classification. This strategy aided in recovering local details that may be lost due to the spatial invariance of CNNs. Although the fully connected model was computationally intensive, the DeepLabv1 model employed approximate algorithms to significantly reduce computational costs. However, these methods overlooked the specificity of class weights in the classification layer. Zhu et al. [29] observed that class weights for neighboring boundary pixels often lacked discrimination, thereby hindering performance. To address this issue, a novel approach called embedded conditional random fields (E-CRF) was proposed. E-CRF seamlessly integrated CRFs into the CNNs to achieve more efficient end-to-end optimization. It employed CRFs to facilitate message passing between pixels in high-level features and refined the feature representation of boundary pixels by utilizing internal pixels belonging to the same object.

Dilated convolutions, also known as atrous convolutions, were initially introduced as a signal processing technique. In CNNs, dilated convolutions could significantly expand the receptive field without introducing additional parameters. Consequently, large-scale pooling operations were unnecessary to enlarge the receptive field, preventing the loss of fine-grained information associated with pooling. Dilated convolutions were often coupled with multi-scale predictions. Zheng, et al. [53] established a spatial pyramid (ASPP) by connecting multiple dilated convolutions in parallel. ASPP employed dilated convolutions with varying dilation rates to conduct diverse convolution operations on the feature map. It achieved a larger receptive field than the size of its convolution kernel without increasing the number of parameters. Notably, the size of the feature map remained unchanged post dilated convolution. However, dilated convolutions encountered the gridding issue, where zeros were inserted between two sampled pixels of the convolution kernel. Excessive dilation rates could lead to overly sparse convolutions, resulting in inadequate information capture due to sparse input sampling and hindering effective model learning.

The core principle behind multi-scale predictions was to expand the receptive field across multiple resolutions and effectively enrich feature information of the target task by merging features from various scales, thereby enhancing segmentation accuracy. Integrating multi-scale predictions with CRFs could further enhance segmentation accuracy. This network effectively harnessed both local and global contextual information, spanning from the entire scene to each pixel, to carry out pixel-level label estimation. For example, Ding et al. [53,54] proposed a segmentation network called CGBNet, which improved segmentation performance through context encoding and multipath decoding. This network first generated local features compared with context through a context coding module to make use of informative context and discriminating local information. This context-coded module greatly improved segmentation performance, especially for objects that were not obvious.

3.4. Training Strategies

In many instances, manual annotation was required for semantic segmentation tasks as training data. Semantic segmentation models were designed to conduct pixel-level image segmentation [55], where each pixel was assigned a semantic label to delineate the boundaries and categories of various objects and regions within the image. This typically necessitated detailed pixel-level annotations that specified the semantic category to which each pixel belonged. To alleviate the demand for manual annotation, weakly supervised semantic segmentation (WSSS) [56] based on image-level labels garnered attention due to its reduced annotation cost.

Existing methods often leveraged class activation maps (CAMs) to assess the correlation between image pixels and classifier weights. However, classifiers tended to focus solely on discriminative regions, disregarding other valuable information in each image, leading to incomplete localization maps. To tackle this challenge, Chen et al. [3] introduced a self-supervised methodology known as Self-supervised image-specific prototype exploration (SIPE), comprising image-specific prototype exploration (IPE) and general specific consistency (GSC) losses. Specifically, IPE tailored prototypes for each image to capture comprehensive regions, forming image-specific CAMs (IS-CAMs). GSC was implemented to ensure alignment between the general CAMs and the specific IS-CAMs, thereby refining feature representation and strengthening the self-correcting capability of prototype exploration. Additionally, the model benefited from data augmentation techniques including scaling, cropping, flipping, and rotation to prevent overfitting and enhance robustness.

Liu, et al. [57] proposed a novel concept termed projection onto orthogonal proto-types (POP), which updated features to recognize new classes without impacting the base classes. A collection of orthogonal prototypes was established in POP, with each prototype representing a semantic class, and each class was predicted by the features projected onto its respective prototype. Uniform class sampling was employed to ensure equal contribution from each class during training [58]. Online hard example mining focused on instances that were misclassified, thereby enhancing performance in challenging scenarios. Joint training on multiple datasets with diverse data distributions bolstered the model’s generalization capability.

Nevertheless, traditional distillation methods had struggled with LiDAR-based semantic segmentation due to the complexities posed by the sparsity, randomness, and density variations of point clouds. Hou, et al. [59] proposed the utilization of output distillation for point-wise and voxel-wise information to complement sparse supervision signals. The complete point cloud was partitioned into multiple super-voxels, and a difficulty-aware sampling strategy was devised to more frequently sample super-voxels containing low-frequency classes and distant objects. Point-wise and voxel-wise affinity distillation was implemented on these super-voxels, leveraging similarity information between points and super-voxels to aid the model in capturing structural details about the surrounding environment more effectively. In order to address the issue of data imbalance in semantic segmentation, Zhu et al. [60] proposed a central sampling strategy to evenly select training samples from each class in each epoch. A rapid training program was also introduced to reduce the computational burden. This allowed the researchers to explore the use of a large number of false labels. The network structure of data-adaptive transformer (DAFormer) proposed by Hoyer et al. [58] consisted of a Transformer encoder and a multi-level context-aware feature fusion decoder. It represented a significant advancement for unsupervised domain adaptation.

3.5. Factors Affecting Performance

The performance of semantic segmentation models was influenced by a variety of factors including network architecture, training data, optimization strategies, model regularization, contextual modeling, and hardware resources. The intricate interplay of these factors ultimately dictated the accuracy, speed, and robustness of the models.

Network Architecture Design: The overall capacity, depth, receptive field size, and path aggregation strategies within network architectures significantly impacted feature learning and the capabilities for multi-scale contextual modeling. Larger and deeper backbones including ResNet [61] facilitated the encoding of more robust features. Expanding receptive fields through atrous convolutions or spatial pyramid pooling enabled the capture of broader context. Advanced decoder modules featuring skip connections facilitated the fusion of hierarchical features from varying scales. Many existing methods compromised spatial resolution to achieve real-time inference speed, resulting in diminished performance. Yu, et al. [62] devised a small stride spatial path to preserve spatial details and generate high-resolution features. Concurrently, they employed a context path with a rapid downsampling approach to acquire ample receptive field coverage. Introducing a new feature fusion module on top of these two paths effectively amalgamated features, leading to an mIoU of 68.4% on the Cityscapes test dataset [63].

Training Data: The quantity, quality, and diversity of annotated training data fundamentally determined the performance boundaries of models. Larger datasets with diverse annotation variations contributed to enhanced generalization. High-quality labels featuring precise segmentation boundaries enabled models to grasp finer details. Class balancing [64] and data augmentation [65] further bolstered robustness. Kenjic, et al. [66] introduced outlier removal for re-labeling, class-driven balancing of validation and training datasets, and targeted image processing for underrepresented classes. Evaluation outcomes demonstrated enhanced inference accuracy compared to utilizing common open-source datasets, achieving an mIoU of 68.4%.

Loss Functions: The selection of pixel-wise loss functions influenced model training behavior [67]. Weighted loss functions addressed class imbalance challenges, while bootstrapped loss focused on challenging examples. Lovász loss directly optimized the IoU metric. Combining losses with distinct focuses contributed to overall performance improvement. The choice of loss function typically hinged on the characteristics of the training dataset, including its distribution, skewness, and boundaries. Focal-based loss functions were beneficial for highly imbalanced segmentation tasks. Binary cross-entropy was suitable for balanced datasets, whereas mildly skewed datasets could benefit from smooth or generalized dice coefficient.

Optimization Schemes: Optimization hyperparameters including batch size, learning rates, and schedules played a pivotal role in model convergence and training efficiency [68]. Larger batch sizes leveraged batch normalization but necessitated increased GPU memory. Well-calibrated learning rate schedules and warm-up strategies expedited convergence. The online retraining approach empowered the segmentation network to effectively learn from confident regions biased towards accurate labels.

Context Modeling: Multi-scale context aggregation modules heightened localization and recognition capabilities by capturing broader contextual information. Atrous convolutions, pyramid pooling, and non-local operations offered complementary contextual representations [69]. Conditional random fields further refined predictions based on pixel affinities. Li, et al. [70] proposed a novel context-based cascaded network, CTNet, which delved into spatial and channel contextual information to unveil semantic contexts for semantic segmentation. The spatial context module leveraged pixel-class correlations to unveil spatial contextual dependencies among pixels [71]. Simultaneously, the channel context module modeled long-term semantic relationships between channels to learn semantic features encompassing semantic feature maps and class-specific features. By utilizing the acquired semantic features as prior knowledge to guide network learning, CTNet captured more precise long-range spatial dependencies.

Regularization: Various regularization techniques were employed to mitigate overfitting to the training data [72]. Weight decay restricted weight norms, while data augmentation techniques like flipping, scaling, cropping, and elastic deformations enhanced training sample diversity. Dropout randomly deactivated units during training to prevent co-adaptation. Yuan, et al. [73] introduced a novel form of batch normalization known as distribution-specific batch normalization (DSBN) to address this issue. They underscored the significance of robust augmentation methods for semantic segmentation and achieved state-of-the-art outcomes in the semi-supervised context using urban landscape and Pascal VOC datasets.

Hardware Resources: Dedicated GPUs and TPUs expedited training and inference by efficiently parallelizing operations. Swifter hardware facilitated training larger models, handling bigger batches, and extending training schedules. Model optimization through pruning, quantization, and distillation tailored intricate models for deployment [74].

3.6. Representative Models

While earlier studies utilized FCN-based models including DeconvNet [75] and SegNet, which offered limited context modeling capabilities, recent approaches have adopted more sophisticated models like Mask R-CNN [76], DeepLabv 3+ [77], and PSPNet [45]. These advanced models provided enhanced multi-scale contextual reasoning, resulting in more precise segmentations. Additionally, lightweight models like ERFNet [78] were employed for efficient inference on embedded devices. The selection of a semantic segmentation model directly influenced the quality of semantic information. Table 2 lists the prevailing semantic segmentation models in the field.

When selecting a semantic segmentation model, the initial consideration was the choice of network architecture. Different network architectures, including CNNs, RNNs, or variational autoencoders (VAEs), offered distinct advantages for various semantic segmentation tasks. For instance, CNNs excelled in processing data with spatial continuity, including images and videos, while RNNs were more adept at handling data with temporal continuity, like speech and text. Secondly, the curation and generation of datasets played a crucial role. A diverse and realistic dataset could enhance the model’s generalization capability, enabling it to make accurate predictions even with unseen data. Moreover, judicious data augmentation strategies, including rotation, scaling, cropping, etc., could assist the model in adapting to diverse scenarios and enhancing its robustness. During the model training phase, the choice of regularization method was equally vital. Techniques like L1 and L2 regularization could manage the model’s complexity and prevent overfitting, where the model performed well on training data but poorly on test data. Additionally, methods like Early Stopping could be employed to mitigate overfitting. The selection of hardware resources also impacted the model training speed. Leveraging high-performance GPUs for model training could significantly accelerate the training process, thereby enhancing the model’s iteration speed. Lastly, advancements in neural network architectures, training methodologies, and accelerated computing offered substantial support for the advancement of semantic segmentation models. In the future, researchers could leverage deep learning and artificial intelligence techniques to drive further progress in the development of semantic segmentation model.

4. Applications of Semantic Segmentation in VSLAM

Traditional VSLAM systems traditionally relied on low-level geometric features for localization and mapping [81]. In contrast, semantic segmentation offered a high-level understanding of environments by assigning semantic labels to image pixels. This capability enabled the differentiation between static and dynamic components in complex scenes. By integrating semantic segmentation, the accuracy and robustness of VSLAM could be enhanced in highly dynamic real-world environments [82]. In this section, various applications of semantic segmentation in key components of the VSLAM pipeline are outlined. Subsequently, a comparative analysis of representative semantic segmentation models utilized in cutting-edge VSLAM systems is conducted.

4.1. Visual Odometry

Visual odometry played a pivotal role in the VSLAM pipeline, aiming to estimate the incremental camera motion between consecutive frames. Traditional visual odometry relied on tracking and matching low-level geometric features, including points, lines, or keypoints, across frames to compute changes in camera pose [83]. However, these low-level features lacked semantic understanding of the 3D environment, hindering their ability to distinguish between static and dynamic elements in complex environments encompassing both background structures and independently moving objects. Figure 4 illustrates the optimization process of semantic segmentation for visual odometry. Features originating from moving objects could compromise the accuracy and robustness of visual odometry.

Semantic segmentation offered a viable solution to overcome this limitation by providing pixel-level semantic comprehension to recognize dynamic objects. The segmented masks could be utilized to filter out features related to moving objects during the estimation of ego-motion between frames, consequently enhancing odometry accuracy in dynamic environments. Notable works that integrated semantic segmentation for robust visual odometry included:

MaskFusion [84]: This approach utilized the CNNs to segment RGB images into static scenes and moving objects, subsequently eliminating masks corresponding to moving objects during feature tracking and matching across frames. This methodology significantly enhanced the efficacy of visual odometry in dynamic environments by mitigating the impact of dynamic objects on camera pose estimation. Such an approach could effectively boost the performance of visual odometry, particularly in intricate environments characterized by numerous dynamic objects. Figure 5 illustrates an example of a real-life scenario.

VDO-SLAM [85]: Utilizing Mask R-CNN for instance segmentation and object tracking, this method excluded objects exhibiting independent motion from ego-motion estimation between frames. VDO-SLAM integrated optical flow or alternative object tracking techniques to monitor objects across consecutive frames. This strategy facilitated the tracking of dynamic objects during camera pose estimation, thereby enhancing the efficacy of visual odometry in dynamic environments. This approach effectively elevated the performance of visual odometry, particularly in intricate environments with a significant presence of dynamic objects. Figure 6 shows the dynamic tracking and recognition of vehicles.

ORB-SLAM3 [86]: ORB-SLAM3, a widely used open-source VSLAM system, incorporated semantic segmentation functionalities into its framework. This system was capable of identifying and tracking semantic objects within the environment, thereby bolstering the stability and scene comprehension capabilities of SLAM. Leveraging the outcomes of semantic segmentation, ORB-SLAM3 could detect and track dynamic objects in the scene. Subsequently, this technique was employed for dynamic object filtering, enabling the filter to mitigate the impact of dynamic objects, thereby enhancing the stability of visual odometry.

SLAM-Net [87]: SLAM-Net was a deep learning model that integrated visual SLAM with semantic segmentation, offering applications in intelligent navigation, autonomous driving, and robotics to enhance environmental perception and path planning capabilities. Through predominantly employing end-to-end training techniques, SLAM-Net enhanced visual odometry. In specific indoor settings, SLAM-Net demonstrated superior performance compared to conventional learned visual odometry methods.

Experiments conducted using these approaches showcased improved accuracy and robustness in contrast to conventional geometry-based visual odometry, particularly evident in highly dynamic outdoor driving datasets. Semantic segmentation offered crucial perception capabilities to segregate complicating elements from odometry estimation in intricate scenes. This breakthrough unleashed new possibilities for VSLAM to function dependably in real-world scenarios [88].

4.2. Loop Closure Detection

Loop closure detection was a crucial component in VSLAM systems designed to address drift and enhance global consistency. It involved identifying situations where the camera revisited a previously visited location and establishing pose constraints between the current and previous positions. Conventional methods relied on image retrieval and feature matching guided by visual appearance cues [89,90]. However, variations in viewpoint, lighting conditions, environmental dynamics, and perceptual aliasing could undermine the robustness and accuracy of loop closure detection using raw visual features. Figure 7 illustrates the process of enhancing loop closure detection through semantic segmentation.

The integration of semantic segmentation furnished robust cues to facilitate loop closure under diverse conditions by emphasizing stable background structures [91,92]. Semantic maps filtered out variable foreground objects and offered invariant scene representations for reliable matching against stored map data. Noteworthy works that leveraged semantic segmentation for enhanced loop closure detection included:

SegMap [93]: SegMap boosted loop closure detection performance by amalgamating semantic information with maps. This system adeptly identified and flagged loop closures to augment localization accuracy. Figure 8 presents how quickly descriptors extracted from incrementally grown segments contained relevant information that could be used for localization. The x-axis represents the growing status of a segment until all its measurements have been accumulated (here termed complete). The logarithmically scaled y-axis represents the number of neighbors in the target map that needed to be taken into account to include the correct target segment (lower values indicated better performance). The SegMap descriptor offered one order of magnitude better retrieval performance for over 40% of the growing process.

SIIS-SLAM [91,94]: SIIS-SLAM was an enhanced system derived from ORB-SLAM3 that incorporated semantic segmentation to enhance loop closure detection performance. This system could identify and track semantic targets, improving the effectiveness of SLAM. Additionally, the absolute locus RMSE was evaluated using a publicly available dataset, demonstrating superior performance compared to both the original ORB-SLAM3 and DynaSLAM results. According to the experimental results, the method outlined in this paper was found to be more suitable for indoor environments.

DeepSLAM [95]: DeepSLAM was a VSLAM system driven by deep learning, utilizing CNNs to process semantic segmentation data and integrating it into loop closure detection. This approach enhanced the accuracy and robustness of loop closure detection. Under typical weather conditions, the trajectory results obtained by DeepSLAM closely aligned with those from GPS/INS, as shown in Figure 9. However, experimental findings revealed that when faced with challenges including rain, nighttime conditions, and white balance variations, traditional LSD-SLAM and ORB-SLAM exhibited minimal efficacy, whereas DeepSLAM effectively leveraged prior knowledge acquired through training to perform well. DeepSLAM operated as a form of supervised learning.

These methods demonstrated enhanced loop closure detection capabilities compared to traditional approaches relying on manually crafted features including SIFT or ORB, especially in demanding perceptual scenarios. Semantic segmentation provided a robust high-level comprehension of the surroundings, enabling drift-free relocalization within VSLAM systems.

4.3. Environment Mapping

The map representation constituted another crucial component in VSLAM systems. Traditional maps typically consisted of geometric representations that primarily encoded low-level metric or topological information [95,96]. In contrast, semantic maps, which incorporated high-level context, provided a more comprehensive understanding of the environment, thereby enhancing intelligent interactions and decision-making capabilities.

Semantic segmentation facilitated the enhancement of traditional maps by integrating pixel-level semantic labels to differentiate between various classes including walls, furniture, objects, and people [71]. Figure 10 illustrates the process of enhancing environment mapping with semantic segmentation models. This approach enabled unambiguous representations of environments as opposed to solely depicting surfaces or geometric primitives [97]. Noteworthy VSLAM systems that integrated semantics for improved mapping included the following.

Semantic Fusion SLAM [98]: Semantic Fusion integrated semantic segmentation from various perspectives with maps generated by Elastic Fusion. This approach enabled the creation of an effective semantic 3D map and enhanced the precision of single-frame semantic annotation. It established long-term dense correspondence between frames in indoor RGB-D videos, even during complex scanning trajectories. These correspondences facilitated the probabilistic fusion of semantic predictions from multiple viewpoints using CNNs into a map. This not only yielded a valuable semantic 3D map but also demonstrated, using the NYUv2 dataset, that merging multiple predictions led to an enhancement in 2D semantic labeling compared to baseline single-frame predictions.

Panoptic Fusion [99]: Panoptic Fusion combined semantic segmentation and instance segmentation to generate 3D panoptic maps labeling each object instance and background classes. Additionally, researchers constructed a fully connected CRF model with respect to panoptic labels and performed online inference using a novel unary potential approximation and map division strategy, further improving recognition performance. This not only provided an effective semantic 3D map but also enhanced the accuracy of single-frame semantic annotation. Effective validation was achieved in AR scenarios, as shown in Figure 11.

LIO-SAM [100]: LIO-SAM was a VSLAM system based on semantic segmentation. It employed a semantic segmentation model to distinguish various objects and landmarks within the environment, consequently improving the precision of localization and map generation. For example, SAC-SLAM could identify buildings, roads, trees, and traffic signs, leading to the creation of more informative maps.

MapLite [75]: MapLite integrated semantic pixel labels predicted by ERFNet into pose graph optimization for lifelong semantic mapping. These approaches yielded semantic maps with more detailed representations of environments by integrating class-level and instance-level semantic segmentations. This enhanced the system’s capability for scene comprehension, facilitating intelligent interactions and behaviors. The researchers conducted experiments on real-world roads, demonstrating that MapLite outperformed traditional visual odometry methods in trajectory estimation accuracy, as shown in Figure 12.

4.4. Model Action Mechanism

The traditional VSLAM algorithm assumed that the objects in the environment were static or exhibited low motion. However, the presence of dynamic objects, including cars, could introduce erroneous data to the VSLAM system, thereby reducing its accuracy and robustness. As the camera might not accurately capture dynamic object data, some of the aforementioned examples utilized the semantic segmentation algorithm to filter out dynamic areas in the image. Subsequently, they used static environmental points to calculate the camera’s position nearby and construct a map containing semantic information. Figure 13 shows a classic structure. Although the complete elimination of the influence of dynamic objects might not have been achievable, the robustness of the system was significantly enhanced.

Through our analysis, the semantic segmentation model can provide a more accurate understanding of the environment, thereby improving the positioning accuracy of VSLAM, and by identifying semantic objects in the environment, VSLAM can better estimate the camera pose. By being able to recognize objects in the environment, semantic segmentation models help VSLAM perform smarter tasks in complex environments, such as avoiding obstacles or choosing the best path. Table 3 summarizes some usage scenarios and characteristics of VSLAM combined with the semantic segmentation model.

4.5. Experimental Comparison

The previous research suggested that a semantic segmentation model could help VSLAM mitigate the impact of dynamic objects on localization and mapping. Subsequently, the team conducted comparative experiments on the KITTI dataset, with the experimental results outlined in Table 4. The findings illustrated that VSLAM integrated with a semantic segmentation model demonstrated superior performance in dynamic environments.

4.6. Discussion

The integration of semantic segmentation proved to be a valuable strategy for enhancing traditional geometry-based VSLAM systems. By providing pixel-level semantic understanding of environments, semantic segmentation enabled the differentiation between stable background structures and dynamic foreground objects [101]. This enhancement benefited crucial components of the VSLAM pipeline, including visual odometry, loop closure detection, and mapping.

In the realm of visual odometry, semantic segmentation facilitated the identification and exclusion of features associated with movable objects during ego-motion estimation between frames. Compared with traditional methods, some VSLAM systems showed improved accuracy and robustness in dynamic environments. For loop closure detection, semantics assisted in retrieving stable structural scene elements for reliable relocalization despite variations in viewpoint and lighting conditions. In terms of mapping, semantic labels complemented geometric representations with class-level understanding to construct more detailed contextual maps of environments.

However, challenges persisted in the integration of semantics into VSLAM systems. Semantic segmentation models heavily relied on large, annotated training datasets, which were still limited in many robotics domains. These models also needed to be lightweight and efficient for real-time inference on resource-constrained platforms [102]. Numerous research issues remained unresolved, including optimal network architectures, accelerated model deployment, automated data annotation, online adaptation, and seamless integration with various VSLAM components.

Overall, semantic segmentation unlocked novel perception capabilities that paved the way for next-generation VSLAM systems characterized by intelligence, context awareness, and resilience to environmental dynamics. This emerging field held immense potential and presented numerous open challenges for future exploration. Advanced semantic segmentation techniques would serve as fundamental building blocks for achieving reliable VSLAM in complex real-world scenarios.

5. Conclusions

This study provided a comprehensive overview of integrating semantic segmentation into VSLAM systems. Fundamental technologies for traditional VSLAM systems were initially introduced, and the basic principles of semantic segmentation, with a focus on deep learning-based approaches, were outlined. Subsequently, the applications of semantic segmentation in key components of the VSLAM pipeline were delineated, and representative solutions were compared. Finally, the challenges and future directions in this emerging field were discussed. The key conclusions can be drawn as follows:

(1) Semantic segmentation assisted in distinguishing between static and dynamic elements in complex environments, thereby enhancing VSLAM accuracy and robustness. However, issues including sensor noise, environmental variations, and algorithm constraints could lead to the accumulation of positioning errors over prolonged periods. Nevertheless, semantic segmentation provided only partial information. To gain a more comprehensive understanding of the environment, further integration of semantic segmentation results with other sensor data, including inertial measurement units, LiDAR, etc., was identified as a worthwhile research direction to further enhance the performance of VSLAM systems.

(2) Significant applications included the identification of movable objects for visual odometry, the retrieval of structural features for loop closure detection, and the incorporation of semantic labels for contextual mapping. However, due to the imbalanced distribution of pixels among different categories in real-world settings, semantic segmentation models could exhibit suboptimal performance on minority classes, potentially impacting mapping outcomes.

(3) Promising research directions for the future involved dynamically adjusting the weights of semantic segmentation models based on scene dynamics to better adapt to varying environments. Additionally, considering the time consistency of semantic information could reduce the positioning errors caused by discontinuity. For example, utilizing semantic consistency across frames could help address issues in closed-loop detection. Optimizing semantic segmentation algorithms for real-time VSLAM systems to reduce computational resource consumption emerged as a critical area of interest.

(4) In highly dynamic scenes, the positioning accuracy of semantic VSLAM was significantly improved. However, in environments with untextured areas or repetitive patterns, semantic VSLAM might not function effectively. In these cases, employing target detectors to identify potential moving objects and eliminate their regions could mitigate the impact of dynamic objects on pose estimation.

In summary, essential perception capabilities were provided by semantic segmentation, paving the way for next-generation intelligent VSLAM systems. With the continuous advancement of deep neural networks and accelerated computing, semantically enriched VSLAM was poised to unleash the full potential of autonomous robots and agents operating in real-world scenarios. It is obvious that valuable insights were offered by this review to steer future research endeavors in this domain.

6. Future Trends and Prospects

The integration of semantic segmentation into VSLAM systems represented an emerging field with numerous open challenges and exciting opportunities. Based on this review, several promising directions for future research are identified:

Data-Driven Approaches: The availability of larger, high-quality labeled datasets tailored to robotics domains could significantly advance semantic segmentation for VSLAM. Techniques including weakly supervised [103], semi-supervised [104], and unsupervised methods [105] like self-training could mitigate annotation requirements. Additionally, synthetic data generation and domain adaptation methods could enhance model generalization.

Lightweight Architectures: The development of compact and efficient model designs was essential to enable real-time semantic segmentation on embedded systems with limited computing resources. Techniques including network pruning [106], knowledge distillation, quantization, and other optimization methods could tailor complex models for efficient deployment.

Online Adaptation Online learning [107] and domain adaptation algorithms [108] could progressively enhance segmentation models by adapting to changing environments encountered during operation, thereby improving reliability for lifelong VSLAM.

Multimodal Fusion: The fusion of complementary modalities including RGB, depth, thermal, LiDAR, and radar [106] at both input and feature levels could enhance segmentation accuracy, robustness, and consistency, thereby benefiting various components of VSLAM.

Dynamic Reconstruction: The combination of semantics with geometry could enable precise reconstruction and tracking of dynamic objects and interaction hotspots, contributing to safer navigation and smarter decision-making [35].

System Integration: Deeper integration between semantic segmentation modules and other VSLAM components could lead to end-to-end joint optimization and enhanced overall performance [109].

By advancing these research directions, semantically enriched VSLAM systems could unlock smarter interactions, robust long-term autonomy, and reliable performance in unstructured dynamic environments. It was anticipated that semantics would become indispensable to all perception-driven robots and intelligent agents operating in the real world.

Author Contributions

Conceptualization, X.L. (Xiwen Liu) and Y.H.; methodology, X.L. (Xiwen Liu); validation, X.L. (Xiwen Liu) and J.L.; formal analysis, Y.H. and X.L. (Xiaoyu Li); investigation, Y.H., R.Y., H.H. and X.L. (Xiwen Liu); resources, Y.H.; data curation, X.L. (Xiwen Liu) and X.L. (Xiaoyu Li); writing—original draft preparation, X.L. (Xiwen Liu) and Y.H.; writing—review and editing, X.L. (Xiwen Liu), R.Y., H.H. and Y.H.; visualization, R.Y., X.L. (Xiaoyu Li), H.H. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

The project supported by the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources (KF202207012).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

Author Hui Huang was employed by the Chongqing Digital City Technology Co., Ltd.; The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York, NY, USA, 2018; pp. 1451–1460. [Google Scholar]
Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual slam: From tradition to semantic. Remote. Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Y.; Hu, L.; Wang, W.; Ge, G.; Tan, S. A Semantic Topology Graph to Detect Re-Localization and Loop Closure of the Visual Simultaneous Localization and Mapping System in a Dynamic Environment. Sensors 2023, 23, 8445. [Google Scholar] [CrossRef] [PubMed]
Mo, J.; Islam, M.J.; Sattar, J.J.I.R.; Letters, A. Fast direct stereo visual SLAM. IEEE Robot. Autom. Lett. 2021, 7, 778–785. [Google Scholar] [CrossRef]
Moreno, F.-A.; Blanco, J.-L.; Gonzalez-Jimenez, J. A constant-time SLAM back-end in the continuum between global mapping and submapping: Application to visual stereo SLAM. Int. J. Robot. Res. 2016, 35, 1036–1056. [Google Scholar] [CrossRef]
Chen, S.; Zhou, B.; Jiang, C.; Xue, W.; Li, Q. A lidar/visual slam backend with loop closure detection and graph optimization. Remote Sens. 2021, 13, 2720. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In European Conference on Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]
Ortiz, L.E.; Cabrera, E.V.; Gonçalves, L.M. Depth data error modeling of the ZED 3D vision sensor from stereolabs. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2018, 17, 0001–15. [Google Scholar] [CrossRef]
Wang, K.; Ma, S.; Chen, J.; Ren, F.; Lu, J. Approaches, challenges, and applications for deep visual odometry: Toward complicated and emerging areas. IEEE Trans. Cogn. Dev. Syst. 2020, 14, 35–49. [Google Scholar] [CrossRef]
Bailey, T.; Nieto, J.; Guivant, J.; Stevens, M.; Nebot, E. Consistency of the EKF-SLAM algorithm. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; IEEE: New York, NY, USA, 2006; pp. 3562–3568. [Google Scholar]
Zhao, Y.; Vela, P.A. Good feature selection for least squares pose optimization in VO/VSLAM. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: New York, NY, USA, 2018; pp. 1183–1189. [Google Scholar]
Zhao, Y.; Vela, P.A. Good line cutting: Towards accurate pose tracking of line-assisted VO/VSLAM. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 516–531. [Google Scholar]
Nüchter, A.; Lingemann, K.; Hertzberg, J.; Surmann, H. 6D SLAM—3D mapping outdoor environments. J. Field Robot. 2007, 24, 699–722. [Google Scholar] [CrossRef]
Kejriwal, N.; Kumar, S.; Shibata, T.J.R.; Systems, A. High performance loop closure detection using bag of word pairs. Robot. Auton. Syst. 2016, 77, 55–65. [Google Scholar] [CrossRef]
Shen, X.; Chen, L.; Hu, Z.; Fu, Y.; Qi, H.; Xiang, Y.; Wu, J. A Closed-loop Detection Algorithm for Online Updating of Bag-Of-Words Model. In Proceedings of the 2023 9th International Conference on Computing and Data Engineering, Association for Computing Machinery, Haikou, China, 6–8 January 2023; pp. 34–40. [Google Scholar]
Xi, K.; He, J.; Hao, S.; Luo, L. SLAM Loop Detection Algorithm Based on Improved Bag-of-Words Model. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; IEEE: New York, NY, USA, 2022; pp. 683–689. [Google Scholar]
Xu, L.; Feng, C.; Kamat, V.R.; Menassa, C. An occupancy grid mapping enhanced visual SLAM for real-time locating applications in indoor GPS-denied environments. Autom. Constr. 2019, 104, 230–245. [Google Scholar] [CrossRef]
Blochliger, F.; Fehr, M.; Dymczyk, M.; Schneider, T.; Siegwart, R. Topomap: Topological mapping and navigation based on visual slam maps. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 3818–3825. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Sünderhauf, N.; Pham, T.T.; Latif, Y.; Milford, M.; Reid, I. Meaningful maps with object-oriented semantic mapping. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 5079–5085. [Google Scholar]
Safarova, L.; Abbyasov, B.; Tsoy, T.; Li, H.; Magid, E. Comparison of Monocular ROS-Based Visual SLAM Methods. In Proceedings of the International Conference on Interactive Collaborative Robotics, Fuzhou, China, 16–18 December 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 81–92. [Google Scholar]
Deshmukh-Taskar, P.R.; Nicklas, T.A.; O’Neil, C.E.; Keast, D.R.; Radcliffe, J.D.; Cho, S. The relationship of breakfast skipping and type of breakfast consumption with nutrient intake and weight status in children and adolescents: The National Health and Nutrition Examination Survey 1999–2006. J. Am. Diet. Assoc. 2010, 110, 869–878. [Google Scholar] [CrossRef] [PubMed]
Taketomi, T.; Uchiyama, H.; Ikeda, S. Applications, Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 1–11. [Google Scholar]
Ben Ali, A.J.; Kouroshli, M.; Semenova, S.; Hashemifar, Z.S.; Ko, S.Y.; Dantu, K. Edge-SLAM: Edge-assisted visual simultaneous localization and mapping. ACM Trans. Embed. Comput. Syst. 2022, 22, 1–31. [Google Scholar] [CrossRef]
Gao, F.; Moltu, S.B.; Vollan, E.R.; Shen, S.; Ludvigsen, M. Increased Autonomy and Situation Awareness for ROV Operations. In Proceedings of the Global Oceans 2020: Singapore–US Gulf Coast, Virtual, 5–14 October 2020; IEEE: New York, NY, USA, 2020; pp. 1–8. [Google Scholar]
Vincent, J.; Labbé, M.; Lauzon, J.-S.; Grondin, F.; Comtois-Rivet, P.-M.; Michaud, F. Dynamic object tracking and masking for visual SLAM. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020; IEEE: New York, NY, USA, 2020; pp. 4974–4979. [Google Scholar]
Zhu, J.; Huang, H.; Li, B.; Wang, L. E-CRF: Embedded Conditional Random Field for Boundary-caused Class Weights Confusion in Semantic Segmentation. arXiv 2021, arXiv:2112.07106. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
Sun, C.-Z.; Zhang, B.; Wang, J.-K.; Zhang, C.-S. A review of visual SLAM based on unmanned systems. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Education (ICAIE), Dali, China, 18–20 June 2021; IEEE: New York, NY, USA, 2021; pp. 226–234. [Google Scholar]
Chang, J.; Dong, N.; Li, D.; Qin, M. Triplet loss based metric learning for closed loop detection in VSLAM system. Expert Syst. Appl. 2021, 185, 115646. [Google Scholar] [CrossRef]
Wang, Z.; Peng, Z.; Guan, Y.; Wu, L. Manifold regularization graph structure auto-encoder to detect loop closure for visual SLAM. IEEE Access 2019, 7, 59524–59538. [Google Scholar] [CrossRef]
Saputra, M.R.U.; Markham, A.; Trigoni, N. Visual SLAM and structure from motion in dynamic environments: A survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
Wen, S.; Li, P.; Zhao, Y.; Zhang, H.; Sun, F.; Wang, Z. Semantic visual SLAM in dynamic environment. Auton. Robot. 2021, 45, 493–504. [Google Scholar] [CrossRef]
Mingachev, E.; Lavrenov, R.; Tsoy, T.; Matsuno, F.; Svinin, M.; Suthakorn, J.; Magid, E. Comparison of ros-based monocular visual slam methods: Dso, ldso, orb-slam2 and dynaslam. In Proceedings of the International Conference on Interactive Collaborative Robotics, St. Petersburg, Russia, 7–9 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 222–233. [Google Scholar]
Yanik, Ö.F.; Ilgin, H.A. Engineering, A comprehensive computational cost analysis for state-of-the-art visual slam methods for autonomous mapping. Commun. Fac. Sci. Univ. Ank. Ser. A2-A3 Phys. Sci. Eng. 2023, 65, 1–15. [Google Scholar]
Chua, L.O.; Roska, T. The CNN paradigm. IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 1993, 40, 147–156. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Karim, F.; Majumdar, S.; Darabi, H.; Harford, S. Multivariate LSTM-FCNs for time series classification. Neural Netw. 2019, 116, 237–245. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.-W.; Heng, P.-A. H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef] [PubMed]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. EEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Baheti, B.; Innani, S.; Gajre, S.; Talbar, S. Eff-unet: A novel architecture for semantic segmentation in unstructured environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 358–359. [Google Scholar]
Kazerouni, I.A.; Dooly, G.; Toal, D. Ghost-UNet: An asymmetric encoder-decoder architecture for semantic segmentation from scratch. IEEE Access 2021, 9, 97457–97465. [Google Scholar] [CrossRef]
Liu, R.; Tao, F.; Liu, X.; Na, J.; Leng, H.; Wu, J.; Zhou, T. RAANet: A residual ASPP with attention framework for semantic segmentation of high-resolution remote sensing images. Remote Sens. 2022, 14, 3109. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
Combes, J.-M.; Grossmann, A.; Tchamitchian, P. Wavelets: Time-Frequency Methods and Phase Space. In Proceedings of the International Conference, Marseille, France, 14–18 December 1987; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Deng, Z.; Wang, B.; Xu, Y.; Xu, T.; Liu, C.; Zhu, Z. Multi-scale convolutional neural network with time-cognition for multi-step short-term load forecasting. IEEE Access 2019, 7, 88058–88071. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Zheng, S.; Lin, X.; Zhang, W.; He, B.; Jia, S.; Wang, P.; Jiang, H.; Shi, J.; Jia, F. Medicine, MDCC-Net: Multiscale double-channel convolution U-Net framework for colorectal tumor segmentation. Comput. Biol. Med. 2021, 130, 104183. [Google Scholar] [CrossRef] [PubMed]
Gangopadhyay, S.; Zhai, A. CGBNet: A Deep Learning Framework for Compost Classification. IEEE Access 2022, 10, 90068–90078. [Google Scholar] [CrossRef]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y.J.N. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Lee, M.; Kim, D.; Shim, H. Threshold matters in wsss: Manipulating the activation for the robust and accurate segmentation model against thresholds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4330–4339. [Google Scholar]
Liu, S.-A.; Zhang, Y.; Qiu, Z.; Xie, H.; Zhang, Y.; Yao, T. Learning orthogonal prototypes for generalized few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11319–11328. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8479–8488. [Google Scholar]
Zhu, Y.; Zhang, Z.; Wu, C.; Zhang, Z.; He, T.; Zhang, H.; Manmatha, R.; Li, M.; Smola, A. Improving semantic segmentation via efficient self-training. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 46, 1589–1602. [Google Scholar] [CrossRef] [PubMed]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Huang, Y.; Tang, Z.; Chen, D.; Su, K.; Chen, C. Batching soft IoU for training semantic segmentation networks. IEEE Signal Process. Lett. 2019, 27, 66–70. [Google Scholar] [CrossRef]
Yan, S.; Zhou, J.; Xie, J.; Zhang, S.; He, X. An em framework for online incremental learning of semantic segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3052–30600. [Google Scholar]
Luo, Y.; Wang, Z.; Huang, Z.; Yang, Y.; Zhao, C. Coarse-to-fine annotation enrichment for semantic segmentation learning. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 237–246. [Google Scholar]
Kenjic, D.; Baba, F.; Samardzija, D.; Kaprocki, Z. Utilization of the open source datasets for semantic segmentation in automotive vision. In Proceedings of the 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin), Berlin, Germany, 8–11 September 2019; IEEE: New York, NY, USA, 2019; pp. 420–423. [Google Scholar]
Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Via del Mar, Chile, 27–29 October 2020; IEEE: New York, NY, USA, 2020; pp. 1–7. [Google Scholar]
Ke, T.-W.; Hwang, J.-J.; Liu, Z.; Yu, S.X. Adaptive affinity fields for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 587–602. [Google Scholar]
Jiang, W.; Xie, Z.; Li, Y.; Liu, C.; Lu, H. Lrnnet: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation. In Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), London, UK, 6–10 July 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Li, Z.; Sun, Y.; Zhang, L.; Tang, J. CTNet: Context-based tandem network for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9904–9917. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Mao, Y.; Ding, Y.; Ren, P.; Zheng, N. Visual-based semantic SLAM with landmarks for large-scale outdoor environment. In Proceedings of the 2019 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI), Xi’an, China, 21–22 September 2019; IEEE: New York, NY, USA, 2019; pp. 149–154. [Google Scholar]
Qiao, S.; Wang, H.; Liu, C.; Shen, W.; Yuille, A. Micro-batch training with batch-channel normalization and weight standardization. arXiv 2019, arXiv:1903.10520. [Google Scholar]
Yuan, J.; Liu, Y.; Shen, C.; Wang, Z.; Li, H. A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 8229–8238. [Google Scholar]
Holder, C.J.; Shafique, M.J. On efficient real-time semantic segmentation: A survey. arXiv 2022, arXiv:2206.08605. [Google Scholar]
Mukherjee, A.; Chakraborty, S.; Saha, S. Detection of loop closure in SLAM: A DeconvNet based approach. Appl. Soft Comput. 2019, 80, 650–656. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Mollica, G.; Legittimo, M.; Dionigi, A.; Costante, G.; Valigi, P. Integrating Sparse Learning-Based Feature Detectors into Simultaneous Localization and Mapping—A Benchmark Study. Sensors 2023, 23, 2286. [Google Scholar] [CrossRef] [PubMed]
Esparza, D.; Flores, G. The STDyn-SLAM: A stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments. IEEE Access 2022, 10, 18201–18209. [Google Scholar] [CrossRef]
Zhao, Y.; Vela, P.A. Good feature matching: Toward accurate, robust vo/vslam with low latency. IEEE Trans. Robot. 2020, 36, 657–675. [Google Scholar] [CrossRef]
Runz, M.; Buffier, M.; Agapito, L. Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; IEEE: New York, NY, USA, 2018; pp. 10–20. [Google Scholar]
Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2020, arXiv:2005.11052. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Karkus, P.; Cai, S.; Hsu, D. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2815–2825. [Google Scholar]
Cai, Y.; Ou, Y.; Qin, T. Improving SLAM techniques with integrated multi-sensor fusion for 3D reconstruction. Sensors 2024, 24, 2033. [Google Scholar] [CrossRef] [PubMed]
Hou, J.; Yu, L.; Li, C.; Fei, S. Handheld 3D reconstruction based on closed-loop detection and nonlinear optimization. Meas. Sci. Technol. 2019, 31, 025401. [Google Scholar] [CrossRef]
Lomas-Barrie, V.; Suarez-Espinoza, M.; Hernandez-Chavez, G.; Neme, A. A New Method for Classifying Scenes for Simultaneous Localization and Mapping Using the Boundary Object Function Descriptor on RGB-D Points. Sensors 2023, 23, 8836. [Google Scholar] [CrossRef] [PubMed]
Yang, K.; Wang, K.; Bergasa, L.M.; Romera, E.; Hu, W.; Sun, D.; Sun, J.; Cheng, R.; Chen, T.; López, E. Unifying terrain awareness for the visually impaired through real-time semantic segmentation. Sensors 2018, 18, 1506. [Google Scholar] [CrossRef] [PubMed]
Lin, H.-Y.; Liu, T.-A.; Lin, W.-Y. InertialNet: Inertial Measurement Learning for Simultaneous Localization and Mapping. Sensors 2023, 23, 9812. [Google Scholar] [CrossRef]
Dubé, R.; Cramariuc, A.; Dugas, D.; Nieto, J.; Siegwart, R.; Cadena, C. SegMap: 3d segment mapping using data-driven descriptors. arXiv 2018, arXiv:1804.09557. [Google Scholar]
Lv, K.; Zhang, Y.; Yu, Y.; Wang, Z.; Min, J.J.I.A. SIIS-SLAM: A vision SLAM based on sequential image instance segmentation. IEEE Access 2022, 11, 17430–17440. [Google Scholar] [CrossRef]
Yu, S.; Fu, C.; Gostar, A.K.; Hu, M. A review on map-merging methods for typical map types in multiple-ground-robot SLAM solutions. Sensors 2020, 20, 6988. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Yu, W.; Liu, W.; Xu, H.; He, Y. A Lightweight Visual Simultaneous Localization and Mapping Method with a High Precision in Dynamic Scenes. Sensors 2023, 23, 9274. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.; Kim, M.; Ahn, J.; Park, J. Accurate Visual Simultaneous Localization and Mapping (SLAM) against Around View Monitor (AVM) Distortion Error Using Weighted Generalized Iterative Closest Point (GICP). Sensors 2023, 23, 7947. [Google Scholar] [CrossRef] [PubMed]
McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: New York, NY, USA, 2017; pp. 4628–4635. [Google Scholar]
Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: New York, NY, USA, 2019; pp. 4205–4212. [Google Scholar]
Li, C.; Kang, Z.; Yang, J.; Li, F.; Wang, Y. Research on semantic-assisted SLAM in complex dynamic indoor environment. ISPRS Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2020, 43, 353–359. [Google Scholar] [CrossRef]
Lai, T. A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion. Sensors 2022, 22, 7265. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Huang, K.; Li, J.; Li, X.; Zeng, Z.; Chang, L.; Zhou, J. AdaSG: A Lightweight Feature Point Matching Method Using Adaptive Descriptor with GNN for VSLAM. Sensors 2022, 22, 5992. [Google Scholar] [CrossRef]
Yan, Y.; Hang, Y.; Hu, T.; Yu, H.; Lai, F. Visual SLAM in Long-Range Autonomous Parking Application Based on Instance-Aware Semantic Segmentation via Multi-Task Network Cascades and Metric Learning Scheme. SAE Int. J. Adv. Curr. Pract. Mobil. 2021, 3, 1357–1368. [Google Scholar] [CrossRef]
Zarringhalam, A.; Ghidary, S.S.; Khorasani, A.M. Semi-supervised Vector-Quantization in Visual SLAM using HGCN. Int. J. Intell. Syst. 2024, 2024, 9992159. [Google Scholar] [CrossRef]
Shen, T.; Luo, Z.; Zhou, L.; Deng, H.; Zhang, R.; Fang, T.; Quan, L. Beyond photometric loss for self-supervised ego-motion estimation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 6359–6365. [Google Scholar]
Liu, R.; Zhang, J.; Chen, S.; Yang, T.; Arth, C. Real-time visual SLAM combining building models and GPS for mobile robot. J. Real-Time Image Process. 2021, 18, 419–429. [Google Scholar] [CrossRef]
Xu, S.; Xiong, H.; Wu, Q.; Yao, T.; Wang, Z.; Wang, Z. Online Visual SLAM Adaptation against Catastrophic Forgetting with Cycle-Consistent Contrastive Learning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 6196–6202. [Google Scholar]
Loo, S.Y.; Shakeri, M.; Tang, S.H.; Mashohor, S.; Zhang, H. Online mutual adaptation of deep depth prediction and visual slam. arXiv 2021, arXiv:2111.0409. [Google Scholar]
Vargas, E.; Scona, R.; Willners, J.S.; Luczynski, T.; Cao, Y.; Wang, S.; Petillot, Y.R. Robust underwater visual SLAM fusing acoustic sensing. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021; pp. 2140–2146. [Google Scholar]

Figure 1. General discussion framework.

Figure 2. A basic framework for simultaneous localization and mapping.

Figure 3. Comparison of energy consumption between classic simultaneous localization and mapping frameworks [37].

Figure 4. Optimization process of visual odometer based on semantic segmentation.

Figure 5. The model succeeded in segmenting the fruit [84].

Figure 6. VDO-SLAM can track vehicles and estimate their speeds [85].

Figure 7. Semantic segmentation to improve the loop closure detection process.

Figure 8. Impact of segment growth on descriptor relevance for localization [93].

Figure 9. Testing on dataset using a low-cost ZED camera.

Figure 10. Semantic segmentation model improving the principle of environment mapping.

Figure 11. An example of AR application using a 3D panoptic map generated by the Panoptic Fusion system [99].

Figure 12. A comparison between the trajectory autonomously driven by MapLite (blue) with the path estimated by odometry (green). Note: the shaded area is the ground truth road surface [75].

Figure 13. The mechanism of the semantic segmentation model in VSLAM.

Table 1. The advantages and disadvantages of some main regularization methods at present.

Method	Standard Discussion
Volumetric	Volume integrity: Consider the volume information in the map, including the volume of a building, room, or other object. This can be assessed by comparing the volume in the map to the volume in the actual scene.
Volumetric	Volume consistency: Check that the volume of different areas in the map is consistent. If the volume in the map changes too much, it may indicate that the map is incomplete.
Surface	Surface integrity: Focus on surface information in the map, including walls, floors, and ceilings. A complete map should accurately capture the geometry of these surfaces.
Surface	Surface consistency: Check that surfaces in different areas of the map are consistent. If the surface shape in the map is inconsistent, it may indicate that there is a problem with the map.
Semantic classifications	Semantic integrity: Consider the semantic information in the map, including the categories of objects (e.g., chairs, tables, doors, etc.). A complete map should be able to mark these objects correctly.
Semantic classifications	Semantic consistency: Check whether semantic labels in different areas of the map are consistent. If the semantic labels in the map are inconsistent, it may indicate that the map is incomplete or contains errors.

Table 2. The main semantic segmentation model and its application and characteristics.

Model	Features	Application Scenarios
U-Net [79]	Simple, efficient, easy to build.	U-Net can classify image pixels into different categories, including lanes, stop lines, speed bumps, and obstacles.
Mask-RCNN [76]	Features: Powerful image-based instance level segmentation algorithm.	Mask-RCNN can segment instances of different semantic objects at the pixel level, which is suitable for dynamic environments.
Pyramid Scene Parsing Network [45]	Considering the context relationship matching problem, showing a good segmentation effect.	PSPNet performs well in complex environments and can extract semantic information efficiently.
Fully Convolutional Networks [80]	The traditional convolutional neural network is transformed into a full convolutional structure for pixel-level semantic segmentation.	FCN is widely used in semantic segmentation tasks, which can effectively extract semantic information from images.
ERFNet [78]	Real-time segmentation with low computational costs while maintaining high accuracy.	ERFNet is particularly suitable for scenarios that require real-time performance, including autonomous driving and lane detection.

Table 3. Some common VSLAM usage scenarios and features.

Model	Semantic Segmentation Model	Features	Application Scenario
MaskFusion [84]	Mask R-CNN	Object-level RGB-D SLAM system for dynamic environments. Run in real time, able to track multiple moving objects and perform dense reconstruction.	Autonomous driving, online positioning at the vehicle end.
VDO-SLAM [85]	FCN	Emphasis on dynamic object perception, without the need for an a priori model of the object. The motion estimation of rigid objects is realized by using semantic information.	Deployment in real-world applications involving highly dynamic and unstructured environments.
ORB-SLAM3 [86]	Mask R-CNN	Real-time calculation of camera position and generate sparse 3D reconstructed maps.	Mobile robots, mobile phones, drones.
SegMap [93]	Mask R-CNN	A pure static semantic octree map is constructed by using semantic information.	Construction, navigation.
Semantic Fusion SLAM [98]	Res-Net	Real-time: Systems typically need to operate in a real-time environment and therefore require high frame rates and low latency.	High-precision map construction for autonomous vehicles.

Table 4. Comparison of the ATE [m] RMSE.

Sequence	ORB-SLAM	DynaSLAM	SLAM-Net	Semantic- Assisted SLAM	ORB-SLAM3	SIIS-SLAM
KITTI 01	5.19	7.44	5.06	5.12	4.76	7.23
KITTI 02	23.45	26.53	22.36	22.63	19.68	23.36
KITTI 03	1.49	1.79	1.43	1.53	1.55	1.78
KITTI 04	1.58	0.99	1.56	1.49	1.45	0.93
KITTI 05	4.79	4.53	4.66	4.23	3.99	4.55
KITTI 06	13.01	14.79	12.36	11.36	10.77	12.88
KITTI 07	2.30	2.26	2.12	2.19	2.08	2.26
KITTI 08	47.69	41.23	46.23	45.26	43.16	39.31
KITTI 09	6.53	3.22	5.99	5.36	4.23	2.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; He, Y.; Li, J.; Yan, R.; Li, X.; Huang, H. A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation. Sensors 2024, 24, 3388. https://doi.org/10.3390/s24113388

AMA Style

Liu X, He Y, Li J, Yan R, Li X, Huang H. A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation. Sensors. 2024; 24(11):3388. https://doi.org/10.3390/s24113388

Chicago/Turabian Style

Liu, Xiwen, Yong He, Jue Li, Rui Yan, Xiaoyu Li, and Hui Huang. 2024. "A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation" Sensors 24, no. 11: 3388. https://doi.org/10.3390/s24113388

APA Style

Liu, X., He, Y., Li, J., Yan, R., Li, X., & Huang, H. (2024). A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation. Sensors, 24(11), 3388. https://doi.org/10.3390/s24113388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Review on Enhancing Visual Simultaneous Localization and Mapping with Deep Semantic Segmentation

Abstract

1. Introduction

2. Key Technologies and Influencing Factors of VSLAM

2.1. Workflow of VSLAM

2.2. Front-End Visual Odometry

2.3. Back-End Optimization

2.4. Loop Closure Detection

2.5. Mapping

2.6. Evaluation Metrics

2.6.1. Absolute Trajectory Error (ATE)

2.6.2. Relative Pose Error (RPE)

2.6.3. Map Completeness

2.7. Influencing Factors

3. Principles and Methods of Semantic Segmentation

3.1. Definition and Evaluation Metrics

3.2. Encoder–Decoder Architectures

3.3. Context Modeling Modules

3.4. Training Strategies

3.5. Factors Affecting Performance

3.6. Representative Models

4. Applications of Semantic Segmentation in VSLAM

4.1. Visual Odometry

4.2. Loop Closure Detection

4.3. Environment Mapping

4.4. Model Action Mechanism

4.5. Experimental Comparison

4.6. Discussion

5. Conclusions

6. Future Trends and Prospects

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI