DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM

Zhu, Fan; Zhao, Yifan; Chen, Ziyu; Jiang, Chunmao; Zhu, Hui; Hu, Xiaoxi

doi:10.3390/rs17040625

Open AccessArticle

DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM

by

Fan Zhu

¹

,

Yifan Zhao

²,

Ziyu Chen

¹

,

Chunmao Jiang

³,

Hui Zhu

^3,* and

Xiaoxi Hu

⁴

¹

Department of Automation, University of Science and Technology of China, Hefei 230031, China

²

Department of Mathematics, University of Science and Technology of China, Hefei 230026, China

³

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

⁴

State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(4), 625; https://doi.org/10.3390/rs17040625

Submission received: 4 January 2025 / Revised: 9 February 2025 / Accepted: 10 February 2025 / Published: 12 February 2025

(This article belongs to the Special Issue 3D Scene Reconstruction, Modeling and Analysis Using Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Visual SLAM is widely applied in robotics and remote sensing. The fusion of Gaussian radiance fields and Visual SLAM has demonstrated astonishing efficacy in constructing high-quality dense maps. While existing methods perform well in static scenes, they are prone to the influence of dynamic objects in real-world dynamic environments, thus making robust tracking and mapping challenging. We introduce DyGS-SLAM, a Visual SLAM system that employs dual constraints to achieve high-fidelity static map reconstruction in dynamic environments. We extract ORB features within the scene, and use open-world semantic segmentation models and multi-view geometry to construct dual constraints, forming a zero-shot dynamic information elimination module while recovering backgrounds occluded by dynamic objects. Furthermore, we select high-quality keyframes and use them for loop closure detection and global optimization, constructing a foundational Gaussian map through a set of determined point clouds and poses and integrating repaired frames for rendering new viewpoints and optimizing 3D scenes. Experimental results on the TUM RGB-D, Bonn, and Replica datasets, as well as real scenes, demonstrate that our method has excellent localization accuracy and mapping quality in dynamic scenes.

Keywords:

visual SLAM; reconstruction; dynamic scenes; Gaussian Splatting; mapping

Graphical Abstract

1. Introduction

Simultaneous Localization and Mapping (SLAM) is a technique in which an intelligent agent uses sensor data to determine its own pose and build environmental maps. It plays a critical role in various applications including remote sensing, autonomous driving, and robotics [1,2,3,4,5]. With the continuous advancement of computer vision technology, the low cost and rich information of visual sensor hardware have facilitated the substantial development of Visual SLAM technology using cameras as the primary sensors.

Traditional Visual SLAM systems [6,7,8] have made remarkable progress in tracking and sparse map reconstruction. However, this low-resolution and discontinuous surface reconstruction with holes is insufficient to meet the demands of current Augmented Reality (AR), autonomous driving simulation, and other scenarios [9,10,11]. The emergence of high-quality image synthesis based on neural radiance fields (NeRFs) [12,13,14] has introduced new perspectives for map construction in Visual SLAM [15,16,17,18]. These NeRF-based SLAM systems utilize global maps and image reconstruction loss functions, capturing dense depth information through differentiable rendering, thereby achieving high-fidelity dense maps. However, scenes reconstructed using multi-layer perceptrons (MLPs), an implicit neural representation, are often challenging to edit, and there are concerns regarding low computational efficiency and catastrophic forgetting [19].

In this context, our exploration of map representation turns to 3D Gaussian radiance fields [20] as an explicit representation. Gaussian Splatting inherits the advantages of NeRF in generating high-quality images and scenes while achieving a faster rendering speed and allowing direct gradient flow to each parameter, which is more favorable for optimization compared to NeRF’s indirect gradient flow. The 3D Gaussian representation can generate explicit maps with clear spatial ranges, providing more possibilities for downstream applications such as mixed reality and autonomous driving [21,22]. Recent studies [23,24,25,26] have also demonstrated the possibility of integrating 3D Gaussians with SLAM systems. These SLAM systems achieve good dense map reconstruction in static scenes by building 3D Gaussian maps. The traditional dynamic visual SLAM method can provide accurate pose estimation for the construction of dense Gaussian maps. However, these methods either do not process the input image and still retain the dynamic object, or they are segmented, leaving a large area of blank space. There are unacceptable for Gaussian map optimization using such original images.

To address these challenges, we propose DyGS-SLAM, which resolves the issues of “traditional dynamic SLAM’s difficulty in achieving high-quality dense reconstruction” and “3D Gaussian-based SLAM’s difficulty in handling dynamic scenes” [19,27]. It fully filters dynamic feature points in the scene by combining semantic segmentation with multi-view geometry. Subsequently, we simultaneously conduct inpainting of the static background and camera pose tracking based on the ORB-SLAM3 [7] visual odometer. Finally, we utilize the inpainted static background images and a set of determined keyframe poses to construct a dense Gaussian map and optimize the 3D scene. This approach achieves more robust tracking results in dynamic scenes while obtaining a high-quality static map. In summary, the main contributions of our work can be summarized as follows:

We introduce DyGS-SLAM, a double constrained dynamic Visual SLAM system based on 3D Gaussian, capable of accurately predicting camera poses, reconstructing high-quality dense maps, and synthesizing novel view images in dynamic scenes.
A zero-shot dynamic feature point removal method is proposed. We enhanced the Segment Anything Model (SAM) and combined it with multi-view geometry methods to achieve dynamic feature point removal. This enables our method to have a wider application space than other methods. Simultaneously, a background inpainting module is provided to repair static structures occluded by dynamic objects.
A novel 3D Gaussian-based scene representation and map optimization method is proposed. Underlying Gaussian map construction and differentiable rendering are performed on the resulting camera pose and point cloud, while the 3D scene is optimized based on the inpainted frames. This method addresses the interference of dynamic information in Gaussian map optimization, enabling the synthesis of novel view images and the generation of more realistic 3D scenes.
We conducted extensive experiments on multiple datasets, demonstrating that our method achieves state-of-the-art (SOTA) performance in pose estimation and mapping.

2. Related Works

2.1. Traditional Dynamic Visual SLAM

For most VSLAM systems [6,7,8,28], it is of great significance to eliminate the influence of dynamic objects prevalent in the real world when it comes to localization and mapping [29]. Early research [6,7,30] primarily concentrated on using probabilistic, geometric, and optical flow methods to remove dynamic feature points, but often struggled to accurately segment subtly moving targets and handle potential moving objects. In recent years, with the advancement of deep learning technology [31,32,33], object detection and semantic segmentation methods have been applied in dynamic visual SLAM systems. DynaSLAM [34] combines semantic and geometric methods to recognize dynamic objects and inpaint images by projecting previous keyframes. DP-SLAM and RS-SLAM [35,36] employ probabilistic techniques to refine semantic information, obtaining precise dynamic masks. OVD-SLAM [37] combines optical flow, semantic information, and depth data to differentiate foreground from background. However, traditional dynamic visual SLAM methods face two significant challenges. First, they struggle to overcome the limitations of discrete surface representations, such as point clouds and voxels, which hinders the accurate reconstruction of fine scene details. Furthermore, traditional methods frequently depend on pre-trained object detection and semantic segmentation models, making them inadequate for addressing unseen categories in complex dynamic environments. Our approach employs a zero-shot segmentation model to extract semantic information, integrating it with geometric information to construct dual constraints, thus achieving broader applicability in open-world scenarios.

2.2. Dynamic Visual SLAM with Inpainting

In dynamic visual SLAM methods, segmentation-based methods often leave holes in the image after removing dynamic objects. Therefore, some researchers propose integrating inpainting techniques with dynamic visual SLAM to fill these gaps, ensuring map completeness and consistency. There are two main methods of background recovery: B. Bescos [34], A. Li [35], and J. Chang [38] project some previous keyframes into the dynamic region of the current frame. However, for scenes that have not been previously seen, the projection-based approach is difficult to patch. Meanwhile, Refs. [39,40] employ learning-based approaches to predict and restore static backgrounds, though they may have runtime drawbacks.

2.3. Neural Radiance Field-Based SLAM

NeRF [12,13,14] has demonstrated capabilities in continuous surface modeling, hole filling, and scene repairing in 3D scene representation, offering new insights into dense map construction in SLAM. iMAP [15] initially employs neural implicit representations for tracking and mapping, providing memory-efficient map representations using a single MLP. NICE-SLAM [16] and Vox-Fusion [18] enhance map scalability using layered multi-feature grids and expanded octrees, respectively. Other studies [17,41] have introduced more efficient scene representations, offering three-dimensional maps suitable for robots. In dynamic scene reconstruction, NID-SLAM [42] utilizes depth-guided semantic segmentation to extract dynamic information and reconstructs dense static scenes using multi-resolution geometric feature grids. DN-SLAM [19], based on ORB-SLAM3, achieves robust pose tracking and dense reconstruction in dynamic scenes by extracting dynamic objects through coarse and fine segmentation, combined with instant-NGP [13]. DDN-SLAM [43] employs semantic information and conditional probabilities for dynamic segmentation, combined with multi-resolution hash encoding, to construct high-quality static maps. However, these NeRF-based methods are still constrained by the computational speed of training and inference processes and implicit neural network representations.

2.4. Gaussian Radiance Field-Based SLAM

B. Kerbl et al. [20] utilized 3D Gaussians, an explicit representation method, along with fast differentiable rendering for modeling three-dimensional scenes. This approach not only inherits NeRF’s capability for new view synthesis but also significantly enhances rendering speed and scene interpretability. As a result, it has seen rapid applications in various domains such as virtual driving simulation and urban reconstruction [22,44].

The advantages of 3D Gaussian Splatting make it a promising research direction when combined with SLAM. Recent studies [23,24,25,26] have demonstrated the potential of this paradigm. GS-SLAM [24] utilizes 3D Gaussian distribution, opacity, and spherical harmonics to encapsulate scene geometry and appearance, achieving improvements in map optimization and rendering speed. SplaTAM [26] employs Gaussians as a bottom representation to achieve map scalability while introducing a silhouette-guided optimization method, resulting in high-quality RGB and depth map rendering. Photo-SLAM [25] utilizes ORB features for localization and introduces a Gaussian-Pyramid-based training method to achieve realistic view synthesis. MonoGS [23] adopts 3D Gaussians as the sole representation, introducing geometric validation and regularization to address blurriness in dense 3D reconstruction, capable of reconstructing transparent or small objects. However, these SLAM approaches based on 3D Gaussian Splatting do not handle dynamic environments well, as dynamic objects in the environment significantly affect the accuracy of tracking and reconstruction quality. Our DyGS-SLAM overcomes these issues and achieves accurate camera pose estimation and dense, high-quality map reconstruction in dynamic scenes.

3. Method

Our proposed DyGS-SLAM is a dynamic visual SLAM method based on 3D Gaussian Splatting. Figure 1 illustrates the overall framework of DyGS-SLAM. Benefiting from ORB-SLAM3’s visual odometry (VO), our tracking module supports mono, stereo, and RGB-D inputs, delivering the camera’s initial pose. First, in the tracking thread, we combine semantic information and geometric information using a specialized method to filter out dynamic feature points. Next, we simultaneously perform tracking and background inpainting on the current frame, selecting suitable frames as keyframes. Finally, based on a set of determined keyframe poses and point clouds, the mapping thread constructs a dense Gaussian map and performs differentiable rendering and Gaussian map optimization.

3.1. Dynamic Object Elimination with Double-Constrained

3.1.1. Open-World Semantic Segmentation

Many dynamic SLAM methods based on semantic segmentation [34,45] are trained only for a few fixed objects (such as humans, cars, animals), which poses significant limitations. Our aim is to provide a system for dynamic object detection and segmentation that can be easily adapted to various dynamic scenes without the need for retraining. The Segment Anything Model [46] serves as a powerful foundational model for interactive object segmentation, demonstrating zero-shot generalization capability. However, SAM encounters two challenges in SLAM applications: firstly, SAM’s processing speed is slow, resulting in poor real-time performance. Secondly, SAM can only perform complete segmentation without semantic information or work through interactive points and boxes, which is unsuitable for SLAM systems. To tackle the first challenge, we opted for SAM’s accelerated version, EfficientSAM [47], as the core of our segmentation module. Addressing the second challenge, we introduced an additional real-time open-vocabulary detection module, YOLO-World [48]. This module annotates detection boxes in the image, which SAM then uses as prompts for dynamic object segmentation. The specific workflow is illustrated in Figure 2.

3.1.2. Multiple View Geometry Constraint

Through the use of a semantic segmentation module, we can relatively easily segment most dynamic objects, ensuring that feature points in these areas are not used for tracking and mapping. However, many objects that are semantically static but actually in motion are not detected by this method. For instance, in Figure 3, the person on the left side of the image is seen dragging a chair with their hand. This section will elaborate on the solution to this issue.

The multi-view geometry approach, as shown in Figure 4, involves selecting an arbitrary feature point p from a designated keyframe, where its corresponding point P in 3D space is determined. With changes in camera pose, the projection of point p onto the current frame is computed as p’. If the disparity angle

α

between the back-projections point p’ and p exceeds a certain threshold

τ_{α}

, the point is deemed occluded and is thus disregarded as a static point. Subsequently, we compute the depth reprojection error

Δ d = |\begin{matrix} d_{p r o j} - d^{'} \end{matrix}|

for the remaining feature points, where

d^{'}

is the measured depth value of the remaining feature points in the current frame. If

Δ d

surpasses a certain threshold

τ_{d}

, the feature point p’ is classified as a dynamic point. As recommended by [34,49], for each current frame, we select the five keyframes with the highest overlap as reference frames. The overlap is computed based on the distance and rotation between the current frame and each keyframe.

3.1.3. Background Inpainting

Inpainting the static background occluded by dynamic objects is essential for the subsequent construction of a realistic static scene Gaussian map. We believe this is particularly useful for applications like virtual reality and augmented reality. We employ a learning-based method proposed in ProPainter [50], the SOTA method in the field of image inpainting, for background inpainting. To conserve computational resources to some extent, we opt not to treat the entire sequence of images as a single entity for inpainting. Instead, for each current frame, we select the last

Δ n

keyframes preceding it for collective inpainting. Subsequently, for the nth current frame, we concatenate it with the preceding

Δ n

keyframes and transmit them, along with their corresponding masks, to the background inpainting module. This module employs highly efficient recurrent flow completion network to complete the corrupted flow fields.

Figure 5 illustrates the comparison between the traditional projection-based method [34] and our approach in background restoration. It can be observed that the projection method struggles to restore many blank areas, whereas the learning-based method can effectively reconstruct these scenes. We deem this aspect as pivotal for the subsequent construction of a complete static scene.

3.2. High-Quality Keyframe Selection

In the previous section, we mentioned that both multi-view geometry and background restoration require operations based on keyframes. We give a high-quality keyframe selection strategy for selecting an appropriate combination of images and poses to facilitate subsequent dense Gaussian scene reconstruction. For a set of n input frames, frames

[\begin{matrix} F_{1} \dots F_{i} \dots F_{n} \end{matrix}]

, selecting the current frame

F_{c}

as a keyframe relies mainly on three criteria:

1. Calculate the overlap ratio (

R_{o l}

) between

F_{c}

and the previous keyframe, similar to Section 3.1.2, but here choose frames with an overlap ratio lower than the threshold

τ_{o l}

, i.e.,

R_{o l} < τ_{o l}

;

2. Calculate the proportion (

R_{d y}

) of dynamic regions in

F_{c}

compared to the entire image, and prefer frames with a lower proportion of dynamic areas, i.e.,

R_{d y} < τ_{d y}

;

3. Calculate the time (t) between

F_{c}

and the previous keyframe and set a maximum threshold (

t_{m a x}

). If

t \geq t_{max}

, indicating a long time since the last keyframe insertion, it is necessary to insert a keyframe promptly.

The first condition regulates the overlap rate, thereby avoiding information redundancy and ensuring that the keyframe set contains a richer scene representation. The second condition ensures that keyframes provide more reliable information for background restoration and camera tracking. The third condition controls the maximum time interval for keyframe insertion. Any two of these conditions being met allows for the insertion of

F_{c}

as a keyframe.

3.3. Scene Representation and Rendering

3.3.1. Gaussian Map Representation

Following the approach proposed by B. Kerbl et al. [20], we represent the scene using a set of 3D Gaussians. To simplify computations, we have omitted spherical harmonics and only utilize view-independent colors. Therefore, each Gaussian comprises four parameters: the center of the Gaussian function

μ = {[\begin{matrix} x & y & z \end{matrix}]}^{T} \in R^{3}

; the covariance matrix

Σ

representing its shape; opacity

|δ (α) \in [0, 1]|

; and its RGB color parameters c. According to the definition of a three-dimensional Gaussian function, without normalization and under the assumption of opacity weighting, the influence of a 3D Gaussian on any point

P \in R^{3}

in space is given by the following:

f_{i} (P) = δ (α_{i}) exp (- \frac{{∥P - μ_{i}∥}^{2}}{2 Σ_{i}})

(1)

This approach enables us to describe every aspect of the scene using discrete sets of 3D Gaussians.

The original Gaussian Splatting utilizes point clouds and camera poses provided by Structure from Motion (SFM) for initialization, which is time-consuming and susceptible to disturbances in dynamic environments [51]. However, ORB-SLAM3 exhibits good localization capabilities, and our improved tracking thread demonstrates excellent adaptability in dynamic environments. Therefore, we utilize the visual odometry of ORB-SLAM3 for tracking and conduct global optimization to provide a set of accurate sparse point clouds and camera poses for initializing the Gaussian map. Moreover, during the Gaussian map optimization process, we only optimize the parameters of each 3D Gaussian, without further optimizing the camera poses.

To summarize, our map construction has two stages. First, as a SLAM system, we incrementally construct a sparse point cloud map during tracking. Second, after tracking, we perform global optimization to refine poses and the sparse point cloud. We then construct a dense Gaussian map, optimizing it with rendered and background-corrected images to enhance realism. Below, we will describe how our method optimizes Gaussian maps.

3.3.2. Image Formation Model and Differentiable Rendering

For a given set of 3D Gaussians and camera poses, how to render them is crucial for new view synthesis techniques. We aim to render high-fidelity colors into any possible camera reference frame. Firstly, divide the 2D screen into 16 × 16 tiles, then select 3D Gaussians within the view frustum for each tile, sorting all Gaussian distributions from front to back. Splat the sorted Gaussians onto the corresponding tile from nearest to farthest. Finally, stack the splats left by Gaussian models on each tile until the opacity saturates for all pixels (

δ (α) = 1

). In this process, the image formation model of Gaussian Splatting is similar to NeRF, where for a single pixel

p = (\begin{matrix} u, ν \end{matrix})

in the image, its rendering color equation is

C (p) = \sum_{i \in N} c_{i} f_{i}^{2 D} (p) \prod_{j = 1}^{i - 1} (1 - f_{j}^{2 D} (p))

(2)

where

f_{i}^{2 D} (p)

is the two-dimensional form of Equation (1):

μ^{2 D} = [\begin{matrix} μ_{x}^{2 D} \\ μ_{y}^{2 D} \end{matrix}] = [\begin{matrix} u / z \\ v / z \end{matrix}] = K \frac{W μ}{{(W μ)}_{z}}

(3)

Σ^{2 D} = J W Σ W^{T} J^{T}

(4)

J = \frac{\partial μ^{2 D} (μ)}{\partial μ}

(5)

Here, K represents the intrinsic matrix of the camera, and W represents the extrinsic matrix of the camera’s pose change in the t-th frame. The superscript 2D denotes the projection of

μ

and

Σ

onto the two-dimensional plane, while the subscript z indicates normalization.

Similarly, we present rendering equations for depth:

D (p) = \sum_{i \in N} d_{i} f_{i}^{2 D} (p) \prod_{j = 1}^{i - 1} (1 - f_{j}^{2 D} (p))

(6)

The entire process is differentiable, enabling direct computation of the gradient of camera parameter errors between the rendered and provided images. Subsequently, Gaussian parameters are updated to minimize this error, thereby fitting an accurate three-dimensional scene.

3.3.3. Gaussian Map Optimization

To minimize artifacts in rendering and enhance scene realism, optimization of parameters for every 3D Gaussian in the map is necessary. Parameter optimization involves continuous iteration using stochastic gradient descent to accurately fit all components. In each iteration round, images are rendered and compared with real training views. The loss function combines

L_{1}

and D-SSIM (Structural Similarity Index) between the input and current rendered images:

L = (1 - λ) L_{1} + λ L_{D - SSIM}

(7)

where

D - SSIM = \frac{1 - SSIM}{2}

, and

SSIM \in [\begin{matrix} - 1, 1 \end{matrix}]

is the structural similarity index between two images.

At the beginning of Gaussian map optimization, we made two important modifications to the original Gaussian Splatting. This optimization does not begin from scratch; rather, only keyframes are utilized. Additionally, for maintaining consistency in the static scene structure, comparison is made between the background-repaired image and the currently rendered image, not the original input image or the image with blank areas provided by the segmentation module. Throughout the iteration process, if the Gaussian model experiences under-reconstruction or over-reconstruction issues, it necessitates either cloning or splitting of the Gaussian model. Finally, as in [20], useless Gaussian distributions with opacities close to 0 or excessively large are eliminated.

4. Experimental Results

4.1. Experimental Setup

4.1.1. Implementation Details

In order to validate the effectiveness of DyGS-SLAM, we conducted a series of experiments and the results are presented in this section. All experiments were carried out on an 24 GB NVIDIA GeForce RTX4090 and AMD EPYC 7402 processor. Our odometry code was built on ORB-SLAM3, and rasterization and gradient computation relied on the original 3DGS.

4.1.2. Datasets

We evaluate the performance of our method on four datasets. The TUM and Bonn datasets are used to assess tracking and mapping in dynamic scenes. The Replica dataset is utilized to verify the reconstruction quality of the proposed method. Additionally, we run our method on a real-world dataset to validate the feasibility of the method in practical scenarios.

4.1.3. Baselines

To showcase the advancement of our method, we compare it with several other SOTA systems. For fairness, we selected systems in traditional VSLAM (ORB-SLAM3 [7]), traditional dynamic visual SLAM (Dyna-SLAM [34], LC-SLAM [52]), and radiance field-based SLAM (NICE-SLAM [16], SplaTAM [26], and DN-SLAM [19]). Some of the data are taken from the corresponding original article.

4.1.4. Metrics

We quantitatively evaluate the performance of the DyGS-SLAM system using the commonly used metric of absolute trajectory error (ATE) to assess camera poses. Given the increased uncertainty caused by dynamic objects, we ran each sequence 10 times, as suggested by C. Campos and B. Bescos [34], and reported the median results. Meanwhile, we also followed the evaluation metrics of rendering performance, using PSNR, SSIM, and LPIPS to quantitatively evaluate the reconstruction quality. Here, we show the optimal value in bold and the suboptimal value is underlined. We denote “↑” to indicate that a higher value of the corresponding metric is better, while “↓” indicates the opposite.

4.2. Evaluation on the TUM Dataset

The TUM RGB-D dataset [53], widely utilized in the field of computer vision, was captured by the Computer Vision Group at the Technical University of Munich using Microsoft Kinect sensors at a frequency of 30 Hz. Each sequence in this dataset contains about 1000 RGB images, depth images, and ground truth camera traces with a resolution of 640 × 480 pixels. These sequences are categorized into high-dynamic and low-dynamic based on motion situation. The ‘walking’/‘w’ sequence, categorized as high-dynamic, predominantly showcases two individuals walking in an office. The ‘sitting’/‘s’ sequence, classified as low-dynamic, illustrates two individuals sitting at a desk conversing and gesturing. To ensure fair comparison with other state-of-the-art SLAM systems, four high-dynamic sequences and two low-dynamic sequences were chosen.

4.2.1. Evaluation of Trajectory

Figure 6 presents a comparison between the camera trajectories estimated by our system and ORB-SLAM3 on selected sequences and the ground truth trajectories. The system’s accuracy is depicted by the closeness between the estimated trajectory (blue line) and the ground truth trajectory (black line), with the red line representing the error between them. It is evident that our system estimates trajectories closer to the ground truth trajectories compared to ORB-SLAM3 in these sequences.

To quantitatively evaluate the positioning performance of the DyGS-SLAM system, we present a comparison of DyGS-SLAM with other SOTA methods on the TUM RGB-D dynamic sequences, as shown in Table 1. In the table, bold numbers represent the best performance of the sequence, while underlined numbers denote the second-best performance of the sequence. Our method significantly outperforms radiance field-based SLAM methods in localization performance and achieves results comparable to traditional dynamic SLAM methods, demonstrating that the use of large-scale segmentation models does not compromise performance. Compared to DN-SLAM [19], a radiance field-based SLAM method for dynamic scenes, our approach achieves superior accuracy. This improvement is attributed to DN-SLAM relying solely on semantic constraints, whereas our dual-constraint approach delivers more precise pose estimation.

4.2.2. Evaluation of Reconstruction Quality

To demonstrate the mapping capability of DyGS-SLAM in dynamic scenes, we ran our method and compare it with NICE-SLAM and SplaTAM on the walking_xyz sequence, and the resulting effect is illustrated in Figure 7. It is evident that, compared to NICE-SLAM and SplaTAM, our method effectively eliminates the interference of dynamic objects and constructs a static scene map.

4.3. Evaluation on the Bonn Dataset

The Bonn dataset [54] is also an RGB-D dynamic dataset. It comprises 24 dynamic sequences where individuals are engaged in various activities, such as walking, balloon throwing, and box moving. Each scene includes camera ground truth trajectories captured using the Optitrack Prime 13 motion capture system, provided in the same format as the TUM RGB-D dataset. Compared to the TUM RGB-D dataset, the Bonn dataset is more diverse, with a wider variety of dynamic objects and motions, as well as significant occlusions, making it more challenging. We selected nine representative sequences for evaluating our approach.

4.3.1. Evaluation of Trajectory

Table 2 presents the experimental results on the Bonn dataset, showing that our method continues to exhibit competitive tracking performance.

4.3.2. Evaluation of Reconstruction Quality

Figure 8 illustrates a detailed comparison from different views between the ground truth model provided by traditional methods on the original Bonn [54] dataset and the model obtained by DyGS-SLAM. This highlights the advantages of our method in reconstructing texture details and completing unknown areas.

Moreover, Figure 9 compares the 3D scene reconstruction between our method and SplaTAM. Our method removes dynamic objects, recovers the static 3D scene, and maintains high reconstruction quality.

4.4. Evaluation on the Replica Dataset

The presence of moving objects in dynamic scenes makes it challenging to accurately evaluate the reconstruction quality of the proposed method. Therefore, the method was evaluated on the Replica dataset [55] using the standard evaluation methods provided by NICE-SLAM [16]. The Replica dataset is a static indoor dataset containing 18 highly realistic room and building scenes, specifically designed for robotics, AR, and VR tasks.

In Table 3, the proposed method shows significant improvement over NeRF-based methods, thanks to the superior capability of 3D Gaussians in high-quality map reconstruction. Additionally, the method also shows improvement over SplaTAM [26], which is also based on 3D Gaussians, due to the more accurate pose and point cloud provided by the front-end visual odometry. Figure 10 shows the improvement in the scene reconstruction effect of DyGS-SLAM more intuitively. It can be seen intuitively that our method is superior to other methods in terms of reconstruction accuracy and details of the scene.

4.5. Evaluation in Real Environment

To further validate the practicality of our method, we captured real-world scene data using the Intel RealSense L515 RGB-D camera (made by Intel, Santa Clara, CA, United States) at 30 fps and a resolution of 640 × 480. Figure 11 illustrates the details of our approach at each stage. In this scene, individuals are engaged in random movements, as depicted in Figure 11a. Figure 11b shows the mask provided by the segmentation module, while Figure 11c displays the results of our background restoration, revealing the complete restoration of the black chair obscured in the image. Figure 11d demonstrates the successful recovery of regions occluded by dynamic objects in the synthesized novel views, illustrating how our approach mitigates the influence of dynamic objects and constructs a realistic static scene. Meanwhile, real-time performance is very important for SLAM systems. Therefore, we evaluated the runtime required by our method for each task in a real scenario, as shown in Table 4.

4.6. Ablation Experiment

The functionality of our approach primarily relies on learning-based segmentation methods, the multi-view geometry constraint, and background restoration. The primary challenge for multi-view geometry lies in initialization, as this method necessitates multiple views. However, learning-based methods can effectively detect and segment dynamic objects using a single image, thus circumventing the initialization issue. The primary limitation of learning methods is their inability to identify objects that are semantically static but are actually in motion. However, this issue can be resolved by multi-view geometry. Table 5 illustrates the impact of using learning methods, geometry methods, or their combination on performance in the TUM dataset. The results clearly indicate that the combined approach of learning and geometry outperforms using these methods individually, demonstrating the highest accuracy.

Figure 12 illustrates the difference in our method with and without background restoration. It can be observed that when the image without background restoration is used as the input frame for Gaussian map optimization, the resulting map exhibits noticeable white artifacts. However, after background restoration, our method obtains a complete static map.

5. Conclusions

We introduce DyGS-SLAM, a dynamic SLAM framework that leverages 3D Gaussian functions as the foundational map representation. Through the construction of semantic and geometric double constraints, the method effectively eliminates dynamic feature points and ensures precise camera pose tracking. Using accurate poses and repaired static scene images, we construct explicit, high-fidelity, dense static scene maps with 3D Gaussians. This is highly beneficial for virtual reality environments. In the TUM RGB-D, Bonn, and Replica datasets, our method achieves comparable or even superior accuracy to some state-of-the-art traditional dynamic Visual SLAM approaches, while obtaining maps with similar high fidelity as SLAM based on NeRF and Gaussian radiance fields.

Nonetheless, like other Gaussian-based methods, our approach still faces real-time performance issues. Hence, we recommend using this approach in offline scenarios. Future extensions of this work may include improving real-time performance and capturing/reconstructing dynamic objects.

Author Contributions

Conceptualization, F.Z.; formal analysis, Z.C.; funding acquisition, H.Z.; methodology, F.Z.; project administration, H.Z.; software, Y.Z.; supervision, H.Z.; validation, Z.C.; writing—original draft, F.Z.; writing—review and editing, C.J. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the HFIPS Director’s Fund, Grant No.YZJJKX202401.

Data Availability Statement

Some data presented in this study are available on request from the corresponding author. The data are not publicly available due to the confidentiality clause of some projects.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Y.; Xie, F.; Zhao, J.; Gao, Z.; Chen, J.; Zhao, F.; Liu, X. ULG-SLAM: A Novel Unsupervised Learning and Geometric Feature-Based Visual SLAM Algorithm for Robot Localizability Estimation. Remote Sens. 2024, 16, 1968. [Google Scholar] [CrossRef]
Wang, W.; Wang, C.; Liu, J.; Su, X.; Luo, B.; Zhang, C. HVL-SLAM: Hybrid Vision and LiDAR Fusion for SLAM. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5706514. [Google Scholar] [CrossRef]
Yang, D.; Bi, S.; Wang, W.; Yuan, C.; Qi, X.; Cai, Y. DRE-SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot. Remote Sens. 2019, 11, 380. [Google Scholar] [CrossRef]
Chen, Z.; Zhu, H.; Yu, B.; Jiang, C.; Hua, C.; Fu, X.; Kuang, X. IGE-LIO: Intensity Gradient Enhanced Tightly-Coupled LiDAR-Inertial Odometry. IEEE Trans. Instrum. Meas. 2024, 73, 8506411. [Google Scholar] [CrossRef]
Wu, H.; Liu, Y.; Wang, C.; Wei, Y. An Effective 3D Instance Map Reconstruction Method Based on RGBD Images for Indoor Scene. Remote Sens. 2025, 17, 139. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. Orb-slam3: An accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Qin, T.; Li, P.L.; Shen, S.J. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Xu, X.; Zhang, L.; Yang, J.; Cao, C.; Wang, W.; Ran, Y.; Tan, Z.; Luo, M. A review of multi-sensor fusion slam systems based on 3D LIDAR. Remote Sens. 2022, 14, 2835. [Google Scholar] [CrossRef]
Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. Dgs-slam: A fast and robust rgbd slam in dynamic environments combined by geometric and semantic information. Remote Sens. 2022, 14, 795. [Google Scholar] [CrossRef]
Yuan, C.; Xu, Y.; Zhou, Q. PLDS-SLAM: Point and line features SLAM in dynamic environment. Remote Sens. 2023, 15, 1893. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2022, 65, 99–106. [Google Scholar] [CrossRef]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Xu, Q.G.; Xu, Z.X.; Philip, J.; Bi, S.; Shu, Z.X.; Sunkavalli, K.; Neumann, U. Point-nerf: Point-based neural radiance fields. In Proceedings of the 2022 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5438. [Google Scholar]
Sucar, E.; Liu, S.K.; Ortiz, J.; Davison, A.J. Imap: Implicit mapping and positioning in real-time. In Proceedings of the 2021 IEEE/Cvf International Conference on Computer Vision (ICCV 2021), Virtual Conference, 11–17 October 2021; pp. 6209–6218. [Google Scholar]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Society. Volume 2022, pp. 12776–12786. [Google Scholar]
Sandström, E.; Li, Y.; Gool, L.V.; Oswald, M.R. Point-slam: Dense neural point cloud-based slam. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18433–18444. [Google Scholar]
Yang, X.; Li, H.; Zhai, H.; Ming, Y.; Liu, Y.; Zhang, G. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In Proceedings of the 21st IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2022, Singapore, 17–21 October 2022; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2022; pp. 499–507. [Google Scholar]
Ruan, C.Y.; Zang, Q.Y.; Zhang, K.H.; Huang, K. Dn-slam: A visual slam with orb features and nerf mapping in dynamic environments. IEEE Sens. J. 2024, 24, 5279–5287. [Google Scholar] [CrossRef]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Zhu, H.; Kuang, X.; Su, T.; Chen, Z.; Yu, B.; Li, B. Dual-Constraint Registration LiDAR SLAM Based on Grid Maps Enhancement in Off-Road Environment. Remote Sens. 2022, 14, 5705. [Google Scholar] [CrossRef]
Wu, C.; Duan, Y.; Zhang, X.; Sheng, Y.; Ji, J.; Zhang, Y. Mm-gaussian: 3D gaussian-based multi-modal fusion for localization and reconstruction in unbounded scenes. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS) 2024, 12287–12293. [Google Scholar]
Matsuki, H.; Murai, R.; Kelly, P.H.; Davison, A.J. Gaussian splatting slam. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 18039–18048. [Google Scholar]
Yan, C.; Qu, D.; Xu, D.; Zhao, B.; Wang, Z.; Wang, D.; Li, X. Gs-slam: Dense visual slam with 3d gaussian splatting. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19595–19604. [Google Scholar]
Huang, H.; Li, L.; Cheng, H.; Yeung, S.K. Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21584–21593. [Google Scholar]
Keetha, N.; Karhade, J.; Jatavallabhula, K.M.; Yang, G.; Scherer, S.; Ramanan, D.; Luiten, J. SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21357–21366. [Google Scholar]
Tosi, F.; Zhang, Y.; Gong, Z.; Sandström, E.; Mattoccia, S.; Oswald, M.R.; Poggi, M. How nerfs and 3d gaussian splatting are reshaping slam: A survey. arXiv 2024, arXiv:2402.13255. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. Svo: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: New York, NY, USA, 2014; pp. 15–22. [Google Scholar]
Chen, W.; Zhou, C.; Shang, G.; Wang, X.; Li, Z.; Xu, C.; Hu, K. SLAM overview: From single sensor to heterogeneous fusion. Remote Sens. 2022, 14, 6033. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small ar workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
Zhang, C.; Zhang, R.; Jin, S.; Yi, X. PFD-SLAM: A new RGB-D SLAM for dynamic indoor environments based on non-prior semantic segmentation. Remote Sens. 2022, 14, 2445. [Google Scholar] [CrossRef]
Yu, H.; Wang, Q.; Yan, C.; Feng, Y.; Sun, Y.; Li, L. DLD-SLAM: RGB-D Visual Simultaneous Localisation and Mapping in Indoor Dynamic Environments Based on Deep Learning. Remote Sens. 2024, 16, 246. [Google Scholar] [CrossRef]
Cheng, S.H.; Sun, C.H.; Zhang, S.J.; Zhang, D.F. Sg-slam: A real-time rgb-d visual slam toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2023, 72, 7501012. [Google Scholar] [CrossRef]
Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Li, A.; Wang, J.; Xu, M.; Chen, Z. Dp-slam: A visual slam with moving probability towards dynamic environments. Inf. Sci. 2021, 556, 128–142. [Google Scholar] [CrossRef]
Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. Rs-slam: A robust semantic slam in dynamic environments based on rgb-d sensor. IEEE Sens. J. 2021, 21, 20657–20664. [Google Scholar] [CrossRef]
He, J.M.; Li, M.R.; Wang, Y.Y.; Wang, H.Y. Ovd-slam: An online visual slam for dynamic environments. IEEE Sens. J. 2023, 23, 13210–13219. [Google Scholar] [CrossRef]
Chang, J.; Dong, N.; Li, D. A real-time dynamic object segmentation framework for SLAM system in dynamic scenes. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Trombley, C.M.; Das, S.K.; Popa, D.O. Dynamic-gan: Learning spatial-temporal attention for dynamic object removal in feature dense environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: New York, NY, USA, 2022; pp. 12189–12195. [Google Scholar]
Bescos, B.; Cadena, C.; Neira, J. Empty cities: A dynamic-object-invariant space for visual slam. IEEE Trans. Robot. 2021, 37, 433–451. [Google Scholar] [CrossRef]
Johari, M.M.; Carta, C.; Fleuret, F. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In Proceedings of the 2023 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17408–17419. [Google Scholar]
Xu, Z.; Niu, J.; Li, Q.; Ren, T.; Chen, C. Nid-slam: Neural implicit representation-based rgb-d slam in dynamic environments. arXiv 2024, arXiv:2401.01189. [Google Scholar]
Li, M.; He, J.; Jiang, G.; Wang, H. Ddn-slam: Real-time dense dynamic neural implicit slam with joint semantic encoding. arXiv 2024, arXiv:2401.01545. [Google Scholar]
Lin, J.; Li, Z.; Tang, X.; Liu, J.; Liu, S.; Liu, J.; Lu, Y.; Wu, X.; Xu, S.; Yan, Y.; et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5166–5175. [Google Scholar]
Zhong, Y.; Hu, S.; Huang, G.; Bai, L.; Li, Q. Wf-slam: A robust vslam for dynamic scenarios via weighted features. IEEE Sens. J. 2022, 22, 10818–10827. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Xiong, Y.; Varadarajan, B.; Wu, L.; Xiang, X.; Xiao, F.; Zhu, C.; Dai, X.; Wang, D.; Sun, F.; Iandola, F.; et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16111–16121. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Tan, W.; Liu, H.M.; Dong, Z.L.; Zhang, G.F.; Bao, H.J. Robust monocular slam in dynamic environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality (Ismar)—Science and Technology, Adelaide, Australia, 1–4 October 2013; pp. 209–218. [Google Scholar]
Zhou, S.; Li, C.; Chan, K.C.; Loy, C.C. Propainter: Improving propagation and transformer for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10477–10486. [Google Scholar]
Liu, Y.L.; Gao, C.; Meuleman, A.; Tseng, H.Y.; Saraf, A.; Kim, C.; Chuang, Y.Y.; Kopf, J.; Huang, J.B. Robust dynamic radiance fields. In Proceedings of the 2023 IEEE/Cvf Conference on Computer Vision and Pattern Recognition, CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 13–23. [Google Scholar]
Du, Z.J.; Huang, S.S.; Mu, T.J.; Zhao, Q.; Martin, R.R.; Xu, K. Accurate dynamic slam using crf-based long-term consistency. IEEE Trans. Vis. Comput. Graph. 2022, 28, 1745–1757. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/Rsj International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Palazzolo, E.; Behley, J.; Lottes, P.; Giguère, P.; Stachniss, C. Refusion: 3D reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 7855–7862. [Google Scholar]
Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The Replica dataset: A digital replica of indoor spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar]

Figure 1. System framework of DyGS-SLAM. The tracking thread conducts dynamic object removal and background inpainting. The mapping thread reconstructs the Gaussian map and performs differentiable rendering using a set of determined poses and point clouds. Lastly, the 3D scene is optimized based on the repaired frames and rendered frames.

Figure 2. Open-world semantic segmentation model.

Figure 3. RGB images in TUM RGB-D dataset. (a) Frame 690. (b) Frame 765. The red boxes indicates the chair being moved. This is often semantically static but actually moving.

Figure 4. The feature point p on the keyframe projected onto the current frame is p’, and O and O’ are the two frames corresponding to the optical center of the camera, respectively. (a) Feature point p’ is static (

d^{'} = d_{p r o j}

). (b) Feature point p’ is dynamic (

d^{'} ≪ d_{p r o j}

).

Figure 4. The feature point p on the keyframe projected onto the current frame is p’, and O and O’ are the two frames corresponding to the optical center of the camera, respectively. (a) Feature point p’ is static (

d^{'} = d_{p r o j}

). (b) Feature point p’ is dynamic (

d^{'} ≪ d_{p r o j}

).

Figure 5. Image frame comparison between Dyna-SLAM and DyGS-SLAM (Ours) after walking_halfsphere sequence repair in TUM RGB-D dataset. The red boxes show how different methods compare the results of fixing the same frame.

Figure 6. Camera trajectory estimated by ORB-SLAM3 and DyGS-SLAM (Ours) on the TUM dataset, and the differences with ground truth values.

Figure 7. Comparison of mapping effects between NICE-SLAM, SplaTAM, and DyGS-SLAM (Ours) on walking_xyz sequence.

Figure 8. Detailed comparison of the original reconstructed scene provided by Bonn and the scene reconstructed by our method. The red boxes indicate the details of the different methods to reconstruct the scene. (a) Original reconstructed scene provided by Bonn. (b–d) Details of the reconstructed scene of our method.

Figure 9. Comparison of reconstruction performance between SplaTAM and DyGS-SLAM (Ours) on Bonn dataset. Our method demonstrates better reconstruction quality. (a) SplaTAM. (b) DyGS-SLAM.

Figure 10. Comparison of mapping effects between NICE-SLAM, SplaTAM, and DyGS-SLAM on Replica dataset. The red boxes indicate the details of the different methods to reconstruct the scene. Our method also has excellent reconstruction quality in static scenes. (a) NICE-SLAM. (b) SplaTAM. (c) DyGS-SLAM. (d) GT.

Figure 11. Experimental results in real scenarios. The red boxes indicates the recovery of the static background during reconstruction (a) Input image. (b) Segmentation. (c) Background repair. (d) Novel view synthesis.

Figure 12. Effect of background inpainting or not on DyGS-SLAM scene reconstruction. The red boxes indicate the reconstruction effects of different methods. (a) Reconstruction w/o background inpainting. (b) Reconstruction w/background inpainting.

Table 1. Camera tracking results on TUM RGB-D. ATE RMSE (m) (↓) is used as the evaluation metric.

Sequence	ORB-SLAM3	DynaSLAM	LC-SLAM	NICE-SLAM	SplaTAM	DN-SLAM	DyGS-SLAM
w_xyz	0.358	0.015	0.022	0.826	1.292	0.015	0.014
w_rpy	0.752	0.085	0.053	1.275	1.442	0.032	0.045
w_static	0.309	0.009	0.022	0.327	0.778	0.008	0.008
w_halfsphere	0.445	0.040	0.036	0.795	0.997	0.026	0.035
s_xyz	0.010	0.013	0.012	0.211	0.016	0.011	0.009
s_halfsphere	0.029	0.023	0.024	0.542	0.139	0.014	0.013

Table 2. Camera tracking results on Bonn. ATE RMSE (m) (↓) is used as evaluation metric.

Sequence	ORB-SLAM3	DynaSLAM	LC-SLAM	NICE-SLAM	SplaTAM	DN-SLAM	DyGS-SLAM
balloon	0.052	0.031	0.032	2.853	0.357	0.030	0.030
balloon2	0.211	0.034	0.028	1.946	0.372	0.025	0.025
crowd	0.386	0.022	0.023	1.757	1.767	0.025	0.020
crowd2	1.224	0.030	0.067	3.887	4.731	0.028	0.027
crowd3	0.897	0.039	0.037	1.430	1.924	0.026	0.036
move_no_b	0.176	0.023	0.025	0.178	0.058	0.026	0.022
move_o_b2	0.693	0.286	0.297	0.832	0.600	0.120	0.208
person1	0.074	0.064	0.050	0.398	1.374	0.038	0.048
person2	1.051	0.114	0.055	0.843	0.918	0.042	0.040

Table 3. Performance comparison of different methods. PSNR (↑), SSIM (↑), and LPIPS (↓) are the evaluation metrics.

Methods	Metrics	Room0	Room1	Room2	Office0	Office1	Office2	Office3	Office4
NICE-SLAM	PSNR ↑	22.12	22.47	24.52	29.07	30.34	19.66	22.23	24.94
	SSIM ↑	0.69	0.76	0.81	0.87	0.89	0.80	0.80	0.86
	LPIPS ↓	0.33	0.27	0.21	0.23	0.18	0.24	0.21	0.20
SplaTAM	PSNR ↑	32.86	33.89	35.25	38.26	39.17	31.97	29.70	31.81
	SSIM ↑	0.98	0.97	0.98	0.98	0.98	0.97	0.95	0.95
	LPIPS ↓	0.07	0.10	0.08	0.09	0.09	0.10	0.12	0.15
DyGS-SLAM	PSNR ↑	33.51	34.25	35.57	37.54	39.87	32.59	31.48	34.19
	SSIM ↑	0.97	0.97	0.98	0.99	0.98	0.97	0.96	0.98
	LPIPS ↓	0.08	0.09	0.08	0.07	0.08	0.10	0.11	0.13

Table 4. Results of average processing time per image for different tasks.

Tasks	Runtime/Frame
Feature Extraction	0.01 s
Dynamic Object Elimination	0.18 s
Background Inpainting	0.09 s
Rendering	0.13 s

Table 5. Results of ATE RMSE (m) (↓) of different methods for DyGS-SLAM.

Sequence	Learning	Geometry	Learning + Geometry
walking_xyz	0.015	0.016	0.014
walking_rpy	0.052	0.105	0.045
walking_static	0.007	0.008	0.008
walking_halfsphere	0.035	0.116	0.035
sitting_xyz	0.015	0.010	0.009
sitting_halfsphere	0.022	0.019	0.013

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, F.; Zhao, Y.; Chen, Z.; Jiang, C.; Zhu, H.; Hu, X. DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM. Remote Sens. 2025, 17, 625. https://doi.org/10.3390/rs17040625

AMA Style

Zhu F, Zhao Y, Chen Z, Jiang C, Zhu H, Hu X. DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM. Remote Sensing. 2025; 17(4):625. https://doi.org/10.3390/rs17040625

Chicago/Turabian Style

Zhu, Fan, Yifan Zhao, Ziyu Chen, Chunmao Jiang, Hui Zhu, and Xiaoxi Hu. 2025. "DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM" Remote Sensing 17, no. 4: 625. https://doi.org/10.3390/rs17040625

APA Style

Zhu, F., Zhao, Y., Chen, Z., Jiang, C., Zhu, H., & Hu, X. (2025). DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM. Remote Sensing, 17(4), 625. https://doi.org/10.3390/rs17040625

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DyGS-SLAM: Realistic Map Reconstruction in Dynamic Scenes Based on Double-Constrained Visual SLAM

Abstract

1. Introduction

2. Related Works

2.1. Traditional Dynamic Visual SLAM

2.2. Dynamic Visual SLAM with Inpainting

2.3. Neural Radiance Field-Based SLAM

2.4. Gaussian Radiance Field-Based SLAM

3. Method

3.1. Dynamic Object Elimination with Double-Constrained

3.1.1. Open-World Semantic Segmentation

3.1.2. Multiple View Geometry Constraint

3.1.3. Background Inpainting

3.2. High-Quality Keyframe Selection

3.3. Scene Representation and Rendering

3.3.1. Gaussian Map Representation

3.3.2. Image Formation Model and Differentiable Rendering

3.3.3. Gaussian Map Optimization

4. Experimental Results

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. Datasets

4.1.3. Baselines

4.1.4. Metrics

4.2. Evaluation on the TUM Dataset

4.2.1. Evaluation of Trajectory

4.2.2. Evaluation of Reconstruction Quality

4.3. Evaluation on the Bonn Dataset

4.3.1. Evaluation of Trajectory

4.3.2. Evaluation of Reconstruction Quality

4.4. Evaluation on the Replica Dataset

4.5. Evaluation in Real Environment

4.6. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI