RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting

Guo, Chaoyang; Gao, Chunyan; Bai, Yiyang; Lv, Xiaoling

doi:10.3390/app14177767

Open AccessArticle

RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting

by

Chaoyang Guo

,

Chunyan Gao

^*,

Yiyang Bai

and

Xiaoling Lv

School of Mechanical Engineering, Hebei University of Technology, Tianjin 300401, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7767; https://doi.org/10.3390/app14177767

Submission received: 5 August 2024 / Revised: 27 August 2024 / Accepted: 28 August 2024 / Published: 3 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Simultaneous localization and mapping (SLAM) is fundamental for intelligent mobile units to perform diverse tasks. Recent work has shown that integrating neural rendering and SLAM showed promising results in photorealistic environment reconstruction. However, existing methods estimate pose by minimizing the error between rendered and input images, which is time-consuming and cannot be run in real-time, deviating from the original intention of SLAM. In this paper, we propose a dense RGB-D SLAM based on 3D Gaussian splatting (3DGS) while employing generalized iterative closest point (G-ICP) for pose estimation. We actively utilize 3D point cloud information to improve the tracking accuracy and operating speed of the system. At the same time, we propose a dual keyframe selection strategy and its corresponding densification method, which can effectively reconstruct new observation scenes and improve the quality of previously constructed maps. In addition, we introduce regularization loss to address the scale explosion of the 3D Gaussians and over-elongate in the camera viewing direction. Experiments on the Replica, TUM-RGBD, and ScanNet datasets show that our method achieves state-of-the-art tracking accuracy and runtime while being competitive in rendering quality.

Keywords:

dense visual SLAM; scene reconstruction; 3D gaussian splatting; keyframe selection

1. Introduction

Visual simultaneous localization and mapping (SLAM) is a technology in which robots rely on their vision sensors to enable real-time positioning and mapping in unknown environments [1,2]. After decades of development, SLAM has been widely used in fields such as autonomous driving, virtual reality, and augmented reality (AR). Traditional SLAM methods have achieved precise tracking accuracy in a variety of environments while using point clouds [3,4,5,6], meshes [7,8], voxel [9,10,11], or truncated signed distance function (TSDF) [12,13,14] to represent the scene. However, to adapt to increasingly complex downstream tasks such as autonomous goal navigation and human-computer interaction, SLAM systems need to make breakthroughs in map quality and density so that the maps constructed by the system have real-world realism.

Neural Radiance Fields (NeRF) [15] have been used in SLAM systems, demonstrating highly novel scene inference and high-fidelity map representation capabilities. NeRF is a novel view synthesis method with an implicit scene representation that offers the advantages of realism and continuity. SLAM systems using NeRF typically represent the scene as a multilayer perceptron (MLP), estimating the pose and optimizing the map parameters through differentiable volumetric render. However, NeRF-based SLAM requires using severely time-consuming ray-based volume rendering techniques to obtain high-resolution images. Concurrently, it hides the scene information in the MLP, precluding the expeditious undertaking of tasks such as scene editing [16,17].

3D Gaussian splatting (3DGS) [18] has more potential for dense map representation due to its high-fidelity rendering quality and hundreds of times faster rendering than NeRF. In addition to faster rendering, 3DGS-based scene representation enables the direct editing of the scene, which is essential for numerous downstream operations. 3D Gaussians are differentiable volumetric representations that are simultaneously unstructured and explicit and can be rasterized very efficiently by projecting them to 2D. With these advantages, 3DGS can be well applied to online SLAM systems with real-time requirements. Nevertheless, most extant 3D Gaussian-based SLAM systems [19,20,21,22] employ photometric error techniques to ascertain the optimal pose through iterative minimization of the error between the rendered and input images. This approach has the consequence of significantly constraining the system’s real-time performance. Based on this, we use the generalized iterative closest point (G-ICP) [23] to replace the photometric error technique in position estimation and use a simple and efficient 3D point cloud fast tracking method to improve the system’s operation speed and tracking accuracy.

In this paper, we present RD-SLAM, a dense RGB-D SLAM system using 3D Gaussian as the only scene representation. We actively utilize the point cloud information acquired by the depth camera to estimate the position using the G-ICP. A dual keyframe selection strategy and corresponding densification method are proposed for 3DGS characteristics and system mapping requirements, and regularization loss is added to ensure geometric consistency, which achieves a good balance between speed and accuracy. An example of the high-fidelity rendering output of RD-SLAM is shown in Figure 1. Overall, our contributions include the following:

We propose a 3DGS-based dense RGB-D SLAM system that utilizes fast rendering techniques to improve the speed of map optimization and achieve realistic map construction.
The system uses G-ICP for pose estimation, significantly reducing the time required for tracking while improving positioning accuracy.
We propose a dual keyframe selection strategy and corresponding densification method, and add regularization loss to the loss term to ensure tracking accuracy while improving rendering quality.
The system was extensively evaluated on various RGB-D datasets, achieving competitive tracking, mapping, and run speed performance.

This paper is organized into six main sections. Section 1 is the introduction, which provides an overview of relevant research and highlights the innovativeness of the proposed methodology. Section 2 reviews the dense visual SLAM research lineage and provides an overview of recent research advances. Section 3 details the theoretical derivation and the concrete implementation of the proposed method, which principally includes 3D Gaussian scene representation and SLAM system construction. Section 4 describes the experimental validation part of the system, including the dataset used and the analysis of the SLAM performance metrics. Section 5 discusses the challenges encountered by RD-SLAM and subsequent improvement plans. Finally, in Section 6, we summarize the work in this paper and look ahead to future work.

2. Related Work

2.1. Dense Visual SLAM

Dense visual SLAM is considered a pivotal approach to solving problems related to scene understanding and autonomous goal navigation. DTAM [24] and KinectFusion [12] were the first systems to realize dense scene reconstruction and are highly inspiring for subsequent research. Unlike systems that rely on feature extraction, DTAM implements camera tracking directly through whole-image alignment, using multi-view stereo constraints to update the global point cloud map. KinectFusion uses ICP for pose estimation and uses aligned depth maps to incrementally update the TSDF and eventually estimate normal maps. Some recent works [25,26,27] have used deep learning networks to improve pose estimation and scene reconstruction accuracy, but the scene representation and overall framework still follow traditional SLAM.

2.2. NeRF-Based SLAM

With the advent of NeRF, dense visual SLAM has made excellent progress in performing high-fidelity environment reconstruction. IMAP [28] uses implicit neural radiation fields for scene representation, jointly optimizing map and camera pose through the loss between volumetric rendered images and input images. NICE-SLAM [29] allows detailed environment reconstruction on large indoor scenes by introducing a hierarchical scene representation to fuse multi-level local information. ESLAM [30] represents the scene as the multi-scale axis-aligned perpendicular feature planes and shallow decoders that decode interpolated features into TSDF and color values for each point in continuous space. CO-SLAM [31] utilizes the high convergence speed and ability to represent high-frequency local features of a multi-resolution hash-grid to represent the scene. Point-SLAM [32] introduces a neural anchor point representation of the scene that dynamically adjusts the density of points based on the input image information, reducing optimization time and memory usage in fewer detailed regions. Despite these advantages, NeRF-based SLAM tends to lose scene details due to localized over-smoothing, along with the disadvantage of catastrophic forgetting.

2.3. 3DGS-Based SLAM

Compared with the NeRF-based SLAM method, 3DGS-based SLAM combines the advantages of explicit and implicit representations. SplaTAM [19] simplifies 3DGS by removing view-dependent appearance and using isotropic Gaussians. The system uses silhouette-guided differentiable rendering for incremental 3D map construction. GS-SLAM [20] designs a coarse-to-fine tracking technique during pose estimation, selecting reliable 3D Gaussians to optimize the camera pose. An adaptive extension strategy is also proposed to add new or remove noisy 3D Gaussians effectively. Gaussian-SLAM [21] outlines the limitations of 3DGS and proposes an online learning method for 3D Gaussian that achieves good rendering quality on both real-world and synthetic datasets. MonoGS [22] derives the analytic Jacobian of camera pose with respect to a 3D Gaussian map, allowing the pose to be optimized together with scene geometry. The system operates at 3FPS and achieves high-quality tracking and map construction at monocular or RGB-D inputs. However, existing methods rely exclusively on 3D Gaussian for tracking and mapping, which requires a significant amount of time to perform multiple iterations of optimization, making it impossible to run the system in real-time.

3. Materials and Methods

In this section, we describe the complete SLAM framework in detail, and the system overview is shown in Figure 2. In Section 3.1, we first introduce 3D Gaussian scene representation and the differentiable splatting rasterized rendering method. Section 3.2 presents the G-ICP based tracking thread and keyframe selection strategy. Section 3.3 presents the 3D Gaussian adaptive expansion strategy, the pruning strategy, and the loss function construction.

3.1. 3D Gaussian Scene Representation

We represent the underlying map of the environment as a set of 3D Gaussians, and its overall overview is shown in Figure 3. Each 3D Gaussian initialized by the point cloud contains center

μ \in R^{3}

, spatial covariance

Σ \in R^{3 \times 3}

, opacity

α \in R

, and color c. c is represented by spherical harmonics for view-dependent radiance. All properties are learnable and optimized by means of back-propagation.

G = \{G_{i} : (μ_{i}, Σ_{i}, α_{i}, c_{i})| i = 1, \dots, N\},

(1)

where

Σ_{i} = R_{i} S_{i} S_{i}^{T} R_{i}^{T}

,

S \in R^{3}

is a 3D scale vector, and

R \in R^{3 \times 3}

is the rotation matrix, which participates in computation as a 4D quaternion. The influence of a single 3D Gaussian function on a physical point

x \in R^{3}

in 3D space is calculated using the following equation:

f (x) = α e x p (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)) .

(2)

In order to obtain the rendered RGB-D image, we calculate it with the input image and optimize the Gaussian parameters. We use the following equation to render the 3D Gaussian projection onto the 2D image plane:

μ^{2 D} = K \frac{E μ}{d}, Σ^{2 D} = J E Σ E^{T} J^{T}, d = {(E μ)}_{z},

(3)

where K is the camera intrinsic matrix, E is the extrinsic matrix capturing the rotation and translation of the camera, J is the Jacobian of the affine approximation of the projective transformation, and

d

is the center depth of the 3D Gaussian obtained by projecting onto the z-axis in the camera coordinate system.

The 3D Gaussians projected to the image plane are sorted in depth order, and front-to-back blending rendering is performed to synthesize a pixel color

C_{p}

:

C_{p} = \sum_{i \in N} c_{i} f_{i} \prod_{j = 1}^{i - 1} (1 - f_{j}) .

(4)

Similarly, the depth

D_{p}

is rendered by the following equation:

D_{p} = \sum_{i \in N} d_{i} f_{i} \prod_{j = 1}^{i - 1} (1 - f_{j}) .

(5)

3.2. Tracking

In the tracking threads, we use G-ICP to estimate the camera pose. G-ICP introduces probabilistic information on top of ICP. While maintaining the speed and simplicity of ICP, G-ICP allows for adding outlier terms, measurement noise, and other probabilistic techniques, significantly improving the algorithm’s robustness.

For the first frame of the RGB-D camera input, the tracking step is skipped, and the camera pose is set to the identity. The obtained point cloud data are also added to G-ICP as target point clouds, and the pose is estimated from the second frame of the input image. G-ICP assumes the existence of point clouds sets

A = \hat{a_{i}}

and

B = \hat{b_{i}}

, which generate A and B according to the Gaussian probability models

a_{i} ~ N (\hat{a_{i}}, C_{i}^{A})

and

b_{i} ~ N (\hat{b_{i}}, C_{i}^{B})

.

\hat{a_{i}}

and

\hat{b_{i}}

are associated,

a_{i}

and

b_{i}

are the actual measurements of the point locations, and

\{C_{i}^{A}\}

and

\{C_{i}^{B}\}

are covariance matrices associated with the measurement points. The covariance of a 3D point is obtained by computing the covariance matrix of the k-nearest neighbors of the 3D point. G-ICP aims to find a transformation that maximally aligns the current frame point cloud

a_{i}

with the map point cloud

b_{i}

. For the corresponding points

\hat{a_{i}}

and

\hat{b_{i}}

, there exists a truth value

T^{*} = [R^{*} | t^{*}]

such that

\hat{b_{i}} = T^{*} \hat{a_{i}}

. The error term of an arbitrary rigid transformation

T

is:

d_{i}^{(T)} = b_{i} - T a_{i} .

(6)

Since

a_{i}

and

b_{i}

are independent of each other and both satisfy Gaussian distribution,

d_{i}^{(T^{*})}

also satisfies Gaussian distribution:

d_{i}^{(T^{*})} ~ N (\hat{b_{i}} - (T^{*}) \hat{a_{i}}, C_{i}^{B} + (T^{*}) C_{i}^{A} {(T^{*})}^{T}) = N (0, C_{i}^{B} + (T^{*}) C_{i}^{A} {(T^{*})}^{T})

(7)

T

can be viewed as a parameter to be estimated in the

d_{i}^{(T)}

probability distribution. We use maximum likelihood estimation (MLE) to iteratively calculate

T^{*}

:

T^{*} = \underset{T}{argmax} \prod_{i} p (d_{i}^{(T)}) = \underset{T}{argmax} \sum_{i} l o g (p (d_{i}^{(T)})) .

(8)

The above equation can be simplified as:

T^{*} = \underset{T}{argmin} \sum_{i} {d_{i}^{(T)}}^{T} {(C_{i}^{B} + T C_{i}^{A} T^{T})}^{- 1} d_{i}^{(T)},

(9)

T^{*} = [R^{*} | t^{*}]

is the camera pose estimated by the system.

Since 3D Gaussian needs good observation coverage and considering the densification strategy, we divide the keyframes into tracking and mapping keyframes, as shown in Figure 4. We use G-ICP to calculate the match ratio of the current frame point cloud to the map point cloud. The frame is selected as the tracking keyframe if the matching ratio is below the threshold. If the relative rotation of the current frame to the latest keyframe is above the threshold, the frame is selected as the mapping keyframe. The relative rotation here is expressed using Euler angles, and to ensure the stability of the results, we add the Euler angles on the three axes and compare them to a single threshold. In the tracking keyframes, only 3D Gaussians that do not overlap with the existing map are added to the optimization. The opposite is true in the mapping keyframe, where only 3D Gaussians that overlap with the existing map are added to the optimization.

3.3. Mapping

RD-SLAM aims to create a coherent, high-fidelity 3D map. The 3D Gaussian scene representation is updated and optimized at each selected keyframe for stable mapping. We first use an adaptive densification strategy in the mapping thread to add new 3D Gaussians to the entire scene representation for differentiable rendering to obtain color and depth images. The updated 3D Gaussian scene representation is then optimized by minimizing the input and rendered images’ color, depth, and regularization loss. Finally, anomalous 3D Gaussian is removed using the 3D Gaussian pruning strategy.

The Gaussian densification process is closely related to the keyframe selection strategy. We use all pixels of the first frame to initialize new 3D Gaussians. For each pixel, we add a new 3D Gaussian function with the color of that pixel, centered on the spatial position of that pixel, and an opacity of 0.1. When the tracking thread uses G-ICP for position estimation, the system can simultaneously obtain the matching relationship between the current frame’s spatial points and the map’s spatial points. If the matched spatial points are below a certain threshold, the current frame is added as a tracking keyframe, and the unmatched spatial points are set to new Gaussians. When the current frame is detected as a mapping keyframe using the dual keyframe selection strategy, the spatial points on which the current frame matches the map are set as new 3D Gaussians.

After 3D Gaussians splatting to the image plane, the color loss between the input color image and the rendered image is calculated. We calculated the color loss using a weighted combination of

L_{1}

and structural similarity index measure (SSIM) [33] losses:

L_{c o l o r} = (1 - λ) \cdot {|\hat{I} - I|}_{1} + λ (1 - S S I M (\hat{I}, I)),

(10)

where

I

is the original color image, and

\hat{I}

is the rendered color image. Deep optimization is achieved using

L_{1}

loss:

L_{d e p t h} = {|\hat{D} - D|}_{1},

(11)

where

D

is the original depth image, and

\hat{D}

is the rendered depth image.

To prevent scale explosion of the 3D Gaussian and to highly elongate in the camera viewing direction, we add the regularization loss

L_{r e g}

of the scale. It can effectively compensate for the 3D Gaussian scale distortion caused by the insufficient observation viewpoint during SLAM. We use

L_{1}

loss for scale optimization:

L_{r e g} = {| \hat{S} - \tilde{S} |}_{1},

(12)

where

\tilde{S}

is the average of the scale, and

\hat{S}

is the scale before 3D Gaussian optimization. Finally, the depth, color, and regularization loss are optimized together:

{L = α L}_{d e p t h} + β L_{c o l o r} + γ L_{r e g},

(13)

where

α

,

β

, and

γ

are predefined hyperparameters that are weighted values of depth, color, and scale regularization loss.

For each keyframe, we perform a fixed number of iterations. First, we iteratively compute using the current keyframe to fully optimize the 3D Gaussian for the current viewpoint. Then, the first n frames in the list of keyframes that overlap with the current keyframe so far are selected, and one of them is randomly chosen in each iteration to ensure that the whole map is fully optimized while avoiding falling into local minima. After a certain number of map iterations, some anomalous Gaussians appear due to the instability of the adaptive control of the Gaussian. We follow the pruning strategy in [18] while removing 3D Gaussians whose locations are not near the scene surface.

4. Results

4.1. Experimental Setup

To evaluate the performance of RD-SLAM, we conduct experiments on the Replica [34] (eight sequences), the TUM-RGBD [35] (three sequences), and the ScanNet [36] dataset (six sequences). The Replica dataset is a synthetic scene containing high-quality RGB depth images. The TUM-RGBD and ScanNet datasets are real-world images captured using an old, low-quality camera with significant noise and blurriness. In particular, the depth image could be more sparse, with many sections suffering from information loss. We validate the effectiveness of our method by evaluating it on both virtual synthetic datasets and real-world datasets.

We compare RD-SLAM with the existing state-of-the-art (SOTA) NeRF-based and 3DGS-based dense visual SLAM: NICE-SLAM, Point-SLAM, GS-SLAM, and SplaTAM. When comparing tracking accuracy on the TUM-RGBD dataset, we also compare our approach with three traditional SLAM systems: Kintinuous [37], ElasticFusion [38], and ORB-SLAM2.

We evaluate the camera tracking accuracy using the average absolute trajectory error (ATE RMSE) and the map quality using the peak signal-to-noise ratio (PSNR), SSIM, and learned perceptual image patch similarity (LPIPS) [39]. For the experimental results, we use the average of three runs. The assessment tables mark the best results in bold, and the second best results are underlined. RD-SLAM is implemented in Python (v.3.9) using the PyTorch framework, incorporating CUDA code for 3DGS. We run our SLAM on a laptop with Intel Core i7 10750H 2.60 GHz and NVIDIA Quadro RTX 3000.

4.2. Quantitative Evaluation

4.2.1. Tracking Performance

Table 1 demonstrates the comparison of tracking accuracy between RD-SLAM and other SLAM systems on the TUM-RGBD dataset. Our approach achieves SOTA performance in both NeRF-based SLAM and 3DGS-based SLAM systems, with an average ATE RSME of 2.11 cm at real-time operating speeds, which is on average 0.93 cm higher than the second best approach, Point-SLAM. However, due to many voids in the depth image and the extreme blurring of the RGB image, the well-designed feature-based conventional SLAM method still outperforms the neural SLAM method.

Table 2 shows our tracking performance on some selected ScanNet sequences. All neural SLAM methods struggle immensely due to the fact that ScanNet dataset depth images are sparse and RGB images are blurry. Our approach achieves competitive tracking performance.

In Table 3, we compare the tracking accuracy of RD-SLAM with other SOTA methods on the Replica dataset. Our approach achieves the best performance in all eight scenes, with a 59% reduction in trajectory error compared to SplaTAM, which is also a 3DGS-based SLAM system. Compared with other methods, our system exhibits better tracking performance on real-world and virtual synthetic datasets. This is because we actively use the 3D point cloud information for accurate pose estimation directly through G-ICP.

4.2.2. Rendering Performance

Table 4 and Table 5 show the rendering performance of our approach on real-world datasets. RD-SLAM achieves competitive results on the average of all sequences from TUM-RGBD and ScanNet datasets. Our method achieves better environment reconstruction in challenging environments where depth images are severely missing by implementing the dual keyframe selection strategy that fully adds and optimizes 3D Gaussians. Notably, RD-SLAM system runs at 10 FPS, more than 60 times faster than SplaTAM. Figure 5 shows the rendering results of the system.

Table 6 shows the results of comparing the rendering performance of RD-SLAM with other neural SLAM methods on the Replica dataset. Our method is the best performer in most evaluated sequences, and many are second best. RD-SLAM outperforms the second ranked method, Point-SLAM, by 2.96 dB in PSNR. This excellent rendering and real-time performance allow RD-SLAM to be more easily applied to autonomous driving, virtual reality, AR, and other fields.

The visualization results in Figure 6 show that RD-SLAM can generate higher quality and more realistic images than previous SOTA methods, avoiding ghosting and local reconstruction errors. From the rendering results, Point-SLAM is too smooth to reconstruct fine details, and SplaTAM produces more apparent holes in some areas, as shown in the second, fourth, and sixth rows of Figure 6.

4.2.3. Runtime Analysis

We show the runtime comparison of RD-SLAM with other systems in Table 7. Because of the efficiency of the rasterizing 3D Gaussians and G-ICP front-end, RD-SLAM runs 55 times faster than Point-SLAM and 60 times faster than SplaTAM, achieving the best tracking accuracy and rendering quality at the fastest run speed.

4.3. Ablation Study

We perform ablation experiments with RD-SLAM on room0 sequences of the Replica dataset to evaluate the effectiveness of the dual keyframe selection strategy and scale regularization. As shown in Figure 7, we exclude the mapping keyframes and scale regularization to verify their effects on the system. Without mapping keyframes, the resulting rendered image produces voids at the edges due to the ineffective implementation of the densification strategy. With scale regularization removed, the scale of the 3D Gaussian is not effectively constrained, resulting in the system being unable to render the fine details of the scene.

Table 8 shows the ablation experiments data for the dual keyframe selection strategy and scale regularization. With only tracking keyframes, the rendering results of our method are drastically degraded by the failure to add 3D Gaussians representing the scene effectively. In contrast, our proposed dual keyframe selection strategy improves the PSNR by 10.81 dB by adding more accurate 3D Gaussians. Scale regularization losses can significantly improve the mapping performance with precise shape constraints compared to supervising with only color and depth losses. Adding the scale regularization loss decreases the LPIPS by 0.188. Our complete implementation yields higher quality and more detailed rendering results.

5. Discussion

During our study, we found that the performance of RD-SLAM depends on the quality of RGB-D images, and the system’s effectiveness may be compromised with poor camera quality or in environments where high-quality images are unavailable. Methods such as [40,41] can quickly reconstruct high-quality images with their lightweight and efficient modeling architecture. We plan to construct image pyramids that are first optimized for 3D Gaussian using low-resolution images captured by the camera and then supervised using reconstructed high-resolution images as truth values. We hope to use the above methods in future work to reduce the effect of low-quality input images and improve the robustness of the system.

In addition, we plan to extend RD-SLAM to make it more adaptable to large-scale scenes. The two main challenges that we will face are the large number of 3D Gaussians that need to be optimized and the enormous amount of memory required to store the parameters. To address the problem of optimizing too many 3D Gaussians, we propose using a fixed number of keyframes to compose the sub-map. The efficiency and real-time performance of the system are ensured by expanding and optimizing only one active sub-map at any given time. To address the problem of excessive memory consumption, we plan to implement more accurate initialization and pruning strategies so that a single Gaussian better fits the object surface without the need for multiple overlapping Gaussians, thus reducing memory costs. These improvements can adapt the system to increasingly complex downstream tasks.

6. Conclusions

This paper proposes a new dense visual SLAM system, RD-SLAM, which utilizes 3D Gaussian as the underlying map representation while using G-ICP for fast tracking. The proposed dual keyframe selection strategy and scale regularization enable RD-SLAM to reconstruct realistic and dense maps dynamically. The experimental results on the Replica dataset show that the system of the proposed method in this paper runs more than 60 times faster than SplaTAM, which is also a 3DGS-based SLAM. We demonstrate through extensive experiments that our method achieves SOTA performance in terms of localization, reconstruction, and system speed. Overall, RD-SLAM provides a promising solution for dense real-time visual SLAM by effectively combining 3DGS, G-ICP, a dual keyframe selection strategy, and scale regularization loss.

Author Contributions

Conceptualization, C.G. (Chaoyang Guo), C.G. (Chunyan Gao) and X.L.; methodology, C.G. (Chunyan Gao) and X.L.; software, C.G. (Chaoyang Guo) and Y.B.; validation, C.G. (Chaoyang Guo), C.G. (Chunyan Gao) and Y.B.; formal analysis, X.L.; investigation, C.G. (Chaoyang Guo) and Y.B.; resources, C.G. (Chunyan Gao) and X.L.; data curation, C.G. (Chunyan Gao); writing—original draft preparation, C.G. (Chaoyang Guo) and C.G. (Chunyan Gao); writing—review and editing, C.G. (Chunyan Gao); visualization, Y.B.; supervision, X.L.; project administration, Y.B. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under grant No. 2022YFB4701100 and by the National Natural Science Foundation of China under grant No. U1913211.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Towards the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Abaspur Kazerouni, I.; Fitzgerald, L.; Dooly, G.; Toal, D. A Survey of State-of-the-Art on Visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Ruetz, F.; Hernández, E.; Pfeiffer, M.; Oleynikova, H.; Cox, M.; Lowe, T.; Borges, P. OVPC Mesh: 3D Free-Space Representation for Local Ground Vehicle Navigation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8648–8654. [Google Scholar]
Schöps, T.; Sattler, T.; Pollefeys, M. SurfelMeshing: Online Surfel-Based Mesh Reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2494–2507. [Google Scholar] [CrossRef] [PubMed]
Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-Time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph. 2013, 32, 1–11. [Google Scholar] [CrossRef]
Kahler, O.; Prisacariu, V.; Valentin, J.; Murray, D. Hierarchical Voxel Block Hashing for Efficient Integration of Depth Images. IEEE Robot. Autom. Lett. 2016, 1, 192–197. [Google Scholar] [CrossRef]
Dai, A.; Nießner, M.; Zollhöfer, M.; Izadi, S.; Theobalt, C. BundleFusion: Real-Time Globally Consistent 3D Reconstruction Using On-the-Fly Surface Reintegration. ACM Trans. Graph. 2017, 36, 76a:1. [Google Scholar] [CrossRef]
Newcombe, R.A.; Fitzgibbon, A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
Whelan, T.; Johannsson, H.; Kaess, M.; Leonard, J.J.; McDonald, J. Robust Real-Time Visual Odometry for Dense RGB-D Mapping. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 5724–5731. [Google Scholar]
Weder, S.; Schonberger, J.L.; Pollefeys, M.; Oswald, M.R. NeuralFusion: Online Depth Fusion in Latent Space. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 3162–3172. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the 2020 European Conference on Computer Vision(ECCV), Online, 23–28 August 2020; pp. 405–421. [Google Scholar]
Chen, G.; Wang, W. A Survey on 3D Gaussian Splatting. arXiv 2024, arXiv:2401.03890. [Google Scholar]
Tosi, F.; Zhang, Y.; Gong, Z.; Sandström, E.; Mattoccia, S.; Oswald, M.R.; Poggi, M. How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: A Survey. arXiv 2024, arXiv:2402.13255. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkuehler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Keetha, N.; Karhade, J.; Jatavallabhula, K.M.; Yang, G.; Scherer, S.; Ramanan, D.; Luiten, J. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 21357–21366. [Google Scholar]
Yan, C.; Qu, D.; Wang, D.; Xu, D.; Wang, Z.; Zhao, B.; Li, X. GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting. arXiv 2023, arXiv:2311.11700. [Google Scholar]
Yugay, V.; Li, Y.; Gevers, T.; Oswald, M.R. Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting. arXiv 2023, arXiv:2312.10070. [Google Scholar]
Matsuki, H.; Murai, R.; Kelly, P.H.J.; Davison, A.J. Gaussian Splatting SLAM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 18039–18048. [Google Scholar]
Segal, A.; Haehnel, D.; Thrun, S. Generalized-ICP. In Proceedings of the Robotics: Science and Systems, Seattle, WA, USA, 28 June–1 July 2009. [Google Scholar]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense Tracking and Mapping in Real-Time. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
Bloesch, M.; Czarnowski, J.; Clark, R.; Leutenegger, S.; Davison, A.J. CodeSLAM—Learning a Compact, Optimisable Representation for Dense Visual SLAM. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2560–2568. [Google Scholar]
Li, R.; Wang, S.; Gu, D. DeepSLAM: A Robust Monocular SLAM System With Unsupervised Deep Learning. IEEE Trans. Ind. Electron. 2021, 68, 3577–3587. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In Proceedings of the 2021 International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 16558–16569. [Google Scholar]
Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. iMAP: Implicit Mapping and Positioning in Real-Time. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6209–6218. [Google Scholar]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12776–12786. [Google Scholar]
Johari, M.M.; Carta, C.; Fleuret, F. ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17408–17419. [Google Scholar]
Wang, H.; Wang, J.; Agapito, L. Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13293–13302. [Google Scholar]
Sandström, E.; Li, Y.; Van Gool, L.; Oswald, M.R. Point-SLAM: Dense Neural Point Cloud-Based SLAM. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 18387–18398. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
Whelan, T.; Kaess, M.; Johannsson, H.; Fallon, M.; Leonard, J.J.; McDonald, J. Real-time Large Scale Dense RGB-D SLAM with Volumetric Fusion. Int. J. Robot. Res. 2015, 34, 598–626. [Google Scholar] [CrossRef]
Whelan, T.; Leutenegger, S.; Salas Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a Pose Graph. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint Separable Residual Network for Efficient Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 832–842. [Google Scholar]
Mardieva, S.; Ahmad, S.; Umirzakova, S.; Rasool, M.A.; Whangbo, T.K. Lightweight Image Super-Resolution for IoT Devices Using Deep Residual Feature Distillation Network. Knowl.-Based Syst. 2024, 285, 111343. [Google Scholar] [CrossRef]

Figure 1. Our method reconstructs scene details more accurately than existing systems such as Point-SLAM and SplaTAM, and at the same time, it operates at speeds up to 10 FPS. The second row shows zoomed-in details of the black squares.

Figure 2. Overview of RD-SLAM. Our approach takes RGB-D frames as input and initializes the point cloud by reprojecting the RGB-D image, and then performs an alignment algorithm using G-ICP to obtain the current frame pose. The dual keyframe selection strategy determines whether the current frame is a tracking or mapping keyframe, which initializes new 3D Gaussians to perform the densification operation. The Gaussian parameters are optimized by applying depth, color, and regularization loss to the rendered RGB-D frames, and a pruning operation is performed after a fixed number of iterations to remove anomalous Gaussians. Finally, a dense 3D environment map is obtained.

Figure 3. 3D Gaussian operation flow process.

Figure 4. Dual keyframe selection strategy. By dividing the keyframes into tracking and mapping keyframes, RD-SLAM can accomplish the densification strategy while guaranteeing tracking accuracy and, simultaneously, allowing the 3D Gaussian to be fully trained.

Figure 5. RD-SLAM rendering visualization results on the TUM-RGBD dataset.

Figure 6. Comparison of the rendering visualization results of RD-SLAM proposed in this paper with other SOTA methods on three sequences in the Replica dataset. The second, fourth, and sixth rows show zoomed-in details of the colored squares.

Figure 7. Dual keyframe selection strategy and scale regularization for ablation experiments. The second row shows zoomed-in details of the colored squares.

Table 1. Camera tracking results on TUM-RGBD dataset (ATE RMSE ↓ [cm]).

Methods	fr1/desk	fr2/xyz	fr3/Office	Avg.
Kintinuous [37]	3.70	2.90	3.00	3.20
ElasticFusion [38]	2.53	1.17	2.52	1.50
ORB-SLAM2 [5]	1.60	0.40	1.00	1.00
NICE-SLAM [29]	4.26	31.73	3.87	15.87
Point-SLAM [32]	4.34	1.31	3.48	3.04
GS-SLAM [20]	3.30	1.30	6.60	3.73
SplaTAM [19]	3.34	1.34	5.18	3.29
Ours	2.29	1.80	2.24	2.11

Table 2. Camera tracking results on ScanNet dataset (ATE RMSE ↓ [cm]).

Methods	0000	0059	0106	0169	0181	0207	Avg.
NICE-SLAM [29]	12.00	14.00	7.90	10.90	13.40	6.20	10.70
Point-SLAM [32]	10.24	7.81	8.65	22.16	14.77	9.54	12.19
SplaTAM [19]	11.53	9.99	17.86	12.49	9.61	7.81	11.55
Ours	10.73	12.56	9.34	11.62	10.67	11.69	11.10

Table 3. Camera tracking results on Replica dataset (ATE RMSE ↓ [cm]).

Methods	Room0	Room1	Room2	Office0	Office1	Office2	Office3	Office4	Avg.
NICE-SLAM [29]	0.97	1.31	1.07	0.88	1.00	1.06	1.10	1.13	1.06
Point-SLAM [32]	0.61	0.41	0.37	0.38	0.48	0.54	0.69	0.72	0.52
GS-SLAM [20]	0.48	0.53	0.33	0.52	0.41	0.59	0.46	0.70	0.50
SplaTAM [19]	0.27	0.41	0.32	0.55	0.23	0.36	0.32	0.53	0.37
Ours	0.15	0.16	0.10	0.18	0.12	0.16	0.16	0.20	0.15

Table 4. Map quality on TUM-RGBD.

Methods	Metrics	fr1/desk	fr2/xyz	fr3/office	Avg.
Point-SLAM [32]	PSNR ↑	13.79	17.62	18.29	16.57
	SSIM ↑	0.625	0.710	0.749	0.695
	LPIPS ↓	0.545	0.584	0.452	0.527
SplaTAM [19]	PSNR ↑	21.49	25.06	21.17	22.57
	SSIM ↑	0.839	0.950	0.861	0.883
	LPIPS ↓	0.255	0.099	0.221	0.192
Ours	PSNR ↑	20.79	23.17	21.39	21.78
	SSIM ↑	0.758	0.829	0.785	0.791
	LPIPS ↓	0.227	0.143	0.198	0.189

Table 5. Map quality on ScanNet.

Methods	Metrics	0000	0059	0106	0169	0181	0207	Avg.
	PSNR ↑	21.13	19.57	17.01	18.29	22.36	20.72	19.85
Point-SLAM [32]	SSIM ↑	0.801	0.765	0.684	0.686	0.824	0.751	0.752
	LPIPS ↓	0.488	0.498	0.537	0.539	0.471	0.542	0.510
SplaTAM [19]	PSNR ↑	19.69	19.38	19.18	23.33	16.42	19.90	19.65
	SSIM ↑	0.748	0.793	0.743	0.804	0.644	0.690	0.737
	LPIPS ↓	0.297	0.276	0.319	0.271	0.434	0.354	0.325
Ours	PSNR ↑	19.41	19.15	21.18	22.82	19.93	20.15	20.44
	SSIM ↑	0.735	0.726	0.795	0.821	0.742	0.745	0.761
	LPIPS ↓	0.251	0.253	0.209	0.149	0.247	0.244	0.226

Table 6. Map quality on Replica.

Methods	Metrics	Room0	Room1	Room2	Office0	Office1	Office2	Office3	Office4	Avg.
NICE-SLAM [29]	PSNR ↑	22.12	22.47	24.52	29.07	30.34	19.66	22.23	24.94	24.42
	SSIM ↑	0.689	0.757	0.814	0.874	0.886	0.797	0.801	0.856	0.809
	LPIPS ↓	0.330	0.271	0.208	0.229	0.181	0.235	0.209	0.198	0.233
Point-SLAM [32]	PSNR ↑	32.40	34.08	35.50	38.26	39.16	33.99	33.48	33.49	35.17
	SSIM ↑	0.974	0.977	0.982	0.983	0.986	0.960	0.960	0.979	0.975
	LPIPS ↓	0.113	0.116	0.111	0.100	0.118	0.156	0.132	0.142	0.124
GS-SLAM [20]	PSNR ↑	31.56	32.86	32.59	38.70	41.17	32.36	32.03	32.92	34.27
	SSIM ↑	0.968	0.973	0.971	0.986	0.993	0.978	0.970	0.968	0.975
	LPIPS ↓	0.094	0.075	0.093	0.050	0.033	0.094	0.110	0.112	0.082
SplaTAM [19]	PSNR ↑	32.65	33.68	35.11	37.95	38.82	32.08	29.99	31.77	34.01
	SSIM ↑	0.977	0.969	0.984	0.981	0.981	0.967	0.950	0.947	0.970
	LPIPS ↓	0.070	0.102	0.075	0.087	0.097	0.098	0.119	0.156	0.101
Ours	PSNR ↑	34.55	37.08	37.92	42.62	42.66	36.19	35.93	38.08	38.13
	SSIM ↑	0.956	0.969	0.973	0.984	0.982	0.971	0.963	0.970	0.971
	LPIPS ↓	0.057	0.045	0.050	0.028	0.037	0.048	0.049	0.048	0.045

Table 7. Runtime on Replica room0.

Methods	System FPS ↑	ATE RMSE [cm] ↓	PSNR [dB] ↑
Point-SLAM [32]	0.19	0.61	32.40
SplaTAM [19]	0.17	0.27	32.65
Ours	10.57	0.15	34.55

Table 8. Quantitative results of ablation study.

Keyframes	Regularization	PSNR [dB] ↑	SSIM ↑	LPIPS ↓
×	√	23.74	0.829	0.195
√	×	29.06	0.840	0.245
√	√	34.55	0.956	0.057

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, C.; Gao, C.; Bai, Y.; Lv, X. RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting. Appl. Sci. 2024, 14, 7767. https://doi.org/10.3390/app14177767

AMA Style

Guo C, Gao C, Bai Y, Lv X. RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting. Applied Sciences. 2024; 14(17):7767. https://doi.org/10.3390/app14177767

Chicago/Turabian Style

Guo, Chaoyang, Chunyan Gao, Yiyang Bai, and Xiaoling Lv. 2024. "RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting" Applied Sciences 14, no. 17: 7767. https://doi.org/10.3390/app14177767

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting

Abstract

1. Introduction

2. Related Work

2.1. Dense Visual SLAM

2.2. NeRF-Based SLAM

2.3. 3DGS-Based SLAM

3. Materials and Methods

3.1. 3D Gaussian Scene Representation

3.2. Tracking

3.3. Mapping

4. Results

4.1. Experimental Setup

4.2. Quantitative Evaluation

4.2.1. Tracking Performance

4.2.2. Rendering Performance

4.2.3. Runtime Analysis

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI