Next Article in Journal
Photocatalytic Degradation of Selected Non-Opioid Analgesics Driven by Solar Light Exposure
Previous Article in Journal
Maximizing the Absorbing Performance of Rectangular Sonic Black Holes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting

by
Chaoyang Guo
,
Chunyan Gao
*,
Yiyang Bai
and
Xiaoling Lv
School of Mechanical Engineering, Hebei University of Technology, Tianjin 300401, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(17), 7767; https://doi.org/10.3390/app14177767
Submission received: 5 August 2024 / Revised: 27 August 2024 / Accepted: 28 August 2024 / Published: 3 September 2024

Abstract

:
Simultaneous localization and mapping (SLAM) is fundamental for intelligent mobile units to perform diverse tasks. Recent work has shown that integrating neural rendering and SLAM showed promising results in photorealistic environment reconstruction. However, existing methods estimate pose by minimizing the error between rendered and input images, which is time-consuming and cannot be run in real-time, deviating from the original intention of SLAM. In this paper, we propose a dense RGB-D SLAM based on 3D Gaussian splatting (3DGS) while employing generalized iterative closest point (G-ICP) for pose estimation. We actively utilize 3D point cloud information to improve the tracking accuracy and operating speed of the system. At the same time, we propose a dual keyframe selection strategy and its corresponding densification method, which can effectively reconstruct new observation scenes and improve the quality of previously constructed maps. In addition, we introduce regularization loss to address the scale explosion of the 3D Gaussians and over-elongate in the camera viewing direction. Experiments on the Replica, TUM-RGBD, and ScanNet datasets show that our method achieves state-of-the-art tracking accuracy and runtime while being competitive in rendering quality.

1. Introduction

Visual simultaneous localization and mapping (SLAM) is a technology in which robots rely on their vision sensors to enable real-time positioning and mapping in unknown environments [1,2]. After decades of development, SLAM has been widely used in fields such as autonomous driving, virtual reality, and augmented reality (AR). Traditional SLAM methods have achieved precise tracking accuracy in a variety of environments while using point clouds [3,4,5,6], meshes [7,8], voxel [9,10,11], or truncated signed distance function (TSDF) [12,13,14] to represent the scene. However, to adapt to increasingly complex downstream tasks such as autonomous goal navigation and human-computer interaction, SLAM systems need to make breakthroughs in map quality and density so that the maps constructed by the system have real-world realism.
Neural Radiance Fields (NeRF) [15] have been used in SLAM systems, demonstrating highly novel scene inference and high-fidelity map representation capabilities. NeRF is a novel view synthesis method with an implicit scene representation that offers the advantages of realism and continuity. SLAM systems using NeRF typically represent the scene as a multilayer perceptron (MLP), estimating the pose and optimizing the map parameters through differentiable volumetric render. However, NeRF-based SLAM requires using severely time-consuming ray-based volume rendering techniques to obtain high-resolution images. Concurrently, it hides the scene information in the MLP, precluding the expeditious undertaking of tasks such as scene editing [16,17].
3D Gaussian splatting (3DGS) [18] has more potential for dense map representation due to its high-fidelity rendering quality and hundreds of times faster rendering than NeRF. In addition to faster rendering, 3DGS-based scene representation enables the direct editing of the scene, which is essential for numerous downstream operations. 3D Gaussians are differentiable volumetric representations that are simultaneously unstructured and explicit and can be rasterized very efficiently by projecting them to 2D. With these advantages, 3DGS can be well applied to online SLAM systems with real-time requirements. Nevertheless, most extant 3D Gaussian-based SLAM systems [19,20,21,22] employ photometric error techniques to ascertain the optimal pose through iterative minimization of the error between the rendered and input images. This approach has the consequence of significantly constraining the system’s real-time performance. Based on this, we use the generalized iterative closest point (G-ICP) [23] to replace the photometric error technique in position estimation and use a simple and efficient 3D point cloud fast tracking method to improve the system’s operation speed and tracking accuracy.
In this paper, we present RD-SLAM, a dense RGB-D SLAM system using 3D Gaussian as the only scene representation. We actively utilize the point cloud information acquired by the depth camera to estimate the position using the G-ICP. A dual keyframe selection strategy and corresponding densification method are proposed for 3DGS characteristics and system mapping requirements, and regularization loss is added to ensure geometric consistency, which achieves a good balance between speed and accuracy. An example of the high-fidelity rendering output of RD-SLAM is shown in Figure 1. Overall, our contributions include the following:
  • We propose a 3DGS-based dense RGB-D SLAM system that utilizes fast rendering techniques to improve the speed of map optimization and achieve realistic map construction.
  • The system uses G-ICP for pose estimation, significantly reducing the time required for tracking while improving positioning accuracy.
  • We propose a dual keyframe selection strategy and corresponding densification method, and add regularization loss to the loss term to ensure tracking accuracy while improving rendering quality.
  • The system was extensively evaluated on various RGB-D datasets, achieving competitive tracking, mapping, and run speed performance.
This paper is organized into six main sections. Section 1 is the introduction, which provides an overview of relevant research and highlights the innovativeness of the proposed methodology. Section 2 reviews the dense visual SLAM research lineage and provides an overview of recent research advances. Section 3 details the theoretical derivation and the concrete implementation of the proposed method, which principally includes 3D Gaussian scene representation and SLAM system construction. Section 4 describes the experimental validation part of the system, including the dataset used and the analysis of the SLAM performance metrics. Section 5 discusses the challenges encountered by RD-SLAM and subsequent improvement plans. Finally, in Section 6, we summarize the work in this paper and look ahead to future work.

2. Related Work

2.1. Dense Visual SLAM

Dense visual SLAM is considered a pivotal approach to solving problems related to scene understanding and autonomous goal navigation. DTAM [24] and KinectFusion [12] were the first systems to realize dense scene reconstruction and are highly inspiring for subsequent research. Unlike systems that rely on feature extraction, DTAM implements camera tracking directly through whole-image alignment, using multi-view stereo constraints to update the global point cloud map. KinectFusion uses ICP for pose estimation and uses aligned depth maps to incrementally update the TSDF and eventually estimate normal maps. Some recent works [25,26,27] have used deep learning networks to improve pose estimation and scene reconstruction accuracy, but the scene representation and overall framework still follow traditional SLAM.

2.2. NeRF-Based SLAM

With the advent of NeRF, dense visual SLAM has made excellent progress in performing high-fidelity environment reconstruction. IMAP [28] uses implicit neural radiation fields for scene representation, jointly optimizing map and camera pose through the loss between volumetric rendered images and input images. NICE-SLAM [29] allows detailed environment reconstruction on large indoor scenes by introducing a hierarchical scene representation to fuse multi-level local information. ESLAM [30] represents the scene as the multi-scale axis-aligned perpendicular feature planes and shallow decoders that decode interpolated features into TSDF and color values for each point in continuous space. CO-SLAM [31] utilizes the high convergence speed and ability to represent high-frequency local features of a multi-resolution hash-grid to represent the scene. Point-SLAM [32] introduces a neural anchor point representation of the scene that dynamically adjusts the density of points based on the input image information, reducing optimization time and memory usage in fewer detailed regions. Despite these advantages, NeRF-based SLAM tends to lose scene details due to localized over-smoothing, along with the disadvantage of catastrophic forgetting.

2.3. 3DGS-Based SLAM

Compared with the NeRF-based SLAM method, 3DGS-based SLAM combines the advantages of explicit and implicit representations. SplaTAM [19] simplifies 3DGS by removing view-dependent appearance and using isotropic Gaussians. The system uses silhouette-guided differentiable rendering for incremental 3D map construction. GS-SLAM [20] designs a coarse-to-fine tracking technique during pose estimation, selecting reliable 3D Gaussians to optimize the camera pose. An adaptive extension strategy is also proposed to add new or remove noisy 3D Gaussians effectively. Gaussian-SLAM [21] outlines the limitations of 3DGS and proposes an online learning method for 3D Gaussian that achieves good rendering quality on both real-world and synthetic datasets. MonoGS [22] derives the analytic Jacobian of camera pose with respect to a 3D Gaussian map, allowing the pose to be optimized together with scene geometry. The system operates at 3FPS and achieves high-quality tracking and map construction at monocular or RGB-D inputs. However, existing methods rely exclusively on 3D Gaussian for tracking and mapping, which requires a significant amount of time to perform multiple iterations of optimization, making it impossible to run the system in real-time.

3. Materials and Methods

In this section, we describe the complete SLAM framework in detail, and the system overview is shown in Figure 2. In Section 3.1, we first introduce 3D Gaussian scene representation and the differentiable splatting rasterized rendering method. Section 3.2 presents the G-ICP based tracking thread and keyframe selection strategy. Section 3.3 presents the 3D Gaussian adaptive expansion strategy, the pruning strategy, and the loss function construction.

3.1. 3D Gaussian Scene Representation

We represent the underlying map of the environment as a set of 3D Gaussians, and its overall overview is shown in Figure 3. Each 3D Gaussian initialized by the point cloud contains center μ R 3 , spatial covariance Σ R 3 × 3 , opacity α R , and color c. c is represented by spherical harmonics for view-dependent radiance. All properties are learnable and optimized by means of back-propagation.
G = G i : μ i , Σ i , α i , c i i = 1 , , N ,
where Σ i = R i S i S i T R i T , S R 3 is a 3D scale vector, and R R 3 × 3 is the rotation matrix, which participates in computation as a 4D quaternion. The influence of a single 3D Gaussian function on a physical point x R 3 in 3D space is calculated using the following equation:
f x = α e x p 1 2 x μ T Σ 1 x μ .
In order to obtain the rendered RGB-D image, we calculate it with the input image and optimize the Gaussian parameters. We use the following equation to render the 3D Gaussian projection onto the 2D image plane:
μ 2 D = K E μ d , Σ 2 D = J E Σ E T J T , d = E μ z ,
where K is the camera intrinsic matrix, E is the extrinsic matrix capturing the rotation and translation of the camera, J is the Jacobian of the affine approximation of the projective transformation, and d is the center depth of the 3D Gaussian obtained by projecting onto the z-axis in the camera coordinate system.
The 3D Gaussians projected to the image plane are sorted in depth order, and front-to-back blending rendering is performed to synthesize a pixel color C p :
C p = i N c i f i j = 1 i 1 1 f j .
Similarly, the depth D p is rendered by the following equation:
D p = i N d i f i j = 1 i 1 1 f j .

3.2. Tracking

In the tracking threads, we use G-ICP to estimate the camera pose. G-ICP introduces probabilistic information on top of ICP. While maintaining the speed and simplicity of ICP, G-ICP allows for adding outlier terms, measurement noise, and other probabilistic techniques, significantly improving the algorithm’s robustness.
For the first frame of the RGB-D camera input, the tracking step is skipped, and the camera pose is set to the identity. The obtained point cloud data are also added to G-ICP as target point clouds, and the pose is estimated from the second frame of the input image. G-ICP assumes the existence of point clouds sets A = a i ^ and B = b i ^ , which generate A and B according to the Gaussian probability models a i ~ N a i ^ , C i A and b i ~ N b i ^ , C i B . a i ^ and b i ^ are associated, a i and b i are the actual measurements of the point locations, and C i A and C i B are covariance matrices associated with the measurement points. The covariance of a 3D point is obtained by computing the covariance matrix of the k-nearest neighbors of the 3D point. G-ICP aims to find a transformation that maximally aligns the current frame point cloud a i with the map point cloud b i . For the corresponding points a i ^ and b i ^ , there exists a truth value T * = R * | t * such that b i ^ = T * a i ^ . The error term of an arbitrary rigid transformation T is:
d i ( T ) = b i T a i .
Since a i and b i are independent of each other and both satisfy Gaussian distribution, d i ( T * ) also satisfies Gaussian distribution:
d i ( T * ) ~ N b i ^ T * a i ^ , C i B + T * C i A T * T = N 0 , C i B + T * C i A T * T
T can be viewed as a parameter to be estimated in the d i ( T ) probability distribution. We use maximum likelihood estimation (MLE) to iteratively calculate T * :
T * = argmax T i p d i T = argmax T i l o g p d i T .
The above equation can be simplified as:
T * = argmin T i d i T T C i B + T C i A T T 1 d i T ,
T * = R * | t * is the camera pose estimated by the system.
Since 3D Gaussian needs good observation coverage and considering the densification strategy, we divide the keyframes into tracking and mapping keyframes, as shown in Figure 4. We use G-ICP to calculate the match ratio of the current frame point cloud to the map point cloud. The frame is selected as the tracking keyframe if the matching ratio is below the threshold. If the relative rotation of the current frame to the latest keyframe is above the threshold, the frame is selected as the mapping keyframe. The relative rotation here is expressed using Euler angles, and to ensure the stability of the results, we add the Euler angles on the three axes and compare them to a single threshold. In the tracking keyframes, only 3D Gaussians that do not overlap with the existing map are added to the optimization. The opposite is true in the mapping keyframe, where only 3D Gaussians that overlap with the existing map are added to the optimization.

3.3. Mapping

RD-SLAM aims to create a coherent, high-fidelity 3D map. The 3D Gaussian scene representation is updated and optimized at each selected keyframe for stable mapping. We first use an adaptive densification strategy in the mapping thread to add new 3D Gaussians to the entire scene representation for differentiable rendering to obtain color and depth images. The updated 3D Gaussian scene representation is then optimized by minimizing the input and rendered images’ color, depth, and regularization loss. Finally, anomalous 3D Gaussian is removed using the 3D Gaussian pruning strategy.
The Gaussian densification process is closely related to the keyframe selection strategy. We use all pixels of the first frame to initialize new 3D Gaussians. For each pixel, we add a new 3D Gaussian function with the color of that pixel, centered on the spatial position of that pixel, and an opacity of 0.1. When the tracking thread uses G-ICP for position estimation, the system can simultaneously obtain the matching relationship between the current frame’s spatial points and the map’s spatial points. If the matched spatial points are below a certain threshold, the current frame is added as a tracking keyframe, and the unmatched spatial points are set to new Gaussians. When the current frame is detected as a mapping keyframe using the dual keyframe selection strategy, the spatial points on which the current frame matches the map are set as new 3D Gaussians.
After 3D Gaussians splatting to the image plane, the color loss between the input color image and the rendered image is calculated. We calculated the color loss using a weighted combination of L 1 and structural similarity index measure (SSIM) [33] losses:
L c o l o r = 1 λ · I ^ I 1 + λ 1 S S I M I ^ , I ,
where I is the original color image, and I ^ is the rendered color image. Deep optimization is achieved using L 1 loss:
L d e p t h = D ^ D 1 ,
where D is the original depth image, and D ^ is the rendered depth image.
To prevent scale explosion of the 3D Gaussian and to highly elongate in the camera viewing direction, we add the regularization loss L r e g of the scale. It can effectively compensate for the 3D Gaussian scale distortion caused by the insufficient observation viewpoint during SLAM. We use L 1 loss for scale optimization:
L r e g = | S ^ S ~ | 1 ,
where S ~ is the average of the scale, and S ^ is the scale before 3D Gaussian optimization. Finally, the depth, color, and regularization loss are optimized together:
L = α L d e p t h + β L c o l o r + γ L r e g ,
where α , β , and γ are predefined hyperparameters that are weighted values of depth, color, and scale regularization loss.
For each keyframe, we perform a fixed number of iterations. First, we iteratively compute using the current keyframe to fully optimize the 3D Gaussian for the current viewpoint. Then, the first n frames in the list of keyframes that overlap with the current keyframe so far are selected, and one of them is randomly chosen in each iteration to ensure that the whole map is fully optimized while avoiding falling into local minima. After a certain number of map iterations, some anomalous Gaussians appear due to the instability of the adaptive control of the Gaussian. We follow the pruning strategy in [18] while removing 3D Gaussians whose locations are not near the scene surface.

4. Results

4.1. Experimental Setup

To evaluate the performance of RD-SLAM, we conduct experiments on the Replica [34] (eight sequences), the TUM-RGBD [35] (three sequences), and the ScanNet [36] dataset (six sequences). The Replica dataset is a synthetic scene containing high-quality RGB depth images. The TUM-RGBD and ScanNet datasets are real-world images captured using an old, low-quality camera with significant noise and blurriness. In particular, the depth image could be more sparse, with many sections suffering from information loss. We validate the effectiveness of our method by evaluating it on both virtual synthetic datasets and real-world datasets.
We compare RD-SLAM with the existing state-of-the-art (SOTA) NeRF-based and 3DGS-based dense visual SLAM: NICE-SLAM, Point-SLAM, GS-SLAM, and SplaTAM. When comparing tracking accuracy on the TUM-RGBD dataset, we also compare our approach with three traditional SLAM systems: Kintinuous [37], ElasticFusion [38], and ORB-SLAM2.
We evaluate the camera tracking accuracy using the average absolute trajectory error (ATE RMSE) and the map quality using the peak signal-to-noise ratio (PSNR), SSIM, and learned perceptual image patch similarity (LPIPS) [39]. For the experimental results, we use the average of three runs. The assessment tables mark the best results in bold, and the second best results are underlined. RD-SLAM is implemented in Python (v.3.9) using the PyTorch framework, incorporating CUDA code for 3DGS. We run our SLAM on a laptop with Intel Core i7 10750H 2.60 GHz and NVIDIA Quadro RTX 3000.

4.2. Quantitative Evaluation

4.2.1. Tracking Performance

Table 1 demonstrates the comparison of tracking accuracy between RD-SLAM and other SLAM systems on the TUM-RGBD dataset. Our approach achieves SOTA performance in both NeRF-based SLAM and 3DGS-based SLAM systems, with an average ATE RSME of 2.11 cm at real-time operating speeds, which is on average 0.93 cm higher than the second best approach, Point-SLAM. However, due to many voids in the depth image and the extreme blurring of the RGB image, the well-designed feature-based conventional SLAM method still outperforms the neural SLAM method.
Table 2 shows our tracking performance on some selected ScanNet sequences. All neural SLAM methods struggle immensely due to the fact that ScanNet dataset depth images are sparse and RGB images are blurry. Our approach achieves competitive tracking performance.
In Table 3, we compare the tracking accuracy of RD-SLAM with other SOTA methods on the Replica dataset. Our approach achieves the best performance in all eight scenes, with a 59% reduction in trajectory error compared to SplaTAM, which is also a 3DGS-based SLAM system. Compared with other methods, our system exhibits better tracking performance on real-world and virtual synthetic datasets. This is because we actively use the 3D point cloud information for accurate pose estimation directly through G-ICP.

4.2.2. Rendering Performance

Table 4 and Table 5 show the rendering performance of our approach on real-world datasets. RD-SLAM achieves competitive results on the average of all sequences from TUM-RGBD and ScanNet datasets. Our method achieves better environment reconstruction in challenging environments where depth images are severely missing by implementing the dual keyframe selection strategy that fully adds and optimizes 3D Gaussians. Notably, RD-SLAM system runs at 10 FPS, more than 60 times faster than SplaTAM. Figure 5 shows the rendering results of the system.
Table 6 shows the results of comparing the rendering performance of RD-SLAM with other neural SLAM methods on the Replica dataset. Our method is the best performer in most evaluated sequences, and many are second best. RD-SLAM outperforms the second ranked method, Point-SLAM, by 2.96 dB in PSNR. This excellent rendering and real-time performance allow RD-SLAM to be more easily applied to autonomous driving, virtual reality, AR, and other fields.
The visualization results in Figure 6 show that RD-SLAM can generate higher quality and more realistic images than previous SOTA methods, avoiding ghosting and local reconstruction errors. From the rendering results, Point-SLAM is too smooth to reconstruct fine details, and SplaTAM produces more apparent holes in some areas, as shown in the second, fourth, and sixth rows of Figure 6.

4.2.3. Runtime Analysis

We show the runtime comparison of RD-SLAM with other systems in Table 7. Because of the efficiency of the rasterizing 3D Gaussians and G-ICP front-end, RD-SLAM runs 55 times faster than Point-SLAM and 60 times faster than SplaTAM, achieving the best tracking accuracy and rendering quality at the fastest run speed.

4.3. Ablation Study

We perform ablation experiments with RD-SLAM on room0 sequences of the Replica dataset to evaluate the effectiveness of the dual keyframe selection strategy and scale regularization. As shown in Figure 7, we exclude the mapping keyframes and scale regularization to verify their effects on the system. Without mapping keyframes, the resulting rendered image produces voids at the edges due to the ineffective implementation of the densification strategy. With scale regularization removed, the scale of the 3D Gaussian is not effectively constrained, resulting in the system being unable to render the fine details of the scene.
Table 8 shows the ablation experiments data for the dual keyframe selection strategy and scale regularization. With only tracking keyframes, the rendering results of our method are drastically degraded by the failure to add 3D Gaussians representing the scene effectively. In contrast, our proposed dual keyframe selection strategy improves the PSNR by 10.81 dB by adding more accurate 3D Gaussians. Scale regularization losses can significantly improve the mapping performance with precise shape constraints compared to supervising with only color and depth losses. Adding the scale regularization loss decreases the LPIPS by 0.188. Our complete implementation yields higher quality and more detailed rendering results.

5. Discussion

During our study, we found that the performance of RD-SLAM depends on the quality of RGB-D images, and the system’s effectiveness may be compromised with poor camera quality or in environments where high-quality images are unavailable. Methods such as [40,41] can quickly reconstruct high-quality images with their lightweight and efficient modeling architecture. We plan to construct image pyramids that are first optimized for 3D Gaussian using low-resolution images captured by the camera and then supervised using reconstructed high-resolution images as truth values. We hope to use the above methods in future work to reduce the effect of low-quality input images and improve the robustness of the system.
In addition, we plan to extend RD-SLAM to make it more adaptable to large-scale scenes. The two main challenges that we will face are the large number of 3D Gaussians that need to be optimized and the enormous amount of memory required to store the parameters. To address the problem of optimizing too many 3D Gaussians, we propose using a fixed number of keyframes to compose the sub-map. The efficiency and real-time performance of the system are ensured by expanding and optimizing only one active sub-map at any given time. To address the problem of excessive memory consumption, we plan to implement more accurate initialization and pruning strategies so that a single Gaussian better fits the object surface without the need for multiple overlapping Gaussians, thus reducing memory costs. These improvements can adapt the system to increasingly complex downstream tasks.

6. Conclusions

This paper proposes a new dense visual SLAM system, RD-SLAM, which utilizes 3D Gaussian as the underlying map representation while using G-ICP for fast tracking. The proposed dual keyframe selection strategy and scale regularization enable RD-SLAM to reconstruct realistic and dense maps dynamically. The experimental results on the Replica dataset show that the system of the proposed method in this paper runs more than 60 times faster than SplaTAM, which is also a 3DGS-based SLAM. We demonstrate through extensive experiments that our method achieves SOTA performance in terms of localization, reconstruction, and system speed. Overall, RD-SLAM provides a promising solution for dense real-time visual SLAM by effectively combining 3DGS, G-ICP, a dual keyframe selection strategy, and scale regularization loss.

Author Contributions

Conceptualization, C.G. (Chaoyang Guo), C.G. (Chunyan Gao) and X.L.; methodology, C.G. (Chunyan Gao) and X.L.; software, C.G. (Chaoyang Guo) and Y.B.; validation, C.G. (Chaoyang Guo), C.G. (Chunyan Gao) and Y.B.; formal analysis, X.L.; investigation, C.G. (Chaoyang Guo) and Y.B.; resources, C.G. (Chunyan Gao) and X.L.; data curation, C.G. (Chunyan Gao); writing—original draft preparation, C.G. (Chaoyang Guo) and C.G. (Chunyan Gao); writing—review and editing, C.G. (Chunyan Gao); visualization, Y.B.; supervision, X.L.; project administration, Y.B. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under grant No. 2022YFB4701100 and by the National Natural Science Foundation of China under grant No. U1913211.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Towards the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
  2. Abaspur Kazerouni, I.; Fitzgerald, L.; Dooly, G.; Toal, D. A Survey of State-of-the-Art on Visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
  3. Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  4. Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  5. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  6. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  7. Ruetz, F.; Hernández, E.; Pfeiffer, M.; Oleynikova, H.; Cox, M.; Lowe, T.; Borges, P. OVPC Mesh: 3D Free-Space Representation for Local Ground Vehicle Navigation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8648–8654. [Google Scholar]
  8. Schöps, T.; Sattler, T.; Pollefeys, M. SurfelMeshing: Online Surfel-Based Mesh Reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2494–2507. [Google Scholar] [CrossRef] [PubMed]
  9. Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-Time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph. 2013, 32, 1–11. [Google Scholar] [CrossRef]
  10. Kahler, O.; Prisacariu, V.; Valentin, J.; Murray, D. Hierarchical Voxel Block Hashing for Efficient Integration of Depth Images. IEEE Robot. Autom. Lett. 2016, 1, 192–197. [Google Scholar] [CrossRef]
  11. Dai, A.; Nießner, M.; Zollhöfer, M.; Izadi, S.; Theobalt, C. BundleFusion: Real-Time Globally Consistent 3D Reconstruction Using On-the-Fly Surface Reintegration. ACM Trans. Graph. 2017, 36, 76a:1. [Google Scholar] [CrossRef]
  12. Newcombe, R.A.; Fitzgibbon, A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
  13. Whelan, T.; Johannsson, H.; Kaess, M.; Leonard, J.J.; McDonald, J. Robust Real-Time Visual Odometry for Dense RGB-D Mapping. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 5724–5731. [Google Scholar]
  14. Weder, S.; Schonberger, J.L.; Pollefeys, M.; Oswald, M.R. NeuralFusion: Online Depth Fusion in Latent Space. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 3162–3172. [Google Scholar]
  15. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the 2020 European Conference on Computer Vision(ECCV), Online, 23–28 August 2020; pp. 405–421. [Google Scholar]
  16. Chen, G.; Wang, W. A Survey on 3D Gaussian Splatting. arXiv 2024, arXiv:2401.03890. [Google Scholar]
  17. Tosi, F.; Zhang, Y.; Gong, Z.; Sandström, E.; Mattoccia, S.; Oswald, M.R.; Poggi, M. How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: A Survey. arXiv 2024, arXiv:2402.13255. [Google Scholar]
  18. Kerbl, B.; Kopanas, G.; Leimkuehler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
  19. Keetha, N.; Karhade, J.; Jatavallabhula, K.M.; Yang, G.; Scherer, S.; Ramanan, D.; Luiten, J. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 21357–21366. [Google Scholar]
  20. Yan, C.; Qu, D.; Wang, D.; Xu, D.; Wang, Z.; Zhao, B.; Li, X. GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting. arXiv 2023, arXiv:2311.11700. [Google Scholar]
  21. Yugay, V.; Li, Y.; Gevers, T.; Oswald, M.R. Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting. arXiv 2023, arXiv:2312.10070. [Google Scholar]
  22. Matsuki, H.; Murai, R.; Kelly, P.H.J.; Davison, A.J. Gaussian Splatting SLAM. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 18039–18048. [Google Scholar]
  23. Segal, A.; Haehnel, D.; Thrun, S. Generalized-ICP. In Proceedings of the Robotics: Science and Systems, Seattle, WA, USA, 28 June–1 July 2009. [Google Scholar]
  24. Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense Tracking and Mapping in Real-Time. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
  25. Bloesch, M.; Czarnowski, J.; Clark, R.; Leutenegger, S.; Davison, A.J. CodeSLAM—Learning a Compact, Optimisable Representation for Dense Visual SLAM. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2560–2568. [Google Scholar]
  26. Li, R.; Wang, S.; Gu, D. DeepSLAM: A Robust Monocular SLAM System With Unsupervised Deep Learning. IEEE Trans. Ind. Electron. 2021, 68, 3577–3587. [Google Scholar] [CrossRef]
  27. Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In Proceedings of the 2021 International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 16558–16569. [Google Scholar]
  28. Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. iMAP: Implicit Mapping and Positioning in Real-Time. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6209–6218. [Google Scholar]
  29. Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12776–12786. [Google Scholar]
  30. Johari, M.M.; Carta, C.; Fleuret, F. ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17408–17419. [Google Scholar]
  31. Wang, H.; Wang, J.; Agapito, L. Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13293–13302. [Google Scholar]
  32. Sandström, E.; Li, Y.; Van Gool, L.; Oswald, M.R. Point-SLAM: Dense Neural Point Cloud-Based SLAM. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 18387–18398. [Google Scholar]
  33. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  34. Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar]
  35. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
  36. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
  37. Whelan, T.; Kaess, M.; Johannsson, H.; Fallon, M.; Leonard, J.J.; McDonald, J. Real-time Large Scale Dense RGB-D SLAM with Volumetric Fusion. Int. J. Robot. Res. 2015, 34, 598–626. [Google Scholar] [CrossRef]
  38. Whelan, T.; Leutenegger, S.; Salas Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a Pose Graph. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015. [Google Scholar]
  39. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
  40. Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint Separable Residual Network for Efficient Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 832–842. [Google Scholar]
  41. Mardieva, S.; Ahmad, S.; Umirzakova, S.; Rasool, M.A.; Whangbo, T.K. Lightweight Image Super-Resolution for IoT Devices Using Deep Residual Feature Distillation Network. Knowl.-Based Syst. 2024, 285, 111343. [Google Scholar] [CrossRef]
Figure 1. Our method reconstructs scene details more accurately than existing systems such as Point-SLAM and SplaTAM, and at the same time, it operates at speeds up to 10 FPS. The second row shows zoomed-in details of the black squares.
Figure 1. Our method reconstructs scene details more accurately than existing systems such as Point-SLAM and SplaTAM, and at the same time, it operates at speeds up to 10 FPS. The second row shows zoomed-in details of the black squares.
Applsci 14 07767 g001
Figure 2. Overview of RD-SLAM. Our approach takes RGB-D frames as input and initializes the point cloud by reprojecting the RGB-D image, and then performs an alignment algorithm using G-ICP to obtain the current frame pose. The dual keyframe selection strategy determines whether the current frame is a tracking or mapping keyframe, which initializes new 3D Gaussians to perform the densification operation. The Gaussian parameters are optimized by applying depth, color, and regularization loss to the rendered RGB-D frames, and a pruning operation is performed after a fixed number of iterations to remove anomalous Gaussians. Finally, a dense 3D environment map is obtained.
Figure 2. Overview of RD-SLAM. Our approach takes RGB-D frames as input and initializes the point cloud by reprojecting the RGB-D image, and then performs an alignment algorithm using G-ICP to obtain the current frame pose. The dual keyframe selection strategy determines whether the current frame is a tracking or mapping keyframe, which initializes new 3D Gaussians to perform the densification operation. The Gaussian parameters are optimized by applying depth, color, and regularization loss to the rendered RGB-D frames, and a pruning operation is performed after a fixed number of iterations to remove anomalous Gaussians. Finally, a dense 3D environment map is obtained.
Applsci 14 07767 g002
Figure 3. 3D Gaussian operation flow process.
Figure 3. 3D Gaussian operation flow process.
Applsci 14 07767 g003
Figure 4. Dual keyframe selection strategy. By dividing the keyframes into tracking and mapping keyframes, RD-SLAM can accomplish the densification strategy while guaranteeing tracking accuracy and, simultaneously, allowing the 3D Gaussian to be fully trained.
Figure 4. Dual keyframe selection strategy. By dividing the keyframes into tracking and mapping keyframes, RD-SLAM can accomplish the densification strategy while guaranteeing tracking accuracy and, simultaneously, allowing the 3D Gaussian to be fully trained.
Applsci 14 07767 g004
Figure 5. RD-SLAM rendering visualization results on the TUM-RGBD dataset.
Figure 5. RD-SLAM rendering visualization results on the TUM-RGBD dataset.
Applsci 14 07767 g005
Figure 6. Comparison of the rendering visualization results of RD-SLAM proposed in this paper with other SOTA methods on three sequences in the Replica dataset. The second, fourth, and sixth rows show zoomed-in details of the colored squares.
Figure 6. Comparison of the rendering visualization results of RD-SLAM proposed in this paper with other SOTA methods on three sequences in the Replica dataset. The second, fourth, and sixth rows show zoomed-in details of the colored squares.
Applsci 14 07767 g006
Figure 7. Dual keyframe selection strategy and scale regularization for ablation experiments. The second row shows zoomed-in details of the colored squares.
Figure 7. Dual keyframe selection strategy and scale regularization for ablation experiments. The second row shows zoomed-in details of the colored squares.
Applsci 14 07767 g007
Table 1. Camera tracking results on TUM-RGBD dataset (ATE RMSE ↓ [cm]).
Table 1. Camera tracking results on TUM-RGBD dataset (ATE RMSE ↓ [cm]).
Methodsfr1/deskfr2/xyzfr3/OfficeAvg.
Kintinuous [37]3.702.903.003.20
ElasticFusion [38]2.531.172.521.50
ORB-SLAM2 [5]1.600.401.001.00
NICE-SLAM [29]4.2631.733.8715.87
Point-SLAM [32]4.341.313.483.04
GS-SLAM [20]3.301.306.603.73
SplaTAM [19]3.341.345.183.29
Ours2.291.802.242.11
Table 2. Camera tracking results on ScanNet dataset (ATE RMSE ↓ [cm]).
Table 2. Camera tracking results on ScanNet dataset (ATE RMSE ↓ [cm]).
Methods000000590106016901810207Avg.
NICE-SLAM [29]12.0014.007.9010.9013.406.2010.70
Point-SLAM [32]10.247.818.6522.1614.779.5412.19
SplaTAM [19]11.539.9917.8612.499.617.8111.55
Ours10.7312.569.3411.6210.6711.6911.10
Table 3. Camera tracking results on Replica dataset (ATE RMSE ↓ [cm]).
Table 3. Camera tracking results on Replica dataset (ATE RMSE ↓ [cm]).
MethodsRoom0Room1Room2Office0Office1Office2Office3Office4Avg.
NICE-SLAM [29]0.971.311.070.881.001.061.101.131.06
Point-SLAM [32]0.610.410.370.380.480.540.690.720.52
GS-SLAM [20]0.480.530.330.520.410.590.460.700.50
SplaTAM [19]0.270.410.320.550.230.360.320.530.37
Ours0.150.160.100.180.120.160.160.200.15
Table 4. Map quality on TUM-RGBD.
Table 4. Map quality on TUM-RGBD.
MethodsMetricsfr1/deskfr2/xyzfr3/officeAvg.
Point-SLAM [32]PSNR ↑13.7917.6218.2916.57
SSIM ↑0.6250.7100.7490.695
LPIPS ↓0.5450.5840.4520.527
SplaTAM [19]PSNR ↑21.4925.0621.1722.57
SSIM ↑0.8390.9500.8610.883
LPIPS ↓0.2550.0990.2210.192
OursPSNR ↑20.7923.1721.3921.78
SSIM ↑0.7580.8290.7850.791
LPIPS ↓0.2270.1430.1980.189
Table 5. Map quality on ScanNet.
Table 5. Map quality on ScanNet.
MethodsMetrics000000590106016901810207Avg.
PSNR ↑21.1319.5717.0118.2922.3620.7219.85
Point-SLAM [32]SSIM ↑0.8010.7650.6840.6860.8240.7510.752
LPIPS ↓0.4880.4980.5370.5390.4710.5420.510
SplaTAM [19]PSNR ↑19.6919.3819.1823.3316.4219.9019.65
SSIM ↑0.7480.7930.7430.8040.6440.6900.737
LPIPS ↓0.2970.2760.3190.2710.4340.3540.325
OursPSNR ↑19.4119.1521.1822.8219.9320.1520.44
SSIM ↑0.7350.7260.7950.8210.7420.7450.761
LPIPS ↓0.2510.2530.2090.1490.2470.2440.226
Table 6. Map quality on Replica.
Table 6. Map quality on Replica.
MethodsMetricsRoom0Room1Room2Office0Office1Office2Office3Office4Avg.
NICE-SLAM [29]PSNR ↑22.1222.4724.5229.0730.3419.6622.2324.9424.42
SSIM ↑0.6890.7570.8140.8740.8860.7970.8010.8560.809
LPIPS ↓0.3300.2710.2080.2290.1810.2350.2090.1980.233
Point-SLAM [32]PSNR ↑32.4034.0835.5038.2639.1633.9933.4833.4935.17
SSIM ↑0.9740.9770.9820.9830.9860.9600.9600.9790.975
LPIPS ↓0.1130.1160.1110.1000.1180.1560.1320.1420.124
GS-SLAM [20]PSNR ↑31.5632.8632.5938.7041.1732.3632.0332.9234.27
SSIM ↑0.9680.9730.9710.9860.9930.9780.9700.9680.975
LPIPS ↓0.0940.0750.0930.0500.0330.0940.1100.1120.082
SplaTAM [19]PSNR ↑32.6533.6835.1137.9538.8232.0829.9931.7734.01
SSIM ↑0.9770.9690.9840.9810.9810.9670.9500.9470.970
LPIPS ↓0.0700.1020.0750.0870.0970.0980.1190.1560.101
OursPSNR ↑34.5537.0837.9242.6242.6636.1935.9338.0838.13
SSIM ↑0.9560.9690.9730.9840.9820.9710.9630.9700.971
LPIPS ↓0.0570.0450.0500.0280.0370.0480.0490.0480.045
Table 7. Runtime on Replica room0.
Table 7. Runtime on Replica room0.
MethodsSystem FPS ↑ATE RMSE [cm] ↓PSNR [dB] ↑
Point-SLAM [32]0.190.6132.40
SplaTAM [19]0.170.2732.65
Ours10.570.1534.55
Table 8. Quantitative results of ablation study.
Table 8. Quantitative results of ablation study.
KeyframesRegularizationPSNR [dB] ↑SSIM ↑LPIPS ↓
×23.740.8290.195
×29.060.8400.245
34.550.9560.057
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, C.; Gao, C.; Bai, Y.; Lv, X. RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting. Appl. Sci. 2024, 14, 7767. https://doi.org/10.3390/app14177767

AMA Style

Guo C, Gao C, Bai Y, Lv X. RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting. Applied Sciences. 2024; 14(17):7767. https://doi.org/10.3390/app14177767

Chicago/Turabian Style

Guo, Chaoyang, Chunyan Gao, Yiyang Bai, and Xiaoling Lv. 2024. "RD-SLAM: Real-Time Dense SLAM Using Gaussian Splatting" Applied Sciences 14, no. 17: 7767. https://doi.org/10.3390/app14177767

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop