To evaluate the performance of the improved ORB-SLAM3 algorithm, experiments were conducted on a high-performance system equipped with an Intel Core i9-14900HX CPU (Lenovo, Beijing, China) (24 cores, 32 threads, with a maximum turbo frequency of 5.8 GHz), 32 GB of DDR5-5600 MHz RAM (dual-channel, 2 × 16 GB), and an NVIDIA GeForce RTX 4060 Laptop GPU (Santa Clara, CA, USA). The system ran on Ubuntu 18.04, and the SLAM program was developed using C++ (
https://visualstudio.microsoft.com (accessed on 21 June 2024)). To ensure fairness and reproducibility, all experiments were performed under consistent conditions, maintaining the same hardware platform, software environment, and parameter configurations across both the proposed algorithm and the comparison methods.
4.1. Experiment on the Improved YOLOv8 Object Detection Algorithm
To evaluate the effectiveness of the YOLOv8 algorithm enhanced by the MSF convolution module, this experiment utilizes the VOC2007 dataset [
23], a standard benchmark for object detection, accessible at
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (accessed on 21 June 2024). The dataset contains 20 categories, including person, car, sheep, and others. It is split into training, validation, and test sets in a 6:2:2 ratio. The training set consists of 5977 images, the validation set includes 1993 images, and the test set also comprises 1993 images.
Object detection algorithms are commonly evaluated using the mean Average Precision (mAP) metric. A higher mAP value indicates better model performance. The mAP is calculated as follows:
In Equation (16), represents the average precision for a specific class, where c denotes the total number of sample categories, and j refers to the current class sample.
The detection results obtained using the improved YOLOv8 are presented in
Figure 4. As shown in
Figure 4a,b, even small and distant objects, which are barely visible, are accurately recognized.
Building on the preliminary results presented in
Table 3, this section provides a detailed analysis of the performance of the proposed YOLOv8n+MSFConv model. The evaluation focuses on standard metrics, including precision (P), recall (R), mean Average Precision at IoU = 0.5 (mAP@50), and mAP across IoU thresholds from 0.5 to 0.95 (mAP@50:95), to quantify detection accuracy. Additionally, we examine the computational complexity through parameters (Params), GFLOPs, and inference speed (Frames Per Second, FPS) to assess the trade-offs associated with the improved detection performance. All experiments were conducted on an NVIDIA GeForce RTX 4060 Laptop GPU, ensuring a consistent hardware environment for fair comparison.
Table 3 highlights that YOLOv8n+MSFConv achieves a precision of 74.2%, marking a 6.31% improvement over the baseline YOLOv8n’s 69.8%. The mAP@50 increases from 64.0% to 65.1% (a 1.72% improvement), and the mAP@50:95 rises from 42.9% to 43.5% (a 1.40% improvement). These gains indicate that the multi-scale feature extraction capability of MSFConv effectively captures diverse object features, leading to more accurate detection in dynamic environments. However, this enhanced performance comes with a modest increase in computational complexity. The parameter count of YOLOv8n+MSFConv is 3.19 million, a 6.0% increase compared to YOLOv8n’s 3.01 million, and the GFLOPs rise from 8.1 to 8.9, reflecting a 9.9% increase in computational demand. Consequently, the inference speed decreases, with the FPS dropping from 275.1 for YOLOv8n to 233.1 for YOLOv8n+MSFConv, a 15.3% reduction. Despite this trade-off, the FPS of 233.1 remains well above the threshold of 30 FPS typically required for real-time SLAM applications, ensuring practical applicability in dynamic scenarios.
The superior performance of YOLOv8n+MSFConv can be attributed to its multi-scale feature consolidation mechanism, which leverages parallel convolutional branches (3 × 3, 5 × 5, and 7 × 7) to extract features at varying scales. This approach enhances the model’s ability to detect objects of different sizes and shapes, a critical requirement in dynamic indoor scenes where objects like people or furniture may vary significantly in scale. The slight decrease in recall (from 58.9% to 57.8%) suggests that while MSFConv improves precision by reducing false positives, it may occasionally miss some true positives, possibly due to the increased model complexity affecting feature generalization. Future optimizations could focus on balancing precision and recall to further improve overall performance.
To contextualize the proposed Multi-Scale Feature Consolidation (MSFConv) module, we compare it with established multi-scale feature extraction methods—Feature Pyramid Networks (FPNs) [
21], HRNet [
22], and Wu’s SLAM-specific approach [
17]—focusing on detection accuracy, computational efficiency, and real-time suitability for SLAM.
Table 4 summarizes the results. YOLOv8n+MSFConv achieves a precision of 74.2%, mAP@50 of 65.1%, and mAP@50:95 of 43.5%, with 3.19 M parameters, 8.9 GFLOPs, and 233.1 FPS. YOLOv8n+FPN, with hierarchical feature fusion, improves precision to 76.1% and mAP@50 to 66.8%, but increases complexity, reducing FPS to 198.4. YOLOv8n+HRNet, maintaining high-resolution features, excels with 77.3% precision and 67.5% mAP@50, yet its 10.2 M parameters and 12.5 GFLOPs yield a 165.2 FPS. Wu’s method [
17], tailored for dynamic SLAM, scores 73.5% precision and 64.8% mAP@50, with 3.45 M parameters and 9.8 GFLOPs, resulting in 210.6 FPS.
MSFConv’s parallel multi-scale branches and depthwise separable convolutions balance accuracy and efficiency, outperforming Wu in detection and speed while staying lighter than FPN and HRNet. Its 233.1 FPS ensures real-time SLAM applicability, unlike the heavier alternatives. While FPN and HRNet offer superior accuracy, their computational cost limits practicality in dynamic SLAM. MSFConv thus provides an optimal trade-off, with the potential for future hybrid enhancements.
4.2. Pose Estimation Error Analysis Experiment
- (1)
TUM Dataset
To evaluate the performance of the proposed algorithm, we utilized the TUM RGB-D dataset [
24], a widely recognized benchmark for visual SLAM systems, accessible at
https://vision.in.tum.de/data/datasets/rgbd-dataset (accessed on 21 June 2024). This dataset provides synchronized RGB and depth image pairs captured using a Microsoft Kinect sensor at a resolution of 640 × 480 pixels and a frame rate of 30 Hz, alongside ground truth camera poses for accuracy assessment. The dataset includes a variety of indoor sequences designed to test SLAM algorithms under different conditions, such as static scenes, slow motions, and highly dynamic environments. In our experiments, we selected sequences from the “fr3” category, which are captured in a typical office environment and divided into two primary scene types: “sitting” and “walking”. The “sitting” sequences feature slow actions, such as gesturing and conversing at a desk with minimal object motion, while the “walking” sequences involve significant indoor movements, including people walking, which introduce dynamic elements challenging for SLAM systems.
Each scene type includes three distinct camera motion patterns: “halfsphere”, where the camera moves in a hemispherical trajectory approximately 1 m in diameter; “static”, where the camera remains stationary; and “XYZ”, where the camera translates along the x, y, and z axes while maintaining a consistent orientation. These motion patterns test the algorithm’s robustness across varying degrees of camera dynamics. Specifically, we evaluated the following sequences:
- (a)
sitting_static: Stationary camera with minor gestures.
- (b)
sitting_xyz: Translational camera motion in a low-dynamic setting.
- (c)
walking_half: Hemispherical camera path with multiple moving people.
- (d)
walking_static: Stationary camera observing walking individuals.
- (e)
walking_xyz: Translational camera motion with significant dynamic activity.
These frame counts reflect the full sequences as provided in the TUM RGB-D dataset, though our evaluation used all available frames per sequence to ensure comprehensive testing across motion types. The “sitting” sequences (totaling ~1700 frames across static and XYZ) represent low-dynamic conditions, while the “walking” sequences (totaling ~3100 frames across halfsphere, static, and XYZ) provide high-dynamic challenges, allowing us to assess MSF-SLAM’s performance in diverse scenarios.
The ground truth information in the TUM RGB-D dataset was collected using a high-precision motion capture system consisting of eight infrared cameras operating at 100 Hz. This system tracks reflective markers attached to the Kinect sensor, providing accurate 6-DoF (degrees of freedom) pose estimates with sub-millimeter precision. The recorded RGB-D frames were temporally aligned with the motion capture data using timestamps, ensuring that each frame has a corresponding ground truth pose. This setup enables reliable evaluation of SLAM algorithms by comparing estimated trajectories against these externally measured poses, particularly valuable in dynamic scenes where internal sensor drift could otherwise skew results.
To ensure a fair comparison, all algorithms (including ORB-SLAM3 and other baselines) were tested on the same sequences—e.g., “walking_half” and “walking_xyz”—using identical parameter configurations, such as feature extraction thresholds and optimization settings. This consistency allows us to isolate the impact of MSF-SLAM’s enhancements on pose estimation accuracy across the specified camera motion types.
- (2)
Experimental Results Analysis
To assess the experimental results, we employed two primary metrics: absolute trajectory error (ATE) and relative pose error (RPE), which evaluate the precision of pose estimation relative to the ground truth. Various metrics, including root mean square error (RMSE), standard deviation (STD), mean error, and median error, were utilized to provide a thorough evaluation of the algorithm’s performance.
Table 5,
Table 6 and
Table 7 present the ATE, RPE, and absolute error comparisons between the proposed method and other algorithms.
As shown in
Table 5 and
Table 6, the improved algorithm incorporates dynamic feature point rejection into the traditional ORB-SLAM framework, achieving significant enhancements over ORB-SLAM3 in highly dynamic scenarios (e.g., the walking_half and walking_xyz datasets). Specifically, the ATE RMSE is reduced by 93.34% and 94.43%, while the RPE RMSE is reduced by 45.8% and 68.31%, effectively mitigating the impact of dynamic objects on global trajectory estimation and local pose estimation. These improvements can be attributed to the integration of the multi-scale convolutional module (MSFConv), which enhances the detection of fast-moving objects, and the LK optical flow method, which accurately filters dynamic feature points.
In low-dynamic scenarios, such as sitting_static and sitting_xyz, where dynamic objects are almost stationary, ORB_SLAM3 benefits from retaining all feature points in the scene. In contrast, the proposed algorithm occasionally misclassifies some static feature points as dynamic and removes them, reducing the number of usable feature points for tracking. This leads to varied performance in these environments. For example, in the sitting_xyz dataset, the ATE RMSE increased by 16.92% compared to ORB-SLAM3. However, in the sitting_static dataset, the ATE and RPE RMSEs were reduced by 26.85% and 65.98%, respectively, highlighting the ability of the algorithm to suppress interference from low-quality feature points, even under near-static conditions.
To further evaluate the performance of the proposed algorithm, comparative experiments were conducted with several state-of-the-art visual SLAM methods, including RDS-SLAM, DM-SLAM, RDMO-SLAM, Huai et al., and DIO-SLAM, under consistent experimental settings. All algorithms were tested on the same datasets (e.g., walking_half and walking_xyz) with identical parameter configurations to ensure fairness. As shown in
Table 7, MSF-SLAM achieves superior results in highly dynamic scenes, with significantly lower RMSE in trajectory estimation compared to all baselines. For instance, in the walking_half sequence, MSF-SLAM records an RMSE of 0.0202 m, outperforming DIO-SLAM (0.0212 m), Huai et al. (0.0303 m), and RDMO-SLAM (0.0304 m), while in walking_xyz, it achieves an RMSE of 0.0134 m against DIO-SLAM’s 0.0141 m and Huai et al.’s 0.0143 m. These improvements are attributed to the integration of the Multi-Scale Feature Consolidation module (MSFConv), which enhances the detection of fast-moving objects, and the Dynamic Object Filtering Framework (DOFF), which leverages LK optical flow to effectively remove dynamic features. Together, these components enable MSF-SLAM to mitigate the impact of dynamic objects more effectively and maintain robustness in challenging environments.
Figure 5 and
Figure 6 compare the real and estimated trajectories between the improved algorithm and ORB_SLAM3. The black solid line represents the real trajectory, the blue solid line denotes the estimated trajectory, and the red shaded area highlights the error between the two trajectories. The smaller the shaded area, the higher the system’s walking_static, walking_half, and walking_xyz, the improved algorithm significantly outperforms ORB_SLAM3 with substantially smaller trajectory errors.
4.3. Computational Performance Analysis
To evaluate the computational efficiency of MSF-SLAM, we benchmarked resource usage against ORB-SLAM3 on two platforms: a high-performance laptop (Intel i9-14900HX, NVIDIA RTX 4060, 32 GB RAM) and an edge device (NVIDIA Jetson Xavier NX, 8 GB RAM).
Table 8 summarizes the results for the “walking_xyz” sequence (1000 frames).
On the laptop, MSF-SLAM achieves 233.1 FPS with 65% GPU utilization, 8.2 GB VRAM, and 12.5 GB RAM, compared to ORB-SLAM3’s 275.1 FPS, 20% GPU utilization, 1.5 GB VRAM, and 6.8 GB RAM. The increased resource demand reflects the integration of YOLOv8+MSFConv, which adds ~9.9% computational overhead (8.9 vs. 8.1 GFLOPs). On the Jetson Xavier NX, MSF-SLAM runs at 45.2 FPS with 5.8 GB VRAM, suitable for real-time robotics applications (>30 FPS), while ORB-SLAM3 achieves 78.4 FPS with 1.2 GB VRAM. Per-frame latency for MSF-SLAM is 4.3 ms (laptop) and 22.1 ms (Jetson), compared to ORB-SLAM3’s 3.6 ms and 12.8 ms, respectively.
These results indicate that MSF-SLAM’s enhanced accuracy (94.43% ATE reduction) comes with a manageable resource trade-off, enabling deployment on mid-range edge devices for dynamic SLAM tasks, such as autonomous navigation in indoor environments.