Next Article in Journal
Analysis of Voice, Speech, and Language Biomarkers of Parkinson’s Disease Collected in a Mixed Reality Setting
Previous Article in Journal
Taylor Series Interpolation-Based Direct Digital Frequency Synthesizer with High Memory Compression Ratio
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Monocular Initialization for Real-Time Feature-Based SLAM in Dynamic Environments with Multiple Frames

Space Control and Inertial Technology Research Center, School of Astronautics, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(8), 2404; https://doi.org/10.3390/s25082404
Submission received: 5 March 2025 / Revised: 7 April 2025 / Accepted: 9 April 2025 / Published: 10 April 2025
(This article belongs to the Section Navigation and Positioning)

Abstract

:
Two-view epipolar initialization for feature-based monocular SLAM with the RANSAC approach is challenging in dynamic environments. This paper presents a universal and practical method for improving the automatic estimation of initial poses and landmarks across multiple frames in real time. Image features corresponding to the same spatial points are matched and tracked across consecutive frames, and those that belong to stationary points are identified using ST-RANSAC, an algorithm designed to detect inliers based on both spatial and temporal consistency. Two-view epipolar computations are then performed in parallel among frames and corresponding features to select the most reliable initialization. The proposed method is integrated with ORB-SLAM3 and evaluated on dynamic datasets for comparative analysis with the baseline. The experimental results demonstrate that the proposed method improves the accuracy of initial pose estimations with the construction of static landmarks while significantly reducing feature extraction scale and computational cost.

1. Introduction

Simultaneous localization and mapping (SLAM) is a technique that concurrently estimates poses and generates a map through sensory perception, playing a crucial role in the guidance, navigation, and control (GNC) of autonomous robots operating in unknown environments. Visual-based SLAM has been extensively studied and applied due to the lightweight and cost-effective nature of cameras. The initialization of system states in visual SLAM, involving an initial guess of poses and landmarks, can be achieved through integration with other sensors (IMU, LiDAR, etc.), auxiliary cameras (depth, stereo, etc.), or preset landmarks. Meanwhile, for monocular feature-based visual SLAM without prior information, the initialization is typically performed using structure from motion (SfM), which requires sufficient and correct image feature matching between frames and appropriate camera displacement.
To address measurement errors, mismatching, and features on moving points, modern SLAM systems generally extract excess features for robust initialization. Inlier features consistent with an anticipated motion model are identified, often using statistical methods such as Random Sample Consensus (RANSAC), for multiple-view structure computation. However, in some dynamic environments, features from slow-moving objects may be incorrectly treated as static, leading to erroneous state estimation [1].
To ensure robust monocular initialization, some SLAM algorithms significantly increase feature extraction during bootstrapping to enhance the quantity of stationary features, albeit at the cost of additional computational overheads. Other methods focus on achieving consistent solutions across multiple frames using computationally intensive approaches, which often involve conditional assumptions [2] and require substantial processing time.
In this work, a universal monocular initialization method for dynamic scenes is proposed, which automatically identifies stationary feature pairs and estimates initial pose and landmarks across several frames with a manageable computational burden. The main contributions are as follows:
  • The RANSAC algorithm is extended to the discrete time domain, enabling the identification of inlier features with both spatial and temporal consistency. The features corresponding to stationary points are then distinguished from moving ones across consecutive image frames.
  • A pipeline is developed to perform automatic initialization using aforementioned features and frames. Two-view epipolar computations between frame pairs are performed in parallel, seeking the most credible estimation of initial poses and map points.
  • The proposed initialization method is implemented and integrated into the open-source ORB-SLAM3 framework. Experimental evaluation on dynamic datasets demonstrate that the proposed method achieve superior accuracy in initial pose estimation with map points constructed from static objects while reducing computational time due to the need for fewer extracted features.
This paper is structured as follows. Related works on the initialization of feature-based visual SLAM are briefly surveyed in Section 2. Reviewing conventional two-view estimation, epipolar consistency across multiple frames is discussed, referring to the Spatio-Temporal RANSAC algorithm in Section 3. In Section 4, a pipeline is detailed to estimate initial poses and construct the stationary environment in dynamic scenes with the aforementioned algorithms. To validate the proposed method, dataset experiments are conducted in Section 5, and the experimental results are analyzed in Section 6. The contribution of this paper is concluded in Section 7, with a discussion of future work.

2. Related Works

While some systems employ random initial states, most monocular SLAM algorithms rely on 3D perception derived from 2D image information, primarily through the depth estimation of pixels projected onto images [3,4]. Two-view epipolar-based initialization has been extensively applied in many classical SLAM systems [5]. This approach involves estimating the fundamental or homography matrix from matched image features, decomposing the matrix for relative pose with chirality, and triangulating pixel correspondences for landmarks.
With the assumption that matched features are projected from a static environment, this initialization method is challenging in dynamic scenes, where features on moving objects may lead to inconsistency in the statistical epipolar estimation [6]. One solution is to reduce the impact of dynamic features. Regarded as one of state-of-the-art methods with the best performance in dynamic scenes [7], ORB-SLAM3 [8] greatly expands the feature extraction scale during initialization in the codes, improving the robustness of epipolar estimation with an increased number of stationary features. However, the excess feature extraction introduces a heavy computational burden, potentially compromising real-time performance. Additionally, some dynamic features are still misidentified and mixed with stationary ones after RANSAC, resulting in deviated poses and the creation of non-static landmarks.
Another solution is to distinguish and eliminate features on dynamic objects. Semantic-based approaches have gained traction with advancements in data-driven image segmentation. For instance, DynaSLAM [9] provide an influential framework where features on movable objects are double-checked by means of multi-view geometry after CNN-based segmentation, inspiring a number of subsequent studies [10]. However, the instance segmentation efficacy depends on the quality of the pretrained model. Furthermore, achieving real-time performance in semantic segmentation remains a challenge without GPU acceleration [11].
As for geometry-based methods, one of the fundamental theories is multibody epipolar constraint in motion tracking under noise-free conditions. Initialization in actual systems can be deployed using either the iterative computation of multiple hypotheses [12] or prior knowledge of the ego-motion kinematics model [13]. However, the assumption of rigid-body motion for dynamic objects limits their general applicability. Aiming at highly non-grid environments with deformation, the first step of initialization in NR-SLAM [14] assumes that most image innovation comes from camera motion rather than from changing backgrounds, and the estimation is further refined with deformable bundle adjustments in the next step.
Additionally, for most SLAM systems, the primary goal during initialization is motion awareness and exclusion rather than explicit motion tracking, suggesting that these methods could benefit from further optimization for efficiency. For instance, under the assumption of motion consistency around local patches, the dynamic area can be identified and discarded via neighbor-based RANSAC [15] during initialization. The stationary image regions for initialization can also be determined with high confidence in the probabilistic field on S i m ( 3 ) manifold [16] via random sampling.

3. Stationary Feature Sorting with Spatio-Temporal Consensus

3.1. Two-View Epipolar Constraint in Dynamic Environment

As illustrated in Figure 1a, stationary spatial points P = { P i } and moving spatial points Q = { Q i } , with world coordinates { P i } and { Q i ( t ) } , are observed by a moving camera with camera centers C = { C ( t 0 ) , C ( t 1 ) , , C ( t k ) , } : = { C k } at discrete time instances. The corresponding camera poses are represented by the extrinsics { C k } . The observed points are projected on the image frame F k of camera center C k , where they are represented as image features P k = { p i k } and Q k = { q i k } with homogeneous coordinates { p i k } and { q j k } , respectively.
To estimate the initial states of the SLAM system, a RANSAC-based scheme can be applied between camera centers C a and C b using matched features X a = { x i a } = P a Q a and X b = { x i b } = P b Q b , seeking epipolar consistency in accordance with either the fundamental matrix F b a , satisfying
x i b F b a x i a = 0 ,
or the homography matrix H b a [17], satisfying
x i b = H b a x i a .
The matrix with the maximum number of consistent matchings and minimal error is estimated using the RANSAC algorithm. The corresponding features in { x i a } and { x i b } satisfying the epipolar constraint are regarded as inliers, thus are considered spatially consistent between camera centers C a and C b . From this, the relative camera extrinsics C b a = [ R b a | t b a ] of C b with regard to C a are computed as the initial pose, and inlier features are triangulated to form initial landmarks.
The feature pairs corresponding to moving points Q , along with mismatched pairs, are expected to be recognized as outliers during RANSAC due to the violation of the epipolar constraint. However, when the displacement between { Q i ( t a ) } and { Q i ( t b ) } is subtle (e.g., due to a similar motion direction, moderate movement, or non-rigid motion), the corresponding features Q a and Q b may be misidentified as spatially consistent, as illustrated in Figure 1b.
An increase in such misidentified features can lead to significant deviation during the estimation of F b a or H b a , resulting in ill-initialized states with drifted pose and some non-stationary landmarks. Therefore, the accuracy of initialization can be improved by eliminating non-stationary features for epipolar matrix estimation.

3.2. Temporal Consistency Across Multiple Views

To better distinguish stationary features from moving ones, the epipolar constraint can be checked between multiple pairs of consecutive frames F = { F k } .
According to the epipolar constraint of Equations (1) and (2), given F b a or H b a , spatial consistency can be evaluated with reference to the symmetric transfer error [18], denoted as
S ( x i a , x i b ) = max ( d ( x i a , x i b F b a ) 2 , d ( x i b , F b a x i a ) 2 ) , if fundamental max ( x i b H b a x i a , x i a H b a 1 x i b ) , if homography
where d ( x , l ) is the distance from a point to a line.
Considering all frame pairs { ( F a , F b ) | F a , F b F , a b } within F , the features that are spatially consistent across each pair are referred to as temporally consistent, which is represented as
S ( x i a , x i b ) = 0 , F a , F b F .
All features corresponding to stationary spatial point P are expected to be temporally consistent, and they are represented as
S ( p i a , p i b ) < ϵ , F a , F b F , P i P ,
where ϵ is a threshold considering the measurement error.
For features corresponding to moving spatial points Q , there may be pairs that are misidentified as spatially consistent via RANSAC within the threshold, as discussed in Section 3.1. However, except for possible temporarily fixed points on moving objects, nearly all non-stationary points inherently exhibit spatial inconsistency between certain frame pairs, which is represented as
Q i Q , F a , F b F , s . t . S ( q i a , q i b ) > ϵ .
Compared to moving points Q , stationary points P are more likely to be considered temporally consistent via the traversal of the RANSAC-based spatial consistency check. Consequently, rather than being simply spatially consistent in a single two-view RANSAC process, a spatial point has a much higher likelihood of being stationary if it exhibits temporal consistency across multiple frames.

3.3. Spatio-Temporal RANSAC Algorithm

To sort stationary features for improved initialization, temporal consistency can be checked across multiple frames. Similarly to the two-view RANSAC algorithm, numerous random sets of matched features could be checked for epipolar consistency across all or randomly selected pairs of frames. The set of epipolar matrices with the maximum consensus of features would yield the best result for initialization. However, if the algorithm fails to find a consistent solution within the current set of frames, a thorough recalculation is required when a new frame arrives, which can be computationally expensive for real-time SLAM systems.
In this paper, an iterative method is presented to check and record the spatial consistency of features. The temporal consistency of a point is evaluated using the spatial consistency ratio of corresponding features, which is continuously updated as new frames are processed. This method is referred to as Spatio-Temporal RANSAC (ST-RANSAC), and it is detailed in Algorithm 1.
Algorithm 1 Spatio-Temporal RANSAC at t N
    Input: M Matched features { x i j } on N + 1 frames { F j } , with previous counts of being inliers { m i } and outliers { n i }
    Output: Stationary features P N on frame F N
1:
for each j [ 0 , N 1 ]  do
2:
   Estimate F ^ j N or H ^ j N with { x i j } , { x i N } by RANSAC
3:
   for each i [ 1 , M ]  do
4:
      Calculate e = S ( x i j , x i N ) with F ^ j N or H ^ j N
5:
      if  e < ϵ  then
6:
         m i m i + 1 ;
7:
      else
8:
         n i n i + 1 ;
9:
      end if
10:
    end for
11:
end for
12:
for each i [ 1 , M ]  do
13:
   Calculate r i = m i / ( m i + n i )
14:
   if  r i > r th  then
15:
      P N P N { p i N }
16:
   end if
17:
end for
18:
return  P N
Considering the incoming frame F N at t N , a spatial consistency check of matched features X N is performed N times between F N and all preceding frames { F 0 , F 1 , , F N 1 } using RANSAC-estimated F ^ j N or H ^ j N , as detailed in Algorithm 2, for each pair. The counts of inliers and outliers for each feature x i are accumulated as m i and n i , respectively, which are inherited from the previous iteration at t N 1 . Consequently, the ratio r i = m i / ( m i + n i ) represents the temporal consistency of feature x i across the frame pairs.
If the ratio r i exceeds a threshold r th , feature x i is considered temporally consistent and thus stationary up to t N . Once more than s th stationary points are recognized at t N by frame F N , initialization can be conducted using corresponding features. If the number of stationary points is insufficient, the algorithm will proceed to the next iteration with the incoming frame F N + 1 at t N + 1 .
Algorithm 2 Epipolar estimation between F j and F N at t N within K iterations
    Input: M Matched pairs D N j between features { x i N } on frame { F N } and { x i j } on frame { F j }
    Output: Initial guess of fundamental matrix F ^ j N or homography matrix H ^ j N
1:
for each k [ 1 , K ]  do
2:
   Randomly sample 8 pairs of matched feature between { x i N } and { x i j } into set D k
3:
   (1) Estimate F ^ k with D k using 8-Point algorithm
4:
   for each l [ 1 , M 8 ]  do
5:
     Compute fundamental error e l of pair ( x l N , x l j ) D N j D k using F ^ k
6:
     if  e l < ϵ th  then
7:
        Accumulate ϵ th e l to score f k
8:
     end if
9:
   end for
10:
   if  f k > f max  then
11:
      f max = f k , F ^ max = F ^ k
12:
   end if
13:
   (2) Estimate H ^ k with D k using DLT algorithm
14:
   for each l [ 1 , M 8 ]  do
15:
     Compute homography error of pair ( x l N , x l j ) D N j D k using H ^ k
16:
     if  e l < ϵ th  then
17:
        Accumulate ϵ th e l to score h k
18:
     end if
19:
   end for
20:
   if  h k > h max  then
21:
      h max = h k , H ^ max = H ^ k
22:
   end if
23:
end for
24:
if  f max > h max
25:
   then return  F ^ j N = F ^ max
26:
else
27:
   return  H ^ j N = H ^ max
28:
end if
Compared to conventional two-frame-based initialization, the proposed algorithm leverages historical information from all previous frames through the accumulated values of { m i } and { n i } . Regarding algorithmic efficiency, the spatial consistency checks between each frame pair (from Line 1 to Line 11 in Algorithm 1) are entirely independent, making it possible to conserve processing time via multithreading or parallel computing. Similarly in Algorithm 2, the estimation of F ^ k (Line 3 to 12) and H ^ k (Line 13 to 22) is performed in parallel due to their independence.

4. SLAM Initialization with Multiple Frames

With temporally consistent features, two-view or multi-view structure computation can be performed across historical frames. This paper introduces an initialization pipeline that seeks the most credible two-view reconstruction.

4.1. Feature Matching Across Multiple Frames

The aforementioned sorting algorithm relies on features being properly matched across multiple frames. However, most extraction algorithms tend to be semi-random during corner detection [19], making it difficult to ensure that features corresponding to the same spatial point appear consistently in successive frames. To effectively look up previous matching, corresponding features across frames can be linked using a data structure referred to as object point, as depicted in Figure 2a.
Similarly to the map points in many SLAM systems, the object point’s structure can be regarded as a candidate landmark with undetermined coordinates. It allows for the efficient association and retrieval of features through their corresponding frames and vise versa. For an incoming frame at t N , the feature matching across frames, along with object point maintenance, is detailed in Algorithm 3.
Algorithm 3 Feature matching across frames at t N
    Input: M features X N = { x i N } on frame F N , with object points O within N historical frames { F j }
    Output: Object points O within N + 1 frames
1:
for each j [ 0 , N 1 ] in reversed order do
2:
   for each i [ 1 , M ]  do
3:
     if  x i N has been matched then
4:
        continue
5:
     else
6:
        Try to find a match for x i N with x * j in X j
7:
        if  x i N cannot find a match in X j  then
8:
          continue
9:
        else
10:
          if  x * j had been matched and related to O *  then
11:
             Relate x i N to O *
12:
          else
13:
             Create O * and relate x i N , x * j to O *
14:
              O O { O * }
15:
          end if
16:
        end if
17:
     end if
18:
   end for
19:
end for
20:
return  O
This algorithm ensures that matched features are properly associated via the introduction of object points while avoiding circular matching across frames. As a result, the feature pairs required by Line 2 in Algorithm 1 can be effectively retrieved by means of the management of the object point database O . Additionally, the consistency evaluation of features is transferred to object points for convenience. Concerning the possible computational cost of feature matching across excessive frames, the size of historical frames { F j } can be restricted using a sliding window.

4.2. Structure Computing in Parallel

With sufficient stationary points observed at t N , two-view structuring can be attempted between F N and multiple preceding frames, aiming for the best initialization with the maximum number of initial landmarks. Frame candidates are determined based on covisibility relationships, as illustrated in Figure 2b. For each stationary object point in F N , the preceding frames with the corresponding feature are counted. The frames observing the most stationary object points in common with F N are selected as candidates, with their number adjustable based on the multithreading capability of the processor.
Rather than relying on previously estimated F ^ i N or H ^ i N during ST-RANSAC, structure computing for each frame candidate F i is performed using a re-estimated epipolar matrix based solely on sorted stationary object points. Poses and landmarks are then computed for each frame candidate, and the set with the maximum number of triangulated landmarks is selected as the most credible initialization. Similarly to ST-RANSAC, the epipolar re-estimation and structure computation for each candidate frame are completely independent and can be parallelized to enhance computational efficiency.
Compared to conventional methods that search for the best candidate relative to a fixed frame, this pipeline ensures that both frames in the optimal pair are chosen from all historical images. Furthermore, since epipolar estimation using exclusively stationary features is expected to yield greater precision, it becomes feasible to avoid excess feature extraction during initialization, significantly mitigating the computational burden. Consequently, the proposed method has the potential to enhance initialization not only in accuracy but also in efficiency by conserving the scale of feature extraction.

4.3. Implementation and Comparison with ORB-SLAM3

The entire initialization pipeline is depicted in Figure 3a and implemented within ORB-SLAM3 [8], a state-of-the-art feature-based SLAM system. After features are extracted from an image, the frame is matched with historical frames to update the object point database. ST-RANSAC is then executed concurrently across multiple threads to distinguish stationary object points in the current frame. Once frame candidates with the highest covisibility are identified, structure computations are performed in parallel. The result with the highest number of triangulated landmarks is accepted as the SLAM system’s initialization, with the current frame referred to as the construction frame and the other as the initial frame.
For comparison, the original monocular initialization procedure in ORB-SLAM3 is shown in Figure 3b. Unlike conventional two-frame initialization, the proposed method fully utilizes historical frame information and conducts multiple structure computations to realize the optimal initial states. The efficiency of both ST-RANSAC and structure computation is ensured through multithreading.
In the original ORB-SLAM2 [20] and DynaSLAM codes, the number of features extracted per uninitialized frame is set to twice that of normal frames, yet initialization failures persist in dynamic scenes. ORB-SLAM3 adopts the same initialization pipeline as ORB-SLAM2 but increases this factor to 5 for robustness in codes, as depicted in Figure 3d. Moreover, this approach extracts additional features in frames following initialization (within 1 s), consuming significant computational resources and compromising real-time performance.
In contrast, the proposed method maintains a consistent number of features across all frames, including uninitialized ones, as depicted in Figure 3c. This consistency ensures steady image processing time for all incoming frames, significantly reducing overall computational time. Additionally, the major extra computation introduced by the proposed method occurs during the parallelized ST-RANSAC stage, which consumes far fewer resources than feature extraction and constitutes only a minor portion of the total processing time.

5. Experiments

5.1. Dataset Experiments

Dataset experiments were conducted to compare the initialization efficacy of the baseline ORB-SLAM3 and the proposed method. The popular semantic-masking method designated for dynamic environments, DynaSLAM, was also compared for reference despite its inefficiency. Selected monocular sequences from the TUM [21] and Bonn [22] datasets, which include dynamic scenes, were used for evaluation.
The experiments were performed on an Intel Core i7-12700K (Intel, Santa Clara, CA, USA) desktop computer without GPU acceleration. The base number of extracted features per frame was set to 1000 for all methods according to the resolution. The first 100 images of each sequence were processed by the SLAM systems, attempting to compute initial poses and map points. In the proposed method, frame tracking after initialization follows the same pipeline as ORB-SLAM3 to ensure a fair comparison. Keeping pace with ORB-SLAM3, the multiplier of features during initialization in DynaSLAM was augmented from the original 2 to 5 to prevent failure. For some sequences, several images containing only stationary backgrounds were skipped at the beginning to focus on dynamic scenes.
The accuracy of the estimated keyframe poses immediately following the initialization was assessed. Since scale ambiguity exists, the  Sim ( 3 ) transformation was applied using the Umeyama algorithm to align the keyframes with ground truth. The translations of the root mean square error of the absolute trajectory (RMS ATE) and relative poses (RMS RPE) over the initialization were calculated to compare the accuracy of initial poses. For each sequence, multiple experiments were conducted with both methods, and the initialization metrics featuring median RMS ATE are listed in Table 1. The peak time during initialization and the mean tracking time for all frames were recorded to evaluate computational efficiency. Additionally, the frame pair of the initial frame and the construction frame, automatically selected for two-view reconstructions during initialization, is reported for reference.
In the proposed method, the error threshold ϵ introduced in Section 3.2 was set based on the χ 2 test at 95%. The temporal consistency threshold r th introduced in Section 3.3 was set to 0.6 . The threshold for the number of stationary object point observations s th introduced in the same section was set to 50.

5.2. Practical Experiments

To further evaluate the proposed method, practical experiments were conducted to compare the initialization efficacy in an outdoor environment. An unmanned ground vehicle (UGV) was remote-controlled to pass through the Science Park campus of Harbin Institute of Technology, capturing monocular image sequences containing moving people and vehicles with an integrated Intel D435i camera (Intel, Santa Clara, CA, USA) at 640 × 480 @ 30 fps . The ComNav M100 system offered RTK GNSS ground truth at 10 Hz , with three-dimensional positioning precision within 2 cm .
In accordance with dataset experiments, the first 100 images in each of the five dynamic scenes were processed on the aforementioned computer using the SLAM systems repeatedly, attempting to estimate initial poses and map points. The base number of extracted features per frame was set to 1500 for all methods, and the features during initialization were doubled in the proposed method to adapt to the outdoor environment. The other hyperparameters were set to the same values mentioned in the dataset experiments. The metrics of experimental results featuring median RMS ATE are listed in Table 2.

6. Discussion

6.1. Stationariness of Initial Map Points

Pruning dynamic features, the proposed method successfully constructed static parts of the environment into initial map points.
The examples of candidate features for initialization, along with the resulting initial map points, are illustrated in Figure 4, and they are sampled from the widely compared sequence freiburg3_walking_xyz in the TUM dataset.
In ORB-SLAM3, the candidate matches in the 18th construction frame are shown in Figure 4a, which contain a significant number of features on the right, indicating a moving person. Many non-stationary points are subsequently triangulated from these features, resulting in a deviated initial pose and misleading the subsequent localization, as illustrated in Figure 4b.
In the proposed method, all observed object points in the 18th frame are marked in Figure 4d, with the spatial consistency ratio represented in diverse colors and observation counts indicated by different point sizes. The features on the right indicating a moving person exhibit noticeably lower spatial consistency ratios regardless of the observation frequency. The object points sorted with the spatial consistency ratio r th are shown in Figure 4e as candidate features, where blue points represent the matches in preceding frames. Subsequently, later on in the 25th construction frame of the proposed method, similar stationary features are used to triangulate the initial map points, which are shown in Figure 4f.
The 16th construction frame in DynaSLAM is shown in Figure 4c for reference. Compared with semantic-based masking, the proposed method can achieve comparable mapping efficacy in eliminating dynamic points while tremendously conserving computational resources and processing time.
Further comparisons of initial map points in other sequences are shown in Figure 5, which are sampled from the respective construction frames. These examples demonstrate that the proposed method effectively separates stationary features from moving ones in dynamic scenes, prompting the construction of map points on fixed objects for better initialization.

6.2. Efficacy of Initial Pose Estimation

The proposed method tracks initial poses with better accuracy while conserving processing time by requiring the extraction of fewer features.
The experimental results in Table 1 show that the proposed method consistently [“achieves” or “brings”] lower localization errors compared to the baseline across nearly all sequences, with up to 65.15% less RMS ATE and 67.19% less RMS RPE in sequence freiburg3_walking_xyz. One of the exceptions occurs in the sequence Balloon2, where the RMS ATEs of the two methods are closely comparable with a disparity of 3.17%, while the proposed method still achieves a 29.86% reduction in RMS RPE. The other exception occurs in Scene 4 of the practical experiment, where the baseline method covers a minor part of the trajectory. As for DynaSLAM, the proposed method generally achieves comparable localization accuracy and outperforms it in many sequences.
In additional to accuracy, the proposed method also improves efficiency. The mean tracking time of all 100 frames around the initialization is reduced across almost all sequences. The peak processing time, which occurs at the respective construction frame to perform two-view reconstruction, is generally lowered in most sequences and up to 58.50% in sequence freiburg3_walking_halfsphere.
Plotted by EVO [23], the key frame trajectories of some sequences are shown in Figure 6 as examples, alongside the processing time per frame and absolute pose error (APE) distribution. In the highly misguiding sequence freiburg3_walking_xyz, where initialization is procrastinated in all methods, the proposed method achieves impressive trajectory accuracies with reduced processing time. In the less misguiding sequence freiburg3_walking_halfsphere, where initialization is expeditious in all methods, the proposed method achieves a noticeably reduced processing time, together with the promotion of trajectory accuracy.
In sequence moving_obstructing_box with dynamics that step away from the camera, all methods realized good initialization, implying the limited promotion of the proposed method in such cases.
It is indicated that, with the help of stationary feature sorting, the proposed method realizes more accurate initial poses in dynamic scenes while reducing computational time.

6.3. Robustness of Initialization

While requiring the extraction of fewer features, the proposed initialization method is more robust than the others.
After repeated attempts, ORB-SLAM3 or DynaSLAM fails to perform initialization in dataset sequences Crowd, Crowd2, Crowd3, and Person_tracking2. The baseline ORB-SLAM3 is also challenged in Scene 2, Scene 3, and Scene 4 of the outdoor dynamic environment, as shown in Figure 5 and Figure 6. Meanwhile, the proposed method completes initialization in all sequences and scenes within multiple trials. In addition to stationary feature sorting, the robustness of the proposed method is speculated to be ensured by parallel structuring, which automatically compares and selects the frame pair with the most credible matches to perform the initialization.
The success rates of initialization in different datasets and practical experiments are compared in Table 3. The results demonstrate that the proposed method achieves consistently high success rates across a range of scenarios involving various dynamic objects. However, it is noteworthy that rapid ego-motion involving violent rotations remains a significant challenge for the proposed method. It is indicated that increasing the number of features extracted during initialization contributes to the generation of more map points, which in turn enhances the accuracy and robustness of the initialization process. Therefore, increasing the number of extracted features can further improve the performance of the proposed method, albeit at the expense of increased computational overhead.

7. Conclusions

This paper introduces a geometry-based method for automatically computing initial poses and landmarks for real-time monocular SLAM in dynamic scenes. With the proposed ST-RANSAC algorithm, stationary features are distinguished from moving ones for more accurate initialization. Through the data structure termed the object point, the parallel processing of epipolar estimation and triangulation is performed to ensure the timeliness of the initialization pipeline. Our experimental results demonstrate that the proposed method surpasses the baseline in both initial pose estimation accuracy and frame processing efficiency.
A key advantage of the proposed method is its ability to maintain steady computational costs for image processing by avoiding the need to extract additional features during initialization. This feature is particularly significant for SLAM algorithm designs in hardware-constrained platforms or specialized chips. While optimized for dynamic environments, this method also shows promise for improving initialization in static scenes. By automatically selecting frame pairs with the highest covisibility, it proves to be potentially advantageous in challenging scenarios, such as sequences with featureless, blurred, or distorted images or low-parallax viewpoints.
Furthermore, during initialization, the identification of temporal consistency for object points could guide feature extraction in subsequent frames, concentrating on stationary areas. This approach might enhance the density and quality of map points constructed on static objects, which is a potential direction for further research.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/s25082404/s1. Video S1: Short demo.

Author Contributions

Conceptualization, H.D. and C.W.; methodology, H.D.; software, H.D.; validation, H.D.; investigation, H.D., B.L. and Y.J.; writing—original draft preparation, H.D.; writing—review and editing, H.D., B.L., Y.J. and C.W.; visualization, H.D.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Touyan Innovation Program of Heilongjiang Province, China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article or Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
APEAbsolute pose error;
ATEAbsolute trajectory error;
DLTDirect linear transform;
GNCGuidance, navigation, and control;
GNSSGlobal navigation satellite system;
IMUInertial measurement unit;
LiDARLight detection and ranging;
RANSACRandom Sample Consensus;
RMSRoot mean square;
RPERelative pose error;
RTKReal-time kinematic positioning;
SLAMSimultaneous localization and mapping;
ST-RANSACSpatio-Temporal Random Sample Consensus;
UGVUnmanned ground vehicle.

References

  1. Saputra, M.R.U.; Markham, A.; Trigoni, N. Visual SLAM and Structure from Motion in Dynamic Environments: A Survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
  2. Gadipudi, N.; Elamvazuthi, I.; Izhar, L.I.; Tiwari, L.; Hebbalaguppe, R.; Lu, C.K.; Doss, A.S.A. A review on monocular tracking and mapping: From model-based to data-driven methods. Vis. Comput. 2022, 39, 5897–5924. [Google Scholar] [CrossRef]
  3. Azzam, R.; Taha, T.; Huang, S.; Zweiri, Y. Feature-based visual simultaneous localization and mapping: A survey. SN Appl. Sci. 2020, 2, 224. [Google Scholar] [CrossRef]
  4. Herrera-Granda, E.P.; Torres-Cantero, J.C.; Peluffo-Ordóñez, D.H. Monocular visual SLAM, visual odometry, and structure from motion methods applied to 3D reconstruction: A comprehensive survey. Heliyon 2024, 10, e37356. [Google Scholar] [CrossRef] [PubMed]
  5. He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [Google Scholar] [CrossRef]
  6. Tourani, A.; Bavle, H.; Sanchez-Lopez, J.L.; Voos, H. Visual SLAM: What Are the Current Trends and What to Expect? Sensors 2022, 22, 9297. [Google Scholar] [CrossRef] [PubMed]
  7. Beghdadi, A.; Mallem, M. A comprehensive overview of dynamic visual SLAM and deep learning: Concepts, methods and challenges. Mach. Vis. Appl. 2022, 33, 54. [Google Scholar] [CrossRef]
  8. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  9. Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  10. Hu, X.; Lang, J. DOE-SLAM: Dynamic Object Enhanced Visual SLAM. Sensors 2021, 21, 3091. [Google Scholar] [CrossRef] [PubMed]
  11. Pu, H.; Luo, J.; Wang, G.; Huang, T.; Liu, H.; Luo, J. Visual SLAM Integration with Semantic Segmentation and Deep Learning: A Review. IEEE Sensors J. 2023, 23, 22119–22138. [Google Scholar] [CrossRef]
  12. Gonzalez, H.; Balakrishnan, A.; Elouardi, A. Enhanced Initialization for Track-before-Detection based Multibody Motion Segmentation from a Moving Camera. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 2040–2045. [Google Scholar] [CrossRef]
  13. Sabzevari, R.; Scaramuzza, D. Multi-body Motion Estimation from Monocular Vehicle-Mounted Cameras. IEEE Trans. Robot. 2016, 32, 638–651. [Google Scholar] [CrossRef]
  14. Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. NR-SLAM: Nonrigid Monocular SLAM. IEEE Trans. Robot. 2024, 40, 4252–4264. [Google Scholar] [CrossRef]
  15. Wei, H.; Zhang, T.; Zhang, L.; Zhong, L. Point-State Segmentation for Dynamic Scenes via Probability Model and Orientation Consistency. IEEE Trans. Ind. Inform. 2024, 20, 8428–8440. [Google Scholar] [CrossRef]
  16. Gao, X.; Liu, X.; Cao, Z.; Tan, M.; Yu, J. Dynamic Rigid Bodies Mining and Motion Estimation Based on Monocular Camera. IEEE Trans. Cybern. 2023, 53, 5655–5666. [Google Scholar] [CrossRef] [PubMed]
  17. Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
  18. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
  19. Huang, Q.; Guo, X.; Wang, Y.; Sun, H.; Yang, L. A survey of feature matching methods. IET Image Process. 2024, 18, 1385–1410. [Google Scholar] [CrossRef]
  20. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  21. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: New York, NY, USA, 2012; pp. 573–580. [Google Scholar] [CrossRef]
  22. Palazzolo, E.; Behley, J.; Lottes, P.; Giguère, P.; Stachniss, C. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
  23. Grupp, M. EVO: Python Package for the Evaluation of Odometry and SLAM. 2017. Available online: https://github.com/MichaelGrupp/evo (accessed on 6 August 2023).
Figure 1. Visual SLAM in dynamic environments. (a) Spatial points are observed by consecutive camera centers, where the positions of stationary points { P i } remain fixed, while those of moving points { Q i } vary over time. (b) Misidentification of moving points between two views can lead to incorrect epipolar estimation, potentially resulting in ill-initialized states.
Figure 1. Visual SLAM in dynamic environments. (a) Spatial points are observed by consecutive camera centers, where the positions of stationary points { P i } remain fixed, while those of moving points { Q i } vary over time. (b) Misidentification of moving points between two views can lead to incorrect epipolar estimation, potentially resulting in ill-initialized states.
Sensors 25 02404 g001
Figure 2. Feature management across multiple frames via object points. (a) Matched features corresponding to spatial point X i are linked to object point O i , enabling efficient retrieval and association across frames. (b) Covisibility between frames is evaluated using object points O N observed on F N . Frames sharing the highest number of common observations are selected as candidates for structure computation.
Figure 2. Feature management across multiple frames via object points. (a) Matched features corresponding to spatial point X i are linked to object point O i , enabling efficient retrieval and association across frames. (b) Covisibility between frames is evaluated using object points O N observed on F N . Frames sharing the highest number of common observations are selected as candidates for structure computation.
Sensors 25 02404 g002
Figure 3. The monocular initialization procedure of (a) the proposed method and (b) ORB-SLAM3, with the indication of extracted features per frame of (c) the proposed method and (d) ORB-SLAM3.
Figure 3. The monocular initialization procedure of (a) the proposed method and (b) ORB-SLAM3, with the indication of extracted features per frame of (c) the proposed method and (d) ORB-SLAM3.
Sensors 25 02404 g003
Figure 4. Features and map points during monocular initialization in sampled frames from the sequence freiburg3_walking_xyz. In the original ORB-SLAM3 (extracting 5 × 1000 features per frame), (a) candidate features for initialization and (b) initial map points on the 18th construction frame are shown. The initial map points on the 16th construction frame of DynaSLAM (extracting 5 × 1000 features per frame) are shown in (c) for reference. Meanwhile, for the proposed method (extracting 1000 features per frame) of the same frame, (d) the spatial consistency ratio is calculated with ST-RANSAC after matching and updating object points, and (e) the stationary features are then sorted as candidates for initialization. (f) The initial map points on the 22nd construction frame in the proposed method are shown for comparison. Images are adapted from the TUM dataset.
Figure 4. Features and map points during monocular initialization in sampled frames from the sequence freiburg3_walking_xyz. In the original ORB-SLAM3 (extracting 5 × 1000 features per frame), (a) candidate features for initialization and (b) initial map points on the 18th construction frame are shown. The initial map points on the 16th construction frame of DynaSLAM (extracting 5 × 1000 features per frame) are shown in (c) for reference. Meanwhile, for the proposed method (extracting 1000 features per frame) of the same frame, (d) the spatial consistency ratio is calculated with ST-RANSAC after matching and updating object points, and (e) the stationary features are then sorted as candidates for initialization. (f) The initial map points on the 22nd construction frame in the proposed method are shown for comparison. Images are adapted from the TUM dataset.
Sensors 25 02404 g004
Figure 5. Comparison of initial map points on the construction frames of respective methods in sequence (a) Freiburg2_desk_with_person and (b) Freiburg3_walking_halfsphere from the TUM dataset; (c) Balloon, (d) Balloon2, (e) Crowd, (f) Crowd2, (g) Crowd3, and (h) Moving_obstructing_box2 from the Bonn dataset; and (i) Scene 1, (j) Scene 2, (k) Scene 3, and (l) Scene 4 in practical experiments, with ORB-SLAM3 on the top and the proposed method on the bottom.
Figure 5. Comparison of initial map points on the construction frames of respective methods in sequence (a) Freiburg2_desk_with_person and (b) Freiburg3_walking_halfsphere from the TUM dataset; (c) Balloon, (d) Balloon2, (e) Crowd, (f) Crowd2, (g) Crowd3, and (h) Moving_obstructing_box2 from the Bonn dataset; and (i) Scene 1, (j) Scene 2, (k) Scene 3, and (l) Scene 4 in practical experiments, with ORB-SLAM3 on the top and the proposed method on the bottom.
Sensors 25 02404 g005aSensors 25 02404 g005b
Figure 6. Comparison of metrics in ORB-SLAM3, DynaSLAM, and our method. (a) Processing time of frames, (b) trajectories, and (c) absolute pose error distributions over initialization in the sequence freiburg3_walking_xyz are shown, together with (df) in sequence freiburg3_walking_halfsphere, (gi) in sequence moving_obstructing_box, and (jl) in Scene 3 and (mo) in Scene 4 of the practical experiments.
Figure 6. Comparison of metrics in ORB-SLAM3, DynaSLAM, and our method. (a) Processing time of frames, (b) trajectories, and (c) absolute pose error distributions over initialization in the sequence freiburg3_walking_xyz are shown, together with (df) in sequence freiburg3_walking_halfsphere, (gi) in sequence moving_obstructing_box, and (jl) in Scene 3 and (mo) in Scene 4 of the practical experiments.
Sensors 25 02404 g006aSensors 25 02404 g006b
Table 1. Dataset experimental results and comparison.
Table 1. Dataset experimental results and comparison.
DatasetSequenceSkipped
Images
MethodInit.
Frame
Pair
Peak
Init.
t. (ms)
Mean
Track
t. (ms)
RMS
ATE
(cm)
RMS
RPE
(cm)
TUMfr2 desk
with person
1060ORB3 *25th, 33rd46.01014.2150.4520.729
MOD *7th, 9th40.03810.9780.3970.567
Dyna *1st, 13th869.127842.3610.3410.637
fr3 walking
xyz
-ORB314th, 18th51.66314.8113.2655.193
MOD5th, 22nd41.19614.1531.1381.704
Dyna1st, 16th936.505866.9351.1501.735
fr3 walking
rpy
-ORB321st, 30th55.96516.1521.3611.414
MOD23th, 33th50.96318.6781.3071.253
Dyna21st, 30th927.457855.3521.4271.713
fr3 walking
halfsphere
-ORB31st, 4th50.71215.2193.6502.484
MOD2nd, 5th21.0488.6923.2281.732
Dyna1st, 4th869.516837.7862.9652.614
BonnBalloon-ORB31st, 7th43.03814.4492.8107.189
MOD16th, 24th38.90411.7231.4673.586
Dyna1st, 35th861.660843.2333.9908.511
Balloon2-ORB31st, 9th39.41814.1792.93112.086
MOD4th, 10th29.4738.5723.0278.477
Dyna1st, 32th881.597843.6392.22319.521
Balloon
tracking
280ORB31st, 16th40.39511.92021.85714.458
MOD10th, 23rd51.35811.4958.39611.765
Dyna1st, 26th864.395843.1886.95113.419
Balloon
tracking2
200ORB341st, 45th22.1819.0234.20711.710
MOD44th, 51st47.04914.4933.28911.191
Dyna67th, 70th870.908838.2812.05112.102
Crowd230ORB31st, 6th58.72417.7142.7522.844
MOD1st, 16th32.4689.6182.3142.407
DynaFailed
Crowd2 -ORB31st, 13th40.51012.4791.1109.976
MOD6th, 10th36.44910.3461.0119.373
DynaFailed
Crowd330ORB31st, 4th48.94312.3751.2913.987
MOD4th, 10th27.1378.4120.7543.050
DynaFailed
BonnMoving
nonobstructing
box
250ORB31st, 13th41.73813.0512.9747.201
MOD8th, 14th40.5439.2302.6828.717
Dyna1st, 33th884.943857.9581.71217.789
Moving
nonobstructing
box2
470ORB31st, 6th36.21211.9343.6155.744
MOD23rd, 37th47.30811.4502.7175.523
Dyna1st, 13th877.393842.3292.7699.729
Moving
obstructing
box
300ORB31st, 26th38.40112.6021.60813.985
MOD8th, 22th48.95810.9821.32612.076
Dyna37th, 70th880.056841.2871.02525.772
Moving
obstructing
box2
400ORB31st, 3rd55.76312.5764.3506.197
MOD7th, 15th39.01312.5312.2594.518
Dyna1st, 5th869.905832.8192.4769.121
Person
tracking 
10ORB31st, 10th55.08413.5283.4223.589
MOD25th, 37th55.42112.8891.7273.612
Dyna1st, 13th916.646860.4980.9386.470
Person
tracking2
40ORB3Failed
MOD22nd, 24th47.4959.8753.1962.457
DynaFailed
* ORB3: Original ORB-SLAM3 system. MOD: The modified system with the proposed initialization method. Dyna: DynaSLAM system. The feature extraction number during initialization is doubled to 2000 in the modified system. Tracking failure occurs after around the 50th frame in DynaSLAM, and thus, the corresponding RMSE is not comparable.
Table 2. Practical experimental results and comparison.
Table 2. Practical experimental results and comparison.
SceneMethodInit.
Frame
Pair
Peak
Init.
t. (ms)
Mean
Track
t. (ms)
RMS
ATE
(cm)
1ORB31st, 20th89.96917.7004.202
MOD4th, 21st77.77315.7294.011
Dyna1st, 20th905.595859.1084.026
2ORB3Failed
MOD12th, 14th57.86813.6927.114
Dyna1st, 35th898.462861.5649.001
3ORB3Failed
MOD8th, 23rd77.08315.4784.615
Dyna1st, 7th909.335857.5314.799
4ORB382nd, 84th111.02753.3916.185
MOD24th, 25th88.04618.80413.729
Dyna91st, 95th930.409881.3957.073
5ORB31st, 26th74.86518.4436.558
MOD3rd, 20th70.30314.9515.429
Dyna1st, 43rd894.833864.2545.599
Table 3. Success rate in experiments.
Table 3. Success rate in experiments.
ScenarioNumber of
Sequences
MethodTrialsSuccess
Rate
Challenging
Sequence
TUM
dataset
4ORB32095%Walking RPY
MOD2095%Walking RPY
Dyna20100%
Bonn
dataset
13ORB36592.31%Person Tracking 2
MOD6595.38%Balloon Tracking 2, Crowd 2
Dyna6569.23%Crowd 1–3, Person Tracking 2
Practical
experiment
5ORB32560%Scene 2, 3
MOD2592%Scene 2
Dyna2588%Scene 2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dou, H.; Liu, B.; Jia, Y.; Wang, C. Monocular Initialization for Real-Time Feature-Based SLAM in Dynamic Environments with Multiple Frames. Sensors 2025, 25, 2404. https://doi.org/10.3390/s25082404

AMA Style

Dou H, Liu B, Jia Y, Wang C. Monocular Initialization for Real-Time Feature-Based SLAM in Dynamic Environments with Multiple Frames. Sensors. 2025; 25(8):2404. https://doi.org/10.3390/s25082404

Chicago/Turabian Style

Dou, Hexuan, Bo Liu, Yinghao Jia, and Changhong Wang. 2025. "Monocular Initialization for Real-Time Feature-Based SLAM in Dynamic Environments with Multiple Frames" Sensors 25, no. 8: 2404. https://doi.org/10.3390/s25082404

APA Style

Dou, H., Liu, B., Jia, Y., & Wang, C. (2025). Monocular Initialization for Real-Time Feature-Based SLAM in Dynamic Environments with Multiple Frames. Sensors, 25(8), 2404. https://doi.org/10.3390/s25082404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop