An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments

Ni, Jianjun; Wang, Li; Wang, Xiaotian; Tang, Guangyi

doi:10.3390/app13042712

Open AccessArticle

An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments

¹

College of Internet of Things Engineering, Hohai University, Changzhou 213022, China

²

School of Artificial Intelligence, Hohai University, Changzhou 213022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2712; https://doi.org/10.3390/app13042712

Submission received: 16 January 2023 / Revised: 16 February 2023 / Accepted: 17 February 2023 / Published: 20 February 2023

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

The visual simultaneous localization and mapping (SLAM) method under dynamic environments is a hot and challenging issue in the robotic field. The oriented FAST and Rotated BRIEF (ORB) SLAM algorithm is one of the most effective methods. However, the traditional ORB-SLAM algorithm cannot perform well in dynamic environments due to the feature points of dynamic map points at different timestamps being incorrectly matched. To deal with this problem, an improved visual SLAM method built on ORB-SLAM3 is proposed in this paper. In the proposed method, an improved new map points screening strategy and the repeated exiting map points elimination strategy are presented and combined to identify obvious dynamic map points. Then, a concept of map point reliability is introduced in the ORB-SLAM3 framework. Based on the proposed reliability calculation of the map points, a multi-period check strategy is used to identify the unobvious dynamic map points, which can further deal with the dynamic problem in visual SLAM, for those unobvious dynamic objects. Finally, various experiments are conducted on the challenging dynamic sequences of the TUM RGB-D dataset to evaluate the performance of our visual SLAM method. The experimental results demonstrate that our SLAM method can run at an average time of 17.51 ms per frame. Compared with ORB-SLAM3, the average RMSE of the absolute trajectory error (ATE) of the proposed method in nine dynamic sequences of the TUM RGB-D dataset can be reduced by 63.31%. Compared with the real-time dynamic SLAM methods, the proposed method can obtain state-of-the-art performance. The results prove that the proposed method is a real-time visual SLAM, which is effective in dynamic environments.

Keywords:

visual SLAM; dynamic environment; map point reliability; geometric constraint

1. Introduction

SLAM is a technology for self-positioning and environment map construction in unknown environments with sensors on machines [1,2,3]. In the early research of SLAM, researchers described the problem of SLAM as a state estimation problem of spatial uncertainty, and used an extended Kalman filter (EKF), a particle filter (PF) and maximum likelihood estimation (MLE) to eliminate the uncertainty in the process of positioning and mapping [4,5,6]. In 2009, Lourakis and Argyros [7] realized the sparsity and symmetric structure of the Hessian matrix in bundle adjustment (BA) and published their research results, which enabled BA to be accelerated. Since then, BA as the core graph optimization algorithm began to replace the filtering algorithm to become the mainstream method to eliminate uncertainty.

Visual SLAM (V-SLAM) uses the camera as the main sensor and only obtains input information from the image [8,9]. Compared with other LiDAR or inertial sensors, it has the advantages of rich information, low cost, lightweight and small volume. At present, V-SLAM is widely used in autonomous driving, AR (augmented reality)/VR (virtual reality), and other fields [10,11,12]. The framework of V-SLAM is mainly divided into two parts: visual front-end and back-end optimization. The visual front-end is also commonly referred to visual odometry (VO) [13,14]. The main function of VO is to calculate the rough poses of the camera and generate new map points. The back-end optimization uses general graph optimization (

g^{2} o

) [15] to optimize the data passed from the front-end to eliminate random errors. There are two kinds of VO: the feature-based method and the direct method. The feature-based method extracts representative points from the frames and then uses them to estimate the camera poses and build the map. The feature-based method has high robustness because it can still identify the same points when the scene and camera perspective slightly change. However, for the feature point with higher robustness, its calculation consumes higher resources. The representative SLAM of the feature-based method is the ORB-SLAM series algorithms [16,17]. The direct method SLAM estimates the pose and builds the map according to the brightness information of the pixels, which consumes fewer resources. The direct SLAM can generate a dense or semi-dense map. However, it is sensitive to the image quality [18,19,20]. Direct sparse odometry (DSO) [21] is the representative SLAM of the direct method SLAM. At present, the feature-based method is the mainstream method of VO in V-SLAM.

At present, most of the high robustness and high efficiency of V-SLAM are achieved in a static environment. However, the real world is complex and dynamic most of the time. The movement of people, animals and vehicles may cause V-SLAM to fail in feature matching and loop closure detection. To deal with these problems, researchers have come up with various processing methods to find moving objects in images [22,23]. One method is based on deep learning, and the other is based on consistency detection.

Deep-learning-based methods use deep neural networks to identify objects that are thought to move with high probability, such as Mask R-CNN [24], SegNet [25] and you only look once (YOLO) [26,27] and then track and maintain these objects or cover them with a mask [28,29]. To reduce the impact of the neural network on the real-time performance of V-SLAM, some researchers run the neural network in an independent thread [23,30] or just use the neural network in some specifically selected frames and spread the detection results to other frames by feature points matching [31,32,33]. More recent methods based on deep learning tend to go further by not just removing the dynamic foreground, but also inpainting or reconstructing the static background that is occluded by moving targets [34,35,36]. Benefiting from the rapid advances in deep learning and easy embedding into V-SLAM, this kind of dynamic SLAM works well in most scenarios. However, low running rate and inability to deal with objects that the neural network has not yet trained are two unavoidable disadvantages that need more research. Therefore, this paper focuses on the real-time V-SLAM method, which is not based on deep learning.

For the method based on consistency detection, researchers search for consistency between static backgrounds or dynamic objects, such as geometric or physical consistency and so on. Compared with learning-based dynamic V-SLAM, this kind of SLAM usually has better real-time performance. Some researchers try to find the dynamic objects in a single scene. Alcantarilla et al. [37] used scene flow to distinguish moving objects from the static scene, and the calculation of the scene flow was based on a result estimated by VO. Li and Lee [38] presented an RGB-D SLAM based on static point weighting (SPWSLAM), which extracted the feature points at the depth edges of the RGB-D frames. In addition, some studies have tried to use multi-scene contrast to find dynamic objects. For example, Kerl et al. [39] proposed dense visual odometry and SLAM (DVO), which found the dynamic region by minimizing the photometric error between two consecutive RGB-D frames and used a robust error function to reduce the influence of large residuals. Kim and Kim [40] presented an effective background model-based DVO, which obtained the static region in images by computing the depth differences between consecutive RGB-D frames. Sun et al. [41] proposed a novel motion removal SLAM (MotionRemove), which computed the camera ego-motion using RANSAC-based homography estimation with two consecutive RGB-D images. Then, the moving-object motions are roughly detected by subtracting the current RGB-D frame with the ego-motion-compensated last RGB-D frame. Dai et al. [42] proposed a dynamic SLAM (DSLAM), which used Delaunay triangulation to connect mapped points to surrounding points and then treated the two points that remain connected in the next several frames as two points on the same rigid object, finally assuming the largest set of points that remain connected as a static background.

There are some other improved V-SLAM methods based on consistency detection. For example, Zou and Tan [43] used two moving cameras to identify dynamic objects and considered the objects tracked successfully by one camera but unsuccessfully by the other camera in successive frames as moving objects but did not consider the occlusion problem under different viewing angles. Zhang et al. [44] used PWC-net to obtain optical flow from RGB-D images. The error of optical flow was also judged by the initial relative poses of the front and back frames, and the optical flow with a large error was regarded as a dynamic object. Based on the KinectFusion dense SLAM system, Palazzolo et al. [45] calculated the residual of each pixel and then obtained the approximate dynamic region by adaptive threshold segmentation, finally obtaining the final dynamic region and the static background map by morphological processing. However, it still needs a lot of research on parameter judgment and threshold setting. It is worth noting that many researchers are trying to use both deep learning and consistency detection in dynamic SLAM.

As introduced above, the dynamic SLAM using consistency detection can usually run at high speed, but it is usually not guaranteed to completely screen out the small dynamic region in the image, and it is still disturbed in some scenes where objects are moving slowly. Thus, we propose an improved V-SLAM to reduce the impact of moving objects in a dynamic environment. In the proposed method, a concept of map point reliability is presented, which is used to describe the map points with high observation times as reliable map points. In addition, a concept of unreliable map points is presented, which is used to define the map points with fewer observation times. Based on the two concepts of the map points, we deal with those obvious and unobvious dynamic objects in V-SLAM.

The main contributions of this paper are as follows: (1) An improved V-SLAM method based on ORB-SLAM3 is proposed, which can not only retain its original performance but also work well in dynamic environments. (2) A mixed strategy combining new map points screening with repeated exiting map points elimination is proposed to identify obviously dynamic map points. (3) The strategy utilizes map point reliability to identify unobviously dynamic map points. (4) Extensive experiments demonstrate the benefits of the processing strategies for dynamic objects in the V-SLAM method. The proposed method achieves accuracy that outperforms the state-of-the-art real-time V-SLAM methods on the TUM RGB-D dataset.

The rest of this paper is structured as follows: Section 2 introduces the framework of our proposed V-SLAM method in detail. Section 3 reports the experiment results evaluated on the TUM dataset. The conclusions and future work are given in Section 4.

2. Proposed Method

In this section, we introduce the overall framework of our proposed SLAM system.Then, we introduce the concept of map points reliability and our two main improvements using this concept in a dynamic environment.

2.1. Framework of Our Method

Our method is based on ORB-SLAM3, which has been proven to be efficient in static environments [16,46]. ORB-SLAM3 is the first open-source SLAM system capable of performing vision, visual inertia, and multiple maps with monocular, binocular, and RGB-D cameras using pinhole or fisheye camera models. ORB-SLAM3 includes four parts: atlas, tracking thread, local mapping thread and loop and map merging thread. Its biggest advantage is that it allows matching and using observations made before BA, enabling data association in the short, medium and long term. However, ORB-SLAM3 is designed to run in a static environment or a very slightly dynamic environment. ORB-SLAM3 calculates the pose of the current frame through feature matching and

g^{2} o

algorithm in the tracking thread. When it comes to a dynamic environment, the number of correct matching points between two dynamic feature points significantly increases, exceeding the range that the

g^{2} o

algorithm can optimize. Because of this, ORB-SLAM3 cannot perform well in a dynamic environment.

The framework of our proposed algorithm is shown in Figure 1, which is introduced as follows. (1) ATLAS: It is a multi-map representation composed of a set of disconnected maps. Only the activated map is used by the tracking thread and optimized by the local mapping thread. (2) TRACKING: This thread processes sensor information and computes the pose of the current frame according to the active map in real-time by reliable map points and temporary reliable map points, minimizing the reprojection error of the matched map features, checking the reliability of unreliable map points, selecting the new keyframe and handling the new map points. (3) Local mapping: This thread adds keyframes and points to the active map, removes the redundant ones and refines the map using visual or visual-inertial bundle adjustment and operating in a local window of keyframes close to the current frame. New map points generated in this thread will be processed in the same way as new map points in the tracking thread. (4) Loop and map merging: This thread detects common regions between the active map and the whole ATLAS at the keyframe rate. (5) Loop correction: This thread will be performed if the common area belongs to the active map. Otherwise, SLAM merges non-activated items into the active map, then runs a full BA in an independent thread.

The basic workflow of the proposed method is the same as ORB-SLAM3. However, there are some differences between the proposed method and the traditional ORB-SLAM3. In the proposed method, the map points are divided into three categories, namely the reliable map points (R-MPs), the unreliable map points (U-MPs) and the temporary reliable map points (TR-MPs). Based on this classification of map points, some novel map points processing strategies are added to the framework of ORB-SLAM3, including a map point selection operation, an unreliable map point check operation, etc. In this section, only the different parts of the proposed method are introduced in detail.

2.2. New Map Points Processing Strategy

The algorithm in this section mainly deals with new map points. In the general ORB-SLAM3 algorithm, new map points may have similar descriptors to existing map points, or they may be influenced by a mismatch in the dynamic environment or two frames that are too close together in triangulation, which will lead to generating a wrong coordinate. To deal with this problem, we propose a new map points processing (called NMPP) strategy. In this strategy, for each new map point, we perform the following two operations, namely a distance check and a repeated existing map points elimination operation.

2.2.1. Distance Check Operation

While ORB-SLAM3 runs in a dynamic environment, if the dynamic objects occupy most of the field of view, the new map point will be generated at an error coordinate caused by a mismatch or wrong triangulation. They are clustered around the camera’s photo center, as shown in Figure 2.

Therefore, after generating new map points, we need to perform a distance check operation (called DC) to calculate the Euclidean distance between the map point and the camera, which is described as follows:

D_{p - c} = {∥ P_{C} - P_{x y z} ∥}_{2}

(1)

where

D_{p - c}

is the distance between map point and camera,

P_{C}

is the coordinate of the map point, and

P_{x y z}

is the translation of the camera at this time. Then, the map points whose distance is lower than the threshold are eliminated.

2.2.2. Repeated Existing Map Points Elimination Operation

When SLAM runs in a dynamic environment, dynamic landmarks may leave repeated map points in the map point set, whereas the original ORB-SLAM3 has no mechanism to deal with this problem. So, we add a repeated existing map points elimination operation (called REMPE) to the ORB-SLAM. As shown in Figure 3, if the moving map points are not removed in time, they will leave a large number of outdated map points in the map point set of the SLAM system, which represent the objects that have passed through these places. Therefore, before a new map point is added to the map point set, it is necessary to eliminate the existing similar map points.

Directly traversing all the map points in the map point set one by one is time-consuming work, which seriously affects the real-time performance of the SLAM system. As the map has more points, it will take a longer time. The ORB-SLAM3 system uses DBoW2 [47] to accelerate loopback detection, reference keyframe tracking and relocation tracking. In this paper, we use the existing DBoW2 bag of words model in ORB-SLAM3 to calculate the NodeId of each map point when each map point is generated by the SLAM system. Then, each of these new map points is compared with the existing map points that have the same NodeId. A similar score is described as:

s i m (d, d_{i}) = {∥ d - d_{i} ∥}_{2}

(2)

where d is the description of the current map point,

d_{i}

is the description of one map point which has the same NodeId as the current map point, and

s i m (d, d_{i})

is the score function which is in integer form.

The specific steps of the proposed repeated existing map points elimination operation are: the SLAM system finds the remaining map points whose similarity with the descriptor is within the threshold range (which is set as 0.5 in this study) and then removes them from the map point atlas. The pseudocode of the proposed operation is shown in Algorithm 1.

2.3. Unreliable Map Points Processing Strategy

The algorithm in this section mainly deals with map points left in SLAM by landmarks that are not moving fast enough. As we know, the traditional V-SLAM uses the PnP (perspective-n-point) method to calculate the pose of a frame, which is as follows:

p = Φ \cdot T \cdot P

(3)

where the p is the feature point extracted from the current frame and is a projection of a landmark in the real world,

Φ

is the camera calibration and distortion parameters, P is a map point of the same landmark which is generated in the previous frame, and T is the pose of the current frame. In a static environment, P can always be treated as a landmark. However, when it comes to dynamic environments, P can only represent the map point of the landmark at the previous time. Although p and P can be matched through the description, this matched pair cannot be used for pose calculation.

Algorithm 1 Repeated map points elimination operation

Input:P: New map points in current frame;

M

: A set of all map points in the active map;

1:: for each $p \in P$ do
2:: %Compute the node serial number (denoted by $N o d e I d$ ) of P;
3:: $p . N o d e I d = f_{D B o W 2} (p)$ ;
4:: % $f_{D B o W 2}$ is the algorithm of DBoW2;
5:: $S = M (p . N o d e I d)$ ;
6:: %Get the map points set in $M$ at the same $N o d e I d$ as p;
7:: for each $s \in S$ do
8:: if $s i m (p . d, s . d) ⩽ 0.5$ then
9:: % $s i m (\cdot)$ is the score function; $p . d$ is the description of the current map point p;
10:: if $s . i s B a d = = 1$ then
11:: % $s . i s B a d$ is the flag to denote whether s has been checked before;
12:: Break;
13:: else
14:: $s . i s B a d = 1$ ;
15:: end if
16:: end if
17:: end for
18:: end for
19:: for each $p \in P$ do
20:: $S = M (p . N o d e I d)$ ;
21:: $S . i n s e r t (p)$ ;
22:: %Insert p into the end of $S$ ;
23:: end for

To deal with this problem, we propose an unreliable map points processing strategy (called UMPP). As we know, in feature-based visual SLAM, matching pairs of map points and projection points are used to calculate the pose of the camera in each frame. There are correct matches and mismatches in these matching pairs, but not all matching pairs in the current frame are needed. The proposed UMPP strategy can be described as follows:

\begin{matrix} p & = Φ \cdot T \cdot P^{r} \end{matrix}

(4)

\begin{matrix} P^{r} & = UMPP (T, P^{u}) \end{matrix}

(5)

where

P^{r}

is the reliable part of map points, and

P^{u}

is the unreliable part of map points. As shown in Figure 4, the idea of the proposed UMPP strategy is that we can calculate the pose of the camera with reliable map points (regardless of whether they are mismatches) and then calculate the reprojection of the rest of the map points with unknown reliability at this pose.

In the proposed V-SLAM method, we set the map points which have not been checked three times as unreliable map points (called U-MPs). If the error between the reprojection coordinate and the corresponding feature point coordinate is within the threshold range, the current unknown reliability point is considered to pass the check at this time. Otherwise, the current map point is considered a dynamic map point (called D-MP), which needs to be eliminated from the current frame. Finally, the U-MPs whose reliability exceeds the threshold are set as the reliable map points (called R-MPs).

It is noted that within an early period of time when SLAM is running, we cannot find any reliable map points, so the map points matched by these frames can be set as R-MPs only by passing the test described in Section 2.2.

2.3.1. Map Points Selection Operation

In the stage of pose calculation, we use R-MPs to calculate the pose. However, in the actual processing, we need a certain number of matching point pairs to achieve enough redundancy. When the number of these R-MPs is insufficient, we need to choose some temporary reliable points (called TR-MPs) from the map points for pose calculation. So, we propose a map point selection operation (MPS). The basic idea of MPS is introduced as follows.

We think that a point has a higher probability of being reliable if most of the points around it are reliable. Our method is to triangulate all feature points in the current frame by the Delaunay triangulation. Triangulation is the process of obtaining a set of triangles from a set of planar points. Delaunay triangulation is one of the best triangulations with a lot of advantages, including maximizing the minimum angles of the triangles and forming a triangle with the closest three points. With the Delaunay triangulation, each point is connected to its surrounding points by edges. We rank the map points from greatest to smallest in terms of their overall reliability with the surrounding points, and the top 20% of the map points are set as TR-MPs to calculate the frame pose. The details of the proposed MPS operation are shown as Algorithm 2.

Algorithm 2 The proposed MPS operation

Input:P: Map points in current frame;

Output:

P_{t e}

: Temporary reliable map points;

1:: $P = P . f e a t u r e ()$ ;
2:: % $P . f e a t u r e ()$ is the function used to get the set of feature points of P;
3:: $T = D e l a_T (P)$ ;
4:: % $D e l a_T (\cdot)$ is the function used to convert a point set to a triangle set;
5:: for each $p \in P$ do
6:: $ϱ$ = { }; %Set up a sub triangle set;
7:: for each $t \in T$ do
8:: if $p \in t$ then
9:: $ϱ . i n s e r t (t)$ ; %Insert t into the end of $ϱ$ ;
10:: end if
11:: end for
12:: $ϑ$ = Get_U( $ϱ$ );
13:: %Get_U(·) is the function used to get all unique points in $ϱ$ ;
14:: $υ = 0$ ;
15:: for each $ς \in ϑ$ do
16:: $γ = R e l i a b i l t y (ς)$
17:: %Get the reliability of the map point with the feature point $ς$ ;
18:: if $ς . R e l i a b i t y = = 1$ then
19:: % $ς . R e l i a b i t y$ is the flag to denote whether $ς$ belongs to a reliable map point;
20:: $υ = υ + 3 γ$
21:: else
22:: $υ = υ + γ$
23:: end if
24:: end for
25:: $p . v a l u e = υ$ ; % $p . v a l u e$ denotes the reliability weight of the feature point p;
26:: end for
27:: P = Sort(P);
28:: %Sort the map points according to the reliability weight of their feature point;
29:: $P_{t e}$ = Select(P);
30:: %Select the top 20 percent of P as $P_{t e}$ ;

The examples of the proposed MPS operation based on the Delaunay triangulation are shown in Figure 5. The green points are the feature points of R-MPs, which are used for pose calculation. The blue and red points are the feature points of TR-MPs and U-MPs, respectively. Finally, it should be noted that there is no R-MP in the early stage of SLAM. So, we set all the U-MPs as TR-MPs in this stage.

2.3.2. Unreliable Map Points Check Operation

After calculating the pose of the current frame, we present an unreliable map points check (UMPC) operation. The details of the proposed UMPC operation are introduced as follows.

First, we calculate the reprojection points of the U-MPs by the pose of the current frame. Then, we calculate the Euclidean distance between each projection point and its corresponding feature point. Map points whose distance exceeds the threshold will be removed from the current frame.

The proposed SLAM can work in three modes: RGB-D, monocular and stereo. For each frame, the coordinate of the unreliable map point is

P^{u} = {[X, Y, Z]}^{T}

; the observation point of the map point in the RGB-D image is

p = {[u, v, d]}^{T}

, and the observation point of map point in mono image and left or right camera in stereo is

p = {[u, v]}^{T}

.

Thus, for the map point observed by the RGB-D camera, the reprojection

\tilde{p}

of

P^{u}

in the normalized plane is calculated by

\tilde{p} = {[x, y, z]}^{T} = {(T \cdot P^{u})}_{1 : 3},

(6)

and for the right camera in stereo, the equation is

\tilde{p} = {[x, y, z]}^{T} = {(T_{m t r l} \cdot T \cdot P^{u})}_{1 : 3}

(7)

where

T_{m t r l}

is the right-to-left transformation matrix.

Then, the coordinate of projection point

p^{'}

of

\tilde{p}

in the image from the RGB-D camera is shown as follows:

p^{'} = {[f_{x} \cdot \frac{x}{z} + c_{x}, f_{y} \cdot \frac{y}{z} + c_{y}, \frac{f_{x} \cdot x - b f}{z} + c_{x}]}^{T}

(8)

while for the point in a mono camera and a left or right stereo camera, the

p^{'}

is calculated by:

p^{'} = {[f_{x} \cdot \frac{x}{z} + c_{x}, f_{y} \cdot \frac{y}{z} + c_{y}]}^{T}

(9)

where

f_{x}, c_{x}, f_{y}, c_{y}, b f

are the internal parameters of the camera.

The error between the reprojection point

p^{'}

and the observation point p is shown as follows:

E r r o r = μ \times ∥ (p^{'} - p) ∥_{2}

(10)

where

μ

is the effective coefficient, and in this paper, it is the scaling coefficient of the pyramid layer where the ORB feature point is located.

To test whether the error exceeds the threshold ▵, we use a chi-square distribution, as described by

▵ (x; k) = \{\begin{matrix} \frac{1}{2^{\frac{k}{2}} Γ (\frac{k}{2})} x^{\frac{k}{2} - 1} e^{\frac{- x}{2}}, & x > 0 \\ 0, & o t h e r w i s e \end{matrix}

(11)

Γ (k) = (k - 1)!

(12)

where k is the degree of freedom, and x is the degree of confidence. Since the dimension of the matching point of RGB-D is 3, and the 3 dimensions are independent of each other, we take the degree of freedom

k = 3

, and the degree of confidence

x = 0.05

. For mono and stereo images, we set

k = 2

, and

x = 0.05

. If

E r r o r

exceeds the chi-square statistic, it is considered that the reprojection of the map point under the current pose is inconsistent with the original matching point, and it will be moved.

The workflow of the proposed map points processing in the proposed method is shown in Figure 6.

3. Experiments

3.1. Datasets and Evaluation Criteria

In this section, we test the performance of our proposed algorithm in dynamic environments. All experiments were run on Ubuntu 20.04 LTS using an Inter(R) Core(TM) [email protected] CPU and 32GB@2400MHz RAM. The test dataset we used is the TUM RGB-D dataset [48,49], which is widely used for dynamic SLAM testing. The TUM RGB-D dataset, published by TUM Computer Vision Group in 2012, consists of 39 sequences recorded at 30 frames per second using a Microsoft Kinect sensor in different indoor scenes. It also comes with ground truth and some performance evaluation tools.

In these experiments, we use dynamic object sequences to test the accuracy and real-time performance of our proposed algorithm. In these sequences of the TUM RGB-D dataset, the camera is moved at five different modes. (1) The “static” mode: the camera is static with slight shaking. (2) The “xyz” mode: the camera is moved approximately along the X, Y and Z axis with few rotational components. (3) The “rpy” mode: the camera is mostly only rotated around the principal axes with few translational motions. (4) The “halfsphere” mode: The camera moves back and forth along the latitude and longitude of the hemisphere, with the lens pointed at the center of the ball. (5) The “desk-person” mode: a person is sitting at one of the desks and continuously moves several objects around. Here, we select 5 sequences of slightly dynamic scenes and 4 sequences of highly dynamic scenes. In the slightly dynamic scene, two people sit on a chair, only moving their hands or faces and keeping their bodies static, while the camera moves in different ways. The slightly dynamic sequences we used include fr2/desk-person (fr2/d/person), fr3/sitting-static (fr3/s/static), fr3/sitting-xyz (fr3/s/xyz), fr3/sitting-rpy (fr3/s/rpy) and fr3/sitting-halfsphere (fr3/s/half). The highly dynamic scenes are where two people move freely, stand or sit at will, and the camera moves in the same way as the types above. The highly dynamic sequences we used are fr3/walking-static (fr3/w/static), fr3/walking-xyz (fr3/w/xyz), fr3/walking-rpy (fr3/w/rpy) and fr3/walking-halfsphere (fr3/w/half).

We evaluate the performance of our method on these sequences by evaluating the quality of the estimated trajectories from the ground truth trajectories with two kinds of evaluation metrics, relative pose error (RPE) and absolute trajectory error (ATE) [50].

For a sequence of poses from the estimated trajectories

P_{1}, \dots, P_{n} \in S E (3)

and from the ground truth trajectories

Q_{1}, \dots, Q_{n} \in S E (3)

, the RPE measures the local accuracy of the trajectories over a fixed time interval

Δ

. The RPE at time step i is described as:

E_{i} = {(Q_{i}^{- 1} Q_{i + Δ})}^{- 1} (P_{i}^{- 1} P_{i + Δ})

(13)

From a sequence of n camera poses, we obtain individual RPE along the sequence in this way:

m = n - Δ

. Since each

E_{i}

consists of a translation part and a rotation part, we need to calculate them separately. Because the TUM RGB-D dataset was recorded with

30 Hz

, we set

Δ = 30

. From these errors, the root mean squared error (RMSE) over all time indices of the translational and rotational components is calculated as:

Trans . RMSE = {(\frac{1}{m} \sum_{i = 1}^{m} ∥ t r a n s (E_{i}) ∥^{2})}^{1 / 2}

(14)

Rot . RMSE = {(\frac{1}{m} \sum_{i = 1}^{m} ∥ r o t (E_{i}) ∥^{2})}^{1 / 2}

(15)

where

t r a n s (E_{i})

and

r o t (E_{i})

refers to the translational and rotational component of the RPE

E_{i}

, respectively.

The ATE is used to evaluate global consistency. The ATE at time step i can be computed as:

F_{i} = Q_{i}^{- 1} S P_{i}

(16)

where

S

is the rigid-body transformation corresponding to the least-squares solution,

P_{i}

is the estimated trajectory, and

Q_{i}

is the ground truth trajectory.

For ATE, we just evaluate its translation part. The RMSE over all time indices of the translational components of ATE is calculated as:

ATE . RMSE = {(\frac{1}{n} \sum_{i = 1}^{n} ∥ t r a n s (F_{i}) ∥^{2})}^{1 / 2}

(17)

3.2. Comparison with the State of the Art

Firstly, we will introduce the robustness and real-time performance of our SLAM on this dataset and compare it with the other state-of-the-art real-time SLAM methods based on the consistency principle. The V-SLAM methods based on the consistency principle used in the comparison experiments are DVO [39], StaticFusion [51], SPWSLAM [38], MotionRemove [41] and DSLAM [42]. The results are shown in Table 1, Table 2 and Table 3.

The results in Table 1, Table 2 and Table 3 show that our SLAM achieves the best performance in the whole highly dynamic sequences compared with the state-of-the-art methods. In the five slightly dynamic sequences, the proposed method also achieves the best performance in four sequences, except in fr3/s/xyz, the ATE.RMSE of the proposed method is higher than that of DSLAM. However, the average ATE.RMSE of the proposed method reduces to 0.0255 m, which is about 37.35% less than DSLAM (relative value). In terms of the Rot.RMSE, DSLAM can obtain the second-best performance. However, the average Rot.RMSE of DSLAM increases by about 111.62% (relative value) compared with the proposed method. This result shows that DSLAM cannot handle camera rotation well.

In addition, compared with ORB-SLAM3, our method can outperform ORB-SLAM3. The average Rot.RMSE, Trans.RMSE and ATE.RMSE of the proposed method reduce by about 75.63%, 48.81% and 63.31% (relative value), respectively. The main reason is that by screening each new map point using the distance threshold, our SLAM can significantly reduce the number of erroneous map points generated due to mismatches in a purely rotational environment. In addition, as the camera keeps still, this method can also eliminate the wrong map points generated by triangulation due to two frames being too close to each other. The comparison results of the trajectory between the proposed method and ORB-SLAM3 are shown in Figure 7 and Figure 8. Some examples of the working process of the proposed method and ORB-SLAM3 in fr3/w/xyz are shown in Figure 9.

In Figure 7 and Figure 8, the blue line in the figure above is the trajectory of the V-SLAM system, the black line is the ground truth trajectory, and the red line segments are used to connect the same timestamp points at two trajectories. So, the longer the red segment is, the larger the error trajectory is at that timestamp. The results in Figure 7 and Figure 8 show that the error trajectory of our proposed method is less than that of ORB-SLAM3, especially in the sequences fr3/s/rpy, fr3/w/static, fr3/w/half, and fr3/w/xyz, where the improvements of the proposed method are 60% higher than ORB-SLAM3 (see Table 3). The results in Figure 9 show that the proposed method can deal with different types of map points, which are marked in different colors and sizes. However, ORB-SLAM3 uses all map points indiscriminately, ignoring possible potential errors between map points and their real-world landmarks. This makes it perform poorly in dynamic environments.

In order to show the performance of the proposed method in dynamic environments, we also compare our proposed method with the state-of-the-art V-SLAM methods based on deep learning, which are not suited for real-time applications. These methods include Co-fusion [52], MaskFusion [34], DynaSLAM [28], EM-Fusion [29], DS-SLAM [30] and RDS-SLAM [23]. To make the paper more concise, here only the ATE.RMSE of these methods is given. The results are listed in Table 4.

In the comparison results of SLAM based on deep learning, except for DynaSLAM, our SLAM performs better than SLAM in the other nine dynamic sequences, as shown in Table 4. However, our SLAM obtains the best performance in all five slightly dynamic sequences (in the sequences fr2/s/static and fr3/s/xyz. The results of our method are almost the same as those of DS-SLAM and DynaSLAM, where the differences of the ATE.RMSE are 0.1 mm). The main reason is that the proposed method has some additional operations to deal with the unobvious dynamic map points in unreliable map points.

Furthermore, the proposed method obtains the best performance in fr3/w/half in the comparison with the deep-learning-based method. This is because, on the one hand, DynaSLAM is a SLAM that combines deep learning and consistency detection, which consumes more computing resources to deal with dynamic environments. On the other hand, the nature of deep learning determines that it cannot completely identify all the dynamic objects in each frame of the highly dynamic sequences. When a person walks behind a desk with only one head exposed, or when the person stands at the edge of the visual field with only his or her hands moving in the visual field, deep-learning SLAM tends to ignore the targets of these movements. Secondly, when some objects that are considered static, such as the keyboard and mouse, move, SLAM based on deep learning also ignores these movements. Therefore, our SLAM can also achieve good results compared with most SLAM based on deep learning. However, as we know, the deep-learning-based methods are not suitable for real-time applications, which will be discussed in the next part of this paper.

3.3. Discussion about the Real-Time Performance

To discuss the real-time performance of the proposed V-SLAM method, some comparisons are given, which are listed in Table 5. Here only the methods which have provided the values of time consumption are listed in this comparison.

In Table 5, the time in our SLAM versus ORB-SLAM3 is the average mean time in Table 6. As we can gather from Table 5, our average elapsed time per frame is only 0.23 ms more than ORB-SLAM3. Among the compared SLAM, SLAM based on deep learning cannot run at the rate of 30 frames per second. Only DSLAM and our SLAM can run at that rate with CPU in the dynamic environments. On average, our SLAM consumes 13.14 ms less per frame than DSLAM. DynaSLAM has a better performance than our SLAM in highly dynamic sequences, however, it takes 736.48 ms per frame, far more than our SLAM. This shows that our SLAM has more plasticity than other SLAM, which can be further improved based on our SLAM at 30 frames per second.

To further analyze the real time of the proposed method and ORB-SLAM3, the details of the time consumption per frame in the nine sequences are given in Table 6. As we can see from the table, the average mean time consumption per frame in our SLAM system is higher by about 0.23 ms than that of ORB-SLAM3, but the average median time consumption per frame in our SLAM system is less by about 0.1 ms compared to ORB-SLAM3. So, the overall real-time performance of the proposed method is very near to ORB-SLAM3. However, the accuracy of our method is obviously better than ORB-SLAM3.

3.4. Ablation Experiments

To demonstrate the effectiveness of our improvements in the proposed method, we also perform two ablation experiments: one is the ORB-SLAM3 with only the proposed UMPP strategy as described in Section 2.3 (called ORB-UMPP), and the other is the ORB-SLAM3 with only the proposed NMPP strategy as described in Section 2.2 (called ORB-NMPP). The results of the ablation experiments are shown in Table 7.

From Table 7, we can see that the performance improvement of ORB-NMPP and ORB-UMPP is positive when compared with ORB-SLAM3 in highly dynamic sequences. This means that the proposed strategies of NMPP and UMPP can effectively reduce the number of dynamic map points in highly dynamic environments and increase the robustness of V-SLAM. When it comes to slightly dynamic sequences ‘rpy’ and ‘static’, ORB-UMPP and ORB-NMPP can also significantly reduce the number of wrong map points generated by the mismatch in a pure rotation environment and wrong triangulation. However, because the U-MPs in ORB-UMPP cannot participate in the pose calculation, this prevents a lot of unreliable but static map points from participating in the pose calculation in the slightly dynamic environments, which makes the performance of ORB-UMPP worse than ORB-SLAM3 in the other three slightly dynamic sequences. As for ORB-NMPP, although it rarely finds obvious dynamic map points in these three slightly dynamic sequences, it will not prevent the static map points from participating in the pose calculation. So, its performance is close to ORB-SLAM3.

In addition, the results in Table 7 show that our full method has a greater improvement in the sequences, where only one of the two methods (ORB-NMPP and ORB-UMPP) has slightly positive performance improvement, while the other method has a worse performance. These results indicate that these two methods promote each other, and the combination of the two methods can yield a better performance. The reasons are as follows: Firstly, the strategy in new map points processing can effectively reduce the number of wrong map points and repeated map points. Secondly, the strategy in unreliable map points processing can screen out the map points of slowly moving landmarks. The results of the two ablation experiments show that both strategies for new map points processing and unreliable map points processing are important to improve the performance of V-SLAM under dynamic environments.

4. Conclusions

In this paper, the real-time V-SLAM based on consistency detection under dynamic environments was studied, and an improved map point reliability-based V-SLAM method was proposed, which is based on ORB-SLAM3. In the proposed method, a map point reliability concept is presented. Based on the map point reliability, the problem of high dynamic objects and the slight dynamic objects can both be dealt with efficiently. In this study, various experiments were conducted. The results show that the proposed method can obtain the best accuracy compared with the state-of-the-art real-time V-SLAM methods. Compared with other non-real-time V-SLAM, the proposed method can also obtain competitive accuracy performance. These experimental results show that the proposed V-SLAM is an efficient real-time dynamic V-SLAM. In future work, there are some problems that need to be further studied. For example, the issue of how to improve the performance of SLAM in a pure rotation environment needs to be addressed. Furthermore, how to fuse the data of two different sensors and how to find reliable map points through mutual verification of the fused data are valuable research directions.

Author Contributions

Funding acquisition, J.N.; project administration, J.N.; writing—original draft, L.W.; writing—review and editing, X.W. and G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the National Natural Science Foundation of China (61873086), and the Science and Technology Support Program of Changzhou (CE20215022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://vision.in.tum.de/data/datasets/rgbd-dataset/download accessed on 10 July 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sharma, K. Improved visual SLAM: A novel approach to mapping and localization using visual landmarks in consecutive frames. Multimed. Tools Appl. 2018, 77, 7955–7976. [Google Scholar] [CrossRef]
Fuentes-Pacheco, J.; Ruiz-Ascencio, J.; Rendón-Mancha, J.M. Visual simultaneous localization and mapping: A survey. Artif. Intell. Rev. 2015, 43, 55–81. [Google Scholar] [CrossRef]
Ni, J.; Chen, Y.; Wang, K.; Yang, S.X. An improved vision-based SLAM approach inspired from animal spatial cognition. Int. J. Robot. Autom. 2019, 34, 491–502. [Google Scholar] [CrossRef]
Oh, S.; Hahn, M.; Kim, J. Dynamic EKF-based SLAM for autonomous mobile convergence platforms. Multimed. Tools Appl. 2015, 74, 6413–6430. [Google Scholar] [CrossRef]
Ni, J.; Wang, C.; Fan, X.; Yang, S.X. A bioinspired neural model based extended Kalman filter for robot SLAM. Math. Probl. Eng. 2014, 2014, 905826. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Wang, X. Research on SLAM Road Sign Observation Based on Particle Filter. Comput. Intell. Neurosci. 2022, 2022, 4478978. [Google Scholar] [CrossRef]
Lourakis, M.I.; Argyros, A.A. SBA: A software package for generic sparse bundle adjustment. ACM Trans. Math. Softw. (TOMS) 2009, 36, 1–30. [Google Scholar] [CrossRef]
Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef] [Green Version]
Li, J.; Wang, L.; Li, Y.; Zhang, J.; Li, D.; Zhang, M. Local optimized and scalable frame-to-model SLAM. Multimed. Tools Appl. 2016, 75, 8675–8694. [Google Scholar] [CrossRef]
Yoo, J.K.; Kim, J.H. Gaze Control-Based Navigation Architecture With a Situation-Specific Preference Approach for Humanoid Robots. IEEE/ASME Trans. Mechatron. 2015, 20, 2425–2436. [Google Scholar] [CrossRef]
Ni, J.; Chen, Y.; Chen, Y.; Zhu, J.; Ali, D.; Cao, W. A Survey on Theories and Applications for Self-Driving Cars Based on Deep Learning Methods. Appl. Sci. 2020, 10, 2749. [Google Scholar] [CrossRef] [Green Version]
Kam, J.W.; Kim, H.S.; Lee, S.J.; Hwang, S.S. Robust and fast collaborative augmented reality framework based on monocular SLAM. IEIE Trans. Smart Process. Comput. 2020, 9, 325–335. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef] [Green Version]
Kenye, L.; Kala, R. An Ensemble of Spatial Clustering and Temporal Error Profile Based Dynamic Point Removal for visual Odometry. Multimed. Tools Appl. 2022, 81, 23259–23288. [Google Scholar] [CrossRef]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. G²o: A general framework for graph optimization. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Ni, J.; Gong, T.; Gu, Y.; Zhu, J.; Fan, X. An Improved Deep Residual Network-Based Semantic Simultaneous Localization and Mapping Method for Monocular Vision Robot. Comput. Intell. Neurosci. 2020, 2020, 7490840. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
Gao, X.; Wang, R.; Demmel, N.; Cremers, D. LDSO: Direct sparse odometry with loop closure. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018, Madrid, Spain, 1–5 October 2018; pp. 2198–2204. [Google Scholar]
Chen, Y.; Ni, J.; Mutabazi, E.; Cao, W.; Yang, S.X. A Variable Radius Side Window Direct SLAM Method Based on Semantic Information. Comput. Intell. Neurosci. 2022, 2022, 4075910. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
Wei, H.; Zhang, T.; Zhang, L. GMSK-SLAM: A new RGB-D SLAM method with dynamic areas detection towards dynamic environments. Multimed. Tools Appl. 2021, 80, 31729–31751. [Google Scholar] [CrossRef]
Liu, Y.; Miura, J. RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ni, J.; Shen, K.; Chen, Y.; Cao, W.; Yang, S.X. An Improved Deep Network-Based Scene Classification Method for Self-Driving Cars. IEEE Trans. Instrum. Meas. 2022, 71, 5001614. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Park, Y.W.; Won, C.S. Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. J. Mar. Sci. Eng. 2022, 10, 377. [Google Scholar] [CrossRef]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef] [Green Version]
Strecke, M.; Stueckler, J. EM-Fusion: Dynamic Object-Level SLAM With Probabilistic Data Association. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5864–5873. [Google Scholar]
Yu, C.; Liu, Z.; Liu, X.J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Zhong, F.; Wang, S.; Zhang, Z.; Chen, C.; Wang, Y. Detect-SLAM: Making object detection and SLAM mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
Huang, J.; Yang, S.; Mu, T.J.; Hu, S.M. ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2165–2174. [Google Scholar]
Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
Xu, B.; Li, W.; Tzoumanikas, D.; Bloesch, M.; Davison, A.; Leutenegger, S. MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5231–5237. [Google Scholar]
Yang, S.; Scherer, S. CubeSLAM: Monocular 3-D Object SLAM. IEEE Trans. Robot. 2019, 35, 925–938. [Google Scholar] [CrossRef] [Green Version]
Alcantarilla, P.F.; Yebes, J.J.; Almazán, J.; Bergasa, L.M. On combining visual SLAM and dense scene flow to increase the robustness of localization and mapping in dynamic environments. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 1290–1297. [Google Scholar]
Li, S.; Lee, D. RGB-D SLAM in Dynamic Environments Using Static Point Weighting. IEEE Robot. Autom. Lett. 2017, 2, 2263–2270. [Google Scholar] [CrossRef]
Kerl, C.; Sturm, J.; Cremers, D. Robust odometry estimation for RGB-D cameras. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 3748–3754. [Google Scholar]
Kim, D.H.; Kim, J.H. Effective Background Model-Based RGB-D Dense Visual Odometry in a Dynamic Environment. IEEE Trans. Robot. 2016, 32, 1565–1573. [Google Scholar] [CrossRef]
Sun, Y.; Liu, M.; Meng, M.Q.H. Improving RGB-D SLAM in dynamic environments: A motion removal approach. Robot. Auton. Syst. 2017, 89, 110–122. [Google Scholar] [CrossRef]
Dai, W.; Zhang, Y.; Li, P.; Fang, Z.; Scherer, S. RGB-D SLAM in Dynamic Environments Using Point Correlations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 373–389. [Google Scholar] [CrossRef]
Zou, D.; Tan, P. CoSLAM: Collaborative Visual SLAM in Dynamic Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 354–366. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, H.; Li, Y.; Nakamura, Y.; Zhang, L. FlowFusion: Dynamic Dense RGB-D SLAM Based on Optical Flow. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7322–7328. [Google Scholar]
Palazzolo, E.; Behley, J.; Lottes, P.; Giguère, P.; Stachniss, C. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 7855–7862. [Google Scholar]
Wei, H.; Tang, F.; Xu, Z.; Zhang, C.; Wu, Y. A Point-Line VIO System with Novel Feature Hybrids and with Novel Line Predicting-Matching. IEEE Robot. Autom. Lett. 2021, 6, 8681–8688. [Google Scholar] [CrossRef]
Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Sun, H.; Wang, P.; Ni, C.; Li, J.m. Loop closure detection based on image semantic feature and bag-of-words. Multimed. Tools Appl. 2022. [Google Scholar] [CrossRef]
Afifi, A.; Takada, C.; Yoshimura, Y.; Nakaguchi, T. Real-time expanded field-of-view for minimally invasive surgery using multi-camera visual simultaneous localization and mapping. Sensors 2021, 21, 2106. [Google Scholar] [CrossRef]
Scona, R.; Jaimez, M.; Petillot, Y.R.; Fallon, M.; Cremers, D. StaticFusion: Background Reconstruction for Dense RGB-D SLAM in Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3849–3856. [Google Scholar]
Rünz, M.; Agapito, L. Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4471–4478. [Google Scholar]

Figure 1. Framework overview of our proposed SLAM. The reliable map points are called R-MPs; the unreliable map points are called U-MPs; the temporary reliable map points are called TR-MPs; DC represents the distance check operation; REMPE represents the repeated existing map points elimination operation.

Figure 2. Some map points are generated at error coordinates by ORB-SLAM3.

Figure 3. The difference of the generated map between ORB-SLAM3 and our SLAM: (a) the result of ORB-SLAM3; (b) the result of our method. There are some dynamic map points left after dynamic objects are gone in Figure 3a. Compared with ORB-SLAM3, the results in Figure 3b indicate that our algorithm can deal with these map points effectively.

Figure 4. The difference between a static map point and a dynamic map point. For a static map point, the error between its reprojection and its feature point can always keep within a reasonable range. However, for a dynamic map point, its position in SLAM is just the position at timestamp K. With time going on, its true position will change, and the reprojection error will become larger. Finally, in timestamp

K + 2

, the error exceeds the threshold and is detected.

Figure 4. The difference between a static map point and a dynamic map point. For a static map point, the error between its reprojection and its feature point can always keep within a reasonable range. However, for a dynamic map point, its position in SLAM is just the position at timestamp K. With time going on, its true position will change, and the reprojection error will become larger. Finally, in timestamp

K + 2

, the error exceeds the threshold and is detected.

Figure 5. Examples of Delaunay triangulation. The green points are the feature points of R-MPs. The larger the point, the greater the reliability is. The red points are the feature points of U-MPs, and the blue points are the feature points of TR-MPs.

Figure 6. The workflow of the proposed map points treatment in the proposed method.

Figure 7. Comparison of ATE results between our method and ORB-SLAM3 in slightly dynamic sequences.

Figure 8. Comparison of ATE results between our method and ORB-SLAM3 in highly dynamic sequences.

Figure 9. Some examples of the working process of the proposed method and ORB-SLAM3 in fr3/w/xyz: (a) environment with people moving at low speed; (b) environment with people moving at high speed. In our method, the red feature point means it connects to a D-MP; the green feature point means it connects to an R-MP or a U-MP. The larger the green feature point is, the more reliable the map point is. In our method, there are significantly fewer feature points on the people.

Table 1. Comparison of the Rot.RMSE of the RPE on the TUM benchmark.

Sequence		Rot.RMSE of Trajectory Alignment (o/s)
Sequence		DVO [39]	StaticFusion [51]	SPWSLAM [38]	MotionRemove [41]	DSLAM [42]	ORB-SLAM3	Our Method
slightly	fr2/d/person	1.5368	-	0.8213	0.7341	1.3951	0.4346	0.4302
	fr3/s/static	0.6084	0.4300	0.7228	-	0.3786	0.3263	0.2752
	fr3/s/xyz	1.4980	0.9200	0.8466	0.9828	0.5792	0.5466	0.5149
	fr3/s/rpy	6.0164	-	5.6258	-	0.9047	5.7060	0.8230
	fr3/s/half	4.6490	2.1100	1.8836	2.3748	0.8699	0.6222	0.6131
highly	fr3/w/static	6.3502	0.3800	0.8085	2.0487	0.3293	1.8518	0.2813
	fr3/w/xyz	7.6669	2.6600	1.6442	3.2346	2.7413	3.1113	0.6408
	fr3/w/rpy	7.0662	-	5.6902	4.3755	4.6327	9.7841	1.7318
	fr3/w/half	5.2179	5.0400	2.4048	5.0108	0.9854	2.4706	0.7460
Average		4.5122	1.9233	2.2720	2.6802	1.4240	2.7615	0.6729

Note: The best results are shown in bold. Not all papers provide results for all sequences.

Table 2. Comparison of the Trans.RMSE of the RPE on the TUM benchmark.

Sequence		Trans.RMSE of Trajectory Alignment (m/s)
Sequence		DVO [39]	StaticFusion [51]	SPWSLAM [38]	MotionRemove [41]	DSLAM [42]	ORB-SLAM3	Our Method
slightly	fr2/d/person	0.0354	-	0.0173	0.0172	0.0362	0.0078	0.0072
	fr3/s/static	0.0157	0.0110	0.0231	-	0.0138	0.0111	0.0083
	fr3/s/xyz	0.0453	0.0280	0.0219	0.033	0.0134	0.0142	0.0132
	fr3/s/rpy	0.1735	-	0.0843	-	0.032	0.0427	0.0263
	fr3/s/half	0.1005	0.0300	0.0389	0.0458	0.0354	0.0213	0.0172
highly	fr3/w/static	0.3818	0.0130	0.0327	0.0842	0.0141	0.0132	0.0106
	fr3/w/xyz	0.4360	0.1210	0.0651	0.1214	0.1266	0.0880	0.0245
	fr3/w/rpy	0.4038	-	0.2252	0.1751	0.2299	0.1296	0.1006
	fr3/w/half	0.2628	0.2070	0.0527	0.1672	0.0517	0.1256	0.0241
Average		0.2061	0.0683	0.0624	0.0920	0.0615	0.0504	0.0258

Note: The best results are shown in bold. Not all papers provide results for all sequences.

Table 3. Comparison of the ATE.RMSE on the TUM benchmark.

Sequence		ATE.RMSE of Trajectory Alignment (m)
Sequence		DVO [39]	StaticFusion [51]	SPWSLAM [38]	MotionRemove [41]	DSLAM [42]	ORB-SLAM3	Our Method
slightly	fr2/d/person	0.1037	-	0.0484	0.0596	0.0075	0.0081	0.0074
	fr3/s/static	0.0119	0.0130	-	-	0.0096	0.0106	0.0066
	fr3/s/xyz	0.2420	0.0400	0.0397	0.0482	0.0091	0.0198	0.0151
	fr3/s/rpy	0.1756	-	-	-	0.0225	0.0592	0.0219
	fr3/s/half	0.2198	0.0400	0.0432	0.047	0.0235	0.0224	0.0161
highly	fr3/w/static	0.7515	0.0140	0.0261	0.0656	0.0108	0.0249	0.0079
	fr3/w/xyz	1.3830	0.1270	0.0601	0.0932	0.0874	0.1493	0.0243
	fr3/w/rpy	1.2922	-	0.1791	0.1333	0.1608	0.2257	0.1121
	fr3/w/half	1.0136	0.3910	0.0489	0.1252	0.0354	0.1057	0.0181
Average		0.5770	0.1042	0.0636	0.0817	0.0407	0.0695	0.0255

Note: The best results are shown in bold. Not all papers provide results for all sequences.

Table 4. The ATE.RMSE on the TUM benchmark, compared with the V-SLAM methods based on deep learning.

Sequence		ATE.RMSE of Trajectory Alignment (m)
Sequence		Co-Fusion [52]	MaskFusion [34]	DynaSLAM [28]	EM-Fusion [29]	DS-SLAM [30]	RDS-SLAM [23]	Our Method
slightly	fr2/d/person	-	-	-	-	-	-	0.0074
	fr2/s/static	0.011	0.021	-	0.090	0.0065	0.0084	0.0066
	fr3/s/xyz	0.027	0.031	0.015	0.370	-	-	0.0151
	fr3/s/rpy	-	-	-	-	-	-	0.0219
	fr3/s/half	0.036	0.052	0.017	0.032	-	-	0.0161
highly	fr3/w/static	0.551	0.035	0.006	0.014	0.0081	0.0206	0.0079
	fr3/w/xyz	0.696	0.104	0.015	0.066	0.0247	0.0571	0.0243
	fr3/w/rpy	-	-	0.035	-	0.4442	0.1604	0.1121
	fr3/w/half	0.803	0.106	0.025	0.051	0.0303	0.0807	0.0181
Average		0.3540	0.0582	0.0188	0.1038	0.1028	0.0654	0.0255

Note: The best results are shown in bold. Not all papers provide results for all sequences, and some methods just provided the three decimal places of data.

Table 5. Comparison of the real-time performance in dynamic environments.

System	Average Time Consumption of per Frame (ms)	Hardware Platform
Co-Fusion [52]	83.3	1 CPU, 1 GPU
DS-SLAM [30]	76.46	Intel Core i7 CPU, P4000 GPU
DynaSLAM [28]	738.46	Titan X GPU
RDS-SLAM [23]	206.13	GeForce RTX 2080Ti GPU
MaskFusion [34]	200.00	Intel Core CPU, 2 Nvidia GTX Titan X GPU
EM-Fusion [29]	106–257	Intel Xeon Silver 4112 CPU, Nvidia GTX 1080 Ti GPU
DSLAM [42]	30.65	Intel Core i5-3470 CPU
SPWSLAM [38]	45	Intel Core i7-4790K CPU
StaticFusion [51]	30	Intel Core i7 3770 CPU, GeForce GTX 1070 GPU
DVO [39]	717.48	Intel Core i5 670 CPU
ORB-SLAM3	17.28	Intel Core i5 10400F CPU
Our method	17.51	Intel Core i5 10400F CPU

Note: All data of state-of-the-art SLAM methods are obtained from their original papers. The running times of other SLAM are obtained from the corresponding papers, except the time of ORB-SLAM3 is obtained with the same computer that we use. Co-fusion does not provide specific information about the hardware used in the original paper, but only the number of CPU and GPU.

Table 6. Comparison of the real-time performance in dynamic environments.

Sequence		Our Method		ORB-SLAM3
Sequence		Median (ms)	Mean (ms)	Median (ms)	Mean (ms)
Slightly	fr2/d/person	17.84	19.74	18.50	19.40
	fr3/s/static	17.45	18.44	16.61	17.01
	fr3/s/xyz	16.61	17.13	16.48	17.23
	fr3/s/rpy	16.75	17.10	16.10	16.48
	fr3/s/half	17.45	18.44	17.75	18.58
Highly	fr3/w/static	14.89	15.72	15.15	16.71
	fr3/w/xyz	15.21	16.79	15.82	16.29
	fr3/w/rpy	16.29	16.67	15.40	15.98
	fr3/w/half	15.60	17.55	17.13	17.84
Average		16.45	17.51	16.55	17.28

Table 7. The results of ablation experiments.

Sequence		ATE.RMSE of Trajectory Alignment (m)
Sequence		ORB-SLAM3	ORB-UMPP	ORB-NMPP	Our Method
Slightly	fr2/d/person	0.0081	0.0110	0.0079	0.0074
	fr3/s/static	0.0106	0.0079	0.0076	0.0066
	fr3/s/xyz	0.0198	0.0256	0.0195	0.0151
	fr3/s/rpy	0.0592	0.0278	0.0235	0.0219
	fr3/s/half	0.0224	0.0448	0.0217	0.0161
Highly	fr3/w/static	0.0249	0.0227	0.0202	0.0079
	fr3/w/xyz	0.1493	0.0499	0.1120	0.0243
	fr3/w/rpy	0.2257	0.1756	0.1621	0.1121
	fr3/w/half	0.1057	0.0793	0.0659	0.0181
Average		0.0695	0.0494	0.0489	0.0255

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, J.; Wang, L.; Wang, X.; Tang, G. An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments. Appl. Sci. 2023, 13, 2712. https://doi.org/10.3390/app13042712

AMA Style

Ni J, Wang L, Wang X, Tang G. An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments. Applied Sciences. 2023; 13(4):2712. https://doi.org/10.3390/app13042712

Chicago/Turabian Style

Ni, Jianjun, Li Wang, Xiaotian Wang, and Guangyi Tang. 2023. "An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments" Applied Sciences 13, no. 4: 2712. https://doi.org/10.3390/app13042712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Visual SLAM Based on Map Point Reliability under Dynamic Environments

Abstract

1. Introduction

2. Proposed Method

2.1. Framework of Our Method

2.2. New Map Points Processing Strategy

2.2.1. Distance Check Operation

2.2.2. Repeated Existing Map Points Elimination Operation

2.3. Unreliable Map Points Processing Strategy

2.3.1. Map Points Selection Operation

2.3.2. Unreliable Map Points Check Operation

3. Experiments

3.1. Datasets and Evaluation Criteria

3.2. Comparison with the State of the Art

3.3. Discussion about the Real-Time Performance

3.4. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI