Next Article in Journal
Detecting Patient Position Using Bed-Reaction Forces for Pressure Injury Prevention and Management
Previous Article in Journal
Measurement Campaign of Radio Frequency Interference in a Portion of the C-Band (4–5.8 GHz) for the Sardinia Radio Telescope
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PGMF-VINS: Perpendicular-Based 3D Gaussian–Uniform Mixture Filter

1
Academy for Engineering and Technology, Fudan University, Shanghai 200433, China
2
Engineering Research Center of AI and Robotics, Ministry of Education, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(19), 6482; https://doi.org/10.3390/s24196482
Submission received: 3 September 2024 / Revised: 30 September 2024 / Accepted: 6 October 2024 / Published: 8 October 2024
(This article belongs to the Section Navigation and Positioning)

Abstract

:
Visual–Inertial SLAM (VI-SLAM) has a wide range of applications spanning robotics, autonomous driving, AR, and VR due to its low-cost and high-precision characteristics. VI-SLAM is divided into localization and mapping tasks. However, researchers focus more on the localization task while the robustness of the mapping task is often ignored. To address this, we propose a map-point convergence strategy which explicitly estimates the position, the uncertainty, and the stability of the map point (SoM). As a result, the proposed method can effectively improve the quality of the whole map while ensuring state-of-the-art localization accuracy. The convergence strategy mainly consists of a perpendicular-based triangulation and 3D Gaussian–uniform mixture filter, and we name it PGMF, perpendicular-based 3D Gaussian–uniform mixture filter. The algorithm is extensively tested on open-source datasets, which shows the RVM (Ratio of Valid Map points) of our algorithm exhibits an average increase of 0.1471 compared to VINS-mono, with a variance reduction of 68.8%.

1. Introduction

Visual–Inertial SLAM (VI-SLAM) has a wide range of applications spanning robotics, autonomous driving, AR, and VR. Among them, Visual SLAM systems has gained significant interest in recent years compared to Lidar SLAM, owing to their smaller size, lower cost, and convenience. Monocular Visual–Inertial SLAM systems are low-cost, high-precision perception solutions that provide state estimation for navigation tasks in motion. They have great potential due to their advantages of a small size and light weight.
In monocular current VI-SLAM, researchers focus more on the localization task, primarily on how to determine the position and achieve more accurate localization. However, mapping tasks are often overlooked. Researchers often consider mapping completed with a simple triangulation after performing localization tasks, which is insufficient and unreliable. Therefore, we propose a convergence strategy for map points.
Each map point is observed multiple times in a series of camera measurements, so it is crucial to calculate the position in the world coordinate system with multiple observations. Many algorithms directly triangulate multiple observations [1,2]; however, the drawback of this approach is that it does not consider the stability of map points (SoM). In other words, these algorithms triangulate all features observed, even if they are noisy points, mismatched points, or dynamic points. This leads to maps filled with noise and a decrease in pose accuracy.
Therefore, we propose a convergence strategy for map points. It estimates the position and the uncertainty in R 3 for each map point through a perpendicular-based triangulation. Additionally, we propose a 3D Gaussian–uniform mixture filter based on Bayesian theory, which iteratively fuses multiple observations and explicitly calculates the SoM. These three parameters are updated each time new measurements arrive, and the uncertainty and SoM together determine the convergence of the map points. Specifically, when the uncertainty is sufficiently small and the SoM is sufficiently high, we consider to have obtained an accurate position.
The proposed method is called PGMF-VINS, which has the following contributions:
1.
PGMF-VINS includes a convergence strategy for map points, which explicitly estimates the position, the uncertainty, and the stability of map points (SoM). This can effectively improve the quality of the whole map while ensuring state-of-the-art accuracy of pose estimation.
2.
The perpendicular-based triangulation and 3D Gaussian–uniform mixture filter are novel theoretical contributions with great potential in SLAM and 3D reconstruction tasks.
3.
The algorithm is extensively tested on open-source datasets and shows good performance.

2. Related Works

2.1. Visual SLAM

The concept of SLAM was first introduced in the mid-1980s, primarily using range-finding sensors like sonar and laser. Visual SLAM emerged later, leveraging advancements in computer vision and camera technology. One of the pioneering works in vSLAM was MonoSLAM [3], which demonstrated real-time 3D mapping and pose estimation using a single camera. This system laid the groundwork for subsequent developments by showcasing the feasibility of using monocular vision for SLAM tasks.
Feature-based methods have been prominent in the development of vSLAM systems. These methods rely on extracting and matching key points (features) from images. Notable algorithms include PTAM [4], which separates tracking and mapping into parallel threads to improve performance. ORB-SLAM2 [5], proposed in 2017, is another significant milestone, utilizing ORB features to achieve real-time performance on monocular, stereo, and RGB-D cameras.
In contrast to feature-based methods, direct methods use the intensity values of pixels directly, avoiding the need for feature extraction and matching. LSD-SLAM [6] is a prominent example, where semi-dense maps are created using direct image alignment techniques. Direct Sparse Odometry (DSO) [7] further advanced this approach by improving accuracy and robustness in challenging environments.
It is important to introduce SVO [8], an algorithm designed for efficient and robust visual odometry. SVO stands out by combining the strengths of both direct and feature-based methods, aiming to achieve high-speed and accurate pose estimation. SVO constructs a sparse map of the environment by selecting key points that are well distributed and have high gradient values through a novel deep filter, and the core idea of this article is inspired by it.

2.2. Visual–Inertial SLAM

Vision-only systems have poor performance in extreme situations; Visual–Inertial SLAM (VI-SLAM) is an advanced technique that combines visual and inertial measurements to estimate the pose of a moving device while simultaneously constructing a map of the environment.
In VI-SALM, filter-based approaches play an important role. MSCKF [9] is a pioneering method that shows significant performance gains in robustness and accuracy. It uses an Extended Kalman Filter (EKF) to fuse visual and inertial data. ROVIO [10] incorporates photometric error terms in the EKF framework, improving performance in environments with low-texture or challenging lighting conditions. EKF-based VI-SLAM methods are computationally efficient and suitable for real-time applications. However, they may struggle with long-term accuracy due to linearization errors and limited ability to relocalize.
Optimization-based approaches have become more and more popular in recent years. OKVIS [11] is a tightly coupled method, leveraging keyframe-based nonlinear optimization techniques to refine pose estimates and map features. It performs well in various environments, including those with low texture or dynamic changes. VINS-Mono [1] is a state-of-the-art optimization-based system that has been widely adopted in academia and industry. It features real-time performance on mobile devices and provides robust localization and mapping capabilities in both indoor and outdoor environments. ORB-SLAM3 [2] is an extension of its earlier versions, enhancing their capabilities to provide robust performance in various scenarios. It uses ORB features for reliable key-point detection and matching and employs a keyframe-based strategy that optimizes computational efficiency while maintaining high accuracy in pose estimation and map construction. ORB-SLAM3 has the capability to handle multiple maps simultaneously. SVO2.0 [12] represents a significant advancement in visual odometry, combining the strengths of direct and feature-based methods to achieve robust, real-time performance. Its versatility and efficiency make it a valuable tool for various applications requiring accurate motion estimation and environment mapping. In general, optimization-based methods typically achieve higher accuracy and robustness, particularly in challenging environments with dynamic changes.
The field of VI-SALM has seen significant advancements over the past two decades, driven by innovations in sensor fusion, algorithm design, and computational techniques. The development of robust, real-time systems has paved the way for widespread application in areas such as robotics, augmented reality, and autonomous navigation.

3. System Overview

A block diagram illustrating the pipeline of the proposed system is shown in Figure 1. The system firstly performs measurements’ preprocessing after receiving data from the camera and IMU, in which features are detected and tracked, and IMU measurements of two image frames are preintegrated. Visual–inertial joint initialization is only executed once at the beginning of the entire system to obtain the necessary parameters for subsequent processes. PGMF is the most important and core module. It maintains a sliding window of image frames and the position of map points through three crucial steps: nonlinear optimization, perpendicular-based triangulation, and a 3D Gaussian–uniform mixture filter. Finally, a clean sparse map and accurate up-to-date pose are output in real time.
Compared to state-of-the-art algorithms like VINS [1] and ORB-SLAM [2], which mainly focus on localization tasks and pose accuracy, the proposed algorithm endows the system with the additional capability to obtain stable map points during mapping tasks while maintaining high precision in localization.
It is important to define notations throughout this paper. w means the world coordinates. b is considered as the body coordinates, which are the same as the IMU coordinates. c refers to the camera coordinates. b w represents the transformation from the body coordinates to the world coordinates. Rotation matrices R and Hamilton quaternions q are equally representing rotation. P represents translation. b k and c k represent the body frame and the camera frame while taking the kth image, respectively. g w = 0 , 0 , g T is the gravity in the world coordinates.

3.1. Measurements’ Preprocessing

This section preprocesses raw sensor data from the IMU and camera. It is noteworthy that it only tracks data in two consecutive frames for these two types of measurements, aiming for smaller latency and less time consumption.
For visual measurements, when a new image arrives, we first track existing features from the previous frame using the KLT sparse optical flow algorithm [13], then we supplement new features up to a threshold (150–300) by detecting Shi–Tomasi corners [14]. Finally, based on the fundamental matrix, RANSAC is carried out to screen outliers.
For IMU measurements, we use a preintegration algorithm. Preintegration processes the IMU data between two consecutive image frames into a pose increment and a covariance matrix, which does not change with changes in the starting and ending states of the time period. This helps to reduce the computational load. Detailed derivations can be found in [1,15], so we only provide a brief conclusion here:
Δ P i , j = k = i j 1 Δ V i , k δ t + 1 2 Δ R i , k a k b a k δ t 2 Δ V i , j = k = i j 1 Δ R i , k a k b a k δ t Δ R i , j = k = i j 1 E x p ω k b g k δ t
where Δ P i , j , Δ V i , j , Δ R i , j represent the preintegrations of translation, velocity, and rotation between i and j, respectively. δ t is the time interval between two IMU measurements. a k and ω k are accelerometer and gyroscope measurements at moment k. b a k , b g k are the acceleration bias and gyroscope bias at moment k. E x p ( ) means the transformation from the axis-angle vector to the rotation matrix. The initial value of the covariance P k + 1 is set to 0 and is updated according to the following formula: for IMU measurement k between i and j:
P k + 1 = I + F k δ t P k I + F k δ t T + G k Q G k T δ t 2 F k = 0 0 0 0 0 I 0 0 0 0 0 Δ R i k a k b a k × ω k b g k × 0 0 0 Δ R i k 0 0 0 0 0 I 0 0 G k = 0 Δ R i k 0 0 0 0 0 I 0 0 0 0 0 I 0 0 0 0 0 I
where Δ R i , k represents the rotation preintegration between i and k. I denotes the 3 × 3 identity matrix. Q is a noise covariance matrix Q = d i a g ( σ a 2 , σ g 2 , σ b a 2 , σ b g 2 ) , σ a and σ g represent the noise of accelerometer and gyroscope measurements, respectively, and σ b a and σ b g are the random walk noise of the acceleration bias and gyroscope bias. [ ] × means the skew-symmetric matrix of the vector.

3.2. Visual–Inertial Joint Initialization

The initialization includes gravity alignment, global scale recovery, and determining the initial pose states and map-point positions. The process is divided into two steps: first, a visual-only SFM is carried out with a set of initial frames, which provides relative poses of these frames in an unknown scale. Then, the first step results are aligned with IMU measurements. This alignment recovers the global scale and solves the gravity vector, ultimately determining the map-point positions. Detailed calculations and derivations can be found in [16].

3.3. Sliding Window and Nonlinear Optimization

The illustration of the sliding window is shown in Figure 2. The sliding window includes a sequence of frames. The state vector of these frames is defined as
X = X 0 , X 1 , , X n , X c b , P 0 w , P 1 w , , P m w X k = P b k w , V b k w , R b k w , b a k , b g k , k ϵ 0 , n X c b = P c b , R c b
where X k is the IMU state while taking the kth image, including position, velocity, orientation, acceleration bias, and gyroscope bias. n is total number of frames. X c b is an extrinsic parameter from the camera to the IMU, including translation and rotation. P i w ( i ϵ 0 , m ) is the position of features in world coordinates. m is total number of map points.
More importantly, the sliding window contains a relationship that maps the visual constraints of the camera from reprojection errors, and the corresponding inertial constraints of the IMU from preintegration errors. The tightly coupled monocular VIO is solved by optimizing these constraints, and the final aim is to minimize the sum of all measurement residuals to obtain a maximum posteriori estimation as
min X { k A r A z ^ b k + 1 b k , X 2 + i B p r B z ^ i , X 2 + j C p r C z ^ j , X 2 }
where A is the set of all IMU measurements, and B and C are the set of visual measurements from converged map points and unconverged map points, respectively. For unconverged map points, we construct the reprojection error and simultaneously optimize their positions in world coordinates (a vector of [x, y, z]). However, for converged map points, we do not optimize the positions in world coordinates, as we consider their positions to be reliable and no longer in need of optimization. The advantage of this approach is that it significantly reduces the dimensionality of the state vector during nonlinear optimization, as the number of map points requiring optimization is reduced. Therefore, we separate them into sets B and C to distinguish between them. r A ( z ^ b k + 1 b k , X ) is the IMU residual. r B z ^ i , X , and r C z ^ j , X are visual residuals of their corresponding set. p ( ) is the Huber loss function. is the Mahalanobis norm.
IMU measurement residuals are computed from preintegration, building upon Equation (1):
r A z ^ b k + 1 b k , X = P j P i Δ P i , j V j V i Δ V i , j R i T R j Δ R j i T b a j b a i b g j b g i = R b k w T P b k + 1 w P b k w V b k w Δ t + 1 2 g w Δ t 2 Δ P k , k + 1 R b k w T V b k + 1 w V b k w + g w Δ t Δ V k , k + 1 L o g R b k w T R b k + 1 w Δ R k , k + 1 T b a k + 1 b a k b g k + 1 b g k
where Δ t is the time interval between image moment k and k + 1 . L o g means the transformation from rotation matrix to axis-angle vector.
Visual measurement residuals are divided into two parts based on whether the map points have converged. Their biggest difference lies in the optimization strategy. The position of unconverged map points continue to be estimated in nonlinear optimization, while the converged map points are fixed. However, they compute the residual in the same way, by projecting the map point onto the normalized plane and then subtracting the observation of the ith feature p i
r B , C z ^ i , X = H R c b T R b k w T P i w P b k w P c b p i
where H means the homogeneous projection. R b k w , P b k w is the kth pose of the IMU. k belongs to the frames that can observe feature p i . P i w is the position of feature p i in world coordinates. Using Equation (5), we can derive the Jacobian matrix of the visual residuals with respect to the corresponding variables:
r B , C R b k w = J c R c b T R b k w T P i w P b k w × r B , C P b k w = J c R c b T R b k w T r B , C R c b = J c R c b T R b k w T P i w P b k w P c b x r B , C P c b = J c R c b T r B , C P i w = J c R c b T R b k w T
where J c is the Jacobian matrix of H .
The sliding window update handles keyframes and non-keyframes differently. The selection of keyframes follows two criteria: first, the average parallax of tracked features between consecutive frames exceeds a certain threshold, indicating significant movement of the body. Second, to avoid tracking loss in extreme cases, when the number of tracked features falls below a critical value, we treat the new frame as a keyframe. In other situations, the frame is considered as a non-keyframe. Keyframes are retained during updates, while the oldest frame is discarded. For non-keyframes, they are discarded after passing IMU measurements to the next frame.

3.4. Convergence Strategy of Map Points

The convergence strategy of map points involves an explicit estimation of the position, the uncertainty, and SoM of each map point in R 3 . These three parameters are updated with each new measurement. The uncertainty and the SoM together determine the convergence of the map point. When the uncertainty is sufficiently small and the SoM is sufficiently high, we consider that an accurate position has been obtained.
The convergence strategy of map points is primarily divided sequentially into a perpendicular-based triangulation and a 3D Gaussian–uniform mixture filter. The perpendicular-based triangulation handles the observation of every feature in the new image. It calculates a position and a variance by combining the latest and the first observations. The specific calculation principles can be found in Section 4. The 3D Gaussian–uniform mixture filter, based on Bayesian theory, stores a state for each map point, including position, variance, and SoM. When receiving the computation result of the previous step, it updates the state of the map point in an iterative manner until the map point converges. The detailed update principles can be found in Section 5.

4. Perpendicular-Based Triangulation

Whenever a new image frame arrives, we update the state for each feature in this image. If the ID of the map point is observed for the first time, we create a new map point state. And each time a new image frame arrives, we perform a perpendicular-based triangulation between the current observation and the oldest observation, which ensures that image frames in the middle are not ignored.
Traditional methods have a prerequisite that assumes two observation rays intersect at a definite point in 3D space. However, in reality, we can only obtain an optimal estimate of the camera pose, not the ground truth, which leads to the observation rays intersecting in space in a skewed manner. Our proposed model (perpendicular-based triangulation) addresses this issue and ultimately yields a position and variance in world coordinates. The variance directly represents the error of the current triangulation and on the other hand, implies the uncertainty resulting from an inaccurate pose estimation.
Figure 3 shows the spatial model of the perpendicular-based triangulation. We can observe x 1 and x 2 on the image frame. We know the poses of the image frames after nonlinear optimization of the sliding window; let us assume they are P 1 w , R 1 w and P 2 w , R 2 w . This spatial model follows the following two assumptions:
1.
The observation rays O 1 M 1 and O 2 M 2 are skew lines, due to imprecise poses causing them not to intersect exactly.
2.
M 1 M 2 is the perpendicular line of the observation rays, and we consider the ground truth of M to be at the midpoint of M 1 M 2 .
Through the spatial model, we know O 1 x 1 M 1 M 2 O 2 x 2 , that is
R 1 w K x 1 K x 1 · V M 1 M 2 = R 2 w K x 2 K x 2 · V M 1 M 2 = 0 x 1 · V M 1 M 2 = x 2 · V M 1 M 2 = 0
where K represents the camera intrinsic parameters. V M 1 M 2 is the unit orientation vector of M 1 M 2 . x 1 , and x 2 replace R 1 w K x 1 K x 1 and R 2 w K x 2 K x 2 . If the depths of O 1 M 1 and O 2 M 2 are defined as s 1 and s 2 , we can express ( M 2 w M 1 w ) / / M 1 M 2 as
s 2 x 2 + P 2 w s 1 x 1 + P 1 w = c V M 1 M 2 s 2 x 2 s 1 x 1 + P 2 w P 1 w = c V M 1 M 2
where c is an arbitrary constant. We multiply both sides of the equation by x 1 and substitute Equation (7) into Equation (8)
s 2 x 2 · x 1 s 1 x 1 · x 1 + P 2 w P 1 w · x 1 = c V M 1 M 2 · x 1 = 0
Similarly, we multiply both sides of the equation by x 2 and substitute Equation (7) into Equation (8)
s 2 x 2 · x 2 s 1 x 1 · x 2 + P 2 w P 1 w · x 2 = c V M 1 M 2 · x 2 = 0
We combine Equations (9) and (10) to form a system of equations:
x 1 · x 1 s 1 x 2 · x 1 s 2 = P 2 w P 1 w · x 1 x 1 · x 2 s 1 x 2 · 2 s 2 = P 2 w P 1 w · x 2 a 11 s 1 a 12 s 2 = b 1 a 21 s 1 a 22 s 2 = b 2 a 11 a 12 a 21 a 22 s 1 s 2 = b 1 b 2
Equation (11) resembles A x = b , where the rank of A implies different spatial relations. Since the magnitudes of a 11 and a 22 are both 1, the rank of A cannot be 0. A rank of 1 indicates a special case, where the observation rays O 1 M 1 and O 2 M 2 are parallel, resulting in infinite solutions for x. It is worth noting that this usually implies the body is stationary. A rank of 2 is the most common case, where we can use Cramer’s rule to obtain the unique solution for x. Once we get the values of s 1 and s 2 , we can determine the position of M
M = 1 2 M 1 w + M 2 w = 1 2 s 1 x 1 + P 1 w + s 2 x 2 + P 2 w
The variance of M is divided into τ a and τ b . τ a represents the error of the current measurement, which can be obtained by substituting the result of Equation (11) into Equation (8) which is the magnitude of M 1 M 2 . τ b represents the components in other directions, for which we take one pixel on the image frame as the error. The specific calculation can be found in [17]. The difference is that the proposed algorithm operates in three dimensions, whereas [17] operates in a two-dimensional plane, so we need to add the M 1 M 2 vector to calculate. Finally, the total variance is τ 2 = d i a g ( τ b 2 , τ a 2 , τ b 2 ) .

5. Three-Dimensional Gaussian–Uniform Mixture Filter

Each mappoint is observed multiple times in a series of camera measurements. Therefore, it is crucial to calculate the position of this point in world coordinates with multiple observations. The 3D Gaussian–uniform mixture filter, based on Bayesian theory, continuously integrates new observations in an iterative manner. This method is based on a core assumption: the inliers are considered as camera observation noise, modeled by a three-dimensional Gaussian distribution, while the outliers are represented as white noise, following a uniform distribution, and we represent the joint probability as a Gaussian–uniform mixture distribution, with one key parameter being the SoM. The larger the SoM is, the more likely the point is an inlier; conversely, the smaller the SoM is, the more likely it is an outlier. Ref. [17] discusses the feasibility of this joint probability distribution, and SVO [8] also adopts similar assumptions in its depth filter. We strongly recommend readers to refer to these works for further understanding.
The 3D Gaussian–uniform mixture distribution can be represented as
P X n | Z , π = π N X n | Z , τ n + 1 π U X n | S b
where Z means the ground truth of the position. π represents the probability of being an inlier, while X n and τ n denote the new measurement values, which are the results of the perpendicular-based triangulation. N and U represent Gaussian and uniform distributions in 3D, respectively. S b represents the boundary space of the uniform distribution. We can obtain the recursive form of the probability distribution through Bayesian methods. The specific derivation can be found in [17]. Here, we provide the conclusion:
P Z , π | a n , b n , μ n , Σ n = P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π
where P X n | Z , π refers to Equation (13). P Z , π | a n , b n , μ n , Σ n = N ( Z | μ n , Σ n ) · B ( π | a n , b n ) , which means P Z , π | a n , b n , μ n , Σ n is composed of the product of a Gaussian distribution and a Beta distribution. a n and b n in B ( π | a n , b n ) represent the probabilities for inliers and outliers, respectively, and they jointly determine the calculation of the SoM ( π )
π = a n a n + b n
The issue with Equation (14) is that the product of the two probability distributions on the right-hand side does not strictly equal the probability distribution on the left-hand side. Therefore, we match the first and second moments of the probability distributions on both sides. The moments of the probability distribution on the left-hand side with respect to Z and π are
M 1 ( Z ) = Z · P Z , π | a n , b n , μ n , Σ n d Z = μ n M 2 ( Z ) = Z 2 · P Z , π | a n , b n , μ n , Σ n d Z = Σ n + μ n μ n T M 1 ( π ) = π · P Z , π | a n , b n , μ n , Σ n d π = a n a n + b n M 2 ( π ) = π 2 · P Z , π | a n , b n , μ n , Σ n d π = a n a n + 1 a n + b n a n + b n + 1
For the right-hand side of Equation (14), we first simplify using Equations (13) and (15)
P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π = N Z | μ n 1 , Σ n 1 · B π | a n 1 , b n 1 π N X n | Z , τ n + 1 π U X n | S b = a n a n + b n N Z | X n , τ n N Z | μ n 1 , Σ n 1 B π | a n 1 + 1 , b n 1 + b n a n + b n U X n | S b N Z | μ n 1 , Σ n 1 B π | a n 1 , b n 1 + 1
We present the product of two Gaussian distributions in three-dimensional space, which serves as a fundamental mathematical result introduced in the derivation process:
N X | μ 1 , Σ 1 N X | μ 2 , Σ 2 = N μ 1 | μ 2 , Σ 1 + Σ 2 N X | μ , Σ Σ = Σ 1 1 + Σ 2 1 1 , μ = Σ Σ 1 1 μ 1 + Σ 2 1 μ 2
By substituting Equation (18) into Equation (17), we obtain
P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π = a n a n + b n N X n | μ n 1 , τ n + Σ n 1 N Z | μ , Σ B π | a n 1 + 1 , b n 1 + b n a n + b n U X n | S b N Z | μ n 1 , Σ n 1 B π | a n 1 , b n 1 + 1 = C 1 N Z | μ , Σ B π | a n 1 + 1 , b n 1 + C 2 N Z | μ n 1 , Σ n 1 B π | a n 1 , b n 1 + 1
where Σ = τ n 1 + Σ n 1 1 1 , μ = Σ τ n 1 X n + Σ n 1 1 μ n 1 . C 1 is a n a n + b n N X n | μ n 1 , τ n + Σ n 1 , and C 2 is b n a n + b n U X n | S b . Since the integral of the probability must be 1, C 1 and C 2 need to be normalized, that is, C 1 = C 1 C 1 + C 2 , C 2 = C 2 C 1 + C 2 . Then, we compute the moments for Z and π in Equation (19)
N 1 ( Z ) = Z · P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π d Z = C 1 μ + C 2 μ n 1 N 2 ( Z ) = Z 2 · P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π d Z = C 1 Σ + μ μ T + C 2 Σ n 1 + μ n 1 μ n 1 T N 1 ( π ) = π · P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π d Z = C 1 a n 1 + 1 a n 1 + b n 1 + 1 + C 2 a n 1 a n 1 + b n 1 + 1 N 2 ( π ) = π 2 · P Z , π | a n 1 , b n 1 , μ n 1 , Σ n 1 · P X n | Z , π d Z = C 1 a n 1 + 1 a n 1 + 2 a n 1 + b n 1 + 1 a n 1 + b n 1 + 2 + C 2 a n 1 a n 1 + 1 a n 1 + b n 1 + 1 a n 1 + b n 1 + 2
By equating the moments of Z and π in Equations (16) and (20), we can derive the update formulas for the four parameters a n , b n , μ n , Σ n :
a n = N 2 ( π ) N 1 ( π ) N 1 ( π ) N 2 ( π ) / N 1 ( π ) b n = 1 N 1 ( π ) N 2 ( π ) N 1 ( π ) N 1 2 ( π ) N 2 ( π ) μ n = N 1 ( Z ) Σ n = N 2 ( Z ) μ n μ n T
The update formulas for the four parameters in Equation (21) strictly replace the recursive formula based on (14). When the system is running, parameters are updated using explicit analytical expressions, effectively reducing computational load and memory usage, making it a highly efficient method. The convergence judgment is executed after updating the map points, and there are two termination conditions. First, the variance is decomposed using an SVD, and the singular values of the SVD result need to be smaller than a certain threshold (1 × 10 4 ∼1 × 10 2 ). Second, the SoM needs to be greater than a certain threshold (0.40∼0.65). When both conditions are satisfied, we consider this map point as an inlier. It should be noted that if the SoM is less than a certain threshold (0.25∼0.40), we can consider this map point as an outlier and then remove it from the map. This situation may arise due to noise, lighting changes, occlusions, overlaps, and extreme environments.

6. Experimental Results

To evaluate the effectiveness and robustness of the VIO system of our proposed method, we conducted a series of experiments. In terms of map quality improvement, we collected data on the recall rate of convergent points and the divergence rate of outliers. Regarding localization accuracy, we tested on the publicly available EUROC MAV dataset [18] and compared the performance with state-of-the-art algorithms. It is worth noting that for the sake of a lightweight system and a clearer evaluation of the proposed method, we did not utilize marginalization, loop detection, or pose relocalization modules.
The experiments were conducted on a desktop PC with an Intel® Core™ i5-12600 CPU @ 3.70 GHz × 8 and 16 GB memory.

6.1. Localization Accuracy

We used the EVO tool to help process the experimental result data and used images from the left camera. Figure 4 shows the pose comparison of trajectories in the EUROC MH01 sequence. It is evident that PGMF-VINS tracked well in the x, y, and z dimensions. However, in the comparison of orientation data, while the pitch and yaw were still tracked well, the roll direction showed ordinary tracking. The problem was caused by the aggressive movements of the drone in the dataset, which is one of the most difficult challenges; however, our method achieved competitive performance compared to VINS-mono in aggressive movements. Overall, PGMF-VINS performed well in the pose tracking of the localization task.
In Table 1 and Table 2, the RMSEs (Root-Mean-Square Errors) of the RPE (Relative Pose Error) and APE (Absolute Pose Error) for the EUROC dataset are provided. We conducted three tests and took the average of the RPE. Red indicates the best measurement for the sequence, while green indicates a better average value. According to the experimental results, PGMF-VINS achieved the best values for most sequences. In the V1 scene sequences and the difficult sequences of the MH scene, PGMF-VINS performed exceptionally well, indicating its significant potential in more challenging scenarios.

6.2. Map Quality Improvement

The core objective of the proposed method was to improve map quality. To evaluate this effect, we collected relevant data on map points and analyzed the recall rate of convergent points and the divergence rate of outliers. We evaluated the proposed algorithm using the Ratio of Valid map points (RVM) metric and compared it with other algorithms to verify the improvement in map quality. The RVM represents the ratio of map points used for pose estimation to the total number of map points. A higher RVM indicates more effective information and better map quality. As shown in Figure 5, the RVM over time clearly demonstrated that the overall performance of the proposed algorithm (PGMF-VINS) surpassed that of VINS-mono. Table 3 presents specific data, showing that the RVM of PGMF-VINS exhibited an average increase of 0.1471 compared to VINS-mono, with a variance reduction of 68.8%.
Figure 6 shows the number of converged and estimated map points within the sliding window. In the first 0∼20 s, the MAV is flying within a small range, so we can see the number of converged map points increasing. From 20∼40 s, the MAV is stationary, so the number of converged map points remains almost unchanged, while the number of estimated map points decreases. From 40∼180 s, the number of estimated map points exceeds the number of converged map points because the MAV is moving forward, meaning that most map points are not sufficiently observed, so the number of converged map points is smaller. During that phase, the number of converged map points fluctuates between 0 and 50.
Figure 7 shows the recall rate per hundred points. We can see the recall rate of convergence, estimation, divergence, and disappearance states for every hundred points over the index of map points. The estimate state was the most common state, indicating that the positions of most points were continuously being estimated. This state fluctuated between 0.4 and 0.8. The percentage of convergence, divergence, and disappearance states validated the effectiveness of the proposed method, as it improved map quality by achieving the following: (1) converging the map points which were observed to be stable; (2) excluding outlier points caused by noise, lighting changes, occlusions, and extreme environments; and (3) filtering out random noise points that appeared occasionally.
Figure 8 illustrates the observation times and SoM at convergence. It can be observed that the majority of points had an SoM value between 0.45 and 0.70 at convergence, indicating they were reliable and stable map points. In our experiments, the SoM of outliers typically fell below 0.4. Additionally, the observation times of convergent map points were mainly distributed in the range [2, 20]. With the image publication frequency manually set to 10 Hz, this implied that the time required for these map points to converge from estimate was controlled within 2 s.

7. Conclusions and Future Work

In this paper, we proposed a monocular VI-SLAM method using a novel convergence strategy of map points that could improve the quality of the map. Meanwhile, the proposed method boasted a localization accuracy on par with state-of-the-art algorithms. The convergence strategy was primarily composed of a perpendicular-based triangulation and a 3D Gaussian–uniform mixture filter, and we provided a detailed derivation process for them. Finally, we validated the performance on an open-source dataset by comparing against other state-of-the-art approaches.
We also observe some limitations of the proposed method. In the future, we will differentiate the mapping and localization tasks more clearly and run them in a parallel and tightly coupled manner. Furthermore, the majority of existing SLAM methods assume equal weights for visual measurements during optimization, which is actually unreasonable. Our proposed method provides a way to implement variable weights for visual measurements in the optimization graph, which can be defined from the uncertainty. However, deeper research is still necessary to further improve the accuracy and robustness of the system.

Author Contributions

Methodology, W.D.; resources, Z.D. and L.Z.; software, W.D. and Z.Y.; validation, B.H.; writing—original draft, W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the National Natural Science Foundation of China (82090052), Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103, No. 2023SHZDZX02), Ningbo Science and Technology Bureau, Technology and Equipment of Intelligent and All-terrain Unmanned Aerial System for Magmatic Exploration (No. 2020Z073), and is partly supported by the National Key R&D Program of China (2021ZD0113503).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  2. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  3. Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
  4. Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 225–234. [Google Scholar]
  5. Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  6. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]
  7. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  8. Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 15–22. [Google Scholar]
  9. Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 3565–3572. [Google Scholar]
  10. Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 298–304. [Google Scholar]
  11. Leutenegger, S.; Furgale, P.; Rabaud, V.; Chli, M.; Konolige, K.; Siegwart, R. Keyframe-Based Visual-Inertial Slam Using Nonlinear Optimization. In Proceedings of the Robotics: Science and Systems IX Conference, Berlin, Germany, 24–28 June 2013; Available online: https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/236658/1/RSS2013paper.pdf (accessed on 22 August 2024).
  12. Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual odometry and mapping for autonomous flight using an RGB-D camera. In Proceedings of the Robotics Research: The 15th International Symposium ISRR, Flagstaff, AZ, USA, 9–12 December 2011; Springer: Berlin/Heidelberg, Germany, 2017; pp. 235–252. [Google Scholar]
  13. Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
  14. Shi, J. Good features to track. In Proceedings of the 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; IEEE: Piscataway, NJ, USA, 1994; pp. 593–600. [Google Scholar]
  15. Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-manifold preintegration for real-time visual–inertial odometry. IEEE Trans. Robot. 2016, 33, 1–21. [Google Scholar] [CrossRef]
  16. Qin, T.; Shen, S. Robust initialization of monocular visual-inertial estimation on aerial robots. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4225–4232. [Google Scholar]
  17. Vogiatzis, G.; Hernández, C. Video-based, real-time multi-view stereo. Image Vis. Comput. 2011, 29, 434–441. [Google Scholar] [CrossRef]
  18. Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Figure 1. Block diagram illustrating the pipeline of the proposed system. This system consists of three modules: measurements’ preprocessing, visual–inertial joint initialization, and PGMF, with PGMF being the core module.
Figure 1. Block diagram illustrating the pipeline of the proposed system. This system consists of three modules: measurements’ preprocessing, visual–inertial joint initialization, and PGMF, with PGMF being the core module.
Sensors 24 06482 g001
Figure 2. Illustration of the sliding window. The sliding window includes a sequence of frames. A tightly coupled monocular VIO is solved to estimate the state vector of frames. Visual measurements are computed as a reprojection residual, and IMU measurements are computed as a preintegration residual. Non-keyframe are discarded after the optimization.
Figure 2. Illustration of the sliding window. The sliding window includes a sequence of frames. A tightly coupled monocular VIO is solved to estimate the state vector of frames. Visual measurements are computed as a reprojection residual, and IMU measurements are computed as a preintegration residual. Non-keyframe are discarded after the optimization.
Sensors 24 06482 g002
Figure 3. Schematic diagram illustrating the spatial model of perpendicular-based triangulation. O 1 and O 2 are the optical centers of image frame 1 and image frame 2, respectively. x 1 and x 2 represent the measurements in image coordinates. The observation rays O 1 M 1 and O 2 M 2 are skew lines and intersect their perpendicular line at M 1 and M 2 . We consider the real position M to be at the midpoint of M 1 M 2 .
Figure 3. Schematic diagram illustrating the spatial model of perpendicular-based triangulation. O 1 and O 2 are the optical centers of image frame 1 and image frame 2, respectively. x 1 and x 2 represent the measurements in image coordinates. The observation rays O 1 M 1 and O 2 M 2 are skew lines and intersect their perpendicular line at M 1 and M 2 . We consider the real position M to be at the midpoint of M 1 M 2 .
Sensors 24 06482 g003
Figure 4. Pose comparison of trajectories. Left: position comparison for x, y, z. Right orientation comparison for roll, pitch, and yaw.
Figure 4. Pose comparison of trajectories. Left: position comparison for x, y, z. Right orientation comparison for roll, pitch, and yaw.
Sensors 24 06482 g004
Figure 5. The comparison of the RVM over time.
Figure 5. The comparison of the RVM over time.
Sensors 24 06482 g005
Figure 6. The number of converged and estimated map points within the sliding window.
Figure 6. The number of converged and estimated map points within the sliding window.
Sensors 24 06482 g006
Figure 7. The recall rate per hundred points. The recall rate of convergence, estimation, divergence, and disappearance states for every hundred points.
Figure 7. The recall rate per hundred points. The recall rate of convergence, estimation, divergence, and disappearance states for every hundred points.
Sensors 24 06482 g007
Figure 8. The observation times and SoM at convergence.
Figure 8. The observation times and SoM at convergence.
Sensors 24 06482 g008
Table 1. RMSE of the RPE for the EUROC dataset (full transformation for delta = 1.0 m with SE3 Umeyama alignment).
Table 1. RMSE of the RPE for the EUROC dataset (full transformation for delta = 1.0 m with SE3 Umeyama alignment).
PGMF-VINSVINS-Mono
Times 1Times 2Times 3AverageTimes 1Times 2Times 3Average
MH_01_easy0.1196500.1631030.1039390.1288970.1196600.1196580.1196700.119663
MH_02_easy0.1408350.1126610.1805670.1446880.1284120.1284100.1284020.128408
MH_03_medium0.1892570.1633970.2319370.1948640.3111370.3111780.3111340.311150
MH_04_difficult0.2468300.2076990.1970120.2171800.3418680.3600320.3217710.341224
MH_05_difficult0.1900910.2111260.2419170.2143780.2957950.3308390.2957930.307476
V1_01_easy0.1578470.4140400.2141920.2620260.3111900.3111890.3112030.311194
V1_02_medium0.1949990.262441/0.2287200.2983190.2983690.2983190.298336
V1_03_difficult0.690327/0.7475130.7189200.623400/0.6234000.623400
V2_01_easy1.2203130.4155270.7452440.7936950.1716960.1716960.1716880.171693
V2_02_medium0.7041250.4910960.1385000.4445740.1727950.1925320.1925290.185952
V2_03_difficult/0.6119150.8686720.740294/0.4611340.4706450.465890
Note: Red indicates the best, green indicates better average values.
Table 2. RMSE of the APE for the EUROC dataset.
Table 2. RMSE of the APE for the EUROC dataset.
VINS-MonoMSCKFSVOPGMF-VINS (Ours)
MH_010.2340.4200.1700.194
MH_020.2850.4500.2700.279
MH_030.8670.2300.4200.371
MH_040.6660.3701.0000.530
MH_050.6140.4800.6000.519
Table 3. Average and standard deviation of the RVM.
Table 3. Average and standard deviation of the RVM.
MethodAvg.Std.
VIN-mono0.74920.0077
PGMF-VINS (ours)0.89630.0024
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, W.; Yan, Z.; Hu, B.; Dong, Z.; Zhang, L. PGMF-VINS: Perpendicular-Based 3D Gaussian–Uniform Mixture Filter. Sensors 2024, 24, 6482. https://doi.org/10.3390/s24196482

AMA Style

Deng W, Yan Z, Hu B, Dong Z, Zhang L. PGMF-VINS: Perpendicular-Based 3D Gaussian–Uniform Mixture Filter. Sensors. 2024; 24(19):6482. https://doi.org/10.3390/s24196482

Chicago/Turabian Style

Deng, Wenqing, Zhe Yan, Bo Hu, Zhiyan Dong, and Lihua Zhang. 2024. "PGMF-VINS: Perpendicular-Based 3D Gaussian–Uniform Mixture Filter" Sensors 24, no. 19: 6482. https://doi.org/10.3390/s24196482

APA Style

Deng, W., Yan, Z., Hu, B., Dong, Z., & Zhang, L. (2024). PGMF-VINS: Perpendicular-Based 3D Gaussian–Uniform Mixture Filter. Sensors, 24(19), 6482. https://doi.org/10.3390/s24196482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop