4.1. Analysis of VINS Odometry Optimization
This paper analyzes the performance optimization of visual–inertial odometry based on a VINS system with the inclusion of the ZUPT on the VIMU data [
7]. VINS belongs to the category of mature visual–inertial fusion algorithms, where the system estimates the pose information and constructs maps by fusing visual point clouds with pre-integrated information from inertial sensors.
Figure 6 shows the flowchart of the localization and mapping process of VINS. It illustrates how the system estimates pose information and constructs maps by fusing visual point clouds with pre-integrated information from inertial sensors.
The pre-integration method is a widely employed approach for processing data collected by IMUs. By integrating the angular velocity and acceleration signals, this method can provide information on the system’s relative changes in attitude and velocity with respect to the previous moment [
19].
In this equation, and represent the change in velocity and displacement between two specific moments in time. The variable is the accelerometer measurement value, while and represent the accelerometer’s bias and noise, respectively. The variable denotes the calculated velocity value, and finally, and refer to the bias and noise of the calculated velocity value.
During the pre-integration process, the presence of bias errors in the IMU measurements can result in accumulated integration errors, leading to a decrease in the accuracy and stability of visual–inertial odometry. To mitigate this issue, the ZUPT is applied to the foot-mounted VIMU constructed using the method described above. This correction eliminates the bias errors in the VIMU and thereby reduces the accumulation of pre-integration errors. As a result, it enhances the accuracy and stability of visual–inertial odometry [
20].
The ZUPT values for acceleration and velocity are represented by
and
, respectively. The pre-integrated measurement model for the foot-mounted VIMU after the ZUPT can be defined as follows:
Considering the measurements of the foot-mounted VIMU within a sliding window comprising consecutive frames
and
, the residual of the pre-integrated foot-mounted VIMU measurement can be formulated in accordance with the defined VIMU measurement model:
The system utilizes an iterative nonlinear least squares optimization method to minimize the cost function and update the estimated camera pose, the IMU and VIMU states. The detailed optimization process and related formulas are presented in the literature [
21]. By adding ZUPT constraints into the cost function, the enforcement of zero velocity during stationary periods is facilitated, leading to enhanced overall pose estimation accuracy and improved smoothness of visual–inertial odometry. The ZUPT error terms for the velocity and camera pose during stationary periods can be formulated as follows:
where
and
are the predicted velocity and position from IMU integration,
and
are the observed velocity and position from the ZUPT constraint, and
and
are the gain factor that controls the influence of the ZUPT constraint on the velocity and camera pose estimation.
The ZUPT error terms for the velocity and camera pose during stationary periods can be included as additional error terms in the cost function for the optimization step. The additional ZUPT constraint can be formulated as
where
and
are regularization parameters that control the trade-off between the other measurements and the ZUPT constraints, and
and
are the ZUPT error terms for velocity and camera pose, respectively.
The incorporation of zero-velocity constraints provides additional information to the optimization process, helping to improve the accuracy of estimating the camera pose and the IMU and VIMU states during stationary periods when visual measurements may be ambiguous or noisy. Additionally, by providing additional constraints on the system’s motion, it helps improve the consistency of the estimated trajectory, reducing drift and improving the long-term stability of the estimated trajectory.
This article shows the adjustments made to the original state estimation framework of VINS, replacing the original IMU with a ZUPT-modified VIMU and incorporating ZUPT constraints. After adjustment, the framework still supports multiple sensor choices, such as stereo cameras, a monocular camera with a VIMU, and stereo cameras with a VIMU. Each sensor is treated as a general factor. Factors which share common state variables are summed together to build the optimization problem. The fusion of multi-sensor information factors is shown in
Figure 7.
The maximum a posteriori estimation result is obtained by minimizing the prior sum of all measurement residuals, and the Mahalanobis norm through visual–inertial bundle adjustment is as follows:
where
represents the residual of the visual measurement,
represents the residual of the VIMU measurement, and
represents the zero-velocity update constraint.
Specifically, the residual of the visual measurement and the residual of the IMU, VIMU (with ZUPT), and zero-velocity constraint are all included in the cost function. Although the cost function slightly differs after adding the ZUPT for the VIMU, the dimension of the VIMU residual remains the same as that of the IMU residual, and the dimension of the state to be solved remains unchanged. By adding the zero-velocity constraint, the positioning accuracy and smoothness of the visual–inertial odometer are improved in complex environments.
4.3. Localization and Mapping Experiments in Degenerated Environments
To assess the efficacy of the proposed visual–inertial SLAM system aided by the VIMU and VINS-Fusion, we conducted experiments using a JetBot 2.0 quadruped robot fitted with Realsense D435/435i cameras and GPS. GPS was used as the baseline path for our experiments. The collected data were processed and stored on a computer running the Ubuntu 20.04 ROS Melodic system. The experimental setting comprised both indoor and outdoor environments and involved trajectories with effective loop closures, which were designed to comprehensively evaluate and validate the proposed methodology [
21].
In Experiment 1, we conducted a benchmark straight-line motion experiment in an indoor environment. The hallway was chosen as the test environment due to its rectangular shape and simple scene with only one feature, which can be considered as a single-degree-of-freedom degenerate scene, as illustrated in
Figure 8a.
During the experiment, the JetBot robot was manually driven to approach a constant speed along a straight path using the walking gait, and the visual–inertial SLAM system using the VIMU and VINS-Fusion system was used to estimate the robot’s position and orientation. The experiment was repeated with different levels of camera shake, induced by attaching weights of different magnitudes to the robot’s camera system.
The results were compared to those obtained from the VINS-Fusion system, which serve as
Table 5 and
Figure 8b.
In order to evaluate the accuracy of the SLAM system on the dataset, we used the open-source tool evo, and the absolute trajectory error (ATE) was selected as the evaluation standard to measure the error between the estimated and ground-truth trajectories after running the dataset sequence. The obtained results demonstrate that the proposed system achieved a significant improvement in accuracy. Specifically, the maximum error (Max) decreased on average by 74.3%, the mean error (Mean) decreased on average by 63.6%, the median error (Median) decreased on average by 17.8%, the root means square error (RMSE) decreased on average by 76.9%, and the standard deviation error (Std) decreased on average by 78.1%. These results indicate the effectiveness of the proposed approach in improving the accuracy of SLAM systems [
22].
Figure 8c,d show that the VINS-Fusion system performs well in indoor environments, but its positioning error increases significantly when the camera experiences large movements. In comparison to VINS-Fusion, the system proposed in this paper exhibits greater stability and smaller trajectory deviations in dynamic scenarios, as evidenced by the trajectories being closer to the ground-truth straight line. This is because the proposed system can avoid highly distorted trajectories and effectively suppress filter drift by leveraging both the body IMU and VIMU data.
In Experiment 2, the selected outdoor environment provided a more complex setting to test the proposed SLAM system. The ground truth is shown in
Figure 9a–c. The road and parking lot that circled the building created a challenging closed-loop path, requiring the system to handle dynamic changes in the environment, such as moving vehicles and pedestrians. The filtering-based mechanism of VINS-Fusion proved to have limitations in this kind of environment, as deviations from the reference path occurred after loop closure. In contrast, our system demonstrated improved robustness and accuracy in handling dynamic changes, including correcting for data distortion caused by large camera shakes through a virtual inertial navigation component.
In addition, the significant path errors that can occur when the camera angle changes sharply during turns were also addressed in this experiment. As shown in
Figure 9d, the proposed SLAM system was able to correct for these errors, resulting in more accurate and stable trajectory estimation.
To evaluate the performance of the SLAM systems in the outdoor environment, the same evo tool was used, and the results are shown in
Figure 9e,f and
Table 6. Compared to VINS-Fusion, the proposed system demonstrated better performance in terms of reducing the maximum error (Max) by an average of 14.9%, the mean error (Mean) by an average of 3.21%, the median error (Medium) by an average of 4.73%, the root means square error (RMSE) by an average of 7.04%, and the standard deviation error (Std) by an average of 22.5%. These results confirm the effectiveness of the proposed system in handling complex outdoor environments and improving the trajectory estimation accuracy.
In Experiment 3, we selected a building as a closed loop, as shown in
Figure 10a. The ground truth is shown in
Figure 10a,b. During the data collection process, we deliberately increased the camera’s motion speed and amplitude to simulate a scenario where VINS-Fusion may struggle to track the camera’s motion. We also introduced errors in the initial pose information to evaluate the robustness of the two SLAM systems. When the camera’s motion speed and amplitude were too high, it was difficult to track, leading to a decrease in the tracking performance of VINS-Fusion. Additionally, VINS-Fusion has a high demand for initial pose information, and the lack of accurate initial pose information can cause errors to accumulate gradually, leading to significant distortion during initialization. In contrast, the visual–inertial SLAM aided by VIMU can utilize foot sensor data to provide more information, and adjust the weight ratios of each data to achieve better tracking and curve fitting results, while reducing the demand for accurate initial pose information [
23], as shown in
Figure 10d,e.
The results of the experiment showed that our visual–inertial SLAM aided by VIMU outperformed the VINS-Fusion system. The VIMU-based SLAM system was able to handle the camera’s fast and large motion while maintaining accurate tracking, whereas VINS-Fusion struggled and suffered from a decrease in tracking performance. Moreover, the VIMU-based system showed better performance in initialization, achieving more accurate pose estimation than VINS-Fusion, even when the initial pose information was inaccurate.
Through experimental comparisons, we found that the visual–inertial SLAM aided by VIMU performed better than the VINS-Fusion method in these scenarios. Our system can solve the problem of too fast and too large camera motion during data collection and has better initialization performance. Additionally, our system can better adapt to the inaccuracy of initial pose information.