Visual–Inertial Navigation System Based on Virtual Inertial Sensors

Cai, Yunpiao; Qian, Weixing; Zhao, Jiaqi; Dong, Jiayi; Shen, Tianxiao

doi:10.3390/app13127248

Open AccessArticle

Visual–Inertial Navigation System Based on Virtual Inertial Sensors

by

Yunpiao Cai

,

Weixing Qian

^*

,

Jiaqi Zhao

,

Jiayi Dong

and

Tianxiao Shen

School of Electrical and Automation Engineering, Nanjing Normal University, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7248; https://doi.org/10.3390/app13127248

Submission received: 14 May 2023 / Revised: 9 June 2023 / Accepted: 15 June 2023 / Published: 17 June 2023

(This article belongs to the Special Issue New Insights into Positioning and Navigation Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a novel visual–inertial simultaneous localization and mapping (SLAM) method for intelligent navigation systems that aims to overcome the challenges posed by dynamic or large-scale outdoor environments. Our approach constructs a visual–inertial navigation system by utilizing virtual inertial sensor components that are mapped to the torso IMU under different gait patterns through gait classification. We apply a zero-velocity update (ZUPT) to initialize the system with the original visual–inertial information. The pose information is then iteratively updated through nonlinear least squares optimization, incorporating additional constraints from the ZUPT to improve the accuracy of the system’s positioning and mapping capabilities in degenerate environments. Finally, the corrected pose information is fed into the solution. We evaluate the performance of our proposed SLAM method in three typical environments, demonstrating its applicability and high precision across various scenarios. Our method represents a significant advancement in the field of intelligent navigation systems and offers a promising solution to the challenges posed by degenerate environments.

Keywords:

virtual IMU construction; simultaneous mapping and localization technology; attention-based convolutional; neural network; resnet-gated recurrent unit neural network; visual–inertial navigation

1. Introduction

Simultaneous localization and mapping (SLAM) is critical technology in mobile navigation, enabling robots to create environmental maps and achieve autonomous localization by utilizing sensors mounted on them [1]. In recent years, SLAM systems with complementarity have been developed to meet the higher demands of mobile platform application scenarios. Among them, the most representative system is the fusion of laser–inertial [2] and visual–inertial sensors [3].

However, when laser radar is used as the primary sensor, it may lack certain directional constraints in scenes with sparse structural features. This is due to its inability to distinguish geometrically continuous repeated scenes, which can result in unreliable motion estimation and significant errors [4]. In contrast, visual–inertial SLAM systems utilizing visual sensors have the advantages of lower cost and smaller size [5]. Furthermore, the fusion of inertial and other sensors in visual–inertial SLAM systems has been demonstrated to improve their accuracy and robustness [6], making them the mainstream combination method in the field of visual SLAM.

Visual–inertial SLAM systems can be classified into two forms, loosely coupled and tightly coupled, as discussed in the literature [7]. The loosely coupled approach, exemplified by Weiss et al.’s 2012 strategy based on optical flow and inertial constraints, addresses certain mapping challenges. However, it may lead to decreased mapping accuracy due to visual measurement errors [8].

Recently, more progress has been made in the research of deep learning-based visual–inertial fusion methods in the field of visual–inertial navigation. HVIONet (a deep learning-based hybrid visual–inertial odometry approach) is a low-cost visual–inertial fusion method that integrates inertial and image data using deep learning frameworks to improve the speed and accuracy through feature fusion. However, it requires ground-truth data and has high computational costs, limiting its practical application [9]. VIIONet (a visual–inertial image-odometry network) utilizes a deep learning method to fuse data from both sources, converting features in the fully connected layer to poses through Gaussian regression and enhancing stability using Savitzky–Golay filters for pose estimation. While this method achieves more accurate pose estimation, real-time synchronization of pose estimation with UAV motion remains an area for improvement. Another emerging deep learning-based fusion method is EMA-VIO, which incorporates external memory attention for efficient and effective pose estimation. However, this method’s model requires GPU training, restricting its use in micro-navigation, and further stability verification is needed in open environments [10,11]. We have developed a new sensor through deep learning to enhance data acquisition and improve accuracy. By pre-collecting data through virtual inertial sensors, we can efficiently process real-time data and perform experiments on GPU platforms. Additionally, in scenarios with low texture, lighting variations, fast motion, or large-scale environments, the visual–inertial SLAM system may encounter tracking failures, leading to errors in state estimation. Furthermore, external factors such as weather conditions and occlusions can also affect the mapping accuracy of the system. When deployed on mobile platforms, the reliability of IMU observations can be compromised due to complex motion behavior. Incorporating multiple IMUs to assist in measurements increases power consumption, complexity, and potential data loss. Therefore, establishing the relationship between the inertial sensor measurement on the carrier through deep learning and constructing virtual inertial sensors to replace the physical IMU on the carrier presents an effective solution to simplify the system.

This paper proposes a visual–inertial navigation SLAM system that utilizes a DeepConvGRU neural network model to classify IMU gait data fand map the virtual IMU (VIMU) data through a DenseNet-GRU hybrid model. The proposed method eliminates accumulated errors before pre-integration and obviates the need for additional IMU hardware. The system performs an external fusion ZUPT followed by tight coupling optimization within the sliding window to update the pose graph, reduce localization and mapping drift, and enhance motion attitude estimation and the construction of the scene’s three-dimensional structure. The experimental results show that the proposed system effectively solves the common accumulated error problem in VIO while accurately estimating motion attitude and constructing the scene’s three-dimensional structure.

2. System Overview

This paper proposes a VIMU-aided visual–inertial SLAM system to address the high hardware requirements and poor robustness in dynamic environments of existing approaches, such as VINS-Fusion. The main workflow is illustrated in Figure 1.

This method begins by classifying the inertial sensor information into gait phases using a DeepConvGRU neural network model. The DenseNet-GRU hybrid model is used to extract the nonlinear mapping relationship between the torso and foot inertial sensors, thereby constructing the VIMU. The pose information is estimated using the loosely coupled INS algorithm, and cumulative errors are corrected using the ZUPT algorithm. This method does not require the introduction of additional IMU hardware as a data source. Instead, data are still collected by a camera and a single IMU. The IMU data are mapped to the VIMU data using a deep learning model, which is then corrected for zero velocity to obtain a virtual inertial navigation sensor. After pre-integration processing, the VIMU data are used together with the pre-integration results of the camera and IMU for initialization.

The visual part uses the KLT sparse optical flow tracking method to obtain the feature points of the latest image, while the IMU part performs the pre-integration calculations between the adjacent keyframes. The pose information is updated by iteratively minimizing the nonlinear least squares cost function, with additional optimization information provided by the ZUPT constraint included in the cost function. The calibration of physical quantities is completed by combining the attitude obtained by visual SFM with the IMU pre-integration. Subsequently, back-end sliding window optimization is performed, with visual constraints, IMU constraints, and loop closure constraints included in a large objective function for nonlinear optimization. Marginalization is introduced to convert the related constraints into priors when removing poses, with the Marg residual part, IMU residual part, and visual reprojection error part handled separately. The entire local mapping of VINS is completed by the VIO fusion module, which uses IMU and camera observations as constraints for BA optimization.

Next, loop detection is performed. When multiple loops occur within the sliding window, all corresponding features retrieved from the loop frames are used to optimize the state estimation results. This guides the subsequent nonlinear optimization of VIO, and performs tightly coupled optimization of feature observations and IMU pre-integration values through nonlinear optimization within the sliding window, while also having the ability to perform re-localization. Finally, pose graph optimization is performed, utilizing the re-localization data to load the past pose graph into the global map for optimization and reuse, thereby eliminating localization and mapping drift.

3. Virtual Inertial Navigation System Based on Deep Learning

3.1. Construction of Virtual Inertial Sensors Based on Deep Learning

The construction of virtual inertial sensors involves synchronously collecting the inertial motion information of a quadruped robot using two distributed inertial sensors, and using a deep learning network model to classify the torso inertial sensor information based on gait, thereby constructing a robot foot motion model [12], as illustrated in Figure 2.

In this paper, we use the DeepConvGRU neural network model to classify the gait information from the inertial sensors. This model combines convolutional neural networks and gated recurrent units (GRU), replacing LSTM as the recurrent layer [13,14], and preserves important features through gates to prevent feature loss during long-term propagation. This type of model can effectively reduce the difficulty of model training without compromising model accuracy [15], as shown in Figure 3.

Based on the gait classification, a DenseNet-GRU hybrid model was used to model the nonlinear mapping relationship between the torso and foot inertial sensors. Corresponding virtual inertial sensor models were trained for different gaits to reduce the complexity of fitting the inertial sensor mapping relationship, thereby further improving the regression accuracy of the hybrid model. This achieved more accurate gait recognition and behavior monitoring, as shown in Figure 4.

The DenseNet-GRU model extracts motion feature information from the inertial sensor data [16], effectively utilizing the feature information through dense connections, and learns and extracts hidden feature information from the input data. The GRU neural network further explores the temporal characteristics of inertial data [13] and inputs the extracted features into the fully connected layer to establish a nonlinear mapping relationship between the torso and foot inertial sensor information [17].

3.2. Zero-Velocity Update-Based Optimization of Virtual Inertial Sensor Data

This paper improves the accuracy of the pose estimation system of the virtual inertial sensor by incorporating the ZUPT method to reduce the cumulative errors in the inertial navigation estimation process.

To reduce the cumulative errors in the inertial navigation estimation process, the interval zero-velocity search algorithm is used for the error update. This algorithm uses the acceleration data to detect the motion state of the robot when it is still and segments the searching units by filtering the angular velocity. The angular velocity is smoothed by the sliding window average method. When the system detects zero velocity during movement, the Kalman filter is triggered to estimate and correct the output error of the system, and the rotation matrix is updated accordingly to eliminate the cumulative errors [18].

Because the distribution of zero-velocity intervals varies in different motion modes of the quadruped robot, it can affect the accuracy of the ZUPT. The system reduces the misjudgment rate of the zero-velocity interval by adding gait classification constraints and adjusts the filtering parameters according to the motion mode to improve the accuracy of the ZUPT, as shown in Figure 5.

3.3. Performance Verification of Zero-Velocity Update of VIMU

In this study, we initially evaluated the accuracy of the gait classification. Table 1 shows that the DeepConvGRU model achieves a gait classification rate of approximately 97.6%. Considering the fault tolerance capability of the DenseNet-GRU hybrid model, the achieved accuracy in the experiment is sufficient for the subsequent development of virtual inertial sensors.

We additionally gathered torso and foot inertial sensor data to validate three common gaits of the quadruped robot: walking, running, and uphill/downhill walking. To compare the constructed virtual inertial sensors using the DenseNet-GRU hybrid network model with the actual foot inertial sensor data, we refrained from presenting the actual fitting curve. This decision was made based on the observation that during the motion process, the amplitude of the motion information significantly exceeds the fitting error’s magnitude, rendering the difference between the two negligible and indistinguishable. Table 2 shows the fitting output error of the virtual inertial sensor. From the results, it can be seen that the fitting output error of the virtual inertial sensor is relatively small when the quadruped robot moves relatively vigorously on slopes such as uphill and downhill, meeting the system accuracy requirements.

Where a and w, respectively, represent accelerometers and gyroscopes, M is the output deviation of the IMU, and S is the output noise of the IMU.

Under the same conditions, the ZUPT method is applied to process the fitting output results of the VIMU. Table 3 presents the error of the ZUPT output of the VIMU, which shows that the ZUPT output error of the VIMU is relatively small during the typical motion of the quadruped robot, for which improvement enhances the system’s accuracy for positioning.

Based on the comparison results, it can be seen that the fitting curves of the VIMU have similar identification accuracy and a small curve error to the actual foot inertial sensor in the case of relatively intense movements such as uphill and downhill of the quadruped robot, as in the previous study, which meets the positioning accuracy requirements of the inertial navigation system [12].

4. Analysis and Experiments of Visual–Inertial Odometry Optimization

4.1. Analysis of VINS Odometry Optimization

This paper analyzes the performance optimization of visual–inertial odometry based on a VINS system with the inclusion of the ZUPT on the VIMU data [7]. VINS belongs to the category of mature visual–inertial fusion algorithms, where the system estimates the pose information and constructs maps by fusing visual point clouds with pre-integrated information from inertial sensors.

Figure 6 shows the flowchart of the localization and mapping process of VINS. It illustrates how the system estimates pose information and constructs maps by fusing visual point clouds with pre-integrated information from inertial sensors.

The pre-integration method is a widely employed approach for processing data collected by IMUs. By integrating the angular velocity and acceleration signals, this method can provide information on the system’s relative changes in attitude and velocity with respect to the previous moment [19].

\begin{array}{l} Δ v_{b} = \int a_{b} (τ) d τ = \int (a_{b} (τ) - b_{a} (τ) - n_{a} (τ)) d τ \\ Δ p_{b} = \int v_{b} (τ) d τ = \int (v_{b} (τ) - b_{v} (τ) - n_{v} (τ)) d τ \end{array}

(1)

In this equation,

Δ v_{b}

and

Δ p_{b}

represent the change in velocity and displacement between two specific moments in time. The variable

a_{b}

is the accelerometer measurement value, while

b_{a}

and

n_{a}

represent the accelerometer’s bias and noise, respectively. The variable

v_{b}

denotes the calculated velocity value, and finally,

b_{v}

and

n_{v}

refer to the bias and noise of the calculated velocity value.

During the pre-integration process, the presence of bias errors in the IMU measurements can result in accumulated integration errors, leading to a decrease in the accuracy and stability of visual–inertial odometry. To mitigate this issue, the ZUPT is applied to the foot-mounted VIMU constructed using the method described above. This correction eliminates the bias errors in the VIMU and thereby reduces the accumulation of pre-integration errors. As a result, it enhances the accuracy and stability of visual–inertial odometry [20].

\begin{array}{l} a_{b} (τ) = a_{b} (τ) - b_{a} (τ) - n_{a} (τ) - Δ b_{a} \\ v_{b} (τ) = v_{b} (τ) - b_{v} (τ) - n_{v} (τ) - Δ b_{v} \end{array}

(2)

The ZUPT values for acceleration and velocity are represented by

Δ b_{a}

and

Δ b_{v}

, respectively. The pre-integrated measurement model for the foot-mounted VIMU after the ZUPT can be defined as follows:

[\begin{matrix} {\hat{α}}_{b_{k + 1}}^{b_{k}} \\ {\hat{β}}_{b_{k + 1}}^{b_{k}} \\ {\hat{γ}}_{b_{k + 1}}^{b_{k}} \\ 0 \\ 0 \end{matrix}] = [\begin{matrix} R_{ω}^{b_{k}} (P_{b_{k + 1}}^{ω} - P_{b_{k}}^{ω} + \frac{1}{2} g^{ω} Δ t_{k}^{2} - v_{b_{k}}^{ω} Δ t_{k}) \\ R_{ω}^{b_{k}} (v_{b_{k + 1}}^{ω} + g^{ω} Δ t_{k} - v_{b_{k}}^{ω}) \\ q_{b_{k}}^{ω}^{- 1} \otimes q_{b_{k + 1}}^{ω} \\ b_{a}_{b_{k + 1}} - b_{a}_{b_{k}} \\ b_{ω}_{b_{k + 1}} - b_{ω}_{b_{k}} \end{matrix}]

(3)

Considering the measurements of the foot-mounted VIMU within a sliding window comprising consecutive frames

b_{k}

and

b_{k + 1}

, the residual of the pre-integrated foot-mounted VIMU measurement can be formulated in accordance with the defined VIMU measurement model:

r_{V B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X) = [\begin{matrix} δ α_{b_{k + 1}}^{b_{k}} \\ δ β_{b_{k + 1}}^{b_{k}} \\ δ θ_{b_{k + 1}}^{b_{k}} \\ δ b_{a} \\ δ b_{g} \end{matrix}] = [\begin{matrix} R_{ω}^{b_{k}} (P_{b_{k + 1}}^{ω} - P_{b_{k}}^{ω} + \frac{1}{2} g^{ω} Δ t_{k}^{2} - v_{b_{k}}^{ω} Δ t_{k}) - {\hat{α}}_{b_{k + 1}}^{b_{k}} \\ R_{ω}^{b_{k}} (v_{b_{k + 1}}^{ω} + g^{ω} Δ t_{k} - v_{b_{k}}^{ω}) - {\hat{β}}_{b_{k + 1}}^{b_{k}} \\ 2 {[q_{b_{k}}^{ω}^{- 1} \otimes q_{b_{k + 1}}^{ω} \otimes {({\hat{γ}}_{b_{k + 1}}^{b_{k}})}^{- 1}]}_{x y z} \\ b_{a}_{b_{k + 1}} - b_{a}_{b_{k}} \\ b_{ω}_{b_{k + 1}} - b_{ω}_{b_{k}} \end{matrix}]

(4)

The system utilizes an iterative nonlinear least squares optimization method to minimize the cost function and update the estimated camera pose, the IMU and VIMU states. The detailed optimization process and related formulas are presented in the literature [21]. By adding ZUPT constraints into the cost function, the enforcement of zero velocity during stationary periods is facilitated, leading to enhanced overall pose estimation accuracy and improved smoothness of visual–inertial odometry. The ZUPT error terms for the velocity and camera pose during stationary periods can be formulated as follows:

\begin{array}{l} e r r o r_{ZUPT}^{v} = K_{v} \cdot (v_{p r e d} - v_{o b s}) \\ e r r o r_{ZUPT}^{p} = K_{p} \cdot (p_{p r e d} - p_{o b s}) \end{array}

(5)

where

v_{p r e d}

and

p_{p r e d}

are the predicted velocity and position from IMU integration,

v_{o b s}

and

p_{o b s}

are the observed velocity and position from the ZUPT constraint, and

K_{v}

and

K_{p}

are the gain factor that controls the influence of the ZUPT constraint on the velocity and camera pose estimation.

The ZUPT error terms for the velocity and camera pose during stationary periods can be included as additional error terms in the cost function for the optimization step. The additional ZUPT constraint can be formulated as

e r r o r_{ZUPT} = λ_{v} e r r o r_{ZUPT}^{v} + λ_{p} e r r o r_{ZUPT}^{p}

(6)

where

λ_{v}

and

λ_{p}

are regularization parameters that control the trade-off between the other measurements and the ZUPT constraints, and

e r r o r_{ZUPT}^{v}

and

e r r o r_{ZUPT}^{p}

are the ZUPT error terms for velocity and camera pose, respectively.

The incorporation of zero-velocity constraints provides additional information to the optimization process, helping to improve the accuracy of estimating the camera pose and the IMU and VIMU states during stationary periods when visual measurements may be ambiguous or noisy. Additionally, by providing additional constraints on the system’s motion, it helps improve the consistency of the estimated trajectory, reducing drift and improving the long-term stability of the estimated trajectory.

This article shows the adjustments made to the original state estimation framework of VINS, replacing the original IMU with a ZUPT-modified VIMU and incorporating ZUPT constraints. After adjustment, the framework still supports multiple sensor choices, such as stereo cameras, a monocular camera with a VIMU, and stereo cameras with a VIMU. Each sensor is treated as a general factor. Factors which share common state variables are summed together to build the optimization problem. The fusion of multi-sensor information factors is shown in Figure 7.

The maximum a posteriori estimation result is obtained by minimizing the prior sum of all measurement residuals, and the Mahalanobis norm through visual–inertial bundle adjustment is as follows:

\begin{matrix} \min \\ X \end{matrix} \{\sum_{(l, j) \in C} ρ ({‖r_{C} ({\hat{z}}_{l}^{C_{j}}, X)‖}_{P_{l}^{C_{j}}}^{2}) + \sum_{k \in B} {‖r_{B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)‖}_{P_{b_{k + 1}}^{b_{k}}}^{2} + \sum_{k \in V B} {‖r_{V B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)‖}_{P_{b_{k + 1}}^{b_{k}}}^{2} + e r r o r_{ZUPT}\}

(7)

where

r_{C} ({\hat{z}}_{l}^{C_{j}}, X)

represents the residual of the visual measurement,

r_{V B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)

represents the residual of the VIMU measurement, and

e r r o r_{ZUPT}

represents the zero-velocity update constraint.

Specifically, the residual of the visual measurement and the residual of the IMU, VIMU (with ZUPT), and zero-velocity constraint are all included in the cost function. Although the cost function slightly differs after adding the ZUPT for the VIMU, the dimension of the VIMU residual remains the same as that of the IMU residual, and the dimension of the state to be solved remains unchanged. By adding the zero-velocity constraint, the positioning accuracy and smoothness of the visual–inertial odometer are improved in complex environments.

4.2. KITTI Dataset Test Experiment

As shown in the table, the official evaluation metrics of the KITTI dataset are used to validate the quantitative results of our method, which are the average translation RMSE drift rate (%) and the average rotation RMSE drift rate (°/20 m). The table shows that our proposed method outperforms VINS-Fusion in both translation and rotation. Compared with VINS-Fusion, our method significantly reduces the translation drift by 36% and the rotation drift by 9%. Due to the less accurate calibration and time synchronization of the camera and IMU on the KITTI dataset, the integration of visual and inertial features becomes more effective after external memory attention, which is beneficial for deep mining and the utilization of features between data. Details can be found in Table 4. The deep learning model adaptively adjusts the feature weights for different positions based on the training data. Meanwhile, the algorithm continuously processes and optimizes the pose, IMU data, and odometer data through polling and updates these three data based on virtual IMU data, improving the calibration of the temporal information and achieving lower drift.

4.3. Localization and Mapping Experiments in Degenerated Environments

To assess the efficacy of the proposed visual–inertial SLAM system aided by the VIMU and VINS-Fusion, we conducted experiments using a JetBot 2.0 quadruped robot fitted with Realsense D435/435i cameras and GPS. GPS was used as the baseline path for our experiments. The collected data were processed and stored on a computer running the Ubuntu 20.04 ROS Melodic system. The experimental setting comprised both indoor and outdoor environments and involved trajectories with effective loop closures, which were designed to comprehensively evaluate and validate the proposed methodology [21].

In Experiment 1, we conducted a benchmark straight-line motion experiment in an indoor environment. The hallway was chosen as the test environment due to its rectangular shape and simple scene with only one feature, which can be considered as a single-degree-of-freedom degenerate scene, as illustrated in Figure 8a.

During the experiment, the JetBot robot was manually driven to approach a constant speed along a straight path using the walking gait, and the visual–inertial SLAM system using the VIMU and VINS-Fusion system was used to estimate the robot’s position and orientation. The experiment was repeated with different levels of camera shake, induced by attaching weights of different magnitudes to the robot’s camera system.

The results were compared to those obtained from the VINS-Fusion system, which serve as Table 5 and Figure 8b.

In order to evaluate the accuracy of the SLAM system on the dataset, we used the open-source tool evo, and the absolute trajectory error (ATE) was selected as the evaluation standard to measure the error between the estimated and ground-truth trajectories after running the dataset sequence. The obtained results demonstrate that the proposed system achieved a significant improvement in accuracy. Specifically, the maximum error (Max) decreased on average by 74.3%, the mean error (Mean) decreased on average by 63.6%, the median error (Median) decreased on average by 17.8%, the root means square error (RMSE) decreased on average by 76.9%, and the standard deviation error (Std) decreased on average by 78.1%. These results indicate the effectiveness of the proposed approach in improving the accuracy of SLAM systems [22].

Figure 8c,d show that the VINS-Fusion system performs well in indoor environments, but its positioning error increases significantly when the camera experiences large movements. In comparison to VINS-Fusion, the system proposed in this paper exhibits greater stability and smaller trajectory deviations in dynamic scenarios, as evidenced by the trajectories being closer to the ground-truth straight line. This is because the proposed system can avoid highly distorted trajectories and effectively suppress filter drift by leveraging both the body IMU and VIMU data.

In Experiment 2, the selected outdoor environment provided a more complex setting to test the proposed SLAM system. The ground truth is shown in Figure 9a–c. The road and parking lot that circled the building created a challenging closed-loop path, requiring the system to handle dynamic changes in the environment, such as moving vehicles and pedestrians. The filtering-based mechanism of VINS-Fusion proved to have limitations in this kind of environment, as deviations from the reference path occurred after loop closure. In contrast, our system demonstrated improved robustness and accuracy in handling dynamic changes, including correcting for data distortion caused by large camera shakes through a virtual inertial navigation component.

In addition, the significant path errors that can occur when the camera angle changes sharply during turns were also addressed in this experiment. As shown in Figure 9d, the proposed SLAM system was able to correct for these errors, resulting in more accurate and stable trajectory estimation.

To evaluate the performance of the SLAM systems in the outdoor environment, the same evo tool was used, and the results are shown in Figure 9e,f and Table 6. Compared to VINS-Fusion, the proposed system demonstrated better performance in terms of reducing the maximum error (Max) by an average of 14.9%, the mean error (Mean) by an average of 3.21%, the median error (Medium) by an average of 4.73%, the root means square error (RMSE) by an average of 7.04%, and the standard deviation error (Std) by an average of 22.5%. These results confirm the effectiveness of the proposed system in handling complex outdoor environments and improving the trajectory estimation accuracy.

In Experiment 3, we selected a building as a closed loop, as shown in Figure 10a. The ground truth is shown in Figure 10a,b. During the data collection process, we deliberately increased the camera’s motion speed and amplitude to simulate a scenario where VINS-Fusion may struggle to track the camera’s motion. We also introduced errors in the initial pose information to evaluate the robustness of the two SLAM systems. When the camera’s motion speed and amplitude were too high, it was difficult to track, leading to a decrease in the tracking performance of VINS-Fusion. Additionally, VINS-Fusion has a high demand for initial pose information, and the lack of accurate initial pose information can cause errors to accumulate gradually, leading to significant distortion during initialization. In contrast, the visual–inertial SLAM aided by VIMU can utilize foot sensor data to provide more information, and adjust the weight ratios of each data to achieve better tracking and curve fitting results, while reducing the demand for accurate initial pose information [23], as shown in Figure 10d,e.

The results of the experiment showed that our visual–inertial SLAM aided by VIMU outperformed the VINS-Fusion system. The VIMU-based SLAM system was able to handle the camera’s fast and large motion while maintaining accurate tracking, whereas VINS-Fusion struggled and suffered from a decrease in tracking performance. Moreover, the VIMU-based system showed better performance in initialization, achieving more accurate pose estimation than VINS-Fusion, even when the initial pose information was inaccurate.

Through experimental comparisons, we found that the visual–inertial SLAM aided by VIMU performed better than the VINS-Fusion method in these scenarios. Our system can solve the problem of too fast and too large camera motion during data collection and has better initialization performance. Additionally, our system can better adapt to the inaccuracy of initial pose information.

5. Conclusions

The system presented in this paper has demonstrated a significant improvement in localization and mapping accuracy in complex environments, characterized by challenging backgrounds and strong external light interference. To address this challenge, we propose a novel VIMU-aided visual–inertial SLAM system for walking scenarios. The proposed system leverages both visual and inertial measurements to improve the performance of the SLAM system. Furthermore, it incorporates a virtual inertial navigation module to compensate for distorted visual measurements caused by camera motion. Our experimental results demonstrate that the proposed system outperforms the VINS-Fusion system in terms of tracking accuracy and robustness, particularly in challenging scenarios that feature fast and large-scale camera motion, as well as inaccurate initial pose estimation. The proposed system also achieves a significant reduction in trajectory errors, as measured by the absolute trajectory error (ATE), compared to the VINS-Fusion system.

Our findings suggest that the proposed VIMU-aided visual–inertial SLAM system has considerable potential for practical applications in walking scenarios for both indoor and outdoor navigation. However, the incorporation of an additional sensor as an information source unavoidably leads to increased complexity of the fusion method network structure and computational cost. Therefore, to expand the application scenarios of this fusion effect, it is imperative to investigate more lightweight solutions that enable better performance on devices with limited computing power as a future research direction.

Author Contributions

Conceptualization, Y.C., J.Z. and J.D.; Methodology, Y.C.; Software, Y.C.; Validation, Y.C., J.Z. and J.D.; Formal analysis, J.D.; Investigation, J.Z.; Resources, Y.C.; Data curation, J.Z.; Writing—original draft preparation, Y.C. and J.Z.; Writing—review and editing, J.D. and T.S.; Visualization, Y.C.; Supervision, W.Q.; Project administration, W.Q.; Funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, C.; Lei, L.; Ma, X.; Zhou, R.; Shi, Z.; Guo, Z. Map Construction Based on LiDAR Vision Inertial Multi-Sensor Fusion. World Electr. Veh. J. 2021, 12, 261. [Google Scholar] [CrossRef]
Huang, B.; Zhao, J.; Liu, J. A survey of simultaneous localization and mapping with an envision in 6g wireless networks. arXiv 2019, arXiv:1909.05214. [Google Scholar]
Azzam, R.; Taha, T.; Huang, S.; Zweiri, Y. Feature-based visual simultaneous localization and mapping: A survey. SN Appl. Sci. 2020, 2, 1–24. [Google Scholar] [CrossRef] [Green Version]
Luo, W.; Xiong, Z.; Xing, L.; Duan, S.; Liu, J.; Yu, Y. An IMU/Visual Odometry Integrated Navigation Method Based on Measurement Model optimization. In Proceedings of the IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018; pp. 1–5. [Google Scholar]
Weiss, S.M. Vision Based Navigation for Micro Helicopters. Ph.D. Thesis, ETH Zurich, Zurich, Switzerland, 2012. [Google Scholar]
Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef] [Green Version]
Qin, T.; Cao, S.; Pan, J.; Shen, S. A general optimization-based framework for global pose estimation with multiple sensors. arXiv 2019, arXiv:1901.03642. [Google Scholar]
Tu, Z.; Chen, C.; Pan, X.; Liu, R.; Cui, J.; Mao, J. EMA-VIO: Deep Visual–Inertial Odometry With External Memory Attention. IEEE Sens. J. 2022, 22, 20877–20885. [Google Scholar] [CrossRef]
Aslan, M.F.; Durdu, A.; Sabanci, K. Visual-Inertial Image-Odometry Network (VIIONet): A Gaussian process regression-based deep architecture proposal for UAV pose estimation. Meas. J. Int. Meas. Confed. 2022, 194, 111030. [Google Scholar] [CrossRef]
Aslan, M.F.; Durdu, A.; Yusefi, A.; Yilmaz, A. HVIOnet: A deep learning based hybrid visual–inertial odometry approach for unmanned aerial system position estimation. Meas. Neural Netw. 2022, 155, 461–474. [Google Scholar] [CrossRef] [PubMed]
Qian, W.; Zhu, Y.; Jin, Y.; Yang, J.; Qi, P.; Wang, Y.; Ma, Y.; Ji, H. A Pedestrian Navigation Method Based on Construction of Adapted Virtual Inertial Measurement Unit Assisted by Gait Type Classification. IEEE Sens. J. 2021, 21, 15258–15268. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. 2014. Available online: http://arxiv.org/abs/1409.1259 (accessed on 20 January 2023).
Ordóñez, F.J.; Roggen, D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, J.; Hu, X. Convolutional Neural Networks With Gated Recurrent Connections. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3421–3435. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. 2016. Available online: http://arxiv.org/abs/1608.06993 (accessed on 20 January 2023).
Xiong, L.; Zhang, L.; Huang, X.; Yang, X.; Huang, W.; Zeng, H.; Tang, H. DCAST: A Spatiotemporal Model with DenseNet and GRU Based on Attention Mechanism. Math. Probl. Eng. 2021, 2021, 8867776. [Google Scholar] [CrossRef]
Liu, J.Y. Theory and Application of Navigation Systeem; Northwestern Polytechnical University Press: Xi’an, China, 2010. [Google Scholar]
Luo, Y.; Liu, Y.; Guo, C.; Liu, J. The Unified Mathematical Framework for IMU Preintegration in Inertial-Aided Navigation System. 2021. Available online: http://arxiv.org/abs/2111.09100 (accessed on 20 January 2023).
Qin, T.; Pan, J.; Cao, S.; Shen, S. A General Optimization-based Framework for Local Odometry Estimation with Multiple Sensors. 2019. Available online: http://arxiv.org/abs/1901.03638 (accessed on 20 January 2023).
Fodor, K.; Viktor, R. Validation of ORB-SLAM3 and VINS-Mono with Low-Cost Sensor Setup in Outdoor Environment. In Proceedings of the IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 19–21 January 2023; pp. 27–32. [Google Scholar] [CrossRef]
Sun, J.; Song, F.; Ji, L. VINS-Mask: A ROI-mask Feature Tracker for Monocular Visual-inertial System. In Proceedings of the 2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE), Wuhan, China, 16–17 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
He, M.; Rajkumar, R.R. Extended VINS-Mono: A Systematic Approach for Absolute and Relative Vehicle Localization in Large-Scale Outdoor Environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4861–4868. [Google Scholar] [CrossRef]

Figure 1. Detailed implementation flowchart of visual–inertial SLAM based on virtual inertial navigation assistance.

Figure 2. Flowchart of constructing virtual inertial sensors.

Figure 3. DeepConvGRU neural network model structure diagram.

Figure 4. DenseNet-GRU hybrid model structure diagram.

Figure 5. Flowchart of Zero-Velocity Update System.

Figure 6. Flowchart of localization and mapping process of VINS.

Figure 7. Flowchart for Multi-Sensor Information Fusion.

Figure 8. (a) Real scene of the indoor experimental environment. The red line is the experimental trajectory; (b) comparison of the localization performance between VINS and VINS-VIMU; (c) localization performance of VINS-VIMU; (d) localization performance of VINS-Fusion.

Figure 9. (a,b) Skewed image of Ground Conditions in Experiment 2. (c) Real scene of the outdoor experimental environment; (d) comparison of the localization performance between VINS-VIMU and VINS-Fusion, using GPS data collected simultaneously as the actual route benchmark; (e) localization performance of VINS-VIMU; (f) localization performance of VINS-Fusion.

Figure 10. (a) Skewed image of ground conditions in Experiment 3; (b) real scene of the outdoor experimental environment; (c) comparison of the localization performance between VINS-VIMU and VINS-Fusion, using GPS data collected simultaneously as the actual route benchmark; (d) localization performance of VINS-VIMU; (e) localization performance of VINS-Fusion.

Table 1. Classification accuracy of DeepConvGRU model gait types.

	Walking	Running	Uphill	Downhill	Classification Rate (%)
Walking	343	2	2	1	98.0%
Running	4	341	1	1	97.4%
Uphill	1	3	340	1	97.1%
Downhill	2	1	1	342	97.7%

Table 2. Fitting output error of VIMU.

	Mw (°/s)	Sw (°/s)	Ma (m/s²)	Sa (m/s²)
Walking	0.0033	0.0046	0.076	0.132
Running	0.0061	0.0072	0.102	0.183
Uphill	0.0043	0.0049	0.079	0.137
Downhill	0.0051	0.0050	0.082	0.141

Table 3. Fitting output error of zero-velocity update of VIMU.

	Mw (°/s)	Sw (°/s)	Ma (m/s²)	Sa (m/s²)
Walking	0.0009	0.0010	0.046	0.088
Running	0.0021	0.0037	0.072	0.121
Uphill	0.0012	0.0016	0.045	0.083
Downhill	0.0011	0.0019	0.050	0.092

Table 4. Comparison of localization results on the KITTI dataset.

Seq-	VINS-Fusion		VINS-Fusion		Error
Seq-	t_rel (%)	r_rel (°)	t_rel (%)	r_rel (°)	t_rel (%)	r_rel (°)
09	40.39	2.15	10.27	1.94	0.75	0.10
10	18.62	2.73	8.58	2.51	0.54	0.08
average	29.51	2.44	18.85	2.23	0.36	0.09

Table 5. Error comparison between VINS-VIMU and VINS-Fusion based on GPS baseline.

Error	VINS-VIMU	VINS
Max	0.0009	0.0010
Mean	0.0021	0.0037
Median	0.0011	0.0019
RMSE	0.007654	0.033183
std	0.00702	0.03211

Table 6. Error comparison between VINS-VIMU and VINS-Fusion based on GPS baseline.

Error	VINS-VIMU	VINS
Max	2.228963	2.619871
Mean	1.273015	1.315248
Median	1.457812	1.530268
RMSE	1.381425	1.486064
std	0.536441	0.691744

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Y.; Qian, W.; Zhao, J.; Dong, J.; Shen, T. Visual–Inertial Navigation System Based on Virtual Inertial Sensors. Appl. Sci. 2023, 13, 7248. https://doi.org/10.3390/app13127248

AMA Style

Cai Y, Qian W, Zhao J, Dong J, Shen T. Visual–Inertial Navigation System Based on Virtual Inertial Sensors. Applied Sciences. 2023; 13(12):7248. https://doi.org/10.3390/app13127248

Chicago/Turabian Style

Cai, Yunpiao, Weixing Qian, Jiaqi Zhao, Jiayi Dong, and Tianxiao Shen. 2023. "Visual–Inertial Navigation System Based on Virtual Inertial Sensors" Applied Sciences 13, no. 12: 7248. https://doi.org/10.3390/app13127248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual–Inertial Navigation System Based on Virtual Inertial Sensors

Abstract

1. Introduction

2. System Overview

3. Virtual Inertial Navigation System Based on Deep Learning

3.1. Construction of Virtual Inertial Sensors Based on Deep Learning

3.2. Zero-Velocity Update-Based Optimization of Virtual Inertial Sensor Data

3.3. Performance Verification of Zero-Velocity Update of VIMU

4. Analysis and Experiments of Visual–Inertial Odometry Optimization

4.1. Analysis of VINS Odometry Optimization

4.2. KITTI Dataset Test Experiment

4.3. Localization and Mapping Experiments in Degenerated Environments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI