1. Introduction
To address the inherent limitations in individual sensors, researchers have shifted their focus towards the development of multi-sensor fusion localization systems. These systems capitalize on the strengths of diverse sensors to augment the precision and robustness of the localization system. Consequently, there is a growing emphasis on the application of sensor fusion methods to ascertain vehicle positions. Currently, simultaneous localization and mapping (SLAM) stands as the prevailing solution for autonomous vehicle localization, incorporating the fusion of GNSS/RTK, laser radar, LIDAR-based prior maps, and IMU-based high-quality positioning [
1,
2]. However, the widespread adoption of laser radar sensors in autonomous vehicles is impeded by their high cost.
In contrast, visual odometry (VO) and visual simultaneous localization and mapping (SLAM) have garnered significant attention due to their advantages, including low cost, compact size, and straightforward hardware configuration. Nevertheless, the purely visual approach lacks robustness in the presence of sparse-textured areas, motion blur, sharp turns, and lighting variations. Therefore, the inertial measurement unit (IMU) is integrated to provide short-term motion constraints and the absolute scale of motion. Systems tightly coupling visual observations and IMU measurements are denoted as visual–inertial navigation systems (VINSs), proficient in estimating the six-degree-of-freedom (DOF) pose of a vehicle. Even in conditions of rapid motion or substantial changes in lighting, this system consistently achieves high precision and robust localization results [
3,
4,
5,
6,
7].
The most direct approach to fuse visual and inertial measurements is through a filter-based loosely coupled framework [
8,
9]. However, this straightforward strategy overlooks the correlations between different sensor data, leading to suboptimal localization accuracy. Consequently, tightly coupled fusion methods have been introduced, wherein the tight coupling of visual and inertial measurements falls into filter-based and optimization-based categories. Filter-based tightly coupled fusion methods, such as [
3,
6,
10], concurrently optimize the states of the camera and IMU. These methods typically restrict the number of landmarks, preserving only the most recently detected features in the state vector to ensure a manageable problem complexity. Nevertheless, they commonly encounter a shared challenge: visual–inertial navigation systems (VINSs) constitute nonlinear systems, necessitating the linearization of nonlinear measurements before processing, which may introduce significant errors [
11]. In contrast, optimization-based methods convert sensor fusion into a graph-based nonlinear least squares problem, delivering superior accuracy compared to filter-based methods, with the drawback of increased computational time. Consequently, effective IMU pre-integration techniques are widely employed in optimization-based tightly coupled visual–inertial odometry (VIO) methods [
4,
5,
12], as well as in [
7,
13,
14], where manifolds are used instead of Euler angles to parameterize the rotation group for enhanced computational efficiency. Optimization-based methods typically optimize the most recent states within a limited-size sliding window while marginalizing past states and measurements [
5]. Ref. [
5] stands out as one of the most popular open-source VIO systems, exhibiting an excellent performance on the KITTI dataset. However, depending solely on local relative pose estimation is insufficient with autonomous vehicles, prompting the need for an absolute localization method to map local state estimates into a global coordinate system.
Given that the Global Navigation Satellite System (GNSS) offers absolute positioning information in the Earth coordinate system, a natural approach involves integrating local positioning results into the absolute position data from GNSS for precise global localization. Notably, ref. [
15] devised a tightly coupled visual–inertial odometry (VIO) system augmented by intermittent GNSS measurements, yielding consistent global localization results. This approach concurrently addresses spatiotemporal sensor calibration and state initialization. Additionally, ref. [
16] introduced an innovative filter-based estimator, amalgamating GNSS measurements with visual–inertial data. It concurrently estimates the extrinsic rotation between GNSS and VIO results online, achieving robust global localization. Beyond filter-based methods, ref. [
17] proposed a sliding window optimization-based strategy that positions the vehicle in the global coordinate system by fusing long-range stereo vision, inertial integration, and limited GNSS information. Ref. [
18] introduced a tightly coupled framework that integrates visual–inertial odometry with global positioning measurements. Recent advancements, as demonstrated by methods [
19,
20], tightly couple visual–inertial SLAM with raw GNSS measurements, yielding globally consistent localization information. However, tightly coupled methods present challenges in both complex initialization and limited scalability. Integrating new sensors necessitates algorithm redesign, making system expansion difficult. Ref. [
21] presents a loosely coupled sensor fusion framework capable of attaining locally accurate and globally drift-free attitude estimation, showcasing a commendable performance in practical scenarios. However, it is noteworthy that this system does not explicitly initialize the coordinate transformation from East North Up (ENU) to visual–inertial odometry (VIO).
Additionally, it overlooks the potential issues that may arise from the failure of a sensor.
This study introduces the integration of Global Navigation Satellite System (GNSS) signals to enhance the performance of visual–inertial navigation systems (VINSs) for achieving locally accurate and globally drift-free attitude estimation. The proposed approach is easy to extend and involves a two-stage initialization method to acquire initial poses at both local and global levels. Additionally, an adaptive fusion mechanism is introduced to ensure effective global pose estimation even in the event of VIO or GNSS failure. Algorithm validation is conducted in simulation environments and challenging urban road scenarios. The primary contributions of this work include the following:
Two-Stage Initialization Method: The introduction of a two-stage initialization method leveraging inertial, camera, and asynchronous GPS measurement data. This method facilitates a coarse estimation of visual–inertial odometry (VIO) initialization parameters and the transformation matrix between the VIO coordinate system and the global East North Up (ENU) system when the vehicle is stationary. Further optimization of relevant initialization parameters occurs as the vehicle is in motion.
Adaptive Fusion Mechanism: The implementation of an adaptive fusion mechanism that incorporates anomaly detection for VIO and GPS information. In the event of VIO or GPS failure, this mechanism enables the system to continue estimating the global pose of the vehicle, ensuring robust and continuous vehicle localization.
Algorithm Validation in Simulations and Challenging Urban Roads: The validation of the algorithm’s effectiveness in both simulation environments and challenging urban roads. The validation process includes qualitative and quantitative analyses. Through these experiments, the study demonstrates the satisfactory performance of the proposed method across different scenarios, highlighting its robustness and accuracy.
2. System Overview
The structure of our proposed system is illustrated in
Figure 1.
The system integrates inputs such as images, inertial measurements, and GNSS data. The high-frequency IMU between the two images will be pre-integrated. The GNSS frequency is low and unstable. When new GNSS data are obtained, the positioning status is detected first, and the positioning anomaly and some GNSS data far away from the image acquisition time will be discarded to avoid major positioning errors. At the initial moment when the vehicle is stationary, IMU data are utilized to estimate rough accelerometer and gyroscope biases, the direction of gravity, and the coordinate transformation from ENU to VIO. The transition from a static to dynamic state is identified by analyzing disparities between consecutive images. Upon detecting vehicle motion, the initialization parameters initially estimated during the stationary phase are optimized to conclude the VIO initialization. Following initialization, the VIO local estimator generates local pose estimates. These local pose estimates, in conjunction with GNSS information, serve as inputs to the global estimator. A nonlinear optimization process is then employed to produce the final six-degree-of-freedom (DOF) global pose results.
3. Methods
This section begins with the definitions of the coordinate systems. The involved coordinates in this paper are shown in
Figure 2, including the sensor frames: the camera frame {C}, the IMU frame {B} (vehicle frame), and the GNSS frame {A}. The phase center of the GNSS antenna serves as the coordinate origin for {A}. {W} is the reference frame for local odometry, aligning the gravity vector with the z-axis, and the origin is set at the vehicle’s starting point,
. {G} represents the global East-North-Up (ENU) frame, serving as the reference frame for global poses, with an origin identical to the local reference frame {W}. The following symbols will be used. We define
and
to represent matrices (or vectors) in {G} and {W}, respectively.
,
, and
are the camera, IMU, and GNSS data frames when capturing the kth image.
is considered the rotation matrix from frame {A} to {B}, with the corresponding Hamilton quaternion form denoted as
.
and
represent the position and velocity vectors of {A} in {B}, and
is the homogeneous expression of the transformation matrix from {A} to {B}. External parameters between sensors are employed to transform measurement data from different frames to a unified coordinate system. The external parameters for Camera-IMU include
and
.
represents the GNSS-IMU external parameter, known as the lever arm, and all involved external parameters are calibrated offline.
denotes vector cross-product,
denotes quaternion multiplication, and finally,
is used to represent measurement data containing noise.
3.1. System Preliminaries
3.1.1. IMU Measurements and Pre-Integration Theory
IMU sensors typically consist of a 3-axis gyroscope and a 3-axis accelerometer, allowing for the measurement of angular velocity and acceleration of the inertial sensor (i.e., body frame) with respect to the inertial frame. IMU measurements combine the force for countering gravity and the platform dynamics, subject to acceleration bias, gyroscope bias, and additional noise. The raw measurements from the gyroscope and accelerometer
and
are given by Equation (1):
Assuming that the additional noise in the accelerometer and gyroscope measurements are Gaussian white noise,
,
. The accelerometer bias and gyroscope bias are modeled as a random walk by Equation (2), with their derivatives being Gaussian white noise,
,
.
To avoid recomputing integrals when linearization points change, we adhere to the approach outlined in [
5]. Given the biases, we compute the relative motion increment between two consecutive keyframes using Equation (3). Importantly, this increment remains independent of the attitude and velocity at time
.
where
is defined by Equation (4):
3.1.2. Monocular Vision
Applying the monocular pinhole camera model, upon receiving a new image, features are detected using Harris corner detection [
22]. Subsequently, the KLT sparse optical flow algorithm [
23] is employed to track these newly detected features. The mapping relationship between landmark points and features is expressed through Equation (5):
where
denotes the camera intrinsic calibration matrix.
represents the inverse of the camera pose,
represents the projection of a landmark in the three-dimensional world onto the camera plane, while
denotes the three-dimensional position of the landmark in the local reference frame {W}.
3.1.3. GNSS Measurements
Taking the first GNSS position measurement as the origin of the global reference frame {G}, the GNSS observations at time
are represented by a general Equation (6):
where
is the function connecting the IMU frame {B} and GNSS measurements, and
represents the measurement noise. In fact, GNSS observations can be expressed using Equation (7):
where
is the position of the GNSS antenna phase center in the global reference frame {G} at time
.
3.1.4. Nonlinear Optimization
Vehicle localization can be viewed as a state estimation problem, which can be transformed into a maximum likelihood estimation (MLE) problem. MLE is composed of the joint probability distribution of vehicle states over a certain period. Under the assumption that all measurements are independent, this problem is typically formulated as Equation (8):
where
represents measurements from the camera, IMU, or other sensors. Assuming that sensor measurements follow a Gaussian distribution
, the negative log likelihood of Equation (8) can be expressed as Equation (9):
The Mahalanobis norm is defined as
, and the sensor model
is defined by Equations (1), (5) and (6). The state estimation is then transformed into an iteratively optimized nonlinear least squares problem, where vertices represent the variables to be optimized and edges denote error terms. A graph corresponding to any nonlinear least squares problem can be constructed.
3.2. Initiation
To achieve an improved localization performance, initializing the system is essential. A two-stage initialization method is proposed to fully exploit measurement data from the vehicle’s startup to the commencement of motion. In the stationary phase, initialization involves estimating gravity direction, gyroscope biases, accelerometer biases, and a rough estimate of the absolute pose. As the vehicle begins to move, the parameters estimated during the stationary phase are rapidly optimized, and scale is restored, thus completing the initialization process.
3.2.1. Rough Estimation of Static Parameters
Initially, using stationary state data to estimate IMU biases, during this phase, the vehicle’s initial velocity and position are both zero. The gravity acceleration measured in the {B} frame is obtained from Equation (10):
where
is the observation of the kth IMU accelerometer;
is the total number of IMU observations obtained in the stationary state of the vehicle. If the parallax is less than the threshold, then the vehicle is considered to be stationary. The projection of the x-axis direction vector of the world frame in the IMU frame is obtained using Equation (11):
The projection of the y-axis direction vector of the world frame in the IMU frame is obtained using Equation (12):
Therefore, the rotation matrix from the VIO frame {W} to the IMU frame {B} is obtained using Equation (13):
The biases of the accelerometer and gyroscope are calculated using Equation (14):
Based on double-vector attitude determination [
24], a rough estimate of
is obtained, completing the static initialization phase.
3.2.2. Dynamic Optimization
When there is stable feature tracking and sufficient disparity (exceeding 20 pixels) between the latest image and all the previously stored images in the sliding window, the method proposed in [
5] is employed. Initially, a visual reconstruction is conducted, followed by visual–inertial alignment. The first camera frame, denoted as {C
0}, is set as the reference. Subsequently, the rough estimate parameters obtained from Equations (10), (13), and (14) are utilized as initial values for optimization. The optimization process begins with the refinement of gyroscope biases, followed by the initialization of the velocity, gravity vector, and scale factor. In this process, accelerometer biases are simultaneously considered, defining the system state as Equation (15):
where
represents the velocity of {B} at the time of capturing the kth image,
is the gravity vector in the reference frame, and
scales the normalized reconstruction to metric units. By solving the system described in Equation (16), the velocity of {B} within the window, the gravity vector in the visual reference frame {C
0}, and the scale parameter are obtained.
After further adjustment of the gravity vector, is rotated to align with the z-axis in the {W} frame, resulting in the calculation of . All variables are adjusted to the {W} frame to complete the initialization.
3.3. VIO Local Estimator
For local pose estimation, an existing visual–inertial odometry (VIO) algorithm is employed. There are many excellent open-source VIO algorithms available, and this paper utilizes the algorithm from [
5]. In the sliding window, the algorithm estimates the poses of several IMU frames along with the depth of visual features. The state is defined as Equation (17):
where the
kth IMU state
is composed of the position
, velocity
, orientation
, gyroscope bias
, and accelerometer bias
, representing the position of the IMU center relative to the local reference frame {W}. The orientation is represented using quaternion.
The first IMU pose serves as the reference frame. When a feature is first observed in the camera frame, it is parameterized using inverse depth
. At this point, the state estimation can be represented as Equation (18):
where
and
represent the inertial and visual residuals, respectively. The prior
contains information about the marginalized state history. The term
denotes a robust kernel function [
25], which effectively suppresses the influence of outliers. For a detailed explanation, refer to [
5]. The VIO local estimator achieves precise real-time 6-DoF local pose estimation.
3.4. GNSS/VIO Global Estimator
3.4.1. Pose Graph Structure
The global pose estimation problem can be represented using the pose graph in
Figure 3. Each VIO node represents the estimated local six-degree-of-freedom pose. Given the high accuracy of visual–inertial odometry in the short term, we introduce the relative pose between these two nodes as a constraint to the pose graph. In instances where a node is associated with GNSS measurements, GNSS constraints are also included as global constraints in the pose graph. This approach proves effective in mitigating the impact of cumulative errors, considering the non-accumulative nature of GNSS measurements.
3.4.2. Adaptive Fusion Mechanism
Based on nonlinear optimization theory, the global state is defined as Equation (19):
where
and
are the position and orientation in global reference frame {G}, respectively.
Due to the intricacies of autonomous driving scenarios, there are instances where the local visual–inertial odometry (VIO) system may encounter tracking failures, leading to a restart, and sometimes GNSS signals may be limited. Therefore, an adaptive mechanism is employed. When the local VIO system encounters a failure, reliance is placed on GNSS signals. Similarly, when GNSS signals are restricted, local VIO measurements are utilized. Upon recovery of the local VIO system, visual–inertial odometry (VIO) factors are reintroduced into the pose graph, and when GNSS signals recover, global factors are reintroduced into the pose graph. Therefore, the adaptive fusion optimization problem for GNSS and VIO factors can be represented as Equation (20):
where
is the VIO adaptive coefficient, with a value of 1 when VI is normal and a value of 0 otherwise. Similarly,
is the GNSS adaptive coefficient, with a value of 1 in the normal state and a value of 0 otherwise.
The detection of local visual–inertial odometry (VIO) failure is executed by measuring the relative translation or rotation between two consecutive frames. If the change between these frames exceeds a predefined threshold, then VIO failure is considered.
At this point, let . If VIO is normal, then let .
The anomaly detection of GNSS primarily relies on the position state and covariance matrix of its measurement information. The detection of abnormal states involves two steps. Firstly, the positioning state of GNSS must satisfy Equation (21); otherwise, GNSS is directly considered to be in an abnormal state.
where
is the positioning state, with 0 indicating an invalid position.
is the covariance of the positioning results.
is the positioning covariance threshold.
are the covariances of the longitude, latitude, and altitude measurement.
Secondly, utilizing the chi-squared detection method, GNSS state determination is carried out. The covariance of longitude, latitude, and elevation measurements is employed to calculate the information matrix of GNSS residuals. The covariance matrix and information matrix for the kth frame of GNSS measurements are given by Equation (22):
When GNSS measurements follow a normal distribution, the corresponding GNSS residuals
follow a Gaussian distribution with a mean of zero and a variance of
. Supposing that there are p frames of GNSS measurement data in the optimized sliding window, the length of the data window is p, and the dimension of the chi-squared test is 3, then all GNSS data in this data window follow a chi-squared distribution with 3p degrees of freedom. The abnormal detection function can be designed as Equation (23):
Choosing a 95% confidence level and based on 3p degrees of freedom, the threshold can be obtained from the table. When is less than the threshold, the GNSS is considered in a normal state. At this point, let ; otherwise, let .
Local VI factors can come from any local state estimator. Considering two consecutive measurement frames, the visual–inertial factor is defined as Equation (24):
where
represents quaternion subtraction and
are the weighting factors for translation and rotation, respectively.
By setting the first GNSS measurement as the origin, we can obtain GNSS measurements in the global ENU (East North Up) coordinate system. At this point, where the global reference frame {G} and local reference frame {W} origins coincide, the GNSS factor can be defined as Equation (25):
At this stage, the system can discern whether various factors are anomalous. Ultimately, optimization and solving for the global pose are conducted using Ceres Solver [
26].
5. Conclusions
This paper introduces a relative and absolute positioning system for ground vehicles. The entire system, starting from initialization, employs a two-stage initialization method to fully utilize measurement data from inertial sensors, cameras, and asynchronous GNSS. During vehicle stationary periods, it roughly estimates the initialization parameters of visual–inertial odometry (VIO) and the transformation matrix between the VIO coordinate system and ENU. When the vehicle is in motion, it further optimizes the relevant initialization parameters. Additionally, an adaptive fusion mechanism is utilized to ensure smooth system operation even when a single sensor signal is unavailable. Finally, experiments are conducted in both the Carla simulation environment and a dataset from complex real-world scenarios. The proposed system demonstrates robustness and accuracy. Since there is no hard synchronization in time between GNSS and other sensors, and the time offset between sensors will change with time, it is necessary to incorporate time-offset estimation into future work.