Extrinsic calibration refers to the estimation of a rigid (Euclidean) transformation that maps three-dimensional points from one coordinate system to another, for instance, transforming points from the world or LiDAR coordinate system into the camera coordinate system. This process determines the position and orientation of a sensor relative to an external reference frame by estimating its translation and rotation along the three orthogonal axes of three-dimensional space. The resulting extrinsic parameters are typically expressed as a 3 × 4 matrix, comprising a 3 × 3 rotation matrix and a 3 × 1 translation vector.
3.1. Camera–IMU Extrinsic Calibration
3.1.1. Decoupling-Based Methods
In early studies, the extrinsic calibration between a camera and an IMU was typically performed using specialized apparatus. Lobo and Dias [
39] mounted the system on a rotation platform to estimate the direction of gravity under static conditions. Then the calibration pattern is placed on a horizontal surface and accelerometer readings are taken in the various camera poses [
40]. By aligning the vertical direction in the camera coordinate system with that measured by the accelerometer in the body frame, they estimated the rotation between the IMU and the camera [
41]. The system was subsequently rotated around the IMU’s center, where it experiences zero linear acceleration, which enabled estimation of the translation parameters. In practical experiments, they achieved a rotational error of 0.69° and a translational error of 5 mm using an IMU paired with a low-cost camera. This method requires precise alignment of the camera–IMU system on a dedicated rotation platform. However, accurately locating the IMU center and aligning it with the camera’s optical center is challenging. This misalignment can negatively impact overall calibration accuracy. Hol [
42] adopted the approach proposed by Lobo and Dias [
39] for camera–IMU calibration and sensor fusion. In his method, only the rotation parameters were estimated, while the translation parameters were derived from mechanical design specifications.
Alves [
43] performed calibration of a low-cost camera–IMU system using a pendulum-based setup. The pendulum, fitted with an encoded axis, was used to estimate the bias, scale factors, and axis alignment parameters of the inertial sensor. At the same time, the camera determined the vertical direction by detecting the vanishing point of the pendulum. By combining vertical direction measurements from the IMU, through accelerometer readings, and the camera, the rotation matrix between their coordinate frames was estimated. However, this method did not account for the translation parameters.
The aforementioned camera–IMU calibration methods, relying on specialized auxiliary equipment, face significant limitations. Their dependence on such devices leads to high acquisition costs and complex system setup, while unavoidable installation and adjustment errors directly compromise calibration accuracy. Additionally, the parameter estimation process often treats rotation and translation independently, ignoring their potential coupling. This separation causes errors in rotation estimation to propagate into translation parameters, further reducing overall calibration precision. These technical challenges and practical constraints considerably limit the engineering applicability of these methods.
3.1.2. Filter-Based
Filter-based calibration methods exploit filtering algorithms to fuse camera and IMU data, enabling estimation and the continuous updating of calibration parameters. Mirzaei and Roumeliotis [
44] first introduced an iterative extended Kalman filter-based camera–IMU calibration approach that tracks corner points on a planar calibration target to estimate the relative poses and the bias of the IMU. However, the computational complexity of this method increases cubically with the number of environmental features, posing challenges for real-time performance in certain scenarios. Li and Mourikis [
45] improved upon this by incorporating camera poses at different time steps into the state vector rather than including individual feature points, which greatly reduced the computational burden.
However, Mirzaei’s model did not account for the influence of the gravity vector. Chu and Yang [
46] and Kelly and Sukhatme [
47] addressed this by incorporating the gravity vector into the state estimation process. Kelly further validated the necessity of estimating the gravity vector through simulation experiments. Distinct from prior work, Kelly applied the unscented Kalman filter (UKF), which offers better performance in highly nonlinear systems. An EKF performs well when system nonlinearity is relatively mild [
48]. Assuming a Gaussian distribution, the UKF can achieve third-order approximation accuracy in nonlinear state estimation. Zachariah and Jansson also applied the UKF in [
49] and additionally estimated intrinsic IMU parameters, including accelerometer and gyroscope misalignments, scale factors, and bias. Later, Hartzer and Saripalli [
50] implemented the EKF in a multi-camera and IMU system, and designed a two-stage filter to calibrate both temporal and spatial misalignment errors. This method effectively reduced noise and demonstrated high robustness in spatiotemporal calibration.
Within the filter-based framework, several researchers have proposed methods for calibrating shutter cameras and IMUs. Li et al. [
51] proposed an error approximation method that replaces traditional motion assumptions, such as constant velocity, to address motion distortion introduced by row-wise exposure. By incorporating the intrinsic parameters of the IMU and camera, along with time delay, into the system state, they utilized an EKF to perform full online self-calibration of both the IMU and the rolling shutter camera. Lee et al. [
52] employed the Cayley transformation model together with IMU measurements to compute the rolling shutter transformation. They introduced a staged gray-box Kalman filter calibration approach to estimate calibration parameters under non-fixed noise density conditions. The main idea is to estimate IMU noise characteristics using nonlinear optimization, then refine the calibration parameters based on the updated noise model. This staged framework achieves a balance between computational efficiency and estimation accuracy, leading to faster convergence compared to single-shot methods. However, the gray-box model is structurally complex and relies on limited measurement information, which makes it difficult to accurately identify and predict the underlying nonlinear system behavior. Yang et al. [
53] proposed an online self-calibration method for visual-inertial navigation systems based on EKF and conducted an observability analysis, which demonstrated that a fully parameterized calibration system has four unobservable directions, corresponding to global yaw and translation. They also identified specific motion patterns, such as single-axis rotation and constant acceleration, that cause failures in calibrating the intrinsic parameters of both the IMU and the camera.
Filter-based camera–IMU calibration methods have steadily advanced, allowing for the reliable estimation of additional key parameters, including gravity, IMU intrinsic parameters, and spatiotemporal offsets. Despite these improvements, such methods depend on accurate assumptions regarding noise distribution. In certain real-time calibration scenarios, they may struggle to effectively mitigate the systematic accumulation of linearization errors.
Table 2 summarizes the characteristics and contributions of selected filter-based camera–IMU calibration methods.
3.1.3. Optimization-Based
Unlike filter-based methods, optimization-based approaches do not propagate the system state and covariance throughout the entire measurement sequence. Instead, they formulate a mathematical model of the sensor trajectory and obtain the optimal calibration parameters by minimizing the measurement residuals.
Dong-Si and Mourikis [
54] were the first to propose an algorithm that directly computes system observability from sensor measurements. They initially estimated the rotation and translation between the camera and the IMU using a convex optimization method, and subsequently refined the results by solving a nonlinear least-squares problem with the Levenberg–Marquardt algorithm. Although this approach proved effective, it did not dynamically estimate the bias of the IMU’s gyroscope and accelerometer. As a result, accumulated errors over time degraded the calibration accuracy. Building on this work, Yang and Shen [
55] introduced a sliding window framework to constrain the temporal accumulation of IMU errors during optimization. However, similar to the previous method, it did not account for IMU bias.
For estimating IMU bias, Mur-Artal and Tardos [
56] proposed a stepwise method to separately estimate the bias of the accelerometer and gyroscope. Their approach achieved high accuracy within a short time frame but required prior knowledge of the extrinsic parameters between the camera and IMU. Qiu et al. [
57] introduced a high-precision calibration method for camera–IMU systems based on adaptive constraints derived from multiple error equations. In their framework, IMU errors are treated as adaptive constraints embedded into error compensation equations, and the Newton method is employed to iteratively solve for the optimal model parameters. By comprehensively accounting for IMU bias and lens distortion, this method supports accurate online calibration.
Qin et al. [
58] proposed a general monocular visual-inertial state estimation system, VINS-Mono. This system employs structure from motion (SfM) and IMU pre-integration to perform online estimation of gyroscope bias, gravity vector, scale parameter, and camera–IMU extrinsic parameters. It overcomes the limitation of traditional VINS, which necessitates static initialization or slow movement. Similarly, Huang and Liu [
59] proposed an online calibration method that does not rely on any prior information. They employed a three-stage strategy to progressively optimize the calibration parameters and to automatically identify their convergence.
However, the aforementioned methods did not account for the temporal offset between camera and IMU data. Fleps et al. [
60] were among the first to model camera–IMU extrinsic calibration as a nonlinear batch optimization problem by using B-spline curves to parameterize the IMU trajectory. They incorporated time delay into the measurement model and solved the problem using a sequential quadratic programming (SQP) algorithm. Due to the batch-processing nature of this method, it is inherently limited in real-time applications. Furgale et al. [
33] later introduced the well-known Kalibr toolbox, which formulates a continuous-time joint calibration model based on maximum likelihood estimation. This method employs B-splines to parameterize the time-varying IMU states and directly embeds the temporal offset as an optimization variable within the measurement model. Thus, it avoids the estimation difficulties caused by timestamp misalignments in discrete-time models. Finally, a least-squares optimization framework was constructed by integrating image, IMU measurement residuals, and bias to solve for the spatiotemporal parameters of the camera–IMU. Building on this work, Li and Mourikis [
37] demonstrated that the camera–IMU time offset can be effectively estimated by modeling it as an additional state variable, thus providing a theoretical basis for addressing sensor time asynchrony issues. Rehder et al. [
61] extended the Kalibr framework to multi-IMU systems, and Huai et al. [
62] adapted it for use with rolling-shutter cameras. More recently, Nikolic et al. [
63] proposed a non-parametric joint calibration approach that differs from Kalibr by eliminating the need to predefine temporal basis functions or motion noise parameters. Their method incorporates the discrepancy between continuous IMU integration and discrete sampling into the maximum likelihood framework, enabling joint estimation of trajectory and calibration parameters without relying on strict prior assumptions.
Optimization-based camera–IMU calibration methods directly solve nonlinear cost functions using nonlinear optimization algorithms such as Levenberg–Marquardt or Gauss–Newton, thereby avoiding the inherent errors introduced by linearization in filtering-based approaches. These methods can simultaneously process a large amount of measurement data, resulting in higher accuracy and robustness. However, their computational cost is relatively high, making them less suitable for real-time applications compared to filter-based methods.
Table 3 summarizes the characteristics and contributions of selected optimization-based camera–IMU calibration methods.
3.1.4. Learning-Based
Traditional calibration approaches typically involve carefully designed pipelines, encompassing feature extraction, matching, and parameter optimization, all of which rely heavily on manually engineered rules and algorithms. Deep learning provides a promising alternative for camera–IMU calibration by leveraging large-scale data to learn complex feature representations and mapping relationships. Deep learning-based extrinsic calibration methods can be broadly categorized into two main strategies. The first involves substituting individual modules within the conventional pipeline, such as feature extraction or matching. The second seeks to model and optimize the nonlinear parameters of the sensor to improve the calibration accuracy.
Guo et al. [
64] retained the traditional feature extraction and matching process in their calibration framework. They used the rotation matrices of adjacent images along with 10 sequences of IMU measurements as input to a backpropagation neural network. The network was trained by minimizing the mean squared error using the Levenberg–Marquardt algorithm, enabling direct mapping from IMU data to camera pose. Some work focuses on enhancing calibration accuracy by refining IMU intrinsic parameters. For instance, Liu et al. [
65] proposed a dual-branch dilated convolutional network that directly estimates the scale factors of the gyroscope and accelerometer, as well as accelerometer correction terms, including bias and noise. However, their approach depends on ground-truth position and orientation data as supervision. To address the challenge of limited supervision, Hosseini et al. [
66] employed a loosely coupled error-state extended Kalman filter (ESKF) [
67] to fuse IMU and camera data, thereby generating reliable training targets. A convolutional neural network (CNN) was then trained as a front-end calibration module to produce bias and noise-compensated IMU data, which was subsequently refined in real-time by ESKF.
In recent years, deep learning-based research on IMU intrinsic calibration has achieved notable advancements. Brossard et al. [
68] proposed the use of deep neural networks (DNNs) to model IMU noise characteristics and to dynamically adjust filter parameters, thereby enhancing estimation accuracy. Zhang et al. [
69] employed DNNs to compute observable IMU integration terms, improving both robustness and precision. Gao et al. [
70] proposed Gyro-Net to estimate and compensate for random noise in gyroscope measurements. Buchanan et al. [
71] conducted a comparative study of long short-term memory (LSTM) and transformer-based architectures for IMU bias compensation. A lightweight convolutional neural network, Calib-Net, was developed by Li et al. [
72], utilizing dilated convolutions to extract spatiotemporal features from IMU measurements and output dynamic compensation terms for gyroscope readings. This framework enables high-precision calibration for low-cost IMUs.
In the area of end-to-end visual-inertial odometry (VIO), Clark et al. [
73] introduced VINet, a framework that directly estimates sensor poses by processing raw visual and inertial data through a neural network, without requiring manual extrinsic calibration or clock synchronization. However, this system depends on ground-truth poses from high-precision reference platforms, which limits its practical deployment. To overcome the challenges of scale ambiguity in monocular vision and the scarcity of supervised ground-truth data, Han et al. [
74] used stereo image pairs to generate three-dimensional geometric constraints that provide absolute scale supervision. They developed DeepVIO, a self-supervised deep learning-based VIO system that operates using only monocular images and IMU data, thereby eliminating the need for extrinsic calibration.
Some notable studies have omitted the extrinsic calibration step. In end-to-end visual-inertial odometry, Clark et al. [
73] proposed VINet, a neural network that takes raw images and IMU data as input and directly outputs the pose, thereby removing the need for manual extrinsic calibration and clock synchronization. However, this approach depends on high-precision systems to provide ground-truth poses, which limits its practical applicability. Han et al. [
74] developed DeepVIO, a self-supervised, deep learning-based visual-inertial odometry method that also operates directly on monocular images and IMU data without requiring extrinsic calibration. This method has demonstrated robustness under challenging conditions, including inaccurate camera–IMU calibration, unsynchronized data, and data loss. Learning-based methods show considerable potential; however, several unresolved issues, including reliance on labeled data, limited global consistency, and poor interpretability, hinder their ability to replace traditional approaches in the near term.
3.1.5. Discussion
The four categories of camera–IMU calibration methods exhibit distinct trade-offs in terms of accuracy, computational efficiency, and adaptability to various environments. Methods based on decoupled models reduce computational complexity by separating the calibration of rotation and translation; however, their dependence on specialized equipment and disregard for coupling effects may lead to significant accuracy degradation. Filter-based methods employ recursive estimation techniques that support both online and offline calibration and demonstrate strong adaptability to dynamic environments. Nevertheless, inaccurate assumptions about the noise model can cause the estimation to diverge. Optimization-based methods jointly solve visual-inertial constraints to achieve high calibration accuracy, although they are often computationally intensive. Deep learning methods automate calibration by leveraging large-scale datasets, but their performance is highly dependent on labeled data and tends to degrade when applied to previously unseen scenarios.
Table 4 summarizes the principles, advantages, and limitations of each approach, along with relevant references.
To further evaluate camera–IMU calibration performance, we compared the calibration performances of several camera–IMU methods in
Table 5. Studies that did not explicitly report camera–IMU extrinsic parameter errors were excluded, as their experiments may not have been specifically designed for camera–IMU extrinsic calibration or may have validated their methods through other types of experiments. In general, methods with smaller calibration errors are considered superior, although exceptions occur. For example, Lee’s [
52] method yielded smaller calibration errors than that of Qiu [
57], yet Qiu [
57] experimentally demonstrated superior overall performance. This discrepancy arises from the lack of standardization in datasets and hardware across studies. Factors such as the level of motion excitation during data collection, the density of environmental features, camera resolution, IMU accuracy, and even the number of algorithm iterations can all influence calibration outcomes. Optimization-based methods, exemplified by Kalibr [
33], typically achieve high precision, with the best results reporting an average translation error of less than 1 mm and an average rotation error of less than 0.01°, although this comes at the cost of real-time performance. The trade-off between accuracy and computational efficiency remains an important topic for future investigation.
Following the evaluation approach in [
75], we also assessed the robustness of these algorithms from four perspectives:
(A) Whether experiments tested the method on multiple dataset types or real-time data. Methods tested on two or more types received two stars (✩✩).
(B) Whether tests included degenerate or challenging scenarios. If so, one star was awarded (☆).
(C) Whether the range of initial errors in the test data was sufficiently large. If the initial average translation exceeded 50 mm or the initial average rotation exceeded 10°, one star was awarded (☆).
(D) Whether anti-noise testing was conducted. If so, one star was awarded (☆).
Under this scheme, the maximum possible score was five stars. Some methods achieved four stars, indicating strong robustness.
Recent advancements in camera–IMU calibration have shifted research focus toward multi-sensor systems and task-specific adaptations. In the field of multi-camera and IMU calibration, Eckenhoff et al. [
76] proposed an online approach based on the MSCKF framework, which aligns asynchronous timestamps and jointly optimizes the geometric constraints of multiple cameras. This method eliminates the dependence on overlapping FoVs and improves robustness in dynamic environments. Bo Fu et al. [
77] developed the Mu-CI system, which minimizes a multi-objective cost function comprising inertial residuals, reprojection errors, and AprilTag association errors.
Calibration methods for specific tasks are often tailored to the operational environment. To address the instability of checkerboard corner detection in underwater welding scenarios, Chi [
78] proposed a prediction–detection corner algorithm and employed joint optimization of intrinsic and extrinsic parameters to enhance underwater calibration accuracy. In unmanned aerial vehicle (UAV) applications, Yang et al. [
79] analyzed the observability conditions of camera boresight misalignment angles using the observability Gramian matrix and designed an online trajectory optimization algorithm with physical constraints by optimizing the relative geometry between the UAV and the target. For autonomous driving vehicles, Xiao et al. [
80] proposed an extrinsic calibration quality monitoring algorithm based on online residual analysis. In multi-sensor fusion calibration, researchers have extended calibration techniques to include additional modalities such as LiDAR, GNSS [
81,
82,
83], wheel encoders [
84], and ultra-wideband (UWB) systems [
85]. Calibration strategies involving LiDAR will be discussed in
Section 3.4.
3.2. LiDAR–IMU Extrinsic Calibration
To calibrate the extrinsic parameters between a rigidly connected LiDAR and IMU, a commonly adopted strategy is to align their respective trajectories. Geiger et al. [
86] proposed a motion-based method that estimates the transformation through hand–eye calibration [
87]. Schneider et al. [
88] utilized an unscented Kalman filter to compute the transformation between two odometry-based sensors. Xia et al. [
89] estimated the extrinsic parameters by minimizing the Euclidean distance between LiDAR point clouds transformed using IMU data and those aligned through the iterative closest point (ICP) algorithm. Despite their effectiveness, these approaches often overlook the accumulated error and rotational drift introduced by the IMU under real-world conditions.
Gentil et al. [
90] modeled LiDAR–IMU calibration as a factor graph optimization problem. They used Gaussian process regression [
91] to upsample IMU measurements and correct motion distortion in LiDAR scans. The upsampled IMU data were pre-integrated, and the system jointly minimized point-to-plane residuals and IMU errors to estimate the extrinsics. Li et al. [
92] refined this approach by introducing an improved time offset model in the pre-integration process, computing initial rotation through a closed-form solution and using dynamic IMU initialization to estimate gravity and velocity. However, this method requires a dense prior point cloud map to extract planar features, which may limit its applicability in real-world scenarios. Zhu et al. [
93] proposed a real-time initialization framework, LI-Init, which first estimates a coarse temporal offset by maximizing the correlation of angular velocity magnitudes between LiDAR and IMU. It then aligns LiDAR odometry with IMU measurements for further refinement. The method also evaluates motion excitation by analyzing the singular values of the Jacobian matrix. Yang et al. [
94] further improved calibration accuracy by applying a point-based update strategy [
95] to correct distortion in the LiDAR scan during motion.
Mishra et al. [
96] observed that, although [
90] modeled IMU data using Gaussian processes, the factor graph performed optimization only at discrete time steps, which may reduce accuracy. To address this issue, they proposed an EKF-based calibration method integrated into the OpenVINS [
97] visual-inertial framework. This method does not rely on calibration targets or structured environments. However, the EKF is sensitive to the initial state, and inaccurate initial rotation estimates can propagate through the filter and degrade final calibration performance.
Continuous-time batch optimization, based on temporal basis functions, has also been widely adopted. Furgale and Rehder conducted a series of foundational studies in spatiotemporal calibration. In [
61], they implemented a complete SLAM system using B-spline basis functions for camera–IMU calibration. This framework was later extended to support both temporal and spatial calibration [
98], and further generalized to camera–LiDAR–IMU systems by integrating point-to-plane constraints [
99] into the batch optimization. A key limitation of this approach is its reliance on highly accurate visual-inertial trajectories, which restricts its applicability to LiDAR–IMU calibration. LV et al. [
100] modeled IMU states as continuous-time splines, enabling accurate pose estimation at each LiDAR scan time. Unlike [
90], their method exploited all planar surfaces in the environment by projecting LiDAR points onto corresponding planes and incorporating these projections as constraints in a nonlinear least-squares optimization. Although this approach does not require artificial targets, it depends on environments rich in planar features. Their later work, OA-Calib [
101], improved computational efficiency by selecting high-information trajectory segments and introduced an observability-aware update mechanism to ensure robustness under degenerate motion. Wu et al. [
102] extended this by dynamically adjusting frame length in LiDAR odometry to maintain near-linear motion and scene stability, thereby reducing nonlinearity and environmental variation.
Some recent methods are based on dedicated equipment. Liu et al. [
103] designed a calibration system using cone cylinder structures. They estimated point cloud poses relative to these shapes and formulated an optimization problem with geometric constraints to solve for extrinsics. Subsequently, they proposed using features such as points, lines, spheres, cylinders, and planes extracted from LiDAR scans in natural scenes [
104]. While this method avoids specially designed targets, its effectiveness is limited in scenes where such features are scarce or unreliable.
Online LiDAR–IMU calibration has also been integrated into SLAM systems. [
105,
106] fused LiDAR, IMU, and camera data using lightweight EKF frameworks, where extrinsic parameters are estimated online. Ye et al. [
107] proposed a tightly coupled odometry system that jointly minimizes residuals from LiDAR points and IMU pre-integration for real-time extrinsic estimation. Xu et al. [
108] introduced FAST-LIO2, which performs iterative EKF-based calibration during tightly coupled LiDAR–IMU odometry. These online methods typically assume reasonable initial estimates for convergence. Since extrinsic parameters are included in the state vector [
105,
109], unobservable directions may emerge under certain motions. To address this, Kim et al. [
110] proposed GRIL-Calib, a method tailored for ground robots with planar motion. It introduced geometric constraints, such as enforcing the orthogonality between the ground normal vector in the LiDAR frame and the ground
z-axis, to constrain roll and pitch. It also assumed a constant vertical offset between the IMU and LiDAR to constrain z-translation.
Similarly, we compared the calibration performance of several LiDAR–IMU methods in
Table 6. By contrasting these results with those from camera–IMU calibration, it is evident that LiDAR–IMU calibration generally achieves slightly lower accuracy. This difference arises because image features, such as corner points, are highly dense, providing over-constrained conditions for camera pose estimation and enabling highly precise optimization. In contrast, point cloud features, such as planes and edges, are relatively sparse, with only a few dozen valid geometric features per frame. Consequently, the corresponding registration error models lack sufficient constraints, making them more susceptible to local minima. For instance, Xia [
89] adopted a trajectory alignment approach, in which the algorithm relies on the iterative closest point (ICP) method to establish point cloud correspondences. When initial errors are large, the algorithm can converge to local minima, resulting in absolute translation and rotation errors of 17.2 mm and 1.2°, respectively.
Another noteworthy approach is GRIL-Calib [
110], proposed by KIM et al., whose calibration results are not directly comparable to those of other methods. GRIL-Calib was evaluated in a ground-robot planar-motion scenario, where the absence of excitation along the vertical (
z-axis) inherently renders certain parameters, such as roll and pitch rotations, as well as the
z-axis translation, unobservable. In contrast, other methods were tested under full-axis motion excitation, which provides superior parameter observability. To address this unobservability in planar motion, GRIL-Calib introduces the ground planar motion (GPM) constraint. However, this method relies heavily on the geometric regularity of a flat ground surface. In non-planar environments, such as slopes or steps, reduced accuracy in ground segmentation can cause the geometric relationships defined by the GPM constraint to break down, thereby degrading calibration performance.
Based on experimental results reported in other studies, we have also summarized the following three important conclusions:
Regardless of whether LiDAR operates in 128-, 64-, 32-, or 16-channel mode, the algorithm converges to comparable calibration parameter estimates, indicating insensitivity to LiDAR point cloud density [
96].
Experimental comparisons show that, without point cloud de-skewing, calibration results fail to converge and exhibit large fluctuations. When de-skewing is performed using IMU state prediction, scan matching accuracy improves significantly, and calibration parameters converge stably [
96].
During data collection, all rotational and translational degrees of freedom of the sensor suite must be sufficiently excited to ensure full observability of the extrinsic calibration parameters. Experiments demonstrate that, with adequate motion excitation, both translation and rotation parameters converge rapidly; insufficient excitation may result in biased estimates [
96,
110].
LiDAR–IMU calibration technology was previously overlooked due to the high computational complexity involved in processing LiDAR point clouds. However, with the increasing demand for LiDAR–IMU fusion in recent years, research in this area has grown substantially. Nevertheless, the calibration accuracy of LiDAR–IMU systems remains lower than that of camera–IMU systems, leaving room for further improvement.
Table 7 presents several representative calibration approaches, along with their main contributions.
3.3. Camera–LiDAR Extrinsic Calibration
The calibration between LiDAR and cameras has been extensively reviewed by numerous scholars in recent years. Readers seeking a more comprehensive and in-depth understanding of camera–LiDAR calibration are encouraged to consult the survey papers listed in
Table 8. Liu et al. [
111] were the first to systematically summarize the development of camera–LiDAR calibration, outlining the complete calibration process and describing the principles and algorithms for estimating both intrinsic and extrinsic parameters. Li et al. [
112] reviewed the progress of targetless calibration methods and categorized existing approaches into four main groups: information-theoretic methods, feature-based methods, ego motion-based methods, and learning-based methods. This study analyzed the theoretical foundations, advantages, limitations, and application scenarios of each category. Qiu et al. [
113] provided an overview of extrinsic calibration methods for imaging sensors, including cameras, LiDAR, and millimeter-wave radar. Liao et al. [
8] conducted a systematic review of the application of deep learning in camera calibration, which includes LiDAR–camera calibration. Tan et al. [
75] focused specifically on deep learning-based calibration methods, classifying them into accurate extrinsic parameter estimation (AEPE) and relative extrinsic parameter prediction (REPP), and established a structured knowledge system in this area. Zhang et al. [
114] offered a detailed summary of 2D–3D, 3D–3D, and 2D–2D matching techniques for camera–LiDAR calibration, covering sensor modeling, error analysis, and key methodologies. The review included both traditional and deep learning-based approaches. Based on these survey studies, we categorize existing camera–LiDAR calibration techniques into four types, as summarized in
Table 9, and compare their respective advantages and disadvantages.
3.4. Camera–LiDAR–IMU Extrinsic Calibration
Traditional calibration of camera–LiDAR–IMU (LVI) systems typically follows a pairwise, chain-based strategy in which the extrinsic parameters are estimated sequentially between sensor pairs to derive the complete spatial and temporal relationships among the three sensors. However, this step-by-step approach introduces operational redundancy and allows errors to accumulate at each stage, since every sensor pair requires a customized calibration process. To address these limitations, joint calibration methods have been developed to estimate the spatiotemporal parameters of all three sensors within a unified framework. The general procedure of such methods can be divided into three stages: preprocessing, parameter initialization, and nonlinear optimization, as illustrated in
Figure 5. After the data from each individual sensor is preprocessed, initial estimates of rotation and temporal offset are obtained through pose estimation, point cloud alignment, and hand–eye calibration. Finally, multiple cross-modal constraints, including visual reprojection errors, LiDAR point-to-plane correspondences, and IMU kinematic residuals, are jointly optimized using iterative nonlinear techniques to accurately estimate the extrinsic parameters.
Several researchers have introduced target-based calibration approaches. Hou et al. [
125] extracted line and plane features from a checkerboard to establish geometric constraints between the camera and LiDAR. These constraints were then translated into constraints on the camera–IMU and LiDAR–IMU extrinsics, which were jointly optimized with IMU measurement residuals, enabling simultaneous estimation of both sets of extrinsic parameters. Zhi et al. [
126] introduced a method compatible with multiple cameras, LiDAR sensors, and IMUs, even in scenarios without overlapping FoVs. Their approach utilizes multiple planar calibration targets with AprilTags. By incorporating LiDAR point-to-plane constraints, IMU measurements, and corner reprojection errors, the system jointly estimates all extrinsic parameters without relying on any prior spatial initialization.
More recently, targetless spatiotemporal calibration methods based on continuous-time frameworks have gained attention. Liu et al. [
127] employed an IMU-centric approach, using correlation analysis and hand–eye calibration to obtain the initial estimates of the time offset and extrinsic rotation. They then calibrated the extrinsic parameters between the IMU and LiDAR, optimized the IMU trajectory, and subsequently calibrated the extrinsic parameters between the IMU and camera.
Wang et al. [
128] integrated multiple constraints into a unified optimization framework, including point-to-surfel constraints, visual reprojection errors, a visual point-to- LiDAR-surfel constraint, and the error constraint between IMU measurements and trajectory derivatives. However, this requires an accurate initial estimate of the camera–IMU extrinsic parameters to construct the visual-inertial odometry (VIO) system. In addition, due to its computational complexity and scale, each calibration takes approximately 200 s, which limits its suitability for real-time applications. However, in [
129], the LiDAR point-to-plane, gyroscope, and accelerometer factors are jointly optimized to estimate the LiDAR–IMU extrinsic parameters, time offset, and the poses of control points. Then, by incorporating the visual reprojection factor, the control points in the spline trajectory are fixed to calibrate the camera–IMU extrinsic parameters and time offset. Li and Chen et al. [
130] combined LiDAR point-to-plane and visual reprojection factors into a single optimization process to solve for both sets of extrinsic parameters simultaneously. Their approach also supports rolling shutter cameras. Wang et al. [
131] proposed a parallel calibration method for LiDAR–IMU and camera–IMU extrinsics, including convergence criteria and excitation evaluation metrics. They developed a user-friendly, targetless online calibration framework for camera–LiDAR–IMU systems, demonstrating notable improvements in both accuracy and efficiency compared to existing methods.
In contrast, Wang and Ma [
132] proposed a non-continuous-time approach with significantly improved efficiency. In the first stage, they estimated the camera–IMU extrinsics, time offset, and IMU bias to obtain reliable VIO results. In the second stage, the VIO trajectory was aligned with LiDAR odometry obtained using the iterative closest point (ICP) method to calibrate the LiDAR–IMU extrinsics. Their method, evaluated on the EuRoC dataset, achieved superior performance compared to VINS-Mono. The total calibration time, including both VIO and LiDAR–IMU initialization, ranged from 10 to 20 s, although the accuracy was slightly lower than that of offline methods.
Despite advances in multi-sensor fusion technologies, research on the joint calibration of LiDAR, cameras, and IMU systems remains limited.
Table 10 summarizes the characteristics and contributions of representative LVI calibration methods.
Table 11 presents the calibration performance of several camera–LiDAR–IMU methods. Compared with the dual-sensor calibration results (camera–IMU and LiDAR–IMU) presented in the other two tables, the accuracy of three-sensor configurations is generally comparable, but their convergence time is markedly longer. Robustness evaluations further reveal that current three-sensor calibration methods have not been adequately validated in challenging environments, such as low-texture scenes, dynamic occlusions, or adverse lighting conditions. Nevertheless, by integrating the visual detail of cameras, the three-dimensional structural perception of LiDAR, and the motion continuity of IMUs, three-sensor systems inherently provide a more comprehensive environmental perception capability, theoretically offering a natural adaptability advantage for calibration in complex scenarios. Future research directions include developing calibration methods tailored to specific challenging conditions (e.g., tunnels, rainy nights, or dynamic crowd interference), advancing online real-time calibration frameworks and addressing generalized calibration across heterogeneous devices (e.g., combinations of consumer-grade and industrial-grade sensors) to fully exploit the potential of multimodal fusion.