Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume

Han, Xiaobo; Luo, Jie; Wei, Xiaoxu; Wang, Yongsheng

doi:10.3390/info16030223

Open AccessArticle

Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume

¹

School of Automation, Wuhan University of Technology, Wuhan 430070, China

²

School of Automotive Engineering, Wuhan University of Technology, Wuhan 430070, China

³

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(3), 223; https://doi.org/10.3390/info16030223

Submission received: 9 February 2025 / Revised: 6 March 2025 / Accepted: 12 March 2025 / Published: 13 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The online calibration algorithm for camera and LiDAR helps solve the problem of multi-sensor fusion and is of great significance in autonomous driving perception algorithms. Existing online calibration algorithms fail to account for both real-time performance and accuracy. High-precision calibration algorithms require high hardware requirements, while it is difficult for lightweight calibration algorithms to meet the accuracy requirements. Secondly, sensor noise, vibration, and changes in environmental conditions may reduce calibration accuracy. In addition, due to the large domain differences between different public datasets, the existing online calibration algorithms are unstable for various datasets and have poor algorithm robustness. To solve the above problems, we propose an online calibration algorithm based on multi-scale cost volume fusion. First, a multi-layer convolutional network is used to downsample and concatenate the camera RGB data and LiDAR point cloud data to obtain three-scale feature maps. The latter is then subjected to feature concatenation and group-wise correlation processing to generate three sets of cost volumes of different scales. After that, all the cost volumes are spliced and sent to the pose estimation module. After post-processing, the translation and rotation matrix between the camera and LiDAR coordinate systems can be obtained. We tested and verified this method on the KITTI odometry dataset and measured the average translation error of the calibration results to be 0.278 cm, the average rotation error to be 0.020°, and the single frame took 23 ms, reaching the advanced level.

Keywords:

online calibration; multi-scale fusion; cost volume

Graphical Abstract

1. Introduction

With the continuous development of machine vision technology, multi-sensor data fusion is increasingly used in the field of intelligent perception, such as autonomous driving, embodied intelligence, and drone navigation [1,2]. The necessary prerequisite for data fusion between sensors is accurate external coordinate calibration [3]. In the field of autonomous driving, environmental perception solutions based on the fusion of three-dimensional LiDAR and camera are widely used [4]. The camera provides rich color and texture information, but cannot directly obtain pixel depth. LiDAR can provide accurate spatial scale and depth information, but point cloud data are sparse due to their rotational scanning principle [5,6]. The data from the two sensors are highly complementary, and through fusion, more comprehensive target observation information can be obtained, thereby significantly improving the accuracy and reliability of environmental perception [7]. However, the sensor drifts and jitters during vehicle driving, causing the external parameter matrix to change, thereby reducing the performance of the perception algorithm [8]. Therefore, external calibration of the camera and LiDAR is required to ensure the spatial alignment between the two and the accuracy of data fusion [9].

The external parameter calibration methods of the camera and LiDAR can be divided into two modes, offline calibration and online calibration, which are suitable for the needs of static scenes and dynamic application scenarios, respectively [10,11]. Offline calibration is used to compute the external parameters of the camera and LiDAR in a static environment in a single instance. It relies on a specific calibration board or calibration tool, and the calibration result remains fixed [12]. Online calibration is used to update external parameters in real time within a dynamic environment. It can adapt to changes in the relative positions of sensors and is suitable for complex application scenarios [13]. Compared to offline calibration, online calibration is more suitable for unmanned driving scenarios because it corrects the external parameter errors between sensors in real time ensuring the stability and accuracy of the perception system [14]. With the advancement of unmanned driving technology, online calibration algorithms have evolved from traditional geometric feature-based matching methods to optimization-based iterative methods and gradually shifted toward leveraging deep learning for feature extraction and external parameter prediction [15,16].

However, the current deep learning-based camera–LiDAR online calibration algorithms still has a trade-off between real-time performance and accuracy, making it difficult to simultaneously meet efficiency and accuracy requirements in unmanned driving scenarios [17]. Building on the above research, we aim to adopt a more powerful backbone network to learn richer feature representations from sparse depth maps. Our model employs an online joint calibration method by fusing cost volumes [18], which not only adapts easily to specific datasets, but also demonstrates strong generalization ability. The main contributions of this paper are summarized as follows:

We propose an end-to-end LiDAR–camera extrinsic calibration network that incorporates multi-scale feature extraction, multi-scale cost volume feature matching, and cost volume aggregation modules. A multi-scale cost volume cascade approach predicts the 6-DoF transformation parameters, thereby enhancing calibration accuracy.
By employing multi-scale feature extraction and a 3D Hourglass network for optimized feature fusion, the backbone network extracts camera and LiDAR features and utilizes a feature pyramid to process multi-scale information, enhancing adaptability to targets of varying scales. Meanwhile, the 3D Hourglass network processes the fused cost volume, preserving high-level semantic information and enhancing calibration accuracy.
The model’s calibration accuracy is evaluated on the KITTI dataset, demonstrating strong generalization ability. Through multiple iterations of training, the model effectively handles a larger error range.

2. Related Work

The purpose of camera–LiDAR calibration is to determine their spatial relationship, including relative translation and rotation, to accurately align the LiDAR 3D point cloud with the camera’s 2D image [19]. Existing multi-sensor extrinsic calibration systems can be categorized into three main types: target-based calibration, target-free calibration, and deep learning-based calibration [20]. In this section, we provide a detailed analysis and summary.

2.1. Calibration Methods Based on Specific Targets

Target-based methods require researchers to prepare calibration targets, which are typically planar and serve specific functions [21].

Zhang et al. [22] first proposed a method for calibrating a 2D LiDAR and a camera using a chessboard, establishing a benchmark for further research on 3D LiDAR–camera calibration. Unnikrishnan and Pandey et al. [23,24] extended the original 2D LiDAR–camera calibration method to 3D LiDAR. Kwak et al. [25] proposed a robust weighted extrinsic calibration algorithm that estimates extrinsic parameters by minimizing the distance between features projected onto the image plane. Park et al. [26] proposed a 2D-3D correspondence method using a polygonal calibration plate for multi-line LiDAR. The method estimates the plate’s vertices based on known adjacent edge lengths, enabling high-precision calibration of low-resolution 3D LiDAR and cameras. Since sensors can better distinguish diverse shapes when detecting targets, Dhall et al. [27] designed a hollow matrix calibration plate to obtain a total of eight inner and outer corner points for joint calibration. Veľas et al. [28] designed four calibration plates with circular holes as targets and introduced a new calibration method.

In summary, target-based calibration methods use a known geometric image or plane to design the calibration target’s geometric shape. Although these methods have proven effective in the past, they perform poorly under external conditions and are challenging to apply in highly dynamic environments or real-time applications.

2.2. Calibration Methods Without Specific Targets

With the increasing demand for autonomous driving applications, traditional target-based calibration methods are no longer applicable. As the vehicle moves, the relative position between the 3D LiDAR and the camera inevitably changes, leading to misjudgments. To address this limitation, an online LiDAR–camera calibration method is urgently needed. Scaramuzza et al. [29] proposed one of the earliest targetless calibration methods, leveraging natural scenes instead of chessboards for online calibration and employing the PnP algorithm to estimate extrinsic parameters. Pandey et al. [30] proposed that a relationship exists between laser point cloud reflections and image pixels. They modeled them as two random variables and optimized extrinsic parameters using a targetless calibration algorithm based on mutual information (MI). Taylor et al. [31] established a gradient relationship between the image and the point cloud by measuring gradient direction and estimated extrinsic parameters by extracting features from targets in a continuous frame sequence. Jiang et al. [32] utilized parallel line features in autonomous driving scenes to estimate extrinsic rotation parameters and determined extrinsic translation parameters using a point cloud-image fusion method. Li et al. [33] used bundle adjustment (BA) with plane constraints for optimization, integrating visual BA with LiDAR point cloud plane constraints into a unified optimization problem. They optimized the camera’s intrinsic and extrinsic parameters simultaneously and improved the accuracy of intrinsic calibration through depth constraints.

Traditional online calibration methods typically estimate extrinsic parameters using easily detectable and commonly occurring features in the scene. The accuracy of such calibration methods largely depends on the chosen target features. Additionally, these methods require careful hyperparameter selection and tuning, which limits the model’s generalization ability.

2.3. Calibration Method Based on Deep Learning

With the rapid advancement of computer vision technology, deep learning-based methods have gradually emerged for LiDAR–camera online calibration. RegNet, proposed by Schneider et al. [34], is the first convolutional neural network (CNN)-based calibration method to perform tasks such as feature extraction, feature matching, and global regression. It executes multiple CNNs in a single iteration but does not account for potential geometric constraints in the extrinsic calibration problem. CalibNet, proposed by Iyer et al. [35], uses photometric error and point cloud distance error as a joint loss function and does not directly regress calibration parameters during training. As a deep network with geometric supervision, it is trained to predict calibration parameters to address point cloud geometry issues. Cattaneo et al. [36] proposed CMRNet, a real-time method that employs PWC-Net as the correlation layer and processes each image frame independently without a tracking program. It matches features from different angles to predict extrinsic parameters. Most multi-sensor data matching algorithms perform well on a single dataset but fail to generalize to others. To address this, Shen et al. [37] proposed CFNet, a network based on cascade and fusion cost volume to enhance the robustness of stereo matching networks. To enhance the real-time performance and usability of camera–LiDAR automatic calibration algorithms, Lv et al. [38] proposed LCCNet, an end-to-end trainable online calibration network that predicts calibration parameters in real time and achieves excellent performance on the KITTI dataset.

Traditional deep learning-based methods reduce scene restrictions but lack generalization and require extra training for different datasets. To address this, Luo et al. [39] proposed CalibAnything, which leverages the SAM model to achieve zero additional training and optimizes point cloud consistency in image masks for accurate extrinsic calibration.

However, traditional methods often struggle with sparse feature maps, weak cross-modal associations, and inaccurate calibration parameters. To solve these issues, Xiao et al. [40] introduced CalibFormer, an end-to-end network for automatic LiDAR–camera calibration. It employs a multi-head association module to enhance feature correlation and utilizes a transformer architecture for precise parameter estimation.

Although the Transformer-based network has powerful feature modeling capabilities, its computational complexity is high, which affects real-time performance. In addition, the Calib-Anything method uses a large pre-trained model for segmentation. Although it does not require additional training, it has computational bottlenecks in resource-constrained environments and performs poorly in specific scenarios or unseen environments.

To further enhance the accuracy and generalization of camera–LiDAR online calibration, we propose an end-to-end online calibration network that integrates multi-scale cost volumes, aiming to achieve a better balance between computational efficiency, scene adaptability, and calibration accuracy.

3. Our Method

We input the raw camera and LiDAR data into a feature pyramid network to extract features, obtaining multi-scale group-wise (GW) and concatenated (CF) features. The GW features of corresponding camera and LiDAR scales are used to construct a multi-scale GWC volume, while the CF features form a multi-scale concatenated volume. Next, the GWC and concatenated volumes of the same scale are combined into a multi-scale cost volume. The hourglass module then fuses all volumes into a refined volume, which is decoupled through post-processing to obtain the rotation and translation matrices. The overall block diagram of the model is shown in Figure 1.

3.1. Multi-Scale Feature Extraction

In LiDAR–camera extrinsic calibration, feature extraction is a key step. Its goal is to extract highly discriminative and stable feature points from the data obtained by both sensors, ensuring accurate alignment of their spatial relationship, including translation and rotation. Since real-world scenes contain objects of varying sizes and depths, multi-scale feature extraction effectively handles these variations. By extracting features at multiple scales, the feature pyramid enhances network calibration across diverse environments and object sizes, reducing calibration errors caused by missing features at certain scales or insufficient information.

As shown in Figure 2, we employ a three-layer feature pyramid to extract features from the RGB images captured by the camera and the point cloud data collected by the LiDAR. The resulting feature maps, at 1/8, 1/16, and 1/32 of the original image size, are then forwarded to the subsequent network for further processing. Low-resolution feature maps capture global information, while high-resolution feature maps emphasize local details, facilitating correlation analysis between LiDAR and camera data. The detailed structure of the feature extraction process is shown in Figure 3. The backbone downsampling process follows the ResNet multi-layer convolutional structure, with the Mish activation function incorporated for enhancement:

f (x) = x \cdot t a n (l n (1 + e^{x})),

(1)

This function improves the performance and generalization of multi-layer feature extraction networks by smoothing nonlinear mappings and preserving negative information, thereby enhancing gradient flow and feature representation capabilities. Additionally, batch normalization (Batch Norm) is applied to normalize the output of the 2D convolutional layer, accelerating model convergence, stabilizing training, and mitigating gradient vanishing or explosion by standardizing feature distribution.

The number of input channels is determined by the sensor type. For the camera’s RGB data, the Feature_Extraction module uses three input channels to fully leverage color information for object and scene recognition. For LiDAR point cloud data, a single input channel is used, as LiDAR provides only 3D coordinate information without color. The coordinate data effectively represents an object’s 3D position and shape.

Figure 2. Feature pyramid network structure.

Figure 3. Feature extraction.

3.2. Constructing Multi-Scale Cost Volumes

In camera–LiDAR feature matching, the cost volume stores the matching costs between the two datasets. Constructing the cost volume involves comparing image and point cloud features, evaluating matching quality, and optimizing results to achieve precise data alignment and depth estimation.

As shown in Figure 4, we constructed a multi-scale cost volume to capture features at different scales, enabling the extraction of both global information and fine details. This approach enhances matching accuracy and robustness, particularly in complex scenes, by better handling diverse depth information. Here, the cost volume consists of feature concatenation and group-wise correlation. The former primarily compresses and adjusts the number of channels in the feature map, while the latter preserves and increases the number of channels to enhance feature extraction and maintain high expressiveness. The feature concatenation cost volume expression is as follows:

V_{c o n c a t}^{i} (d^{i}, x, y, f) = f_{L}^{i} (x, y) | | f_{R}^{i} (x - d^{i}, y),

(2)

In the above formula,

d_{i}

represents the disparity value,

| |

represents the vector concatenation operation,

f_{i}

represents the feature extracted in the i-th stage,

i = 0

corresponds to the original input image, and

(x, y)

represents the pixel coordinates. We use different disparity values for feature maps of different scales. The initial maximum disparity is set to 256. For the 1/8 feature map,

d_{i} = 256 / 8

, for the 1/16 feature map,

d_{i} = 256 / 16

, and for the

1 / 32

feature map,

d_{i} = 256 / 32

. The group-wise correlation calculation formula is as follows:

V_{g w c}^{i} (d^{i}, x, y, g) = \frac{1}{N_{c}^{i} / N_{g}} 〈 f_{l}^{i g} (x, y), f_{r}^{i g} (x - d^{i}, y) 〉,

(3)

where

N_{c}

represents the channel for feature extraction,

N_{g}

is the number of groups, and

〈 〉

represents the vector inner product. To fuse feature information from different data sources and expand the model’s capability in scene depth estimation, the GWC and CF cost volumes of the same scale must be concatenated. The combination formula is given by

V_{c o m b i n e}^{i} = V_{c o n c a t}^{i} | | V_{g w c}^{i},

(4)

Here, the concat operation concatenates the three sets of GWC and concat volumes at corresponding scales along the channel dimension. Batch-normalized 3D convolution is then applied to generate three cost volumes at different scales:

Y = B N (C o n v 3 D (X, w)),

(5)

where Conv3D represents 3D convolution, BN represents batch normalization operation,

X \in R^{C_{i n} \times D \times H \times W}

is the input tensor,

C_{i n}

is the number of input channels,

D \times H \times W

is the spatial dimension of the input,

w \in R^{C_{o u t} \times C_{i n} \times k_{d} \times k_{h} \times k_{w}}

is the convolution kernel weight,

C_{o u t}

is the number of output channels, and

k_{d} \times k_{h} \times k_{w}

is the size of the convolution kernel.

Figure 4. Constructing multi-scale cost volumes.

3.3. Multi-Scale Cost Volume Aggregation

To further standardize and refine the cost volume, the 3D Hourglass network integrates multi-scale features from different image regions. The 3D Hourglass network is an encoder–decoder architecture based on 3D convolution. It alternates between top-down and bottom-up processing through a series of convolutional layers, effectively integrating global information and local details—particularly in tasks requiring multi-scale feature extraction. During encoding, higher-level features are progressively abstracted through downsampling, while during decoding, the spatial resolution of the feature map is restored through upsampling. The structure of the two-layer 3D Hourglass module we used is shown in Figure 5.

The Hourglass-1 module takes as input the cost volume at three different scales, concatenated with downsampled features via the concat operation to enhance high-level feature utilization. The Hourglass-2 module processes the output of the previous layer, performing multi-scale feature extraction and fusion separately on this tensor. Additionally, the FMish activation function is introduced during the upsampling process:

f (x) = x \cdot t a n h (l n (1 + e^{β x})),

(6)

where

β

is a scaling parameter used to control the strength of nonlinearity. The FMish activation function integrates nonlinearity and differentiability, adapts to input amplitudes, and enhances training stability and deep network performance through smooth gradients.

The traditional Hourglass module has high computing resource requirements and limited feature expression capabilities. To improve model performance, spatial attention, and channel attention mechanisms are inserted into Hourglass to enable the model to focus on important spatial locations and feature channels more effectively, adaptively focus on effective information, and suppress redundant features, thereby improving feature expression capabilities. The spatial attention mechanism formula can be expressed as follows:

M_{S} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])) = σ (f^{7 \times 7} [F_{a v g}^{s}; F_{max}^{s}]),

(7)

where F represents the input feature map, with a shape of

C \times H \times W

, representing the number of channels, height, and width, respectively. The Sigmoid activation function represented by

δ

compresses the spatial weights into the range of

[0, 1]

. The formula of the channel attention mechanism is as follows:

\begin{matrix} M_{C} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{max}^{c}))) \end{matrix},

(8)

In the above formula,

F_{a v g}^{c}

and

F_{m a x}^{c}

represent the global average pooling feature and the maximum pooling feature, respectively. To further reduce the consumption of computing resources, SCConv is used to reduce the spatial and channel redundancy between feature maps while promoting representative feature learning. SCConv consists of two units: the spatial reconstruction unit (SRU) and channel reconstruction unit (CRU). The SRU uses the separate-and-reconstruct method to suppress spatial redundancy, while the CRU uses the split-transform-and-fuse strategy to reduce channel redundancy. The specific structure of SCConv is shown in Figure 6.

3.4. Pose Estimation Module

We use the camera coordinate system as the reference to calibrate the LiDAR extrinsic parameters. The extrinsic calibration parameter calculation formula is as follows:

T_{L C} = T_{p r e d}^{- 1} \cdot T_{i n i t},

(9)

where

T_{p r e d}

represents the prediction result of the online calibration network, and

T_{i n i t}

represents the initial calibration parameters. During pose estimation, the tensor and its convolution result are concatenated along the channel dimension and repeated for five layers. This increases the number of channels and enriches feature representation. Additionally, it improves gradient flow, particularly during training, helping to prevent gradient vanishing and enhancing both training speed and model generalization. The structure of the pose estimation is shown in Figure 7.

Finally, the rotation and translation matrices are estimated. The feature vector from the previous step is fused, and dropout is applied to randomly set some neuron outputs to zero with a certain probability, preventing overfitting to the training data. The feature vector then undergoes forward propagation through a fully connected layer activated by the LeakyReLU function. Finally, the feature map is passed to two separate fully connected layers, normalized, and used to obtain the translation and rotation matrices.

3.5. Loss Function and Training Strategy

We use a hybrid loss function to measure the difference between the model’s predictions and the ground truth. L1 loss is applied for translation, while quaternion angle loss is used for rotation. The Smooth L1 function computes the translation loss:

L (x, y) = \{\begin{matrix} 0.5 \cdot {(x - y)}^{2}, & i f | x - y | < 1 \\ | x - y | - 0.5, & o t h e r w i s e \end{matrix},

(10)

In the above formula, x and y represent the predicted value and the true value, and

| x - y |

represents the difference between the two. When

| x - y | < 1

, the square error is used, otherwise the linear error is used, which makes the Smooth L1 function respond more smoothly to abnormal values. Since the rotation parameters are represented by quaternions, the angular distance is used to calculate the difference between the true value and the predicted value:

D_{a} (q_{g}, q_{p}) = 2 \cdot a r c c o s (| q_{g} \cdot q_{p} |)

(11)

where

q_{g}

and

q_{p}

represent the true value and predicted value of the rotation parameter, respectively. The final total loss function is as follows:

L_{t o t a l} = (R_{t} \times L_{t r a n s l} + R_{r} \times L_{r o t})

(12)

Here, K is the point cloud weight coefficient,

R_{t}

is the translation scaling coefficient, and

R_{r}

is the rotation scaling coefficient.

We draw inspiration from LCCNet and CalibNet, training multiple networks on different miscalibration ranges to enhance calibration accuracy. The translation range (±1.5 m to ±0.1 m) and rotation range (±20° to ±1°) are chosen. First, the network with the largest range (±1.5 m, ±20°) is used for preliminary predictions. The results are then fed as input to train networks with progressively smaller ranges (e.g., ±1.0 m,

\pm 10 °

), iteratively. In each iteration, the LiDAR point cloud is reprojected, generating a new depth image for prediction. In each iteration, the LiDAR point cloud is reprojected, generating a new depth image for prediction.

T_{L C} = {(T_{0} \cdot T_{1} \dots T_{5})}^{- 1} \cdot T_{i n i t}

(13)

In the above formula,

T_{i}

is the external parameter matrix obtained by training with different error ranges, and

T_{i n i t}

is the initial external parameter before calibration.

4. Experiments and Analysis

4.1. Dataset

To ensure the experiment’s fairness and reproducibility and to evaluate the accuracy of online extrinsic calibration in a controlled but challenging environment, we use the KITTI Odometry dataset instead of collecting camera and LiDAR data ourselves. The KITTI Odometry dataset is widely used for visual positioning and odometry estimation, particularly in computer vision and autonomous driving. It provides extensive data collected from a vehicle’s perspective in urban road scenes, including RGB images, LiDAR point clouds, camera poses, timestamps, and precise sensor calibration (intrinsic and extrinsic parameters of the camera and LiDAR). The dataset covers diverse conditions, such as varying weather, road types, traffic, and lighting. It includes 21 sequences: sequences 01–20 are used for training and validation, while sequence 00 is reserved for testing. The test set is spatially independent of the training set, with minimal overlap.

To expand the training data, we introduce a random deviation

Δ T

within a reasonable range into the LiDAR–camera extrinsic calibration matrix, generating a large number of training samples. Initial extrinsic parameters with added random deviation:

T_{i n i t} = Δ T \cdot T_{L C}

(14)

The external parameter

T_{L C}

is defined as the Euclidean transformation from LiDAR to camera, and

T_{i n i t}

is used for training and validation.

4.2. Evaluation Metrics

The calibrated translation vectors of the camera and the LiDAR are denoted as the 3D vectors

t_{c a m e r a}

and

t_{L i D A R}

. The Euclidean distance between the vectors evaluates the translation error, calculated as follows:

E_{t} = | | t_{c a m e r a} - t_{L i D A R} {| |}_{2}

(15)

where

| | \cdot {| |}_{2}

represents the 2-norm of the vector. The translation error is computed along the X, Y, and Z directions. The rotation error of the extrinsic calibration is measured using the quaternion angular distance

D_{a} (q_{g}, q_{p})

. To assess angular errors in three degrees of freedom, the extrinsic rotation matrix is converted into Euler angles, and the errors in Roll, Pitch, and Yaw are calculated.

4.3. Experimental Setup

In this experiment, the training process runs for 120 iterations with a base learning rate of

3 \times 10^{- 4}

. The maximum translation and rotation error combinations

(m a x_{t}, m a x_{r})

are set to (0.1, 1), (1.5, 20.0), (1, 10), (0.5, 5), and (0.2, 2). Training on a single GPU uses a batch size of 10. The test system consists of an Intel Xeon W-2154 CPU and an RTX 2080 Ti GPU, with PyTorch 1.10 and CUDA 10.2 installed.

4.4. Results and Discussion

In this experiment, the model was trained using the left camera and left LiDAR data from the KITTI dataset and tested on the right camera and right LiDAR data. “Mean”, “median”, and “standard deviation (std)” describe the dataset’s distribution and dispersion. The results demonstrate that the trained model outperforms similar methods on the test set.

4.4.1. Quantitative Results

The model performance is tested on data with an initial error of (±20°, ±1.5 m). We train multiple networks on different miscalibration ranges, where the translation error ranges are

{\pm

1.5 m, ±1 m, ±0.5 m, ±0.2 m, ±0.1 m} and the rotation error ranges are

{\pm 20 °, \pm 10 °, \pm 5 °, \pm 2 °, \pm 1 °}

. Initial error test sets with different ranges are validated, yielding the following results, which are shown in Table 1.

Analysis of the data in the table above shows that, with iterative training, both the rotation and translation errors gradually decrease. The model trained with

(\pm 1 °, \pm 0.2 m)

yields the best performance, with the average translation error controlled within 1 cm and the average rotation error on the order of 0.1°. Additionally, the superiority of this method is demonstrated by comparing it with other deep learning-based online calibration networks in the multi-frame calibration task. The multi-frame detection method analyzes the calibration prediction results across multiple frames and uses the median of these results as the final estimate for the external parameters. This approach effectively suppresses random errors and noise in single-frame predictions.

As shown in Table 2, for an initial error of (±20°, ±1.5 m), the mean translation error of this model is an order of magnitude smaller than that of traditional learning-based calibration methods such as RegNet and CalibNet, and it outperforms LCCNet. The mean rotation error also demonstrates superior performance, with the pitch direction accuracy leading. Compared with the Transformer-based CalibFormer method, our scheme has slightly lower accuracy in translation error and rotation error. However, considering the single-frame processing time, our scheme performs better.

4.4.2. Ablation Study

To verify the contribution of each key module to the calibration accuracy and analyze the impact of different components, this paper designs an ablation experiment to demonstrate the effectiveness and necessity of the proposed method. The performance is evaluated under a deviation of

(\pm

0.5 m/

\pm 5 °)

. The results of the ablation experiments are shown in Table 3.

According to the results, after removing the multi-scale features, the translation error increased to 1.058 cm, and the rotation error increased to 0.084°, indicating that multi-scale features contribute to improving the calibration accuracy. The removal of the multi-scale cost volume had a greater impact, increasing the translation error to 1.213 cm and the rotation error to 0.099°, indicating that this module is crucial for optimizing the calibration parameters. After removing cost volume aggregation and hierarchical feature fusion, the errors increased slightly, but the impact was relatively minor. In addition, the experimental results after removing the hybrid loss function indicate that this module contributes to enhancing stability, reducing the rotation error to 0.074°. From the perspective of the trade-off between computational efficiency and accuracy, removing the multi-scale cost volume has the most significant impact. Although it reduces certain computational resources, it increases the translation error to 1.213 cm and the rotation error to 0.099°, indicating that this module makes an important contribution to the calibration accuracy. Although removing the layered feature fusion significantly reduces the amount of computation, the error only increases slightly, indicating that the module has a relatively small impact on accuracy while significantly improving computational efficiency. Therefore, this module can be appropriately trimmed in computationally limited scenarios.

4.4.3. Visualizing the Calibration Effect

The original parameters, calibrated true values, and prediction results of this network for three different scenes in the KITTI Odometry dataset are presented in Figure 8.

The first column shows the state before calibration, where a significant deviation in the alignment between the point cloud and the image is evident. The point cloud projection onto the image fails to align correctly, indicating large errors in the calibration parameters. The second column shows the true calibration values, obtained through high-precision offline or manual calibration methods. In this column, the point cloud and image are perfectly aligned, representing the ideal calibration state. The third column displays the online calibration results predicted by the model. Comparing these results with the true values reveals that the model’s online calibration closely matches the true calibration. The point cloud and the image are well aligned in most areas, with only minor deviations in certain details.

5. Conclusions

We propose a novel method for the online calibration of external parameters between camera and LiDAR based on deep learning. The calibration network consists primarily of feature extraction, multi-scale cost volume construction and splicing, and pose estimation. To enhance the adaptability of the calibration network, we utilize multi-scale, low-resolution cost volumes to cover diverse receptive fields, guiding the network to focus on image regions of varying scales. This approach enables the network to handle different datasets and road scenes effectively. Comparative experiments demonstrate that our method performs well on the KITTI dataset while also achieving higher efficiency. In future work, we plan to incorporate an attention mechanism to further improve both the feature extraction and pose estimation networks, while reducing parameter count and computational time.

Author Contributions

Conceptualization, J.L. and Y.W.; methodology, X.H.; software, X.H.; validation, X.W., X.H. and Y.W.; formal analysis, Y.W.; investigation, J.L.; resources, J.L.; data curation, X.H.; writing—original draft preparation, X.H.; writing—review and editing, X.W.; visualization, X.W.; supervision, J.L.; project administration, Y.W.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author. The dataset used for web phishing detection is available for download from https://www.cvlibs.net/datasets/kitti/eval_odometry.php (accessed on 8 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef] [PubMed]
Duan, J.; Yu, S.; Tan, H.L.; Zhu, H.; Tan, C. A survey of embodied ai: From simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 230–244. [Google Scholar] [CrossRef]
Zhuang, Y.; Sun, X.; Li, Y.; Huai, J.; Hua, L.; Yang, X.; Cao, X.; Zhang, P.; Cao, Y.; Qi, L.; et al. Multi-sensor integrated navigation/positioning systems using data fusion: From analytics-based to learning-based approaches. Inf. Fusion 2023, 95, 62–90. [Google Scholar] [CrossRef]
Zhong, H.; Wang, H.; Wu, Z.; Zhang, C.; Zheng, Y.; Tang, T. A survey of LiDAR and camera fusion enhancement. Procedia Comput. Sci. 2021, 183, 579–588. [Google Scholar] [CrossRef]
Wang, P. Research on comparison of LiDAR and camera in autonomous driving. J. Phys. Conf. Ser. 2021, 2093, 012032. [Google Scholar] [CrossRef]
Grammatikopoulos, L.; Papanagnou, A.; Venianakis, A.; Kalisperakis, I.; Stentoumis, C. An effective camera-to-LiDAR spatiotemporal calibration based on a simple calibration target. Sensors 2022, 22, 5576. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: LiDAR-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. [Google Scholar]
Li, X.; Xiao, Y.; Wang, B.; Ren, H.; Zhang, Y.; Ji, J. Automatic targetless LiDAR–camera calibration: A survey. Artif. Intell. Rev. 2023, 56, 9949–9987. [Google Scholar] [CrossRef]
Wang, L.; Huang, Y. LiDAR–camera fusion for road detection using a recurrent conditional random field model. Sci. Rep. 2022, 12, 11320. [Google Scholar] [CrossRef]
Roriz, R.; Cabral, J.; Gomes, T. Automotive LiDAR technology: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6282–6297. [Google Scholar] [CrossRef]
Yan, G.; He, F.; Shi, C.; Wei, P.; Cai, X.; Li, Y. Joint camera intrinsic and LiDAR-camera extrinsic calibration. In Proceedings of the 2023 IEEE International Conference On Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 11446–11452. [Google Scholar]
Zhu, J.; Xue, J.; Zhang, P. Calibdepth: Unifying depth map representation for iterative LiDAR-camera online calibration. In Proceedings of the 2023 IEEE International Conference On Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 726–733. [Google Scholar]
Zhu, J.; Li, H.; Zhang, T. Camera, LiDAR, and imu based multi-sensor fusion slam: A survey. Tsinghua Sci. Technol. 2023, 29, 415–429. [Google Scholar] [CrossRef]
Liu, Z.; Chen, Z.; Wei, X.; Chen, W.; Wang, Y. External Extrinsic Calibration of Multi-modal Imaging Sensors: A Review. IEEE Access 2023, 11, 110417–110441. [Google Scholar] [CrossRef]
Guo, Z.; Xiao, Z. Research on online calibration of LiDAR and camera for intelligent connected vehicles based on depth-edge matching. Nonlinear Eng. 2021, 10, 469–476. [Google Scholar] [CrossRef]
Yan, G.; Liu, Z.; Wang, C.; Shi, C.; Wei, P.; Cai, X.; Ma, T.; Liu, Z.; Zhong, Z.; Liu, Y.; et al. Opencalib: A multi-sensor calibration toolbox for autonomous driving. Softw. Impacts 2022, 14, 100393. [Google Scholar] [CrossRef]
Ye, C.; Pan, H.; Gao, H. Keypoint-based LiDAR-camera online calibration with robust geometric network. IEEE Trans. Instrum. Meas. 2021, 71, 1–11. [Google Scholar] [CrossRef]
Shen, Z.; Dai, Y.; Song, X.; Rao, Z.; Zhou, D.; Zhang, L. Pcw-net: Pyramid combination and warping cost volume for stereo matching. In Proceedings of the European Conference On Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–297. [Google Scholar]
Beltrán, J.; Guindel, C.; De La Escalera, A.; García, F. Automatic extrinsic calibration method for LiDAR and camera sensor setups. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17677–17689. [Google Scholar] [CrossRef]
Ou, J.; Huang, P.; Zhou, J.; Zhao, Y.; Lin, L. Automatic extrinsic calibration of 3D LiDAR and multi-cameras based on graph optimization. Sensors 2022, 22, 2221. [Google Scholar] [CrossRef]
Huang, H.; Zhang, M.; Li, L.; Hu, J.; Wang, H. GTSCalib: Generalized Target Segmentation for Target-Based Extrinsic Calibration of Non-Repetitive Scanning LiDAR and Camera. IEEE Trans. Autom. Sci. Eng. 2024, 22, 3648–3660. [Google Scholar] [CrossRef]
Zhang, Q.; Pless, R. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sendai, Japan, 28 September–2 October 2004; (IEEE Cat. No. 04CH37566). IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 2301–2306. [Google Scholar]
Unnikrishnan, R.; Hebert, M. Fast Extrinsic Calibration of a Laser Rangefinder to a Camera; Tech. Rep. CMU-RI-TR-05-09; Robotics Institute: Pittsburgh, PA, USA, 2005. [Google Scholar]
Pandey, G.; McBride, J.; Savarese, S.; Eustice, R. Extrinsic calibration of a 3d laser scanner and an omnidirectional camera. IFAC Proc. Vol. 2010, 43, 336–341. [Google Scholar] [CrossRef]
Kwak, K.; Huber, D.F.; Badino, H.; Kanade, T. Extrinsic calibration of a single line scanning LiDAR and a camera. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, Francisco, CA, USA, 25–30 September 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 3283–3289. [Google Scholar]
Park, Y.; Yun, S.; Won, C.S.; Cho, K.; Um, K.; Sim, S. Calibration between color camera and 3d LiDAR instruments with a polygonal planar board. Sensors 2014, 14, 5333–5353. [Google Scholar] [CrossRef]
Dhall, A.; Chelani, K.; Radhakrishnan, V.; Krishna, K. LiDAR-camera calibration using 3d-3d point correspondences. arXiv 2017, arXiv:1705.09785. [Google Scholar]
Velas, M.; Španěl, M.; Materna, Z.; Herout, A. Calibration of Rgb Camera with Velodyne LiDAR. 2014. Available online: https://www.fit.vut.cz/research/publication-file/10578/Calibration_of_RGB_Camera_With_Velodyne_LiDAR.pdf (accessed on 8 February 2025).
Scaramuzza, D.; Harati, A.; Siegwart, R. Extrinsic self calibration of a camera and a 3d laser range finder from natural scenes. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 4164–4169. [Google Scholar]
Pandey, G.; McBride, J.; Savarese, S.; Eustice, R. Automatic targetless extrinsic calibration of a 3d LiDAR and camera by maximizing mutual information. In Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–23 July 2012; Volume 26, pp. 2053–2059. [Google Scholar]
Taylor, Z.; Nieto, J. Motion-based calibration of multimodal sensor extrinsics and timing offset estimation. IEEE Trans. Robot. 2016, 32, 1215–1229. [Google Scholar] [CrossRef]
Jiang, J.; Xue, P.; Chen, S.; Liu, Z.; Zhang, X.; Zheng, N. Line feature based extrinsic calibration of LiDAR and camera. In Proceedings of the 2018 IEEE International Conference on Vehicular Electronics and Safety (ICVES), Madrid, Spain, 12–14 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Li, L.; Li, H.; Liu, X.; He, D.; Miao, Z.; Kong, F.; Li, R.; Liu, Z.; Zhang, F. Joint intrinsic and extrinsic lidar-camera calibration in targetless environments using plane-constrained bundle adjustment. arXiv 2023, arXiv:2308.12629. [Google Scholar]
Schneider, N.; Piewak, F.; Stiller, C.; Franke, U. Regnet: Multimodal sensor registration using deep neural networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1803–1810. [Google Scholar]
Iyer, G.; Ram, R.K.; Murthy, J.K.; Krishna, K.M. Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1110–1117. [Google Scholar]
Cattaneo, D.; Vaghi, M.; Ballardini, A.L.; Fontana, S.; Sorrenti, D.G.; Burgard, W. Cmrnet: Camera to LiDAR-map registration. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1283–1289. [Google Scholar]
Shen, Z.; Dai, Y.; Rao, Z. Cfnet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13906–13915. [Google Scholar]
Lv, X.; Wang, B.; Dou, Z.; Ye, D.; Wang, S. Lccnet: LiDAR and camera self-calibration using cost volume network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2894–2901. [Google Scholar]
Luo, Z.; Yan, G.; Li, Y. Calib-anything: Zero-training lidar-camera extrinsic calibration method using segment anything. arXiv 2023, arXiv:2306.02656. [Google Scholar]
Xiao, Y.; Li, Y.; Meng, C.; Li, X.; Ji, J.; Zhang, Y. Calibformer: A transformer-based automatic lidar-camera calibration network. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16714–16720. [Google Scholar]

Figure 1. Overall architecture. Camera and LiDAR data undergo feature extraction and cost volume construction to generate multi-scale volumes, which are then fused using the hourglass module and post-processed to obtain the translation and rotation matrices.

Figure 5. Three-dimensional Hourglass module combined with attention.

Figure 6. SCConv module.

Figure 7. Hierarchical feature fusion network.

Figure 8. Online calibration effect visualization. (a) Initial state, (b) ground truth, (c) calibration result.

Table 1. Multi-scale iterative training test results when the initial error is

(\pm

20 m,

\pm 1.5 °)

.

Table 1. Multi-scale iterative training test results when the initial error is

(\pm

20 m,

\pm 1.5 °)

.

Results of Multi-Range		Translation (cm)				Rotation ( $°$ )
Results of Multi-Range		Et	X	Y	Z	ER	Roll	Pitch	Yaw
After 20/1.5 m network	Mean	18.617	3.327	6.423	4.916	1.455	0.175	0.215	0.349
	Median	15.081	2.517	5.326	4.150	0.963	0.138	0.288	0.239
	std	15.079	2.671	2.888	3.847	1.847	0.157	0.173	0.226
After 10/0.1 m network	Mean	4.092	1.110	1.186	1.089	0.611	0.165	0.140	0.069
	Median	3.518	0.453	0.539	0.526	0.441	0.111	0.119	0.031
	std	2.743	1.161	1.298	1.339	1.311	0.098	0.041	0.066
After 5/0.5 m network	Mean	2.145	0.503	0.554	1.350	0.264	0.045	0.067	0.079
	Median	1.674	0.506	0.770	1.819	0.137	0.030	0.025	0.088
	std	1.798	0.177	0.414	0.975	1.330	0.053	0.080	0.030
After 2/0.2 m network	Mean	1.234	0.361	0.496	0.334	0.191	0.064	0.040	0.045
	Median	0.886	0.394	0.486	0.331	0.075	0.082	0.021	0.044
	std	1.285	0.205	0.092	0.075	1.340	0.034	0.045	0.014
After 1/0.1 m network	Mean	0.841	0.235	0.326	0.274	0.178	0.030	0.008	0.033
	Median	0.542	0.188	0.294	0.244	0.063	0.026	0.007	0.024
	std	1.077	0.130	0.058	0.057	1.343	0.019	0.004	0.033

Table 2. Comparison of learning-based calibration networks.

Method	Miscalibrated Range	Translation Absolute Error (cm)				Rotation Absolute Error ( $°$ )				Latency/ms
Method	Miscalibrated Range	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw	Latency/ms
Calibnet	±0.2 m/ $\pm 10 °$	4.340	4.200	1.600	7.220	0.410	0.180	0.900	0.150	29.66
CalibFormer	±0.25 m/ $\pm 10 °$	1.188	1.101	0.902	1.561	0.141	0.076	0.259	0.087	31.47
Calib-Anything	±0.2 m/ $\pm 10 °$	1.027	1.027	0.742	1.313	0.136	0.079	0.229	0.099	32.15
Ours	±0.25 m/ $\pm 10 °$	1.101	1.044	0.817	1.443	0.138	0.081	0.232	0.101	23.54
Regnet	±1.5 m/ $\pm 20 °$	6	7	7	4	0.280	0.240	0.250	0.360	30.65
LCCNet	±1.5 m/ $\pm 20 °$	0.297	0.262	0.271	0.357	0.017	0.020	0.012	0.019	24.53
PCBA [33]	±1.5 m/ $\pm 20 °$	0.305	0.249	0.323	0.343	0.019	0.022	0.012	0.023	29.79
Ours	±1.5 m/ $\pm 20 °$	0.278	0.235	0.326	0.274	0.020	0.030	0.008	0.022	23.91

Table 3. Ablation experiments on KITTI Odometry datasets.

Network Architecture	Params/M	FLOPs/G	Inference Time/ms	Translation Error/cm	Rotation Error/°
w/o multi-scale feature	18.7	62.5	17.5	1.058	0.084
w/o multi-scale cost volume	19.1	63.8	18.8	1.213	0.099
w/o cost volume aggregation	19.4	70.5	19.5	1.151	0.093
w/o hierarchical feature fusion	18.8	61.1	18.4	0.930	0.078
w/o hybrid loss function	20.2	77.4	22.6	0.997	0.074
w/all modules	20.5	78.3	23.3	0.864	0.068

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, X.; Luo, J.; Wei, X.; Wang, Y. Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume. Information 2025, 16, 223. https://doi.org/10.3390/info16030223

AMA Style

Han X, Luo J, Wei X, Wang Y. Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume. Information. 2025; 16(3):223. https://doi.org/10.3390/info16030223

Chicago/Turabian Style

Han, Xiaobo, Jie Luo, Xiaoxu Wei, and Yongsheng Wang. 2025. "Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume" Information 16, no. 3: 223. https://doi.org/10.3390/info16030223

APA Style

Han, X., Luo, J., Wei, X., & Wang, Y. (2025). Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume. Information, 16(3), 223. https://doi.org/10.3390/info16030223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume

Abstract

1. Introduction

2. Related Work

2.1. Calibration Methods Based on Specific Targets

2.2. Calibration Methods Without Specific Targets

2.3. Calibration Method Based on Deep Learning

3. Our Method

3.1. Multi-Scale Feature Extraction

3.2. Constructing Multi-Scale Cost Volumes

3.3. Multi-Scale Cost Volume Aggregation

3.4. Pose Estimation Module

3.5. Loss Function and Training Strategy

4. Experiments and Analysis

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Results and Discussion

4.4.1. Quantitative Results

4.4.2. Ablation Study

4.4.3. Visualizing the Calibration Effect

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI