Next Article in Journal
Global Evapotranspiration Datasets Assessment Using Water Balance in South America
Previous Article in Journal
Generation and Optimization of Spectral Cluster Maps to Enable Data Fusion of CaSSIS and CRISM Datasets
Previous Article in Special Issue
Point Cloud Validation: On the Impact of Laser Scanning Technologies on the Semantic Segmentation for BIM Modeling and Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Online Extrinsic Calibration on LiDAR-Camera System with LiDAR Intensity Attention and Structural Consistency Loss

1
School of Electrical and Information Engineering, Wuhan Institute of Technology, Wuhan 430072, China
2
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2022, 14(11), 2525; https://doi.org/10.3390/rs14112525
Submission received: 19 April 2022 / Revised: 16 May 2022 / Accepted: 23 May 2022 / Published: 25 May 2022

Abstract

:
Extrinsic calibration on a LiDAR-camera system is an essential task for the advanced perception application for the intelligent vehicle. In the offline situation, a calibration object based method can estimate the extrinsic parameters in high precision. However, during the long time application of LiDAR-camera system in the actual scenario, the relative pose of LiDAR and camera has small and accumulated drift, so that the offline calibration result is not accurate. To correct the extrinsic parameter conveniently, we present a deep learning based online extrinsic calibration method in this paper. From Lambertian reflection model, it is found that an object with higher LiDAR intensity has the higher possibility to have salient RGB features. Based on this fact, we present a LiDAR intensity attention based backbone network (LIA-Net) to extract the significant co-observed calibration features from LiDAR data and RGB image. In the later stage of training, the loss of extrinsic parameters changes slowly, causing the risk of vanishing gradient and limiting the training efficiency. To deal with this issue, we present the structural consistency (SC) loss to minimize the difference between projected LiDAR image (i.e., LiDAR depth image, LiDAR intensity image) and its ground truth (GT) LiDAR image. It aims to accurately align the LiDAR point and RGB pixel. With LIA-Net and SC loss, we present the convolution neural network (CNN) based calibration network LIA-SC-Net. Comparison experiments on a KITTI dataset demonstrate that LIA-SC-Net has achieved more accurate calibration results than state-of-the-art learning based methods. The proposed method has both accurate and real-time performance. Ablation studies also show the effectiveness of proposed modules.

Graphical Abstract

1. Introduction

The system that consists of Light Detection and Ranging (LiDAR) and optical camera has played an important role in the recent years [1]. This system is called a LiDAR-camera system, which is the core multi-sensor system for the intelligent vehicle [2]. The main advantage of LiDAR-camera system is that it provides both structural (3D LiDAR point cloud) and textural (2D RGB image) information of the surrounding [3], which can be used to improve the performance of the advanced perception tasks (i.e., 3D object detection [4], 3D point cloud semantic segmentation [5]) for the autonomous driving car.
To utilize LiDAR point cloud and RGB image efficiently [6], it is essential to find the alignment of point p in LiDAR point cloud and its corresponding pixel I in RGB images, to make sure that p and I satisfy the projection constraint [7]. Theoretically, this alignment problem can be solved by point cloud and image registration [8]. Considering the background of autonomous driving, the mechanical structure of LiDAR and camera is fixed. In the ideal condition, the relative pose of LiDAR and camera is also constant. It means that point cloud and image registration can be simplified and solved via calibrating the LiDAR-camera system. This task aims to calibrate all the parameters of the LiDAR-camera system, such as (i) the intrinsic parameters of camera, (ii) the intrinsic parameters of LiDAR, and (iii) the extrinsic parameters of LiDAR-camera system. At first, the intrinsic parameters of camera consist of intrinsic matrix K and lens distortion parameters D . In the offline situation, using the rectangular board with the chessboard patterns as a calibration object, the user can place the camera at the different positions to observe the calibration object, and exploit the nonlinear optimization to estimate K and D via minimizing the reprojection errors of observed corner points extracted from the chessboard patterns [7]. K and D of the camera are basically constant during the online application. Second, the intrinsic parameters of LiDAR are more complex than the camera because they are related with the mechanical structure of LiDAR. These parameters are generally carefully calibrated by the LiDAR manufacturer. The intrinsic parameters of LiDAR are also constant in the actual application. Third, the extrinsic parameters of LiDAR-camera system are the rigid transformation of the LiDAR coordinate system and the camera coordinate system, represented by the rotation matrix R and the translation vector T. Thanks to the rapid development of computer vision and LiDAR manufacturing technique, the intrinsic parameters of LiDAR and camera can be calibrated with high precision. However, due to the limited resolution angle of LiDAR, the LiDAR point cloud is sparse, causing the extrinsic parameter calibration on the LiDAR-camera system to be a challenging problem, for the co-observed feature is difficult to find from the sparse LiDAR point cloud and dense RGB image [9,10]. Therefore, we mainly discuss the extrinsic calibration on the LiDAR-camera system in this paper.
The key of extrinsic calibration on LiDAR-camera system is to find the co-observed constraints from both LiDAR point cloud and RGB image [11]. To resist the sparsity of LiDAR point cloud, a simple and effective approach is to design a specific target or calibration object [12]. The calibration object has a salient structure, and its contour and edge features can be clearly found in the sparse LiDAR point cloud and RGB image. These features serve as the co-observed constraints for calibration. This kind of method is called the calibration object based calibration method. The man-made calibration object is convenient to find [13] in the offline case. However, in the online situation (i.e., the intelligent vehicle drives in the open scene), there is nearly no specific calibration object so that the calibration object based extrinsic calibration method cannot work. Due to the complexity of object contour and the sparsity of LiDAR point cloud, it is a challenging task to find the co-observed constraint in the open scene [10]. Therefore, it is difficult to calibrate the extrinsic parameters of LiDAR-camera system in the online situation.
With the background of intelligent driving, one may wonder whether it is necessary to calibrate the LiDAR-camera system in the online situation. Refs. [14,15] points out that, in the long time application of LiDAR-camera system, the extrinsic parameter has small but accumulated drift (represented by Δ R and Δ T ) due to the mechanical vibrations or temperature changes. This issue causes extrinsic calibration results R 0 and T 0 in the offline case to not be accurate at this time. Online extrinsic calibration aims to estimate Δ R and Δ T , and obtain the accurate extrinsic parameters R and T as:
R T 0 T 1 = R 0 T 0 0 T 1 · Δ R Δ T 0 T 1
In this paper, we present a novel convolution neural network (CNN) based calibration network with a LiDAR intensity attention based backbone network (LIA-Net) and structural consistency (SC) loss. The proposed calibration network is called LIA-SC-Net, which aims to solve the problem of online extrinsic calibration on the LiDAR-camera system. From the Lambertian reflection model [16], it is found that the object with high LiDAR intensity has the higher possibility with the salient texture feature in the RGB image. This kind of object has the potential co-observed constraint. As the object in the open scene has complex contour, it is difficult to extract the robust primary co-observed features [17]. We exploit the technique of deep learning [18], and attempt to learn the co-observed calibration (CoC) feature via the deep CNN architecture. As the object with the higher LiDAR intensity has more potential information for calibration, we refer the visual attention mechanism [19] in deep learning, and then propose LIA-Net to learn the salient CoC feature. In the later stage of training, the regression loss of extrinsic parameters tends to be stable and close to zero. On the one hand, it has the risk of vanishing gradient [20], which limits the efficiency of supervised learning. On the other hand, according to the camera pinhole model [7], even if the rotation error is smaller than 1 deg, the alignment pixel error of the projected LiDAR point and its corresponding pixel in the RGB image might be larger than 10.0 pixels. To deal with this issue, we present SC loss to reduce the alignment pixel error by minimizing the difference between a projected LiDAR image (i.e., LiDAR depth image, LiDAR intensity image) and its GT projected LiDAR image. With combining the advantages of LIA-Net and SC loss, we then propose LIA-SC-Net to estimate the extrinsic parameters. To evaluate the performance of the proposed method, extensive experiments are both conducted in the public dataset KITTI [2] collected by the self-made LiDAR-camera system. Ablation studies are also provided to show the effectiveness of LIA-Net and SC loss. We believe that the proposed method benefits the advanced application of the LiDAR-camera system.
In conclusion, three main contributions are provided in this paper:
(i)
Considering that the object with higher LiDAR intensity has more salient co-observed constraint, LIA-Net is proposed to utilize LiDAR intensity as the attention feature map to generate the salient CoC features;
(ii)
To prevent the risk of vanishing gradient in the later training stage, SC loss is presented to approximately reduce the alignment error from LiDAR point cloud and RGB image by minimizing the difference of the projected and its GT projected LiDAR images;
(iii)
Taking the advantages of both LIA-Net and SC loss, deep learning based extrinsic calibration method LIA-SC-Net is presented to estimate the accurate extrinsic parameters of a LiDAR-camera system.
The remainder of this paper is organized as follows: At first, related works of extrinsic calibration on LiDAR-camera system are provided in Section 2. In the next, the proposed online extrinsic calibration method is illustrated in Section 3. After that, experimental results in the public dataset are shown in Section 4. Discussions of the proposed method and its performance in the actual application are presented in Section 5. Finally, our work is concluded in Section 6.

2. Related Works

As for the calibration technique, current extrinsic calibration methods can be classified as three categories: (i) calibration object based extrinsic calibration method, (ii) information fusion based extrinsic calibration method, and (iii) deep learning based extrinsic calibration method.

2.1. Calibration Object Based Extrinsic Calibration Method

The key of calibration object based method is to design the specific target with the significant structure. The corner point of the calibration object can be measured from the LiDAR point cloud and RGB image, and 3D-2D, 3D-3D point corresponding constraints can be established from the corner point [9]. From a 3D-2D point corresponding constraint, perspective-n-point (PnP) algorithm [21,22], direct linear transformation (DLT) method [23], and bundle adjustment (BA) algorithm [24] can be used to estimate R and T. From a 3D-3D point corresponding constraint, the iterative closest point (ICP) method [25,26] can be exploited to calibrate the extrinsic parameters. A common calibration object can be classified as three groups: (i) 2D calibration object; (ii) 3D calibration object; and (iii) combined calibration object. 2D calibration objects are the polygonal planar board [11,12], the planar board with circular holes [27], or chessboard patterns [28,29]. Chessboard patterns are useful in the extrinsic calibration on the LiDAR-camera system, for the corner points of chessboard patters are easily extracted from RGB image [28]. As LiDAR intensity values of chessboard patterns change regularly, this fact can be utilized to extract the corner points of chessboard patters from LiDAR point cloud [28,30]. If the relative position of chessboard patterns and the planar board is known, the corner points of chessboard patterns can also provide 3D-2D point corresponding constraints [31]. 3D calibration objects are the trihedron [32], the spherical ball [33], the cubic box [13,34], and the cubic box with patterns [35]. Corner point of the trihedron and cube, or center point of the sphere can be used to establish the point corresponding constraint. To extract more calibration constraints, researchers try to combine the 2D and 3D calibration objects, and propose the combined calibration object. An et al. [9] proposed a combined calibration object that consists of the main and auxiliary calibration objects. Kummerle and Kuhner [36] combined the spherical ball and the planar board with patterns. As time goes by, the structure of calibration object tends to be complex. In fact, from the viewpoint of actual application, it is an interesting question to balance the calibration accuracy and the production cost of the calibration object.

2.2. Information Fusion Based Extrinsic Calibration Method

As discussed in Section 1, under the background of the intelligent vehicle, it is essential to calibrate the LiDAR-camera system in the target-less open scene [14,15]. Before the rise of deep learning in the area of computer vision, the information fusion based extrinsic calibration method is presented to automatically find the co-observed calibration features from camera image and LiDAR data. This so called co-observed calibration feature can be (i) corresponding edge feature [14], (ii) corresponding point cloud [37], and (iii) corresponding odometry [38]. To make use of the corresponding edge feature, researchers converts the extrinsic calibration as the multi-source image registration problem [14]. With the initial extrinsic parameters R 0 and T 0 , the LiDAR point cloud can be projected into the image plane as the projected LiDAR depth image D 0 and the projected LiDAR intensity image I 0 . Δ R and Δ T can be estimated via solving the image registration of D 0 or I 0 and RGB image R . Mutual information (MI) is a useful tool to measure the similarity of two multi-source images, which has been successfully used in the field of remote sensing [39]. With the MI based calibration loss, Pandey et al. [14] established the nonlinear optimization problem to estimate Δ R and Δ T in an iterative scheme. To obtain the robust registration result, Wolcott and Eustice [40] exploited the normalized MI for registration, and Irie et al. [41] proposed bagged least-squares MI (BLSMI) as the calibration loss. Zhu et al. [42] considered that the edge feature in the RGB image is sensitive to color variations and noise, and they used the semantic segmentation map predicted from RGB image for registration. Xiao et al. [43] obtained Δ R and Δ T by minimizing the image feature errors of LiDAR points in the different frames. To make use of the corresponding point cloud, researchers convert the extrinsic calibration as the 3D point cloud registration problem [37]. The key of this approach is to generate the point cloud P c from RGB images. After that, R and T are estimated by solving the point cloud registration of P c and LiDAR point cloud P l . For the LiDAR-stereo system, the stereo camera can generate the depth image via the disparity principle, and P c is obtained via back-projecting the depth image into the 3D space [44]. For the common LiDAR-camera system, using the sequence of RGB images, a 3D point cloud P c can be reconstructed via the technique of structure from motion (SFM) [45] and simultaneous localization and mapping (SLAM) [46]. Nagy and Benedek [3] exploited the semantic segmentation on the point clouds P c and P l to increase the accuracy of the ICP algorithm [25]. To make use of the corresponding odometry, researchers convert the extrinsic calibration as the classic hand-eye calibration problem [47]. This approach needs to move the LiDAR-camera system. With the ICP algorithm [25], the odometry of LiDAR is estimated as O l . Using the techniques of visual odometry (VO) [48], SFM, and SLAM, the odometry of camera is estimated as O c . A hand-eye calibration constraint can be established from O l and O c , and then extrinsic parameters are estimated [49]. Ishikawa et al. [38] presented the two-stage odometry based extrinsic calibration method. In the first stage, they obtain the initial results R 0 and T 0 from the odometry information via the hand-eye calibration approach [47]. In the second stage, they exploited the MI based extrinsic method [14] to obtain Δ R and Δ T . The extrinsic parameters are finally estimated via Equation (1). Unlike calibration object based extrinsic calibration method, the information fusion based extrinsic method tends to convert the extrinsic calibration as some known and classical problems, such as multi-source image registration [14], point cloud registration [37], and hand-eye calibration [38].

2.3. Deep Learning Based Extrinsic Calibration Method

As for the information fusion based extrinsic calibration method, it needs the extra techniques, such as MI computation [39], 3D reconstruction [45,46], and odometry estimation [25,48]. Calibration procedure is more complex than the calibration object based method, and calibration precision depends on the performance of these techniques. With the success of deep learning in the field of robotic vision [50], some researchers attempt to design the deep learning based extrinsic calibration method, and aims to extract the adaptive co-observed feature to regress the extrinsic parameters [15]. The key problem of this method is how to align the structure feature from LiDAR point cloud P l and RGB image R . As P l and R have the different data representations, a naive approach is to project P l into the image plane as D 0 and I 0 with R 0 and T 0 , and extract the co-observed feature from D 0 , I 0 , and R . Most of the researchers select to extract the CoC feature from D 0 and R [51,52,53] because most of the edge features are caused by the discontinuous of depth [54], which means that D 0 and R exist the common structure feature. RegNet [51] is the first deep learning based calibration approach. It uses the network in network (NIN) module [55] to extract CoC features from D 0 and R , and exploit fully connected (FC) layers to regress Δ R and Δ T . Iyer et al. [52] proposed a calibration network CalibNet, in which ResNet [56] is used as a backbone network to extract the feature maps from D 0 and R . In CalibNet, they exploited the geometric and photometric consistency of the P l and R as the supervision loss. Based on the deep learning based optical flow [57], Cattaneo et al. [53] aimed to extract the CoC feature from the difference of D 0 and R . Recently, Wu et al. [15] proposed a coarse-to-fine method CalibRank where the image retrieval algorithm is used to estimate R 0 and T 0 . Yuan et al. [58] considered Riemannian geometry and present a module tolerance regularizer for extrinsic calibration. Ye et al. [59] designed a point weighting layer to predict the sparse keypoint correspondences with the weighted scores, and exploited the PnP algorithm [21] to estimate Δ R , Δ T with these correspondences. A deep learning based extrinsic calibration method is in development now. It is a challenging task to increase the generalization ability for various online calibration scenes. In addition, more discussion of this kind of method is presented in the following.

2.4. Analysis of the Deep Learning Based Extrinsic Calibration Method

As for the online extrinsic calibration on LiDAR-camera system, compared with the information fusion based method, the deep learning based method is the end-to-end approach. It does not require any extra techniques, and is convenient to use in the actual application. The general pipeline of deep learning based extrinsic calibration method is shown in Figure 1 [51,52,53]. It consists of three modules: (i) backbone networks, (ii) co-observed feature extraction, and (iii) parameters regression. To increase the calibration accuracy, two issues need to be focused on. First, how to learn the salient CoC feature from multi-source images is an important problem. To solve this issue, we exploit the visual attention mechanism from LiDAR intensity and present a novel backbone network LIA-Net. Second, it is essential to increase the learning efficiency of calibration network. To deal with it, SC loss is proposed to accurately align the LiDAR points and RGB pixels. In the next section, the proposed LIA-Net and SC loss are discussed with details.

3. Proposed Method

3.1. Problem Statement and Method Overview

Extrinsic calibration on the LiDAR-camera system aims to estimate the rigid transformation between LiDAR coordinate system O L X L Y L Z L and the camera coordinate system O C X C Y C Z C , represented by R and T, shown in Figure 2a. As illustrated in Section 1, with the known and initial R 0 and T 0 , online extrinsic calibration aims to estimate Δ R and Δ T . Compared with Figure 2b and Figure 2c, the alignment of LiDAR point cloud and RGB image is more accurate after online extrinsic calibration.
In this paper, we propose an efficient deep learning based online extrinsic calibration method LIA-SC-Net. The pipeline is presented in Figure 3. Its inputs are LiDAR point cloud P l , RGB image R , and initial extrinsic parameters R 0 and T 0 . Its outputs are Δ R and Δ T . Using the pinhole projection [7], the initial LiDAR depth D 0 and LiDAR intensity I 0 are obtained with P l , K , and R 0 and T 0 . Projection detail is shown in Appendix A.1. R , D 0 , and I 0 are [ W , H , 3 ] , [ W , H , 1 ] , [ W , H , 1 ] tensors where W, H are the width and height of the image. The architecture of LIA-SC-Net is the extension of the general pipeline in Figure 1. Unlike the common pipeline, LIA-Net in the proposed method utilizes the natural attention mechanism from LiDAR intensity, and SC-loss focuses on the alignment of LiDAR point cloud and RGB image, thus increasing the extrinsic calibration accuracy.

3.2. LiDAR Intensity Attention Based Backbone Network

LIA-Net is presented to generate salient CoC feature F C o C from R , D 0 , and I 0 . It has two procedures: (i) dense operation and (ii) CoC feature generation. At first, dense operation is discussed. It is a fact that the number of point cloud generated by the mechanical LiDAR ( 10 5 ) is far smaller than image pixels W · H ( 10 6 ), so that D 0 and I 0 are sparser than R (seen in Figure 4a,b). From the computation theory of CNN layer [18,56], sparse feature map has the risk of destroying the quality of the structural information of the original data. Thus, it is essential to densify D 0 and I 0 . Inspired by the work [60], based on the classical image processing techniques, a simple and fast pipeline is presented as:
X d = inv ( F median ( F hole ( F dilate ( inv ( X ) , k ) ) , k ) ) , X = { D 0 , I 0 } , X d = { D 0 d , I 0 d }
inv ( X ) = max ( X ) · 1 X
Operators F median ( · , k ) and F dilate ( · , k ) are median filter and dilation operation with the kernel size of k × k . If the k set is too large, densification results D 0 d and I 0 d have distortion and are unreal. An exact estimation of k requires the intrinsic parameters of both LiDAR and camera. It is discussed in Appendix A.2. F hole ( · ) is the operation to estimate the empty depth using the dilation with the large kernel size [60]. inv ( · ) is the inverse operation for the single channel image. 1 in Equation (3) is a full-one tensor with the same size of X . Results of the dense operation are shown in Figure 4c.
After that, CoC feature generation is discussed. From the Lambertian reflection model [16], it is found that the object with higher LiDAR intensity has the potential with the salient structure feature in both LiDAR point cloud and RGB image, helpful for generating the significant CoC feature. It is discussed in the following. As most of the object surfaces in the open scene are rough [61], the Lambertian reflection model [16] is used to characterize LiDAR intensity r l :
r l ρ l · n l T n LiDAR D l 2 · P LiDAR
LiDAR intensity is determined by the material coefficient of laser ρ l , object orientation n l , LiDAR orientation n LiDAR , object distance D l , and laser power P LiDAR . As laser is the active light source, P LiDAR D l 2 . If LiDAR intensity of one object is larger, it might be the fact that n l T n LiDAR 1 and D l is smaller, which means that this object is close to LiDAR and object surface exactly faces LiDAR. According to the mechanism of the mechanical LiDAR, this kind of object commonly has dense LiDAR points, and its structural feature in LiDAR point cloud is clear. It is noted that the LiDAR-camera system equipped in the intelligent vehicle, so that T D l . It means that this kind of object is also close to the camera, which might have salient textural feature in the RGB image. In conclusion, the object with high LiDAR intensity might have salient features in both LiDAR point cloud and RGB image (shown in Figure 5), which is helpful for generating the representative CoC feature.
Based on the above analysis, visual attention is exploited from LiDAR intensity I 0 d . With the efficient encoder ResNet [56], the attention feature map A is obtained as Equation (5). E I ( · ) is the encoder with the first four convolution layers of ResNet-18. The reason for selecting ResNet-18 as an encoder is that it is light and efficient to extract the image feature. Conv 2 d ( · ) is one 2D convolution layer with the single channel. Its activation function is Softmax to make sure the attention value in A lies in [ 0 , 1 ] . A is [ W / 8 , H / 8 , 1 ] tensor. After that, with A , an attentional LiDAR map feature F LiDAR is extracted as Equation (6). E D ( · ) is the encoder of LiDAR depth, as the same architecture with E I ( · ) . F LiDAR is [ W / 8 , H / 8 , 256 ] tensor:
A = Conv 2 d ( E I ( I 0 d ) )
F LiDAR = A · E D ( D 0 d )
Referring the co-observed feature extraction in CMR-Net [53], the CoC feature F CoC is extracted as:
F CoC = CVC ( F LiDAR , F Camera )
F Camera = E R ( R )
CVC ( · ) is the cost volume computation (CVC) module [53,57], discussed in Appendix A.3. E R ( · ) is the encoder of RGB image, as the same architecture with E I ( · ) . F Camera is [ W / 8 , H / 8 , 256 ] tensor. F CoC is [ d 2 , W / 8 , H / 8 ] tensor. d is the pixel range, set as the default value in the literature [53].

3.3. Parameters Regression

The parameters regression module aims to estimate Δ R and Δ T . Referring the literature [51,53], the proposed network regresses the unit quaternion Δ Q instead of Δ R , for regressing Δ R needs to consider two constraints: (i) Δ R T Δ R = I and (ii) d e t ( Δ R ) = 1 , whereas Δ Q only has one constraint: Δ Q 2 = 1 . Δ Q and Δ T are obtained as:
f CoC = FC f ( Squeeze ( F CoC ) )
Δ Q = FC Q ( f CoC ) , Δ T = FC T ( f CoC )
Squeeze ( · ) is the operation that converts the feature map into 1D vector. FC f is two FC layers with neurons of 1024 and 512. FC Q is two FC layers with neurons of 256 and 4. FC T is two FC layers with neurons of 256 and 3. As Δ Q is a unit quaternion, normalization should be done in Equation (11). Finally, Δ R is extracted from Δ Q in Equation (12). Qua 2 Rot ( · ) is the operation that converts the unit quaternion as a rotation matrix, discussed in Appendix A.4:
Δ Q = Δ Q Δ Q 2
Δ R = Qua 2 Rot ( Δ Q )

3.4. Structural Consistency Loss

Regular calibration loss [51,53] is illustrated at first. Ground truths’ extrinsic parameters are marked as Δ R gt and Δ T gt . Δ Q gt = Qua 2 Rot ( Δ R gt ) . Regular loss L r e g is presented in Equation (13). L smooth - L 1 ( · ) is smooth L 1 function [50]. Let Θ denote all learn-able parameters in the proposed calibration network. Θ 2 is an L 2 norm of all learn-able parameters. γ > 0 is L 2 regulation weight, set as the default value in the literature [53]:
L r e g = L smooth - L 1 ( Δ T , Δ T gt ) + L smooth - L 1 ( Δ Q , Δ Q gt ) + γ Θ 2
The challenge of L r e g is then discussed. In the late period of the training stage, although the regression results Δ R and Δ T might be close to the ground truths, it cannot promise that the alignment error of LiDAR point cloud and RGB image is small enough. An example is taken. Supposed that the focal length of the camera is f c = 800 pixels and T D l . If the error of the rotation angle is close to e r = 1.0 deg, the re-projection error of LiDAR point in the image plane is e p π 180 f c e r = 13.95 pixels [7], which is larger than 8.0 pixels as the calibration tolerance error [9,10]. Therefore, we need to design a new loss term for the alignment of LiDAR point cloud and RGB image.
A naive loss term is designed to directly minimize the alignment error as Equation (14). P l , i is the coordinate of the i-th LiDAR point in the LiDAR coordinate system. S FOV is the index set of LiDAR points that are fallen into FOV of the camera. 1 ( i S FOV ) is the indicator function. It outputs 1 if the i-th point is fallen into FOV of the camera. It outputs 0 if not. π ( · ) is the operation of pinhole camera projection, discussed in Appendix A.1. I i , GT is the pixel in the image corresponding to the i-th LiDAR point:
L align = i = 1 N I i , GT I i 2 2 · 1 ( i S FOV ) , I i = π ( P l , i ; Δ R , Δ T ) , P l , i = R 0 P l , i + T 0
However, L align cannot work in the actual applications for two reasons: (i) it is difficult to find the corresponding pixel I i for the arbitrary LiDAR point; (ii) it is also difficult to determine S FOV with the initial and inaccurate R 0 and T 0 . Thus, L align needs to be approximated for the actual application. That is the reason why loss approximation is needed in Figure 3. With Δ R gt and Δ T gt , ground truth LiDAR depth D gt and LiDAR intensity I gt are obtained with the same procedure in Section 3.1. With Δ R and Δ T , D ( Δ R , Δ T ) and I ( Δ R , Δ T ) are also obtained. It is noted that the depth and intensity of I i , GT and I i are close if I i , GT I i 2 is small enough. Therefore, we aim to maximize the structural consistency between D gt and D ( Δ R , Δ T ) , or between I gt and I ( Δ R , Δ T ) . Inspired the photo-metric loss in CalibNet [52], using L2 norm, SC loss is designed as:
L SC , raw = D gt D ( Δ R , Δ T ) 2 2 + I gt I ( Δ R , Δ T ) 2 2
Due to the sparsity of LiDAR point cloud, minimizing the sparse feature map would easily cause the discontinuity of the SC loss. Dense operation in Section 3.2 is exploited to densify the ground truths and estimated LiDAR projection maps. SC loss is revised as Equation (16). The calibration loss of LIA-SC-Net is shown in Equation (17):
L SC = D gtd D d ( Δ R , Δ T ) 2 2 + I gtd I d ( Δ R , Δ T ) 2 2
L calib = L reg + L SC

3.5. Iterative Inference Scheme

Following the methods [52,53], the iterative inference scheme is exploited to increase the accuracy of LIA-SC-Net. Let m be the iteration time ( m = 1 , . . . , M ). In addition, M is the maximum iterative time. The m-th outputs are Δ R m and Δ T m . Then, R 0 and T 0 are updated with Δ R m and Δ T m via Equation (1), and used as the initial parameters for the m + 1 iteration. In the M-th iteration, updated R 0 and T 0 are output as the iterative inference calibration results. The selection of M is discussed in the experiment section.

4. Experimental Results

4.1. Experiment Configuration

4.1.1. Dataset and Preparations

The performance of the proposed method is evaluated on a KITTI autonomous driving dataset [2]. In the KITTI outdoor dataset, the LiDAR-camera system consists of Velodyne HDL-64E LiDAR (10 Hz, 64 laser beams) and PointGray Flea2 video RGB camera (10 Hz) [2]. RGB image and LiDAR point cloud are used to prepare our dataset. It contains 7481 samples for training and 7518 samples for validation. The training and validation datasets are randomly selected from all 28 different outdoor scenes. As the size of image in the KITTI dataset is [ 1242 , 375 ] , all images are padded to [ 1280 , 384 ] to meet the CNN architecture requirement (width and height multiple of 64). Although the KITTI dataset was established in 2012, it is still the main public benchmark for mainstream applications for the autonomous driving [17]. To simulate the situation of online extrinsic calibration, we follow the approach of the current deep learning based extrinsic calibration methods [51,52,53] to generate Δ R and Δ T . Δ R can be represented with yaw–pitch–roll Euler form with three angles ( Δ θ y , Δ θ p , Δ θ r ) T . These angles are randomly sampled from a uniform distribution in the range of [ θ , θ ] (Unit: degree). Δ T has three elements ( Δ x , Δ y , Δ z ) T , and they are randomly sampled from a uniform distribution in the range of [ D , D ] (Unit: meter).

4.1.2. Implementations

To verify the deep learning based methods in different situations, six noise levels are presented in Table 1. According to the classical learning based method [15], for the practical application of the LiDAR-camera system, the drifted rotation error is smaller than 20 deg, and the drifted translation error is smaller than 1.50 m. For the fair comparison, all learning based methods are trained with level-3 calibration situation, and then tested with level-0 to level-5 calibration situations. Level-0 is used to test the stability of the learning based method. Based on an open-source optical flow network PWC-Net [57], CMR-Net [53] and LIA-SC-Net are implemented with the PyTorch library. Open-source CalibNet [52] is implemented with Tensorflow library. LiDAR depth and reflected intensity are normalized to the range of [ 0 , 1 ] . γ = 0.004 . Batch size is set as 1 for training. LIA-SC-Net, CMR-Net, and CalibNet are trained for 40 epochs using an SGD optimizer on a single Nvidia GTX 1080ti.

4.1.3. Evaluation Metrics

GT rotation matrix Δ R gt is represented with yaw–pitch–roll Euler form using three angles ( Δ θ y , g t , Δ θ p , g t , Δ θ r , g t ) T . GT translation vector Δ T gt is denoted as ( Δ t x , g t , Δ t y , g t , Δ t z , g t ) T . Calibration metrics are presented as Equations (18) and (19) for evaluation. The average errors in the validation dataset are used to evaluate the performance of learning based calibration method:
E θ = | Δ θ y , g t Δ θ y | + | Δ θ p , g t Δ θ p | + | Δ θ r , g t Δ θ r | 3
E t = | Δ t x , g t Δ t x | + | Δ t y , g t Δ t y | + | Δ t z , g t Δ t z | 3

4.2. Verification of the Proposed Method

To evaluate the LIA-Net backbone network and SC loss in LIA-SC-Net, five baselines are designed in Table 2. Encoder means that the calibration network only uses the encoders E D and E R to extract CoC features. Calibration errors are presented in Figure 6. Error curves from Baselines-1 to Baselines-5 are marked as deep blue, light blue, green, yellow, and pink, respectively. Average and standard deviation errors of all baseline methods are presented in Table 3.

4.2.1. Verification on Dense Operation

This experiment investigates the performance of dense operation for LiDAR depth. From Table 3, compared with baseline-1 and baseline-2, it is found that the rotation and translation errors are decreased by 0.057 deg and 0.45 cm with the dense operation. As presented in Figure 4, a dense LiDAR map has more salient object structural information than the sparse LiDAR map. Thus, the representative CoC feature can be generated from the dense LiDAR map and RGB image. It is concluded that dense operation is effective for the extrinsic online calibration for the sparse LiDAR point cloud.

4.2.2. Verification on LiDAR Intensity Attention

This experiment investigates the performance of LiDAR intensity attention. From Figure 6 and Table 3, compared with baseline-2 and baseline-3, it is found that baseline-3 has smaller calibration errors from levels 1 to 5. The average rotation and translation errors are decreased by 0.026 deg and 0.47 cm. Standard deviation is also decreased. Compared with the encoder, LIA-Net exploits LiDAR intensity as visual attention, which can adaptively select the area in the LiDAR depth for the salient calibration feature extraction. It means that LiDAR intensity attention is helpful for CoC feature generation.

4.2.3. Verification on Structural Calibration Loss

This experiment investigates the performance of the proposed SC loss. From Figure 6 and Table 3, compared with baseline-3, baseline-4, and baseline-5, the translation errors are decreased significantly with the calibration levels. Compared with baseline-1, rotation and translation errors of baseline-5 are dropped by 15.71 % and 55.03 % . Visualizations of baseline-1 and baseline-5 are shown in Figure 7. It means that combining the regular loss and the proposed SC loss improves the training efficiency of the calibration network.

4.2.4. Verification on Iterative Inference and Time-Consuming Test

This experiment investigates the performance of iterative inference and time-consuming. With the iterative time M increasing from 1 to 5, the calibration performance is presented in Table 4. Calibration accuracy increases with M. When M 4 , the gains of calibration errors are not obvious because the calibration network has difficultly generating more significant calibration features so that LIA-SC-Net considers the current results as accurate enough. We also investigate the performance of time efficiency of the proposed method. CPU-GPU transfer time was not considered. Time consumption of each module is presented in Table 5. Frame per second (FPS) of the proposed method in the non-iterative inference mode is nearly 34, higher than the fresh rate of the mechanical LiDAR (nearly 10 Hz) and camera (nearly 30 Hz). In the case that M = 5 , the proposed method runs 146 ms (nearly 6.8 FPS). In conclusion, LIA-SC-Net works for the online extrinsic calibration and iterative inference strategy is useful in the actual application.

4.3. Comparison Results

4.3.1. Comparisons with State-of-the-Art Methods

For the calibration performance evaluation, our method is compared with state-of-the-art methods, such as CalibNet [52] and CMR-Net [53]. As discussed in Section 4.1.1, these methods are all trained on the same validation dataset, with the same epochs and optimizer. For the fair comparisons on all methods, iterative inference strategy is used for the best calibration results. Rotation and translation errors of each calibration level are presented in Figure 8. All methods have nearly the same accuracy for levels 0 to 2. For the large initial calibration errors, it is found that LIA-SC-Net outperforms other methods. Average calibration errors of all levels are provided in Table 6. LIA-SC-Net has the smaller average rotation and translation errors than other methods. In addition, the runtime comparison results in the non-iterative inference mode are shown in Table 7. Although the inference time of LIA-SC-Net is not the fastest, it satisfied the requirement of the real-time online calibration. Therefore, the proposed calibration method meets the need of accurate and real-time extrinsic calibration on the LiDAR-camera system.

4.3.2. Calibration Visualization

Visualization results of LIA-SC-Net and CMR-Net [53] on the large initial calibration case are provided in Figure 9. For better visualization, RGB image and LiDAR dense depth are mixed. Red dotted lines are the borderline of LiDAR depth. It is found that LIA-SC-Net has better performance than CMR-Net [53] in these scenes. With the iterative inference strategy, calibration results of our method are close to the ground truths. However, in the first scene, results of LIA-SC-Net with M = 5 seem to have relatively large rotation error. The reason might be that few salient objects with the low depth and high LiDAR reflected intensity are founded in this scene. Therefore, for precise calibration, our method requires the region close to the LiDAR-camera system with higher LiDAR reflected intensity.

5. Discussion

The proposed method LIA-SC-Net has large potential advantages for the online extrinsic calibration on the LiDAR-camera system. Compared with the calibration object based method [11,12], it does not exploit any specific calibration object. Among the target-less based method, the proposed method does not need to compute the odometry from LiDAR point cloud and RGB image [38], or reconstruct 3D point cloud from an RGB image [37]. Due to the sparsity of LiDAR point cloud, LiDAR depth and reflected intensity are also sparse (shown in Figure 4), causing the inaccuracy of MI features [40,41]. Compared with the other deep learning based calibration methods [51,52,53], the proposed method notices the relation of the co-observed features and LiDAR intensity, thus proposing LIA-Net as the novel backbone network to generate more salient CoC features. In future work, we attempt to refine LIA-Net with both spatial and channel attention [62] to achieve more accurate calibration results.
The proposed method LIA-SC-Net provides the aligned LiDAR point cloud and RGB image for the advanced applications on the LiDAR-camera system, such as depth completion [61] and 3D object detection [63]. As LiDAR depth is sparse in Figure 4b, depth completion can use the texture feature in the aligned RGB image as guidance [64] to generate a dense depth map. As LiDAR point cloud and RGB image are aligned, the multi-sensor fusion feature can be generated to regress the accurate 3D localization of targeted object. It means that LIA-SC-Net has wide applications on the LiDAR-camera system.
Experiments demonstrate that LIA-SC-Net has stable calibration performance and outperforms current methods. It means that the proposed LIA-Net module and SC loss are useful to ensure that LIA-SC-Net generates the robust and precise calibration results. A limitation of our method is the compatibility of the real-time and accurate performance. From the analysis in Section 4.2.4, M 4 achieves better results, but FPS is lower than 10. Therefore, we would speed up the proposed method in the future work.

6. Conclusions

In this paper, to solve the accumulated extrinsic parameter errors of the LiDAR-camera system, we present a novel and efficient extrinsic calibration network LIA-SC-Net. From the Lambertian reflection model, it is found that the object with higher LiDAR intensity has the potential of providing the salient co-observed feature. LIA-Net is presented to exploit LiDAR intensity as the attention feature map to generate the representative CoC feature. To increase the training efficiency of calibration network, SC loss is proposed to minimize the alignment errors of LiDAR point cloud and RGB image. With LIA-Net and SC loss, the proposed method achieves the accurate and robust performance in the dataset experiment. It demonstrates the effectiveness of LIA-SC-Net.

Author Contributions

Methodology and writing—original draft preparation, P.A.; software and validation, Y.G.; investigation, L.W.; formal analysis, Y.C.; supervision, resources, and funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (U1913602, 61991412). Equipment Pre-Research Project (41415020202, 41415020404, 305050203).

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://www.cvlibs.net/datasets/kitti/ accessed on 18 April 2022.

Acknowledgments

The authors thank Siying Ke, Bin Fang, Junfeng Ding, Zaipeng Duan, and Zhenbiao Tan from Huazhong University of Science and Technology, and anonymous reviewers for providing many valuable suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LiDARLight detection and ranging
MIMutual information
CNNConvolution neural network
LIA-NetLiDAR intensity attention based backbone network
SCStructural consistency
FOVField of view
FPSFrame per second
GTGround truth
NetNetwork
CoCCo-observed calibration
PnPPerspective-n-point
DLTDirect linear transformation
BABundle adjustment
ICPIterative closest point
BLSMIBagged least-squares mutual information
VOVisual odometry
FCFully connected
CVCCost volume computation
FPSFrame per second

Appendix A

Appendix A.1. Projection Details

Let P l = ( x l , y l , z l ) T P l denote the LiDAR coordinate of one LiDAR laser point. r l is its LiDAR reflected intensity. With the extrinsic parameters R and T, P l can be projected into the image plane via camera pinhole model [7]:
d l · I l = K · ( R P l + T )
K = f u f s u 0 0 f v v 0 0 0 1
K is the camera intrinsic matrix. ( u 0 , v 0 , 1 ) T is the pixel coordinates of the principle point. f u and f v are the focal lengths of horizontal and vertical axes on the image plane. f s is the skew ratio. I = ( u , v , 1 ) T is the pixel coordinates of laser point P l . If this laser point falls into the camera field of view (FOV), it has positive depth d l > 0 . Projecting all LiDAR points fallen into the FOV of the camera via Equation (A1), collecting I, d l , and r l , LiDAR depth D and LiDAR intensity I intensity images are generated. They are [ W , H , 1 ] tensors where W and H are the width and height of the image. For simplicity, Equation (A1) can be rewritten as:
I l = π ( P l ; R , T )
Table A1. Intrinsic parameters of LiDAR-camera system in the KITTI outdoor dataset [2].
Table A1. Intrinsic parameters of LiDAR-camera system in the KITTI outdoor dataset [2].
ParameterValues
f u 721.537 pixel
f v 721.537 pixel
f s 0.0 pixel
u 0 609.559 pixel
v 0 216.379 pixel
θ h 0.08 deg
θ v 0.40 deg

Appendix A.2. Selection of k in the Dense Operation

Kernel size k of dilation is selected to make sure that each dilated pixel overlaps with its neighbored dilated pixels. Let θ h and θ v be the horizontal and vertical resolution angles of LiDAR. From Equation (A3), k should approximately satisfy the following constraints:
π 180 θ h k f u 2 π 180 θ h , π 180 θ v k f v 2 π 180 θ v
π 180 · 1 2 ( θ v f u + θ h f v ) k π 180 · ( θ v f u + θ h f v )
For the mechanical LiDAR, θ v is far larger than θ h , and a compromise selection scheme is exploited as Equation (A5). As the experiments are conducted in the KITTI dataset [2], according to the parameters of LiDAR-camera system in KITTI dataset [2] (Table A1), k is set as 5.

Appendix A.3. Details of the Cost Volume Computation Module

To extract co-observed features from F LiDAR and F Camera for calibration, the matching cost computation module [53,57] is exploited. The CoC feature F CoC is computed as:
F CoC ( p 1 , p 2 ) = 1 c ( F LiDAR ( p 1 ) F ¯ LiDAR ) T ( F Camera ( p 2 ) F ¯ Camera )
F ¯ LiDAR = Mean ( F LiDAR ) , F ¯ Camera = Mean ( F Camera )
p i = ( u i , v i ) is the pixel index of the 2D feature map. F LiDAR ( p 1 ) and F Camera ( p 2 ) are both [ 1 , c ] vectors where c = 256 in this paper (as shown in Section 3.2). To extract the robust CoC feature from images with different sources, normalization is used. Mean ( · ) generates the average feature from a given 2D feature map. F ¯ LiDAR and F ¯ Camera are both [ 1 , c ] vectors. T is the vector transpose operator. For the computation efficiency [57], we only compute a partial cost volume with a limited range of d pixel ( p 1 p 2 d / 2 ) . As F LiDAR and F Camera are [ W / 8 , H / 8 , 256 ] tensors, F CoC is a [ d 2 , W / 8 , H / 8 ] tensor.

Appendix A.4. Relation of the Quaternion and Rotation Matrix

Let R = ( r i j ) 3 × 3 be a rotation matrix. Let Q = ( q 0 , q 1 , q 2 , q 3 ) T be a unit quaternion, which is also a 4 × 1 unit vector. R can be converted from Q as:
R = 1 2 q 2 2 2 q 3 2 2 q 1 q 2 2 q 0 q 3 2 q 1 q 3 + 2 q 0 q 2 2 q 1 q 2 2 q 0 q 3 1 2 q 1 2 2 q 3 2 2 q 2 q 3 2 q 0 q 1 2 q 1 q 3 2 q 0 q 2 2 q 2 q 3 + 2 q 0 q 1 1 2 q 1 2 2 q 2 2
Q can be converted from R as:
q 0 = t r ( R ) + 1 2 , q 1 = r 23 r 32 4 q 0 ,
q 2 = r 31 r 13 4 q 0 , q 3 = r 12 r 21 4 q 0

References

  1. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection From RGB-D Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
  2. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  3. Nagy, B.; Benedek, C. On-the-Fly Camera and Lidar Calibration. Remote Sens. 2020, 12, 1137. [Google Scholar] [CrossRef] [Green Version]
  4. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
  5. Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation. In Proceedings of the IEEE Conference on International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16260–16270. [Google Scholar]
  6. Yin, L.; Luo, B.; Wang, W.; Yu, H.; Wang, C.; Li, C. CoMask: Corresponding Mask-Based End-to-End Extrinsic Calibration of the Camera and LiDAR. Remote Sens. 2020, 12, 1925. [Google Scholar] [CrossRef]
  7. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef] [Green Version]
  8. Liu, W.; Wang, C.; Chen, S.; Bian, X.; Lai, B.; Shen, X.; Cheng, M.; Lai, S.; Weng, D.; Li, J. Y-Net: Learning Domain Robust Feature Representation for ground camera image and large-scale image-based point cloud registration. Inf. Sci. 2021, 581, 655–677. [Google Scholar] [CrossRef]
  9. An, P.; Ma, T.; Yu, K.; Fang, B.; Zhang, J.; Fu, W.; Ma, J. Geometric calibration for LiDAR-camera system fusing 3D-2D and 3D-3D point correspondences. Opt. Express 2020, 28, 2122–2141. [Google Scholar] [CrossRef]
  10. An, P.; Gao, Y.; Ma, T.; Yu, K.; Fang, B.; Zhang, J.; Ma, J. LiDAR-camera system extrinsic calibration by establishing virtual point correspondences from pseudo calibration objects. Opt. Express 2020, 28, 18261–18282. [Google Scholar] [CrossRef]
  11. Park, Y.; Yun, S.; Won, C.; Cho, K.; Um, K.; Sim, S. Calibration between color camera and 3D LIDAR instruments with a polygonal planar board. Sensors 2014, 14, 5333–5353. [Google Scholar] [CrossRef] [Green Version]
  12. Dhall, A.; Chelani, K.; Radhakrishnan, V.; Krishna, K.M. LiDAR-Camera Calibration using 3D-3D Point correspondences. arXiv 2017, arXiv:1705.09785. [Google Scholar]
  13. Pusztai, Z.; Hajder, L. Accurate Calibration of LiDAR-Camera Systems using Ordinary Boxes. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 394–402. [Google Scholar]
  14. Pandey, G.; Mcbride, J.R.; Savarese, S.; Eustice, R.M. Automatic Targetless Extrinsic Calibration of a 3D Lidar and Camera by Maximizing Mutual Information. In Proceedings of the Twenty-Sixth Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; pp. 1–7. [Google Scholar]
  15. Wu, X.; Zhang, C.; Liu, Y. Calibrank: Effective Lidar-Camera Extrinsic Calibration by Multi-Modal Learning to Rank. In Proceedings of the IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3189–3193. [Google Scholar]
  16. Oren, M.; Nayar, S.K. Generalization of the Lambertian model and implications for machine vision. Int. J. Comput. Vis. 1995, 14, 227–251. [Google Scholar] [CrossRef]
  17. Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; pp. 1–16. [Google Scholar]
  18. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the ICAIS, Klagenfurt, Austria, 6–8 September 2011; pp. 315–323. [Google Scholar]
  19. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the NeurIPS, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
  20. Ribeiro, A.H.; Tiels, K.; Aguirre, L.A.; Schön, T.B. Beyond exploding and vanishing gradients: Analysing RNN training using attractors and smoothness. In Proceedings of the AISTATS, virtually, 18 September 2020; Volume 108, pp. 2370–2380. [Google Scholar]
  21. Lepetit, V.; Noguer, F.M.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef] [Green Version]
  22. Gao, X.; Hou, X.; Tang, J.; Cheng, H. Complete Solution Classification for the Perspective-Three-Point Problem. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 930–943. [Google Scholar]
  23. Abdel-Aziz, Y.I.; Karara, H.M. Direct linear transformation into object space coordinates in close-range photogrammetry. In Proceedings of the Symposium on Close-Range Photogrammetry, Urbana, IL, USA, 26–29 January 1971; pp. 1–18. [Google Scholar]
  24. Triggs, B.; Mclauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle Adjustment—A Modern Synthesis. In Proceedings of the Workshop on Vision Algorithms, Corfu, Greece, 21–22 September 2000; pp. 298–372. [Google Scholar]
  25. Ge, Y.; Maurer, C.R.; Fitzpatrick, J.M. Surface-based 3D image registration using the iterative closest point algorithm with a closest point transform. Proc. SPIE- Int. Soc. Opt. Eng. 1996, 2710, 358–367. [Google Scholar]
  26. Horn, B.K.P.; Hilden, H.M.; Negahdaripour, S. Closed-form Solution of Absolute Orientation Using Orthonormal Matrices. J. Opt. Soc. Am. A 1988, 5, 1127–1135. [Google Scholar] [CrossRef] [Green Version]
  27. Guindel, C.; Beltrán, J.; Martín, D.; Garcia, F. Automatic Extrinsic Calibration for Lidar-Stereo Vehicle Sensor Setups. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 1–6. [Google Scholar]
  28. Zhou, L.; Deng, Z. Extrinsic calibration of a camera and a lidar based on decoupling the rotation from the translation. In Proceedings of the IEEE Intelligent Vehicles Symposium, Madrid, Spain, 3–7 June 2012; pp. 642–648. [Google Scholar]
  29. Wang, W.; Sakurada, K.; Kawaguchi, N. Reflectance Intensity Assisted Automatic and Accurate Extrinsic Calibration of 3D LiDAR and Panoramic Camera Using a Printed Chessboard. Remote Sens. 2017, 9, 851. [Google Scholar] [CrossRef] [Green Version]
  30. Cui, J.; Niu, J.; Ouyang, Z.; He, Y.; Liu, D. ACSC: Automatic Calibration for Non-repetitive Scanning Solid-State LiDAR and Camera Systems. arXiv 2020, arXiv:2011.08516. [Google Scholar]
  31. Ou, J.; Huang, P.; Zhou, J.; Zhao, Y.; Lin, L. Automatic Extrinsic Calibration of 3D LIDAR and Multi-Cameras Based on Graph Optimization. Sensors 2022, 22, 2221. [Google Scholar] [CrossRef]
  32. Gong, X.; Lin, Y.; Liu, J. 3D LIDAR-Camera Extrinsic Calibration Using an Arbitrary Trihedron. Sensors 2013, 13, 1902–1918. [Google Scholar] [CrossRef] [Green Version]
  33. Lee, G.; Lee, J.; Park, S. Calibration of VLP-16 Lidar and multi-view cameras using a ball for 360 degree 3D color map acquisition. In Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Daegu, Korea, 16–18 November 2017; pp. 64–69. [Google Scholar]
  34. Pusztai, Z.; Eichhardt, I.; Hajder, L. Accurate Calibration of Multi-LiDAR-Multi-Camera Systems. Sensors 2018, 18, 2139. [Google Scholar] [CrossRef] [Green Version]
  35. Chai, Z.; Sun, Y.; Xiong, Z. A Novel Method for LiDAR Camera Calibration by Plane Fitting. In Proceedings of the IEEE International Conference on Advanced Intelligent Mechatronics, Auckland, New Zealand, 9–12 July 2018; pp. 286–291. [Google Scholar]
  36. Kümmerle, J.; Kühner, T. Unified Intrinsic and Extrinsic Camera and LiDAR Calibration under Uncertainties. In Proceedings of the IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 6028–6034. [Google Scholar]
  37. Caselitz, T.; Steder, B.; Ruhnke, M.; Burgard, W. Monocular camera localization in 3d lidar maps. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Daejeon, Korea, 9–14 October 2016; pp. 1926–1931. [Google Scholar]
  38. Ryoichi, I.; Takeshi, O.; Katsushi, I. LiDAR and Camera Calibration using Motion Estimated by Sensor Fusion Odometry. In Proceedings of the IEEE International Conference on Intelligence Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 7342–7349. [Google Scholar]
  39. Chen, S.; Li, X.; Zhao, L. Multi-source remote sensing image registration based on sift and optimization of local self-similarity mutual information. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016; pp. 2548–2551. [Google Scholar]
  40. Wolcott, R.W.; Eustice, R.M. Visual localization within LIDAR maps for automated urban driving. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 176–183. [Google Scholar]
  41. Irie, K.; Sugiyama, M.; Tomono, M. Target-less camera-LiDAR extrinsic calibration using a bagged dependence estimator. In Proceedings of the IEEE International Conference on Automation Science and Engineering, Fort Worth, TX, USA, 21–25 August 2016; pp. 1340–1347. [Google Scholar]
  42. Zhu, Y.; Li, C.; Zhang, Y. Online Camera-LiDAR Calibration with Sensor Semantic Information. In Proceedings of the International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 4970–4976. [Google Scholar]
  43. Xiao, Z.; Li, H.; Zhou, D.; Dai, Y.; Dai, B. Accurate extrinsic calibration between monocular camera and sparse 3D Lidar points without markers. In Proceedings of the Intelligent Vehicles Symposium, Los Angeles, CA, USA, 11–14 June 2017; pp. 424–429. [Google Scholar]
  44. John, V.; Long, Q.; Liu, Z.; Mita, S. Automatic calibration and registration of lidar and stereo camera without calibration objects. In Proceedings of the IEEE International Conference on Vehicular Electronics and Safety, Yokohama, Japan, 5–7 November 2015; pp. 231–237. [Google Scholar]
  45. Schönberger, J.L.; Frahm, J. Structure-from-Motion Revisited. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
  46. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]
  47. Fassi, I.; Legnani, G. Hand to sensor calibration: A geometrical interpretation of the matrix equation AX = XB. J. Field Robot. 2005, 22, 497–506. [Google Scholar] [CrossRef]
  48. Ban, X.; Wang, H.; Chen, T.; Wang, Y.; Xiao, Y. Monocular Visual Odometry Based on Depth and Optical Flow Using Deep Learning. IEEE Trans. Instrum. Meas. 2021, 70, 1–19. [Google Scholar] [CrossRef]
  49. Chien, H.; Klette, R.; Schneider, N.; Franke, U. Visual odometry driven online calibration for monocular LiDAR-camera systems. In Proceedings of the IEEE International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016; pp. 2848–2853. [Google Scholar]
  50. Girshick, R.B. Fast R-CNN. In Proceedings of the IEEE Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  51. Schneider, N.; Piewak, F.; Stiller, C.; Franke, U. RegNet: Multimodal sensor registration using deep neural networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017. [Google Scholar]
  52. Ganesh, I.; R, K.R.; Krishna, M.J. CalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
  53. Cattaneo, D.; Vaghi, M.; Ballardini, A.L.; Fontana, S.; Sorrenti, D.G.; Burgard, W. CMRNet: Camera to LiDAR-Map Registration. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems (ITSC), Auckland, New Zealand, 27–30 October 2019. [Google Scholar]
  54. Neubert, P.; Schubert, S.; Protzel, P. Sampling-based methods for visual navigation in 3D maps by synthesizing depth images. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Vancouver, BC, Canada, 24–28 September 2017; pp. 2492–2498. [Google Scholar]
  55. Lin, M.; Chen, Q.; Yan, S. Network In Network. In Proceedings of the International Conference on Learning Representations, Banff, Canada, 14–16 April 2014; pp. 1–10. [Google Scholar]
  56. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  57. Sun, D.; Yang, X.; Liu, M.; Kautz, J. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8934–8943. [Google Scholar]
  58. Yuan, K.; Guo, Z.; Wang, Z.J. RGGNet: Tolerance Aware LiDAR-Camera Online Calibration With Geometric Deep Learning and Generative Model. IEEE Robot. Autom. Lett. 2020, 5, 6956–6963. [Google Scholar] [CrossRef]
  59. Ye, C.; Pan, H.; Gao, H. Keypoint-Based LiDAR-Camera Online Calibration With Robust Geometric Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
  60. Ku, J.; Harakeh, A.; Waslander, S.L. In defense of classical image processing: Fast depth completion on the CPU. In Proceedings of the IEEE Conference on Computer and Robot Vision, Toronto, ON, Canada, 8–10 May 2018; pp. 16–22. [Google Scholar]
  61. An, P.; Fu, W.; Gao, Y.; Ma, J.; Zhang, J.; Yu, K.; Fang, B. Lambertian Model-Based Normal Guided Depth Completion for LiDAR-Camera System. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  62. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  63. An, P.; Liang, J.; Yu, K.; Fang, B.; Ma, J. Deep structural information fusion for 3D object detection on LiDAR-camera system. Comput. Vis. Image Underst. 2022, 214, 103295. [Google Scholar] [CrossRef]
  64. Mukherjee, S.; Mohana, R.; Guddeti, R. A Hybrid Algorithm for Disparity Calculation From Sparse Disparity Estimates Based on Stereo Vision. In Proceedings of the International Conference on Signal Processing and Communications, Bangalore, India, 22–25 July 2014; pp. 1–6. [Google Scholar]
Figure 1. The general pipeline of deep learning based extrinsic calibration method. Inputs are D 0 and R . D 0 are obtained by projecting P l into the image plane with K , R 0 , and T 0 . Outputs are Δ R and Δ T . Squeeze is the operation to convert the 2D feature map into a 1D feature vector. Function Qua2Rot converts the quaternion Δ Q as the rotation matrix Δ R . Conv. and FC mean convolution and fully connected, respectively.
Figure 1. The general pipeline of deep learning based extrinsic calibration method. Inputs are D 0 and R . D 0 are obtained by projecting P l into the image plane with K , R 0 , and T 0 . Outputs are Δ R and Δ T . Squeeze is the operation to convert the 2D feature map into a 1D feature vector. Function Qua2Rot converts the quaternion Δ Q as the rotation matrix Δ R . Conv. and FC mean convolution and fully connected, respectively.
Remotesensing 14 02525 g001
Figure 2. The task and significance of online extrinsic calibration on LiDAR-camera system. (a) the model of LiDAR-camera system; (b) inaccurate alignment of LiDAR point cloud and RGB image with R 0 and T 0 ; (c) LIA-SC-Net predicts accurate Δ R and Δ T , thus obtaining the precise alignment result. The projected LiDAR point cloud with pseudo color dependent on LiDAR intensity is used for visualization. After calibration, RGB pixels and LiDAR points of the vehicle in the red box are correctly aligned.
Figure 2. The task and significance of online extrinsic calibration on LiDAR-camera system. (a) the model of LiDAR-camera system; (b) inaccurate alignment of LiDAR point cloud and RGB image with R 0 and T 0 ; (c) LIA-SC-Net predicts accurate Δ R and Δ T , thus obtaining the precise alignment result. The projected LiDAR point cloud with pseudo color dependent on LiDAR intensity is used for visualization. After calibration, RGB pixels and LiDAR points of the vehicle in the red box are correctly aligned.
Remotesensing 14 02525 g002
Figure 3. The pipeline of LIA-SC-Net. It mainly consists of LIA-Net and SC loss. Abbreviations Op. and approx. are operation and approximation, respectively. SC loss aims to reduce the alignment error by minimizing the re-projection error of the point P = R 0 P + T 0 .
Figure 3. The pipeline of LIA-SC-Net. It mainly consists of LIA-Net and SC loss. Abbreviations Op. and approx. are operation and approximation, respectively. SC loss aims to reduce the alignment error by minimizing the re-projection error of the point P = R 0 P + T 0 .
Remotesensing 14 02525 g003
Figure 4. Visualization of dense operation. (a) RGB image; (b) LiDAR depth and intensity; (c) dense LiDAR depth and intensity after dense operation. Edge structure of the vehicle inside the red box in the dense LiDAR feature map is clearer than the raw LiDAR feature map, benefiting the extraction of the significant calibration feature.
Figure 4. Visualization of dense operation. (a) RGB image; (b) LiDAR depth and intensity; (c) dense LiDAR depth and intensity after dense operation. Edge structure of the vehicle inside the red box in the dense LiDAR feature map is clearer than the raw LiDAR feature map, benefiting the extraction of the significant calibration feature.
Remotesensing 14 02525 g004
Figure 5. Visualization of (a) RGB image and (b) LiDAR intensity. Observing the red bounding boxes, the objects with higher LiDAR intensity have the clearer structural features in both RGB image and LiDAR point cloud.
Figure 5. Visualization of (a) RGB image and (b) LiDAR intensity. Observing the red bounding boxes, the objects with higher LiDAR intensity have the clearer structural features in both RGB image and LiDAR point cloud.
Remotesensing 14 02525 g005
Figure 6. Calibration performance of five baseline methods in the validation dataset.
Figure 6. Calibration performance of five baseline methods in the validation dataset.
Remotesensing 14 02525 g006
Figure 7. Comparisons with the baseline-1 (left) and baseline-5 (right) in the validation dataset for the various outdoor scenes (ac). Comparing the alignment of LiDAR points and RGB pixels of the objects in the red boxes, baseline-5 has the smaller alignment errors. The projected LiDAR point cloud with pseudo color depended on LiDAR intensity is used for visualization.
Figure 7. Comparisons with the baseline-1 (left) and baseline-5 (right) in the validation dataset for the various outdoor scenes (ac). Comparing the alignment of LiDAR points and RGB pixels of the objects in the red boxes, baseline-5 has the smaller alignment errors. The projected LiDAR point cloud with pseudo color depended on LiDAR intensity is used for visualization.
Remotesensing 14 02525 g007
Figure 8. Calibration performance of deep learning based methods in the validation dataset.
Figure 8. Calibration performance of deep learning based methods in the validation dataset.
Remotesensing 14 02525 g008
Figure 9. Calibration visualization of different learning based calibration methods in the various scenes (ac). Images mixed with LiDAR depth and RGB image are used for visualization.
Figure 9. Calibration visualization of different learning based calibration methods in the various scenes (ac). Images mixed with LiDAR depth and RGB image are used for visualization.
Remotesensing 14 02525 g009
Table 1. Different calibration situations for verification on deep learning based methods.
Table 1. Different calibration situations for verification on deep learning based methods.
Levels012345
θ / deg 0.04.08.012.016.020.0
D/m0.00.300.600.901.201.50
Training××× ××
Testing🗸🗸🗸🗸🗸🗸
Table 2. Five baselines for the verification of LIA-SC-Net.
Table 2. Five baselines for the verification of LIA-SC-Net.
MethodsDense OperationEncoderLIA-NetLregLSC
baseline-1×🗸×🗸×
baseline-2🗸🗸×🗸×
baseline-3 (LIA+reg)🗸×🗸🗸×
baseline-4 (LIA+SC)🗸×🗸×🗸
baseline-5 (LIA-SC-Net)🗸×🗸🗸🗸
Table 3. Average and standard deviation of rotation and translation errors for five baseline methods in the validation dataset. Gain of the average calibration error is also provided.
Table 3. Average and standard deviation of rotation and translation errors for five baseline methods in the validation dataset. Gain of the average calibration error is also provided.
MethodsRotation Error/degTranslation Error/m
Baseline-10.869/0.0620.0693/0.0087
Baseline-20.812/0.0540.0648/0.0074
Baseline-30.786/0.0520.0601/0.0068
Baseline-40.768/0.0470.0538/0.0064
Baseline-50.751/0.0450.0447/0.0061
Gain15.71%55.03%
Table 4. Performance of iterative inference of LIA-SC-Net in the validation dataset.
Table 4. Performance of iterative inference of LIA-SC-Net in the validation dataset.
M12345
Er/deg0.7510.6540.5380.5250.525
Et/m0.04470.04020.03960.03960.0396
Table 5. Runtime of LIA-SC-Net in the validation dataset (Unit: ms).
Table 5. Runtime of LIA-SC-Net in the validation dataset (Unit: ms).
ModulePreparationLIA-NetRegressionTotal
Time7.613.58.129.2
Table 6. Average calibration errors of deep learning based methods in the validation dataset.
Table 6. Average calibration errors of deep learning based methods in the validation dataset.
MethodsRotation Error/degTranslation Error/m
CalibNet0.6210.0495
CMR-Net0.5570.0437
LIA-SC-Net (Our)0.5250.0396
Table 7. Runtime comparison with deep learning based methods in the validation dataset (Unit: ms).
Table 7. Runtime comparison with deep learning based methods in the validation dataset (Unit: ms).
MethodsCalibNetCMR-NetLIA-SC-Net
Time36.814.729.2
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

An, P.; Gao, Y.; Wang, L.; Chen, Y.; Ma, J. Online Extrinsic Calibration on LiDAR-Camera System with LiDAR Intensity Attention and Structural Consistency Loss. Remote Sens. 2022, 14, 2525. https://doi.org/10.3390/rs14112525

AMA Style

An P, Gao Y, Wang L, Chen Y, Ma J. Online Extrinsic Calibration on LiDAR-Camera System with LiDAR Intensity Attention and Structural Consistency Loss. Remote Sensing. 2022; 14(11):2525. https://doi.org/10.3390/rs14112525

Chicago/Turabian Style

An, Pei, Yingshuo Gao, Liheng Wang, Yanfei Chen, and Jie Ma. 2022. "Online Extrinsic Calibration on LiDAR-Camera System with LiDAR Intensity Attention and Structural Consistency Loss" Remote Sensing 14, no. 11: 2525. https://doi.org/10.3390/rs14112525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop