**1. Introduction**

Multi-sensor fusion is performed in many fields, such as autonomous driving and robotics. A single sensor does not guarantee reliable recognition in complex and varied scenarios [1]. Therefore, it is difficult to cope with various autonomous driving situations using only one sensor. Conversely, fusing two or more sensors supports reliable environmental perception around the vehicle. In multi-sensor fusion, one sensor compensates for the shortcomings of the other sensor [2]. In addition, multi-sensor fusion expands the detection range and improves the measurement density compared with using a single sensor [3]. Studies based on multi-sensor fusion include 3D object detection, road detection, semantic segmentation, object tracking, visual odometry, and mapping [4–9]. Moreover, most large datasets that are built for autonomous driving research [10–13] provide data measured by at least two different sensors. Importantly, multi-sensor fusion is greatly affected by the calibration accuracy of the sensors used. While the vehicle is driving, the pose or position of the sensors mounted on the vehicle may change for various reasons. Therefore, for multi-sensor fusion, it is essential to perform the online calibration of sensors to accurately recognize changes in sensor pose or changes in the positions of the sensors.

Extensive academic research on multi-sensor calibration has been performed [2,3]. Many traditional calibration methods [14–17] use artificial markers, including checkerboards, as calibration targets. The target-based calibration algorithms are not suitable for autonomous driving because they involve processes that require manual intervention. Some of the calibration methods currently used focus on fully automatic and targetless online self-calibration [3,18–24]. However, most online calibration methods perform calibration only when certain conditions are met, and their calibration accuracy is not as high

**Citation:** Song, J.; Lee, J. Online Self-Calibration of 3D Measurement Sensors Using a Voxel-Based Network. *Sensors* **2022**, *22*, 6447. https://doi.org/10.3390/s22176447

Academic Editor: Jing Tian

Received: 8 August 2022 Accepted: 23 August 2022 Published: 26 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as the target-based offline methods [1]. The latest online calibration methods [1,2,25–27] based on deep learning use optimization through gradient descent, large-scale datasets, and CNNs to overcome the limitations of the previous online methods. In particular, the latest research based on CNNs has shown suitable results. Compared with previous methods, CNN-based online self-calibration methods do not require strict conditions and provide excellent calibration accuracy when they are run online.

Many CNN-based LiDAR-camera calibration methods use an image for calibration. In this case, the point cloud of the LiDAR is projected onto the image. Then, 2D convolution kernels are used to extract the features of the inputs.

In this study, we propose a CNN-based multi-sensor online self-calibration method. This method estimates the values of six parameters that describe rotation and translation between sensors that are capable of measuring 3D space. The combinations of sensors that are subject to calibration in our proposed method are: a LiDAR and stereo camera and a LiDAR and LiDAR. One of the two sensors is set as the reference sensor and the other as the target sensor. In the combination of LiDAR and stereo camera, the stereo camera is set as the reference sensor.

The CNN we propose is a network that uses voxels instead of using image features. Therefore, we convert the stereo image into 3D points called pseudo-LiDAR points to feed the stereo image into this network. Pseudo-LiDAR points and actual LiDAR points are expressed in voxel spaces through voxelization. Then, 3D convolution kernels are applied to the voxels to generate features that can be used for calibration parameter regression. In particular, the attention mechanism [28] included in our proposed network confirms the correlation between the input information of the two sensors. The research fields that use voxels are diverse, including shape completion, semantic segmentation, multi-view stereoscopic vision, object detection, etc. [29–32].

The amount of data in public datasets is insufficient to perform online self-calibration. Therefore, existing studies have assigned random deviations to the values of known parameters and have evaluated the performance of online self-calibration based on how accurately the algorithm proposed in the respective study predicts this deviation. This approach is commonly referred to as miscalibration [1,2,25]. To sample the random deviation, we choose the rotation range and translation range as ±20◦ and ±1.5 m, respectively, as in [1]. In this study, we train five networks on a wide range of miscalibrations and apply iterative refinement to the outputs of the five networks and temporal filtering over time to increase the calibration accuracy. The KITTI dataset [10] and Oxford dataset [12] are used to conduct the research of the proposed method. The KITTI dataset is used for online LiDAR-stereo camera calibration, and the Oxford dataset is used for online LiDAR-LiDAR calibration.

The rest of this paper is organized as follows. Section 2 provides an overview of existing calibration studies. Section 3 describes the proposed method. Section 4 presents the experimental results for the proposed method, and Section 5 draws conclusions.

#### **2. Related Work**

This section provides a brief overview of traditional calibration methods. In addition, we introduce how CNN-based calibration methods have been improved. Specifically, the calibrations covered in this section are LiDAR-camera calibration and LiDAR-LiDAR calibration.

#### *2.1. Traditional Methods*

Traditional methods of calibration that use targets mainly use artificial markers. One example is the LiDAR-camera calibration method that uses a polygonal planar board [14]. This method first finds the vertices of the planar board in the image and in the LiDAR point cloud. Then, the corresponding points between the vertices of the image and the vertices of the point cloud are searched, and a linear equation is formulated. Finally, this method uses singular value decomposition to solve the linear equation and obtain calibration parameters. Another example is the LiDAR-camera calibration method that uses a planar

chessboard [15]. The edge information in an image and the Perspective-n-Point algorithm are used to find the plane of the chessboard that appears in the image. Then, distance filtering and a random sample consensus (RANSAC) algorithm are used to obtain the chessboard plane information from the LiDAR point cloud. After obtaining the plane information, this method aligns the normal vectors of the planes with obtaining the rotation parameters between LiDAR and the camera. The translation parameters between LiDAR and camera are calculated by minimizing the distance between the plane searched from the LiDAR points, and the rotated plane searched from the image. Similarly, the algorithm in [16] is an automatic LiDAR-LiDAR calibration that uses planar surfaces. This method uses three planes. The planes are obtained through a RANSAC algorithm. The calibration of LiDAR-LiDAR is formulated as a nonlinear optimization problem by minimizing the distance between corresponding planes. This approach adopts the Levenberg-Marquardt algorithm for nonlinear optimization. Another LiDAR-LiDAR calibration in [17] uses two poles plastered with retro-reflective tape to easily identify them in the point cloud. This method first uses a threshold to find the reflected points on the pole. The searched point cloud is expressed as a linear equation that represents a line. The points and the linear equation are then used to solve the least squares problem. The data obtained by solving the least squares problem are the calibration parameters.

Among the targetless online LiDAR-camera calibration methods, some methods use edge information [18,19]. In these methods, when the LiDAR-camera calibration is correct, the edges of the depth map of the LiDAR points are projected onto the image, and the edges of the image are naturally aligned. As another example, there is a method that uses the correlation between sensor data, as in [20,21]. The method in [20] uses the reflectance information from the LiDAR and the intensity information from the camera for calibration. According to this method, the correlation between reflectivity and intensity is maximized when the LiDAR-camera calibration is accurate. The method in [21] uses the surface normal of the LiDAR points and the intensity of image pixels for calibration. There is also a method based on the hand-eye calibration framework [22], which estimates the motion of the LiDAR and camera, respectively, and uses this information for calibration. A targetless online LiDAR-LiDAR calibration is introduced in [23]. This method first performs a rough calibration from an arbitrary initial pose. Then, the calibration parameters are corrected through an iterative closest point algorithm and are further optimized using an octreebased method. Other LiDAR-LiDAR calibration methods are presented in [3,24]. These methods are based on the hand-eye calibration framework, and the motion of each LiDAR is estimated. This information is used for calibration.

#### *2.2. CNN-Based Methods*

RegNet [25], the first CNN-based online LiDAR-camera self-calibration method, adopted a three-step convolution consisting of feature extraction, feature matching, and global regression. RegNet uses the decalibration of a given point cloud to train the proposed CNN and also uses five networks to predict the six-degree of freedom (6-DoF) extrinsic parameters for five different decalibration ranges.

CalibNet [5], the CNN-based online LiDAR-camera self-calibration method, proposed a geometrically supervised deep network that was capable of automatically predicting the 6-DoF extrinsic parameters. The end-to-end training is performed by maximizing the photometric and geometric consistencies. Here, photometric consistency is obtained between two depth maps constructed by projecting a given point cloud onto the input image with the 6-DoF parameters predicted by the network and the ground-truth 6-DoF parameters. Similarly, geometric consistency is calculated between two kinds of 3D points obtained by transforming the point cloud into 3D space with the predicted 6-DoF parameters and the ground-truth 6-DoF parameters.

LCCNet [1], which represents a significant improvement over previous CNN-based methods, is a CNN-based online LiDAR-camera self-calibration method. This network considers the correlation between the RGB image features and the depth image projected

from point clouds. Additional CNN-based online LiDAR-camera self-calibration methods are presented in [26,27]. They utilize the semantic information extracted by CNN to perform more robust calibration even under changes in lighting and noise [26].

To the best of our knowledge, no deep learning-based LiDAR-LiDAR or LiDAR-stereo camera online self-calibration method has ye<sup>t</sup> been reported. In this paper, we propose, for the first time, a deep learning-based method that is capable of online self-calibration between such sensor combinations.
