*3.1. Preprocessing*

In order to perform online self-calibration with the network we designed, several processes, including data preparation, were performed in advance. This section describes these processes. We assume that sensors targeted for online calibration are capable of 3D measurement. Therefore, we use point clouds that are generated by these sensors. In the LiDAR-LiDAR combination, this premise is satisfied, but in the case of the LiDAR-stereo camera combination, this premise is not satisfied, so we obtain a 3D point cloud from the stereo images. The conversion of the stereo depth map to 3D points and the removal of the 3D points, which are covered in the next two subsections, are not required for the LiDAR-LiDAR combination.

#### 3.1.1. Conversion of Stereo Depth Map to 3D Points

A depth map is built from stereo images through stereo matching. In this paper, we obtain the depth map using the method in [33] that implements semi-global matching [34]. This depth map composed of disparities is converted to 3D points, which are called pseudo-LiDAR points, as follows:

$$P = \begin{bmatrix} X & Y & Z \end{bmatrix} = \begin{bmatrix} \frac{\text{base} \cdot (u - c\_u)}{\text{disp}} & \frac{\text{base} \cdot (v - c\_v)}{\text{disp}} & \frac{f\_u \cdot \text{base}}{\text{disp}} \end{bmatrix} \tag{1}$$

where *cu*, *cv*, and *fu* are the camera intrinsic parameters, *u* and *v* are pixel coordinates, *base* is the baseline distance between cameras, and *disp* is the disparity obtained from the stereo matching.

#### 3.1.2. Removal of Pseudo-LiDAR Points

The pseudo-LiDAR points are too many in number compared with the points measured by a LiDAR. Therefore, we, in this paper, reduce the quantity of pseudo-LiDAR points through a spherical projection, which is implemented using the method presented in [35] as follows:

$$
\begin{bmatrix} p \\ q \end{bmatrix} = \begin{bmatrix} \frac{1}{2} (1 - \text{atan}(-X, Z) / \pi) \cdot w \\ \left( 1 - \left( \text{asin} \left( -\mathbf{Y} \cdot \boldsymbol{r}^{-1} \right) + f\_{up} \right) \cdot f^{-1} \right) \cdot h \end{bmatrix} \tag{2}
$$

where ( *X*, *Y*, *Z*) are 3D coordinates of a pseudo-LiDAR point, (*p*, *q*) are the angular coordinates, (*h*, *w*) are the height and width of the desired projected 2D map, *r* is the range of each point, and *f* = *fup* + *fdown* is the vertical field of view (FOV) of the sensor. We set *fup* to 3◦ and *fdown* to <sup>−</sup>25◦. Here, the range 3◦ to −25◦ is the vertical FOV of the LiDAR used to build the KITTI benchmarks [10]. The pseudo-LiDAR points become a 2D image via this spherical projection. Multiple pseudo-LiDAR points can be projected onto a single pixel in the 2D map. In this case, only the last projected pseudo-LiDAR point is left, and the previously projected pseudo-LiDAR points are removed.

#### 3.1.3. Setting of Region of Interest

Because the FOVs of the sensors used are usually different, we determine the region of interest (ROI) of each sensor and perform calibration only with data belonging to this ROI. However, the ROI cannot be determined theoretically but can only be determined experimentally. We determine the ROI of the sensors by looking at the distribution of data acquired with the sensors.

We provide an example of setting the ROI using data provided by the KITTI [10] and Oxford [12] datasets. For the KITTI dataset, which was built using a stereo camera and LiDAR, the ROI of the stereo camera is set to [Horizon: −10 m–10 m, Vertical: −2 m–1 m, Depth: 0 m–50 m], and the ROI of the LiDAR is set to the same values as the ROI of the stereo camera. For the Oxford dataset, which was built using two LiDARs, the ROI of the LiDAR is set to [Horizon: −30–30 m, Vertical: −2–1 m, Depth: −30–30 m].

#### 3.1.4. Transformation of Point Cloud of Target Sensor

In this paper, the miscalibration method used in previous studies [1,2,25] is used to perform the calibration of the stereo camera-LiDAR and LiDAR-LiDAR combination. In the KITTI [10] and Oxford [12] datasets we use, the values of six extrinsic parameters between two heterogeneous sensors and the 3D point clouds generated by them are given. Therefore, we can transform the 3D point cloud created by one sensor into a new 3D point cloud using the values of these six parameters. If we assign arbitrary deviations to these parameters, we can retransform the transformed point cloud in another space. At this time, if a calibration algorithm accurately finds the deviations that we randomly assign, we can move the retransformed point cloud to the position before the retransformation.

In order to apply the aforementioned approach to our proposed online self-calibration method, a 3D point *P* = [*x*, *y*, *z*] ∈ R<sup>3</sup> measured by the target sensor is transformed by Equation (3) as follows:

$$
\widehat{P} = RT\_{\rm mis}RT\_{\rm init} \\
\widehat{P}^{\widehat{T}} = \begin{bmatrix} R\_{\rm mis} & & & T\_{\rm mis} \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} R\_{\rm init} & & & T\_{\rm init} \\ 0 & 0 & 0 & 1 \end{bmatrix} \widehat{P}^{\widehat{T}} \tag{3}
$$

$$RT\_{\mathcal{S}^t} = RT\_{\text{mis}}{}^{-1} \tag{4}$$

$$R\_{\rm init} = \begin{bmatrix} \cos(R\_z) & -\sin(R\_z) & 0\\ \sin(R\_z) & \cos(R\_z) & 0\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos(R\_y) & 0 & \sin(R\_y) \\ 0 & 1 & 0\\ -\sin(R\_y) & 0 & \cos(R\_y) \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & \cos(R\_x) & -\sin(R\_x)\\ 0 & \sin(R\_x) & \cos(R\_x) \end{bmatrix} \tag{5}$$

$$T\_{init} = \begin{bmatrix} T\_x & T\_y & T\_z \end{bmatrix}^T \tag{6}$$

$$R\_{\rm mis} = \begin{bmatrix} \cos(\theta\_z) & -\sin(\theta\_z) & 0\\ \sin(\theta\_z) & \cos(\theta\_z) & 0\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos(\theta\_y) & 0 & \sin(\theta\_y) \\ 0 & 1 & 0\\ -\sin(\theta\_y) & 0 & \cos(\theta\_y) \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & \cos(\theta\_x) & -\sin(\theta\_x)\\ 0 & \sin(\theta\_x) & \cos(\theta\_x) \end{bmatrix} \tag{7}$$

$$T\_{\rm mis} = \begin{bmatrix} \tau\_{\rm x} & \tau\_{\rm y} & \tau\_{\rm z} \end{bmatrix}^T \tag{8}$$

where *P* is the transformed point of *P*, and superscript *T* represents the transpose. *P*+ and *P*+ are expressed with homogeneous coordinates. *RTgt*, described in Equation (4), is the transformation matrix we want to predict with our proposed method. *RTgt* is used as the ground truth when the loss for training is calculated. In Equation (5), each of the parameters *Rx*, *Ry*, and *Rz* describes the angle rotated about the *x*-, *y*-, and *z*-axes between the two sensors. In Equation (6), *Tx*, *Ty*, and *Tz* describe the relative displacement between two sensors along the *x*-, *y*-, and *z*-axes. In this study, we assume that the values of the six parameters *Rx*, *Ry*, *Rz*, *Tx*, *Ty*, and *Tz* are given. In Equations (7) and (8), the parameters *θ<sup>x</sup>*, *<sup>θ</sup>y*, *θ<sup>z</sup>*, *τx*, *<sup>τ</sup>y*, and *τz* represent the random deviations for *Rx*, *Ry*, *Rz*, *Tx*, *Ty*, and *Tz*, respectively. Each of these six deviations is sampled randomly with equal probability within a predefined range of deviations described next. In Equations (5)–(8), *Rinit* and

*Rmis* are rotation matrices and *Tinit* and *Tmis* are translation vectors. The transformation by Equation (3) is performed only on points belonging to a predetermined ROI of the target sensor.

We set the random sampling ranges for *θ<sup>x</sup>*, *<sup>θ</sup>y*, *θ<sup>z</sup>*, *τx*, *<sup>τ</sup>y*, and *τz* the same as in previous studies [1,25] as follows: (rotational deviation: <sup>−</sup>θ–θ, translation deviation: −τ–<sup>τ</sup>), Rg1 = {θ: ±20◦, τ: ±1.5 m}, Rg2 = {θ: ±10◦, τ: ±1.0 m}, Rg3 = {θ: ±5◦, τ: ±0.5 m}, Rg4 = {θ: ±2◦, τ: ±0.2 m}, and Rg5 = {θ: ±1◦,τ: ±0.1 m}. Each of Rg1, Rg2, Rg3, Rg4, and Rg5 set in this way is used for training each of the five networks named Net1, Net2, Net3, Net4, and Net5. One deviation range is assigned to one network training. Training for calibration starts with Net1 assigned to Rg1, and it continues with networks assigned to progressively smaller deviation ranges. The network mentioned here is described in Section 3.2.
