**4. Experiments**

There are several tasks, such as data preparation in training and evaluation of the proposed calibration system. The KITTI dataset provides images captured with a stereo camera and point clouds acquired using a LiDAR. The dataset consists of 21 sequences (00 to 20) from different scenarios. The Oxford dataset provides point clouds acquired using two LiDARs. In addition, both datasets provide initial calibration parameters and visual odometry information.

We used the KITTI dataset for LiDAR-stereo camera calibration. We referred to the method proposed by Lv et al. [1] in using the 00 sequence (4541 frames) for testing and using the rest (39,011 frames) of the sequences for training. We used the Oxford dataset for LiDAR-LiDAR calibration. Of the many sequences in the Oxford dataset, we used the 2019-01-10-12-32-52 sequence for training and the 2019-01-17-12-32-52 sequence for evaluation. The two LiDARs that were used to build the Oxford dataset were not synchronized. Therefore, we used visual odometry information to synchronize the frames. After the synchronization, the unsynchronized frames were deleted, and our Oxford dataset consisted of 43,130 frames for training and 35,989 frames for evaluation.

We did not apply the same hyper-parameter values to all five networks (Net1 to Net5) because of the large difference in the range of allowable deviations for rotation and translation in Rg1 and Rg5. Because Net5 is trained with Rg5, which has the smallest deviation range, and is applied last in determining the calibration matrix, we trained Net5 using different hyper-parameter values from other networks. Such hyper-parameters included S, *Vr*, *Vt*, G, λ1, λ2, and *B*, which are the length of a side of a voxel, the number of voxels with data among voxels in a voxel space of the reference sensor, the number of voxels with data among voxels in a voxel space of the target sensor, the number of output nodes of the FC2 and FC3 in the AM, the weight of the loss function *Lrot*, the weight of the loss function *Ltrs*, and the batch size, respectively.

Through the experiments with the Oxford dataset, we observed that data screening is required to enhance the calibration accuracy. The dataset was built with two LiDARs mounted at the left and right corners in front of the roof of a platform vehicle. Figure 2 shows a point cloud for one frame in the Oxford dataset. This point cloud contains points generated by scanning the surface of the platform vehicle by LiDARs. We confirmed that the calibrations performed on point clouds containing these points degrade the calibration accuracy. Therefore, to perform calibration after excluding these points, we set a point removal area to [Horizon: −5–5 m, Vertical: −2–1 m, Depth: −5–5 m] for the target sensor and [Horizon: −1.5–1.5 m, Vertical: −2–1 m, Depth: −2.5–1.5 m] for the reference sensor. Experimental results with respect to this region cropping are provided in Section 4.3.1.

We trained the network for a total of 60 epochs. We initially set the learning rate to 0.0005 and halved it when the epochs reached 30, and we halved it again when the epochs reached 40. The batch size *B* was determined to be within the limits allowed by the memory of the equipment used. We used one NVIDIA GeForce RTX 2080Ti graphic card for all our

experiments. Adam [41] was used for model optimization, and hyper-parameters *β*1 = 0.9 and *β*2 = 0.999 were used.

**Figure 2.** Point cloud constituting one frame in the Oxford dataset. The green dots represent points obtained by the right LiDAR, and the red dots represent the points obtained by the left LiDAR.

#### *4.1. Evaluation Using the KITTI Dataset*

Figure 3 shows a visual representation of the results for performing calibration on the KITTI dataset using the proposed five networks. In this experiment, we transform a point cloud using the calibration matrix inferred from the proposed network and using the ground-truth parameters given in the dataset. We want to show how consistent these two transformation results are. Figure 3a,b show the transformation of a point cloud by randomly sampled deviations from Rg1 and the calibrated parameters given in the KITTI dataset, respectively. The left side of Figure 3c shows the transformation of the point cloud by *RT*1 predicted by the trained Net1. This result looks suitable, but as shown to the right of Figure 3c, it can be seen that the points measured on a thin column were projected to positions that deviated from the column. The effect of iterative refinement appears here. Calibration does not end at Net1 but continues to Net5. Figure 3d shows the transformation of the point cloud by *RTonline* obtained after performing calibration up to Net5. By comparing the result of Figure 3d with the result shown in Figure 3c, we can see that the calibration accuracy is improved: suitable alignment even with the thin column.

Table 1 presents the average performance of calibrations performed without temporal filtering on 4541 frames for testing on the KITTI dataset. From the results shown in Table 1, we can see the effect of iterative refinement. From Net1 to Net5, the improvements are progressive. Our method achieves an average rotation error of [Roll: 0.024◦, Pitch: 0.018◦, Yaw: 0.060◦] and an average translation error of [X: 0.472 cm, Y: 0.272 cm, Z: 0.448 cm].

**Table 1.** Quantitative results of calibration performed on the KITTI dataset without temporal filtering. See footnotes 1,2 for hyper-parameter settings.


1 S = 5, (*Vr*, *Vt*) = (96, 160), G = 1024, (*λ*1, *λ*2) = (1, 2), *B* = 8. 2 S = 2.5, (*Vr*, *Vt*) = (384, 416), G = 128, (*λ*1, *λ*2) = (0.5, 5), *B* = 4.

**Figure 3.** Results of applying the proposed method to a test frame of the KITTI dataset. (**a**) Transformation by randomly sampled deviations. (**b**) Transformation by given calibrated parameters. (**c**) Transformation by *RT*1 inferred from Net1. (**d**) Transformation by *RTonline* obtained from iterative refinement by five networks.

Figure 4 shows two examples of error distribution for individual components by means of boxplots. From these experiments, we confirmed that temporal filtering provides suitable calibration results regardless of the amount of arbitrary deviation. The dots shown in Figure 4a,b are both obtained by transforming the same point cloud of the target sensor by randomly sampled deviations from Rg1, but the sampled deviations are different. As can be seen from the boxplots in Figure 4e–h, the distribution of calibration errors was similar despite the large difference in sampled deviations.

Table 2 shows the calibration results for our method and for the existing CNN-based online calibration methods. From these results, it can be seen that our method achieves the best performance. In addition, when these results are compared with the results shown in Table 1, it can be concluded that our method achieves significant performance improvement through temporal filtering. CalibNet [2] did not specify a frame bundle.

**Table 2.** Comparison of calibration performance between our method and other CNN-based methods.


**Figure 4.** Calibration results and error distribution when temporal filtering was applied. (**a**) Transformation by randomly sampled deviation from Rg1. (**b**) Transformation by randomly sampled deviation from Rg1. (**c**) Calibration results from random deviations shown in (**a**). (**d**) Calibration results from random deviations shown in (**b**). (**e**) Rotation error for the results shown in (**c**). (**f**) Rotation error for the results shown in (**d**). (**g**) Translation error for the results shown in (**c**). (**h**) Translation error for the results shown in (**d**).

Figure 5 graphically shows the changes in the losses calculated by Equations (10) and (11) for training the proposed networks on the KITTI dataset. In this figure, the green graph shows the results of training with randomly sampled deviations from Rg1, and the pink graph shows the results of training with randomly sampled deviations from Rg5. The horizontal and vertical axes of these graphs represent epochs and loss, respectively. From these graphs, we can observe that the loss reduction decreases from approximately the 30th epoch. This was consistently observed, no matter what deviation range the network was trained on or what hyper-parameters were used. There were similar trends in loss reduction for rotation and translation. Given this situation, we halved the initial learning rate after the 30th epoch of training. Training was performed at a reduced learning rate for 10 epochs

after the 30th epoch. After the 40th epoch, we halved the learning rate again. Training continued until the 60th epoch, and the result that produced the smallest training error among the results obtained from the 45th to the 60th epoch was selected as the training result. When Net1 was trained, the hyper-parameters were set as S = 5, (*Vr*, *Vt*) = (96, 160), G = 1024, (*λ*1, *λ*2) = (1, 2), and *B* = 8. When Net5 was trained, the hyper-parameters were set as S = 2.5, (*Vr*, *Vt*) = (384, 416), G = 128, (*λ*1, *λ*2) = (0.5, 5), and *B* = 4. In Figure 5, the training results before the 10th epoch are not shown because the loss was too large.

**Figure 5.** Changes in loss calculated during the training of Net1 and Net5 on the KITTI dataset. (**a**) *Lrot* calculated using Equation (10). (**b**) *Ltrs* calculated using Equation (11).

#### *4.2. Evaluation Using the Oxford Dataset*

Figures 6 and 7 show the results of performing calibration on the Oxford dataset using the proposed five networks. In these figures, the green dots represent the points obtained by the right LiDAR, which is considered to be the target sensor, and the red dots represent the points obtained by the left LiDAR. Figure 6a,b show the results of the transformation of a point cloud from the target sensor by randomly sampled deviations from Rg1 and calibrated parameters given in the Oxford dataset, respectively. Figure 6c shows the result of the transformation of the point cloud by *RT*1 inferred from the trained Net1. Figure 6d shows the result of the transformation of the point cloud by *RTonline* obtained after performing calibration up to Net5. Similar to the results of the calibration performed using the KITTI dataset, the results of Net1 look suitable, but they are not suitable when compared with the results shown in Figure 6d. The photo on the right side of Figure 6c shows that the green and red dots indicated by an arrow are misaligned. In contrast, the photo on the right side of Figure 6d shows that the green and red dots indicated by an arrow are well aligned. We show through this comparison that calibration accuracy can be improved by the iterative refinement of five networks even without temporal filtering.

**Figure 6.** Results of applying the proposed method to a test frame of the Oxford dataset. (**a**) Transformation by randomly sampled deviations. (**b**) Transformation by given calibrated parameters. (**c**) Transformation by *RT*1 inferred from Net1. (**d**) Transformation by *RTonline* obtained from iterative refinement by five networks.

**Figure 7.** Calibration results and error distribution when temporal filtering was applied to the Oxford dataset. (**a**) Transformation by randomly sampled deviations from Rg1. (**b**) Transformation by randomly sampled deviations from Rg1. (**c**) Calibration results from random deviations shown in (**a**). (**d**) Calibration results from random deviations shown in (**b**). (**e**) Rotation error for the results shown in (**c**). (**f**) Rotation error for the results shown in (**d**). (**g**) Translation error for the results shown in (**c**). (**h**) Translation error for the results shown in (**d**).

Table 3 presents the average performance of calibrations performed without temporal filtering on 35,989 frames for testing in the Oxford dataset. Our method achieves an average rotation error of [Roll: 0.056◦, Pitch: 0.029◦, Yaw: 0.082◦] and an average translation error of [X: 0.520 cm, Y: 0.628 cm, Z: 0.350 cm]. In this experiment, we applied the same hyper-parameters to all five networks. They are S = 5, (*Vr*, *Vt*) = (224, 288), G = 1024, (*λ*1, *λ*2) = (1, 2), and B = 8.


**Table 3.** Quantitative results of calibration performed on the Oxford dataset without temporal filtering.

Figure 7 shows two examples of the error distribution of individual components by means of boxplots, as shown in Figure 4. From these experiments, we can see that temporal filtering provides suitable calibration results regardless of the amount of arbitrary deviation, even for LiDAR-LiDAR calibration. The green dots shown in Figure 7a,b are both obtained by transforming the same point cloud of the target sensor with randomly sampled deviations from Rg1, but the sampled deviations are different. As shown in Figure 7e–g, the distribution of calibration errors is similar despite the large difference in sampled deviations. In these experiments, the size of the frame bundle used in the temporal filtering was 100.

Table 4 shows the calibration performance of the proposed method with temporal filtering. Our method achieves a rotation error of less than 0.1◦ and a translation error of less than 1 cm. By comparing Tables 3 and 4, it can be seen that temporal filtering achieves a significant improvement in performance.

**Table 4.** Quantitative results of calibration on Oxford dataset with temporal filtering.


Figure 8 graphically shows the changes in the losses calculated by Equations (10) and (11) in training the proposed networks with the Oxford dataset. Compared with the results shown in Figure 5, we observed that the results from this experiment were very similar to the experimental results achieved with the KITTI dataset. Therefore, we decided to apply the same training strategy to the KITTI and Oxford datasets. However, the settings of the hyper-parameter values that were applied to the network were different. When Net1 was trained, the hyper-parameters were set as S = 5, ( *Vr*, *Vt*) = (224, 288), G = 1024, (*λ*1, *λ*2) = (1, 2), and *B* = 8. When Net5 was trained, the hyper-parameters were set as S = 5, (*Vr*, *Vt*) = (224, 288), G = 1024, ( *λ*1, *λ*2) = (0.5, 5), and *B* = 4.

**Figure 8.** *Cont*.

**Figure 8.** Changes in calculated losses during Net1 and Net5 training on the Oxford dataset. (**a**) Rotation loss *Lrot* calculated using Equation (10). (**b**) Translation loss *Ltrs* calculated using Equation (11).

#### *4.3. Ablation Studies*

#### 4.3.1. Performance According to the Cropped Area of the Oxford Dataset

At the beginning of Section 4, we mentioned the need to eliminate some points in the Oxford dataset that degraded calibration performance. To support this observation, we presented in Table 5 the results of experiments with and without the removal of those points. However, although there is a difference in the calibration performance according to the size of the removed area, it is difficult to theoretically determine the size of the area to be cropped. Table 5 shows the results of the experiments by setting the area to be cut in two ways. Through these experiments, we found that the calibration performed after removing points that caused the performance degradation generally produced better results than the calibration performed without removing those points. These experiments were performed with the trained Net5, and the hyper-parameters were as follows. S = 5, V = (224, 288), G = 1024, *λ* = (1, 2), and B = 8.

**Table 5.** Comparison of calibration performance according to the cropped area on the Oxford dataset.


4.3.2. Performance According to the Length of a Voxel Side, S

We conducted experiments to check how the calibration performance changes according to S. Tables 6 and 7 show the results of these experiments. Table 6 shows the results for the KITTI dataset, and Table 7 shows the results for the Oxford dataset. We performed an evaluation according to S with a combination of Rg1 and Net1 and a combination of Rg5 and Net5. These experiments showed that the calibration performance improved as S became smaller. However, as S became smaller, the computational cost increased, and in some cases, the performance deteriorated. We tried to experiment with fixed values of hyper-parameters other than S, but naturally, as S decreased, the hyper-parameters *Vr* and *Vt* increased rapidly. This was a burden on the memory, and thus it was difficult to keep the batch size B at the same value. Therefore, when S was 2.5, B was 4 in the experiment performed on the KITTI dataset, and B was 2 in the experiment performed on the Oxford dataset. However, for S greater than 2.5, B was fixed at 8. In addition, there were cases where the performance deteriorated when S was very small, such as 2.5, which was considered to be the result of a small receptive field in the FEN. Even in the experiments

performed on the Oxford dataset, when S was 2.5 in Net1, the training loss diverged near the 5th epoch, so the experiment could no longer be performed. For training on the KITTI dataset, S was set to 2.5 in Net5, and S was set to 5 in Net1 to Net4. However, for training on the Oxford dataset, S was set to 5 for both Net1 and Net5.


**Table 6.** Comparison of calibration performance according to S on the KITTI dataset.

**Table 7.** Comparison of calibration performance according to S on the Oxford dataset.


4.3.3. Performance According to the Bundle Size of Frames

We conducted experiments to observe how the calibration performance changes according to the bundle size of the frame for temporal filtering. Tables 8 and 9 show the results of these experiments. Table 8 shows the results for the KITTI dataset, and Table 9 shows the results for the Oxford dataset. We performed the experiments as presented in Section 3.4.3. Because 100 runs had to be performed, the position of the starting frame for each run was predetermined. For each run, we took the median of the values of each of the six parameters associated with rotation and translation inferred from the frames in the bundle, and we calculated the absolute difference between this median and the deviation randomly sampled from Rg1. The error of each parameter shown in Tables 8 and 9 was obtained by adding up the error of the corresponding parameters calculated for each run and dividing the sum by the number of runs. Through these experiments, we found that temporal filtering using many frames improves the overall calibration performance. However, if we look carefully at the results presented in the two tables, the effect is not shown for all parameters. Considering this observation and the processing time, the bundle size of the frame was set to 100.


**Table 8.** Comparison of calibration performance according to the bundle size of frames for temporal filtering on the KITTI dataset.

**Table 9.** Comparison of calibration performance according to the bundle size of frame for temporal filtering on the Oxford dataset.

