3.2.1. FEN

Starting from the initial input voxel maps *Fx*, *Fy*, and *Fz*, FEN extracts features for use in predicting calibration parameters by performing 3D convolution on 20 layers. The number of layers, the size of the kernel used, the number of kernels used in each layer, and the stride applied in each layer are experimentally determined. The kernel size is 3 × 3 × 3. There are two types of stride, 1 and 2, which are used selectively for each layer. The number of kernels used in each layer is indicated at the bottom of Figure 1. This number corresponds to the quantity of the feature volume created in the layer. In the deep learning terminologies, this quantity is called channels. Convolution is performed differently depending on the stride applied to each layer. When stride 1 is applied, submanifold convolution [38] is performed, and when stride 2 is applied, general convolution is performed. General convolution is performed on all voxels with or without a value, but submanifold convolution is performed

only when a voxel with a value corresponds to the central cell of the kernel. In addition, batch normalization (BN) [39] and rectified linear unit (ReLU) activation functions are sequentially applied after convolution in the FEN.

**Figure 1.** Overall structure of the proposed network. In the attention module, the T within a circle represents the transpose of a matrix; @ within a circle represents a matrix multiplication; S' within a circle represents the soft max function; C' within a circle represents concatenation. In the inference network, Trs and Rot represent the translation and rotation parameters predicted by the network, respectively.

We want the proposed network to perform robust calibration for large rotational and translational deviations between two sensors. To this end, a large receptive field is required. Therefore, we included seven layers with a stride of 2 in the FEN.

The final output of the FEN is 1024 feature volumes. The number of cells in the feature volume depends on the size of the voxel, but we let V be the number of cells in the feature volume. At this time, because each feature volume can be reconstructed as a V-dimensional column vector, we represent 1024 feature volumes as a matrix F of dimension V × 1024. The outputs of FENs for the reference and target sensors are denoted by *Fr* and *Ft*, respectively.
