3.2.2. AM

It is not easy to match the features extracted from the FEN through convolutions because the point clouds from the LiDAR-stereo camera combination are generated differently. Even in the LiDAR-LiDAR combination, if the FOVs of the two LiDARs are significantly different, it is also not easy to match the features extracted from the FEN through convolutions. Moreover, because the deviation range of rotation and translation is set large to estimate calibration parameters, it becomes difficult to check the similarity between the point cloud of the target sensor and the point cloud of the reference sensor.

Inspired by the attention mechanism proposed by Vaswani et al. [28], we solve these problems: we design an AM that implements the attention mechanism, as shown in Figure 1. The AM calculates an attention value for each voxel of the reference sensor input using the following procedure.

The AM has four fully connected layers (FCs): FC1, FC2, FC3, and FC4. A feature is input into these FCs, and a transformed feature is output. We denote the outputs of FC1, FC2, FC3, and FC4 as matrices M1, M2, M3, and M4, respectively. Each FC has 1024 input nodes. Here, the number 1024 is the number of feature volumes extracted from the FEN. The FC1 and FC4 have G/2 output nodes, and the FC2 and FC3 have G output nodes. These FCs transform 1024 features to G or G/2 features. Here, G is a hyper-parameter. If the sum of the elements in a row of matrix F, which is the output of the FEN, is 0, the row vector is not input to FC. We apply layer normalization (LN) [40] and the ReLU function to the output of these FCs so that the final output becomes nonlinear. The output M2 of FC2 is a matrix of dimension *Vt* × G, and the output M3 of FC3 is a matrix of dimension *Vr* × G.

Here, *Vr* and *Vt* are the number of rows in which there is at least one element with a feature value among the elements in each row of *Fr* and *Ft*, respectively. Therefore, *Vr* and *Vt* can be different for each input. However, we fix the values of *Vr* and *Vt* because the number of input nodes of the multi-layer perceptron (MLP) of the IN following the AM cannot be changed every time. In order to fix the values of *Vr* and *Vt*, we input all the data to be used in the experiments into the network and set the values when they are the largest, but we make them a multiple of 8. This is because *Vr* and *Vt* are also hyper-parameters. If the actual *Vr* and *Vt* are less than the predetermined *Vr* and *Vt*, the elements of the output matrices of FCs will be filled with zeros. The output M1 of FC1 is a matrix of dimension *Vt* × G/2, and the output M4 of FC4 is a matrix of dimension *Vr* × G/2.

• Computation of attention score by dot product

An attention score is obtained from the dot product of a row vector of M3 and a column vector of MT2 . This score is the same as the cosine similarity. The matrix AS is obtained through the dot products of all row vectors of M3 and all column vectors of MT2 are called an attention score matrix. The dimension of the matrix AS (AS = M3·<sup>M</sup>T2 ) is *Vr* × *Vt*.

• Generation of attention distribution by softmax

We apply the softmax function to each row vector of AS and obtain the attention distribution. The softmax function calculates the probability of each element of the input vector. We call this probability an attention weight, and the matrix obtained by this process is the attention weight matrix AW of dimension *Vr* × *Vt*.

• Computation of attention value by dot product

An attention value is obtained from the dot product of a row vector of AW and a column vector of the matrix M1. A matrix AV obtained through the dot products of all row vectors of AW and all column vectors of M1 is called an attention value matrix. The dimension of the matrix AV (AV = AW·M1) is *Vr* × G/2.

Finally, we concatenate the attention value matrix AV and the matrix M4. The resulting matrix from this final process is denoted as AC (AC = [AV M4]) and has dimension *Vr* × G; this matrix becomes the input to the IN. The reason we set the output dimension of FC1 and FC4 to G/2 instead of G is to save memory and reduce processing time.
