3.2.3. IN

The IN infers rotation and translation parameters. The IN consists of an MLP and two fully connected layers, FC5 and FC6. The MLP is composed of an input and an output layer, as well as a single hidden layer. The input layer has *Vr* × G nodes, and the hidden and output layers have 1024 nodes, respectively. Therefore, when we input AC, the output of the AM, into the MLP, we make AC a flat vector. In addition, this MLP has no bias input, and it uses ReLU as an activation function. Moreover, LN is performed on the weighted sums that are input to nodes in the hidden layer and output layer, and ReLU is applied to the normalization result to obtain the output of these nodes. The output of the MLP becomes the input to the FC5 and FC6. The MLP plays the role of dimension reduction in the input vector.

We do not apply a normalization or an activation function to the FC5 and FC6. FC5 produces three translation-related parameter values, which are *τpx* , *τpy* , and *τpz* , and FC6 produces four rotation-related quaternion values, which are *q*0, *q*1, *q*3, and *q*4.

#### *3.3. Loss Function*

To train the proposed network, we use a loss function as follows:

$$L = \lambda\_1 L\_{\rm rot} + \lambda\_2 L\_{\rm trs} \tag{9}$$

where *Lrot* is a regression loss related to rotation, *Ltrs* is a regression loss related to translation, and hyper-parameters *λ*1 and *λ*2, respectively, are their weights. We use the quaternion distance to regress the rotation. The quaternion distance is defined as:

$$L\_{rot} = \arccos(2(\frac{q\_p}{|q\_P|} \cdot \frac{q\_{\mathcal{K}}t}{|q\_{\mathcal{K}}t|})^2 - 1) \tag{10}$$

where · represents the dot product, |·| indicates the norm, and *qp* and *qgt* indicate a vector of the quaternion parameters predicted by the network and the ground-truth vector of quaternion parameters, respectively. From *RTgt* of Equation (4), we obtain the four quaternion values. These four quaternion values are used for rotation regression as the ground truth.

For the regression of the translation vector, the smooth *L*1 loss is applied. The loss *Ltrs* is defined as follows:

$$\begin{array}{l} \text{L}\_{\text{trs}} = \frac{1}{3} \Big( \text{s} \text{s} \text{s} \text{t} \text{h}\_{\text{L1}} \Big( \boldsymbol{\tau}\_{\text{x}}^{p} - \boldsymbol{\tau}\_{\text{x}}^{\text{q}t} \Big) + \text{s} \text{s} \text{s} \text{t} \text{h}\_{\text{L1}} \Big( \boldsymbol{\tau}\_{\text{y}}^{p} - \boldsymbol{\tau}\_{\text{y}}^{\text{q}t} \Big) + \text{s} \text{s} \text{s} \text{t} \text{h}\_{\text{L1}} \Big( \boldsymbol{\tau}\_{\text{z}}^{p} - \boldsymbol{\tau}\_{\text{z}}^{\text{q}t} \Big) \Big) \\\quad \text{s} \text{s} \text{s} \text{t} \text{h}\_{\text{L1}} \left( \mathbf{x} \right) = \begin{cases} \frac{\mathbf{x}^{2}}{2\beta} & \text{if } |\mathbf{x}| < \beta\\|\mathbf{x}| - \frac{\beta}{2} & \text{otherwise} \end{cases} \end{array} \tag{11}$$

where the superscripts *p* and *gt* represent prediction and ground truth, respectively, *β* is a hyper-parameter and is usually taken to be 1, and |·| represents an absolute value. The parameters *τ<sup>p</sup> x* , *τ<sup>p</sup> y* and *τ<sup>p</sup> z* are inferred by the network, and *τg<sup>t</sup> x* , *τg<sup>t</sup> y* , and *τg<sup>t</sup> z* are obtained from *RTgt* of Equation (4).
