3.1.5. Voxelization

We first perform a voxel partition by dividing the 3D points obtained by the sensors into equally spaced 3D voxels, as was performed in [36]. This voxel partition requires a space that limits the 3D points acquired by a sensor to a certain range. We call this range a voxel space. We consider the length of a side of a voxel, which is a cube, as a hyper-parameter, and denote it as S. In this paper, the unit of S is expressed in cm. A voxel can contain multiple points, of which up to three are randomly chosen, and the rest are discarded. Here, it is an experimental decision that we leave only up to three points per voxel. Referring to the method in [37], the average coordinates along the x-, y-, and z-axes of the points in each voxel are then calculated. We build three initial voxel maps, *Fx*, *Fy*, and *Fz*, using the average coordinates for each axis. For each sensor, these initial voxel maps become the input to our proposed network. Section 3.2 describes the network.

In this paper, we set the voxel space to be somewhat larger than the predetermined ROI of the sensor, considering the range of deviation. For example, in the case of the KITTI dataset, the voxel space of the stereo input is set as [horizontal: −15–15 m, vertical: −15–15 m, depth: 0–55 m], and the voxel space of the LiDAR input is set to the same size as the voxel space of the stereo input. In contrast, the voxel space of the 3D points generated by the two LiDARs in the Oxford dataset is set to [width: −40–40 m, height: −15–15 m, depth: −40–40 m]. The points outside of the voxel space are discarded.

#### *3.2. Network Architecture*

We propose a network of three parts, which are referred to as a feature extraction network (FEN), an attention module (AM), and an inference network (IN). The overall structure of the proposed network is shown in Figure 1. The input of this network is the *Fx*, *Fy*, and *Fz* for each sensor built from voxelization, and the output is seven numbers, three of which are translation-related parameter values, and the other four are rotation-related quaternion values. The network is capable of end-to-end training because every step is differentiable.
