**4. Experiments**

In this section, we first explain the details of the experimental setup. Then we introduce the adopted datasets, with which both quantitative and qualitative results prove that our proposed method outperforms existing methods.

#### *4.1. Experimental Setup*

All the network models are implemented in PyTorch and trained with NVIDIA P100 GPU. The RMSprop optimization algorithm [31] is adopted with an initial learning rate of 10−<sup>4</sup> and the momentum of 0.9, for both the fusion network and denoising network. The networks are trained sequentially, that is, the fusion network is pre-trained before the training of the denoising network. 10K frames sampled from ShapeNet dataset [32] are utilized for training the network.

#### *4.2. Dataset and Noise Simulation*

**ShapeNet** dataset [32] includes a large scale of synthetic 3D shapes, such as the plane, sofa and car. The ground-truth data, including depth maps, camera intrinsics and camera poses, can be obtained from the 3D shapes. Similar to RoutedFusion [4], we use the ShapeNet dataset to train the networks. To simulate the realistic noisy situation, not only depth maps but also camera poses are added random noises in the training process.

**CoRBS** dataset [33], a comprehensive RGB-D benchmark for SLAM, provides (i) real depth data and (ii) real color data, which are captured with a Kinect v2, (iii) a ground-truth trajectory of the camera that is obtained with an external motion capture system, and (iv) a ground-truth 3D model of the scene that is generated via an external 3D scanner. Totally, the dataset involves 20 image sequences of 4 different scenes.

**Noise Simulation**. As introduced in Section 3.2, we need the *μ<sup>t</sup>*, *σt*, *μ<sup>r</sup>*, and *σr* parameters to mimic the real sensor noises. Since the CoRBS dataset provides not only real-scene data but also the ground-truth data, we adopt it to obtain the realistic pose noise for simulation. In order to measure the pose noise, we follow the calculation process of the commonly-used relative pose error (RPE) [34]. RPE is defined as the drift of the trajectory over a fixed time interval Δ. For a sequence of *n* frames, firstly, the relative pose error at time step *i* is calculated as follows:

$$E\_i = (I\_i^{-1} I\_{i+\Delta})^{-1} (I\_i^{-1} I\_{i+\Delta}) \, \tag{15}$$

where *I* is the ground-truth trajectory and *J* is the estimated trajectory. Then *m* = *n* − Δ individual relative pose error matrices can be obtained along the sequence. Generally, the RPE is considered as two components, i.e., RPE for translation matrix (*T* = *trans*(*Ei*)) and RPE for rotation matrix (*R* = *rot*(*Ei*)). We use the following formulas for obtaining the *μ* and *σ* parameters for the normal distribution.

$$\mu\_t = \frac{1}{m} \sum\_{i=1}^{m} \parallel \operatorname{trans}(E\_i) \parallel \tag{16}$$

$$
\sigma\_t = \sqrt{\frac{1}{m} \sum\_{i=1}^{m} (||\, \text{trans}(E\_i)||\, -\mu\_t)^2} \tag{17}
$$

$$\mu\_{\mathcal{I}} = \frac{1}{m} \sum\_{i=1}^{m} \angle rot(E\_i) \tag{18}$$

$$
\sigma\_r = \sqrt{\frac{1}{m} \sum\_{i=1}^{m} (\angle rot(E\_i) - \mu\_r)^2},
\tag{19}
$$

where <sup>∠</sup>*rot*(*Ei*) = *arccos*( *Tr*(*R*)−<sup>1</sup> 2 ) and *Tr*(*R*) represents the sum of the diagonal elements of the rotation matrix *R*.

For the translation error, *μt* is 0.006 and *σt* is 0.004, while for the rotation error, *μr* is 0.094 and *σr* is 0.068, which are used in the noise simulation for our experiments. These parameters are also preferable in the training of DFusion model for actual uses, while they can also be increased a bit (better keeping *μt* and *σt* no larger than 0.02, *μr* and *σr* no larger than 0.2, with which the DFusion model can give good fusion results) if strong sensor noises are expected.

#### *4.3. Evaluation Results*

The experiments are conducted on ShapeNet and CoRBS datasets. For ShapeNet dataset, which involves the synthetic data, we add only depth noises and both depth noises and pose noises, respectively. The results are shown in Tables 1 and 2. To compare with state-of-the-art methods, our method is evaluated with four metrics, i.e., the mean squared error (MSE), the mean absolute distance (MAD), intersection over union (IoU) and accuracy (ACC). MSE and MAD mainly focus on the distance between the estimated TSDF and the ground truth, while IoU and ACC quantify the occupancy of the estimation. According to the results, our method outperforms the state-of-the-art methods on all metrics for both scenarios. Especially when there exist both depth noises and pose noises, our method shows a significant advantage over other methods. When only depth noises exist, the RoutedFusion method and the proposed DFusion method have similar performance, while the latter shows a slight advantage due to the post-processing of the Denoising Module. Figures 5 and 6 illustrate the fusion results on the ShapeNet dataset with depth noises or pose noises, respectively, which is more intuitive to show the advantages of DFusion method. Consistent with the metric results, we can see that DFusion can give clean and precise fusion for all these objects. Due to the use of deep learning models, RoutedFusion and DFusion both have satisfactory outputs when depth noises are added, as shown in Figure 5. However, when pose noises exist (as shown in Figure 6), the fusion results of RoutedFusion deteriorate a lot, while our DFusion model can still have a precise output.

**Table 1.** Comparison results on ShapeNet (with only depth noise).


**Table 2.** Comparison results on ShapeNet (with depth noise and pose noise).



**Figure 5.** Fusion results on the ShapeNet dataset with depth noise added.


**Figure 6.** Fusion results on the ShapeNet dataset with pose noise added.

For the CoRBS dataset, we choose four real scenes to perform the comparison with KinectFusion and RoutedFusion method. However, the pose information needs to be calculated before fusing the depth maps. KinectFusion method involves the process of calculating the pose information, which is the iterative closest point (ICP) algorithm [36]. Hence, to generate the TSDF volume, we use the ICP algorithm to obtain pose information for RoutedFusion and DFusion method, then compare the results on the MAD metric. The results are shown in Table 3. For all the scenes, our method achieves the best result. We also show some visualization results in Figure 7, which proves that our method can denoise the TSDF volume effectively and obtain more complete and smooth object models (note the cabinet edges, desk legs, and the human model arms).

**Figure 7.** Fusion results on the CoRBS dataset. ICP algorithm [36] is used to obtain the sensor trajectory for RoutedFusion and DFusion.



#### *4.4. Ablation Study*

To verify the effectiveness of the proposed loss function, we perform an ablation study, which compares the results with other three variants of the loss function, i.e., the loss function without object loss, the loss function without surface loss and the loss function without both object and surface loss. The original loss is our default setting which involves space loss, object loss and surface loss. For all variants, the experiment is conducted on the ShapeNet dataset with both depth noises and pose noises added. The results are shown in Table 4. It can be seen that the original setting can achieve the best performance for all metrics, which demonstrates the effectiveness of the proposed loss functions.

**Table 4.** Variants of the proposed method (with depth noise and pose noise).

