**3. Methodology**

#### *3.1. TSDF Fusion*

Standard TSDF fusion, which is proposed by Curless and Levoy [1], integrates a depth map *Di* with the camera pose and camera intrinsic into a signed distance function *Vi* ∈ *RX*×*Y*×*<sup>Z</sup>* and weight function *Wi* ∈ *RX*×*Y*×*Z*. For location *x*, the integration process can be expressed as follows:

$$V\_i(\mathbf{x}) = \frac{W\_{i-1}(\mathbf{x})V\_{i-1}(\mathbf{x}) + w\_i(\mathbf{x})v\_i(\mathbf{x})}{W\_{i-1}(\mathbf{x}) + w\_i(\mathbf{x})} \tag{1}$$

$$\mathcal{W}\_{i}(\mathbf{x}) = \mathcal{W}\_{i-1}(\mathbf{x}) + w\_{i}(\mathbf{x}) \tag{2}$$

It is an incremental process, and *V*0 and *W*0 are initially set as zero volumes. In each time step *i*, the signed distance *vi* and its weight *wi* are estimated according to the depth map of the current ray, then are integrated into a cumulative signed distance function *Vi*(*x*) and a cumulative weight *Wi*(*x*).

However, in the traditional way, the parameters are tuned manually, so that it is a heavy task and difficult to exclude artifacts and maintain high performance. In RoutedFusion [4], the TSDF fusion process has been conducted in a convolutional network, named depth fusion network, which is trained to tune the parameters automatically. The input of the fusion network is depth maps, camera intrinsics and camera poses. The depth map is fused

into the previous TSDF volume with the camera intrinsic and camera pose incrementally. The main purpose of RoutedFusion method is to deal with the noise of the TSDF volume caused by the noise on depth maps. To remove the depth noise, the authors firstly adopt the depth maps with random noises for training, then use a routing network to denoise the depth maps before fusing them with the fusion network.

In a real application, however, the pose noise is also inevitable. Therefore, in our method, the inputs include noised depth maps and noised camera poses.

#### *3.2. Network Architecture*

The proposed DFusion method mainly includes two parts: a Fusion Module for fusing depth maps and a Denoising Module for removing the depth noises and pose noises. These two modules are trained independently, with different loss functions.

**Fusion Module.** The Fusion Module follows the design of the fusion network proposed in the RoutedFusion method [4]. It fuses depth maps incrementally with a learned TSDF updating function, using the information of camera intrinsics and camera poses. Then the TSDF update will be integrated to form a TSDF volume for the whole scene. The process of the Fusion Module is illustrated in the upper part of Figure 3. Although RoutedFusion can remove the depth noise, its denoising process is implemented as a pre-processing network, i.e., the routing network as metioned in Section 3.1, rather than the Fusion Module which is used in our method. Also, different from the RoutedFusion method, we consider not only the depth noise but also the pose noise, the latter of which is much more obvious when fusion is finished than before/during fusion. Therefore, we add a post-processing module to deal with both of these two types of noises.

**Figure 3.** The DFusion model.

**Denoising Module.** After obtaining the TSDF volume, the Denoising Module is designed to remove the noise of the TSDF volume. The input of the Denoising Module, which is also the output of the Fusion Module, is a TSDF volume with depth noises and pose noises. Since it deals with a 3D volume, we adopt 3D convolutional layers instead of 2D convolutional layers, aiming to capture more 3D features to remove the noise (as using 3D convolutional layers is a natural choice for tasks such as 3D reconstruction [30] and recognizing 3D shifts are extremely difficult for 2D convolutions). As shown in Figure 3, the Denoising Module is implemented as an UNet-like network, which downsamples the features in the encoder part and upsamples them back to the original size in the decoder part. Skip connections are added among encoder layers and decoder layers.

In the training phase, to mimic the noises of real-world applications, we add random noises to the ground-truth depth maps and camera poses of the dataset. Therefore, the output of the Fusion Module, as well as the input of the Denoising Module, is noisy and needs to be fixed. For the depth noise, we add the noises *Bd* that follow a normal distribution to all pixels *P* in the depth maps (following the solutions in [4,23]). This process can be represented as

$$P' := P + B\_{d\prime} \tag{3}$$

and

$$B\_d \sim N[0, \sigma\_d],\tag{4}$$

where *σd* is the pre-defined scale parameter. This parameter should be set to reflect the actual noise levels of the applications. We set *σd* = 0.005 following [4,23].

As for pose noises, we add the noise to translation matrix *T* and rotation matrix *R*, respectively. Firstly, given a random translation error *Bt*, a random rotation error *Br*, two random unit vectors *nt* = (*<sup>n</sup>*1, *n*2, *<sup>n</sup>*3) and *nr* = (*<sup>n</sup>*4, *n*5, *<sup>n</sup>*6) (respectively, for translation and rotation errors), the noised translation matrix and rotation matrix are calculated as follows.

$$\begin{aligned} T' &:= T + n\_t \cdot B\_t \\ R' &:= R + \text{Rodri}(n\_t, B\_t)\_t \end{aligned} \tag{5}$$

where Rodri(*nr*, *Br*) follows Rodrigues's rotation formula and it can be represented as:

$$\begin{pmatrix} n\_4^2(1-\cos B\_{\varGamma}) + \cos B\_{\varGamma} & n\_4 n\_5 (1-\cos B\_{\varGamma}) - n\_6 \sin B\_{\varGamma} & n\_4 n\_6 (1-\cos B\_{\varGamma}) + n\_5 \sin B\_{\varGamma} \\ n\_4 n\_5 (1-\cos B\_{\varGamma}) + n\_6 \sin B\_{\varGamma} & n\_5^2 (1-\cos B\_{\varGamma}) + \cos B\_{\varGamma} & n\_5 n\_6 (1-\cos B\_{\varGamma}) - n\_4 \sin B\_{\varGamma} \\ n\_4 n\_6 (1-\cos B\_{\varGamma}) - n\_5 \sin B\_{\varGamma} & n\_5 n\_6 (1-\cos B\_{\varGamma}) + n\_4 \sin B\_{\varGamma} & n\_6^2 (1-\cos B\_{\varGamma}) + \cos B\_{\varGamma} \end{pmatrix} \tag{6}$$

In addition, *Bt* and *Br* also follow the normal distribution.

$$\begin{aligned} B\_t &\sim \mathcal{N}[\mu\_t, \sigma\_t] \\ B\_r &\sim \mathcal{N}[\mu\_r, \sigma\_r] \end{aligned} \tag{7}$$

Since there is no existing method that adds artificial pose noises to improve the denoising performance, the value of *μ* and *σ* is decided based on a real scene dataset. More details are given in Section 4.2.

#### *3.3. Loss Functions*

Since there are two modules in the network, i.e., Fusion module and Denoising module, the total loss function involves two parts as follows.

**Fusion Loss.** The loss function of the Fusion Module is expressed as follows:

$$L\_F = \sum\_a \lambda\_1^F L\_1(V\_{\text{local},a\prime}, V\_{\text{local},a}^{\prime}) + \lambda\_2^F L\_\mathbb{C}(V\_{\text{local},a\prime}, V\_{\text{local},a}^{\prime})\_\prime \tag{8}$$

where *Vlocal* and *V local* are two local volumes along ray *a*, respectively, from the the network output and from the ground-truth. *L*1 is the L1 loss and can be represented as

$$L\_1(V, V') = \frac{\sum\_{\upsilon\_m \in V, \upsilon\_m' \in V'} |\upsilon\_m - \upsilon\_m'|}{|V|} \tag{9}$$

In addition, we use the cosine distance loss *LC* (on the signs of the output volume and ground-truth volume) to ensure the fusion accuracy of the surface, following the setting in [4], which can be represented as

$$L\_{\mathbb{C}}(V, V') = 1 - \cos(\operatorname{sign}(V), \operatorname{sign}(V')),\tag{10}$$

where *sign*() is to ge<sup>t</sup> the signs of the inputs and *cos*() is to ge<sup>t</sup> the cosine values of the angles between the input vectors.

In addition, *λF*1 and *λF*2 are the weigths for the loss terms and are emperically decided as 1 and 0.1 [4], respectively.

**Denoising Loss.** The Denoising Module is also trained in a supervised manner, considering the fusion accuracy on the whole scene, objects, and surface regions. The loss function is defined as follows:

$$L\_D = \lambda\_1^D L\_{SPACE} + \lambda\_2^D L\_{OBIECT} + \lambda\_3^D L\_{SILRFACE} \tag{11}$$

where *LSPACE*, *LOBJECT*, and *LSURFACE* are, respectively, for the losses of the whole scene, objects, and the surface regions (as shown in Figure 4). *λD*1 , *λD*2 , and *λD*3 are the weights to adjust their relative importance.

*LSPACE* is defined as

$$L\_{SPACE} = L\_1(V, V'),\tag{12}$$

where *V* is the predicted scene volume while *V* is the ground-truth volume. Let*VOBJECT*⊆*V*,andforeach*vm* ∈ *VOBJECT*,*v* ≤0,then

$$L\_{OBJECT} = L\_1(V\_{OBJECT}, V\_{OBJFET}') \tag{13}$$

Similarly, let *VSURFACE* ⊆ *V*, and for each *vm* in *VSURFACE*, −*S* ≤ *v m* ≤ *S*, where *S* is a threshold of the surface range (we set *S* to 0.02), then

*m* 
$$L\_{SILFACE} = L\_1(V\_{SILFACE}, V\_{SILFACE}^{\prime}) \tag{14}$$

We set the values of hyperparameter *λD*1 , *λD*2 , and *λD*3 to 0.5, 0.25, and 0.25, respectively. The effects of object loss and surface loss are explored in the ablation study.

**Figure 4.** The focus regions of the loss functions (green masks for the focus regions). (**a**) The illustration of the example scene, where one object exists. (**b**) The scene loss. (**c**) The object loss. (**d**) The surface loss.
