**1. Introduction**

Depth fusion is of grea<sup>t</sup> importance for many applications, such as augmented reality applications and autonomous driving. Many methods have been proposed in this area and truncated signed distance function (TSDF) [1] is one of the most famous. However, TSDF requires manual adjustment on its parameters, possibly leading to thick artifacts. To address this problem, some depth fusion methods have emerged with improved performance. Methods such as [2,3] use surfel-based or probabilistic approaches to generate 3D representations, which may be a voxel grid, a mesh or a point cloud. In addition, compared with these classical methods, convolutional neural network (CNN) based methods have shown advantages in the fusion performance. However, their results still suffer from noisy input, which results in missing surface details and incomplete geometry [4].

The data acquired by depth cameras inevitably contain a significant amount of noise. Although researchers have proposed many methods to remove the noise, most of the works only focus on removing the noise caused by depth maps but neglect the noise of camera poses (pose noises for simplicity). Figure 1 illustrates the two types of noises. Figure 1a shows the situation where there is no noise and a plane is in the sight of the camera. If there are depth noises, the noise may be outliers or missing data, as shown in Figure 1b, which leads to noisy TSDF volumes. As for the pose noise, Figure 1c provides an example when the camera has translation and rotation error compared with Figure 1a, which causes troubles when integrating the TSDF updates due to the inaccurate extrinsic data. Both types of noises may have adverse impacts on depth fusion results. However, there are only a few works that focus on removing noises for TSDF fusion, even given the fact that both types of noises are inevitable.

**Citation:** Niu, Z.; Fujimoto, Y.; Kanbara, M.; Sawabe, T.; Kato, H. DFusion: Denoised TSDF Fusion of Multiple Depth Maps with Sensor Pose Noises. *Sensors* **2022**, *22*, 1631. https://doi.org/10.3390/s22041631

Academic Editor: Jing Tian

Received: 4 January 2022 Accepted: 2 February 2022 Published: 19 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Illustration of the sensor noises. (**a**) Sensor without noises. (**b**) Depth noises. (**c**) Sensor pose noises.

RoutedFusion method [4], as an example, considers the depth noise and aims to obtain a robust TSDF volume against different levels of depth noise. It uses depth maps derived from synthetic datasets and puts random noises into the depth maps. However, in the fusion process, the camera pose they use is the ground-truth pose from the synthetic dataset, so that the results can only be robust against depth noise, but not against pose noise. In this paper, we propose a method named DFusion that considers not only depth noises but also pose noises, as shown in Figure 2. To the best of our knowledge, this is one of the earliest research that tries to avoid the performance drop caused by pose noises.

**Figure 2.** DFusion can minimize the influence of both types of noises.

Generally, depth fusion is conducted with 2D convolutional models. However, when considering the pose noise, it is better to remove the noise with the 3D representation because it is challenging to recognize and remove the surface shifts in the 2D space. Therefore, we firstly adopt a Fusion Module, as the first part of DFusion, with the same setting as the fusion network in the RoutedFusion method, to fuse the depth maps with camera poses into a TSDF volume. After gaining the integrated TSDF volume, we design a Denoising Module, an UNet-like neural network, as the second part of DFusion to denoise the TSDF volume. Since the input of the Denoising Module is a 3D volume, 3D convolutional layers are utilized to obtain the 3D features. Skip connections are used to avoid the vanishing gradient problem, which is prone to occur due to the small value of TSDF volume.

For training the networks, we utilize a synthetic dataset which can provide the groundtruth value of depth maps and camera poses. The model is trained in a supervised manner. In addition to the commonly-used fusion loss, several specially-designed loss functions are proposed, including a *L*1 loss for all voxels in the whole scene and *L*1 losses over the objects and surfaces for better fusion performance on these regions.

In sum, the contributions of this work are as follows:


denoising effects of the proposed method. The ablation study proves the effectiveness of the proposed loss function.

#### **2. Related Works**

#### *2.1. Depth Fusion and Reconstruction*

#### 2.1.1. Classical Methods

TSDF fusion method [1] is one of the most important classical fusion methods that fuses depth maps with camera intrinsics and the corresponding viewpoints, i.e., camera poses, into a discretized signed distance function and weight function, thereby obtaining a volumetric representation. It has been adopted as the fundamental in the majority of depth map fusion based 3D reconstruction, including KinectFusion [5], BundleFusion [6], and voxel hashing [7,8]. However, the depth maps always involve noises but all these methods update a wider band to deal with the noise, as a result, there are noise artifacts, especially outlier blobs and thickening surfaces, on the results.

In contrast to the voxel-based method, there are some reconstruction approaches that update the results in different ways. For example, Zienkiewicz et al. [9] introduce a scalable method that fuses depth maps into a multi-resolution mesh instead of a voxel grid. Keller et al. [10] design a flat point-based representation method [2], which utilizes the input from the depth sensor directly without converting representations, thereby saving the memory and increasing the speed. In addition, the surfel-based approach that approximates the surface with local points is adopted for reconstruction [2,11]. The unstructured neighborhood relationship can be built by this approach, although it usually tends to miss connectivity information among surfels. MRSMap [12], as an example, integrates depth maps into a multi-resolution surfel map for objects and indoor scenes.

Some researchers also regard the depth map fusion process as a probabilistic density problem [3,12–14], considering various ray directions. Yong et al. [15] estimate the probability density function based on the original point cloud instead of the depth map and use a mathematical expectation method to decrease the complexity of computation. In [16], the marginal distribution of each voxel's occupancy and appearance is calculated by a Markov random field along with the camera rays. However, all these classical methods have limitations to balance reconstruction quality, scene assumptions, speed and spatial scale due to the large and complex computation but limited memory.

#### 2.1.2. Learning-Based Methods

Along with the development of deep learning methods, there exist lots of proposals that fuse and improve the performance of classical 3D reconstruction [17]. For example, ScanComplete [18] method completes and refines the 3D scan with a CNN model, which can deal with the large-scale input and obtain the high-resolution output. RayNet [19], which combines a CNN model with Markov random fields method, considers both local information and global information of the multi-view images. It can cope with large surfaces and solve the occlusion problem. Based on Mask R-CNN method [20], Mesh R-CNN [21] detects objects in an image, then builds meshes with a mesh prediction model and refines the meshes with a mesh refinement model.

Specifically, in many learning-based approaches, TSDF fusion is still one of the important steps [22]. OctNetFusion [23] fuses the depth maps with TSDF fusion and subsequently utilizes a 3D CNN model to deal with the occluded regions and refines the surfaces. Leroy et al. [24] propose a deep learning-based method to achieve multi-view photoconsistency, which focuses on matching features among viewpoints for obtaining the depth information. Similarly, the depth maps are finally fused by TSDF fusion. RoutedFusion [4] also fuses the depth maps based on the standard TSDF fusion. Different from other methods, it reproduces TSDF fusion by a CNN model, which predicts the parameters of volume and weight, then the volumetric representation can be updated with new volume and weight sequentially.

Compared with the classical method, deep learning-based methods show advantages in handling thickening artifacts and increasing diversity and efficiency. In addition, existing methods pay little attention to the noise problem during the fusion process. Our method adopts a part of RoutedFusion models to fuse the 3D volume firstly, then combines a special-designed neural network to remove the noise, thereby improving the performance of the depth fusion.

#### *2.2. Denoising/Noise Reduction*

Most of the works consider the noise as the depth noise and try to remove the noise at the beginning of the fusion process. The authors in [3,25] adopt Gaussian noise to mimic the real depth noise derived from the depth sensors, then achieve the scene reconstruction. Cherabier et al. [26] also remove some regions of random shapes, such as circles and triangles, to simulate the missing data. In RoutedFusion [4], the authors add random noise to the depth maps and propose a routing network that can remove the random noise, then use a fusion network to fuse the denoised depth maps into a TSDF volume. The experiments prove that the routing network has a significant effect on improving accuracy.

Another way to cope with the noise is to refine the 3D representation directly. NPD [27] trains the network by utilizing a reference plane from the noiseless point cloud as well as the normal vector of each point while PointCleanNet [28] removes the outlier firstly then denoises the remaining points by estimating normal vectors. Han et al. [29] propose a local 3D network to refine the patch-level surface but it needs to obtain the global structure from the depth images firstly, which is inconvenient and time-consuming. Zollhöfer et al. [25] propose a method that utilizes the details, such as shading cues, of the color image to refine the fused TSDF volume since the color image typically has a higher resolution. A 3D-CFCN model [30], which is a cascaded fully convolutional network, combines the feature of low-resolution input TSDF volume and high-resolution input TSDF volume to remove the noise and refine the surface. However, all these methods only consider either the outliers of the 3D representation or the noises caused by depth maps. In our method, we design a denoising network with 3D convolutional layers, which remove the noise for the TSDF volume without any other additional information. In addition, we take the noise of both depth maps and camera poses into account; thus, the network is robust against not only depth noises but also pose noises.
