Next Article in Journal
Heart Pulse Transmission Parameters of Multi-Channel PPG Signals for Cuffless Estimation of Arterial Blood Pressure: Preliminary Study
Previous Article in Journal
A Proof-of-Concept FPGA-Based Clock Signal Phase Alignment System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stage-Aware Interaction Network for Point Cloud Completion

by
Hang Wu
and
Yubin Miao
*
School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3296; https://doi.org/10.3390/electronics13163296
Submission received: 5 July 2024 / Revised: 13 August 2024 / Accepted: 14 August 2024 / Published: 20 August 2024
(This article belongs to the Section Artificial Intelligence)

Abstract

:
Point cloud completion aims to restore full shapes of objects from partial scans, and a typical network pipeline is AutoEncoder, which has coarse-to-fine refinement modules. Although existing approaches using this kind of architecture achieve promising results, they usually neglect the usage of shallow geometry features in partial inputs and the fusion of multi-stage features in the upsampling process, which prevents network performances from further improving. Therefore, in this paper, we propose a new method with dense interactions between different encoding and decoding steps. First, we introduce the Decoupled Multi-head Transformer (DMT), which implements and integrates semantic prediction and resolution upsampling in a unified network module, which serves as a primary ingredient in our pipeline. Second, we propose an Encoding-aware Coarse Decoder (ECD) that compactly makes the top–down shape-decoding process interact with the bottom–up feature-encoding process to utilize both shallow and deep features of partial inputs for coarse point cloud generation. Third, we design a Stage-aware Refinement Group (SRG), which comprehensively understands local semantics from densely connected features across different decoding stages and gradually upsamples point clouds based on them. In general, the key contributions of our method are the DMT for joint semantic-resolution generation, the ECD for multi-scale feature fusion-based shape decoding, and the SRG for stage-aware shape refinement. Evaluations on two synthetic and three real-world datasets illustrate that our method achieves competitive performances compared with existing approaches.

1. Introduction

Point clouds are sets of points in 3D Cartesian coordinates. They are the typical outputs of 3D scanners such as depth cameras or LiDARs in intelligent robots and are usually used to describe the positions and discretized shapes of object surfaces. As concise and effective shape representations of objects and scenes, point clouds are widely used in various visual perception tasks [1,2]. However, real-world raw point clouds acquired by scanners are usually incomplete due to view angles, occlusions, surface materials, etc. Incomplete point clouds usually cause structural losses of target object shapes and might lead to ambiguous results in downstream tasks such as semantic segmentaion, object detection, path planning, robot grasping, etc. Therefore, with the help of recent advances in deep learning, many point cloud completion approaches elaborate networks to restore complete shapes from the learned features of partial scans [3,4,5,6,7,8,9,10,11,12].
Point cloud completion approaches usually use 2048 [13,14] or 16,384 [3,8] (i.e., multiples of 2048) points to represent each single object; in this paper, we choose to generate a dense point cloud that contains 16,384 points for high resolutions of the object shape. A typical dense point cloud completion architecture is AutoEncoder-based coarse-to-fine pipeline [3], which first generates low-resolution shape skeletons and then gradually upsamples points. For coarse shape generators, some approaches use a Multilayer Perceptron (MLP) [3,4,6] to predict flattened point coordinates from global features, while other methods [10] use transformers to derive shapes from point-wise features of sampled partial points to avoid the capacity limitation of global features when describing local details. For upsamplers, folding-based networks [3,7,11] try to let canonical grids fit the local topologies of coarse points, while some transformer-based methods [8,10] learn attention scores to aggregate point feature vectors and use them to estimate new points.
Although existing approaches achieve promising results, there is still some room for improvement. First, their coarse shapes are usually generated from the deep features of inputs [3,8,10], while the shallow features are neglected. Specifically, their decoders are directly placed behind the last layer of encoder, where the features are obtained from either global pooling or several sampling and encoding modules. Although shallow features have proven useful in semantic segmentation since they include more fine-grained local shape information [15,16], such information is not comprehensively used in point cloud completion approaches. Under such circumstances, the decoder still faces a loss of local details in its early generation steps, which can hardly be fully solved by the refinement modules (i.e., upsamplers). Second, their upsamplers are typically cascaded in one chain, where each upsampler (e.g., in the third stage) takes the features and shapes from the last upsampler in front of it (e.g., in the second stage) to generate new points. Under such circumstances, features with potential noises or errors in early stages are less representative and informative, which might influence all the stages in feedfoward network [17]. A possible solution is a feedback mechanism [17], which may require several time steps to execute the network and increases the computation complexity.
Before introducing our method, we need to revisit two targets in point cloud completion: predicting new semantics that are lost or incomplete in partial scans, and upsampling the density of sparse points to refine local details. Semantics can be defined as the meanings of components in objects, and they are used to precisely depict specific object shapes. Thus, ‘new semantics’ refers to the semantics that are unavailable or incomplete in partial scans but should be included in complete shapes. Taking the chair in Figure 1 as an example, its complete semantics are as follows: a chair back that is thicker on the top, four legs with the same thickness, a cushion covering four legs, and four continuous beams that connect neighboring legs. In a partial scan, we can find that ‘thicker on the top’, ‘covering four legs’, and ‘continuous’ are to some extent lost in the chair back, cushion and one beam, respectively, so they are the semantics that need to be predicted and reflected in the output shape.
Our method introduces three main modules to address the problems in existing approaches and realize the two targets above. First, we introduce Decoupled Multi-head Transformer (DMT), which can jointly predict new semantics and upsample local feature resolutions; DMT acts as the basic ingredient in shape-generation modules. Second, we propose an Encoding-aware Coarse Decoder (ECD), which compactly interacts the top–down shape skeleton-decoding process with the bottom–up feature-encoding process. In this way, it can predict missing semantics using the deep features that describe the observed semantics of partial inputs and further decode points after introducing the shallow features that include more detailed local information, which would also facilitate the follow-up refinement process. Third, we design a Stage-aware Refinement Group (SRG) that consists a group of upsamplers with dense connections across each other, which makes each upsampling module aware of the features and points from multiple decoding stages with different resolutions and scales. In this way, potential errors or noises in one stage can be amended after fusing the features of other stages and have less influence on the next stages.
In general, the main contributions of our method can be summarized as follows:
  • A Decoupled Multi-head Transformer (DMT), which integrates semantic generation and point upsampling in one module.
  • A Encoding-aware Coarse Decoder (ECD), which parallels the top–down decoding with bottom–up encoding process to generate shape skeletons based on both shallow and deep features of partial inputs.
  • A Stage-aware Refinement Group (SRG), which defines a coarse-to-fine pipeline with dense feature fusion across different point decoding and upsampling stages.
Our method is evaluated on two synthetic datasets MVP [6] and PCN [3], as well as three real-world datasets, Matterport3D [18], ScanNet [2] and KITTI [1]. Both quantitative and qualitative results illustrate the competitive performances of our method compared with existing representative approaches.

2. Related Work

In this section, we will mainly discuss recent advances of learning-based generative models on 3D shapes and scenes.

2.1. Shape Generation

Three-dimensional objects can be described by different data structures, such as voxels [19], point clouds [20], and fields [21]. We will mainly discuss the last two representations. For points, pioneering approaches use MLP [20] to estimate point coordinates or folding-based networks [22] to parameterize object surfaces. Recent approaches improve generation qualities with more diversified techniques. For example, some folding-based methods substitute parameterization priors with a sphere [23] or cube [24] and use adversarial training to obtain more realistic outputs. Flow-based methods model the distribution transformations from standard Gaussian priors to points with two or more normalization flows [25,26] conditioned on shape latents. Diffusion-based methods learn the inverse diffusion process to let noise points form desired shapes, which can be applied in point cloud auto-encoding [27], image-conditioned generation [28], and shape editing [29]. For fields, the pioneering method [30] utilizes 3D convolutions in its AutoEncoder pipeline predicting distance values in voxels, which are used for mesh extraction. Recent neural field-based approaches tend to let network estimate the occupancy value of any spatial position. Such neural representation can be derived from global latent [31], multiscale latent grids [32], sampled points with features [21], or sets of vectors [33].

2.2. Point Cloud Completion

Point clouds are sets of points that represent discrete object surfaces, and point cloud completion methods typically design AutoEncoders to restore shapes from the extracted features of partial scans. The pioneering method [3] utilizes stacked PointNet [34] layers as the encoder, MLP as Coarse Decoder, and folding-based layers as the upsampler. This kind of pipeline has been widely adopted by the following methods. For encoders, some methods [4,6,8] extract global features from inputs, some methods [7,10,11] extract local features of downsampled points to achieve better results, and other methods [12] extract features from the rendered images of inputs to facilitate shape generation. For decoders, coarse shapes can be restored by MLP [3,4,6] or Transformer [7,10], and shape refinement can be realized by folding layers [7,11] or Transformer [8,10]. Besides generating the whole shapes, some methods mainly focus on the missing parts [7,11,35]. In addition, another branch of methods try to train networks without partial-complete point cloud pairs. These methods typically establish mappings between the distributions of partial and complete shapes, which can be achieved by latent transformation [36], cycle GAN [13], GAN inversion [14], energy function [37], knowledge transfer [38], and symmetric preserving [39]. Adversarial learning is widely adopted in these unsupervised methods to improve generation quality when specific ground truths are not available.

2.3. Scene-Level Completion

Besides object-level completion, scene-level completion aims to restore the whole partially scanned scenes with multiple instances. One typical topic is Semantic Scene Completion (SSC) [40], which represents indoor [40] or outdoor [41] scenes using occupancy semantic voxels and predicts them from depth images or point clouds. Some effective SSC methods include Sketch-aware Feature Embedding [42], empty voxel removal and point–voxel fusion [43], local deep implicit functions [44], the knowledge distillation with noise-free model [45], the Dual-path Transformer [46], knowledge distillation from multi-frame to single-frame [47], etc. Another recent topic in scene-level completion is Point Scene Understanding [48,49], which reconstructs certain categories of objects in scenes simultaneously. This branch of methods usually integrate object detection or instance segmentation with shape completion. Their restored objects are usually represented as meshes.

3. Method

A shape-completion example of our method is illustrated in Figure 1, and the specific completion pipeline of our method is illustrated in Figure 2. In Section 3.1, we will first introduce the Decoupled Multi-head Transformer (DMT), which acts as a basic ingredient in our network. In Section 3.2, we will discuss the Encoding-aware Coarse Decoder (ECD) that generates shape skeletons (i.e., Y c in Figure 2). Taking the first three blocks in Figure 1 as an example, when estimating points and semantics in one side of the chair, ECD queries the related points and areas (i.e., neighbors and similar shapes in the other side) from different encoding and downsampling stages. In Section 3.3, we will describe the Stage-aware Refinement Group (SRG) (i.e., the upsampling process in Figure 2). Taking the last upsampler in Figure 1 as an example, with dense connections, SRG simultaneously analyzes the shape structures represented by earlier low-resolution outputs and the more detailed locals provided by later high-resolution outputs.

3.1. Decoupled Multi-Head Transformer (DMT)

Given input points X = { x i } i = 1 n (with point-wise features Q = { q i } i = 1 n ), DMT aims to find their relationships with reference point cloud Y = { y i } i = 1 m (with features R = { r i } i = 1 m ) and fuse their features. Here, X and Y can be certain middle outputs of network or partial inputs, which will be specified in Section 3.2 and Section 3.3. The output features of DMT can have different resolutions, represented as Q, which will be used to either generate new shape parts (in Coarse Decoder) or refine current shapes (in upsamplers).
Inspired by [16], DMT uses vector attention to fuse the features of input and reference points according to their embeddings and relative locations. For each point x i X , the original vector attention is described as
z i = y j N ( x i ) A ( H ( q i ) K ( r j ) + P ( x i y j ) ) ( V ( r i ) + P ( x i y j ) )
where N ( x i ) is the k-Nearest Neighbor (kNN) of x i in reference point cloud Y. Function H maps the feature q i of input point x i to query embedding H ( q i ) , K maps the feature r j of neighboring point y j N ( x i ) to key embedding K ( r j ) , and V is also applied on y j to obtain value embedding V ( r i ) . The above embeddings are based on the absolute position of points (i.e., Cartesian coordinates), so function P directly maps the relative position x i y j to positional embedding. Then, function A measures the relevance of y j to x i by analyzing the joint effects of ‘embedding differences on absolute positions’ and ‘embedding on relative positions’ and maps such relations to point-wise attention scores. With the Hadamard product ⊙, these scores are assigned to the value and positional embeddings of y j as weights, and the summation notation ∑ aggregates these embeddings. In this way, vector attention learns the features of a local area that lies in reference point cloud Y and is centered at query point x i . All these functions are implemented by networks. Typically, there is also a residual link that adds the aggregated embeddings with input features.
However, the original vector attention learns a weighted average of local embeddings for each input point, which can hardly predict the missing parts or generate new points. To this end, as illustrated in Figure 3, we apply two independent groups of attention heads on different stages of DMT to decouple semantic prediction and resolution upsampling.
Specifically, functions H, K, V and P all map input features or coordinates to new embedding space, so different groups of these functions will be able to estimate different related semantics for each input point, and we define such group as a semantic head. After that, function A calculates the weights of local embeddings for each input, if these weights are normalized via softmax, they will serve as interpolations inside local embeddings, and different groups of A will be able to return different interpolations conditioned on specific semantics; we define these groups as resolution head. If these weights are not normalized, they will also be able to generate new shape patterns [10] like the semantic head.
In this way, it will be convenient to control the shape generation and point upsampling process by adjusting the number of semantic head u s and resolution heads u r , so the final upsampling rate of a DMT module is u s × u r . We define the function of DMT as
Z = DMT ( X , Q , Y , R ; u s , u r )
Now, we summarize the inputs, parameters and outputs of function DMT according to Equation (2). X is the input point cloud; it can either be a partial scan that requires semantic prediction or middle output of the network that requires upsampling. Y and R are the reference point cloud and point-wise features, which provide information (i.e., relative positions and local features) for X. DMT then derives new feature set Z from the point-wise features Q of X. Z is used to generate new points, and its resolution is controlled by parameters u s and u r . u s mainly controls the generation of new semantics (e.g., the third block in Figure 1), and u r mainly controls the upsampling within input point cloud (e.g., the seventh block in Figure 1).

3.2. Encoding-Aware Coarse Decoder (ECD)

Then we introduce the Coarse Decoder, which takes the encoded features of partial point cloud X p as inputs and generates a complete skeleton Y c . Inspired by the design of U-Net [50], we stack the top–down shape generation process (i.e., Decoder) with the bottom–up feature extraction process (i.e., Encoder) compactly to make full use of features at different scales.
Specifically, as illustrated in Figure 4, we borrow the encoder of [10] that leverages two Set Abstraction (SA) modules [15] to gradually downsample X p to X p 1 = { x p , i 1 } i = 1 n 1 and X p 2 = { x p , i 2 } i = 1 n 2 , and encode their corresponding point-wise features Z p 1 = { z p , i 1 } i = 1 n 1 and Z p 2 = { z p , i 2 } i = 1 n 2 using vector attention, respectively. After that, global feature vector Z p 3 is obtained by applying MLP and maxpooling on Z p 2 . This bottom–up encoder extracts features that represent local shape geometries with different scales at different stages.
Our decoder contains two DMTs that are in parallel with the SA modules. The first DMT takes point cloud X p 2 with its features as both inputs and references:
Z c 2 = F s DMT d ( X p 2 , F d ( Z p 2 , Z p 3 ) , X p 2 , F d ( Z p 2 , Z p 3 ) ; u s d , u r d )
where the superscript d of DMT means it analyzes deeper encoded features. Because the AutoEncoder structure of the point-cloud-completion pipeline can implicitly cluster the global features of partial shapes according to their categories [3], we wish to derive new features from Z p 2 conditioned on global feature Z p 3 . Therefore, F d is a single mapping layer that concatenates Z p 2 and Z p 3 as input and outputs with fused point-wise features on X p 2 . According to the definition of Equation (2), the output of DMT d is Z c 2 = { z c , i 2 } i = 1 n 2 × u s d × u r d , which contains n 2 × u s d × u r d embeddings; then, we use another single layer F s to fuse it with global feature Z p 3 again and let it have the same number of dimensions as Z p 1 . Given that DMT assigns the same number of heads to each input query point and feature, it is intuitive that each point x p 2 X p 2 equally derive u s d × u r d embeddings in Z c 2 . Therefore, we can consider that Z c 2 is associated with a point cloud X p 2 that repeats X p 2 by u s d × u r d times. On that basis, the second DMT further fuses Z c 2 with Z p 1 :
Z c 1 = DMT s ( X p 2 , Z c 2 , X p 1 , Z p 1 ; 1 , 1 )
We set the number of heads to 1 to ensure that the output embedding set Z c 1 has the same resolution as Z c 2 (i.e., Z c 1 = { z c , i 1 } i = 1 n 2 × u s d × u r d ). Finally, we concatenate Z c 1 with Z c 2 and fuse them via a shared MLP to generate coarse point cloud features Z c ; then, another MLP is used to decode Z c to point cloud Y c with n c points, where n c = n 2 × u s d × u r d .

3.3. Stage-Aware Refinement Group (SRG)

The generated coarse point clouds require further refinement and upsampling for better quality and higher resolution. Therefore, we introduce SRG to gradually upsample points based on their coordinates and features at different decoding stages. In order to realize this purpose, as illustrated in Figure 2, SRG adopts a DMT-based architecture with dense connections across different refinement stages.
In general, a typical SRG contains three upsamplers, U 1 , U 2 , and U 3 , with upsample rates r 1 , r 2 , and r 3 , respectively. In addition to Figure 2, a summarization of input and output point clouds with the inner upsampling process (in the first upsampler as an example) is illustrated in Figure 5.
U 1 : Given coarse point cloud Y c = { y c , i } i = 1 n c with point-wise features Z c = { z c , i } i = 1 n c , we merge Y c with input point cloud X p and downsample it using Furthest Point Sampling (FPS). The features of points in X p are obtained by making interpolations [15] in Z c . In this way, we have point cloud Y m 1 = { y m , i 1 } i = 1 m 1 with features Z m 1 = { z m , i 1 } i = 1 m 1 . In addition, given that partial and complete point clouds usually have similar local semantics that can facilitate shape refinement (e.g., the seen leg of a partial chair might provide templates for the unseen leg), we encode the local patterns of both Y m and X p using one MLP and obtain features Z m a = { z m a , i } i = 1 m 1 and Z x a = { z x a , i } i = 1 n :
z m a , i = maxpool { MLP ( y c , i N ( y c , i ) ) } z x a , i = maxpool { MLP ( x i N ( x i ) ) }
where N means collecting the neighbors of points using a ball query. In order to ensure that the semantics of similar local shapes in partial point cloud X p and coarse point cloud Y c are also similar, the MLPs for Y m 1 and X p share weights.
After that, the features of Y m 1 are further fused based on the reference points from three sources: partial inputs X p , coarse points Y c , and Y m 1 itself:
Z m , 1 1 = DMT u , 1 1 ( Y m 1 , Z m a , X p , Z x a ; 1 , 1 ) Z m , 2 1 = DMT u , 1 1 ( Y m 1 , Z m 1 , Y c , Z c ; 1 , 1 ) Z m , 3 1 = DMT u , 3 1 ( Y m 1 , Z m 1 , Y m 1 , Z m 1 ; 1 , 1 )
unlike other DMTs, DMT u , 1 1 tries to find local areas in partial shapes that have similar semantics with the points in Y m 1 , so it queries neighbors based on the L2 distance of point features instead of coordinates. The output features Z m , 1 1 , Z m , 2 1 , and Z m , 3 1 all contain m 1 vectors, and they are generated based on point clouds and features at different stages with different shapes and resolutions. In this way, our network can comprehensively use such information by simply concatenating these features in the feature channel:
Z t 1 = cat ( Z m , 1 1 , Z m , 2 1 , Z m , 3 1 )
and upsample them using another DMT with shared MLP:
Z u 1 = MLP DMT u , 4 1 ( Y m 1 , Z t 1 , Y m 1 , Z t 1 ; u s 1 , u r 1 )
In this way, we obtain upsampled features Z u 1 = { z u , i } i = 1 u 1 and use them to fuse with global feature Z p 3 to generate the upsampled points Y u 1 = { y u , i } i = 1 u 1 using a shared MLP, where u 1 = u s 1 × u r 1 × m 1 , the order and inputs of MLP and DMT in Equation (8) are visualized by the first block of Figure 5.
U 2 : Similar to U 1 , we first fuse Y u 1 with X p and downsample it to Y m 2 (and features Z m 2 ) with resolution m 2 , and then we further fuse features based on three references Y c , Y m 1 and Y m 2 itself:
Z m , 1 2 = DMT u , 1 2 ( Y m 2 , Z m 2 , Y c , Z c ; 1 , 1 ) Z m , 2 2 = DMT u , 2 2 ( Y m 2 , Z m 2 , Y m 1 , Z m 1 ; 1 , 1 ) Z m , 3 2 = DMT u , 3 2 ( Y m 2 , Z m 2 , Y m 2 , Z m 2 ; 1 , 1 )
The new features Z m , 1 2 , Z m , 2 2 and Z m , 3 2 are all derived from the point cloud and features at the current stage (i.e., Y m 2 and Z m 2 ), while their reference point clouds and features come from different generation stages (i.e., two previous stages and one current stage). In order to comprehensively use them like U 1 , we concatenate these acquired features as Z t 2 :
Z t 2 = cat ( Z m , 1 2 , Z m , 2 2 , Z m , 3 2 )
We then use another DMT to upsample the resolution of Z t 2 and use a shared MLP to predict the point cloud conditioned on global feature Z p 3 , which is similar to Equation (8) in U 1 :
Z u 2 = MLP DMT u , 4 2 ( Y m 2 , Z t 2 , Y m 2 , Z t 2 ; u s 2 , u r 2 )
the upsampled points Y u 2 = { y u , i } i = 1 u 2 are generated from features Z u 2 = { z u , i } i = 1 u 2 using a shared MLP, where u 2 = u s 2 × u r 2 × m 2 .
U 3 : Similar to the above upsamplers, we obtain merged points Y m 3 (and features Z m 3 ) with resolution m 3 ; fuse Z m 3 with Y m 1 , Y m 2 and Y m 3 as references (using DMT u , 1 3 , DMT u , 2 3 , DMT u , 3 3 ); use DMT u , 4 3 to obtain upsampled feature Z u 3 with resolution u 3 = u s 3 × u r 3 × m 3 ; and generate the final point cloud Y = { y i } i = 1 u 3 .
For all the three upsamplers in SRG, the raw points generated from upsampled features are actually displacements, and they are added to the input points they are derived from. This operation is commonly applied in coarse-to-fine networks [3,8,10].

3.4. Losses

Following previous methods [3,8,10], we use Chamfer Distance (CD) as training loss and evaluation metric of our network, which defines the similarity between two point clouds by calculating the lowest point distances:
L ( X , Y ) = 1 | X | x X min y Y | | x y | | 2 + y Y min x X | | y x | | 2
The form of CD is borrowed from PCN [3] and VAPNet [51], where | | y x | | 2 means the L2-Norm of y x . The loss L of our network is the sum of CDs between the generated outputs and the ground truth G at different stages:
L = L ( Y c , G ) + L ( Y u 1 , G ) + L ( Y u 2 , G ) + L ( Y , G )
in the first three terms of Equation (13), G is downsampled by FPS to meet the resolutions of Y c , Y u 1 , and Y u 2 , respectively.

4. Experiments

In this section, we will illustrate the performances of our method on several datasets. We first specify the details and hyperparameters of the network in Section 4.1, then we implement tests on synthetic and real-world objects in Section 4.2 and Section 4.3, respectively. Finally, we discuss the effect of different modules in the network with ablation studies in Section 4.4.

4.1. Datasets and Network Details

Datasets. We train and evaluate our network on two synthetic datasets created from ShapeNet [52]: PCN [3] and MVP [6]. The PCN dataset is introduced by the pioneering approach [3] and contains eight categories of objects; it generates partial point clouds using virtual rendering on each mesh from eight (in training set) or one (in test set) view angles, and complete point clouds are uniformly sampled from meshes. MVP is another dataset that contains 16 categories objects and 26 view angles for both the training and test set. Each complete point cloud in these two datasets contains 16,384 points.
Besides synthetic datasets, we also evaluated methods on chairs and tables in real-world datasets Matterport3D [18] and ScanNet [2]. Because the objects in these two datasets are usually reconstructed from a different view angle, the raw partial point clouds are quite dense; therefore, their resolutions are downsampled to 2048 using FPS before entering the Encoder of the network, and for a fair comparison, we apply the same settings for existing methods. In addition, we test network performances on the sparse outdoor cars from KITTI [1], which are preprocessed by [3].
Network details. The details of each module are listed as follows: For DMT in Section 3.1, functions Q, K and V are all a single layer that takes d-dimensional features as inputs and outputs 64-dimensional embeddings. The positional and attentional embedding function P and A are both MLPs with output layers { 64 , 64 } and { 256 , 64 } , respectively. Finally, another single layer transfers embedding dimensions from 64 to d. The weights in these functions could be different in different semantic and resolution heads.
For the encoder in Section 3.2, the first SA encodes input X p to X p 1 with n 1 = 512 points and 128-dimensional point features Z p 1 R n 1 × 128 using MLP with layers { 64 , 128 } ; then, the vector attention aggregates local features for each point from its 20 nearest neighbors. The second SA encodes X p 1 to X p 2 with n 2 = 128 points and 256-dimensional features Z p 2 R n 2 × 256 using layers { 128 , 256 } ; then, the vector attention aggregates features from 16 neighbors. Another MLP encodes Z p 2 to 512-dimensional global feature Z p 3 using layers { 512 , 512 } . For the Coarse Decoder, the single-layer network F l transfers n 2 point feature vectors to 256 dimensions; based on them, DMT d generates embeddings Z c 2 R 2 n 2 × 256 with semantic head u s d = 2 , and resolution head u r d = 1 , F d then decreases the dimensions to 128. After Z c 1 R 2 n 2 × 128 is achieved from DMT s and concatenated with Z c 2 , three cascaded MLP blocks with output layers { 512 , 256 } , { 128 , 128 } , { 256 , 128 } and residual links generate feature Z c R 2 n 2 × 128 , and then Z c is decoded to point cloud Y c using another MLP with layers { 64 , 3 } .
For SRG in Section 3.3, given coarse point cloud Y c with 256 points, we merge it with input partial point cloud and downsample it to 512 points (with corresponding features), so the upsampling rates for 16,384 output points are set to 2, 2, and 8. In this way, the heads in three upsamplers DMT u , 4 1 , DMT u , 4 2 and DMT u , 4 3 are { u s 1 , u r 1 } = { 1 , 2 } , { u s 2 , u r 2 } = { 1 , 2 } and { u s 3 , u r 3 } = { 2 , 4 } , respectively. Inside SRG, after merging upsampled point clouds (i.e., Y u 1 and Y u 2 ) with partial input X p , we downsample them to have the same resolutions as the generated ones. The points are predicted using the same MLP structures as the Coarse Decoder above, and the ball query used in Equation (5) is 0.1 for MVP and 0.05 for the PCN dataset.
Our network is trained on one Nvidia A6000 GPU using PyTorch 1.8 with a batch size 24, and Adam [53] is used as an optimizer. For the PCN dataset, the initial learning rate is 0.001 with 10 warm-up steps, and it exponentially decays by 0.1 every 150 epochs. Following [5,10], each epoch only trains one of eight partial scans for each object, so we train our network for 400 epochs. For the MVP dataset, the initial learning rate is 0.0005, and it exponentially decays by 0.1 every 50 epochs, and we train our network for 40 epochs.

4.2. Evaluation on Synthetic Datasets

We first evaluate our network on MVP and PCN datasets and compare its performances with several representative existing approaches. Following the mainstream settings, we report the CD- l 2 (on squared point distance) and F1 score [54] on the MVP benchmark in Table 1 and Table 2, as well as these two metrics on different output resolutions in Table 3. For the PCN benchmark, we report the completion qualities using CD- l 1 (on unsquared point distance) in Table 4. We can find that our method is able to achieve competitive results on both datasets and reach the best in most cases.
Specifically, for quantitative evaluation, our method achieves the best average CD and F1 on these datasets with full point cloud resolution (16,384 points). The overall best CD means that the point clouds completed by our method are the closest to the ground truths. The overall best F1 means that our outputs can effectively balance shape completeness and fidelity; in other words, our method is less likely to miss the required parts of objects or excessively generate unnecessary parts. For category-level comparison, our method also reaches the best results in the most categories for both two datasets. Besides the full point cloud resolution, Table 3 illustrates that our method reaches the best results in terms of both CD and F1 when generating 8192 points and in terms of one metric when generating 2048 and 4096 points. Therefore, we can find that the advantage of our method is more obvious when the completed point clouds are denser.
For qualitative evaluation, several examples of completion results on the PCN dataset are visualized in Figure 6, and we have the following two main findings according to these examples. First, our method can reasonably infer the missing parts of objects based on their partial observations. For example, in an airplane, our method can clearly restore the unobserved fuel tank that is similar to the observed one on the other side, while other methods may either be unable to generate fuel tank or include noises. In the first lamp, only our method restores the unobserved stairs with shapes and resolutions comparable to the observed stairs. Second, our method can generate shapes more smoothly with less noise. For example, points in our car are more uniformly distributed. The vertical beams in our chair back are more complete and distinguishable. Our second lamp, sofa and tables contain less noise or unnecessary parts. In the top and front sides of our submarine, the boundaries between the observed and unobserved parts are more smoothly merged.
Besides reporting average results, Figure 7 tries to describe the variances and estimate the confidence intervals. Specifically, we collect all the completion results in terms of CD on the PCN dataset and plot their variances with respect to category-level means using histograms. The bars on the left side of zero record the numbers of outputs with CDs that are less than their means. In addition, we use different colors to represent different percentiles. Green bars represent the number of completion results whose percentiles in CDs are less than 50%, and gray bars mean those between 50% and 80%. We can define the confidence intervals around [−0.004, 0.002] since most CDs are within this range.

4.3. Evaluation on Real-World Datasets

Our method was also tested on chairs and tables from Matterport3D [18] and ScanNet [2] and was compared with existing methods [5,10,12]. We use the network weights trained on the PCN dataset in evaluation, and directly take those of baseline methods [5,10,12] from their released pre-trained models. Because there are no ground truths for these partial objects, we follow previous approaches to use the Fidelity and Minimal Matching Distance (MMD) for quantitative assessment. Fidelity applies Unidirectional Chamfer Distance (UCD) to calculate the average distance from input points to their nearest points in the output, while MMD aims to assess the quality of each output point cloud by calculating the lowest CDs between it and the instances in ShapeNet test set. The quantitative results are reported in Table 5. In fact, it is usually not easy to achieve optimal performances in terms of Fidelity and MMD at the same time. This is because real-world partial shapes might contain noise or introduce errors when normalizing scales and poses, so high fidelity to partial inputs may bring about such errors and challenge the qualities of outputs assessed by MMD. However, our method still achieves the best performances in terms of both Fidelity and MMD in ScanNet and the best Fidelity in Matterport3D, which means that it is suitable for real-world completion.
The qualitative results are visualized in Figure 8; we can find that the two findings summarized in Section 4.2 are also valid here. Specifically, our method is able to reasonably recover the unobserved object parts. For example, our method better completes the missing right side of the second table and generates a right arm that is comparable to the left arm in the first chair. In addition, our outputs usually include less noise or unnecessary parts, such as the bottoms of the first table and two chairs.
For the KITTI dataset, given that the cars scanned by LiDAR are usually sparse, we substituted the KNN query of SA modules in our encoder with a ball query, with radii of 0.05 and 0.1 in the two SAs, respectively, while the other network parameters, including the weights of neurons (pretrained on the PCN dataset), were kept unchanged. Then, we fine-tuned and augmented the network on the cars of ShapeNet with the new query method. We also used fidelity and MMD to evaluate our method and existing ones. The quantitative and qualitative results are illustrated in Table 6 and Figure 9, respectively. According to the quantitative results in Table 6, we can find that our method achieved the best performance in terms of MMD, which means our generated cars are realistic and close to the shape distributions in the dataset. On the other hand, some methods, such as [7,11], could achieve 0 on Fidelity, which is mainly because they only generate the missing parts of objects, while our method estimates the whole shapes and does not directly take take input points as parts of outputs. Under this circumstance, our method still achieves competitive performance in terms of Fidelity, which is close to them. The qualitative results visualized in Figure 9 also prove that our method can restore cars with higher quality, even when the input partial scans are sparse. Specifically, we can easily distinguish each component of our cars, while the other cars seem ambiguous.
In addition, a possible pipeline that deals with some over-sparse inputs is first using ECD to ‘pre-complete’ the partial scans, then taking the estimated coarse points as inputs and using the whole network to complete them. We find that this pipeline could work well on partial inputs with high sparsity but may degrade the peformances on others with normal sparsity. Therefore, a future improvement could be automatically determining the sparsities of partial scans and using different pipelines in completion.

4.4. Ablation Studies

In order to experimentally prove the effects of the designs in our method, we retrained and tested the ablated versions of it on the MVP dataset. The results on the MVP dataset in terms of CD and F1 are reported in Table 7. For ECD, we removed the fusion with shallow features Z c 2 and only generated shapes from deep features (i.e., V1 in Table 7). For SRG, we removed the connections with previous stages in each of the three upsamplers, respectively, and kept the other two upsamplers unchanged. Specifically, V2, V3, and V4 in Table 7 represent the network with only DMT u , 3 1 , DMT u , 3 2 and DMT u , 3 3 in U 1 , U 2 and U 3 , respectively (i.e., the colored links in Figure 2 that skip the last module are removed). We can find that our pipeline with all these components achieves the best CD. Although version V4 achieves a little higher F1, the discussion below shows that it is quite unstable.
In addition, as discussed in Section 1, applying dense connections on the decoding and upsampling process may alleviate potential error accumulation compared with the cascaded pipeline. In order to experimentally prove this assumption, we take the second and the third upsampling modules (i.e., U 2 and U 3 in Section 3.3) as examples and arbitrarily add Gaussian noises on the points they need to upsample. We change the levels of noises by adjusting the standard deviation from 0.0025 to 0.03 , and then record the performances of networks with or without dense connections (i.e., Ours and V3/V4 in Table 7). As illustrated in Figure 10, dense connections can effectively alleviate network degradations in terms of both CD and F1, and this phenomenon is more obvious when the level of noises are higher.

5. Conclusions

In this paper, we propose a new method that mainly consists of three designs for point cloud completion: DMT, ECD and SGR. DMT is designed to estimate new points in both coarse shape generation and dense shape refinement, so it decouples semantic prediction and resolution upsampling in a unified Transformer-based architecture. ECD aims to estimate coarse shapes based on both semantics and local patterns of partial inputs, so it compactly parallels top-down decoding with bottom–up feature encoding network. SGR tries to refine points with monitoring on previous upsampling processes, so it replaces the conventional cascaded modules with a densely connected pipeline to make full use of features with different scales in different stages. Experiments on both synthetic and real-world datasets illustrate the competitive performances of our method when compared with existing approaches. Therefore, we hope our proposed method can contribute to future high-level 3D perception tasks such as scene reconstruction and robot navigation.

Author Contributions

Conceptualization, H.W.; methodology, H.W.; software, H.W.; validation, H.W. and Y.M.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and Y.M.; visualization, H.W.; supervision, Y.M.; funding acquisition, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 51975361.

Data Availability Statement

The data presented in this study are openly available in [3,6].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  2. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  3. Yuan, W.; Khot, T.; Held, D.; Mertz, C.; Hebert, M. PCN: Point Completion Network. In Proceedings of the International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar]
  4. Wang, X.; Ang, M.H.; Lee, G.H. Cascaded Refinement Network for Point Cloud Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  5. Xie, H.; Yao, H.; Zhou, S.; Mao, J.; Zhang, S.; Sun, W. GRNet: Gridding Residual Network for Dense Point Cloud Completion. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020. [Google Scholar]
  6. Pan, L.; Chen, X.; Cai, Z.; Zhang, J.; Zhao, H.; Yi, S.; Liu, Z. Variational Relational Point Completion Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021. [Google Scholar]
  7. Yu, X.; Rao, Y.; Wang, Z.; Liu, Z.; Lu, J.; Zhou, J. PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021. [Google Scholar]
  8. Xiang, P.; Wen, X.; Liu, Y.S.; Cao, Y.P.; Wan, P.; Zheng, W.; Han, Z. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021. [Google Scholar]
  9. Tang, J.; Gong, Z.; Yi, R.; Xie, Y.; Ma, L. LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  10. Zhou, H.; Cao, Y.; Chu, W.; Zhu, J.; Lu, T.; Tai, Y.; Wang, C. SeedFormer: Patch Seeds Based Point Cloud Completion with Upsample Transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel-Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 416–432. [Google Scholar]
  11. Li, S.; Gao, P.; Tan, X.; Wei, M. ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9466–9475. [Google Scholar] [CrossRef]
  12. Zhu, Z.; Chen, H.; He, X.; Wang, W.; Qin, J.; Wei, M. SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14462–14472. [Google Scholar] [CrossRef]
  13. Wen, X.; Han, Z.; Cao, Y.P.; Wan, P.; Zheng, W.; Liu, Y.S. Cycle4Completion: Unpaired Point Cloud Completion using Cycle Transformation with Missing Region Coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021. [Google Scholar]
  14. Zhang, J.; Chen, X.; Cai, Z.; Pan, L.; Zhao, H.; Yi, S.; Yeo, C.K.; Dai, B.; Loy, C.C. Unsupervised 3D Shape Completion through GAN Inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021. [Google Scholar]
  15. Charles, R.Q.; Li, Y.; Hao, S.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  16. Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021. [Google Scholar]
  17. Yan, X.; Yan, H.; Wang, J.; Du, H.; Wu, Z.; Xie, D.; Pu, S.; Lu, L. FBNet: Feedback Network for Point Cloud Completion. In Proceedings of the European Conference on Computer Vision (ECCV), Tel-Aviv, Israel, 23–27 October 2022. [Google Scholar]
  18. Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niebner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017. [Google Scholar]
  19. Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 628–644. [Google Scholar]
  20. Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; Guibas, L. Learning Representations and Generative Models for 3D Point Clouds. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  21. Zhang, B.; Nießner, M.; Wonka, P. 3DILG: Irregular Latent Grids for 3D Generative Modeling. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  22. Yang, Y.; Feng, C.; Shen, Y.; Tian, D. FoldingNet: Point Cloud Auto-Encoder via Deep Grid Deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  23. Li, R.; Li, X.; Hui, K.H.; Fu, C.W. SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation. ACM Trans. Graph. 2021, 40, 151. [Google Scholar] [CrossRef]
  24. Tang, Y.; Qian, Y.; Zhang, Q.; Zeng, Y.; Hou, J.; Zhe, X. WarpingGAN: Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  25. Yang, G.; Huang, X.; Hao, Z.; Liu, M.Y.; Belongie, S.; Hariharan, B. PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  26. Postels, J.; Liu, M.; Spezialetti, R.; Van Gool, L.; Tombari, F. Go with the Flows: Mixtures of Normalizing Flows for Point Cloud Generation and Reconstruction. In Proceedings of the International Conference on 3D Vision (3DV), Virtual, 1–3 December 2021. [Google Scholar]
  27. Luo, S.; Hu, W. Diffusion Probabilistic Models for 3D Point Cloud Generation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021; pp. 2836–2844. [Google Scholar]
  28. Zhou, L.; Du, Y.; Wu, J. 3D Shape Generation and Completion through Point-Voxel Diffusion. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021; pp. 5806–5815. [Google Scholar]
  29. Nakayama, G.K.; Angelina Uy, M.; Huang, J.; Hu, S.M.; Li, K.; Guibas, L. DiffFacto: Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14211–14221. [Google Scholar]
  30. Dai, A.; Qi, C.R.; Nießner, M. Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, HI, USA, 21–26 July 2017. [Google Scholar]
  31. Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy Networks: Learning 3D Reconstruction in Function Space. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4455–4465. [Google Scholar]
  32. Chibane, J.; Alldieck, T.; Pons-Moll, G. Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6968–6979. [Google Scholar]
  33. Zhang, B.; Tang, J.; Nießner, M.; Wonka, P. 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models. ACM Trans. Graph. 2023, 42, 92. [Google Scholar] [CrossRef]
  34. Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, HI, USA, 21–26 July 2017. [Google Scholar]
  35. Huang, Z.; Yu, Y.; Xu, J.; Ni, F.; Le, X. PF-Net: Point Fractal Network for 3D Point Cloud Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  36. Chen, X.; Chen, B.; Mitra, N.J. Unpaired Point Cloud Completion on Real Scans using Adversarial Training. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 26 April–1 May 2020. [Google Scholar]
  37. Cui, R.; Qiu, S.; Anwar, S.; Zhang, J.; Barnes, N. Energy-Based Residual Latent Transport for Unsupervised Point Cloud Completion. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 21–24 November 2022. [Google Scholar]
  38. Cao, Z.; Zhang, W.; Wen, X.; Dong, Z.; Liu, Y.S.; Xiao, X.; Yang, B. KTNet: Knowledge Transfer for Unpaired 3D Shape Completion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 January 2023. [Google Scholar]
  39. Ma, C.; Chen, Y.; Guo, P.; Guo, J.; Wang, C.; Guo, Y. Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  40. Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic Scene Completion from a Single Depth Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, HI, USA, 21–26 July 2017; pp. 190–198. [Google Scholar]
  41. Yang, X.; Zou, H.; Kong, X.; Huang, T.; Liu, Y.; Li, W.; Wen, F.; Zhang, H. Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
  42. Chen, X.; Lin, K.Y.; Qian, C.; Zeng, G.; Li, H. 3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4192–4201. [Google Scholar]
  43. Tang, J.; Chen, X.; Wang, J.; Zeng, G. Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2020; pp. 4192–4201. [Google Scholar]
  44. Rist, C.B.; Emmerichs, D.; Enzweiler, M.; Gavrila, D.M. Semantic Scene Completion Using Local Deep Implicit Functions on LiDAR Data. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7205–7218. [Google Scholar] [CrossRef] [PubMed]
  45. Wang, F.; Zhang, D.; Zhang, H.; Tang, J.; Sun, Q. Semantic Scene Completion with Cleaner Self. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 867–877. [Google Scholar]
  46. Zhang, Y.; Zhu, Z.; Du, D. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9399–9409. [Google Scholar]
  47. Xia, Z.; Liu, Y.; Li, X.; Zhu, X.; Ma, Y.; Li, Y.; Hou, Y.; Qiao, Y. SCPNet: Semantic Scene Completion on Point Cloud. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17642–17651. [Google Scholar]
  48. Nie, Y.; Hou, J.; Han, X.; Nießner, M. RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021. [Google Scholar]
  49. Tang, J.; Chen, X.; Wang, J.; Zeng, G. Point Scene Understanding via Disentangled Instance Mesh Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Tel-Aviv, Israel, 23–27 October 2022. [Google Scholar]
  50. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  51. Fu, Z.; Wang, L.; Xu, L.; Wang, Z.; Laga, H.; Guo, Y.; Boussaid, F.; Bennamoun, M. VAPCNet: Viewpoint-Aware 3D Point Cloud Completion. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
  52. Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
  53. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980v9. [Google Scholar]
  54. Tatarchenko, M.; Richter, S.R.; Ranftl, R.; Li, Z.; Koltun, V.; Brox, T. What Do Single-View 3D Reconstruction Networks Learn? In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
  55. Tchapmi, L.P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; Savarese, S. TopNet: Structural Point Cloud Decoder. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  56. Liu, M.; Sheng, L.; Yang, S.; Shao, J.; Hu, S.M. Morphing and Sampling Network for Dense Point Cloud Completion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020. [Google Scholar]
  57. Pan, L. ECG: Edge-aware Point Cloud Completion with Graph Convolution. IEEE Robot. Autom. Lett. 2020, 5, 4392–4398. [Google Scholar] [CrossRef]
  58. Zhang, W.; Yan, Q.; Xiao, C. Detail Preserved Point Cloud Completion via Separated Feature Aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020. [Google Scholar]
  59. Xu, M.; Wang, Y.; Liu, Y.; He, T.; Qiao, Y. CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9583–9594. [Google Scholar] [CrossRef] [PubMed]
  60. Lyu, Z.; Kong, Z.; Xu, X.; Pan, L.; Lin, D. A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion. arXiv 2021, arXiv:2112.03530. [Google Scholar]
  61. Nie, Y.; Lin, Y.; Han, X.; Guo, S.; Chang, J.; Cui, S.; Zhang, J.J. Skeleton-bridged point completion: From global inference to local adjustment. In Proceedings of the NIPS ’20, Virtual, 6–12 December 2020. [Google Scholar]
  62. Wen, X.; Xiang, P.; Han, Z.; Cao, Y.P.; Wan, P.; Zheng, W.; Liu, Y.S. PMP-Net: Point Cloud Completion by Learning Multi-step Point Moving Paths. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021. [Google Scholar]
  63. Wen, X.; Xiang, P.; Han, Z.; Cao, Y.P.; Wan, P.; Zheng, W.; Liu, Y.S. PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-Step Point Moving Paths. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 852–867. [Google Scholar] [CrossRef]
  64. Zhang, M.; Li, Y.; Chen, R.; Pan, Y.; Wang, J.; Wang, Y.; Xiang, R. WalkFormer: Point Cloud Completion via Guided Walks. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024. [Google Scholar]
Figure 1. An example of the overall shape-completion process with dense stage connections. It consists of two main processes: ECD, which fuses multi-scale features in the first two blocks and generates shapes in the third block; SRG, which upsamples and refines shapes based on shapes in its own stage and the two previous stages. In shape skeleton decoding (i.e., the fourth chair), the object parts (e.g., the three red points in the third blue block) are generated based on their related areas/points from both downsampling stages. In the coarse-to-fine upsampling process, taking the third upsampler as an example, it generates new points (e.g., the red and blue points in the fourth green block) based on their related areas in several previous stages with different scales.
Figure 1. An example of the overall shape-completion process with dense stage connections. It consists of two main processes: ECD, which fuses multi-scale features in the first two blocks and generates shapes in the third block; SRG, which upsamples and refines shapes based on shapes in its own stage and the two previous stages. In shape skeleton decoding (i.e., the fourth chair), the object parts (e.g., the three red points in the third blue block) are generated based on their related areas/points from both downsampling stages. In the coarse-to-fine upsampling process, taking the third upsampler as an example, it generates new points (e.g., the red and blue points in the fourth green block) based on their related areas in several previous stages with different scales.
Electronics 13 03296 g001
Figure 2. The whole coarse-to-fine completion pipeline. Y with sub- or super-subscriptions are the middle output point clouds, DMTs are used to fuse features from different stages, and Us upsample the resolutions of fused features and generate refined points. ECD in the grey box contains Set Abstraction and Coarse Decoder, which form an AutoEncoder to generate coarse shape skeletons from partial inputs. After that, a group of upsamplers compactly fuse the features of different generation stages and gradually refine the complete point clouds to high resolution. Connections applied in the same upsampling stage have the same color.
Figure 2. The whole coarse-to-fine completion pipeline. Y with sub- or super-subscriptions are the middle output point clouds, DMTs are used to fuse features from different stages, and Us upsample the resolutions of fused features and generate refined points. ECD in the grey box contains Set Abstraction and Coarse Decoder, which form an AutoEncoder to generate coarse shape skeletons from partial inputs. After that, a group of upsamplers compactly fuse the features of different generation stages and gradually refine the complete point clouds to high resolution. Connections applied in the same upsampling stage have the same color.
Electronics 13 03296 g002
Figure 3. Network structure of DMT (with two semantic and two resolution heads as an example). Given points and features as inputs, semantic heads transfer them into different embedding spaces (i.e., different background colors) using different groups of functions H, P, K and V, where functions with the same superscripts are in the same group. Inside each embedding space, the resolution head uses another group of weights in As to upsample the feature resolution (i.e., blocks with the same color share weights).
Figure 3. Network structure of DMT (with two semantic and two resolution heads as an example). Given points and features as inputs, semantic heads transfer them into different embedding spaces (i.e., different background colors) using different groups of functions H, P, K and V, where functions with the same superscripts are in the same group. Inside each embedding space, the resolution head uses another group of weights in As to upsample the feature resolution (i.e., blocks with the same color share weights).
Electronics 13 03296 g003
Figure 4. Shape skeleton generation. It combines a two-stage feature decoding process with the two Set Abstraction encoding modules, Specifically, DMTl first maps the points X p 2 and features Z p 2 at deep encoding stage to Z c 2 , and DMTs upsamples the resolution of Z c 2 based on the points X p 1 and features Z p 1 at shallow stage. In this way, we can make full use of the features that represent different local scales in partial inputs.
Figure 4. Shape skeleton generation. It combines a two-stage feature decoding process with the two Set Abstraction encoding modules, Specifically, DMTl first maps the points X p 2 and features Z p 2 at deep encoding stage to Z c 2 , and DMTs upsamples the resolution of Z c 2 based on the points X p 1 and features Z p 1 at shallow stage. In this way, we can make full use of the features that represent different local scales in partial inputs.
Electronics 13 03296 g004
Figure 5. Summarization of input and output point clouds in three upsamplers, and the inner resolution upsampling process (taking U 1 as an example). In U 1 , a DMT fuses and upsamples the multi-stage features, and a MLP takes such refined features as inputs and generates new points conditioned on global feature Z p 3 . U 2 and U 3 have the same orders of DMT and MLP as U 1 .
Figure 5. Summarization of input and output point clouds in three upsamplers, and the inner resolution upsampling process (taking U 1 as an example). In U 1 , a DMT fuses and upsamples the multi-stage features, and a MLP takes such refined features as inputs and generates new points conditioned on global feature Z p 3 . U 2 and U 3 have the same orders of DMT and MLP as U 1 .
Electronics 13 03296 g005
Figure 6. Visualization of some completion results on the PCN dataset. We can find that our method can better infer the missing parts of objects (e.g., the left side of the airplane) and reconstruct the shapes with smoother surfaces (e.g., the car) with less noise (e.g., the first table). We emphasize some comparisons with squares.
Figure 6. Visualization of some completion results on the PCN dataset. We can find that our method can better infer the missing parts of objects (e.g., the left side of the airplane) and reconstruct the shapes with smoother surfaces (e.g., the car) with less noise (e.g., the first table). We emphasize some comparisons with squares.
Electronics 13 03296 g006
Figure 7. Histograms of category-level CD variance on the PCN dataset. Different colors are used to represent percentiles, green bars mean the results whose percentiles are less than 50%, and gray bars mean those between 50% and 80%.
Figure 7. Histograms of category-level CD variance on the PCN dataset. Different colors are used to represent percentiles, green bars mean the results whose percentiles are less than 50%, and gray bars mean those between 50% and 80%.
Electronics 13 03296 g007
Figure 8. Visualization of some real-world completion results. We can also find that our method can better generate the missing parts (e.g., the first chair) and reconstruct the shapes with less noise (e.g., the first table). We emphasize some comparisons with squares.
Figure 8. Visualization of some real-world completion results. We can also find that our method can better generate the missing parts (e.g., the first chair) and reconstruct the shapes with less noise (e.g., the first table). We emphasize some comparisons with squares.
Electronics 13 03296 g008
Figure 9. Visualization of real-world completion results. We can find that our method is able to generate shapes with distinguishable parts given sparse inputs.
Figure 9. Visualization of real-world completion results. We can find that our method is able to generate shapes with distinguishable parts given sparse inputs.
Electronics 13 03296 g009
Figure 10. Performances on the MVP dataset in terms of CD × 10 4 (left) and F1 × 10 2 (right) with different levels of Guassian noise added to the upsampling process in (a) U 2 and (b) U 3 .
Figure 10. Performances on the MVP dataset in terms of CD × 10 4 (left) and F1 × 10 2 (right) with different levels of Guassian noise added to the upsampling process in (a) U 2 and (b) U 3 .
Electronics 13 03296 g010
Table 1. Comparison of network performances on the MVP dataset (16,384 points) in terms of CD- l 2 ( × 10 4 ); lower is better. The bold value means the best.
Table 1. Comparison of network performances on the MVP dataset (16,384 points) in terms of CD- l 2 ( × 10 4 ); lower is better. The bold value means the best.
MethodAirCabCarChaLamSofTabWatBedBenBooBusGuiMotPisSkaAvg
PCN [3]2.954.133.047.0714.935.567.066.0812.725.736.912.461.023.533.282.996.02
TopNet [55]2.724.253.407.9517.016.047.426.0411.565.628.222.371.373.903.972.096.36
MSN [56]2.073.822.766.2112.724.745.324.809.933.895.852.120.692.482.911.584.90
CRN [4]1.593.642.605.249.024.425.454.269.563.675.342.230.792.232.862.134.30
ECG [57]1.413.442.364.586.953.814.273.387.463.104.821.990.592.052.311.663.58
GRNet [5]1.614.663.104.725.664.614.853.537.822.964.582.971.282.242.111.613.87
NSFA [58]1.514.242.754.686.044.294.843.027.933.875.992.210.781.732.042.143.77
VRCNet [6]1.153.202.143.585.573.584.172.476.902.763.451.780.591.521.831.573.12
Snowflake [8]0.963.192.273.304.103.113.432.295.932.293.341.810.501.721.542.132.73
FBNet [17]0.812.972.182.832.772.862.841.944.811.942.911.670.401.531.291.092.29
VAPCNet [51]0.783.192.103.053.163.143.262.155.361.923.081.680.331.391.340.952.40
CP3 [59]0.742.942.252.782.542.872.842.005.241.982.871.670.451.451.230.922.27
Ours0.793.061.982.602.902.802.731.935.021.912.961.560.321.281.320.712.23
Table 2. Comparison of network performances on the MVP dataset (16,384 points) in terms of F1 ( × 10 2 ); higher is better. The bold value means the best.
Table 2. Comparison of network performances on the MVP dataset (16,384 points) in terms of F1 ( × 10 2 ); higher is better. The bold value means the best.
MethodAirCabCarChaLamSofTabWatBedBenBooBusGuiMotPisSkaAvg
PCN [3]86.164.168.651.745.555.264.662.845.269.454.677.990.666.577.486.163.8
TopNet [55]79.862.161.244.338.750.663.960.940.568.052.476.686.861.972.683.760.1
MSN [56]87.969.269.359.960.462.773.069.656.979.763.780.693.572.880.988.571.0
CRN [4]89.868.872.567.068.164.174.874.260.079.765.980.293.177.284.390.274.0
ECG [57]90.668.071.668.373.465.176.675.364.082.270.680.494.578.083.589.775.3
GRNet [5]86.164.168.651.745.555.264.662.845.269.454.677.990.666.577.486.163.8
NSFA [58]90.369.472.173.778.370.581.779.968.784.574.781.593.281.585.889.478.3
VRCNet [6]92.872.175.674.378.969.681.380.067.486.375.583.296.083.488.793.079.6
Snowflake [8]92.070.073.073.078.070.080.079.068.085.073.082.095.080.087.092.078.2
FBNet [17]----------------82.2
VAPCNet [51]94.276.275.878.684.475.984.582.473.689.280.885.497.884.590.294.482.9
CP3 [59]94.074.075.077.084.074.082.082.072.087.077.084.097.085.090.093.081.4
Ours94.975.676.780.085.575.185.383.374.289.680.584.698.085.490.894.483.3
Table 3. Network performances of different resolutions on the MVP dataset in terms of CD- l 2 ( × 10 4 ) and F1 score ( × 10 2 ). The bold value means the best.
Table 3. Network performances of different resolutions on the MVP dataset in terms of CD- l 2 ( × 10 4 ) and F1 score ( × 10 2 ). The bold value means the best.
Points20484096819216,384
MetricCDF1CDF1CDF1CDF1
PCN [3]9.7732.07.9645.86.9956.36.0263.8
TopNet [55]10.1130.88.2044.07.0053.36.3660.1
MSN [56]7.9043.26.1758.55.4265.94.9071.0
CRN [4]7.2543.45.8356.94.9068.04.3074.0
ECG [57]6.6447.65.4158.54.1869.03.5875.3
GRNet [5]7.6135.35.7349.34.5161.63.5470.0
VRCNet [6]5.9649.94.7063.63.6472.73.1279.1
PoinTr [7]5.7949.94.2963.83.5272.52.9578.3
PDR [60]5.6649.94.2664.93.3575.42.6181.7
Snowflake [8]5.7150.34.4564.83.4874.32.6979.6
FBNet [17]5.0653.23.8867.12.9976.62.2982.2
VAPCNet [51]5.4052.13.9665.83.0276.32.4082.9
CP3 [59]5.1052.63.4968.23.1475.62.2781.4
Ours5.0352.63.6668.62.8577.92.2383.3
Table 4. Network performances on PCN (16,384 points) dataset in terms of CD- l 1 ( × 10 3 ); lower is better. The bold value means the best.
Table 4. Network performances on PCN (16,384 points) dataset in terms of CD- l 1 ( × 10 3 ); lower is better. The bold value means the best.
MethodAirCabCarChaLamSofTabWatAvg
FoldingNet [22]9.4915.8012.6115.5516.4115.9713.6514.9914.31
TopNet [55]7.6113.3110.9013.8214.4414.7811.2211.1212.15
AtlasNet [3]6.3711.9410.1012.0612.3712.9910.3310.6110.85
PCN [3]5.5022.7010.638.7011.0010.3411.688.599.64
GRNet [55]6.4510.379.459.417.9610.518.448.048.83
SK-PCN [61]5.099.988.229.298.3910.807.848.028.49
CRN [4]4.799.978.319.498.9410.697.818.058.51
PMP-Net [62]5.6511.249.649.516.9510.838.727.258.73
ECG [57]5.2310.128.369.438.5310.947.988.168.63
NSFA [58]5.0310.519.119.167.4510.467.567.288.32
PoinTr [7]4.7510.478.689.397.7510.397.787.298.38
SnowflakeNet [8]4.299.168.087.896.079.236.556.407.21
PMP-Net++ [63]4.399.968.538.096.069.827.176.527.56
FBNet [17]3.999.057.907.385.828.856.356.186.94
SeedFormer [10]3.859.058.067.065.218.856.055.856.74
ProxyFormer [11]4.019.017.887.115.358.776.035.986.77
VAPCNet [51]4.109.288.157.515.559.186.286.107.02
CP3 [59]4.349.027.907.416.358.526.326.267.02
SVDFormer [12]3.628.797.466.915.338.495.905.836.54
WalkFormer [64]3.739.178.267.285.358.696.125.746.79
Ours3.698.747.616.885.108.605.865.746.53
Table 5. Network performances on Matterport3D and ScanNet datasets in terms of Fiderlity (Fid × 10 4 ) and MMD ( × 10 3 ). The bold value means the best.
Table 5. Network performances on Matterport3D and ScanNet datasets in terms of Fiderlity (Fid × 10 4 ) and MMD ( × 10 3 ). The bold value means the best.
DatasetsMatterport3DSacnNet
CategoryChairTableChairTable
MetricFid.MMDFid.MMDFid.MMDFid.MMD
GRNet [5]0.3561.5410.3411.9160.3551.6330.3531.586
SeedFormer [10]0.1501.7210.1291.7020.1581.6620.1211.734
SVDFormer [12]0.2361.7800.2051.7950.2531.5750.2091.506
Ours0.1131.7580.0711.7070.1481.5730.1111.478
Table 6. Network performances on KITTI dataset in terms of Fidelity ( × 10 3 ) and MMD ( × 10 3 ). The bold value means the best.
Table 6. Network performances on KITTI dataset in terms of Fidelity ( × 10 3 ) and MMD ( × 10 3 ). The bold value means the best.
PCN [3]TopNet [55]NSFA [58]CRN [4]GRNet [5]PoinTr [7]Seed. [10]Proxy. [11]Walk. [64]Ours
Fidelity2.2355.3541.2811.0230.8160.0000.1510.0000.0930.135
MMD1.3660.6360.8910.8720.5680.5260.5160.5080.5040.496
Table 7. Ablated versions of our method and their performances on the MVP dataset.
Table 7. Ablated versions of our method and their performances on the MVP dataset.
ECDCon.1Con.2Con.3CDF1
V1 2.31582.40
V2 2.26882.74
V3 2.26682.55
V4 2.29383.61
Ours2.23383.33
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, H.; Miao, Y. Stage-Aware Interaction Network for Point Cloud Completion. Electronics 2024, 13, 3296. https://doi.org/10.3390/electronics13163296

AMA Style

Wu H, Miao Y. Stage-Aware Interaction Network for Point Cloud Completion. Electronics. 2024; 13(16):3296. https://doi.org/10.3390/electronics13163296

Chicago/Turabian Style

Wu, Hang, and Yubin Miao. 2024. "Stage-Aware Interaction Network for Point Cloud Completion" Electronics 13, no. 16: 3296. https://doi.org/10.3390/electronics13163296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop