The overall framework of FuNet is shown in
Figure 1; it is an encoder–decoder architecture network. The feature
is extracted by point-based processing and the feature
is extracted by convolution-based processing, and then two coarse point clouds,
and
, corresponding to them are generated. Then, the decoder fuses the above two features in the attention module to obtain the global feature
, which is used to generate a complete point cloud
. The different point clouds notations are shown in
Table 1.
2.1. Encoder
The encoder separately extracts local structure information from the point cloud by point-based processing, and global contour information by convolution-based processing.
Point-based processing. As a simple and effective network used for point cloud shape classification and part segmentation, Point-PN [
22] enables point cloud feature extraction by using a series of nonparametric components and linear layers, then stacking them into multiple stages to build a pyramid hierarchy. Therefore, the extended version of Point-PN, which is designed in this paper, inherits the original structure and extract features used for point cloud completion.
Firstly, the dimensions of the input point cloud are extended by a shared MLP, which is then input into a multi-stage hierarchy. The multi-stage hierarchy applies Farthest Point Sampling (FPS), -Nearest Neighbors (-NN), trigonometric functions and pooling operations to progressively aggregate the local geometric structure to generate a high-dimensional feature representing the feature obtained from point-based processing.
At each stage of the multi-stage hierarchy, an -points input point cloud is denoted as , where represents the coordinates of a point. The number of points is downsampled from to by FPS. Then, -NN are responsible for dividing neighborhoods from points for each center to form a local 3D region, and the value of is 8 in our network. Normally, the combination of FPS and -NN is used to extract the set of local neighborhood points and their features. After passing FPS and -NN, the trigonometric functions are used to reveal the local features simply. Specifically, for each centroid and its neighbourhood , Local Geometry Aggregation (LGA) is applied to implement feature extraction. The specific process of LGA is as follows: first, and are concatenated along the feature dimension to assign a large receptive field to each point feature and expand the feature. Second, , which refers to position encoding in the Transformer, can effectively encode the relative position information. The expanding feature combines to contain the local geometry information. Finally, pooling operations are used to aggregate the expanding feature. After the multi-stage hierarchy, both max and average pooling are performed to aggregate the local structure feature .
Convolution-based processing. Drawing on the idea of point cloud gridding [
8] that has developped in recent years, we grid the input point cloud to extract its global contour features. The point cloud is regularized using a 3D grid as an intermediate representation, whereby an unordered and irregular point cloud is converted into a regular 3D grid denoted as
. This conversion ensures the preservation of the spatial layouts of the point cloud, with each point
being assigned to the vertex set
, and corresponding values are stored in the set
. As illustrated in
Figure 2, a cell is defined as a cube composed of eight vertices. The corresponding value
for this vertex
is determined based on the points lying in the eight adjacent cells of this vertex.
Next, the objective of the 3D Convolutional Neural Network (3D CNN) with skip connections is to extract the global contour information from a 3D grid. The architecture of the 3D CNN includes four 3D convolutional layers, each composed of a batch normalization layer, an activation function, and a max pooling layer. Finally, a shared MLP is used to output the global contour feature representing the feature obtained from convolution-based processing.
2.2. Decoder
By extracting point cloud features from the encoder, we obtained
and
, whose sizes are
and
, respectively, where
and
are weight coefficients. In the attention module, we first concatenated the two features along the feature dimensions, and extended the concatenated feature dimensions, denoted as
, in order to expand the receptive field to increase the representational capability. Second, the extended feature is input into the max-pooling MLP pipeline (
) and the average-pooling MLP pipeline (
), respectively, to obtain the weighted point cloud features. Then, based on the weight values, the
features with the highest weights are used to represent the global features
of the input point cloud. The experimental results show that although the structure of the attention module is simple, the effect is significantly improved.
Next, we generate the complete and dense point cloud from the global feature . In the first step, a coarse point cloud is generated by passing through an MLP and transforming the output into a matrix. In the second step, for each point in the coarse point cloud, a patch of points in local coordinates centered at is generated using the folding operation. Subsequently, these points are transformed into global coordinates by adding to the output, where represents the side lengths of the 2D grid. Combining all patches generates a complete point cloud consisting of points. This two-step process enables FuNet to generate a complete point cloud using fewer parameters compared to a fully connected decoder, while also offering greater flexibility than a folding-based decoder.
2.3. Loss Function
The loss function is used to evaluate the disparity between the ground truth point cloud and the output point cloud. Given the unordered nature of point clouds, the loss function must be permutation-invariant. Common choices for point cloud completion loss functions include Chamfer Distance (CD) [
20] and Earth Mover’s Distance (EMD) [
20]. Due to the high memory requirements of EMD, with a complexity of
, and considering that the number of the reconstructed points must be equal to the number of points in the ground truth point cloud, CD with a complexity of
is chosen in our experiment. In addition, Uniform Loss [
23] is incorporated to enhance the uniformity of the output point cloud.
Chamfer Distance: By definition, Chamfer Distance denotes the sum of the average closest distance from a point in the output point cloud
to a point in the ground truth point cloud
, and the average closest distance from a point in
to a point in
.
where
denotes the average distance from the point of
to the closest point of
, and
denotes the average distance from the point of
to the closest point of
.
and
are the numbers of points for
and
, respectively.
In general, the loss function CD has two forms,
and
, which are defined as follows:
They are both used in the loss function of the network.
Uniformity Loss: Uniformity is usually used to evaluate the homogeneity of the complete point cloud distribution, and it is expressed as:
where
is the subset of points (
) obtained by cropping from the output point cloud using farthest point sampling and a ball query with radius
. Here,
considers local distribution uniformity, while
considers non-local uniformity to encourage better point coverage.
where
is the expected number of points in
and the chi-square test is employed to quantify the bias of
from
.
where
represents the distance to the nearest neighbor for the
-th point in
, and
is approximately calculated as
(assuming
has a uniform distribution). The chi-square test is employed once again to quantify the bias of
from
.
The loss function
that we propose is as follows, where
,
and
are weight coefficients.
Of these, the first term of the function evaluates the loss between the coarse point cloud generated by and the ground truth point cloud. Similarly, the second term of the function evaluates the loss between the coarse point cloud generated by and the ground truth point cloud, and the third term evaluates the loss between the complete point cloud and the ground truth point cloud. The last term evaluates the uniformity of the complete point cloud .