Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling

Wu, Meng; Jiao, Hailong; Nan, Junxiang

doi:10.3390/electronics11193157

Open AccessArticle

Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling

by

Meng Wu

^1,*

,

Hailong Jiao

¹ and

Junxiang Nan

²

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

The Second Topographic Surveying Brigade of MNR, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3157; https://doi.org/10.3390/electronics11193157

Submission received: 9 August 2022 / Revised: 21 September 2022 / Accepted: 28 September 2022 / Published: 1 October 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Three-dimensional (3D) point clouds have a wide range of applications in the field of 3D vision. The quality of the acquired point cloud data considerably impacts the subsequent work of point cloud processing. Due to the sparsity and irregularity of point cloud data, processing point cloud data has always been challenging. However, existing deep learning-based point cloud dense reconstruction methods suffer from excessive smoothing of reconstruction results and too many outliers. The reason for this is that it is not possible to extract features for local and global features at different scales and provide different levels of attention to different regions in order to obtain long-distance dependence for dense reconstruction. In this paper, we use a parallel multi-scale feature extraction module based on graph convolution and an upsampling method with an added multi-head attention mechanism to process sparse and irregular point cloud data to obtain extended point clouds. Specifically, a point cloud training patch with 256 points is inputted. The PMS module uses three residual connections in the multi-scale feature extraction stage. Each PMS module consists of three parallel DenseGCN modules with different size convolution kernels and different averaging pooling sizes. The local and global feature information of the augmented receptive field is extracted efficiently. The scale information is obtained by averaging the different pooled augmented receptive fields. The scale information was obtained using the different average pooled augmented receptive fields. The upsampling stage uses an upsampling rate of

r = 4

, The self-attentive features with a different focus on different point cloud data regions obtained by fusing different weights make the feature representation more diverse. This operation avoids the bias of one attention, and each focuses on extracting valuable fine-grained feature information. Finally, the coordinate reconstruction module obtains 1024 dense point cloud data. Experiments show that the proposed method demonstrates good evaluation metrics and performance and is able to obtain better visual quality. The problems of over-smoothing and excessive outliers are effectively mitigated, and the obtained sparse point cloud is more dense.

Keywords:

sparse 3D point cloud; dense reconstruction; parallel multi-scale feature extraction; multi-head attention mechanism

1. Introduction

Point cloud data play an important role in the 3D visual field. It is involved in almost all related fields, including perception and localization of autonomous vehicles, SLAM, 3D scene reconstruction, AR/VR, SFM, pose estimation, 3D recognition, structured light, stereovision, 3D measurement, visual guidance, and virtual displays of cultural relics. Point cloud data acquisition generally relies on 3D sensors such as LIDAR and depth cameras. However, the raw point cloud data acquired by LiDAR and other devices is usually sparse, noisy, uneven, or even partially missing. Therefore, it is necessary to repair the raw sampled data in order to generate complete, dense, and uniform point cloud data. This work generally relies on early traditional geometric feature optimization-based methods [1,2,3,4,5,6], which in turn depend on preconditioning assumptions or additional attributes.

Compared to traditional methods, deep learning methods make it possible to learn features directly from point clouds. The first deep learning framework able to directly process raw 3D point cloud data was PointNet [7]. Using point clouds directly as input and implementing substitution invariance of symmetric functions, point cloud features are learned independently using several MLP layers and global features are extracted using a maximum pooling layer. However, the relationship between points in the point cloud is not considered. Many deep learning methods for point cloud upsampling have been submitted recently, devoted to converting sparse, incomplete, and noisy point cloud data into compact, complete, and clean point cloud data. These include PU-Net [8], EC-Net [9], 3PU-Net [10], PU-GAN [11], and PU-GCN [12]. PU-Net [8] expands the feature space by learning multi-level features of each point using different convolutional branches. Then, the expanded features are decomposed and reconstructed as a set of upsampled point clouds. The fine-grained features of the point cloud data are mainly lost when using this approach. EC-Net [9] implements edge-aware point cloud upsampling, improving the quality of point cloud surface reconstruction. However, the point cloud model of the input network needs to be calibrated with contour information, and the workload is larger. 3PU-Net [10] performs the upsampling process via chunking, which facilitates local detail preservation. This process can be understood as progressive upsampling, repeating three upsampling stages from coarse to fine to achieve the final result, similar to the super-resolution method in image processing. However, this approach is less efficient. PU-GCN [12] extracts 3D point cloud features by fusing the Inception–DenseGCN feature extraction module designed by GCN and DenseGCN, upsampling via GCN modules such as NodeShuffle, and finally obtaining a dense point cloud by coordinate reconstruction. This method encodes local features, generates new points with better conditions, and does not require additional attributes such as annotations and normals. PU-GCN is the model with the best dense reconstruction effect compared to other existing methods. However, there is a certain degree of loss in terms of global point cloud structure information [13], and it generates more noise points in feature-rich regions [14]. The existing deep learning methods all have the problem of over-smoothing and an excessive number of outliers in the dense reconstruction results, as shown in Figure 1.

Unlike well-structured 2D data, point cloud data have the inherent properties of irregularity and sparsity, which pose significant challenges for advanced vision tasks such as point cloud classification, segmentation, and target detection. Point cloud data consist of a series of sparse 3D spatial points with ordering invariance. Traditional CNNs (Convolutional Neural Networks) are more suitable for processing data with regular spatial rankings [15,16,17]. Network structures based on GCN (graph convolutional network) structures [18,19,20,21,22] are more suitable for non-Euclidean structured data such as point clouds compared to CNNs.

A general upsampling network usually consists of three parts: the feature extraction module, upsampling module, and coordinate reconstruction module. If traditional deep learning repair methods are used, there are too many outliers and over-smoothing problems. Here, we design two modules that play a crucial role in the point cloud upsampling method. This method involves end-to-end dense reconstruction by upsampling. There have been other attempts at dense reconstruction, although these tend to lose fine-grained features of the point cloud [23]. These methods are complex to operate [9], and may not pay enough attention to the global structure [12] and feature-rich regions [13,14]. Attention should be paid to the localization of features at different scales, extraction of global features, and different levels of attention in different regions. Therefore, based on the latest upsampling methods, we use parallel modules to fuse feature information at different scales during the feature extraction process. More sensing field regions of different scales are combined, and the sensing field is extended by adding average pooling. In the upsampling stage, the extracted features are upsampled and input into the Multi-head Attention Module (MHA). Attention heads extract more valuable information with different weights and different foci. The above operation is used to achieve a good dense reconstruction effect while reducing the problem of over-smoothing and excessive outliers. Our main contributions include combining graph convolution methods based on upsampling; specifically, we propose the following:

Parallel multi-scale feature extraction module combined with pyramid pooling module. Based on the original DenseGCN, the pooling operation at different scales is added and built-in parallel by multiple modules. Effectively extracts global and local feature information and fuses different scales as well as pooled feature information;
Up-sampling module for multi-head attention mechanism. Calculate the attention focus of each attention head by the gained total weight matrix, allowing a more diverse feature representation and avoiding bias from the same attention head;
The feature information is effectively extracted from the irregular and sparse multi-space point cloud data, and the features obtained from the feature extraction module are upsampled; then, the results are reconstructed in coordinates to obtain a densely reconstructed point cloud. The point cloud data obtained by upsampling in our model are more uniform and dense. The data have fewer outliers, more distinct contour details, and a better display of features with more high-fidelity detail, as well as an improvement in the over-smoothing problem (see Figure 1).

2. Related Work

2.1. Reasons for Missing Point Clouds

During the data acquisition process, the 3D laser scanner is affected by the characteristics of the measured object, the measurement method, and the environment, which inevitably leads to the loss of point clouds [1]. For example, the stability of the 3D scanner during the scanning process has a remarkable impact on the scanned point cloud. Foot support materials, mechanical structures, and continuous rotation of the scanner inevitably lead to mechanical jitter, which can affect the echoes and deviations between the collected point cloud locations and the actual object to be measured. After the data collection, the point cloud needs to undergo a series of processing operations, such as point cloud de-noising, smoothing, alignment, and fusion. At the same time, these operations greatly aggravate the missing point clouds, affecting data integrity and leading to topological errors, as well as affecting the quality of point cloud reconstruction, 3D model reconstruction, local spatial information extraction, and subsequent processing.

In this paper, we propose an end-to-end upsampling method for dense reconstruction. The method is simple and fast when applied to the dense reconstruction of missing and sparse point cloud data acquired during data acquisition. In order to reduce the loss of global point cloud structural information, the point cloud model with rich structural information is used to generate fewer outliers, ensuring that the point cloud data have good usability in subsequent operations. Alternatively, after data collection is completed, the point cloud data are further processed for dense reconstruction to reduce the quality problems of point cloud reconstruction, 3D model reconstruction, and other subsequent processing operations. Thus, our approach has an optimized impact on the application effect in the field of 3D vision.

2.2. Graph Convolution Method

Graph-based neural networks were first proposed by Joan Bruna [24] in 2014 to construct convolutional operations from a spatial perspective [25] and a spectral domain perspective. Graph convolutional networks were first proposed by Thomas Kpif [18] in 2017. GCN [26] is a neural network architecture that can use a graph structure to aggregate vertex feature information from the neighborhood in a convolutional manner. It can perform effective convolutional processing of irregular discrete data. It has gradually become a research hotspot in 3D disordered discrete point cloud processing and is usually used in feature coding for dense and denoised point cloud reconstruction. A convolutional graph network is used for upsampling the point cloud data at arbitrary up-sampling rates. Because PointNet [7] cannot extract the local feature information of the neighborhood, many graph convolutional networks use the edge convolution method to learn the feature relationships of the point clouds between the neighborhoods. Graph convolutional networks are widely used due to their excellent properties for processing non-Euclidean data. For example, Xu [27] developed a BOIQA framework based on viewport-oriented graph convolutional networks to address interactions between different viewports. Fu [28] achieved impressive performance in reference-free 360-degree image quality assessment (NR 360IQA) using graph convolutional networks (GCNs) to model interactions between viewports via graphs. Fu [29] proposed a dual graph convolutional network (GCN) for single image dewatering; given the shortcomings, the deep rainfall method based on CNN can only model the local relationship and rarely considers the long-term context information. Two graphs are designed to realize the modeling and inference of global relations, using the characteristics of graph convolution in dealing with global relations. EdgeConv in DGCNN [22] is used as the default GCN layer in our model.

2.3. Attention Mechanism

The attention mechanism first appeared in [30], and is widely used today, as it can introduce remote context dependencies to enhance feature integration through self-attentive units [11].

The attention module in SAGAN [31] is used to construct two feature spaces using two convolution functions:

f (x) = W_{f} x, g (x) = W_{g} x

(1)

In the above equation,

W_{g}, W_{f}, W_{h}

denotes the learned weight matrix in the above equation. After transposing the result of

f (x)

and multiplying it with

g (x)

, the s-matrix is obtained. Then, softmax normalization is applied to the s-matrix to obtain the

β

matrix, with

β_{j, i}

denoting the degree of attention the model pays to the ith position when synthesizing the jth pixel point, i.e., the following attention map:

β_{j, i} = \frac{e x p (s_{i j})}{\sum_{i = 1}^{N} e x p (s_{i j})}, s_{i j} = f {(x_{i})}^{T} g (x_{j})

(2)

The obtained attention map is then applied to a feature map output by

h (x)

. The

h (x_{j})

that impacts the jth pixel generated is multiplied by its corresponding degree of impact

β_{j, i}

, then summed:

o_{j} = v (\sum_{i = 1}^{N} β_{j, i} h (x_{i})), h (x_{i}) = W_{h} x_{i}, v (x_{i}) = W_{v} x_{i}

(3)

In turn, j pixels

o_{j}

can be generated according to the degree of influence, and this result can be convolved in another layer to obtain the result of adding attention to the self-attention maps(o) of the feature maps. This dependency type can decrease the amount of computation by acquiring remote dependencies at one level:

o = o_{1}, o_{2}, \dots, o_{j}, \dots, o_{N}

(4)

SAGAN chooses a large kernel or writes an intense network capable of expanding the reception field. Therefore, this strategy requires many calculations and parameters, and is consequently inefficient [32].

In the Transformer attention mechanism [33], the encoder has two operations per layer, self-attention and feed-forward. The decoder has three operations per layer, self-attention, encoder–decoder attention, and feed-forward. Both self-attention and encoder–decoder attention uses the Muti-Head Attention mechanism. Attention mechanisms can often be described as mapping queries (Q) and key-value pairs to outputs. When the query, each key, and each value are vectors, the output is the weight of all values in V. The weights are calculated from the query and each key. The calculated weights are normalized by calculating the obtained similarity in order to obtain the attention vector. There are various ways to calculate the similarity; here, we use the dot product to calculate the similarity. Multi-head attention involves carrying out Scaled Dot-Product Attention [33] H times and then merging the output features. The Transformer has been demonstrated to stabilize the learning process in various tasks via a multi-headed attention mechanism [34], guaranteeing a sizable perceptual field without sacrificing computational efficiency [35].

DGCNN [22] learns the features of the neighborhood by edge convolution and uses the same symmetric function (max pooling) as PointNet in aggregation. Using symmetric functions in aggregation is simple and solves the ordering invariance of the point cloud, although this loses a great deal of information. The aggregation of features is handled by different attention-scoring mechanisms in AGNet [36]. The attention pooling method makes it possible to score different point clouds. The point cloud regions that are scored with higher scores have higher weights. This part of the point cloud features receives more attention during aggregation.

In this paper, attention is focused on different attention levels in different regions, and different attention focus is obtained from different attention heads, merging the obtained features to obtain long-range dependencies while avoiding loss of feature information. Here, we propose the attention strategy of adding different attention scoring with multiple attention heads at the end of upsampling.

2.4. Multi-Scale Feature Extraction

Classical neural network frameworks such as [37], LeNet [38], AlexNet [39], and VGG [40] share the common feature, that is, that the whole network is built by stringing together different neural modules. This method of construction suffers from overfitting and large model parameters. These issues led to the earliest multi-scale module, GoogleNet [41]. The designed inception module was the first classical model using parallel network structures. Many of its proposed excellent structures were designed based on reducing network depth and improving performance. Many places in the paper show that global average pooling significantly impacts network effectiveness by reducing parameters and saving computational effort. In addition, it is shown that max-pooling leads to a loss of spatial information in return for better feature accuracy. For convolutional kernels, it is best to use 1, 3, and 5 for easy alignment,

1 \times 1

convolution kernels for dimensionality reduction, and a reduced number of channels. Different sizes of convolution kernels can obtain different scale information in the image and fuse different scale convolution and pooling feature information.

The size of the perceptual field specifies, to a certain extent, the degree of global context use. It contains information on different scales and between different sub-regions. The loss of contextual information between different sub-regions can be further reduced. The pyramid pooling module [42] is an effective a priori model of the global context. It contains four scales of features. The feature maps are divided into different sub-regions according to the scales to form the information representation of different regions. Suppose the input feature map size pooling operations obtain

H \times W \times C

with multiple scales of feature maps at four different scales (

1 \times 1

,

2 \times 2

,

3 \times 3

,

6 \times 6

). After pooling,

1 \times 1

convolution is performed to adjust the number of channels to

\frac{1}{4}

of the input channel. Then, the features at different levels are bilinearly upsampled to the size of the input feature map. The features of all levels are then stitched together with the original feature map in order to form a global a priori representation for subsequent use. The scale information is from operations at different levels, the context information is combined with more perceptual field regions by different scales, and

1 \times 1

convolution is used to incorporate the channel information. This results in complete information representation combining local and global feature representation.

Many excellent feature extraction methods have been proposed for point cloud data at this stage [43,44,45], although most are subjectively designed based on prior knowledge. Due to the influence of the acquisition process on the point cloud data, it is not easy to express specific point cloud features with subjectively designed features [46].

This paper pays attention to the localization of features at different scales, extraction of global features, model performance, and fusion of convolutional pooling of feature information at different scales while maintaining model runtime performance. Parallel connected modules for extracting feature information at different scales are proposed.

3. Method

The effectiveness of the deep learning-based point cloud upsampling method relies heavily on the design of two modules in the model, namely, the feature extraction module (feature encoding) and the upsampling module (feature expansion) [1].

3.1. Parallel Multi-Scale Feature Extraction Module

To further encode multi-scale features for point clouds, a parallel multi-scale feature extraction module is designed by fusing the Inception native [41] and pyramid pooling modules [47], as shown in Figure 2. To better extract the fine-grained spatial feature information of the point cloud, DenseGCN [26] with different convolution kernel sizes are used for the side-by-side arrangement. In this way, different scale levels of point cloud spatial feature information can be extracted. A pooling operation with different sizes is used to acquire scale information comprehensively. This effectively solves the problem of the point cloud data collected by the instrument containing various sizes and resolutions. As shown in Figure 2, we first pass the input feature information through a multilayer perceptron (MLPs) set to map the multi-feature information into a single feature for the subsequent multi-scale feature extraction. The obtained feature information is then passed through three parallel-arranged DenseGCN blocks. Each DenseGCN consists of three layers of densely connected dilation graph convolution. At the end of each DenseGCN block, we add averaging pooling operations of different sizes,

1 \times 1

,

3 \times 3

, and

5 \times 5

, respectively, to expand the perceptual field and fuse multiple different scales of feature information. The contextual information is combined with more perceptual field regions with different scales to obtain multi-scale features. We use a global pooling operation to obtain global context information. Finally, the features at different levels are stitched with the features obtained by global pooling to form a global prior.

The graph structure is constructed using KNN in the first layer of the parallel multi-scale feature extraction module. Dilated-KNN [26] is used in each subsequent DenseGCN to expand the perceptual field while maintaining the resolution of the data [12]. Through multiple sets of experiments, we found that the model performed best in quantitative metrics with three sets of DenseGCNs, and the expansion rate of Dilated-KNN was 1, 2, and 3. We define DeneGCN as (k, d, c), where k is the kernel size, d is the dilation rate, and c is the number of channels.

3.2. Multi-Headed Self-Attention Structure

Inspired by self-attention [11] and Transformer [48] and using the basic structural idea of SAGAN [31], we designed a self-attention module in the upsampling module; the network framework is shown in Figure 3. The input features are first converted into F and G by two independent sets of MLPs (using Equation (1)).

{W = G}^{T} F

(5)

To obtain the attention weights W, features H are extracted from the input by another set of MLPs and the attention weights W are applied to H:

A t t e n t i o n (G, F, H) = f_{softmax} (\frac{G^{T} F}{\sqrt{d_{G}}}) H

(6)

We now compute the weighted attention feature, i.e., single self-attention. In detail, we multiply the result of G by F after transposing it, then divide it by the scaling factor

\sqrt{d_{G}}

. After normalizing the resulting result using the softmax function, we apply the obtained degree of attention to the output of the features using H. Then, the weighted features and the input features are summed to obtain the attention map output features of the attention unit single self-attention, where F, G, and H are obtained by multiplying the parameter matrix W,

f_{s o f t m a x}

is the softmax function, and

d_{G}

is the number of dimensions of G. The distribution of

f_{s o f t m a x} (G^{T} F)

is related to d, and dividing by the scaling factor

\sqrt{d_{G}}

allows

f_{s o f t m a x} (G^{T} F)

to be decoupled from d, which ensures that the gradient values remain stable during the training process.

A multi-head strategy is adopted in the attention mechanism, i.e., by summing multiple sets of SSA features in Figure 4. Finally, the output features of the attention module are obtained. One attention head has only one learning space, while multiple attention heads have multiple learning spaces. We adopt this strategy in the multi-headed self-attentive module, i.e.,

M u l t i H e a d (Q, K, V) = Concat ({h e a d}_{1}, \dots, h e a d_{h}) W^{O}

(7)

h e a d = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(8)

Three attention heads (SSA) are used, meaning that

h = 3

, and F, G, and H are partitioned into multiple SSA map modules with different weight matrices

W_{i}^{F}

,

W_{i}^{G}

, and

W_{i}^{H}

. Each SSA map has its focus on the attention area. Finally, the SSA maps computed for each attention head are merged. The total weight matrix

W^{O}

determines the degree of attention of each attention head. We make the different SSAs learn features by mapping F, G, and H to different spaces and optimizing different feature parts. Balancing the possible biases of the same SSA through this operation makes the features have more diverse expressions. In the ablation study (Section 4.4) , the multi-head attention module mechanism has a more significant impact on the recovery effect of the point cloud.

3.3. Network Framework

When using sparse or even missing point cloud data, it is desirable to obtain denser or even missing-complementary point cloud data. Our network framework (see Figure 5) consists of three parts: a feature extraction module, an upsampling module, and a coordinate reconstruction module. First, the input extracted sparse point cloud features

N \times 3

are expanded into point features

r N \times C^{'}

by our multi-scale feature extraction module, then the dense

r N \times 3

point cloud data output is obtained by the standard coordinate reconstruction module, where r is the upsampling rate,

C^{'}

is the feature channel dimension, and N is the number of training patch points. The three sets of the parallel multi-scale feature extraction (PMS) modules’ residual connections are used in the multi-scale feature extraction module.

In the feature extraction module, a GCN layer and a DenseGCN [10] layer are used to extract the high-level graph feature information of the point cloud. Then, the multi-level and multi-scale point cloud feature information is extracted by multiple sets of our proposed parallel multi-scale feature extraction modules. Each parallel multi-scale feature extraction module is densely connected.

The up-sampling module consists of three components: one set of MLPs, one core component of up-sampling, and two sets of MLP compressed features. Specifically, the point cloud features extracted by the feature extraction module are compressed by a set of MLP compressed features. Then, the compressed point cloud features are extended by the upsampling core component, after which the compressed features are

r N \times C^{'}

by two sets of MLP compressed features.

For coordinate reconstruction, two groups of MLP are used in the coordinate reconstruction module [12]. The 3D coordinates of the output point cloud are reconstructed from the extended features and dense

r N \times 3

point cloud data are obtained.

4. Experiments

4.1. Dataset and Implementation Details

We use the PU1k dataset, which is used in PU-GCN and covers many 3D models. It has 1147 models, including 1020 for training, consisting of 120 training samples in the PU-GAN dataset and 900 different training samples in shapeNetCore, and 127 samples for testing, consisting of 27 test models in PU-GAN and 100 models in shapeNetCore.

Each model is cropped into 50 training patches during training, obtaining a total of 51,000 training patches. Each patch consists of 256 points as a low-resolution training input, upsampled at a sampling rate of

r = 4

, and 1024 points are used as the ground truth.

We generate a pair of input point clouds (2048 points) and ground truths (8192 points) for the test data. In the test phase, we follow the PU-GCN in using the farthest point sampling to obtain the set of seeds for a patch size of 256 points per seed.

4.2. Loss Function and Evaluation Metrics

The loss function is

C D (P, Q) = \frac{\sum_{p \in P} {min}_{q \in Q} {| p - q |}_{2}^{2}}{| P |} + \frac{\sum_{q \in Q} {min}_{p \in P} {| p - q |}_{2}^{2}}{| Q |}

(9)

We use the mutual chamfer distance as the loss function. The mutual chamfer distance measures the average distance between the two point cloud models P and Q closest to the sampled points; p and q are the 3D coordinates of the sampled points in the point cloud models. The symbol

{| |}_{2}^{2}

indicates the squared Euclidean parametrization. Smaller chamfer distances indicate better model performance.

The evaluation metrics are the chamfer distance (CD), Hausdorff distance (HD), and point-to-ground distance (P2F) ground truth grid; in all cases, a smaller metric means better performance. All models were trained on a Quadro RTX 5000 GPU with an AMD EPYC 7302 CPU.

4.3. Qualitative and Quantitative Results

Qualitative Results

The proposed model is compared qualitatively with Input, MPU, PU-Net, PU-GAN, PU-GCN, and GT on the PU1k dataset. The proposed model produced fewer outliers and more distinct model profile features. As can be clearly observed from the close-up of the leg of the bird (first row) and the hand-held part of the handbag (second row) in Figure 6, the proposed model generates fewer outliers than the other models and can more objectively demonstrate the original features of the model. The proposed model better preserves the raised circular region of the ear on the mask compared to the recent and more advanced PU-GCN dense reconstruction, in which the detailed features of the model are not overly modified. This part of the effect is primarily due to the use of Dilated-KNN [37] in each DenseGCN layer in the network to recalculate the edges between vertices to increase the perceptual field further and create dynamic edges [34]. Making the same connected component node representation features of the input graph better avoid converging to the same value can effectively alleviate the over-smoothing problem [22].

Moreover, the model’s detailed features and geometric information can be better preserved. In Figure 6, the mask (third row) can be seen from the close-up region. The proposed model better preserves the raised circular region of the ear on the mask compared to the recent and more advanced PU-GCN, in which the detailed features of the model are not overly modified by dense reconstruction. This part of the effect is primarily due to the proposed attention mechanism, which allows for a more diverse representation of features and effectively avoids bias in feature representation by the same attention mechanism [37].

Quantitative Results

In the quantitative comparison, the proposed model is compared with four upsampling networks, MPU, PU-Net, PU-GAN, and PU-GCN, in three evaluation metrics.

The quantitative results in Table 1 show that the proposed model approach has better quantification in each metric than the other four models. Although the proposed model shows a decrease in CD metrics compared to PU-GCN, there is a significant improvement in performance compared to HD and P2Favg. Compared to the other four models in the qualitative results in Figure 6, the proposed model has higher quality and fewer outliers than the other four models in the close-up regions of the bird’s legs (first row), the hand carrying the handbag (second row), and the mask and ear (third row). These differences suggest that CD evaluation metrics do not play an important role in model evaluation. This part of the experimental results further illustrates the contribution of the proposed feature extraction module and upsampling module to model performance.

Generalization Study

The performance of the proposed model is further validated on the PU-GAN dataset, in order to verify the generalization performance of the proposed model. The proposed model is compared with the existing state-of-the-art methods MPU, PU-GAN, and PU-GCN. The quantitative results in Table 2 show that the proposed model remains optimal in all performance metrics. From the qualitative results in Figure 7, it can be seen that the proposed model encodes local information better, preserves the fine-grained features of the model, and produces fewer outliers. It is clear from the close-up regions of the duck beak (first row), elephant leg (second row), and horse leg (third row) that the proposed model produces fewer outliers and expresses the contour features of the model more clearly.

4.4. Ablation Studies

The ablation experiments quantitatively evaluate the performance contribution of the proposed module to the overall network. The ablation experiments are performed on the parallel multi-scale feature extraction module (PMS), the multi-headed self-attentive module (MHA) of the core component in the upsampling module, in the whole network model. The base model is PU-GCN [12]; no multi-headed self-attentive (MHA) module or other attention modules are used in PU-GCN, and the Inception DenseGCN module is used for the feature extraction module in PU-GCN. Because PU-GCN is an existing model with an excellent reconstruction effect in dense upsampling reconstruction, it is used as the comparison module for ablation experiments. The PMS module and MHA module proposed in this paper are replaced/added on top of it for ablation studies.

In the parallel multi-scale feature extraction module (PMS) for ablation experiments, the multi-headed self-attentive (MHA) module is removed and only the parallel multi-scale feature extraction module (PMS) is added. The performance of the proposed model is significantly improved in other metrics, and the overall performance remains good despite the decrease in CD metrics, as shown in Table 3. From the visualization results in Figure 8, the close-up regions of the telephone line (first row) and the chair backrest (third row) show that the core components in the proposed multiscale feature extraction have fewer outliers compared to the advanced methods. The close-up area of the hole puncher (second row) reveals that our module responds well to the contour features of the point cloud model.

Our ablation experiments are performed on the multi-headed self-attentive module (MHA) by removing the parallel multi-scale feature extraction module (PMS) and adding only the multi-headed self-attentive module (MHA). The proposed model significantly reduces the performance of the HD and P2Favg metrics, and the performance is better than the latest models, as shown in Table 4. The visualization results are shown in Figure 9; it can be seen that our model has fewer outliers and more specific contour information in the close-up region of the point cloud model.

Our ablation experiments are conducted on the number of attentional heads of the multi-headed self-attentive module (MHA) to verify whether the model performance benefits as the number of attention head increases. While keeping the other modules unchanged, the attention heads are set to

h = 2

,

h = 3

, and

h = 4

. With

h = 3

, the model’s quantified index performance is optimal. As shown in Table 5, the performance of the index increases with the number of attention heads, reaching optimal performance at

h = 3

. The performance of the evaluated metrics does not improve as the number of attention heads increased. The performance of CD metrics continues to improve, and HD and P2Favg metrics do not show exponential improvement, further indicating that the CD evaluation metrics do not occupy a significant role. The model quantification performance does not improve further with

h = 4

. These results, when adding attentional heads indicate that the model’s performance cannot be further improved.

The visualization results are shown in Figure 10. It is obvious from the close-up areas of the barrel (first row) and the shoulder bag (second row). There are intensive outliers at

h = 2

for the barrel (first row), accompanied by a blurred description of the contours. At

h = 3

, although there are outliers, the contour information is closer to GT(e). At

h = 4

, the outliers are reduced compared to

h = 2

, although a few outliers stand out and impact the description of the model contour information. The close-up area of the shelf (third row) shows that when

h = 3

, the contour features of the point cloud model are more clearly described with the fewest outliers and more specific detail features, which better preserves the contour information of the shelf (third row). At

h = 2

,

h = 4

, there are more outliers, and

h = 2

has fewer outliers than

h = 4

.

As shown in Figure 10, the visualization results from the close-up areas of the barrel (first row), shoulder bag (second row), and shelf (third row) show that as h increases the detail features of the point cloud model, the contour information is retained. Visualization features such as outliers are less affected. Table 5 shows that the HD and P2Favg evaluation metrics are not further improved with an increased number of attention heads. As shown in Table 6, the average inference time of the point cloud model continues to increase for different attention heads. While maintaining the model performance and evaluation effect, adding more attention modules increases parameters, leading to an increase in the average inference time of the point cloud model. Furthermore, the impact on the model performance is smaller. For the proposed model, the optimal setup is used, i.e., one multi-headed attention module with attention heads

h = 3

.

5. Conclusions

In solving the problems of sparse, noisy, inhomogeneous, and missing original point cloud data obtained by scanning devices, the existing methods of dense reconstruction result in excessive smoothing, unclear geometric contour information, and an excessive number of outliers. This paper proposes a parallel multi-scale feature extraction module, the multi-headed self-attention module, to solve the above problems. A parallel multi-scale feature extraction module is proposed in the feature extraction stage to fuse the feature information of three different scales from different perceptual fields through multiple groups of pooling operations. The global pooling operation is then connected by specific feature dimensions to ensure that the global information representation combines local and global feature representations, facilitating feature expansion in the upsampling stage. In the upsampling module with the addition of multiple heads of self-attention, after the upsampling features are obtained, the attention focus of different SSA heads is obtained as well, enhancing the feature set through different attention scoring mechanisms.

We experimentally demonstrate that the proposed method is superior to other methods in terms of both quantification results and visualization effects. The contour features of densely reconstructed point cloud data are optimized, helping the contour representation to converge to the actual representation. Over-smoothing is alleviated compared to other methods, and there are fewer outliers.

Author Contributions

Conceptualization and methodology, M.W. and H.J.; Validation, H.J.; Writing—original draft preparation, H.J.; Writing—review and editing, H.J. and M.W.; Supervision, M.W.; Funding acquisition and resources, M.W.; Visualization, H.J.; Data formatting, J.N. All authors have read and agree to the published version of this manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 61701388) and Natural Science Foundation of Shaanxi Province of China (2018JM6080).

Data Availability Statement

The PU1k dataset presented in this study is openly available on the website. Available online: https://drive.google.com/file/d/1oTAx34YNbL6GDwHYL2qqvjmYtTVWcELg/view (accessed on 20 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, C.; Wei, M.; Guo, Y. A Review of 3D Point Cloud Restoration Techniques Based on Deep Learning. Available online: https://www.cnki.com.cn/Article/CJFDTotal-JSJF202112016.htm (accessed on 2 August 2022).
Lipman, Y.; Cohen-Or, D.; Levin, D.; Tal-Ezer, H. Parameterization-Free Projection for Geometry Reconstruction. ACM Trans. Graph. 2007, 26, 22. [Google Scholar] [CrossRef]
Huang, H.; Li, D.; Zhang, H.; Ascher, U.; Cohen-Or, D. Consolidation of Unorganized Point Clouds for Surface Reconstruction. ACM Trans. Graph. (TOG) 2009, 28, 1–7. [Google Scholar] [CrossRef] [Green Version]
Preiner, R.; Mattausch, O.; Arikan, M.; Pajarola, R.; Wimmer, M. Continuous Projection for Fast L1 Reconstruction. ACM Trans. Graph. (TOG) 2014, 33, 47.1–47.13. [Google Scholar] [CrossRef] [Green Version]
Huang, H.; Shihao, W.U.; Gong, M.; Cohen-Or, D.; Ascher, U.; Zhang, H.R. Edge-Aware Point Set Resampling. ACM Trans. Graph. 2013, 32, 1–12. [Google Scholar] [CrossRef]
Alexa, M.; Behr, J.; Cohen-Or, D.; Fleishman, S.; Levin, D.; Silva, C.T. Computing and Rendering Point Set Surfaces. IEEE Trans. Visual. Comput. Graph. 2003, 9, 3–15. [Google Scholar] [CrossRef] [Green Version]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Yu, L.; Li, X.; Fu, C.-W.; Cohen-Or, D.; Heng, P.-A. Pu-net: Point cloud upsampling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2790–2799. [Google Scholar]
Yu, L.; Li, X.; Fu, C.-W.; Cohen-Or, D.; Heng, P.-A. Ec-net: An edge-aware point set consolidation network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 386–402. [Google Scholar]
Yifan, W.; Wu, S.; Huang, H.; Cohen-Or, D.; Sorkine-Hornung, O. Patch-based progressive 3d point set upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5958–5967. [Google Scholar]
Li, R.; Li, X.; Fu, C.-W.; Cohen-Or, D.; Heng, P.-A. Pu-gan: A point cloud upsampling adversarial network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7203–7212. [Google Scholar]
Qian, G.; Abualshour, A.; Li, G.; Thabet, A.; Ghanem, B. Pu-gcn: Point cloud upsampling using graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11683–11692. [Google Scholar] [CrossRef]
Han, B.; Zhang, X.; Ren, S. PU-GACNet: Graph Attention Convolution Network for Point Cloud Upsampling. Image Vis. Comput. 2022, 118, 104371. [Google Scholar] [CrossRef]
Feng, W.; Li, J.; Cai, H.; Luo, X.; Zhang, J. Neural Points: Point Cloud Representation with Neural Fields. arXiv 2021, arXiv:2112.04148. [Google Scholar]
Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2011; Honkela, T., Duch, W., Girolami, M., Kaski, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 52–59. [Google Scholar]
Tschannen, M.; Bachem, O.; Lucic, M. Recent advances in autoencoder-based representation learning. arXiv 2018, arXiv:1812.05069. [Google Scholar] [CrossRef]
Karn, U. An Intuitive Explanation of Convolutional Neural Networks. The Data Science Blog 2016. Available online: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ (accessed on 20 September 2022).
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 1024–1034. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Column networks for collective classification. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Simonovsky, M.; Komodakis, N. Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Huang, H.; Chen, H.; Li, J. Deep Neural Network for 3D Point Cloud Completion with Multistage Loss Function. In Proceedings of the 2019 Chinese Control And Decision Conference (CCDC), Nanchang, China, 3–5 June 2019; pp. 4604–4609. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Valsesia, D.; Fracastoro, G.; Magli, E. Learning Localized Generative Models for 3d Point Clouds via Graph Convolution. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Li, G.; Müller, M.; Qian, G.; Delgadillo, I.C.; Abualshour, A.; Thabet, A.; Ghanem, B. DeepGCNs: Making GCNs Go as Deep as CNNs. arXiv 2019, arXiv:1910.06849. [Google Scholar] [CrossRef]
Xu, J.; Zhou, W.; Chen, Z. Blind Omnidirectional Image Quality Assessment with Viewport Oriented Graph Convolutional Networks. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1724–1737. [Google Scholar] [CrossRef]
Fu, J.; Hou, C.; Zhou, W.; Xu, J.; Chen, Z. Adaptive Hypergraph Convolutional Network for No-Reference 360-Degree Image Quality Assessment. arXiv 2021, arXiv:2105.09143. [Google Scholar]
Fu, X.; Qi, Q.; Zha, Z.-J.; Zhu, Y.; Ding, X. Rain Streak Removal via Dual Graph Convolutional Network. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 2–9 February 2021; 2021; Volume 35, pp. 1352–1360. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric Non-Local Neural Networks for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Wang, Z.; She, Q.; Ward, T.E. Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy. ACM Comput. Surv. 2021, 54, 37:1–37:38. [Google Scholar] [CrossRef]
Jing, W.; Zhang, W.; Li, L.; Di, D.; Chen, G.; Wang, J. AGNet: An Attention-Based Graph Network for Point Cloud Classification and Segmentation. Remote Sens. 2022, 14, 1036. [Google Scholar] [CrossRef]
Lin, M.; Chen, Q.; Yan, S. Network in Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten Digit Recognition with a Back-Propagation Network. In Advances in Neural Information Processing Systems; Morgan-Kaufmann: Burlington, MA, USA, 1989; Volume 2. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef] [Green Version]
Fang, H.; Lafarge, F. Pyramid scene parsing network in 3D: Improving semantic segmentation of point clouds with multi-scale contextual information. ISPRS J. Photogramm. Remote Sens. 2019, 154, 246–258. [Google Scholar] [CrossRef] [Green Version]
Weinmann, M.; Schmidt, A.; Mallet, C.; Hinz, S.; Rottensteiner, F.; Jutzi, B. Contextual Classification of Point Cloud Data by Exploiting Individual 3d Neigbourhoods. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, II-3, 271–278. [Google Scholar] [CrossRef] [Green Version]
Richter, R.; Behrens, M.; Döllner, J. Object Class Segmentation of Massive 3D Point Clouds of Urban Areas Using Point Cloud Topology. Int. J. Remote Sens. 2013, 34, 8408–8424. [Google Scholar] [CrossRef]
Niemeyer, J.; Rottensteiner, F.; Soergel, U. Contextual Classification of Lidar Data and Building Object Detection in Urban Areas. ISPRS J. Photogramm. Remote Sens. 2014, 87, 152–165. [Google Scholar] [CrossRef]
Wang, L.; Huang, Y.; Shan, J.; He, L. MSNet: Multi-Scale Convolutional Network for Point Cloud Classification. Remote Sens. 2018, 10, 612. [Google Scholar] [CrossRef] [Green Version]
Su, Y.; Jiang, L.; Cao, J. Point Cloud Semantic Segmentation Using Multi Scale Sparse Convolution Neural Network. arXiv 2022. [Google Scholar] [CrossRef]
Qiu, S.; Anwar, S.; Barnes, N. Pu-transformer: Point cloud upsampling transformer. arXiv 2021, arXiv:2111.12242. [Google Scholar]

Figure 1. The point cloud

4 \times

upsampling effect obtained from a sparse point cloud using: (a) MPU, (b) PU-Net, (c) PU-GAN, (d) PU-GCN, and (e) our proposed network.

Figure 1. The point cloud

4 \times

upsampling effect obtained from a sparse point cloud using: (a) MPU, (b) PU-Net, (c) PU-GAN, (d) PU-GCN, and (e) our proposed network.

Figure 2. Parallel Multiscale Feature Extraction Module (PMS). ⋈ denotes connecting feature tensor along the last dimension; ⊕ denotes feature stitching;

1 \times 1

,

3 \times 3

,

5 \times 5

denote different pooled feature map sizes, respectively; AVE denotes average pooling.

Figure 2. Parallel Multiscale Feature Extraction Module (PMS). ⋈ denotes connecting feature tensor along the last dimension; ⊕ denotes feature stitching;

1 \times 1

,

3 \times 3

,

5 \times 5

denote different pooled feature map sizes, respectively; AVE denotes average pooling.

Figure 3. Multihead Self-Attention Module (MHA) in the core component of the upsampling module.

Figure 4. Single self-attention (SSA). ⊗ denotes the dot product operation; ⊕ denotes the summation of characteristics.

Figure 5. Proposed network architecture, consisting of three parts: a multi-scale feature extraction module, an upsampling module, and a coordinate reconstruction module. ⊕ denotes feature stitching; PMS denotes the parallel multi-scale feature extraction module; MHA denotes the multi-headed attention mechanism.

Figure 6. Dense reconstruction results for 4× point clouds sampled on different network architectures (b–g), 2048-point input point cloud (a), and 8192-point ground truth point cloud (e).

Figure 7. Surface reconstruction results of 4× point clouds sampled on PU-GAN dataset with different network models (b–f), 2048-point input point cloud (a), and 8192-point ground truth point cloud (e).

Figure 8. Qualitative results of ablation study with parallel multi-scale feature extraction module (PMS).

Figure 9. Qualitative results of ablation study of the multihead self-attention module (MHA) in the upsampling core component.

Figure 10. Qualitative results of an ablation study of different numbers of attention heads in the multi-headed self-attentive module (MHA).

Table 1. Quantitative comparison with four state-of-the-art network models on the PU1k dataset. The best results are highlighted in bold. Lower results are better.

NetWork	CD $(10^{- 3})$	HD $(10^{- 3})$	P2Favg $(10^{- 3})$
MPU	$0.935$	$13.327$	$3.551$
PU-Net	$0.935$	$13.327$	$3.548$
PU-GAN	$0.950$	$13.540$	$3.682$
PU-GCN	$0.585$	$7.577$	$2.499$
Ours	$0.612$	$5.580$	$1.405$

Table 2. A quantitative comparison on the PU-GAN dataset of our model and three advanced networks proposed based on different datasets. The best results are shown in bold.

NetWork	CD $(10^{- 3})$	HD $(10^{- 3})$	P2Favg $(10^{- 3})$
MPU	$0.496$	$5.870$	$2.037$
PU-GAN	$0.489$	$6.283$	$1.955$
PU-GCN	$0.470$	$4.765$	$1.756$
Ours	$0.467$	$4.600$	$1.744$

Table 3. Ablation study of parallel multi-scale feature extraction module (PMS). The best results are highlighted in bold. Lower results are better.

NetWork	CD $(10^{- 3})$	HD $(10^{- 3})$	P2Favg $(10^{- 3})$
PU-GCN	$0.585$	$7.577$	$2.499$
Ours	$0.712$	$6.004$	$1.505$

Table 4. Ablation study of multi-headed self-attentive modules (MHA). The best results are highlighted in bold. Lower results are better.

NetWork	CD $(10^{- 3})$	HD $(10^{- 3})$	P2Favg $(10^{- 3})$
PU-GCN	$0.647$	$9.894$	$2.672$
Ours	$0.631$	$7.570$	$1.557$

Table 5. An ablation study of different numbers of attentional heads in a multi-headed self-attentive module (MHA). The best results are highlighted in bold. Lower results are better.

Number of Attention Heads	CD $(10^{- 3})$	HD $(10^{- 3})$	P2Favg $(10^{- 3})$
h = 2	$0.637$	$9.784$	$2.552$
h = 3	$0.631$	$7.570$	$1.557$
h = 4	$0.606$	$9.249$	$2.450$

Table 6. Average inference time of point cloud model with different numbers of attention heads.

Number of Attention Heads	$h = 2$	$h = 3$	$h = 4$
Average Inference Time (ms)	$7.588$	$10.071$	$12.315$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Jiao, H.; Nan, J. Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling. Electronics 2022, 11, 3157. https://doi.org/10.3390/electronics11193157

AMA Style

Wu M, Jiao H, Nan J. Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling. Electronics. 2022; 11(19):3157. https://doi.org/10.3390/electronics11193157

Chicago/Turabian Style

Wu, Meng, Hailong Jiao, and Junxiang Nan. 2022. "Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling" Electronics 11, no. 19: 3157. https://doi.org/10.3390/electronics11193157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling

Abstract

1. Introduction

2. Related Work

2.1. Reasons for Missing Point Clouds

2.2. Graph Convolution Method

2.3. Attention Mechanism

2.4. Multi-Scale Feature Extraction

3. Method

3.1. Parallel Multi-Scale Feature Extraction Module

3.2. Multi-Headed Self-Attention Structure

3.3. Network Framework

4. Experiments

4.1. Dataset and Implementation Details

4.2. Loss Function and Evaluation Metrics

4.3. Qualitative and Quantitative Results

Qualitative Results

Quantitative Results

Generalization Study

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI