Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention

Gu, Shangtai; Xu, Ke; Wan, Jianwei; Hou, Baolin; Ma, Yanxin

doi:10.3390/rs17081448

Open AccessArticle

Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention

by

Shangtai Gu

^1,*

,

Ke Xu

¹,

Jianwei Wan

¹,

Baolin Hou

¹ and

Yanxin Ma

²

¹

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410005, China

²

College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410005, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(8), 1448; https://doi.org/10.3390/rs17081448

Submission received: 13 February 2025 / Revised: 6 April 2025 / Accepted: 16 April 2025 / Published: 18 April 2025

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a novel point cloud multi-scale completion algorithm guided by image rotation attention mechanisms to address the challenges of severe information loss and suboptimal fusion effects in multi-modal feature extraction and integration during point cloud shape completion. The proposed network employs an encoder–decoder structure, integrating a Rotating Channel Attention (RCA) module for enhanced image feature extraction and a multi-scale feature extraction method for point clouds to improve both local and global feature information. The network also utilizes multi-level self-attention mechanisms to achieve effective multi-modal feature fusion. The decoder employs a multi-branch completion method, guided by Chamfer distance loss, to accomplish the point cloud completion task. Extensive experiments on the ShapeNet-ViPC and ModelNet40ViPC datasets demonstrate the effectiveness of the proposed algorithm. Compared to eight related algorithms, the proposed method achieves superior performance in terms of completion accuracy and efficiency. Specifically, compared to the state-of-the-art XMFnet, the average Chamfer distance (CD) value is reduced by 11.71%. The algorithm also shows significant improvements in visual comparisons, with more distinct structural details and a more uniform density distribution in the completed point clouds. The ablation studies further validate the effectiveness of the RCA module and the multi-scale module, highlighting their complementary nature in enhancing point cloud completion accuracy. Future work will focus on improving the network’s performance and exploring its application in more complex 3D vision tasks.

Keywords:

LiDAR; point cloud; multi-modal; point cloud completion; attention mechanism; feature extraction; feature fusion

1. Introduction

In recent years, the pursuit of accurate and efficient methods for 3D scene understanding has gained significant attention in various fields, including autonomous vehicles, robotics, and augmented reality. A key technology in 3D vision, Light Detection and Ranging (LiDAR), provides precise measurements and preserves the original geometric information of objects in 3D space [1,2]. LiDAR point clouds, which are collections of 3D points representing the surfaces of objects, have been widely used in numerous applications. For instance, in autonomous vehicles, LiDAR point clouds are used for obstacle detection and path planning. In robotics, they are employed for environment mapping and object manipulation. In augmented reality, point clouds help in the integration of virtual elements with the real world. Additionally, in industrial inspection and cultural heritage preservation, point clouds are used for detailed object scanning and analysis.

Despite the numerous applications, the challenge of working with incomplete LiDAR point cloud data has significantly impeded the progress in 3D scene understanding. Point cloud data are often sparse or incomplete due to various factors such as object occlusion, surface reflection, and sensor limitations. For example, in outdoor environments, objects may be partially obscured by other objects, leading to missing data in the point cloud. In indoor settings, reflective surfaces can cause LiDAR sensors to miss certain areas. Moreover, the limited scanning range and resolution of LiDAR sensors can result in sparse point cloud data, especially in large-scale scenes. These issues undermine the performance of subsequent 3D vision tasks, including object classification, detection, and segmentation, which are fundamental to many applications.

To address these challenges, point cloud completion has become a crucial task in 3D vision. The goal of point cloud completion is to reconstruct the missing parts of a point cloud, thereby providing a complete and accurate representation of the object or scene. Traditional point cloud completion methods often rely on geometric assumptions or manual feature engineering, which are limited in their ability to capture complex structures and details. For example, some methods use primitive shapes such as spheres and cylinders to fill in missing regions, but these approaches may fail to capture the intricate details of real-world objects. With the advancement of deep learning, researchers have developed various deep learning-based point cloud completion methods. These methods typically use autoencoder structures to learn the underlying features of point clouds and generate complete point clouds. For instance, PointNet [3] and PointNet++ [4] are pioneering works that use multilayer perceptrons (MLPs) and hierarchical architectures to process point clouds. However, these methods often struggle to balance local and global feature information, leading to suboptimal completion results. Moreover, they primarily focus on single-modal point cloud data and do not leverage the complementary information from other modalities, such as images. In this context, multi-modal point cloud completion methods have emerged as a promising direction. These methods integrate data from different sources, such as LiDAR and images, to enhance the completion process. Images provide rich texture and structural information that can guide the completion of sparse or incomplete point clouds. For example, the ViPC network [5] and CSDN network [6] are early attempts to combine point cloud and image data for completion tasks. However, these methods still face challenges in effectively fusing multi-modal information and maintaining the structural integrity of the completed point clouds.

The motivation behind our research is to develop a more effective point cloud completion algorithm that can better utilize the complementary strengths of point clouds and images. Our proposed method focuses on enhancing feature extraction from both modalities and improving the fusion of multi-modal features. Specifically, we introduce a LiDAR point cloud multi-scale completion algorithm guided by image rotation attention mechanisms. This algorithm employs an encoder–decoder structure, where the image feature extractor utilizes rotation attention mechanisms to enhance image feature extraction, and the point cloud feature extractor employs multi-scale methods to improve the global and local information of point clouds. The network also uses multi-level self-attention mechanisms to achieve effective multi-modal feature fusion.

Our research aims to address the limitations of existing point cloud completion methods by proposing a novel network architecture that leverages the strengths of both point cloud and image data. Through extensive experiments, we demonstrate the effectiveness of our approach, showing superior performance compared to state-of-the-art methods in terms of completion accuracy and efficiency. We believe that our work provides a significant advancement in the field of 3D vision and offers new possibilities for applications requiring accurate and reliable point cloud data. The code for our method is publicly available at https://github.com/CaineGu/CMFN, accessed on 2 February 2025.

2. Related Works

In the realm of three-dimensional (3D) vision technology, the completion of point cloud data has been approached through various methodologies, which can be broadly categorized into single-modal and multi-modal point cloud completion methods. Single-modal methods utilize only point cloud data for shape completion, while multi-modal methods integrate data from different sources, such as LiDAR and images, to enhance the completion process.

Single-modal point cloud completion methods utilize only object point cloud data for shape completion. Qi Charles et al.’s PointNet [3], as a multilayer perceptron (MLP) method for points, directly inspired the network design of many subsequent scholars. PointNet++ [4] and TopNet [7] incorporate hierarchical architectures to consider the local structural features of point clouds. To mitigate the structural loss brought by MLP, AtlasNet [8] and MSN [9] reconstruct complete outputs by evaluating a set of surface parameters, from which a complete point cloud can be generated. The encoder–decoder architecture pioneered by PCN [10] requires no structural assumptions or annotation information about the underlying shape. PoinTr [11] adapts the Transformer module for 3D completion tasks by creating a geometry-aware module to model local geometric relationships. PointAttN [12] enhances point cloud completion by employing attention mechanisms to capture detailed point relationships and improve feature extraction. Additionally, SnowflakeNet [13] generates child points by learning the transition of specific regions to gradually split parent points, enabling the network to predict highly detailed shape geometries. Yang et al. [14] proposed FoldingNet, which uses a grid deformation strategy to reconstruct point clouds, demonstrating the potential of auto-encoder architectures for point cloud processing. Pan [15] introduced ECG, which employs graph convolution to achieve edge-aware point cloud completion, focusing on preserving sharp edges and fine details. Pan et al. [16] further developed a variational relational point completion network, which enhances the robustness of 3D classification by incorporating relational reasoning into the completion process.

In single-modal point cloud completion works, the limited available geometric information results in high uncertainty in inferring missing regions of point clouds. Furthermore, the low scanning resolution of 3D data typically leads to sparsely captured data, making it difficult to determine whether the missing regions in the point cloud are due to incompleteness or inherent sparsity. Therefore, leveraging the complementary information between multi-modalities to enhance point cloud completion effects is essential. The cross-modal multi-scale point cloud feature fusion shape completion network (CMFN) utilizes different data information fusion methods to achieve the enhancement of point cloud data information. The unordered and unstructured nature of point clouds prevents them from directly extracting features using structural characteristics like images, so using image information to represent point cloud information during the point cloud completion process can improve the completion accuracy. In image-guided point cloud shape completion methods, extracting dual-modality information and effectively fusing it is key to point cloud completion work, which this paper focuses on in terms of feature extraction for both modalities.

Multi-modal point cloud completion methods address these limitations by leveraging the complementary information from different modalities. Zhang [5] proposed the ViPC network, which inputs incomplete point clouds and object images into the network, generating a coarse point cloud based on the object image information before fusing it with the incomplete point cloud. Building on this, Zhu [6] proposed the CSDN network, which differs from the ViPC network in that it uses an encoder–decoder structure to first recover the overall but coarse shape of the point cloud before refining local details. Emanuele [17] proposed XMFnet, which outperforms its predecessors in point cloud completion performance, characterized by the introduction of cross-attention mechanisms and self-attention mechanisms to supervise the structural generation of point clouds. Although the aforementioned methods have achieved better completion results using multi-modal information in point cloud completion tasks, they also have some issues in implementation: the ViPC network consumes a significant amount of memory and computational resources to generate coarse point clouds from image information, with suboptimal results; the CSDN network has excessive computational demands, and the underutilization of multi-modal information leads to imprecise completion details; and the XMFnet network’s insufficient feature extraction capabilities for point cloud and image data result in non-refined local structures in the decoded point cloud. Multi-modal point cloud completion methods, despite their potential, face several challenges as documented in recent studies. The integration of data from various sensors, such as LiDAR and cameras, requires effective fusion techniques to leverage the complementary information. However, the heterogeneity between modalities and the difficulty in aligning features from different sensors have been noted as significant obstacles. Additionally, the computational complexity and the need for high memory capacity to process and fuse multi-modal data have been highlighted as limitations. The utilization of multi-modal information, leading to imprecise completion details, has also been reported. Furthermore, the insufficient feature extraction capabilities of some networks result in non-refined local structures in the decoded point clouds. These issues underscore the need for advanced algorithms that can effectively take advantage of multi-modal data to enhance point cloud completion tasks.

Despite these advancements, multi-modal methods face challenges such as the heterogeneity between modalities, computational complexity, and the need for high memory capacity [18]. Additionally, the underutilization of multi-modal information can lead to imprecise completion details [19]. To address these shortcomings, our proposed algorithm employs a LiDAR point cloud multi-scale completion algorithm guided by image rotation attention mechanisms, focusing on enhancing feature extraction from both point clouds and images.

3. Methodology

Our proposed network, designed to address the challenges of balancing local and global feature information in point clouds and mitigating the loss of structural information in images, leverages an encoder–decoder architecture to facilitate the completion of LiDAR point clouds with the guidance of image data. The network is divided into four main components: the point cloud multi-scale feature extractor, the image feature extractor, the cross-modal feature fusion (CMF) module, and the decoder. To address the shortcomings of current multi-modal point cloud completion algorithms, this paper proposes a LiDAR point cloud multi-scale completion algorithm based on the image rotation attention mechanism. The specific contributions are (1) the introduction of the Rotating Channel Attention (RCA) module, which enhances the extraction capability of image features by adding a Rotating Channel Attention mechanism to existing image feature extractors and (2) the optimization of the point cloud feature extractor, which employs a multi-scale method for hierarchical feature extraction, both locally and globally, to enhance the capability of point cloud feature extraction. Each component plays a critical role in the pipeline, from feature extraction to the final point cloud completion task.

3.1. Overall Framework of Network

Figure 1 shows the overall framework of the cross-modal multi-scale feature fusion network. The input to the point cloud feature extractor is two point clouds of different scales (

P_{in 1}

and

P_{in 2}

), which are then fed into their respective feature extractors to obtain multi-scale feature vectors (

F_{p 1}

and

F_{p 2}

). The image feature extractor extracts features from image data, resulting in the feature vector

F_{I}

. The following two subsections of this paper will focus on the image feature extractor and the point cloud multi-scale feature extractor.

The cross-modal feature fusion (CMF) utilizes cross-attention and self-attention mechanisms [20] to integrate feature information from both modalities. As shown in Figure 1, the image features

F_{I}

are fused with the point cloud features

F_{p 1}

to obtain the fused features

F_{p 2}

. The features

F_{2}

are then concatenated with

F_{1}

in parallel and fed into a multilayer perceptron (MLP) layer to obtain the final fused features

F_{f}

. This paper employs multi-head attention mechanisms, first distinguishing between the two different modality features. The paper uses

F_{p 1}

and

F_{p 2}

as the query matrix Q and the key matrix K, respectively, with V representing the value matrix for each. The paper applies three rounds of cross-attention and self-attention mechanisms, allowing the image features and point cloud features to be gradually fused into the feature vector

F_{f}

, which is then fed into the decoder.

The decoder (D) is an important component of the network, aimed at accurately estimating the specific locations of the missing parts of the point cloud while maintaining both local and global structures. The paper employs a joint feature embedding method to complete the full point cloud. First, the original input data

P_{in 1}

with N points undergoes a second round of farthest point sampling to obtain the sampling result

P_{in 2} = F P S_{[N / 2]} (P_{i n 1})

. The decoder uses k branches to generate the point cloud, with each branch using multiple MLP layers to generate points, which are then combined into a new point cloud and finally concatenated with the farthest point sampling point cloud to form the final completed point cloud. The CMFN network proposed in this paper uses the Chamfer distance (CD) loss function [21] to evaluate the similarity between the completed point cloud

P_{cp}

and the real point cloud

P_{gt}

, guiding the network training.

3.2. Point Cloud Multi-Scale Extractor

The point cloud multi-scale feature extractor (MSFE) is a vital component of our proposed network, which is designed to effectively handle point clouds at different scales. This extractor takes as input two point clouds of varying resolutions,

P_{in 1}

and

P_{in 2}

, where the number of points is defined as

P_{in} = [N, N / 4] = [2048, 512]

. Here, N represents the total number of points in the original point cloud, while the second scale point cloud

P_{in 2}

is derived through farthest point sampling (FPS) [7], which is a downsampling technique that selects points from the original point cloud to preserve its spatial distribution. The FPS algorithm begins by randomly selecting an initial point from the point cloud. Subsequently, it iteratively selects the point that is farthest from the already chosen points. This approach ensures that the selected points are well distributed across the point cloud, maintaining the overall shape and structure.

This method effectively captures the geometric characteristics of the original point cloud while reducing its size, facilitating more efficient processing in subsequent stages. The input data of different scales are processed through the Feature Embedding Layer (FEL), resulting in point cloud feature vectors

F_{p 1}

and

F_{p 2}

of the same dimensionality. The FEL employs Dynamic Graph Convolution (DGCNN) [22] to extract features from the point cloud data.

Employing multi-scale feature extraction is to capture both local and global features of the point cloud effectively. The original point cloud

P_{in 1}

retains fine details such as edges and surfaces, while the downsampled point cloud

P_{in 2}

captures broader structural characteristics. By processing the point cloud at these two different scales, we can extract features that represent varying levels of detail. This multi-scale approach ensures a comprehensive representation of the point cloud, which is crucial for accurate point cloud completion. The combination of features extracted from both scales enhances the network’s ability to reconstruct missing parts of the point cloud, leading to improved performance in subsequent tasks.

3.3. Image Feature Extractor

To better utilize the texture and structural information of images to assist in the completion of incomplete point clouds with higher accuracy, this paper proposes an image feature extractor based on the Rotating Channel Attention (RCA) mechanism.

As shown in Figure 2 the Rotating Channel Attention (RCA) consists of three parallel branches. The first and second branches involve rotation and interaction between the channel dimension C and the spatial dimensions H and W (rotating 90° clockwise around the z-axis and y-axis, respectively), achieving information exchange between different dimensions. The last branch constructs spatial attention using the spatial dimensions H and W (rotating 90° clockwise around the x-axis), focusing on local information. The outputs of the three branches are then aggregated using an averaging method to produce the final output.

The calculation process is as follows:

\{\begin{matrix} y_{1} & = Conv [R_{x} (I_{in})] • I_{in} = W_{0} • I_{in} \\ y_{2} & = Conv [R_{y} (I_{in})] • I_{in} = W_{1} • I_{in} \\ y_{3} & = Conv [R_{z} (I_{in})] • I_{in} = W_{2} • I_{in} \end{matrix}

(1)

y = Aggr {y_{1}, y_{2}, y_{3}}

(2)

where

R_{x} (\cdot)

,

R_{y} (\cdot)

, and

R_{z} (\cdot)

represent the rotation operations around the x-axis, y-axis, and z-axis, respectively, and

Conv (\cdot)

denotes the convolution operation. The resulting coefficient matrices after rotation and convolution are represented by

W_{0}

,

W_{1}

, and

W_{2}

. The

Aggr (\cdot)

operation denotes the aggregation of these matrices to generate the final result y.

The image input

I_{in}

is processed through the RCA operation before being fed into the ResNet18 [23] network, ultimately yielding the image feature

F_{I}

. The RCA module shares information between different channels of the image, enriching the image information in each channel and further enhancing the cross-modal feature fusion effect.

3.4. Cross-Model Feature Fusion (CMF)

The Cross-Model Feature Fusion (CMF) module integrates features from two modalities using cross-attention and self-attention mechanisms. As shown in Figure 3, multi-scale point cloud data

P_{i n 1}

,

P_{i n 2}

, and image data

I_{i n}

are processed by their respective feature extractors to obtain point cloud features

F_{P 1}

,

F_{P 2}

, and image features

F_{I}

. In the CMF module,

F_{I}

is fused with

F_{P 1}

and

F_{P 2}

sequentially to obtain fused features

F_{1}

and

F_{2}

. These are then concatenated and passed through an MLP layer to produce the final fused feature

F_{f}

.

Multi-head attention is used to enhance feature representation and network robustness. The image features

F_{I}

serve as the key (K) and value (V) matrices, while the point cloud features

F_{P 1}

and

F_{P 2}

act as the query Q matrices. Taking the

F_{P 1}

branch as an example,

(Q, K, V) = (F_{P_{1}}, F_{I}, F_{I}) \times (W_{q}, W_{k}, W_{v})

(3)

Q, K \in R^{N \times d_{a}}, V \in R^{N \times d_{v}}

(4)

Here,

W_{Q}

,

W_{K}

, and

W_{V}

are learnable linear weights. The attention weight coefficient matrix

{\tilde{a}}_{i j}

is computed using Q and K, then normalized via softmax:

\tilde{W} = {\tilde{a}}_{i j} = Q \cdot K^{T}

(5)

\bar{a} = softmax ({\tilde{a}}_{i j}) = \frac{exp ({\tilde{a}}_{i j})}{\sum_{k} exp ({\tilde{a}}_{k j})}

(6)

W = a_{i j} = \frac{{\bar{a}}_{i j}}{\sum_{k} {\bar{a}}_{i k}}

(7)

F_{1} = W \cdot V

(8)

Using image features as K and V matrices, and point cloud features as Q matrices allows the rich texture and structural information from images to guide the completion of incomplete point clouds. Multiple cross-attention and self-attention operations enable gradual fusion of image and point cloud features.

The fused features

F_{1}

and

F_{2}

from two scales are concatenated and fed into an MLP layer to obtain the final encoded feature

F_{f}

:

F_{f} = MLP (Cat (F_{1}, F_{2}))

(9)

C a t (\cdot)

represents concatenation of two-feature matrices.

F_{1}

and

F_{2}

have dimensions [B,1,N,F], concatenated to [B,2,N,F], and mapped to

F_{f}

with dimension [B,N,F], containing rich global and local feature information.

3.5. Point Cloud Similarity Evaluation Metrics

The network model evaluation metrics employed in this study are the Chamfer distance (CD) and the F-Score [4]. The CD loss function is utilized within the text to guide the network training. The similarity evaluation criterion between the completed point cloud and the real point cloud is based on the Chamfer distance, which is defined as follows:

L_{CD} (Y, \hat{Y}) = \frac{1}{2 N} \sum_{y \in Y} min_{\hat{y} \in \hat{Y}} ∥y - \hat{y}∥ + \sum_{\hat{y} \in \hat{Y}} min_{y \in Y} ∥\hat{y} - y∥

(10)

where Y represents the real point cloud, y represents a point in the real point cloud,

\hat{Y}

represents the completed point cloud,

\hat{y}

represents a point in the completed point cloud, and N represents the number of points in the point cloud. When training with the CD loss function, the goal of the optimization algorithm is to minimize the CD loss, thereby allowing the completed point cloud to better match the real point cloud.

The F-Score is a shape-matching-based point cloud similarity evaluation metric used to detect any shape deviations between two point clouds. The calculation steps for the F-Score are as follows:

For each point $p_{real}$ in the real point cloud, find the closest point $p_{completed}$ in the completed point cloud;
If the Euclidean distance between $p_{real}$ and its closest point $p_{completed}$ is less than a threshold (in experiments, the threshold is set to 0.001 times the diameter of the point cloud), then point $p_{real}$ is considered to be successfully matched; otherwise, it is considered a false match;
Calculate the number of true matches (TP), false matches (FP), and the number of points not successfully matched (FN);
Based on the definitions of precision and recall, compute the F-Score:

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}

(11)

F - Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(12)

where the precision rate represents the proportion of correctly matched points, and the recall rate represents the proportion of points that are correctly matched. The F-Score takes into account both the precision and recall rates, providing a comprehensive measure of the similarity between two point clouds.

In point cloud completion tasks, Chamfer distance (CD) and F-Score are commonly used evaluation metrics that assess model performance from different angles. CD primarily measures the geometric distance between the generated and real point clouds, focusing on geometric accuracy and detail preservation. It calculates the average of the nearest neighbor distances between points in the two point clouds. A smaller CD value indicates better geometric matching and richer details. F-Score, on the other hand, evaluates the overall distribution and integrity of the completed point cloud by considering both precision and recall. It assesses the percentage of correctly reconstructed points or surfaces within a certain distance threshold. A higher F-Score means the generated point cloud has a more reasonable distribution, with fewer outliers or holes.

In practice, these two metrics are often used together because they provide complementary information about the model’s performance, allowing for a comprehensive assessment of the effectiveness of point cloud completion algorithms.

4. Results and Discussion

To test the effectiveness of the network algorithm, all experiments in this paper are conducted on the public dataset ShapeNet-ViPC and ModelNet40ViPC, which is established from ModelNet40 [24]. Additionally, this paper’s algorithm is compared with recent relevant deep learning methods for point cloud completion, and the effectiveness of each module within the network is verified through ablation experiments.

4.1. Datasets and Experimental Configuration

The network algorithm proposed in this paper is trained and tested on the public dataset ShapeNet-ViPC and ModelNet40ViPC to evaluate its effectiveness. The ShapeNet-ViPC dataset comprises 13 object categories with a total of 38,328 objects, each featuring 24 incomplete point clouds with different occlusion perspectives and 24 corresponding real point clouds. For the 24 incomplete point cloud perspectives, there are 24 images in PNG format. The real point clouds contain 3500 points each, the incomplete point clouds contain 2048 points each, and the image pixel size is 24 × 24. Following convention, 80% of the data is used for training and 20% for testing. The ModelNet40ViPC dataset is based on the CAD models from ModelNet40 [24] and includes 12,311 objects from 40 different categories. Each object in this dataset contains 32 views of incomplete point clouds.

Experimental Conditions: The experiments in this paper are conducted on the Ubuntu operating system using the PyTorch 1.10.2 framework with the corresponding version of CUDA, and the GPU device used is the NVIDIA V100 in AutoDL Compute which is a company dedicated to offering powerful and efficient computing resources. Experimental parameter settings are shown in Table 1.

The input point cloud data is given by

P_{in} = [N, \frac{N}{4}] = [2048, 512]

. After passing through the point cloud multi-scale feature extractor, the features are extracted with dimensions of

[128, 256, 128]

. The image feature extractor uses a ResNet18 network embedded with the RCA module, with image input

I_{in} = [128, 3, 224 \times 224]

, and the output is

F_{I} = [128, 256, 14 \times 14]

. In the cross-modal feature fusion module (CMF), the multi-head is set to 4, with inputs

F_{scale 1}

and

F_{I}

, producing fused features

F_{fused 1}

and

F_{fused 2}

at different scales, both with dimensions of

[128, 256, 128]

. In the decoder, the original point cloud undergoes farthest point sampling to obtain the sampled points

P_{fps}

. The decoder has k branches, each producing 128 points, and the final completed point cloud

\hat{Y} = 2048

points is formed by combining the points from all branches with the sampled points.

4.2. Experiment on ShapeNetViPC

This paper utilizes the Chamfer distance (CD) loss to guide the network training and test the effectiveness of the network. As previously mentioned, the principle of calculating the CD loss function is such that a smaller CD distance indicates greater similarity between two point clouds. Table 2 presents a comparison of the completion results between our CMFN algorithm and other completion algorithms. The bolded numbers indicate the best completion results for that category.

From Table 2, it can be observed that on the ShapeNet-ViPC dataset, multi-modal point cloud completion methods outperform single-modal point cloud completion methods, and our proposed CMFN network has achieved state-of-the-art (SOTA) performance among both single-modal and multi-modal algorithms. The completion effects for objects in eight categories show a certain improvement over the latest multi-modal completion algorithm, XMFnet, especially in the category of lamps where the CD evaluation metric decreased by 41.38%, and the average metric across the eight categories decreased by 11.71%.

Table 3 presents a comparison of the results from nine point cloud completion algorithms on the ShapeNet-ViPC dataset, using the F-Score as the evaluation metric. The bolded numbers indicate the best completion results for each category.

It can be observed from Table 3 that the CMFN network proposed in this paper achieved optimal performance in seven categories according to the F-Score metric, with an average improvement of 3.1% over the XMFnet network, and a significant enhancement of 10.98% in the category of lamps. It is noteworthy that the performance of our algorithm in terms of the F-Score metric is slightly inferior to that of the CD metric, since our network is trained and tested using the CD loss function, which tends to perform poorly in point cloud uniformity distribution. Consequently, networks trained with the CD loss function may exhibit a decline in the evaluation of point cloud completion effectiveness when assessed with the shape-matching similarity metric of F-Score. Nonetheless, our algorithm still demonstrates an improvement in this metric.

4.3. Experiment on ModelNet40ViPC

The ModelNet40ViPC dataset [25] is based on the CAD models from ModelNet40 [24] and includes 12,311 objects from 40 different categories. The 40 categories are airplane, bathtub, bed, bench, bookshelf, bottle, car, chair, bowl, cone, cup, curtain, desk, door, dresser, flower_pot, glass_box, guitar, keyboard, lamp, laptop, mantle, monitor, night_stand, person, piano, plant, radio, range_hood, sink, sofa, stairs, stool, table, tent, toilet, tv_stand, vase, wardrobe, and xbox.

The point cloud incompleteness is primarily due to self-occlusion from the viewpoint, which is consistent with real-world point cloud sampling. The 32 different viewpoints are uniformly set on the spherical surface around the object model, shown as Figure 4. The number of points in the incomplete point clouds is variable. To verify the robustness of the network algorithm, the point cloud incompleteness rate is set between 60% and 90%. Image data are consistent with point cloud data, with 32 images rendered from the same 32 viewpoints, with a pixel size of [224 × 224] and in PNG format shown as Figure 5. The real point clouds in this dataset are obtained through Poisson sampling from the CAD models in ModelNet40, with point cloud sizes of 3500 and 2045, as shown in Figure 6. During training, the network uses point clouds with 3500 points, which are downsampled to 2048 points using farthest point sampling, focusing on the distribution of each category’s point cloud data. During testing, the real point clouds use 2048 points to more accurately test the completion performance of each algorithm. The ModelNet40ViPC dataset contains a total of 393,952 incomplete point cloud and image samples, and 24,622 real point cloud samples. In this paper, 50% of the samples are used for training and 50% for testing, using a unified training method to verify the completion performance of the network algorithm.

Table 4 presents the CD results for various algorithms on the ModelNet40ViPC dataset. The proposed CMFN network achieved the best performance in most categories, demonstrating superior completion accuracy compared to other methods. For instance, in the laptop category, CMFN achieved a CD value of 1.288, which is significantly lower than ViPC (10.552), CSDN (2.960), and XFMnet (1.531). This indicates that CMFN is more effective in reconstructing the fine details and overall structure of the laptop, resulting in a more accurate and complete point cloud. In the table category, CMFN achieved a CD value of 1.840, outperforming ViPC (9.296), CSDN (3.419), and XFMnet (2.107). This suggests that CMFN is better at capturing the structural details of the table, such as the tabletop and legs, leading to a more precise reconstruction. In the chair category, CMFN achieved a CD value of 2.580, which is better than ViPC (9.742), CSDN (4.792), and XFMnet (3.004). This demonstrates CMFN’s ability to accurately reconstruct the chair’s structure, including the seat, backrest, and legs. Overall, the CMFN network achieved the lowest average CD value of 2.669, compared to ViPC (9.990), CSDN (5.2528), and XFMnet (3.080). This indicates that the proposed network is more effective in reconstructing the missing parts of point clouds, resulting in a more accurate and complete representation of the objects.

Table 5 presents the F-Score results for various algorithms on the ModelNet40ViPC dataset. The CMFN network achieved the highest F-Score in most categories, demonstrating superior performance in terms of point cloud density and distribution. For example, in the laptop category, CMFN achieved an F-Score of 0.929, which is higher than ViPC (0.407), CSDN (0.578), and XMFnet (0.696). This indicates that CMFN not only reconstructs the missing parts accurately but also maintains a uniform density distribution in the completed point cloud, resulting in a more realistic and detailed representation of the laptop. In the table category, CMFN achieved an F-Score of 0.835, outperforming ViPC (0.467), CSDN (0.593), and XFMnet (0.710). This suggests that CMFN is better at preserving the density and distribution of points in the table’s structure, leading to a more accurate and complete reconstruction. In the chair category, CMFN achieved an F-Score of 0.879, which is better than ViPC (0.426), CSDN (0.432), and XMFnet (0.706). This demonstrates CMFN’s ability to maintain a uniform density distribution in the completed point cloud, resulting in a more accurate and detailed representation of the chair. Overall, the CMFN network achieved the highest average F-Score of 0.929, compared to ViPC (0.551), CSDN (0.518), and XFMnet (0.901). This indicates that the proposed network not only reconstructs the missing parts accurately but also maintains a uniform density distribution in the completed point clouds, leading to a more realistic and detailed representation of the objects.

The CMFN network’s superior performance is attributed to its multi-scale feature extraction, which processes point clouds at different scales (2048 and 512 points) to capture both local and global features, ensuring a comprehensive representation. The use of farthest point sampling (FPS) preserves the spatial distribution and overall structure of the point cloud. The RCA module enhances image feature extraction by sharing information between channels, enriching the feature representation and providing more accurate guidance for completion. The multi-level self-attention mechanisms in the CMF module gradually fuse image and point cloud features, leveraging complementary information from both modalities. The decoder’s multi-branch completion method, combined with Chamfer distance loss during training, ensures accurate reconstruction by minimizing the distance between the completed and real point clouds. Additionally, the network demonstrates robustness and generalization across various categories, effectively handling different types of incomplete point clouds and maintaining local and global structures during reconstruction.

4.4. Visualization

The visual comparison between the network algorithm proposed in this paper, CMFN, and the benchmark algorithm XMFnet is shown in Figure 7. The figure presents the incomplete point clouds for six categories, the completion results from the XMFnet network, the completion results from the CMFN network, and the visualization of the real point clouds. By comparing the visual results from both networks with the real point clouds, it can be observed that the completion effect of the CMFN network is generally superior to that of the XMFnet network, with more distinct structural details and a more uniform density distribution in the point clouds. In the airplane category completion, the CMFN network provides more precise details compared to XMFnet, with the lower engine contour details being more similar to the real point cloud. The XMFnet network failed to complete the shape of the lower wing of the airplane, while the CMFN network captured the details of the lower wing, and the comparison of the nose and tail completion also reflects the advantages of the network proposed in this paper. In the table lamp category, the CMFN network’s completion result is closer to the real point cloud, whereas the XMFnet network’s completion result has more outliers in the lampshade and fails to complete the details of the lamp column and base contour. The CMFN network not only completes the structural details of the missing parts but also outperforms the former in terms of uniform point cloud density distribution. Additionally, in the table category, it can be observed that the network proposed in this paper significantly outperforms the XMFnet network in completing the contour of the missing tabletop area, which is more similar to the real point cloud contour, and the point cloud density distribution on the tabletop is more uniform.

4.5. Ablation

The ablation studies were conducted to validate the effectiveness of the multi-scale module and the Rotating Channel Attention (RCA) module in the CMFN network. The experiments were performed on the ShapeNet-ViPC dataset, and the results were compared with the benchmark network XMFnet. The network models were configured as follows: complete CMFN model including both the RCA module and the multi-scale module, model with only RCA module including only the RCA module for image feature extraction, model with only multi-scale module including only the multi-scale module for point cloud feature extraction, and XMFnet, which is the benchmark network for comparison.

The experimental results are compared in Table 6. It is worth noting that while validating the effectiveness of the multi-scale module, this paper also verified the impact of different point cloud numbers at the second scale on the experimental results.

We set the first scale point cloud input to 1024, which helps maintain the original point cloud structural information during feature extraction. The second scale point cloud input is set to 512. After multiple graph convolutions, this resolution better captures the global features of the point cloud. Ablation experiments show that the second scale input of 512 outperforms 1024, proving that 512 is more effective in extracting global feature information.

The RCA module enhances image feature extraction by sharing information between channels, which significantly improves the network’s ability to capture structural information from images. This is evident in the results for the lamp category, where the RCA module achieved a CD value of 1.490, compared to 1.810 for XMFnet. The improvement can be attributed to the RCA module’s ability to effectively fuse multi-channel information, enriching the feature representation of the image data. This enhanced feature representation provides more accurate guidance for the point cloud completion task, leading to better performance.

The multi-scale module, using DGCNN with KNN clustering, effectively extracts both local and global point cloud features. The first scale, set to 2048 points, preserves the original point cloud’s structural information, while the second scale, set to 512 points, captures global features after multiple graph convolutions. The ablation study shows that the 512-point configuration outperforms the 1024-point configuration, confirming its higher global feature extraction capability. For example, in the lamp category, the multi-scale module with 512 points achieved a CD value of 1.327, compared to 1.452 for 1024 points. This indicates that the 512-point configuration is more effective in capturing global features, which are crucial for accurate point cloud completion.

The complete CMFN model, combining both the RCA module and the multi-scale module, achieved the best performance across all categories. For instance, in the lamp category, the CMFN model achieved a CD value of 1.061, which was significantly better than the RCA module alone (1.490), the multi-scale module alone (1.327), and XMFnet (1.810). Similarly, in the watercraft category, the CMFN model achieved a CD value of 0.788, compared to 0.823 for the RCA module, 0.828 for the multi-scale module, and 0.945 for XMFnet. In the cabinet category, the CMFN model achieved a CD value of 1.796, compared to 1.936 for the RCA module, 1.877 for the multi-scale module, and 1.980 for XMFnet.

The combined effect of the RCA and multi-scale modules can be attributed to their complementary nature. The RCA module enhances image feature extraction, providing rich structural information that guides the point cloud completion process. The multi-scale module, on the other hand, ensures that both local and global features of the point cloud are effectively captured and utilized. This synergy between the two modules leads to a more comprehensive and accurate representation of the point cloud, resulting in superior completion performance. The ablation studies clearly demonstrate the effectiveness of both the RCA module and the multi-scale module in the CMFN network. The RCA module enhances image feature extraction, while the multi-scale module improves point cloud feature extraction. The combination of these two modules in the complete CMFN model results in the best performance, indicating that they complement each other and significantly enhance point cloud completion accuracy.

5. Conclusions

This paper presents a novel point cloud shape completion network that enhances data feature extraction capabilities by employing a multi-modal feature fusion approach within an encoder–decoder structure. An RCA (Rotating Channel Attention) module is integrated into the image feature extraction process to improve the extraction of image features, leveraging the textural and structural information of images to assist in the extraction of point cloud features. In the point cloud feature extraction process, a multi-scale method is utilized to enhance the extraction capabilities, not only increasing the inter-point relatedness within each sample’s point cloud but also integrating the correlation between local and global feature information, thereby generating point clouds with more accurate structural information.

This paper employs a multilayer cross-attention mechanism to gradually fuse image features and point cloud features, achieving a multi-modal information feature-level fusion effect. The proposed algorithm compares favorably with both single-modal and multi-modal point cloud completion algorithms, achieving the best results. Specifically, when compared with the most recent multi-modal point cloud completion algorithm XMFnet, the average CD value was reduced by 11.71%, with a 41.38% reduction in the CD value for the lamp category. It is noteworthy that the proposed algorithm exhibited issues of local point cloud density inconsistency and a higher number of outliers during point cloud completion. Analysis indicates that the inconsistency in completion point cloud density is due to the use of the CD loss function to guide network training, while the excess of outliers may be attributed to insufficient local structural detail constraints during the point cloud generation process. Therefore, designing a more effective point cloud similarity evaluation method and a decoder with multi-modal feature constraints will be one of the future research directions in point cloud completion.

Author Contributions

Conceptualization, S.G. and J.W.; methodology, S.G. and K.X.; software, S.G. and B.H.; validation, B.H. and Y.M.; formal analysis, S.G.; resources, J.W.; data curation, B.H.; writing—original draft preparation, B.H.; writing—review and editing, S.G., B.H., and Y.M.; visualization, B.H.; supervision, J.W.; funding acquisition, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

The National Natural Science Foundation of China, Project Name: Research on High-Quality Perception of Targets in Dynamically Complex Scenes; Grant Number: U24B20138.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Mitra, N.J.; Pauly, M.; Wand, M.; Ceylan, D. Symmetry in 3D Geometry: Extraction and Applications. Comput. Graph. Forum 2013, 32, 1–23. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; pp. 5099–5108. [Google Scholar]
Zhang, X.; Feng, Y.; Li, S.; Zou, C.; Wan, H.; Zhao, X.; Guo, Y.; Gao, Y. View-Guided Point Cloud Completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2021; pp. 15890–15899. [Google Scholar] [CrossRef]
Zhu, Z.; Nan, L.; Xie, H.; Chen, H.; Wang, J.; Wei, M.; Qin, J. CSDN: Cross-Modal Shape-Transfer Dual-Refinement Network for Point Cloud Completion. IEEE Trans. Vis. Comput. Graph. 2024, 30, 3545–3563. [Google Scholar] [CrossRef] [PubMed]
Tchapmi, L.P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.D.; Savarese, S. TopNet: Structural Point Cloud Decoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2019; pp. 383–392. [Google Scholar] [CrossRef]
Schnabel, R.; Degener, P.; Klein, R. Completion and Reconstruction with Primitive Shapes. Comput. Graph. Forum 2009, 28, 503–512. [Google Scholar] [CrossRef]
Liu, M.; Sheng, L.; Yang, S.; Shao, J.; Hu, S. Morphing and Sampling Network for Dense Point Cloud Completion. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 11596–11603. [Google Scholar] [CrossRef]
Yuan, W.; Khot, T.; Held, D.; Mertz, C.; Hebert, M. PCN: Point Completion Network. In Proceedings of the 2018 International Conference on 3D Vision, 3DV 2018, Verona, Italy, 5–8 September 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 728–737. [Google Scholar] [CrossRef]
Yu, X.; Rao, Y.; Wang, Z.; Liu, Z.; Lu, J.; Zhou, J. PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12478–12487. [Google Scholar] [CrossRef]
Wang, J.; Cui, Y.; Guo, D.; Li, J.; Liu, Q.; Shen, C. PointAttN: You Only Need Attention for Point Cloud Completion. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M.J., Dy, J.G., Natarajan, S., Eds.; AAAI Press: Washington, DC, USA, 2024; pp. 5472–5480. [Google Scholar] [CrossRef]
Xiang, P.; Wen, X.; Liu, Y.; Cao, Y.; Wan, P.; Zheng, W.; Han, Z. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5479–5489. [Google Scholar] [CrossRef]
Yang, Y.; Feng, C.; Shen, Y.; Tian, D. FoldingNet: Point Cloud Auto-Encoder via Deep Grid Deformation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation/IEEE Computer Society: Piscataway, NJ, USA, 2018; pp. 206–215. [Google Scholar] [CrossRef]
Pan, L. ECG: Edge-aware Point Cloud Completion with Graph Convolution. IEEE Robot. Autom. Lett. 2020, 5, 4392–4398. [Google Scholar] [CrossRef]
Pan, L.; Chen, X.; Cai, Z.; Zhang, J.; Zhao, H.; Yi, S.; Liu, Z. Variational Relational Point Completion Network for Robust 3D Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11340–11351. [Google Scholar] [CrossRef] [PubMed]
Aiello, E.; Valsesia, D.; Magli, E. Cross-modal Learning for Image-Guided Point Cloud Shape Completion. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; [Google Scholar]
Gharineiat, Z.; Kurdi, F.T.; Campbell, G. Review of Automatic Processing of Topography and Surface Feature Identification LiDAR Data Using Machine Learning Techniques. Remote. Sens. 2022, 14, 4685. [Google Scholar] [CrossRef]
Michalowska, M.; Rapinski, J. A Review of Tree Species Classification Based on Airborne LiDAR Data and Applied Classifiers. Remote. Sens. 2021, 13, 353. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; pp. 5998–6008. [Google Scholar]
Fan, H.; Su, H.; Guibas, L.J. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 2463–2471. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 146:1–146:12. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the Computer Vision—ECCV 2016—14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV. Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2016; Volume 9908, pp. 630–645. [Google Scholar] [CrossRef]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Piscataway, NJ, USA, 2015; pp. 1912–1920. [Google Scholar] [CrossRef]
Liu, X.; Hou, B.; Wang, H.; Xu, K.; Wan, J.; Guo, Y. DuInNet: Dual-Modality Feature Interaction for Point Cloud Completion. arXiv 2024, arXiv:2407.07374. [Google Scholar]

Figure 1. Overall frame diagram of cross-modal multi-scale feature fusion.

Figure 2. Schematic diagram of rotating channel attention.

Figure 3. Cross-Model Feature Fusion network architecture.

Figure 4. The 32 different viewpoints used to complete the point cloud.

Figure 5. Image data sampled from 32 different viewpoints.

Figure 6. Incomplete point clouds sampled with Poisson sampling from CAD models.

Figure 7. Visualization results comparison chart of CMFN with different menthods.

Table 1. Experimental parameter settings.

Parameter	Value/Description
Optimizer	Adaptive Moment Estimation (Adam) [23]
Initial Learning Rate	0.001
Learning Rate Decay	Decayed to 1/10 of its original value at the 25th and 125th epochs, respectively
Batch Size	128
Total Epochs	200

Table 2. Using CD evaluation index to compare the effect on ShapeNet-VIPC dataset.

	Methods	Avg	Airplane	Cabinet	Car	Chair	Lamp	Sofa	Table	Watercraft
Single-modal	AtlasNet [8]	6.062	5.032	6.414	4.868	8.161	7.182	6.023	6.561	4.261
	FoldingNet [14]	6.271	5.242	6.958	5.307	8.823	6.504	6.368	7.080	3.882
	PCN [10]	5.619	4.246	6.409	4.840	7.441	6.331	5.668	6.508	3.510
	TopNet [7]	4.976	3.710	5.629	4.530	6.391	5.547	5.281	5.381	3.350
	ECG [15]	4.957	2.952	6.721	5.243	5.867	4.602	6.813	4.332	3.127
	PoinTr [11]	8.382	4.753	10.472	8.682	9.392	7.754	10.933	7.784	7.291
	PointAttN [12]	6.634	3.282	10.775	6.132	7.141	5.921	9.727	6.164	3.592
	VRC-Net [16]	4.598	2.813	6.108	4.932	5.342	4.103	6.614	3.953	2.925
Multi-modal	ViPC [5]	3.308	1.760	4.558	3.183	2.476	2.867	4.481	4.990	2.197
	CSDN [6]	1.653	1.873	3.245	1.943	1.885	2.096	3.417	4.009	2.236
	XMFnet [17]	1.443	0.572	1.980	1.754	1.403	1.810	1.702	1.386	0.945
	Ours	1.274	0.561	1.796	1.686	1.376	1.061	1.582	1.342	0.788

Table 3. Using

F 1 @ 0.001

index to compare the effect on ShapeNet-VIPC dataset.

Table 3. Using

F 1 @ 0.001

index to compare the effect on ShapeNet-VIPC dataset.

	Methods	Avg	Airplane	Cabinet	Car	Chair	Lamp	Sofa	Table	Watercraft
Single-modal	AtlasNet [8]	0.410	0.509	0.304	0.379	0.326	0.426	0.318	0.469	0.551
	FoldingNet [14]	0.331	0.432	0.237	0.300	0.204	0.360	0.249	0.351	0.518
	PCN [10]	0.407	0.578	0.270	0.331	0.323	0.456	0.293	0.431	0.577
	TopNet [7]	0.467	0.593	0.358	0.405	0.388	0.491	0.361	0.528	0.615
	ECG [15]	0.704	0.880	0.542	0.713	0.671	0.689	0.534	0.792	0.810
	PoinTr [11]	0.635	0.462	0.873	0.703	0.693	0.655	0.592	0.787	0.792
	PointAttN [12]	0.732	0.892	0.592	0.732	0.771	0.544	0.534	0.796	0.820
	VRC-Net [16]	0.764	0.902	0.621	0.753	0.722	0.823	0.654	0.810	0.832
Multi-modal	ViPC [5]	0.591	0.803	0.451	0.512	0.529	0.706	0.434	0.594	0.730
	CSDN [6]	0.672	0.563	0.524	0.443	0.485	0.826	0.417	0.523	0.836
	XMFnet [17]	0.796	0.961	0.662	0.691	0.809	0.792	0.723	0.830	0.901
	Ours	0.822	0.968	0.696	0.710	0.812	0.879	0.748	0.835	0.929

Table 4. Comparison of experimental results on the ModelNet40ViPC dataset (

C D \times 10^{- 3}

).

Table 4. Comparison of experimental results on the ModelNet40ViPC dataset (

C D \times 10^{- 3}

).

	Methods
Models	ViPC [5]	CSDN [6]	XFMnet [17]	Ours (RCA)	Ours (Multi-Scale 512)	Ours (Multi-Scale 1024)	Ours (RCA+M-S)
glass_box	10.833	4.614	2.751	2.768	3.206	3.363	2.770
range_hood	11.464	5.079	3.086	3.153	3.290	3.530	2.799
laptop	10.552	2.960	1.531	1.579	1.505	1.640	1.288
table	9.296	3.419	2.107	2.103	1.974	2.238	1.840
bed	7.684	3.850	2.714	2.746	2.552	2.711	2.387
chair	9.742	4.792	3.004	3.036	2.778	3.023	2.580
bookshelf	8.948	4.494	3.196	3.235	3.196	3.257	2.974
piano	10.026	5.766	3.763	3.760	3.730	4.011	3.404
sink	12.049	5.727	3.512	3.511	3.144	3.435	3.038
airplane	4.274	1.758	1.011	1.027	1.047	1.141	0.894
dresser	11.066	4.845	2.869	2.917	2.965	3.110	2.592
sofa	7.680	4.102	2.814	2.853	2.804	2.938	2.560
bottle	6.483	2.448	1.560	1.591	1.511	1.541	1.414
monitor	7.275	3.756	2.489	2.537	2.611	2.708	2.310
tv_stand	10.493	5.487	3.804	3.620	3.588	3.737	3.501
toilet	11.202	5.942	3.826	3.884	3.848	4.066	3.554
stool	12.755	6.831	4.464	3.519	3.179	3.404	3.267
xbox	9.982	4.206	2.503	2.508	2.511	2.521	2.180
door	6.777	2.031	1.038	1.077	1.114	1.181	0.988
night_stand	12.136	5.580	3.666	3.693	3.449	3.651	3.230
bench	9.044	4.353	2.549	2.611	2.232	2.448	2.094
vase	11.445	6.675	3.906	3.702	3.583	3.674	3.419
tent	12.585	6.719	3.971	3.679	3.347	3.679	3.163
desk	12.517	6.321	4.003	4.018	3.710	4.001	3.507
car	7.589	4.044	3.051	3.131	3.875	3.018	2.741
radio	11.233	5.817	2.983	3.054	2.901	3.041	2.581
stairs	9.898	8.704	4.625	4.364	3.564	3.644	3.462
guitar	3.521	1.416	0.506	0.502	0.455	0.486	0.403
mantel	11.906	4.292	2.327	2.354	2.685	2.708	2.219
cup	13.504	7.348	5.288	5.105	4.692	4.857	4.708
plant	9.102	8.076	5.929	5.315	4.794	4.806	4.721
curtain	7.049	2.899	1.283	1.318	1.078	1.190	0.983
lamp	19.825	12.622	4.986	5.073	3.313	3.546	3.931
flower_pot	11.209	7.622	5.301	5.078	4.575	4.753	4.544
cone	10.655	3.806	2.197	2.147	2.149	2.359	1.954
keyboard	4.174	1.781	0.824	0.850	0.843	0.876	0.760
bathtub	10.798	5.010	3.371	3.247	3.121	3.515	3.019
wardrobe	9.831	4.181	2.543	2.603	2.622	2.781	2.279
bowl	16.715	15.116	5.304	5.042	2.208	5.040	4.676
person	6.267	5.623	2.541	2.468	2.124	2.179	2.050
all	9.990	5.2528	3.080	3.019	2.781	2.995	2.669

Table 5. Comparison of experimental results on the ModelNet40 dataset (

F 1 @ 0.001

).

Table 5. Comparison of experimental results on the ModelNet40 dataset (

F 1 @ 0.001

).

	Methods
Models	ViPC	CSDN	XFMnet	Ours (RCA)	Ours (Multi-Scale 512)	Ours (Multi-Scale 1024)	Ours (RCA+M-S)
glass_box	0.207	0.473	0.507	0.554	0.516	0.508	0.561
range_hood	0.265	0.492	0.578	0.579	0.563	0.559	0.595
laptop	0.416	0.662	0.711	0.717	0.713	0.701	0.743
table	0.469	0.697	0.759	0.754	0.761	0.746	0.779
bed	0.324	0.527	0.588	0.584	0.588	0.578	0.607
chair	0.433	0.603	0.659	0.659	0.663	0.655	0.679
bookshelf	0.325	0.528	0.564	0.580	0.568	0.564	0.597
piano	0.288	0.476	0.526	0.544	0.537	0.527	0.560
sink	0.345	0.579	0.644	0.640	0.652	0.632	0.667
airplane	0.630	0.776	0.888	0.886	0.872	0.860	0.896
dresser	0.223	0.478	0.575	0.572	0.544	0.526	0.582
sofa	0.315	0.488	0.554	0.549	0.545	0.541	0.569
bottle	0.455	0.739	0.822	0.820	0.816	0.793	0.839
monitor	0.391	0.573	0.637	0.633	0.617	0.613	0.649
tv_stand	0.276	0.484	0.547	0.544	0.542	0.531	0.563
toilet	0.276	0.453	0.504	0.514	0.503	0.498	0.523
stool	0.413	0.593	0.632	0.640	0.676	0.667	0.678
xbox	0.263	0.538	0.630	0.624	0.603	0.587	0.645
door	0.542	0.781	0.803	0.847	0.830	0.820	0.858
night_stand	0.242	0.464	0.543	0.541	0.539	0.524	0.558
bench	0.471	0.654	0.711	0.707	0.721	0.710	0.737
vase	0.291	0.507	0.585	0.583	0.586	0.572	0.604
tent	0.287	0.527	0.593	0.587	0.605	0.587	0.626
desk	0.344	0.537	0.588	0.589	0.602	0.588	0.619
car	0.318	0.499	0.543	0.544	0.544	0.533	0.562
radio	0.302	0.531	0.613	0.619	0.617	0.602	0.645
stairs	0.452	0.567	0.634	0.631	0.664	0.663	0.678
guitar	0.741	0.887	0.955	0.950	0.960	0.954	0.965
mantel	0.285	0.535	0.644	0.642	0.620	0.609	0.653
cup	0.217	0.420	0.474	0.471	0.484	0.473	0.494
plant	0.365	0.453	0.516	0.516	0.531	0.534	0.538
curtain	0.533	0.759	0.838	0.834	0.844	0.830	0.860
lamp	0.383	0.575	0.667	0.667	0.701	0.691	0.702
flower_pot	0.270	0.413	0.476	0.481	0.482	0.476	0.493
cone	0.364	0.678	0.715	0.775	0.757	0.773	0.798
keyboard	0.635	0.781	0.857	0.870	0.865	0.857	0.889
bathtub	0.296	0.523	0.577	0.573	0.578	0.564	0.598
wardrobe	0.270	0.536	0.640	0.631	0.608	0.586	0.652
bowl	0.209	0.453	0.523	0.519	0.524	0.517	0.536
person	0.521	0.629	0.707	0.707	0.720	0.717	0.733
all	0.366	0.571	0.638	0.641	0.642	0.632	0.663

Table 6. Comparison of ablation results.

	RCA	Multi-Scale		XMFnet	CMFN
	RCA	512	1024	XMFnet	RCA+Multi-Scale
Lamp	1.490	1.327	1.452	1.810	1.061
Watercraft	0.823	0.828	0.838	0.945	0.788
Cabinet	1.936	1.877	1.915	1.980	1.796

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, S.; Xu, K.; Wan, J.; Hou, B.; Ma, Y. Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention. Remote Sens. 2025, 17, 1448. https://doi.org/10.3390/rs17081448

AMA Style

Gu S, Xu K, Wan J, Hou B, Ma Y. Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention. Remote Sensing. 2025; 17(8):1448. https://doi.org/10.3390/rs17081448

Chicago/Turabian Style

Gu, Shangtai, Ke Xu, Jianwei Wan, Baolin Hou, and Yanxin Ma. 2025. "Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention" Remote Sensing 17, no. 8: 1448. https://doi.org/10.3390/rs17081448

APA Style

Gu, S., Xu, K., Wan, J., Hou, B., & Ma, Y. (2025). Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention. Remote Sensing, 17(8), 1448. https://doi.org/10.3390/rs17081448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Overall Framework of Network

3.2. Point Cloud Multi-Scale Extractor

3.3. Image Feature Extractor

3.4. Cross-Model Feature Fusion (CMF)

3.5. Point Cloud Similarity Evaluation Metrics

4. Results and Discussion

4.1. Datasets and Experimental Configuration

4.2. Experiment on ShapeNetViPC

4.3. Experiment on ModelNet40ViPC

4.4. Visualization

4.5. Ablation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI