RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation

Wang, Kai; Zhang, Huanhuan

doi:10.3390/math12162485

Open AccessArticle

RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation

by

Kai Wang

¹ and

Huanhuan Zhang

^1,2,*

¹

School of Electronics and Information, Xi’an Polytechnic University, Xi’an 710048, China

²

School of Automation, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2485; https://doi.org/10.3390/math12162485

Submission received: 4 June 2024 / Revised: 23 July 2024 / Accepted: 8 August 2024 / Published: 12 August 2024

(This article belongs to the Topic Big Data Intelligence: Methodologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Point cloud semantic segmentation is essential for comprehending and analyzing scenes. However, performing semantic segmentation on large-scale point clouds presents challenges, including demanding high memory requirements, a lack of structured data, and the absence of topological information. This paper presents a novel method based on the Reverse Attention Adaptive Fusion network (RAAFNet) for segmenting large-scale point clouds. RAAFNet consists of a reverse attention encoder–decoder module, an adaptive fusion module, and a local feature aggregation module. The reverse attention encoder–decoder module is applied to extract point cloud features at different scales. The adaptive fusion module enhances fine-grained representation within multi-resolution feature maps. Furthermore, a local aggregation classifier is introduced, which aggregates the features of neighboring points to the center point in order to leverage contextual information and enhance the classifier’s perceptual capability. Finally, the predicted labels are generated. Notably, our method excels at extracting point cloud features across different dimensions and produces highly accurate segmentation results. Experimental results on the Semantic3D dataset achieved an overall accuracy of 89.9% and a mIoU of 74.4%.

Keywords:

semantic segmentation; reverse attention; adaptive fusion

MSC:

68T01; 68T07; 68T45

1. Introduction

With the rapid advancement of 3D acquisition technology, the process of acquiring point clouds has become increasingly accessible. Point clouds possess the unique capability of preserving the original geometric information of objects with arbitrary shapes. They are widely used in autonomous driving [1], virtual reality, remote sensing, medical treatment, robot navigation, and scene segmentation [2]. Point cloud semantic segmentation plays a crucial role in facilitating point cloud scene understanding and analysis. Due to heavy memory requirements and an inherent lack of structured data and topological information, it is exceedingly challenging to achieve semantic segmentation and understanding of large-scale point clouds.

Existing approaches for point cloud segmentation can be categorized into projection-oriented and point-based methods. Projection representation methods involve projecting 3D point clouds onto multiple views in 2D [3]. SqueezeSeg [4] and SqueezeSegV2 [5] use spherical projection to preserve more point cloud information. However, these methods are computationally complex and not suitable for real-time applications. PointSIFT [6] takes a point-based approach and processes each point individually using a shared point-wise MLP. While accurate, it demands significant computational resources for large-scale point cloud data. PointSIFT’s aggregation modules may result in information loss or errors with unevenly distributed point clouds. The LFA (local feature aggregation) module proposed by RandLA-Net addresses these challenges by aggregating local and global features. This improves accuracy and efficiency. However, these methods still require extensive computational resources for large-scale point clouds and may struggle with uneven point cloud density and complex structures.

In this paper, the RAAFNet semantic segmentation method is proposed for large-scale point clouds. The proposed method utilizes reverse attention fusion and aims to effectively handle local features and process extensive point cloud data. The main contributions of this paper are as follows:

(1): A reverse attention adaptive fusion network is introduced for precise segmentation of large-scale point clouds. This network is suitable for unstructured, disordered, and large-scale point cloud data with irregular domains.
(2): An adaptive fusion module is designed to learn the adaptive weights of each feature representation. This enables the effective fusion of features extracted at different scales, resulting in improved accuracy in point cloud semantic segmentation. The adaptive fusion module enhances the model’s ability to capture multi-scale information and handle variations in point cloud density.
(3): Extensive experiments to evaluate the proposed RAAFNet are conducted. The results demonstrate that RAAFNet outperforms state-of-the-art models in point cloud segmentation tasks. RAAFNet achieves significant improvements in benchmark performance while maintaining high efficiency.

In the remainder of this article, Section 2 describes the relevant work, Section 3 gives the details of the methods, Section 4 presents the experiments, and Section 5 provides the conclusions.

2. Related Work

2.1. MLP Method Based on Multi-Layer Perceptron

Many methods directly use point-wise MLPs to process unstructured point clouds. PointNet [7] is a pioneering work that uses shared MLPs to encode each point individually and uses global pooling to aggregate the features of all points. However, it lacks the ability to capture the local 3D structure, ignoring the relationships between points and the extraction of local features. Subsequent studies have solved this problem by adopting hierarchical multi-scale or weighted feature aggregation schemes to combine local features. Other methods use graphs to represent point clouds [8] and perform local graph operations to aggregate point features, aiming to capture local point relationships. However, they all use shared MLPs to transform point features, which limits the models’ ability to capture spatial variation information. To better adapt to large-scale point cloud scenarios, Rand LA-Net [9] has been proposed. It first introduces a local spatial encoding unit to compensate for the loss of local structural information caused by random sampling, and then uses attention pooling [10] and neighborhood point feature aggregation methods to enhance important information. It also uses dilated residual blocks [11] to increase the receptive field of each point to extract richer features. This method has excellent storage and computational efficiency, and it can retain more detailed geometric structural information, making it suitable for handling large-scale point cloud data. However, due to the influence of data acquisition noise, it is currently limited in improving segmentation accuracy [12]. Using MLPs for point cloud segmentation often leads to high memory consumption [13], and attention-based aggregation [14] can improve the ability to capture local features and spatial structures between points. But these methods need to be further improved in terms of adaptability and flexibility under complex scenarios, especially for large-scale point cloud datasets and variable structures.

2.2. Method of Point Convolution

The inherent disorder and uneven density of point cloud data often lead to the results of direct convolution operations being sensitive to the input order of points. To address this issue, PointCNN [15] employs a χ-transformation matrix to re-weight and reorder the feature matrix, effectively mitigating the impact of disorder. This approach successfully reduces the time and space complexity. In theory, the algorithm can effectively solve the problem of shape information loss, and the segmentation performance is less affected by the input order of points, but the actual results have not met the expected performance. To better utilize the local topological structure information features, Hua et al. [16] used point-wise convolution to capture the local information of points for semantic segmentation, but the fixed convolution operator reduces the flexibility of the network and leads to high computational costs. To address the high computational burden of point-wise convolution, Thomas [17] proposed a kernel point convolution method, including rigid convolution and deformable convolution. Similarly, ConvPoint [18], A-CNN [19], and other methods have constructed new convolution operations for 3D point cloud segmentation, which can effectively capture the structural information of point clouds. ConvPoint [18] introduced a new non-uniform 3D data convolution method, where the convolution kernel is a non-linear function composed of a weight function and a density function based on the local coordinates of the 3D points. For a given point, the density function is determined by kernel density estimation, and the weight function is re-weighted using an MLP. This method effectively mitigates the adverse effects of the uneven point cloud distribution and significantly improves the segmentation performance. However, it lacks sufficient neighborhood feature extraction and has relatively high time and space complexity. Inspired by the convolution kernel construction in KPConv [17] and ConvPoint [18], Xu [20] proposed a position-adaptive convolution method based on dynamic kernel assembly. This algorithm first dynamically combines the basic weight matrices in the weight library to construct the convolution kernel and then adopts the dynamic kernel assembly strategy of the ScoreNet to reduce the memory and computational burden. Compared to ConvPoint [18] and KPConv [17], PAConv [20] has lower memory usage and higher segmentation efficiency. However, it still suffers from high computational complexity, limited generalization ability, and sensitivity to the size of the receptive field. Unlike standard point convolution, MKConv [21] can transform the point feature representation from vectors to multi-dimensional matrices. By employing convolution kernels of multiple scales, MKConv [21] can capture features at different scales, which enhances the model’s perception of multi-scale features and improves its generalization ability. However, MKConv [21] mainly focuses on feature extraction at the point level and lacks the utilization of contextual information, so its performance may be limited in scenarios with a large amount of detail or complex structures.

2.3. RNN-Based Approach

To capture the structural features of neighboring points in the point cloud, some researchers have proposed applying recurrent neural networks (RNNs) [22] to 3D point cloud semantic segmentation models. Francis et al. [23] transformed the point blocks in PointNet [7] into multi-scale blocks or grid blocks to obtain context at the input level, and then merged or recursively merged the per-block features extracted by PointNet [7] to obtain context at the output level. The recursive merging can preserve information about the scene well and improve learning efficiency, but the locally learned features are insufficient. To better solve the problems of inadequate extraction of local geometric features and insufficient acquisition of spatial relationship information between neighboring points, 3P-RNN [24] uses point-wise pyramid pooling to capture local context information at different scales and uses a bidirectional hierarchical RNN to fuse spatial correlation data over a larger range. This method has achieved good results on both indoor and outdoor datasets, with strong generalization ability. However, it has limited ability to distinguish between similar semantic classes, such as doors and walls. The adaptive fusion module [25] effectively integrates features from different levels, which can better capture the multi-scale information in point cloud data. It adopts a reverse attention mechanism to dynamically adjust the weights of low-level and high-level features, improving the quality of feature representation. The adaptive fusion network has exhibited excellent performance on various tasks such as point cloud segmentation and 3D object detection, demonstrating strong generalization capabilities. However, the adaptive fusion approach involves multiple connection modules [26], leading to a relatively complex overall network structure, which can make it more difficult to understand and deploy. Due to the network depth and large number of parameters, the training process of the adaptive fusion network may require a significant amount of data and computational resources.

3. Methodology

In this paper, RAAFNet is proposed to segment disordered point clouds. The overall architecture of RAAFNet is shown in Figure 1. RAAFNet consists of three main parts: a reverse attention encoder–decoder module, an adaptive fusion module, and a local feature aggregation module.

3.1. Spatial Random Sampling

Large-scale point clouds often need downsampling to improve processing efficiency by reducing the number of points. Random sampling is chosen for downsampling after a comprehensive analysis. A sampling rate of 0.5 (50%) is selected, retaining half of the points by randomly selecting them from the original point cloud data. The number of points to be sampled is calculated based on the specified rate and the original point cloud size. Randomly chosen points are preserved with their position and attribute information, while the rest is removed.

3.2. Inverse Attention Encoder–Decoder

Given a point cloud

P = {\{p_{n}\}}_{n = 1}^{N}

, which contains N points, each point contains positional information (x, y, z) and color information (RGB).

Firstly, the original point cloud is processed by the inverse attention encoder–decoder, and it was converted to point-wise features

F = {\{f_{n}\}}_{n = 1}^{N}

. F can be expressed as follows:

F = σ_{B A F - E D} (P)

(1)

where

σ_{B A F - E D}

[27] represents the inverse attention encoder–decoder.

For the inverse attention encoder–decoder, the encoder employs random sampling to reduce the number of points by four times. Additionally, farthest-point sampling further decreases the number of point cloud points. Sampling is performed as follows:

N \to \frac{N}{4} \to \frac{N}{16} \to \frac{N}{64} \to \frac{N}{256}

.

Secondly, local feature aggregation (LFA) [28] is utilized to expand the feature dimensions of each point to a five-order layer, as illustrated in Figure 2.

The encoder can be expressed as follows:

F^{l + 1} = \{\begin{cases} Μ (P), (l = 0) \\ R S (L F A (F^{l})), o t h e r \end{cases}

(2)

where l represents the number of layers in the structure, M represents a Multi-Layer Perceptron (MLP) in different structural layers with different weights; here,

F^{l} \in ℝ^{N_{l} \times D_{l}}

,

F^{l + 1} \in ℝ^{N_{l} / 4 \times D_{l + 1}}

.

The decoder increases the number of points by upsampling operations [29] and then compresses the point features into five sequential layers using a Multi-Layer Perceptron (MLP).

The inverse attention fusion module is used to enhance the feature discrimination ability of the decoder for the multi-layer features from the encoder. For convenience, the feature representation of the decoder layer corresponding to

F_{e}^{l} \in ℝ^{N_{l} \times D_{l}}

is denoted as

F_{d}^{l} \in ℝ^{N_{l} \times D_{l}}

, among them

l \in \{0, 1, 2, 3, 4, 5\}

(

F_{d}^{0}

corresponding to P, which is the original input point cloud). Therefore, for the encoder, the features in the decoder layer are represented in the opposite order. Enhancement is performed using

F_{e}^{l + 1}

and

F_{e}^{l}

in the encoder to enhance

F_{d}^{l}

in the decoder.

F_{d}^{l + 1}

acts as inverse attention to adjust the semantic information of

F_{d}^{l}

. The number of points and feature dimensions of

F_{d}^{l + 1}

and

F_{d}^{l}

need to be consistent. To adjust the feature dimensions, a shared MLP is first applied on

F_{e}^{l + 1}

, compressing its dimension to

D_{l}

. Then, the number of points in the intermediate feature is upsampled to

N_{l}

via nearest-neighbor interpolation.

In addition, MLP and the Sigmoid activation function are used to generate attention weights. To better fuse these two features, a residual connection [30] is applied to maintain more low-level information. The operation process can be expressed as follow:

F_{a}^{l} = F_{e}^{l} + F_{e}^{l} ⊙ ℘ (U S (M (F_{e}^{l + 1})))

(3)

where

F_{a}^{l}

is the modulation feature between the first encoder layer and the decoder layer.

℘

represents a shared MLP and Sigmoid function operation. Inverse attention can capture multi-scale information and maintain low-level details via residual connections. To further weaken the influence of spatial indistinguishability, the Feature Enhancement (FE) unit is used to operate on the intermediate feature. This module is implemented using a simple shared MLP. Finally, the modulated encoder feature is fused with the decoder feature.

3.3. Adaptive Fusion Module

To analyze large-scale 3D scenes with numerous points, the point cloud is explored at reduced resolutions. However, this exploration results in implicit and abstract output features. To address this limitation, it is important to reconstruct a feature map that preserves the original number of points and fully interprets the encoding information of each point. The adaptive fusion module is utilized to achieve a detailed representation within the multi-resolution feature map. The adaptive fusion module is shown in Figure 3.

Assuming that the M lower-resolution (M = 5) point clouds are outputted by the inverse encoder–decoder as a set of multi-resolution feature maps

\{S_{1}, S_{2}, S_{3}, S_{4}, S_{5}\}

, each of which includes

\{N_{1}, N_{2}, N_{3}, N_{4}, N_{5}\}

points, each extracted feature map

\{S_{1}, S_{2}, S_{3}, S_{4}, S_{5}\}

is upsampled and a full-size representation of all N points is generated. The complete size feature map

\{{\tilde{S}}_{1}, {\tilde{S}}_{2}, {\tilde{S}}_{3}, {\tilde{S}}_{4}, {\tilde{S}}_{5}\}

is reconstructed, where

\tilde{S} \in ℝ^{N \times 32}

.

To integrate information and find a useful context for semantic segmentation, the complete-size feature map is adaptively fused at the point level. An MLP is used to summarize point-level information

Φ_{m}

among the upsampling process of each complete-size feature map:

Φ_{m} = F C ({\tilde{S}}_{m})

(4)

where FC represents a fully connected layer.

\{Φ_{1}, Φ_{2}, Φ_{3}, Φ_{4}, Φ_{5}\}

is concatenated and normalized using the SoftMax function:

σ = softmax (c o n c a t (Φ_{1}, Φ_{2}, Φ_{3}, Φ_{4}, Φ_{5})), σ \in ℝ^{N \times 32}

(5)

σ

is separated channel-wise to obtain the separation parameters

\{σ_{1}, σ_{2}, σ_{3}, σ_{4}, σ_{5}\}

, and the adaptive fusion feature is calculated as follows:

S_{out} = σ_{1} \times {\tilde{S}}_{1} + σ_{2} \times {\tilde{S}}_{2} + σ_{3} \times {\tilde{S}}_{3} + σ_{4} \times {\tilde{S}}_{4} + σ_{5} \times {\tilde{S}}_{5}

(6)

where

S_{o u t} \in ℝ^{N \times 32}

, the obtained fusion feature, is fed into the local aggregation classifier.

3.4. Local Aggregation Classifier

The point cloud before being processed by the local aggregation classifier can be represented as

{\tilde{S}}_{out} \in ℝ^{N \times 32}

. Before performing local aggregation, a graph needs to be constructed. The point cloud graph structure (V, E) is considered, where V represents the vertices of the graph (vertices), and E represents the edges of the graph (edges). Then

V \in 1, 2, \dots, N

and

E \in |V| \times |V|

,

N (i)

represent the adjacent points of vertex

i

, and

\tilde{F} = {\{{\tilde{f}}_{n}\}}_{n = 1}^{N}

represents the features of the vertex.

The graph is constructed based on a simple KNN algorithm [30]. For vertex features

{\tilde{f}}_{i}

, the k nearest vertex features

\tilde{F} = {\{{\tilde{f}}_{n}\}}_{k = 1}^{K}

are connected to it via edges. Edge features can be calculated using the following method:

e_{i}^{j} = r e l u (W • {\tilde{f}}_{i}^{j})

(7)

where

W

is learnable weights,

r e l u

is the ReLU activation function, and

e_{i}^{j}

is the edge feature between the i-th vertex and its j-th adjacent vertex.

Then, the edge features are aggregated to the vertices using a similar method as PointNet:

{\bar{f}}_{i} = \max_{j : (i, j) \in ε} e_{i}^{j}

(8)

where max represents the max pooling operation.

The max pooling operation will cause the point features

{\tilde{f}}_{i}

to depend on neighboring features, leading to a loss of the original attributes of the points. To avoid this problem, residual connections [16] are used

\hat{f} = {\bar{f}}_{i} + {\tilde{f}}_{i}

(9)

where

{\tilde{f}}_{i}

represents the original vertex features and

{\bar{f}}_{i}

represents the edge features aggregated to the vertices.

4. Experiments

4.1. Experimental Platform

Experiments were performed with TensorFlow on a server equipped with an NVIDIA GTX 3090 GPU, 500 GB of memory (NVIDIA GTX 3090 GPU (AMAX, Fremont, CA, USA)), and 24 GB of RAM for training and testing. The Adam algorithm with default parameters was used as the optimizer. The initial learning rate was at 0.01 and decreased by 10% after each iteration. Due to hardware limitations, the batch size was set at 3 for training with the Semantic3D dataset [31] and 5 for S3DIS. Additionally, the number of epochs was set at 100. The K nearest neighbor for local aggregation was set at 16 in the local aggregation module. We sampled a fixed number of points from each point cloud as the input in the training stage, and the raw point cloud was fed into the network for inference during testing.

4.2. Experimental Dataset

In this paper, the Semantic3D [31] dataset was used for testing. The dataset is a large-scale annotated 3D point cloud dataset of natural scenes, as shown in Figure 4. It contains over 4 billion points and 8 semantic classes: church, street, railway track, square, village, soccer field, and castle. The dataset includes 15 scenes for training, 2 scenes for validation, and a reduced-8 testing of 4 scenes. This dataset was obtained using advanced static scanning devices, and therefore contains very detailed information.

In this work, the performance of semantic segmentation was evaluated using global accuracy (OA), mean Class accuracy (mAcc), and mean intersection over union [32] (mIoU).

4.3. Experiment Result Analysis

RAAFNet was compared to several state-of-the-art methods, including TMLC-MS [33], MSDVN [34], SnapNet [35], SEGCloud [36], PointNet++ [37], RF_MSSF [38], EdgeConv [39], and MSDeepVox [34]. The results are shown in Table 1. Table 1 indicates our method achieves lower OA compared to the RF_MSSF [38] model. However, in terms of mIoU, RAAFNet outperforms other methods, including MSDeepVox [34], demonstrating its effectiveness and robustness.

The visual results of the proposed method on the Semantic3D [31] dataset are shown in Figure 4. From Figure 4, it can be observed that RAAFNet achieves accurate segmentation of buildings, man-made structures, tall vegetation, and natural-terrain objects.

In order to better display the visual segmentation results, details of Figure 4b,f are magnified in Figure 4d and Figure 4h respectively. From Figure 4d,h, it can be seen that RAAFNet achieves precise segmentation of scanning artifacts, railings (man-made structures), and nearby cars in the scene model. The prediction results of Scene3 in Figure 4j demonstrate significant segmentation performance for buildings and tall vegetation. Moreover, from the magnified prediction result image in Figure 4l, it can be observed that the segmentation boundaries of different categories are clear in the segmentation of low vegetation, man-made structures, and natural landscapes, indicating a satisfactory visual effect.

MIoU was used to compare the accuracy of test results for each category of semantic 3d dataset [31] to evaluate semantic segmentation performance. The results were compared with TMLC-MS [33], MSDVN [34], SnapNet [35], SEGCloud [36], PointNet++ [37], RF_MSSF [38], EdgeConv [39], and MSDeepVox [34]. The comparison results are shown in Table 2.

From Table 2, it can be observed that there is a larger gap in segmentation accuracy between the proposed method and the previous methods for the categories of low vegetation and artificial landscapes compared to other categories. This is because the point cloud targets in some scenes of these two categories have very similar shapes, making them difficult to distinguish. Among the eight categories of the Semantic3D dataset, the proposed method in this chapter achieved the best performance in the categories of low vegetation, artificial landscapes, and scan artifacts. In particular, the proposed method outperformed PointNet++ [37] by 6.9 percentage points in the artificial landscape category, and the mIoU of the proposed method for scan artifacts was 22.2 percentage points higher than that of MSDeepVox [34]. Through the visualization of segmentation results and quantitative evaluation, it can be concluded that the proposed method can effectively segment large-scale and complex point clouds, demonstrating leading performance in specific categories.

S3DIS [40] is composed of six large-scale indoor areas including 272 rooms from three buildings. Each point cloud in S3DIS is a medium-sized single room with dense 3D points, and each point belongs to one of the 13 semantic categories. Following previous methods, we use the 6-fold cross-validation in the experiment. In addition, the proposed method takes 40,960 sampled points as the input in the training phase, and the raw point cloud is fed into the network during testing following RandLA-Net.

The quantitative results of the proposed method on S3DIS data are presented in Table 3. Compared with the state-of-the-art methods following RandLA-Net [9], including PointNet [7], RSNet [41], 3P-RNN [24], SPG [42], PointCNN [15], PointWeb [43], ShellNet [44], and KPConv [17], RAAFNet achieves the best performance in terms of OA and mIoU. The S3DIS dataset covers a variety of different types of indoor scenes, including offices, classrooms, and corridors, and the dataset scale is also quite considerable. This indicates that RAAFNet is able to more accurately recognize and segment various objects and structures in indoor scenes. It has strong robustness and adaptability in processing complex indoor environment data and has good generalization ability.

5. Conclusions

This paper presents RAAFNet for segmenting large-scale point clouds by leveraging reverse attention feature adaptive fusion. The method utilizes a backward attention fusion encoder–decoder and an adaptive fusion module to extract and capture fine-grained features at various scales. Through extensive experiments on publicly available datasets, RAAFNet was evaluated in terms of its effectiveness and accuracy. The experimental results demonstrate RAAFNet’s superiority over existing state-of-the-art models, showcasing substantial enhancements in segmentation performance without compromising efficiency. In future work, we will optimize our method to enable it to handle large-scale sparse point cloud data and promote the primary structures for more 3D tasks, such as object detection, instance segmentation, etc.

Author Contributions

K.W.: Conceptualization, methodology, software, validation, formal analysis, writing—original draft, writing—review and editing. H.Z.: Writing—review and editing, supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Program of Shaanxi Province under Grant 2024GX-YBXM-231, in part supported by the Xi’an Beilin District Science and Technology Plan Project under Grant GX2311, in part by the Preferential Funding for Post Doctoral Research Program in ZheJiang Province under Grant ZJ2022154, in part by the Post-Doctoral Research Program in Shaanxi Province, in part by the Science and Technology Foundation of Xi’an for Program of University Science and Technology Scholar Serving Enterprise under Grant 22GXFW0033, in part by The Youth Innovation Team of Shaanxi Universities, and in part by Innovation Capability Support Program of Shaanxi under Grant 2021TD-29.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Saez-Perez, J.; Wang, Q.; Calero, J.M.A.; Garcia-Rodriguez, J. Enhancing point cloud resolution for autonomous driving with deep learning AI models. In Proceedings of the 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops), Biarritz, France, 11–15 March 2024; pp. 599–604. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Lawin, F.J.; Danelljan, M.; Tosteberg, P.; Bhat, G.; Khan, F.S.; Felsberg, M. Deep projective 3D semantic segmentation. In Computer Analysis of Images and Patterns, Proceedings of the 17th International Conference, CAIP 2017, Ystad, Sweden, 22–24 August 2017; Proceedings, Part I 17; Springer: Cham, Switzerland, 2017; pp. 95–107. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3D lidar point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; Lu, C. Pointsift: A sift-like network module for 3D point cloud semantic segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Jian, S.; Kaiming, H.; Shaoqing, R.; Xiangyu, Z. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Rakotosaona, M.J.; La Barbera, V.; Guerrero, P.; Mitra, N.J.; Ovsjanikov, M. Pointcleannet: Learning to denoise and remove outliers from dense point clouds. Comput. Graph. Forum 2020, 39, 185–203. [Google Scholar] [CrossRef]
Xiu, H.; Liu, X.; Wang, W.; Kim, K.S.; Shinohara, T.; Chang, Q.; Matsuoka, M. Diffusion unit: Interpretable edge enhancement and suppression learning for 3D point cloud segmentation. Neurocomputing 2023, 559, 126780. [Google Scholar] [CrossRef]
Han, J.; Liu, K.; Li, W.; Chen, G.; Wang, W.; Zhang, F. A Large-Scale Network Construction and Lightweighting Method for Point Cloud Semantic Segmentation. IEEE Trans. Image Process. 2024, 33, 2004–2017. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
Hua, B.-S.; Tran, M.-K.; Yeung, S.-K. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 984–993. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Boulch, A. ConvPoint: Continuous convolutions for point cloud processing. Comput. Graph. 2020, 88, 24–34. [Google Scholar] [CrossRef]
Komarichev, A.; Zhong, Z.; Hua, J. A-cnn: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7421–7430. [Google Scholar]
Xu, M.; Ding, R.; Zhao, H.; Qi, X. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3173–3182. [Google Scholar]
Woo, S.; Lee, D.; Hwang, S.; Kim, W.J.; Lee, S. MKConv: Multidimensional feature representation for point cloud analysis. Pattern Recognit. 2023, 143, 109800. [Google Scholar] [CrossRef]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Engelmann, F.; Kontogianni, T.; Hermans, A.; Leibe, B. Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 716–724. [Google Scholar]
Ye, X.; Li, J.; Huang, H.; Du, L.; Zhang, X. 3D recurrent neural networks with context fusion for point cloud semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 403–417. [Google Scholar]
Qiu, S.; Anwar, S.; Barnes, N. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1757–1767. [Google Scholar]
Li, H.; Guan, H.; Ma, L.; Lei, X.; Yu, Y.; Wang, H.; Delavar, M.R.; Li, J. MVPNet: A multi-scale voxel-point adaptive fusion network for point cloud semantic segmentation in urban scenes. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103391. [Google Scholar] [CrossRef]
Shuai, H.; Xu, X.; Liu, Q. Backward Attentive Fusing Network with Local Aggregation Classifier for 3D Point Cloud Semantic Segmentation. IEEE Trans Image Process 2021, 30, 4973–4984. [Google Scholar] [CrossRef] [PubMed]
Babenko, A.; Lempitsky, V. Aggregating local deep features for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1269–1277. [Google Scholar]
Wang, H. Research on Point Cloud Upsampling Algorithms Based on Deep Learning. J. Image Signal Process. 2023, 12, 21–31. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. Semantic3d. net: A new large-scale point cloud classification benchmark. arXiv 2017, arXiv:1704.03847. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Hackel, T.; Wegner, J.D.; Schindler, K. Fast semantic segmentation of 3D point clouds with strongly varying density. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 177–184. [Google Scholar] [CrossRef]
Roynard, X.; Deschaud, J.-E.; Goulette, F. Classification of point cloud scenes with multiscale voxel deep network. arXiv 2018, arXiv:1804.03583. [Google Scholar]
Boulch, A.; Le Saux, B.; Audebert, N. Unstructured point cloud semantic labeling using deep segmentation networks. In Proceedings of the 3Dor ‘17: Proceedings of the Workshop on 3D Object Retrieval, Lyon, France, 23–24 April 2017; Volume 3, pp. 17–24. [Google Scholar]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3D point clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Thomas, H.; Goulette, F.; Deschaud, J.-E.; Marcotegui, B.; LeGall, Y. Semantic classification of 3D point clouds with multiscale spherical neighborhoods. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 390–398. [Google Scholar]
Contreras, J.; Denzler, J. Edge-convolution point net for semantic segmentation of large-scale point clouds. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5236–5239. [Google Scholar]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
Huang, Q.; Wang, W.; Neumann, U. Recurrent slice networks for 3D segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2626–2635. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4558–4567. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.-W.; Jia, J. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5565–5573. [Google Scholar]
Zhang, Z.; Hua, B.-S.; Yeung, S.-K. Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1607–1616. [Google Scholar]

Figure 1. The diagram of the RAAFNet method.

Figure 2. Backward attention fusion module.

Figure 3. Adaptive fusion module.

Figure 4. Results of different methods on the Semantic3D Dataset.

Table 1. Experimental results of different methods on the Semantic3D Dataset.

Method	OA (%)	mIou (%)
TMLC-MS [33]	85.0	49.4
MSDVN [34]	84.8	57.1
SnapNet [35]	88.6	59.1
SEGCloud [36]	88.1	61.3
PointNet++ [37]	85.7	63.1
RF_MSSF [38]	90.3	63.9
EdgeConv [39]	89.6	64.4
MSDeepVox [34]	88.4	65.3
RAAFNet	89.9	74.4

Table 2. Quantitative comparison of semantic segmentation results for various categories in the Semantic3D [31] dataset.

Method	Artificial Terrain	Natural Terrain	High Vegetation	Low Vegetation	Building	Arti	Scan	Car
TMLC-MS [33]	91.1	69.5	32.8	21.6	87.6	25.9	11.3	55.3
MSDVN [34]	82.7	53.1	83.8	28.7	89.9	23.6	29.8	65.0
SnapNet [35]	82.0	77.3	79.7	22.9	91.1	18.4	37.3	64.4
SEGCloud [36]	83.9	66.0	86.0	40.5	91.1	30.9	27.5	64.3
PointNet++ [37]	81.9	78.1	64.3	51.7	75.9	36.4	43.7	72.6
RF_MSSF [38]	87.6	80.3	81.8	36.4	92.2	24.1	42.6	56.6
EdgeConv [39]	91.1	69.5	65.0	56.0	89.7	30.0	43.8	69.7
MSDeepVox [34]	83.0	67.2	83.9	36.7	92.4	31.3	50.0	78.2
RAAFNet	86.8	74.2	86.0	62.8	89.2	44.2	77.2	74.8

Table 3. Quantitative results of different methods on S3DIS (6 cross-validations).

Method	mIou (%)	OA (%)	mACC (%)	Ceil.	Floor	Wall	Beam	Col.	Wind.	Door	Table	Chair	Sofa	Book.	Board	Clut.
PointNet [7]	47.6	78.6	66.2	88.0	88.7	69.3	42.4	23.1	47.5	51.6	54.1	42.0	9.6	38.2	29.4	35.2
RSnet [41]	56.5	-	66.5	92.5	92.8	78.6	32.8	34.4	51.6	68.1	59.7	60.1	16.4	50.2	44.9	52.0
3P-RNN [24]	56.3	86.9	-	92.9	93.8	73.1	42.5	25.9	47.6	59.2	60.4	66.7	24.8	57	36.7	51.6
SPG [42]	62.1	85.5	73.0	89.9	95.1	76.4	62.8	47.1	55.3	68.4	73.5	69.2	63.2	45.9	8.7	52.9
PointCNN [15]	65.4	88.1	75.6	94.8	97.3	75.8	63.3	51.7	58.4	57.2	71.6	69.1	39.1	61.2	52.2	58.6
PointWeb [43]	66.7	87.3	76.2	93.5	94.2	80.8	52.4	41.3	64.9	68.1	71.4	67.1	50.3	62.7	62.2	58.5
ShellNet [44]	66.8	87.1	-	90.2	93.6	79.9	60.4	44.1	64.9	52.9	71.6	84.7	53.8	64.6	48.6	59.4
KPConv [17]	70.6	-	79.1	93.6	92.4	83.1	63.9	54.3	66.1	76.6	57.8	64.0	69.3	74.9	61.3	60.3
RandLA-Net [9]	70	88	82.0	93.1	96.1	80.6	62.4	48.0	64.4	69.4	69.4	76.4	60	64.2	65.9	60.1
RAAFNet	72.3	88.3	81.8	96.3	95.1	78.9	55.6	52.9	79.3	85.7	67.3	77.9	51.9	59.9	71.0	67.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, K.; Zhang, H. RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation. Mathematics 2024, 12, 2485. https://doi.org/10.3390/math12162485

AMA Style

Wang K, Zhang H. RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation. Mathematics. 2024; 12(16):2485. https://doi.org/10.3390/math12162485

Chicago/Turabian Style

Wang, Kai, and Huanhuan Zhang. 2024. "RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation" Mathematics 12, no. 16: 2485. https://doi.org/10.3390/math12162485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RAAFNet: Reverse Attention Adaptive Fusion Network for Large-Scale Point Cloud Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. MLP Method Based on Multi-Layer Perceptron

2.2. Method of Point Convolution

2.3. RNN-Based Approach

3. Methodology

3.1. Spatial Random Sampling

3.2. Inverse Attention Encoder–Decoder

3.3. Adaptive Fusion Module

3.4. Local Aggregation Classifier

4. Experiments

4.1. Experimental Platform

4.2. Experimental Dataset

4.3. Experiment Result Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI