MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks

Shuang, Feng; Li, Pei; Li, Yong; Zhang, Zhenxin; Li, Xu

doi:10.3390/rs14092187

Open AccessArticle

MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks

¹

Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, Nanning 530004, China

²

Artificial Intelligence Key Laboratory of Sichuan Province, Yi Bin 644000, China

³

Beijing Advanced Innovation Center for Imaging Theory and Technology, Key Lab of 3D Information Acquisition and Application, College of Resource Environment and Tourism, Capital Normal University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2022, 14(9), 2187; https://doi.org/10.3390/rs14092187

Submission received: 5 March 2022 / Revised: 27 April 2022 / Accepted: 28 April 2022 / Published: 3 May 2022

(This article belongs to the Special Issue Point Cloud Processing in Remote Sensing Technology)

Download

Browse Figures

Versions Notes

Abstract

:

Large-scale 3D point clouds are rich in geometric shape and scale information but they are also scattered, disordered and unevenly distributed. These characteristics lead to difficulties in learning point cloud semantic segmentations. Although many works have performed well in this task, most of them lack research on spatial information, which limits the ability to learn and understand the complex geometric structure of point cloud scenes. To this end, we propose the multispatial information and dual adaptive (MSIDA) module, which consists of a multispatial information encoding (MSI) block and dual adaptive (DA) blocks. The MSI block transforms the information of the relative position of each centre point and its neighbouring points into a cylindrical coordinate system and spherical coordinate system. Then the spatial information among the points can be re-represented and encoded. The DA blocks include a Coordinate System Attention Pooling Fusion (CSAPF) block and a Local Aggregated Feature Attention (LAFA) block. The CSAPF block weights and fuses the local features in the three coordinate systems to further learn local features, while the LAFA block weights the local aggregated features in the three coordinate systems to better understand the scene in the local region. To test the performance of the proposed method, we conducted experiments on the S3DIS, Semantic3D and SemanticKITTI datasets and compared the proposed method with other networks. The proposed method achieved 73%, 77.8% and 59.8% mean Intersection over Union (mIoU) on the S3DIS, Semantic3D and SemanticKITTI datasets, respectively.

Keywords:

LiDAR; point cloud data; spatial information; deep learning; semantic segmentation; convolutional neural network

1. Introduction

With the increased performance and popularization of spatial measurement devices, e.g., LiDAR and RGB-D cameras, point cloud data are higher in quality and are easier to collect. Point clouds can directly reflect the spatial information of real-life scenes and have been applied in many fields, e.g., autonomous driving, robotics, virtual reality, remote sensing and medical treatment. Presently, the deep learning abilities of point clouds are receiving increased attention from scholars due their potential to comprehend spatial information from real-life scenes. Guo et al. [1] summarised the works related to point cloud deep learning in recent years. These works are divided into studies of three main tasks: 3D shape classification, 3D object detection and 3D point cloud segmentation. In addition, the task of 3D point cloud segmentation can be subdivided into semantic segmentation, instance segmentation and part segmentation.

The study of semantic segmentation for 2D images has been very fruitful because 2D image structures are regular and well suited to convolutional neural networks (CNNs). However, 3D point clouds are non-regular and disordered. Therefore, the traditional CNNs should be improved for point cloud processing. Recently, many CNN-based works have been proposed for the semantic segmentation of point clouds, which can be mainly divided into Point aggregation-based CNNs methods, Graph-based CNNs methods and Spatial-based CNNs methods. Point aggregation-based CNNs. Lu et al. [2] proposed a novel category-guided aggregation network that first directly identifies whether neighbouring points are of the same category as their input points and then aggregates these points by category for feature learning. However, this method cannot handle local features with large prediction errors. Landrieu et al. [3] proposed an SPG to divide the point cloud into multiple superpoints to learn the point cloud features. Furthermore, the method in ref. [4] divided point clouds into superpoints in a more detailed and reasonable way. However, these methods are computationally expensive and inefficient. Graph-based CNNs. To better learn point cloud features, Li et al. [5] applied Gaussian-weighted Taylor kernels to enhance the shape description capability of graph convolutions. DCG-Net [6] learns point cloud features by constructing a dynamic capsule graph. Liu et al. [7] directly improved graph convolution networks to dynamic aggregate points. Liang et al. [8] combined a depth-wise graph convolution with a hierarchical structure to learn the local and global features of point clouds. However, this kind of method only extracts the features based on the relation of the centre point and its neighboring points. Spatial-based CNNs. PointNet [9] can obtain the coordinate information of input points. PointNet++ [10] can extract the point features based on the coordinates of centre points and their neighbouring points. Wang et al. [11] proposed edge features for local information representation. To extract edge features that can arbitrarily expand the receptive field, ref. [12] introduced the sampling rate r to sample the neighbouring points. Qiu et al. [13] introduced offsets but these were still essentially a further optimization for edge features. RandLA-Net [14] was proposed to learn the spatial information of the input point coordinates

p_{i}

, the neighbouring point coordinates

p_{i}^{k}

, the relative positions

p_{i} - p_{i}^{k}

and the geometric distances

‖ p_{i} - p_{i}^{k} ‖

. Fan et al. [15] used polar angle features as one of the spatial information items to be learned. However, these methods focus on the input of spatial information and lack deeper or multiple types of learning of point cloud features.

Based on the above review, most of the aforementioned CNNs have some drawbacks.

First, the input spatial information for the vast majority of these studies is relatively simple. The input spatial information of many existing studies is generally the input point coordinates, neighbouring point coordinates and edge features. These studies did not obtain enough spatial information which limited the ability to learn and understand the complex spatial features of point clouds. As shown in Figure 1, each coordinate system can represent the spatial information among points in a different way. In addition, the distribution of the same points varies in different coordinate systems, which fully demonstrates that these points have different spatial relations in each coordinate system.

Second, many works are unable to learn the input information sufficiently. The aforementioned CNNs mainly focus on construct graphs for learning point clouds or the input of spatial information and are less concerned with how to learn the features a step further.

With the analysis of these problems, we argue that physically meaningful spatial information plays a crucial role in feature learning.

The contributions of this research are as follows:

We propose a multiple spatial information encoding block that aims to learn more types of spatial information about the point cloud;
We propose a coordinate systems attentive pooling fusion (CSAPF) block to sufficiently learn local context features. The local features encoded in each of the three coordinate systems are first attention pooled. Then, the three attention scores obtained are added and averaged. Through those steps, each of the neighbouring points obtains a more reasonable attention score for feature learning;
A local aggregation features attention (LAFA) block is proposed. The distribution of points in each local region are different, so this block aims to learn the features of each local overall region in different coordinate systems and the adaptive weights of these local aggregation features to learn the contribution. In other words, the better the local aggregation features can describe the local region, the more important they are. Through this block, not only can the relation among different coordinate systems be adequately learned, but also the understanding ability of our proposed method for local regions can be improved.

2. Related Work

2.1. Projection-Based Methods

Lawin et al. [16] projected the 3D point cloud onto the 2D plane, and then multiple FCNs were used to learn the point features. Boulch et al. [17] obtained the depth image of the point cloud directly from several different camera positions and then SegNet [18] was used to predict the labels. Finally, the semantic images were projected back to the point cloud. Tatarchenko et al. [19] projected the point cloud local geometric surface onto the virtual plane and then applied the proposed tangent plane convolution directly to the virtual plane. In general, these methods have obvious drawbacks. First, there are many tedious preparatory tasks. Second, the choice of viewpoint also has a significant impact on the prediction results. Third, the projection of the point cloud from the 3D to the 2D can result in the loss of structural information. Compared with the aforementioned methods, our MSIDA-Net has none of these drawbacks.

2.2. Voxel-Based Methods

Huang et al. [20] voxelised point cloud data and then used a 3D CNN for feature learning. Tchapmi et al. [21] applied 3D trilinear interpolation and conditional random fields to a voxel-level 3D CNN to learn the point cloud features. Meng et al. [22] proposed the variational autoencoder, which was equipped with a radial basis function kernel to learn the features of subvoxels. Dai et al. [23] proposed a truncated signed distance field to represent 3D scans and store them in voxel grids for feature learning. Hu et al. [24] proposed VMNet to process point clouds using both voxels and meshes, where the voxel part focuses on learning point cloud features, while the mesh part focuses on constructing the graph structure of multiple layers. Ye et al. [25] divided the point cloud learning approach into two parts, one for learning pointwise features and the other for learning voxel features. However, the voxel-based approach also has drawbacks in that the voxelization inevitably loses detailed information and the resolution of the voxelization also directly affects the training results. Compared with these Voxel-based methods, the inputs of our proposed method are original point clouds instead of voxels, which lose less information.

2.3. Point-Based Methods

Point-based methods work directly on points and capture the detailed features of a point cloud with little loss of geometric structure information.

Attention-based Methods. Ma et al. [26] applied a self-attention mechanism to a graph for global contextual reasoning. GAC-Net [27] can dynamically assign attention weights to neighbouring points. Zhang et al. [28] proposed a contextual attention graph for learning multiscale local features. Yang et al. [29] proposed a network combining self-attention with Gumbel subset sampling to solve the farthest point sampling problem. Compared with these Attention-based methods, the two attention mechanisms (the dual adaptive blocks) of our proposed method can sufficiently learn the features of point clouds.

Kernel-based Methods. Hua et al. [30] inputted point clouds into multiple pointwise convolution operators for feature learning and concatenated them together for the output. Thomas et al. [31] proposed a kernel point convolution, which is very flexible and does not require the same number of input points as the kernel points. Dilated convolution was proposed by Engelmann et al. [32], which could expand the perceptual field without increasing the computational effort. Lei et al. [33] proposed a spherical convolution kernel that would divide the area around the input point spherically into smaller bins, with each bin having different weights. Lin et al. [34] proposed the deformable 3D kernel in which the structural features of the point cloud are learned by a 3D GCN. However, the prediction results of all these methods are dependent on the hyperparameters of the kernel, while our MSIDA-Net does not need to manually set the hyperparameters of the kernel.

3. Methodology

Figure 2a shows the framework of MSIDA-Net. The input point cloud is

N \times d_{i n}

, where

N

is the total number of input points and

d_{i n}

are the input features. First, each point feature is extracted through a fully connected layer, whose dimension is 8. Then, the output features of FC are inputted to the MSIDA module to learn the features. The detailed structure of the MSIDA module is shown in Figure 2b. The input point coordinates are encoded by a multiple spatial information encoding (MSI) block and then they are further weighted by the CSAPF block and LAFA block. The overall feature will be downsampled and inputted to the next encoding layer. The extracted feature is the coarse local feature and is defined as

f_{i}^{k}

, where

i

is the

i^{t h}

centre point, and

k

is the

k^{t h}

neighbouring point. The Adaptive Fusion Module was proposed by ref. [13] and aims to learn the global features. Briefly, the adaptive fusion module upsamples each downsampled feature map to a

N

point feature map and adaptively weights these upsampled feature maps. Finally, three fully connected layers are applied to the weighted features for the prediction of each point.

3.1. Spatial Information Encoding Based on Multiple Coordinate Systems

In addition to the Cartesian coordinate system, the spherical and cylindrical coordinate systems can each completely describe the spatial information of points in 3D data. To learn more types of spatial information, we converted the spatial information of the Cartesian coordinate system into the spatial information of the spherical coordinate system and cylindrical coordinate system. This block consists of three parts, namely, the Cartesian coordinate system feature encoding, the spherical coordinate system feature encoding and the cylindrical coordinate system feature encoding. First, the K-nearest neighbours (KNN) are used to find the neighbouring points of each centre point

i^{t h}

. After that, these neighbouring points will be encoded by each of the three coordinate systems. The structure is shown in Figure 3.

3.1.1. The Spatial Information Encoding of the Cartesian Coordinate System

Cartesian coordinate system spatial information includes the input point coordinates

p_{i}

, the coordinates of the neighbouring point

p_{i}^{k}

, the relative position

p_{i} - p_{i}^{k}

and the geometric distance

‖ p_{i} - p_{i}^{k} ‖

.

c_{i C a}^{k} = MLP (CONCAT (p_{i}, p_{i}^{k}, p_{i} - p_{i}^{k}, ‖ p_{i} - p_{i}^{k} ‖))

(1)

where

c_{i C a}^{k}

represents the features of each point in the Cartesian coordinate system. CONCAT() means that the spatial features are concatenated in the feature dimension. MLP() is a convolution layer with a 1

\times

1 kernel size.

3.1.2. The Spatial Information Encoding of the Spherical Coordinate System

As shown in Figure 4a, the angular difference and geometric distance of the centre point and each of its neighbouring points are the spherical coordinate system spatial information. It should be noted that the centre of conversion is the centre point. In addition, the directions of the x, y and z-axes are not defined and the painted axes are used to understand the conversion. It is known that the equations for conversion from the Cartesian coordinate system to the spherical coordinate system are as follows:

{\begin{matrix} r = \sqrt{x^{2} + y^{2} + z^{2}} \\ θ = \cos^{- 1} \frac{z}{r} \\ φ = \tan^{- 1} \frac{y}{x} \end{matrix}

(2)

The relative position of the centre point and each of its

k^{t h}

neighbouring points is represented as

x_{i}^{k'}, y_{i}^{k'}, z_{i}^{k'}

. Then, we applied the relative position to Equation (2) to obtain

r_{i}^{k'}, θ_{i}^{k'}, φ_{i}^{k'}

.

r_{i}^{k'}

represents the geometric distance,

θ_{i}^{k'}

is the elevation angle and

φ_{i}^{k'}

is the azimuth angle.

As shown in Figure 4b, we introduced the centre of mass

p_{i}^{m}

from SCF-Net [15], which reflects the distribution of local features. The centre of mass is obtained by averaging the coordinates of the neighbouring points. The relative position of the centre point to the centre of mass is

x_{i}^{m'}, y_{i}^{m'}, z_{i}^{m'}

and then it is applied to Equation (2) to obtain

r_{i}^{m'}, θ_{i}^{m'}, φ_{i}^{m'}

.

To learn more detailed directional local features of the point cloud, the equations were defined as:

θ_{i}^{k ″} = θ_{i}^{k'} - θ_{i}^{m'}

(3)

φ_{i}^{k ″} = φ_{i}^{k'} - φ_{i}^{m'}

(4)

where Equation (3) represents the detailed feature of the elevation angle and Equation (4) represents the detailed feature of the azimuth angle.

We selected

r_{i}^{k'}, r_{i}^{m'}, θ_{i}^{k''}, φ_{i}^{k''}

as the spatial information of the spherical coordinate system and encoded them as follows:

c_{i S p}^{k} = MLP (CONCAT (r_{i}^{k'}, r_{i}^{m'}, θ_{i}^{k''}, φ_{i}^{k''}))

(5)

3.1.3. Spatial Information Encoding of the Cylindrical coordinate System

As shown in Figure 5a, the angular difference, radial distance and z coordinate distance of the centre point and each of its neighbouring points are the cylindrical coordinate system spatial information. Additionally, the painted x-, y-, z-axes are used to understand the conversion. Although the z coordinate does not need to be transformed, the spatial information of the z-axis is indispensable for the complete learning of the spatial features in the cylindrical coordinate system.

The equations for conversion from the Cartesian coordinate system to the cylindrical coordinate system are as follows:

{\begin{matrix} l = \sqrt{x^{2} + y^{2}} \\ α = \tan^{- 1} \frac{y}{x} \\ z = z \end{matrix}

(6)

Here, we took

x_{i}^{k'}, y_{i}^{k'}, z_{i}^{k'}

and

x_{i}^{m'}, y_{i}^{m'}, z_{i}^{m'}

into Equation (6) to obtain

l_{i}^{k'}, α_{i}^{k'}, z_{i}^{k'}

and

l_{i}^{m'}, α_{i}^{m'}, z_{i}^{m'}

, respectively.

l_{i}^{k'}, α_{i}^{k'}, z_{i}^{k'}

represent the radial distance, azimuth angle and relative position of the z-axis between the centre point and each of its

k^{t h}

neighbouring points, while

l_{i}^{m'}, α_{i}^{m'}, z_{i}^{m'}

represent the radial distance, azimuth angle and relative position of the z-axis between the centre point and the centre of mass.

The azimuth anglewas selected to learn the detailed local directional features:

α_{i}^{k ″} = α_{i}^{k'} - α_{i}^{m'}

(7)

We selected

l_{i}^{k'}, l_{i}^{m'}, α_{i}^{k''}, z_{i}^{k'}, z_{i}^{m'}

as the spatial information of the cylindrical coordinate system and encoded them as follows.

c_{i C y}^{k} = MLP (CONCAT (l_{i}^{k'}, l_{i}^{m'}, α_{i}^{k''}, z_{i}^{k'}, z_{i}^{m'}))

(8)

3.2. Coordinate Systems Attentive Pooling Fusion

Hu et al. [14] proposed attention pooling, which is powerful for automatically learning important local features. However, the ability of attention pooling is limited due to less spatial information. To better learn the detailed local features in different coordinate systems, we proposed the coordinate systems attention pooling fusion (CSAPF) block, as shown in Figure 6. First, the Cartesian features, the spherical features and the cylindrical features extracted by the MSI block are attention pooled to obtain the per point attention score in three different coordinate systems. Second, the attention scores for each of points are summed and averaged to obtain its attention fusion score, respectively. Then, the attention fusion score matrix and all local feature matrices are dot producted to obtain the coordinates of the fused features. Finally, we concatenated the coordinates fusion features and all local features as the output of this block.

This block learns the attention score of each point in different coordinate systems and gathers these attention scores in a reasonable way to automatically select important local features.

3.2.1. Calculating the Attention Scores of Neighbouring Points in Each Coordinate System

We appliedattention pooling to learn the local features of each coordinate system as in Equation (9):

s_{i}^{k} = SOFTMAX (W (c_{i}^{k}))

(9)

SOFTMAX (p_{i}^{k}) = \frac{e^{f_{i}^{k}}}{\sum_{k = 1}^{K} e^{f_{i}^{k}}}

(10)

Equation (10) is the calculation of SOFTMAX. In Equation (9),

p_{i}^{k}

is the

k^{t h}

neighbouring point of

i^{t h}

input point and its feature is

f_{i}^{k}

. The

c_{i C a}^{k}, c_{i S p}^{k}, c_{i C y}^{k}

are taken into Equation (9) to obtain

s_{i C a}^{k}, s_{i S p}^{k}, s_{i C y}^{k}

, which represent the attention scores of the neighbouring points of the

i^{t h}

input point in different coordinate systems, respectively. Clearly, every

k^{t h}

neighbouring point has a different contribution to each coordinate system, so the value of

s_{i C a}^{k}, s_{i S p}^{k}, s_{i C y}^{k}

of this point is also different. W represents the learnable weight matrix, which is actually the MLP layer without a bias and activation function.

3.2.2. Attention Scores Fusion

As mentioned in Section 3.2.1, the value of

s_{i C a}^{k}, s_{i S p}^{k}, s_{i C y}^{k}

for each point is different. To obtain a more reasonable score that can on balance reflect the importance of each point, we proposed the fusion attention score. To obtain the fusion attention scores of the neighbouring points of the

i^{t h}

input point in different coordinate systems, we used Equation (11) for the calculation.

s_{i}^{k'} = λ_{C a} ⨀ s_{i C a}^{k} + λ_{S p} ⨀ s_{i S p}^{k} + λ_{C y} ⨀ s_{i C y}^{k}

(11)

where

λ_{C a} : λ_{S p} : λ_{C y}

is the ratio,

⨀

is the dot product and

s_{i}^{k'}

represents the fusion attention scores, which can reflect the importance of each point in a more reasonable way. In addition, the ratio of

λ_{C a} : λ_{S p} : λ_{C y}

was set to 0.33:0.33:0.33 and the ablation experiments in Section 4.3.3 explain the reason.

We concatenated the local features and weighted them as:

c_{i}^{k'} = MLP (CONCAT (c_{i C a}^{k}, c_{i S p}^{k}, c_{i C y}^{k}, f_{i}^{k}))

(12)

\tilde{c_{i}} = \sum_{k = 1}^{K} (c_{i}^{k'} ⨀ s_{i}^{k'})

(13)

f_{i}^{k}

represents the extracted feature, which is shown in Figure 2b.

⨀

is the dot product,

\tilde{c_{i}}

is the weighted summed local feature and

c_{i}^{'}

contains all the local features.

3.3. Local Aggregation Features Attention

To improve the understanding of local regions, we proposed the LAFA block (shown in Figure 7, which learns the local aggregation features in different coordinate systems and adaptively assigns weights to the local aggregation features. As shown in Figure 7, first, three kinds of features extracted by the MSI block are summed and averaged to obtain the local aggregation features. Second, the local aggregation features from different coordinate systems are concatenated and the attention pooling is applied to obtain the attention score of each local aggregation feature. Then, we split the attention score matrix into three parts and dot producted each part of the attention score vector and its corresponding local aggregation features to obtain the weighted aggregation features. Finally, the three weighted aggregation features are concatenated as the output of this block.

Here,

{\bar{c}}_{i C a}

,

{\bar{c}}_{i S p}

and

{\bar{c}}_{i C y}

represent the local aggregation features of the

i^{t h}

input point in different coordinate systems. Then, they are concatenated and denoted as

\bar{c_{l}}

. The local aggregation attention score is defined as:

{\tilde{s}}_{i} = SOFTMAX (A (\bar{c_{l}}))

(14)

{\tilde{s}}_{i C a}, {\tilde{s}}_{i S p}, {\tilde{s}}_{i C y} = SPLIT ({\tilde{s}}_{i})

(15)

where

{\tilde{s}}_{i}

is the local aggregation attention score, A is a learnable weight matrix and

SPLIT

means dividing

{\tilde{s}}_{i}

through the dimension of 3. It should be noted that Equation (14) weights the local aggregation features of the three coordinate systems (the dimension of 3) rather than weighting the feature channels (the dimension of

d_{o u t} / 2

).

{\tilde{s}}_{i C a}, {\tilde{s}}_{i S p}, {\tilde{s}}_{i C y}

are the local aggregation attention scores of the

i^{t h}

input point in different coordinate systems.

The weighted local aggregation feature is represented as:

{\overset{=}{c}}_{l} = CONCAT (({\bar{c}}_{i C a} ⨀ {\tilde{s}}_{i C a}), ({\bar{c}}_{i S p} ⨀ {\tilde{s}}_{i S p}), ({\bar{c}}_{i C y} ⨀ {\tilde{s}}_{i C y}))

(16)

The weighed local feature

c_{i}^{'}

from Section 3.2.2 is combined with

\overset{=}{c_{l}}

, and the final output is defined as:

o_{i} = MLP (CONCAT (\overset{=}{c_{l}}, c_{i}^{'}))

(17)

where

o_{i}

represents the aggregation feature in Figure 2b.

To summarise, the final output

o_{i}

is combined with the detailed local features and the local aggregation features of the

i^{t h}

point in different coordinate systems.

3.4. Loss Function

The cross-entropy loss function has effective prediction abilities. Thus, we used the cross-entropy loss as our loss function:

p r e d_{i}^{'} = SOFTMAX (p r e d_{i})

(18)

Loss = - \sum_{a = 1}^{A} Y_{i} \log (p r e d_{i}^{'})

(19)

where

p r e d_{i}

is the predicted label.

Y_{i}

is the ground truth label.

A

is the number of classes.

4. Experiments

To validate the performance of the proposed method, we conducted comparison experiments on S3DIS [35], Semantic3D [36] and SemanticKITTI [37]. In addition, ablation studies were also conducted to prove the advancement and effectiveness of each proposed block.

4.1. Datasets

S3DIS [35] is a large indoor dataset, and the entire dataset is divided into 13 classes. The S3DIS dataset has 272 rooms, including a conference room, copy room, hallway, office, pantry, WC, auditorium, storage and lobby. Each room contains anywhere from hundreds of thousands to several million points, depending mainly on the size of the room and in addition, the number of objects contained in each room varies greatly. Each point has 3D coordinates, RGB and a semantic label.

Semantic3D [36] is a large real scene outdoor dataset with over 4 billion points. There are 15 training sets and 15 online test sets in total. The point cloud scenes include urban and suburban areas and the points are divided into eight classes. These classes are named man-made, natural, high veg, low veg, buildings, hard scape, scanning art and cars.

SemanticKITTI [37] is an outdoor driving dataset that consists of 22 sequences. The dataset contains a total of more than 40,000 scans, each of which contains more than 100,000 points. These points are divided into 19 categories. In general, the training set consists of 10 sequences, numbered 00 to 07 and 09 to 10; the validation set consists of 1 sequence, numbered 08; and the test set consists of 11 sequences, numbered 11 to 22.

4.2. Results of Semantic Segmentation

Evaluation on S3DIS. To evaluate the performance of our method, we followed the 6-fold cross-validation strategy and used three standard metrics named mean intersection-over union (mIoU), mean class accuracy (mAcc) and overall accuracy (OA). In Table 1, we made a quantitative comparison with the state-of-the-art methods. Our proposed method performed well, and the OA, mAcc and mIoU were better than those of the state-of-the-art methods. In addition, our proposed method outperformed in some classes. Compared to SCF-Net [15], our MSIDA-Net obtained a gain of 2.2% in the IoU of the beam. For the table class, MSIDA-Net obtained the same IoU as SPG [3]. Compared to KPConv [31], the sofa IoU value of the proposed MSIDA-Net improved by 1%. For the clutter class, the IoU of MSIDA-Net was 0.9% higher than that of BAAF-Net [13]. It should be noted that in the S3DIS dataset, the shapes of some classes (i.e., ceiling, floor, wall, window, door and board) tend to be a plane, while the classes that we better classified (i.e., beam, table, sofa and clutter) all had more complex geometric structures. In particular, the class of clutter contained many kinds of objects (e.g., lamps, garbage cans, toilets, sinks and small cabinets). These objects vary in shape, which results in difficulties in classifying the clutter class. This shows that our proposed MSI block is suitable for learning and capturing the features of objects with complex geometric structures.

Figure 8 shows the visualization of the prediction results. The first line is a conference room; we can see that BAAF-Net [13] had a more mispredicted region, i.e., the bookcase points were mispredicted to a column. The second line is a hallway, which obviously shows that RandLA-Net [14] and BAAF-Net [13] mislabelled some column points as walls, while our proposed method had fewer mistakes. The third line is an office. RandLA-Net [14] and BAAF-Net [13] were poor in dealing with the boundaries of walls and boards, while our proposed method performed better.

Evaluation on Semantic3D. We used the reduced−8 validation method, and the metrics were mIoU and OA. In Table 2, we made a quantitative comparison with the state-of-the-art methods. Our mIoU performed better, but OA was slightly inferior. Our MSIDA-Net achieved the same 97.5% IoU as RGNet [43] on the man-made (mainly roads) class. For the natural class, our MSIDA-Net achieved a 94.9% IoU, which was 1.9% higher than RGNet [43]. In Figure 9, we can see that the man-made class and natural class had boundaries with some classes. Notably, there were many classes on the surface of the man-made class, i.e., building, hard scape, high vegetation and low vegetation classes. This proves that our proposed MSI block can effectively learn the spatial relation among points. In addition, MSIDA-Net ranked 2 and 3 on the low vegetation and car classes, respectively, while these classes have relatively complex shape structures. This again proves that our proposed method is good at learning and describing the shape of complex objects.

Figure 9 shows the prediction results of the test sets. In the first column, we can see that some building points were mispredicted as hard-scape and high vegetation. In the second column, the shape similarity between low vegetation and high vegetation caused a false label prediction. The third column obviously shows that some high vegetation points were mislabelled as buildings. In the final column, some scanning artefact points were mispredicted as high vegetation and many low vegetation points were mispredicted as high vegetation. It should be noted that the test set labels of the Semantic3D dataset are not publicly available, so we can only show the input data and the predicted labels.

Evaluation on SemanticKITTI.

Table 3. The quantitative results (%) of SemanticKITTI (single-scan).

Method	mIoU	Road	Sidewalk	Parking	Other-ground	Building	Car	Truck	Bicycle	Motorcycle	Other-vehicle	Vegetation	Trunk	Terrain	Person	Bicyclist	Motocyclist	Fence	Pole	Traffic-sign
Method	(%)	Road	Sidewalk	Parking	Other-ground	Building	Car	Truck	Bicycle	Motorcycle	Other-vehicle	Vegetation	Trunk	Terrain	Person	Bicyclist	Motocyclist	Fence	Pole	Traffic-sign
PointNet [9]	14.6	61.6	35.7	15.8	1.4	41.4	46.3	0.1	1.3	0.3	0.8	31	4.6	17.6	0.2	0.2	0	12.9	2.4	3.7
PointNet++ [10]	20.1	72	41.8	18.7	5.6	62.3	53.7	0.9	1.9	0.2	0.2	46.5	13.8	30	0.9	1	0	16.9	6	8.9
SquSegV2 [44]	39.7	88.6	67.6	45.8	17.7	73.7	81.8	13.4	18.5	17.9	14	71.8	35.8	60.2	20.1	25.1	3.9	41.1	20.2	36.3
TangentConv [19]	40.9	83.9	63.9	33.4	15.4	83.4	90.8	15.2	2.7	16.5	12.1	79.5	49.3	58.1	23	28.4	8.1	49	35.8	28.5
PointASNL [45]	46.8	87.4	74.3	24.3	1.8	83.1	87.9	39	0	25.1	29.2	84.1	52.2	70.6	34.2	57.6	0	43.9	57.8	36.9
RandLA-Net [14]	53.9	90.7	73.7	60.3	20.4	86.9	94.2	40.1	26	25.8	38.9	81.4	61.3	66.8	49.2	48.2	7.2	56.3	49.2	47.7
PolarNet [46]	54.3	90.8	74.4	61.7	21.7	90	93.8	22.9	40.3	30.1	28.5	84	65.5	67.8	43.2	40.2	5.6	67.8	51.8	57.5
MinkNet42 [47]	54.3	91.1	69.7	63.8	29.3	92.7	94.3	26.1	23.1	26.2	36.7	83.7	68.4	64.7	43.1	36.4	7.9	57.1	57.3	60.1
BAAF-Net [13]	59.9	90.9	74.4	62.2	23.6	89.8	95.4	48.7	31.8	35.5	46.7	82.7	63.4	67.9	49.5	55.7	53	60.8	53.7	52
FusionNet [48]	61.3	91.8	77.1	68.8	30.8	92.5	95.3	41.8	47.5	37.7	34.5	84.5	69.8	68.5	59.5	56.8	11.9	69.4	60.4	66.5
Ours	59.8	90.7	74.9	63.1	27.1	91.1	95.6	52.3	35.3	43.3	46.1	82.1	64.5	67	52.6	57.5	22.7	64	54.4	51.6

Table 3 shows the quantitative results. Compared to these referred methods, our MSIDA-Net ranked third with an mIoU of 59.8%, which is slightly inferior to BAAF-Net. However, our MSIDA-Net performed well on the car, truck and motorcycle classes. For the car class, the IoU of MSIDA-Net was 0.2% higher than that of BAAF-Net [13]. For the truck class, the IoU of MSIDA-Net was 3.6% higher than that of BAAF-Net [13]. Compared to FusionNet [48], the motorcycle IoU value of the proposed MSIDA-Net improved by 5.6%. In addition, our MSIDA-Net ranked 2 on the bicycle, other-vehicle, person and bicyclist classes. To summarise, our MSIDA-Net is good at classifying small-size classes (e.g., cars, trucks, motorcycles, bicycles, other vehicles, persons and bicyclists). It shows the effectiveness of the dual adaptive blocks, especially the LAFA block, which could aggregate neighbouring points to improve the understanding of the local region.

As shown in Figure 10, although our MSIDA-Net performed well in some small-size classes, there were still many mislabelled cases. For example, the features of bicycles and motorcycles were too similar to identify. Then, the car, truck and other-vehicle classes have semblable shapes, which results in difficulties in classifying them. Additionally, it is challenging to predict the person, bicyclist and motorcyclist classes.

4.3. Ablation Experiments

To verify the validity of our method, we used the S3DIS dataset to conduct ablation experiments and select the area 5 validation strategy. The first part of the ablation experiments was about the two adaptive blocks, CSAPF and LAFA. The second part of ablation experiments concerned the features of the coordinate systems. The last part of the ablation experiments concerned the ratio of the fusion attention score.

4.3.1. Ablation of the CSAPF Block and LAFA Block

The ablation results of the dual adaptive blocks are shown in Table 4. As shown in Figure 11, Ours1 shows that none of the adaptive blocks were used, which led to a poor result. Ours2 means that only the CSAPF block was used, and we can see that compared with Ours1, the mIoU and OA were improved by 1.8% and 1.4%, respectively. This proves that the CSAPF block makes a substantial contribution to learning important local features. Our 3 means that only the LAFA block was used, and its mIoU and OA were improved by 2% and 0.8%, respectively, which fully indicates that the LAFA block improved the understanding of the local regions. Our overall network, MSIDA-Net, combines both CSAPF and LAFA blocks, which not only enhanced the learning ability of local features but also the understanding ability of local regions, and its mIoU and OA were 2.7% and 1.5% higher than those of Ours1, respectively.

4.3.2. Ablation Experiments for the Features of Coordinate Systems

To better understand the contribution of the features from different coordinate systems, we conducted two kinds of ablation methods. As shown in Table 5, the first ablation method was Remove. Due to the specific nature of the network architecture (CSAPF and LAFA cannot be used if both coordinate system features are removed), only one coordinate system feature was removed from each of our ablation experiments. For example, Ours4 means removing the features of the Cartesian coordinate system; then, it is easy to know that the spherical coordinate feature is removed in Ours5 and the cylindrical coordinate features are removed in Ours6. Compared to Ours4, Ours5 and Ours6, the mIoUs of the full features improved by 1.5%, 1.6% and 1.4%, respectively. This provides ample evidence that the spatial information of each coordinate system plays a crucial role in learning the geometric structure of the point cloud.

As shown in Figure 3, the spatial information in different coordinate systems was encoded. To verify whether these features needed to be encoded separately for learning, we conducted ablation experiments of feature concatenation. In Figure 12, Ours7 was used to concatenate the spatial information of the spherical coordinates and the spatial information of the cylindrical coordinates for encoding, and only the spatial information of the Cartesian coordinate system was encoded separately, and so on for Ours8 and Ours9. As shown in Table 5, we can see that the mIoUs and OAs of Ours7, Ours8 and Ours9 were inferior to those of the MSIDA-Net (Full features), which indicates that mixing spatial information from different coordinate systems makes no contribution to local feature learning. Thus, we separately encoded the features from different coordinate systems.

4.3.3. Ablation of the Fusion Attention Score

As mentioned in Section 3.2.2,

s_{i}^{k'}

is the fusion attention score, and

s_{i C a}^{k}, s_{i S p}^{k}, s_{i C y}^{k}

are the attention scores in each coordinate system. To explore the contribution of the attention scores, we conducted ablation experiments to learn the proper ratio (

λ_{C a} : λ_{S p} : λ_{C y}

) of each attention score. The ablation results are shown in Table 6.

Ours10 means the ratio of

λ_{C a} : λ_{S p} : λ_{C y}

is 0.1:0.3:0.6, Ours11 means the ratio of

λ_{C a} : λ_{S p} : λ_{C y}

is 0.1:0.6:0.3 and so on. Compared to Ours10, the mIoU and OA of Original

(s_{i}^{k'})

were improved by 1.2% and 0.3%, respectively. Compared to Ours11, Original

(s_{i}^{k'})

obtained gains of 1.5% and 0.6% on the mIoU and OA, respectively. Compared to Ours12, the mIoU and OA of Original

(s_{i}^{k'})

were 1.7% and 0.8% superior, respectively. Compared to Ours13, Original

(s_{i}^{k'})

outperformed with 0.6% and 0.6% gains on the mIoU and OA, respectively. Compared to Ours14, the mIoU and OA of Original

(s_{i}^{k'})

were boosted by 2.1% and 1.0%, respectively. Compared to Ours15, Original

(s_{i}^{k'})

gained a 1.5% and 1.1% improvement on the mIoU and OA. Through these comparisons, we can see that the attention scores in different coordinate systems made almost the same contribution to local feature learning. In addition, we conducted Ours16, which applies the Softmax to

s_{i}^{k'}

for secondary weighting:

s_{i}^{k''} = SOFTMAX (s_{i}^{k'})

(20)

where

s_{i}^{k''}

is the secondary weighting score. However, the mIoU and OA of Ours16 were also 0.8% and 0.7% inferior to Original (

s_{i}^{k'}

), respectively. This is mainly because the secondary weighting pays too much attention to important features, which leads to the insufficient learning of other features.

4.4. Information about Experiments and Model Size

We used a Quadro RTX 6000 server for the experiments. The experiments required the implementation of Python, TensorFlow, CUDA and cuDNN. We trained with a batch size of 4 and used the Adam optimiser. The initial learning rate was 0.01, and the decay rate for every epoch was 5%.

For the S3DIS dataset, our network had 40,960 input points and 5 sampling layers. The number of input points in each sampling layer was 10,240, 2560, 640, 160 and 80 and the corresponding feature dimensions of these layers were 32, 128, 256, 512 and 1024, respectively. The number of neighbouring points obtained by K-nearest neighbours (KNN) for each input point was set to 16.

For the Semantic3D dataset, the number of input points was 65536 and the number of sampling layers was 5. The input point numbers in each sampling layer were 16,384, 4096, 1024, 256 and 128 and the corresponding feature dimensions of these sampling layers were 32, 128, 256, 512 and 1024, respectively. The neighbouring points were set to 16.

For the SemanticKITTI dataset, the number of input points was 45,056 and the number of sampling layers was 4. The number of input points in each sampling layer was 11,264, 2816, 704, and 176 and the corresponding feature dimensions of these sampling layers were 32, 128, 256, and 512, respectively. The number of neighbouring points was set to 16.

Detailed information about the MSIDA-Net model is shown in Table 7. It can be seen that the proposed method maintained a high training speed. In addition, it should be noted that the test time only contains the inference time of the original point cloud.

5. Conclusions

In this paper, our main concern was investigating how to adequately learn multiple spatial information. Our proposed MSIDA-Net consists of a multispatial information encoding (MSI) block, the coordinate system attention pooling fusion (CSAPF) block and the local aggregation feature attention (LAFA) block. The input points are first sampled by KNN and input to the MSI block to obtain the local spatial features in the three coordinate systems. The adaptive CSAPF block assigns attention scores for the feature of each neighbouring point in the different coordinate systems. The LAFA block aims to improve the understanding of each local region feature. We received results on three benchmarks and obtained 73%, 77.8% and 59.8% mIoU on the S3DIS, Semantic3D and SemanticKITTI datasets, respectively. Furthermore, ablation studies of the CSAPF block, LAFA block and encoded features in different coordinate systems were conducted to evaluate the effectiveness of each part of MSIDA-Net.

However, the question of how to better represent the spatial relation among points has been difficult to resolve. Although our proposed method learns the spatial information in different coordinate systems and achieves great results, we think that there are other spaces that can better represent the spatial and geometric relations among points. We leave those investigations to future research.

Author Contributions

Conceptualization, Y.L.; methodology, P.L., Y.L. and F.S.; software, P.L., Y.L. and X.L.; validation, P.L., F.S. and Z.Z.; writing—original draft preparation, P.L., X.L. and Y.L.; writing—review and editing, F.S. and Z.Z.; supervision, F.S.; project administration, Y.L. and F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (Grant 61720106009), in part by the Natural Science Foundation of Guangxi (Grant 2022GXNSFBA035661), in part by the Open Research Fund of Artificial Intelligence Key Laboratory of Sichuan Province (Grant 2021RYJ06).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in refs. [35,36,37].

Acknowledgments

The authors gratefully acknowledge language help from the other members in Guangxi Key Laboratory of Manufacturing System and Advanced Manufacturing Technology.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pat. Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Lu, T.; Wang, L.; Wu, G. Cga-net: Category Guided Aggregation for Point Cloud Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 11693–11702. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Landrieu, L.; Boussaha, M. Point Cloud Oversegmentation with Graph-Structured Deep Metric Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7440–7449. [Google Scholar]
Li, Y.; Ma, L.; Zhong, Z.; Cao, D.; Li, J. TGNet: Geometric Graph CNN on 3-D Point Cloud Segmentation. IEEE Trans. Geo. Rem. Sens. 2020, 58, 3588–3600. [Google Scholar] [CrossRef]
Bazazian, D.; Nahata, D. DCG-net: Dynamic Capsule Graph Convolutional Network for Point Clouds. IEEE Access 2020, 8, 188056–188067. [Google Scholar] [CrossRef]
Liu, J.; Ni, B.; Li, C.; Yang, J.; Tian, Q. Dynamic Points Agglomeration for Hierarchical Point Sets Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 7546–7555. [Google Scholar]
Liang, Z.; Yang, M.; Deng, L.; Wang, C.; Wang, B. Hierarchical Depthwise Graph Convolutional Neural Network for 3D Semantic Segmentation of Point Clouds. In Proceedings of the 2019 IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8152–8158. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Grap. 2019, 38, 1–12. [Google Scholar] [CrossRef] [Green Version]
Pan, L.; Chew, C.M.; Lee, G.H. PointAtrousGraph: Deep Hierarchical Encoder-Decoder with Point Atrous Convolution for Unorganized 3D Points. In Proceedings of the IEEE/CVF International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; pp. 1113–1120. [Google Scholar]
Qiu, S.; Anwar, S.; Barnes, N. Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 1757–1767. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 11108–11117. [Google Scholar]
Fan, S.; Dong, Q.; Zhu, F.; Lv, Y.; Ye, P.; Wang, F.Y. SCF-Net: Learning Spatial Contextual Features for Large-Scale Point Cloud Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 14504–14513. [Google Scholar]
Lawin, F.J.; Danelljan, M.; Tosteberg, P.; Bhat, G.; Khan, F.S.; Felsberg, M. Deep Projective 3D Semantic Segmentation. In Proceedings of the International Conference on Computer Analysis of Images and Patterns (CAIP), Ystad, Sweden, 22–24 August 2017; pp. 95–107. [Google Scholar]
Boulch, A.; Lesaux, B.; Audebert, N. Unstructured Point Cloud Semantic Labeling Using Deep Segmentation Networks. In Proceedings of the 3DOR@ Eurographics, Lyon, France, 23–24 April 2017; p. 3. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pat. Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.Y. Tangent Convolutions for Dense Prediction in 3D. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3887–3896. [Google Scholar]
Huang, J.; You, S. Point Cloud Labeling Using 3D Convolutional Neural Network. In Proceedings of the IEEE/CVF Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 16–21 May 2016; pp. 2670–2675. [Google Scholar]
Tchapmi, L.P.; Choy, C.B.; Armeni, I.; Gwak, J.Y.; Savarese, S. Segcloud: Semantic Segmentation of 3D Point Clouds. In Proceedings of the IEEE/CVF Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547. [Google Scholar]
Meng, H.Y.; Gao, L.; Lai, Y.K.; Manocha, D. Vv-net: Voxel Vae Net with Group Convolutions for Point Cloud Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 8500–8508. [Google Scholar]
Dai, A.; Ritchie, D.; Bokeloh, M.; Reed, S.; Strum, J.; Nießner, M. Scancomplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4578–4587. [Google Scholar]
Hu, Z.; Bai, X.; Shang, J.; Zhang, R.; Dong, J.; Wang, X.; Sun, G.; Fu, H.; Tai, C.L. Vmnet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Virtual, 19–25 June 2021; pp. 15488–15498. [Google Scholar]
Ye, M.; Xu, S.; Cao, T.; Chen, Q. Drinet: A Dual-Representation Iterative Learning Network for Point Cloud Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 7447–7456. [Google Scholar]
Ma, Y.; Guo, Y.; Liu, H.; Lei, Y.; Wen, G. Global Context Reasoning for Semantic Segmentation of 3D Point Clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 2–5 May 2020; pp. 2931–2940. [Google Scholar]
Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; Shan, J. Graph Attention Convolution for Point Cloud Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10296–10305. [Google Scholar]
Zhang, W.; Xiao, C. PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12436–12445. [Google Scholar]
Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3323–3332. [Google Scholar]
Hua, B.S.; Tran, M.K.; Yeung, S.K. Pointwise Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 984–993. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 6411–6420. [Google Scholar]
Engelmann, F.; Kontogianni, T.; Leibe, B. Dilated Point Convolutions: On the Receptive Field Size of Point Convolutions on 3D Point Clouds. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; pp. 9463–9469. [Google Scholar]
Lei, H.; Akhtar, N.; Mian, A. Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds. IEEE Trans. Pat. Anal. Mach. Intell. 2020, 43, 3664–3680. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, Z.H.; Huang, S.Y.; Wang, Y.C.F. Learning of 3D Graph Convolution Networks for Point Cloud Analysis. IEEE Trans. Pat. Anal. Mach. Intell. 2021. Early Access. [Google Scholar] [CrossRef] [PubMed]
Armeni, I.; Sax, S.; Zamir, A.R.; Savarese, S. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. arXiv 2017, arXiv:1702.01105. [Google Scholar]
Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. Semantic3d. net: A New Large-Scale Point Cloud Classification Benchmark. arXiv 2017, arXiv:1704.03847. [Google Scholar]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A Dataset for Semantic Scene Understanding of Lidar Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 9297–9307. [Google Scholar]
Huang, Q.; Wang, W.; Neumann, U. Recurrent Slice Networks for 3D Segmentation of Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2626–2635. [Google Scholar]
Ye, X.; Li, J.; Huang, H.; Du, L.; Zhang, X. 3D Recurrent Neural Networks with Context Fusion for Point Cloud Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 403–417. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on X-Transformed Points. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.W.; Jia, J. Pointweb: Enhancing Local Neighborhood Features for Point Cloud Processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5565–5573. [Google Scholar]
Zhang, Z.; Hua, B.S.; Yeung, S.K. Shellnet: Efficient Point Cloud Convolutional Neural Networks Using Concentric Shells Statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 1607–1616. [Google Scholar]
Truong, G.; Gilani, S.Z.; Islam, S.M.S.; Suter, D. Fast Point Cloud Registration Using Semantic Segmentation. In Proceedings of the 2019 Digital Image Computing: Techniques and Applications (DICTA), Perth, Australia, 2–4 December 2019; pp. 1–8. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a Lidar Point Cloud. In Proceedings of the 2019 IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Yan, X.; Zheng, C.; Li, Z.; Wang, S.; Cui, S. Pointasnl: Robust Point Clouds Processing Using Nonlocal Neural Networks with Adaptive Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 5589–5598. [Google Scholar]
Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An Improved Grid Representation for Online Lidar Point Clouds Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 9601–9610. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4D Spatio-Temporal Convnets: Minkowski Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Zhang, F.; Fang, J.; Wah, B.; Torr, P. Deep Fusionnet for Point Cloud Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 644–663. [Google Scholar]

Figure 1. The conversion of spatial information. (a) The spatial information of cartesian coordinate systems (red: centre point. Blue: neighbouring points). (b) The relationship of spherical coordinate systems

r_{i}^{k'}, θ_{i}^{k'}, φ_{i}^{k'}

and cartesian coordinate systems. (c) The relationship of cylindrical coordinate systems

l_{i}^{k'}, α_{i}^{k'}, z_{i}^{k'}

and cartesian coordinate systems. (d) The original distribution of the centre point (red point) and its neighbouring points (black point). (e) The distribution of points (same as in (d)) in the spherical coordinate system. (f) The distribution of points (same as in (d)) in the cylindrical coordinate system.

Figure 1. The conversion of spatial information. (a) The spatial information of cartesian coordinate systems (red: centre point. Blue: neighbouring points). (b) The relationship of spherical coordinate systems

r_{i}^{k'}, θ_{i}^{k'}, φ_{i}^{k'}

and cartesian coordinate systems. (c) The relationship of cylindrical coordinate systems

l_{i}^{k'}, α_{i}^{k'}, z_{i}^{k'}

and cartesian coordinate systems. (d) The original distribution of the centre point (red point) and its neighbouring points (black point). (e) The distribution of points (same as in (d)) in the spherical coordinate system. (f) The distribution of points (same as in (d)) in the cylindrical coordinate system.

Figure 2. Our proposed method. (a) The framework of MSIDA-Net. (b) The detailed structure of the MSIDA module.

Figure 3. The architecture of the MSI block.

Figure 4. The conversion from the Cartesian coordinate system to the spherical coordinate system. (a) The spherical coordinate system spatial information between the centre point and each of its

k^{t h}

neighbouring points. (b) The spherical coordinate system spatial information between the centre point and the centre of mass. (Red: centre point; Blue: neighbouring points; Yellow: the centre of mass).

Figure 4. The conversion from the Cartesian coordinate system to the spherical coordinate system. (a) The spherical coordinate system spatial information between the centre point and each of its

k^{t h}

neighbouring points. (b) The spherical coordinate system spatial information between the centre point and the centre of mass. (Red: centre point; Blue: neighbouring points; Yellow: the centre of mass).

Figure 5. The conversion from the Cartesian coordinate system to the cylindrical coordinate system. (a) The cylindrical coordinate system spatial information between the centre point and each of its

k^{t h}

neighbouring points. (Red: centre point; Blue: neighbouring points; Yellow: the centre of mass). (b) The cylindrical coordinate system spatial information between the input point and the centre of mass.

Figure 5. The conversion from the Cartesian coordinate system to the cylindrical coordinate system. (a) The cylindrical coordinate system spatial information between the centre point and each of its

k^{t h}

neighbouring points. (Red: centre point; Blue: neighbouring points; Yellow: the centre of mass). (b) The cylindrical coordinate system spatial information between the input point and the centre of mass.

Figure 6. The structure of CSAPF block.

Figure 7. The architecture of LAFA block.

Figure 8. The visualization results of some rooms. (The first line: conference; the second line: hallway, the other lines: office).

Figure 9. The visualization results of Semantic3D (reduced−8).

Figure 10. The prediction of SemanticKITTI (single-scan). Red rectangles show the main mispredicted regions.

Figure 11. The ablation structures of the MSIDA module. (a). Ours1. (b). Ours2. (c). Ours3.

Figure 12. The structure of Ours7.

Table 1. Comparisons with state-of-the-art methods (6-fold cross-validation of S3DIS).

Methods	OA	mAcc (%)	mIoU (%)	Ceil.	Floor	Wall	Beam	Col.	Wind.	Door	Table	Chair	Sofa	Book.	Board	Clut.
PointNet [9]	78.6	66.2	47.6	88.0	88.7	69.3	42.4	23.1	47.5	51.6	54.1	42.0	9.6	38.2	29.4	35.2
RSNet [38]	-	66.5	56.5	92.5	92.8	78.6	32.8	34.4	51.6	68.1	59.7	60.1	16.4	50.2	44.9	52.0
3P-RNN [39]	86.9	-	56.3	92.9	93.8	73.1	42.5	25.9	47.6	59.2	60.4	66.7	24.8	57.0	36.7	51.6
SPG [3]	86.4	73.0	62.1	89.9	95.1	76.4	62.8	47.1	55.3	68.4	73.5	69.2	63.2	45.9	8.7	52.9
PointCNN [40]	88.1	75.6	65.4	94.8	97.3	75.8	63.3	51.7	58.4	57.2	71.6	69.1	39.1	61.2	52.2	58.6
PointWeb [41]	87.3	76.2	66.7	93.5	94.2	80.8	52.4	41.3	64.9	68.1	71.4	67.1	50.3	62.7	62.2	58.5
ShellNet [42]	87.1	-	66.8	90.2	93.6	79.9	60.4	44.1	64.9	52.9	71.6	84.7	53.8	64.6	48.6	59.4
KPConv [31]	-	79.1	70.6	93.6	92.4	83.1	63.9	54.3	66.1	76.6	57.8	64.0	69.3	74.9	61.3	60.3
RandLA [14]	88.0	82.0	70.0	93.1	96.1	80.6	62.4	48.0	64.4	69.4	69.4	76.4	60.0	64.2	65.9	60.1
SCF-Net [15]	88.4	82.7	71.6	93.3	96.4	80.9	64.9	47.4	64.5	70.1	71.4	81.6	67.2	64.4	67.5	60.9
BAAF-Net [13]	88.9	83.1	72.2	93.3	96.8	81.6	61.9	49.5	65.4	73.3	72.0	83.7	67.5	64.3	67.0	62.4
Ours	89.2	83.7	73.0	93.6	97.0	82.1	67.1	52.1	66.0	71.8	73.5	80.2	70.3	67.3	64.0	63.3

Table 2. The semantic segmentation results of Semantic3D (reduced−8).

Methods	mIoU (%)	OA (%)	Man-made.	Natural.	High Veg.	Low Veg.	Buildings	Hard Scape	Scanning Art.	Cars
SnapNet [17]	59.1	88.6	82.0	77.3	79.7	22.9	91.1	18.4	37.3	64.4
SEGCloud [21]	61.3	88.1	83.9	66.0	86.0	40.5	91.1	30.9	27.5	64.3
ShellNet [42]	69.3	93.2	96.3	90.4	83.9	41.0	94.2	34.7	43.9	70.2
GACNet [27]	70.8	91.9	86.4	77.7	88.5	60.6	94.2	37.3	43.5	77.8
SPG [3]	73.2	94.0	97.4	92.6	87.9	44.0	93.2	31.0	63.5	76.2
KPConv [31]	74.6	92.9	90.9	82.2	84.2	47.9	94.9	40.0	77.3	79.7
RGNet [43]	74.7	94.5	97.5	93.0	88.1	48.1	94.6	36.2	72.0	68.0
RandLA-Net [14]	77.4	94.8	95.6	91.4	86.6	51.5	95.7	51.5	69.8	76.8
SCF-Net [15]	77.6	94.7	97.1	91.8	86.3	51.2	95.3	50.5	67.9	80.7
Ours	77.8	94.6	97.5	94.9	87.0	54.9	94.2	42.8	72.0	78.8

Table 4. The ablation results of the CAPF block and LOFA block.

Method		Ablation		mIoU (%)	OA (%)
Method		CSAPF Block	LAFA Block	mIoU (%)	OA (%)
MSIDA-Net	Ours1			64.2	87.8
	Ours2	√		66	89.2
	Ours3		√	66.2	88.6
	MSIDA	√	√	66.9	89.3

Table 5. The ablation results of the coordinate system features.

Method		Ablation Methods(Remove)			mIOU (%)	OA (%)
Method		Cartesian Feature	Spherical Feature	Cylindrical Feature	mIOU (%)	OA (%)
MSIDA-Net	Ours4	√			65.4	88.3
	Ours5		√		65.3	88.7
	Ours6			√	65.5	88.8
	Full features				66.9	89.3
Method		Ablation Methods(Concatenate)			mIOU (%)	OA (%)
MSIDA-Net	Ours7	Encode [Spherical Feature + Cylindrical Feature] and Cartesian Feature			66.1	88.9
	Ours8	Encode [Cartesian Feature + Cylindrical Feature] and Spherical Feature			65.3	88.7
	Ours9	Encode [Cartesian Feature + Spherical Feature] and Cylindrical Feature			66.5	89.0

Table 6. The ablation results of attention scores.

Methods		Ablation Methods (Ratio)			mIoU (%)	OA (%)
Methods		Cartesian ( $λ_{C a}$ )	Cartesian ( $λ_{C a}$ )	Cylindrical ( $λ_{C y}$ )	mIoU (%)	OA (%)
MSIDA-Net	Ours10	0.10	0.30	0.6	65.7	89.0
	Ours11	0.10	0.60	0.3	65.4	88.7
	Ours12	0.30	0.10	0.6	65.2	88.5
	Ours13	0.30	0.60	0.1	66.3	88.7
	Ours14	0.60	0.10	0.30	64.8	88.3
	Ours15	0.60	0.30	0.10	65.4	88.2
	Original ( $s_{i}^{k'}$ )	0.33	0.33	0.33	66.9	89.3
Method		Ablation Methods (Secondary Weighting)			mIoU (%)	OA (%)
MSIDA-Net	Ours16	-			66.1	88.6

Table 7. Information of the proposed model on different datasets.

Method	Dataset	Parameters (Millions)	Training Speed (Batch/s)	Test Time (s)	Max Inference Points (Millions)
MSIDA-Net	S3DIS	15.98	1.50	47.1	0.37
	Semantic3D	15.98	0.85	91.7	0.36
	SemanticKITTI	3.94	1.39	433.5	0.40

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shuang, F.; Li, P.; Li, Y.; Zhang, Z.; Li, X. MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks. Remote Sens. 2022, 14, 2187. https://doi.org/10.3390/rs14092187

AMA Style

Shuang F, Li P, Li Y, Zhang Z, Li X. MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks. Remote Sensing. 2022; 14(9):2187. https://doi.org/10.3390/rs14092187

Chicago/Turabian Style

Shuang, Feng, Pei Li, Yong Li, Zhenxin Zhang, and Xu Li. 2022. "MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks" Remote Sensing 14, no. 9: 2187. https://doi.org/10.3390/rs14092187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSIDA-Net: Point Cloud Semantic Segmentation via Multi-Spatial Information and Dual Adaptive Blocks

Abstract

1. Introduction

2. Related Work

2.1. Projection-Based Methods

2.2. Voxel-Based Methods

2.3. Point-Based Methods

3. Methodology

3.1. Spatial Information Encoding Based on Multiple Coordinate Systems

3.1.1. The Spatial Information Encoding of the Cartesian Coordinate System

3.1.2. The Spatial Information Encoding of the Spherical Coordinate System

3.1.3. Spatial Information Encoding of the Cylindrical coordinate System

3.2. Coordinate Systems Attentive Pooling Fusion

3.2.1. Calculating the Attention Scores of Neighbouring Points in Each Coordinate System

3.2.2. Attention Scores Fusion

3.3. Local Aggregation Features Attention

3.4. Loss Function

4. Experiments

4.1. Datasets

4.2. Results of Semantic Segmentation

4.3. Ablation Experiments

4.3.1. Ablation of the CSAPF Block and LAFA Block

4.3.2. Ablation Experiments for the Features of Coordinate Systems

4.3.3. Ablation of the Fusion Attention Score

4.4. Information about Experiments and Model Size

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI