3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints

Xv, Jiabin; Deng, Fei

doi:10.3390/rs15204939

Open AccessArticle

3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints

by

Jiabin Xv

and

Fei Deng

^*

School of Geodesy and Geomatics, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(20), 4939; https://doi.org/10.3390/rs15204939

Submission received: 27 August 2023 / Revised: 10 October 2023 / Accepted: 11 October 2023 / Published: 12 October 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Aiming to solve the problem that spatially distributed similar instances cannot be distinguished in 3D point cloud instance segmentation, a 3D point cloud instance segmentation network, considering the global shape contour, was proposed. This research used the global-to-local design idea and added the global shape constraint to solve this problem. A Transformer module (Global Shape Attention, GSA) that can capture the shape contour information of the instance in the scene was designed. This module encoded the shape contour information into the Transformer structure as a Key-Value and extracted the instance fused with the global shape contour features, for instance, segmentation. At the same time, the network directly predicted the instance mask in an end-to-end mode, avoiding heavy post-processing algorithms. Many experiments have been conducted on the S3DIS, ScanNet, and STPL3D datasets, and our experimental results showed that our proposed network can efficiently capture the shape contour information of scene instances and can help to alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene, improving the efficiency and stability of instance segmentation.

Keywords:

3D point cloud; Transformer; instance segmentation; shape contour

1. Introduction

In 3D vision, 3D point cloud instance segmentation is a comprehension task in several application scenarios such as autonomous driving, virtual reality, live 3D, and robot navigation. The goal of 3D point cloud instance segmentation is to recognize not only different classes of scenes but also different object instances within the same class. This requires more robust feature capture capabilities. Unlike semantic segmentation, where the accurate acquisition of local features is sufficient to determine the class of objects, instance segmentation requires distinguishing different instances from each other and local feature perception, which may not be suitable for such a task. Convolution uses multiple pooling layers to continuously expand the perceptual field of the convolutional kernel, which can give the convolutional network the ability to partially perceive the global at a deeper level. Compared to convolutional networks, the Transformer model does not require multiple layers to perceive the global capability, and only by noticing the global features can we better distinguish between different instance objects, which is more suitable for instance segmentation tasks.

Some research has been carried out to solve the instance segmentation task of the 3D point cloud, and such work can be divided into two directions. One is the instance segmentation of fusion target detection and classification tasks, which is called the top–down [1,2,3]; this type of network follows the design idea of global to local and can extract more accurate object contours, but a subsequent non-maximum suppression algorithm is required to eliminate redundant Bounding boxes. The other method is bottom–up [4,5,6], which first extracts high-dimensional instance features and then clusters them. This approach avoids using non-maximum suppression algorithms, which require time-consuming post-processing. DETR [7,8] successfully introduced the Transformer into 2D target detection and achieved great success. It realized the convenience of directly outputting target detection bounding boxes and eliminated the design of anchor points. Researchers have tried to apply this idea to the 3D point cloud instance segmentation, for example, in Mask3D [9]. Mask3D uses the Hungarian Algorithm [10,11] to directly match predictions and real-value instances, avoiding the heavy post-processing of the above method, but Mask3D uses a convolutional network as the backbone network, meaning the network lacks the ability to capture the global features of the scene point cloud. This leads to errors in which similar instances cannot be distinguished.

Based on the above analysis, we proposed a 3D point cloud instance segmentation network that considers the global shape contour. We designed a network module (Global Shape Attention, GSA) that can capture the shape contour information of the instance in the scene. This module encodes the shape contour into the Transformer structure as the Key-Value and extracts the instance fused with the global shape contour features, for instance, segmentation. The network adopts the global-to-local design idea, constrains the instance shape and contour, and directly predicts the instance mask in an end-to-end mode. This effectively alleviates the problem that similar instances cannot be distinguished in the instance segmentation task. Figure 1 shows a schematic diagram of our proposed network.

We conducted a large number of experiments on the datasets S3DIS [12], ScanNet [13], and STPLS3D [14]. The experimental results showed that the indoor S3DIS dataset’s overall accuracy mAP reached 52.4 and mAP₅₀ reached 66.9. On the indoor ScanNet test, the overall accuracy mAP reached 49.8 and AP₅₀ reached 71.7. AP reached 55.1 on the STPLS3D test set. The experimental results showed that our proposed network can efficiently capture the shape contour information of scene instances and helps to alleviate the problem of the difficulty distinguishing between similar instances in a scene, improving the efficiency and stability of instance segmentation.

Our contributions are as follows:

We proposed a Transformer network module that can efficiently encode global shape contour information called Global Shape Attention, GSA.
We fused this GSA module with convolutional features to obtain instance features with global and local information, which isused to alleviate the problem that similar instances cannot be segmented.
An end-to-end point cloud instance segmentation network was designed, which achieved good results in both indoor and outdoor point cloud instance segmentation tasks.

2. Related Work

Instance segmentation tasks for 3D point clouds are closely related to semantic segmentation and object detection. Semantic segmentation is used to predict the category label for each point in the 3D scene. Target detection uses a frame similar to a bounding box to determine the boundary of the target of interest, and it is necessary to distinguish individuals. From the results, the instance segmentation task combined the two. Predicting object categories in different scenes and instance labels for different individuals in the same category is essential. Correspondingly, two technical routes have been developed, and the instance segmentation of 3D point clouds also extends in two directions. One is the bottom–up method, which first extracts point features and then clusters to determine instances, and it can be divided into two mainstream methods: clustering based on high-dimensional features and clustering based on geometric centers. The other is the top–down method, which first determines the bounding box and then determines the semantic category in the box. Since the above two methods require time-consuming data post-processing, the researchers proposed Mask3D, a way to directly predict instance masks, similar to DETR [7] and Deformable DETR [8] in image target detection.

2.1. Bottom–Up

The bottom–up method extracts features and then performs clustering. According to different clustering features, it can be divided into two categories: (1) clustering based on high-dimensional features [4,5,6] and (2) voting clustering based on the geometric center [15,16].

2.1.1. Clustering Based on High-Dimensional Features

SGPN [4] calculates the point cloud similarity matrix (similarity matrix) and the corresponding confidence value (confidence map) to obtain candidate regions (group proposals) and combines the results of semantic segmentation to obtain instance labels. The network uses PointNet/PointNet++ [17,18] to extract the point cloud features. Then, the classification of these point cloud features is carried out. There are three modules in the back end: the similarity matrix, confidence map, and semantic map. Among them, the role of the similarity matrix is to calculate the group proposals to obtain accurate instance segmentation results directly, the confidence map is the confidence interval, and the confidence value is added to play an optimization role. The role of the semantic map is to act as a point-by-point classifier.

The author of ASIS [5] believed that semantic segmentation and instance segmentation can learn from each other. Based on this, a fusion module of semantic features and instance features was proposed to allow the two tasks to use each other to achieve a win–win situation. Specifically, instance segmentation benefits from semantic segmentation by learning semantically aware point-level instance embeddings. At the same time, the semantic features of points belonging to the same instance are fused, resulting in a more accurate semantic prediction for each point.

2.1.2. Vote Clustering Based on the Geometric Center

The author of VoteNet [15] proposed a 3D object detection network based on a voting mechanism. Unlike the methods based on high-dimensional feature clustering, the network learning instance object based on a voting mechanism is offset from the spatial coordinates of the geometric center, and the points belonging to the same instance object are closer to each other after the offset translation, so as to realize the distinction of the instance object.

SoftGroup [16] found that the learned semantic features in the current network are all based on the results of hard classification and proposed a soft classification network structure using soft classification with different confidence levels in the semantic category discrimination. In the segmentation task of voting, good results have been achieved.

2.2. Top–Down

GSPN [1] adopts the strategy of analysis by synthesis–analysis by synthesis. First, the target candidate area (object proposal) is obtained through GSPN and R-PointNet is used to refine the proposal to obtain the instance segmentation results. GSPN obtains the shape distribution containing the location information of the sample categories, then generates the sampling points through this distribution to act as the proposal region, generates the RoI (Region of Interest) based on this proposal region, selects the RoI that meets the requirements using the NMS algorithm, and then segments the points within the RoI to obtain the instances. Finally, it projects the structure into the original point cloud by using the nearest neighbor algorithm.

In addition, 3D-SIS [2] first obtains the proposal through target detection and then mask prediction to obtain the instance segmentation result. The RGB and point cloud features are learned by a neural network where the point cloud is recovered by the depth information in RGBD, not the radar point cloud, and the 2D features will be extracted by a 2D convolutional network and back-projected onto the corresponding 3D scene mesh. Then, the fusion of the 2D and 3D features can largely improve the accuracy of the acquisition of the proposal. Finally, further prediction in the proposal obtains the instance label.

A novel framework is proposed in 3D-BoNet [3] for 3D point cloud instance segmentation as an efficient point cloud instance segmentation algorithm based on bounding box regression, which achieves approximate bounding box regression by minimizing the matching cost function and final instance segmentation by using point mask prediction. The algorithm does not need to design anchor points and realizes end-to-end direct instance prediction, which is a simple way to predict instances.

2.3. Directly Predicting Instance Masks

Although the previous instance segmentation methods for 3D point clouds have achieved good results, there are several significant shortcomings. They usually require a design module that relies heavily on manual intervention, such as the center point of the predicted geometric attributes in the voting mechanism, the bounding box and share, etc. Another disadvantage is that these models usually do not directly predict the instance mask. On the contrary, through the voting mechanism or by predicting the bounding box, etc., a large number of redundant results will be generated, requiring post-processing algorithms to remove redundant results. Usually, the non-maximum suppression algorithm (NMS) is used to obtain prediction instances with better accuracy. Researchers have tried to develop methods to directly predict instance masks such as 3D-BoNet [3], DETR [7], and Mask3d [9]. These methods use the Hungarian algorithm to directly find the match that is optimal with respect to the truth value during the training process and then compute the matching cost. In the prediction phase, the instance masks are predicted directly, and the output does not require a post-processing algorithm.

There have been few studies on the Transformer architecture for processing 3D point cloud instance segmentation tasks. Mask3d [9] uses a sparse convolutional network as the backbone. Its decoder network uses a Transformer similar to DETR [7] and uses query instance vectors to represent instances; through the continuous refinement of attention, the final instance expression is obtained, and the network adopts the Hungarian algorithm to forcibly match the predicted instance and the true value to obtain the optimal sequence, without subsequent data preprocessing. However, since the backbone network adopted is a convolutional network, whose ability to capture global features is insufficient, which is important for instance segmentation tasks, it results in the problem that similar instances cannot be distinguished.

3. Method

3.1. Whole Network Framework

The overall structure of the network is shown in Figure 2. Our task was the instance segmentation of the 3D point cloud. The input of the network was a scene point cloud

P = \{p_{n} | n = 1, 2, \dots, N\}

,

p_{n} \in R^{(3 + d)}

, where 3 is the coordinate dimension of the point cloud;

d

comprises other features such as RGB, etc.; and

N

is the number of point clouds. In this study,

d

represents the RGB; the network’s output comprises the instance mask and the semantic label of the instance simultaneously.

Our network was divided into four parts, as shown in Figure 2. The first part was the embedding layer, Point Embedding, which was used to project the input point cloud into the high-dimensional feature space

F_{N} \in R^{N \times D}

, where

D

is the feature dimension, so that the subsequent global shape feature (Global Shape Attention, GSA) and instance query module (Instance Query Module, IQM) could be obtained. The Point Embedding layer in our design was a two-layer shared multi-layer perceptron (Multi-Layer Perception, MLP) followed by a Relu activation function. The dimension of the first layer of the multi-layer perceptron MLP was 32, and that of the second layer was 64. The second part was the feature fusion module (Feature Fusion, FF), which fuses the global and local shape features. The global shape feature was obtained by the Global Shape Attention (GSA) of the Transformer module and the multi-scale local features were obtained by using the convolutional network proposed by [19]. The fused features helped distinguish multi-scale instance objects in the scene. The specific implementation will be elaborated in Section 3.2. The third module was the instance query module (Instance Query Module, IQM), which was a Transformer module composed of self-attention and cross-attention, used to refine the instance vector continuously. The input of this module was the initialized instance query vector Instance Query, and the point features of different resolutions obtained by the upsampling layer were output as a refined instance query vector Instance Query, which represented each instance feature in the scene. The last module was the instance mask module (Instance Mask Module, IMM). In this study, we expressed the instance segmentation task as the prediction of the instance mask. The input of this module was the refined instance query vector Instance Query, and the last point features with the original resolution obtained by the upsampling layer were output as the final instance mask and the semantic label of the corresponding instance.

3.2. Feature Fusion Module

This section describes the feature fusion module Feature Fusion, and Figure 3 shows its structure. The feature fusion module fused the global shape features and multi-scale local features. The global shape features were obtained by the self-attention-based global shape attention module (Global Shape Attention, GSA), which will be introduced in detail in Section 3.3. Multi-scale local features were extracted via a convolutional network using the 3D point cloud convolutional network encoding shape information proposed in [19]. The two were directly obtained by using concatenation, and then the dimension was transformed through 1 × 1 convolution to obtain the fused point feature, which included both the instance object global shape feature and the local shape feature. There were two roles for 1 × 1 convolution; one was to fully fuse the concatenated global and convolutional features and the other was to perform dimensionality changes to adapt them to subsequent operations. They can be described by a mathematical formula:

F_{G S A} = G S A (F_{i n})

F_{c o n v} = C o n v (F_{i n})

F_{o u t} = C o n v_{1 \times 1} (C o n c a t (F_{G S A} + F_{c o n v})

3.3. Global Shape Attention Module

This subsection details the attention module (Global Shape Attention, GSA) that considers the global shape information. In the target detection task, the idea from the whole to the part could effectively alleviate the problem that spatially distributed similar instances cannot be distinguished [3,17,18]. The design of this module drew on the design idea from the whole to the part. The main idea was to sample multiple center points in the scene, with each center point as the center of a sphere; the spherical data within a specific radius area was our region of interest. Similar to the region of interest (Region of Interest, ROI) in target detection, this area contained some instance objects. The Shape Feature Aggregation (SFA) was used to capture the shape contour features of the instance objects in the scene, and the cross-attention module CA was used to encode the shape contour features captured by the shape feature module SFA into the attention mechanism in the form of the Key-Value, so as to obtain the attention features with the shape contour. The attention computational complexity of the Global Shape Attention module GSA was

O (N K)

, and

N

was the number of point cloud inputs,

K

was the number of preset shape centers;

K ≪ N

. This greatly reduced the computational complexity of the Transformer and improved its efficiency.

Global Shape Attention (GSA) consists of two parts: Shape Feature Aggregation (SFA) and Cross-Attention (CA). The shape feature module SFA was used to capture the shape contour features of the instance objects in the scene, and the cross-attention module CA was used to encode the shape contour features captured by the shape feature module SFA into the attention mechanism in the form of the Key-Value, to obtain attentional features for shape contours.

In Shape Feature Aggregation (SFA), we utilized a mini-PointNet [17] structure to learn the shape contour features of target regions. PointNet [17] has confirmed that the multi-layer perceptron MLP combined with the max-pooling operation has excellent shape contour learning capabilities. Based on this, we built a lightweight PointNet structure network module for learning shape contours called Shape Feature Aggregation (SFA). The specific structure is shown in Figure 4. The input comprised

K

shape centers. To make the distribution uniform, we used the farthest point sampling method (Farthest Point Sampling, FPS) to sample the

K

hape center points. The area within the radius neighborhood of each shape center point constituted our target of interest. It was sent to the shape feature module (Shape Feature Aggregation, SFA), which consisted of two layers of shared multi-layer perceptron MLP and maximum pooling, to obtain the shape features of

K

regions. This can be described by a mathematical formula:

F_{m l p} = M L P 2 (M L P 1 (F_{k}))

F_{k - s h a p e} = M P (F_{m l p})

Among them,

F_{k}

is the feature matrix in the K shape area obtained by sampling, and its size is

K \times M \times D

;

M

is the number of each shape area in K areas.

D

is the number of points in each area and is the feature dimension.

M P

is the maximum pooling, and

F_{k - s h a p e}

comprises the shape contour features of the obtained

K

regions. The size is

K \times D

.

Cross-attention module (Cross-Attention, CA): The Cross-Attention module encoded the shape contour features learned by the shape feature aggregation module (Shape Feature Aggregation, SFA) into the attention in the form of the Key-Value. It obtained the point features with global shape contours. Query was from the feature matrix obtained from the previous layer and Key and Value were from

F_{k - s h a p e}

, described explicitly by the mathematical formula that follows; the FFN in the formula stands for Feed Forward Networks.

C A = softmax (\frac{Q \cdot K^{T}}{\sqrt{D}}) V

F_{o u t} = N o r m (F F N (C A) + F_{i n})

3.4. Instance Query Module

This section describes the instance query module (IQM) as shown in Figure 5. The instance query module IQM encoded the point features fused with the global shape and contour features in the encoding layer into the instance query vector Instance Query, representing the instance features. Our approach was similar to that of Mask3D [9]. The instance segmentation task of the point cloud was regarded as a collection prediction task, and the obtained instance query vector Instance Query was used for the final instance mask and semantic class prediction.

The input of the instance query module IQM was the point feature that considered the global shape feature

F_{N} \in R^{N \times D}

and the initialized instance query vector Instance Query, the size of which was

K \times D

;

K = 100

. First, after Cross-Attention, the instance query vector Instance Query

Q_{i n s}

was used as a multi-layer perceptron MLP, and the corresponding value

K, V

was obtained through linear mapping followed by a standard self-attention module. The input was the updated query vector Instance Query. The Cross-Attention module can be mathematically described as below:

Q_{i n s} = M L P (Q_{i n s})

K = M L P (F_{N})

V = M L P (F_{N})

Q_{i n s} = N o r m (A d d (S o f t \max (\frac{Q K^{T}}{\sqrt{D}}) \cdot V + F_{N})

There were two attention modules in the instance query module IQM. The first was the Cross-Attention module and the second was the Self-Attention module, which each played different roles. The Cross-Attention module updated the instance query vector by interacting with the instance query vector and point features, taking into account the global shape feature profile. The Self-Attention module made it possible to eliminate repeated instance query vectors by the self-interaction operation of the instance query vector to pay attention to different instances in the scene.

3.5. Instance Mask Module

The instance mask module (Instance Mask Module, IMM) predicted the final instance result. The specific calculation process of the instance mask module IMM is shown in Figure 6. Its input comprised the instance query vector Instance Query, obtained after refinement, and the point feature with original data resolution, obtained by the last layer in the encoding layer. The module had two outputs: the output of the instance mask and the output of the instance semantic category.

Unlike previous research methods based on clustering, voting, or regional proposals, these methods will output many redundant boxes and need to use post-processing algorithms to obtain good results. This study used the Hungarian algorithm to match the predicted instance and the true value and directly output the mask of the instance.

The instance query vector Instance Query calculated the semantic category of the instance. After a multi-layer perceptron MLP and Softmax, the semantic category could be obtained. The number of outputs here was

C + 1

, where

C

was the number of semantic categories and 1 indicated whether the output was an instance. The mask of the instance was calculated by using the dot product of the instance query vector Instance Query and the point feature. As a similarity calculation method, the dot product was used here to establish the similarity between the point feature and the instance feature, and a threshold was used to establish the final instance mask.

3.6. Loss Function

This research used the Hungarian algorithm to enforce a match between the output-predicted instances and the ground-truth instances in the scene. Our network obtained the final instance by continuously updating the instance query vector, that is,

K

predicted instance results. This number was generally higher than the maximum number of instances contained in the sample scene. There was no difference between the predicted and real instances in the sample. One-to-one correspondence required us to establish a correspondence between the two. This research used the methods in 3D-BoNet [3] and Mask3D [9], using the Hungarian algorithm to directly predict instances forced to match the ground truth. Some studies have used this method before, but good results have not been achieved due to insufficient feature extraction capabilities. Recently, due to the development of the Transformer model, it has been used increasingly in target detection and instance segmentation tasks and has achieved good results.

Using the Hungarian algorithm, it was necessary to construct a cost matrix,

Cos t \in ℝ^{K \times \hat{K}}

, where

\hat{K}

is the number of ground-truth instances in the scene and

K > \hat{K}

. The following formula gives the matching cost of the predicted instance and the target instance, and the optimized cost matrix obtained the order that minimized the cost:

C (k, \hat{k}) = λ_{m a s k} L_{m a s k} (k, \hat{k}) + λ_{c l} L_{c l} (k, \hat{k})

Correspondingly, after the optimal match was determined, the loss of the network could be calculated, which included two parts: the prediction loss

L_{c l}

of the instance category and the instance mask’s intersection-over-union ratio (IoU) loss. This was calculated with the multi-analog cross-entropy loss, and

L_{m a s k}

was the binary cross-entropy loss. The corresponding loss coefficient was set to

λ_{m a s k} = 2

, and

λ_{c l} = 1

.

L = λ_{m a s k} L_{m a s k} + λ_{c l} L_{c l}

4. Experiments and Discussion

To evaluate the scientific nature of the model design, we conducted extensive experiments on three public datasets, namely the large indoor datasets S3DIS [12] and ScanNet [13] and the large Simulated City Dense Point Cloud STPLS3D [14]. Data augmentation of point cloud data included horizontal flipping, random rotation around the Z axis, and random scaling.

4.1. Experiments

4.1.1. S3DIS Dataset

S3DIS is a large-scale indoor dataset containing six different regions of three different buildings, 272 scanned scenes, and labels with instance masks in 13 different categories. This research followed a common data division method, using regions 1, 2, 3, 4, and 6 for training and using data in area 5 for evaluation. The evaluation index used the average accuracy rate under different IoU thresholds, including mAP₅₀ and mAP, Prec₅₀, and Rec₅₀. mAP₅₀ is the average accuracy rate of the instance prediction mask with an intersection ratio greater than 0.5, Prec₅₀ is the precision rate with an intersection ratio greater than 0.5, and Rec₅₀ is the recall rate with an intersection ratio greater than 0.5. mAP is the intersection and union ratio between 0.5 and 1, with an interval of 0.05. The average accuracy rate of all different intersection and union ratios was calculated. The training optimizer was AdamW [20], the learning rate scheme adopted the one-cycle scheme [21], and the maximum learning rate was set to 10⁻⁴.

Figure 7 shows the visualization of the prediction results. For the convenience of observation, we uniformly cut and removed the top of the room. The left side is the display of the original RGB input point cloud, the semantic category of the predicted instance is in the middle, the true value of the semantic category is in the upper middle layer, and the semantic category of the predicted instance is in the lower middle layer. The right side corresponds to the instance segmentation task, the upper layer on the right is the true value of the instance, and the lower layer on the right is the predicted instance.

Table 1 shows the quantitative analysis results on the S3DIS dataset Area5. The table compares our network and some instance segmentation networks on the S3DIS dataset Area5. The evaluation indicators are the overall accuracy rate AP, the overall accuracy rate AP₅₀ when the cross-merge ratio is above 50, the recall rate Prec₅₀ if the cross-merge ratio is above 50, and the recall rate Rec₅₀ if the cross-merge ratio is above 50. It can be seen from the table that our proposed method achieved scores of 52.4%, 66.9%, 63.6%, and 64.3% on the four indicators of AP, AP₅₀, Prec₅₀, and Rec₅₀, respectively, which were better than most models.

4.1.2. ScanNet Dataset

ScanNet [13] is a dataset of richly annotated 3D reconstructed indoor scenes. It contains hundreds of different rooms, showing many room types such as hotels, libraries, and offices. It contains 1202 training scenes and 312 validation scenes. Each scene is annotated with semantic and instance segmentation labels, covering 18 object categories, which can be used for semantic and instance segmentation.

The evaluation metrics were mean precision mAP₅₀ and mAP. The sample size was 5 m, the AdamW optimizer was used, and the learning rate scheme was the one-cycle scheme with a maximum learning rate of 0.001. Figure 8 shows the visualization results for the ScanNet test dataset.

Table 2 quantitative analysis results of ScanNet dataset. This table lists the performance of some instance segmentation networks in the ScanNet dataset. The content in the table includes two parts. The left part shows the quantitative analysis results of the ScanNet validation and the evaluation indicators include mAP and mAP₅₀. The right part shows the quantitative analysis results of the ScanNet test. It can be seen from the table that under the ScanNet validation dataset, mAP and mAP₅₀ reached 54.5% and 70.5%, and on the ScanNet test set, mAP and mAP₅₀ reached 49.8% and 77.3%.

4.1.3. STPLS3D Dataset

STPLS3D [14] is a synthetic outdoor 3D point cloud dataset that mimics the data generation process of aerial oblique photography techniques in a virtually reconstructed 3D scene to generate photogrammetrically dense point clouds. The total data area is 16 square kilometers over 25 urban scene areas. There are 18 semantic categories including ground, buildings, low vegetation, medium vegetation, high vegetation, cars, trucks, aircraft, military vehicles, bicycles, motorcycles, light poles, signs, fences, roads, windows, trash cans, and grass. It can be used for semantic segmentation and instance segmentation. We used this dataset to test the performance of our network on large scene point clouds.

During training, we used data blocks with a length and width of 50 × 50 m as samples. During testing, our network directly input data blocks with a length and width of 100 × 100 m. The experimental results are shown in Figure 9. The lower side of the figure shows the original input data, the middle shows the real value of the instance, and the upper side shows the instance result predicted by our network.

Table 3 shows the quantitative analysis results of the network on the STPLS3D test set, showing the different classes of AP, AP₅₀, and AP₂₅. The overall accuracy reached 55.1% for AP, 71.7% for mAP₅₀, and 81.5% for mAP₂₅.

4.2. Discussion

4.2.1. Comparison

To verify the effect of introducing a Transformer to capture global shape contour features, we used Mask3D [9] to conduct comparative experiments on the S3DIS and ScanNet datasets. Figure 10 shows the visualization results of the S3DIS dataset and Figure 11 shows the ScanNet dataset. In Figure 10 and Figure 11, the upper side shows the instance segmentation results after introducing the Transformer Global Shape Attention GSA and the lower side shows the segmentation results of Mask3D. It can be seen from the results that the segmentation results of the two chairs in the left picture were improved, and the middle window segmentation results and the beams and columns in the image on the right achieved better segmentation results. It can be seen that the module we designed to introduce a Transformer to capture global shape features improved the instance segmentation results to a certain extent.

4.2.2. Visualization of Shape Contour Features

This paper proposes a Global Shape Attention (GSA) feature; in order to intuitively observe the learned Global Shape Attention feature, we visualized it as shown in Figure 12. The figure shows three visualization examples, including a sofa and a chair, and the brighter areas indicate higher attention features. It can be observed that the learned attention features were able to completely cover object’s shape contour. This captured feature of the global shape profile of an object instance acts as a constraint in the segmentation task of object instances in complex environments and can help distinguish similar instances.

4.2.3. Model Size and Complexity

The following shows the parameter sizes of some instance segmentation networks. The content in the table indicates that compared with HAIS, SoftGroup, and Mask3D, the model size was reduced by 2~3 times. We showed, through comparison with the ScanNet dataset’s run time, that our approach had the fastest run time compared to the other networks listed in Table 4. This was due to our smaller model size and efficient Transformer design.

The calculation of the network was mainly in the attention layer. The computational complexity of the original Transformer was

O (N^{2})

, where

N

was the length of the input data, and here the

N

is the number of point clouds. Our network design included multiple attention calculations. The first was the Cross-Attention module in the Global Shape Attention GSA, and its computational complexity was

O (N \times K)

, where

K

was the number of preconfigured shape center regions. The number of instances set remained the same. The second was the Cross-Attention module in the instance query module IQM, and its computational complexity was

O (N \times K)

, where

K

was the preset number of instances, which was uniformly 100 in this study. The third was the Self-Attention in the instance query module IQM, and its computational complexity was

O (K \times K)

. Based on the above analysis, the highest computational complexity of the network was

O (N \times K)

, where

K ≪ N

.

The computational complexity of the original Transformer was

O (N^{2})

, where

N

was the original length of the input data, and here was the number of point clouds. Our network design contained several attention computations. The first was the Attention module of the coarse point cloud after downsampling, and the computational complexity was

O (M^{2}) = O (\frac{N^{2}}{r^{2}})

, where

M

was the number of point clouds obtained after downsampling and r was the sphere shape radius. The second was a multi-scale feature module with three scales of attention computation, and the maximum computational complexity was

O (N \times s_{3} \times s_{3})

, in which

O (N \times s_{3} \times s_{3})

was the most enormous scale. The third was the Cross-Attention of Query Refinement, whose computational complexity was

O (N \times K)

, where

K

was the number of pre-defined instances, which was unified to 100 in this study. The last was the Self-Attention of the Instance Query of Query Refinement, whose computational complexity was

O (K \times K)

. In summary, the computational complexity of the network could be considered

O (\frac{N^{2}}{r^{2}})

, which was much smaller than

O (N^{2})

.

4.2.4. Influence of Initial Sampling Position

The GSA module needed to sample the input scene point cloud in each layer in order to obtain global features. To obtain the sampling points more uniformly, we used the farthest point sampling (FPS) method. Since the sampling results of this method were affected by the initial sampling points, different initial sampling points were expected to have different sampling results, which may have had an effect on the acquisition of global features. To validate this idea, we carried out a set of comparative experiments between the sampling method that fixed the initial position in each layer of the network and the sampling method that did not fix the initial position. Five layers of global shape feature attention (GSA) were accumulated in our network. In each layer, we used different initialization position sampling methods to cover as many different regions of the scene as possible to capture more shape features. To study the effect of this strategy, we set up a comparative experiment and verified it in the Area5 area of the S3DIS dataset. One set had the same initial sampling position, and the other set had different initial sampling positions. The result is shown in Table 5 below.

4.2.5. Position Encoding

In the standard Transformer, there is a position-encoding module, which is used to deal with the sequential problem in language sequences. Similarly, when dealing with 2D image data, position encoding is also required when dealing with 2D visual tasks since images are also a kind of data with a sequential structure. The point cloud itself is unordered, and the computation of the attention layer in the Transformer is also order-independent. Theoretically, when a Transformer processes a point cloud, the point cloud itself is an expression of position. It is not required for position encoding, and we designed experiments to verify this. The experimental data were validated using area 5 in S3DIS with a sampling ratio of 8. The results showed that the encoding of position had no significant effect on the accuracy improvement in the point cloud data processing, and all our experiments were conducted without position encoding to speed up the operation efficiency. We shows the result in Table 6.

5. Conclusions

This paper proposed an end-to-end 3D point cloud instance segmentation network considering the global shape contour. In the network, the proposed Global Shape Attention module, GSA, was used to capture the shape contour information of the object instance; the shape contour information was encoded into the Transformer structure in the form of the Key-Value and the obtained features with the global shape contour information were derived from the convolution. The feature fusion of the network was used for the prediction of subsequent instance masks. Taking this as a constraint helped alleviate the problem of indistinguishable spatially distributed similar instances in the instance segmentation task. At the same time, the network directly predicted the instance mask in an end-to-end manner, avoiding heavy post-processing algorithms. Through the experimental verification of many public datasets, the learned shape contour information was visualized, which verified the effectiveness and scientific validity of our proposed network. From the results, this combination of a Transformer to obtain global information and convolution to obtain local detail information showed certain advantages, which inspires us to continue to explore more efficient combinations.

Author Contributions

Conceptualization, J.X. and F.D.; methodology, J.X.; validation, J.X.; formal analysis, J.X.; investigation, J.X.; resources, F.D.; data curation, J.X.; writing—original draft preparation, J.X.; writing—review and editing, J.X.; visualization, J.X.; supervision, J.X.; funding acquisition, F.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Special Funds for Creative Research, grant number 2022C61540.

Data Availability Statement

The S3DIS dataset can be obtained from Large Scale Parsing (stanford.edu); the ScanNet dataset can be obtained from GitHub—ScanNet/ScanNet; the STPLS3D dataset can be obtained from meidachen/STPLS3D: 🔥 Synthetic and real-world 2d/3d dataset for semantic and instance segmentation (BMVC 2022 Oral) (github.com).

Conflicts of Interest

The authors declare no conflict of interest.

References

Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L.J. GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3942–3951. [Google Scholar]
Hou, J.; Dai, A.; NieBner, M. 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4416–4425. [Google Scholar]
Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Wang, W.; Yu, R.; Huang, Q.; Neumann, U. SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2569–2578. [Google Scholar]
Wang, X.; Liu, S.; Shen, X.; Shen, C.; Jia, J. Associatively Segmenting Instances and Semantics in Point Clouds. arXiv 2019, arXiv:1902.09852. [Google Scholar]
Elich, C.; Engelmann, F.; Kontogianni, T.; Leibe, B. 3D-BEVIS: Bird’s-Eye-View Instance Segmentation. Pattern Recognit. 2019, 11824, 48–61. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3D for 3D Semantic Instance Segmentation. arXiv 2022, arXiv:2210.03105. [Google Scholar]
Kuhn, H. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. NRL 1955, 2, 83–97. [Google Scholar] [CrossRef]
Kuhn, H.W. Variants of the Hungarian Method for Assignment Problems. Nav. Res. Logist. 1956, 3, 253–258. [Google Scholar] [CrossRef]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D Semantic Parsing of Large-Scale Indoor Spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
Chen, M.; Hu, Q.; Hugues, T.; Feng, A.; Hou, Y.; McCullough, K.; Soibelman, L. STPLS3D: A Large-Scale Synthetic and Real Aerial Photogrammetry 3D Point Cloud Dataset. arXiv 2022, arXiv:2203.09065. [Google Scholar]
Ding, Z.; Han, X.; Niethammer, M. VoteNet: A Deep Learning Label Fusion Method for Multi-Atlas Segmentation. arXiv 2019, arXiv:1904.08963. [Google Scholar]
Vu, T.; Kim, K.; Luu, T.M.; Nguyen, X.T.; Yoo, C.D. SoftGroup for 3D Instance Segmentation on Point Clouds. arXiv 2022, arXiv:2203.01509. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv 2017, arXiv:1612.00593. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Xv, J.; Deng, F.; Liu, H. Point Cloud Convolution Network Based on Spatial Location Correspondence. ISPRS Int. J. Geo-Inf. 2022, 11, 591. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar]
Smith, L.N.; Topin, N. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv 2018, arXiv:1708.07120. [Google Scholar]
Jiang, L.; Zhao, H.; Shi, S.; Liu, S.; Fu, C.-W.; Jia, J. PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4866–4875. [Google Scholar]
Zhong, M.; Chen, X.; Chen, X.; Zeng, G.; Wang, Y. MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation. arXiv 2022, arXiv:2203.14662. [Google Scholar]
Liang, Z.; Li, Z.; Xu, S.; Tan, M.; Jia, K. Instance Segmentation in 3D Scenes Using Semantic Superpoint Tree Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2763–2772. [Google Scholar]
Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; NieBner, M. 3D-MPA: Multi-Proposal Aggregation for 3D Semantic Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9028–9037. [Google Scholar]
He, T.; Shen, C.; van den Hengel, A. DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 354–363. [Google Scholar]
Han, L.; Zheng, T.; Xu, L.; Fang, L. OccuSeg: Occupancy-Aware 3D Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2937–2946. [Google Scholar]
Chen, S.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X. Hierarchical Aggregation for 3D Instance Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 15447–15456. [Google Scholar]

Figure 1. Schematic diagram of the overall network flow.

Figure 2. Whole architecture of the proposed 3D point cloud instance segmentation method.

Figure 3. Feature fusion module. Convolutional features for fusing global shape features and encoding shape information.

Figure 4. Global Shape Attention module (Global Shape Attention, GSA). This module mainly consists of the Shape Feature Aggregation module (Shape Feature Aggregation, SFA) and the Cross-Attention module (Cross-Attention, CA). Attention features fused by global shape contours were finally obtained.

Figure 5. Instance Query Module. This module was mainly used to obtain the instance query vector Instance Query, representing the characteristics of the instance. The input of this module comprised the initialized instance query vector Instance Query and point features that took into account the global shape contour features, and it used Cross-Attention and Self-Attention to obtain the final instance query vector Instance Query, where the size of Instance Query was

K \times D

.

K

indicates the number of instances in the preset scene.

Figure 5. Instance Query Module. This module was mainly used to obtain the instance query vector Instance Query, representing the characteristics of the instance. The input of this module comprised the initialized instance query vector Instance Query and point features that took into account the global shape contour features, and it used Cross-Attention and Self-Attention to obtain the final instance query vector Instance Query, where the size of Instance Query was

K \times D

.

K

indicates the number of instances in the preset scene.

Figure 6. Instance Mask Module (IMM). The input of this module comprised the point feature and the refined instance query vector Instance Query. The output comprised the mask label of the instance and the semantic category label of the corresponding instance.

Figure 7. Area 5 instance segmentation result visualization. The left side is the original input, the middle shows the true value and predicted value of semantic segmentation, different colors indicate different categories, and the right side shows the true value and predicted value of instance segmentation, different colors indicate different instance object.

Figure 8. The visualization results for the ScanNet test dataset. The left side shows the original RGB input, the center the semantic prediction results, different colors indicate different categories. And the right side the instance prediction results, different colors indicate different instance object.

Figure 9. Visualization of STPLS3D results. The lower side shows the original input, the middle part shows the true value of the instance, and the upper part shows the predicted instance.

Figure 10. S3DIS dataset comparison result visualization. The picture above shows our proposed network and the picture below shows the result of the Mask3D network. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.

Figure 11. ScanNet dataset comparison results. The figure below shows the original input, the middle shows the Mask3D result, and the upper side shows the proposed network prediction result. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.

Figure 12. Visualization of the learned shape contour features.

Table 1. Quantitative analysis results of Area5 instance segmentation. The evaluation indicators are mAP, mAP₅₀, Prec₅₀, and Rec₅₀.

Method	mAP	mAP₅₀	Prec₅₀	Rec₅₀
SGPN [4]	-	-	36.0	28.7
ASIS [5]	-	-	55.3	42.4
3D-Bonet [3]	-	-	57.5	40.2
PointGroup [22]	-	57.8	61.9	62.1
MaskGroup [23]	-	65.0	62.9	64.7
SSTNet [24]	42.7	59.3	65.5	64.2
Mask3D [9]	56.6	68.4	68.7	66.3
Ours	52.4	66.9	63.6	64.3

Table 2. ScanNet dataset quantitative analysis results. Evaluation indicators include mAP and mAP₅₀.

	ScanNet Val		ScanNet Test
Method	mAP	mAP₅₀	mAP	mAP₅₀
SGPN [4]	-	-	4.9	14.3
GSPN [1]	19.3	37.8	-	30.6
3D-SIS [2]	-	18.7	16.1	38.2
3D-BoNet [3]	-	-	25.3	48.8
MTML [25]	20.3	40.2	28.2	54.9
3D-MPA [25]	35.5	59.1	35.5	61.1
DyCo3D [26]	35.4	57.6	39.5	64.1
PointGroup [22]	34.8	57.6	40.7	63.6
MaskGroup [23]	42.0	63.3	43.4	66.4
OccuSeg [27]	44.2	60.7	48.6	67.2
SSTNet [24]	49.4	64.3	50.6	69.8
HAIS [28]	43.5	64.1	45.7	69.9
SoftGroup [16]	46.0	67.6	50.4	76.1
Mask3D	55.2	73.7	50.6	78.0
Ours	54.5	70.5	49.8	77.3

Table 3. Quantitative analysis results with the three STPLS3D test sets.

	AP	AP₅₀	AP₂₅
Building	0.822	0.905	0.918
Low vegetation	0.329	0.617	0.731
Middle vegetation	0.361	0.539	0.676
High vegetation	0.488	0.922	0.985
Car	0.831	0.912	0.985
Trucks	0.815	0.9	0.946
Aircraft	0.603	0.808	0.850
Military vehicles	0.816	0.818	0.886
Bikes	0.218	0.547	0.695
Motorcycle	0.604	0.819	0.912
Light pole	0.602	0.816	0.902
Street sign	0.213	0.431	0.541
Clutter	0.601	0.733	0.792
Fence	0.442	0.627	0.860
Average (mAP)	0.551	0.717	0.815

Table 4. Model sizes and run times.

Method	Model Size	Run Time (ms)
HAIS [28]	30.856 M	339
SoftGroup [16]	30.858 M	345
Mask3D [9]	39.617 M	339
Ours	15.7 M	253

Table 5. Effect of initial sampling location.

	mAP	mAP₅₀
Same initial location	50.1	63.4
Different initial location	50.4	63.9

Table 6. Elimination experiments for position encoding.

	mIoU
With PE	64.0
Without PE	63.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xv, J.; Deng, F. 3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints. Remote Sens. 2023, 15, 4939. https://doi.org/10.3390/rs15204939

AMA Style

Xv J, Deng F. 3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints. Remote Sensing. 2023; 15(20):4939. https://doi.org/10.3390/rs15204939

Chicago/Turabian Style

Xv, Jiabin, and Fei Deng. 2023. "3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints" Remote Sensing 15, no. 20: 4939. https://doi.org/10.3390/rs15204939

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints

Abstract

1. Introduction

2. Related Work

2.1. Bottom–Up

2.1.1. Clustering Based on High-Dimensional Features

2.1.2. Vote Clustering Based on the Geometric Center

2.2. Top–Down

2.3. Directly Predicting Instance Masks

3. Method

3.1. Whole Network Framework

3.2. Feature Fusion Module

3.3. Global Shape Attention Module

3.4. Instance Query Module

3.5. Instance Mask Module

3.6. Loss Function

4. Experiments and Discussion

4.1. Experiments

4.1.1. S3DIS Dataset

4.1.2. ScanNet Dataset

4.1.3. STPLS3D Dataset

4.2. Discussion

4.2.1. Comparison

4.2.2. Visualization of Shape Contour Features

4.2.3. Model Size and Complexity

4.2.4. Influence of Initial Sampling Position

4.2.5. Position Encoding

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI