Next Article in Journal
Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs
Previous Article in Journal
Simulation of Shock Waves in Methane: A Self-Consistent Continuum Approach Enhanced Using Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TSPconv-Net: Transformer and Sparse Convolution for 3D Instance Segmentation in Point Clouds

by
Xiaojuan Ning
1,2,*,
Yule Liu
1,
Yishu Ma
1,
Zhiwei Lu
1,
Haiyan Jin
1,2,
Zhenghao Shi
1,2 and
Yinghui Wang 
3
1
Institute of Computer Science and Engineering, Xi’an University of Technology, No. 5 South of Jinhua Road, Xi’an 710048, China
2
Shaanxi Key Laboratory of Network Computing and Security Technology, Xi’an 710048, China
3
School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 of Lihu Road, Wuxi 214122, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(18), 2926; https://doi.org/10.3390/math12182926
Submission received: 7 August 2024 / Revised: 9 September 2024 / Accepted: 12 September 2024 / Published: 20 September 2024
(This article belongs to the Special Issue Mathematical Computation for Pattern Recognition and Computer Vision)

Abstract

:
Current deep learning approaches for indoor 3D instance segmentation often rely on multilayer perceptrons (MLPs) for feature extraction. However, MLPs struggle to effectively capture the complex spatial relationships inherent in 3D scene data. To address this issue, we propose a novel and efficient framework for 3D instance segmentation called TSPconv-Net. In contrast to existing methods that primarily depend on MLPs for feature extraction, our framework integrates a more robust feature extraction model comprising the offset-attention (OA) mechanism and submanifold sparse convolution (SSC). The proposed framework is an end-to-end network architecture. TSPconv-Net consists of a backbone network followed by a bounding box module. Specifically, the backbone network utilizes the OA mechanism to extract global features and employs SSC for local feature extraction. The bounding box module then conducts instance segmentation based on the extracted features. Experimental results demonstrate that our approach outperforms existing work on the S3DIS dataset while maintaining computational efficiency. TSPconv-Net achieves 68.6% mPrec, 52.5% mRec, and 60.1% mAP on the test set, surpassing 3D-BoNet by 3.0% mPrec, 5.4% mRec, and 2.6% mAP. Furthermore, it demonstrates high efficiency, completing computations in just 326 s.

1. Introduction

Understanding 3D scenes is a fundamental necessity for various applications, including autonomous driving [1], robotics technology [2], medical image processing [3], and 3D reconstruction [4]. By enabling computers to achieve a deeper understanding of 3D scenes, 3D instance segmentation drives the advancement and application of intelligent systems across diverse domains. However, repetitive structures in indoor scenes (e.g., tables, chairs, and beds) present significant challenges for 3D instance segmentation. The similarity in geometric shapes and feature distributions of these objects often leads to instance confusion and crowding in the feature space, increasing the likelihood of missed detection and false positives. Moreover, the close arrangement of objects in indoor scenes, coupled with complex occlusion problems, make it difficult to accurately segment complete instances, particularly when edges are blurred, or objects are overlapped. The geometric and topological finiteness of repetitive structures further exacerbates segmentation difficulties, limiting the model’s ability to segment objects of similar shapes. Additionally, inconsistencies in data annotation introduce noise into the training data, potentially leading to deviations in the learning process and affecting generalization performance. When encountering unseen repetitive structures in the test set or a new environment, the model often exhibits poor adaptability.
In current research on 3D instance segmentation, deep learning methods are predominant, typically involving the training of neural networks for scene segmentation. Among these methods, backbone networks such as MLPs are commonly used for feature extraction. However, MLPs, while simple and easy to implement, often struggle to capture spatial information effectively during feature extraction and incur high computational complexity. As a result, networks relying solely on MLPs may not perform satisfactorily in the segmentation of complex scenes.
Recent researchers have proposed various methods to overcome the limitations of traditional multilayer perceptron (MLP) architectures for 3D instance segmentation. This has led to a growing interest in innovative approaches. Rozenberszki et al. [5] introduced a novel unsupervised method called UnScene3D. UnScene3D extracts color and geometric features from RGB-D data to generate pseudo-masks without relying on labeled data. They employed self-training to refine these masks, resulting in improved segmentation performance. Traditional bottom-up strategies often involve post-processing steps, which can be computationally expensive and require domain-specific assumptions. Kolodiazhnyi et al. proposed a top-down approach called TD3D [6], which is entirely data-driven. TD3D directly predicts instance proposals, filters them using non-maximum suppression, and performs mask segmentation for each proposal. The versatility of deep learning models for segmentation tasks has been a subject of study. While models can excel in specific tasks, they often struggle to generalize to others due to task-specific differences. Kolodiazhnyi et al. [7] proposed OneFormer3D, a unified framework for multitasking 3D segmentation, addressing this challenge through novel query selection and matching strategies. The Transformer architecture, which has shown remarkable success in 2D image processing, has been extended to 3D point clouds. Mask3D [8], proposed as a pioneering Transformer-based approach, directly predicts instance masks from 3D point clouds, eliminating the need for post-processing steps and improving computational efficiency.
A typical method for 3D instance segmentation, known as 3D-BoNet [9], adopts an anchor-free methodology and employs PointNet++ [10] as its backbone network. Leveraging both local and global features, 3D-BoNet directly predicts bounding boxes via the bounding box regression branch and establishing associations between ground truth and predictions using correlation layers. Furthermore, 3D-BoNet improves segmentation performance by predicting instance labels for points using the mask prediction branch. However, 3D-BoNet fails to capture complex features, neglecting the spatial relationships between points.
To address the limitation of 3D-BoNet, we propose a novel 3D instance segmentation framework tailored for indoor scenes. Building upon [9], our framework enhances the backbone network by replacing PointNet++ with the offset-attention mechanism [11] and submanifold sparse convolution [12].
By doing so, the network becomes adept at capturing intricate spatial relationships among points, thus improving its ability to perceive global information. Moreover, TSPconv-Net excels at extracting finer local geometric structures within complex scenes. Notably, our framework eliminates the need for post-processing steps such as clustering, non-maximum suppression [13], or voting. As a result, TSPconv-Net exhibits outstanding efficiency in both training and inference phases.
We conduct experiments on the S3DIS dataset [14], and experimental results indicate that our method can achieve outstanding segmentation performance for repetitive structural objects within complex scenes.
In summary, the contributions of our work are as follows:
  • We propose TSPconv-Net, a novel framework for 3D instance segmentation that leverages the OA mechanism and SSC to extract features in point clouds.
  • We optimize and improve the backbone network of 3D-BoNet, enhancing the ability of TSPconv-Net to capture spatial relationships within point clouds.
  • TSPconv-Net achieves outstanding performance on S3DIS without comprehensive modifications of the model architecture or hyperparameter tuning tailored for each dataset.

2. Related Work

Three-dimensional instance segmentation uses a semantic category label and a unique instance label to annotate each point in a 3D point cloud. Three-dimensional instance segmentation differs from 3D semantic segmentation; it not only requires the consideration of the semantic information of points but also necessitates differentiation between different instances under the same semantic information. Driven by applications in medical image processing, 3D reconstruction and 3D instance segmentation have gained significant interest [15]. Recent approaches can be broadly categorized into proposal-based and proposal-free methods.
proposal-based method. These methods adopt a two-stage approach to address the task: 3D object detection and instance label prediction. This strategy provides benefits such as potentially higher accuracy by initially concentrating on promising regions likely to contain objects. During the first stage, 3D object detection identifies potential objects in the scene. Subsequently, for each identified region of interest (ROI), the method predicts a specific label for each point within that region, effectively segmenting the objects. Hou et al. [16] proposed a method called 3D-SIS for RGB-D-based 3D semantic instance segmentation. This network utilizes a fully convolutional network to extract 2D color features and 3D geometric features, which are subsequently fused for 3D instance segmentation. Yi et al. [17] proposed a Generative Shape Proposal Network (GSPN) based on PointNet [18]. This network utilizes a generative proposal method to generate candidate point sets similar to the shape of the target points and encodes features to construct global and local structural information. Consequently, GSPN can effectively learn the shape features of the target, thereby improving segmentation accuracy. Yang et al. [9] proposed a single-stage, anchor-free and end-to-end trainable network called 3D-BoNet. The network takes a simpler approach, using point-wise MLP to directly predict 3D bounding boxes for each object and point-wise masks to refine the segmentation. Three-dimensional-BoNet also incorporates an optimal assignment method to handle data association effectively, balancing accuracy and speed compared to traditional methods.
In summary, proposal-based methods, exemplified by [19,20,21,22], demonstrate proficiency in handling objects with intricate shapes, achieving precise shapes and poses for each instance. Nonetheless, these methods often adopt multi-stage structures, leading to increased computational overhead and vulnerability to the choice of candidate region generation methods.
proposal-free method. These methods typically regard instance segmentation as a post-processing step of semantic segmentation. These methods operate under the assumption that points belonging to the same instance share similar features, thus emphasizing feature learning. Wang et al. [23] first proposed the deep learning model SGPN, which is founded on similarity measurement and a grouping strategy. SGPN uses a deep similarity matrix to measure the similarity between each pair of points in the point cloud. Additionally, it employs a novel grouping strategy to assign points in the point cloud to their respective instances. Similarly, Liu et al. [24] proposed the instance segmentation network MASC, which is based on SSC and point similarity. This network utilizes a multi-scale point affinity prediction method to measure the similarity between each pair of points in the point cloud. Furthermore, it reduces computational costs using sparse convolution. Given that semantic labels can be utilized for instance label prediction, some methods integrate these two tasks into a unified approach. Wang et al. [25] first proposed the ASIS framework, designed to simultaneously handle both instance segmentation and semantic segmentation tasks. ASIS employs two collaborative approaches: semantic-based instance segmentation and instance-based fusion semantic segmentation, which work in conjunction with each other. Similarly, Zhao et al. proposed JSNet [26] for accomplishing both semantic segmentation and instance segmentation.
Overall, proposal-free methods, such as [27,28,29,30,31,32], directly extract and classify point clouds, predicting instance labels, thereby offering lower computational costs and faster speeds. However, proposal-free methods encounter difficulties in handling objects with complex shapes and may struggle to achieve precise 3D shapes and poses for each instance.

3. Our Method

3.1. Overview

Our method consists of two primary components: a 3D backbone and a bounding box module. The backbone network takes a 3D point cloud P R N × 3 as input to extract point cloud features f R K . We adopt the OA mechanism and SSC as our backbone. The bounding box module consists of parallel network branches dedicated to bounding box regression and instance mask prediction. The box regression branch takes local features F l R N × K from the backbone and outputs bounding boxes B R H × 2 × 3 , and the instance mask prediction branch takes global features F g R 1 × K and F l as inputs and transforms them into binary masks. The bounding box regression branch uses the association index A to associate ground truth bounding boxes with predicted bounding boxes and computes the cost matrix C. Additionally, we use the cost matrix to define the multi-criteria loss function. The overview of our method is illustrated in Figure 1.

3.2. Backbone Network

Compared to the MLP structure, attention mechanisms can capture not only distance information between points but also complex spatial information, enabling the network to better capture global dependencies and enhance focus on important features. However, attention mechanisms lead to increased network complexity and computational demands. Therefore, we introduce sparse submanifold convolution to reduce a large number of redundant computations, improving network training and inference speed. As a result, our backbone network can achieve excellent feature extraction efficiency.

3.2.1. Offset-Attention-Based Global Feature Extraction

Inspired by Guo et al. [11], we embed the input points P into a new feature space R E , obtaining embedded features F e R N × E . The structure of the global feature module is shown in Figure 2. To enable the network to learn richer and more comprehensive feature representations, we stack multiple attention modules in TSPconv-Net. This progressive stacking allows each layer to abstract and extract higher-level features from the output of the previous layer, thereby enhancing the network’s learning capacity and robustness. Additionally, the concatenation operation fuses features from different levels, preserving valuable information. The parameters of each layer are learnable, allowing the model to learn the optimal combination of features at different levels. This enables the model to capture more complex and nuanced feature representations. Moreover, fusing feature representations from different levels enhances the model’s robustness to noise and occlusion. However, directly stacking them can increase the complexity of the network, leading to gradient exploding or gradient vanishing. Therefore, we introduce residual modules to avoid such issues and improve the performance of TSPconv-Net. Then, we input F e into the stacked attention modules and obtain local features F l through the LBR layer. The LBR layer, composed of a linear layer, batch normalization layer, and ReLU activation function, enhances the expressive power and accelerates the convergence speed of TSPconv-Net. Finally, the local features F l are transformed into global features F g through a pooling layer. The specific calculation formula for attention features is shown in Equation (1).
F 1 = A T 1 ( F e ) F i = A T i ( F i 1 ) , i = 2 , 3 , 4 F l = c o n c a t ( F 1 , , F i ) · W o F g = P o o l ( L B R ( F l ) ) ,
where A T i represents the i-th attention layer, each layer has the same dimension of output and input, and W o represents the weights calculated for the final linear layer.
The embedding feature module consists of two cascaded LBRs. Through this combination, the embedding feature module efficiently extracts data features and enhances the fitting ability of our network to nonlinear data. The specific structure is depicted in Figure 3.
After obtaining embedded features F e , further feature extraction is performed using attention mechanisms. However, the original attention mechanism used in the Transformer [33] only aggregates adjacent information near nodes, resulting in corresponding adjacency matrices. Yet, the inherent information of point cloud data is equally crucial. Inspired by GCN [34], we replace the original adjacency matrices in the attention mechanism with Laplacian matrices, resulting in an improved attention mechanism known as the offset-attention mechanism. The specific structure of the offset-attention module is shown in Figure 4. Notably, solid lines indicate the self-attention computation process, while dotted lines indicate the offset-attention computation process. Adding the nonlinear transformation (LBR) of the offsets to the offsets themselves can enhance the ability to learn local features and preserve the original information to form a residual structure. This improves the ability of the model to capture complex geometric structures and represent features.
The specific processing of OA is shown as follows: Firstly, the output vector F e of the embedding feature module is used as the input eigenvector of the E dimension, and the query (Q), keyword (K), and value (V) matrices are obtained by linear transformation through the linear layer. The corresponding calculation formula is shown in Equation (2).
( Q , K , V ) = F e · ( W q , W k , W v ) Q , K R N × d a , V R N × E W q , W k R N × d a , W v R E × E ,
where W q , W k , W v represent the weight matrix of shared learnable linear transformation layers, and d a represents the dimensions of Q and K.
After obtaining Q, K, and V, the query matrix and keyword matrix are used to calculate the attention weight A ˜ according to Equation (3).
A ˜ = ( a ˜ ) i , j = Q · K T
The attention matrix A is obtained by normalizing the attention weight A ˜ from Equation (3) using the SoftMax operator and L 1 -norm, as shown in Equation (4).
a ˜ i , j = S o f t M a x ( a ˜ i , j ) = e x p ( a ˜ i , j ) k e x p ( a ˜ i , j ) a i , j = a ¯ i , j k a ¯ i , j A = ( a ) i , j
The self-attention output features F s a R N × E are obtained by using Equation (5). Then, the offsets of F e and F s a are calculated. Finally, the offset transformed by the LBR layer is added to F e to obtain the output feature F o u t , and the offset of F e and F s a is similar to the calculation of Laplacian matrix.
F s a = A · V F o u t = O A ( F e ) = L B R ( F e F s a ) + F e F e F s a L F e ,
where L represents the Laplacian matrix.
Offset attention uses element-wise offsets between input features and self-attention features to approximate the Laplacian operator, thereby enhancing network performance. Compared to self-attention in the original Transformer, offset attention offers stronger adaptability, higher robustness, and improved model performance.
To address potential issues such as gradient vanishing or exploding during training, we integrate a residual network structure into the offset-attention module. Specifically, we introduce ResNet [35], which employs a single residual module structure as illustrated in Figure 5.
In this structure, each residual module consists of a series of convolutional layers followed by a shortcut connection that directly adds the input of the module to its output. This design helps to mitigate the gradient vanishing problem by allowing gradients to flow more easily through the network during backpropagation. Additionally, the use of the shortcut connections facilitates the learning of identity mappings, enabling the network to learn residual functions effectively.
The network performs both max- and average-pooling operations on the extracted features independently. Max-pooling captures the most salient features by selecting the maximum value within each pooling region, while average pooling smoothes the feature map by averaging values. These pooled features are then merged to form the final global feature F g , which encapsulates both the detailed and averaged representations of the input data. This approach enhances the ability of the model to generalize and improve performance across various tasks.

3.2.2. Sparse Submanifold Convolution-Based Local Feature Extraction

To achieve superior computational efficiency and reduce memory consumption compared to traditional voxelization methods, we employ sparse submanifold convolution for voxel aggregation. This method only performs convolution on relevant, non-empty voxels, significantly reducing unnecessary computations and memory footprint while effectively extracting local feature information from the point cloud data.
The processing steps of the local feature extraction module mainly involve voxelization of the original point cloud, local feature extraction based on sparse submanifold convolution, and devoxelization.
Voxelization of the original point cloud. To process point cloud data of different scales using convolution, we first voxelize the point cloud, then normalize the coordinates of the input PCD. All points are translated to a local coordinate system centered at the centroid. Then, all points are normalized to fit within a unit sphere by dividing by the maximum L2 norm max | | p i | | 2 . The coordinates are linearly mapped (scaled and shifted) from [−1, 1] to [0, 1], completing coordinate normalization. The normalized coordinates are represented as { p ^ i } , where p ^ i = ( x ^ i , y ^ i , z ^ i ) represents the voxel resolution. After obtaining the normalized coordinate information, the coordinates and voxel resolution r are used to calculate the voxel index for each point. For points falling into a voxel, the features f k corresponding to the coordinates p ^ i undergo average pooling, with the average feature becoming the current voxel feature. The normalized point cloud { p ^ i , f k } is transformed into a voxel grid { V u , v , w } . The specific calculation formula is shown in Equation (6).
V u , v , w , c = 1 N u , v , w k = 1 n I [ f l o o r ( x ^ k × r ) , f l o o r ( y ^ k × r ) , f l o o r ( z ^ k × r ) ] × f k , c ,
where V u , v , w , c denotes the feature value of the c-th channel of the voxel grid at position ( u , v , w ) , r represents the voxel resolution, I [ · ] represents a binary indicator of whether the coordinate p ˜ i belongs to the voxel grid (u, v, w), f k , c represents the c-th channel feature associated with p ˜ i , and N u , v , w represents the normalization factor.
Submanifold sparse convolution. To implement the convolution operation, we utilize a hash table and a matrix, each row in the matrix represents an active point, and the hash table contains tuples of all active points along with their positions in the matrix rows. Additionally, we define a rule book as a collection R = ( R i : i F ) , which consists of f d integer matrices, where f represents the size of the filter, and F = { 0 , 1 , . . . . . . , f 1 } d denotes the spatial size of the convolutional filter. We reuse the input hash table for output and construct an appropriate rule book to implement sparse submanifold convolution. The cost of building the hash table and the rule book is O ( l ) , and it is independent of the depth of the network, where l is the number of active points in the input layer.
Devoxelization. To map voxel features to their corresponding points in the point cloud, we employ trilinear interpolation [36] for upsampling. This process essentially recovers higher resolution features by considering the eight nearest neighboring points in the local feature space and interpolating their values along all three dimensions. For all points requiring interpolation, the steps outlined above are repeated. The voxelized local features are then mapped back to the original point cloud space, completing the process of voxelization and obtaining F l of the original point cloud.

3.3. Bounding Box Processing Module

3D-BoNet [9] consists of a backbone network and two parallel network branches, which perform bounding box regression and per-point mask prediction, respectively. This network structure is simple and easy to train and deploy. Compared to anchor-based methods, 3D-BoNet does not require a complex and time-consuming process for generating candidate bounding boxes, resulting in higher efficiency. Compared to anchor-free methods, 3D-BoNet explicitly predicts the bounding boxes of targets, resulting in instances with better precision. Therefore, we adopt the bounding box regression branch and point mask prediction branch from 3D-BoNet as the bounding box processing module in TSPconv-Net.

3.3.1. Bounding Box Regression Branch

The bounding box regression branch uses the global features F g extracted from the backbone as the input to the branch and directly regresses the bounding box B and the corresponding bounding box score B s . By utilizing the predicted bounding box B and the ground truth bounding box B ¯ through the bounding box association layer, a loss function is constructed, and the overall network is optimized by minimizing this loss function.
Bounding box encoding. Unlike defining bounding boxes using the center position and three-dimensional lengths, this module represents the bounding box using two sets of parameterized maximum–minimum vertices: { [ x m i n , y m i n , z m i n ] , [ x m a x , y m a x , z m a x ] } .
Neural layer. This consists of two fully connected layers that use the Leaky ReLU function as the nonlinear activation function and two parallel fully connected layers, forming two branches that output different results. One layer outputs a 6 H dimensional vector and reconstructs it into an H × 2 × 3 tensor, where H represents the maximum number of bounding boxes the network can predict, and B R H × 2 × 3 represents the predicted bounding box. The other layer calculates the corresponding bounding box score B s using the sigmoid function. The bounding box score B s is positively correlated with the likelihood of an object being contained within the bounding box.
Bounding box association layer. This part associates the predicted bounding boxes B with the ground truth bounding boxes B ¯ R T × 2 × 3 , where T is the number of ground truth bounding boxes. Finally, the overall loss function of the network is defined based on the cost between B and B ¯ . Since the number of predicted bounding boxes usually differs from the number of ground truth bounding boxes (assuming H T ), the network is required to associate a unique predicted bounding box B i for each ground truth bounding box B ¯ j to achieve optimal allocation. Assume C is the cost matrix between predicted and ground truth bounding boxes, where C i , j represents the similarity between the two bounding boxes, with smaller costs indicating higher similarity. Therefore, the bounding box association layer is tasked with determining the allocation matrix A that minimizes the overall cost. The specific calculation is shown as Equation (7).
A = arg min A i = 1 H j = 1 T C i , j A i , j ,
where A is a Boolean association matrix, and A i , j indicates whether the i-th predicted bounding box is assigned to the j-th ground truth bounding box. Each column of A i , j represents the association probabilities between the j-th ground truth bounding box and all predicted bounding boxes, so the sum of the elements in each column is 1. Moreover, since each predicted bounding box is only associated with nearby ground truth boxes, the sum of the elements in each row does not exceed 1.
Due to the sparsity and uneven distribution of 3D point clouds, it is not feasible to directly use the Euclidean distance between the predicted bounding box and the ground truth bounding box as a basis for judgment. As shown in Figure 6, predicted bounding box P B 2 has more valid points compared to predicted bounding box P B 1 and a greater overlap with the ground truth bounding box G B . Therefore, the ground truth bounding box G B should be associated with predicted bounding box P B 2 . As a result, the bounding box regression branch not only considers the Euclidean distance between vertices and the soft intersection over union but also incorporates the cross-entropy score into the cost matrix calculation. To solve this optimal allocation problem, the bounding box regression branch uses the Hungarian algorithm [37,38] to compute the optimal solution.
The costs of the three components are calculated as follows:
(1) Euclidean Distance. Defined as the Euclidean distance between the i-th predicted bounding box B i and the j-th ground truth bounding box B ¯ j . It is the average of the squared differences between the vertices. The detailed computation is presented in Equation (8).
C i , j e d = 1 6 B i B j ¯ 2
(2) Soft Intersection over Union (sIoU). Defined as the sIoU between the predicted bounding box and the ground truth bounding box. The specific calculation is shown as Equation (9).
C i , j s I o U = n = 1 N q i n q ¯ j n n = 1 N q i n + n = 1 N q ¯ j n n = 1 N q i n q ¯ j n ,
where q i n represents the probability of the n-th point being inside the i-th predicted bounding box, and q ¯ j n represents the probability of the n-th point being inside the j-th ground truth bounding box. Probabilities are calculated using Algorithm 1.
Algorithm 1 Calculate the probability of points in the input point cloud P being inside the predicted bounding box B. Where H represents the number of predicted bounding boxes, N represents the number of points in the input point cloud, θ 1 and θ 2 are hyperparameters set to 100 and 20, respectively, for numerical stability.
1:
for  i 1 to H do
2:
    Obtain the i-th bounding box minimum vertices B m i n i = [ x m a x i , y m a x i , z m a x i ]
3:
    Obtain the i-th bounding box maximum vertices B m a x i = [ x m a x i , y m a x i , z m a x i ]
4:
    for  n 1 to N do
5:
        Obtain the n-th point P n = ( x n , y n , z n )
6:
         Δ x y z B m i n i P n P n B m a x i
7:
         Δ x y z max min θ 1 Δ x y z , θ 2 , θ 2
8:
        Obtain probability P x y z = 1 1 + e ( Δ x y z )
9:
        Obtain point probability q i n = min P x y z
10:
  end for
11:
  Obtain the soft-binary vector q i = [ q i 1 , , q i N ]
12:
end for
(3) Cross-Entropy Score. Defined as the cross-entropy score between q i n and q ¯ j n , indicating the confidence that the predicted bounding box contains as many effective points as possible. The calculation is specifically shown in Equation (10).
C i , j c e s = n = 1 N q ¯ j n log q i n + 1 q ¯ j n log 1 q i n N
The bounding box prediction multi-criterion loss is computed by the bounding box association layer, which determines the optimal predicted bounding box for each ground truth bounding box by minimizing the cost. The loss function consists of three components: (1) Euclidean distance, (2) sloU, and (3) cross-entropy, as formally expressed in Equation (11):
l b b o x = 1 T t = 1 T C t , t e d + C t , t s I o U + C t , t c e s .
The bounding box prediction score loss evaluates the validity of the predicted bounding boxes. After reordering by the association matrix A, the ground truth values for the first T scores are “1” and “0” for the remaining H T scores. The loss function is defined using cross-entropy as Equation (12).
l b b s = 1 H t = 1 T log B s t + t = T + 1 H log 1 B s t ,
where B s t represents the score of the t-th associated predicted bounding box.

3.3.2. Point Mask Prediction Branch

The point mask prediction branch feeds the global features F g and local features F l into the neural layer to refine the predicted bounding boxes B, as shown in Figure 7, thereby enhancing the segmentation performance of the predicted bounding boxes. Moreover, focal loss with default hyperparameters is used to optimize the point mask prediction branch.
Neural layers. The global and local features are compressed into 256-dimensional vectors through fully connected layers and further compressed into 128-dimensional fused features F ˜ l . For the i-th predicted bounding box B i , the predicted vertices are fused with the fused features to obtain the perception features F ^ l . Finally, the point mask M i is obtained through shared fully connected layers.
The loss function of the bounding box processing network consists of the loss functions from both the bounding box regression branch and the point mask prediction branch. TSPconv-Net optimizes the network through this loss function to improve network performance and segmentation effectiveness. The loss function of TSPconv-Net is defined as Equation (13).
l a l l = l s e m + l b b o x + l b b s + l p m a s k ,
where l s e m denotes the standard softmax cross-entropy loss function used for learning the semantic information of each point. l p m a s k represents the focal loss [39] with default hyperparameters, used for optimizing the point mask prediction branch.

4. Experiments and Analysis

We primarily utilize the publicly available dataset S3DIS [14] provided by Stanford University. This dataset comprises 6 educational and office areas, totaling 272 rooms, with over 215 million points in point clouds. These rooms encompass 11 different scenes, such as offices, conference rooms, and auditoriums. We utilize Python and PyTorch 1.10 to construct TSPconv-Net and conducted training on a single GTX 1660Ti. TSPconv-Net was trained using mini-batches of size 32, ensuring a manageable computational load while maintaining gradient stability. The training process was conducted over 250 epochs, allowing the network sufficient time to converge and optimize performance. The initial learning rate was set to 0.01, enabling efficient exploration of the parameter space in the early stages of training, with adjustments made as necessary based on the learning rate scheduler to fine-tune model performance as training progressed.
The dataset comprises 13 semantic categories, with each point in each scene labeled as 1 of these 13 semantic categories, including ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and other objects. We select Area1, Area3, Area4, Area5, and Area6 as the training set, while Area2 served as the test set.

4.1. Visualization Results

To more clearly demonstrate the optimization of object boundary extraction in our method, we select four representative scenes for visualization. As shown in Figure 8, we can clearly observe that, with the aid of bounding box prediction, our method excels in accurately extracting and distinguishing closely connected objects in the scene.
In these scenarios, traditional segmentation methods often suffer from blurred boundaries or overlapping objects, especially when the objects are densely packed or have complex shapes. However, by introducing bounding box prediction, our method can create clearer boundaries between objects, effectively reducing mis-segmentation.
For instance, in the classroom scene, our method precisely distinguishes and extracts the densely arranged chairs and tables. The bounding box helps to better define the spatial boundaries of the objects, avoiding confusion between chairs or between chairs and tables. Similarly, in conference room scenes, bounding box prediction effectively separates closely positioned chairs and tables, ensuring that each object is accurately recognized and segmented. This approach not only enhances segmentation performance in complex scenes but also significantly improves the model’s ability to handle details, making the boundaries of each object more distinct.
Through this optimization, our method has significantly improved the extraction of individual objects, especially in scenarios where objects are densely arranged or have complex shapes. Compared to traditional segmentation methods, the bounding box prediction technique effectively reduces mis-segmentation, making the final segmentation results more visually clear and accurate. This further proves the superiority of our method in multi-object scenes.

4.2. Comparison Results Analysis

We compare the TSPconv-Net with the SGPN [23], ASIS [25], and 3D-BoNet [12]. The comparison results are presented in Table 1. For quantitative evaluation of the data results, we primarily assessed the mean precision (mPrec), mean recall (mRec), average precision (AP), and mean average precision (mAP). From the observations, it can be concluded that our method outperforms the others in terms of mPrec and mRec. Compared to SGPN, our network shows an improvement of 30.4% and 21% in mPrec and mRec, respectively. Compared to the ASIS network, our method exhibits an improvement of 5% and 4.7% in mPrec and mRec, respectively. Compared to the 3D-BoNet [12], our method demonstrates an improvement of 3% and 4.6% in mPrec and mRec, respectively.
Our research demonstrates that proposal-based techniques, exemplified by 3D-BoNet, outperform direct feature-based methods in repetitive object extraction, achieving superior precision. This advantage likely arises from their ability to generate candidate regions likely to contain objects, focusing analysis on these key areas. Furthermore, TSPconv-Net achieves even higher precision than 3D-BoNet by leveraging offset attention (OA) and submanifold convolution (SSC) as the backbone for feature extraction, thus obtaining more effective features. To objectively assess segmentation performance, we employ the mean average precision (mAP), which is a standard metric for evaluating object detection accuracy. As shown in Table 1, our method achieves a mAP of 60.1%, demonstrating a significant improvement of 16.5% compared to SGPN and outperforming ASIS by 4.8%. Additionally, it outperforms the previous version of 3D-BoNet by 2.6%. Finally, we evaluate the time efficiency of our method against existing approaches (Table 1). Due to the introduction of SSC in our network, reducing a significant amount of unnecessary spatial convolution operations, TSPconv-Net demonstrates exceptional efficiency, processing data nearly 26 times faster than SGPN. It also exhibits faster processing compared to ASIS.
In summary, our method demonstrates excellent results in terms of average precision, average recall, mean average precision, and computation time. This indicates that using predicted bounding boxes for single-object extraction can obtain higher precision in individual object extraction.

5. Conclusions

In this paper, we propose an efficient 3D instance segmentation framework called TSPconv-Net. We introduce OA and SSC as the backbone network of TSPconv-Net, greatly enhancing the ability of TSPconv-Net to capture and represent deep features of point clouds. We feed these features into the bounding box processing module based on 3D-BoNet to obtain segmentation results. TSPconv-Net achieves better performance and comparable speed to 3D-BoNet on the S3DIS dataset. This further confirms the effectiveness of using OA and SSC for extracting point cloud features.
However, due to the introduction of the offset-attention mechanism to handle global features of the point cloud, TSPconv-Net undoubtedly increases the computational complexity of the network. Although TSPconv-Net avoids a large amount of meaningless computation through submanifold sparse convolution, its processing speed is still not ideal compared to the original 3D-BoNet. This issue warrants further investigation in future research, aiming to find ways to enhance the network’s efficiency.

Author Contributions

Conceptualization, X.N.; methodology, X.N and Z.L.; software, Z.L.; validation, Y.M. and Y.W.; formal analysis, Z.S.; investigation, Y.W.; resources, H.J. and Z.S.; data curation, Y.M.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Z.L.; supervision, X.N.; project administration, Y.W.; funding acquisition, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant numbers 62272383, 62172190, and 62172416; the Natural Science Basic Research Program of Shaanxi Province under Grant number 2024GX-YBXM-120; and Tianjin Key Laboratory of Rail Transit Navigation Positioning and Spatio-temporal Big Data Technology under Grant number TKL2023B11.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixao, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
  2. Billard, A.; Kragic, D. Trends and challenges in robot manipulation. Science 2019, 364, eaat8414. [Google Scholar] [CrossRef] [PubMed]
  3. Shen, D.; Wu, G.; Suk, H.I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
  4. Han, X.F.; Laga, H.; Bennamoun, M. Image-based 3D object reconstruction: State-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1578–1604. [Google Scholar] [CrossRef] [PubMed]
  5. Rozenberszki, D.; Litany, O.; Dai, A. Unscene3D: Unsupervised 3D instance segmentation for indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 19957–19967. [Google Scholar]
  6. Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Top-down beats bottom-up in 3D instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 3566–3574. [Google Scholar]
  7. Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Oneformer3D: One transformer for unified point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20943–20953. [Google Scholar]
  8. Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3D: Mask transformer for 3D semantic instance segmentation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8216–8223. [Google Scholar] [CrossRef]
  9. Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning object bounding boxes for 3D instance segmentation on point clouds. Adv. Neural Inf. Process. Syst. 2019, 32, 6737–6746. [Google Scholar]
  10. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  11. Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
  12. Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar]
  13. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
  14. Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
  15. Siddiqui, M.Y.; Ahn, H. Deep learning-based 3D instance and semantic segmentation: A review. J. Artif. Intell. 2022, 4, 99. [Google Scholar] [CrossRef]
  16. Hou, J.; Dai, A.; Nießner, M. 3D-sis: 3D semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4421–4430. [Google Scholar]
  17. Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L.J. Gspn: Generative shape proposal network for 3D instance segmentation in point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3947–3956. [Google Scholar]
  18. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  19. Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4205–4212. [Google Scholar]
  20. Zhang, F.; Guan, C.; Fang, J.; Bai, S.; Yang, R.; Torr, P.H.; Prisacariu, V. Instance segmentation of lidar point clouds. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9448–9455. [Google Scholar] [CrossRef]
  21. Liu, S.H.; Yu, S.Y.; Wu, S.C.; Chen, H.T.; Liu, T.L. Learning gaussian instance segmentation in point clouds. arXiv 2020, arXiv:2007.09860. [Google Scholar]
  22. Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; Nießner, M. 3D-mpa: Multi-proposal aggregation for 3D semantic instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9031–9040. [Google Scholar]
  23. Wang, W.; Yu, R.; Huang, Q.; Neumann, U. Sgpn: Similarity group proposal network for 3D point cloud instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2569–2578. [Google Scholar]
  24. Liu, C.; Furukawa, Y. Masc: Multi-scale affinity with sparse convolution for 3D instance segmentation. arXiv 2019, arXiv:1902.04478. [Google Scholar]
  25. Wang, X.; Liu, S.; Shen, X.; Shen, C.; Jia, J. Associatively segmenting instances and semantics in point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4096–4105. [Google Scholar]
  26. Zhao, L.; Tao, W. Jsnet: Joint instance and semantic segmentation of 3D point clouds. In Proceedings of the the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12951–12958. [Google Scholar] [CrossRef]
  27. Chen, S.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X. Hierarchical aggregation for 3D instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15467–15476. [Google Scholar]
  28. He, T.; Shen, C.; Van Den Hengel, A. Dyco3D: Robust instance segmentation of 3D point clouds through dynamic convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 354–363. [Google Scholar]
  29. He, T.; Yin, W.; Shen, C.; Van den Hengel, A. Pointinst3D: Segmenting 3D instances by points. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 286–302. [Google Scholar]
  30. Vu, T.; Kim, K.; Luu, T.M.; Nguyen, T.; Yoo, C.D. Softgroup for 3D instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2708–2717. [Google Scholar]
  31. Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; Zhong, W. 3D instances as 1d kernels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 235–252. [Google Scholar]
  32. Zhao, W.; Yan, Y.; Yang, C.; Ye, J.; Yang, X.; Huang, K. Divide and conquer: 3D point cloud instance segmentation with point-wise binarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 562–571. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 15. [Google Scholar] [CrossRef]
  34. Shi, X.; Chai, X.; Xie, J.; Sun, T. Mc-gcn: A multi-scale contrastive graph convolutional network for unconstrained face recognition with image sets. IEEE Trans. Image Process. 2022, 31, 3046–3055. [Google Scholar] [CrossRef] [PubMed]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  36. Fadnavis, S. Image interpolation techniques in digital image processing: An overview. Int. J. Eng. Res. Appl. 2014, 4, 70–73. [Google Scholar]
  37. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
  38. Kuhn, H.W. Variants of the Hungarian method for assignment problems. Nav. Res. Logist. Q. 1956, 3, 253–258. [Google Scholar] [CrossRef]
  39. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Figure 1. The pipeline of TSPconv-Net.
Figure 1. The pipeline of TSPconv-Net.
Mathematics 12 02926 g001
Figure 2. Global feature module.
Figure 2. Global feature module.
Mathematics 12 02926 g002
Figure 3. Feature embedding module.
Figure 3. Feature embedding module.
Mathematics 12 02926 g003
Figure 4. The structure of offset-attention module.
Figure 4. The structure of offset-attention module.
Mathematics 12 02926 g004
Figure 5. Residual structure.
Figure 5. Residual structure.
Mathematics 12 02926 g005
Figure 6. Predicted bounding box selection.
Figure 6. Predicted bounding box selection.
Mathematics 12 02926 g006
Figure 7. The architecture of point mask prediction branch.
Figure 7. The architecture of point mask prediction branch.
Mathematics 12 02926 g007
Figure 8. The segmentation results of TSPconv-Net across various scenes. Different colors represent distinct instances, and instances of the same type may be depicted in different colors.
Figure 8. The segmentation results of TSPconv-Net across various scenes. Different colors represent distinct instances, and instances of the same type may be depicted in different colors.
Mathematics 12 02926 g008
Table 1. Results of TSpconv-Net on the S3DIS Dataset.
Table 1. Results of TSpconv-Net on the S3DIS Dataset.
MethodsmPrec (%)mRec (%)mAP (%)Total Time (s)
SGPN [23]38.231.243.68821.3
ASIS [25]63.647.555.3402.3
3D-BoNet [9]65.647.657.5294.1
TSPconv-Net68.652.260.1326.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ning, X.; Liu, Y.; Ma, Y.; Lu, Z.; Jin, H.; Shi, Z.; Wang , Y. TSPconv-Net: Transformer and Sparse Convolution for 3D Instance Segmentation in Point Clouds. Mathematics 2024, 12, 2926. https://doi.org/10.3390/math12182926

AMA Style

Ning X, Liu Y, Ma Y, Lu Z, Jin H, Shi Z, Wang  Y. TSPconv-Net: Transformer and Sparse Convolution for 3D Instance Segmentation in Point Clouds. Mathematics. 2024; 12(18):2926. https://doi.org/10.3390/math12182926

Chicago/Turabian Style

Ning, Xiaojuan, Yule Liu, Yishu Ma, Zhiwei Lu, Haiyan Jin, Zhenghao Shi, and Yinghui Wang . 2024. "TSPconv-Net: Transformer and Sparse Convolution for 3D Instance Segmentation in Point Clouds" Mathematics 12, no. 18: 2926. https://doi.org/10.3390/math12182926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop