Next Article in Journal
An Agent-Based Model for Disease Epidemics in Greece
Previous Article in Journal
Drowning in the Information Flood: Machine-Learning-Based Relevance Classification of Flood-Related Tweets for Disaster Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PVI-Net: Point–Voxel–Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios

School of Electrical Engineering and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China
*
Author to whom correspondence should be addressed.
Information 2024, 15(3), 148; https://doi.org/10.3390/info15030148
Submission received: 26 January 2024 / Revised: 27 February 2024 / Accepted: 6 March 2024 / Published: 7 March 2024
(This article belongs to the Topic Advances in Artificial Neural Networks)

Abstract

:
In this study, we introduce a novel framework for the semantic segmentation of point clouds in autonomous driving scenarios, termed PVI-Net. This framework uniquely integrates three different data perspectives—point clouds, voxels, and distance maps—executing feature extraction through three parallel branches. Throughout this process, we ingeniously design a point cloud–voxel cross-attention mechanism and a multi-perspective feature fusion strategy for point images. These strategies facilitate information interaction across different feature dimensions of perspectives, thereby optimizing the fusion of information from various viewpoints and significantly enhancing the overall performance of the model. The network employs a U-Net structure and residual connections, effectively merging and encoding information to improve the precision and efficiency of semantic segmentation. We validated the performance of PVI-Net on the SemanticKITTI and nuScenes datasets. The results demonstrate that PVI-Net surpasses most of the previous methods in various performance metrics.

1. Introduction

In recent years, with the rapid development of artificial intelligence technology, 3D point cloud processing has become an important branch in the field of computer vision. Especially in outdoor scenes, such as autonomous driving, urban planning, and Geographic Information Systems (GISs), LiDAR point cloud segmentation technology plays a crucial role. For autonomous vehicles, accurate point cloud segmentation is key to safe navigation and decision making. Due to the working principle of LiDAR sensors, the collected point cloud data may have uneven density and occlusion issues. These characteristics make extracting accurate and reliable semantic information from these data a challenging task.
Recent advancements in point cloud semantic segmentation have substantially contributed to the field, particularly within large-scale autonomous driving scenarios [1,2,3,4]. These advancements predominantly revolve around the effective processing and analytical representation of voluminous point cloud data, captured through LiDAR technology. Our work introduces a novel conceptualization within this domain, where a single point cloud dataset is represented through three distinct but complementary perspectives: point-based, voxel-based, and distance map representations. This unique approach aims to enhance the model’s feature extraction capabilities by leveraging the intrinsic advantages of each representation method, thereby enriching the semantic segmentation process. Among these, voxel-based methods convert point clouds into three-dimensional grids and use 3D convolutional neural networks for processing, which is convenient for capturing spatial information but requires high resolution when dealing with sparse point clouds, increasing computational and storage burdens. Direct point-based methods retain the precision of the original structure but are computationally inefficient when dealing with unstructured data, while image-based methods accelerate processing, but they may lose three-dimensional spatial information when projecting point clouds into two-dimensional images, affecting segmentation accuracy.
Therefore, we found that, in building models for large scene point cloud segmentation, the fusion of point cloud, voxel, and distance map perspectives is not just a simple data overlay but a multi-dimensional information fusion strategy. Point clouds, as a high-fidelity representation of raw data, maintain the original precision of spatial information and the integrity of microscopic details, directly reflecting the depth perception of scenes. Voxelization, though introducing some quantization errors, provides an intuitive and operable geometric expression for the macro form and volumetric characteristics of the data. Distance maps, as an advanced representation of the spatial relationships in point clouds, provide a key perspective for understanding the geometric continuity and topological structure of scenes by encoding the spatial distances between points. This multi-dimensional data representation strategy lays the foundation for in-depth analysis and accurate segmentation of large-scale point cloud scenes. A point cloud segmentation model that integrates different perspectives shows outstanding robustness and accuracy in processing complex and large-scale scenes. This fusion is not just a simple stacking of data but a deep integration of information.
In this study, we propose an adaptive point–voxel–distance map feature fusion framework, PVI-Net, to optimize the semantic segmentation of point clouds in outdoor scenes. This framework combines the advantages of point cloud, voxel, and distance map perspectives, providing a comprehensive perspective for processing complex large-scale data. PVI-Net uses a multi-layer feature extraction and fusion mechanism, combining multi-layer perceptron (MLP), 3D sparse convolution, and 2D convolution, implementing effective feature fusion and information encoding retention through a U-Net structure and residual connections, thereby improving the accuracy and efficiency of semantic segmentation. Specifically, the point cloud–voxel cross-attention mechanism and point–image multi-perspective feature fusion strategy effectively handle the structural differences and information fusion between different perspectives, enhancing the overall performance of the model. For computational efficiency, PVI-Net reduces the computational cost of multi-perspective fusion through optimization strategies. Voxelization processing quickly filters point clouds in the early stage of data processing, reducing the processing burden on high-density information, while the high-level spatial relationship expression provided by distance maps helps the model quickly identify scene features, reducing the need for point-by-point analysis of complex data. These strategies collectively contribute to effectively improving computational efficiency and resource management balance while maintaining high segmentation accuracy. The experimental results show that PVI-Net performs excellently in processing point cloud data of complex outdoor scenes. The evaluation results on two key datasets, SemanticKITTI and nuScenes, show that PVI-Net performs excellently in terms of point cloud semantic segmentation accuracy in large-scale autonomous driving scenarios.
Our work offers the following key contributions:
  • Proposing PVI-Net, a semantic segmentation framework for large-scale point cloud scenes, which integrates three different data perspectives—point cloud, voxel, and distance map—achieving an adaptive multi-dimensional information fusion strategy.
  • Designing point–voxel cross-attention and Multi-perspective Fusion Attention (MF-Attention) mechanisms in the network structure, effectively addressing the structural differences and information fusion issues between different perspectives.
  • Designing a multi-perspective feature post-fusion module. This module can effectively combine features from point clouds, voxels, and distance maps. In the post-fusion stage, the model integrates information from different perspectives, enhancing semantic understanding of complex outdoor scenes.

2. Related Works

2.1. Point Processing in Point Cloud Segmentation

Point-based methods [5,6,7,8] are renowned for their ability to learn global features directly from raw point clouds. However, they fall short in capturing details and local structures within point clouds. To address this deficiency, multi-scale processing methods [9] have been proposed. Such methods enhance the understanding of complex structures by analyzing point cloud features at different scales. Nevertheless, these methods often increase computational burdens. Graph-based methods [10], on the other hand, have turned to a new processing strategy, transforming point cloud data into graph structures and utilizing graph neural networks to capture complex relationships between points. This approach is particularly suitable for processing unstructured point cloud data but faces high computational costs in graph construction and processing. Overall, in the field of large-scale autonomous driving point cloud processing, point cloud data are unstructured, meaning the data points are unordered, and the number of neighbors for each point can vary. This irregularity poses significant challenges to point cloud processing.

2.2. Voxel Processing in Point Cloud Segmentation

Voxel-based point cloud segmentation [11,12,13] has garnered widespread attention in the understanding of autonomous driving scenes. Park et al. [14] proposed an Efficient Point Cloud Transformer (EPT) based on local self-attention to understand large-scale 3D scenes. EPT, due to its voxel structure, offers faster inference speeds compared with point-based work. Wang et al. [15] introduced a Dynamic Sparse Voxel Transformer (DSVT), a Voxel Transformer backbone based on a single-step window for outdoor 3D perception. This method divides a series of local regions in each window according to sparsity and then computes the features of all regions in a fully parallel manner. Although these methods use sparse voxel grids to reduce memory occupancy and employ layered and multi-scale voxel representations to capture more details, the conversion of point cloud data into voxel format faces detail loss due to voxelization. Our proposed PVI-Net bridges this gap through multiple perspectives.

2.3. Range Image Processing in Point Cloud Segmentation

Recent advancements in point cloud segmentation have highlighted the potential of range images as a complementary representation to traditional point-based and voxel-based methods. Range images, derived from point clouds through spherical projection, maintain depth information in a structured, image-like format, facilitating the application of mature 2D image-processing techniques. The transformation of point clouds into range images involves projecting 3D points onto a 2D plane based on their azimuth and elevation angles relative to a specific viewpoint, typically the sensor origin. This process preserves the spatial locality and depth information, offering a compact representation that is particularly beneficial for capturing surface geometries and contours. Several notable studies have leveraged range images for enhancing point cloud analysis. For instance, RangeNet++ [16] employs a deep neural network to segment range images semantically, exploiting their structured nature for efficient processing. Similarly, SqueezeSeg [17] and its successors demonstrate the efficacy of convolutional neural networks in interpreting range images for tasks like semantic segmentation and object detection within point clouds. Despite their advantages, range images are not without challenges. The projection process can introduce distortions, particularly at large distances or near the edges of the field of view. Therefore, considering how to further narrow the gap between 2D image processing and 3D point cloud analysis is a potential objective.

2.4. Multi-Perspective Fusion

The advantages of multi-perspective point cloud segmentation [18,19,20] are primarily manifested in its ability to provide a more comprehensive spatial understanding than a single perspective. In multi-perspective point cloud segmentation, data from different angles are fused to form more complete three-dimensional representations of target objects or environments. Chen et al. [21] explored interactive fusion between point cloud and image data, using an autoencoder structure to enhance the performance of 3D object detection through simultaneously learning features of point clouds and images. Tang et al. [8] focused on finding efficient 3D architectures. They combined sparse point and voxel convolutions, aiming to create a network that is both efficient and accurate for processing point cloud data. These methods can significantly reduce the occlusions and blind spots caused by single perspectives, especially in complex environments. Compared with previous methods, our approach proposes a point–voxel–image tri-perspective point cloud semantic segmentation framework, which enables capturing more information about shape, size, and other important features from multiple angles.

3. Methodology

In this section, we provide a comprehensive introduction to the PVI-Net framework for point cloud processing in outdoor scene segmentation. In Section 3.1, we outline the overall structure of the network and data flow. Following this, in Section 3.2, we detail the input data sources and feature extraction processes of the network’s three key branches. Further, in Section 3.3, we delve into the fusion methods of these three branches during the feature extraction stage and the key modules designed for the post-fusion stage. This chapter aims to offer an in-depth understanding of the details of the PVI-Net framework, showcasing its efficiency and innovation in processing complex outdoor scene point cloud data.

3.1. Overview

Figure 1 shows our newly developed PVI-Net network, a tri-branch feature fusion network for point cloud semantic segmentation. For the input point cloud data, we first map point cloud features into voxel grid features, providing input for the voxel feature learning branch. Then, point cloud data are transformed into range images through spherical projection, serving as input for the image feature learning branch. The point cloud branch employs a basic PointNet structure and several MLPs to generate multi-resolution features. The voxel and image branches utilize 3D sparse convolution and 2D convolution, respectively, and employ a U-Net structure for featuring the encoding and decoding of each branch, simultaneously achieving a fusion of features from three perspectives. Additionally, in the decoding stage, we apply residual connections to ensure that information learned during the encoding stage is effectively transferred to the output. Finally, using an innovative multi-perspective feature post-fusion module, we perform post-fusion of features from the three branches, accurately restoring the semantic information of each point cloud.

3.2. Tri-Branch Feature Learning

3.2.1. Point Cloud Feature Extraction Branch

In the point cloud branch of PVI-Net, given an unordered set of points P = { p P i } i = 1 N , where each point in the point cloud p P i R C includes the coordinates c P i = [ x i , y i , z i ] and the point cloud features. The direct use of MLP to extract features in the point cloud branch helps to reduce the high computational load and memory consumption caused by searching for neighboring relationships, thereby enabling efficient processing of large-scale data and simplifying the network structure. Each point in the point cloud is individually processed with MLP, which effectively extracts and learns the features of each point, and can be represented as follows:
F p i = M L P ( P ) , l = 1 M L P ( F p l 1 ) + F p l 1 , l > 1
where l denotes the layer of the MLP, and F p i represents the features extracted via the MLP at layer l. The point cloud feature extraction involves processing each point in the point cloud individually. MLP layers, including linear transformations and nonlinear activations, allow the network to learn complex patterns in the data. This process is crucial for capturing the complex geometric details of the point cloud, and these features are subsequently integrated with the voxel and range image branches through the fusion process.

3.2.2. Voxel Feature Extraction Branch

For the input point cloud P = p P i i = 1 N , a three-dimensional voxel grid covering the entire range of the point cloud is first defined. This grid consists of many small cubes (voxels), each with a fixed size. Furthermore, the point cloud data are mapped onto the three-dimensional voxel grid to obtain voxel features with a voxel resolution of L V × H V × W V , denoted as F V R L V × H V × W V . The voxel index for each point is calculated based on its coordinates in three-dimensional space. For a point p P i ( x , y , z ) and a voxel grid in which each voxel’s size is x × y × z , the voxel index ( i , j , k ) of point p P i can be calculated as follows:
i = x x min x × c k j = y y min y × c k k = z z min z × c k
where x min , y min , and z min are the minimum coordinate values of the voxel grid in each direction, · denotes the floor function, and c k is the downsampling stride of the 3D CNN. This approach ensures that each point in the point cloud is allocated to a corresponding voxel, establishing a mutual correspondence between points and voxels, facilitating feature interaction between point cloud and voxels. To avoid the memory loss caused by empty voxels, we use 3D sparse convolution to downsample and encode voxel features:
S C o n v 3 D ( F V ) , l = 1 S C o n v 3 D ( F V l 1 ) + F V l 1 , l > 1
where S C o n v 3 D ( · ) contains a 3D sparse convolution and an activation function, and F V l represents the voxel features extracted via 3D sparse convolution at layer l. We use 3D sparse convolution to downsample and encode voxel features, preserving feature maps of three downsampling voxel directions. The voxel features are then upsampled to restore voxel features.

3.2.3. Image Feature Extraction Branch

The method of converting point cloud data into range images is achieved through spherical projection, where the position of each point is mapped onto a two-dimensional plane. Given a three-dimensional point cloud P = p P i i = 1 N with coordinates ( x i , y i , z i ) in the three-dimensional Cartesian coordinate system, the corresponding two-dimensional coordinates [ u i , v i ] of the two-dimensional image I R H I × W I × C , with height H I , width W I , and dimension C, through spherical projection, can be expressed as follows:
u i v i = 1 2 1 arctan ( y i , x i ) π 1 W I 1 arcsin ( z i , d 1 ) R d R H I
where d = x i 2 + y i 2 + z i 2 is the Euclidean distance from point P to the reference origin in the LiDAR coordinate system, as well as the straight-line distance to the projection center. R represents the vertical field of perspectives of the LiDAR sensor, and R d is the lower boundary of the vertical field of perspectives. Spherical projection is a non-bijective process in which each point, p i , in the point cloud maps to a pixel position in the projected image. However, due to the nature of this mapping, multiple three-dimensional points may correspond to the same pixel in the image, leading to a one-to-many mapping relationship.
In the image feature extraction branch, convolutional operations are used to extract features from the two-dimensional image obtained through spherical projection, which can be represented as follows:
F I l = C o n v ( I ) , l = 1 C o n v ( F I l 1 ) + F I l 1 , l > 1
where C o n v ( · ) contains a 2D convolution and an activation function. F I l represents the image features extracted via 2D convolution at layer l, similarly preserving feature maps of three downsampling image directions for the encoding–decoding process.

3.3. Multi-Perspective Feature Fusion

In the previous section, we first introduced the projection system, establishing corresponding index systems between point–voxel–range and the feature extraction process of the three branches. In this section, we construct interactions between the representations based on points, voxels, and ranges.
The distinct characteristics and advantages of point clouds, voxels, and depth maps necessitate different fusion strategies, based on their properties and complementarity in fusion. Point cloud data are irregular, while voxels partition space into regular grids. This structural difference makes simple addition or concatenation fusion insufficient for capturing their complex relationships.
Therefore, we designed an adaptive point–voxel cross-attention feature interaction method to handle this irregularity and structural difference better. It computes the relationship between point cloud and voxel features, enabling more flexible weighting of these features and a more effective combination of their information. As shown in Figure 2,
f P V = k = 1 K M L P [ ( ( f V + f P k ) + δ ) f P k ] + f V
where M L P ( · ) denotes a feature encoding function, ⊙ represents element-wise multiplication, and δ is the positional encoding, defined as follows:
δ = MLP Concat σ p P k μ c , σ p P k
where p P k is the 3D coordinates of a point P, μ c = 1 K i = 1 K p i is the mean of all projected point coordinates, σ is a nonlinear activation function, and C o n c a t ( · ) denotes vector concatenation. This combines both relative and absolute position information, passed through nonlinear activation and then concatenated as input to the MLP, capturing the spatial relationships of points in both local and global contexts.

3.3.1. MF-Attention Feature Fusion Module

We process point cloud data, mapping them to a two-dimensional image. In this process, multiple points in the point cloud may map to the same pixel position in the two-dimensional image. To consider information comprehensively from different perspectives of points and images and dynamically balance their contributions, we designed an MF-Attention feature fusion module. Suppose a set of points, p P k k = 1 K , in the point cloud maps to a pixel, P I , in the two-dimensional image, then each point, p P k , in the set has a corresponding feature vector, f P k , and each pixel, P I , also has a feature vector, f I . The goal of MF-Attention fusion is to update point features, f P k , to reflect their relationship with the corresponding pixel feature, f I . Firstly, we calculate the attention weights between point cloud features and image features:
A P I = S o f t m a x ( f I W q × ( f P k W k ) T d k )
where W I , W P are learnable weight matrices for further transforming the mapped features into the attention computation space. The dimension size of the key vectors is represented by d k . Employing the scaling factor d k aids in preserving the numerical stability within the attention mechanism. Then, the final MF-Attention fusion feature is represented as:
f P I = C o n c a t ( ( A P I × f I ) , ( A P I T × f P k ) )
where C o n c a t ( · ) is used to concatenate features. The point–image attention fusion mechanism provides an effective way to synthesize and utilize information from point clouds and images, enabling the model to discover and leverage their inherent connections when processing multi-perspective data. This method is particularly useful in combining point cloud and image data for semantic prediction of point clouds.

3.3.2. Multi-Perspective Feature Post-Fusion Module

We extract features from each branch and design a deep fusion method for the features of the three branches to enhance the feature representation ability of each branch. As shown in Figure 3. Furthermore, we post-fuse the final prediction results of point cloud, voxel, and depth map features to provide a richer and more comprehensive feature representation for point cloud semantic segmentation tasks. For the final features obtained from the point cloud branch, F P R N × D , the voxel branch, F V R L V × H V × W V × D , and the image branch, F I R H I × W I × D , the corresponding semantic segmentation pseudo-probabilities are represented as follows:
O P = S o f t m a x ( M L P ( F p ) ) E V = S o f t m a x ( S C o n v 3 D ( F V ) ) E I = S o f t m a x ( C o n v ( F I ) )
where O P R N × T , where T represents the number of semantic categories. For E V R L V × H V × W V × T and E I R H I × W I × T , they are mapped back to the original point cloud position according to the hash table built in the voxelization and spherical projection processes:
E V O V E I O I
where O V R N × T , O I R N × T . To associate global features, we weight the features of each branch globally, allowing the model to learn key features automatically in each perspective. The weighted features of each branch are represented as follows:
G P = M L P ( g [ M a x P o o l ( F P ) ; A v g P o o l ( F ) ] ) G V = M L P ( 3 D G A P ( F V ) ) G I = M L P ( G A P ( F I ) )
where g ( · ) denotes ( 2 ,   1 ) linear mapping, 3 D G A P ( · ) represents 3D global average pooling, and G A P ( · ) represents global average pooling. Thus, the final fusion result is represented as follows:
Y = G P O P + G V O V + G I O I
Fusing the features of point clouds, voxels, and depth maps utilizes each perspective’s unique advantages to provide a more comprehensive and powerful data representation, thus achieving better performance in specific tasks.

4. Experiments

In this section, we extensively explore the PVI-Net network and its application in autonomous driving. In Section 4.1, we provide a thorough introduction to the two key datasets used in our experiments—SemanticKITTI and nuScenes—elucidating their importance in network testing and evaluation. Following this, in Section 4.2, we delve into the various components of the PVI-Net architecture, detailing the key aspects and experimental settings of the network to ensure transparency and reproducibility in our experiments. In this section, to intuitively understand the impact of various indicators on network performance, we use “↓” and “↑” to denote that smaller or larger values of the indicators, respectively, lead to better network performance. Finally, in Section 4.3, we conduct a comprehensive performance evaluation of the PVI-Net model. In addition, we perform a series of ablation experiments to verify the superiority and effectiveness of the model in its key constituent steps.

4.1. Datasets

SemanticKITTI. The SemanticKITTI dataset, an extension of the KITTI Vision Benchmark Suite, is a leading dataset in the fields of autonomous driving and robotics vision. Its key feature is the provision of a large-scale, time-sequenced LiDAR scanning dataset, comprising over 43.5 billion finely annotated point clouds distributed across more than 22,000 scene sequences, covering various road types and climatic conditions. The point clouds in the dataset are subdivided into 25 categories, with training and test sets composed of sequences from 00 to 10 and 11 to 21, respectively, to test and optimize their models, ensuring their effective operation in various environments and an accurate understanding of their surroundings.
nuScenes. The nuScenes dataset, released by Aptiv Autonomous Mobility, is a widely used multi-perspective dataset in the field of autonomous driving research. It was collected in diverse urban environments in Boston and Singapore, providing rich information on roads, traffic, and climate conditions. This dataset combines data from six cameras, five radars, and one LiDAR, achieving 360-degree comprehensive environmental capture, greatly facilitating an in-depth understanding of complex scenes and supporting tasks such as object detection, tracking, and segmentation. nuScenes includes over 1 million precise 3D bounding box annotations, covering 23 different object categories, totaling 40,000 frames of high-quality data. These data are meticulously divided into 8130 training samples, 6019 validation samples, and 6008 test samples, ensuring extensive training and evaluation coverage. Additionally, to enhance its applicability in real-world scenarios, the dataset specially optimized its category annotations, focusing on 16 primary categories for LiDAR semantic segmentation.

4.2. Implementation Details and Settings

Architecture Settings. As shown in Figure 1, we propose a multi-perspective point cloud segmentation network architecture. This architecture first converts point cloud data into quantized voxels with a high resolution of 1600 × 1408 × 40 × 8. At the core of voxel processing, the backbone network employs 3D sparse convolution, generating feature maps of voxel directions at four different scales with output dimensions of 32, 64, 128, and 256, respectively. Subsequently, these feature maps are restored by a decoder symmetrical to the dimensions of the encoder to recover voxel features. In our experiments, the resolution of voxels is set to a 5 cm edge length for each voxel. For image branch processing, when dealing with the SemanticKITTI dataset, the input range–image size is set to 64 × 2048. When handling the nuScenes dataset, the initial input range–image size of 32 × 2048 is later adjusted to 64 × 2048 to align with the dimensions of the SemanticKITTI dataset.
Training Strategies. In our experiments, we trained the model for 120 epochs using the Adam optimizer, with the initial learning rate set to 0.01. This process was conducted on a system equipped with 4× RTX 3090 GPUs, with the batch size set to 4. To prevent overfitting, we used data augmentation techniques, including GT-sampling technology and random flipping, rotation, and scaling, within the range of [0.95, 1.05]. During training, we also employed a cosine annealing strategy to adjust the learning rate and implemented global scaling and random rotation around the Z-axis as enhancement measures to increase data diversity and the model’s generalization capability.

4.3. Results

4.3.1. Evaluation on SemanticKITTI Dataset

In our research, we conducted comprehensive experiments on the newly proposed PVI-Net network using the SemanticKITTI dataset and compared it with some of the latest advanced methods, as shown in Table 1. The results show that PVI-Net achieved a significant improvement of over 10% in the mean intersection over union (mIOU) metric compared with previous classic single-perspective input networks (such as point-based, voxel-based, and image-based methods). In comparison with mixed-perspective methods, PVI-Net also exhibited the best mIOU performance. Notably, PVI-Net outperformed RPVNet by 0.6% in mIOU, highlighting the effectiveness and practical value of the cross-attention mechanism and the proposed MF-Attention multi-perspective fusion strategy used in our network compared with the direct averaging fusion approach of RPVNet.

4.3.2. Evaluation on nuScenes Dataset

For a comprehensive validation of our model’s robustness, we carried out a series of detailed experiments on the nuScenes dataset. As shown in Table 2, PVI-Net demonstrated exceptional performance, especially in the key metric of mIOU, where it surpassed other classic single-perspective and multi-perspective networks, achieving a leading position. This result further confirms the enormous potential of multi-perspective data fusion in the field of point cloud semantic segmentation. Notably, by combining point cloud and voxel data, our network effectively overcomes geometric distortions that may occur during point cloud projection, significantly enhancing the accuracy of point cloud segmentation. Moreover, Figure 4 presents the semantic segmentation visualization results of the PVI-Net network on the nuScenes dataset. These experimental results not only showcase the efficient performance of PVI-Net but also emphasize the importance of multi-perspective fusion in enhancing point cloud processing capabilities in complex environments.

4.4. Ablation Study

In this section, we delve into the key components of the PIV-Transformer, conducting a series of fusion experiments to analyze the impact of each branch, the multi-perspective feature deep fusion modules, and the post-fusion modules within the network. Additionally, we evaluate the computational efficiency and parameter count of PVI-Net under various branch combinations. All the aforementioned experiments are implemented on the SemanticKITTI dataset, and we showcase the test results of these methods on the validation part (sequence 08) of this dataset.

4.4.1. Impact of Different Perspectives on Network Performance

A shown in Table 3, we conducted a series of independent and interactive ablation experiments on three different branches. Furthermore, we detailed the required parameter count and model inference speed for each ablation experiment network. For the sake of uniformity, all ablation experiments in Table 3 use the same hardware settings and batch sizes as the PVI-Net network experiments (see Section 4.2). Our experimental results clearly show that, compared with single-perspective inputs, multi-perspective inputs demonstrate better performance in segmentation tasks. Specifically, regarding the point cloud segmentation network’s interaction with multi-perspective features, we found that voxel features, as opposed to image features, provide a richer and more comprehensive feature supplement for the point cloud branch.

4.4.2. Impact of Multi-Perspective Feature Deep Fusion Modules

In Table 4, we present a series of ablation experiments on the key modules of the PVI-Net network, verifying their contributions in the process of deep feature fusion. In this table, modules marked with a “✓” default to using an averaging method for fusion. Through these experimental results, we observed that each module mentioned in the network positively impacted the model’s effectiveness.

4.4.3. Impact of Multi-Perspective Feature Post-Fusion Module

In Table 5, we specifically compare the multi-perspective feature post-fusion method used in our network with the common Addition (additive fusion) and Concatenation (concatenative fusion) methods. The experimental results show that, on the SemanticKITTI dataset, our fusion method improved the mIoU by 1.7% and 1.4% compared with the Addition and Concatenation methods, respectively. This outcome demonstrates that our fusion strategy more effectively integrates information from different sources when processing multi-perspective data, thereby enhancing the accuracy of semantic segmentation.

4.4.4. Multi-Perspective Fusion Addresses Challenges Encountered by Single-Perspective Methods

This paper enhances the understanding of complex 3D scenes by introducing a multi-view fusion approach, addressing the limitations of single-view methods that often miss crucial scene details due to occlusions, scale variations, and viewpoint dependencies. By integrating data from various perspectives, our multi-view fusion technique reconstructs obscured parts, mitigates scale discrepancies, and generates viewpoint-invariant features, leading to improved feature completeness and classification accuracy. Although our initial model, PVI-Net, does not outperform the latest state-of-the-art models in accuracy, it validates the feasibility of multi-view fusion and offers a novel perspective for 3D scene comprehension.

5. Conclusions

In conclusion, PVI-Net stands as a testament to the innovative exploration of point cloud semantic segmentation, particularly within the realm of autonomous driving. Central to our framework is the strategic intra-modal fusion of three distinct representations of a singular point cloud dataset. This fusion, achieved through parallel processing branches, underscores our commitment to extracting a richer, more nuanced feature set from point cloud data. We introduced point cloud–voxel cross-attention and point–image multi-perspective feature fusion strategies, which are innovative approaches that enable effective information interaction between different perspectives, significantly optimizing the process of information fusion between perspectives. Additionally, PVI-Net employs a U-Net architecture and residual connections. These not only enhance the precision and efficiency of semantic segmentation but also present an innovative method for multi-perspective feature post-fusion. This effectively integrates information from different data sources, thereby improving the accuracy of semantic segmentation. Extensive experiments in autonomous driving scenarios confirm that PVI-Net demonstrates outstanding performance in point cloud semantic segmentation.

Author Contributions

Funding acquisition, C.L.; resources, C.L.; validation, J.M. and Z.F.; visualization, L.X.; writing—original draft, Z.W.; writing—review and editing, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of this thesis was partially funded by the National Natural Science Foundation of China (Grant No.: 62363025), the Science and Technology Program of Lanzhou (No.: 2022-2-58), and the Key R&D plan of Science and Technology Plan of Gansu Province—Social Development Field Project (No.: 23YFFA0064).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

[HTML]FE0000Data are contained with in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yan, X.; Zhan, H.; Zheng, C.; Gao, J.; Zhang, R.; Cui, S.; Li, Z. Let images give you more: Point cloud cross-modal training for shape analysis. Adv. Neural Inf. Process. Syst. 2022, 35, 32398–32411. [Google Scholar]
  2. Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 677–695. [Google Scholar]
  3. Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Zhou, J.; Lu, J. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 4–6 October 2023; pp. 21729–21740. [Google Scholar]
  4. Ottonelli, S.; Spagnolo, P.; Mazzeo, P.L.; Leo, M. Improved video segmentation with color and depth using a stereo camera. In Proceedings of the IEEE International Conference on Industrial Technology 2013, Cape Town, South Africa, 25–28 February 2013; pp. 1134–1139. [Google Scholar]
  5. Zhang, Z.; Yang, B.; Wang, B.; Li, B. GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 17619–17629. [Google Scholar]
  6. Xia, Y.; Gladkova, M.; Wang, R.; Li, Q.; Stilla, U.; Henriques, J.F.; Cremers, D. CASSPR: Cross Attention Single Scan Place Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 4–6 October 2023; pp. 8461–8472. [Google Scholar]
  7. Fan, S.; Dong, Q.; Zhu, F.; Lv, Y.; Ye, P.; Wang, F.Y. SCF-Net: Learning spatial contextual features for large-scale point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 14504–14513. [Google Scholar]
  8. Li, L.; He, L.; Gao, J.; Han, X. Psnet: Fast data structuring for hierarchical deep learning on point cloud. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6835–6849. [Google Scholar] [CrossRef]
  9. Nie, D.; Lan, R.; Wang, L.; Ren, X. Pyramid architecture for multi-scale processing in point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 17284–17294. [Google Scholar]
  10. Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef] [PubMed]
  11. Yuan, W.; Gu, X.; Li, H.; Dong, Z.; Zhu, S. Monocular Scene Reconstruction with 3D SDF Transformers. arXiv 2023, arXiv:2301.13510. [Google Scholar]
  12. Cui, M.; Long, J.; Feng, M.; Li, B.; Kai, H. OctFormer: Efficient octree-based transformer for point cloud compression with local enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 470–478. [Google Scholar]
  13. Fei, J.; Chen, W.; Heidenreich, P.; Wirges, S.; Stiller, C. SemanticVoxels: Sequential fusion for 3D pedestrian detection using LiDAR point cloud and semantic segmentation. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Virtual, 14–16 September 2020; pp. 185–190. [Google Scholar]
  14. Park, C.; Jeong, Y.; Cho, M.; Park, J. Efficient Point Transformer for Large-Scale 3D Scene Understanding. Available online: https://openreview.net/forum?id=3SUToIxuIT3 (accessed on 1 January 2024).
  15. Wang, H.; Shi, C.; Shi, S.; Lei, M.; Wang, S.; He, D.; Schiele, B.; Wang, L. Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 13520–13529. [Google Scholar]
  16. Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar]
  17. Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
  18. Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
  19. Zhang, Z.; Shen, Y.; Li, H.; Zhao, X.; Yang, M.; Tan, W.; Pu, S.; Mao, H. Maff-net: Filter false positive for 3d vehicle detection with multi-modal adaptive feature fusion. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 369–376. [Google Scholar]
  20. Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; Rodrigo, R. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 9902–9912. [Google Scholar]
  21. Chen, A.; Zhang, K.; Zhang, R.; Wang, Z.; Lu, Y.; Guo, Y.; Zhang, S. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 5291–5301. [Google Scholar]
  22. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  23. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
  24. Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
  25. Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 1–19. [Google Scholar]
  26. Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In Proceedings of the Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, 5–7 October 2020; pp. 207–222. [Google Scholar]
  27. Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 9601–9610. [Google Scholar]
  28. Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
  29. Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv 2020, arXiv:2008.01550. [Google Scholar]
  30. Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556. [Google Scholar]
  31. Zhang, F.; Fang, J.; Wah, B.; Torr, P. Deep fusionnet for point cloud semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 644–663. [Google Scholar]
  32. Gerdzhev, M.; Razani, R.; Taghavi, E.; Bingbing, L. Tornado-net: Multiview total variation semantic segmentation with diamond inception module. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 9543–9549. [Google Scholar]
  33. Liong, V.E.; Nguyen, T.N.T.; Widjaja, S.; Sharma, D.; Chong, Z.J. Amvnet: Assertion-based multi-view fusion network for lidar semantic segmentation. arXiv 2020, arXiv:2012.04934. [Google Scholar]
  34. Axelsson, M.; Holmberg, M.; Serra, S.; Ovren, H.; Tulldahl, M. Semantic labeling of lidar point clouds for UAV applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 4314–4321. [Google Scholar]
  35. Liu, Z.; Tang, H.; Zhao, S.; Shao, K.; Han, S. Pvnas: 3d neural architecture search with point-voxel convolution. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8552–8568. [Google Scholar] [CrossRef] [PubMed]
  36. Xu, J.; Zhang, R.; Dou, J.; Zhu, Y.; Sun, J.; Pu, S. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 16024–16033. [Google Scholar]
Figure 1. PVI-Net. Point cloud–voxel–image fusion point cloud semantic segmentation network structure diagram.
Figure 1. PVI-Net. Point cloud–voxel–image fusion point cloud semantic segmentation network structure diagram.
Information 15 00148 g001
Figure 2. Multi-perspective feature deep fusion structure.
Figure 2. Multi-perspective feature deep fusion structure.
Information 15 00148 g002
Figure 3. Multi-perspective feature fusion module.
Figure 3. Multi-perspective feature fusion module.
Information 15 00148 g003
Figure 4. A visual comparison of the results from the model on the nuScenes dataset.
Figure 4. A visual comparison of the results from the model on the nuScenes dataset.
Information 15 00148 g004
Table 1. Experimental results of the model on the SemanticKITTI dataset. To compare the performance of different models clearly, we divide the compared models into four groups based on the type of input data: point-based input, image-based input, voxel-based input, and mixed-view input. In the table, we specifically highlight the highest mIOU score in each category in red and the second highest score in blue.
Table 1. Experimental results of the model on the SemanticKITTI dataset. To compare the performance of different models clearly, we divide the compared models into four groups based on the type of input data: point-based input, image-based input, voxel-based input, and mixed-view input. In the table, we specifically highlight the highest mIOU score in each category in red and the second highest score in blue.
MethodsDatamIoU (%) ↑CarBicycleMotorcycleTruckOther-VehiclePersonBicyclistMotorcyclistRoadParkingSidewalkOther-GroundBuildingFenceVegetationTrunkTerrainPoleTraffic-Sign
PointNet [22]Point14.646.31.30.30.10.80.20.20.061.615.835.71.441.412.931.04.617.62.43.7
RandLANet [23]Point53.994.226.025.840.138.949.248.27.290.260.373.720.486.956.381.461.366.849.247.7
KPConv [24]Point58.896.030.242.533.444.361.561.611.888.861.372.731.690.564.284.869.269.156.447.4
SqueezeSegv3 [25]Range55.992.538.736.529.633.045.646.220.191.763.474.826.489.059.482.058.765.449.658.9
RangeNet++ [16]Range52.291.425.734.425.723.038.338.84.891.865.075.227.887.458.680.555.164.647.955.9
SalsaNext [26]Range59.591.948.338.638.931.960.259.219.491.763.775.829.190.264.281.863.666.554.362.1
PolarNet [27]Voxel54.393.840.330.122.928.543.240.25.690.861.774.421.790.061.384.065.567.851.857.5
MinkowskiNet [28]Voxel63.1-------------------
Cylinder3D [29]Voxel67.897.167.664.059.058.673.967.936.091.465.175.532.391.066.585.471.868.562.665.6
AF2S3 [30]Voxel69.794.565.486.839.241.180.780.474.391.368.872.553.587.963.270.268.553.761.571.0
FusionNet [31]Fusion61.395.347.537.741.834.559.556.811.991.868.877.130.892.569.484.569.868.560.466.5
TornadoNet [32]Fusion63.194.255.748.140.038.263.660.134.989.766.374.528.791.365.685.667.071.558.065.9
AMVNet [33]Fusion65.396.259.954.248.845.771.065.711.090.171.075.832.491.469.185.667.071.558.065.9
SPVCNN [34]Fusion63.8-------------------
PVNAS [35]Fusion67.097.250.650.456.658.067.467.150.390.267.675.421.891.666.986.173.471.064.367.3
RPVNet [36]Fusion70.397.668.468.744.261.175.974.473.493.470.380.733.393.570.286.575.171.764.861.4
PIV-NetFusion70.997.467.268.943.761.576.675.073.692.371.280.132.892.670.886.974.572.564.862.5
Table 2. Experimental data on PVI-Net for the nuScenes dataset. We highlight the highest score in red and the second-highest score in blue.
Table 2. Experimental data on PVI-Net for the nuScenes dataset. We highlight the highest score in red and the second-highest score in blue.
MethodsDatamIoU (%) ↑BarrierBicycleBusCarConstructionMotorcyclePedestrianTraffic-ConeTrailerTruckDriveableOther_FlatSidewalkTerrainManmadeVegetation
RangeNet++ [16]Range65.566.021.377.280.930.266.869.652.154.272.394.166.663.570.183.179.8
PolarNet [27]Voxel71.074.728.285.390.935.177.571.358.857.476.196.571.174.774.087.385.7
Salsanext [26]Range72.274.834.185.988.442.272.472.263.161.376.596.070.871.271.586.784.4
AMVNet [33]Fusion76.179.832.482.286.462.581.975.372.383.565.197.467.078.874.690.887.4
Cylinder3D [29]Voxel76.176.440.391.292.851.378.078.964.962.184.496.871.676.475.490.587.4
RPVNet [36]Fusion77.678.243.492.793.249.085.780.566.066.984.096.973.575.976.090.688.9
PVI-NetFusion78.178.843.893.593.148.687.080.465.967.585.197.074.575.876.490.689.0
Table 3. Impact of different perspectives on network performance.
Table 3. Impact of different perspectives on network performance.
ViewmIoU (%) ↑Params (M) ↓Latency (ms) ↓
Point15.30.06513.8
Voxel65.523.397.6
Image50.83.3623.2
Point+Voxel68.124.8125.4
Point+Image56.83.3241.3
Point+Voxel+Image70.928.2158.7
Table 4. Impact of different perspectives on network performance.
Table 4. Impact of different perspectives on network performance.
PVC AttentionMF-AttentionSkip ConnectionmIoU (%) ↑
68.8
69.6
70.9
Table 5. Impact of different perspectives on network performance.
Table 5. Impact of different perspectives on network performance.
MethodmIoU (%) ↑
Addition69.2
Concatenation69.5
Our fusion70.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Li, C.; Ma, J.; Feng, Z.; Xiao, L. PVI-Net: Point–Voxel–Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios. Information 2024, 15, 148. https://doi.org/10.3390/info15030148

AMA Style

Wang Z, Li C, Ma J, Feng Z, Xiao L. PVI-Net: Point–Voxel–Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios. Information. 2024; 15(3):148. https://doi.org/10.3390/info15030148

Chicago/Turabian Style

Wang, Zongshun, Ce Li, Jialin Ma, Zhiqiang Feng, and Limei Xiao. 2024. "PVI-Net: Point–Voxel–Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios" Information 15, no. 3: 148. https://doi.org/10.3390/info15030148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop