Next Article in Journal
Research and Optimization of White Blood Cell Classification Methods Based on Deep Learning and Fourier Ptychographic Microscopy
Previous Article in Journal
Smartphone-Based Deep Learning System for Detecting Ractopamine-Fed Pork Using Visual Classification Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds

1
College of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
2
South West Institute of Technical Physics, Chengdu 610041, China
3
The Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Flight University of China, Guanghan 618307, China
4
Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610213, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(9), 2697; https://doi.org/10.3390/s25092697
Submission received: 12 March 2025 / Revised: 13 April 2025 / Accepted: 19 April 2025 / Published: 24 April 2025
(This article belongs to the Section Radar Sensors)

Abstract

:
Environmental perception systems provide foundational geospatial intelligence for precision mapping applications. Light Detection and Ranging (LiDAR) provides critical 3D point cloud data for environmental perception systems, yet efficiently processing unstructured point clouds while extracting semantically meaningful information remains a persistent challenge. This paper presents FARVNet, a novel real-time Range-View (RV)-based semantic segmentation framework that explicitly models the intrinsic correlation between intensity features and spatial coordinates to enhance feature representation in point cloud analysis. Our architecture introduces three key innovations: First, the Geometric Field of View Reconstruction (GFVR) module rectifies spatial distortions and compensates for structural degradation induced during the spherical projection of 3D LiDAR point clouds onto 2D range images. Second, the Intensity Reconstruction (IR) module is employed to update the “Intensity Vanishing State” for zero-intensity points, including those from LiDAR acquisition limitations, thus enhancing the learning ability and robustness of the network. Third, the Adaptive Multi-Scale Feature Fusion (AMSFF) is applied to balance high-frequency and low-frequency features, augmenting the model expressiveness and generalization ability. Experimental evaluations demonstrate that FARVNet achieves state-of-the-art performance in single-sensor real-time segmentation tasks while maintaining computational efficiency suitable for environmental perception systems. Our method ensures high performance while balancing real-time capability, making it highly promising for LiDAR-based real-time applications.

1. Introduction

With the rapid development of technologies such as autonomous driving, robot navigation, and unmanned aerial vehicle (UAV) cruising, semantic segmentation has become an indispensable core task in the field of environmental perception [1,2,3,4]. It facilitates timely processing of environmental information and rapid response. Real-time performance and high accuracy are crucial requirements in environmental perception, particularly in complex and dynamically changing scenarios [5]. Balancing fast with high-accuracy semantic segmentation remains a significant challenge [6,7].
As a critical component of environmental perception systems, LiDAR provides essential 3D spatial information [8,9]. The massive, unordered point cloud generated by LiDAR—which uses multi-angle laser beams to measure distances and return signal intensities from target objects—contains rich 3D scene information [10]. Nevertheless, extracting meaningful features and deeper insights from this complex dataset remains a significant challenge. The most significant challenges are threefold: (1) The complexity of the acquired data is exacerbated by variations in acquisition conditions across different devices [8,9,11], resulting in inconsistencies in data quality and significantly increasing the difficulty of data processing and analysis. (2) Real-time, efficient acquisition of discriminative features from 3D spatial data is highly challenging due to the inherent sparsity, unordered nature, and presence of indistinguishable noise [12,13], significantly increasing the complexity of feature learning. (3) Extracting surface features from the spatial relationships within the point cloud is further complicated by the loss of depth information and the complexities introduced by occlusion [14,15]. Therefore, reliable point cloud processing is crucial for practical point cloud applications [16].
Recent research in LiDAR point cloud semantic segmentation has explored a trade-off between accuracy and real-time performance [5]. Methods aiming for high accuracy often utilize more input data (multi-view images, LiDAR point clouds, and color information) and complex, deep network architectures [17,18]. Conversely, methods prioritizing real-time processing employ simpler architectures and computational optimizations to achieve faster speeds, often at the cost of some accuracy. Compared to 3D point cloud processing, 2D feature extraction techniques have made significant progress in various fields [19]. Projecting 3D point cloud data onto a 2D plane significantly enhances computational efficiency and scalability, making it well-suited for real-time applications. However, this dimensionality reduction introduces a substantial trade-off: while computationally advantageous, the resulting loss of information leads to a significant decrease in semantic segmentation accuracy compared to methods utilizing the full 3D data.
Our research aims to design a LiDAR semantic segmentation method that balances efficient inference and high-accuracy prediction, enhancing the adaptability of point cloud perception while improving the robustness of 3D point cloud data and intra-class segmentation consistency in perception scenarios. In summary, our contributions are as follows:
  • This work develops a range-view architecture for real-time point cloud semantic segmentation that achieves competitive accuracy with hardware-efficient computation, resolving latency-sensitive precision trade-offs in resource-constrained 3D perception deployments.
  • A GFVR module has been implemented to effectively address feature misalignment caused by the projection of 3D space onto a 2D plane.
  • An IR module is proposed to update the zero intensity points in the intensity vanishing state, thus mitigating the loss of accuracy due to intensity vanishing.
  • The superiority of FARVNet is demonstrated by its accuracy and efficiency on popular benchmark datasets, achieving a balance between high accuracy and real-time performance on both SemanticKITTI [8] and nuScenes [9].

2. Related Work

The widespread adoption of semantic segmentation in real-world applications has motivated the development of numerous methods [5,10]. Current mainstream methods primarily include point-based [12,13], voxel-based [20,21], multi-view fusion [22,23], and multi-modal data fusion approaches [17,18,24], all of which have shown significant progress, as shown in Table 1.

2.1. Point-Based Methods

Point-based methods that directly process unstructured point clouds face challenges in accurately modeling local geometric structures and capturing long-range contextual dependencies [26,27]. Subsequent research achieved enhanced performance by integrating global features with local feature representations. PointNet [12], a seminal contribution to this field, pioneered the application of Multi-Layer Perceptrons (MLPs) [28] to process points for the extraction of global features. Subsequent studies have proposed innovative architectures that integrate point convolution, graph convolution, and attention mechanisms. RSNet [29] leverages a lightweight Local Dependency Module (LDM) to achieve efficient modeling of local structures within point clouds. To reduce computational complexity in large-scale semantic segmentation, RandLA-Net [13] employs random sampling to decrease point cloud processing time. ECC [30] utilizes spatial graph networks to capture point cloud features and adaptively learns the edge features between neighboring points. DGCNN [31] performs dynamic edge convolution within the neighborhood of each point, and HDGCN [32] uses both depth-wise and pointwise convolutions to extract local and global features. AGCN [33] integrates Self-Attention with Graph Convolutional Networks (GCNs) to model local point relationships and introduces a global point graph to capture the relative positional information of each point. However, due to the limitation of point cloud quantity, point-based methods are still unable to process large-scale point clouds in real time.

2.2. Voxel-Based Methods

These methods typically treat voxelized point clouds as 3D grids, applying 3D Convolution Neural Networks (CNNs) to extract features from the volumetric representation. ShapeNets [34] employs a convolutional deep belief network to model the probabilistic distribution of 3D shapes, thereby constructing a hierarchical representation in a bottom-up fashion. The computational complexity of this method scales cubically with voxel size, posing a significant challenge for processing high-resolution point clouds. Specifically, to preserve fine geometric details, smaller voxels are required, which in turn increase computational overhead, whereas larger voxels result in a loss of detail. OctNet [35] utilizes octree decomposition to effectively handle sparsely distributed point clouds, reducing both data size and computational complexity, while facilitating a rapid nearest-neighbor search. PointGrid [36] utilizes a combined point and grid representation, employing a fixed number of points to extract geometric detail features and improve model efficiency. However, these methods still demand significant computational and memory resources, rendering them unsuitable for resource-constrained edge devices.

2.3. Multi-Modal Data Fusion Methods

These methods integrate data from multiple modal and diverse data types, yielding more comprehensive feature information than single sources. TransFusion [37] seamlessly integrates images and point clouds to fully leverage spatial structural information. UniSeg [18] fuses voxel, view, and image features to fully utilize multi-modal semantic information. 2DPASS [24] employs auxiliary modality fusion and multi-scale feature fusion to extract richer semantic structural information from multi-modal data. Real-time multi-modal data fusion faces significant challenges due to the complexity and volume of data from diverse sources [20]. These sources often have varying formats, update rates, and sensor characteristics, creating integration difficulties and leading to processing delays and computational bottlenecks. This hinders the ability to meet real-time demands in dynamic environments.

2.4. Multi-View Fusion Methods

Multi-view fusion methods leverage the complementary information obtained from different viewpoints of the same point cloud data, for example, BEV and RV,df to improve model accuracy and robustness in semantic segmentation [23]. MVCNN [25] leverages a multi-view CNN architecture trained on multiple rendered 2D images, employing max pooling to aggregate a global feature representation from these diverse perspectives. AMVNet [22] uses RV and BEV, each for point cloud feature extraction. For uncertain point clouds, neighbourhood features are fused to predict the final category. CFNet [23] uses Center Focusing Feature Encoding (CFFE) to mitigate viewpoint misalignment. Overall, features between different viewpoint branches require additional precise alignment and fusion, with delays in real-time performance.

2.5. Range-View Methods

RV is widely used in semantic segmentation and point cloud upsampling due to its effectiveness and real-time capabilities [38,39,40]. TFNet [41] leverages temporal sequences to address the “many-to-one” projection issue and integrates a voting mechanism to correct misclassified labels. However, the voting strategy struggles to distinguish boundary regions. FRNet [5], which directly utilizes raw intensity, fails to effectively identify regions with intensity vanishing. FIDNet [42] uses bilinear interpolation to fuse multi-resolution features, but it fails to distinguish complex spatial relationships.
Compared to state-of-the-art techniques, RV often exhibits suboptimal performance, primarily due to the inherent challenges in learning 3D spatial features from projected range images, which are prone to various forms of interference. (1) The 3D point cloud data are sparse and unordered [43,44]: limitations in the original range image, such as distortions, occlusions, and incomplete feature representations, directly compromise the network backbone capacity for accurate spatial feature extraction. To address this, we propose the GFVR module, which enhances image quality and compensates for performance degradation caused by point cloud misalignment. (2) Intensity measurements are susceptible to significant noise and variation: as the distance between the sensor and the target object increases, intensity generally exhibits an attenuating trend accompanied by fluctuations. This attenuation is a consequence of laser beam propagation losses, and intensity is further modulated by various factors, including target surface reflectivity and material properties. Accurate discrimination between intensity-dropout states and normal point clouds during range image projection is critical for enhancing system robustness. We address this limitation by introducing an IR module, which corrects pseudo-intensity-dropout artifacts effectively. (3) Image boundaries are blurred: RV-based methods typically generate 2D feature maps with poorly defined object boundaries, resulting in intra-class inconsistencies in the segmentation outcomes. The application of AMSFF learns the balance between latent high-level and low-level features, enhancing the system’s robustness.

3. Method

This section provide a detailed explanation of the FARVNet (Section 3.1). The architecture overview of the proposed FARVNet framework is depicted in Figure 1.

3.1. Overview

As shown in Figure 1, the network architecture is primarily divided into three parts: feature encoding is performed to deeply integrate 3D spatial coordinates and intensity features, with the IR and GFVR modules; feature extraction is performed to extract multi-scale features; Adaptive Multi-Scale Feature Fusion is used to integrate high-dimensional and low-dimensional features.

3.1.1. Feature Encoding

It is noteworthy that, with the development of hardware devices, a growing diversity of LiDAR acquisition devices has emerged, leading to increasingly rich data [11,45]. These devices exhibit variations in key parameters such as the number of laser beams, effective range, field of view (FOV), and intensity distribution. Beyond the inherent differences in LiDAR sensor specifications, point cloud scene acquisition is also influenced by the platform on which the sensor is mounted. For instance, increasing platform altitude necessitates adjustments to the FOV and azimuth angle to effectively capture surrounding environmental information. Additionally, its accuracy is also affected by positional offset, as shown in Figure 2, which leads to the clustering of point clouds and the formation of numerous voids, as shown in Figure 3, at the top. The detailed formulation can be found in Appendix A.
We employ GFVR that adjusts for the distance discrepancies between the 3D point cloud and the true origin, as shown in Equation (1):
G F V R = Δ = O f f s e t ( a r c s i n ( P n z , ( P n d ) 1 ) , ϕ , E n v s ) P ˜ n d = ( P n x ) 2 + ( P n y ) 2 + ( P n z + Δ ) 2 v ˜ n = 1 a r c s i n P n z + Δ , ( P ˜ n d ) 1 + φ ˜ d o w n f ˜ 1 H u n = 1 2 1 a r c t a n P n y , P n x π 1 W
The O f f s e t calculates the vertical offset for each laser beam point cloud using weighting parameters, ϕ represents the FOV angle range of the radar device, E n v s represents the LiDAR sensor’s own height and its spatial position relative to the ground, and ( P n x , P n y , P n z ) represents the 3D spatial coordinates of a point cloud, while constraints within the FOV are simultaneously applied to prevent feature loss by φ ˜ d o w n and f ˜ 1 , as shown in Figure 3, at the bottom. First, the point cloud is partitioned into H regions. Then, the overall offset for each region is calculated through the regional partitioning. To maintain the continuity and integrity of the point cloud scene, the offsets are smoothed to eliminate potential discontinuities or abrupt changes.
GFVR partitions point cloud scene P into K projection spaces, with projection performed using Equation (1): P = { P 1 , P 2 , , P K } . P K R m × c . Each projection space contains a varying number of points m, and each point has the same feature dimensionality c.
P 1 , P 2 , , P K = G F V R ( P )
Point cloud intensity I [ 0 , 1 ] , reflecting the ability of the object surface to reflect laser pulses, and is closely related to factors such as the material of the object, surface roughness, and relative orientation. However, in practice, there is a special phenomenon of intensity drop-out, which is represented by an intensity value of zero. This typically indicates that the LiDAR system failed to receive a reflected signal from the target object. Potential causes include the target object’s material properties, surface roughness, the distance between the object and the LiDAR, the angle of incidence, LiDAR positioning errors, or even equipment malfunction, as illustrated in Figure 4.
Specifically, to prevent the neural network from overfitting to these outliers, employing an outlier detection mechanism is an effective strategy. The point cloud scene is partitioned into m project space based on GFVR. This clustering process incorporates both distance and intensity awareness, as shown in Figure 5. Mean pooling is then applied to the intensity values of the point cloud data within each projected spatial region, as shown in Equation (3).
I R = i = 0 n I i i = 0 n 1 { I i 0 } i = 0 n 1 { I i 0 } i = 0 n 1 { I i = 0 } min i n I i 0 i = 0 n 1 { I i 0 } < i = 0 n 1 { I i = 0 } 0 i = 0 n 1 { I i = 0 } = n
{ P ¯ 1 , P ¯ 2 , , P ¯ K } = I R ( P 1 , P 2 , , P K )
where ∑ denotes the summation of values, and I represents the intensity value of a single point in the projection space. The content inside the brackets represents the condition.
Next, global features F p R m × C from the point cloud are efficiently learned using MLP [28]. Feature dimensionality reduction and max pooling are then applied to each projection space, yielding projection space features F p s R 1 × D , features shared by all points within a projection space. These features are finally projected to 2D range image features F r i R B × D × H × W . Here, D represents the output dimensionality of the MLP, and B represents the batch size:
F p = M L P ( { P ¯ 1 , P ¯ 2 , , P ¯ K } ) F p s = M A X P o o l i n g ( F p ) F r i = P r o j e c t ( F p s )
Input feature dimensionality expanded from 4 to 10, P ¯ k R m × 10 . Expanded features include initial 3D coordinates and intensity ( P n x , P n y , P n z , P n I ) of the point cloud. Manhattan distance, vector difference, and intensity change between each point in the projection space and the interval virtual center are as follows: ( P n x P m x ¯ , P n y P m y ¯ , P n z P m z ¯ , Dd, P n I P m I ¯ ). Each point cloud depth is represented by P n d .

3.1.2. Feature Extraction

The framework utilizes cascaded residual convolution blocks to construct a pyramidal 2D backbone that hierarchically generates multi-scale feature representations. At each feature extraction stage, per-point features within the projection space are iteratively updated. These updated per-point features then update the projection space features.
Feeding F r i i 1 into a residual network yields F r i i , a high-level feature representation capturing both local and global spatial contexts within the 2D image. This network is designed to extract multi-resolution features across different scales and increase channel depth for enhanced feature diversity, ultimately improving performance on complex tasks:
F r i i = R e s B l o c k ( F r i i 1 )
The range image constitutes a geometric abstraction of the 3D environment rather than an isomorphic scene representation. The naive application of 2D convolutional features for 3D semantic inference remains intrinsically constrained by representational disparity across dimensional domains. To maintain consistency between point and projection space features, feature vectors extracted from both point cloud and image spaces are concatenated. This fused feature vector is then input to an MLP for non-linear transformation and feature fusion:
F p i ˜ = M L P ( I n p r o j e c t ( F r i i ) , F p i 1 )
Then, the updated point features are projected back to the range image, and image features are fused using convolution. R V F u s i o n comprises three main components: first, a 2D convolutional layer with a 3 × 3 kernel and a stride of 1 for local feature extraction; second, a fully connected layer that maps the extracted features to a specified dimensional space; and finally, an activation layer that further enhances the expressiveness of the features, thereby achieving efficient feature fusion:
F r i i ˜ = R V F u s i o n ( P r o j e c t ( F p i ˜ ) , F r i i )
Finally, range image features are updated using a residual attention weighting mechanism. This residual attention mechanism effectively strengthens important features while suppressing irrelevant information, improving feature expressiveness:
F r i i = F r i i ˜ × A t t e n t i o n ( F r i i ˜ ) + F r i i

3.1.3. Adaptive Multi-Scale Feature Fusion Module

Inspired by FreqFusion [46], cross-resolution feature aggregation via frequency-aware attention mechanisms in hierarchical neural representations. Ours employs an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator to fuse high-level and low-level features. An ALPF generator predicts spatially varying low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistencies during upsampling. An Offset generator refines large inconsistencies and thin boundaries by replacing inconsistent features with more consistent ones through resampling. An AHPF generator enhances the high-frequency detail and boundary information lost during the downsampling process:
Y ˜ u , v i + 1 = U P S a m p i n g ( A L P H ( Y u , v i + 1 ) ) X ˜ u , v i = A H P F ( X u , v i ) + X u , v i Y u , v i = Y ˜ u + α , v + β i + 1 + X ˜ u , v i
where Y u , v i + 1 R D × h × w , and X u , v i R D × 2 h × 2 w represent the i-th features generated by the backbone and the fused feature at the i-th level, respectively. Offset values ( α , β ) are predicted by the offset generator for the feature located at (u, v).

4. Experiments

This section demonstrates the robustness of FARVNet as well as the balance between accuracy and efficiency. We begin by describing the benchmark datasets and hyperparameters. Experimental results demonstrate that our method achieves an optimal balance between efficiency and accuracy, and we further validate its performance and efficiency on specialized devices. Finally, ablation studies are conducted to analyze the contribution of each component.

4.1. Datasets

We conducted comprehensive evaluations on two widely used datasets. SemanticKITTI [8] is a large-scale outdoor autonomous driving LiDAR dataset collected in Karlsruhe, Germany, using a Velodyne HDL-64E LiDAR sensor. It consists of 22 sequences, with Sequences 0–7 and 9–10 containing point clouds and labels for training, Sequence 8 for validation, and Sequences 11–21 for online testing. Each scene contains approximately 120 thousands points. The original data uses 28 classes to categorize the entire scene, with a vertical field of view of [−25, 3] degrees. nuScenes [9] is a widely used autonomous driving dataset collected using a 32-beam LiDAR sensor. This dataset contains 1000 driving scenes with dense point clouds, annotated with 32 classes and evaluated using 16 semantic classes, with a vertical field of view of [−30, 10] degrees.

4.2. Implementation Details

Our FARVNet framework is built upon the widely adopted MMDetection3D [47] platform. The architecture employs a customized ResNet34 [19] variant as its 2D backbone network, with feature map dimensions specifically configured for different datasets: 64 × 512 pixels for SemanticKITTI [8] and 32×480 pixels for nuScenes [9]. The optimization strategy utilizes the AdamW [48] optimizer initialized with a base learning rate of 0.001, complemented by the OneCycle [49] learning rate policy to achieve adaptive rate scheduling throughout the training phase. All experiments were conducted with a mini-batch size of 4 to balance computational efficiency and memory constraints.

5. Results

5.1. Quantitative Results

Table 2 and Table 3 present a comparison of our method with existing state-of-the-art methods on the SemanticKITTI [8] and nuScenes [9] val sets. The results demonstrate that FARVNet significantly outperforms previous approaches on most metrics, exhibiting superior performance. Specifically, on the SemanticKITTI [8] val set, our method surpasses the recently proposed FRNet [5] by +1.0 mIoU. On the nuScenes [9] val set, our method outperforms the recently proposed WaffleAndRange [50] by +0.7 mIoU.

5.2. Qualitative Results

Visualization examples of real-time semantic segmentation on SemanticKITTI [8] Sequence 08, based on RV and high-performance LiDAR, are provided. Figure 6 shows the Ground Truth (GT) and segmentation results from several methods. Our method exhibits the fewest mis-segmented regions, demonstrating higher accuracy and robustness. This further validates the effectiveness of our method in handling complex scenes, enabling better identification and distinction between object classes. Figure 7 clearly illustrates that existing real-time semantic segmentation methods struggle with sparse boundary regions, resulting in incomplete and mis-segmented areas, such as sidewalks and vegetation. In contrast, the proposed FARVNet method demonstrates the most faithful segmentation alignment with ground truth values. Furthermore, our method shows a significant advantage in handling low-intensity regions. As shown in Figure 8, our method demonstrates strong competitiveness in both long-distance and short-distance scenarios.

5.3. Runtime and Model Parameters

Runtime was also evaluated using the SemanticKITTI [8] val set. The runtime for all methods was measured on a single NVIDIA RTX 4090 GPU without any acceleration techniques, as shown in Figure 9. Our method is end-to-end and requires no additional computation time. Both the GFVR and IR have linear time complexity, ensuring efficient processing. The inference time is 38.5 ms, 1.98 times faster than SphereFormer [61] and 4 times faster than WaffleIron [62]. Meanwhile, our method adopts a more lightweight design, maintaining high performance while having fewer parameters, and the inference process requires only approximately 1500 MB of GPU memory.

5.4. Ablation Study

Ablative analyses were conducted on the SemanticKITTI [8] val set and nuScenes [9] val set with the above experimental setup. As shown in Table 4, we validated the effectiveness of the MSFF, GFVR, IR, and TTA modules by incorporating different modules, and compared quantitative metrics with the baseline model. In particular, the MSFF, GFVR, IR modules improve 1.93% in mIoU and 1.89% in mAcc on SemanticKITTI [8], and 1.18% in mIoU and 0.5% in mAcc on nuScenes [9], compared to baseline. The underlying reason for this performance difference lies in the high-quality range image and stronger pseudo-noise recognition. Further study of the impact of different IR algorithms on accuracy is shown in Table 5. The mean pooling operation exhibits significant advantages in feature representation over others.

5.5. Failure Cases

Although FARVNet outperforms state-of-the-art methods in multiple metrics, its semantic segmentation accuracy remains limited in specific scenarios. As illustrated in Figure 10, the model struggles to effectively distinguish low-lying vegetation from uneven ground surfaces, resulting in misclassification between the “vegetation” and “terrain” labels.

6. Discussion

As quantitatively demonstrated through the comparative analysis in Table 2 and Table 3, the conventional projection formula assumes point clouds originating from a single coordinate system, utilizing 3D spatial coordinates to compute azimuth and elevation angles. However, since laser beams capturing scene information are not emitted from identical points, the displacement discrepancy between laser beams and physical objects generates extensive void pixels in the range image. The proposed FARVNet method utilizes the GFVR module to mitigate laser beam displacement deviations, thereby improving range image quality and enabling more comprehensive scene feature learning, as shown in Figure 6.
FRNet [5] and WaffleIron [62] achieve multi-scale feature fusion solely through simple linear interpolation upsampling operations. Lu [68] found that simple linear interpolation resulted in over-smoothing, leading to boundary displacement and significantly impacting the quality of the inverse mapping from the distance view to 3D space, as shown in the experimental results in Figure 7. In contrast, the AMSFF module balances high-frequency and low-frequency features through adaptive frequency perception, reducing intra-class inconsistencies in multi-scale fusion, effectively distinguishing intra-class features, maintaining intra-class consistency, and avoiding incomplete segmentation.
Figure 8 shows the impact of intensity on the model’s accuracy. The current mainstream methods recognize the importance of intensity, but they do not consider its instability. The intensity drop-out phenomenon presents a significant challenge for point cloud data processing, as network models struggle to accurately distinguish these anomalous intensity values from valid data. Specifically, the model does not effectively differentiate noise from meaningful signals, leading to overfitting, particularly in point cloud scenes containing substantial noise. Therefore, the IR module effectively manages these outliers and mitigates their negative impact on network training, which is essential for enhancing model performance.
In sparse scenes, severe loss of spatial geometric information caused by projection—particularly at large-area sparse junctions of terrain—results in chaotic label segmentation, as illustrated in Figure 10.

7. Considerations for Future Work

This study proposes a novel range-image-based point cloud semantic segmentation method that enhances distance image quality, deeply explores the intrinsic correlation between spatial coordinates and intensity information, balances intra-class inconsistency between high-dimensional and low-dimensional features, and improves model accuracy and robustness.
A potential limitation of our method is its reliance on the spatial coordinates and intensity of the point cloud, which limits its performance under sparse point cloud conditions. Although the method shows significant improvement on 64-line and 32-line datasets, it still exhibits drawbacks in sparse regions at long distances. Future work will further investigate the network’s performance in sparse and challenging environments, with a focus on enhancing accuracy in distant sparse regions through long-range dependency modeling.

Author Contributions

Conceptualization, C.C. and W.G. (Wenyi Ge); methodology, C.C.; software, L.Z.; validation, L.Z. and W.G. (Wenwu Guo); formal analysis, C.C.; investigation, X.Y.; resources, J.H.; data curation, S.T.; writing—original draft preparation, C.C.; writing—review and editing, C.C.; visualization, Z.Y. and W.G. (Wenwu Guo); supervision, W.G. (Wenyi Ge); project administration, S.W.; funding acquisition, W.G. (Wenyi Ge) and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Project of the Sichuan Provincial Department of Science and Technology—Research on Three-Dimensional Multi-Resolution Intelligent Map Construction Technology (2024YFG0009), the Intelligent Identification and Assessment for Disaster Scenes: Key Technology Research and Application Demonstration (2025YFN0008), Project of the sichuan Provincial Department ofscience and Technology— Application and Demonstration of Intelligent Fusion Processing of Laser Imaging Radar Data (2024ZHCG0176), the "Juyuan Xingchuan" Project of Central Universities and Research Institutes in Sichuan—High-Resolution Multi-Wavelength Lidar System and Large-Scale Industry Application (2024ZHCG0190), the Sichuan Science and Technology Program, Research on Simulator Three-Dimensional View Modeling Technology and Database Matching and Upgrading Methods, and the Key Laboratory of Civil Aviation Flight Technology and Flight Safety (FZ2022KF08).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The SemanticKITTI can be found at https://semantic-kitti.org/. The nuScenes can be found at https://www.nuscenes.org/.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Appendix A

Specifically, given a LiDAR point cloud, each point P n R 1 × 4 is represented by Cartesian coordinates ( P n x , P n y , P n z ) and intensity P n I . Points are rasterized into a projection R ( u , v ) of size H × W based on azimuth angles θ and elevation-depression angles ϕ . The rasterization formula is as follows:
u n v n = 1 2 1 a r c t a n P n y , P n x π 1 W 1 a r c s i n P n z , ( P n d ) 1 + φ d o w n f 1 H
P n d = ( P n x ) 2 + ( P n y ) 2 + ( P n z ) 2 φ d o w n = ϕ d o w n / 180 × π φ u p = ϕ u p / 180 × π f = ϕ d o w n + ϕ u p
( u n , v n ) represents the grid coordinates in the 2D image where point P n is projected. ϕ d o w n and ϕ u p represent the depression and elevation angles of the LiDAR device, respectively. ( H , W ) provides the range image resolution, and H aligns with the LiDAR laser beams.

References

  1. Yang, B.; Pfreundschuh, P.; Siegwart, R.; Hutter, M.; Moghadam, P.; Patil, V. TULIP: Transformer for Upsampling of LiDAR Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15354–15364. [Google Scholar] [CrossRef]
  2. Yang, Y.; Wu, X.; He, T.; Zhao, H.; Liu, X. Sam3d: Segment anything in 3d scenes. arXiv 2023, arXiv:2306.03908. [Google Scholar] [CrossRef]
  3. Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2918–2927. [Google Scholar] [CrossRef]
  4. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar] [CrossRef]
  5. Xu, X.; Kong, L.; Shuai, H.; Liu, Q. FRNet: Frustum-range networks for scalable LiDAR segmentation. arXiv 2023, arXiv:2312.04484. [Google Scholar] [CrossRef] [PubMed]
  6. Kong, L.; Liu, Y.; Chen, R.; Ma, Y.; Zhu, X.; Li, Y.; Hou, Y.; Qiao, Y.; Liu, Z. Rethinking range view representation for lidar segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 228–240. [Google Scholar] [CrossRef]
  7. Bai, Y.; Fei, B.; Liu, Y.; Ma, T.; Hou, Y.; Shi, B.; Li, Y. Rangeperception: Taming lidar range view for efficient and accurate 3d object detection. Adv. Neural Inf. Process. Syst. 2023, 36, 79557–79569. [Google Scholar]
  8. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar] [CrossRef]
  9. Fong, W.K.; Mohan, R.; Hurtado, J.V.; Zhou, L.; Caesar, H.; Beijbom, O.; Valada, A. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. IEEE Robot. Autom. Lett. 2022, 7, 3795–3802. [Google Scholar] [CrossRef]
  10. Mao, Y.; Chen, K.; Diao, W.; Sun, X.; Lu, X.; Fu, K.; Weinmann, M. Beyond single receptive field: A receptive field fusion-and-stratification network for airborne laser scanning point cloud classification. ISPRS J. Photogramm. Remote Sens. 2022, 188, 45–61. [Google Scholar] [CrossRef]
  11. Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar] [CrossRef]
  12. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar] [CrossRef]
  13. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar] [CrossRef]
  14. Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef]
  15. Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3412–3432. [Google Scholar] [CrossRef]
  16. Uecker, M.; Fleck, T.; Pflugfelder, M.; Zöllner, J.M. Analyzing deep learning representations of point clouds for real-time in-vehicle lidar perception. arXiv 2022, arXiv:2210.14612. [Google Scholar] [CrossRef]
  17. Krispel, G.; Opitz, M.; Waltner, G.; Possegger, H.; Bischof, H. Fuseseg: Lidar point cloud segmentation fusing multi-modal data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1874–1883. [Google Scholar] [CrossRef]
  18. Liu, Y.; Chen, R.; Li, X.; Kong, L.; Yang, Y.; Xia, Z.; Bai, Y.; Zhu, X.; Ma, Y.; Li, Y.; et al. Uniseg: A unified multi-modal lidar segmentation network and the openpcseg codebase. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21662–21673. [Google Scholar] [CrossRef]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  20. Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 2019, 32, 965–975. [Google Scholar] [CrossRef]
  21. Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
  22. Liong, V.E.; Nguyen, T.N.T.; Widjaja, S.; Sharma, D.; Chong, Z.J. Amvnet: Assertion-based multi-view fusion network for lidar semantic segmentation. arXiv 2020, arXiv:2012.04934. [Google Scholar] [CrossRef]
  23. Li, X.; Zhang, G.; Wang, B.; Hu, Y.; Yin, B. Center Focusing Network for Real-Time LiDAR Panoptic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13425–13434. [Google Scholar] [CrossRef]
  24. Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 20 October 2022; Springer: Cham, Switzerland, 2022; pp. 677–695. [Google Scholar]
  25. Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar] [CrossRef]
  26. Liu, Y.; Fan, B.; Meng, G.; Lu, J.; Xiang, S.; Pan, C. Densepoint: Learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5239–5248. [Google Scholar] [CrossRef]
  27. Jiang, L.; Zhao, H.; Liu, S.; Shen, X.; Fu, C.W.; Jia, J. Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10433–10441. [Google Scholar] [CrossRef]
  28. Zhang, W.; Yin, Z.; Sheng, Z.; Li, Y.; Ouyang, W.; Li, X.; Tao, Y.; Yang, Z.; Cui, B. Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14 August 2022; pp. 4560–4570. [Google Scholar] [CrossRef]
  29. Huang, Q.; Wang, W.; Neumann, U. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2626–2635. [Google Scholar] [CrossRef]
  30. Simonovsky, M.; Komodakis, N. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3693–3702. [Google Scholar] [CrossRef]
  31. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 146. [Google Scholar] [CrossRef]
  32. Liang, Z.; Yang, M.; Deng, L.; Wang, C.; Wang, B. Hierarchical depthwise graph convolutional neural network for 3D semantic segmentation of point clouds. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8152–8158. [Google Scholar] [CrossRef]
  33. Xie, Z.; Chen, J.; Peng, B. Point clouds learning with attention-based graph convolution networks. Neurocomputing 2020, 402, 245–255. [Google Scholar] [CrossRef]
  34. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar] [CrossRef]
  35. Riegler, G.; Osman Ulusoy, A.; Geiger, A. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3577–3586. [Google Scholar] [CrossRef]
  36. Le, T.; Duan, Y. Pointgrid: A deep network for 3d shape understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9204–9214. [Google Scholar] [CrossRef]
  37. Maiti, A.; Elberink, S.O.; Vosselman, G. TransFusion: Multi-modal fusion network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–23 June 2023; pp. 6537–6547. [Google Scholar] [CrossRef]
  38. Eskandar, G.; Sudarsan, S.; Guirguis, K.; Palaniswamy, J.; Somashekar, B.; Yang, B. Hals: A height-aware lidar super-resolution framework for autonomous driving. arXiv 2022, arXiv:2202.03901. [Google Scholar] [CrossRef]
  39. Zhang, W.; Jiang, H.; Yang, Z.; Yamakawa, S.; Shimada, K.; Kara, L.B. Data-driven upsampling of point clouds. Comput.-Aided Des. 2019, 112, 1–13. [Google Scholar] [CrossRef]
  40. Yu, L.; Li, X.; Fu, C.W.; Cohen-Or, D.; Heng, P.A. Pu-net: Point cloud upsampling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2790–2799. [Google Scholar] [CrossRef]
  41. Li, R.; Li, S.; Chen, X.; Ma, T.; Gall, J.; Liang, J. Tfnet: Exploiting temporal cues for fast and accurate lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 4547–4556. [Google Scholar] [CrossRef]
  42. Zhao, Y.; Bai, L.; Huang, X. Fidnet: Lidar point cloud semantic segmentation with fully interpolation decoding. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4453–4458. [Google Scholar] [CrossRef]
  43. Zhao, W.; Liu, X.; Zhong, Z.; Jiang, J.; Gao, W.; Li, G.; Ji, X. Self-supervised arbitrary-scale point clouds upsampling via implicit neural representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1999–2007. [Google Scholar] [CrossRef]
  44. Ye, S.; Chen, D.; Han, S.; Wan, Z.; Liao, J. Meta-PU: An arbitrary-scale upsampling network for point cloud. IEEE Trans. Vis. Comput. Graph. 2021, 28, 3206–3218. [Google Scholar] [CrossRef]
  45. Menze, M.; Geiger, A. Object Scene Flow for Autonomous Vehicles. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
  46. Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef]
  47. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. 2020. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 1 May 2024).
  48. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017. [Google Scholar] [CrossRef]
  49. Smith, L.N.; Topin, N. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Bellingham, WA, USA, 10 May 2019. [Google Scholar] [CrossRef]
  50. Fusaro, D.; Mosco, S.; Menegatti, E.; Pretto, A. Exploiting Local Features and Range Images for Small Data Real-Time Point Cloud Semantic Segmentation. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 4980–4987. [Google Scholar] [CrossRef]
  51. Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar] [CrossRef]
  52. Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar] [CrossRef]
  53. Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–19. [Google Scholar] [CrossRef]
  54. Aksoy, E.E.; Baci, S.; Cavdar, S. Salsanet: Fast road and vehicle segmentation in lidar point clouds for autonomous driving. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 926–932. [Google Scholar] [CrossRef]
  55. Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar] [CrossRef]
  56. Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching efficient 3d architectures with sparse point-voxel convolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 20 October 2022; Springer: Cham, Switerland, 2022; pp. 685–702. [Google Scholar]
  57. Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv 2020, arXiv:2008.01550. [Google Scholar] [CrossRef]
  58. Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16280–16290. [Google Scholar] [CrossRef]
  59. Ando, A.; Gidaris, S.; Bursuc, A.; Puy, G.; Boulch, A.; Marlet, R. Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5240–5250. [Google Scholar] [CrossRef]
  60. Cheng, H.X.; Han, X.F.; Xiao, G.Q. Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
  61. Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 July 2022; pp. 17545–17555. [Google Scholar] [CrossRef]
  62. Puy, G.; Boulch, A.; Marlet, R. Using a waffle iron for automotive point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3379–3389. [Google Scholar] [CrossRef]
  63. Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. (AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556. [Google Scholar] [CrossRef]
  64. Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9601–9610. [Google Scholar] [CrossRef]
  65. Park, J.; Kim, C.; Kim, S.; Jo, K. PCSCNet: Fast 3D semantic segmentation of LiDAR point cloud for autonomous car using point convolution and sparse convolution network. Expert Syst. Appl. 2023, 212, 118815. [Google Scholar] [CrossRef]
  66. Zhao, L.; Xu, S.; Liu, L.; Ming, D.; Tao, W. SVASeg: Sparse voxel-based attention for 3D LiDAR point cloud semantic segmentation. Remote Sens. 2022, 14, 4471. [Google Scholar] [CrossRef]
  67. Xu, J.; Zhang, R.; Dou, J.; Zhu, Y.; Sun, J.; Pu, S. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16024–16033. [Google Scholar] [CrossRef]
  68. Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar] [CrossRef]
Figure 1. FARVNet overview. It consists of three steps: (1) The feature encoding incorporates the Geometric Field of View Reconstruction (GFVR) and Intensity Reconstruction (IR) modules. (2) The 2D backbone network extracts multi-scale features. (3) The Adaptive Multi-Scale Feature Fusion (AMSFF) of hierarchically extracted features predicts the final class label.
Figure 1. FARVNet overview. It consists of three steps: (1) The feature encoding incorporates the Geometric Field of View Reconstruction (GFVR) and Intensity Reconstruction (IR) modules. (2) The 2D backbone network extracts multi-scale features. (3) The Adaptive Multi-Scale Feature Fusion (AMSFF) of hierarchically extracted features predicts the final class label.
Sensors 25 02697 g001
Figure 2. Projection offset caused by floating-point rounding during the projection process. Black represents pixel voids caused by projection offset.
Figure 2. Projection offset caused by floating-point rounding during the projection process. Black represents pixel voids caused by projection offset.
Sensors 25 02697 g002
Figure 3. Comparison of range projection methods: (a) typical method described in Equation (A1), (b) our GFVR-based method in Equation (1). Our approach clearly outperforms traditional projection rules.
Figure 3. Comparison of range projection methods: (a) typical method described in Equation (A1), (b) our GFVR-based method in Equation (1). Our approach clearly outperforms traditional projection rules.
Sensors 25 02697 g003
Figure 4. Comparison of intensity distribution between the SemanticKITTI [8] scenario (red) and a domestic point cloud acquisition device (blue).
Figure 4. Comparison of intensity distribution between the SemanticKITTI [8] scenario (red) and a domestic point cloud acquisition device (blue).
Sensors 25 02697 g004
Figure 5. Pseudo-noise in the state of vanishing intensity is updated by intensity perception (left), and a generally low intensity is considered to be caused by the object (right).
Figure 5. Pseudo-noise in the state of vanishing intensity is updated by intensity perception (left), and a generally low intensity is considered to be caused by the object (right).
Sensors 25 02697 g005
Figure 6. Visualization of multiple real-time semantic segmentation methods based on RV and using the SemanticKITTI [8] val set, with gray representing correct labels and red representing incorrect labels.
Figure 6. Visualization of multiple real-time semantic segmentation methods based on RV and using the SemanticKITTI [8] val set, with gray representing correct labels and red representing incorrect labels.
Sensors 25 02697 g006
Figure 7. Visual comparison of high-level semantic segmentation network model performance on SemanticKITTI [8] Sequence 08, with magnified views highlighting key differences. Ground True indicates the true labels.
Figure 7. Visual comparison of high-level semantic segmentation network model performance on SemanticKITTI [8] Sequence 08, with magnified views highlighting key differences. Ground True indicates the true labels.
Sensors 25 02697 g007
Figure 8. Comparison of different semantic segmentation network ability to handle intensity drop-out points. Yellow indicates error class segmentation; red indicates incorrect segmentation where the point cloud intensity = 0.
Figure 8. Comparison of different semantic segmentation network ability to handle intensity drop-out points. Yellow indicates error class segmentation; red indicates incorrect segmentation where the point cloud intensity = 0.
Sensors 25 02697 g008
Figure 9. mIoU vs. runtime on the SemanticKITTI [8] val set. The size of the markers represents the model’s number of parameters. Runtime measurements are taken on a single NVIDIA RTX 4090 GPU.
Figure 9. mIoU vs. runtime on the SemanticKITTI [8] val set. The size of the markers represents the model’s number of parameters. Runtime measurements are taken on a single NVIDIA RTX 4090 GPU.
Sensors 25 02697 g009
Figure 10. Semantic segmentation failure in complex terrain, as illustrated by the real-world segmentation result in (a) and the result from our method in (b). The similar spatial distribution and elevation features, along with point cloud sparsity, make it difficult for the model to distinguish between “vegetation” and “terrain”.
Figure 10. Semantic segmentation failure in complex terrain, as illustrated by the real-world segmentation result in (a) and the result from our method in (b). The similar spatial distribution and elevation features, along with point cloud sparsity, make it difficult for the model to distinguish between “vegetation” and “terrain”.
Sensors 25 02697 g010
Table 1. Comparisons among different LiDAR representations.
Table 1. Comparisons among different LiDAR representations.
BasedInputRepresentative
Raw PointPointPointNet [12], RandLA-Net [13]
VoxelPointPVCNN [20], PVRCNN++ [21]
Multi-ModalPoint+ImageFuseSeg [17], Uniseg [18]
Multi-ViewPointAMVNet [22], CFNet [23], MVCNN [25]
Range-ViewPointRangeFormer [6], FRNet [5], RangePerception [7]
Table 2. The class-wise IoU scores of different LiDAR semantic segmentation approaches on the SemanticKITTI [8] val set. All mIoU scores are given in percentage (%). The best and second best scores for each class are highlighted in bold and underline.
Table 2. The class-wise IoU scores of different LiDAR semantic segmentation approaches on the SemanticKITTI [8] val set. All mIoU scores are given in percentage (%). The best and second best scores for each class are highlighted in bold and underline.
MethodmIoUCarBicycleMotorcycleTruckOther-vehicle PersonBicyclistMotorcyclistRoadParkingSidewalkOther-ground BuildingFenceVegetationTrunkTerrainPoleTraffic-sign
RandLA-Net [13] 50.0 92.0 8.0 12.8 74.8 46.7 52.3 46.0 0.0 93.4 32.7 73.4 0.1 84.0 43.5 83.7 57.3 73.1 48.0 27.0
RangeNet++ [51] 51.0 89.4 26.5 48.4 33.9 26.7 54.8 69.4 0.0 92.9 37.0 69.9 0.0 83.4 51.0 83.3 54.0 68.1 49.8 34.0
SequeezeSegV2 [52] 40.8 82.7 15.1 22.7 25.6 26.9 22.9 44.5 0.0 92.7 39.7 70.7 0.1 71.6 37.0 74.6 35.8 68.1 21.8 22.2
SequeezeSegV3 [53] 53.3 87.1 34.3 48.6 47.5 47.1 58.1 53.8 0.0 95.3 43.1 78.2 0.3 78.9 53.2 82.3 55.5 70.4 46.3 33.2
SalasNet [54] 59.4 90.5 44.6 49.6 86.3 54.6 74.0 81.4 0.0 93.4 40.6 69.1 0.0 84.6 53.0 83.6 64.3 64.2 54.4 39.8
MinkowskiNet [55] 58.5 95.0 23.9 50.4 55.3 45.9 65.6 82.2 0.0 94.3 43.7 76.4 0.0 87.9 57.6 87.4 67.7 71.5 63.5 43.6
SPVNAS [56] 62.3 96.5 44.8 63.1 55.9 64.3 72.0 86.0 0.0 93.9 42.4 75.9 0.0 88.8 59.1 88.0 67.5 73.0 63.5 44.3
Cylinder3D [57] 64.9 96.4 61 . 5 78.2 ̲ 66.3 69.8 80.8 ̲ 93.3 ̲ 0.0 94.9 41.5 78.0 1.4 87.5 55.0 86.7 72.2 68.8 63.0 42.1
PMF [58] 63.9 95.4 47.8 62.9 68.4 75 . 2 78.9 71.6 0.0 96 . 4 43.5 80.5 1.0 88.7 60.1 88.6 ̲ 72.7 ̲ 75.3 ̲ 65.5 ̲ 43.0
rangvit [59] 60.9 94.7 44.1 61.4 71.9 37.7 65.3 75.5 0.0 95.5 48.4 83.1 0.0 88.3 60.0 86.3 65.3 72.7 63.1 42.7
CENet [60] 61.5 91.6 42.4 61.7 82.4 63.5 64.4 76.6 0.0 93.0 50.3 72.7 0.1 85.0 54.4 84.1 61.0 70.3 55.2 42.8
RangeFormer [6] 66.5 95.0 58.1 72.1 85.1 59.8 76.9 86.4 0.2 94.8 55.5 ̲ 81.7 13.0 ̲ 88.5 64.5 86.5 66.8 73.0 64.0 52.0
SphereFormer [61] 67.8 96.8 51.0 75.0 93 . 4 64.4 77.0 92.6 0.8 ̲ 94.7 53.2 52.1 3.7 90.7 ̲ 58.5 88 . 7 71.3 75 . 9 64.7 54 . 5
FRNet [5] 67.6 97.2 ̲ 53.3 72.9 81.5 72.9 77.2 90.8 0.2 95.9 53.7 83.9 ̲ 9.0 90.5 65.9 ̲ 87.0 66.8 72.6 64.0 47.9
waffleIron [62] 68.0 ̲ 96.1 58.1 ̲ 79 . 7 77.4 59.0 81 . 1 92.2 1 . 3 95.5 50.2 83.6 6.0 92 . 1 67 . 5 87.8 73 . 8 73.0 65 . 7 52.2 ̲
FARVNet 69 . 4 97 . 2 56.5 77.1 90.2 ̲ 72.9 ̲ 78.6 93 . 9 0.0 96.0 ̲ 57 . 8 84 . 2 21 . 4 90.0 62.4 87.2 66.5 73.2 64.8 48.1
Table 3. The class-wise IoU scores of different LiDAR semantic segmentation approaches on the val set of nuScenes [9]. All IoU scores are given in percentage (%). The best and second best scores for each class are highlighted in bold and underline.
Table 3. The class-wise IoU scores of different LiDAR semantic segmentation approaches on the val set of nuScenes [9]. All IoU scores are given in percentage (%). The best and second best scores for each class are highlighted in bold and underline.
Method mIoUBarrierBicycleBusCarConstruction-vehicle MotorcyclePedestrianTraffic-coneTrailerTruckDriveable-surfaceOther-groundSidewalkTerrainManmadeVegetation
AF2S3Net [63] 62.2 60.3 12.6 82.3 80.0 20.1 62.0 59.0 49.0 42.2 67.4 94.2 68.0 64.1 68.6 82.9 82.4
RangeNet++ [51] 65.5 66.0 21.3 77.2 80.9 30.2 66.8 69.6 52.1 54.2 72.3 94.1 66.6 63.5 70.1 83.1 79.8
PolarNet [64] 71.0 74.7 28.2 85.3 90.9 35.1 77.5 71.3 58.8 57.4 76.1 96.5 71.1 74.7 74.0 87.3 85.7
PCSCNet [65] 72.0 73.3 42.2 87.8 86.1 44.9 82.2 76.1 62.9 49.3 77.3 95.2 66.9 69.5 72.3 83.7 82.5
SalsaNext [54] 72.2 74.8 34.1 85.9 88.4 42.2 72.4 72.2 63.1 61.3 76.5 96.0 70.8 71.2 71.5 86.7 84.4
SVASeg [66] 74.7 73.1 44.5 88.4 86.6 48.2 80.5 77.7 65.6 57.5 82.1 96.5 70.5 74.7 74.6 87.3 86.9
RangeViT [59] 75.2 75.5 40.7 88.3 90.1 49.3 79.3 77.2 66.3 65.2 80.0 96.4 71.4 73.8 73.8 89.9 87.2
Cylinder3D [57] 76.1 76.4 40.3 91.2 93 . 8 51.3 78.0 78.9 64.9 62.1 84.4 96.8 71.6 76.4 75.4 ̲ 90.5 87.4
AMVNet [22] 76.1 79 . 8 32.4 82.2 86.4 62 . 5 81.9 75.3 72 . 3 83 . 5 65.1 97 . 4 67.0 78 . 8 74.6 90.8 87.9
RPVNet [67] 77.6 78.2 43.4 92.7 93.2 49.0 85.7 ̲ 80.5 66.0 66.9 84.0 96.9 73.5 75.9 70.6 90.6 88.9
WaffleIron [62] 77.6 78.7 ̲ 51 . 3 93.6 88.2 47.2 86.5 81.7 68.9 69.3 83.1 96.9 74.3 75.6 74.2 87.2 85.2
RangeFormer [6] 78.1 78.0 45.2 94.0 92.9 58.7 83.9 77.9 69.1 63.7 85.6 96.7 74.5 75.1 75.3 89.1 87.5
SphereFormer [61] 78 . 4 77.7 43.8 94.5 ̲ 93.1 52.4 86 . 9 81.2 ̲ 65.4 73.4 85.3 ̲ 97.0 73.4 75.4 75.0 91 . 0 89 . 2
WaffleAndRange [50] 77.6 78.5 49.6 ̲ 91.8 87.6 52.7 86.7 82 . 2 70.1 ̲ 67.2 79.7 97.0 74.7 ̲ 76.8 ̲ 74.9 87.5 85.0
FARVNet 78.3 ̲ 78.3 42.5 95 . 5 91.8 58.9 ̲ 84.4 77.7 67.5 68.9 83.5 97.0 ̲ 77 . 2 76.1 76 . 0 89.2 87.5
Table 4. Ablation study of each component in Ours on the valset of SemanticKITTI [8] and nuScenes [9]. BL: BaseLine. GFVR: Geometric Field of View Reconstruction. IR: Intensity Reconstruction. TTA: Test Time Augmentation. All mIoU and mAcc scores are given in percentage (%).
Table 4. Ablation study of each component in Ours on the valset of SemanticKITTI [8] and nuScenes [9]. BL: BaseLine. GFVR: Geometric Field of View Reconstruction. IR: Intensity Reconstruction. TTA: Test Time Augmentation. All mIoU and mAcc scores are given in percentage (%).
BLMSFFGFVRIRTTASemKITTInuScenes
mIoUmAccmIoUmAcc
67.3 74.0 76.1 83.9
67.8 75.3 76.3 83.9
68.6 75.4 77.0 84.4
69.4 75.5 78.3 85.0
Table 5. Impact of different IR algorithms on the mIoU metric.
Table 5. Impact of different IR algorithms on the mIoU metric.
MeanMaxMin
+0.8+0.1+0.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.; Zhao, L.; Guo, W.; Yuan, X.; Tan, S.; Hu, J.; Yang, Z.; Wang, S.; Ge, W. FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds. Sensors 2025, 25, 2697. https://doi.org/10.3390/s25092697

AMA Style

Chen C, Zhao L, Guo W, Yuan X, Tan S, Hu J, Yang Z, Wang S, Ge W. FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds. Sensors. 2025; 25(9):2697. https://doi.org/10.3390/s25092697

Chicago/Turabian Style

Chen, Chuang, Lulu Zhao, Wenwu Guo, Xia Yuan, Shihan Tan, Jing Hu, Zhenyuan Yang, Shengjie Wang, and Wenyi Ge. 2025. "FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds" Sensors 25, no. 9: 2697. https://doi.org/10.3390/s25092697

APA Style

Chen, C., Zhao, L., Guo, W., Yuan, X., Tan, S., Hu, J., Yang, Z., Wang, S., & Ge, W. (2025). FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds. Sensors, 25(9), 2697. https://doi.org/10.3390/s25092697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop