Next Article in Journal
Retrospective Spectrum-Conversion Method Based on Time-Modulated Van Atta Array
Next Article in Special Issue
A Novel 3D Point Cloud Reconstruction Method for Single-Pass Circular SAR Based on Inverse Mapping with Target Contour Constraints
Previous Article in Journal / Special Issue
Central Pixel-Based Dual-Branch Network for Hyperspectral Image Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SAMNet++: A Segment Anything Model for Supervised 3D Point Cloud Semantic Segmentation

Department of Civil Engineering, Faculty of Engineering and Architectural Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(7), 1256; https://doi.org/10.3390/rs17071256
Submission received: 24 January 2025 / Revised: 15 March 2025 / Accepted: 28 March 2025 / Published: 2 April 2025
(This article belongs to the Special Issue 3D Scene Reconstruction, Modeling and Analysis Using Remote Sensing)

Abstract

:
Segmentation of 3D point clouds is essential for applications such as environmental monitoring and autonomous navigation, where making accurate distinctions between different classes from high-resolution 3D datasets is critical. Segmenting 3D point clouds often requires a trade-off between preserving spatial information and achieving computational efficiency. In this paper, we present SAMNet++, a hybrid 3D segmentation model that integrates segment anything model (SAM) and adopted PointNet++ in a sequential two-stage pipeline. Firstly, SAM performs an initial unsupervised segmentation, which is then refined using adopted PointNet++ to improve the accuracy. The key innovations of SAMNet++ include its hybrid architecture, which combines SAM’s generalization with PointNet++’s local feature extraction, and a feature refinement strategy that enhances precision while reducing computational overhead. Additionally, SAMNet++ minimizes the reliance on extensive supervised training, while maintaining high accuracy. The proposed model is tested on three urban datasets, which are collected by an unmanned aerial vehicle (UAV). The proposed SAMNet++ model demonstrates high segmentation performance, achieving accuracy, precision, recall, and F1-score values above 0.97 across all classes on our experimental datasets. Furthermore, its mean intersection over union (mIoU) of 86.93% on a public benchmark dataset signifies a more balanced and precise segmentation across all classes, surpassing previous state-of-the-art methods. In addition to its improved accuracy, SAMNet++ showcases remarkable computational efficiency, requiring almost half the processing time of standard PointNet++ and nearly one-sixteenth of the time needed by the original PointNet algorithm.

1. Introduction

The segmentation and identification of urban features, such as buildings, vehicles, sidewalks, and other infrastructure elements, are essential for various applications, including smart city planning, autonomous navigation, disaster management, and the optimization of telecommunication networks [1]. Traditionally, urban feature detection research, such as detecting buildings [2], has relied on 2D images. These techniques, based on image processing, have been widely explored in computer vision and photogrammetry research [3]. Methods like edge detection [4] and texture analysis [5] are commonly applied to identify urban features, yet 2D approaches lack depth information, which can limit the accuracy when distinguishing complex or overlapping features. This reliance on 2D data highlights the need for advancements that incorporate 3D information for more precise urban feature detection.
Recent advancements in 3D scanning technologies and point cloud generation have significantly expanded the use of 3D measurements for extracting urban features. Among these, light detection and ranging (LiDAR) technology is particularly valued for its precision and speed, enabling rapid data collection [6]. Building on these capabilities, unmanned aerial vehicles (UAVs) equipped with LiDAR, high-resolution cameras, inertial measurement units (IMUs), and global navigation satellite systems (GNSS) can be a mobile platform that enhances 3D point cloud acquisition. This integration allows unmanned aerial systems (UASs) to efficiently collect detailed georeferenced data, providing a more comprehensive view of urban environments and surpassing the limitations of static methods [7]. However, integrating multiple sensors increases the complexity of processing due to the diversity and volume of data. This technological leap forward presents both opportunities and challenges in urban feature extraction.
To address these challenges, some approaches attempt to convert 3D point clouds into 2D formats [8]. This, however, can lead to the loss of essential spatial and depth information, as 2D projections lack the full geometric context. Working directly with 3D point cloud data preserves the scene’s structure, allowing for more precise segmentation, particularly in applications where depth is crucial. To fully harness the potential of these high-resolution datasets, advanced computational techniques are necessary.
The first approach to point cloud segmentation relied on traditional machine learning algorithms and handcrafted features (e.g., manually designed descriptors or attributes based on geometric properties to characterize data points), such as [9,10], respectively. However, these approaches faced scalability issues and struggled to generalize to complex urban environments. With the rise of deep neural networks (DNNs), particularly convolutional neural networks (CNNs) and point-based networks, the accuracy of semantic segmentation tasks has improved significantly. DNNs have proven to be powerful tools for point cloud segmentation, offering advanced capabilities in feature extraction, classification, and segmentation of high-resolution datasets. These networks can learn complex spatial patterns and fine-grained distinctions between features, such as buildings, vegetation, and roads, making them highly effective for detailed urban feature extraction. DNNs also excel at handling the high dimensionality and irregularity of point cloud data, using architectures specifically designed to capture geometric and contextual relationships. This capability allows them to deliver higher accuracy and efficiency compared to traditional methods, even in challenging, densely populated urban environments [11,12,13].
There are two main DNN approaches for 3D segmentation: unsupervised and supervised methods. Unsupervised segmentation methods, such as [14,15], aim to segment the data without the need for labeled examples. While unsupervised methods eliminate the costly annotation process, they often struggle to achieve the same level of accuracy as supervised approaches, particularly in complex urban environments. Additionally, unsupervised segmentation can suffer from issues such as over-segmentation or poor distinction between similar features. A further limitation is that, without labeled outputs, the results lack semantic information, restricting their applicability in tasks that require precise identification and classification of objects or features.
On the other hand, supervised segmentation methods can offer high accuracy and reliability by leveraging labeled datasets. However, supervised segmentation methods for point clouds come with several limitations. They are generally more computationally inefficient, requiring vast amounts of labeled data and significant processing time, making them costly and resource-intensive to train. These methods are also prone to overfitting, especially with limited or imbalanced datasets, which reduces their generalization to new data. Scalability is another challenge, as annotating large point cloud datasets is labor-intensive and often requires domain expertise, further increasing costs. Additionally, supervised models struggle with rare or complex classes, can be sensitive to noisy labels, and may require frequent retraining to adapt to new environments.
Among unsupervised methods, PointNet [16] has demonstrated effectiveness in the segmentation of various urban features. However, it struggles with complex, large-scale urban scenes, where geometric relationships are essential for accurate segmentation. Building upon this, PointNet++ [17] extends PointNet by hierarchically processing point sets with local neighborhood information, enhancing feature capture in large-scale environments. It improves object distinction, such as between buildings and trees, but still struggles with noisy data, occlusion, and varying point densities in dense urban scenes. Moreover, PointNet++ has improved over PointNet by incorporating local neighborhood information, addressing several challenges in large-scale urban segmentation. However, both PointNet and PointNet++ have limitations, particularly when handling vast datasets and complex geometries. Additionally, neither PointNet nor PointNet++ natively uses color features, relying solely on 3D spatial coordinates, which can limit their ability to differentiate between similar objects where color can be an important cue. These limitations motivate the need for further enhancements to achieve more accurate segmentation in vast datasets and complex geometries. PointNeXt [18] further enhances this by leveraging skip connections and residual blocks, ensuring both computational efficiency and high performance on large-scale datasets, like S3DIS [19] and SemanticKITTI [20]. Despite these improvements, the computational cost of processing raw point clouds remains significant, particularly for large-scale LiDAR data. To overcome this, voxel-based methods have gained traction. The framework, introduced in [21], seeks to bridge the gap between the high efficiency of voxel-based models and the detail preservation of point-based models like PointNet++. By distilling knowledge from point-based networks into voxel-based models, this approach attempts to capture finer details without sacrificing the computational efficiency needed for large-scale data processing. While voxel-based methods offer improved computational efficiency, they come with specific limitations. One key drawback is that voxelization introduces a loss of geometric precision due to grid quantization, which can result in coarser segmentation outputs. Additionally, voxel-based models struggle to preserve fine-grained details, especially in dense and complex urban environments, as the grid structure inherently simplifies the spatial resolution of the original point cloud. These limitations highlight the need for alternative strategies to enrich 3D segmentation quality.
Recent research has also focused on the fusion of multiple data modalities, such as RGB images and LiDAR point clouds, to improve the accuracy of 3D segmentation [22]. The integration of RGB data helps add texture and color information to the 3D point clouds, which can be crucial for distinguishing between urban features with similar geometric properties but different visual appearances. For example, the proposed architecture in [23], adapted for 3D segmentation tasks, improves the accuracy in classifying objects by leveraging both point cloud geometry and RGB information. This hybrid approach shows significant promise, particularly in distinguishing urban features within complex urban environments. Further advancements, such as in the FuseSeg framework [24], integrate RGB and LiDAR data early in the segmentation process through feature warping, enhancing the segmentation accuracy for large-scale urban environments. Similarly, the perception-aware multi-sensor fusion (PMF) method [25] utilizes a two-stream network to fuse RGB and LiDAR features in the camera coordinate system, improving robustness and detail capture in 3D segmentation tasks. Despite the advantages of multi-modal fusion, current methods often face limitations in processing speed and scalability, particularly when dealing with large-scale urban datasets.
To overcome these challenges, a segmentation approach is needed that balances accuracy, efficiency, and flexibility in segmenting complex urban scenes. The model should efficiently process high-density 3D point clouds, manage overlapping objects, and distinguish between similar structures, like roads and sidewalks or buildings and walls. Additionally, it must capture fine-grained urban details while optimizing computational efficiency. These challenges underscore the need for more robust architectures that leverage the complementary strengths of RGB and LiDAR, inspiring the development of SAMNet++, a novel DNN architecture for 3D segmentation.
SAMNet++ combines the features of SAM LiDAR [26], an adaptation of the Segment Anything Model (SAM) [27] for LiDAR point clouds, with PointNet++ [17]. By integrating the strengths of both unsupervised and supervised approaches, SAMNet++ enhances the segmentation quality and efficiency. Unsupervised segmentation with SAM LiDAR allows the model to identify regions of interest in colorized LiDAR point clouds without requiring labeled data, making the initial segmentation. This step also helps to reduce noise and isolate meaningful features, providing a strong foundation for further refinement. Subsequently, applying PointNet++, a supervised model, SAMNet++ can refine these regions with higher precision, leveraging labeled data to improve classification accuracy and boundary definition.
Compared to existing methods, this hybrid approach significantly improves segmentation performance, particularly in complex urban environments, where high accuracy is crucial, while also enhancing computational efficiency. It also allows SAMNet++ to benefit from the scalability of unsupervised learning, while achieving the high detail and accuracy of supervised methods. This dual-stage process effectively balances efficiency and accuracy, establishing SAMNet++ as a robust solution for 3D segmentation tasks on high-resolution LiDAR data.
To validate the developed model, we utilized three distinct datasets collected using a UAV equipped with a multisensory payload, including one sourced from a publicly available dataset. Compared to state-of-the-art (SOTA) methods, SAMNet++ not only improves segmentation accuracy but also enhances computational efficiency. The main research objectives and contributions are outlined as follows:
  • Develop a novel hybrid segmentation model that leverages unsupervised and supervised segmentation techniques for improved 3D segmentation accuracy and efficiency.
  • Demonstrate the advantages of a dual-stage segmentation pipeline, where SAM LiDAR performs coarse segmentation and PointNet++ refines it.
  • Evaluate SAMNet++ on real-world UAV-collected LiDAR datasets, demonstrating its effectiveness compared to some of the SOTA methods.
The structure of this paper is organized as follows: Section 2 details the proposed approach. Section 3 discusses the collection and processing of the datasets. Section 4 presents the results of the proposed model’s implementation. Section 5 analyzes the results. Finally, Section 6 concludes the paper.

2. Proposed Approach

The proposed technique, SAMNet++, is developed to achieve high accuracy and efficiency in semantic segmentation of 3D colorized point clouds. Our approach leverages multi-sensor integration and advanced deep learning techniques to accurately segment 3D urban features. Figure 1 illustrates the architecture for SAMNet++, outlining the sequential stages of segmenting 3D point cloud data. Datasets are first collected using a UAV equipped with the Zenmuse L1 LiDAR payload [28], which integrates a LiDAR, an RGB camera, a GNSS receiver, and an IMU, generating a georeferenced and colorized point cloud. After generating the colorized point clouds, they are annotated to provide labeled data. To simplify processing and ensure numerical stability during the segmentation task, the original geospatial coordinates of the point cloud are normalized and transformed into a local coordinate system centered around the dataset’s centroid at the preprocessing stage. This transformation involves shifting and scaling the coordinates to fit within a standardized reference frame, reducing computational complexity and enabling SAMNet++ to efficiently identify and segment urban features within the 3D colorized point cloud. In the final step, referred to as point translation, the segmented points are mapped back to their original geospatial coordinates by reversing the normalization process. This step is essential for preserving the spatial accuracy required for real-world applications.

2.1. Data Fusion and Generating Colored Point Cloud

The data from various sensors, including IMU, GNSS, and LiDAR, are fused to create a georeferenced point cloud. The generated point cloud is then colorized using the georeferenced RGB images. This technique is widely used in applications requiring high-accuracy spatial data, such as surveying, construction, and urban planning, providing a detailed and contextually rich 3D representation of an environment [29].
The workflow begins with estimating the position and orientation (POS) of the UAV, leveraging GNSS for global positioning accuracy and IMU for orientation and movement tracking. Once the POS data is estimated, raw LiDAR scans are processed to generate a dense, georeferenced 3D point cloud. This point cloud captures the structural geometry of the environment, providing a high level of detail valuable for spatial analysis. To further enrich the 3D point cloud, a colorization process is applied using georeferenced RGB images, enabling the utilization of color-based features for segmentation. In the Zenmuse L1, these images are captured concurrently with LiDAR scans and tagged with precise GPS coordinates.

2.2. SAMNet++

In this subsection, we delve into the proposed model, SAMNet++, which synergizes the SAM LiDAR [26], an adaptation of the SAM specifically developed for 3D LiDAR point clouds, with an adopted PointNet++ deep neural network. It enables SAMNet++ to leverage the scalability of unsupervised learning while achieving the precision and detail provided by supervised methods. This dual-stage approach effectively balances efficiency and accuracy.
The process begins with unsupervised segmentation using the SAM LiDAR, which identifies regions of interest within the 3D colorized point clouds. Next, the PointNet++ model processes these segmented regions to refine the segmentation and assign labels corresponding to specific urban features for each point in the dataset. This workflow enables SAMNet++ to achieve accurate and efficient semantic segmentation. Figure 2 illustrates the architecture of the proposed approach used in this research for 3D point cloud segmentation.

2.2.1. SAM LiDAR Segmentation

This part explores the SAM LiDAR [26], a key component in our proposed approach, which performs the initial unsupervised segmentation of a 3D LiDAR point cloud data. In the package, SAM [27], a model for versatile, unsupervised segmentation across various types of visual data, is integrated with the segment-geospatial package [30,31]. This integration allows SAM LiDAR to automatically identify and separate distinct regions in the 3D point cloud based on inherent color features and patterns, enabling unsupervised segmentation without predefined labels.
The segmentation process begins with the preprocessing stage, where the input is a 3D LiDAR point cloud in a standard format, such as LAS [32]. The point cloud is then converted into a 2D raster image representation, as SAM is inherently designed for processing visual data. This transformation allows SAM to leverage RGB information extracted from the LiDAR point cloud, enabling segmentation based on color and texture rather than solely on geometric properties, such as point density, surface normal orientations, spatial clustering, or radiometric properties, such as intensity. By incorporating color-based segmentation, SAM LiDAR provides a flexible and adaptable method for identifying objects in a scene. After generating the raster image, SAM utilizes this information to perform segmentation, ensuring accurate object boundaries based on both spatial and color features. In this step, multiple segmentation masks are generated based on distinct color and texture patterns within the image, where each mask represents a separate region that SAM identifies as visually coherent. Then, the segment-geospatial algorithm is employed to generate segment labels, with the option to refine the results using text-based prompts. The segmentation masks define distinct regions within the raster image, capturing variations in color and texture that correspond to different object classes.
The effectiveness of this segmentation process is highly dependent on several parameter settings, which influence the accuracy and quality of the results. The minimum mask region area determines the smallest area a segment can occupy, preventing overly fragmented segmentation. The points per side parameter controls the resolution at which SAM processes the image, impacting the granularity of the segmented regions. Other key parameters include the prediction intersection over the union (IoU) threshold, which regulates how strictly the model refines segmentation masks, and the stability score threshold, which affects the consistency of predicted masks. Additionally, fine-tuning the crop layers and downscaling factors can optimize feature extraction and improve segmentation performance.
Following the segmentation process, the results are mapped back to the original 3D point cloud. Each segmented region from the 2D raster image is associated with its corresponding 3D points, ensuring that the spatial structure of the original point cloud is preserved. Through this workflow, SAM LiDAR enables unsupervised segmentation of 3D LiDAR point clouds by leveraging SAM’s segmentation capabilities in a color-based framework. However, a key drawback of this approach is the loss of spatial information during the 3D-to-2D conversion, which can lead to inaccuracies in segmentation boundaries and misalignment when reconstructing the segmented point cloud. Additionally, due to the unsupervised nature of the method, SAM LiDAR often fragments individual classes across multiple segments, making it challenging to achieve a coherent classification of objects. For further details on implementation and customization, the open-source repository of the SAM LiDAR package is available at [26].

2.2.2. Pointnet++ Segmentation

PointNet++ offers several key advantages over other point cloud segmentation methods, which contribute to its widespread use in 3D computer vision tasks. One of the primary advantages is its hierarchical learning structure, which mirrors the way CNNs process point clouds. This hierarchy allows PointNet++ to effectively capture and model local features at multiple scales, making it capable of learning complex patterns in regions of varying density within the point cloud. This approach enables the network to focus on both fine and coarse details, significantly improving its performance in recognizing intricate structures and maintaining context over larger spatial areas. Unlike earlier models that treated point clouds as unstructured sets without local context, PointNet++ can learn neighborhood relationships through its point grouping and sampling strategy. Moreover, PointNet++ preserves the original geometric structure of point clouds by working directly on the raw data without requiring voxelization or mesh construction, which is common in other 3D segmentation methods, like voxel-based CNNs or graph-based networks. This direct approach helps avoid discretization artifacts and the loss of detail that can occur when converting 3D data into regular grid structures. As a result, PointNet++ provides higher accuracy and efficiency in processing large-scale and complex point clouds, such as those found in urban, indoor, or natural environments. It also employs a feature propagation mechanism that supports effective point-wise segmentation and up-sampling, ensuring that the learned features are transferred seamlessly across different resolutions.
In this research, we adopt the PointNet++ model [17] for supervised segmentation of urban features in 3D point clouds, customizing it specifically for detailed extraction of urban features from segmented point cloud data. The model preserves the foundational architecture of the original PointNet++ [17], which utilizes hierarchical layers to progressively capture and refine geometric features, ensuring that spatial details critical for accurate segmentation are retained. To enhance feature extraction, we introduce an additional Set Abstraction layer, which expands the feature dimension to 1024 channels, increasing the model’s capacity to capture finer details, while preserving broader contextual relationships. These enhancements improve the segmentation accuracy and robustness, allowing the model to effectively handle complex urban environments with dense structures, varying scales, and intricate boundaries.
One key modification in the adopted model is its focused feature extraction approach. Unlike the original PointNet++, which processes entire point clouds at both local and global levels, the adopted model searches for and extracts features solely within the segmented parts generated by the SAM LiDAR. This adjustment eliminates the need to process the entire point cloud, allowing the model to efficiently concentrate on relevant local and global structures within each segment. As a result, it enables precise segmentation of urban features while minimizing unnecessary computations and processing time. Additionally, we introduce a new Feature Propagation layer with an extended feature space (1536 to 512 channels), which enhances the model’s ability to transfer high-dimensional learned features to lower-resolution representations. This modification helps in refining segmentation boundaries and improving per-point classification accuracy. Finally, we modify the feature extraction pipeline by introducing a Global Reduction layer to aggregate learned representations into a fixed-size descriptor. This is followed by a Dropout layer to prevent overfitting and enhance generalization, particularly for complex urban datasets, before the final classification step.
To better address class imbalance in the segmented data, the adopted model incorporates advanced data balancing techniques. These include rotation, jittering, and the use of the synthetic minority over-sampling technique (SMOTE) [33] for enhancing minority class representation, as well as random under-sampling for majority classes. We implement a class-aware weighting mechanism in the loss function by assigning higher weights to underrepresented classes, ensuring that the model does not become biased toward dominant categories. Additionally, Focal Loss is employed to direct the model’s attention to more difficult-to-classify samples. This method reshapes the standard cross-entropy loss by incorporating a scaling factor that diminishes the weight of the easily classifiable cases while boosting the weight of the more difficult ones. Given an input sample (x, y), where x represents the input logits, and y is the ground-truth label, the standard cross-entropy loss is defined as
L CE x , y = logp t ,
where p t = p ( y | x ) is the predicted probability of the correct class. Focal Loss enhances this formulation by incorporating a modulating factor 1 p t γ , which dynamically down-weights well-classified examples. The Focal Loss function is expressed as
L FL x , y = α 1 p t γ logp t ,
where α is a weighting factor that balances class importance and γ is a focusing parameter that adjusts the contribution of easy and hard examples. For a batch of N training samples, the total loss is calculated as
L FL x , y = 1 N i = 1 N α 1 p t γ logp t i ,
By appropriately setting α and γ , this approach mitigates class imbalance by preventing the model from being overly biased toward majority classes while improving its performance on minority classes, which are often more difficult to classify accurately.
The adopted model also features an additional layer that extends the number of feature channels to 1024, significantly increasing its capacity for complex feature representation. This added depth enables the model to capture fine-grained details, such as edges and smaller urban elements, as well as broader spatial contexts, like large structures. Furthermore, we integrate batch normalization layers after each feature propagation step to stabilize training and improve generalization. This ensures that feature distributions remain consistent across different point cloud segments. These enhancements allow the model to effectively differentiate between various urban features, including buildings, trees, roads, and vehicles, thus improving segmentation accuracy in complex urban environments.
As shown in Figure 2, the architecture of the adopted model, like the original PointNet++, is organized around two main components: Set Abstraction layers and Feature Propagation layers, which work together to establish a hierarchical understanding of point cloud structures and enable point-wise segmentation. The Set Abstraction layers form the core of the feature extraction process, capturing both local and global geometric properties within each segmented part. The first set abstraction layer receives 3D coordinates (X, Y, Z) from the unsupervised segmented lidar data produced by the SAM LiDAR. It applies 1D convolutions to learn spatial relationships, followed by batch normalization to stabilize training and rectified linear unit (ReLU) activations to introduce nonlinearity. Each layer is followed by a max-pooling operation, which extracts the most prominent features across the spatial dimension, summarizing essential geometric information at each hierarchical level.
Following the feature extraction, the Feature Propagation layers are crucial for refining and up-sampling feature maps, effectively restoring spatial resolution by merging high-level, coarse features with finer, low-level details. Each feature propagation layer concatenates feature maps from different scales, providing the model with localized details as well as broader contextual information. This up-sampling process maintains high spatial fidelity, ensuring accuracy in boundary segmentation and small feature recognition essential for urban environments. The feature propagation layers progressively reduce the feature dimensionality back to 128 channels, synthesizing and refining features for final point-wise segmentation. In the last layer, a fully connected layer maps these refined 128-dimensional features to the desired number of classes (in this case, seven for different urban features). A softmax function [34] is applied to generate a probability distribution across classes for each point, enabling detailed segmentation of diverse urban features.
To optimize performance, we carefully select and tune key hyperparameters, such as batch size, dropout rate, weight decay, and learning rate. The learning rate is initially set to 0.001, balancing convergence speed with stability. A batch size of 1024 is determined through experimentation to provide an effective balance among memory usage, model accuracy, and training speed (Figure 3). Weight decay, set at 1 × 10−5, serves as a regularization measure, preventing overly large weights and reducing overfitting risks. Additionally, a dropout rate of 0.5 in the final layers further improves generalization by helping the model avoid reliance on specific features. These tuned parameters contribute to efficient training and enhanced model performance for urban feature segmentation.
In the training of the model, we employ several techniques to improve computational efficiency, handle larger batch sizes, and ensure smooth convergence. First, gradient accumulation is used to simulate larger batch sizes without overloading the GPU memory. This method allows gradients to accumulate over multiple mini-batches, reducing memory constraints by only updating the model parameters after a defined number of accumulation steps. To enhance training efficiency, mixed precision training is implemented, where specific computations are conducted in half precision (float16) to reduce memory usage and accelerate training speed. This approach allows the model to handle larger batch sizes and process data more quickly, optimizing GPU utilization. At the same time, essential computations requiring higher numerical precision are retained in full precision (float32) to prevent any potential loss in accuracy. By balancing precision and computational efficiency, mixed precision training significantly reduces resource demands without compromising the overall performance of the model. The GradScaler [35] dynamically scales the loss to avoid underflow in half precision, while Autocast [35] automatically manages precision during forward and backward passes.
The training process leverages the Adam optimizer [36], known for its adaptive learning rate adjustments based on the first and second moments of the gradient, which accelerates convergence. Additionally, a CosineAnnealingLR [37] learning rate scheduler is used, gradually decreasing the learning rate in a cosine pattern over each training cycle. This approach enables exploration across a range of learning rates and avoids premature convergence to local minima, allowing the model to converge towards an optimal solution. These careful choices in hyperparameters and training setup ensure that the model trains effectively, achieving improved performance in the segmentation of complex urban environments.

2.2.3. Training and Validation

In this subsection, we outline the methodology used to partition the segmented parts and evaluate model performance to optimize accuracy and ensure a robust assessment. The unsupervised segmented parts are split into training (80%) and test (20%) sets. To maintain a balanced representation of all classes, we apply stratified partitioning [38], ensuring that class distributions are preserved across both sets. This prevents the overrepresentation of dominant categories, while ensuring that minority classes remain adequately represented. Additionally, we ensure that all classes are present in both the training and test datasets, guaranteeing that the model learns from a diverse range of data features and generalizes effectively across different urban structures. This approach supports a more accurate and versatile model performance. Furthermore, this partitioning strategy ensures a clear distinction between model training and evaluation of unseen data, which is crucial for obtaining unbiased results in supervised learning.
To further strengthen model training, we apply a K-fold cross-validation [39] approach to the 80% training set. In this process, the training data is divided into five equal parts, or “folds.” For each fold, one subset serves as the validation set, while the remaining four are used for training. This rotation continues until each fold has been used as a validation set once, yielding five separate evaluations of the model. By averaging the results from each fold, this approach minimizes overfitting risks and exposes the model to diverse subsets of the training data, enhancing its generalizability.
Figure 4 illustrates the K-fold cross-validation framework applied in this study, visually representing how we divide the dataset. Each fold takes a turn as the validation set, allowing for a comprehensive evaluation of model consistency and stability. This structured approach to data partitioning and validation supports the robustness and reliability of the supervised model in accurately capturing and generalizing features across the segmented dataset.

2.2.4. Testing the Model

To evaluate the performance of the proposed model, various quantitative metrics are employed, including precision, recall, and F1-score, as defined in [40], along with the mean intersection over union (mIoU) and overall accuracy (OA), as denoted in [41]. In the context of point cloud segmentation, accuracy measures the model’s overall correctness by calculating the proportion of correctly segmented points out of the total points. Precision evaluates the model’s ability to correctly identify relevant segments by calculating the proportion of true positives among all predicted positive segments, offering insight into its specificity. Recall (or sensitivity) assesses the model’s capacity to capture true positives within the actual positive segments, which is critical for evaluating how well the model identifies all relevant instances within the point cloud. The F1-score combines precision and recall into a single metric by calculating their harmonic mean, providing a balanced assessment that considers both false positives and false negatives. In addition to these metrics, mIoU is a widely used evaluation metric in point cloud segmentation, providing a coarse assessment of the model’s performance across different classes. IoU measures the overlap between the predicted and ground truth segments by computing the ratio of their intersection to their union. mIoU extends this by averaging the IoU scores across all classes, ensuring a comprehensive evaluation of the model’s segmentation quality. A higher mIoU indicates better segmentation performance, as it reflects the model’s ability to correctly classify points while minimizing misclassification errors across multiple categories.
Macro and micro averaging [40] are also utilized to provide deeper insights into the model’s performance across multiple classes. Macro precision, recall, and F1-score calculate the metrics independently for each class and then average them, treating all classes equally regardless of their size. This approach highlights how well the model performs on a class-by-class basis, even for smaller or less represented classes. On the other hand, micro precision, recall, and F1-score aggregate the contributions of all classes to calculate the metrics globally, giving greater weight to larger classes. Together, these metrics ensure a comprehensive evaluation of the model’s segmentation performance, balancing the contributions of all classes while addressing potential imbalances in class distributions within the point cloud data.

2.3. Point Translation

The final step of SAMNet++, point translation, involves mapping the segmented points back to their original geospatial coordinates. During the preprocessing stage, the original geospatial coordinates of the point cloud are normalized and transformed into a local coordinate system centered around the dataset’s centroid. This step, involving both a shift and scaling, is performed to simplify the processing and to ensure numerical stability during the segmentation task. Point translation reverses this transformation, restoring the segmented points to their original positions within the global geospatial coordinate system. To ensure that the point translation process does not introduce errors, we verify the consistency between the translated point cloud and the original point cloud. This is achieved by computing the absolute positional difference between the original and translated points and estimating the mean squared error (MSE) which is less than 0.01 cm2 for the high-density point clouds to quantify any discrepancies. Additionally, visual inspections and nearest-neighbor comparisons are performed to confirm spatial alignment. These verification steps ensure that the transformation is accurate and that the spatial relationships of the segmented points remain intact.
By preserving the spatial relationships after translation, the integrity of the segmentation results is maintained throughout the entire pipeline. This ensures that each segmented point remains correctly positioned within the dataset, allowing for accurate evaluation of the model’s performance. With the points restored to their initial positions, the segmentation results can be assessed in a real-world context, facilitating the identification of discrepancies and the evaluation of segmentation precision.

3. Data Collection and Preprocessing

In this research, a UAS is utilized to collect detailed datasets from two distinct locations in Ontario, Canada. A multi-sensor data fusion approach is leveraged to simultaneously obtain both visual and topographic information, enhancing the richness of the collected data. For both datasets, a DJI Matrice 300 drone equipped with a DJI Zenmuse L1 LiDAR payload [28] is employed. The DJI Zenmuse L1 payload integrates a solid-state LiDAR sensor, a GNSS receiver, an IMU, and a 20-megapixel RGB camera. Through this integrated system, LiDAR data and high-quality visual imagery are simultaneously acquired during each flight operation. The RGB camera captures images at a resolution of 5472 × 3078 pixels, providing highly detailed visual data. Meanwhile, the LiDAR sensor is configured to offer a wide field of view (70.4° horizontally and 77.2° vertically), with a maximum range of 450 m and an accuracy of ±3 cm at a distance of 100 m, making it well-suited for high-resolution 3D mapping.
For the first dataset, a flight plan covering part of the TMU Campus is designed, consisting of seven parallel track lines over an area of approximately 76,000 square meters. This mission is completed within five minutes, with the UAS maintaining an altitude of around 60 m above ground level. This efficient operation enables the collection of approximately 42 million LiDAR points and 208 high-quality RGB images, which provide a detailed representation of the area. To ensure precise georeferencing, fifteen ground control points (GCPs) are strategically placed across the site. These GCPs are precisely located using a Trimble R8s GNSS receiver operating in Real-Time Kinematic (RTK) mode, providing high positional accuracy for the data. Additionally, real-time corrections are provided by a D-RTK 2 GNSS base station to the UAS’s onboard GNSS RTK receiver to ensure accurate execution of the flight mission.
The second dataset is collected with a lower flight altitude (25 m above ground level) to capture finer details, resulting in approximately 123 million LiDAR points and 224 high-resolution RGB images. The same GNSS RTK setup is utilized as in the first dataset, ensuring centimeter-level positional accuracy, which is essential for precise georeferencing of the collected data. Figure 5 provides a visual representation of the flight paths for both datasets.

3.1. Sensor Fusion

The sensor fusion technique used here leverages the complementary strengths of each sensor: GNSS ensures precise global positioning of the platform, the IMU tracks motion and orientation changes, the LiDAR captures high-resolution depth and structural information, and the RGB camera provides surface color and texture details. These data sets are integrated using the DJI Terra software [42]. The final output, produced entirely through DJI Terra, is a high-resolution, georeferenced, and colorized 3D point cloud that integrates the spatial accuracy of LiDAR with the detailed surface information provided by the RGB imagery. Figure 6 illustrates the georeferenced and colorized point clouds for both datasets.

3.2. Annotation

The 3D point cloud datasets are meticulously annotated using CloudCompare software [43], as illustrated in Figure 7. This process enables precise categorization of each point into predefined classes for each dataset. In the first dataset, points are labeled into six categories: 1-Asphalt, 2-Building, 3-Car, 4-Grass, 5-Tree, and 6-Sidewalk. Similarly, in the second dataset, the classes are largely the same, except that the ‘Car’ segment is replaced by ‘Tank’, which represents ‘Gas Tank’. During the annotation process, points that do not fit any of these specified classes are assigned to a separate category labeled as ‘Unclassified’. This ensures that every point in the dataset is accounted for, even if it does not correspond to the primary classes.
The annotation process involves visual inspection and segmentation based on geometric features, appearance, and contextual understanding, using CloudCompare’s tools to define and label each point accurately. The labeled point clouds serve as the ground truth, providing a reliable reference for training the supervised model in the proposed DNN model used for semantic segmentation. This ground truth dataset is crucial for the model to learn the specific characteristics of each class in the segmented parts, enhancing its ability to segment similar features in new data accurately. Additionally, the ground truth labels are used to validate the model’s performance by comparing predicted segmentation to the manually annotated data, ensuring robustness and reliability in the segmentation outcomes.

4. Results

We evaluate the proposed model on both our collected experimental datasets, as described in Section 4, and the publicly available Toronto-3D dataset [41]. The performance of our model is first analyzed on our datasets, where it is compared against several baseline methods to demonstrate its effectiveness. Additionally, we assess its generalizability by benchmarking it against SOTA models on Toronto-3D. The following subsections provide a detailed analysis of these evaluations.

4.1. Evaluation on Our Experimental Datasets

The initial unsupervised segmentation generated by the SAM LiDAR is assessed, examining its ability to identify and isolate regions of interest within the colorized point cloud data. Following this, we evaluate the refined segmentation produced by the adopted PointNet++ model, which enhances classification accuracy using labeled data. Our results include both quantitative and qualitative analyses, utilizing metrics such as precision, recall, and overall accuracy at each stage to provide a comprehensive view of the model’s performance.
We utilize the unsupervised method, named SAM LiDAR, to segment both datasets, as shown in Figure 8. For the first dataset, this method results in 172 distinct segments, while the second dataset is divided into 47 segments. These segmented areas serve as the basis for training, validation, and testing of the adopted PointNet++ model.
The implementation of adopted Pointnet++ begins with the random selection of segmented parts and their corresponding annotations for training and validation. This step ensures a diverse representation of the data for effective learning. To further strengthen the model’s performance and prevent overfitting, a 5-fold cross-validation technique is applied to the training data. After completing the training phase and validating the model through cross-validation, the model is tested using separate parts of the dataset reserved for testing. Table 1 presents the performance of the proposed model, employing various quantitative metrics. The results are compared with two well-known methods, PointNet [16] and PointNet++ [17], to verify the effectiveness of the proposed model.
The confusion matrix results for the testing phase are shown in Figure 9 and Figure 10 for PointNet++ and SAMNet++, respectively. Due to the low accuracy of PointNet on both datasets, it is excluded from the subsequent comparisons. Figure 11 and Figure 12 display the outputs of PointNet++ and SAMNet++ for the two datasets, respectively, following point translation. These visualizations depict the outputs of both models representing the adjusted points within the 3D space.
The processing time for PointNet, PointNet++, and SAMNet++ is calculated by running each algorithm over 50 epochs, with the average runtime serving as a reliable measure of the time required for training and testing on large-scale, real-world datasets. This evaluation is conducted on a Dell Precision 5820 Tower workstation [44], equipped with an NVIDIA T1000 8 GB GPU [45] and 64 GB of RAM, to ensure consistent and robust computational conditions for all models. By leveraging this hardware, the analysis highlights the computational efficiency and scalability of each model in handling complex, high-density point cloud data. The results, summarized in Table 2, provide a comparative view of the processing demands for each algorithm. The use of 50 epochs as a standardized benchmark not only ensures a fair comparison but also captures any variations in the computational load that may arise during training.

4.2. Evaluation on a Public Dataset

To further assess the generalizability of our proposed model, we evaluate its performance on the publicly available Toronto-3D dataset [41]. This dataset provides a large-scale, high-density point cloud captured using a vehicle-mounted Mobile LiDAR System (MLS) in an urban environment. It consists of approximately 78.3 million to 78 million points, with each point containing ten attributes, including 3D coordinates, RGB color, intensity, GPS time, and scan angle. The dataset is divided into four sections, three of which are used for training, while the remaining section is used for testing. Additionally, Toronto-3D includes manually labeled annotations across eight semantic categories, such as roads, road markings, sidewalks, and buildings, making it a valuable benchmark for evaluating segmentation models in real-world scenarios. We compare our model’s performance against some SOTA approaches using this dataset, with quantitative results summarized in Table 3.

5. Discussion

We analyze the SAMNet++ model’s performance in 3D point cloud segmentation, examining how each stage of our integrated approach—from initial unsupervised segmentation to final supervised refinement—improves segmentation quality. The analysis highlights the contributions and effectiveness of each component in addressing the challenges of complex urban datasets.
The segmentation process begins with unsupervised segmentation to identify regions of interest within colorized 3D point clouds. This initial unsupervised step offers a fast, data-driven method to isolate segments without requiring labeled data. In addition, the unsupervised segmentation process leverages color features in segmenting the point cloud along with spatial and geometric characteristics. However, its results reveal limitations, as illustrated in Figure 8, where it often fragments individual classes into multiple segments, underscoring challenges in capturing fine-grained distinctions. This limitation points to the need for structured class guidance, especially for complex urban datasets where similar features are often grouped together or dispersed. To address these challenges, we integrate the adopted PointNet++ model to refine and correct initial segmentations because the supervised component of PointNet++ brings greater class distinction, accuracy, and precision, which is essential for detailed segmentation in multi-class urban datasets.
The outputs of PointNet++ and SAMNet++ after point translation, displayed in Figure 11 and Figure 12, along with the corresponding ground truth in Figure 7, offer a visual comparison of each model’s segmentation effectiveness. SAMNet++ consistently maintains higher segmentation quality, especially for the second dataset, even after translation, illustrating its robustness in real-world scenarios. The clear differentiation between SAMNet++ and PointNet++ further emphasizes SAMNet++’s superior capability in preserving segment accuracy and boundary clarity.
A comparative analysis of model performance in Table 1 reveals the performance of SAMNet++, PointNet, and PointNet++ on both datasets. The relatively simple architecture of PointNet, while efficient, struggles to capture complex spatial relationships in large-scale urban datasets, resulting in significantly lower accuracy and F1-scores. Its limited feature extraction ability hinders accurate classification, particularly in scenes with high structural complexity [56]. PointNet++, with its hierarchical structure, shows marked improvement over PointNet, as it better captures spatial relationships within the data. However, PointNet++ still falls short of achieving the segmentation accuracy required for detailed urban mapping, indicating the need for further refinement in handling intricate or varied features within dense point clouds.
In contrast, SAMNet++ excels across all performance metrics, demonstrating its capacity to handle complex data structures effectively. This improved performance can be attributed to SAMNet++’s advanced feature extraction and hybrid architecture, which seamlessly integrates unsupervised segmentation using SAM LiDAR with supervised learning through the adoption of PointNet++. The combination of these methods enables SAMNet++ to balance computational efficiency with segmentation accuracy, allowing it to adapt to complex data patterns and identify class boundaries with higher precision.
As shown in Table 1, SAMNet++ demonstrates substantial improvements over PointNet++, particularly in terms of accuracy and precision metrics. For dataset 1, SAMNet++ achieves a 13% improvement in accuracy, micro precision, and macro precision. The recall metrics also show notable gains, with a 13% improvement in micro recall and a 28% increase in macro recall (from 70% to 98%). Similarly, the F1-scores exhibit significant enhancements, with a 17% increase in the micro F1-score and a 28% improvement in the macro F1-score (from 66% to 94%). For dataset 2, SAMNet++ also outperforms PointNet++. It achieves a 9% increase in accuracy and micro precision, a 9% rise in micro recall, a 15% improvement in macro precision (from 78% to 93%), and a 35% increase in macro recall (from 63% to 98%). The larger improvement in macro metrics compared to micro metrics highlights SAMNet++’s superior ability to classify less frequent or minority classes effectively. The significant gains in macro metrics with SAMNet++—particularly in recall and F1-score—indicate that the model has successfully addressed weaknesses in PointNet++ related to class imbalances. For example, classes with fewer instances or overlapping features (e.g., Sidewalk) are better differentiated and classified by SAMNet++, as reflected by its higher macro recall and F1-score compared to PointNet++. These improvements demonstrate SAMNet++’s ability to handle complex and imbalanced datasets effectively, ensuring robust performance across all classes. By excelling in both micro and macro metrics, SAMNet++ achieves balanced performance. While strong micro metrics ensure overall high accuracy by correctly classifying the majority of points, improved macro metrics confirm that the model does not overlook minority classes. This balanced performance enables SAMNet++ to significantly surpass PointNet++, addressing both class imbalances and overall segmentation quality. The consistency and magnitude of these improvements underscore its capability for more accurate and equitable classification across datasets, making it highly effective in real-world scenarios with diverse and imbalanced class distributions.
The results from the confusion matrices in Figure 9 and Figure 10 provide a comprehensive evaluation of the models’ performance. Figure 13 and Figure 14 emphasize these improvements by visualizing segmentation outputs for specific highlighted regions that correspond to the confusion matrices. For dataset 1, SAMNet++ achieves a significant boost in accuracy for challenging classes like Sidewalk, increasing from 29% (PointNet++) to 99%. This is clearly shown in Figure 14d, where most of the Sidewalk is misclassified as Trees compared to SAMNet++ output in Figure 14c. This improvement is attributed to its advanced hybrid learning architecture, which combines the strengths of unsupervised and supervised models. This approach enables SAMNet++ to capture finer spatial and contextual details, particularly for occluded classes like Sidewalks, which often overlap with other urban elements such as Asphalt or Grass. Similarly, Asphalt accuracy rises from 54% to 98%, where PointNet++ misclassified it as Trees compared to SAMNet++, as shown in Figure 14g,f. Other classes, including Building, Car, Grass, and Tree, also see consistent gains; for example, Tree accuracy is improved from 92% to 97%. For dataset 2, SAMNet++ achieves remarkable accuracy across all categories. PointNet++ achieves accuracies of Unclassified (46%), Asphalt (96%), Building (96%), Tank (8%), Grass (84%), and Sidewalk (36%). In contrast, SAMNet++ delivers significantly higher accuracies: Unclassified (95%), Asphalt (98%), Building (99%), Tank (100%), Grass (98%), and Sidewalk (99%). SAMNet++ demonstrates substantial improvements over PointNet++ across all datasets, particularly in challenging classes. As illustrated in Figure 15, the PointNet++ output exhibits spatial label leakage, where class boundaries are not well-defined. For example, in Figure 15d and (g), the Grass class boundary extends inaccurately, overlapping with regions labeled as Sidewalk and Asphalt. SAMNet++ consistently produces results that align closely with the ground truth, effectively addressing the limitations of PointNet++. SAMNet++ overcomes these challenges, particularly for complex urban features such as Tank and Unclassified, where PointNet++ exhibits significant misclassification. These visual and quantitative analyses collectively demonstrate SAMNet++’s robustness and its ability to achieve precise segmentation for complex and overlapping urban features. By combining confusion matrix analyses with segmentation visualizations, it is evident that SAMNet++ delivers more accurate and reliable segmentation outcomes, outperforming PointNet++ across both datasets.
To rigorously evaluate the performance improvements of SAMNET++ over PointNet++, we conduct statistical significance tests on the F1-scores obtained from 10 independent runs. Specifically, we perform a Wilcoxon signed-rank test [57], a non-parametric test suitable for comparing paired samples without assuming normality. The test yields a p-value of less than 0.000002, which is well below the conventional significance threshold of 0.05. This result provides strong evidence that the improvements in F1-score achieved by SAMNET++ are statistically significant and unlikely to be due to random variation. This statistical validation reinforces the effectiveness of SAMNET++ in point cloud segmentation and demonstrates its superiority over PointNet++ with a high degree of confidence.
Processing efficiency is another critical aspect of our comparison. Table 2 summarizes the average processing time per epoch for PointNet, PointNet++, and SAMNet++ across two datasets of different sizes. In dataset 1, which contains 42 million points, SAMNet++ exhibits the shortest processing time, averaging 3565 s per epoch, compared to 7242 s for PointNet++ and 57,404 s for PointNet. This represents a 50% reduction in processing time compared to PointNet++ and a 94% reduction compared to PointNet, highlighting SAMNet++’s computational efficiency due to its hybrid learning structure, which optimizes computational demands without compromising segmentation quality. In the larger dataset 2, containing 123 million points, SAMNet++ again demonstrates superior processing efficiency, requiring only 9560 s per epoch, compared to 20,243 s for PointNet++ and 180,104 s for PointNet. This translates to a 53% improvement in processing time over PointNet++ and a 95% improvement over PointNet. One contributing factor to SAMNet++’s efficiency in dataset 2 is the segmentation approach of SAM LiDAR, which segments the data into fewer parts compared to dataset 1. This reduced segmentation complexity enables SAMNet++ to refine the segmentation more quickly, requiring less processing time per epoch.
The performance evaluation presented in Table 3 demonstrates the effectiveness of SAMNet++ on the public dataset (Toronto-3D), achieving a strong OA of 96.90% and the highest mIoU of 86.93% among all methods. While SAMNet++ does not achieve the highest OA, with SFL-Net* (97.9%) and LACV-Net* (97.4%) reporting slightly higher values, its significant higher mIoU reflects a more balanced and precise segmentation across all classes. This suggests that SAMNet++ provides superior per-class segmentation performance, particularly in challenging categories where other models struggle.
A detailed per-class analysis reveals that SAMNet++ excels in Nature (98.99%) and Building (99.10%), outperforming all other methods. Similarly, in Utility Line (91.26%) and Pole (92.06%), it achieves the highest IoU, surpassing the previous best results from ResDLPS-Net* (86.82% and 79.95%, respectively) and RandLA-Net* (88.06% and 77.84%, respectively). The model also demonstrates substantial improvement in the Car category with an IoU of 97.86%, surpassing methods such as EyeNet (94.02%) and LACV-Net* (93.4%). The most significant improvement is observed in the Fence category, where SAMNet++ achieves an IoU of 77.67%, a remarkable advancement over the previous best-performing model, MappingConvSeg* (44.11%). This substantial increase highlights the ability of SAMNet++ to accurately capture intricate and irregular structures that are typically difficult to segment. Additionally, a qualitative comparison in Figure 13 illustrates the accuracy of SAMNet++ in real-world segmentation tasks. The visual comparison between Figure 13b (ground truth) and Figure 13c (SAMNet++ output) demonstrates how effectively the model preserves structural details and correctly classifies fine-grained elements in the scene. This visual evidence reinforces the quantitative findings, showcasing the model’s superior segmentation capabilities, especially in complex and challenging environments.
While SAMNet++ demonstrates significant improvements in segmentation accuracy, particularly over PointNet and PointNet++, it still has some limitations that warrant further discussion. One key drawback is its dependence on RGB color information, which may restrict its applicability to datasets that lack color information, such as pure LiDAR scans or grayscale depth images. Additionally, the model’s performance is sensitive to parameter tuning, particularly in the segmentation adjustment step using SAM LiDAR, which directly impacts classification accuracy. Another notable challenge is the difficulty in segmenting certain categories, such as “Sidewalk” and “Tank,” as observed in the confusion matrices. Misclassifications occur due to similarities in spectral characteristics with other categories or structural ambiguities within the dataset. For example, in our dataset, “Sidewalk” is sometimes misclassified as “Unclassified,” likely due to partial occlusions or similar reflectance properties, while “Tank” exhibits misclassification, suggesting difficulties in distinguishing it from surrounding structures. Beyond these category-specific challenges, SAMNet++ also exhibits variability in performance across different feature types. In particular, while it performs well overall, its segmentation accuracy for fine-scale structures like Road Markings remains an area for improvement. One key factor contributing to this challenge is the quality of the LiDAR data used, which produces relatively sparse and discontinuous point clouds, making it difficult to accurately delineate road markings. On the Toronto-3D dataset, it achieves an IoU of 53.62% (Table 3), which, although an improvement over KPFCNN, still lags behind SFL-Net* (70.7%) and MappingConvSeg* (67.87%). These difficulties likely stem from the sparse and discontinuous nature of road markings in point cloud data. Similarly, while SAMNet++ achieves a competitive 95.37% IoU for the Road class, it falls slightly behind ResDLPS-Net* [55] (95.82%), LACV-Net* [53] (97.1%), and SFL-Net* [54](97.7%), indicating room for improvement in surface segmentation.
These failure cases highlight areas where SAMNet++ could be further improved, particularly in handling low-density and fine-grained features. The model’s reliance on RGB information may also contribute to these errors, as lane markings and other small-scale features might be more distinguishable with higher-resolution geometric data. Moreover, the LiDAR used in this dataset introduces additional challenges, further complicating the segmentation of sparse structures like road markings. Addressing these challenges, either through enhanced feature extraction techniques or adaptive segmentation strategies, could further improve SAMNet++’s performance, particularly for datasets with sparse or occluded structures. Nonetheless, despite these limitations, SAMNet++ sets a new SOTA benchmark in urban point cloud segmentation, demonstrating superior accuracy across most object classes and significantly outperforming prior methods.

6. Conclusions

In this research, SAMNet++, a novel 3D segmentation model, was developed, which integrates the Segment Anything Model (SAM) with PointNet++. SAM LiDAR was used for the initial unsupervised segmentation of 3D point clouds to identify regions of interest. This initial segmentation was then refined using an adopted PointNet++ model. The proposed model was tested on three urban datasets collected by a UAV equipped with a multisensory system. SAMNet++ achieved accuracy, precision, recall, and F1-score values exceeding 0.97 across all classes on our experimental datasets, significantly outperforming standard PointNet++ and PointNet models in segmentation accuracy. Additionally, the model attained a mean intersection over union (mIoU) of 86.93% on a public benchmark dataset, demonstrating more balanced and precise segmentation across all classes and surpassing previous SOTA methods. Beyond its superior accuracy, SAMNet++ exhibited remarkable computational efficiency, requiring nearly half the processing time of PointNet++ and approximately one-sixteenth of the time needed by the original PointNet algorithm. These results highlight SAMNet++’s effectiveness in handling datasets with complex features requiring precise segmentation.
However, SAMNet++ faced challenges in sparsely populated or occluded regions, leading to reduced segmentation reliability. While the model demonstrated faster processing, further optimization of the segmentation pipeline could enhance computational efficiency. Future work will explore multi-modal fusion techniques, which integrate RGB, LiDAR intensity, and hyperspectral data to improve classification in complex environments. Additionally, self-supervised learning and transformer-based architectures may enhance feature extraction and generalization across diverse datasets. Addressing these challenges could further improve SAMNet++’s adaptability to real-world applications, including autonomous navigation, urban planning, and environmental monitoring.

Author Contributions

Conceptualization, M.S., A.E. and A.E.-R.; methodology, M.S., A.E. and A.E.-R.; software, M.S.; validation, M.S.; formal analysis, M.S.; investigation, M.S.; resources, A.E.-R. and A.E.; dataset curation, M.S., A.E. and A.E.-R.; writing—original draft preparation, M.S.; writing—review and editing, M.S., A.E. and A.E.-R.; supervision, A.E. and A.E.-R.; visualization, M.S.; project administration, A.E.-R.; funding acquisition, A.E.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by Toronto Metropolitan University and the Natural Sciences and Engineering Research Council of Canada (NSERC) RGPIN-2022-03822.

Data Availability Statement

The original contributions featured in this study are incorporated within the article, and any additional questions can be addressed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
  2. Paparoditis, N.; Cord, M.; Jordan, M.; Cocquerez, J.P. Building Detection and Reconstruction from Mid- and High-Resolution Aerial Imagery. Comput. Vis. Image Underst. 1998, 72, 122–142. [Google Scholar] [CrossRef]
  3. Hinz, S.; Baumgartner, A. Automatic extraction of urban road networks from multi-view aerial imagery. ISPRS J. Photogramm. Remote Sens. 2003, 58, 83–98. [Google Scholar] [CrossRef]
  4. Bhadauria, A.; Bhadauria, H.; Kumar, A. Building extraction from satellite images. IOSR J. Comput. Eng. 2013, 12, 76–81. [Google Scholar]
  5. Klonus, S.; Tomowski, D.; Ehlers, M.; Reinartz, P.; Michel, U. Combined Edge Segment Texture Analysis for the Detection of Damaged Buildings in Crisis Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 1118–1128. [Google Scholar] [CrossRef]
  6. Lohani, B.; Ghosh, S. Airborne LiDAR Technology: A Review of Data Collection and Processing Systems. Natl. Acad. Sci. India. Proc. Sect. A Phys. Sci. 2017, 87, 567–579. [Google Scholar] [CrossRef]
  7. Elamin, A.; El-Rabbany, A. UAV-Based Multi-Sensor Data Fusion for Urban Land Cover Mapping Using a Deep Convolutional Neural Network. Remote Sens. 2022, 14, 4298. [Google Scholar] [CrossRef]
  8. Lyu, Y.; Huang, X.; Zhang, Z. Learning to Segment 3D Point Clouds in 2D Image Space. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 12252–12261. [Google Scholar]
  9. Kuçak, R.A.; Özdemir, E.; Erol, S. The Segmentation of Point Clouds With K-Means and Ann (Artifical Neural Network). In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH: Gottingen, Germany, 2017; pp. 595–598. [Google Scholar]
  10. Zhao, J.; Li, C.; Tian, L.; Zhu, J.; Zhou, J.; Verikas, A.; Radeva, P.; Nikolaev, D. FPFH-based graph matching for 3D point cloud registration. In Proceedings of the Tenth International Conference on Machine Vision (ICMV 2017), Vienna, Austria, 13–15 November 2018; SPIE: Bellingham, WA, USA, 2017. [Google Scholar] [CrossRef]
  11. Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef]
  12. Muhammad Yasir, S.; Ahn, H. Deep Learning-Based 3D Instance and Semantic Segmentation: A Review. J. Artif. Intell. 2022, 4, 99–114. [Google Scholar] [CrossRef]
  13. Zhang, R.; Wu, Y.; Jin, W.; Meng, X. Deep-Learning-Based Point Cloud Semantic Segmentation: A Survey. Electronics 2023, 12, 3642. [Google Scholar] [CrossRef]
  14. Zhang, Z.; Yang, B.; Wang, B.; Li, B. GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 17619–17629. [Google Scholar]
  15. Poux, F.; Mattes, C.; Kobbelt, L. Unsupervised Segmentation of Indoor 3D Point Cloud: Application to Object-Based Classification. In International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH: Gottingen, Germany, 2020; Volume XLIV-4/W1-2020, pp. 111–118. [Google Scholar] [CrossRef]
  16. Charles, R.Q.; Hao, S.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, 2017; pp. 77–85. [Google Scholar]
  17. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
  18. Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.A.A.K.; Elhoseiny, M.; Ghanem, B. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar] [CrossRef]
  19. Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D Semantic Parsing of Large-Scale Indoor Spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 1534–1543. [Google Scholar]
  20. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 9296–9306. [Google Scholar]
  21. Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 8469–8478. [Google Scholar]
  22. El Madawi, K.; Rashed, H.; El Sallab, A.; Nasr, O.; Kamel, H.; Yogamani, S. RGB and LiDAR fusion based 3D Semantic Segmentation for Autonomous Driving. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: New York, NY, USA, 2019; pp. 7–12. [Google Scholar]
  23. Strom, J.; Richardson, A.; Olson, E. Graph-based segmentation for colored 3D laser point clouds. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; IEEE: New York, NY, USA, 2010; pp. 2131–2136. [Google Scholar]
  24. Krispel, G.; Opitz, M.; Waltner, G.; Possegger, H.; Bischof, H. FuseSeg: LiDAR Point Cloud Segmentation Fusing Multi-Modal Data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; IEEE: New York, NY, USA, 2020; pp. 1863–1872. [Google Scholar]
  25. Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 16260–16270. [Google Scholar]
  26. Yarroudh, A. LiDAR Automatic Unsupervised Segmentation Using Segment-Anything Model (SAM) from Meta AI. Available online: https://github.com/Yarroudh/segment-lidar (accessed on 1 August 2024).
  27. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [CrossRef]
  28. Zenmuse L1 Specification. Available online: https://enterprise.dji.com/zenmuse-l1/specs (accessed on 20 May 2021).
  29. Cristóvão, M.P.; Portugal, D.; Carvalho, A.E.; Ferreira, J.F. A LiDAR-Camera-Inertial-GNSS Apparatus for 3D Multimodal Dataset Collection in Woodland Scenarios. Sensors 2023, 23, 6676. [Google Scholar] [CrossRef]
  30. Osco, L.P.; Wu, Q.; de Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Marcato, J. The Segment Anything Model (SAM) for remote sensing applications: From zero to one shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
  31. Wu, Q.; Osco, L.P. samgeo: A Python package for segmenting geospatial data with the Segment Anything Model (SAM). J. Open Source Softw. 2023, 8, 5663. [Google Scholar] [CrossRef]
  32. Graham, L. The LAS 1.4 Specification. Photogramm. Eng. Remote Sens. 2012, 78, 93–102. [Google Scholar]
  33. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  34. Bridle, J.S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing: Algorithms, Architectures and Applications; Springer: Berlin/Heidelberg, Germany, 1990; pp. 227–236. [Google Scholar]
  35. Automatic Mixed Precision Package—Torch.Amp. Available online: https://pytorch.org/docs/stable/amp.html (accessed on 1 November 2024).
  36. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
  37. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
  38. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  39. Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Society. Ser. B Methodol. 1974, 36, 111–147. [Google Scholar] [CrossRef]
  40. Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  41. Tan, W.; Qin, N.; Ma, L.; Li, Y.; Du, J.; Cai, G.; Yang, K.; Li, J. Toronto-3D: A Large-scale Mobile LiDAR Dataset for Semantic Segmentation of Urban Roadways. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Seattle, WD, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 797–806. [Google Scholar]
  42. DJI Terra Software (Version 4.0.0). Available online: https://enterprise.dji.com/dji-terra (accessed on 30 April 2024).
  43. Girardeau-Montaut, D. CloudCompare; Version 2.13.1; CloudCompare: 2024. CloudCompare. p. 3D Point Cloud and Mesh Processing Software, Open Source Project; Telecom ParisTech: Paris, France, 2024; Available online: https://www.cloudcompare.org/ (accessed on 1 November 2024).
  44. Precision 5820 Tower Workstation. Available online: https://www.dell.com/en-ca/shop/workstations/precision-5820-tower-workstation/spd/precision-5820-workstation (accessed on 1 November 2024).
  45. NVIDIA® T1000, 8 GB GDDR6, full height, PCIe 3.0x16, 4 mDP Graphics Card. Available online: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/nvidia-t1000-datasheet-1987414-r4.pdf (accessed on 1 November 2024).
  46. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 146. [Google Scholar] [CrossRef]
  47. Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6410–6419. [Google Scholar]
  48. Ma, L.; Li, Y.; Li, J.; Tan, W.; Yu, Y.; Chapman, M.A. Multi-Scale Point-Wise Convolutional Neural Networks for 3D Object Segmentation From LiDAR Point Clouds in Large-Scale Environments. IEEE Trans. Intell. Transp. Syst. 2021, 22, 821–836. [Google Scholar] [CrossRef]
  49. Li, Y.; Ma, L.; Zhong, Z.; Cao, D.; Li, J. TGNet: Geometric Graph CNN on 3-D Point Cloud Segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3588–3600. [Google Scholar] [CrossRef]
  50. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Learning Semantic Segmentation of Large-Scale Point Clouds With Random Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8338–8354. [Google Scholar] [CrossRef]
  51. Yan, K.; Hu, Q.; Wang, H.; Huang, X.; Li, L.; Ji, S. Continuous Mapping Convolution for Large-Scale Point Clouds Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6502505. [Google Scholar] [CrossRef]
  52. Yoo, S.; Jeong, Y.; Jameela, M.; Sohn, G. Human Vision Based 3D Point Cloud Semantic Segmentation of Large-Scale Outdoor Scenes. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–27 June 2023; IEEE: New York, NY, USA, 2023; pp. 6577–6586. [Google Scholar]
  53. Zeng, Z.; Xu, Y.; Xie, Z.; Tang, W.; Wan, J.; Wu, W. Large-scale point cloud semantic segmentation via local perception and global descriptor vector. Expert Syst. Appl. 2024, 246, 123269. [Google Scholar] [CrossRef]
  54. Li, X.; Zhang, Z.; Li, Y.; Huang, M.; Zhang, J. SFL-NET: Slight Filter Learning Network for Point Cloud Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5703914. [Google Scholar] [CrossRef]
  55. Du, J.; Cai, G.; Wang, Z.; Huang, S.; Su, J.; Marcato Junior, J.; Smit, J.; Li, J. ResDLPS-Net: Joint residual-dense optimization for large-scale point cloud semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2021, 182, 37–51. [Google Scholar] [CrossRef]
  56. Nurunnabi, A.; Teferle, F.N.; Li, J.; Lindenbergh, R.C.; Parvaz, S. Investigation of Pointnet for Semantic Segmentation of Large-Scale Outdoor Point Clouds. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH: Gottingen, Germany, 2021; Volume XLVI-4/W5, pp. 397–404. [Google Scholar] [CrossRef]
  57. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Figure 1. The Steps in this Research Framework.
Figure 1. The Steps in this Research Framework.
Remotesensing 17 01256 g001
Figure 2. Design of the proposed model architecture, with layer dimensions [Height, Depth] indicated below each block.
Figure 2. Design of the proposed model architecture, with layer dimensions [Height, Depth] indicated below each block.
Remotesensing 17 01256 g002
Figure 3. Effect of batch size on model performance.
Figure 3. Effect of batch size on model performance.
Remotesensing 17 01256 g003
Figure 4. 5-Fold cross-validation method used in training of the model. The training data (80% of the full dataset) is divided into five folds: in each iteration, four folds (thistle) are used for training, and one fold (purple) for validation. The remaining 20% of the data (light blue) is reserved as a hold-out test set.
Figure 4. 5-Fold cross-validation method used in training of the model. The training data (80% of the full dataset) is divided into five folds: in each iteration, four folds (thistle) are used for training, and one fold (purple) for validation. The remaining 20% of the data (light blue) is reserved as a hold-out test set.
Remotesensing 17 01256 g004
Figure 5. UAS flight path: (a) First dataset; (b) Second dataset.
Figure 5. UAS flight path: (a) First dataset; (b) Second dataset.
Remotesensing 17 01256 g005
Figure 6. Georeferenced colored point clouds: (a) First dataset; (b) Second dataset.
Figure 6. Georeferenced colored point clouds: (a) First dataset; (b) Second dataset.
Remotesensing 17 01256 g006
Figure 7. Annotated point clouds: (a) First dataset; (b) Second dataset.
Figure 7. Annotated point clouds: (a) First dataset; (b) Second dataset.
Remotesensing 17 01256 g007
Figure 8. Segmented point clouds with SAM LiDAR: (a) First dataset; (b) Second dataset. Each color corresponds to a distinct segmented region within the area.
Figure 8. Segmented point clouds with SAM LiDAR: (a) First dataset; (b) Second dataset. Each color corresponds to a distinct segmented region within the area.
Remotesensing 17 01256 g008
Figure 9. Confusion Matrix of PointNet++ on the Testing dataset: (a) Results for the first dataset; (b) Results for the second dataset.
Figure 9. Confusion Matrix of PointNet++ on the Testing dataset: (a) Results for the first dataset; (b) Results for the second dataset.
Remotesensing 17 01256 g009
Figure 10. Confusion Matrix of SAMNet++ on the Testing dataset: (a) Results for the first dataset; (b) Results for the second dataset.
Figure 10. Confusion Matrix of SAMNet++ on the Testing dataset: (a) Results for the first dataset; (b) Results for the second dataset.
Remotesensing 17 01256 g010
Figure 11. Point translation of the output PointNet++: (a) First dataset; (b) Second dataset.
Figure 11. Point translation of the output PointNet++: (a) First dataset; (b) Second dataset.
Remotesensing 17 01256 g011
Figure 12. Point translation of the output SAMNet++: (a) First dataset; (b) Second dataset.
Figure 12. Point translation of the output SAMNet++: (a) First dataset; (b) Second dataset.
Remotesensing 17 01256 g012
Figure 13. Evaluation of SAMNet++ on Toronto-3D dataset: (a) Colorized point cloud; (b) Ground truth; (c) Output of SAMNet++.
Figure 13. Evaluation of SAMNet++ on Toronto-3D dataset: (a) Colorized point cloud; (b) Ground truth; (c) Output of SAMNet++.
Remotesensing 17 01256 g013
Figure 14. Comparison of ground truth and segmentation results from PointNet++ and SAMNet++ on selected regions of dataset 1: (a) Selected regions highlighted on the ground truth; (b,e) Ground truth; (c,f) Segmentation results from SAMNet++; (d,g) Segmentation results from PointNet++.
Figure 14. Comparison of ground truth and segmentation results from PointNet++ and SAMNet++ on selected regions of dataset 1: (a) Selected regions highlighted on the ground truth; (b,e) Ground truth; (c,f) Segmentation results from SAMNet++; (d,g) Segmentation results from PointNet++.
Remotesensing 17 01256 g014
Figure 15. Comparison of ground truth and segmentation results from PointNet++ and SAMNet++ on selected regions of dataset 2: (a) Selected regions highlighted on the ground truth; (b,e,h) Ground truth; (c,f,i) Segmentation results from SAMNet++; (d,g,j) Segmentation results from PointNet++.
Figure 15. Comparison of ground truth and segmentation results from PointNet++ and SAMNet++ on selected regions of dataset 2: (a) Selected regions highlighted on the ground truth; (b,e,h) Ground truth; (c,f,i) Segmentation results from SAMNet++; (d,g,j) Segmentation results from PointNet++.
Remotesensing 17 01256 g015
Table 1. Performance comparison of PointNet, PointNet++, and SAMNET++ across datasets.
Table 1. Performance comparison of PointNet, PointNet++, and SAMNET++ across datasets.
MethodOAPrecisionRecallF1-Score
MicroMacroMicroMacroMicroMacro
Dataset 1PointNet14.96%14.95%9.86%14.95%7.47%14.95%8.50%
PointNet++84.61%84.61%73.62%84.61%70.58%84.61%66.65%
SAMNET++97.68%97.87%91.84%97.87%98.44%97.87%94.74%
Dataset 2PointNet20.41%20.41%13.01%20.41%10.20%20.41%11.44%
PointNet++89.49%89.49%78.98%89.49%63.78%89.49%66.25%
SAMNET++98.86%98.86%93.96%98.86%98.86%98.86%96.26%
Table 2. Average processing time per epoch for PointNet, PointNet++, and SAMNet++ across datasets (in minutes (min)).
Table 2. Average processing time per epoch for PointNet, PointNet++, and SAMNet++ across datasets (in minutes (min)).
Title 1PointNetPointNet++SAMNet++
Dataset 1956.73 min120.70 min59.42 min
Dataset 23001.73 min337.38 min159.33 min
Table 3. Performance comparison Toronto-3D in (%).
Table 3. Performance comparison Toronto-3D in (%).
MethodOAmIoURoadRoad MarkingsNatureBuildingUtility LinePoleCarFence
PointNet++ [17]84.8841.8189.270.0069.054.143.723.352.03.0
DGCNN [46]94.2461.7993.880.0091.2580.3962.4062.3288.2615.81
KPFCNN [47]95.3969.1194.620.0696.0791.5187.6881.5685.6615.72
MS-PCNN [48]90.0365.8993.843.8393.4682.5967.8071.9591.1222.50
TGNet [49]94.0861.3493.540.0090.8381.5765.2662.9888.737.85
MS-TGNet [41]95.7170.5094.4117.1995.7288.8376.0173.9794.2423.64
RandLA-Net [50]92.9577.7194.6142.6296.8993.0186.5178.0792.8537.12
MappingConvSeg [51]93.1777.5795.0239.2796.7793.3286.3779.1189.8140.89
EyeNet [52]94.6381.1396.9865.0297.8393.5186.7784.8694.0230.01
LACV-Net [53] 95.878.594.842.796.791.488.279.693.940.6
SFL-Net * [54]96.078.194.234.096.993.887.185.793.539.7
RandLA-Net * [50]94.3781.7796.6964.2196.9294.2488.0677.8493.3742.86
MappingConvSeg * [51]94.7282.8997.1567.8797.5593.7586.8882.1293.7244.11
ResDLPS-Net * [55] 96.4980.2795.8259.8096.1090.9686.8279.9589.4143.31
LACV-Net * [53]97.482.797.166.997.393.087.383.493.443.1
SFL-Net * [54]97.981.997.770.795.891.787.478.892.340.8
SAMNet++ * (ours)96.9086.9395.3753.6298.9999.1091.2692.0697.8677.67
* The method uses the colorized point cloud (RGB).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shahraki, M.; Elamin, A.; El-Rabbany, A. SAMNet++: A Segment Anything Model for Supervised 3D Point Cloud Semantic Segmentation. Remote Sens. 2025, 17, 1256. https://doi.org/10.3390/rs17071256

AMA Style

Shahraki M, Elamin A, El-Rabbany A. SAMNet++: A Segment Anything Model for Supervised 3D Point Cloud Semantic Segmentation. Remote Sensing. 2025; 17(7):1256. https://doi.org/10.3390/rs17071256

Chicago/Turabian Style

Shahraki, Mohsen, Ahmed Elamin, and Ahmed El-Rabbany. 2025. "SAMNet++: A Segment Anything Model for Supervised 3D Point Cloud Semantic Segmentation" Remote Sensing 17, no. 7: 1256. https://doi.org/10.3390/rs17071256

APA Style

Shahraki, M., Elamin, A., & El-Rabbany, A. (2025). SAMNet++: A Segment Anything Model for Supervised 3D Point Cloud Semantic Segmentation. Remote Sensing, 17(7), 1256. https://doi.org/10.3390/rs17071256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop