3.1. Point Attention Net Model Overview
As a general point cloud processing framework, PointNet++ excels in tasks like point cloud classification, semantic segmentation, and object detection, demonstrating wide applicability. However, in small target segmentation tasks, such as pavement crack extraction, PointNet++ uses a fixed receptive field. Although this field is gradually expanded through multiple Set Abstraction layers, it relies solely on the feature extraction stage of the PointNet layer to input several points into the fully connected layer, resulting in a relatively simple encoding method with low robustness.
To address these issues, this paper proposes a PAN network based on PointNet++, which directly processes unordered point sets as inputs and uses the Set Abstraction module to extract local features at different levels, capturing local structures in the point cloud. Through hierarchical subsampling and aggregation operations, point cloud information is captured at various scales, allowing the network to consider both local structures and global context. This improves the network’s understanding of the overall and local structure of the point cloud, enhancing its ability to process point cloud data with multi-scale characteristics. Using symmetric functions to handle the arrangement of input point clouds makes the network insensitive to point arrangements, ensuring consistent outputs across different input configurations. This approach enhances the model’s generalization and makes it more adaptable to point clouds with various shapes and structures, boosting its robustness in practical applications. Additionally, the proposed PC-Parallel module enlarges the model’s receptive field and strengthens encoding robustness. This module enhances PointNet++’s performance in small target segmentation tasks such as pavement point cloud crack detection, allowing it to better adapt to varying scales and complexities of point cloud data and improving its effectiveness in real-world applications.
The PC-Parallel module enhances the network’s capability to capture critical features over long distances, increasing adaptability and robustness. As illustrated in
Figure 1, the PC-Parallel module is introduced after the SA block, where PointNet layer features are input. This module combines spatial and channel attention in parallel, leveraging the strengths of both mechanisms to improve point cloud segmentation performance. By modeling the attention matrix, the spatial attention component effectively captures the spatial relationship between any two points, enhancing the ability to identify local crack features and expanding the model’s perception of the crack region, which enhances the ability to process crack shapes, filter out irrelevant and noisy points, reduce redundant information in the point cloud, and improve computational efficiency. Simultaneously, the channel attention component focuses on capturing remote context information in the channel dimension, emphasizing feature channels with significant differentiation of crack features while suppressing those with minor contributions. This reduces noise interference and redundant information, improving the model’s overall performance.
Finally, aggregating the outputs of the two attention modules enhances the recognition of crack points across spatial and feature dimensions, achieving more effective multi-scale feature fusion. This process captures fracture point characteristics at different sampling levels, allowing the model to better understand the global correlation among fracture points. The combined use of spatial and channel attention enables the model to fully perceive and interpret the information within crack points, thereby improving its understanding of the overall structure and pattern. By obtaining better feature representations, these features can be used more accurately to predict crack points and provide more powerful modeling capabilities for crack point cloud segmentation tasks. Moreover, Poly Loss is introduced to adjust the form of loss function to better balance the imbalance between crack points and background points and significantly improve the identification accuracy of crack areas.
3.2. PC-Parallel Module
The PC-Parallel module consists of two types of attention branches, as illustrated in
Figure 2: the spatial attention branch learns the relationship between different feature points, while the channel attention module captures remote context information along the channel dimension. Finally, the outputs from these two attention modules are aggregated to achieve a more effective point-level feature representation.
Spatial Attention Branch: Spatial attention allows the module to selectively focus on local regions around crack points, enhancing local features and improving the perception of crack areas. This is crucial for crack point segmentation, as crack features are typically small and scattered. By incorporating spatial attention, the model becomes more adaptable to point clouds of varying shapes and structures, enabling a deeper understanding and exploitation of spatial relationships between points to better capture local crack details. This enables the model to focus more on areas with important structure. By guiding the model to focus on the key part of the crack point and filtering irrelevant points and noise points, the computational cost of processing redundant information is effectively reduced, and the overall work efficiency is improved. Furthermore, spatial attention helps the model grasp the global correlations within the point cloud, enhancing its ability to capture overall structures by learning spatial relationships between points. This approach addresses the non-uniformity of point clouds, allowing the model to more accurately process crack points with varying densities and samples, thereby improving its effectiveness in handling pavement crack point cloud data.
We first input local features
, which are initially processed through two convolutional layers to produce new feature maps
T and
P, respectively, represented as
,
. Then, we reshape them into
, where
. Next, matrix multiplication is performed between the transpose of
T and
P, followed by a softmax layer to compute the spatial attention map
:
Among them,
is a measure of the positional relationship between
t and
p.
b represents the batch dimension,
t represents the first spatial dimension, and
p represents the second spatial dimension. We first input the feature
into another convolutional layer to generate a new feature map
and then reshape it to
. Then, we perform matrix multiplication between
and
and reshape it to
. Finally, the results are input into a convolutional layer and are then element-wise summed with feature A after normalization operation to obtain the final output
:
It can be seen from the formula that the final feature of each position is the weighted sum of the features of all positions, taking into account the original features; c is the number of channels, and n is the number of points. Therefore, it has a global context view and selectively aggregates context according to the spatial attention map, helping the model focus on local areas, enhance local features, and reduce the interference of background point clouds. Similar semantic features improve each other, thereby improving intra-class compactness and semantic consistency.
Channel Attention Branch: Channel attention is crucial for dynamically adjusting feature weights across different channels in the point cloud during learning. It emphasizes the importance of various feature channels, enhancing crack point representation by adjusting each channel’s weight. Important feature channels are highlighted by emphasizing those with significant differentiation for crack points. Meanwhile, redundant feature channels that contribute little to the crack point cloud segmentation task are suppressed, reducing noise interference and improving the model’s overall performance and segmentation accuracy for cracks.
By introducing average pooling and max pooling operations, we integrated the spatial information of the feature map, producing two independent spatial context descriptors named AvgPool and MaxPool, representing average and max pooling features, respectively. These descriptors are then fed into a shared network consisting of a multilayer perceptron (MLP) with hidden layers. To balance parameter count and effectiveness, we set the hidden activation size to
, where r is the decay ratio. After applying the shared network to each descriptor and processing the merged features through the Sigmoid activation function, we obtain the channel attention map
. This map generation relies on learning a shared network with relatively few parameters, reducing computational load. Channel attention is calculated as follows:
This channel attention design enables the model to adaptively focus on each channel’s information based on task requirements, thereby enhancing the network’s sensitivity to point cloud features. This mechanism effectively captures inter-channel relationships in point clouds, offering a more accurate feature representation for segmentation tasks and improving the model’s performance and generalization capabilities.
Finally, by integrating the channel attention module and the spatial attention module
through matrix concatenation, the model can learn the relationship between the channel and the location more comprehensively so that the model can understand and represent the details and global information of pavement cracks more comprehensively. The synergistic effect of these two attention modules allows the model to more effectively comprehend the combination of different channels at various positions within the point cloud. This capability enables the model to adapt more efficiently to cracks of diverse shapes, sizes, and positions, ensuring stable performance across different types of cracks. Additionally, this synergy enhances the model’s ability to accurately locate and segment cracks, significantly improving the precision and accuracy of segmentation tasks. By understanding the spatial and channel relationships within the point cloud, the model can deliver more reliable and consistent results in crack detection and analysis. In summary, this fusion strategy provides a more comprehensive and flexible feature learning mechanism for point cloud segmentation tasks.
3.3. Poly Loss Function
In the original PointNet++ framework, NLL Loss (Negative Log Likelihood Loss) is used as a standard loss function and is calculated based on probability distribution. By computing the negative log-likelihood between the predicted class probability distribution of pixels or points and the true label, the model is guided to optimize. However, for fracture point cloud segmentation tasks, NLL Loss overlooks the local structure of fractures and the order of points, leading to a lack of global and local structural information and poor robustness against point cloud rotations or translations.
Given the characteristics of pavement crack point cloud data, there is an imbalance in the distribution of crack sample points and background points. To effectively address this issue in pavement crack segmentation, this study introduced Poly Loss [
59]. The core idea of Poly Loss is to enhance the deep learning model’s robustness and accuracy by designing specific loss functions tailored to the segmentation task. In the point cloud pavement crack segmentation task, cracks usually occupy a small part of the point cloud, while most of the area is normal pavement, and the data imbalance will cause the original loss function to be unable to effectively identify the crack area. By introducing weighting factors or polynomial terms, Poly Loss effectively addresses the imbalance between crack and background points, significantly enhancing crack region identification accuracy. It makes the model more sensitive to edge and detail information, allowing better preservation and recognition of cracks’ fine structure. By adjusting polynomial coefficients, Poly Loss optimizes the model’s predictive performance for crack point cloud segmentation, in line with the task requirements for pavement crack point cloud segmentation. We must balance the prediction ability of crack point and background point of the model to improve the overall model performance. Poly Loss is defined as follows:
Among them, is an additional hyperparameter used to adjust the first polynomial coefficient, and is the predicted probability of the crack point category. By introducing an additional term into the original cross-entropy loss function to adjust the first polynomial coefficient, the classification performance is improved. At the same time, the softmax operation is introduced to effectively deal with the problem of category imbalance and strengthen the learning of minority categories.
In this paper, we employed the Poly Loss function to train the entire network architecture. For the semantic segmentation of pavement cracks in point cloud data, Poly Loss effectively addresses category imbalance by adjusting polynomial coefficients to weight crack and background points. This approach enhances the model’s robustness against noise and outliers, significantly boosting performance. It also improves the model’s stability and reliability in complex scenarios.