1. Introduction
Cracks are widely considered an important indicator of road surface degradation. Detection of pavement cracks is a key step in assessing pavement health, conducting preventive maintenance, and ensuring driving safety. If the most common pavement distresses are not repaired in time, cracks can cause more serious damage to the pavement structure with the infiltration of water. Recently, many studies have applied deep learning to detect pavement cracks and achieved great success due to their exceptional ability to automatically extract pavement crack features from large-scale data [
1,
2,
3,
4].
In these deep learning-based models, pure CNN models (i.e., YOLOv5l [
2] and Faster-RCNN-R50 [
5]) and their improved versions are firstly employed to detect pavement cracks. Xing et al. [
3] improved the YOLOV5 deep learning model by incorporating a bidirectional feature pyramid network (BIFPN) to fuse multi-level features for pavement crack detection. Liang el al. [
4] proposed an automatic approach using multi-source image fusion (MSFSD) based on visible light and infrared sensors, with a generative adversarial network for urban pavement crack detection.
Subsequently, many attention mechanisms are added to CNN models for further capturing long-range dependencies of pavement cracks. Yang et al. [
6] proposed an attention fusion block to integrate multi-scale features and suppress interference from complex backgrounds, thereby enhancing the accuracy of pavement crack detection. Yao et al. [
7] proposed a pavement crack detection method based on YOLOv5, utilizing channel squeeze and excitation (SCSE) [
8] and a convolutional block attention module (CBAM) [
9] to optimize model performance, and investigated the impact of “how” and “where” to add attention modules, significantly enhancing detection accuracy, speed, and robustness. These methods further improve the accuracy of pavement crack detection by leveraging global attention mechanisms to learn the variable global morphology of cracks.
However, as shown in
Figure 1, both pure CNN models and networks that combine CNN with global attention mechanisms perform poorly in small-scale pavement crack detection. This is because those cracks exhibit different characteristics at different views: from the global viewpoint, small-scale pavement cracks exhibit intricate and variable morphology in an image; from the local viewpoint, pavement cracks exhibit little structure, as shown in the top image of
Figure 2. Furthermore, in the bottom image of
Figure 2, we analyze the feature extraction performance using gradient-based class activation maps (Grad-CAMs) [
10]. The results indicate that RT-DETR-R50 not only focuses on crack regions but also on background textures. This additional attention to background textures makes them prone to confusing background textures with the local structure of slight pavement cracks. Therefore, the key challenge in small-scale pavement crack detection is how to effectively extract the features of local structures while reducing background noise interference.
The backbone network extracts multi-level features by sequentially applying CNN layers that filter background noise from the output of the previous layer to extract crack features. As a result, higher-level features have larger receptive fields and richer semantic information, while lower-level features preserve more detailed information but also incorporate additional background noise [
11]. When the local structures of pavement cracks are insignificant, incorporating lower-level features within the Path Aggregation Network (PAN) [
12] not only significantly increases the computational complexity but also introduces additional noise, thereby interfering with asphalt pavement crack detection, especially for small-scale cracks. Therefore, we designed a variant of RT-DETR-R50, referred to as RT-DETR-R50’, which omits the fusion of the lowest-level features in the FPN. As shown in
Figure 2b, RT-DETR-R50’ focuses less on background textures compared to RT-DETR-R50, indicating that excluding the fusion of low-level features in the FPN can reduce background noise interference in pavement crack detection. However, this modification also results in the reduced extraction of local detail features.
In this work, we propose a novel automated pavement crack detection network named Crack-MsCGA, which excludes the fusion of low-level features in the FPN to reduce background noise interference. Then, we design a Cascaded Group Attention Fusion (MsCGA) block to capture the local structures of slight pavement cracks, thereby enhancing local detail features. MsCGA block can extract and enhance both local features and global features of pavement cracks at different scales. Specifically, local window attention is first employed to aggregate local features, capturing the local structure by learning short-range dependencies (i.e., the interrelationships between elements within the window). Subsequently, a global cascaded group attention mechanism is applied to further capture global features, capturing variable global morphology by learning long-range dependencies (i.e., the interrelationships between elements across the entire image). Finally, we propose a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA) to selectively integrate both local and global features.
The main contributions of this paper can be summarized as follows:
We propose Crack-MsCGA, which is a computationally efficient pavement crack detection network. Crack-MsCGA can detect large-scale cracks, medium-scale cracks, and small-scale cracks with insignificant structure and variable morphology.
We design an MsCGA block, which can effectively extract both local and global features of pavement cracks by learning short-range and long-range dependencies. Moreover, the block selectively fuses local features and global features through a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA).
We conduct extensive experiments on the various settings and compare Crack-MsCGA with five existing methods to show its effectiveness. The results show that Crack-MsCGA significantly outperforms these methods. In addition, Crack-MsCGA is a computationally efficient method for pavement crack detection, achieving an average reduction of 32.0% in GFLOPs and 11.2% in parameters over the state-of-the-art method.
4. Experiments
In this section, we first introduce the experimental settings in
Section 4.1. Secondly, we conduct a comparative analysis of our proposed Crack-MsCGA and five other advanced crack detection models from the perspectives of multi-scale crack detection, efficiency, and visualization in
Section 4.2. Thirdly, we conduct a comparative analysis of our proposed MsCGA block with a control group without the attention mechanism and with two classical attention mechanisms and three state-of-the-art attention mechanisms in
Section 4.3. Finally, we design some comparative to validate the rationality of the Crack MSCGA design in
Section 4.6.
4.1. Experimental Settings
4.1.1. Experimental Environment
The experimental environment was as follows: Windows11, Python 3.11.6, Pytorch 2.1.0, CUDA11.8, Intel(R) Xeon(R) Silver 4314, NVIDIA RTX A6000.
4.1.2. Datasets
We conducted experiments on a public dataset (RDD2022-China-MotorBike, China-M) and a private dataset (DH807) to validate the effectiveness of our proposed method. The China-M dataset was captured from a rear-to-front perspective, as shown in the first row of
Figure 6, while DH807 was captured from a top--down perspective, as shown in the second row of
Figure 6.
The public China-M [
47] dataset is a more common dataset for pavement distress detection, including four types of pavement damage: Longitudinal Cracks, Transverse Cracks, Block Cracks, and Potholes. We constructed a new dataset based on RDD2022-China-MotorBike to test the effectiveness of our proposed method in the task of crack detection. Specifically, we first deleted images without cracks from the dataset. Then, we removed non-crack labels and manually corrected incorrect labels. Finally, we randomly divided the dataset into a training set and a test set in a 7:3 ratio, yielding 1290 training images and 560 test images. Each image is a grayscale image with a resolution of
pixels. The first row of
Figure 6 shows two images from the dataset.
The DH807 dataset is a new dataset proposed by us. Specifically, we used a pavement image collection vehicle developed by Wuhan Optics Valley Zoyon Science and Technology to collect pavement images from a top–down perspective in Dehong Prefecture, Yunnan Province, China. This collection vehicle was equipped with advanced high-speed cameras, shooting control algorithms, and postprocessing alignment algorithms, ensuring continuous high-definition pavement images at speeds ranging from 0 to 100 km/h. Subsequently, we selected 807 images containing cracks from the collected pictures and annotated them using LabelImg. These images were then divided into a training set and a test set in a 7:3 ratio. Second row of
Figure 6 shows two images from the dataset.
4.1.3. Evaluation Metrics
In object detection, several metrics are used to evaluate the accuracy of a model. True Positives (
TP) are the correctly detected objects, i.e., the objects that are correctly identified by the model. False Positives (
FP) are the incorrectly detected objects, i.e., the objects that the model incorrectly identifies as present. False Negatives (
FN) are the objects that the model fails to detect, i.e., the objects that are present but not identified by the model. Precision (
P) is the ratio of
TP to the sum of
TP and
FP:
Recall (
R) is the ratio of
TP to the sum of
TP and
FN:
F1 score is the harmonic mean of precision and recall, primarily used to evaluate the balance between precision and recall in classification or detection tasks. Its formula is as follows:
To comprehensively analyze the balance of crack detection across different scales, this study utilizes the mean
F1 score (
mF1), which is calculated as the average of the
F1 scores for small, medium, and large scales, to evaluate the model’s balance between precision and recall:
where (
S) is the number of scales and
is the
F1 score for the (
i)-th scale. Average Precision (
AP) is the area under the precision–recall curve, summarizing the precision and recall at different threshold levels. hlAP@50 is a specific instance of
AP where the Intersection over Union (IoU) threshold is set to 0.50. This means that a predicted bounding box is considered a True Positive if its IoU with the ground truth box is greater than or equal to 0.50.
Mean Average Precision (mAP) is a commonly used metric in object detection that averages the
AP across all classes. It provides a single performance measure that accounts for both precision and recall across different object categories. The
mAP is calculated as follows:
where (
N) is the number of classes and
is the Average Precision for the (
i)-th class.
In addition to accuracy metrics, some metrics are used to evaluate the inference speed and computational complexity of the models. In this study, Frames Per Second (FPS) is utilized to assess the inference speed, while GFLOPs and parameters are employed to evaluate the computational complexity of the models.
4.1.4. Baselines
In our study, we conduct a comparative analysis of our method Crack-DETR with five existing models: YOLOv5l [
2], YOLOv5l-ST-BIFPN [
3], UM-YOLOl [
48], Faster-RCNN-R50 [
5], Deformable-DETR-R50 [
49], DDQ-DETR-R50 [
50], and RT-DETR-R50 [
39]. Among them, YOLOv5l and Faster-RCNN-R50 are purely CNN-based models, while Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50 are end-to-end detection models that leverage attention mechanisms. Additionally, YOLOv5-ST-BIFPN and UM-YOLOl are improved versions of YOLOv5 and YOLOv8 tailored for pavement crack detection. To ensure a fair comparison, we implemented YOLOv5-ST-BIFPN and UM-YOLOl by following the methods outlined in the respective papers, based on the YOLOv5 and YOLOv8 variants with parameters most similar to Crack-MsCGA (i.e., YOLOv5l and YOLOv8l).
YOLOv5l: YOLOv5l is a classical single-stage object detection model known for its speed and efficiency. It is part of the YOLO (You Only Look Once) family, which emphasizes real-time detection with good accuracy.
YOLOv5l-ST-BIFPN: YOLOv5l-ST-BIFPN is an improved version of YOLOv5l, integrating the Swin Transformer block and Bidirectional Feature Pyramid Network (BIFPN) into YOLOv5. These enhancements enable it to achieve outstanding performance in UAV pavement crack detection.
UM-YOLOl: UM-YOLOl is an enhanced version of YOLOv8l. It integrates an Efficient Multiscale Attention (EMA) module into the backbone’s C2f module, employs a Bi-FPN fusion mechanism in the neck to improve feature aggregation, and leverages GSConv—a lightweight convolutional network—for more efficient convolution operations.
Faster-RCNN-R50: Faster-RCNN-R50 is a classical two-stage object detection model. It first generates region proposals and then refines these proposals for final object detection and classification.
Deformable-DETR-R50: Deformable DETR (DEtection TRansformer) with a ResNet-50 backbone integrates deformable convolutions into the DETR framework. This model improves upon the original DETR by enhancing its ability to handle small objects and varying object scales through deformable attention, leading to better performance in object detection tasks.
DDQ-DETR-R50: DDQ-DETR-R50 stands for Dense Distinct Query for end-to-end object detection with a ResNet-50 backbone. This model leverages dense distinct queries in a transformer-based framework, enhancing its ability to detect objects with varying scales and shapes.
RT-DETR-R50: RT-DETR-R50 denotes Real-Time DETR with a ResNet-50 backbone. It aims to bring the benefits of the DETR model, including its transformer-based architecture and attention mechanisms, to real-time applications. RT-DETR-R50 balances the high accuracy of transformer-based detection with the speed necessary for real-time inference.
4.1.5. Hyperparameter Settings
For a fair comparison, we carefully tune the hyperparameters of all baselines to report their best performance. The number of epochs is set to 300, the learning rate is set to 1 , the weight decay used in the local optimizer is set to 1 , the batch size for training is set to 16, and the input image size is . In the Local Window Attention of MsCGA block, the size of each window (ws) is set to 5. For the loss function, the hyperparameters , , and are set to 2, 1, and 1, respectively.
4.2. Detection Performance
In this section, we evaluate Crack-MsCGA on the China-M dataset and DH807 dataset, then compare it with seven baselines.
4.2.1. Performance on the China-M Dataset
The performance of the methods on China-M is summarized in
Table 1, with AP@50 metrics reported for small, medium, and large objects as well as the mean of all scales.
As shown in
Table 1, we first analyze the detection performance under varying scales of cracks, i.e., small, medium, and large. Specifically, for the most challenging small-scale cracks with slight structures and variable morphologies, Crack-MsCGA outperforms the baselines with an average improvement of 11.7% (i.e., improving AP@50 to 11.3%, 11.1%, 8.3%, 14.9%, 16.1%, 7.7%, and 12.2% on YOLOv5l, YOLOv5l-ST-BIFPN, UM-YOLOl, Faster-RCNN-R50, Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50, respectively). In particular, compared to pure CNN models (i.e., YOLOv5l and Faster-RCNN-R50), Crack-MsCGA achieves improvements of up to 11.3% and 14.9% in detection performance, respectively. Models based on CNNs with attention mechanisms (e.g., UM-YOLOl, DDQ-DETR-R50, and RT-DETR-R50) show improved accuracy in detecting large-scale and medium-scale cracks compared to pure CNN models. However, the improvement in detecting small-scale cracks remains relatively limited. Overall, Crack-MsCGA achieved the highest accuracy across all scales. Even when compared to the closest method (i.e., DDQ-DETR-R50), Crack-MsCGA demonstrated improvements of 7.7%, 0.5%, and 1.6% in detecting small-scale, medium-scale, and large-scale cracks, respectively. In addition, the mF1 score of Crack-MsCGA is higher than that of all other models, achieving the top rank. This indicates that Crack-MsCGA has a stronger balance between precision and recall. On the one hand, these improvements are attributed to MsCGA’s ability to avoid the noise interference present in low-level features. On the other hand, they are credited to the MsCGA block, which employs local window attention to aggregate local features by learning short-range dependencies within the patch. This enhances the feature representation capability for small-scale cracks with subtle structures and variable morphologies.
Overall, the results validate the effectiveness of Crack-MsCGA in improving the detection performance of pavement cracks on the China-M dataset, making it a robust and reliable solution for automated pavement crack detection.
4.2.2. Performance on DH807 Dataset
As shown in
Table 2, the proposed Crack-MsCGA achieved the highest average AP@50 across three scales, reaching 60.3%, which is an improvement of 2.5% over the closest model, RT-DETR-R50, on the DH807 dataset. Specifically, for the most challenging slight small-scale cracks with insignificant structures and variable morphologies, Crack-MsCGA outperforms the baselines with an average improvement of 11.3% (i.e., improving AP@50 to 7.3%, 9.1%, 10.9%, 17.5%, 16.3%, 12.1%, and 6.0% on YOLOv5l, YOLOv5l-ST-BIFPN, UM-YOLOl, Faster-RCNN-R50, Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50, respectively). For the medium-scale cracks, Crack-MsCGA achieved 6.7% average improvement on AP@50. For the large-scale cracks, Crack-MsCGA achieved a 5.9% average improvement on AP@50. From the perspective of the F1 score, Crack-MsCGA still achieved the best performance on the DH807 dataset, whereas DDQ-DETR-R50, which performs well on the China-M dataset, experienced a significant drop in mF1 score on the DH807 dataset. This indicates that Crack-MsCGA exhibits better robustness compared to the other models. These improvements further illustrate the effectiveness of MsCGA in enhancing the detection of small-scale pavement cracks. The MsCGA block uses local window attention to aggregate local features by learning short-range dependencies within the patch, enhancing the feature representation abilities for small-scale cracks with delicate structures and varying morphologies.
4.2.3. Analysis on Efficiency
The computational cost and number of parameters are crucial reference metrics for deploying models on portable devices with integrated sensors because computational power and memory are highly limited on such devices. Therefore, we compare the computational cost (GFLOPs) and the number of parameters of Crack-MsCGA and other models to analyze their efficiency on portable devices with integrated sensors. The experimental results, as summarized in
Table 3, clearly demonstrate the superior performance of our proposed method, Crack-MsCGA, compared to other techniques.
Firstly, in terms of computational efficiency, Crack-MsCGA requires only 44.5 GFLOPs, the lowest among the compared methods, achieving a reduction of on average 32.0%. This efficiency is crucial for real-time applications where computational resources and processing time are limited.
Secondly, Crack-MsCGA has the fewest parameters, with only 38.5 G, which not only reduces the model’s complexity but also minimizes the memory footprint. This makes it highly suitable for deployment on devices with limited memory, such as portable devices with integrated sensors.
Thirdly, to further explore the practical inference speed of Crack-MsCGA, we conducted FPS tests on different platforms and under various floating point precisions. The experimental results are shown in
Table 4. Analysis of these results reveals that the inference speed of YOLOv5l leads all other models under all conditions. Our proposed Crack-MsCGA follows closely, achieving second place in all metrics. Overall, with FP16 precision on the NVIDIA RTX 3060, Crack-MsCGA provides an average FPS improvement of 44.4% compared to other models. This indicates that the inference speed of Crack-MsCGA outperforms most real-time object detection models.
Overall, Crack-MsCGA’s parameter count and inference computational cost outperform all other models, while its inference speed is second only to YOLOv5l. It can meet real-time inference requirements in most scenarios, making it a highly effective solution for practical applications.
4.2.4. Gradient-Based Class Activation Maps Visualization
In this section, we use Grad-CAM++ to generate gradient-based class activation maps to analyze the effectiveness of the backbone network and neck in extracting pavement crack features in Crack-MsCGA and other models, as shown in
Figure 7.
From
Figure 7, we can observe that the pure CNN models YOLOv5l and YOLOv5l-ST-BIFPN not only focus on the crack regions but also pay considerable attention to the pavement background textures. In contrast, the CNN-with-attention-based models RT-DETR-R50 and DDQ-DETR-R50, thanks to the incorporation of a global attention mechanism, significantly reduce the focus on background regions compared to the pure CNN models and place greater emphasis on the overall continuity of the cracks. Furthermore, we designed a variant of RT-DETR-R50, named RT-DETR-R50’, which removes the fusion of low-level features. Compared with RT-DETR-R50, this variant further reduces attention on background regions, but it neglects some local detail features. Crack-MsCGA builds on the removal of low-level feature fusion by using MsCGA to aggregate both the local detail features and global features of cracks at multiple scales, thereby further reducing focus on background regions while effectively concentrating on the crack areas.
4.2.5. Visualization of Prediction Results
In
Figure 8, we present the prediction results of Crack-MsCGA and other comparative models for randomly selected samples (a–h) from the test set.
From
Figure 8a,b,e, it is evident that Crack-MsCGA can detect slight small-scale cracks with insignificant structure and variable morphology. This capability is attributed to MsCGA, where local window attention focuses on the interrelationships within local windows, enhancing the model’s ability to capture features of slight small-scale pavement cracks with insignificant structures and variable morphologies.
Figure 8d–f illustrate that, compared to pure CNN models, models employing global attention mechanisms—such as Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50—tend to overlook slight cracks and critical local details, such as endpoints. This phenomenon occurs because the features learned by the global attention mechanism weaken local features. In contrast, Crack-MsCGA effectively addresses this issue. On one hand, Crack-MsCGA enhances both local and global features by learning short-range and long-range dependencies. On the other hand, in the detection of slight small-scale cracks, the multi-scale attention fusion strategy can selectively retain more local features.
From
Figure 8c,e,g, it is evident that pure CNN models such as YOLOv5l and Faster-RCNN-R50 often produce cluttered bounding boxes, even with the application of Non-Maximum Suppression (NMS) as a post-processing step. In contrast, the detection results from DDQ-DETR-R50, RT-DETR-R50, and Crack-MsCGA exhibit more orderly bounding boxes. This improvement is attributed to the use of global attention mechanisms in DDQ-DETR-R50, RT-DETR-R50, and Crack-MsCGA, which can capture long-range dependencies among input features.
By analyzing
Figure 8e,h, it is observed that Crack-MsCGA tends to overly focus on local details when handling large-scale and structurally complex cracks. This results in scenarios where a single crack is detected as two smaller ones, or part of a large crack is identified as a small, separate crack. This issue is even more pronounced in pure CNN models like Faster-RCNN-R50 and YOLOv5l. However, models utilizing global attention mechanisms, such as DDQ-DETR-R50 and RT-DETR-R50, alleviate this problem to some extent. This suggests that when dealing with large-scale, complex crack structures, MsCGA may face limitations in adequately capturing global features.
From the above analysis, we can conclude that Crack-MsCGA effectively utilizes local window attention mechanisms to capture short-range dependencies of pavement cracks, aggregating fine-grained local features, which enables it to detect slight small-scale pavement cracks. Additionally, Crack-MsCGA employs cascaded group attention to capture long-range dependencies of pavement cracks for extracting intricate and variable global features, ensuring the regularity and simplicity of the prediction results. This confirms the significant role of MsCGA in improving the accuracy of pavement crack detection.
4.3. Comparison with Other Attention Mechanisms
In this section, we conduct comparative experiments including a control group without attention mechanisms (WOA), two classical attention mechanisms (CBAM and DAttention), and three state-of-the-art attention mechanisms (AIFI, FLA, and LSKA). For a fair comparison, we construct comparison models by replacing the MsCGA block in Crack-MsCGA with those attention mechanisms. The results are summarized in
Table 5, which presents AP@50 for three scales of cracks—small, medium, and large—as well as the mean performance (mean). It is evident that our MsCGA attention outperforms all other attentions across all scales. Specifically, MsCGA achieves the highest AP@50 scores for small-scale cracks (84.8%), medium-scale cracks (86.8%), and large-scale cracks (96.3%), resulting in a mean AP@50 of 89.3%. This demonstrates the superior capability of MsCGA in accurately detecting all-scale pavement cracks.
In comparison, classical attention mechanisms, i.e., CBAM and DAttention, show relatively lower performance for small-scale crack detection than WOA, which operates without an attention mechanism. This phenomenon occurs because those attention mechanisms may diminish the local features extracted by CNNs when capturing global features. The state-of-the-art methods, AIFI, FLA, and LSKA, show improvements over WOA in medium-scale and large-scale crack detection. However, the improvements in small-scale crack detection are minimal. This is because these attention mechanisms can enhance global features by learning long-range dependencies, but their enhancement of local features is limited. It is worth noting that Crack-MsCGA achieves significant improvements across all scales. This benefits from the fact that MsCGA can not only enhance both local and global features by learning short-range and long-range dependencies, but also selectively fuse them.
4.4. Comparative Experiments on Where to Employ the MsCGA Block
In the neck of the Crack-MsCGA model, we only apply attention enhancement to high-level features, without enhancing low-level features. This is based on the theory that applying attention enhancement to concatenated multi-level features simultaneously is redundant because high-level features are abstractions of low-level features in generic object detection. However, there are significant differences between pavement crack detection and traditional object detection. Therefore, we set up ablation experiments to explore where employing the MsCGA block leads to the best results. As shown in
Table 6, we present the computational cost and mAP@50 of Crack-MsCGA and its three variants. The WOA variant does not use the MsCGA block to enhance features. The AllLevel variant uses the MsCGA block to enhance both high-level and low-level features. The LowLevel variant only uses the MsCGA block to enhance low-level features, while Crack-MsCGA uses the MsCGA block only to enhance high-level features.
Among all variants, Crack-MsCGA achieved the highest mAP@50, with a computational cost only slightly higher than the WOA variant, which does not use the MsCGA block. This confirms that the theory also applies to crack detection. The LowLevel variant improved mAP@50 by only 0.1% compared to the WOA variant, while the AllLevel variant improved mAP@50 by just 0.3%. This is because the information learned by the attention mechanism can easily diminish the local features in large-scale features.
4.5. Comparative Experiments on Window Size of Local Window Attention Mechanism
In
Section 3.2, we proposed a local window attention mechanism to compute short-range dependencies by aggregating the local features of pavement cracks within each window patch, with the window size set as the hyperparameter
. In this section, we design comparative experiments to investigate the impact of different window sizes on asphalt pavement crack detection.
Table 7 presents the detection accuracies on the DH807 dataset using different window sizes.
Firstly, all variants exhibit similar performance in detecting large-scale cracks. This is because each variant employs a CGA mechanism to capture intricate and variable global attention features by learning long-range contextual information across the entire image. These global attention features facilitate the detection of large-scale cracks.
Secondly, as the increases, the detection accuracy for medium-scale cracks gradually declines.
Finally, when is set to five, both small-scale and large-scale crack detection accuracies peak, and the average mAP@50 is significantly higher than that of the other variants. This indicates that a window size of 5 is optimal.
4.6. Ablation Experiment
The primary difference between Crack-MsCGA and other comparative models lies in the neck network architecture. It first employs MsCGA to enhance high-level features, and then fuses these features with low-level features. In this section, we design ablation experiments to examine the rationale behind the neck network structure in Crack-MsCGA.The experimental results are shown in
Table 8. The “Fused Features” column indicates the features used in the neck fusion process, while the “MsCGA” column specifies whether the MsCGA attention mechanism is applied.
Firstly, we design a WOA variant by removing the MsCGA block from Crack-MsCGA to analyze its contribution. Compare to the WOA variant, Crack-MsCGA exhibits improvements in crack mAP@50 across all scales, achieving gains of 3.6%, 3.8%, and 4.2% for small-, medium-, and large-scale cracks, respectively. This suggests that MsCGA plays a pivotal role in achieving high-precision pavement crack detection in the Crack-MsCGA model.
Secondly, while most common pavement crack detection models integrate multi-level features (S2, S3, and S4), Crack-MsCGA only incorporates features from S3 and S4. To validate the rationality of this design, we developed an S3 variant that, compared to Crack-MsCGA, additionally fuses the lower-level S3 feature. Experimental results indicate that, compared to Crack-MsCGA, the S3 variant shows a decline in crack detection accuracy across all scales—most notably, a 3.4% drop for small-scale cracks. On one hand, this can be attributed to the backbone network’s use of a CNN for feature extraction; although the lower-level S3 features provide more detailed information, they also introduce substantial background noise during the fusion process. On the other hand, fusing S3 dilutes the crack-specific features that the MsCGA block learned.
From the perspective of computational complexity, a comparison between Crack-MsCGA and the S2 variant reveals that reducing the fusion of low-level features () enables Crack-MsCGA to reduce to 23.9 GFLOPs. Additionally, when compared to the S2 WOA variant, it is evident that the MsCGA block requires 1.4 GFLOPs of computational cost. Overall, our unique neck design reduces the network’s computational complexity by 22.5 GFLOPs while simultaneously improving the accuracy of pavement crack detection.
4.7. Conclusions
In this study, we focus on improving the detection accuracy of slight small-scale cracks with insignificant structures and variable morphologies. We propose a Multi-scale Cascaded Group Attention (MsCGA) mechanism to learn and fuse both short-range and long-range dependencies of pavement cracks at different scales. Based on MsCGA, we build a pavement crack detection model named Crack-MsCGA. Crack-MsCGA shows its promising potential in the task of pavement crack detection, achieving the highest accuracy with minimal computational cost and the fewest parameters at all scales and all categories.
In the future, we will study pavement crack detection based on 3D point cloud data collected by LiDAR, aiming to further improve the detection effect of small-scale road cracks.