Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection

Liu, Guoxi; Wu, Xiaojing; Dai, Fei; Liu, Guozhi; Li, Lecheng; Huang, Bi

doi:10.3390/s25082446

Open AccessArticle

Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection

by

Guoxi Liu

^1,†

,

Xiaojing Wu

^1,†,

Fei Dai

¹

,

Guozhi Liu

²

,

Lecheng Li

¹

and

Bi Huang

^1,*

¹

College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China

²

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2025, 25(8), 2446; https://doi.org/10.3390/s25082446

Submission received: 18 March 2025 / Revised: 2 April 2025 / Accepted: 9 April 2025 / Published: 12 April 2025

(This article belongs to the Section Fault Diagnosis & Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

Pavement crack detection is crucial for ensuring road safety and reducing maintenance costs. Existing methods typically use convolutional neural networks (CNNs) to extract multi-level features from pavement images and employ attention mechanisms to enhance global features. However, the fusion of low-level features introduces substantial interference, leading to low detection accuracy for small-scale cracks with subtle local structures and varying global morphologies. In this paper, we propose a computationally efficient deep learning network with CNNs and multi-scale attention for multi-scale crack detection, named Crack-MsCGA. In this network, we avoid fusing low-level features to reduce noise interference. Then, we propose a multi-scale attention mechanism (MsCGA) to learn local detail features and global features from high-level features, compensating for the reduced detailed information. Specifically, first, MsCGA employs local window attention to learn short-range dependencies, aggregating local features within each window. Second, it applies a cascaded group attention mechanism to learn long-range dependencies, extracting global features across the entire image. Finally, it uses a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA) selectively to fuse local features and global features of pavement cracks. Compared with five existing methods, it improves the AP@50 by 11.3% for small-scale, 8.1% for medium-scale, and 5.9% for large-scale detection over the state-of-the-art methods in the DH807 dataset.

Keywords:

pavement crack detection; multi-scale attention fusion; multi-scale crack detection; small-scale crack detection

1. Introduction

Cracks are widely considered an important indicator of road surface degradation. Detection of pavement cracks is a key step in assessing pavement health, conducting preventive maintenance, and ensuring driving safety. If the most common pavement distresses are not repaired in time, cracks can cause more serious damage to the pavement structure with the infiltration of water. Recently, many studies have applied deep learning to detect pavement cracks and achieved great success due to their exceptional ability to automatically extract pavement crack features from large-scale data [1,2,3,4].

In these deep learning-based models, pure CNN models (i.e., YOLOv5l [2] and Faster-RCNN-R50 [5]) and their improved versions are firstly employed to detect pavement cracks. Xing et al. [3] improved the YOLOV5 deep learning model by incorporating a bidirectional feature pyramid network (BIFPN) to fuse multi-level features for pavement crack detection. Liang el al. [4] proposed an automatic approach using multi-source image fusion (MSFSD) based on visible light and infrared sensors, with a generative adversarial network for urban pavement crack detection.

Subsequently, many attention mechanisms are added to CNN models for further capturing long-range dependencies of pavement cracks. Yang et al. [6] proposed an attention fusion block to integrate multi-scale features and suppress interference from complex backgrounds, thereby enhancing the accuracy of pavement crack detection. Yao et al. [7] proposed a pavement crack detection method based on YOLOv5, utilizing channel squeeze and excitation (SCSE) [8] and a convolutional block attention module (CBAM) [9] to optimize model performance, and investigated the impact of “how” and “where” to add attention modules, significantly enhancing detection accuracy, speed, and robustness. These methods further improve the accuracy of pavement crack detection by leveraging global attention mechanisms to learn the variable global morphology of cracks.

However, as shown in Figure 1, both pure CNN models and networks that combine CNN with global attention mechanisms perform poorly in small-scale pavement crack detection. This is because those cracks exhibit different characteristics at different views: from the global viewpoint, small-scale pavement cracks exhibit intricate and variable morphology in an image; from the local viewpoint, pavement cracks exhibit little structure, as shown in the top image of Figure 2. Furthermore, in the bottom image of Figure 2, we analyze the feature extraction performance using gradient-based class activation maps (Grad-CAMs) [10]. The results indicate that RT-DETR-R50 not only focuses on crack regions but also on background textures. This additional attention to background textures makes them prone to confusing background textures with the local structure of slight pavement cracks. Therefore, the key challenge in small-scale pavement crack detection is how to effectively extract the features of local structures while reducing background noise interference.

The backbone network extracts multi-level features by sequentially applying CNN layers that filter background noise from the output of the previous layer to extract crack features. As a result, higher-level features have larger receptive fields and richer semantic information, while lower-level features preserve more detailed information but also incorporate additional background noise [11]. When the local structures of pavement cracks are insignificant, incorporating lower-level features within the Path Aggregation Network (PAN) [12] not only significantly increases the computational complexity but also introduces additional noise, thereby interfering with asphalt pavement crack detection, especially for small-scale cracks. Therefore, we designed a variant of RT-DETR-R50, referred to as RT-DETR-R50’, which omits the fusion of the lowest-level features in the FPN. As shown in Figure 2b, RT-DETR-R50’ focuses less on background textures compared to RT-DETR-R50, indicating that excluding the fusion of low-level features in the FPN can reduce background noise interference in pavement crack detection. However, this modification also results in the reduced extraction of local detail features.

In this work, we propose a novel automated pavement crack detection network named Crack-MsCGA, which excludes the fusion of low-level features in the FPN to reduce background noise interference. Then, we design a Cascaded Group Attention Fusion (MsCGA) block to capture the local structures of slight pavement cracks, thereby enhancing local detail features. MsCGA block can extract and enhance both local features and global features of pavement cracks at different scales. Specifically, local window attention is first employed to aggregate local features, capturing the local structure by learning short-range dependencies (i.e., the interrelationships between elements within the window). Subsequently, a global cascaded group attention mechanism is applied to further capture global features, capturing variable global morphology by learning long-range dependencies (i.e., the interrelationships between elements across the entire image). Finally, we propose a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA) to selectively integrate both local and global features.

The main contributions of this paper can be summarized as follows:

We propose Crack-MsCGA, which is a computationally efficient pavement crack detection network. Crack-MsCGA can detect large-scale cracks, medium-scale cracks, and small-scale cracks with insignificant structure and variable morphology.
We design an MsCGA block, which can effectively extract both local and global features of pavement cracks by learning short-range and long-range dependencies. Moreover, the block selectively fuses local features and global features through a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA).
We conduct extensive experiments on the various settings and compare Crack-MsCGA with five existing methods to show its effectiveness. The results show that Crack-MsCGA significantly outperforms these methods. In addition, Crack-MsCGA is a computationally efficient method for pavement crack detection, achieving an average reduction of 32.0% in GFLOPs and 11.2% in parameters over the state-of-the-art method.

2. Related Work

Current mainstream methods can be classified into three categories: image processing-based detection, machine learning-based detection, and deep learning-based detection.

2.1. Image Processing-Based Detection and Machine Learning-Based Detection

Initially, researchers explored automated pavement crack detection through image processing-based methods. Methods such as thresholding [13], edge detection [14,15], and morphological operations [16,17] were utilized to segment and enhance crack regions, while local binary patterns (LBPs) [18] and Gabor filters [19] were employed for texture- and frequency-based feature extraction. Additionally, shape-based approaches [20] and tree-structured algorithms [21] aim to capture geometric and hierarchical characteristics of cracks. These methods often face challenges with varying lighting conditions, shadows, and complex backgrounds, which can result in poor robustness in crack detection.

More recently, machine learning-based algorithms, such as support vector machines (SVMs) [22] and random forests [23], have been employed to classify and detect cracks more accurately. These algorithms can learn from labeled datasets, improving their ability to distinguish between cracks and other pavement features. However, they require extensive feature engineering and are constrained by the quality and quantity of the training data.

2.2. Deep Learning-Based Detection

The advancement of neural networks is propelling computer vision forward remarkably, significantly improving the accuracy of tasks like image classification, object detection, and semantic segmentation. These developments make the automation of pavement crack detection possible.

Leveraging the powerful feature extraction capabilities of CNN networks, many pure CNN-based models have been employed for pavement crack detection. Li et al. [24] proposed a fusion model combining grid-based classification and box-based detection to enhance the accuracy and efficiency of asphalt pavement crack identification. Mayya et al. [25] proposed a triple-stage framework combining YOLO-ensemble, MobileNetV2U-net, and spectral clustering to effectively detect cracks in stone masonry, addressing the challenges of diverse crack scales and morphologies. Raushan et al. [26] leveraged the YOLO network to effectively detect damage in concrete structures, addressing challenges posed by multi-feature backgrounds. Deng et al. [27] proposed a crack-boundary refinement framework to achieve precise segmentation of high-resolution crack images, enhancing both the safety and efficiency of UAV-based bridge inspection processes. Chu et al. [28] proposed a multi-scale crack feature extraction network (MsCFEN) with the embedment of the strip pooling operation to achieve a balance between segmentation accuracy and GPU memory consumption, this being a challenge when segmenting crack images with resolutions exceeding 4K. Compared to traditional image processing-based methods and machine learning-based methods, deep learning approaches, particularly Convolutional Neural Networks (CNNs), achieve higher detection accuracy and robustness. This is due to the exceptional ability of CNN-based deep learning methods to automatically extract pavement crack features within large-scale data.

Many attention mechanisms have been added to CNN models to further capture long-range dependencies of pavement cracks in order to improve detection accuracy; to achieve high precision and efficiency, convolutional block attention modules and architectural optimizations have been implemented. Zhang et al. [29] transformed pavement images from one perspective to Bird’s Eye View images using binocular parallax information, employ a Unet with attention mechanisms to selectively fuse deep and shallow features for crack identification; the results demonstrated high detection accuracy for complex pavement conditions. Zhu et al. [30] proposed a novel encoder–decoder architecture that integrated a new type of hybrid attention block with residual blocks (RBs), resulting in an extremely lightweight model that achieved more accurate detection of pavement crack pixels. Dong et al. [31] achieved high precision and efficiency by integrating a convolutional block attention module and architectural optimizations. Zheng et al. [32] proposed an application-oriented multi-stage crack recognition framework that sequentially employs classification model training, semi-supervised training, pixel-level crack segmentation, and crack width measurement for comprehensive crack detection. Yang et al. [6] proposed an attention fusion block to integrate multi-scale features and suppress interference from complex backgrounds, thereby enhancing the accuracy of pavement crack detection. These models achieve significant success by incorporating attention mechanisms because these mechanisms capture the global features of pavement cracks by learning the long-range dependencies. However, their enhancement of local features is limited, and they may even diminish local features learned by CNNs [33,34,35].

Unlike these approaches, Crack-MsCGA reduces noise interference by reducing the fusion of low-level features. It then incorporates a multi-scale attention mechanism to learn both local and global features from high-level features, thereby improving the accuracy of pavement crack detection.

3. Methodology

In this section, we introduce first the architecture of Crack-MsCGA, and then the design of the MsCGA block.

3.1. Network Architecture

Crack-MsCGA is proposed to detect large-scale cracks, medium-scale cracks, and small-scale cracks with insignificant structure and variable morphology; it consists of three parts: the backbone network, the neck, and the detection head. Figure 3 shows the overall architecture of Crack-MsCGA. First, the input layer resizes the input pavement images to a size of

H \times W \times 3

and then feeds them into the backbone network, where H and W represent the height and width of the resized image, respectively. Second, the backbone network, which is an improved version of ResNet-50 [36], is used to extract multi-level features (i.e., high-level features and low-level features) of the resized images. Subsequently, in the neck, MsCGA is proposed to enhance high-level features and then fuse them with low-level features through a top–down and bottom–up pathway. Finally, these fused feature maps are fed into the detection head to generate the bounding boxes and classes of pavement cracks.

Figure 3. The network structure of Crack-MsCGA, where c and s in ConvBN denote the channel number and stride of the convolution, respectively. Fuse denotes the use of three consecutive fusers, whose structure is illustrated in Figure 4, to merge input features.

Figure 4. Network structure of fuser.

3.1.1. The Backbone

The backbone network consists of an input stem and four subsequent stages, which can extract multi-level features of pavement images, as shown in Figure 3. Inspired by He et al. [37], three conservative ConvBN blocks are designed to replace the

7 \times 7

convolutional layer in the input stem. The ConvBN block consists of a

3 \times 3

convolutional layer, a batch normalization (BN) [38] layer, and a ReLU activation layer. This combination enhances the feature extracting stability of the backbone network because the batch normalization layer breaks the independence of each sample’s loss, and the ReLU activation layer improves the backbone network’s non-linear representation capabilities. Then, the ConvBN block is used to replace each 3 × 3 convolution in each stage. The output of Stage 3 is a low-level feature map, denoted as

S_{3} \in R^{\frac{H}{16} \times \frac{W}{16} \times 512}

. The output of Stage 4 is a high-level feature map, denoted as

S_{4} \in R^{\frac{H}{32} \times \frac{W}{32} \times 1024}

. Finally,

S_{3}

and

S_{4}

are fed into the neck.

3.1.2. The Neck

The backbone network employs ResNet50 to extract multi-level features by sequentially applying CNN layers that filter background noise from the output of the previous layer to extract crack features. As a result, high-level features have larger receptive fields and richer semantic information, while low-level features preserve more detailed information but also incorporate additional background noise. Due to the insignificant local structures of slight asphalt pavement cracks, incorporating lower-level features within the Path Aggregation Network (PAN) [12] not only significantly increases the computational load but also introduces additional noise, thereby interfering with asphalt pavement crack detection, especially for small-scale cracks.

Therefore, as shown in Figure 3, we designed a unique neck structure in Crack-MsCGA. On one hand, we designed the MsCGA block to enhance and aggregate local features and to learn the global feature representations of high-level features for the purpose of enriching the detailed features and capturing the variable global morphology.

The details of the MsCGA block are described in Section 3. On the other hand, unlike conventional object detection models that integrate lower-level features (e.g., the output from stage 2), our approach only fuses the high-level feature

S_{4}

and the low-level feature

S_{3}

through a top–down and bottom–up pathway [12].

3.1.3. The Detection Head

In Crack-MsCGA, we employ an RT-DETR [39] detection head to generate the bounding boxes and categories of pavement cracks. The RT-DETR detection head is an improved version of DETR [40]. It inherits DETR’s end-to-end detection characteristics while achieving a faster inference speed. This allows Crack-MsCGA to quickly detect the bounding boxes and categories of cracks while maintaining well-organized prediction boxes.

3.1.4. Loss Function

In our Crack-MsCGA model, we use Generalized Intersection over Union Loss (

L_{G I o U}

) [41] and L1 Loss (

L_{L 1}

) [42] to compute the bounding box loss, and then use Cross-Entropy Loss (

L_{C E}

) [43] to compute the classification loss. The overall loss function is formulated as follows:

L = α \cdot L_{G I o U} + β \cdot L_{L 1} + γ \cdot L_{C E}

(1)

where

α

,

β

, and

γ

are hyperparameters that balance the contributions of each loss component.

3.2. Multi-Scale Cascaded Group Attention Fusion Block

In this section, we elaborate the details of the Multi-scale Cascaded Group Attention Fusion (MsCGA) block, which can effectively extract and integrate both local and global features of pavement cracks. Specifically, the computation process of this block consists of three stages. During the stage of local feature aggregation, this block firstly divides the input features into multiple windows and then uses the local window attention to compute short-range dependencies within each patch, as shown in the top branch of Figure 5. During the stage of global feature extraction, this block uses the cascaded group attention to learn long-range dependencies across the entire image, as shown in the bottom branch of Figure 5. Finally, a multi-scale attention fusion strategy based on Mixed Local Channel Attention (MLCA) is proposed to selectively fuse features learned by multi-scale attention, as illustrated in Figure 5c.

3.2.1. Local Feature Aggregation

We employ a local window attention mechanism to compute short-range dependencies, aggregating the local features of pavement cracks within each window patch, as shown in the top branch of Figure 5. Firstly, we spatially divide the input features map

S_{4}

into windows, where each window has a size of

w s

, i.e.,

S_{4} = {W_{11}, W_{12}, W_{13}, \dots, W_{n, n - 2}, W_{n, n - 1}, W_{n, n}}

, where

n = ⌈ \frac{W}{w s} ⌉

. Then, we aggregate local features of pavement cracks by using Cascaded Group Attention (CGA) [44] to compute the dependencies of each element in each patch.

{W_{i j}}^{'} = C G A (W_{i j})

(2)

where

W_{i j}

denotes the window patch at the i-th row and j-th column, and

C G A (\cdot)

denotes using Cascaded Group Attention to compute dependencies of the input patch. Finally, we combine the aggregated features within windows to obtain local features

F_{L} \in R^{\frac{H}{32} \times \frac{W}{32} \times 256}

.

Traditional multi-head self-attention (MHSA) mechanisms suffer from high computational costs due to each head computing attention across all channels. To improve the computational efficiency of MsCGA, we choose CGA to calculate self-attention for reducing redundant computations. In fact, CGA is an advanced MHSA, which divides the input features map into multiple channel groups and assigns these channel groups to different attention heads. This attention mechanism enables each attention head to process a unique channel group of the input feature map, thereby enhancing the diversity of the attention maps. Formally, the computation process of CGA can be formulated as follows:

Let h be the total number of attention headers. The input feature map F with C channels is divided into h channel groups, with each group containing

\frac{C}{h}

channels, i.e.,

F = [F_{1}, F_{2}, \dots, F_{h}]

. Here,

F_{k}

denotes the k-th split of the input feature, where

1 \leq k \leq h

. Then, we utilize h attention heads to individually calculate the self-attention for each channel group. Finally, we concatenate the output features of each self-attention head and employ a linear layer to project its dimensions to align with the input feature map.

{\bar{F}}_{k} = A t t e n t i o n (F_{k} W_{k}^{Q}, F_{k} W_{k}^{K}, F_{k} W_{k}^{V})

(3)

C G A (F) = C o n c a t {[{\bar{F}}_{k}]}_{k = 1 : h} W^{P}

(4)

where

W_{k}^{Q}

,

W_{k}^{K}

, and

W_{k}^{V}

are learnable parameters that project each group of feature map to different subspaces,

A t t e n t i o n (\cdot)

denotes the calculation of self-attention [45], and

W^{P}

denotes a linear layer that projects the concatenated output feature map dimensions to align with the input feature map.

As illustrated in Figure 5b, we cascade the output features of each preceding attention head into the input features of the subsequent attention head, thereby enhancing the input information for attention computation. Specifically, the computation method for the input of each attention head is as follows:

F_{k}^{'} = F_{k} + {\bar{F}}_{k - 1}, 1 < k ⩽ h

(5)

where

F_{k}^{'}

denotes the input feature of the kk-th head and

{\bar{F}}_{k - 1}

denotes the output feature of the (

k - 1

)-th attention head.

3.2.2. Global Feature Extraction

We employ CGA to capture intricate and variable global features by learning long-range contextual information across the entire image, based on the aggregated local features of cracks in the previous process. Specifically, as shown in the bottom branch of Figure 5, we first add the local feature map aggregated by the local window attention mechanism with

S_{4}

, enriching the input information of the global attention:

\bar{S_{4}} = S_{4} + F_{L}

(6)

Then, we use cascaded group attention to capture global features

F_{G} \in R^{\frac{H}{32} \times \frac{W}{32} \times 256}

from those features by learning long-range contextual information across the entire image:

F_{G} = C G A (\bar{S_{4}})

(7)

3.2.3. Multi-Scale Attention Fusion

In the previous section, we utilized multi-scale attention to capture both short-range and long-range dependencies for extracting local and global features. In this section, we propose a multi-scale attention fusion strategy to selectively integrate these features, as shown in Figure 5c. Specifically, we first concatenate the local feature map and global feature map:

F_{M} = C o n c a t (F_{L}, F_{G})

(8)

where

C o n c a t (\cdot)

refers concatenating input multi-scale attention features. Then, we use Mixed Local Channel Attention (MLCA) [46] to reweight different channels and various regions within the same channel of the concatenated features:

F_{M} = M L C A (F_{M})

(9)

where

M L C A (\cdot)

denotes using MLCA to reweight features. Finally, we employ a CNN-based fuser to integrate them.

F_{M} = F u s e (F_{M})

(10)

where

F u s e (\cdot)

denotes employing a fuser to integrate features. Figure 4 shows the network structure of the fuser.

4. Experiments

In this section, we first introduce the experimental settings in Section 4.1. Secondly, we conduct a comparative analysis of our proposed Crack-MsCGA and five other advanced crack detection models from the perspectives of multi-scale crack detection, efficiency, and visualization in Section 4.2. Thirdly, we conduct a comparative analysis of our proposed MsCGA block with a control group without the attention mechanism and with two classical attention mechanisms and three state-of-the-art attention mechanisms in Section 4.3. Finally, we design some comparative to validate the rationality of the Crack MSCGA design in Section 4.6.

4.1. Experimental Settings

4.1.1. Experimental Environment

The experimental environment was as follows: Windows11, Python 3.11.6, Pytorch 2.1.0, CUDA11.8, Intel(R) Xeon(R) Silver 4314, NVIDIA RTX A6000.

4.1.2. Datasets

We conducted experiments on a public dataset (RDD2022-China-MotorBike, China-M) and a private dataset (DH807) to validate the effectiveness of our proposed method. The China-M dataset was captured from a rear-to-front perspective, as shown in the first row of Figure 6, while DH807 was captured from a top--down perspective, as shown in the second row of Figure 6.

The public China-M [47] dataset is a more common dataset for pavement distress detection, including four types of pavement damage: Longitudinal Cracks, Transverse Cracks, Block Cracks, and Potholes. We constructed a new dataset based on RDD2022-China-MotorBike to test the effectiveness of our proposed method in the task of crack detection. Specifically, we first deleted images without cracks from the dataset. Then, we removed non-crack labels and manually corrected incorrect labels. Finally, we randomly divided the dataset into a training set and a test set in a 7:3 ratio, yielding 1290 training images and 560 test images. Each image is a grayscale image with a resolution of

512 \times 512

pixels. The first row of Figure 6 shows two images from the dataset.

The DH807 dataset is a new dataset proposed by us. Specifically, we used a pavement image collection vehicle developed by Wuhan Optics Valley Zoyon Science and Technology to collect pavement images from a top–down perspective in Dehong Prefecture, Yunnan Province, China. This collection vehicle was equipped with advanced high-speed cameras, shooting control algorithms, and postprocessing alignment algorithms, ensuring continuous high-definition pavement images at speeds ranging from 0 to 100 km/h. Subsequently, we selected 807 images containing cracks from the collected pictures and annotated them using LabelImg. These images were then divided into a training set and a test set in a 7:3 ratio. Second row of Figure 6 shows two images from the dataset.

4.1.3. Evaluation Metrics

In object detection, several metrics are used to evaluate the accuracy of a model. True Positives (TP) are the correctly detected objects, i.e., the objects that are correctly identified by the model. False Positives (FP) are the incorrectly detected objects, i.e., the objects that the model incorrectly identifies as present. False Negatives (FN) are the objects that the model fails to detect, i.e., the objects that are present but not identified by the model. Precision (P) is the ratio of TP to the sum of TP and FP:

P = \frac{T P}{T P + F P}

Recall (R) is the ratio of TP to the sum of TP and FN:

R = \frac{T P}{T P + F N}

F1 score is the harmonic mean of precision and recall, primarily used to evaluate the balance between precision and recall in classification or detection tasks. Its formula is as follows:

F 1 = \frac{2 \times P \times R}{P + R}

To comprehensively analyze the balance of crack detection across different scales, this study utilizes the mean F1 score (mF1), which is calculated as the average of the F1 scores for small, medium, and large scales, to evaluate the model’s balance between precision and recall:

m F 1 = \frac{1}{S} \sum_{i = 1}^{S} F 1_{i}

where (S) is the number of scales and

F 1_{i}

is the F1 score for the (i)-th scale. Average Precision (AP) is the area under the precision–recall curve, summarizing the precision and recall at different threshold levels. hlAP@50 is a specific instance of AP where the Intersection over Union (IoU) threshold is set to 0.50. This means that a predicted bounding box is considered a True Positive if its IoU with the ground truth box is greater than or equal to 0.50. Mean Average Precision (mAP) is a commonly used metric in object detection that averages the AP across all classes. It provides a single performance measure that accounts for both precision and recall across different object categories. The mAP is calculated as follows:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

where (N) is the number of classes and

A P_{i}

is the Average Precision for the (i)-th class.

In addition to accuracy metrics, some metrics are used to evaluate the inference speed and computational complexity of the models. In this study, Frames Per Second (FPS) is utilized to assess the inference speed, while GFLOPs and parameters are employed to evaluate the computational complexity of the models.

4.1.4. Baselines

In our study, we conduct a comparative analysis of our method Crack-DETR with five existing models: YOLOv5l [2], YOLOv5l-ST-BIFPN [3], UM-YOLOl [48], Faster-RCNN-R50 [5], Deformable-DETR-R50 [49], DDQ-DETR-R50 [50], and RT-DETR-R50 [39]. Among them, YOLOv5l and Faster-RCNN-R50 are purely CNN-based models, while Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50 are end-to-end detection models that leverage attention mechanisms. Additionally, YOLOv5-ST-BIFPN and UM-YOLOl are improved versions of YOLOv5 and YOLOv8 tailored for pavement crack detection. To ensure a fair comparison, we implemented YOLOv5-ST-BIFPN and UM-YOLOl by following the methods outlined in the respective papers, based on the YOLOv5 and YOLOv8 variants with parameters most similar to Crack-MsCGA (i.e., YOLOv5l and YOLOv8l).

YOLOv5l: YOLOv5l is a classical single-stage object detection model known for its speed and efficiency. It is part of the YOLO (You Only Look Once) family, which emphasizes real-time detection with good accuracy.

YOLOv5l-ST-BIFPN: YOLOv5l-ST-BIFPN is an improved version of YOLOv5l, integrating the Swin Transformer block and Bidirectional Feature Pyramid Network (BIFPN) into YOLOv5. These enhancements enable it to achieve outstanding performance in UAV pavement crack detection.

UM-YOLOl: UM-YOLOl is an enhanced version of YOLOv8l. It integrates an Efficient Multiscale Attention (EMA) module into the backbone’s C2f module, employs a Bi-FPN fusion mechanism in the neck to improve feature aggregation, and leverages GSConv—a lightweight convolutional network—for more efficient convolution operations.

Faster-RCNN-R50: Faster-RCNN-R50 is a classical two-stage object detection model. It first generates region proposals and then refines these proposals for final object detection and classification.

Deformable-DETR-R50: Deformable DETR (DEtection TRansformer) with a ResNet-50 backbone integrates deformable convolutions into the DETR framework. This model improves upon the original DETR by enhancing its ability to handle small objects and varying object scales through deformable attention, leading to better performance in object detection tasks.

DDQ-DETR-R50: DDQ-DETR-R50 stands for Dense Distinct Query for end-to-end object detection with a ResNet-50 backbone. This model leverages dense distinct queries in a transformer-based framework, enhancing its ability to detect objects with varying scales and shapes.

RT-DETR-R50: RT-DETR-R50 denotes Real-Time DETR with a ResNet-50 backbone. It aims to bring the benefits of the DETR model, including its transformer-based architecture and attention mechanisms, to real-time applications. RT-DETR-R50 balances the high accuracy of transformer-based detection with the speed necessary for real-time inference.

4.1.5. Hyperparameter Settings

For a fair comparison, we carefully tune the hyperparameters of all baselines to report their best performance. The number of epochs is set to 300, the learning rate is set to 1

\times 10^{- 4}

, the weight decay used in the local optimizer is set to 1

\times 10^{- 4}

, the batch size for training is set to 16, and the input image size is

640 \times 640

. In the Local Window Attention of MsCGA block, the size of each window (ws) is set to 5. For the loss function, the hyperparameters

α

,

β

, and

γ

are set to 2, 1, and 1, respectively.

4.2. Detection Performance

In this section, we evaluate Crack-MsCGA on the China-M dataset and DH807 dataset, then compare it with seven baselines.

4.2.1. Performance on the China-M Dataset

The performance of the methods on China-M is summarized in Table 1, with AP@50 metrics reported for small, medium, and large objects as well as the mean of all scales.

As shown in Table 1, we first analyze the detection performance under varying scales of cracks, i.e., small, medium, and large. Specifically, for the most challenging small-scale cracks with slight structures and variable morphologies, Crack-MsCGA outperforms the baselines with an average improvement of 11.7% (i.e., improving AP@50 to 11.3%, 11.1%, 8.3%, 14.9%, 16.1%, 7.7%, and 12.2% on YOLOv5l, YOLOv5l-ST-BIFPN, UM-YOLOl, Faster-RCNN-R50, Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50, respectively). In particular, compared to pure CNN models (i.e., YOLOv5l and Faster-RCNN-R50), Crack-MsCGA achieves improvements of up to 11.3% and 14.9% in detection performance, respectively. Models based on CNNs with attention mechanisms (e.g., UM-YOLOl, DDQ-DETR-R50, and RT-DETR-R50) show improved accuracy in detecting large-scale and medium-scale cracks compared to pure CNN models. However, the improvement in detecting small-scale cracks remains relatively limited. Overall, Crack-MsCGA achieved the highest accuracy across all scales. Even when compared to the closest method (i.e., DDQ-DETR-R50), Crack-MsCGA demonstrated improvements of 7.7%, 0.5%, and 1.6% in detecting small-scale, medium-scale, and large-scale cracks, respectively. In addition, the mF1 score of Crack-MsCGA is higher than that of all other models, achieving the top rank. This indicates that Crack-MsCGA has a stronger balance between precision and recall. On the one hand, these improvements are attributed to MsCGA’s ability to avoid the noise interference present in low-level features. On the other hand, they are credited to the MsCGA block, which employs local window attention to aggregate local features by learning short-range dependencies within the patch. This enhances the feature representation capability for small-scale cracks with subtle structures and variable morphologies.

Overall, the results validate the effectiveness of Crack-MsCGA in improving the detection performance of pavement cracks on the China-M dataset, making it a robust and reliable solution for automated pavement crack detection.

4.2.2. Performance on DH807 Dataset

As shown in Table 2, the proposed Crack-MsCGA achieved the highest average AP@50 across three scales, reaching 60.3%, which is an improvement of 2.5% over the closest model, RT-DETR-R50, on the DH807 dataset. Specifically, for the most challenging slight small-scale cracks with insignificant structures and variable morphologies, Crack-MsCGA outperforms the baselines with an average improvement of 11.3% (i.e., improving AP@50 to 7.3%, 9.1%, 10.9%, 17.5%, 16.3%, 12.1%, and 6.0% on YOLOv5l, YOLOv5l-ST-BIFPN, UM-YOLOl, Faster-RCNN-R50, Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50, respectively). For the medium-scale cracks, Crack-MsCGA achieved 6.7% average improvement on AP@50. For the large-scale cracks, Crack-MsCGA achieved a 5.9% average improvement on AP@50. From the perspective of the F1 score, Crack-MsCGA still achieved the best performance on the DH807 dataset, whereas DDQ-DETR-R50, which performs well on the China-M dataset, experienced a significant drop in mF1 score on the DH807 dataset. This indicates that Crack-MsCGA exhibits better robustness compared to the other models. These improvements further illustrate the effectiveness of MsCGA in enhancing the detection of small-scale pavement cracks. The MsCGA block uses local window attention to aggregate local features by learning short-range dependencies within the patch, enhancing the feature representation abilities for small-scale cracks with delicate structures and varying morphologies.

4.2.3. Analysis on Efficiency

The computational cost and number of parameters are crucial reference metrics for deploying models on portable devices with integrated sensors because computational power and memory are highly limited on such devices. Therefore, we compare the computational cost (GFLOPs) and the number of parameters of Crack-MsCGA and other models to analyze their efficiency on portable devices with integrated sensors. The experimental results, as summarized in Table 3, clearly demonstrate the superior performance of our proposed method, Crack-MsCGA, compared to other techniques.

Firstly, in terms of computational efficiency, Crack-MsCGA requires only 44.5 GFLOPs, the lowest among the compared methods, achieving a reduction of on average 32.0%. This efficiency is crucial for real-time applications where computational resources and processing time are limited.

Secondly, Crack-MsCGA has the fewest parameters, with only 38.5 G, which not only reduces the model’s complexity but also minimizes the memory footprint. This makes it highly suitable for deployment on devices with limited memory, such as portable devices with integrated sensors.

Thirdly, to further explore the practical inference speed of Crack-MsCGA, we conducted FPS tests on different platforms and under various floating point precisions. The experimental results are shown in Table 4. Analysis of these results reveals that the inference speed of YOLOv5l leads all other models under all conditions. Our proposed Crack-MsCGA follows closely, achieving second place in all metrics. Overall, with FP16 precision on the NVIDIA RTX 3060, Crack-MsCGA provides an average FPS improvement of 44.4% compared to other models. This indicates that the inference speed of Crack-MsCGA outperforms most real-time object detection models.

Overall, Crack-MsCGA’s parameter count and inference computational cost outperform all other models, while its inference speed is second only to YOLOv5l. It can meet real-time inference requirements in most scenarios, making it a highly effective solution for practical applications.

4.2.4. Gradient-Based Class Activation Maps Visualization

In this section, we use Grad-CAM++ to generate gradient-based class activation maps to analyze the effectiveness of the backbone network and neck in extracting pavement crack features in Crack-MsCGA and other models, as shown in Figure 7.

From Figure 7, we can observe that the pure CNN models YOLOv5l and YOLOv5l-ST-BIFPN not only focus on the crack regions but also pay considerable attention to the pavement background textures. In contrast, the CNN-with-attention-based models RT-DETR-R50 and DDQ-DETR-R50, thanks to the incorporation of a global attention mechanism, significantly reduce the focus on background regions compared to the pure CNN models and place greater emphasis on the overall continuity of the cracks. Furthermore, we designed a variant of RT-DETR-R50, named RT-DETR-R50’, which removes the fusion of low-level features. Compared with RT-DETR-R50, this variant further reduces attention on background regions, but it neglects some local detail features. Crack-MsCGA builds on the removal of low-level feature fusion by using MsCGA to aggregate both the local detail features and global features of cracks at multiple scales, thereby further reducing focus on background regions while effectively concentrating on the crack areas.

4.2.5. Visualization of Prediction Results

In Figure 8, we present the prediction results of Crack-MsCGA and other comparative models for randomly selected samples (a–h) from the test set.

From Figure 8a,b,e, it is evident that Crack-MsCGA can detect slight small-scale cracks with insignificant structure and variable morphology. This capability is attributed to MsCGA, where local window attention focuses on the interrelationships within local windows, enhancing the model’s ability to capture features of slight small-scale pavement cracks with insignificant structures and variable morphologies. Figure 8d–f illustrate that, compared to pure CNN models, models employing global attention mechanisms—such as Deformable-DETR-R50, DDQ-DETR-R50, and RT-DETR-R50—tend to overlook slight cracks and critical local details, such as endpoints. This phenomenon occurs because the features learned by the global attention mechanism weaken local features. In contrast, Crack-MsCGA effectively addresses this issue. On one hand, Crack-MsCGA enhances both local and global features by learning short-range and long-range dependencies. On the other hand, in the detection of slight small-scale cracks, the multi-scale attention fusion strategy can selectively retain more local features.

From Figure 8c,e,g, it is evident that pure CNN models such as YOLOv5l and Faster-RCNN-R50 often produce cluttered bounding boxes, even with the application of Non-Maximum Suppression (NMS) as a post-processing step. In contrast, the detection results from DDQ-DETR-R50, RT-DETR-R50, and Crack-MsCGA exhibit more orderly bounding boxes. This improvement is attributed to the use of global attention mechanisms in DDQ-DETR-R50, RT-DETR-R50, and Crack-MsCGA, which can capture long-range dependencies among input features.

By analyzing Figure 8e,h, it is observed that Crack-MsCGA tends to overly focus on local details when handling large-scale and structurally complex cracks. This results in scenarios where a single crack is detected as two smaller ones, or part of a large crack is identified as a small, separate crack. This issue is even more pronounced in pure CNN models like Faster-RCNN-R50 and YOLOv5l. However, models utilizing global attention mechanisms, such as DDQ-DETR-R50 and RT-DETR-R50, alleviate this problem to some extent. This suggests that when dealing with large-scale, complex crack structures, MsCGA may face limitations in adequately capturing global features.

From the above analysis, we can conclude that Crack-MsCGA effectively utilizes local window attention mechanisms to capture short-range dependencies of pavement cracks, aggregating fine-grained local features, which enables it to detect slight small-scale pavement cracks. Additionally, Crack-MsCGA employs cascaded group attention to capture long-range dependencies of pavement cracks for extracting intricate and variable global features, ensuring the regularity and simplicity of the prediction results. This confirms the significant role of MsCGA in improving the accuracy of pavement crack detection.

4.3. Comparison with Other Attention Mechanisms

In this section, we conduct comparative experiments including a control group without attention mechanisms (WOA), two classical attention mechanisms (CBAM and DAttention), and three state-of-the-art attention mechanisms (AIFI, FLA, and LSKA). For a fair comparison, we construct comparison models by replacing the MsCGA block in Crack-MsCGA with those attention mechanisms. The results are summarized in Table 5, which presents AP@50 for three scales of cracks—small, medium, and large—as well as the mean performance (mean). It is evident that our MsCGA attention outperforms all other attentions across all scales. Specifically, MsCGA achieves the highest AP@50 scores for small-scale cracks (84.8%), medium-scale cracks (86.8%), and large-scale cracks (96.3%), resulting in a mean AP@50 of 89.3%. This demonstrates the superior capability of MsCGA in accurately detecting all-scale pavement cracks.

In comparison, classical attention mechanisms, i.e., CBAM and DAttention, show relatively lower performance for small-scale crack detection than WOA, which operates without an attention mechanism. This phenomenon occurs because those attention mechanisms may diminish the local features extracted by CNNs when capturing global features. The state-of-the-art methods, AIFI, FLA, and LSKA, show improvements over WOA in medium-scale and large-scale crack detection. However, the improvements in small-scale crack detection are minimal. This is because these attention mechanisms can enhance global features by learning long-range dependencies, but their enhancement of local features is limited. It is worth noting that Crack-MsCGA achieves significant improvements across all scales. This benefits from the fact that MsCGA can not only enhance both local and global features by learning short-range and long-range dependencies, but also selectively fuse them.

4.4. Comparative Experiments on Where to Employ the MsCGA Block

In the neck of the Crack-MsCGA model, we only apply attention enhancement to high-level features, without enhancing low-level features. This is based on the theory that applying attention enhancement to concatenated multi-level features simultaneously is redundant because high-level features are abstractions of low-level features in generic object detection. However, there are significant differences between pavement crack detection and traditional object detection. Therefore, we set up ablation experiments to explore where employing the MsCGA block leads to the best results. As shown in Table 6, we present the computational cost and mAP@50 of Crack-MsCGA and its three variants. The WOA variant does not use the MsCGA block to enhance features. The AllLevel variant uses the MsCGA block to enhance both high-level and low-level features. The LowLevel variant only uses the MsCGA block to enhance low-level features, while Crack-MsCGA uses the MsCGA block only to enhance high-level features.

Among all variants, Crack-MsCGA achieved the highest mAP@50, with a computational cost only slightly higher than the WOA variant, which does not use the MsCGA block. This confirms that the theory also applies to crack detection. The LowLevel variant improved mAP@50 by only 0.1% compared to the WOA variant, while the AllLevel variant improved mAP@50 by just 0.3%. This is because the information learned by the attention mechanism can easily diminish the local features in large-scale features.

4.5. Comparative Experiments on Window Size of Local Window Attention Mechanism

In Section 3.2, we proposed a local window attention mechanism to compute short-range dependencies by aggregating the local features of pavement cracks within each window patch, with the window size set as the hyperparameter

w s

. In this section, we design comparative experiments to investigate the impact of different window sizes on asphalt pavement crack detection. Table 7 presents the detection accuracies on the DH807 dataset using different window sizes.

Firstly, all variants exhibit similar performance in detecting large-scale cracks. This is because each variant employs a CGA mechanism to capture intricate and variable global attention features by learning long-range contextual information across the entire image. These global attention features facilitate the detection of large-scale cracks.

Secondly, as the

w s

increases, the detection accuracy for medium-scale cracks gradually declines.

Finally, when

w s

is set to five, both small-scale and large-scale crack detection accuracies peak, and the average mAP@50 is significantly higher than that of the other variants. This indicates that a window size of 5 is optimal.

4.6. Ablation Experiment

The primary difference between Crack-MsCGA and other comparative models lies in the neck network architecture. It first employs MsCGA to enhance high-level features, and then fuses these features with low-level features. In this section, we design ablation experiments to examine the rationale behind the neck network structure in Crack-MsCGA.The experimental results are shown in Table 8. The “Fused Features” column indicates the features used in the neck fusion process, while the “MsCGA” column specifies whether the MsCGA attention mechanism is applied.

Firstly, we design a WOA variant by removing the MsCGA block from Crack-MsCGA to analyze its contribution. Compare to the WOA variant, Crack-MsCGA exhibits improvements in crack mAP@50 across all scales, achieving gains of 3.6%, 3.8%, and 4.2% for small-, medium-, and large-scale cracks, respectively. This suggests that MsCGA plays a pivotal role in achieving high-precision pavement crack detection in the Crack-MsCGA model.

Secondly, while most common pavement crack detection models integrate multi-level features (S2, S3, and S4), Crack-MsCGA only incorporates features from S3 and S4. To validate the rationality of this design, we developed an S3 variant that, compared to Crack-MsCGA, additionally fuses the lower-level S3 feature. Experimental results indicate that, compared to Crack-MsCGA, the S3 variant shows a decline in crack detection accuracy across all scales—most notably, a 3.4% drop for small-scale cracks. On one hand, this can be attributed to the backbone network’s use of a CNN for feature extraction; although the lower-level S3 features provide more detailed information, they also introduce substantial background noise during the fusion process. On the other hand, fusing S3 dilutes the crack-specific features that the MsCGA block learned.

From the perspective of computational complexity, a comparison between Crack-MsCGA and the S2 variant reveals that reducing the fusion of low-level features (

S_{2}

) enables Crack-MsCGA to reduce to 23.9 GFLOPs. Additionally, when compared to the S2 WOA variant, it is evident that the MsCGA block requires 1.4 GFLOPs of computational cost. Overall, our unique neck design reduces the network’s computational complexity by 22.5 GFLOPs while simultaneously improving the accuracy of pavement crack detection.

4.7. Conclusions

In this study, we focus on improving the detection accuracy of slight small-scale cracks with insignificant structures and variable morphologies. We propose a Multi-scale Cascaded Group Attention (MsCGA) mechanism to learn and fuse both short-range and long-range dependencies of pavement cracks at different scales. Based on MsCGA, we build a pavement crack detection model named Crack-MsCGA. Crack-MsCGA shows its promising potential in the task of pavement crack detection, achieving the highest accuracy with minimal computational cost and the fewest parameters at all scales and all categories.

In the future, we will study pavement crack detection based on 3D point cloud data collected by LiDAR, aiming to further improve the detection effect of small-scale road cracks.

Author Contributions

In this study, G.L. (Guoxi Liu) built the object detection network model, completed the experiments, and wrote the article. X.W. recorded the experimental data. G.L. (Guozhi Liu) completed the search of the dataset. L.L. finished revising the grammar of the article. F.D. and B.H. provided various guidance. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research and Development Program of Yunnan Province, China (No.202402AD080002-5), the National Natural Science Foundation, China (No. 62262063), the Yunnan Provincial Revitalization Talents Support Plan, China (No. XDYC-CYCX-2022-0009), the Science and Technology Youth Lift Talents of Yunnan Province, China, the Key Industry Science and Technology Projects for University Services in Yunnan Province, China (No. FWCY-ZNT2024020), the Yunnan Provincial Department of Education Science Research Fund (No. 2025Y0850), the Yunnan Fundamental Research Projects (grant No. 202501AS070046), and the Open Fund Project of the Key Laboratory of the Ministry of Education (EIN20240004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, B.H., upon reasonable request.

Acknowledgments

We appreciate RDD2022 for providing us with some experimental data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kheradmandi, N.; Mehranfar, V. A critical review and comparative study on image segmentation-based techniques for pavement crack detection. Constr. Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
Ultraltyics. YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 24 May 2024).
Xing, J.; Liu, Y.; Zhang, G.Z. Improved yolov5-based uav pavement crack detection. IEEE Sens. J. 2023, 23, 15901–15909. [Google Scholar] [CrossRef]
Liang, H.; Qiu, D.; Ding, K.L.; Zhang, Y.; Wang, Y.; Wang, X.; Liu, T.; Wan, S. Automatic Pavement Crack Detection in Multisource Fusion Images Using Similarity and Difference Features. IEEE Sens. J. 2023, 24, 5449–5465. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Yang, L.; Huang, H.; Kong, S. A Deep Supervised Pavement Crack Detection Network with Multiscale Feature Fusion and Feature Learning. In Proceedings of the 2023 35th Chinese Control and Decision Conference (CCDC), Yichang, China, 20–22 May 2023; pp. 1002–1007. [Google Scholar] [CrossRef]
Yao, H.; Liu, Y.; Li, X.; You, Z.; Feng, Y.; Lu, W. A detection method for pavement cracks combining object detection and attention mechanism. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22179–22189. [Google Scholar] [CrossRef]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2018; pp. 421–429. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Elharrouss, O.; Akbari, Y.; Almadeed, N.; Al-Maadeed, S. Backbones-review: Feature extractor networks for deep learning and deep reinforcement learning approaches in computer vision. Comput. Sci. Rev. 2024, 53, 100645. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tang, J.; Gu, Y. Automatic crack detection and segmentation using a hybrid algorithm for road distress analysis. In Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 13–16 October 2013; pp. 3026–3030. [Google Scholar]
Lim, R.S.; La, H.M.; Shan, Z.; Sheng, W. Developing a crack inspection robot for bridge maintenance. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 6288–6293. [Google Scholar]
Lim, R.S.; La, H.M.; Sheng, W. A robotic crack inspection and mapping system for bridge deck maintenance. IEEE Trans. Autom. Sci. Eng. 2014, 11, 367–378. [Google Scholar] [CrossRef]
Amhaz, R.; Chambon, S.; Idier, J.; Baltazart, V. Automatic crack detection on two-dimensional pavement images: An algorithm based on minimal path selection. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2718–2729. [Google Scholar] [CrossRef]
La, H.M.; Gucunski, N.; Dana, K.; Kee, S.H. Development of an autonomous bridge deck inspection robotic system. J. Field Robot. 2017, 34, 1489–1504. [Google Scholar] [CrossRef]
Miraliakbari, A.; Sok, S.; Ouma, Y.; Hahn, M. Comparative evaluation of pavement crack detection using kernel-based techniques in asphalt road surfaces. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 41, 689–694. [Google Scholar] [CrossRef]
Saha, P.K.; Arya, D.; Sekimoto, Y. Federated learning–based global road damage detection. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 2223–2238. [Google Scholar] [CrossRef]
Dollár, P.; Appel, R.; Belongie, S.; Perona, P. Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1532–1545. [Google Scholar] [CrossRef] [PubMed]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Fernandes, K.; Ciobanu, L. Pavement pathologies classification using graph-based features. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 793–797. [Google Scholar]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Li, B.L.; Qi, Y.; Fan, J.S.; Liu, Y.F.; Liu, C. A grid-based classification and box-based detection fusion model for asphalt pavement crack. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 2279–2299. [Google Scholar] [CrossRef]
Mayya, A.M.; Alkayem, N.F. Triple-stage crack detection in stone masonry using YOLO-ensemble, MobileNetV2U-net, and spectral clustering. Autom. Constr. 2025, 172, 106045. [Google Scholar] [CrossRef]
Raushan, R.; Singhal, V.; Jha, R.K. Damage detection in concrete structures with multi-feature backgrounds using the YOLO network family. Autom. Constr. 2025, 170, 105887. [Google Scholar] [CrossRef]
Deng, L.; Yuan, H.; Long, L.; Chun, P.j.; Chen, W.; Chu, H. Cascade refinement extraction network with active boundary loss for segmentation of concrete cracks from high-resolution images. Autom. Constr. 2024, 162, 105410. [Google Scholar] [CrossRef]
Chu, H.; Chun, P.j. Fine-grained crack segmentation for high-resolution images via a multiscale cascaded network. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 575–594. [Google Scholar] [CrossRef]
Zhang, J.; Chen, N.; Peng, J.; Cui, F. Pavement Crack Detection on BEV Based on Attention-Unet. In Proceedings of the 2022 IEEE 2nd International Conference on Digital Twins and Parallel Intelligence (DTPI), Boston, MA, USA, 24–28 October 2022; pp. 1–4. [Google Scholar]
Zhu, G.; Liu, J.; Fan, Z.; Yuan, D.; Ma, P.; Wang, M.; Sheng, W.; Wang, K.C. A lightweight encoder–decoder network for automatic pavement crack detection. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 1743–1765. [Google Scholar] [CrossRef]
Dong, Z.; Zhu, G.; Fan, Z.; Liu, J.; Li, H.; Cai, Y.; Huang, H.; Shi, Z.; Ning, W.; Wang, L. Automatic Pavement Crack Detection Based on YOLOv5-AH. In Proceedings of the 2022 12th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Changbai Mountain, China, 27–31 July 2022; pp. 426–431. [Google Scholar]
Zheng, Y.; Gao, Y.; Lu, S.; Mosalam, K.M. Multistage semisupervised active learning framework for crack identification, segmentation, and measurement of bridges. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1089–1108. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv 2021, arXiv:2106.03650. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 558–567. [Google Scholar]
Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 658–666. [Google Scholar]
Barron, J.T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 4331–4339. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Lin, C.; Tian, D.; Duan, X.; Zhou, J.; Zhao, D.; Cao, D. DA-RDD: Toward domain adaptive road damage detection across different countries. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3091–3103. [Google Scholar] [CrossRef]
Zhu, J.; Wu, Y.; Ma, T. Multi-object detection for daily road maintenance inspection with UAV based on improved YOLOv8. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16548–16560. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; Chen, K. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7329–7338. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5961–5971. [Google Scholar]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]

Figure 1. Comparison of the detection results of YOLOv5l, Faster-RCNN-R50, Deformable-DETR-R50, RT-DETR-R50, DDQ-DETR-R50, and our proposed Crack-MsCGA.

Figure 2. Comparison of gradient-based class activation maps (Grad-CAMs) of YOLOv5l (pure CNN models), RT-DETR-R50 (networks combining CNN and global attention mechanisms), and our proposed Crack-MsCGA.

Figure 5. Architecture of MsCGA. It consists of three parts: the bottom branch (b) represents Cascaded Group Attention for global feature extraction, the top branch (a) represents local window attention for local feature extraction, and (c) indicates multi-scale attention fusion, where Attention represents scaled dot-product attention.

Figure 6. Some images from the datasets and their corresponding labels: green bounding boxes indicate a Transverse Crack (TC), blue bounding boxes indicate a Longitudinal Crack (LC), and red bounding boxes indicate a Block Crack (BC).

Figure 7. The gradient-based class activation maps of the output convolutional layers of the neck of Crack-MsCGA and other models.

Figure 8. Prediction performance visualizations of Crack-MsCGA and baseline models, with samples (a–e) selected from the China-M dataset and samples (f–h) chosen from the DH807 dataset.

Table 1. Performance on China-M dataset.

Methods	AP@50(%) ↑				mF1
Methods	Small	Medium	Large	Mean	mF1
YOLOv5l	73.5	83.9	94.9	84.1	87.4
YOLOv5l-ST-BIFPN	73.7	83.0	93.8	83.5	88.8
UM-YOLOl	76.5	84.8	95.0	85.4	88.6
Faster-RCNN-R50	69.9	76.0	93.7	79.9	83.1
Deformable-DETR-R50	68.7	78.6	93.2	80.2	82.3
DDQ-DETR-R50	77.1	86.3	94.7	86.0	91.5
RT-DETR-R50	72.6	85.3	95.4	84.5	85.5
Crack-MsCGA (ours)	84.8	86.8	96.3	89.3	92.3

Table 2. Performance on DH807 dataset.

Methods	AP@50(%) ↑				mF1
Methods	Small	Medium	Large	Mean	mF1
YOLOv5l	35.4	59.7	70.2	55.1	66.7
YOLOv5l-ST-BIFPN	33.6	58.4	70.5	54.2	64.1
UM-YOLOl	31.8	60.3	66.7	53.0	63.4
Faster-RCNN-R50	25.2	55.1	65.2	48.5	57.3
Deformable-DETR-R50	26.4	38.3	61.1	42.0	47.6
DDQ-DETR-R50	30.6	61.6	65.1	52.4	62.9
RT-DETR-R50	36.7	67.8	69.0	57.8	67.4
Crack-MsCGA (ours)	42.7	65.4	72.7	60.3	69.9

Table 3. GFLOPs and parameters of Crack-MsCGA and other models.

Methods	GFLOPs ↓	Parameters (G) ↓
YOLOv5l	53.8	46.1
Faster-RCNN-R50	91.3	41.8
Deformable-DETR-R50	79.7	40.2
DDQ-DETR-R50	52.8	48.3
RT-DETR-R50	64.8	42.0
Crack-MsCGA (ours)	44.5	38.5

Table 4. FPS of Crack-MsCGA and the comparison models on different platforms under different floating point precisions, where FP32 represents the inference speed measured under a 32-bit floating point precision, while FP16 represents the inference speed measured under a 16-bit floating point precision.

Methods	NVIDIA RTX A6000		NVIDIA RTX 3060
Methods	FP32	FP16	FP32	FP16
YOLOv5l	182.0	294.2	68.6	122.3
Faster-RCNN-R50	96.3	146.5	38.9	60.5
Deformable-DETR-R50	86.7	124.1	32.6	56.8
DDQ-DETR-R50	115.3	196.4	46.3	84.6
RT-DETR-R50	105.4	189.0	40.7	78.2
Crack-MsCGA (ours)	136.6	242.9	55.4	107.6

Table 5. Performance with MsCGA and other attention mechanisms.

Methods	AP@50(%) ↑
Methods	Small	Medium	Large	Mean
WOA	81.2	83.0	92.1	85.4
CBAM [9]	77.6	83.0	94.0	84.9
DAttention [51]	80.9	84.9	94.0	86.6
AIFI [39]	81.7	83.7	94.5	86.6
FLA [52]	82.3	85.1	93.7	87.0
LSKA [53]	81.3	84.3	92.9	86.2
MsCGA (ours)	84.8	86.8	96.3	89.3

Table 6. Study on effectiveness of applying attention enhancement at different layers.

Methods	Attention Levels		GFLOPs ↓	mAP@50(%) ↑
Methods	Low	High	GFLOPs ↓	mAP@50(%) ↑
WOA	✗	✗	43.1	93.0
LowLevel	✓	✗	48.5	93.1
AllLevel	✓	✓	49.8	93.3
Crack-MsCGA	✗	✓	44.5	93.9

Table 7. Comparative experiments on window size of local window attention mechanism.

Window Size	AP@50(%) ↑
Window Size	Small	Medium	Large	Mean
3	36.7	66.5	72.5	58.6
5	42.7	65.4	72.7	60.3
7	40.0	64.0	71.6	58.5
9	41.0	62.5	72.5	58.7

Table 8. Study on effectiveness of applying attention enhancement at different layers.

Methods	Fused Features			MsCGA	GFLOPs ↓	AP@50(%) ↑
Methods	$S_{2}$	$S_{3}$	$S_{4}$	MsCGA	GFLOPs ↓	Small	Medium	Large	Mean
WOA	✗	✓	✓	✗	43.1	81.2	83.0	92.1	85.4
S2	✓	✓	✓	✓	68.4	81.4	86.5	95.2	87.7
Crack-MsCGA (ours)	✗	✓	✓	✓	44.5	84.8	86.8	96.3	89.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, G.; Wu, X.; Dai, F.; Liu, G.; Li, L.; Huang, B. Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection. Sensors 2025, 25, 2446. https://doi.org/10.3390/s25082446

AMA Style

Liu G, Wu X, Dai F, Liu G, Li L, Huang B. Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection. Sensors. 2025; 25(8):2446. https://doi.org/10.3390/s25082446

Chicago/Turabian Style

Liu, Guoxi, Xiaojing Wu, Fei Dai, Guozhi Liu, Lecheng Li, and Bi Huang. 2025. "Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection" Sensors 25, no. 8: 2446. https://doi.org/10.3390/s25082446

APA Style

Liu, G., Wu, X., Dai, F., Liu, G., Li, L., & Huang, B. (2025). Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection. Sensors, 25(8), 2446. https://doi.org/10.3390/s25082446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection

Abstract

1. Introduction

2. Related Work

2.1. Image Processing-Based Detection and Machine Learning-Based Detection

2.2. Deep Learning-Based Detection

3. Methodology

3.1. Network Architecture

3.1.1. The Backbone

3.1.2. The Neck

3.1.3. The Detection Head

3.1.4. Loss Function

3.2. Multi-Scale Cascaded Group Attention Fusion Block

3.2.1. Local Feature Aggregation

3.2.2. Global Feature Extraction

3.2.3. Multi-Scale Attention Fusion

4. Experiments

4.1. Experimental Settings

4.1.1. Experimental Environment

4.1.2. Datasets

4.1.3. Evaluation Metrics

4.1.4. Baselines

4.1.5. Hyperparameter Settings

4.2. Detection Performance

4.2.1. Performance on the China-M Dataset

4.2.2. Performance on DH807 Dataset

4.2.3. Analysis on Efficiency

4.2.4. Gradient-Based Class Activation Maps Visualization

4.2.5. Visualization of Prediction Results

4.3. Comparison with Other Attention Mechanisms

4.4. Comparative Experiments on Where to Employ the MsCGA Block

4.5. Comparative Experiments on Window Size of Local Window Attention Mechanism

4.6. Ablation Experiment

4.7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI