1. Introduction
Since the 1990s, a large number of social and civil infrastructure, such as bridges, tunnels and nuclear power plants, have been constructed throughout China. Owing to its low cost, versatility and extensive use, concrete has become one of the most widely used materials in various types of infrastructure. However, concrete structures inevitably face crack problems caused by creep, shrinkage and loads, which may compromise their safety [
1]. Moreover, as buildings age, they are increasingly affected by factors such as material aging, long-term loads, frequent and sudden natural disasters and human-induced damage, leading to varying degrees of crack damage in concrete structures. If not repaired promptly, these cracks may pose marked safety risks to concrete infrastructure and potentially threaten human life. Therefore, timely detection of concrete surface cracks has profound significance for preventing infrastructure damage and maintaining its safety.
Concrete crack detection methods primarily include traditional and deep learning methods. Traditional methods are based on crack detection using traditional digital image processing. The detection procedure based on image processing generally includes three stages: firstly, pre-processing crack images through denoising and filtering; secondly, binarization and morphological processing by setting thresholds; and thirdly, classification detection using classifiers. Abdel-Qader et al. [
2] analyzed crack edge information using the Fourier and Hough transform algorithms and segmented cracks using the Canny operator to extract crack edge information. Salman et al. [
3] designed an automatic crack digital image classification method using Gabor filters. Zhou et al. [
4] proposed an automatic method for detecting cracks using frequency domain filtering and contour analysis of 3D laser range data. Vivekananthan et al. [
5] combined grayscale discrimination and the Otsu method to successfully detect target cracks in different images. Zhu et al. [
6] proposed a crack detection framework based on two-dimensional digital image correlation and a displacement-based robust crack detection method for concrete cracking phenomena, which evaluated their fracture performance.
In recent years, with the rapid development of deep learning technology, crack detection algorithms based on object detection techniques have become mainstream. Object detection techniques are classified into two- and single-stage algorithms. The current mainstream two-stage algorithms include R-CNN, Fast R-CNN and U-Net, to name a few. Rosso et al. [
7] developed an artificial intelligence (AI)-based automatic road tunnel defects hierarchical classification framework to improve the efficiency of the indirect surveying method. Shahin et al. [
8] constructed a new concrete crack detection model by combining ViT models with multiple image enhancement detectors. Chun et al. [
9] proposed a method combining fully convolutional neural networks and semi-supervised learning to segment pavement crack images, but this method produced rough results that were prone to false detections. Ghosh et al. [
10] improved the convolutional layer structure of U-Net using ResNet residual blocks, validating their performance on public datasets and achieving excellent results. Kang et al. [
11] used an ensemble approach, incorporating the Faster R-CNN algorithm for faster regional convolutional neural networks to detect crack areas. Meng et al. [
12] classified cracks, coarse segmentation and fine segmentation and extracted maximum widths using lightweight classification algorithms, lightweight segmentation algorithms, high-precision segmentation algorithms and crack width measurement algorithms to achieve an automatic real-time crack detection method based on drones. Another category is single-stage detection methods, which directly separate specific categories and regress boundaries. The single-stage object detection algorithms primarily include the You Only Look Once (YOLO) series and the Single Shot MultiBox Detector (SSD). Compared with single-stage algorithms, two-stage algorithms have certain advantages in detection accuracy, but their detection speed is much slower than single-stage algorithms. Therefore, single-stage algorithms are extra concerned with concrete surface crack detection. Chen et al. [
13] introduced a deep learning framework, combining convolutional neural networks (CNNs) with naive Bayes‘ decision algorithms to detect boundary box cracks on underwater nuclear power station surfaces. Deng et al. [
14] applied YOLOv2 for automatic crack detection in concrete, evaluating the robustness of the trained detector against interference by handwritten notes and finding that YOLOv2 can automatically locate a cracked bounding box from original images, even under handwriting interference. Liu et al. [
15] proposed a bridge crack detection algorithm, R-YOLO v5, largely improving crack detection accuracy by incorporating attention mechanisms and optimizing loss functions. Wu et al. [
16] proposed an improved YOLOv4 network using pruning techniques and EvoNorm-S0 structures, which better identified concrete cracks with several misleading targets. They found that whilst maintaining high accuracy, the model could correctly classify cracks at a faster computational speed. Ye et al. [
17] proposed an improved YOLOv7 network and enhanced three different self-developed modules to better identify concrete cracks from several misleading targets, demonstrating effective detection of cracks of different sizes and robustness in validating images contaminated with various types and intensities of noise. Jiang et al. [
18] optimized the YOLO-v3 and SSD algorithms using depth-separable convolution, inverse residual networks and linear bottleneck structures, finding that optimized detection accuracy increased by 3.25% and 4.04%, respectively.
As one of the representatives of single-stage object detection algorithms, the YOLO algorithm has received widespread attention from scholars since its inception. In recent years, with continuous optimization updates, the YOLO algorithm has achieved increasingly excellent results in object detection. In 2023, the Ultralytics team proposed YOLOv8, which not only boasts of high-precision recognition but also excellent real-time performance. The team further lightweighted the entire framework, rendering it extra suitable for concrete surface crack detection. Therefore, based on YOLOv8n, this paper optimizes the model to improve the accuracy of concrete surface crack detection, proposing a YOLOv8-CD algorithm for concrete surface crack detection. Firstly, by introducing the Large Separable Kernel Attention module [
19], other abundant and fine characteristic information about cracks on concrete surfaces is captured, providing strong support for accurate crack identification. Secondly, to further enhance the feature extraction capabilities, the Ghost module [
20] is integrated into the backbone network of YOLOv8, improving detection accuracy and optimizing overall model performance. Finally, in the neck network of the model, the GSConv is used to replace traditional convolutional structures and integrate the VoV-GSCSP module [
21]. This method not only ensures model accuracy but also effectively reduces floating-point operations during feature channel fusion, further reducing model parameters and enhancing computational efficiency. Thus, the proposed algorithm in this paper can further strengthen the application of deep learning algorithms in concrete crack detection technology, offering ideas for deploying related detection algorithms to embedded rapid detection devices, improving the accuracy and efficiency of crack detection and providing technical support for the timely elimination of safety risks in concrete infrastructure.
2. Algorithms
2.1. The YOLOv8 Algorithm
The You Only Look Once (YOLO) algorithm was initially proposed by Joseph Redmon et al. in 2016. Unlike traditional object detection methods, YOLO adopts a novel approach by transforming the object detection task into a regression problem of a single neural network. It divides the image into multiple grids, predicts bounding boxes and object categories within each grid and utilizes the Non-Maximum Suppression (NMS) algorithm to eliminate overlapping bounding boxes, thus achieving real-time object detection.
The YOLO series algorithms represent notable advancements in the field of object detection. YOLOv1 [
22] was the first single-stage object detection algorithm, achieving end-to-end prediction. YOLOv2 [
23] introduced the Darknet-19 network, profoundly enhancing detection accuracy and speed. YOLOv3 [
24] employed the deeper Darknet-53 network. YOLOv4 [
25], proposed by Alexey Bochkovskiy et al., used the CSPDarknet53 network to improve performance. YOLOv5, introduced by Glenn Jocher, achieved remarkable results in many object detection tasks. YOLOv6 [
26], developed by Meituan, focused on industrial applications. YOLOv7 [
27], proposed by the YOLOv4 team in 2022, transformed object detection into a regression problem through convolutional neural networks and fully connected layers for prediction.
The YOLOv8 algorithm used in this paper was introduced by Ultralytics in 2023. It adopts a lightweight network structure whilst maintaining the efficiency and ease of use typical of the YOLO series, further enhancing algorithm performance and flexibility for tasks such as image classification, object detection and instance segmentation, supporting both CPU and GPU platforms. YOLOv8 introduces several innovations, including a new backbone network, Anchor-Free detection head and loss functions and deformable convolution DCNv3, significantly boosting performance. Additionally, the algorithm employs dual-path prediction and densely connected convolution networks, decomposing object detection into classification and localization subtasks. Then, it applies cascading and pyramid concepts to detect objects of different sizes.
Figure 1 is the network structure of YOLOv8. The backbone consists of C2f and SPPF modules with the CSP concept, whilst the neck network of the backbone adopts the ELAN structure design principles from YOLOv7, replacing the C3 module of YOLOv5 with the more gradient flow of the C2f structure. As shown in
Figure 2, the C2f structure enhances the feature fusion capability of convolutional neural networks, improves inference speed and achieves further lightweighting. Moreover, YOLOv8 uses the SPPF module from YOLOv5 architecture, as shown in
Figure 3, which is a spatial pyramid pooling layer that expands the receptive field, facilitates local and global feature fusion and enriches feature information. The head section undergoes remarkable changes compared with YOLOv5, adopting the mainstream decoupled head structure and separating classification and detection heads. YOLOv8 departs from Anchor-Base methods to embrace Anchor-Free principles; on loss functions, it employs the BCE Loss for classification and the DFL Loss + CIOU Loss for regression. In terms of sample matching, YOLOv8 uses Task-Aligned Assigner instead of IOU matching or single-sided proportional allocation.
2.2. Problems with YOLOv8
When deploying the YOLO algorithm to mobile devices, the limited performance of mobile devices needs consideration. At the same time, the requirement for real-time detection of concrete surface cracks is high. Accordingly, this paper selects YOLOv8n, the minimum weight model algorithm of YOLOv8. However, in the actual detection process, YOLOv8n also has certain defects. Firstly, the YOLOv8n algorithm uses numerous standard convolutions and C2f modules, which improves the accuracy of the algorithm but reduces the running speed and increases the parameters of the model. Secondly, the detection scene changes quickly in mobile detection, and ensuring sufficient detection accuracy is difficult. Therefore, the YOLOv8n algorithm is not ideal for the detection and processing of concrete surface cracks, as it is prone to problems such as false detection and missing detection.
3. Improvements to YOLOv8
To improve the accuracy of concrete crack detection, this paper proposes an improved detection network model, called YOLOv8-CD, based on YOLOv8n, as shown in
Figure 4. Prior to the SPPF module in the YOLOv8n backbone network, the LSKA module, incorporating visual attention networks and a large convolutional attention, was embedded to accommodate features such as brittle cracking, large span and the elongation of crack objects. Additionally, the Ghost module was employed within the backbone network to extract necessary information from original features at a lower cost, thereby enhancing feature extraction capability. In the neck network, GSConv replaced the original convolution structure, and the VoV-GSCSP module was utilized to reduce floating-point operations during feature channel fusion and decrease model parameters whilst ensuring accuracy.
Additionally, this paper conducts detailed hyperparameter tuning for the YOLOv8-CD algorithm. Specific adjustments include determining an appropriate learning rate to ensure stable model convergence; selecting a suitable batch size to effectively utilize GPU memory and maintain training stability; adjusting the number of network layers and structure, based on the characteristics of the dataset, to fully extract and represent crack features; optimizing the size and ratio of anchor boxes to improve detection accuracy and recall rate; and adjusting L2 regularization parameters to prevent overfitting. These hyperparameter adjustments significantly improved the model’s detection accuracy and generalization capability on the dataset.
3.1. The LSKA Module
Owing to the complex and variable environments where concrete surface cracks occur, enhancing the model’s ability to represent crack damage features is crucial. In the YOLOv8 backbone network, an attention mechanism was integrated to locally enhance features, allowing the network to ignore irrelevant information interference and incorporate other valuable information into the fused feature maps. Traditional attention mechanisms, such as Squeeze and Excitation (SE) and the Convolutional Block Attention Module (CBAM), have notable limitations. SE attention focuses solely on inter-channel dependencies, consequently neglecting spatial features. Meanwhile, the CBAM introduces large-scale convolution kernels to extract spatial features but overlooks long-range dependency issues.
Large Separable Kernel Attention (LSKA) is an improved module for the application of the traditional Large Kernel Attention (LKA) module in visual attention networks. Traditional LKA modules have the disadvantages of high computational complexity and memory requirements when handling large convolution kernels. To mitigate these problems, LSKA employs oversized convolution kernels in the attention modules of VAN, as shown in
Figure 5. It decomposes the two-dimensional convolution kernels of deep convolution layers into stacked horizontal and vertical one-dimensional kernels, achieving the goal of reducing computational complexity and memory usage. This improvement allows the model to maintain high performance whilst alleviating computational and memory burdens. In this paper, introducing LSKA further enhanced the detection accuracy and efficiency of the model. By optimizing network structure and attention mechanisms, the model better captured critical features in images, thereby improving the accuracy of object detection.
Figure 6 shows the block design structure of LSKA within the visual attention network.
3.2. Ghost Module
The conventional method for feature extraction involves convolving all channels of input feature maps with multiple convolution kernels. However, in deep networks, stacking numerous convolutional layers requires a substantial number of parameters and computations, resulting in many redundant feature maps. Therefore, several studies have proposed model compression methods, such as pruning, quantization and knowledge distillation [
28,
29,
30]. Some problems, such as complicated model design and difficult training, exist despite their effectiveness in reducing parameters. Other approaches focus on optimizing network structures, such as MobileNet and ShuffleNet [
31,
32]. However, a 1 × 1 convolutional layer still consumes considerable memory and Giga floating-point operations per second (GFLOPs). Hence, acquiring essential feature maps is needed without complete convolutional operations. The Ghost module (in
Figure 7) uses standard convolutions to obtain partial feature maps, linear operations to generate additional feature maps and, finally, concatenates these two sets of feature maps to a specified dimension, achieving other feature mappings with fewer parameters and computations.
The operation of generating n feature maps in any convolutional layer can be defined as follows:
where
is the input,
is the number of input channels and
and
are the height and width of the input feature map.
is the output feature map with
channels,
and
are the height and width of the output feature map,
is the convolutional filter of this layer with a kernel size of
and
is the bias term. The computation result of this convolution is as follows:
For the Ghost module, the first step is to obtain the intrinsic feature maps (
) using conventional convolution, with a computational complexity of the following:
Then, each channel of feature maps
is processed using operation
to generate the Ghost feature maps
:
Finally, the intrinsic feature maps obtained in the first step are concatenated (identity concatenate) with the Ghost feature maps obtained in the second step to obtain the final result.
If the input tensor is
(input channels, feature map height and width), after one convolution, the output tensor is
(output channels, feature map height and width). If the conventional convolution kernel size is
and the linear transformation convolution kernel size is
, after
times transformations, the computational complexity comparison between regular convolution and Ghost convolution operations is as follows:
where
is the number of output channels after the first transformation; owing to the identity mapping,
does not need to be calculated, but it is considered part of the second transformation. Therefore, the computational complexity of the Ghost module is approximately
times that of regular convolution.
Taking advantage of the Ghost module, a Ghost Bottleneck is achieved specifically for CNNs. As shown in
Figure 8, the Ghost Bottleneck resembles the basic residual blocks in ResNet [
33], incorporating multiple convolutional layers.
This study enhances feature extraction capability by leveraging the Ghost module within the backbone network, extracting the necessary information from raw features at minimal cost. It simultaneously reduces the overall model size, thereby ensuring lightweight design without compromising accuracy.
3.3. GSConv and VoV-GSCSP Modules
Real-time performance is crucial for object detection, as it requires accurate identification of objects in images or videos within extremely short timeframes, providing relevant information such as their position and size. This scenario demands algorithms not only with efficient processing capabilities but also with rational optimization and configuration of hardware devices. However, lightweight models constructed from numerous depthwise separable convolution layers often fail to achieve sufficient accuracy. Therefore, this paper introduces a novel approach, GSConv, which maintains adequate accuracy whilst reducing model complexity.
The GSConv is introduced to solve the speed problem of predictive computing in Convolutional Neural Networks (CNNs). In the backbone network of CNNs, input images typically undergo a gradual spatial-to-channel transformation process. Each spatial compression and channel expansion of feature maps leads to a partial loss of semantic information. Channel-dense convolution (SC) maximally preserves implicit connections between each channel, whereas depthwise separable convolution (DSC) completely severs these connections. GSConv aims to maintain these connections as much as possible whilst reducing time complexity. Additionally, GSConv is used to design GS Bottleneck, as shown in
Figure 9, which enhances feature non-linear expression and information reuse based on GSConv. As shown in
Figure 10, the VoV-GSCSP module is a cross-stage partial network module designed by using a one-time aggregation method, which is employed to carry out effective information fusion between feature graphs of different stages.
As seen in
Figure 11, GSConv begins by performing downsampling with regular convolution on its input. Subsequently, it applies DWConv (depthwise convolution), concatenates the results of both convolutions and, finally, performs a shuffle operation to align corresponding channel numbers from the previous two convolutions together.
As shown in
Figure 12, the new YOLOv8 neck and head networks replace Conv and C2f modules with GSConv and VoV-GSCSP modules, granting the model sufficient feature information to understand input data whilst reducing the model’s parameter number, achieving the purpose of lightweighting.
4. Dataset
This study used two datasets to train the model, including the RDD2022 dataset and the Wall Crack dataset.
RDD2022: Released by the University of Tokyo, it includes 47,420 road surface images collected from six countries: Japan, India, the Czech Republic, Norway, the United States and China. The dataset captures four types of concrete pavement crack damages, namely, D00 (longitudinal cracks), D10 (transverse cracks), D20 (mesh cracks) and D40 (potholes). Owing to the excessively large image sizes from Norway in comparison to other countries, they are deemed unsuitable for training. Therefore, in this paper, 16,648 photos from countries other than Norway are selected as the training set in the RDD2022 dataset.
Wall Crack: This dataset consists of 5882 concrete wall images collected from the Internet and literature, from which 6 different concrete surface damages can be captured, namely, D0 (exposed steel bars), D1 (weathering), D2 (cracks), D3 (delamination), D4 (spalling) and D5 (rust).
The RDD2022 dataset and the Wall Crack dataset are both used for detecting damage in concrete structures through image recognition. The RDD2022 dataset includes a large number of road crack images from different countries. The selection of the training set ensures a relatively balanced quantity of each type of damage and images from different countries. This allows the trained model to adequately represent different scenarios and types of damage, demonstrating excellent balance. The Wall Crack dataset consists of concrete wall images collected from the Internet and literature, capturing six different types of concrete surface damage. Although the number of images is relatively small, each type of damage is well-represented in the dataset, with a fairly balanced sample distribution. This ensures that the model can learn and recognize each type of damage fairly during training. Overall, these two datasets perform well in providing data support for concrete damage detection and also show excellent balance in data distribution.
6. Conclusions
Concrete surface crack detection is an important research area in structural safety, particularly in the structural assessment of roads, buildings and bridges. However, traditional crack detection methods typically rely on manual inspection, which is not only time-consuming and labor-intensive but also inefficient for meeting the demands of large-scale detection. Therefore, the automation of crack detection with deep learning algorithms has become a hotspot in related fields. Research have shown that traditional deep learning algorithms have been applied in crack detection. For example, Zhang et al. [
34] proposed a deep learning model that was able to automatically detect and identify cracks from road images, which was validated in actual road crack detection. Zou et al. [
35] proposed a DeepCrack model, which was also tested in various actual scenarios. Although these methods have been used in practice, their generalization ability and detection accuracy may be limited when dealing with the complexity and diversity of concrete surfaces.
As an advanced object detection algorithm, the YOLOv8 algorithm has shown a high potential in crack detection owing to its speed and high accuracy. However, there are numerous limitations when dealing with cracks on concrete surfaces. For instance, YOLOv8 may be disrupted by other textures, stains or shadows on concrete surfaces, leading to decreased accuracy in crack recognition. Therefore, this paper proposes an improved concrete surface crack detection algorithm based on YOLOv8n. The improved YOLOv8-CD algorithm of this paper is primarily applied to the crack detection on concrete surfaces. It can handle various types of cracks on concrete surfaces, including small and large cracks, while maintaining high detection accuracy and detection efficiency. This makes it suitable for large-scale crack detection tasks. This method is of great significance for the maintenance and repair of concrete structures. The main conclusions drawn are as follows:
- (1)
In the crack detection process of concrete surfaces, to better capture the unique characteristics of concrete surface cracks, this study innovatively combines the advantages of visual attention networks and large-scale convolutional attention, introducing a novel LSKA module. This module not only effectively extracts the entire feature information of concrete surface cracks but also focuses on the local details of cracks, particularly adapting to the characteristics of cracks that are prone to breakage, have large spans and are thin and long. Through the LSKA module, the model pays extra attention to the shape of cracks, ensuring accurate identification and localization of cracks in complex concrete textures and backgrounds, thereby improving the accuracy and efficiency of crack detection.
- (2)
The Ghost module is integrated into the backbone network of YOLOv8, aiming to efficiently extract and refine crucial information from raw features at minimal computational cost, largely enhancing the model’s feature extraction capability. The Ghost module reduces redundant computations by introducing a novel feature generation method. Accordingly, it not only effectively reduces the number of parameters and computation amount of the model, but also further improves its feature extraction capability and detection performance in complex scenarios whilst maintaining efficiency.
- (3)
In the neck network of YOLOv8, to enhance the efficiency and performance of the model, the GSConv structure is introduced to replace the traditional convolutional structure. Simultaneously, a novel VoV-GSCSP module is incorporated on the basis of the characteristics of the YOLOv8 framework. This module integrates the GSConv structure whilst maintaining the advantages of the original network, allowing the effective utilization of computational resources during feature channel fusion and reducing floating-point operations. By introducing the GSConv and VoV-GSCSP modules, significant optimization is achieved in the neck network of YOLOv8 for feature extraction and fusion. This optimization enables the model to operate efficiently and accurately when dealing with complex images and detection tasks.
- (4)
This paper achieves efficient and accurate crack detection by improving the YOLOv8 algorithm. The improved algorithm has achieved notable performance improvement with multiple datasets, thus proving its effectiveness and practicability in the field of crack detection. Compared with the existing models, this method has higher detection accuracy whilst reducing demands on platform computing and storage capabilities, and deploying on resource-limited devices becomes easy. In the future, we will continue exploring other advanced deep learning algorithms and technologies to further enhance the accuracy and efficiency of crack detection. Additionally, we aim to deploy the improved model on resource-limited embedded detection devices to refine the proposed algorithm in practical applications.