Adaptive Attention-Enhanced Yolo for Wall Crack Detection

Chen, Ying; Wu, Wangyu; Li, Junxia

doi:10.3390/app14177478

Open AccessArticle

Adaptive Attention-Enhanced Yolo for Wall Crack Detection

by

Ying Chen

¹,

Wangyu Wu

²

and

Junxia Li

^1,*

¹

School of Computer Science, School of Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Electrical Engineering, Electronics and Computer Science, University of Liverpool, Liverpool L69 3BX, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7478; https://doi.org/10.3390/app14177478

Submission received: 9 July 2024 / Revised: 19 August 2024 / Accepted: 22 August 2024 / Published: 23 August 2024

Download

Browse Figures

Versions Notes

Abstract

With the advancement of social life, the aging of building walls has become an unavoidable phenomenon. Due to the limited efficiency of manually detecting cracks, it is especially necessary to explore intelligent detection techniques. Currently, deep learning has garnered growing attention in crack detection, leading to the development of numerous feature learning methods. Although the technology in this area has been progressing, it still faces problems such as insufficient feature extraction and instability of prediction results. To address the shortcomings in the current research, this paper proposes a new Adaptive Attention-Enhanced Yolo. The method employs a Swin Transformer-based Cross-Stage Partial Bottleneck with a three-convolution structure, introduces an adaptive sensory field module in the neck network, and processes the features through a multi-head attention structure during the prediction process. The introduction of these modules greatly improves the performance of the model, thus effectively improving the precision of crack detection.

Keywords:

wall crack detection; Yolov5; swin transformer; adaptive receptive field; multi-head attention

1. Introduction

Nowadays, China’s construction industry is booming, and problems such as wall cracks are becoming more and more serious. This type of building aging problem not only poses a threat to the safety of people and property, but also becomes a major hidden danger that affects the quality of the building. Buildings are affected by a variety of factors during long-term use, such as climate change [1], material quality [2], construction technology [3], and improper maintenance [4], and deterioration phenomena such as cracks often occur, thus jeopardizing their durability and safety. In order to prevent potential risks, the timely detection and treatment of wall cracks is particularly important. However, the construction industry has not yet formed a unified standard for aging detection, and existing detection and maintenance technologies are still in the exploratory stage, which makes it challenging to monitor and assess the degree of building aging.

Previously, the detection of cracks in building walls mainly relied on the experience of the workers and they did it by knocking or with the help of ultrasonic pulses, etc. The knocking method has a limited detection range, is only suitable for close observation, and has a high degree of subjectivity and certain safety hazards in its operation. The ultrasonic pulse method [5] requires point-by-point detection, has high technical requirements for the operator, and is not suitable for detecting large areas or tall buildings. The infrared thermal imaging method [6] is susceptible to weather, requires measurements over short distances, and is cumbersome and inefficient. Although these traditional methods can assess the condition of walls to a certain extent, the risks and inefficiencies of manual inspection make it difficult to meet the demand for rapid and accurate detection of wall cracks [7]. Therefore, intelligent detection technology is gradually becoming a research priority for wall crack detection tasks.

This study involves to the application of computer vision technology in crack detection, and its core objective is to improve the model’s adaptability and reliability in complex environments. To achieve this objective, we subdivided the task into feature extraction, staged prediction, and feedback optimization. By introducing three cutting-edge modules to enhance feature extraction, we evaluate the overall performance of the model and optimize it through ablation studies.

In this study, we propose an object detection model that incorporates the strengths of YOLOv5 and Transformer, called Adaptive Attention-Enhanced Yolo (AAEY). The model demonstrated efficiency and accuracy when performing object detection in complex environments. The AAEY model first utilizes Yolo for the initial processing of the image. Then, the concept of Swin Transformer is introduced in the backbone network and the Multi-head Self-Attention (MSA) mechanism [8] is used to improve the feature extraction capability. In the neck stage of the model, the model adjusts the weights of the three different perceptual domains through a Fully Connected Layer (FCL) [9] to achieve the dynamic fusion of information. When prediction is performed, the features undergo an attention mechanism that combines the different dimensions to achieve accurate object detection. The model in this paper makes the following main contributions.

AAEY effectively improves the accuracy of object detection. By incorporating Transformer’s self-attention, the model is able to more accurately recognize long-distance dependencies in the image, thus integrating the information of each component more comprehensively. This improvement effectively reduces the detection error rate and enhances the model’s performance and stability in crack detection tasks.
AAEY improves object detection precision through the integration of Transformer’s attention mechanism. This enhancement allows the model to better capture long-range dependencies within the image. As a result, it integrates information from each component more effectively. Its feature processing strategy in both spatial and channel dimensions leads to improved performance in crack detection.
In this experiment, Precision, Recall, and F1-Score [10] were used to evaluate the detection ability of the model, while mean average precision (mAP) [11] was referenced as an auxiliary index. The obtained results show that the AAEY method demonstrates superior performance compared to other existing techniques in the wall crack detection task and can provide an effective solution for practical applications.

2. Related Work

2.1. Wall Crack Detection

To detect concrete cracks, researchers have developed a variety of Image Processing Technology (IPT) [12] methods that can detect cracks on many different objects, such as masonry [13], bridges [14], roads [15], and walls [16]. These techniques mainly rely on basic detection methods such as thresholding and histogram analysis to identify the presence, width, and direction of cracks [17]. However, in practice, problems such as interference from environmental noise make edge detection difficult. To overcome these problems, researchers have introduced various computer vision techniques, such as denoising processing and edge augmentation techniques. Ying et al. [18] introduced an image enhancement algorithm designed specifically for pavement damage detection, which solves the challenge of uneven illumination by calculating a correction factor to normalize the background illumination variations, and then accurately separates the image features from the background using the beamlet transform. Wang et al. [19] proposed a threshold detection method which uses the Canny operator and a triple thresholding system for processing infrared images and combines it with K-means clustering to generate binary images of crack locations for robust and automated visual detection. While these visual methods have advanced the accuracy of crack detection, they remain constrained by varying environmental conditions. In addition, some crack detection techniques [20] that combine extended finite element methods with genetic algorithms, although effective in specific contexts, have certain limitations in practical applications, resulting in insufficient accuracy of the recognition results.

Gradually, the use of ML [21] in the field of crack detection has become more widespread, and researchers have developed a variety of related detection techniques [22]. These techniques typically utilize IPT to extract features and subsequently classify these features to determine cracks. Since IPT cannot accurately extract all features in some cases, there may be some uncertainty in the final detection results. Nevertheless, this class of methods has an excellent performance when dealing with large-scale crack data. In this area, support vector machines are heavily used due to their effectiveness, as they are able to accurately locate crack areas by analyzing features in concrete images [23]. Principal component analysis is often used in conjunction with SVMs due to its means of dimensionality reduction.

Recently, deep learning (DL) has been rapidly adopted for wall crack detection, yielding impressive results due to its superior image processing capabilities. Researchers have developed various methods based on it to enhance the recognition and localization of cracks, even in complex and highly noisy images. Cha et al. [24] employed a deep convolutional neural network architecture to detect and recognize cracks, avoiding the complex process of manually calculating defect features. Liu et al. [25] introduced CrackFormer, an innovative network for fine-grained crack detection that uses a SegNet-like encoder–decoder architecture and introduces a novel attention module that captures long-range contextual information while sharpening semantic features. Xu et al. [26] introduced CTNet, a model for road crack detection based on a Transformer architecture. The method leverages the Transformer’s ability to capture global and long-range dependencies and incorporates a multi-scale local feature augmentation structure. Deng et al. [27] proposed YOLOv2 to recognize the cracks in bridge image locations, dividing the cracks in each image into multiple regions. Similarly, this method faces challenges such as difficulties in processing large amounts of data.

2.2. YOLO

The field of computer vision has been intensively researched in the area of object detection, which constitutes a core issue. In recent years, object detection algorithms based on DL have achieved significant advancements, the YOLO model being one of the most prevalent one owing to its speed and high accuracy.

Bai et al. [28] proposed the RSG-YOLO model, which enhances the feature extraction capability and reduces the loss of crack features by reparameterizing the dual-fusion feature pyramid structure. Meanwhile, the model adopts SIoU instead of CIoU to optimize the bounding box regression loss function, which improves the convergence speed, and integrates the GAM Attention Mechanism (AM), which enhances the responsiveness to diverse channel information. Zhang et al. [29] proposed an algorithm that integrates MobileNets and CBAM with an improved YOLOv3, optimizing performance through depthwise separable convolution and CBAM. Wang et al. [30] optimized the YOLOv5 algorithm; they enhanced the detection capability and decreased the complexity of the algorithm by adjusting the hyperparameters of transfer learning, conducting experiments with different network structures and applying detection head search technology. Ref. [31] introduced a ship detection method based on YOLOv5 and GhostNet. By adding the Ghostnet module, the feature capture was further optimized during the feature extraction process, greatly improving the detection accuracy. The generalized intersection ratio error was also reduced. The DGAP-YOLO algorithm proposed by Sun et al. [32] improves YOLOv8 by introducing the DGC module to enhance crack detection precision and replacing the neck network with a lightweight lowAFPN, which enhances the multi-scale feature fusion capability and improves the robustness and generalization of the model.

3. Materials and Methods

3.1. Yolo Algorithm

Here, we introduce the basic framework of Yolo using Yolov5 as an example. YOLOv5 is an end-to-end DL model that integrates an input layer, backbone network, neck module, and prediction layer [33]. By using advanced techniques such as grid division, it can achieve high-precision object detection on images with different resolutions.

In the input processing stage of the YOLOv5 network, the Mosaic data enhancement method [34] is used, which randomly scales, clips, and splices the images. In this stage, the adaptive anchor frame mechanism is also used to calculate the appropriate anchor frame size through the preset data set annotation information, and adjust and optimize the model training. Adaptive image scaling reduces the need for image filling, which enables the training process to extract more features, significantly reducing the computational burden of forward reasoning.

In the backbone stage of YOLOv5 network, image feature extraction is carried out by several modules. First, the input image is sliced and convolved by the Focus module to obtain initial feature maps. The feature map then flows through multiple Cross-Stage Partials (CSPs) and Convolutional Block Layers (CBLs), which work together to further extract image features. The CSP structure uses residual connection to integrate the output of multiple CBLs, and finally realizes multi-scale feature fusion through the space pyramid pool module.

In the neck module of YOLOv5, image features are further extracted through the fusion of deep and shallow information. This stage uses the Feature Pyramid Network (FPN) feature pyramid structure to generate a multi-scale feature pyramid by fusing different levels of feature maps together through up-sampling and down-sampling operations. The upper layer mainly captures semantic information, while the lower layer retains detailed location information. The neck module combines semantic information from the FPN from the top to the bottom with positional information from the path aggregation network [35] from the bottom to the top, thereby optimizing the fusion of multiple layers of features.

In the prediction stage of YOLOv5, the main task is to target and classify the feature maps from the previous stage. In this step, the image is divided into several grids, and for each grid, multiple candidate boxes are created. The final target detection results can be obtained by filtering out less dense frames using non-maximum suppression [36] technology. In the training process, the detection results generated by the model will be compared with the actual marked location box and classification label, and the error will be calculated and the model parameters will be updated through backpropagation. Through this method, the model learns on a large number of labeled images and continuously optimizes the parameters, which ultimately enables the model to perform effective target detection on unlabeled natural images.

Figure 1 shows the improved model network structure and highlights the design of the three new modules. These modules include the Swin Transformer-based CSP Bottleneck with three convolutions (STC3), the Adaptive Receptive Field (ARF), and the Head Attention (HA) module. The diagram clearly shows how these components are integrated into the overall network.

3.2. Swin Transformer-Based CSP Bottleneck with Three Convolutions

In-depth analysis of image semantics and structure is an essential step in wall crack detection. In the backbone, we introduced the STC3 module, inspired by the structural principles of Swin Transformer, to improve feature extraction and better utilize input features. This module replaces the CSP Bottleneck, which originally used three convolutions.

The STC3 structure mainly contains Layer Normalization (LN), MSA, and Multilayer Perceptron (MLP). This design allows STC3 to efficiently extract and fuse image features when dealing with tasks such as object detection [37].

First, the normalized data

I_{L 1}

are generated by LN of the input images for each channel. This process is designed to reduce fluctuations in the internal covariates, thereby speeding up the model’s training and improving its generalization ability [38]. The normalized calculation formula is as follows.

I_{L 1} = \frac{I - μ_{1}}{\sqrt{{σ_{1}}^{2} + ε_{1}}}

(1)

In Formula (1), the input feature data are represented by I,

μ_{1}

represents the mean of every single layer,

{σ_{1}}^{2}

refers to the variance of each layer, and

ε_{1}

is a tiny constant used to avoid division by zero errors. The processed image data are then fed into the MSA, which is designed to capture long-range dependencies

I_{M S A}

in the image to help extract detailed image features and background information. In this process, the normalized data

I_{L 1}

are converted into learnable parameter matrices, including query (Q), key (K), and value (V) [39].

\begin{matrix} \{\begin{matrix} Q = I_{L 1} M^{Q} \\ K = I_{L 1} M^{K} \\ V = I_{L 1} M^{V} \end{matrix} \end{matrix}

(2)

In Formula (2),

M^{Q}

,

M^{K}

, and

M^{V}

are three trainable parameter matrices. Then, the attention weight matrix is calculated from these matrices.

A (Q, K, V) = f_{S} (\frac{Q K^{T}}{\sqrt{e_{k}}}) V

(3)

In Formula (3),

e_{k}

refers to the dimension of the key vector, and

f_{S} (*)

represents a normalization function that makes the sum of the three arguments equal to 1. Generally, the value of i ranges from 1 to 8. The stitching process of the AM is repeated several times [40].

m_{i} = A (Q {M_{i}}^{Q}, K {M_{i}}^{K}, V {M_{i}}^{V})

(4)

M H (Q, K, V) = C o n t a c t (m_{1}, \dots, m_{h}) M^{O}

(5)

I_{M S A} = M H (Q, K, V)

(6)

On this basis, the obtained feature map

I_{M S A}

is spliced with the original image to obtain

I_{C 1}

. In this way, the model is able to take full advantage of both types of information for deeper feature learning. The equation is shown below.

I_{C 1} = C o n c a t (I_{M S A}, I)

(7)

The feature graph obtained in the previous step is also normalized by a layer, thus a more normalized feature

I_{L 2}

is obtained, and its calculation formula is as follows.

I_{L 2} = \frac{I_{C 1} - μ_{2}}{\sqrt{{σ_{2}}^{2} + ε_{2}}}

(8)

In Formula (8),

μ_{2}

represents the mean of each layer, and

{σ_{2}}^{2}

represents the variance of each layer. After this, an MLP processes the feature map to extract crucial data relevant to the task, which are represented by

I_{M L P}

. The calculation formula of this process is given below.

I_{M L P} = f_{M} (I_{L 2})

(9)

In Formula (9),

f_{M} (*)

represents the MLP. In this way, the STC3 network can fully integrate the advantages of the self-attention and the MLP to effectively extract and utilize the complex features in the wall crack image. This process improves the generalization ability and stability of the model, and makes the crack detection results more accurate.

3.3. Adaptive Receptive Field

During the crack detection process, it is necessary to distinguish between cracks with different sizes and morphologies. The morphology of these cracks varies greatly, from fine hairline cracks to wide and irregular structural cracks. This morphological diversity requires a high degree of flexibility and accuracy in the model to ensure that the different characteristics of the cracks can be recognized, thus maintaining excellent detection performance. Introducing the ARF module into the neck stage of the Yolo model helps to better cope with the high demands placed on detection by image diversity.

The ARF module lets the feature information pass through different receptive fields to capture the features from fine-grained to coarse-grained features, and then the weights of the resulting features are adjusted by three parameters dynamically generated by the FCL. The bottom three branches of the structure contain a convolutional layer with sizes 1, 3, and 5, respectively, and the features are passed through the convolutional layer to obtain features

F_{1}

,

F_{3}

, and

F_{5}

. The formula is as follows.

F_{i} = C_{i} (I_{i 1})

(10)

In Equation (10),

C_{i}

refers to the convolution kernel of size i and

I_{i 1}

denotes the input features. The topmost branch contains a Liner layer that is used to generate parameters corresponding to the three branches below it. Its formula is shown below.

[λ, θ, τ] = f_{S} (f_{c} (I_{i 1}))

(11)

In Equation (10),

f_{c}

denotes an FCL. After obtaining all the features and parameters, the parameters are multiplied with the corresponding different feature information of the output to effectively fuse the features at each scale to obtain F. Its calculation formula is as follows.

F = λ F_{1} + θ F_{3} + τ F_{5}

(12)

With the introduction of the ARF module, the model is able to focus on both the minute details of the cracks and the overall structural features, and is able to flexibly adjust the weights of the extracted features under different receptive fields according to the recognition results. Such a structure improves the model’s capacity to extract characteristics at various scales, which helps the model become more precise.

3.4. Multi-Head Attention

Images of wall cracks have a variety of complex features and the model needs to have the ability to capture this information accurately. In the prediction part of Yolo, the addition of a Multi-Head Attention (MHA) structure can help to capture more image information, which to some extent solves the problem that the unified convolutional neural network cannot comprehensively extract effective feature information. This introduced structure contains attention mechanisms that can handle features of different dimensions, which are then combined with the original image to obtain feature information that is more favorable for prediction.

Here, HA structure refers to a particular head of MHA. Looking at the HA module alone, it consists of three main branches. The SK attention mechanism [41] constitutes the upper branch of HA. First, the image information input to this branch is segmented by two different sizes of convolutional kernels to obtain two different feature sets

G_{1}

and

G_{2}

.

\begin{matrix} \{\begin{matrix} G_{1} = C_{3} (s p l i t {({I_{i}}_{2})}_{1}) \\ G_{2} = C_{5} (s p l i t {(I_{i 2})}_{2}) \end{matrix} \end{matrix}

(13)

In Equation (13),

I_{i 2}

represents the input image’s information. The resulting feature sets are then used to compute two feature weights

υ

and

ω

[42], a process that requires Global Average Pooling (GAP) and an FCL.

X = f_{R} (f_{c} ([f_{G} (G_{1}), f_{G} (G_{2})]))

(14)

[υ, ω] = f_{S} (f_{c} (X))

(15)

In Equation (14),

f_{G} (*)

denotes the GAP, and

f_{R} (*)

is the activation function to be used for ReLU. Finally, the information of the two feature groups is processed and fused by combining the feature weight weighting, and

F_{S K}

is computed. This step can realize the adaptive adjustment of feature information to spatial dimension and channel dimension, and use the image information more rationally. The formula is as follows.

F_{S K} = υ G_{1} + ω G_{2}

(16)

The bottom branch consists of the parts of the Channel Attention Mechanism (CAM) [43]. Obviously, this branch focuses on the information in the channel dimension, paying attention to different channels according to their importance in real situations, and continuously capturing more effective information in order to finally obtain useful

F_{C A M}

.

F_{C A M} = f_{S} (f_{c} (f_{R} (f_{c} (f_{G} (I)))))

(17)

The middle part of HA is mainly used to retain the original feature information

F_{o}

and pass it on to the next step as it is. This part aims to fully utilize the data information and prevent the other two branches from discarding important information during processing and making the detection results incomplete. After the three branches have obtained new feature information through their respective processing mechanisms, the module fuses the outputs of the three parts to obtain the composite feature

F_{f u s i o n}

. The computational formula for the feature fusion process is as follows.

F_{f u s i o n} = F_{S K} \otimes F_{C A M} \otimes F_{o}

(18)

The HA module effectively reduces the risk of information loss and noise interference by focusing on multiple information channels simultaneously. This multi-dimensional feature processing can improve the robustness of the model in recognizing cracks, making its performance more stable when dealing with complex backgrounds and variable environments. The introduction of the MHA structure can enhance the overall performance of the model, making it perform well in the wall crack detection task.

4. Performance Evaluation

This paper uses a database of more than 800 concrete crack images, all labeled in VOC format, suitable for training the Yolo model. In the experiment, 80% were set aside for training and the remaining 20% were designated for testing. In the experiment, both the initial learning rate and the final learning rate were set to 0.01. Other hyperparameters include the weight attenuation coefficient set to 0.0005, bounding box loss coefficient set to 0.05, classification loss increase coefficient set to 0.5, classification loss weight set to 1.0, objectness loss coefficient set to 0.7, and objectness loss weight set to 1.0. All models used were trained on the same dataset, parameter configuration, and training flow.

In the process of the experiment, Mosaic data enhancement technology commonly used in the YOLO model was selected to solve the problem of small dataset size. This method significantly improves the diversity of data by combining four images into one image, thus simulating different shooting conditions and viewing angles. Specifically, Mosaic enhancement technology creates richer training samples by randomly cropping and stitching images. This data expansion strategy not only alleviates the problem of insufficient data, but also improves the adaptability of the model to various real scenarios.

4.1. Valuation Index

In this experiment, we primarily utilize Precision, Recall, and F1-Score to measure the performance of the model in the object detection task. Precision measures the proportion of instances classified as positive that are indeed positive [44]. Recall calculates the proportion of true positives that are correctly predicted, which indicates whether the model’s prediction is comprehensive or not. F1-Score is the harmonic mean of Precision and Recall, which evaluates the performance of the model in a comprehensive way [45]. Here are the formulas for these metrics.

p r e = \frac{t_{p}}{t_{p} + f_{p}}

(19)

r e = \frac{t_{p}}{t_{p} + f_{n}}

(20)

f s = \frac{2 \times p r e \times r e}{p r e + r e}

(21)

In Formulas (19) and (20),

p r e

represents Precision,

r e

represents Recall, and

f s

represents F1-Score.

t_{p}

,

f_{p}

and

f_{n}

denote how many positive classes were accurately predicted to be positive, how many negative classes were mistakenly predicted to be positive, and how many positive classes were mistakenly predicted to be negative, respectively [46]. These values help us understand how the model performs in different types of predictions. To comprehensively assess the detection ability of the model, we will record the Precision and Recall for each experiment and calculate the average precision (AP) and its area under the curve. AP values for all categories are averaged to form mAP [47], which reflects the overall detection effect of the model at different thresholds. The mAP@0.5 mentioned in this article refers to the mAP value calculated when the Intersection over Union threshold is set to 0.5.

4.2. Training Results

To specify the reasons for choosing Yolov5 as the overall network framework for this experiment, we specifically chose two representative newer models, Yolov8 and Yolov10, to compare with the unimproved Yolov5 model. An identical training set was used to run all three models for 200 epochs each. The models achieved convergent mAP@0.5 within these 200 epochs. Yolov5 could reach a maximum of 83.60%, while Yolov8 could reach a maximum of 85.31%, and Yolov10 had a final mAP value of 80.86%. It can be seen that the difference in average accuracy between the three models is not very large. The trend of mAP@0.5 with epochs for the three models is shown in Figure 2.

The confusion matrix in Figure 3 is also able to demonstrate the performance of the three models. From the figure, it can be found that these models are able to fulfill the detection task better in crack detection and basically predict the correct results. However, there is no lack of certain misdetection and omission in the detection process, in which it is obvious that some of the non-target regions are incorrectly judged as objects. The labeled categories in the dataset do not contain the category of background, which is only used to represent the model’s categorization of the non-target regions in the concrete crack images. Although the figure contains regions that are effective in predicting the background category, this is not meaningless, which leaves the relevant data unincluded in the confusion matrix.

The various performances of the three models are demonstrated in Table 1, which more intuitively shows that Yolov5 has a non-negligible advantage in this study. Here, params denotes the number of parameters of the model and the floating point of operations, or FLOPs, is a metric used to quantify model complexity.

Overall, Yolov5 and Yolov10 perform better in terms of precision and recall compared to Yolov10. Regarding Yolov8, although the accuracy is a bit higher, it has a large number of parameters, a large number of floating point operations, and the model complexity is much higher. The improvement method proposed in this paper is actually not limited to a certain Yolo version, and Yolov8 may be used if higher precision is required. However, considering the wall crack detection process, where resources are more limited and there are certain real-time requirements, a lightweight model may be more able to play an effective role. Therefore, under comprehensive consideration, we choose Yolov5 with higher performance and lower complexity to complete this research.

4.3. Ablation Study

In order to demonstrate the improvement of the model performance by the new modules built in this experiment, we conducted an ablation study, which is still based on the original dataset. The baseline for the ablation study was Yolov5s, and each new module was gradually added in during the experiment. We chose to visualize the impact of STC3, ARF, and MHA on the detection of wall cracks through the comparison of mAP, a comprehensive index. The results of the model performance comparison in the experiment are shown in Table 2.

According to the results shown in the table, the baseline mAP value is only 83.6%. When only the STC3 structure is introduced, the model’s mAP rises to 84.4%. Similarly, the introduction of only the ARF or MHA structure bring the mAP for each to 84.0%. Adding both STC3 and ARF modules together increases the mAP by 1.6%. Using STC3 and MHA together increases the mAP by 1.9%. Introducing both ARF and MHA together increases the mAP by 1.1%. And when the three new modules are embedded together in Yolov5, the performance of the model in terms of mAP improves even better, with an increase of 2.8%. After analyzing the results of the data from the ablation study, it is clear that all three modules, STC3, ARF, and MHA, contribute more to the task of recognition of cracks.

Figure 4 demonstrates the effectiveness of the ablation experiment’s predictions in the crack detection task, visualizing the change in model performance with the sequential addition of the STC3, ARF, and MHA structures. By scrutinizing the results of this set of experiments, we can clearly see that the detection of the baseline is poor and the identification of cracks is incomplete. In the process of adding new modules, the detection was improved to varying degrees and was able to gradually frame out more areas where cracks were located. When integrating all modules together into Yolov5, the detection results reached a high level and the confidence level improved even more. This result demonstrates that incorporating STC3, ARF, and MHA into YOLOv5 greatly improves the model’s crack detection capabilities.

4.4. Comparison with State of the Art

Many models and methods with good detection results have been investigated in recent years. Chun et al. [48] developed a method that combines pixel values with geometry to generate features using an optical gradient lifter. The CrackTree [49] method identifies cracks through geodesic shadow removal, tensor voting, and graph models. Lee et al. [50] combined a 2D Gaussian kernel and Brownian motion processes in the generation of crack images, and the prediction of actual and simulated cracks was based on an image segmentation network approach. In order to recognize and segment different types of cracks, Yang et al. [51] tried to train the network at the pixel level, while the morphological features of the cracks are represented by a single pixel-width skeleton structure, which is a full convolutional network-based detection method.

During our experiments, we obtained Precision, Recall, and F1-Score, among other evaluation metrics, for the AAEY model. In order to more fully demonstrate its excellent performance, we also compared AAEY with a variety of existing methods, and the detailed comparison results are shown in Table 3.

The metric F1-Score, which is a reconciled average of Precision and Recall, enables a comprehensive assessment of the model’s performance, through which we can specifically describe the advantages of the AAEY model over other existing techniques. As can be seen from the table, the F1-Score of the model proposed in this study is 84.22%, which is the maximum value among the models. The comparison shows that it is 14.72% and 13.42% higher than the F1-Score of the LightGBM model and the CrackTree model, respectively. It has a somewhat smaller difference of 8.98% and 4.27% with the F1-Score of CSN and FCN, respectively.

Based on the data analysis above, it is obvious to conclude that the AAEY model outperforms many existing models on the detection and recognition task, significantly improving the prediction of object recognition. In the practical application of wall crack detection, the AAEY model does further enhance both detection efficiency and accuracy.

4.5. Limitations

The AAEY model showed a much better performance for crack detection during the study, but there are some limitations in this experiment. It is quite common that the DL model relies on more adequate computer resources for its operation, which can also be time-consuming. In addition, the improvement of the model’s performance will be accompanied by the elevation of the model’s complexity, and in scenarios with high real-time requirements, it may not be able to accomplish its task efficiently as a result. In our future work, we can continue to consider these limitations comprehensively and try to construct faster and higher-precision network structures, or optimize accordingly based on the feedback of the experimental results, in order to further improve the performance of the model.

5. Conclusions

The appearance of wall cracks may not only lead to the degradation of the building structure, but also shorten the service life of the building and pose a threat to the safety of the residents. Therefore, the detection and repair of wall cracks is essential to ensure the safety of buildings, not only to avoid potentially major accidents, but also to promote the continued development of the construction industry. With the rapid development of science and technology, DL has been widely used in many fields. In this context, the research of this paper relies on deep learning technology, namely the AAEY method, to detect wall cracks, aiming to extract crack features and identify locations through this technology, so as to achieve an efficient wall crack detection system.

In this study, the Yolo model underwent a series of enhancements and improvements. First, by incorporating Swin Transformer’s MSA mechanism into the bottleneck part of the network backbone, the model is able to better capture remote association information in the image. Second, the ARF module is introduced to the Neck stage to enable more comprehensive feature extraction, from detail to the global scope. At the same time, the MHA structure of the prediction stage is introduced to flexibly adjust the focus between different features. These improvements resulted in a new AAEY model with significantly enhanced feature extraction capabilities and improved accuracy in crack identification. The experimental results suggest that these changes enable the model to accurately capture crack characteristics under various conditions, improve the performance of YOLOv5 in crack detection, and provide an effective solution in the field of wall crack detection.

Author Contributions

Conceptualization, J.L. and Y.C.; methodology, J.L and Y.C.; software, Y.C.; validation, W.W.; formal analysis, Y.C. and W.W.; investigation, J.L. and Y.C.; resources, J.L.; data curation, Y.C.; writing—original draft preparation, Y.C. and W.W.; writing—review and editing, J.L. and Y.C.; visualization, Y.C. and W.W.; supervision, Y.C.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Basu, S.; Orr, S.A.; Aktas, Y.D. A geological perspective on climate change and building stone deterioration in London: Implications for urban stone-built heritage research and management. Atmosphere 2020, 11, 788. [Google Scholar] [CrossRef]
Sitota, B.; Quezon, E.T.; Ararsa, W. Assessment on Materials Quality Control Implementation of Building Construction Projects and Workmanship: A Case Study of Ambo University. 2021. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3901438 (accessed on 7 May 2024).
Othman, N.L.; Jaafar, M.; Harun, W.M.W.; Ibrahim, F. A case study on moisture problems and building defects. Procedia-Soc. Behav. Sci. 2015, 170, 27–36. [Google Scholar] [CrossRef]
Yacob, S.; Ali, A.S.; Au-Yong, C.P.; Yacob, S.; Ali, A.S.; Au-Yong, C.P. An Overview and Understanding the Building Deterioration. In Managing Building Deterioration: Prediction Model for Public Schools in Developing Countries; Springer: Singapore, 2022; pp. 11–40. [Google Scholar]
Andi, M.; Yohanes, G.R. Experimental study of crack depth measurement of concrete with ultrasonic pulse velocity (UPV). In Proceedings of the IOP Conference Series: Materials Science and Engineering, Bali, Indonesia, 7–8 August 2019; Volume 673, p. 012047. [Google Scholar]
Singla, R.; Sharma, S.; Sharma, S.K. Infrared imaging for detection of defects in concrete structures. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Stavanger, Norway, 30 November–1 December 2023; Volume 1289, p. 012064. [Google Scholar]
Kim, H.; Lee, J.; Ahn, E.; Cho, S.; Shin, M.; Sim, S.H. Concrete crack identification using a UAV incorporating hybrid image processing. Sensors 2017, 17, 2052. [Google Scholar] [CrossRef]
Sridhar, S.; Sanagavarapu, S. Multi-head self-attention transformer for dogecoin price prediction. In Proceedings of the 2021 14th International Conference on Human System Interaction (HSI), Gdansk, Poland, 8–10 July 2021; pp. 1–6. [Google Scholar]
Chen, Z.; Zhang, F.; Liu, H.; Wang, L.; Zhang, Q.; Guo, L. Real-time detection algorithm of helmet and reflective vest based on improved YOLOv5. J. Real-Time Image Process. 2023, 20, 4. [Google Scholar] [CrossRef]
Veeranampalayam Sivakumar, A.N.; Li, J.; Scott, S.; Psota, E.; J. Jhala, A.; Luck, J.D.; Shi, Y. Comparison of object detection and patch-based classification deep learning models on mid-to late-season weed detection in UAV imagery. Remote Sens. 2020, 12, 2136. [Google Scholar] [CrossRef]
Yassin, W.; Abdollah, F.; Amanah, N.; Ismail, A.; Ragam, P. Seatbelt detection in traffic system using an improved YOLOV5. J. Adv. Comput. Technol. Appl. (JACTA) 2023, 5, 1–17. [Google Scholar]
Khan, M.A.M.; Kee, S.H.; Pathan, A.S.K.; Nahid, A.A. Image Processing Techniques for Concrete Crack Detection: A Scientometrics Literature Review. Remote Sens. 2023, 15, 2400. [Google Scholar] [CrossRef]
Chaiyasarn, K.; Khan, W. Damage detection and localization in masonry structure using faster region convolutional networks. GEOMATE J. 2019, 17, 98–105. [Google Scholar]
Bhowmick, S.; Nagarajaiah, S.; Veeraraghavan, A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from UAV videos. Sensors 2020, 20, 6299. [Google Scholar] [CrossRef]
Nhat-Duc, H.; Nguyen, Q.L.; Tran, V.D. Automatic recognition of asphalt pavement cracks using metaheuristic optimized edge detection algorithms and convolution neural network. Autom. Constr. 2018, 94, 203–213. [Google Scholar] [CrossRef]
Kim, B.; Cho, S. Image-based concrete crack assessment using mask and region-based convolutional neural network. Struct. Control Health Monit. 2019, 26, e2381. [Google Scholar] [CrossRef]
Dinh, T.H.; Ha, Q.P.; La, H.M. Computer vision-based method for concrete crack detection. In Proceedings of the 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV), Phuket, Thailand, 13–15 November 2016; pp. 1–6. [Google Scholar]
Ying, L.; Salari, E. Beamlet transform-based technique for pavement crack detection and classification. Comput.-Aided Civ. Infrastruct. Eng. 2010, 25, 572–580. [Google Scholar] [CrossRef]
Wang, G.; Peter, W.T.; Yuan, M. Automatic internal crack detection from a sequence of infrared images with a triple-threshold Canny edge detector. Meas. Sci. Technol. 2018, 29, 025403. [Google Scholar] [CrossRef]
Rabinovich, D.; Givoli, D.; Vigdergauz, S. XFEM-based crack detection scheme using a genetic algorithm. Int. J. Numer. Methods Eng. 2007, 71, 1051–1080. [Google Scholar] [CrossRef]
Praticò, F.G.; Fedele, R.; Naumov, V.; Sauer, T. Detection and monitoring of bottom-up cracks in road pavement using a machine-learning approach. Algorithms 2020, 13, 81. [Google Scholar] [CrossRef]
Ali, L.; Alnajjar, F.; Jassmi, H.A.; Gocho, M.; Khan, W.; Serhani, M.A. Performance evaluation of deep CNN-based crack detection and localization techniques for concrete structures. Sensors 2021, 21, 1688. [Google Scholar] [CrossRef] [PubMed]
Chaiyasarn, K.; Sharma, M.; Ali, L.; Khan, W.; Poovarodom, N. Crack detection in historical structures based on convolutional neural network. GEOMATE J. 2018, 15, 240–251. [Google Scholar] [CrossRef]
Cha, Y.J.; Choi, W.; Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3783–3792. [Google Scholar]
Xu, Z.; Lei, X.; Guan, H. Multi-scale local feature enhanced transformer network for pavement crack detection. J. Image Graph. 2023, 28, 1019–1028. [Google Scholar]
Deng, J.; Lu, Y.; Lee, V.C.S. Imaging-based crack detection on concrete surfaces using You Only Look Once network. Struct. Health Monit. 2021, 20, 484–499. [Google Scholar] [CrossRef]
Bai, T.; Lv, B.; Wang, Y.; Gao, J.; Wang, J. Crack Detection of Track Slab Based on RSG-YOLO. IEEE Access 2023, 11, 124004–124013. [Google Scholar] [CrossRef]
Zhang, Y.; Huang, J.; Cai, F. On bridge surface crack detection based on an improved YOLO v3 algorithm. In Proceedings of the 21st IFAC World Congress, Berlin, Germany, 12–17 July 2020; Volume 53, pp. 8205–8210. [Google Scholar]
Wang, Z.; Jin, L.; Wang, S.; Xu, H. Apple stem/calyx real-time recognition using YOLO-v5 algorithm for fruit automatic loading system. Postharvest Biol. Technol. 2022, 185, 111808. [Google Scholar] [CrossRef]
Ting, L.; Baijun, Z.; Yongsheng, Z.; Shun, Y. Ship detection algorithm based on improved YOLO V5. In Proceedings of the 2021 6th International Conference on Automation, Control and Robotics Engineering (CACRE), Dalian, China, 15–17 July 2021; pp. 483–487. [Google Scholar]
Sun, Z.; Liu, J.; Li, P.; Li, Y.; Li, J.; Sun, D.; Zhang, C. DGAP-YOLO: A Crack Detection Method Based on UAV Images and YOLO. In Proceedings of the International Conference on Intelligent Computing, Tianjin, China, 5–8 August 2024; pp. 482–492. [Google Scholar]
Liu, S.; Zhang, N.; Yu, G. Lightweight security wear detection method based on YOLOv5. Wirel. Commun. Mob. Comput. 2022, 2022, 1319029. [Google Scholar] [CrossRef]
Luo, X.; Wu, Y.; Wang, F. Target detection method of UAV aerial imagery based on improved YOLOv5. Remote Sens. 2022, 14, 5063. [Google Scholar] [CrossRef]
Dadboud, F.; Patel, V.; Mehta, V.; Bolic, M.; Mantegh, I. Single-stage uav detection and classification with yolov5: Mosaic data augmentation and panet. In Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8. [Google Scholar]
Rahman, R.; Bin Azad, Z.; Bakhtiar Hasan, M. Densely-populated traffic detection using yolov5 and non-maximum suppression ensembling. In Proceedings of the International Conference on Big Data, IoT, and Machine Learning: BIM 2021, Cox’s Bazar, Bangladesh, 23–25 September 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 567–578. [Google Scholar]
Wu, H.; Liu, Q.; Liu, X. A review on deep learning approaches to image classification and object segmentation. Comput. Mater. Contin. 2019, 60, 575–597. [Google Scholar] [CrossRef]
Garbin, C.; Zhu, X.; Marques, O. Dropout vs. batch normalization: An empirical study of their impact to deep learning. Multimed. Tools Appl. 2020, 79, 12777–12815. [Google Scholar] [CrossRef]
Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 815–825. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
Wang, S.; Hao, X. YOLO-SK: A lightweight multiscale object detection algorithm. Heliyon 2024, 10, e24143. [Google Scholar] [CrossRef]
Sharaf Al-deen, H.S.; Zeng, Z.; Al-sabri, R.; Hekmat, A. An improved model for analyzing textual sentiment based on a deep neural network using multi-head attention mechanism. Appl. Syst. Innov. 2021, 4, 85. [Google Scholar] [CrossRef]
Gao, C.; Cai, Q.; Ming, S. YOLOv4 object detection algorithm with efficient channel attention mechanism. In Proceedings of the 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), Harbin, China, 25–27 December 2020; pp. 1764–1770. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Hong, C.S.; Oh, T.G. TPR-TNR plot for confusion matrix. CSAM Commun. Stat. Appl. Methods 2021, 28, 161–169. [Google Scholar] [CrossRef]
He, K.; Lu, Y.; Sclaroff, S. Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 596–605. [Google Scholar]
Chun, P.j.; Izumi, S.; Yamane, T. Automatic detection method of cracks from concrete surface imagery using two-step light gradient boosting machine. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 61–72. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Lee, D.; Kim, J.; Lee, D. Robust concrete crack detection using deep learning-based semantic segmentation. Int. J. Aeronaut. Space Sci. 2019, 20, 287–299. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Yu, Y.; Luo, X.; Huang, T.; Yang, X. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 1090–1109. [Google Scholar] [CrossRef]

Figure 1. The network structure of the improved Adaptive Attention-Enhanced Yolo model, including STC3, ARF, and HA components.

Figure 2. Plot of three models’ mAP@0.5 values versus epochs.

Figure 3. Confusion matrices of Yolov5 (top left), Yolov8 (top right), and Yolov10 (bottom).

Figure 4. Recognition effects of visualization in the ablation study.

Table 1. Performance comparison of Yolov5, Yolov8, and Yolov10 on crack detection tasks.

Model	Precision	Recall	Params (M)	FLOPs (G)
Yolov5	82.84%	78.21%	7.1	16.5
Yolov8	84.55%	79.78%	11.2	28.6
Yolov10	82.60%	76.89%	7.2	21.6

Table 2. Ablation study results for crack detection (the √ symbol indicates the inclusion of the specific module in the model configuration).

Model	STC3	ARF	MHA	mAP
baseline				83.6%
STC3	√			84.4%
ARF		√		84.0%
MHA			√	84.0%
STC3 + ARF	√	√		85.2%
STC3 + MHA	√		√	85.5%
ARF + MHA		√	√	84.7%
Ours AAEY	√	√	√	86.4%

Table 3. Performance comparison with existing methods.

Method	Precision	Recall	F1-Score
LightGBM	68.01%	75.78%	69.50%
CrackTree	73.22%	76.45%	70.80%
CSN	83.40%	68.55%	75.24%
FCN	81.73%	78.97%	79.95%
Ours AAEY	88.69%	80.18%	84.22%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Wu, W.; Li, J. Adaptive Attention-Enhanced Yolo for Wall Crack Detection. Appl. Sci. 2024, 14, 7478. https://doi.org/10.3390/app14177478

AMA Style

Chen Y, Wu W, Li J. Adaptive Attention-Enhanced Yolo for Wall Crack Detection. Applied Sciences. 2024; 14(17):7478. https://doi.org/10.3390/app14177478

Chicago/Turabian Style

Chen, Ying, Wangyu Wu, and Junxia Li. 2024. "Adaptive Attention-Enhanced Yolo for Wall Crack Detection" Applied Sciences 14, no. 17: 7478. https://doi.org/10.3390/app14177478

APA Style

Chen, Y., Wu, W., & Li, J. (2024). Adaptive Attention-Enhanced Yolo for Wall Crack Detection. Applied Sciences, 14(17), 7478. https://doi.org/10.3390/app14177478

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Attention-Enhanced Yolo for Wall Crack Detection

Abstract

1. Introduction

2. Related Work

2.1. Wall Crack Detection

2.2. YOLO

3. Materials and Methods

3.1. Yolo Algorithm

3.2. Swin Transformer-Based CSP Bottleneck with Three Convolutions

3.3. Adaptive Receptive Field

3.4. Multi-Head Attention

4. Performance Evaluation

4.1. Valuation Index

4.2. Training Results

4.3. Ablation Study

4.4. Comparison with State of the Art

4.5. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI