An Improved YOLOv5 Model for Concrete Bubble Detection Based on Area K-Means and ECANet

Tian, Wei; Li, Bazhou; Cao, Jingjing; Di, Feichao; Li, Yang; Liu, Jun

doi:10.3390/math12172777

Open AccessArticle

An Improved YOLOv5 Model for Concrete Bubble Detection Based on Area K-Means and ECANet

by

Wei Tian

¹,

Bazhou Li

^1,2,3,

Jingjing Cao

^4,*

,

Feichao Di

^2,3,

Yang Li

^1,2,3 and

Jun Liu

⁴

¹

CCCC Second Harbor Engineering Co., Ltd., Wuhan 430040, China

²

CCCC Wuhan Harbor Engineering Design and Research Institute Co., Ltd., Wuhan 430040, China

³

Hubei Provincial Key Laboratory of New Materials, Maintenance and Reinforcement Technology for Marine Structures, Wuhan 430040, China

⁴

School of Transportation and Logistics Engineering, Wuhan University of Technology, Wuhan 430063, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2777; https://doi.org/10.3390/math12172777

Submission received: 3 August 2024 / Revised: 2 September 2024 / Accepted: 5 September 2024 / Published: 8 September 2024

(This article belongs to the Special Issue Application of Machine Learning and Data Mining, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The appearance quality of fair-faced concrete plays a crucial role in evaluating the engineering quality, as the abundance of small-area bubbles generated during construction diminishes the surface quality of concrete. However, existing methods are plagued by sluggish detection speed and inadequate accuracy. Therefore, this paper proposes an improved method based on YOLOv5 to rapidly and accurately detect small bubble defects on the surface of fair-faced concrete. Firstly, to address the issue of YOLOv5 in generating prior boxes for imbalanced samples, we divide the image preprocessing part into small-, medium-, and large-area intervals corresponding to the number of heads. Additionally, we propose an area-based k-means clustering approach specifically tailored for the anchor boxes within each of these intervals. Moreover, we adjust the number of prior boxes generated by k-means clustering according to the training loss function to adapt to bubbles of different sizes. Then, we introduce the ECA (Efficient Channel Attention) mechanism into the neck part of the model to effectively capture inter-channel interactions and enhance feature representation. Subsequently, we incorporate feature concatenation in the neck part to facilitate the fusion of low-level and high-level features, thereby improving the accuracy and generalization ability of the network. Finally, we construct our own dataset containing 980 images of two classes: cement and bubbles. Comparative experiments are conducted on our dataset using YOLOv5s, YOLOv6s, YOLOxs, and our method. Experimental results demonstrate that the proposed method achieves the highest detection accuracy in terms of mAP0.5, mAP0.75, and mAP0.5:0.95. Compared to YOLOv5s, our method achieves a 7.1% improvement in mAP0.5, a 3.7% improvement in mAP0.75, and a 4.5% improvement in mAP0.5:0.95.

Keywords:

concrete bughole detection; efficient channel attention; prior anchor; YOLOv5

MSC:

68T07

1. Introduction

Fair-faced concrete, a common material in the construction industry, offers excellent decorative effects without the need for coatings or chemical additives, making it a pollution-free and environmentally friendly option. It can be cast in one pour, requiring no further adornment, resulting in a smooth and naturally appealing surface. However, its characteristic of irreversibility in molding demands high engineering standards. Excessive air bubbles can reduce the structural strength and corrosion resistance of fair-faced concrete, significantly impacting its appearance. The generation of air bubbles stems from various factors primarily related to construction techniques, such as improper water–cement ratios and inadequate concrete compaction. Efforts are made during construction to minimize bubble formation. Consequently, numerous methods have been developed to detect air bubbles in fair-faced concrete, broadly categorized into three directions: manual inspection, traditional image-based inspection, and deep learning-based image inspection.

In the 1970s, the primary method involved manual counting and measurement of bughole diameters by construction personnel to assess surface quality [1,2], which was both time-consuming and inaccurate, rendering it impractical [3]. To address this, the Concrete International Committee (CIB) introduced an improved method using a set of seven standard representative sizes for comparison with images. While this method reduced manual errors to some extent, it still suffered from subjective bias. Lemaire et al. [4] proposed a method for evaluating concrete based on color uniformity and surface bughole distribution, allowing for a more objective assessment by professionals. However, manual evaluation of concrete surface quality is subjective, unstable, time-consuming, and prone to errors. To overcome these drawbacks, automation or semi-automation methods such as digital image processing technology should be adopted to enhance objectivity, accuracy and efficiency in evaluation. These technologies minimize human interference, ensuring consistency and repeatability of evaluation results, thereby improving the detection and management of concrete surface quality.

Detection methods based on digital image processing technology offer high efficiency and objectivity, making them a popular technique for detecting surface bugholes in concrete [5,6]. Ozkul et al. [7] designed a device utilizing pressure differential technology to measure the level of surface bugholes in concrete, characterized by its compact structure, small size, and light weight. Yoshitake et al. [8] developed a bughole pore measurement and quantification system for assessing the quality of tunnel lining concrete surfaces. That system evaluated using the red/green/blue values of color images, demonstrating higher accuracy than threshold-based image analysis and accurately estimating bughole pores. Liu et al. [9] utilized image grayscale, contrast enhancement, and OTSU image threshold segmentation techniques to extract bughole pore characteristics, distinguishing cracks from bughole pores based on threshold values of shape feature coefficients. By establishing the relationship between CIB scale and bughole pore area ratio, and the maximum diameter parameter, the quality of concrete surfaces can be comprehensively evaluated. Silva et al. [10] used digital images processed with image analysis tools to assess concrete surface defects and developed an expert system with image analysis for classification. Zhu et al. [11] proposed an automatic evaluation method for two common surface defects (i.e., bugholes and discoloration), first localizing these defects and then retrieving defect attributes (bughole count and discolored area) to calculate the visual impact ratio (VIRs) of defects. Peterson et al. [12] proposed an iterative method for analyzing images collected from a large number of samples, identifying defects through threshold optimization, albeit with less accuracy. These methods have validated the applicability of image processing technology in practical engineering. However, image processing technology is often susceptible to external environmental interference such as lighting conditions and shadows, which can affect detection results and may require manual intervention, especially in complex situations.

Deep learning-based object detection methods offer a solution to mitigate external environmental interference, are widely applied, and achieve superior detection performance in the field of target recognition [13,14,15,16]. Deep learning eliminates the need for manual extraction of image information and can abstract image features through algorithmic capabilities, demonstrating significant advantages in both image feature extraction and training efficiency with large-scale data. Moreover, it exhibits strong generalization and robustness [17,18]. Currently, deep learning in concrete bughole detection is still in the developmental stage. Sun et al. [19] implemented an improved DeepLabv3+ to detect cracks and bugholes on concrete surfaces, enhancing detection accuracy by employing depthwise separable convolution instead of low-level normal convolution, reducing the expansion ratio in the Atrous Spatial Pyramid Pooling (ASPP) module, and increasing weight in the channel size. Wei et al. [20] utilized a network similar to AlexNet, inserting initial modules as feature extraction layers to detect surface images of concrete specimens processed through grayscale, contrast enhancement, and OTSU threshold segmentation, combining the advantages of image processing and Deep Convolutional Neural Network (DCNN). The aforementioned method demonstrated high accuracy in detection, albeit with the use of a semantic segmentation network, resulting in a slow detection speed. Therefore, addressing the aforementioned challenges and the high density of bugholes in a concrete dataset, we opted to improve these methods by using the YOLOv5 network, which maintains high detection precision while providing rapid inference speed.

Fair-faced concrete, valued for its natural beauty and the absence of additional coatings, has become a popular choice in the construction industry. However, its one-time molding characteristic demands high construction quality, and the presence of air bubbles can significantly impact the structural strength and appearance of the concrete. Early manual inspection methods were time-consuming and inaccurate and were susceptible to subjective biases. While traditional image processing techniques have improved objectivity and efficiency, they are still affected by lighting conditions and external interference, often requiring manual intervention. Recently, deep learning-based object detection methods have demonstrated significant advantages in addressing these issues. They can automatically extract image features, overcoming the limitations of traditional methods, and offer higher accuracy and speed in detection. Consequently, researchers are exploring the application of deep learning technology to concrete bubble detection to enhance both efficiency and accuracy.

This paper presents an improved method for detecting bubbles in concrete based on YOLOv5. YOLOv5 adopts advanced object detection techniques and optimized network structures, enabling high-precision object detection. Moreover, YOLOv5 can achieve real-time object detection on different hardware platforms, including CPU and GPU, making it suitable for applications requiring rapid processing of visual data. Addressing the characteristics of bubbles in concrete images, where bubbles of different sizes are unevenly distributed with smaller bubbles comprising a larger proportion, YOLOv5’s anchor box clustering method tends to generate prior boxes for smaller areas. To enhance the detection method for bubbles in concrete, we introduce an area k-means clustering operation into the network’s image preprocessing stage. We also adjust the number of prior boxes generated by k-means clustering to accommodate bubbles of varying scales. Additionally, we incorporate feature fusion and an ECA (Efficient Channel Attention) mechanism into the network structure. Compared to traditional attention mechanisms, the ECA mechanism has lower computational costs and can maintain the spatial structure of input data when calculating attention weights. It is particularly effective in tasks involving strong spatial correlations. Feature fusion facilitates the fusion of low-level and high-level features, thereby improving the extraction of image information.

Our main contributions can be summarized as follows:

We improve the prior box generation method of YOLOv5 based on the size and distribution of bughole areas in the dataset images. The prior boxes are divided into small-, medium-, and large-area intervals according to the number of heads, and k-means clustering is applied to each interval to adjust the anchor boxes. We also adjust the number of prior boxes generated by k-means clustering to adapt to bugholes of different scales.
We introduce the Efficient Channel Attention (ECA) mechanism and feature fusion operation into the neck part. The ECA mechanism effectively captures inter-channel interactions, enhancing feature representation. The feature fusion operation adds connect operations between the input and output positions of the Feature Pyramid Network (FPN) structure, enabling the fusion of high-level and low-level features, thus improving the extraction capability of image information.
We collect an image set of bughole defects in fair-faced concrete on-site, construct our own dataset, and conduct comparative experiments using YOLOv5s, YOLOv6s, YOLOvxs, and the improved YOLOv5 on our dataset. Our method achieves the highest detection accuracy in terms of mAP0.5, mAP0.75, and mAP0.5:0.95.

The rest of this paper is organized as follows: Section 2 reviews the network structure of YOLOv5, its loss function, and ECA mechanism and elaborates on the improvements made to our model. In Section 3, we conduct an experimental validation and discuss our method, including ablation experiments. Section 4 summarizes our work and highlights future research directions.

2. Materials and Methods

2.1. The Introduction of YOLOv5

YOLOv5 is a popular object detection algorithm that can simultaneously perform object detection and localization without the need for traditional sliding window and region proposal methods. It combines speed and accuracy, demonstrating excellent performance in real-time object detection tasks. YOLOv5 adopts an end-to-end training approach based on deep learning, using a single neural network model to directly predict bounding boxes and class probabilities from the entire image. Compared to previous versions, YOLOv5 has improved in both accuracy and speed, while also providing better versatility and scalability. The architecture of YOLOv5 employs a series of convolutional layers, pooling layers, and other neural network components to efficiently extract image features and perform object detection.

The YOLOv5 algorithm is available in four different sizes: s, m, l, and x. Among these, the YOLOv5s model has the simplest network structure, leading to the fastest training and detection speeds. Conversely, the YOLOv5x model has the most complex network structure, with the deepest architecture, and generally provides the best detection performance. Although the backbone, neck, and head structures are broadly similar across different versions, they differ in depth and width. Depth refers to the total number of layers in the network, while width indicates the number of channels. In the configuration file, the size of each version is controlled by the parameters’ depth multiple and width multiple, with specific parameters for each version as shown in Table 1. As the model scale increases, so do the computational and parameter requirements. The YOLOv5s model has the smallest computational and parameter load. Given the need to consider computational and storage resource limitations, as well as speed constraints, in concrete surface bubble detection in practical engineering, this study selected the YOLOv5s algorithm, which requires fewer resources and offers faster detection speeds, as the base algorithm, and then improved upon it.

YOLOv5 is an efficient object detection algorithm that adopts various data augmentation techniques to enhance model performance and robustness [21,22]. Data augmentation involves applying a series of random transformations and distortions to the original data during the training process to generate more diverse and generalizable training samples. Firstly, YOLOv5 uses random resizing and cropping techniques. This data augmentation process randomly adjusts the size of input images and performs cropping operations, enabling the model to adapt to targets of different sizes and proportions, thereby improving detection accuracy and robustness. Secondly, random horizontal flipping is another commonly used data augmentation method. By horizontally flipping images with a certain probability, the dataset size can be expanded, the model’s generalization ability can be increased, and the risk of overfitting can be reduced. In addition, YOLOv5 also employs random color distortion techniques. This method distorts the color channels of the images, enhancing the model’s ability to recognize targets under different lighting conditions and improving its performance in complex environments. Furthermore, random rotation and random affine transformation are also applied in the data augmentation process of YOLOv5. Random rotation enables the model to learn to recognize targets at different angles, enhancing the model’s adaptability to rotated targets, while random affine transformation helps improve the model’s robustness to target deformation. Finally, YOLOv5 introduces the MixUp data augmentation technique [23]. MixUp data augmentation generates new training samples by blending two images in a certain proportion, thereby enhancing the model’s generalization ability and improving its performance on unseen data.

As shown in Figure 1, the YOLOv5 model consists of three main components: backbone, neck, and head. YOLOv5 employs CSPDarknet53 [24] as its backbone. CSPDarknet53 is an improved version of Darknet53 that utilizes Cross-Stage Partial (CSP) connections to enhance feature extraction efficiency and reduce computational overhead. The CSP structure divides feature maps into two parts, processes them separately, and then merges the processed feature maps. This approach reduces computational load while maintaining the model’s representational power. By using Cross-Stage Partial connections, different hierarchical features are effectively fused, which improves the network’s representational capability. The design goal is to balance computational efficiency and feature representation ability, ensuring that YOLOv5 excels in object detection tasks. The neck module in YOLOv5 incorporates the Bottom-up Path Augmentation (BPA) structure based on Feature Pyramid Networks (FPNs) [25]. This structure facilitates the transmission and aggregation of information across different network layers, achieving multi-scale feature fusion. This enhances the flow of information within the network and reduces information loss, improving the model’s ability to detect targets at various scales. By effectively integrating features from different levels, the neck module boosts overall detection performance, accuracy, robustness, and generalization capability, allowing the model to better adapt to diverse scenarios and effectively detect targets of varying sizes. The head module in YOLOv5 includes several convolutional layers and the final output layer. It uses convolutional layers to predict the bounding box positions (center coordinates, width, height), with either a fully connected layer or additional convolutional layers completing the task. The head module outputs the probability for each class and the confidence for each bounding box. Predictions are made across different scales of feature maps, handling targets of various sizes. By default, three sizes of feature maps are used:

80 \times 80 \times c

,

40 \times 40 \times c

,

20 \times 20 \times c

. YOLOv5’s capability to support multiple prediction sizes enables it to perform exceptionally well in multi-scale object detection.

2.2. Optimization of Prior Boxes Based on K-Means

In YOLOv5, the head section outputs three different sizes of feature maps, where the larger feature map is responsible for predicting small objects and the smaller feature map is responsible for predicting large objects. In our dataset images, bugholes are small in area and numerous, making recognition difficult. We tabulated the distribution of the true anchor box areas in the dataset, as shown in Figure 2. It can be observed that the areas of the anchor boxes were primarily distributed within 10,000 pixels, with a higher number of small-area anchor boxes. Anchor boxes with areas greater than 10,000 pixels exhibited a lower number of instances corresponding to different areas on the x-axis, nearly coinciding with the x-axis in the graph, thus making it difficult to distinguish them visually. If the default k-means [26] clustering method in YOLOv5 was used to generate prior boxes, it would inevitably result in a large number of small-area prior boxes and a few large-area prior boxes, leading to a decline in the network’s detection performance for medium-sized objects. To address this imbalance, we proposed an area k-means optimization method. This method involved categorizing all real anchor boxes into small, medium, and large groups based on area size, and then separately counting the quantity of anchor boxes in each group, as shown in Table 2. For each group of anchor boxes, we used the k-means method to derive 3 prior box sizes, which were initialized for the corresponding-size feature maps. With three different sizes of feature maps, a total of 9 prior box sizes could be obtained.

The role of prior boxes mainly manifests in two aspects. Firstly, prior boxes provide the object detection model with prior information about the expected shape and position of the targets. This prior information helps the model to more accurately predict the location and size of the objects, thereby improving the detection accuracy of the model. Secondly, by sampling and matching targets of different scales and aspect ratios, prior boxes enable the model to adapt to objects of various scales and shapes, enhancing the model’s generalization ability. Therefore, we increased the number of prior boxes used in YOLOv5. By default, YOLOv5 generates N (N = 3) prior boxes for each different-sized feature map. We augmented this to N + K and determined the optimal value of K through experimentation. Increasing the number of prior boxes helped the model better capture variations in target objects, enhancing its ability to accurately localize and recognize targets. Moreover, augmenting the number of prior boxes reduced uncertainty during training, thereby improving the model’s stability and robustness. Consequently, by increasing the number of prior boxes, we effectively enhanced the performance and efficacy of the YOLOv5 object detection model.

2.3. Addition of Feature Fusion and Efficient Channel Attention Network

ECANet is a neural network architecture designed for image processing tasks [27]. The core idea of ECANet is to introduce a channel attention mechanism in convolution operations to capture relationships between different channels, thereby enhancing the feature representation capability. The goal of the channel attention mechanism is to adaptively adjust the weights of channel features, allowing the network to focus more on important features while suppressing unimportant ones. Through this mechanism, ECANet is able to effectively enhance the network’s representation capacity, generalization ability, and feature discrimination capability without significantly increasing parameters and computational costs. As shown in Figure 3, ECANet first compresses the input features through a global pooling layer, transforming the input features into a structure of

1 \times 1 \times c

. It then calculates the size of the adaptive convolution kernel, as expressed below:

k = |\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|

(1)

where C is the number of input channels,

b = 1

,

γ = 2

, and a one-dimensional convolution is used to calculate the weights of the channels. Finally, the sigmoid activation function is applied to map the weights between 0 and 1. The reshaped weight values are then multiplied with the original feature maps to obtain feature maps under different weights.

In the neck part of YOLOv5, the FPN is a top-down feature pyramid that enhances high-level semantic features passed down the pyramid. However, it only enhances semantic information, and due to the long transmission path, the transmission effect is not optimal. To address this, PAN adds a bottom-up pyramid behind the FPN to supplement it, passing low-level localization features upwards. This forms a pyramid that combines both semantic and localization information. However, it may still suffer from missed detections because some bughole features are lost during feature extraction and transmission processes. For this reason, we added three feature fusion parts in the neck part of YOLOv5, as shown by the red lines in Figure 1. We performed a concatenation operation on the input and output features of three different scales of PAFPN networks, effectively utilizing both positional and semantic information of the image and preventing feature information loss. This operation enhanced connectivity between the backbone and head, thus improving the flow of graphic information in the network.

In the actual collected dataset, bugholes are often small and densely distributed, leading to low accuracy and significant missed or false detections when using traditional methods. As shown in Figure 1, we introduced the ECA attention mechanism into the top-down and bottom-up layers of the neck part, allowing the network to better focus on the desired bughole features and enhance its representation, generalization, and discriminative ability, thereby improving detection accuracy.

2.4. Loss Function

The loss function of YOLOv5 consists of three parts: classification loss, objectness loss, and localization loss. The expression is as follows:

L o s s = λ_{1} \times L_{c l s} + λ_{2} \times L_{o b j} + λ_{3} \times L_{l o c}

(2)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are the equilibrium constants. Classification loss assesses whether the model can accurately identify and classify bubbles within an image into the correct categories. Objectness loss determines whether the predicted bounding box contains an object. Localization loss measures the difference between the predicted bounding boxes and the ground-truth bounding boxes. By minimizing this loss, the network learns to predict the object’s position more precisely.

Both the classification loss and objectness loss use a commonly used loss function in handling classification problems, which is the cross-entropy loss function. Cross-entropy is mainly used to determine the closeness between the actual output and the expected output. The expression is as follows:

L_{o b j} = L_{c l s} = - \sum_{i = 1}^{n} p (x_{i}) \log (q (x_{i}))

(3)

where

p (x_{i})

is the actual value, and

q (x_{i})

is the predicted value.

The localization loss uses the CIoU loss function, which considers three geometric parameters: overlap area, center-point distance, and aspect ratio. The expression is as follows:

\begin{matrix} L_{l o c} = C I o U = 1 - I o U + (\frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v) \end{matrix}

(4)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(5)

α = \frac{v}{(1 - I o U) + v}

(6)

Here,

ρ (b, b^{g t})

represents the distance between the centers of the predicted box and the ground-truth box; c is the diagonal length of the smallest box that encompasses both the predicted box and the ground truth box; w and h denote the width and height of the predicted box, respectively;

w^{g t}

and

h^{g t}

denote the width and height of the ground-truth box, respectively; and

I o U

represents the Intersection over Union (

I o U

) between the predicted box and the ground-truth box.

3. Experiment and Result Analysis

3.1. Experimental Settings

3.1.1. Experimental Platform

The experimental platform consisted of an AMD Ryzen9 5950x processor and 48 GB of RAM, paired with an NVIDIA RTX 3090. The development environment for the experiment was Python, and the detection model was built using the PyTorch framework. The optimizer used in the experiment was the stochastic gradient descent (SGD) algorithm, where weight decay and momentum are two important parameters in the optimizer that further optimize the training model. Weight decay is commonly applied to control the size of L2 regularization and was set to 0.0005, while momentum is used to stabilize the optimization direction and was set to 0.9. The number of epochs was set to 100 to ensure that the model could learn the features of the dataset adequately. YOLOv5 model training requires input images of a specific size, set to

640 \times 640

. The batch size refers to the number of samples inputted during each training session and was set to 12.

3.1.2. Dataset

To acquire the necessary image data for the experiment, we established a visual collection system to capture the dataset of air bugholes in clear water concrete, as shown in Figure 4. Figure 4a depicts the pouring tower used in bridge pier construction, Figure 4b showcases the observation window at the top of the pouring tower made of steel and acrylic panels, Figure 4c is the internal structure of the pouring tower, and Figure 4d is the acquisition window of the image capture system.

In order to prevent factors such as dust, lighting, and shadows from affecting the collected dataset, we opted for iRAYPLE industrial cameras with a frame rate of 30 FPS and an image size of

1280 \times 1024

. We collected a total of 22 min of video data, then sampled 1 image every seven frames, resulting in a total of 980 images.

3.1.3. Performance Evaluation Metrics

Average Precision (

A P

) is a commonly used evaluation metric in object detection tasks, used to measure the balance between a model’s accuracy and recall for a single class. AP is calculated by computing the area under the Precision–Recall curve (P–R curve), with values ranging from zero to one where higher values indicate better model performance.

Mean Average Precision

(m A P

) is a comprehensive metric in multi-class object detection tasks, calculated by computing the AP for each class and then taking the average. mAP considers the performance of different classes collectively, serving as a crucial metric for evaluating the overall performance of an object detection system. In practical applications, mAP is often used to compare different models or tuning results, guiding model design and improvement. By measuring AP and mAP, a more comprehensive assessment of the localization accuracy and detection precision of object detection algorithms across different classes can be achieved. Its expression is as follows:

m A P = \frac{A P}{m}

(7)

where m is the total number of target classes in the dataset.

mAP0.5 means that when calculating mAP, the performance of an object detector is evaluated using an IoU (Intersection over Union) threshold of 0.5. This implies that in computing Precision and Recall, only cases where the IoU between the detection result and the ground-truth bounding box is greater than 0.5 are considered. mAP0.75 indicates that when calculating mAP, the performance of an object detector is evaluated using an IoU threshold of 0.75. In this scenario, a detection result is only considered correct if the IoU with the ground-truth bounding box is greater than 0.75. mAP0.5:0.95 signifies that when calculating mAP, different IoU thresholds ranging from 0.5 to 0.95 are considered. The mAP values at different IoU thresholds are calculated and averaged. This approach enables a more comprehensive evaluation of the performance of an object detector across different IoU thresholds, providing additional information about detection accuracy.

3.2. Ablation Experiment

3.2.1. Ablation Experiments on Prior Boxes

The original k-means method in YOLOv5 clustered all anchor boxes, resulting in nine prior box sizes as follows: {[(6, 6), (10, 10), (16, 15)], [(22, 24), (20, 53), (54, 37)], [(127, 74), (605, 324), (634, 508)]}. However, the generated prior boxes were unevenly distributed, as shown in Figure 5. Therefore, we proposed the area k-means method, dividing all true anchor boxes into small, medium, and large groups based on their areas. For each group of anchor boxes, the k-means method was applied to obtain three prior box sizes, resulting in a total of nine prior box sizes: {[(8, 7), (15, 14), (22, 28)], [(50, 41), (93, 71), (154, 47)], [(162, 104), (605, 324), (634, 508)]}, as illustrated in Figure 5.

To verify the impact of the prior-box area method on network performance, we conducted a set of ablation experiments. The experimental results are shown in Table 3, where “k-means” represents the original method for generating prior boxes in YOLOv5, and “area k-means” represents our proposed prior-box area method. According to the experimental results, it is evident that our method achieved a slightly higher mAP0.75 than the k-means method, with a 2.1% improvement in mAP0.5, and a 1.3% improvement in mAP0.5:0.95.

To investigate the impact of the number of prior boxes on network performance, we conducted a set of ablation experiments as shown in Table 3. YOLOv5 has three different sizes of feature map outputs, with the default number of prior boxes generated for each output set at three (as the baseline). We increased the number of generated prior boxes in the hope of improving the precision of the model detection. According to the experimental results, when initializing each size with four prior boxes, both mAP0.5 and mAP0.75 were higher than the baseline, while mAP0.5:0.95 was slightly lower. When the number of initialized prior boxes was increased to five, mAP0.5 reached its peak, but mAP0.75 and mAP0.5:0.95 were lower than the baseline. Therefore, considering the overall performance, we chose to initialize the number of prior boxes at four.

To verify the impact of the number of prior boxes on network performance, we designed a set of ablation experiments, and the experimental results are shown in Table 4. In YOLOv5, there are three different sizes of feature map outputs, with the default number of prior boxes generated for each output channel set at three (as the baseline). We increased the number of generated prior boxes with the expectation of improving the precision of model detection.

When the number of prior boxes for each output channel was increased to four, the generated prior boxes were: {[(7, 6), (11, 11), (19, 17), (21, 33)], [(51, 28), (39, 51), (77, 32), (114, 63)], [(162, 104), (606, 323), (597, 466), (636, 510)]}. When the number of prior boxes for each output channel was further increased to five, the generated prior boxes were: {[(6, 5), (9, 9), (15, 14), (22, 19), (20, 34)], [(54, 27), (23, 69), (40, 43), (72, 39), (114, 63)], [(162, 104), (606, 323), (556, 478), (624, 462), (636, 510)]}.

The initialization of three prior boxes served as the baseline. According to the experimental results, when each size was initialized with four prior boxes, both mAP0.5 and mAP0.75 were higher than the baseline, with mAP0.5:0.95 slightly lower than the baseline, yet with the highest mean value. When the number of initialized prior boxes was increased to five, mAP0.5 reached its peak, yet both mAP0.75 and mAP0.5:0.95 were below the baseline, with the mean value being the lowest. Therefore, when comprehensively considering these results, our method initialized the number of prior boxes to four.

3.2.2. Ablation Experiments on Network Architecture Improvements

To verify the impact of the ECA attention mechanism and feature fusion on network performance, we designed a set of ablation experiments, and the experimental results are shown in Table 5. According to the experimental results, our method showed a 4.3% improvement in mAP0.5, a 3.2% improvement in mAP0.75, and a 4.1% improvement in mAP0.5:0.95, all significantly higher than YOLOv5s (which already incorporated the area k-means method and set the number of prior box outputs to four). It can be observed that introducing the ECA attention mechanism in the neck section and adding three channels for feature fusion allowed better focus on the bughole features we were interested in, enhancing the network’s representational capacity, generalization ability, and feature discriminative power. This approach fully leveraged both the spatial and semantic information in images, avoided loss of feature information, and improved recognition accuracy.

3.2.3. Comparative Experiment with Other Network

To validate that the performance of our proposed method surpassed that of other networks, we designed two sets of ablation experiments, comparing our improved network with several popular networks (YOLOv5s, YOLOv6s, YOLOxs). The results of the first set of ablation experiments are shown in Table 6. According to the experimental results, our improved network significantly outperformed YOLOv5s, YOLOv6s, and YOLOxs networks in terms of the number of detection errors. The number of correct detections and accuracy were slightly lower than YOLOv6s but still substantially higher than YOLOv5s and YOLOxs. Training time and detection speed were slightly less favorable. Compared to YOLOv5s, our method improved the correct number by 3.6%, reduced the wrong number by 23.5%, increased accuracy by 3.6%, increased training time by 2.2 min, and decreased detection speed by 6.4 fps.

The results of the second set of ablation experiments are shown in Table 7 According to the experimental results, our improved network significantly outperformed the YOLOv5s, YOLOv6s, and YOLOxs networks in mAP0.5, mAP0.75, and mAP0.5:0.95. Although the numbers of parameters and GFLOPs in our method were slightly higher than those of YOLOv5, they were much lower than YOLOv6s and YOLOxs. Compared to YOLOv5s, our method achieved a 7.1% improvement in mAP0.5, a 3.7% improvement in mAP0.75, and a 4.5% improvement in mAP0.5:0.95, with only a slight increase of 0.565M parameters and 0.591G GFLOPs. Overall, based on these two sets of ablation experiments, our improved network exhibited the best overall performance among the four methods.

We selected four images from the dataset and input them into the aforementioned four networks. The experimental results are shown in Figure 6. It can be observed from the results that YOLOv6s and YOLOxs exhibited a higher rate of missed detections and lower recognition rates compared to YOLOv5s and our method, with YOLOxs having the highest misidentification rate. As seen in the first row in Figure 6, our method had a lower false detection rate compared to YOLOv5s. In the second and third rows in Figure 6, YOLOv5s and our method exhibited comparable detection performance. In the fourth row in Figure 6, the recognition accuracy significantly decreased in the presence of numerous bugholes, which was unavoidable. However, our method still outperformed the other three in these scenarios.

4. Conclusions

Due to the irreversible nature of fair-faced concrete, engineering quality requirements are very high. To address bugholes on the surface of fair-faced concrete, we proposed a method that combined YOLOv5 with ECA attention mechanism and feature fusion and changed the way and quantity of prior box generation. This method was validated through experiments and ablation studies conducted on a dataset we constructed. In the YOLOv5 model, we introduced the area k-means method to cluster anchor boxes, making the generated prior boxes more suitable for our dataset. We modified the number of generated prior boxes to adapt to bugholes of different scales, thereby improving detection accuracy. The ECA attention mechanism and feature fusion were incorporated into the neck part of the model to enhance feature representation, improve generalization ability, and increase network accuracy. Compared to YOLOv5s, this method significantly improved detection accuracy with almost no change in parameters. It showed a 7.1% improvement in mAP0.5, a 3.7% improvement in mAP0.75, and a 4.5% improvement in mAP0.5:0.95. In future research, emphasis should be shifted towards lightweight model design to ensure a certain detection accuracy while reducing unnecessary resource consumption, thereby enhancing the real-time performance and efficiency of network detection.

Author Contributions

Methodology, software, writing—original draft, W.T.; funding acquisition, visualization, B.L.; conceptualization, investigation, writing—review and editing, J.C.; visualization, project administration, supervision, F.D.; visualization, supervision, Y.L.; data curation, writing—original draft, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing not applicable. Since this article uses a self-built dataset, the dataset is not shared.

Conflicts of Interest

Author Wei Tian was employed by the company CCCC Second Harbor Engineering Co., Ltd. Authors Bazhou Li and Yang Li were employed by the company CCCC Second Harbor Engineering Co., Ltd. and CCCC Wuhan Harbor Engineering Design and Research Institute Co., Ltd. Author Feichao Di was employed by the CCCC Wuhan Harbor Engineering Design and Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Laofor, C.; Peansupap, V. Defect detection and quantification system to support subjective visual quality inspection via a digital image processing: A tiling work case study. Autom. Constr. 2012, 24, 160–174. [Google Scholar] [CrossRef]
Reading, T.J. The bughole problem. J. Proc. 1972, 69, 165–171. [Google Scholar]
Samuelsson, P. Voids in concrete surfaces. J. Proc. 1970, 67, 868–874. [Google Scholar]
Lemaire, G.; Escadeillas, G.; Ringot, E. Evaluating concrete surfaces using an image analysis process. Constr. Build. Mater. 2005, 19, 604–611. [Google Scholar] [CrossRef]
Wei, F.; Yao, G.; Yang, Y.; Sun, Y. Instance-level recognition and quantification for concrete surface bughole based on deep learning. Autom. Constr. 2019, 107, 102920. [Google Scholar] [CrossRef]
Liu, Y.F.; Cho, S.; Spencer, B., Jr.; Fan, J.S. Concrete crack assessment using digital image processing and 3D scene reconstruction. J. Comput. Civ. Eng. 2016, 30, 04014124. [Google Scholar] [CrossRef]
Ozkul, T.; Kucuk, I. Design and optimization of an instrument for measuring bughole rating of concrete surfaces. J. Frankl. Inst. 2011, 348, 1377–1392. [Google Scholar] [CrossRef]
Yoshitake, I.; Maeda, T.; Hieda, M. Image analysis for the detection and quantification of concrete bugholes in a tunnel lining. Case Stud. Constr. Mater. 2018, 8, 116–130. [Google Scholar] [CrossRef]
Liu, B.; Yang, T. Image analysis for detection of bugholes on concrete surface. Constr. Build. Mater. 2017, 137, 432–440. [Google Scholar] [CrossRef]
Da Silva, W.R.L.; Štemberk, P. Expert system applied for classifying self-compacting concrete surface finish. Adv. Eng. Softw. 2013, 64, 47–61. [Google Scholar] [CrossRef]
Zhu, Z.; Brilakis, I. Machine vision-based concrete surface quality assessment. J. Constr. Eng. Manag. 2010, 136, 210–218. [Google Scholar] [CrossRef]
Peterson, K.; Carlson, J.; Sutter, L.; Van Dam, T. Methods for threshold optimization for images collected from contrast enhanced concrete surfaces for air-void system characterization. Mater. Charact. 2009, 60, 710–715. [Google Scholar] [CrossRef]
Dan, C.; Ueli, M.; Jonathan, M.; Jürgen, S.h. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012, 32, 333–338. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Yan, H.; Zhang, H.; Shi, J.; Ma, J.; Xu, X. Inspiration transfer for intelligent design: A generative adversarial network with fashion attributes disentanglement. IEEE Trans. Consum. Electron. 2023, 69, 1152–1163. [Google Scholar] [CrossRef]
Cesa-Bianchi, N.; Conconi, A.; Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theory 2004, 50, 2050–2057. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Sun, Y.; Yang, Y.; Yao, G.; Wei, F.; Wong, M. Autonomous crack and bughole detection for concrete surface image based on deep learning. IEEE Access 2021, 9, 85709–85720. [Google Scholar] [CrossRef]
Wei, W.; Ding, L.; Luo, H.; Li, C.; Li, G. Automated bughole detection and quality performance assessment of concrete using image processing and deep convolutional neural networks. Constr. Build. Mater. 2021, 281, 122576. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Mathew, M.P.; Mahesh, T.Y. Leaf-based disease detection in bell pepper plant using YOLO v5. Signal Image Video Process. 2022, 16, 841–847. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]

Figure 1. Improved network structure.

Figure 2. Area distribution of anchor boxes.

Figure 3. ECA network structure, where G stands for global averaging pooling,

σ

stands for sigmoid activation function, and (•) stands for position-wise dot product.

Figure 3. ECA network structure, where G stands for global averaging pooling,

σ

stands for sigmoid activation function, and (•) stands for position-wise dot product.

Figure 4. Image acquisition system. (a) depicts the pouring tower used in bridge pier construction; (b) showcases the observation window at the top of the pouring tower made of steel and acrylic panels; (c) is the internal structure of the pouring tower; (d) is the acquisition window of the image capture system.

Figure 5. The prior box distribution.

Figure 6. Comparison of detection performance among different networks.

Table 1. YOLOv5 parameter table for different versions.

YOLOv5	Depth_Multiple	Width_Multiple	Params/M	GFLOPs/G
YOLOv5s	0.33	0.50	7.03	7.9
YOLOv5m	0.67	0.75	20.9	48.0
YOLOv5l	1.0	1.0	46.1	107.9
YOLOv5x	1.33	1.25	86.2	204.2

Table 2. Dataset’s anchor frame distribution.

	Small	Medium	Large
Area	(0, $32^{2}$ ]	( $32^{2}$ , $96^{2}$ ]	( $96^{2}$ , infinity)
Number	1722	431	839

Table 3. Comparison of network performance using different methods for generating prior boxes.

Method	map0.5	map0.75	map0.5:0.95
k-Means (baseline)	0.749	0.561	0.558
Area k-means (ours)	0.765	0.562	0.565