CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection

Yang, Shuxin; Xie, Yang; Wu, Jianqing; Huang, Weidong; Yan, Hongsheng; Wang, Jingyong; Wang, Bi; Yu, Xiangchun; Wu, Qiang; Xie, Fei

doi:10.3390/electronics13142771

Open AccessArticle

CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection

by

Shuxin Yang

¹,

Yang Xie

¹,

Jianqing Wu

^1,*

,

Weidong Huang

^1,*,

Hongsheng Yan

²,

Jingyong Wang

²,

Bi Wang

¹,

Xiangchun Yu

¹

,

Qiang Wu

³ and

Fei Xie

⁴

¹

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

Longnan Dingtai Electronic Technology Co., Ltd., Longnan 341700, China

³

Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu 610000, China

⁴

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(14), 2771; https://doi.org/10.3390/electronics13142771

Submission received: 9 June 2024 / Revised: 9 July 2024 / Accepted: 10 July 2024 / Published: 15 July 2024

(This article belongs to the Special Issue Machine Learning and Deep Learning Based Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Due to the low detection accuracy in steel surface defect detection and the constraints of limited hardware resources, we propose an improved model for steel surface defect detection, named CBiF-FC-EFC-YOLOv8s (CFE-YOLOv8s), including CBS-BiFormer (CBiF) modules, Faster-C2f (FC) modules, and EMA-Faster-C2f (EFC) modules. Firstly, because of the potential information loss that convolutional neural networks (CNN) may encounter when dealing with miniature targets, the CBiF combines CNN with Transformer to optimize local and global features. Secondly, to address the increased computational complexity caused by the extensive use of convolutional layers, the FC uses the FasterNet block to reduce redundant computations and memory access. Lastly, the EMA is incorporated into the FC to design the EFC module and enhance feature fusion capability while ensuring the light weight of the model. CFE-YOLOv8s achieves mAP@0.5 values of 77.8% and 69.5% on the NEU-DET and GC10-DET datasets, respectively, representing enhancements of 3.1% and 2.8% over YOLOv8s, with reductions of 22% and 18% in model parameters and FLOPS. The CFE-YOLOv8s demonstrates superior overall performance and balance compared to other advanced models.

Keywords:

steel surface defect detection; YOLO; CNN-Transformer fusion; object detection model

1. Introduction

Surface defects are regions on the surface of a product with uneven chemical or physical characteristics [1]. Steel plays an irreplaceable role as a crucial raw material in various industries that manufacture new products. Due to impurities in raw materials, defects in production processes, and mechanical damage, surface defects are prone to occurring in steel production. Surface defects in steel can weaken its structure and reduce its strength, thus affecting the mechanical and corrosion resistance properties of industrial products. Hence, finding surface flaws is essential to preserving the functional integrity and appearance of steel.

The two primary conventional approaches for identifying surface defects in steel are machine vision-based detection and manual visual inspection. These approaches have some limitations, even though they can detect surface imperfections to some extent. In addition to being tedious and lengthy, a manual visual inspection cannot be performed in real time and is heavily impacted by the subjective opinions of inspectors [2]. Digital image analysis is the primary machine vision-based detection approach for identifying and quantifying surface defects. These approaches are not robust or adaptive, necessitate manual feature creation and selection, and are susceptible to background noise and lighting.

Deep learning approaches are increasingly well-liked due to conventional detection approaches’ inability to satisfy the accuracy and speed needs of modern industry. Due to neural networks being very good at extracting features from data, deep learning approaches can learn features automatically from input data without requiring human feature extractor design. These advantages make deep learning methods more flexible and adaptable when faced with complex surface structures and diverse defects. Additionally, compared to other detection methods, deep learning-based techniques exhibit notable enhancements in detection speed and accuracy, better guaranteeing real-time detection and product quality.

While there has been significant development in steel surface defect detection, challenges persist. As illustrated in Figure 1, steel surface imperfections have various shapes, sizes, and ambiguous borders. The different shapes of defect targets during steel surface defect detection might cause a loss of target feature information and inadequate feature extraction, which lowers detection accuracy. Meanwhile, because of the unclear boundaries and erratic distribution of steel surface defects, there is a risk of focusing on densely populated defect areas while neglecting some defects. There is a necessity for detection speed in industrial production, which suggests that accuracy and efficiency must be balanced. Additionally, a lightweight model is essential because it can effectively cut the risk of overfitting and improve model generalization, and it requires less hardware. Our study makes the following three contributions to handling these issues better:

We propose a new feature extraction module called CBS-BiFormer (CBiF) to address the issue of varied defect shapes and sizes and unclear defect boundaries. Unlike previous simple convolutional neural network (CNN) or Transformer structures, CBiF combines CNN and Transformer to fully utilize both local and global features in the image to enhance detection accuracy.
We propose a lightweight Faster-C2f (FC) module based on the primary feature extraction module of YOLOv8s, C2f, and the FasterNet block. It reduces redundant computation and memory accesses, achieving a lightweight model while ensuring detection accuracy.
To further enhance the detection accuracy, efficient multi-scale attention (EMA) is applied to the lightweight module FC, resulting in the proposal of the EMA-Faster-C2f (EFC) module. By introducing the attention mechanism, the model is assisted in learning the importance of different features. While ensuring the model is lightweight, EFC enhances the capability of feature fusion.

2. Related Works

Surface defect detection methods can be categorized into conventional machine learning approaches and deep learning approaches based on their different methods of feature representation.

2.1. Conventional Machine Learning Approaches

Conventional machine learning approaches are essential in laying the foundation for the later advancement of deep learning approaches for object detection. In conventional machine learning approaches, feature extraction algorithms such as local binary patterns (LBP) [3] and histogram of oriented gradients (HOG) [4] are initially employed to carry out feature extraction. Following their extraction, the features are categorized through the use of algorithms like support vector machines (SVMs) [5] and decision trees [6]. Luo et al. [7] presented a generalized complete local binary pattern framework with the nearest-neighbor classifier to classify surface defects in steel by extracting descriptive information hidden in non-uniform patterns. Zhang et al. [8] submitted a fuzzy measurement-based surface defect detection method for strip steel. By combining defect localization based on pixel connectivity with statistical data and membership functions to estimate the degree of defects, a practical fix for various defect types was provided. The surface defect identification job was formulated as a low-rank and entity sparsity pursuit (ESP) issue by Wang et al. [9], who also introduced an ESP strategy.

Although conventional machine learning methods have addressed certain drawbacks of conventional techniques in surface defect detection, they still necessitate manual feature design and heavily rely on the quality and selection of feature representation. Consequently, conventional machine learning approaches may fail to achieve the desired detection performance in intricate industrial environments, particularly those with diverse defect typologies.

2.2. Deep Learning Approaches

Due to the excellent performance of object detection models in localizing and recognizing various objects in images, most deep learning approaches are built upon these object detection models as their foundation [10]. Object detection models are currently divided into two categories: first, there are one-stage detection methods such as You Only Look Once (YOLO) [11,12,13] and Single-Shot MultiBox Detector (SSD) [14]. The second category includes two-stage detection methods such as R-CNN [15], Fast-RCNN [16], and Faster-RCNN [17]. One-stage detection methods treat object detection as a regression problem, directly predicting the location and class of objects from images. In contrast, the two-stage target detection method decomposes the task into two stages: it first generates candidate regions and then performs classification and fine-tuning of the positions of these candidate regions.

Compared to one-stage detection methods, two-stage detection methods exhibit superior accuracy. However, the detection speed is relatively slow, and the model size is relatively large. Researchers have proposed various approaches to enhance the detection speed of two-stage methods. Wang et al. [18], for instance, combined an enhanced ResNet50 with an optimized Faster R-CNN, resulting in reduced runtime and commendable detection accuracy. Shi et al. [19] applied the ConvNeXt architecture to Faster R-CNN, incorporated the CBAM attention mechanism for better feature extraction, and utilized K-means clustering to optimize anchors for surface defects, leading to enhanced detection accuracy and speed.

Moreover, some researchers are committed to further enhancing the detection accuracy of two-stage methods. By replacing a portion of the standard convolution with a deformable convolution network, Zhao et al. [20] improved Faster R-CNN, increased detection accuracy overall, and improved the ability to detect minor faults. Yang et al. [21] enhanced the Faster R-CNN by aggregating multi-level feature maps and optimizing the anchor selection scheme in the region proposal network (RPN), improving the accuracy of detecting cracks on steel plates in infrared thermal images. Despite significant advancements in two-stage methods, the large model size and slow detection speed remain the primary factors hindering their industrial application.

One-stage methods are known for their rapidity of detection as compared to two-stage methods. Consequently, research on one-stage methods focuses more on improving their detection accuracy. Specifically, improving the detection accuracy of one-stage methods can primarily be achieved by enhancing feature extraction capabilities and optimizing feature fusion. In order to enhance feature extraction capabilities, standard methods involve introducing additional feature extraction modules into the backbone of the model, such as attention mechanisms, Transformers, and residual modules. Building on YOLOv5, Wang et al. [22] introduced multi-scale exploration blocks into the detection network to enhance the detection performance. They also developed a spatial attention mechanism to focus more on defect information, achieving real-time detection while improving detection accuracy. Guo et al. [23] proposed an improved YOLOv5 based on a Transformer, where the TRANS module designed using the Transformer was inserted into the backbone network and detection head. Using the TRANS module can effectively address the issue of poor detection performance for small defects in steel image backgrounds with high interference. Zhao et al. [10] used Res2Net blocks to construct the backbone branch of YOLOv5, combined with a dual-feature pyramid network (DFPN) and decoupled head design, achieving good overall performance. Zhou et al. [24] replaced the C3 module with the CSPlayer module, introduced the global attention mechanism (GAM), and utilized weighted average output along dimensions to develop the improved YOLOv5. The improved YOLOv5 model effectively addressed the detection of small metal surface defects.

Improving feature propagation methods and adding detection paths are effective for optimizing feature fusion. To achieve effective identification of steel surface flaws, Wang et al. [25] enhanced YOLOv7 by utilizing a de-weighted BiFPN structure, an ECA attention mechanism, and an SIoU loss function. By modifying the path aggregation network to a unique receptive field block structure and integrating attention processes into the YOLOv4 core, Li et al. [26] successfully increased the accuracy of detecting steel strip surface defects.

In addition to the approaches mentioned above, which are based on object detection models, some researchers have also proposed other methods for detecting defects on steel surfaces. Wang et al. [27] introduced the real-time detection network (RDN) to conduct steel surface defect detection. RDN utilized ResNetdcn as the fundamental convolutional architecture and incorporated both the pyramid feature fusion module (PFM) and the skip layer connection module (SCM), achieving a balance between detection speed and accuracy. Yeung et al. [28] constructed the fused-attention network (FANet) to effectively detect defects. FANet incorporated the adaptively balanced feature fusion (ABFF) method and the fused-attention module (FAM) to address multi-scale defects and achieve accurate localization and classification. Liu et al. [29] proposed the MSC-DNet module to improve the precise localization and classification of defects. In MSC-DNet, a parallel architecture of dilated convolution (PADC) was constructed to capture multi-scale information. Additionally, a feature enhancement and selection module (FESM) was employed to reduce confusion, and auxiliary image-level supervision (AIS) was utilized to accelerate the convergence speed and enhance the discriminative capability.

Although the aforementioned methods effectively enhance detection accuracy, some approaches, such as incorporating attention mechanisms, integrating Transformer modules, and adding detection paths, may increase the parameters and computational complexity, impacting detection speed and model deployment. Therefore, when proposing the new surface defect detection model CFE-YOLOv8s, we considered improving the detection accuracy and implemented lightweight improvements to the model, mitigating potential negative effects of excessive parameter count and computational complexity.

3. The Proposed Method

3.1. Overview of YOLOv8

YOLOv8 consists of five versions, namely, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Among them, YOLOv8s maintains good detection accuracy and exhibits higher detection speed. Therefore, in this article, YOLOv8s is chosen as the baseline model.

YOLOv8s comprises four parts: Input, Backbone, Neck, and Head. The primary function of input is to resize the input image to a size suitable for the network structure and to perform mosaic data augmentation. The backbone is responsible for extracting features from the input image so that the subsequent detection head can utilize these features for object detection and recognition. C2F is the main feature extraction module of YOLOv8s, inspired by the design philosophy of ELAN [30] from YOLOv7. C2F enhances the transfer and utilization efficiency of features by introducing additional gradient paths to promote internal gradient flow. The neck combines the structures of the feature pyramid network (FPN) and path aggregation network (PAN) to achieve a multi-level fusion of features, facilitating multi-scale target prediction. After feature extraction and fusion, the head performs classification and bounding box regression calculations to obtain the final results.

3.2. The Overall Framework of CFE-YOLOv8s

While effectively improving detection accuracy, the extensive use of C2f in the backbone and neck of YOLOv8s also has some negative impacts. The C2f structure, as shown in Figure 2, consists of multiple CBS and bottleneck units, where each bottleneck contains two CBS layers connected by residual connections. The high computational complexity and large parameter count of C2f result from the stacking of numerous convolutional layers. Additionally, in multi-layer convolution operations, limitations imposed by the window size of convolutional kernels and the effects of pooling operations may lead to the loss of detailed information for small targets, thereby impacting the ability to accurately localize and classify small targets. As a result, in this study, CBiF, FC, and EFC are designed to replace C2f in the neck and backbone.

Figure 3 shows the overall structure of the CFE-YOLOv8s. The backbone employs CBiF and FC as feature extraction modules to enhance extraction efficiency while lowering computational complexity and parameter counts. The EFC is used in the neck to improve the feature fusion of the model. The following are the precise substitutes:

The intermediate two C2f modules within the backbone are replaced with CBiF;
The initial and final C2f modules within the backbone are supplanted with FC;
All instances of C2f within the neck segment are substituted with EFC.

3.3. CBiF Module

With the continuous evolution of Transformer models such as ViT [31] and Swin-transformer [32], notable advancements have been made in their application of surface defect detection. When processing image data, Transformers capture long-range dependencies in the image data through self-attention mechanisms. They excel in establishing relationships between pixels across the entire image, thereby understanding the global context information. This is particularly crucial for addressing issues with unclear defect boundaries, as global feature extraction assists the model in comprehending the overall structure and patterns of the image. Additionally, CNNs are adept at extracting local features by capturing local patterns in the image through convolution operations, such as edges, corners, and textures. The local perceptual ability and the translational invariance of convolution operations enable CNNs to recognize similar features at different positions, making them particularly suitable for detecting detailed parts of defects. By combining CNNs and Transformers, the fusion of multi-scale features has also been achieved. CNNs are responsible for extracting local detailed features, while Transformers capture global context information. The combination of the two allows for a more comprehensive understanding of the shape and size variations of defects. Based on the above points, this study designs a feature extraction module integrating CNNs and Transformers, named CBiF, based on CBS and BiFormer Block.

BiFormer [33] is a variant of the Transformer model, consisting of multiple BiFormer blocks with Bi-Level Routing Attention (BRA) in their core. The core idea of BRA is as follows: first, a gather operation is performed at a coarse region level using Q (Query), K (Key), and V (Value) to collect the most relevant key-value pairs, thereby retaining only a small number of relevant routing regions; next, token-to-token fine-grained attention is implemented within the union of these routing regions. The processing steps are as follows:

K^{g} = g a t h e r (K, I^{r})

(1)

V^{g} = g a t h e r (V, I^{r})

(2)

O = A t t e n t i o n (Q, K^{g}, V^{g}) + L E C (V)

(3)

where Q stands for the query;

I^{r}

for the k indices of the most pertinent regions contained in the i-th region contained in the i-th row;

K^{g}

and

V^{g}

for the gathered keys and value tensors, respectively; and LEC(V) for the local context enhancement term. The BiFormer block can achieve more flexible computation allocation, focus on a limited number of relevant regions, remove duplicated information, and show excellent detection performance thanks to the BRA.

As shown in Figure 4, CBS integrates Conv2d, BatchNorm2d, and Silu activation functions. The construction of CBS enables it to simplify the model structure, reduce the number of parameters, improve the computational efficiency, and enhance the model’s generalization capability. In CBiF, the main role of CBS is to achieve feature extraction and adjust the number of channels in the feature maps to improve feature representation and model performance. The BiFormer block begins with a 3 × 3 deep convolutional layer to encode relative positional information implicitly. Subsequently, it sequentially applies the BRA and a two-layer multi-layer perceptron (MLP) for cross-layer modeling and position embedding. CBiF comprises two branches, consisting of four CBS blocks and one BiFormer block. After adjusting the number of channels through a CBS, the result is fed into two parallel branches. One branch continues with convolutional operations by CBS, while the other branch performs global feature extraction using the BiFormer block before down-sampling. After the parallel processing, the feature maps from the two branches are concatenated.

The concatenation process, illustrated in Figure 5, assumes that the feature map outputs from both branches are 40 × 40 × 256. Upon concatenation in the channel dimension, a 40 × 40 × 512 feature map is obtained as the output. Finally, a 1 × 1 convolution is applied to linearly combine the channel information of the feature map, enabling CBiF to better capture feature information from the two branches and adjust the output channel numbers.

The parallel structure of CBiF with two branches allows for simultaneous processing of different features, aiding in capturing diverse information. Introducing CBiF into the backbone can effectively integrate global and local features, providing richer information for the subsequent detection stage and enhancing detection accuracy.

3.4. FC Module

Parameters and computational complexity are critical metrics for measuring model complexity, which is crucial in improving training efficiency, reducing hardware dependency, and enhancing model deployment flexibility. Based on the above points, this article introduces a lightweight module FC by incorporating the FasterNet block into the C2f architecture.

The FasterNet block, proposed by FasterNet [34], consists of a partial convolution (PConv) layer and two point-wise convolution layers (PWConv), as illustrated in Figure 6. A batch normalization layer (BN) and an activation function layer (ReLU) are placed between the PWConv layers. Here, H and W stand for the input height and width of the feature map, C for the input channel count of the feature map, P for the channels utilized in the PConv layer for feature extraction, s for the width and height of the filter, and * for the convolution operation. PConv is a straightforward, quick, and effective convolution technique that modifies only a portion of the input channels while selectively applying conventional convolution operations to the other input channels. PConv effectively reduces memory access requirements and redundant computation through the novel convolutional approach. The computational load q and memory access load n of PConv are, respectively:

q = H \times W \times s^{2} \times P^{2}

(4)

n = H \times W \times 2 P + s^{2} \times P^{2}

(5)

The computational load

q_{1}

and memory access load

n_{1}

of Conv are calculated as follows:

q_{1} = H \times W \times s^{2} \times C^{2}

(6)

n_{1} = H \times W \times C + s^{2} \times C

(7)

Compared to Conv, PConv has a lower computational and memory access load. Additionally, to fully and effectively utilize information from all channels, the FasterNet block combines PWConv and PConv layers for feature extraction. Specifically, after partial channel convolution by PConv, PWConv performs feature mapping on the unprocessed channels and then concatenates the mapped results with the output of PConv to obtain the final output.

The structure of FC is shown in Figure 7. The feature map is first adjusted for channel number by a CBS and then split into two parts. One part undergoes feature extraction through three parallel FasterNet blocks, while the other is concatenated with the results after processing through the three FasterNet blocks. Finally, the output channel number is adjusted to obtain the final result. FC replaces the bottleneck originally composed entirely of CBS with FasterNet blocks, successfully overcoming the high computational complexity and significant parameter count issues caused by convolutional layers used extensively in C2f.

3.5. EFC Module

Attention mechanisms are widely used in image recognition tasks. Adding attention mechanisms to a model allows for focus on detecting targets to improve detection accuracy to some extent. For defects with diverse shapes and sizes, the attention mechanism can dynamically adjust weights, adaptively extract important features in the image, and suppress irrelevant features, enabling the model to flexibly handle defects of different shapes and sizes. In the case of defects with unclear boundaries, the attention mechanism focuses on the boundaries of the defect area, enhancing the weight of boundary information. It allows the model to more clearly identify and distinguish defect boundaries and process blurry areas, thereby improving detection accuracy and reliability. Therefore, by introducing the EMA [35] in FC, we propose the lightweight module EFC with an attention mechanism.

The EMA structure is illustrated in Figure 8. First, EMA groups the input feature map along the channel dimension to better capture diverse semantic information. Following that, EMA captures multi-scale spatial information using parallel sub-networks. To enhance the extraction of attention weight descriptors, EMA incorporates three parallel pathways: the two left pathways are referred to as the 1 × 1 branch, while the rightmost pathway is the 3 × 3 branch. Within the two 1 × 1 branches are a 1D horizontal global pooling (X Avg Pool) and a 1D vertical global pooling (Y Avg Pool). These two global average pooling operations can effectively obtain feature information in the width and height directions. In the 3 × 3 branch, multi-scale feature representations are obtained using a 3 × 3 convolutional layer.

In cross-spatial learning, global spatial information is extracted from the output of the 1 × 1 branch using 2D global average pooling (Avg Pool). Simultaneously, the Softmax function is employed to ensure computational efficiency. Subsequently, the output of the 3 × 3 branch is subjected to matrix dot product operations with the outputs of the two parallel 1 × 1 branches to obtain the first spatial attention map. Similarly, the same operation is applied to the 3 × 3 branch as in the 1 × 1 branch to obtain the second spatial attention map. Finally, the two spatial attention maps are combined to obtain the final output. Because EMA processes features at different scales in parallel, its parallel structure improves accuracy and speeds up model training.

The structure of EFC is fundamentally similar to the FC shown in Figure 7, with the only difference being replacing the FasterNet block to replace the bottleneck, using the EMA-Faster block instead. The architecture of the EMA-Faster block, as illustrated on the left side of Figure 8, involves the addition of an EMA layer after the second PWConv layer of the FasterNet block. Although EFC has a slightly higher number of parameters and higher computational complexity than FC, it is still lower than C2f. Applying the EFC in the neck can improve detection accuracy while ensuring the model is lightweight.

4. Experiments

4.1. Experimental Setup

Our experiments were executed within a Linux Ubuntu 20.04 environment for this research, specifically on a 64-bit operating system version. The hardware infrastructure supporting these experiments included an Intel Xeon Platinum 8358P CPU alongside a GeForce RTX 3060 GPU. On the software front, we employed the PyTorch deep learning library as our primary tool for implementing the experimental models. The specific versions used were PyTorch 1.11.0 and CUDA 11.3. To maintain consistency and reproducibility across all experimental runs, we strictly followed the parameter settings in Table 1 to conduct the experiments.

4.2. Dataset

This work used two steel datasets, NEU-DET and GC10-DET, to completely confirm the efficacy and generalizability of CFE-YOLOv8s. The specific information is shown in Table 2.

NEU-DET [36]: This public dataset, which Northeastern University produced, includes 1800 photos that highlight six common surface flaws in hot-rolled steel strips. The six defect categories are the roll-in scale, pitted surface, inclusion, patches, crazing, and scratches.
GC10-DET [37]: This publicly available dataset includes surface flaws on steel plates gathered in industrial settings. It includes ten defects: crescent gap, rolling pit, waist folding, crease, water spot, oil spot, inclusion, welding line, punched hole, and silk spot.

4.3. Experimental Metrics

In order to better evaluate the various performance aspects of the model, we used mean average precision (mAP) to assess detection accuracy, frames per second (FPS) to measure detection speed, and floating-point operations per second (FLOPS) to reflect the model’s computational complexity. mAP@0.5 is calculated from precision, recall, and average precision (AP). The specific calculation process is as follows:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(8)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(9)

A P = \int_{0}^{1} P (r) d r

(10)

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P_{i}

(11)

where TP represents the correct number of detected targets, FP denotes the erroneous number of detected targets, and FN indicates the number of missed targets. AP refers to the area under the P-R curve, and P(r) represents the precision value at a specific recall point. The variable m signifies the number of categories, while AP_i represents the detection accuracy of category i. mAP@0.5 denotes the average precision at an IOU threshold of 0.5, where an IOU threshold of 0.5 implies that a detection result is considered correct only if the overlap with the ground truth label exceeds 0.5.

4.4. Result and Analysis

4.4.1. Experimental Analysis of CBiF Module

We validated the effectiveness of CBiF in improving detection accuracy by applying it to widely used models such as YOLOv5s, YOLOv6s, YOLOv7-tiny, and YOLOv8s. These models’ backbones contain multiple feature extraction modules, and we replaced the second and third feature extraction modules in the backbone with CBiF. We used NEU-DET for the experiments. In Figure 9, the mAP@0.5 comparison of each model is represented by two bars, with the blue bar showing the original model and the red bar showing the accuracy after adding CBiF. Upon integrating CBiF, the mAP@0.5 improvements for YOLOv5s, YOLOv6s, YOLOv7-tiny, and YOLOv8s were 1.9%, 2.5%, 2.3%, and 2.2%, respectively. The experimental results indicate that applying CBiF in the backbone as a feature extraction module can effectively enhance the detection accuracy of the model. Furthermore, CBiF demonstrated good applicability in one-stage detection models.

4.4.2. Ablation Study

We conduct four sets of experiments on the NEU-DET dataset to explore the contributions of CBiF, FC, and EFC to CFE-YOLOv8s.

Experiment one serves as the baseline, as shown in the first row of Table 3, with a mAP@0.5 of 74.7%, 11.1 M parameters, 28.4 G FLOPS, and an FPS of 89.6.
Experiment two introduces CBiF into the backbone of YOLOv8s, as depicted in the second row of Table 3. The mAP@0.5 increases by 2.2%, the parameters decrease by 0.3 M, and the FLOPS decreases by 1.3 G. The improvement in mAP@0.5 suggests that the CBiF, constructed by combining CNN and Transformer, can effectively enhance the feature extraction capability of the backbone without increasing the parameter and computational complexity. However, adopting a Transformer layer in CBiF, utilizing self-attention mechanisms for global feature extraction and fully connected layers for processing self-attention outputs, decreases FPS by 6.4.
Experiment three involves the simultaneous use of CBiF and FC in the backbone, as shown in the third row of Table 3. Despite a marginal decrease of 0.2% in mAP@0.5, the number of parameters and FLOPS decrease by 2.3 M and 4.4 G, respectively, and FPS increases by 5.9. The decrease in the number of parameters and computational complexity, along with the increase in FPS, validates that incorporating faster blocks in C2f can effectively aid in achieving a lightweight model and eliminate the adverse impact of CBiF on detection speed.
Experiment four replaces all C2fs in the neck with EFC based on experiment three. As indicated in the fourth row of Table 3, the mAP@0.5 increases by 1.1%, while the FPS only decreases by 1.6. The improvement in mAP@0.5 suggests that enhancing feature fusion by adding attention mechanisms in the neck can improve detection accuracy without significantly slowing detection speed. Although there was a slight increase in the number of parameters and computational complexity after introducing EFC, they still decreased by 2.5 M and 5.3 G compared to the baseline.

4.4.3. Comparisons with Other Methods on NEU-DET

We performed comparative experiments with various advanced methods on the NEU-DET to verify the superiority of CFE-YOLOv8s. The comparison approaches included one-stage methods YOLOv5s, YOLOv6s, YOLOv7-tiny, YOLOv8s, and SSD, as well as the two-stage methods Faster R-CNN, Libra R-CNN, and Cascade R-CNN. The experimental outcomes are displayed in Figure 10 and Figure 11. Within Figure 10, circles of different colors represent different models, where the size of each circle reflects the parameter count of the model and the horizontal and vertical axes represent mAP@0.5 and FPS, respectively. Figure 11 displays the FLOPS of models through a bar graph. In Figure 10, CFE-YOLOv8s is positioned in the top right corner, showing that it achieved the highest mAP@0.5 of 77.8%, a 3.1% improvement over YOLOv8s. Among these models, CFE-YOLOv8s ranked second in FPS at 87.5, only 2.1 lower than the highest YOLOv8s. Combining Figure 10 and Figure 11, it is evident that although YOLOv7-tiny had the lowest parameter count and FLOPS, CFE-YOLOv8s outperformed YOLOv7-tiny by 9.2% in mAP@0.5 and 1.9 in FPS. CFE-YOLOv8s reduced the parameter count and FLOPS by 2.5 M and 5.3 G, respectively, compared to YOLOv8s. In conclusion, CFE-YOLOv8s exhibited outstanding detection accuracy and maintained the detection speed without significant compromise while achieving a light weight.

4.4.4. Comparisons with Other Methods on GC10-DET

To further confirm the superiority of CFE-YOLOv8s in detecting surface defects on steel, we performed comparison tests on the GC10-DET and studies on the NEU-DET. Figure 12 displays the outcomes of the experiment. Comparing CFE-YOLOv8s to YOLOv8s, the mAP@0.5 increased from 66.7% to 69.5%, resulting in the maximum detection accuracy, similar to the NEU-DET results. CFE-YOLOv8 demonstrated an FPS of 84.3, which is comparable to YOLOv7-tiny and marginally lower than the highest YOLOv8s at 86.5, in contrast to other one-stage detection methods like YOLOv5s, YOLOv6s, YOLOv7-tiny, and YOLOv8s. With fewer parameters and computational complexity, CFE-YOLOv8s significantly increased the detection speed and accuracy compared to Faster R-CNN, Libra R-CNN, and Cascade R-CNN. Overall, CFE-YOLOv8s demonstrated excellent detection accuracy while maintaining low model complexity, validating its superiority in steel surface defect detection.

4.4.5. Comparison and Analysis of Detection Performance

Heatmaps provide a visual representation of the areas on which the model focuses. Figure 13 shows the heat maps of YOLOv8s and CFE-YOLOv8s. It can be observed from Figure 13 that CFE-YOLOv8s more accurately focused on defect locations in the first and third images compared to YOLOv8s. Moreover, in the first and second images, YOLOv8s exhibited more attention to normal areas than CFE-YOLOv8s. The above results indicate that CFE-YOLOv8s was better at focusing on defect regions and suppressing attention on normal areas.

To compare the detection performances of the two, the same five images were used for detection by YOLOv8s and CFE-YOLOv8s. In Figure 14, the first row shows the detection results of YOLOv8s, while the second row shows those of CFE-YOLOv8s. The detection results consist of defect categories and confidence scores, where the defect categories help to determine false positives and the confidence scores effectively reflect the reliability of the detection results. Neither YOLOv8s nor CFE-YOLOv8s exhibited false positives in the detection results. However, in the first defective image, YOLOv8s had a missed detection, whereas CFE-YOLOv8s detected all defects. In terms of confidence scores, the experimental results of CFE-YOLOv8s on the “silk spot (6_siban)” defect consistently outperformed those of YOLOv8s. Overall, CFD-YOLOv8s demonstrated a more comprehensive and reliable detection performance compared to YOLOv8s.

5. Conclusions

This study presents an improved defect detection model, CFE-YOLOv8s, to enhance steel surface defect detection. The model boosts detection accuracy and emphasizes lightweight design, rendering it more suitable for industrial applications. To tackle the diverse sizes and ambiguous boundaries of steel surface defects, we integrated the strengths of CNN and Transformer to formulate CBiF, serving as the primary feature extraction module of the backbone and augmenting the feature extraction capability. By substituting the bottleneck in C2f, which comprised multiple convolutional layers, with the FasterNet block, we devised FC, effectively reducing parameter and computational complexity to achieve swifter detection speed and easier deployment. By embedding EMA into FC, we designed EFC, further enhancing detection accuracy without compromising on lightweight design. To validate the speed and accuracy of CFE-YOLOv8s, we conducted comparative experiments with other state-of-the-art models on the NEU-DET and GC10-DET datasets. The experiments demonstrate that CFE-YOLOv8s is competitive among advanced models. While the model proposed in this paper effectively addresses the issue of surface defect detection on steel, it still has limitations when applied to other defect detection tasks. In future work, we will continue to explore models for surface defect detection that are more robust and exhibit better generalization capabilities.

Author Contributions

Conceptualization, S.Y.; methodology, Y.X.; validation, W.H.; formal analysis, J.W. (Jianqing Wu) and F.X.; investigation, Q.W.; data curation, X.Y. and B.W.; writing—original draft preparation, Y.X.; writing—review and editing, J.W. (Jianqing Wu) and W.H.; visualization, H.Y. and J.W. (Jingyong Wang); supervision, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported in part by the Jiangxi Provincial Natural Science Foundation under Grant 20224BAB212013 and in part by the Science and Technology Research Project of Jiangxi Provincial Department of Education under Grants GJJ2200830 and GJJ2200868.

Data Availability Statement

The experiments in this article used publicly available datasets NEU-DET and GC10-DET.

Conflicts of Interest

Authors Hongsheng Yan and Jingyong Wang were employed by the company Longnan Dingtai Electronic Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Park, J.-K.; Kwon, B.-K.; Park, J.-H.; Kang, D.-J. Machine learning-based imaging system for surface defect inspection. Int. J. Precis. Eng. Manuf.-Green Technol. 2016, 3, 303–310. [Google Scholar] [CrossRef]
Zhang, D.; Hao, X.; Wang, D.; Qin, C.; Zhao, B.; Liang, L.; Liu, W. An efficient lightweight convolutional neural network for industrial surface defect detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 25 July 2005; pp. 886–893. [Google Scholar]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Smadja, D.; Touboul, D.; Cohen, A.; Doveh, E.; Santhiago, M.R.; Mello, G.R.; Krueger, R.R.; Colin, J. Detection of subclinical keratoconus using an automated decision tree classification. Am. J. Ophthalmol. 2013, 156, 237–246. [Google Scholar] [CrossRef]
Luo, Q.; Sun, Y.; Li, P.; Simpson, O.; Tian, L.; He, Y. Generalized completed local binary patterns for time-efficient steel surface defect classification. IEEE Trans. Instrum. Meas. 2018, 68, 667–679. [Google Scholar] [CrossRef]
Zhang, J.W.; Wang, H.Y.; Tian, Y.; Liu, K. An accurate fuzzy measure-based detection method for various types of defects on strip steel surfaces. Comput. Ind. 2020, 122, 12. [Google Scholar] [CrossRef]
Wang, J.; Li, Q.; Gan, J.; Yu, H.; Yang, X. Surface defect detection via entity sparsity pursuit with intrinsic priors. IEEE Trans. Ind. Inform. 2019, 16, 141–150. [Google Scholar] [CrossRef]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, S.; Xia, X.; Ye, L.; Yang, B. Automatic detection and classification of steel surface defect using deep convolutional neural networks. Metals 2021, 11, 388. [Google Scholar] [CrossRef]
Shi, X.; Zhou, S.; Tai, Y.; Wang, J.; Wu, S.; Liu, J.; Xu, K.; Peng, T.; Zhang, Z. An improved faster R-CNN for steel surface defect detection. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–5. [Google Scholar]
Zhao, W.; Chen, F.; Huang, H.; Li, D.; Cheng, W. A new steel defect detection algorithm based on deep learning. Comput. Intell. Neurosci. 2021, 2021, 5592878. [Google Scholar] [CrossRef]
Yang, J.; Wang, W.; Lin, G.; Li, Q.; Sun, Y.; Sun, Y. Infrared thermal imaging-based crack detection using deep learning. IEEE Access 2019, 7, 182060–182077. [Google Scholar] [CrossRef]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-time steel surface defect detection with improved multi-scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar] [CrossRef]
Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef]
Zhou, C.; Lu, Z.; Lv, Z.; Meng, M.; Tan, Y.; Xia, K.; Liu, K.; Zuo, H. Metal surface defect detection based on improved YOLOv5. Sci. Rep. 2023, 13, 20803. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Xin, Z. Efficient detection model of steel strip surface defects based on YOLO-V7. IEEE Access 2022, 10, 133936–133944. [Google Scholar] [CrossRef]
Li, M.; Wang, H.; Wan, Z. Surface defect detection of steel strips based on improved YOLOv4. Comput. Electr. Eng. 2022, 102, 108208. [Google Scholar] [CrossRef]
Wang, W.; Mi, C.; Wu, Z.; Lu, K.; Long, H.; Pan, B.; Li, D.; Zhang, J.; Chen, P.; Wang, B. A Real-Time Steel Surface Defect Detection Approach with High Accuracy. IEEE Trans. Instrum. Meas. 2022, 71, 5005610. [Google Scholar] [CrossRef]
Yeung, C.-C.; Lam, K.-M. Efficient fused-attention model for steel surface defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 2510011. [Google Scholar] [CrossRef]
Liu, R.; Huang, M.; Gao, Z.; Cao, Z.; Cao, P. MSC-DNet: An efficient detector with multi-scale context for defect detection on strip steel surface. Measurement 2023, 209, 112467. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 649–667. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
Lv, X.; Duan, F.; Jiang, J.-J.; Fu, X.; Gan, L. Deep metallic surface defect detection: The new benchmark and detection network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]

Figure 1. Six types of steel defects: (a) Inclusion. (b) Crazing. (c) Scratches. (d) Rolled-in scale. (e) Pitted surface. (f) Patches.

Figure 2. The structure of the C2f module.

Figure 3. The overall framework of CFE-YOLOv8s.

Figure 4. The structure of the CBiF module.

Figure 5. The concatenation of the CBiF module.

Figure 6. The structure of the FasterNet block.

Figure 7. The structure of the FC module.

Figure 8. The structure of the EMA-Faster block in EFC.

Figure 9. The experimental results of the CBiF module.

Figure 10. Defect detection results on NEU-DET.

Figure 11. FLOPS with different models.

Figure 12. Defect detection results on GC10-DET.

Figure 13. Heatmaps of YOLOv8s and CFE-YOLOv8s.

Figure 14. Detection results of YOLOv8s and CFE-YOLOv8s: (a) detection results of YOLOv8s; (b) detection results of CEF-YOLOv8s.

Table 1. Experimental parameters.

Parameters	Value
Learning rate	0.02
Decay strategy	cosine
Optimizer	SGD
Momentum	0.937
Weight decay	0.0005
Total epochs	200
Close mosaic	10
Batch size	16

Table 2. NEU-DET and GC10-DET datasets.

Dataset	Total Images	Defects Type	Size	Train Images	Validation Images
NEU-DET	1800	6	200 × 200	1440	360
GC10-DET	2280	10	2048 × 1000	2052	228

Table 3. The ablation experiments on NEU-DET.

	Methods	mAP@0.5	Parameters/M	FLOPS/G	FPS
Experiment one	Baseline (YOLOv8s)	74.7	11.1	28.4	89.6
Experiment two	+CBiF	76.9 (+2.2)	10.8 (−0.3)	27.1 (−1.3)	83.2 (−6.4)
Experiment three	+CBiF, FC	76.7 (−0.2)	8.5 (−2.3)	22.7 (−4.4)	89.1 (+5.9)
Experiment four	+CBiF, FC, EFC	77.8 (+1.1)	8.6 (+0.1)	23.1 (+0.4)	87.5 (−1.6)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, S.; Xie, Y.; Wu, J.; Huang, W.; Yan, H.; Wang, J.; Wang, B.; Yu, X.; Wu, Q.; Xie, F. CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection. Electronics 2024, 13, 2771. https://doi.org/10.3390/electronics13142771

AMA Style

Yang S, Xie Y, Wu J, Huang W, Yan H, Wang J, Wang B, Yu X, Wu Q, Xie F. CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection. Electronics. 2024; 13(14):2771. https://doi.org/10.3390/electronics13142771

Chicago/Turabian Style

Yang, Shuxin, Yang Xie, Jianqing Wu, Weidong Huang, Hongsheng Yan, Jingyong Wang, Bi Wang, Xiangchun Yu, Qiang Wu, and Fei Xie. 2024. "CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection" Electronics 13, no. 14: 2771. https://doi.org/10.3390/electronics13142771

APA Style

Yang, S., Xie, Y., Wu, J., Huang, W., Yan, H., Wang, J., Wang, B., Yu, X., Wu, Q., & Xie, F. (2024). CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection. Electronics, 13(14), 2771. https://doi.org/10.3390/electronics13142771

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection

Abstract

1. Introduction

2. Related Works

2.1. Conventional Machine Learning Approaches

2.2. Deep Learning Approaches

3. The Proposed Method

3.1. Overview of YOLOv8

3.2. The Overall Framework of CFE-YOLOv8s

3.3. CBiF Module

3.4. FC Module

3.5. EFC Module

4. Experiments

4.1. Experimental Setup

4.2. Dataset

4.3. Experimental Metrics

4.4. Result and Analysis

4.4.1. Experimental Analysis of CBiF Module

4.4.2. Ablation Study

4.4.3. Comparisons with Other Methods on NEU-DET

4.4.4. Comparisons with Other Methods on GC10-DET

4.4.5. Comparison and Analysis of Detection Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI