1. Introduction
Metal surface defect detection has always been one of the most important links in the industrial field [
1]. With the development of modern industry, metal plays an indispensable role in construction [
2], machinery manufacturing, transportation and other industries, and its quality is directly related to the performance and safety of products. However, the metal surface is inevitably prone to defects such as scratches, cracks, and pits during the process of automating large productivity [
3]. These defects will not only impact the aesthetic appearance of products, but also significantly compromise the intrinsic properties of the metal, such as reducing the fatigue strength, wear resistance and corrosion resistance of the metal [
4]. In addition, it will cause serious problems such as metal belt break and accumulation, which will increase material loss and reduce production efficiency [
5]. Therefore, the industry attaches great importance to the research of metal surface defect detection technology, which is regarded as the key to improve product quality [
6]. The utilization of precise and effective algorithms for the detection of defects in metal materials prior to their practical application holds immense significance.
Machine vision technology is divided into methods based on machine learning and deep learning. In recent years, it has quickly gained the favor of researchers at home and abroad due to its safety and reliability, wide application range and high detection efficiency [
7]. Optical devices are used to collect images, image processing algorithms are used to extract image information, and defects can be identified according to the obtained feature information. Machine learning technology mainly relies on a mass of prior knowledge and design experience of experts [
8]. Special rule algorithms are designed for specific types of defects to extract features, which are suitable for simple scenarios and difficult to capture high-level semantic features. Yet, the algorithm based on deep learning can automatically integrate pattern recognition into the process of building the model, so as to verify and optimize the model while learning features, thereby weakening the incompleteness caused by artificial design features, and has strong generalization and robustness in complex environments [
9]. Due to its unique advantages, deep learning methods have increasingly taken the place of traditional machine learning techniques in industrial fields such as metal surface defect detection [
10].
At present, the mainstream object detection algorithms consist of two-stage Faster R-CNN [
11] as a representative, single-stage SSD [
12] and YOLO [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23] series, as well as the Transformer-based RT-DETR [
24]. In the two-stage object detection algorithm, candidate boxes are initially generated, followed by feature extraction from the content within these boxes. Subsequently, object regression is performed on the region content [
25]. Generally, this kind of algorithm has high detection accuracy, but it will sacrifice a portion of the detection speed because the candidate boxes need to be screened first. While the single-stage object detection algorithm is a regression-based algorithm that combines the localization and classification tasks [
26], so it shows a faster detection speed and has a huge advantage in real-time detection. Single-stage has higher detection efficiency than two-stage in defect detection tasks in industrial scenarios, so many domestic and foreign researchers have carried out a lot of research on single-stage algorithm. For example, Liu et al. [
27] combined attention mechanism with multi-feature fusion network, and proposed RAF-SSD model for steel surface defect detection task. The detection mAP is 1.6% and 1.8% higher than that of SSD and YOLOv3, respectively. Aiming at the complexity and diversity of steel surface defects, Li et al. [
28] proposed a deep learning model based on Multiscale Feature Extraction (MSFE) module and Efficient Feature Fusion (EFF) module. Compared with the baseline model YOLOv5s, the mAP is improved by 2.2% on the premise of only increasing the number of parameters by 2.5 M. To enhance the attention to defect areas, Liang et al. [
29] combined the C3 module with the Efficient Channel Attention (ECA) mechanism, and they introduced the CSPNet with SPP Block (SPPCSPC) module of YOLOv7 to extract multi-scale features of aluminum surface defects. The detection accuracy of the proposed YOLOv5-ESP algorithm is better than that of classical algorithms such as YOLOv5. Gao et al. [
30] proposed the SRN-YOLO model to solve the defects of tiny and fuzzy steel surface. The segmentation residual convolutional network was used to capture the gradient feature information, and the feature fusion pyramid network was constructed to minimize the feature loss. The detection accuracy of the baseline model YOLOv7 is greatly improved, but the sacrificed model parameters make it unsuitable for industrial deployment. Yuan et al. [
31] improved YOLOv8n, combined adaptive receptive field and Channel Attention (CA) mechanism to reduce the extraction of redundant features, and used Ghost convolution to reduce model parameters. Compared with the baseline model, the improved model GDCP-YOLO increased the mAP by 1.4% while reducing the number of parameters by 0.2 M. Huang et al. [
32] proposed the WFRE-YOLOv8s model out of the consideration that the traditional method is hard to meet the high detection accuracy and efficiency of the steel industry. The original C2f module in the backbone network was replaced with the newly designed CFN structure to reduce model parameters, and Efficient Multi-scale Attention (EMA) mechanism was added to enhance valuable feature extraction. The mAP is 4.7% higher than that of YOLOv8s.
Defects in industrial production often have two characteristics, one is the large difference in shape between the same type of defects, and the other is the similar feature information between different types of defects [
33], that is, fine-grained defects. Most of the above models are designed to alleviate the issue of low detection accuracy resulting from this factor, and they are used to adding attention mechanisms to each part of the model. While the design concept of our model is more inclined to focus on rectifying inherent deficiencies in the backbone network of the baseline model, thereby augmenting its capability to extract, integrate, and represent fine-grained features for improved detection accuracy and versatility. To this end, this study proposes an improved YOLO8 algorithm, which aims to improve the accuracy of metal surface defect detection while ensuring real-time performance. The main contributions of our work are as follows:
We propose a novel spatial pyramid pooling module ASPPFCSPC, which enables the model to process local features, surrounding context and global context at the same time, and effectively fuse this information to enhance the model’s ability to represent fine-grained features under complex backgrounds on metal surfaces, thereby improving the detection accuracy and versatility of the model. This improvement significantly improves mAP by about 1.5% on NEU-DET dataset.
We newly improve the C2f module, which is combined with the Deformable Convnets v2 (DCNv2) to enhance the adaptability to target shapes or local structures. The expanded receptive field can strengthen the ability of the underlying network to extract multi-scale defect features on metal surfaces, thereby improving the detection accuracy and robustness of the model. Building on the previous work, the mAP again grows by approximately 0.9%.
We add an SPD layer after each Conv in the backbone network, so that the feature representation can be fine-tuned by learnable parameters, which avoids the loss of fine-grained information of small target features on metal surfaces during down-sampling, and reduces the dependence on nonlinear activation, thereby improving the detection accuracy and speed of the model. The mAP gains another 1.1%, for a total increase of 3.5%. And the FPS reaches up to 140.4.
We not only conduct numerous experiments on the steel strip surface defect dataset NEU-DET, but also verify the metal surface defect detection algorithm YOLO-ADS on the industrial steel surface defect GC10-DET, the industrial aluminum surface defect dataset APSPC and even the larger public benchmark dataset VOC2012. Its detection accuracy is not only ahead of a certain advanced level, but the model also has good versatility and robustness. The performance of mAP on different metal surface defect datasets is improved by 3.7%, 3.4% and 4.3%, respectively.
The remainder of this paper is organized as follows.
Section 2 introduces the related work of this study.
Section 3 details YOLOv8 and its improved algorithm, introducing the novelly proposed spatial pyramid pooling module ASPPFCSPC, the newly improved feature extraction module C2f_SimDCNv2, and the freshly introduced SPD layer.
Section 4 systematically analyzes the primary dataset to clarify the categories of target defects and the directions for algorithmic improvement, and it conducts a comprehensive benchmark evaluation of YOLO-ADS.
Section 5 discusses the study’s findings, limitations, and future work. Finally,
Section 6 concludes this study.
2. Related Work
Current methods for detecting defects on metal surfaces in the industry are primarily divided into some outstanding traditional machine learning techniques and State-Of-The-Art (SOTA) deep learning algorithms.
2.1. Traditional Machine Learning Techniques
Traditional machine learning techniques mainly rely on a mass of prior knowledge and design experience of experts, there are still many units use some excellent techniques in metal surface defect detection tasks.
Xu et al. [
34] applied a two-dimensional Haar wavelet transform to all blocks in the model to identify surface defects in steel strips. The proposed method outperformed the other three comparable models by 1%. Zhao et al. [
35] utilized the vector-valued regularized kernel function approximation and the support vector machine to localize defects, effectively extracting wrinkle features from images of steel strips. Zhang et al. [
36] posited that the background of steel strips follows a Gaussian distribution and defined a membership function to evaluate the gray level of each defect. The detection rate of this fuzzy sets-based method was up to 96.8%. Ji et al. [
37] integrated the genetic algorithm with the generalized linear regression, enabling the proposed data-driven algorithm to balance prediction accuracy and the interpretability of the resulting model for hot-rolled steel strips.
However, conditional machine learning techniques impose higher demands on external conditions and require timely targeted retuning or redesign when the characteristics of defects change significantly. Consequently, they may only be suitable for certain specific stable scenarios.
2.2. Deep Learning Algorithms
It is well known that the single-stage object detection algorithm is lighter and faster than the two-stage algorithm. Many studies have attempted to apply the current SOTA algorithm YOLOv8 to the industrial field. The YOLOv8 exists three main flaws in detecting metal surface defects, there have been many researchers proposed improved methods and made great contributions.
Firstly, the Spatial Pyramid Pooling-Fast (SPPF) module in YOLOv8 has a limited receptive field due to the blind use of Max pooling operations, which leads to poor ability to perceive complex backgrounds and fuse multi-scale features. To overcome this limitation, Zhang et al. [
38] proposed a variant of YOLOv8n for defect detection on steel strip surfaces, they suggested the integration of the Large Separable Kernel Attention (LSKA) design and the SPPF module to pool feature maps at different scales. This step improved the detection accuracy by 2.3%. Kong et al. [
39] fused the Convolutional Block Attention Module (CBAM) mechanism into the SPPF module of YOLOv8x to avoid the loss of key information caused by a single pooling layer. This resulted in a 3.0% improvement on the steel defect dataset and 2.6% improvement on the aluminum surface industrial defect detection dataset through this step.
Secondly, the original C2f module in YOLOv8 has poor performance in extracting multi-scale defect features. As to this problem, Zhang et al. [
40] added the Dilation-wise Residual (DWR) block to the C2f module of YOLOv8n, which decomposed the method of extracting multi-scale context information into a two-step method to reduce the difficulty of multi-scale defect extraction on steel surfaces. This step improved the detection accuracy by 2.5%. Xie et al. [
41] proposed a multi-scale feature extraction block based on group convolution to improve the detection performance of multi-scale defects by replacing the bottleneck of the C2f module. Compared to the baseline algorithm YOLOv8l, this step improved the mAP by 1.2% and 2.8% on two steel strip datasets, respectively.
Additionally, it is very easy for YOLOv8 to lose fine-grained features during down-sampling. To address this issue, Yuan et al. [
42] replaced the original neck network of YOLOv8s with a simplified version of Bidirectional Feature Pyramid Network (BiFPN) to improve the sensitivity to fuse minor features and enhance the ability to recognize fine-grained target defects on steel surfaces. This step improved the detection accuracy by 1.1%. Wang et al. [
43] proposed a YOLOv8n-DSDM by adding a small target detection layer in the current neck network, which improved the mAP by 2.3% on an industrial aluminum sheet surface defect dataset through this step.
In response to the above problems in YOLOv8, it is obvious that most related work do not directly solve them fundamentally. They are used to adding attention mechanisms or enhancing the fusion ability of the neck network. Although object detection networks have become increasingly robust in terms of architectural design and training strategies, there seems to be no specific evolution in the detection strategies for objects with extreme scales. And recent studies still rely on superior neck network designs, but this cannot radically address the inadequacies in feature extraction capabilities of the underlying network.
Inspired by the inherent deficiencies of YOLOv8, this study proposes a novel metal surface defect detection algorithm, YOLO-ADS, which is based on a deep and large backbone network. The network is designed to capture and integrate dense information across various spatial scales and levels of potential semantics. This is all attributed to three targeted strategies for the backbone network in this study: introducing the average pooling operation to the spatial pyramid pooling module, which can instantly expand the receptive field for fusion; combining the original C2f module with the deformable network, which can markedly enhance adaptability to varying scales; and adding an SPD layer before the convolution operation, which can directly shorten the stride, thereby reducing the loss of fine-grained information during down-sampling.
Compared to YOLOv8n, YOLO-ADS significantly improves detection accuracy and identifies a wider range of metal surface defects, offering a new paradigm of the heavy-backbone design in the field of defect detection.
3. Methodology
3.1. YOLOv8 Algorithm
YOLOv8 is an upgraded version based on the previous generation of YOLO series, comprising primarily of three components: Backbone, Neck and Head. The network architecture of YOLOv8 is illustrated in
Figure 1.
The Backbone part is mainly used for feature extraction. Building upon the C3 module from YOLOv5, YOLOv8 combines the idea of Efficient Layer Aggregation Network (ELAN) module in YOLOv7 algorithm to improve C2f module. The new module has more residual connections, and according to the scaling coefficient, the number of channels is changed by splitting and splicing operations, so as to achieve lightweight and obtain more rich gradient flow information. In the tail of Backbone, the serial SPPF module in YOLOv5 is used to obtain faster speed gain while ensuring the same receptive field.
The Neck part is mainly used for feature fusion, which is inspired by Path Aggregation Network (PAN) [
44] and adopts Feature Pyramid Network (FPN) [
45] (top-down) +PAN (bottom-up) dual feature pyramid structure to help the shallow information output from three different levels of Backbone to better aggregate to deep features.
The Head part is mainly used for feature classification and localization, and the popular decoupled head structure is used to separate the classification and detection heads to alleviate the conflicts existing between tasks. The Anchor-Free method with fully autonomous learning avoids the negative impact of excessive dependence on the prior frame on the detection accuracy, which offers significant advantages in object detection tasks involving objects with irregular length and width. Distribution Focal Loss [
46] combined with Complete Intersection over Union (CIOU) Loss is introduced as the Loss function of regression branch, and Varifocal Loss [
47] is used as the loss function of classification branch. At the same time, the dynamic Task-Aligned Assigner sample allocation strategy is adopted to achieve a high degree of alignment consistency between classification and regression tasks.
However, the application scenarios of general object detection algorithms are usually regular natural images. While fine-grained defects on metal surfaces in industrial production environments are often susceptible to complex backgrounds and variable scales. Therefore, simply using basic object detection technology is not suitable for defect detection, and it is necessary to improve and optimize the model structure.
3.2. Improved YOLOv8 Algorithm
YOLOv8 provides multiple versions of the choice, divided into n, s, m, l, x, five different versions, and the number of model parameters increases in turn. Considering that the metal surface defect detection task in industrial scenarios needs to take into account both detection accuracy and real-time performance, the smallest version YOLOv8n is selected for improvement in this study.
Aiming at the fine-grained defects on metal surface which are susceptible to background interference, we improve SPPCSPC module to replace SPPF module and add SPD layers. For the ASPPFCSPC module, we mainly introduce an adaptive average pool layer in parallel with the Max pool layer to enhance the fusion and representation ability between local features and global background information. At the same time, we reference the idea of the serial Max pooling operation in the SPPF module and use a simpler activation function Rectified Linear Unit (ReLU) to improve the inference speed of the model.
In view of the variable scale of metal surface defects, we simply combine the first Conv in the bottleneck with the deformable convolutional network, and then replace the last three C2f modules in the Backbone part with the improved module C2f_SimDCNv2 to improve the multi-scale feature extraction ability of the model.
By improving the shortcomings of YOLOv8n, the detection accuracy and versatility of the model are significantly improved. The network architecture of the improved algorithm YOLO-ADS is illustrated in
Figure 2.
3.2.1. A Novel Spatial Pyramid Pooling Module ASPPFCSPC
The Spatial Pyramid Pooling (SPP) [
48] module was proposed by K. He in 2015, which mainly solves two key problems: one is to effectively avoid image distortion caused by image region cropping and scaling operations, and the other is to solve the issue of extracting relevant repetitive features by convolutional neural networks.
The SPPCSPC module in the original YOLOv7 algorithm is one of the latest versions of the variants of the improved SPP structure. It not only inherits the two advantages of SPP, but also increases the receptive field, enabling adaptability to images with varying resolutions. As shown in the SPPCSPC module structure in
Figure 3, the features extracted through Backbone undergo Max pooling operations of different sizes of 5 × 5, 9 × 9 and 13 × 13 in parallel to obtain receptive fields of different sizes. This enables the algorithm to effectively discriminate between large and small targets. However, at the same time, it relatively limits the attention to global background information, and the amount of calculation also increases a lot, which leads to the detection accuracy and speed fail to achieve the optimal effect.
To solve these problems, this study redesigns the SPPCSPC module and proposes a novel spatial pyramid pooling module ASPPFCSPC. The structure is shown in
Figure 4.
SPPF is based on SPP, which is much faster than SPP. Therefore, we first learn from the idea of SPPF, and the output of the upper branch is obtained by three 5 × 5 Max pooling operations on the features in order. At the same time, the receptive fields that pass through each Max pool layer are concatenated with them, which can effectively reduce the calculation cost of the pooling scale. The model’s detection speed has been enhanced while maintaining the same receptive field as before improvement. Secondly, to fully reflect the semantic relationship between global and local information, we introduce the adaptive Average pool layer in the lower branch and parallel it with the Max pool layers. The Max pooling operation can extract the feature texture, while the Avg pooling operation can preserve the background information. The resulting scheme enables us to easily obtain local features, surrounding contexts, and global background information at the image level, making fine-grained feature extraction and fusion, and representation smoother and more accurate. Finally, for the activation function of RConv in ASPPFCSPC, we abandon the complex mathematical operation and use ReLU to optimize computational efficiency and accelerate model detection. ReLU has its own gating mechanism, which only performs nonlinear transformation on critical inputs, and the operation of linear processing of positive values can accelerate the convergence speed of Stochastic Gradient Descent (SGD), so that our neural network model is trained more efficiently. Therefore, the redesigned module ASPPFCSPC can effectively leverage the hardware’s computing power, significantly reduce the delay and enhance the model’s ability to extract, fuse and represent fine-grained features, so as to improve the detection accuracy and versatility of the model greatly.
3.2.2. Newly Improved Module C2f_SimDCNv2
In the original YOLOv8 model, the network uses the traditional conventional convolution, and the conventional convolution calculation formula is as follows:
Here, represents the size of the partial feature map corresponding to the convolution kernel; represents the weight corresponding to the NTH sampling point in the sampling area, and represents the pixel value of n sampling points. The size of the convolution kernel in the traditional convolution layer is fixed, whereas the size of the object to be detected varies. The fixed size convolution kernel is difficult to accurately match the shape and size of the object to be detected, which will affect the accuracy of extracting the boundary information of the object to be detected, which will lead to defects such as poor robustness and poor universality.
While the difference between deformable convolution [
49] and traditional convolution lies in that convolution is not fixed and can be adjusted adaptively. It incorporates trainable partial parameter
derived from conventional convolution, and its output formula is formulated as follows:
Since
is generally a fraction, we can use bilinear interpolation to compute
:
where
represents any position in the region,
is the pixel value corresponding to the sampling point, and
represents the two-dimensional bilinear interpolation kernel.
Deformable convolution can strengthen the feature extraction ability of the underlying network for the target due to its unique adaptive ability, but the introduction of offset also leads to the possibility of covering irrelevant areas in the process of scale adjustment, thus interfering with the overall performance of the model. To solve this problem, an upgraded version of DCNv2 [
50] is used in this study. The DCNv2 network architecture is shown in
Figure 5. DCNv2 introduces a modulation mechanism to learn the weights of sampling points by introducing a modulation parameter
. For regions of no interest, the weight coefficient
is assigned a small value, so the above problem of interference can be avoided. The output formula of DCNv2 is as follows:
In this study, the concept of DCNv2 is integrated into the C2f module of YOLOv8, which simply replaces the first Conv in the bottleneck with DCNv2, thus forming C2F_SimDCNv2. The improved Bottleneck_DCNv2 makes C2f inherit the advantages of deformable convnets. The primary responsibility of the backbone network is to extract shallow features and global information from the original image, which are essential for comprehending the overall context of the entire image. Hence, in this study, we replace the last three C2f modules in the original YOLOv8 backbone network with C2f_SimDCNv2 modules. In the following verification experiments of the improved strategy, the results show that compared with all replacement and last replacement, only replacing the first Conv of the bottleneck in C2f module with DCNv2 can make the performance of the model play the best. The introduction of DCNv2 significantly expands the receptive field of the C2f module, enabling the model to sample more abundant and effective gradient flow information. Consequently, the model can more accurately capture boundary and detail information pertaining to target defects amidst complex backgrounds. The improved module C2f_SimDCNv2 further strengthens the model’s detection accuracy and robustness with its superior adaptive ability, which is especially suitable for the metal surface defect detection task with variable target shapes in this study. The improved module C2f_SimDCNv2 structure is shown in
Figure 6.
3.2.3. The Small Target Killer: SPD Layer
To further improve the detection performance of the model for fine-grained defects on metal surfaces, this study also introduces the SPD layer of SPD-Conv [
51] into the underlying backbone network. The SPD layer adopts the frame-cycle super-resolution conversion technique [
52] to down-sample the feature map on the premise of preserving the feature information as much as possible, so it is very suitable for small targets and low-resolution images.
Figure 7 gives an example when
.
Given a feature map of size
, the SPD layer will divide it into sub-feature maps in Equation (5) by sampling every pixel in every row and column:
In Equation (5), is the input feature map, is the sub-feature map, and scale is the scale factor of the sub-feature map. After obtaining these new sub-feature maps, they are concatenated along the channel dimension to reduce the spatial dimension and increase the channel dimension, that is, the output size of the feature map with the input size of is . As you can see from the figure, the SPD layer is equivalent to performing a factor of 2 down-sampling to obtain four sub-feature map maps , , , of size , which are then concatenated to obtain . This design integrates the feature information on and into the channel dimension, expands the number of channels by four times, and the down-sampling method without information loss enables the network to retain more feature information during feature extraction. The above work makes the small target features of fine-grained defects be retained as much as possible, and compared with the ordinary convolution alone, the parametrically free SPD layer recombines the spatial information and depth information, which can reduce the redundancy between convolution operations, thereby reducing the dependence on nonlinear activation, and speeding up the gradient inversion and reasoning speed of the model to a certain extent. Meanwhile, it also expands the receptive field and improves the overall detection performance of the model.
4. Experiments and Analysis
4.1. Experimental Dataset
4.1.1. Causes of Defects
This study mainly focuses on hot rolled steel strips, including the following six common metal surface defects:
Crazing [
53]. During hot rolling, internal stresses may form in the strip due to temperature gradients, deformation stresses and other factors. When the internal stress reaches the bearing limit of the material, it will lead to the formation of crazing. Surface crazing usually appears as linear or depressed defects extending along the surface of steel strips.
Patches [
53]. The raw material of the steel strip may contain impurities such as non-metallic sulfides, which are prone to form patches during the hot rolling process. In addition, when the steel strip is cooled, the internal tissue structure is unstable due to the uneven local temperature gradient, which also forms patches. Patches usually appear as small round or irregular dark patches on the surface of steel strips, which vary in size and distribution density from tiny point to large sheet.
Rolled-in scale [
53]. The steel strip is in contact with air and reacts with oxygen at high temperatures. Contaminants may be present on the steel strip surface if it is not cleaned adequately before hot rolling. These pollutants can promote oxidation reactions at high temperatures, leading to the formation of the rolled-in scale. The rolled-in scale usually appears as a rust-colored or dark brown covering, which makes the surface rough and uneven.
Scratches [
53]. In the process of production, transportation and storage, if it comes into contact with hard or sharp objects, it may lead to the generation of scratches. As well as in the processing process, if the operator uses improper tools or methods, it may also cause the formation of scratches. Scratches vary in shape and size and usually appear on the surface of steel strips and can be directly and visibly observed.
Inclusion [
53]. In the steelmaking process, if non-metallic substances such as impurities, oxides and sulfides are present in the molten steel, these substances may be carried to the billet during the hot rolling process, forming the inclusion. At the same time, chemical reactions in the smelting process produce solid substances, such as carbides and nitrides. If these reaction products are not completely removed or processed, they may also become the inclusion. Inclusion appears in different shapes, sizes, such as spherical, sheet, and filamentous, and is usually unevenly distributed.
Pitting surface [
53]. In the hot rolling process, when liquid metal or impurities exist on the surface of the billet, the impact force at high temperature will make the droplets splash. These droplets will solidify and deposit on the surface of steel strips after cooling, forming the pitting surface. If impurities are present in the molten steel, they may also be carried to the surface, forming the pitting surface during hot rolling. Pitting surface usually appears as small round or irregularly shaped dark spots on the surface of steel strips, which vary in size and distribution density.
All of the above defects have a negative impact on the aesthetics and functionality. Due to the different causes, the same type of defects shows different characteristics, and the distinction between different types of defects is not obvious enough, so it is extremely difficult to identify metal surface defects accurately.
4.1.2. Defect Dataset
In the actual steel strip production line, the new production line has a good environment, defects do not occur often, and there is a certain randomness. It often takes a lot of time to collect enough samples. Due to the different production line environments of different steel mills, the images vary greatly, and the quality of the collected images is uneven. However, the model trained by a small number of samples will have poor generalization ability and insufficient feature representation ability. Therefore, this study mainly focuses on the steel strip surface defect dataset (NEU-DET) provided by Song’s team in Northeastern University (NEU) [
54,
55].
This dataset collects six typical surface defects of hot rolled steel strips, these are crazing, patches, rolled-in scale, scratches, inclusion, and pitted surface. The dataset includes 1800 grayscale images, each with a resolution of 200 × 200 pixels, and 300 samples for each defect. NEU-DET dataset can cover different types and degrees of steel strip surface defects, which can help evaluate and compare the performance of different defect detection algorithms in practical applications. NEU-DET dataset has become one of the widely used datasets in the field of metal surface defect detection. Typical samples of NEU-DET dataset are shown in
Figure 8.
It is worth noting that there are two problems in this defect dataset. First, fine-grained defects (crazing, patches, and rolled-in scale) are easily affected by the background. Due to the influence of illumination and material changes, the gray value of the defect image within the class will change, and the dust in the environment will lead to low contrast with the background, which reduces the representation ability of the defect. Secondly, the scale of fine-grained defects is variable. For example, in the intra-class defect image, the crazing defect may be a connected middle object or a small object, and the small object itself has low resolution, so the context information for the model to learn is limited. Therefore, the detection task of fine-grained defects in this study is more challenging.
4.1.3. Statistics and Analysis
As shown in some samples in the NEU-DET dataset above, the number of features, defect size and background in each defect sample image are quite different, and these factors will affect the selection of the detection model and the corresponding improvement strategy. To make the detection model achieve better detection results, this study conducted detailed statistics and analysis of defect samples in NEU-DET dataset.
Figure 9 shows the distribution of the number of defect categories in the dataset. The inclusion category has the largest number of defects, more than 1000, followed by the patches category with about 900, the crazing category with about 700, and the least defect category pitted surface with only 400. The number of various types of defects is quite different.
Among them, for defects with relatively obvious characteristics such as scratches and pitting surface, only four or five hundred sample images are enough to support the training of the detection model. While for fine-grained defects such as crazing and rolled-in scale, which are easily affected by the background and have variable feature scales within the class, six or seven hundred training samples are still far from enough. Therefore, it is necessary to improve the model for such fine-grained defects specifically.
Figure 10 shows the distribution of the proportion of defect parts area in the dataset. Firstly, by comparison, it is found that 44.8% of the defect parts are less than 1/10 of the sample image area, 26.2% of the defect parts are between 1/10 and 2/10, and 71% of the defects are between 0–2/10. It shows that the small-sized and medium-sized defect parts are the majority, and the design and analysis of the model should be inclined to small and medium targets. Secondly, the defects in pitted surface category in 0–2/10 area proportion only account for 1.4%, which can be temporarily ignored in this study. Finally, it is obvious from
Figure 8 that the defect image background of scratches and inclusion classes is relatively plain and the defect features are comparatively obvious. Moreover, this research plan solves the problem that the fine-grained defect on the metal surface is susceptible to background interference and its feature scale is variable. Therefore, the design and analysis of the model in this study mainly focus on the remaining three types of defects: crazing, patches, and rolled-in scale.
In summary, through the statistics and analysis of strip surface defects, the target task of this study is clarified, which is to improve the model structure for fine-grained defects such as crazing, patches, and rolled-in scale that are susceptible to background influence and variability in feature scales, so as to improve the detection accuracy of such defects and the versatility of the model for metal surface defect detection tasks.
4.2. Experimental Settings
The experimental environment utilized in this study employed the Windows 11 operating system with a memory capacity of 32 GB. In the training phase, a single GPU (NVIDIA GeForce RTX 4060Ti 16GB) is used, and the compilation environment is Python3.8.19 + torch2.0.0 + CUDA11.7. The image input size is set to 640 × 640, with data augmentation performed using the Mosaic method. The initial learning rate is established at 0.01, while Batchsize is configured as 16. The SGD optimizer has been chosen, and the number of training iterations has been set to 150 rounds. Finally, default parameters are employed for all remaining hyperparameters, and the above experimental setup is used for all experiments in this paper unless otherwise stated.
In this study, the steel strip surface defect dataset NEU-DET analyzed in
Section 4.1 was primarily used for training and testing. It was partitioned into a training set, validation set, and test set with an 8:1:1 ratio respectively. Consequently, the training set consisted of 1440 images, while the validation and test sets contained 180 images each.
4.3. Evaluation Metrics
In this study, mAP, Precision (P), Recall (R) and FPS are mainly used as evaluation criteria to evaluate each model. The formula for P and R is as follows:
TP (True Positive), FP (False Positive), and FN (False Negative) respectively represent the number of defect targets correctly identified, incorrectly identified, and missed. In the context of metal surface defect detection, Precision and Recall play a crucial role and typically exhibit an inverse relationship.
mAP refers to the mean Average Precision (AP) of all categories of defects, AP refers to the area of the curve below the Precision-Recall (P-R) curve, the formula of AP and mAP is as follows, the higher the value of mAP, the better the comprehensive detection performance of the model in all categories, the specific formula is as follows:
IoU represents the ratio of intersection and union between the bounding box of the original label and the predicted bounding box by the model. All the mAP in this study indicates the average accuracy of all detected defect targets in the test set, considering correct localization when IoU exceeds 0.5, and vice versa. A higher mAP signifies improved detection performance across different object categories.
FPS reflects the number of images processed by the model per unit time, and a higher FPS value indicates a better real-time performance of the model, the specific formula is as follows:
where frameNum is the total number of frames processed and elapsedTime is the time used.
In the improved strategies verification experiments and the comparison experiment with mainstream object detection algorithm, Parameter (Param) is used as a key indicator to reflect the number of parameters occupied by the model memory. The fewer Param, the lighter the model. It is verified that the number of model parameters of the proposed algorithm YOLO-ADS has certain advantages, which meets the requirements of memory consumption of industrial deployment devices.
4.4. Improved Strategies Verification Experiments
To verify that the strategies of improving ASPPFCSPC module and C2f_SimDCNv2 module is correct and effective, two sets of comparative experiments are set up on NEU-DET dataset, and the experimental results are shown in
Table 1 and
Table 2.
Table 1 shows that the mAP of the model using SPPCSPC module is 0.6% higher than that of the baseline model using SPPF module. This is because SPPCSPC module integrates the Cross Stage Partial Network (CSPNet) structure, which can realize the channel splicing between the output of the previous layer of SPP and the output of the small residual structure after passing through the convolution module. Thus, more useful feature information can be extracted. However, the number of parameters of the model using SPPCSPC module is 1.5 times that of the baseline model, which also shows that CSPNet is the main reason for the increase of parameters and the decrease of FPS of the improved model. And the mAP of the model using our novelly proposed spatial pyramid module ASPPFCSPC further improves by 0.9% to reach the highest, which is 1.5% higher than the baseline model using SPPF module, and the Precision is improved by 7.2%, which is also the highest. This is because we introduce an adaptive Average pool layer to enhance the semantic relationship between local features, surrounding contexts and global background information. It makes multi-scale feature extraction and fusion, and representation smoother and more accurate. At the same time, we draw on the idea of Max pool layer concatenation in SPPF to reduce the calculation cost of pool scale, and use the simpler activation function ReLU with faster response, which effectively alleviates the problem of slow detection speed caused by the increase of model parameters. Therefore, while the number of model parameters and the detection speed FPS meet the real-time requirements of industrial deployment, the ASPPFCSPC module improves the mAP and Precision of the model, which verifies that the strategy to improve ASPPFCSPC is correct and effective.
From
Table 2, it is obvious that compared with the C2f_DCNv2_all scheme, which replaces all the Conv bits in the bottleneck with DCNv2, and the C2f_DCNv2_2nd scheme, which replaces only the last Conv bits in the bottleneck with DCNv2, The C2f_SimDCNv2 scheme only replaces the first Conv in the bottleneck with DCNv2 and performs the best in the improved comparison experiment. The mAP of the model using improved module C2f_SimDCNv2 reaches the highest, which is 2.2% higher than that of the baseline model using C2f module, and the Precision is increased by 6.2%, which is also the highest. While the number of model parameters and the detection speed FPS meet the real-time requirements of industrial deployment, the C2f_SimDCNv2 module improves the mAP and Precision of the model, which verifies that the strategy of improving C2f_SimDCNv2 is correct and optimal.
4.5. Ablation Experiments
In this study, YOLOv8n is used as the baseline to improve the backbone network of the model, and the spatial pyramid pooling module uses the novelly proposed ASPPFCSPC module to fully reflect the semantic relationship between local features and global background information to enhance the fusion and representation ability of the model for fine-grained defects, so as to improve the detection accuracy and versatility of the model. Meanwhile, the DCNv2 is used to improve the first convolution in the bottleneck to improve the model’s ability to extract multi-scale features. The C2f_SimDCNv2 module enhances the detection accuracy and robustness of the model by virtue of its superior adaptive ability. In addition, to further improve the detection performance of the model for small target defects, the SPD layer is introduced to expand the receptive field, so that the network significantly reduces the loss of fine-grained features and effectively improves the model’s detection accuracy. The parameter-free SPD layer mitigates the impact on nonlinear activation, enhances both inference speed and gradient inversion speed of the model to a certain extent, thereby enhancing the overall detection performance of the model. To validate the effectiveness of this study’s improvement over the baseline model, ablation experiments are conducted on NEU-DET dataset from two perspectives:
Based on the baseline model, only one improvement module is introduced at a time to verify the impact of individual modules on mAP, Precision, Recall and detection speed.
Only one improvement module was eliminated at a time on the final model (YOLO-ADS) (excluding the elimination of ASPPFCSPC module) to verify the impact of individual improvement modules on the performance of the final model. The experimental results are shown in
Table 3.
According to the ablation experimental results in
Table 3, for model 1, we use the new ASPPFCSPC module proposed in this study in the backbone network of the baseline model YOLOv8n, and both mAP and Precision are significantly improved, mAP is increased by 1.5%, and the Precision is greatly increased by 7.2%. The AP of the target defect categories of this study, crazing, patches and rolled-in scale has been improved by 2.5%, 2.7% and 4.9%, respectively. The ASPPFCSPC module introduces an average pooling layer in the lower branch of the spatial pyramid pooling structure, which fully reflects the semantic relationship between local features and global background information, and makes the extraction, fusion and representation of fine-grained defects more fluent. Therefore, the ASPPFCSPC module as the feature splicing fusion end in the backbone network can effectively improve the detection accuracy of the model. For model 2, the improved C2f_SimDCNv2 module still has good performance by virtue of the superior adaptive ability of deformable convolution to multi-scale features, mAP is increased by 2.2%, and the Precision is greatly improved by 6.2%. The AP of the target defect categories of crazing, patches and rolled-in scale in this study was greatly improved by 5.3%, 5.5% and 3.8%, respectively, which undoubtedly verified the effectiveness of C2f_SimDCNv2 module for multi-scale feature extraction. For model 3, after introducing the SPD layer alone, a double down-sampling feature map without feature information loss is obtained by expanding the number of channels, which improves the receptive field. In this way, the dependence on nonlinear activation is reduced, and the inference speed and gradient inversion speed of the model are improved. The improved model mAP was consistent with the baseline model, but the Precision was increased by 4.3%. The detection accuracy of the target defect categories of crazing, patches and rolled-in scale defects in this study was improved, and the FPS reached the highest of 177.9. For model 4, after introducing the ASPPFCSPC module and SPD layer at the same time, mAP is increased by 1.3%, the Precision is greatly improved to the highest of 82.0%, and the Recall is increased by 2.5% compared with model 3 using SPD layer alone. The detection accuracy of the target defect categories of this study, crazing, patches and rolled-in scale has been improved, and the AP of rolled-in scale defects has been greatly improved by 7.2%. For model 5, when ASPPFCSPC module and C2f_SimDCNv2 module are introduced at the same time, mAP innovation reaches 80.3% and Recall reaches 76.0%. However, the Precision of model 1 is poor when ASPPFCSPC module is introduced alone, which is only 0.1% higher than the baseline model. The detection accuracy of the target defect categories of this study, crazing, patches and rolled-in scale has been improved, and the AP of rolled-in scale defect has been greatly increased by 11.5%, reaching the highest of 73.0%. The final model proposed in this paper introduces the ASPPFCSPC module, C2f_SimDCNv2 module and SPD layer at the same time, and the mAP of the model is up to the highest of 81.4%. The AP of the target defect categories of crazing, patches and rolled-in scale in this study was greatly improved by 5.5%, 7.2% and 10.3%, respectively, and the AP of crazing and patches defects reached the highest of 59.6% and 94.9%, respectively. In addition, Precision and Recall are usually in a trade-off, and the final model YOLO-ADS proposed in this paper significantly improves the Precision by 6.1% to 79.7% on the premise that the Recall is 0.8% higher than that of the baseline model. At the same time, the FPS finally reaches 140.4, which fully meets the real-time requirements of industrial deployment.
In summary, the metal surface defect detection algorithm YOLO-ADS proposed in this study can effectively solve the problem that fine-grained defects on the metal surface are susceptible to background interference and feature scale variation. It not only improves the mAP of the baseline model YOLOv8n, but also maintains good Precision, Recall and real-time performance.
4.6. Visual Qualitative Analysis Experiments
The P-R curves in
Figure 11 and
Figure 12 show the experimental results of the baseline model YOLOv8n and the improved YOLO-ADS on the test set under the same experimental Settings in
Section 4.2. The figure shows the defects of each category and the overall mAP, so it is not difficult to see the optimization trend. The improved YOLO-ADS algorithm improves mAP from 77.9% to 81.4%, a significant increase of 3.5%. It is worth noting that in the detection of the YOLOv8n algorithm, the AP of the target defect categories of this study, crazing, patches and rolled-in scale, are only 54.1%, 87.7% and 61.5%, respectively, while in the detection of the improved YOLO-ADS algorithm, they reach 59.6%, 94.9% and 71.8%. Compared with the baseline algorithm, the performance improvement of the model for fine-grained defect detection is very significant.
To more intuitively compare the detection performance of the proposed YOLO-ADS with the baseline algorithm in detecting metal surface defects, YOLOv8n and YOLO-ADS are respectively used for further qualitative analysis on the test set of NEU-DET dataset, and the qualitative results are shown in
Figure 13. The first row shows the detection results of the baseline model, and the second row shows the detection results of YOLO-ADS. Comparing the detection results of each column shows that the proposed YOLO-ADS detects the fine-grained defects with varying scales missed by the baseline model, and it also captures the missed fine-grained defects for large target defects such as pitted surface. In addition, it is hearty to find that although YOLO-ADS performs poorly on the evaluation metrics in training for non-target defect inclusion and pitted surface in this study, in the actual detection, the confidence scores are also all higher than the baseline algorithm.
In summary, the results of qualitative analysis experiments on NEU-DET dataset show that the proposed YOLO-ADS has obvious advantages over the baseline algorithm YOLOv8n in the detection of fine-grained metal surface defects with varying scales under such complex backgrounds.
4.7. Versatility and Robustness Verification Experiments
To further verify the versatility and robustness of the proposed algorithm YOLO-ADS, the baseline algorithm and YOLO-ADS are tested on the public GC10-DET [
56], APSPC and VOC2012 datasets [
57]. GC10-DET dataset collected 2294 real industrial steel surface defect images, including 10 defect types, namely, punching, crescent gap, waist folding, rolled pit, water spot, silk spot, inclusion, oil spot, crease and weld line. The APSPC dataset collects 1885 real industrial aluminum surface defect images, which contain 10 types of defect categories, including non-conductive, scratches, corner leakage, orange peel, leakage, jet, paint blister, pit, variegated color and dirty spots. VOC2012 dataset serves as a publicly available benchmark, offering a standardized evaluation system for detection algorithms and learning performance. It encompasses 17,125 images featuring 20 distinct object categories. All experimental Settings remain the same as in
Section 4.2. The experimental results are shown in
Table 4,
Table 5 and
Table 6.
According to the experimental results in
Table 4 and
Table 5, the proposed algorithm YOLO-ADS shows obvious advantages in steel and aluminum surface defect detection tasks. Compared with the baseline model, mAP is increased by 3.7% and 3.4%, the Precision is increased by 2.9% and 2.8%, and the Recall is increased by 0.9% and 3.5%, respectively. It means that the proposed algorithm YOLO-ADS can be applied to different metal surface defect types and has strong versatility and robustness. It is worth noting that the performance of YOLO-ADS on the two datasets meets the real-time requirements for industrial deployment. Meanwhile, the FPS of YOLO-ADS algorithm on the aluminum surface defect detection task reaches 113.1, which is higher than the baseline model 15.9, which further verifies that YOLO-ADS is may more suitable for aluminum such metal surface defects, and thus shows higher detection speed.
According to the experimental results in
Table 6, on the larger and more diverse data set VOC2012, YOLO-ADS proposed in this study has 4.3% improvement on mAP, 2.6% improvement on Precision, and 3.8% improvement on Recall compared with the baseline model. This shows that the YOLO-ADS algorithm can better learn and capture the characteristics of the target, has strong robustness, and further verifies its good versatility. At the same time, the FPS reaches 135.3, which also meets the real-time requirements of side device deployment.
In summary, the experimental results on the industrial steel dataset GC10-DET, the industrial aluminum dataset APSPC and the larger public benchmark dataset VOC2012 show that the YOLO-ADS algorithm proposed in this study has strong versatility and robustness.
4.8. Comparison Experiments with Current Mainstream Algorithms
Finally, in order to prove the advancement of the proposed algorithm in metal surface defect detection, this study selects four mainstream target detection algorithms Faster R-CNN, RT-DETR-r18, SSD300 and YOLO series representative algorithms to compare with YOLO-ADS algorithm on NEU-DET dataset. The results are shown in
Table 7 and
Table 8.
Table 7 shows that, compared with RT-DETR-r18, SSD300 and YOLO series representative algorithms, Fast-RCNN has the highest accuracy in strip surface defect detection with the advantages of two-stage detection network, but its shortcomings are also obvious: the number of model parameters is large, and the FPS performance is poor, only 25.6. It is not conducive to the real-time demand of industrial deployment. In addition, it can be clearly seen from the table that YOLO-ADS proposed in this study improves both mAP and Precision to the highest among the current mainstream algorithms, with 81.4 and 79.7%, respectively, and the recall rate is second only to YOLOv9tiny with 74.5%. At the same time, the number of model parameters is only 5.0 M, and the FPS performance reaches 140.4, and the real-time performance is at the upper middle level among the current mainstream algorithms. Compared with RT-DETR-r18, the mAP_0.5 is increased by 8.4%, and the number of model parameters is much lower than RT-DETR-r18. YOLO-ADS’s mAP is 3.2%, 8.0% and 3.5% higher than the newly released YOLOv9s and YOLOv10s and the baseline model YOLOv8n that balances accuracy and real-time performance, and the Precision is also 7.8%, 8.5% and 6.1% higher than them, respectively.
More closely, it can be seen from
Table 8 that the proposed algorithm YOLO-ADS has the best performance in the mainstream target detection algorithms for the target defect categories of crazing, patches and rolled-in scale, which are greatly increased by 17.8%, 6.7% and 10.6% respectively compared with the newly released lightweight model YOLOv10n. This fully reflects the detection task of this study, that is, YOLO-ADS solves the problem such as susceptibility to background interference and variability in feature scales of fine-grained defects on metal surfaces, has a good performance.
In summary, the YOLO-ADS proposed in this study not only maintains the advanced level of detection accuracy, but also has a small amount of storage parameters, and FPS meets the real-time requirements of industrial deployment, so that the detection task can be accurately and quickly completed in a resource-limited environment, which provides a new alternative scheme for metal surface defect detection.
Finally, it should be recalled that in the experimental settings of
Section 4.2, this study utilized the default YOLOv8n hyperparameters without retuning, implying that the model is likely to exhibit improved performance following specialized hyperparameter tuning. Additionally, it is widely acknowledged that all non-YOLO baselines underwent training using transfer learning techniques and thus benefited from high-quality images, whereas YOLO-ADS algorithm did not.
6. Conclusions
The YOLO-ADS algorithm proposed in this study has shown excellent performance and wide applicability in metal surface defect detection tasks. By focusing on improving the underlying backbone network of the baseline model YOLOv8n, the introduction of ASPPFCSPC, C2f_SimDCNv2 and SPD modules enhances the model’s multi-feature fusion and representation, multi-scale feature extraction and small target feature learning capabilities for fine-grained defects in complex backgrounds, respectively. The ablation experiments confirmed the effectiveness of these improvements, reflected in a significant increase of 3.5% in mAP and 6.1% in Precision for the improved model. And through visual qualitative experiments, we can intuitively see that the improved model in the actual detection, confidence and missed detection also show significant improvement.
In addition, not only on the steel strip dataset NEU-DET, but also on the industrial steel dataset GC10-DET, industrial aluminum dataset APSPC and even the larger public benchmark dataset VOC2012, the improved model shows excellent versatility and robustness. This is demonstrated by substantial increases in mAP performance, with improvements of 3.7%, 3.4%, and 4.3%, respectively. At the same time, in the comparison experiments with the current mainstream algorithms, the detection accuracy, the number of model storage parameters and the detection speed also show a good level.
In summary, the YOLO-ADS algorithm proposed in this study has achieved significant performance improvement in metal surface defect detection tasks. It not only possesses excellent accuracy and versatility, but also meets the practical requirements for parameter volume in the deployment of deep learning edge devices. Moreover, its detection speed reaches up to 140.4 FPS, which is suitable for real-time industrial scenarios with limited resources. Thus, it provides an efficient and feasible solution for the field of metal surface defect detection.
For YOLO-ADS, fine-grained defects still present a challenge with lower detection accuracy compared to other categories. Additionally, there is still potential for reducing the number of model parameters. Moving forward, our advancements will focus on the following pivotal aspects:
Explore diffusion models to generate natural and realistic metal surface defect training samples, thereby improving the detection accuracy and generalization ability of the model to all categories of defects.
Investigate incremental learning methods to enable the model to automatically continue learning from new data samples without having to retrain the entire model.
Combine the knowledge distillation with model pruning techniques to avoid the loss of accuracy after further lightweight of the model. Moreover, deploy the model locally to deep learning edge devices such as Jetson Nano for practical application in industrial scenarios with limited computing resources.