1. Introduction
Surface defect detection of industrial products [
1] is a key link to ensure the quality and safety of products, which involves using advanced technology to identify and classify the surface defects of products [
2], such as cracks and scratches, which not only affect the beauty of the products but also may reduce their durability and performance. Therefore, steel surface defect detection has become one of the main tasks for metallurgical enterprises to improve steel quality. Traditional inspection methods, such as manual inspection, nondestructive inspection, and laser scanning inspection [
3], are time-consuming, inefficient, inaccurate, and struggle to classify defects.
In the detection of surface defects on metal products [
4], one-stage detection methods, exemplified by the YOLO [
5] series and SSD [
6] algorithms, predict the class and bounding boxes of objects directly, offering relatively lower performance. Two-stage detection, represented by FAST R-CNN [
7] and Mask R-CNN [
8], generates candidate regions through a Regional Proposal Network (RPN), and then it classifies these regions and regresses the bounding box.
For example, Tang et al. [
9] systematically analyzed steel surface defect detection methods, which brought a new perspective to studying steel defect detection based on deep learning. Cui et al. [
10] proposed a fast and accurate surface defect detection network, designed a jump connection module, and propagated fine-grained features to improve defect detection accuracy. Li et al. [
11] put forward a steel surface defect detection model based on the YOLO algorithm, designed a novel attention-embedding backbone network to improve the attention to defect features, and improved the perception ability of defect features. Gao et al. [
12] proposed a transformer structure for surface defect detection, designing a hybrid backbone network. Although it improved the defect detection, the huge parameters that it brought were also worthy of our attention. You et al. [
13] used cross-attention to calculate the weight of each channel, which helped extract specific feature areas, design semantic perception modules, enhance deep defect feature extraction, and further improve defect detection. Du et al. [
14] designed a feature extraction network by using Mobile Inverted Residual Bottleneck Block to enhance the perception of PCB surface defects. However, the detection of surface defects on industrial products still faces multiple challenges, and at the same time, it is necessary to accurately identify various defects and show strong adaptability and stability.
Although the above-mentioned detection method based on deep learning has achieved certain advantages, it faces many challenges in steel defect detection: (1) Complexity of defect background: It is difficult for a general detector to capture the global context information in an image, which makes it difficult to accurately detect complex background information. (2) Small target defect detection: The semantic information of small target defects is weak, which makes it difficult to detect and locate the loss function accurately. (3) Multi-scale defects: The scale of the detected object changes dramatically, which makes it difficult for the feature fusion network to fully capture the characteristics of deep and shallow defects, resulting in poor detection results. In fact, more attention should be paid to the extraction of local and global contextual information in the process of steel defect detection, and at the same time, the accurate identification of defects with weak semantic information and the effective detection of defects with significant scale changes should be improved.
In view of this problem, this study introduces a detection network for steel surface defects that leverages global attention perception and cross-layer interaction fusion (GCF-Net), aiming to improve the identification and location of steel surface defects in industrial scenes. To address the identified issues, this paper initially develops an Interactive Feature Extraction Network (IFE-Net) that incorporates both a local modeling feature extraction module and a global attention perception module. Secondly, a Cross-Layer Interactive Fusion Network (CIF-Net) is proposed to improve the recognition and location of targets. The Interactive Fusion Module (IFM) is proposed, which uses the dynamic feature selection process to adjust the importance of each mosaic feature through attention, to make efficient use of each feature’s information.
The main contributions of this method are as follows:
- (1)
In this paper, an IFE-Net is designed to improve the extraction of local detail features. The local modeling feature extraction module (LMF) is designed to fully extract the local details of steel surface defects, and the local feature extraction module (LS Block) is used to enhance the local feature extraction, while the residual feature extraction module (RG Block) is used to retain and utilize the spatial structure information of the image, so as to improve the defect detection performance. In addition, the Global Attention Interaction Module (AIM) is introduced to capture the global contextual information in the image, so as to improve the detection ability of the model for defects under various background noises and interferences.
- (2)
Aiming at the problem of low detection accuracy caused by the large changes in defect size, this paper proposes a CIF-Net. By merging the features of adjacent layers, the network supplements the lost detailed information in the process of feature refinement, giving full play to the advantages of detailed features and deep semantic information. Finally, the interaction between different-sized features is strengthened by cross-layer fusion technology.
- (3)
To comprehensively adjust the importance of each feature in the mosaic, an IFM is proposed, which uses the dynamic feature selection process and adjusts the features to be fused from the perspectives of channel and space to improve the recognition of complex background defects.
- (4)
To meet the challenge of low detection accuracy of small targets, an enhanced small target sensitivity loss Q_IOU is designed in this study, aiming at improving the model’s ability to identify complex small defects on steel surfaces.
The rest of this paper is organized as follows:
Section 2 introduces the literature review of related work, focusing on the analysis of defect detection networks based on deep learning.
Section 3 introduces the method proposed in this paper and the function of each module in detail.
Section 4 introduces the experiment and analysis of this paper, including ablation and contrast experiments.
Section 5 presents the discussion.
Section 6 provides the conclusions and future prospects.
3. Method
As shown in
Figure 1, the GCF-Net network structure proposed in this paper consists of an Interactive Feature Extraction Network (IFE-Net), Cross-Layer Interactive Fusion Network (CIF-Net), Interactive Fusion Module (IFM), and three detectors with different sizes. Firstly, the local and global contextual defect features are fully extracted from the input feature map through the IFE-Net. The IFE-Net consists of five feature extraction modules, and each layer downsamples the input feature map twice as much as the original one. Secondly, the features at all levels extracted from the IFE-Net pass through the CIF-Net, and the fine-grained information lost during the gradual refinement of features is compensated by the fusion of adjacent layers, so as to improve the recognition ability of targets at different scales. Then, the output of the CIF-Net is input to the IFM, and the dynamic feature selection process is used to adjust the features to be fused from the perspectives of channel and space, so as to improve the recognition of complex background defects. Finally, the features fused by the IFM are input into three detection heads with different sizes to enhance the attention of the network to the key areas around the target object and output the defect detection results.
The IFE-Net constructed in this paper is shown in
Figure 2. M1 is composed of a convolution unit, and M2–M4 are composed of a convolution unit and a local modeling feature extraction module (LMF). The LMF module fully extracts the local details of steel surface defects, and the local feature extraction module (LS Block) is used to enhance the local feature extraction, while the residual feature extraction module (RG Block) is used to retain and utilize the spatial structure information of the image. M5 is a Global Attention Interaction Module (AIM) designed in this paper. The local features extracted by the LMF module are input into the AIM at the end of the network, and then the global contextual information of the image is captured. The features in different channel groups are deeply fused by point-by-point convolution, so as to realize information interaction and feature enhancement between channels and improve the detection ability of the model under various background noises and interferences.
3.1. Local Modeling Feature Extraction Module
In an effort to thoroughly capture the local characteristics of the steel surface defects, and to more precisely detect and differentiate between various types of defects, this study introduces a local modeling feature extraction module (LMF), as illustrated in
Figure 3. A local feature extraction module (LS Block) is used to enhance the extraction of local features. Because standard convolution often adopts cross-channel information mixing when extracting features, it cannot effectively extract local information between different channels, and at the same time it brings a large amount of parameters. Therefore, for a given input feature, it first carries out depth-separable convolution, which effectively extracts the local spatial information of the input feature map for each input channel without mixing the channel information, while reducing the calculation cost and the number of parameters, and then carries out batch normalization, and finally adjusts the number of channels through 1 × 1 convolution to extract fine-grained features.
To prevent the gradient degeneration in deep neural networks and enhance the reusability and expressive ability of the features, this paper uses the residual feature extraction module (RG Block) to retain and utilize the spatial structural information of images and reuse the input features. Specifically, first, the channels are mixed through a 1 × 1 convolution layer, and then one branch continues to pass through another 1 × 1 convolution layer, followed by a 3 × 3 depth-separable convolution to extract spatial features. Finally, the characteristic channels are further mixed through a 1 × 1 convolution layer. The design of the whole module aims to reduce the calculation cost and return the gradient more effectively through residual connection, and the calculation cost is lower, while retaining and utilizing the spatial structural information of the image, thus improving the performance of the model. The RG Block can be expressed by Formulae (1) and (2):
where F is the input characteristic diagram, F′ is the output characteristic of the residual differential branch in the right half of the RG Block module, and F″ is the output characteristic of the RG Block module.
3.2. Global Attention Interaction Module
To make the network pay attention to the local detail texture in global features and solve the problem of inaccurate defect classification caused by weak semantic information of the steel surface, this paper designs a Global Attention Interaction Module, (AIM) to enhance the semantic information of defect surface features, as shown in
Figure 4. The module is composed of two distinct pathways: a multi-branch attention mechanism, and a 3 × 3 convolution process. Initially, the incoming feature map is partitioned into n channels using group convolution. Afterwards, an adaptive pooling layer sequentially consolidates the feature map across channels. Subsequently, these aggregated features are transferred to the depth-separable convolution layer, which is responsible for compressing the spatial dimensions of the feature map. The generated attention score is multiplied with the original feature map to highlight the key information in each channel group. Finally, the features in the channel group are deeply fused by point-by-point convolution to realize information interaction and feature enhancement between channels. This process not only enhances the ability of feature representation but also gives the model fine-grained control over different channel features through the attention mechanism.
This process can be expressed by Formulae (3) to (7).
Firstly, the input features f are divided into m groups by block convolution:
where F1 is the feature after adaptive maximum pooling and depthwise convolution, F2 is the feature after adaptive average pooling and depthwise convolution, and F3 is the feature after fusing F1 and F2.
Then, the obtained groups of attention features are subjected to point-by-point convolution of 1 × 1 and depth-separable convolution of 3 × 3, and the features in the channel groups are deeply fused to realize information interaction and feature enhancement between channels.
F4 is the feature after multi-branch attention operation and depthwise convolution.
Finally, the features of the two branches are fused to realize the intra-block feature aggregation between different dimensions and improve the characterization of fine-grained defect features.
3.3. Cross-Layer Interactive Fusion Network
Figure 5 shows the different common feature fusion networks, among which FPN loses detailed information in the process of feature fusion. PANet adds a bottom–up feature transfer path, but the parameters are too large and the efficiency is not high. BiFPN is insufficient in integrating the spatial features captured from different levels of the backbone network.
To solve these problems, this paper proposes a Cross-Layer Interactive Fusion Network (CIF-Net). Specifically, unlike the past approaches, PConv convolution is used to replace ordinary convolution, and the fusion method is reduced layer by layer to reduce the model parameters. Secondly, the loss of detailed information due to the progressive feature refinement is mitigated through the integration of nearby layers. Finally, the integration process, by focusing excessively on the transfer and aggregation among adjacent feature layers, overlooks the progressive interaction between features across different layers, and it uses cross-layer fusion to enhance the interaction between different scales. This allows the network to concurrently take into account both the fine details and the broader context of the image, thereby enhancing its capacity to recognize targets across a range of scales.
The comparison between the fusion network proposed in this paper and FPN, PANet, and BiFPN is shown in
Figure 5. For example, P3 in the figure can be expressed by Formulae (8) to (10):
where
represents two times of downsampling,
represents four times of downsampling, and represents the splicing operation of different feature maps, where
is a very small positive number to prevent unstable training, and
represents the weight coefficient that can be learned. P3, P2, and P1 are new feature layers after cross-layer feature fusion.
3.4. Interactive Fusion Module
In the existing methods of fusing features of the same size, the features are often fused in an undifferentiated way. However, this kind of fusion method, regardless of priority, can easily lead to the loss of important information and the interference of useless information. In this section, an IFM is designed, which can adjust the importance of each mosaic feature by attention to making efficient use of all feature information, and then embed it into the enhanced feature pyramid network to improve the detection of defects in complex backgrounds.
To comprehensively adjust the importance of each feature in stitching, the IFM simultaneously adjusts the features to be stitched from the perspectives of channel and space. In
Figure 6, the Interactive Fusion Module (IFM)’s architecture is illustrated with a simplified representation of two input features for clarity. The module accepts features F1 and F2 to be concatenated. It initiates the process by employing global average pooling (GAP) and global maximum pooling (GMP) to extract global features from each, subsequently summing these extracted features.
This process can be represented by Equations (11) and (12):
where GAP is the global average pooling, GMP is the global maximum pooling, and + is an element-by-element addition operation.
The spatial features obtained from their respective features are spliced in the channel direction. This process can be represented by Equations (13) and (14):
where Concat stands for splicing, AVGPool stands for average pooling, and MAXPool stands for maximum pooling.
To comprehensively consider the stitching features to evaluate their importance, the IFM fuses the channel and spatial features obtained from the input features F1 and F2, respectively, and then it comprehensively uses this information to adjust the values of the elements of the input features F1 and F2. The channel features are spliced, the spliced comprehensive features are input into two 1 × 1 convolution layers, and the input results are converted into adjustable weights by using the sigmoid function. This process can be expressed by Equation (15):
where
represents 1 × 1 convolution, Cat represents the splicing operation, and sigmoid is the activation function.
Similarly, the spatial features are spliced in the channel direction, the spliced features are input into two 1 × 1 convolutions, and the sigmoid function is used to convert them into adjustment values. This process can be expressed by Equation (16):
After the number of channels is adjusted by 1 × 1 convolution, this process can be expressed by Equation (17):
3.5. Loss Function
The loss function [
35] in one-stage target detection consists of classification loss, confidence loss, and prediction frame regression loss. The Binary Cross-Entropy (BCE) loss [
36] is utilized for determining both the confidence and classification losses, whereas the bounding-box regression loss is determined by the CIoU loss [
37]. The CIoU loss incorporates factors such as the intersection over union, the distance between the centers of the predicted and actual bounding boxes, and the aspect ratio. Typically, when the CIoU loss value is high, it indicates a greater discrepancy in the central point positioning between the predicted and actual bounding frames, so the loss for larger targets will be considerably greater than for smaller targets during the loss computation process, which negatively impacts the loss calculation for smaller targets, resulting in the inaccuracy of the model in detecting the small targets in surface defects. Hence, in light of the challenges associated with pinpointing small targets, this study refines the penalty term to enhance the sensitivity of the Q_IOU loss specifically for minor targets, which is used to improve the perception of complex small targets in steel defects. Q_IOU can be expressed by Formulae (18) to (20):
where IoU is calculated as the proportion of the overlapping area between the predicted and actual bounding boxes to their combined area;
represents the Euclidean distance between the center points of the prediction frame and the rear frame; b and
are the center points of the prediction frame and the real frame, respectively;
and
are the width and height of the real frame, respectively; and
w and
h are the width and height of the prediction frame, respectively.
5. Discussion
The purpose of this paper was to design a high-precision defect detection network to improve the problems faced in steel detection, such as complex backgrounds and noise interference, difficulty in accurately detecting complex small targets, and large changes in defects of different scales.
To solve the above problems, we designed an Interactive Feature Extraction network (IFE-Net), a Cross-Layer Interactive Fusion Network (CIF-Net), and an Interactive Fusion Module (IFM) and strengthened the sensitive loss Q_IOU of small targets to improve the detection of different types of defects in steel. A large number of ablation experiments (see
Table 1 and
Table 2) and contrast experiments (see
Table 3) were carried out on the NEU-DET dataset to verify the effectiveness of this model. Compared with the current mainstream one-stage, two-stage, and hybrid detectors, this model was superior to all of the contrast models. By comparing the experiments on the PCB and Steel datasets (see
Table 6 and
Table 7), the generalization of this model was verified. In addition, the ablation experiments (see
Table 4 and
Table 5) proved the effectiveness of the different modules proposed in this paper, as well as the robustness of the model. The model proposed in this paper shows a significant competitive advantage in the comprehensive consideration of detection accuracy and speed, which is very important for the defect detection of industrial products.
6. Conclusions
In this paper, a steel surface defect detection network based on global attention perception and cross-layer interactive fusion was proposed to improve the accuracy of defect detection. First of all, we designed the IFE-Net. By introducing the global attention perception module to capture the global contextual information in the image, the detection ability of the model for defects under various background noises and interferences was improved. Then, we introduced a CIF-Net, which supplements the fine-grained information lost in the process of feature refinement through layer fusion, enhancing the recognition ability of multi-scale defect targets. In addition, we also developed an IFM, which adjusts the importance of different mosaic features through the attention mechanism, realizes the efficient fusion of features of different scales, and further improves the detection performance of defects in complex backgrounds. Ultimately, we introduced Q_IOU to boost the sensitivity of the loss functions for small objects, leading to a substantial enhancement in detecting intricate, minor targets on steel surfaces.
Experiments on the NEU-DET dataset proved the effectiveness of different modules, and experiments on the PCB and Steel datasets verified the generalization of this model. In addition, the PCB dataset contains a large number of complex, small target defects, which were used to verify the positioning performance of the Q_IOU proposed in this paper. The Steel dataset is disturbed by complex background noise. Experiments on this dataset verified the effectiveness of the IFM in detecting defects in complex backgrounds. Therefore, the steel surface defect detection network proposed in this paper achieved remarkable results in improving the detection accuracy and reducing the false detection rate, and it proved its universality, providing strong technical support for the quality control of the steel industry.
Although the method proposed in this paper improved the steel defect detection, as a technology relying on supervised learning, it faces the challenge of dataset size and pretreatment requirements, which usually require manual intervention, and the potential of this method cannot be fully tapped when the dataset is insufficient. Therefore, our future research direction will focus on exploring semi-supervised or unsupervised learning strategies to reduce the dependence on a large number of labeled data, which will provide new technical support for industrial product defect detection.