Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention

Liu, Xiangpeng; Jiao, Lei; Peng, Yulin; An, Kang; Wang, Danning; Lu, Wei; Han, Jianjiao

doi:10.3390/app14135823

Open AccessArticle

Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention

by

Xiangpeng Liu

,

Lei Jiao

,

Yulin Peng

,

Kang An

^*

,

Danning Wang

,

Wei Lu

and

Jianjiao Han

College of Information, Mechanical & Electrical Engineering, Shanghai Normal University, 100 Haisi Road, Shanghai 201418, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5823; https://doi.org/10.3390/app14135823

Submission received: 1 June 2024 / Revised: 28 June 2024 / Accepted: 1 July 2024 / Published: 3 July 2024

(This article belongs to the Section Applied Industrial Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Detecting defects on a steel surface is crucial for the quality enhancement of steel, but its effectiveness is impeded by the limited number of high-quality samples, diverse defect types, and the presence of interference factors such as dirt spots. Therefore, this article proposes a fine-tuned deep learning approach to overcome these obstacles in unstructured few-shot settings. Initially, to address steel surface defect complexities, we integrated a serial multi-scale attention mechanism, concatenating attention and spatial modules, to generate feature maps that contain both channel information and spatial information. Further, a pseudo-label semi-supervised learning algorithm (SSL) based on a variant of the locally linear embedding (LLE) algorithm was proposed, enhancing the generalization capability of the model through information from unlabeled data. Afterwards, the refined model was merged into a fine-tuned few-shot object detection network, which applied extensive base class samples for initial training and sparsed new class samples for fine-tuning. Finally, specialized datasets considering defect diversity and pixel scales were constructed and tested. Compared with conventional methods, our approach improved accuracy by 5.93% in 7-shot detection tasks, markedly reducing manual workload and signifying a leap forward for practical applications in steel defect detection.

Keywords:

few-shot object detection; steel defects; feature pyramid; attention mechanism; semi-supervised learning

1. Introduction

Playing an important role in modern society and industry, steel is a ubiquitous material in many essential products ranging from smart phones and computers to household appliances, not to mention its critical applications in the military and transportation sectors [1]. The market, however, struggles with an excess in steel supply, leading to intense competition and a persistent pursuit of superior quality [2].

One cannot overstate the significance of surface quality, especially in the context of cold-rolled steel. The tide turned in the 1980s when the automotive industry’s strict demand pushed surface quality to the forefront. However, steel production is vulnerable to various environmental and human-induced factors, which result in a range of surface defects, rendering defect inspection an essential component of the production pipeline [3,4].

Despite still relying on human inspectors for defect detection in many steel mills, this outdated method is plagued by high false detection rates, low efficiency, and inconsistency among inspectors. The development of artificial intelligence offers a ray of hope, and its performance in defect detection signals a new era, with deep learning functioning as the main component. A bottleneck in the traditional deep learning approaches, however, is its high demand for labeled data, a product in short supply when it comes to steel surface defects, not to mention the significant costs associated with labeling. All these aspects pave the way for the rationality of few-shot object detection methodologies in the field of steel surface defect detection [5,6].

For few-shot object detection, the prevailing methods have predominantly depended on extensive training with a large dataset of labeled images. However, in practical scenarios, obtaining a sufficient quantity of labeled images is often infeasible, and the labeling process itself proves to be labor-intensive and costly. The efficacy of small-sample object detection lies in its capacity to identify new types of objects with a limited set of labeled images, thereby avoiding the need for extensive datasets. Current methodologies for small-sample detection mainly originates from conventional, robust object detection algorithms, such as Faster R-CNN [7], YOLO [8], SSD [9], and M2Det [10]. Building upon this foundation, numerous researchers have integrated frameworks from few-shot learning to forge specialized solutions tailored for few-shot object detection. Broadly, these methodologies can be categorized into four main types: metric-learning-based, data-augmentation-based, fine-tuning-based, and meta-learning-based methods [11]. A wide range of effective models have emerged, proving useful across diverse applications. While previous advancements in few-shot object detection have made progress across various domains, their accuracy remains generally low, posing the greatest challenge in the field of few-shot object detection [12]. Furthermore, their application in steel surface defect detection still requires further exploration and expansion.

The attention mechanism [13] enables networks to selectively process different parts of encoded semantic vectors or input. It can be compared to how humans concentrate on crucial elements within images or text. In the domain of computer vision, attention mechanisms are generally categorized into three types: channel attention, spatial attention, and cross-domain attention. During computation, the attention mechanism relies on the features of input data, dynamically learning and adjusting the weights of neurons. This empowers networks to focus selectively on pivotal information, thereby enhancing performance and precision.

This research is primarily focused on steel surface defect detection, an area filled with inherent challenges that surpass those encountered in other domains of surface defect detection. Firstly, the scarcity of high-quality steel defect samples is compounded by the labor-intensive and costly nature of the labeling process. Moreover, the extensive diversity of surface defect types in steel magnifies the difficulty of recognition and detection, placing higher demands on the models employed in practical applications. Lastly, samples of steel surface defects often contain interfering factors such as dirt and spots, posing challenges and obstacles to the detection of steel surface defects.

In response to the aforementioned challenges, this article introduces a novel few-shot object detection network that is anchored on a serial multi-scale attention pyramid. An open-source steel defect dataset was employed to craft a small-sample dataset specific to steel defects, and using this dataset, a fine-tuned object detection network was trained. Through this investigation, the contributions of this paper are as follows:

(1) In consideration of the vast spectrum of steel surface defects and the presence of additional elements on the steel surface, this study develops an object detection network for small samples based on an attention-feature pyramid. The approach encompasses the idea of multi-scale attention and proposes the serial multi-scale attention mechanism (SMSANet). These two modules are seamlessly integrated into the fine-tuned few-shot object detection network. During the initial phase of base class training, the model is aided by an enhanced SSL algorithm, which improves the model’s performance.

(2) To address the challenges of limited steel defect samples, difficulty in annotation, and high costs, this experiment utilizes the steel defect detection dataset [14] provided by ServerStal company to construct a few-shot dataset.

(3) This study conducts a series of ablation studies to assess the impact of the novel modules, evaluating the extent to which each component positively influences model performance.

The rest of this paper is organized as follows. Section 2 provides a comprehensive review of the advancements in surface defect detection, evaluating various detection techniques and methods and analyzing their performance in practical applications. This section also explores the development of few-shot target detection technologies, with a particular focus on their application in steel surface defect detection. Section 3 presents a dataset specifically designed to address the challenges of few-shot steel defect detection and proposes a few-shot defect detection network based on a multi-scale attention mechanism. Section 4 details the experimental results and analysis, comparing the performance of ResNet50, ResNet101, and VGG16, within the Faster R-CNN framework and YOLOv4. Section 5 assesses the effectiveness of the proposed methodology, demonstrating significant improvements in baseline model accuracy through the integration of each module. Finally, Section 6 offers a comprehensive summary and analysis of the study, discussing potential directions for future research.

2. Related Work

While few-shot object detection has seen breakthroughs on large-scale datasets, its application has been urgently needed in specialized fields to tackle real-world manufacturing issues, such as steel defect detection. The inspections of steel strips within factories, however, have largely been conducted manually by inspectors. In an effort to improve the efficiency of steel strip defect detection, the adoption of object detection technologies has emerged as an essential replacement for manual inspections.

2.1. Research Pertaining to Surface Defect Detection

Over the past two decades, numerous researchers have explored the application of object detection technologies for surface defect detection [15,16]. In recent years, there has been a rapid increase in research on few-shot learning in object detection. Many of these studies have focused on improving existing methodologies in the field of few-shot learning (see Table 1). Among these techniques, generative adversarial networks (GANs), an unsupervised learning algorithm renowned for its potent generative capabilities, has been extensively deployed in defect detection tasks across various domains. He et al. [17] provided an encompassing overview of the theoretical foundations, technological evolution, and practical implementations of GAN-based defect detection, thereby furnishing a base for research in this area. In addition, Luo et al. [18] proposed a novel two-stage object detection framework based on convolutional neural networks (CNNs), where the localization and classification tasks were decoupled via two dedicated modules, ultimately enhancing the precision of FPCB surface defect detection. Furthermore, Xie et al. [19] integrated atrous convolution into deep networks and designed a surface defect detection model based on convolutional block attention module and atrous convolution, aiming to optimize the model’s real-time detection capabilities. Additionally, Duan et al. [20] introduced an innovative task configuration that enhances cross-domain generalization through image-level meta-enhancement and feature-level perturbation. In comparison to mainstream approaches, their proposed method exhibits superior performance. In a parallel development, Chen et al. [21] advanced a rapid steel strip surface defect detection network (DCAM-Net) based on deformable convolutions and attention mechanisms. Tailored to address a wide spectrum of complex and irregularly distributed steel strip defects, DCAM-Net employed a novel enhanced deformation-feature extraction block (EDE-block), which substantially elevated the detection efficiency of steel strip surface defects. These contributions encompassed additional surface defect detection methods that we drew upon in our research on steel surface defect detection.

2.2. Research Related to Small-Sample Object Detection

In recent years, the field of object detection has witnessed a significant surge in research efforts dedicated to the application of few-shot learning (see Table 2). Wu et al. [22] made groundbreaking strides by employing a transfer-learning-based approach for addressing few-shot object detection in synthetic aperture radar (SAR) images using convolutional neural networks (CNNs). Initially, they pre-trained a CNN using a substantial number of samples from the source domain. Subsequently, fine-tuning was performed using a limited set of samples from the target domain of SAR images, resulting in the successful creation of a novel model. On a similar note, Ma et al. [23] proposed a transferability-based transfer learning path and transfer method selection strategy to train deep learning models, thereby enhancing the feature extraction capabilities of CNN models, especially under multiple reboots scenarios. Conversely, Wu et al. [24] introduced the multi-scale positive sample refinement (MPSR) technique, which leveraged modular components to enhance samples across different scales. While this strategy augmented the dataset and generated a richer training set, the results indicated that relying solely on these auxiliary methods could not fundamentally address the dual challenges of sample quantity and quality, ultimately limiting significant improvements in model performance.

Apart from the aforementioned methods, few-shot object detection based on meta-learning has also emerged prominently. Diverging from conventional training strategies, meta-learning builds upon the foundation of transfer learning with the aim of enabling models to adapt more effectively to new tasks or categories. Meta-learning focuses on task-level learning, aiming to identify initial parameters that excel across a spectrum of tasks, thereby facilitating the rapid adaptation to new learning tasks. The adaptive nature of meta-learning enables swift parameter adjustments in few-shot scenarios, allowing models to adapt to new situations or task requirements. However, due to the complexity in the design of meta-learning algorithms and their high demands for computational resources, they still face certain challenges in practical applications, resulting in relatively limited practical usage [25].

Nevertheless, few-shot object detection approaches based on transfer learning and meta-learning have significantly fortified the groundwork for the subsequent selection of small-sample learning algorithms and the holistic framework for object detection. Following an in-depth consideration of the aforementioned studies, this research has made the decision to adopt a small-sample object detection framework based on transfer learning. This approach integrates the findings from the aforementioned research, offering the potential to provide effective solutions for the tasks.

2.3. Research Progress Related to Attention Mechanisms

In academic research, attention mechanisms have yielded significant advancements, exerting profound influence on fields such as natural language processing and computer vision (see Table 3). Inspired by style transfer, Lee et al. [26] introduced the style-based recalibration module (SRM) to adaptively recalibrate intermediate feature maps. The SRM model correlates among input data channels and strengthens the network’s representational capacity through recalibration. Jaderberg et al. [27] proposed the spatial transformer network (STN) architecture, integrating a spatial transformation network that enables neural networks to autonomously learn spatial transformations of input data. Almahairi et al. [28] introduced the dynamic capacity network (DCN), comprising secondary networks for both global and detailed processing. DCN employs a coarse model to compute feature vectors, extracting a few saturated target regions. This is accomplished by assessing the significance of regions using information from the coarse model’s output and selecting areas that most significantly influence model predictions. Liu et al. [29] introduced a dilated weighted across-stages feature pyramid network to adapt the receptive fields and attention weight preferences of output feature maps at various scales, thereby enhancing the algorithm’s utilization of abnormal-sized defect features and improving the detection of unusual defects. These studies led us to recognize the challenge posed by significant variations in defect spans within our dataset. It became promising that attention mechanisms could serve as highly effective modules to address this issue. Subsequently, tailored modifications to the attention mechanisms were made in accordance with our specific application requirements, ultimately achieving the desired results.

Considering existing studies on surface defect detection and contributions from researchers in the field of few-shot learning, it is observed that conventional surface defect detection methods predominantly rely on a large number of data samples. However, obtaining labeled steel surface defect samples of high quality presents considerable challenges, necessitating a pressing shift towards utilizing few-shot object detection techniques for steel surface analysis. Among the numerous few-shot learning strategies, those employing fine-tuning demonstrate superior efficacy [30]. Consequently, the research incorporates this approach in the investigation of steel surface defect detection. In this study, a small-sample object detection network based on an attention feature pyramid is proposed, which incorporates a sequential multi-scale attention mechanism (SMSANet) and seamlessly integrates it into a fine-tuned few-shot object detection framework.

3. Methodology

In this section, we provide a comprehensive overview of the dataset employed, encompassing various categories of steel defects, and additionally, we construct a dataset specifically tailored for the challenges of small-sample steel defect detection. Subsequently, we investigate the challenges inherent in the field of steel defect detection. Afterwards, we undertake an in-depth exploration and enhancement of small-sample steel defect detection and its associated challenges, outlining in detail how these facets are combined to further optimize our approach.

3.1. Dataset Construction and Analysis

The investigation is centered on a dataset relevant to steel defect detection, provided by the Severstal company. This dataset has been partitioned into training and test subsets. The training set encompasses 12,568 images, each with a resolution of

1600 \times 256

pixels. Significantly, among these, 6666 images were observed to contain at least one form of defect, resulting in 7095 labeled instances. The dataset categorizes defects into four distinct types as illustrated in Figure 1, namely, pits, cracks, scratches, and iron oxide scales, with scratches comprising a significant majority with a total of 5150 instances.

To build few-shot datasets, it is imperative to adhere to the N-way K-shot principle. In this context, N-way K-shot denotes a methodological framework facilitating models to be trained on a constrained sample size with a view to enhancing generalization capabilities. Herein, ‘N’ signifies the quantity of categories, while ‘K’ represents the number of samples allocated to each category. We adopted the N-way K-shot criteria specified by Kang et al. [31], which mandated that the value of ‘K’ should not exceed 10, thus ensuring no more than 10 samples per category. Consequently, scratches and pits were designated as base class defects, whereas cracks and iron oxide scales were classified as new types of defects. Considering the fine-tuning process of the model, a mere ten samples of new classes were deemed adequate. As such, the dataset for our study comprised two base and two new classes, with N equated to 2, and K constrained to a maximum of 10. After conducting a visual analysis of the defects in the steel dataset, we observed substantial variations in the sizes of defects across different categories. For instance, iron oxide scales were found to be immensely larger than pits. This discovery prompted an exhaustive statistical analysis to investigate the relationship between defect categories and pixel distributions.

Upon examining Table 4, several key findings come to light. Firstly, scratches were dominant in the class distribution, accounting for approximately 73% of the total, which highlighted a stark imbalance in terms of defect types within the dataset. Owing to the large number of scratches, they also exhibited the highest proportion of pixel scales, which was around 80%. In contrast, oxidation scales, despite accounting for only 11.3% of the defect types, constituted 16.8% of the pixel representation, which suggested a relatively larger size for this defect type. The pits and cracks were noticeably smaller in size. Although they accounted for 12.6% and 3.48% of the total defects, respectively, they only contributed a marginal 2.39% and 0.51% to the pixel counts within the defect masks. This implied that the network might have faced challenges detecting pits and cracks due to their tiny dimensions.

These findings prompted an incisive analysis focusing on the pixel scales of the defects. Given the significant contrast in pixel sizes between iron oxide scales and pits, it was imperative to gather more definitive data to validate this observation. A comprehensive statistical assessment of the pixel scales for each defect type was conducted, and the results are listed in Table 5.

It was revealed that pits exhibited the smallest pixel dimensions, with the minimum scale recorded at 163 pixels. In contrast, cracks, with an average pixel dimension of 3378.4, emerged as the category having the smallest mean pixel scale among all defect types. Scratches, on the other hand, displayed the largest pixel dimensions with the maximum recorded at 368,240 pixels. Oxidation scales had an average pixel dimension of 34,373.9, which was the largest mean value among the defects. Consequently, the analysis underscored the significant difference in pixel dimensions across the four types of defect present in the dataset.

3.2. Proposed Few-Shot Object Detection Network

In this study, an enhanced network architecture was introduced. Featuring the cornerstone of our contribution, the serial multi-scale attention network (SMSANet) is illustrated in Figure 2. Initially, input images underwent processing via an optimized backbone network based on ResNet101 [32] within the Faster R-CNN [7] framework, resulting in the generation of feature maps C2, C3, C4, and C5. Subsequently, C5 was fed into the improved SMSANet based on a dual multi-scale attention network (DMSANet [33]), which fused channel and spatial attention weights to yield a composite feature map. This was then combined with C5 post a

1 \times 1

convolution to produce P5. Through an up-down approach with lateral connections, a rich set of multi-scale features, namely P4, P3, and P2, were obtained. Moreover, P2 was guided through SMSANet, and upon integration with P2 post

1 \times 1

convolution, F2 emerged. Utilizing a bottom-up methodology complemented by lateral connections, a further array of multi-scale features was extracted. These features were subsequently integrated and passed through a region proposal network (RPN) [34] and region of interest (ROI) [35] pooling, before splitting into classifier and regressor branches to produce the final output.

Notably, the present study marked the incorporation and optimization of feature pyramid and attention mechanism modules. The network underwent a two-stage fine-tuning training process: initially, it was subjected to extensive training using numerous base class samples. Next, the model was further fine-tuned using a small number of base class samples as well as samples from new classes.

3.3. Proposed Serial Multi-Scale Attention Mechanism

In this work, we introduced an innovative attention mechanism termed SMSANet, which operates within a hybrid domain. As depicted in Figure 3, the structure of this attention mechanism is divided into two core modules. The first module, named ‘Split’, initially partitioned the feature map into multiple groups. Within each of these divisions, ‘Shuffle’ units integrated channel and spatial attentions. Sub-features were aggregated, and the exchange of information across these sub-features was facilitated through the utilization of a ‘channel shuffle [36]’ operator. The technique known as ‘Channel shuffle’ is employed within convolutional neural networks (CNNs) with the aim of enhancing the network’s non-linear capacity and representation ability. Its fundamental principle involves rearranging split input channels (or feature maps) to introduce randomness and mixing, thereby augmenting feature diversity and inter-channel correlations. This approach involves grouping input channels, followed by reordering and convolution operations, culminating in the concatenation of convolutional results to form the output.

Progressing to the second module, the serial multi-scale attention module, we refined the approach from parallelism to a serial configuration. This involved channel attention being sequentially followed by spatial attention. Finally, it outputs a feature map that incorporates both channel and spatial information.

In considering the order of channel and spatial attention mechanisms, we built the model upon the serial attention mechanism proposed by Woo et al. [37]. Their comparative experiments demonstrate that positioning the channel attention mechanism before the spatial attention mechanism can optimize model performance. Consequently, this research adopts their sequence of attention mechanisms.

For channel attention, the feature map was divided into four branches: A, B, S, and D. After reshaping branches A and B to

C \times N

dimensions, a product of branch B and the transposed branch A was obtained. Following a softmax operation, a

C \times C

feature map M was acquired. Matrix multiplication, in essence, establishes connections between each pixel point. When the similarity between two positions is higher, the corresponding

M_{i j}

also increases accordingly. This interconnection among different parts of the image enhances their mutual interaction, leading to a more effective capture of the correlations between various positions.

M_{j i} = \frac{exp (A_{i} \cdot B_{j})}{\sum_{i = 1}^{C} exp (A_{i} \cdot B_{j})}

(1)

This map was then transposed and multiplied with the reshaped branch S and subsequently resized to

C \times H \times W

dimensions after scaling by a factor

β

. Then, the result was added to the original branch D, providing a final feature map L that combines channel information.

L_{j} = β \sum_{i = 1}^{C} (m_{j i} C_{i}) + D_{j}

(2)

Similarly, within the spatial attention module, three new feature maps were generated through convolutions, termed A, B, and C. The subsequent operations follow the same steps as those within the channel attention mechanism, but with an

α

scaling factor in the spatial domain. After the sequential execution of these two modules, the final output was a fused feature map that integrates both channel and spatial information.

3.4. Enhanced Semi-Supervised Learning Algorithm

SSL and data augmentation are effective techniques for handling small data sample sizes [38]. By using smoothing estimation, these methods can significantly improve model performance under conditions of data scarcity. The lack of labeled data may be widespread across all categories or concentrated in specific ones, leading to classification bias.

SSL distinguishes itself by exploiting unlabeled data in addition to labeled data, thereby enabling the improvement of model performance even with relatively fewer labeled instances. The principal aim of this approach is to exploit the information in unlabeled data to enhance the model’s generalization, ultimately achieving superior performance on new datasets. This technique finds particular resonance in contexts where extensive datasets are at play, but labeled samples are few, such as in image classification, speech recognition, and natural language processing.

The SSL algorithm based on pseudo-labeling involves feeding unlabeled data into the model and employing the class with the highest probability as the pseudo-label to facilitate the training process.

SSL based on pseudo-labels is divided into a real label part and a pseudo-label part, as detailed in Equation (3).

L = \frac{1}{n} \sum_{m = 1}^{n} \sum_{i = 1}^{c} L (y_{i}^{m}, f_{i}^{m}) + α (t) \frac{1}{n^{'}} \sum_{m = 1}^{n^{'}} \sum_{i = 1}^{c} L (y_{i}^{' m}, f_{i}^{' m})

(3)

where

L (y_{i}^{m}, f_{i}^{m})

represents the loss of the real labels,

y_{i}^{m}

is the real label of the labeled samples, and

f_{i}^{m}

is the output of the network. For the pseudo section,

L (y_{i}^{' m}, f_{i}^{' m})

denotes the pseudo-label loss,

y_{i}^{' m}

refers to the label of the unlabeled samples, and

f_{i}^{' m}

is the network’s output. Notably, the pseudo-label component incorporates a weight factor,

α (t)

, which modulates the influence of pseudo-labels.

α (t)

follows a deterministic simulated annealing process, computed as described in Equation (4), and the pseudo code follows.

α (t) = \{\begin{matrix} 0, (t < T_{1}) \\ \frac{t - T_{1}}{T_{2} - T_{1}}, (T_{1} \leq t \leq T_{2}) \\ α_{f}, (T_{2} \leq t) \end{matrix}

(4)

The tuning of

T_{1}

and

T_{2}

is conducted through a phased approach. During the initial phase

t < T_{1}

, the weight factor

α (t)

is set to 0, meaning the influence of pseudo-labels is completely ignored, and only real labels are used for training. This phase ensures that the model learns effectively from real labels in the early stages. In the transition phase,

T_{1} \leq t \leq T_{2}

,

α (t)

increases linearly from 0 to the final value

α_{f}

, allowing the model to gradually adapt to the inclusion of pseudo-labels in the training process. The duration of this phase is critical as it dictates the smooth introduction of pseudo-labels. Finally, in the stable phase,

T_{2} < t

,

α (t)

maintains its final value

α_{f}

, maximizing the influence of pseudo-labels and enabling the model to fully leverage both pseudo-labels and real labels for joint training. The specific values of

T_{1}

and

T_{2}

are fine-tuned experimentally based on the task and dataset to ensure the model achieves a balance between initial reliance on real labels and the gradual incorporation of pseudo-labels, ultimately optimizing performance, as shown in Algorithm 1.

Algorithm 1 Calculation method of weight factor

1:: while $t < T_{1}$ do
2:: $α (t) = 0$
3:: end while
4:: while $T_{1} < t < T_{2}$ do
5:: $α (t) = \frac{t - T_{1}}{T_{2} - T_{1}}$
6:: end while
7:: while $T_{2} < t$ do
8:: $α (t) = α_{f}$
9:: end while
10:: end function

Pseudo-label-based semi-supervised learning emerges as a prominent method within the SSL framework [39]. This strategy involves using a model to make predictions on unlabeled data and then applying those predictions as pseudo-labels to train the model together with the labeled data. The methodology can be explained through the following steps (see Algorithm 2).

Algorithm 2 Steps of pseudo-label-based SSL

1:: Employ labeled data for initial model training.
2:: while no desired performance achieved and no pre-established iterations reached do
3:: Utilize the trained model to generate predictions for unlabeled data, treating them as pseudo-labels.
4:: Integrate labeled data and pseudo-labels to create an augmented training dataset.
5:: Further train the model with the augmented dataset.
6:: end while

The accuracy of pseudo-labels significantly affects the performance of the model. Consequently, thorough consideration must be given to the selection of high-quality pseudo-labels, such as by establishing confidence thresholds or utilizing regularization techniques to mitigate the impact of noise.

In our study, high-dimensional feature maps were extracted during the training phase of SSL. These feature maps were subjected to dimensionality reduction via the locally linear embedding (LLE) algorithm, transitioning to a two-dimensional space. This refined the workflow of the pseudo-label-based SSL algorithm, which, supported by the LLE algorithm, unfolded as follows:

(1) A limited number of labeled data were employed to initiate model training.

(2) Following the initial model training, high-level features were obtained, which were then dimensionally reduced and visualized through the LLE algorithm. A subsequent data cleansing stage involved the identification and removal of images with incorrect labels, which were then reclassified as unlabeled data.

(3) The trained model was applied to predict unlabeled data, and the resulting predictions were utilized as pseudo-labels.

(4) The labeled data and pseudo-labels were combined to form an extended training set, which was used to continue training the model.

(5) Steps 2 and 3 were executed in cycles until the model either converged, fulfilled the performance criteria, or reached the maximum number of training iterations, resulting in the final model.

4. Experimental Results and Analysis

4.1. Comparative Analysis of Backbone Networks

All experiments were conducted using an NVIDIA GeForce RTX 2070 GPU, including training, validation, and testing phases. In the context of fine-tuning-based few-shot object detection networks, our study explored four distinct architectures for defect detection: ResNet50, ResNet101, and VGG16 [40] within the Faster R-CNN framework, as well as YOLOv4 [41], whose backbone is CSPDarknet53. Each network was trained using the stochastic gradient descent (SGD), with a batch size of 4, momentum set at 0.9, and a weight decay coefficient of 0.0001. In the first stage of training for the base classes, the learning rate was set to 0.02, while in the second stage of training for the new classes, the learning rate was adjusted to 0.001. It is important to recognize that SGD initially converges rapidly, but as it approaches the optimal solution, its convergence rate slows, potentially leading to local minima. Momentum can assist in navigating past these minima. Furthermore, careful adjustments to the learning rate and weight decay are essential to prevent overfitting and maintain the stability of the training process.

In this research, we performed the fine-tuning-based few-shot object detection based on the methodology put forth by Wang et al. [42]. This approach is based on the widely employed Faster R-CNN, a two-stage object detector, which served as the foundation [43]. Comprising a backbone network (e.g., ResNet [44], VGG16), a RPN, and dual fully connected sub-networks, Faster R-CNN functions as a feature extractor at the proposal level. Furthermore, it incorporates a classifier and a regressor to perform object classification and bounding box coordinate prediction. The authors introduce new benchmarks on PASCAL VOC, COCO, and LVIS datasets, achieving stable accuracy estimations and outperforming previous meta-learning methods by notable margins. This simple yet powerful approach establishes new state-of-the-art results, demonstrating substantial gains in average precision (AP) for rare classes with minimal impact on frequent classes.

During the initial phase of training, Faster R-CNN underwent an intensive training process using a large number of samples from the base class, analogous to other object detection frameworks. The loss function utilized in this stage is represented in Equation (5):

L = L_{r p n} + L_{c l s} + L_{l o c}

(5)

where

L_{r p n}

is applied to RPN’s output to discriminate between the foreground and background and to refine anchors.

L_{c l s}

represents the cross-entropy loss for the bounding box classifier, used for object classification, and

L_{l o c}

indicates the smooth L1 loss for the bounding box regressor, used for bounding box coordinate prediction. The Smooth L1 loss is a fusion of the L1 and L2 losses. While the L1 loss is not differentiable at the zero point, the L2 loss can result in gradient explosion when the prediction significantly deviates from the target. The Smooth L1 loss ameliorates the drawbacks of both by incorporating the advantages of each component. The first piecewise function corresponds to the L2 loss, and the second piecewise function corresponds to the L1 loss, as shown in Equation (6):

L_{1} (x) = \{\begin{matrix} 0.5 x^{2}, if | x | < 1 \\ | x | - 0.5, other \end{matrix}

(6)

where x is equal to

f (x) - y

, which is the difference between the true value and the predicted value.

Transitioning to the second phase, a curated dataset, containing a limited number of base class and new class samples, was introduced for the fine-tuning phase. The network initially allocated randomly initialized weights to the new classes. This was followed by fine-tuning confined to the bounding box classifier and regressor, while the feature extractor remained unchanged.

In the context of few-shot learning, the traditional softmax classifier is sub-optimal for learning classification features. In light of this, Wang et al. [42] employed a classifier built on cosine similarity, facilitating effective learning of classification features within a few-shot setting, reducing inter-class variance, and thereby improving classification accuracy.

In this investigation, fine-tuned few-shot steel defect detection was conducted utilizing a small dataset, employing ResNet50, ResNet101, and VGG16 within the Faster R-CNN framework, alongside YOLOv4. The selection of YOLOv4 was based on its demonstrated high accuracy and versatility in diverse settings compared to its successors such as YOLOv5. A range of shot quantities, including 1-shot, 3-shot, 5-shot, 7-shot, and 10-shot, was assessed within these frameworks, each with multiple random selections and evaluations of sample quantities for each category, and the pertinent outcomes are illustrated in Figure 4. The analysis was conducted through three aspects: the effect of sample quantity on model performance, comparison of different backbones, and performance of 7-shot relative to 10-shot. The selection of training sample size was based on the theoretical proxy learning curve of the Bayes classifier [45]. This curve provides a method for estimating the required sample size by illustrating how the classifier’s performance improves with an increasing number of samples. It helps determine the minimum sample size needed to achieve a specific classification error probability. This is particularly important for applications involving small datasets, as a reasonable estimation can enhance classification performance and reduce resource waste. In this study, accurate sample size estimation optimized the training process for few-shot object detection, leading to higher detection accuracy and efficiency in practical applications.

(a) Effect of sample quantity on model performance: With an increase in the number of samples per class, the model’s mean Average Precision (mAP) also improved. This aligns with our assumption, as more samples provide additional information to train the model, enabling it to detect targets with increased accuracy. For instance, under the Res-Net101 backbone, we observed a progressive increase in mAP from 1-shot to 10-shot, with values of 38.57%, 51.81%, 62.97%, and 72.28%, respectively. This strongly emphasizes the significance of sample quantity on model performance.

(b) Comparison of different backbones: Under the same sample quantity (except in the case of 3-shot, where it ranked second), ResNet101 consistently achieved the highest mAP values. This indicates that in this task, ResNet101 exhibited relatively stronger feature extraction capabilities, contributing to enhanced model performance. Compared to other backbones such as VGG16, CSPDarknet53, and ResNet50, ResNet101 consistently outperformed in all scenarios.

(c) Performance of 7-shot relative to 10-shot: Within the ResNet101 backbone, the mAP of 7-shot slightly outperformed that of 10-shot. This suggests that, in this task, surpassing 7 samples per class may not significantly enhance model performance. The reason is that with 7 samples, the model has already acquired sufficient features to perform accurate target detection, and further addition of samples does not yield substantial benefits.

From the comparative experiments, we drew two key conclusions:

(1) Regardless of the frameworks and backbones employed, the number of shots per class significantly influenced the mean Average Precision (mAP). The mAP was observed to be at its lowest with a single sample (1-shot). As the shot quantity increased, there was a corresponding rise in mAP. Notably, there was little to no improvement in mAP values between 7-shot and 10-shot, with 7-shot occasionally outperforming 10-shot. This suggested that beyond 7 shots, model accuracy did not improve significantly, and thus, 7-shot was more aligned with the principles of few-shot object detection.

(2) Through a comparative analysis of various frameworks and backbones at 7-shot, we observed that the model with Faster R-CNN framework and ResNet101 backbone was optimal. The mAP achieved by this configuration surpassed those of models using VGG-16 and YOLOv4 by 16.56% and 17.46%, respectively, and was 5.36% higher than that of ResNet50. Therefore, in the context of this work, employing the Faster R-CNN framework with the ResNet101 backbone proved to be the most effective option, achieving a highest mAP value of 72.28%. Future work will be devoted to optimizing and advancing upon this framework and backbone.

4.2. Performance Evaluation of Enhanced Network on Single Defect Detection

Single defect detection refers to the detection of a specific type of defect among the four defects (pit, crack, scratch, and oxide scale) present in the dataset, representing an individual defect category. This research focused on optimizing defect detection, and we validated the efficacy through an experiment employing a 7-shot scenario, where each new class contained seven samples. The results are demonstrated in Figure 5 and Table 6.

Figure 5 illustrates the changes in detection accuracy for four types of defects before and after network refinement. It is evident that there have been significant improvements across all defect categories following the network refinement. Particularly noteworthy is the substantial enhancement observed in the case of pit defects, with an increase in the accuracy of 5.34%. This can be attributed to the fact that pit defects typically own smaller scales and more pronounced shape features. Therefore, the network refinement has made the model more sensitive to these defect characteristics, resulting in a notable performance boost.

Furthermore, crack defects also exhibited a notable increase in accuracy within the refined model, with an improvement of 4.78%. This represents a significant improvement, demonstrating the effectiveness of the network refinement for small-scale yet crucial defects.

Additionally, for larger-scale defects such as scratches and iron oxide, the network refinement similarly led to substantial improvements, with increases of 2.77% and 4.31%, respectively. This indicates the refined model’s capability in capturing larger-scale defects, consequently reducing instances of false detections and misclassifications.

In all, Figure 5 reveals that following network refinement, there was a noticeable improvement in detection accuracy for all categories of defects. Among these, the most noteworthy improvement was observed in the case of pit defects. As these defects are characterized by a smaller scale, the modified model demonstrated a marked improvement in detection accuracy compared with its original version. Furthermore, the refined model effectively reduced the instances of false detections and misclassifications for larger-scale defects, thereby further improving its detection accuracy. These results imply that the proposed model is capable of detecting small defects effectively in more complex scenarios.

4.3. Evaluation of Enhanced Semi-Supervised Learning Algorithm

During the training phase of the proposed network, we observed that a subset of unlabeled data needed to be not effectively exploited in the first stage of base class training. Acknowledging the substantial investments in human effort and financial resources typically required to acquire high-quality labels, particularly for steel defects that are challenging to label, we resolved to implement and optimize an SSL algorithm based on pseudo-labels. For the purpose of enhancing data cleansing, local linear embedding (LLE) was applied to reduce the dimensionality of a dataset comprising 6047 training images of pits and scratches. This process yielded 159 images suspected of mislabeling or non-labeling, which were subsequently incorporated into an unlabeled dataset of 1801 images, enriching the dataset’s diversity. Meanwhile, pseudo-labels were generated for these additions, thus reinforcing the training of the model. The final results of this experiment are presented in Table 7.

Additionally, we implemented a random sampling strategy within the labeled dataset for pits and scratches, collecting images with increments of 1000. These images, along with 1960 pseudo-labeled images derived via the SSL algorithm, were fed into the network for model training. From the data presented in the table, it was apparent that the SSL algorithm successfully transformed the previously unexploited unlabeled data into valuable training assets. This transformation yielded a marked improvement in the efficacy of the small-sample detection algorithm: a surge in the mean Average Precision (mAP) from 76.58% to 78.21% was achieved when the complete dataset was engaged in training. This clearly suggests that the employment of SSL algorithms can lead to a significant increase in model precision, particularly in industrial settings where data are relatively challenging to label and acquire.

4.4. Performance of Different Models during the Training Process

Figure 6 presents the loss curves for various models operating in 7-shot and 10-shot configurations throughout iterative progression.

During the training epoch, there was a noticeable decrease in aggregate loss across the four models, with models trained on smaller sample sizes converging more quickly. In the initial phase, the models exhibited higher total loss, which can be attributed to the early stage of parameter initialization where useful information was lacking. As training progressed into the mid-phase, the loss of all models notably decreased, indicating that the models were learning beneficial information from the data. At this stage, Faster R-CNN with a ResNet101 backbone demonstrated superiority due to its sophisticated network architecture, which was proficient in capturing the features of the data. As training approached its conclusion, there was a continued, though gradual, decrease in loss across the models. This slowdown characterized the model’s gradual approach towards their maximum potential, making further improvements increasingly challenging. At this point, Res-Net101-based Faster R-CNN maintained its superiority by demonstrating the lowest cumulative loss among the models. Furthermore, under both 7-shot and 10-shot conditions, Faster R-CNN with ResNet101 as the backbone consistently exhibited the lowest losses, particularly under the 7-shot setting, where it achieved the optimal loss. This underlines that the highest level of accuracy was achieved when utilizing the 7-shot configuration in combination with ResNet101-based Faster-RCNN.

Additionally, we observed some other trends in the entire training process. As training progressed, the performance of all models gradually stabilized, indicating that they were gradually finding a suitable state to detect targets accurately. Particularly in the case of Faster R-CNN (ResNet101), due to its advanced network architecture, the model excelled in capturing data features, maintaining its lead throughout the training process.

However, in later stages, the performance differences among models began to narrow gradually. This can be attributed to the fact that they were all approaching their best performance on current training data. This also implies that further improvements will become increasingly challenging, requiring more refined tuning.

In summary, these observations provide us with a deeper understanding of the dynamic changes during the training process. They also emphasize the superiority of ResNet 101 in combination with the 7-shot configuration, which will have a positive impact on practical applications.

5. Discussion

To assess the effectiveness of the proposed method, each module was isolated for a series of evaluationd, yielding the mAP results for 7-shot small-sample object detection after incorporating different modules, as listed in Table 8.

The results indicate that each employed module contributed significantly to enhancing the precision of the base model. Considering the small-scale nature of defects like pits and cracks in steel materials, as well as the significant variation in pixel scales across the four defect categories, the incorporation of the feature pyramid network (FPN) yielded an increase of 0.89% in comparison to the base model. The integration of FPN in conjunction with DMSANet further improved the model’s performance by 2.34%. In light of certain constraints within FPN, an enhanced version was implemented, leading to a substantial enhancement of 2.65% as compared to the base model. The subsequent incorporation of SMSANet into this configuration resulted in a remarkable performance boost of 4.3%. These outcomes provide compelling evidence for the effectiveness of the refined modules in our framework.

As depicted in Figure 7, the comparative experiments substantiated that the network developed in this study was highly effective, which successfully identified new classes of defects with a constraint of no more than 10-shot samples. The model’s proficiency in defect detection remained robust regardless of whether the images contained only one type of defect or were composed of two or three different variations. In Figure 8, we present a visual depiction of the detection results, using a range of colors to distinguish different defects. The visualized steel defect detection results demonstrate that the proposed few-shot object detection method exhibits excellent generalization capabilities.

As performance metrics are highly valued in industrial production, this paper conducted FPS and mAP evaluation on the four original models and the proposed model. The specific results are detailed in Table 9. The results demonstrate that the proposed model’s accuracy outperformed that of Faster R-CNNs with three different backbones (VGG16, ResNet50, ResNet101), and YOLOv4, by 20.86%, 9.66%, 4.3%, and 21.76%, respectively. YOLOv4 achieved the optimal real-time performance, while our model, with superior accuracy, attained a frame rate of 63.8 FPS. With its ideal balance, the proposed approach emerges as a suitable choice for deployment in the domain of industrial production, particularly in scenarios where precision is of paramount importance.

6. Conclusions and Future Work

In this research, an exploration of few-shot object detection methods based on fine-tuning was conducted, addressing the challenge of detecting defects on the surface of steel. Through a series of enhanced experiments, the precision of the model was significantly improved to meet industrial production standards. The proposed methodology demonstrated a notable increase in the detection accuracy for small-target defects within the dataset, compared with the typical fine-tuning based few-shot target detection framework. Additionally, the method showed improvements in detecting medium and large targets. The co-tributions of this study are concentrated in the following areas:

(1) An in-depth analysis and visualization of the dataset were executed, encompassing the quantification of diverse defect categories with respect to their count and pixel scale. In compliance with the criteria set for few-shot datasets, a tailored dataset for few-shot defect detection was established. A comparative evaluation of various backbone networks within the paradigm of fine-tuning few-shot target detection methods was undertaken. The selection of ResNet101 within the Faster R-CNN framework emerged as the most suited approach for the backbone network in subsequent experiments.

(2) This study improved the SSL algorithm’s workflow that was based on pseudo-labeling. The incorporation of the locally linear embedding (LLE) algorithm for data cleansing was proven effective in enhancing the model’s detection accuracy.

(3) Refinement of few-shot defect detection algorithms: Given the extensive pixel scale diversity in the dataset, feature pyramids and attention mechanisms were introduced. These modules underwent the relevant refinement, resulting in a substantial increase in detection accuracy. Additionally, an algorithm based on pseudo-labeling was employed to forecast pseudo-labels for defects that were not labeled, and this process was optimized to further augment the model’s precision. Following these refinements, there was a 5.93% increase in detection precision relative to the baseline model.

This study primarily employed fine-tuning methods, which have inherent limitations. Firstly, fine-tuning methods heavily rely on pre-trained models and a small amount of high-quality labeled data. If the labeled data are insufficient or of low quality, the model’s performance may be significantly affected. Secondly, fine-tuned models may require frequent updates and retraining to adapt to ever-changing production environments or new types of defects, increasing the complexity and cost of maintenance. Future research will expand to comprehensively evaluate and compare various small-sample detection techniques. By adopting this broader approach, we aim to identify the most effective method for industrial applications, particularly in detecting steel surface defects. Additionally, this study considers it essential to develop a dataset specifically tailored for steel defect detection. Such a targeted dataset is expected to improve data quality and enhance the robustness of detection models.

Author Contributions

Conceptualization, K.A., L.J. and X.L.; methodology, L.J. and X.L.; formal analysis, K.A.; investigation, K.A., L.J. and X.L.; data curation, L.J.; writing—original draft preparation, X.L., L.J., Y.P., D.W. and W.L.; writing—review and editing, X.L., Y.P. and J.H.; visualization, X.L., D.W., W.L. and Y.P.; project administration, X.L. and K.A.; funding acquisition, K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62073245), the Pudong New Area Science & Technology Development Fund (PKX2021-R07), and the Natural Science Foundation of Shanghai (20ZR1440500).

Data Availability Statement

The datasets are accessible via the link provided in [14]. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Suárez, A.; Panfilo, A.; Aldalur, E.; Veiga, F.; Gomez, P. Microstructure and mechanical properties of mild steel-stainless steel bimetallic structures built using Wire Arc Additive Manufacturing. CIRP J. Manuf. Sci. Technol. 2022, 38, 769–773. [Google Scholar] [CrossRef]
Lopez, G.; Farfan, J.; Breyer, C. Trends in the global steel industry: Evolutionary projections and defossilisation pathways through power-to-steel. J. Clean. Prod. 2022, 375, 134182. [Google Scholar] [CrossRef]
Sharma, M.; Lim, J.; Lee, H. The amalgamation of the object detection and semantic segmentation for steel surface defect detection. Appl. Sci. 2022, 12, 6004. [Google Scholar] [CrossRef]
Yan, H.; Zhong, C.; Lu, W.; Wu, Y. Metal fracture recognition: A method for multi-perception region of interest feature fusion. Appl. Intell. 2023, 53, 23983–24007. [Google Scholar] [CrossRef]
Wang, H.; Li, Z.; Wang, H. Few-shot steel surface defect detection. IEEE Trans. Instrum. Meas. 2021, 71, 5003912. [Google Scholar] [CrossRef]
Yu, R.; Guo, B.; Yang, K. Selective prototype network for few-shot metal surface defect segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 5020010. [Google Scholar] [CrossRef]
Wang, S.; Xia, X.; Ye, L.; Yang, B. Automatic detection and classification of steel surface defect using deep convolutional neural networks. Metals 2021, 11, 388. [Google Scholar] [CrossRef]
Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef] [PubMed]
Gong, L.; Huang, X.; Chao, Y.; Chen, J.; Lei, B. An enhanced SSD with feature cross-reinforcement for small-object detection. Appl. Intell. 2023, 53, 19449–19465. [Google Scholar] [CrossRef]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9259–9266. [Google Scholar]
Zhang, J.; Meng, Y.; Chen, Z. A small target detection method based on deep learning with considerate feature and effectively expanded sample size. IEEE Access 2021, 9, 96559–96572. [Google Scholar] [CrossRef]
Zhu, S.; Zhang, K. Few-shot object detection via data augmentation and distribution calibration. Mach. Vis. Appl. 2024, 35, 11. [Google Scholar] [CrossRef]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Severstal: Steel Defect Detection. Available online: https://www.kaggle.com/c/severstal-steel-defect-detection/data (accessed on 31 May 2024).
Zhang, H.; Pan, D.; Liu, J.; Jiang, Z. A novel MAS-GAN-based data synthesis method for object surface defect detection. Neurocomputing 2022, 499, 106–114. [Google Scholar] [CrossRef]
Meng, Y.; Xu, H.; Ma, Z.; Zhou, J.; Hui, D. Detail-semantic guide network based on spatial attention for surface defect detection with fewer samples. Appl. Intell. 2023, 53, 7022–7040. [Google Scholar] [CrossRef]
He, X.; Chang, Z.; Zhang, L.; Xu, H.; Chen, H.; Luo, Z. A survey of defect detection applications based on generative adversarial networks. IEEE Access 2022, 10, 113493–113512. [Google Scholar] [CrossRef]
Luo, J.; Yang, Z.; Li, S.; Wu, Y. FPCB surface defect detection: A decoupled two-stage object detection framework. IEEE Trans. Instrum. Meas. 2021, 70, 5012311. [Google Scholar] [CrossRef]
Xie, X.; Xu, L.; LI, X.; Wang, B.; Wan, T. A high-effective multitask surface defect detection method based on CBAM and atrous convolution. J. Adv. Mech. Des. Syst. Manuf. 2022, 16, JAMDSM0063. [Google Scholar] [CrossRef]
Duan, G.; Song, Y.; Liu, Z.; Ling, S.; Tan, J. Cross-domain few-shot defect recognition for metal surfaces. Meas. Sci. Technol. 2022, 34, 015202. [Google Scholar] [CrossRef]
Chen, H.; Du, Y.; Fu, Y.; Zhu, J.; Zeng, H. DCAM-Net: A rapid detection network for strip steel surface defects based on deformable convolution and attention mechanism. IEEE Trans. Instrum. Meas. 2023, 72, 5005312. [Google Scholar] [CrossRef]
Wu, S.; Wang, K.; Ouyang, Y. Study on small samples SAR image recognition detection method based on transfer CNN. In Proceedings of the 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE), Xiamen, China, 18–20 October 2019; pp. 718–722. [Google Scholar]
Ma, J.; Liu, M.; Hu, S.; Fu, J.; Chen, G.; Yang, A. A novel CNN ensemble framework for bearing surface defects classification based on transfer learning. Meas. Sci. Technol. 2022, 34, 025902. [Google Scholar] [CrossRef]
Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVI 16. Springer: Cham, Switzerland, 2020; pp. 456–472. [Google Scholar]
Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface defect detection methods for industrial products: A review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
Lee, H.; Kim, H.E.; Nam, H. SRM: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Almahairi, A.; Ballas, N.; Cooijmans, T.; Zheng, Y.; Larochelle, H.; Courville, A. Dynamic capacity networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2549–2558. [Google Scholar]
Liu, G.; Ma, Q. Strip steel surface defect detecting method combined with a multi-layer attention mechanism network. Meas. Sci. Technol. 2023, 34, 055403. [Google Scholar] [CrossRef]
Wang, T.; Li, Z.; Xu, Y.; Chen, J.; Genovese, A.; Piuri, V.; Scotti, F. Few-Shot Steel Surface Defect Recognition via Self-Supervised Teacher-Student Model with Min-Max Instances Similarity. IEEE Trans. Instrum. Meas. 2023, 72, 5026016. [Google Scholar] [CrossRef]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sagar, A. DMSANet: Dual multi scale attention network. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022; pp. 633–645. [Google Scholar]
Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4013–4022. [Google Scholar]
Yang, W.; Xiao, Y.; Shen, H.; Wang, Z. Generalized weld bead region of interest localization and improved faster R-CNN for weld defect recognition. Measurement 2023, 222, 113619. [Google Scholar] [CrossRef]
Yasir, S.M.; Ahn, H. Faster metallic surface defect detection using deep learning with channel shuffling. CMC—Comput. Mater. Contin. 2023, 75, 1847–1861. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Salazar, A.; Vergara, L.; Safont, G. Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets. Expert Syst. Appl. 2021, 163, 113819. [Google Scholar] [CrossRef]
Chun, C.; Ryu, S.K. Road surface damage detection based on semi-supervised learning using pseudo labels. J. Korea Inst. Intell. 2019, 18, 71–79. [Google Scholar] [CrossRef]
Gururaj, V.; Ramesh, S.V.; Satheesh, S.; Kodipalli, A.; Thimmaraju, K. Analysis of deep learning frameworks for object detection in motion. Int. J. Knowl. 2022, 26, 7–16. [Google Scholar] [CrossRef]
Liu, J.; Cui, G.; Xiao, C. A real-time and efficient surface defect detection method based on YOLOv4. J. Real-Time Image Process. 2023, 20, 77. [Google Scholar] [CrossRef]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
Srikar, M.; Malathi, K. A real time object detection in integral part of computer vision using novel image classification of faster R-CNN algorithm over fast R-CNN algorithm. J. Pharm. Negat. Results 2022, 13, 1686–1693. [Google Scholar]
Yang, L.; Zhong, J.; Zhang, Y.; Bai, S.; Li, G.; Yang, Y.; Zhang, J. An improving faster-RCNN with multi-attention ResNet for small target detection in intelligent autonomous transport with 6G. IEEE Trans. Intell. Transp. Syst. 2022, 24, 7717–7725. [Google Scholar] [CrossRef]
Salazar, A.; Vergara, L.; Vidal, E. A proxy learning curve for the Bayes classifier. Pattern Recognit. 2023, 136, 109240. [Google Scholar] [CrossRef]

Figure 1. Illustration of four different defects. (a) Pits. (b) Cracks. (c) Scratches. (d) Iron oxide scales.

Figure 2. Structure of the proposed few-shot defect detection network.

Figure 3. Structure of the proposed SMSANet.

Figure 4. Comparative results of mAPs.

Figure 5. Comparison of pre- and post-improvement accuracy in single defect detection.

Figure 6. Loss values of various models at different epochs. (a) 7-shot. (b) 10-shot.

Figure 7. Comparison of few-shot object detection using different models.

Figure 8. Visualization of results: Green for pits, yellow for cracks, purple for scratches, and red for iron oxide scales.

Table 1. Existing methods in surface defect detection research.

Reference	Methods	Performance	Disadvantages
He et al. [17]	GAN	GAN uses the concept of game to train the generator and discriminator to detect the defect region.	The old problems in the application of GAN network, such as mode collapse, gradient disappearance and training instability, have not all been solved.
Luo et al. [18]	This research proposed a decoupled two-stage object detection framework based on CNNs.	The method achievs 94.15% mAP on FPCB-DET, surpassing SOTA object detection methods in accuracy.	This method requires substantial computational resources and storage space.
Xie et al. [19]	This paper proposed a surface defect detection model based on the convolutional block attention module and atrous convolution.	The proposed model performs better than current mainstream defect detection methods for both KolektorSDD and magnetic tile datasets.	This model still requires a small number of images with pixel-level annotations on defective images as training data when training the segmentation network.
Duan et al. [20]	This study introduced a novel task setting that can achieve few-shot defect recognition by transferring knowledge across domains.	The extension of cross-domain experiments from textured to metal surfaces shows the superior performance of the proposed method compared to other mainstream methods.	This method exhibits a large domain shift.
Chen et al. [21]	This research proposed a rapid detection network for strip steel surface defects based on deformable convolution and attention mechanism (DCAM-Net).	The proposed algorithm achieves a detection speed of 100.2 fps and an mAP of 82.6% on the NEU-DET dataset.	It struggles with complex texture defects like Cr defects due to noise interference, necessitating improved feature extraction.

Table 2. Existing methods in small-sample object detection.

Reference	Methods	Performance	Disadvantages
Wu et al. [22]	This paper proposed a transfer-based CNN method for the recognition and detection of ground objects in small-sample SAR images.	This method achieves 95.7% precision for farmland and 93.4% for non-farmland.	The effectiveness of transfer learning depends on the similarity between the source and target domains.
Ma et al. [23]	The study proposed a novel transitive transfer learning CNN ensemble framework for classifying bearing surface defects.	The method achieves a 97.51% accuracy rate with an average detection time of 155 ms per bearing.	The method requires more computational resources and storage space due to training across multiple domains.
Wu et al. [24]	The research proposed a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD.	MPSR achieves 42.3% in cross-dataset experiments, indicating superior cross-domain generalization ability.	MPSR’s performance in some 10-shot categories was sometimes lower than Baseline-FPN, indicating limitations in handling specific categories.

Table 3. Existing methods in surface attention mechanism research.

Reference	Methods	Performance	Disadvantages
Lee et al. [26]	The paper proposed a Style-based Recalibration Module (SRM) that adaptively recalibrates intermediate feature maps by leveraging their styles.	Accuracy is preserved early in pruning but drops after a certain ratio. SRM maintains higher accuracy than SE and GE, indicating it learns channel importance better.	SRM have limited adaptability in specific scenarios.
Jaderberg et al. [27]	This study introduced a learnable module, the spatial transformer, which allows explicit spatial manipulation of data within the network.	The spatial transformer module significantly improves accuracy across tasks and excels in recurrent models and complex tasks.	It has only been explored in feed-forward networks, limiting validation in other network structures.
Almahairi et al. [28]	This research introduced the Dynamic Capacity Network (DCN), which adaptively assigns capacity using low-capacity and high-capacity sub-networks.	The DCN adaptively allocates computational resources, enhancing efficiency in critical regions, and performs excellently on Cluttered MNIST and SVHN datasets.	The model’s computational complexity remains high with large spatial inputs.
Liu et al. [29]	This paper proposed a deep learning-based algorithm using CSP_ResNet and DWAS-FPN to enhance strip steel surface defect detection, improving gradient fusion and multi-scale information gathering.	The proposed deep learning detection technique significantly improves the accuracy of strip steel surface defect detection while maintaining good detection speed.	The algorithm struggles with low accuracy in detecting small-sized defects, requiring further algorithmic improvements and integration with other fields.

Table 4. Types and pixel ratios of four defect categories.

Defect Categories	Defect Quantity	Category Proportions	Pixel Proportions
Pits	897	12.6%	2.4%
Cracks	247	3.5%	0.5%
Scratches	5150	72.6%	80.3%
Iron oxide scales	801	11.3%	16.8%

Table 5. Pixel-scale distribution statistics of four defect categories.

Defect Categories	Defect Quantity	Average Pixel Scale	Maximum/Minimum Pixel Scale
Pits	897	4361.3	31,303/163
Cracks	247	3378.4	14,023/316
Scratches	5150	25,496.0	368,240/115
Iron oxide scales	801	34,373.0	192,780/491

Table 6. Comparison results for the Severstal dataset.

Methods	Pits/AP	Cracks/AP	Scratches/AP	Iron Oxide Scales/AP
YOLOv4 [45]	44.3%	61.5%	66.4%	71.0%
YOLOv5 [45]	42.3%	69.3%	60.7%	68.2%
Cascade R-CNN [45]	60.8%	71.4%	68.4%	69.6%
Ours	76.17%	74.68%	78.24%	77.23%

Table 7. Optimization results of pseudo-label-based SSL algorithm.

Number of Training Labels	mAP before Optimization (%)	mAP after Optimization (%)
1000	52.49	55.07
2000	61.84	63.73
3000	69.83	71.38
4000	75.61	77.24
5000	76.05	77.92
6000	76.58	78.21

Table 8. mAP results with different modules added.

Module Integration	Adoption Status
Backbones	✓	✓	✓	✓	✓
FPN		✓	✓
DMSANet			✓	✓
MS-SPAN				✓	✓
SMSANet					✓
mAP (%)	72.28	73.17	74.62	74.93	76.58

Table 9. Performance comparison of the models.

Models	FPS (Frames per Second)	7-Shot mAP
Faster R-CNN (VGG16)	46.9	55.72%
Faster R-CNN (ResNet50)	69.7	66.92%
Faster R-CNN (ResNet101)	67.1	72.28%
YOLOv4	95.4	54.82%
Proposed	63.2	76.58%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Jiao, L.; Peng, Y.; An, K.; Wang, D.; Lu, W.; Han, J. Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention. Appl. Sci. 2024, 14, 5823. https://doi.org/10.3390/app14135823

AMA Style

Liu X, Jiao L, Peng Y, An K, Wang D, Lu W, Han J. Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention. Applied Sciences. 2024; 14(13):5823. https://doi.org/10.3390/app14135823

Chicago/Turabian Style

Liu, Xiangpeng, Lei Jiao, Yulin Peng, Kang An, Danning Wang, Wei Lu, and Jianjiao Han. 2024. "Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention" Applied Sciences 14, no. 13: 5823. https://doi.org/10.3390/app14135823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Steel Defect Detection Based on a Fine-Tuned Network with Serial Multi-Scale Attention

Abstract

1. Introduction

2. Related Work

2.1. Research Pertaining to Surface Defect Detection

2.2. Research Related to Small-Sample Object Detection

2.3. Research Progress Related to Attention Mechanisms

3. Methodology

3.1. Dataset Construction and Analysis

3.2. Proposed Few-Shot Object Detection Network

3.3. Proposed Serial Multi-Scale Attention Mechanism

3.4. Enhanced Semi-Supervised Learning Algorithm

4. Experimental Results and Analysis

4.1. Comparative Analysis of Backbone Networks

4.2. Performance Evaluation of Enhanced Network on Single Defect Detection

4.3. Evaluation of Enhanced Semi-Supervised Learning Algorithm

4.4. Performance of Different Models during the Training Process

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI