4.1. Comparative Analysis of Backbone Networks
All experiments were conducted using an NVIDIA GeForce RTX 2070 GPU, including training, validation, and testing phases. In the context of fine-tuning-based few-shot object detection networks, our study explored four distinct architectures for defect detection: ResNet50, ResNet101, and VGG16 [
40] within the Faster R-CNN framework, as well as YOLOv4 [
41], whose backbone is CSPDarknet53. Each network was trained using the stochastic gradient descent (SGD), with a batch size of 4, momentum set at 0.9, and a weight decay coefficient of 0.0001. In the first stage of training for the base classes, the learning rate was set to 0.02, while in the second stage of training for the new classes, the learning rate was adjusted to 0.001. It is important to recognize that SGD initially converges rapidly, but as it approaches the optimal solution, its convergence rate slows, potentially leading to local minima. Momentum can assist in navigating past these minima. Furthermore, careful adjustments to the learning rate and weight decay are essential to prevent overfitting and maintain the stability of the training process.
In this research, we performed the fine-tuning-based few-shot object detection based on the methodology put forth by Wang et al. [
42]. This approach is based on the widely employed Faster R-CNN, a two-stage object detector, which served as the foundation [
43]. Comprising a backbone network (e.g., ResNet [
44], VGG16), a RPN, and dual fully connected sub-networks, Faster R-CNN functions as a feature extractor at the proposal level. Furthermore, it incorporates a classifier and a regressor to perform object classification and bounding box coordinate prediction. The authors introduce new benchmarks on PASCAL VOC, COCO, and LVIS datasets, achieving stable accuracy estimations and outperforming previous meta-learning methods by notable margins. This simple yet powerful approach establishes new state-of-the-art results, demonstrating substantial gains in average precision (AP) for rare classes with minimal impact on frequent classes.
During the initial phase of training, Faster R-CNN underwent an intensive training process using a large number of samples from the base class, analogous to other object detection frameworks. The loss function utilized in this stage is represented in Equation (
5):
where
is applied to RPN’s output to discriminate between the foreground and background and to refine anchors.
represents the cross-entropy loss for the bounding box classifier, used for object classification, and
indicates the smooth L1 loss for the bounding box regressor, used for bounding box coordinate prediction. The Smooth L1 loss is a fusion of the L1 and L2 losses. While the L1 loss is not differentiable at the zero point, the L2 loss can result in gradient explosion when the prediction significantly deviates from the target. The Smooth L1 loss ameliorates the drawbacks of both by incorporating the advantages of each component. The first piecewise function corresponds to the L2 loss, and the second piecewise function corresponds to the L1 loss, as shown in Equation (
6):
where
x is equal to
, which is the difference between the true value and the predicted value.
Transitioning to the second phase, a curated dataset, containing a limited number of base class and new class samples, was introduced for the fine-tuning phase. The network initially allocated randomly initialized weights to the new classes. This was followed by fine-tuning confined to the bounding box classifier and regressor, while the feature extractor remained unchanged.
In the context of few-shot learning, the traditional softmax classifier is sub-optimal for learning classification features. In light of this, Wang et al. [
42] employed a classifier built on cosine similarity, facilitating effective learning of classification features within a few-shot setting, reducing inter-class variance, and thereby improving classification accuracy.
In this investigation, fine-tuned few-shot steel defect detection was conducted utilizing a small dataset, employing ResNet50, ResNet101, and VGG16 within the Faster R-CNN framework, alongside YOLOv4. The selection of YOLOv4 was based on its demonstrated high accuracy and versatility in diverse settings compared to its successors such as YOLOv5. A range of shot quantities, including 1-shot, 3-shot, 5-shot, 7-shot, and 10-shot, was assessed within these frameworks, each with multiple random selections and evaluations of sample quantities for each category, and the pertinent outcomes are illustrated in
Figure 4. The analysis was conducted through three aspects: the effect of sample quantity on model performance, comparison of different backbones, and performance of 7-shot relative to 10-shot. The selection of training sample size was based on the theoretical proxy learning curve of the Bayes classifier [
45]. This curve provides a method for estimating the required sample size by illustrating how the classifier’s performance improves with an increasing number of samples. It helps determine the minimum sample size needed to achieve a specific classification error probability. This is particularly important for applications involving small datasets, as a reasonable estimation can enhance classification performance and reduce resource waste. In this study, accurate sample size estimation optimized the training process for few-shot object detection, leading to higher detection accuracy and efficiency in practical applications.
(a) Effect of sample quantity on model performance: With an increase in the number of samples per class, the model’s mean Average Precision (mAP) also improved. This aligns with our assumption, as more samples provide additional information to train the model, enabling it to detect targets with increased accuracy. For instance, under the Res-Net101 backbone, we observed a progressive increase in mAP from 1-shot to 10-shot, with values of 38.57%, 51.81%, 62.97%, and 72.28%, respectively. This strongly emphasizes the significance of sample quantity on model performance.
(b) Comparison of different backbones: Under the same sample quantity (except in the case of 3-shot, where it ranked second), ResNet101 consistently achieved the highest mAP values. This indicates that in this task, ResNet101 exhibited relatively stronger feature extraction capabilities, contributing to enhanced model performance. Compared to other backbones such as VGG16, CSPDarknet53, and ResNet50, ResNet101 consistently outperformed in all scenarios.
(c) Performance of 7-shot relative to 10-shot: Within the ResNet101 backbone, the mAP of 7-shot slightly outperformed that of 10-shot. This suggests that, in this task, surpassing 7 samples per class may not significantly enhance model performance. The reason is that with 7 samples, the model has already acquired sufficient features to perform accurate target detection, and further addition of samples does not yield substantial benefits.
From the comparative experiments, we drew two key conclusions:
(1) Regardless of the frameworks and backbones employed, the number of shots per class significantly influenced the mean Average Precision (mAP). The mAP was observed to be at its lowest with a single sample (1-shot). As the shot quantity increased, there was a corresponding rise in mAP. Notably, there was little to no improvement in mAP values between 7-shot and 10-shot, with 7-shot occasionally outperforming 10-shot. This suggested that beyond 7 shots, model accuracy did not improve significantly, and thus, 7-shot was more aligned with the principles of few-shot object detection.
(2) Through a comparative analysis of various frameworks and backbones at 7-shot, we observed that the model with Faster R-CNN framework and ResNet101 backbone was optimal. The mAP achieved by this configuration surpassed those of models using VGG-16 and YOLOv4 by 16.56% and 17.46%, respectively, and was 5.36% higher than that of ResNet50. Therefore, in the context of this work, employing the Faster R-CNN framework with the ResNet101 backbone proved to be the most effective option, achieving a highest mAP value of 72.28%. Future work will be devoted to optimizing and advancing upon this framework and backbone.
4.2. Performance Evaluation of Enhanced Network on Single Defect Detection
Single defect detection refers to the detection of a specific type of defect among the four defects (pit, crack, scratch, and oxide scale) present in the dataset, representing an individual defect category. This research focused on optimizing defect detection, and we validated the efficacy through an experiment employing a 7-shot scenario, where each new class contained seven samples. The results are demonstrated in
Figure 5 and
Table 6.
Figure 5 illustrates the changes in detection accuracy for four types of defects before and after network refinement. It is evident that there have been significant improvements across all defect categories following the network refinement. Particularly noteworthy is the substantial enhancement observed in the case of pit defects, with an increase in the accuracy of 5.34%. This can be attributed to the fact that pit defects typically own smaller scales and more pronounced shape features. Therefore, the network refinement has made the model more sensitive to these defect characteristics, resulting in a notable performance boost.
Furthermore, crack defects also exhibited a notable increase in accuracy within the refined model, with an improvement of 4.78%. This represents a significant improvement, demonstrating the effectiveness of the network refinement for small-scale yet crucial defects.
Additionally, for larger-scale defects such as scratches and iron oxide, the network refinement similarly led to substantial improvements, with increases of 2.77% and 4.31%, respectively. This indicates the refined model’s capability in capturing larger-scale defects, consequently reducing instances of false detections and misclassifications.
In all,
Figure 5 reveals that following network refinement, there was a noticeable improvement in detection accuracy for all categories of defects. Among these, the most noteworthy improvement was observed in the case of pit defects. As these defects are characterized by a smaller scale, the modified model demonstrated a marked improvement in detection accuracy compared with its original version. Furthermore, the refined model effectively reduced the instances of false detections and misclassifications for larger-scale defects, thereby further improving its detection accuracy. These results imply that the proposed model is capable of detecting small defects effectively in more complex scenarios.
4.4. Performance of Different Models during the Training Process
Figure 6 presents the loss curves for various models operating in 7-shot and 10-shot configurations throughout iterative progression.
During the training epoch, there was a noticeable decrease in aggregate loss across the four models, with models trained on smaller sample sizes converging more quickly. In the initial phase, the models exhibited higher total loss, which can be attributed to the early stage of parameter initialization where useful information was lacking. As training progressed into the mid-phase, the loss of all models notably decreased, indicating that the models were learning beneficial information from the data. At this stage, Faster R-CNN with a ResNet101 backbone demonstrated superiority due to its sophisticated network architecture, which was proficient in capturing the features of the data. As training approached its conclusion, there was a continued, though gradual, decrease in loss across the models. This slowdown characterized the model’s gradual approach towards their maximum potential, making further improvements increasingly challenging. At this point, Res-Net101-based Faster R-CNN maintained its superiority by demonstrating the lowest cumulative loss among the models. Furthermore, under both 7-shot and 10-shot conditions, Faster R-CNN with ResNet101 as the backbone consistently exhibited the lowest losses, particularly under the 7-shot setting, where it achieved the optimal loss. This underlines that the highest level of accuracy was achieved when utilizing the 7-shot configuration in combination with ResNet101-based Faster-RCNN.
Additionally, we observed some other trends in the entire training process. As training progressed, the performance of all models gradually stabilized, indicating that they were gradually finding a suitable state to detect targets accurately. Particularly in the case of Faster R-CNN (ResNet101), due to its advanced network architecture, the model excelled in capturing data features, maintaining its lead throughout the training process.
However, in later stages, the performance differences among models began to narrow gradually. This can be attributed to the fact that they were all approaching their best performance on current training data. This also implies that further improvements will become increasingly challenging, requiring more refined tuning.
In summary, these observations provide us with a deeper understanding of the dynamic changes during the training process. They also emphasize the superiority of ResNet 101 in combination with the 7-shot configuration, which will have a positive impact on practical applications.