4.3.1. Stage 1: Defect Detection
This study constructed a binary classification model for defect detection in stage 1, utilizing the transfer learning technique to classify defects in steel images. ResNet-50, ResNet-101, and ResNet-152, all pre-trained on the ImageNet database, served as base models for comparative experiments to ensure consistent defect distribution with the original dataset.
Table 4 displays the hyperparameters for each model used in the experiments.
Figure 8 presents the training and validation loss curves over 30 epochs for model1, one of the three binary classification models referenced in
Table 4. For the loss function, BCELoss was employed as the loss function in addressing the binary classification task, and the model weights were updated using the Adam optimizer with a learning rate of 0.0001. A StepLR scheduler reduced the learning rate by 0.2 every 10 epochs. The batch size varied between 16 and 32 during experiments, although this change did not significantly impact results. The training loss curve of model1 quickly declined in the initial 10 epochs, then stabilized and converged to nearly zero. The validation loss curve also showed a consistent decrease, indicating effective performance on both training and validation datasets.
Figure 9 illustrates the training and validation loss curves for model2 over 30 epochs. The curves reveal a sharp decrease in both training and validation losses during the first 5 epochs, approaching zero. However, the curves exhibit a sudden spike in loss values afterward, indicating potential optimization issues due to overfitting or learning instability in specific data patterns.
Figure 10 displays the training and validation loss curves for model3. Although the learning rate of model3 was slower in the initial epochs compared to the others, it stabilized and converged rapidly after a certain point. Examination of all three models’ loss curves reveals that each model’s training loss approaches near-zero values after specific epochs. However, except for ResNet-50, the remaining models exhibited unstable loss fluctuations during training, reflecting differences in their generalization abilities. Confusion matrices were used to provide a more intuitive visualization of each model’s performance on defect and non-defect data.
Figure 11 presents the confusion matrices for the three binary classification models described in
Table 4. These matrices provide a clear visualization of the performance of each model on the test dataset, which includes 720 samples. Each matrix displays results for two classes: “No Defect” and “Defect.” The rows indicate the true labels, while the columns show the predicted labels. All three models effectively differentiated between defect and no defect samples. However, ResNet-101 and ResNet-152 each had one or two misclassified samples, whereas ResNet-50 achieved perfect predictions across all samples, demonstrating its superior performance and potential for defect detection in binary classification tasks.
4.3.2. Stage 2: Surface Pixel Defect Mapping
Figure 12 illustrates the results of the faster R-CNN object detection model for detecting six different types of defects. The NMS threshold was set at 0.5, and only bounding boxes with a confidence score over 0.7 were displayed. Each pair of images includes a left column showing the ground truth bounding box labels and a right column displaying the model’s predictions. Labels (a) to (f) denote the defect types as follows: crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches. For (a), the difference in pixel appearance between defect and normal conditions is minimal, making it the most challenging prediction. The results also reveal that the model detected the least stable bounding box coordinates for this type, suggesting difficulty in distinguishing this defect. In the case of (b), the model identified defects that were not originally labeled, indicating that the original data might have been inaccurately labeled yet showing the model’s ability to detect even subtle defects. For the other defect types, the model displayed stable detection performance, especially for types (c) and (f), where it accurately identified multiple objects, showcasing strong detection capabilities across various defect scenarios.
In the surface pixel defect mapping process, the performance of faster R-CNN combined with various backbone networks was assessed using precision, recall, AP, and mAP.
Table 5 summarizes the defect detection performance of faster R-CNN based on three different backbone networks: VGG-16, Inception-V2, and ResNet-50. This table presents the AP values for each defect type, and the last column shows the mAP for each model. According to the results, Inception-V2 exhibited slightly higher AP values for patches and pitted surface defects. However, ResNet-50 achieved the highest AP across most classes and had the highest mAP, particularly excelling in detecting rolled-in scale and scratch defects, outperforming other backbone networks. The ResNet-50-based model recorded a peak mAP of 0.766, establishing it as the most reliable backbone network for defect detection.
Based on the results presented in
Table 6, a comprehensive comparison of faster R-CNN, SSD, and YOLOv4 on the NEU-DET dataset is conducted, with a focus on detection performance (mAP), speed (FPS), and computational complexity (GFLOPs). The analysis highlights the trade-offs between accuracy and efficiency across the different stages of models. The one-stage models, SSD and YOLOv4, exhibit relatively higher processing speeds and lower computational costs compared to the two-stage model, faster R-CNN. However, their detection performance is significantly inferior. This trade-off between speed and accuracy is often a key factor in choosing models for real-time applications. Faster R-CNN is better suited for such tasks where the accuracy of the bounding box determines the segmentation, as it excels in accurately identifying and localizing objects. This capability is particularly important in scenarios requiring detailed analysis of object boundaries, where even slight inaccuracies in bounding box placement could lead to significant performance degradation.
Table 7 presents a comparison of Precision, Recall, and AP for steel surface defects using the faster R-CNN model with the ResNet-50 backbone, which showed the best results among the three models. The AP value for the crazing defect was the lowest at 0.6528, reflecting the difficulty in identifying this defect, which often closely resembles normal pixels. Consequently, the model struggles to distinguish this defect clearly, resulting in lower precision and recall. Future research could improve detection by implementing augmentation techniques to enhance defect patterns or using multi-scale approaches to better discern the shape, size, and brightness of defects.
The bounding boxes detected by faster R-CNN were refined for weakly supervised segmentation by excluding certain regions within the boxes and the GrabCut algorithm was applied. The GrabCut algorithm is one of the image segmentation techniques that separates the foreground and background based on minimal input provided by the user. To create an initial mask, the bounding box is first reduced to 80% of its original size, with the center of the box serving as the reference, to make the boundary closer to the object. Then, using the GrabCut algorithm, a binary mask is generated by marking the interior of the shrunken bounding box as the foreground and the exterior as the background. As a result, an initial mask is generated, as shown in
Figure 13c, which is then fed into the DeepLabv3+ model for inference. The DeepLabv3+ model used in this process was pre-trained on large-scale segmentation datasets such as PASCAL VOC and Cityscapes, which are specifically designed for semantic segmentation tasks.
Subsequently, the initial mask is input into the DeepLabv3+ model, which uses ResNet-101 as the backbone network. Recursive learning was used to train the segmentation model across 10 rounds, with 5 epochs per round. The model is trained on both the shrunken bounding box masks and the original image. In each round, the predicted mask is compared with the ground truth from the previous round. If the predicted mask covers a larger area, the previous ground truth is retained. This approach starts from the second round, with the mask from the shrunken bounding box used as the initial ground truth in the first round.
This approach ensures that the model progressively refines its predictions and prevents over-prediction by mitigating the accumulation of incorrect predictions.
Figure 13 illustrates the step-by-step outcomes from the bounding box to the segmentation process; (a) shows the original image from the dataset, (b) displays the ground truth bounding box labeled image, (c) represents the initial mask image from GrabCut with the applied shrunken bounding box, (d) shows the outcome after 5 rounds of training, and (e) presents the results after 10 rounds of training. The visualization demonstrates that, starting from the initial bounding box-based mask, the defect boundaries become progressively more refined through recursive learning. A comparison of images (d) and (e) in the second row indicates a reduction in noise over the rounds. Although the final round did not capture all defect pixels perfectly, the segmentation mask delineates the defect’s shape and extent more accurately compared to the bounding box annotation.