1. Introduction
Remanufacturing is gaining more attention for its significant impact on establishing a sustainable circular economy [
1]. This increase in interest is mainly due to its capacity for resource efficiency and waste reduction. A critical aspect of remanufacturing is the difficulty in aligning production planning and control with remanufacturing processes. Guide [
2] extensively discussed the research needs in this domain, highlighting the gap between industry practice and theoretical development. Additionally, the development of a remanufacturing supply chain management system, as exemplified in the case study by Zhu and Tian [
3], further demonstrates the practical applications and benefits of remanufacturing in the industry. The evolution of remanufacturing processes is further emphasized by Tolio et al. [
4] and Caterino et al. [
5], who highlighted the integration of cutting-edge systems like cloud technologies in remanufacturing.
One crucial step in remanufacturing involves inspecting end-of-life (EOL) products, which is mostly carried out manually by a skilled worker [
6]. However, with promising achievements in computer vision applications, scientists are now focusing on automating this remanufacturing step by adopting intelligent defect detection approaches to recognize faulty regions on EOL parts that require restoration [
7,
8,
9].
Defect detection on metal surfaces [
10] is an important area of research that is gaining more attention from the scientific community as we move into a sustainable environment. Defects on metal surfaces occur in several types, including cracks, scratches, inclusions, corrosion, spots, patches, etc. [
11]. Such defects can affect the quality and performance of the product. Chen et al. [
12] conducted an extensive exploration of traditional and deep learning-based methods for surface defect detection, underscoring the evolution from basic feature analysis to sophisticated neural architectures capable of handling complex defect patterns. Furthermore, Jin et al. [
13] extended the discussion to the application of machine learning in solid mechanics, suggesting a broader context where these computational models contribute to predictive maintenance and quality assurance in manufacturing processes. Thus, there have been many efforts to develop and adopt different models for efficient and accurate detection of such defects.
Some researchers have focused on developing novel deep learning architecture targeting a specific environment and dataset [
14,
15,
16,
17,
18]. Others adopted and fine-tuned some of the existing models trained and developed for a general object detection domain [
7,
8,
9,
19,
20,
21,
22]. A recent approach involves improving a well-performed existing model utilizing hybrid methods [
23,
24,
25,
26]. Therefore, a behavioral study of the state-of-the-art object detection model on detecting defects on metal surfaces is important for further decision support in the selection of the repair process. Thus, this paper focuses on investigating the state-of-the-art object detection model for the defect detection task.
Many novel approaches are based on convolutional neural networks (CNNs). CNNs are robust networks used for extracting embedded features for images, such as corners, edges, etc., which makes them widely adopted in the object detection task [
27]. Tao et al. [
14] developed a detection pipeline based on cascaded autoencoder (CASAE) and CNN networks. This architecture uses CASAE to localize and segment the defect, whereas CNN is used to classify the defect type. Moreover, Parvez et al. [
15] and Han et al. [
16] designed similar networks that used CNN for feature extraction and a fully connected neural network (FCNN) for the classification step. Both focused on detecting defects in additively manufactured parts, which include cracks, porosity, and lack of fusion. In addition, Xu et al. [
17] developed a novel self-supervised efficient defect detector (SEDD) that focuses on eliminating the annotation step of the data by using a homographic enhancement method. They also designed a custom detector based on depth-wise convolution layers and an attention module to enhance performance and accuracy.
Another approach is to utilize state-of-the-art object detection detectors for defect detection applications. This approach is known as transfer learning, where the model is trained on a new dataset to perform a related task using initial weights [
28]. Zheng et al. [
7,
19] adopted mask region-based CNN (Mask RCNN) architecture for detecting and segmenting the damaged area on metal parts for further repair applications. In addition, Zheng et al. [
20] investigated the performance of Faster R-CNN, YOLOv3, and RetinaNet architectures on rail crack detection employing knowledge transfer of the COCO dataset [
29]. Furthermore, Konovalenko et al. [
21] and Litvintseva et al. [
22] based their detection model on the U-Net architecture for defect detection. In [
8,
9], Imam et al. leveraged one of the most known two-stage models—Faster R-CNN using the transfer learning concept (using the COCO dataset) to detect and localize steel parts’ wear for a remanufacturing task.
A recent work by Li et al. [
23] proposed a new method that leverages and enhances CSPDarknet53 architecture by integrating a multi-head self-attention block and using it as the backbone of their detector. They also adopted some simple yet effective techniques to enhance the model’s performance, such as augmentation and grayscale filtering. Moreover, Pan et al. [
24] integrated a dual attention module with DeepLabv3+ architecture to detect and segment metal defects. Furthermore, Wang et al. [
25] developed a new module that follows a similar pipeline to ResNet based on Transformers and CNNs to retain both local and global information. Moreover, Gao et al. [
26] adopted the Swin Transformer with a variant shift window called Cas-VSwin Transformer as the backbone network for better performance on the defect detection task. In addition, they used the feature pyramid network (FPN) as the neck, whereas the Cascade Mask network is the head of the architecture.
One feature of YOLO detectors is real-time inference. They gained a lot of attention on the object detection task. With the speed requirement for the defect inspection process in production lines, many scientists have investigated and adopted YOLO for the defect detection problem. In addition, the YOLO series provides the lowest processing time compared to all other state-of-the-art detectors. However, higher detection accuracy is achieved with enhanced versions of the YOLO series by integrating it with other networks, such as self-attention mechanisms. Li et al. [
30] developed an improved version of the YOLO network using only CNN layers. This method performed well; however, no comparison with the standard YOLO network was conducted.
Furthermore, Kou et al. [
31] proposed a new detector based on YOLOv3 architecture with an anchor-free feature selection method and custom dense convolution blocks. This improves the training process and inference accuracy by about 26.7% on the NEU-DET dataset. In addition, Xu et al. [
32] modified the YOLOv3 network to improve accuracy by focusing on extracting more features of small defects using a new scale feature layer. With this, an improvement of 4.6% in precision is gained on the Tianchi dataset. On the other hand, Guo et al. [
33] introduced an improved architecture of YOLOv5 with a Transformer encoder as the backbone of the network to integrate more global information into the model. This model was tested with the NEU-DET dataset and achieved 75.2% mean average precision (mAP) on inference. Saiz et al. [
34] also proposed combining YOLOv5 and DeepLabV3+ models, which provides stable and reliable defect detection in components, overcoming the limitations of traditional methods. This ensemble approach, tailored to specific use cases, achieved full accuracy in overall performance testing, proving highly effective for practical applications in the manufacturing industry.
It is essential to acknowledge the complexities introduced by the diverse range of defect types, such as scratches, dents, and corrosion, each varying significantly in shape, size, and visibility [
14]. Traditional computer vision techniques, while less computationally demanding, often fall short in flexibility and adaptability, struggling with high noise levels and variability in defect manifestations, which are common in industrial settings [
35]. Conversely, deep learning models, despite their superior performance in learning from large datasets and generalizing across different defects, require extensive computational resources and substantial amounts of labeled data, which complicates their application in real-life industrial scenarios [
36]. One of the most promising solutions for industries is the YOLOv7 object detection model, which aims to bridge these gaps by enhancing detection accuracy while meeting the operational demands of real-time industrial applications [
37].
YOLOv7 is the current state-of-the-art for real-time object detection tasks in terms of both speed and accuracy [
37]. However, this benchmark is based on the Microsoft COCO dataset. A benchmark of YOLOv7 on the defect datasets is essential to observe and investigate its performance on metal defect detection tasks. This paper focuses on adopting the state-of-the-art model, YOLOv7, for detecting defects and damages on metal surfaces and investigating the model’s behavior on some public datasets of metal defects. This can also be used as a reference for future comparisons of the enhanced models based on YOLOv7 with the standard. Furthermore, a comparison of the existing defect detection models with the YOLOv7 is conducted and studied. A case study validation is also conducted on synthetic data inspired by an industrial setting, demonstrating the real-world application and efficacy of the YOLOv7 model in identifying and classifying defects.
2. YOLOv7 Architecture
YOLOv7 is the latest improved model of the YOLO series. YOLO models, including YOLOv7, are a type of single-stage framework that contains three main components, named backbone, neck, and head [
38,
39,
40,
41,
42,
43,
44], as shown in
Figure 1. The backbone extracts feature maps of an image and transfers them to the neck layers. Those maps are combined, fused, and passed to the next layers. Then, the head network predicts the objects’ bounding boxes and their classes. Unlike the two-stage models, the YOLO series, a single-stage model, reconsidered the object detection task as a regression problem instead of a classification problem, which is the main feature of real-time detection algorithms [
45].
The efficiency of the CNNs in the backbone network is important to enhance the inference process. Thus, the author of YOLOv7 proposed an enhanced method based on an efficient layer aggregation (ELAN) network named Extended-ELAN (E-ELAN). This method improves the ELAN architecture to boost the learning ability of a scaled network without disturbing or changing the original gradient propagation path. E-ELAN has a stronger learning ability for various features.
Furthermore, YOLOv7 comes with a new method of scaling for concatenation-based models. The proposed method, named corresponding compound model scaling, addresses the issue of a larger width output of the computational block when depth scaling is performed on the architecture. With the proposed method, the depth of the concatenation-based model is scaled directly; however, the width of transition layers is scaled with a corresponding factor calculated from the change of the output width of the block.
Moreover, several techniques have been introduced in the YOLOv7 to improve the model inference accuracy while maintaining a low training cost. Those strategies are called bags of freebies (BoF), including planned re-parameterization and dynamic label assignment. After a deep investigation of the re-parametrized convolution behavior when combined with various networks, the author showed an increase in the model’s accuracy when using the RepConv without identity connection (RepConvN). Further, in supervised deep learning techniques, the head that represents the final prediction of the model is called the lead head, whereas the auxiliary head is the head that is used to assist the lead head. Previously, both heads were independent of each other, and their prediction and the ground truth were used as soft labels for label assignment. However, YOLOv7 proposed a new method for lead-dependent label assignment. Two types of label assigners were developed with the YOLOv7. One is lead head guided, where the soft label is mainly generated from the leader’s head and ground truth. Another is coarse-to-fine lead head-guided, where two different soft labels are produced, including fine and coarse labels. The fine label is the same as the one generated in the lead head guided assigner; however, the coarse label is generated with relaxed rules on the positive sample assignment process.
Other BoF techniques are also adopted and used in YOLOv7, such as batch normalization, implicit knowledge, and the exponential moving average (EMA) model. Normalization of the training batch, by integrating the mean and variance of the data to the bias and weight of the convolutional layer, is proved to directly affect the training process by utilizing a higher training rate and faster convergence [
46]. Another technique is the implicit knowledge, adopted from the YOLOR [
44], computed as a vector in the inference stage of YOLOv7, improving the prediction accuracy in previous versions. Lastly, adopting the EMA model as the final inference model in YOLOv7 has improved the inference accuracy.
YOLOv7 outperforms all existing models in the object detection task in terms of both speed and accuracy [
37]. According to the authors, the focus of YOLOv7 has been to optimize the training process for enhanced detection accuracy and speed, as well as to improve the inference process. This optimization includes the reduction of model training parameters and the enhancement of the learning process.
In contrast to the evaluation performed in [
37], which establishes the general superiority of YOLOv7 in object detection, this paper uniquely contributes to the field by explicitly investigating the performance of YOLOv7 in detecting defects on metal surfaces. While previous benchmarks, such as the one on the Microsoft COCO dataset, provide a broad understanding of YOLOv7’s capabilities, this work offers a specialized assessment of its applicability and efficiency in the context of metal surface defect detection. This evaluation is essential, as the characteristics and requirements of defect detection on metal surfaces differ significantly from the more general object detection tasks. Therefore, this analysis reinforces the findings regarding YOLOv7’s overall performance and extends its utility to the specific domain of defect detection, providing valuable insights for industrial applications.