1. Introduction
Castings are metal parts created using various casting methods. Traditional casting involves injecting liquid metal into a model and solidifying the parts after cooling, also known as the liquid forming method. Currently, to meet precision requirements, specialized casting technology is commonly utilized, such as metal mold casting, pressure casting, vacuum suction casting, and other rapid forming methods [
1]. Due to the complexity and diversity of the casting process, surface defects of castings are unavoidable. Castings’ surface defects can take various forms, such as cracks, blowholes, porosities, and sand holes. These defects can have severe consequences for subsequent equipment processing, shortening the service life and posing risks to both the machine and users. Therefore, Sur-face Defect Detection on Castings has been one of the research focuses [
2].
Traditional methods for defect detection rely on human visual checking, which utilizes machine vision devices, such as industrial cameras and lenses, for identification and detection. These devices use light sources and image-acquisition equipment to collect images and determine whether there are defects. The process involves five steps: image preprocessing, feature extraction, template matching and positioning, and positive and negative classification [
3]. Currently, intelligent equipment manufacturing companies typically use Cognex’s machine vision software, VisionPro(9.0), due to its flexibility, power, and a collection of many image-processing algorithms [
4]. However, traditional methods of defect detection are mainly based on manually designed features and conventional machine learning. This approach has limitations in complex scenes compared to deep learning. Furthermore, it heavily relies on manual work, which can be influenced by the inspector’s experience, technical skills, and proficiency, resulting in low efficiency and accuracy. Additionally, long working hours may cause visual fatigue among inspectors, leading to missed detections and false alarms.
In recent years, with the continuous development of deep learning, deep learning-based defect detection methods have gradually replaced traditional detection methods. Currently, there are two main types of deep learning-based detection algorithms: one-stage algorithms, such as SSD [
5] and YOLO [
6,
7]; and two-stage algorithms, such as Mask R-CNN [
8]. Two-stage algorithms have been used for detecting long-distance small targets [
9], and one-stage algorithms have been used for fast identification and detection [
10], as well as wire insulator fault detection and foreign-object detection [
11]. Lan et al. [
12] proposed a new lightweight model named Swin-T YOLOX, which consists of the advanced detection network YOLOX [
13] and a robust backbone Swin Transformer [
14]. Hurtik et al. [
15] presented a new version of YOLO called Poly-YOLO, which builds on YOLOv3 and addresses issues by aggregating features from a lightweight SE-Darknet-53 backbone using a hypercolumn technique, employing stairstep upsampling, and generating a single-scale output with high resolution. Compared to YOLOv3, Poly-YOLO has only 60% of its trainable parameters but improves the mean average precision by 40%.
In conclusion, the casting surface-defect detection technology has been developed for a long time, and the traditional detection technology methods have been gradually replaced by the current deep learning-based methods. Although researchers have conducted a lot of research on the key issues, such as high accuracy and lightweight, the determination of how to achieve high precision and lightweight coexistence still faces challenges.
This study proposed a castings surface defect detection method based on the YOLOv8 algorithm, named SLGA-YOLO, which improved on the basic model of YOLOv8. Firstly, the neck module is optimized using SlimNeck, which can significantly reduce the number of parameters while maintaining sufficient accuracy. Secondly, SimAM and LSKA are fused in the backbone to strengthen the attention mechanism and improve the algorithm’s focus on the focal region. Then, to achieve high accuracy while maintaining the requirement of meeting the lightweight requirement, YOLO-P2 is used to replace the original model to achieve four-detector-head coexistence detection, based on which a novel GCML module is designed to replace a part of the base convolutional block, CBS to make full use of the extracted feature information. Finally, the Alpha-EIoU loss function is constructed to maintain sufficient flexibility and strong generalization. The main improvements are summarized as follows:
SlimNeck is used to optimize the model neck module, reducing model complexity whilst increasing accuracy.
The fusion of SimAM and LSKA strengthens the attention mechanism to enhance the three-bit weight extraction capability of the model, as well as to enhance the multiscale feature extraction capability.
YOLO-P2 is used to replace the original model to improve the efficiency of model detection of small targets. At the same time, some of the basic convolutional blocks (CBSs) are optimized as self-defined design GCML modules to improve the convergence speed of the model and enhance the feature extraction capability.
The Alpha-EIoU loss function is constructed to accelerate the regression fitting process of the real frame and prediction frame, thus maintaining sufficient flexibility and strong generalization.
3. Design for SLGA-YOLO
In order to meet the requirements of fast speed and high precision for casting surface-defect detection, the improved algorithm model (SLGA-YOLO), as shown in
Figure 1, included three major parts: the backbone, the neck, and the predicted head.
Firstly, by employing the SlimNeck constructed based on GSConv and VoV-GSCSP to replace the neck part of the model, i.e., connecting the standard backbone to the SlimNeck, we aimed to enhance the model’s running speed and reduce the computational complexity, enabling it to detect defects promptly. Secondly, due to the mixed defects in the collected casting surface-defect dataset, the ability to extract key feature information is weak. To solve this problem, we integrated SimAM and LSKA to strengthen the attention mechanism: the SimAM was incorporated into the backbone network to make the network focus on key information during training. Embedding the LSKA in the SPPF improved the model’s perceptual range of input features. In addition, based on the addition of the p2 detection layer, we propose a lightweight model called GCML to replace part of the base convolutional block (CBS) in order to reduce the redundancy of the feature maps present in the neural network and better exploit the feature information. Finally, to accelerate model convergence while providing enough flexibility to improve model accuracy, we designed the Alpha-EIoU loss function. In comparison to YOLOv8, this method provides an effective balance between model lightweight and model accuracy.
3.1. Improved YOLOv8 Neck Module Based on SlimNeck
In actual defect detection, the algorithm detection speed is required. The neck is situated between the backbone and the predicted head and is employed primarily for feature fusion and enhancement. In contrast, SlimNeck is capable of achieving an optimal balance between feature fusion and enhancement, accuracy, and speed. To improve the detection speed without reducing the accuracy, an improved method based on SlimNeck [
25] is proposed.
The YOLOv8 model utilizes the CBS standard convolutional block and C2F module in the neck as the feature information processing module. However, this module fails to balance the accuracy and speed of the model. To improve the accuracy and reduce the number of model parameters without reducing the feature fusion capability, the GSConv convolution block and VoV-GSCSP module from SlimNeck were introduced to replace the CBS and C2F in the original model. GSConv is a lightweight convolution that is computationally inexpensive and has a high feature fusion capability compared to CBS.
The input undergoes standard convolution and deep convolution. The results of both convolutions are then combined through splicing, and a shuffle operation is performed to match the corresponding channels. Finally, the output is produced. VoV-GSCSP utilizes a cross-layer network structure with fewer parameters, maintaining accuracy while reducing computational complexity and network structure. The structures of VoV-GSCSP, GSConv, and DWConv can be seen in
Figure 2.
In conclusion, we propose using GSConv-based SlimNeck to replace the neck part of the model. Using the GSConv convolution operation is one of the main ways to reduce redundancy and duplicate information without the need for compression, resulting in reduced inference time while maintaining accuracy. By replacing the neck part with VoV-GSCSP, we aim to convey deep semantic features from the top down, reduce the number of modular parameters, and reduce the computational burden.
3.2. Attention Module
The basic YOLOv8 model algorithm belongs to the one-stage detection algorithm, with improved detection speed, and is linked by multiple CBS and C2F structures in the backbone network to increase the model’s ability to extract graphical feature information. However, since the model’s ability to extract input features is weak, it is difficult to dig deep into details, which means that it often tends to ignore deep feature information, thus reducing the model’s sensitivity to small targets, and the extraction of effective feature information is incomplete. At the same time, there is a large amount of redundant information extraction, which increases the computational burden; it is unable to effectively capture the global relationship of the data, and it is weak in the ability to fight against noise interference, of which, the backbone network is the basis of the model and is responsible for extracting features from the input image. These features are the prerequisite for the subsequent network layers for target detection. To solve such problems, we propose a novel fusion attention enhancement mechanism that uses the joint action of LSKA and SimAM to further improve performance, reduce the role of interfering information, continuously focus on the key region information, and improve the detection ability of the algorithm. By introducing the fusion of LSKA and SimAM to enhance the attention mechanism, SLGA-YOLO can pay more attention to the key features in the defect images.
3.2.1. LSKA Attentional Mechanism
LSKA (Large Separable Kernel Attention) [
26] is an improvement of the LKA (Large Kernel Attention) [
27] module. The LKA module has been shown to provide good performance in multi-class target detection tasks. However, the depth-wise convolution layer in the LKA module leads to computational inflation. To address this issue, LSKA suggests breaking down the 2D convolution kernels of the depth-wise convolution layer into a series of connected 1D convolution kernels. These can then be used in the attention module with a large convolution kernel of the depth-wise convolution layer. This approach can significantly reduce the amount of computation required without impacting performance. The experimental results show that the proposed LSKA attention mechanism significantly reduces computational complexity and memory footprint with increasing kernel size and provides improved performance in areas such as detection and recognition compared to the LKA module. The structure is illustrated in
Figure 3a.
The output of LSKA attention mechanism is formulated as shown in Equations (1)–(4).
where
is a given input feature map;
H and
W are the height and width of the feature map, respectively; and
C is the number of input channels. Moreover, * and ⊗ represent convolution and Hadamard product, respectively;
is the output of deep convolution;
is the attention map;
is the expansion rate; and
is the maximum sensory field.
SPPF is an improvement based on SPP [
28], which is a simpler and less computationally intensive model than the original structure. The SPPF module is used in YOLOv8 to capture multi-scale information by performing various degrees of pooling operations on the input feature maps. The LSKA attention mechanism enables the model to focus on important parts of the input features, and it improves the overall performance and efficiency. To enhance the model’s performance and capture the global feature relationships within the feature map, we introduce the attention mechanism LSKA after the Concat so that the LSKA attention mechanism can be utilized after SPPF completes operation. This improves the model’s perceptual range and ability to model input features. The structure is shown in
Figure 3b.
3.2.2. SimAM Attention Mechanism
SimAM [
29], or Simple Attention Mechanism, is a lightweight and parameter-less attention mechanism based on neuroscience theory. It is designed with an energy function to achieve its purpose, as shown in Equation (5).
where E is the energy function; X is the input feature layer.
The embedded attention mechanism in the model does not introduce any additional parameters. It can directly estimate the weights of the three-dimensional features, resulting in a faster inference speed compared to attention modules such as CBAM [
30]. Additionally, it improves the baseline model performance with stable results; the visual representation of this mechanism is shown in
Figure 4.
The features of casting surface defects are not easily discernible, making it challenging to extract crucial information. The SimAM can adaptively adjust the feature mapping weights and pay more attention to the local area, which can improve the feature representation and classification ability of the model, while the neck will perform a Concat operation at layers 4, 6, and 9 of the backbone. Therefore, the introduction of SimAM afterward can make the model improve the feature extraction and fusion ability without increasing the computational burden again. By introducing SimAM, SLGA-YOLO can pay more attention to key features related to casting surface defects in the image, better capture fine-grained details, and thus improve detection accuracy. This enhancement enables the model to better handle the difficulties encountered in the casting surface-defect detection task, such as changes in environmental conditions, lighting, and viewing angles.
3.3. Optimization Model Based on GCM Module
To enhance detection accuracy, we incorporated the YOLOv8-p2 detection layer. However, this led to an increase in computational complexity due to the additional detection head. To balance accuracy and lightweight, we developed our own GCML module based on Ghostnet. The GhostConv module first aggregates the information features between channels and then employs grouped convolution to generate a new feature map. This approach allows for the generation of more feature maps with less computation. However, to ensure that the generated feature maps retain key feature information, we incorporate the LSKA attention mechanism before the splicing operation. This mechanism enables the module to focus on the key information in the input feature maps, while ignoring most of the influencing factors. As shown in
Figure 5, part of the CBS module is replaced by the lightweight GCML (GhostConv-Mish-LSKA) module. It consists of the GC [
31] (GhostConv) module, which is based on an improved CBM convolutional block, and the LSKA attention module.
Activation functions are a crucial component of neural networks in deep learning. They are responsible for the nonlinear transformation of a neuron’s output, which enhances the network’s expressive and nonlinear modeling abilities. YOLOv8 utilizes SiLU as its activation function. While SiLU is smoother than ReLU, it only exhibits advantages in multi-layer neural networks due to the limitations of the Sigmoid function. Due to its soft saturation, problems such as gradient vanishing may arise, leading to slow learning. To address these issues, we implemented the Mish [
32] function as the activation function for SLGA-YOLO. Mish has several advantages that help to solve problems such as gradient vanishing, strong generalization, and smoother transitions. The expression of Mish can be found in Equations (6) and (7).
In the algorithm for detecting surface defects in castings, the feature input passes through the BN layer and is then processed by the Mish function. The improved convolution block CBM is shown in
Figure 6.
3.4. Build Loss Function Based on Alpha-EIoU
The loss function is the discrepancy between the predicted and actual values following model analysis. The bounding box regression loss function is crucial for accurate detection results. In YOLOv8,
IoU is used as a standard to measure target detection accuracy. The function can be represented by Equation (8).
where
A represents the area of the prediction box,
B represents the area of the real box, and
IoU represents the ratio of the intersection area to the common area between the prediction box and the real box.
The surface defects of castings are complex and varied, and the defect area is different. Therefore, the detection algorithm needs better positioning results and faster convergence speed. The
IoU loss function cannot effectively describe the target of bounding box regression, resulting in inaccurate results and slow convergence,
EIoU [
33]. The aspect-ratio influence factors of the predicted box and the real box were separated, and the length and width of the predicted box and the real box were calculated, respectively, which met the requirements. Alpha [
34] is a power parameter that provides flexibility for different levels of regression accuracy. To make the loss function more robust and flexible, Alpha-
EIoU as the model bounding box regression loss function is reconstructed. The Alpha-
EIoU loss function can expedite the regression fitting process of the real frame and prediction frame, thereby maintaining ample flexibility and robust generalization.
EIoU and Alpha-
EIoU are described in Equations (9) and (10).
There are three components in the formula: IoU loss, distance loss, and height-width loss; and are the width and height of the minimum bounding box between the prediction box and the real box, is the Euclidean distance between two points, and are the centroids of the prediction box and the real box, w and h denote the width and height of the prediction box, and denote the width and height of the real box, and α denotes the parameter size.
5. Conclusions
This paper introduces SLGA-YOLO, a lightweight algorithm that addresses the issues of the high computational cost, large model size, and high leakage rate in casting surface-defect detection. The model incorporates the SlimNeck optimization model neck module to reduce model complexity. Additionally, the model integrates the SimAM and LSKA fusion enhancement attention mechanism to enhance attention to important information. Additionally, we propose the GCML module to enhance the model’s understanding of input features comprehensively. Furthermore, we reconstructed the novel bounding box loss function Alpha-EIoU to provide the model with sufficient flexibility and strong generalization. The results demonstrate that the average detection accuracy is improved to 86.2%, and there is a drastic reduction in the number of algorithm parameters. SLGA-YOLO establishes a reliable foundation for the field of surface defect detection in castings. However, SLGA-YOLO has some shortcomings, such as the need to improve detection accuracy. Future work will focus on enhancing its adaptability to different scenarios and environmental conditions. Additionally, the algorithm’s inference speed will be optimized by exploring more sophisticated deep learning techniques and incorporating the latest developments in the field.