Next Article in Journal
Internet of Underwater Things: A Survey on Simulation Tools and 5G-Based Underwater Networks
Previous Article in Journal
Improving MLP-Based Weakly Supervised Crowd-Counting Network via Scale Reasoning and Ranking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Part Defect Detection Method Based on Channel-Aware Aggregation and Re-Parameterization Asymptotic Module

School of Automobile and Traffic Engineering, Jiangsu University of Technology, Changzhou 213001, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(3), 473; https://doi.org/10.3390/electronics13030473
Submission received: 12 December 2023 / Revised: 20 January 2024 / Accepted: 22 January 2024 / Published: 23 January 2024
(This article belongs to the Section Artificial Intelligence)

Abstract

:
In industrial production, the quality, reliability, and precision of parts determine the overall quality and performance of various mechanical equipment. However, existing part defect detection methods have shortcomings in terms of feature extraction and fusion, leading to issues of missed detection. To address this challenge, this manuscript proposes a defect detection algorithm for parts (CRD-YOLO) based on the improved YOLOv5. Our first aim is to increase the regional features of small targets and improve detection accuracy. In this manuscript, we design the channel- aware aggregation (CAA) module, utilizing a multi-branch convolutional segmentation structure and incorporating an attention mechanism and ConvNeXt V2 Block as bottleneck layers for feature processing. Secondly, the re-parameterization asymptotic module (RAFPN) is used to replace the original model neck structure in order to improve the interaction between shallow-detail features and deeper semantic features, and to avoid the large semantic gaps between non-neighboring layers. Then, the DO-DConv module is encapsulated within the BN layer and the LeakyReLU activation function to become the DBL module, which further processes the feature output from the backbone network and fuses neck features more comprehensively. Finally, experiments with the self-made dataset show that the model proposed in this paper improves the accuracy of detecting various types of defect. In particular, it increased the accuracy of detecting bearing scuffing defects with significant dimensional variations, with an improvement of 6%, and gear missing teeth defects with large shape differences, with an 8.3% enhancement. Additionally, the mean average precision (mAP) reached 96.7%, an increase of 5.5% and 6.4% compared to YOLOv5s and YOLOv8s, respectively.

1. Introduction

Bearing components [1], such as those found in large trains and small family car rotating joints, play a crucial role. Similarly, gears serve as the core transmission elements in mechanical equipment. During manufacturing and operation processes, it is inevitable for the surface of these components to develop defects such as scratches and grooves. These defects may have a serious impact on mechanical properties, leading to significant reductions in the efficiency of automobile rotation and mechanical equipment transmission. Therefore, using computer vision technology to accurately detect surface defects of parts and components for the purpose of improving the performance of machinery and equipment, as well as the level of industrial production, is very significant.
Most traditional surface defect detection methods are manual and their detection effect is poor. Wei [2] proposed a bearing roller detection method based on fused single-response constrained SIFT feature matching. The method uses an adaptive algorithm to obtain a pre-selected region and then performs feature matching to remove the part of the region that does not contain bearing rollers, which it is seeking to detect. However, the accuracy of object detection using this method is influenced by the precision of matching regional features, and the accuracy is relatively low. In addition, Hendi Wang [3] et al. proposed a method for detecting side defects in the outer ring of bearings based on the differential image method. The method firstly processes the dataset image for denoising, segmentation, and edge detection. Then, it uses the differential shadow method for defect detection. However, the method uses a CCD camera to collect image information. This is affected by the light of the shooting environment, which easily leads to the occurrence of misdetection and omission detection. In addition, Chen Shuo [4] and others proposed a bearing collar end face defect detection method that first preprocesses the collar image and performs edge detection. The least-squares method is then used to fit the profile in order to identify shape defects, finally completing the identification and classification of defects. However, this method is not applicable to defect detection in the case of bearing stacking. Wang et al. [5] put forward a technique for identifying surface defects on strip surfaces using a straightforward guide template. However, this approach is limited in adaptability and robustness due to the use of a manual feature classifier to screen features.
Over the preceding years, as artificial intelligence has progressed at an accelerated pace, convolutional neural networks [6] have demonstrated features of high accuracy, automation, multi-domain application, and real-time operation in the field of visual inspection, achieving significant progress and widespread application. Particularly in the field of defect detection, the continuous development and application of convolutional neural networks has produced a powerful tool with which to increase production quality, reduce losses and improve the product manufacturing process. As an example, Yao Bo [7] et al. detected surface defects on aluminum profiles based on the YOLOv3 algorithm. They first selected the optimal target candidate frames using the K-means clustering algorithm, and then adjusted the network hierarchy. Their method has higher accuracy than Faster-RCNN and SSD methods. In contrast, Li et al. [8] used the YOLOv3 detection algorithm to identify steel plate defects. They first used wavelet-median filtering for denoising and then added a scale of output enhancement to the original network to improve the detection of small-target defects. Next, they optimized the loss function. However, the dataset target class of this method is too homogeneous and less robust than the detection of other tasks. In addition, Xu Qian [9] et al. chose to replace the Darknet-53 network in YOLOv3 with a lightweight network and introduced a null convolution to improve the detection capability. In order to enhance the precision of the model while upholding an excellent real-time performance, the final convolutional layer was augmented with inception [10] architecture. As a result, the number of parameters in the improved model decreased in comparison to the original model, leading to a considerable increase in the detection accuracy. However, due to the network’s primary focus on the lightweight direction, the method is less accurate in detecting overlapping, occlusion, and small targets. Li Bin and the research team [11] utilized the YOLOv4 [12] algorithm to identify defects on the surface of aero-engine components. To enhance the detection accuracy and speed, they made improvements to the PANet [13] structure, expanded the feature detection scale, and replaced the cross-entropy loss function with an optimized focusing loss for classification errors. These enhancements resulted in some improvement in the detection accuracy and speed. However, the model still struggles with scenarios involving multiple overlapping targets and occlusions, potentially leading to detection gaps. Shi Zhenhua and colleagues [14] built upon the YOLOv3 model for workpiece defect detection. They focused on improving the network’s feature fusion to reduce redundant candidate frames. This method demonstrated higher accuracy in recognizing defects with a single shape. However, when small targets appeared in dense shapes, the enhanced feature fusion method still had insufficient feature integration, leading to potential misjudgments in detection. In general, the field of industrial defect detection has witnessed notable advancements, which may be attributed to convolutional neural networks. Nonetheless, there persist several challenges that necessitate resolution: (1) The overall parameters of the model are too large, which leads to a lack of real-time performance. (2) The model feature extraction and fusion methods need to be further enhanced, and the model focusing on bearing defect targets struggles to cope with small targets under overlapping and occlusion, meaning that it cannot meet the needs of practical applications.
To address the real-time and accuracy requirements in the field of industrial bearing and gear defect detection, we propose a new defect detection algorithm. This article takes the YOLOv5s [15] model and makes meticulous improvements to enhance its performance. One of the key improvements involves designing the CAA module, which replaces the C3 module in the backbone network. This module utilizes the convolutional segmentation concept and the ConvNeXt V2 module [16], combined with the SE attention mechanism [17], to extract richer features, enhance regional features of hidden small targets, and improve the detection accuracy. In order to further improve the detection accuracy, we introduce the asymptotic feature pyramid network (AFPN) [18] to the neck portion and alter its structure to incorporate the re-parameterization concept by replacing multiple convolutional layers with re-parameterization convolutions in order to form the RAFPN module. This improvement enhances the interaction between shallow-detail features and deep semantic features, avoids large semantic gaps between non-adjacent layers, and reduces computation and memory consumption while speeding up the inference speed of the model for the purpose of achieving more accurate detection and a better multi-scale feature fusion effect. Finally, the DO-DConv [19] module is introduced and the BN layer and LeakyReLU activation function are added to form the DBL layer for the purpose of improving performance. Overall, the principal contributions of this manuscript are described below:
(1)
This manuscript designs a new channel-aware aggregation feature extraction module (CAA) that uses a multi-branch convolutional segmentation structure to perform its task, with the SE attention mechanism and ConvNeXt V2 Block comprising the bottleneck layer for feature processing. We seek to improve the attention and feature richness of the model as applied to the target region.
(2)
This manuscript redesigns the structure of the YOLOv5 model’s neck, introduces an improved asymptotic feature pyramid fusion network and incorporates reparameterization convolution. This advancement allows the model to effectively integrate both shallow positional features and deep semantic features, preventing issues related to significant semantic disparities. Consequently, the proposed approach achieves precise detection in complex environments characterized by occlusion and overlapping.
(3)
The DO-DConv module is introduced and a BN layer and LeakyReLU activation function are added, combining into a DBL layer. DO-DConv allows for the addition of learnable parameters during training, and deep convolutions are collapsed into ordinary convolutions during inference, thus improving performance without increasing cost. The use of the LeakyReLU activation function allows for the improved handling of negative input values, permitting neurons to better convey gradient information and thus optimizing the learning ability of the model. Additionally, its use is less computationally intensive than the SiLU activation function.
The manuscript is divided into several sections. Section 2 presents an overview of the two primary research directions related to this field of research. Section 3 introduces the original model and the improved model structure, presenting a comprehensive description of the relevant modules. Section 4 details the evaluation metrics and datasets used, as well as the experimental results and analysis, to validate the improved algorithm. Finally, Section 5 summarizes the experimental and any relevant analyses.

2. Related Work

2.1. Self-Encoder

An autoencoder is an unsupervised learning algorithm widely used in the field of deep learning. It can learn a low-dimensional representation of data by reconstructing input data through encoding and decoding processes. With proper design and training, an autoencoder can extract useful information from data for tasks such as feature extraction, downscaling, and data reconstruction. Such a low-dimensional representation can provide a better input for subsequent machine learning tasks.
It is possible to conceptualize auto-encoders more generally by imagining the task of drawing a portrait of a person, but where only the outline of the character’s profile is required, along with an approximate position of the eyes and mouth. Then, the artist gradually adds more detail and draws the complete portrait. Self-encoders work in a similar way: an encoder compresses the original image into a simplified outline drawing, and then a decoder reduces the outline drawing down to the original image. In this way, it is possible to represent complex images in a more concise way while still retaining important features.
Han [20] et al. proposed a method for defect detection using a stacked convolutional autoencoder that requires only non-defective data and synthetic defective data generated using defective features based on expert knowledge for training. Although this method has achieved some success in solving the problem of a lack of real defect detection data, it may be less applicable in industrial defect detection scenarios where high accuracy is a requirement. Conversely, Liu et al. [21] presented an unsupervised learning-based approach for detecting defects in suspension chain rod insulators. This method involves reconstructing segmented real samples of the insulators, treating them as foregrounds, and separating the defective regions as backgrounds by comparing the segmented image with the reconstructed image. However, this approach heavily relies on the effectiveness of foreground and background separation, limiting its applicability with regard to defect detection in complex scenes.
As the technology continues to evolve, Kaiming He et al. [22] proposed a scalable self-supervised learner MAE for a computer vision that used an asymmetric encoder–decoder structure. The authors found that masking a large portion of the input image (e.g., 75%) produces an important and meaningful self-supervised task.
The performance of ConvNeXt [23], inspired by the self-supervised approach MAE, may also benefit from MAE, and we use ConvNeXt V2, a new model that combines ConvNeXt, based on supervised training, with the self-supervised learning technique MAE, significantly improving ConvNet’s performance with regard to a variety of recognition benchmarks. The full convolutional masked self-encoder (FCMAE) framework is a self-supervised learning method based on convolutional neural networks, where the idea is to randomly mask some regions of the input image before using the model to attempt to recover the masked parts. This forces the model to learn both the global and local features of the image, enhancing its ability for generalization. The FCMAE framework possesses two advantages over the traditional masked self-encoder (MAE) framework. Firstly, the manuscript implements a fully convolutional architecture instead of utilizing fully connected layers to generate the mask and reconstruct the image. This strategy effectively reduces both the number of parameters and the computational resources needed, all while retaining the spatial information. Additionally, the manuscript introduces a multiscale masking approach as opposed to employing fixed-size masks. This technique enhances the model’s capability to identify features at a variety of scales. This is the reason behind its use in this manuscript.

2.2. Multi-Scale Defect Detection

Multi-scale feature fusion is a basic field of deep learning research, one which greatly improves the accuracy of object detection at various scales. Although several methods have been proposed to tackle this problem, each technique has its own limitations. For example, Lin et al. [24] introduce a feature pyramid network model to address the challenge of multiscale detection. However, their approach merely transfers multiscale features in the backbone network and neglects a deeper feature fusion design, resulting in significant semantic gaps between non-adjacent layers. Additionally, in their work, Zhao et al. [25] present a pyramid pooling model that utilizes feature details at various scales to investigate global contextual information. However, their methodology only incorporates a pyramid pooling module in the last layer of the backbone network, lacking a multi-scale feature sensing architecture. Moreover, Chen et al. [26] present a null-space pyramid pooling module that captures the contextual information of the target object via multiple sampled values of feature data at different scales. Unfortunately, while handling large targets, null convolution can lead to an overlap in feature information, an issue with the potential to adversely affect the model’s accuracy. Additionally, Zhang and colleagues [27] introduced a lightweight attention method to achieve a powerful multi-scale representation through the incorporation of an efficient pyramid-split-attention module into the bottleneck layer of ResNet. This involves replacing the 3 × 3 convolution with a PSA module. Tang et al. [28] propose a layered multiscale block to provide rich scale features through layered representation and multi-scale embedding, which is achieved by stacking hierarchical multiscale blocks. However, when the number of hierarchical stacks is high, a lack of information is inevitable. Yeung et al. [29] propose an inspection model, incorporating an attention mechanism for steel surface defects detection, to improve the extraction of defects at different scales. This model performs well on both the NEU-DET and GC10-DET datasets. Nonetheless, feature extraction is dependent on stacking multiple deformable convolutions, which inevitably results in significant computation. In response to this issue, Su et al. [30] present a multi-branch feature fusion network that extracts features from various layers and adjusts feature maps to the same size using different approaches (e.g., pooling or up-sampling). However, their methodology does not consider feature fusion between non-adjacent layers.
Almost all the above studies on multiscale defect detection have focused on improving the backbone network’s ability to extract multiscale features, but they have not paid attention to the design of multiscale defect detection models. Given the constraints of the aforementioned approaches, this study utilizes AFPN to comprehensively capture the feature details of each scale of the feature map and accomplish their integration across various scales. Furthermore, this research employs the RepConv [31] module to decrease the parameters and computational load. Through this framework, our model can notably enhance the identification of defects across multiple scales.

3. Methodology

3.1. CRD-YOLO Architecture

The YOLOv5 model has the advantages of high accuracy, high speed, being lightweight, and possessing high scalability compared with previous versions. However, it can produce missed detections in scenarios with occlusion and overlapping interference and may not be accurate enough to detect defects in datasets with large differences in bearings and gears. Therefore, this manuscript proposes an improved model based on YOLOv5 called CRD-YOLO. The CRD-YOLO model structure is shown in Figure 1 and consists of three parts: backbone, neck, and detector. The backbone structure uses the CAA module designed in this manuscript to replace the C3 module. The neck adopts this manuscript’s RAFPN structure to replace the PANet structure. Finally, the DBL module proposed in this manuscript is used to further process features after the three-layer output of the backbone, producing far more comprehensive feature fusion in the neck.

3.2. Channel-Aware Aggregation Module

In a conventional YOLOv5s backbone network, the CBS module and the C3 module are employed for down-sampling and feature extraction. The C3 module serves as a building block for the feature extraction layer, which directly connects the feature maps processed in the bottleneck layer to the original input feature maps. However, this feature fusion approach may not adequately capture different feature maps of defects, resulting in insufficiently rich extracted features. In addition, the bottleneck layer structure of the C3 module consists of a stack of convolutional layers, which may hinder the model’s processing of multi-scale defective targets. For this reason, the CAA structure is proposed in this manuscript, as shown in Figure 2d.
Firstly, the backbone network has insufficient feature extraction, which leads to low accuracy in detecting defects in small targets of parts. This manuscript draws on the ConvNeXt V2 Block approach used in ConvNeXt V2. The ConvNeXt V2 Block can better understand the high-level structure of the data and extract a more meaningful representation of the features, which in turn improves the model’s performance in image detection tasks. In short, it consists of depth-wise convolution and point-by-point convolution. The principal advantage of depth-wise convolution is that it reduces the number of parameters and the required quantity of computation, increasing model efficiency. However, point-by-point convolution can retain and integrate feature maps from depth-wise convolution and improve the model’s capacity to capture the interplay among diverse channels.
Simultaneously, to further enhance the network’s channel representation capability, the SE attention mechanism is integrated with the ConvNeXt V2 Block, resulting in the ConvNeXtV2SE module shown in Figure 2b. The SE module performs two primary operations, squeezing and excitation, as shown in Equations (1) and (2). F denotes squeeze, and this step pools the feature map globally on average to generate a 1 × 1 × c vector, where u is the feature map, and u c ( i , j ) is the value of the ( i , j ) position of the input u in the c-th channel convolution layer. After performing Equation (1), the feature maps are pooled as a global average, and Equation (2) is generated by the weight W , where W  is obtained through learning. The sought-after weighting information can adaptively weigh the features in each channel, making the network better able to adapt to complex tasks and scenarios, as well as improving the feature representation capability. The structure of the SE attention mechanism is shown in Figure 3.
c = F s q ( u c ) = 1 H × W i = 1 H j = 1 W u c ( i , j )
s = F e x ( z , W ) = σ ( g ( z , W ) ) = σ ( W 2 δ ( W 1 z ) )
where H denotes the height of the feature map, W denotes the width, represents the generated feature map, δ denotes the ReLU function, W 1 and W 2 denote the two fully connected layers, and σ denotes the sigmoid function.
Second, to address the problem of the insufficient fusion of defective information, the input x is processed using the first convolutional layer CBS1, and the result is split into two parts with equal numbers of channels. These two parts are then taken as a list, y. The last set of elements in list y, y2, are processed in sequence through the ConvNeXtV2SE module. Each ConvNeXtV2SE block consists of a 7 × 7 depth-wise convolutional layer, a layer-norm layer, and two linear layers (pwconv1 and pwconv2). Global response normalization (GRN) and squeeze-and-excitation (SE) layers are also included in the ConvNeXtV2SE block for enhanced feature extraction and feature importance determination, and the output from each module is expanded into list y. In the end, all elements in list y—both the original segments and those processed by the ConvNeXtV2SE modules—are concatenated together, and this concatenated result is then passed through the second convolutional layer CBS2 in order to produce the final output.
The CAA module consists of two principal convolutional layers (CBS1 and CBS2). The CBS1 convolutional layer has dimensions of 1 × 1 × (2c), and the ConvNeXtV2SE block has a depth-wise convolutional layer with dimensions of 7 × 7 × (c). The dimensions of the linear layers (pwconv1 and pwconv2) of the ConvNeXtV2SE block are 1 × 1 × (4c) and 1 × 1 × c, and the dimensions of the CBS2 convolutional layer are 1 × 1 × c.
The design of CAA effectively utilizes the advantages of ConvNeXtV2SE and the multi-branch convolutional segmentation structure to better capture features at different scales and semantic layers. The multi-branch convolutional segmentation structure uses a bottleneck layer for feature extraction. In this manuscript, the ConvNeXtV2SE module is used to replace the original bottleneck layer structure. This enables the effective extraction and integration of feature maps, as well as the enhancement of the model’s ability to capture the correlation between different channels.
In this section, we report the use of a visualization method to verify the effectiveness of the CAA module in performing the feature extraction task. The C3 module of layer 2 in the model was replaced with the CAA module for feature map visualization. We conducted a comparison between the feature map obtained from the CAA module and the feature map generated by the original model’s C3 module, as illustrated in Figure 4. The experimental results show that the improved CAA module can extract features more efficiently and can more effectively concentrate on capturing shape characteristics, thus improving the accuracy and precision of defect detection. This is mainly due to the CAA module, which assists the model in learning key feature information more accurately and filtering out unimportant information. These measures improve the quality and expressiveness of feature extraction.

3.3. Reparameterization Asymptotic Feature Pyramid Networks

The neck network of YOLOv5 is an intermediate feature extraction network added on top of the backbone network and is principally used to enhance the feature expression ability and receptive field of the model to further improve the detection performance. It adopts two different neck network structures: PANet and the SPPF. The PANet requires bottom-up and top-down feature map up-sampling and down-sampling operations, and there is no direct feature interaction between non-neighboring layers, which can result in the loss or degradation of feature information. Additionally, lower-level features may be over-suppressed during the feature fusion process, which may affect the model’s small target detection ability. SPPF is spatial pyramid pooling, with the structure shown in Figure 5. It allows researchers to conduct pooling operations on features at different scales in order to capture objects of different sizes as well as the global contextual information of the image. This helps to improve the model’s ability to perceive scale changes and object context. Therefore, the principal role of SPPF in YOLOv5 is to fuse multi-scale features and stitch together features of different proportions of the same feature map.
Inspired by the idea of SPPF, the neck structure of YOLOv5 was improved. Using AFPN (Asymptotic Feature Pyramid Network) to replace PANet, the AFPN module accepts feature maps of three different levels with different spatial dimensions and numbers of channels. Up-sampling or down-sampling operations are performed on the three input feature maps so that their spatial dimensions match those of the current level. If up-sampling is performed, the feature maps are up-sampled to the target size using a bilinear interpolation method. If down-sampling is performed, a 1 × 1 convolution is applied to perform the operation. Finally, a feature fusion operation is performed on the three adjusted feature maps to produce the fused feature map. The three layers of feature fusion—large, medium, and small—are used to improve the model’s ability to understand and detect objects and scenes at different scales. The use of three-layer fusion reduces the computational complexity while maintaining a relatively high performance. The AFPN can simultaneously fuse feature maps at different layers, thereby reducing the risk of information loss and improving the detection of small targets. To further reduce the computational complexity of the model, the RepConv module is used in this manuscript to process the output of the feature fusion. The RepConv module can increase the reusability of important features and mitigate the effects of changes in the resolution of the features, which further improves the detection performance of the model at different scales and reduces the number of parameters. As shown in Figure 6, we use the C o n v 3 N × C 2 × H 2 × W 1 as the output and C o n v 1 N × C 1 × H 1 × W 1 as the input, where C 1 = C 2 , H 1 = H 2 , and W 1 = W 2 . The RepConv consists of three parts, namely the BN layer, 1 × 1 convolution  +  BN layer, and 3 × 3 convolution + BN layer, where the BN layer is the batch normalization layer. The feature map is summed by elements after passing through the three branches to obtain the final feature map. This structure can significantly enhance the multiscale representation capability of the model to tackle the multiscale problem of defects in complex industrial environments. In the inference stage, the above three-branch structure can be equated to a 3 × 3 convolutional structure with batch normalization parameters, and the equivalence process is shown in the following equation. The three-part structure used during training can be formulated as follows:
y = B N ( x × C o n v 3 , x ¯ 3 , σ 3 , γ 3 , β 3 ) + B N ( x × C o n v 1 , x ¯ 1 , σ 1 , γ 1 β 1 ) + B N ( x , x ¯ 0 , σ 0 , γ 0 , β 0 )
For example, take 3 × 3 convolutional layers + batch normalization as shown in the following equation:
B N ( x × C o n v 3 ,   x ¯ 3 , σ 3 , γ 3 , β 3 )   = x × γ 3 C o n v 3 σ 3 x ¯ 3 γ 3 σ 3 + β 3
where x ¯ 3 , σ 3 , γ 3 , and β 3 are parameters in batch normalization.
By integrating AFPN with RepConv, we created the RAFPN architecture. This effectively enhances the precision of defective target detection at different scales in the dataset, as depicted in Figure 7.

3.4. DBL Module

In the homemade dataset, the defective targets are mostly small targets with overlapping, occlusion, and multi-scale problems. In order to enable the model to process the features more adequately before RAFPN feature fusion and to enrich the fusion process in order to locate more regions favorable for target detection, the DBL module is added after the three-layer feature output of the backbone, as illustrated in Figure 8.
DO-DConv deep hyperparametric deep convolution is cited, and the use of additional deep convolution to enhance the deep convolution can increase the parameters that can be learned. The calculation formula is:
O = ( D T W T ) T P
where “ ” denotes the depthwise convolution operator and P denotes the input patch tensor.
Here, two parameter matrices, D and W, represent the weights of two deep convolutions. Further, D and W are the parameters to be learnt during the training process. Firstly, D   W must be computed, which is achieved by performing a weighted summation of the products of D and W. Then, D   W is used to replace the weight matrix in the traditional convolution and applied to the input feature map. Finally, the convolved feature map is the output. In this manuscript, DO-DConv and a BN layer along with the LeakyReLU activation function are encapsulated as a convolutional layer and added to the neural network. DO-DConv not only accelerates the training speed of the model, but also achieves better results than the traditional convolutional layer in detection tasks. The BN layer normalizes the inputs of each layer and mitigates the problems of vanishing gradients and gradient explosions, thus improving the convergence speed. A negative gradient α is introduced using the LeakyReLU activation function, as shown in Equation (6), in order to solve the problem of neuron death, better handle negative inputs and convey gradient information, and optimize the learning performance of the model.
f ( x ) = L e a k y R e L U ( x ) = x i f   x > 0 α x i f   x < 0  

4. Experiments

4.1. Datasets

The dataset used in this experiment contains bearing and gear images. We collected images of three common types of defects on the bearing surface, namely grooves, wear, and scoring, totaling 1530 images. Gear images were used for the experiment, utilizing the gear dataset provided by Guizhou University [32]. We used an image size of 800 × 600. The images included three types of defects, namely broken teeth, missing teeth, and tooth surface abrasion, totaling 3000 images, for 4530 images overall for the two components. We randomly selected the training set and validation set at a ratio of 8:2, where the training set included 3624 images and the validation set included 906 images. The images of the various defects are shown in Figure 9.

4.2. Experimental Environment and Parameter Settings

In this manuscript, the deep learning framework, the Pytorch framework, is used to achieve the defect detection of bearings and gears. The experimental hardware environment includes Intel(R) Xeon(R) Platinum 8352V CPU, Nvidia GeForce RTX 4090, and the ubuntu20.04 operating system. We use the Python language to write the code, and also to call the CUDA, cuDNN, and other required libraries for acceleration.
In this manuscript, we use a stochastic gradient descent algorithm to train the YOLOv5s model. During the training process, the number of training batches is set to 8, the momentum is set to 0.937, the weight decay is set to 0.0005, and the initial learning rate is set to 0.01. In addition, according to the training process of the model, it can be seen that all the evaluation metrics converge towards stability after 250~300 times, as Figure 10 shows, so that the model training epoch is 300 times.

4.3. Performance Evaluation Indicators

To assess the overall performance of the model, this experiment employs several evaluation metrics, namely precision (P), recall (R), average precision (AP), mean average precision (mAP), FPS, gigafloating point operations per second (GFLOPS), parameters, and model volume. AP represents the area bounded by the precision recall curve which can be calculated using the following formulas, where P, R, F1, and computational expression are denoted as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l × 100 %
In the provided equation, true positive (TP) refers to the number of positive samples correctly identified as positive, while false positive (FP) indicates the number of positive samples incorrectly categorized as negative. False negative (FN) represents the number of negative samples mistakenly classified as positive.
AP denotes the average precision achieved at various levels of recall. mAP represents the average AP across different categories, and its formula is as follows:
A P = 0 1 P ( R ) d R
m A P = j = 1 S A P ( j ) S
where S indicates the overall count of categories. Moreover, the model’s ability to detect frames per second is represented by FPS, with a higher value indicating a faster rate of detection. GFLOPS signifies the number of floating-point calculations that a model executes every second, where a greater GFLOPS value indicates higher computational complexity. The parameters denote the complete number of trainable features in the model, including the individual weights of each layer and the number of network layers. The model volume refers to the weight file’s total size and is determined by the model’s data type and the total number of trainable parameters, being expressed in MB as the unit of the model volume.

4.4. Comparative Experiments of Feature Extraction Modules

In order to further validate the multi-branch convolutional segmentation structure designed in this manuscript and the training effect of applying the ConvNeXt V2 module at the bottleneck layer, the ConvNeXt V2 module, which does not use the SE attention mechanism, is deployed as the bottleneck layer to process the feature information, which is named CNX here. Additionally, a comparison experiment is carried out with the original model, and the results are shown in Table 1.
Table 1 shows that using the CNX module achieves a better detection performance than the original model. YOLOv5-CNX improves the detection accuracy by 2.2%, recall by 1.6%, and mAP by 2.7% compared to YOLOv5. This is all due to the processing of the input features via CNX, which improves the detection accuracy.

4.5. CAA Module Comparison Experiment

In order to verify that the combination of the SE attention mechanism and ConvNeXt V2 module in the bottleneck layer can enhance feature extraction and the judgment of feature importance, the CAA module is used in this section to conduct comparative experiments with the CNX model in Section 4.4, and the experimental results are shown in Table 2.
As can be seen from the data in Table 2, the recall rate is improved by adding the SE attention mechanism to the bottleneck layer, the precision rate is reduced slightly, but the overall mAP value is improved. This may be due to the fact that the SE attention mechanism is able to better focus on the feature channels that are more important for the current task and inhibit those channels that are less important for the model decision, thus improving the recall of the model. However, equally, this weight adjustment also affects the model’s precision rate to some degree. In summary, the SE attention mechanism can enhance the performance of the model by adaptively learning the attention weights of the feature channels, and this is an effective attention mechanism worth using.
In addition, to verify the effectiveness of choosing the SE attention mechanism, this section also carries out a cross-sectional comparison experiment of the attention mechanism, using CA, SimAM, and an SE attention mechanism added to the bottleneck layer for experimental comparison, respectively. The experimental results are shown in Table 3.
The outcomes obtained from the experiments presented in Table 3 demonstrate that employing the SE attention mechanism yields the most impactful results in terms of detection. This may be due to the fact that SE can effectively focus on the feature channel weights that are more important to the task at hand for the purpose of enhancing the expressiveness of the network in the feature extraction process. CA and SimAM may require more human settings or a priori knowledge.

4.6. RAFPN Module Comparison Experiments

In order to determine the training efficacy of introducing the RAFPN module, this section uses the YOLOv5-CAA-RAFPN model to complete the experiment in comparison with the YOLOv5-CAA model, and the results of the module comparison experiment are shown in Table 4.
It can be seen from the experimental results in Table 4 that the model recall and mAP value after adding RAFPN are significantly improved again. RAFPN can enhance the interaction between shallow-detail features and deep semantic features, avoiding the larger semantic gap between the non-adjacent layers, and it is not easy to lose detail features when dealing with defective images, meaning that the features are fused better, and this is the reason for the improvement of the metrics.
In this section, a separate feature pyramid comparison experiment is also conducted, and PANet, BiFPN, and RAFPN are selected for comparison to verify that the RAFPN detection performance of the improved model in this manuscript is superior, and the experimental results are shown in Table 5.
Table 5 shows that the YOLOv5-RAFPN model performs relatively well in terms of precision, recall, and mAP metrics, reaching 94%, 90.9%, and 93.6%. Indeed, it has a better performance than the other two models. This indicates that RAFPN can be used as a model neck to handle features at different scales for the better detection of defective targets of varying sizes. This structure enhances the model’s localization accuracy and ability to perceive the target object, thus improving the detection performance.

4.7. Comparison Experiments of DBL Modules

The above experimental results show that the YOLOv5-CAA-RAFPN model achieves enhanced feature extraction and feature fusion capabilities. Defects can be detected more effectively, allowing further improvements to the model performance. This section uses the model with the added DBL module to conduct comparison experiments on the basis of Section 4.6. The aim is to enable the model to process the features fully before RAFPN feature fusion. This in turn enriches the fusion process, allowing researchers to locate more regions favorable for target detection. The experimental results are shown in Table 6.
As can be seen from the data in Table 6, the YOLOv5-CAA-RAFPN-DBL model performs relatively well in terms of precision, recall, and mAP metrics, reaching 95.1%, 94%, and 96.7%. Compared to the YOLOv5-CAA-RAFPN model, the YOLOv5-CAA-RAFPN-DBL model improved in all metrics.
This may be because the YOLOv5-CAA-RAFPN-DBL model introduces the DBL module and the added learnable parameters to optimize the model learning performance.

4.8. Ablation Experiment

In order to verify the effectiveness of the three-point improvement method in this manuscript for the problems of part occlusion, overlapping, and multiple scales in the dataset, ablation experiments are carried out and the results are shown in Table 7.
From the experimental data in Table 7, it can be seen that the CAA, RAFPN, and DBL modules can all improve the model performance to some extent. When combining all three modules, the model performance sees greater improvement. Specifically, precision, recall, and mAP values as high as 95.1%, 94%, and 96.7% are registered when using the three modules. This marks improvements of 3.9%, 4.3%, and 5.5%, respectively, compared to the original model. It is worth noting that the CAA and DBL modules bring no additional parameters and computation, while the RAFPN module slightly increases the number of parameters and computation. At this point, the final model parameter count increases from 26.8 M to 29.2 M, an increase of just 2.4 M model parameters. This shows that using the CAA module for feature extraction and fusing the neck features with the DBL module and RAFPN module can effectively improve the accuracy and mAP of the detection results and minimize the increase in the number of parameters. A decreased parameter count is achievable due to the utilization of the DBL module and the RepConv module within the neck, which enhances the performance while minimizing the parameter count to the fullest extent.
Therefore, in this manuscript, by applying CAA, RAFPN, and DBL modules to the YOLOv5s model, the optimal model is obtained, possessing the improved accuracy and mAP of small target detection.

4.9. Qualitative Analysis of the Model

The experimental detection maps are shown in Figure 11, and it can be clearly seen that the proposed model is able to detect each category of defects accurately and without misdetection or omission, proving the effectiveness of the proposed model.
Finally, we chose a typical image with three classes of defects in bearings as an example. The feature maps correspond to the three detection layers examined under different improved methods and the detection results are shown in Figure 12, from which we can further analyze the essential reasons as to why the model improved in its detection performance.
When there is interference such as overlapping and occlusion in the original image, the baseline model is unable to filter out this interfering information, resulting in missed detection. For the CAA module, as shown in Figure 12(II), the detection model improves its focus on the feature information during the detection process, producing more complete detection. When using the CAA and RAFPN modules, the detection model focuses on the information of the target area on the feature map at all levels, reducing the impact of interfering information, as shown in Figure 12(III).
The final improved model leverages three improved methods to better balance the importance of the features at all levels of the detection process, reducing the interfering information while retaining the necessary features, and thus significantly improving the detection performance. In particular, improved high-level feature mapping has richer global information, which can help to improve classification accuracy, while the low-level features are mainly distributed throughout the target region, which provides more accurate information about the region and the target due to richer detail information. This further suggests that the three improvement strategies assist the model in better balancing the importance of features in the detection process. Ultimately, by combining rich contextual information and detailed features, the detection accuracy of the small targets is notably enhanced.
In summary, the CRD-YOLO model, utilizing the three modules CAA, RAFPN, and DBL, is able to better balance the importance of high-level and low-level features in the detection process, as well as reduce the interfering information. This significantly improves the performance of small-target detection in the case of occlusion and overlap. These improvement strategies enable the model to better fuse contextual information and detail features, effectively improving the accuracy of detection and its richness of detail.

4.10. Comparison Results with Other Models

In this section, we select several representative high-speed and high-accuracy detectors to discuss the progress of the improved models, including SSD [33], Faster-RCNN [34], RetinaNet [35], YOLOv3 [36], YOLOv5, YOLOv6 [37], YOLOv7 [38], and YOLOv8 [39] network models. All the models are tested on the self-made bearing gear dataset and show the average accuracy of each type of label. The experimental data are shown in Table 8.
As shown in Table 8, the proposed model outperforms several mainstream models in detecting various defects, offering notable advancements over existing methods. Not only does the model demonstrate a 5.5% increase in mAP compared to the original YOLOv5, but it also achieves higher average accuracies across all label types, with the most substantial improvements observed in the detection of bearing wear and gear missing teeth. This can be attributed to the model’s superior feature representation. Although there is a slight decrease seen in FPS, the model still supports real-time defect detection, which is essential for industrial applications.
Compared to both mainstream YOLO series models and other commonly used two-stage detection algorithms, our enhanced YOLOv5s-based algorithm, as proposed in this research article, proves to be well-rounded in balancing both high detection rates and fast processing speeds, producing real-time detection capabilities. These comprehensive experimental analyses validate the claim that our algorithm is particularly apt for fulfilling the rigorous demands of industrial detection tasks due to its superior accuracy and efficiency. Hence, this model demonstrates strong practicality and holds great potential for deployment in industrial production environments.

4.11. Analysis of Model Detection Effect

To further validate the effectiveness of the enhanced algorithm proposed in this manuscript for detecting small targets on the surfaces of bearings and gears, we perform the weighted gradient activation heat map visualization operation on the original and improved models. This operation shows the specific region of the final detection result in the input image. As shown in Figure 13, the higher the temperature of the region that the model focuses on, the lower the temperature of the region that it does not focus on.
Based on the heat map visualization results, it is evident that the original YOLOv5s model’s predictive accuracy for surface defects on bearings and gears is inadequate, resulting in scattered attention. In contrast, the enhanced model proposed in this manuscript exhibits excellent detection capabilities and can accurately detect all types of surface defects. It directs attention towards defects, effectively improving the model’s focus on multi-scale defect targets and enabling it to locate bearing and gear surface defects in the same area as the model of interest. These findings firmly establish the fact that the improved algorithm put forward in this manuscript can significantly enhance the model detection accuracy, particularly concerning the detection and identification of small-scale targets.
Furthermore, we conducted feature mapping visualizations of the third down-sampled CBS module, CAA module, RAFPN module, and C3 module at the corresponding positions in the original YOLOv5, and the outcomes are depicted in Figure 14.
By observing (III) and (IV), it is evident that the CAA module proposed in this research article is capable of filtering out irrelevant texture information, highlighting graphical information, and retaining more detailed information compared to the C3 module. The examination of (V) reveals that the proposed RAFPN module can efficiently emphasize the positional information of images, particularly for bearing grooves, bearing scratches, and other defect types that overlap and obscure complex background interference. In the original design of the feature fusion pathway, the C3 module forwards the feature mapping for further network processing, which is unsuitable for processing small target defects on the surfaces of bearings and gears, ultimately yielding undesirable results during feature fusion. Therefore, we achieve significant results by constructing a DBL- and RAFPN-based neck network that simultaneously extracts graphical and positional information using small, medium, and large features.
In summary, the proposed model outlined in this manuscript demonstrates a substantial improvement in the detection accuracy of different defects on the surfaces of bearings and gears, satisfies the requirements for real-time detection, and offers promising prospects for widespread application in industrial production environments. Additionally, the heat map visualization and feature map visualization analysis methods offer us a more intuitive means with which to comprehend the operational mechanisms of the model, optimize detection efficacy, and apply it to practical scenarios.

5. Conclusions

In this manuscript, we proposed an improved method to address the challenges associated with the bearing and gear dataset, such as occlusion, overlap, and other factors that contribute to low detection accuracy, especially with dense and small targets. Firstly, we used YOLOv5s as the base model and incorporated a newly designed CAA module into the backbone network structure to enhance feature extraction. The CAA module was pivotal in capturing intricate shallow and deep features, thus improving the model’s ability to detect fine-grained patterns and textures that are typically obscured in complex images.
Next, the DBL module was employed in the neck to further refine the output features. The DBL module enabled the features extracted from the backbone to be used to learn a greater number of parameters, providing the neck with richer feature representations for subsequent fusion. Afterwards, the RAFPN module was utilized to synergistically blend shallow, medium, and deep features, thereby reducing the semantic discrepancy among them. This thorough fusion led to improved detection precision, allowing researchers to effectively leverage the complementary attributes of features at varying scales.
By integrating these modules, our enhanced model demonstrated superior detection results during the testing process, verifying that our approach offers considerable improvements over conventional methods, especially in complex situations involving small and densely clustered targets.
The experimental findings demonstrate that the enhanced CRD-YOLO model outperforms the original YOLOv5s model in terms of accuracy and mean average precision (mAP) in target detection tasks, with only a slight increase in the parameter count. Moreover, the detection speed satisfies real-time requirements. Although this manuscript’s method attains performance improvement, further research is necessary to examine the model’s detection outcomes in the presence of other complexities encountered in industrial production scenarios. These complexities may include an expanded set of detection categories and the absence of a defect detection dataset. Additionally, cross-domain defect detection deserves attention in order to enhance the model’s generalization capability and compensate for the limited availability of datasets.
To summarize, the proposed improved method effectively enhances the detection accuracy for surface defects on bearings and gears, thus offering significant value for applications in industrial production scenarios. However, additional investigations are required to further enhance the model’s performance and generalization ability.

Author Contributions

Conceptualization, E.B. and M.Y.; methodology, E.B., M.Y. and Q.G.; software, E.B. and S.F.; validation, E.B., M.Y. and Q.G.; formal analysis, E.B., Y.L. and Q.G.; investigation, E.B., Y.L. and Q.G.; resources, E.B., Y.L. and Q.G.; data curation, E.B., Y.L. and Q.G.; writing—original draft preparation, E.B., M.Y. and Q.G.; writing—review and editing, E.B., M.Y. and Q.G.; visualization, E.B.; supervision, M.Y.; project administration, M.Y.; funding acquisition, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62103192), the Natural Science Research Program for Higher Education Institutions in the Jiangsu Province (20KJB520015), and the Changzhou Applied Basic Research Program Project (medium subsidy) (CJ20200039).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface defect detection methods for industrial products: A review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
  2. Wei, L.-S.; Ding, K.; Duan, Z.-D. Bearing roller detection by incorporating single-response constrained SIFT feature matching. J. Electron. Meas. Instrum. 2019, 33, 107–113. [Google Scholar] [CrossRef]
  3. Hengdi, W.; Sha, L.; Siji, D. Research on visual detection algorithm for lateral defects of bearing outer ring. Mech. Des. Manuf. 2017, 169–172. [Google Scholar] [CrossRef]
  4. Chen, S.; Lin, Z.; Wu, Y. Research and implementation of online visual inspection of bearing collar endface defects. Bearing 2022, 48–54. [Google Scholar] [CrossRef]
  5. Wang, H.; Zhang, J.; Tian, Y.; Chen, H.; Sun, H.; Liu, K. A simple guidance template-based defect detection method for strip steel surfaces. IEEE Trans. Ind. Inform. 2018, 15, 2798–2809. [Google Scholar] [CrossRef]
  6. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  7. Yao, B.; Wen, X.; Jiao, L. Improved YOLOv3 algorithm for surface defect detection of aluminium profiles. J. Metrol. 2022, 43, 1256–1261. [Google Scholar] [CrossRef]
  8. Qingdang, L.; Tielin, L. Steel plate defect detection based on improved YOLOv3 algorithm. Electron. Meas. Technol. 2021, 44, 104–108. [Google Scholar]
  9. Qian, X.; Hongjin, Z.; Honghui, F. Improved YOLOv3 network for surface defect detection on steel plates. Comput. Eng. Appl. 2020, 56, 265–272. [Google Scholar] [CrossRef]
  10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  11. Li, B.; Wang, C.; Ding, X.Y.; Ju, H.J.; Guo, Z.P.; Li, J.Y. Improved surface defect detection algorithm for YOLOv4. J. Beijing Univ. Aeronaut. Astronaut. 2023, 49, 710–717. [Google Scholar]
  12. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Opti-mal speed and accuracy of object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; p. 10934. [Google Scholar]
  13. Liu, S.; Qi, L.; Qin, H.F.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  14. Shi, Z.; Chen, J. Yolo V3 algorithm in the application of workpiece defect detection. J. Mech. Des. Manuf. 2021, 4, 62–65+69. [Google Scholar]
  15. Jocher, G.; Nishimura, K.; Mineeva, T. YOLOv5. Code Repository. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 23 May 2023).
  16. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
  17. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  18. Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. Afpn: Asymptotic feature pyramid network for object detection. arXiv 2023, arXiv:2306.15988. [Google Scholar]
  19. Cao, J.; Li, Y.; Sun, M.; Chen, Y.; Lischinski, D.; Cohen-Or, D.; Chen, B.; Tu, C. Do-conv: Depthwise over-parameterized convolutional layer. IEEE Trans. Image Process. 2022, 31, 3726–3736. [Google Scholar] [CrossRef] [PubMed]
  20. Han, Y.J.; Yu, H.J. Fabric defect detection system using stacked convolutional denoising auto-encoders trained with synthetic defect data. Appl. Sci. 2020, 10, 2511. [Google Scholar] [CrossRef]
  21. Liu, W.; Liu, Z.; Wang, H.; Han, Z. An automated defect detection approach for catenary rod-insulator textured surfaces using unsupervised learning. IEEE Trans. Instrum. Meas. 2020, 69, 8411–8423. [Google Scholar] [CrossRef]
  22. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  23. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  24. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  25. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  26. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  27. Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 1161–1177. [Google Scholar]
  28. Tang, R.; Liu, Z.; Song, Y.; Duan, G.; Tan, J. Hierarchical multi-scale network for cross-scale visual defect detection. J. Intell. Manuf. 2023, 1–17. [Google Scholar] [CrossRef]
  29. Yeung, C.; Lam, K.M. Efficient fused-attention model for steel surface defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
  30. Su, H.; Lin, B.; Huang, X.; Li, J.; Jiang, K.; Duan, X. MBFFNet: Multi-branch feature fusion network for colonoscopy. Front. Bioeng. Biotechnol. 2021, 9, 696251. [Google Scholar] [CrossRef]
  31. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
  32. Yu, L.; Wang, Z.; Duan, Z. Detecting gear surface defects using background-weakening method and convolutional neural network. J. Sens. 2019, 2019, 3140980. [Google Scholar] [CrossRef]
  33. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  34. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
  35. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
  36. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  37. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  38. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
  39. Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 23 May 2023).
Figure 1. CRD-YOLO architecture structure: (a) CAA module; (b) DBL module. (where green arrows indicate down-sampling operations and blue arrows indicate up-sampling operations).
Figure 1. CRD-YOLO architecture structure: (a) CAA module; (b) DBL module. (where green arrows indicate down-sampling operations and blue arrows indicate up-sampling operations).
Electronics 13 00473 g001
Figure 2. Schematic diagram of CAA modules: (a) ConvNeXt V2 Block; (b) ConvNeXtV2SE Block; (c) C3 module; (d) CAA module.
Figure 2. Schematic diagram of CAA modules: (a) ConvNeXt V2 Block; (b) ConvNeXtV2SE Block; (c) C3 module; (d) CAA module.
Electronics 13 00473 g002
Figure 3. Structure of the SE attention mechanism.
Figure 3. Structure of the SE attention mechanism.
Electronics 13 00473 g003
Figure 4. CAA module feature map visualization: (I) Original image; (II) C3 module result; (III) CAA module result; (a) groove; (b) abrasion; (c) blemish; (d) break; (e) lack; (f) scratch.
Figure 4. CAA module feature map visualization: (I) Original image; (II) C3 module result; (III) CAA module result; (a) groove; (b) abrasion; (c) blemish; (d) break; (e) lack; (f) scratch.
Electronics 13 00473 g004
Figure 5. SPPF spatial pyramid pooling.
Figure 5. SPPF spatial pyramid pooling.
Electronics 13 00473 g005
Figure 6. RepConv module.
Figure 6. RepConv module.
Electronics 13 00473 g006
Figure 7. RAFPN module.
Figure 7. RAFPN module.
Electronics 13 00473 g007
Figure 8. DBL module.
Figure 8. DBL module.
Electronics 13 00473 g008
Figure 9. Sample of defects in the datasets.
Figure 9. Sample of defects in the datasets.
Electronics 13 00473 g009
Figure 10. Evolution of the evaluation metrics for the model training process based on the self-constructed datasets.
Figure 10. Evolution of the evaluation metrics for the model training process based on the self-constructed datasets.
Electronics 13 00473 g010
Figure 11. Detection results of the improved model for various types of defective images.
Figure 11. Detection results of the improved model for various types of defective images.
Electronics 13 00473 g011
Figure 12. Eight-channel predicted images and visual feature maps of the posterior three detection layers of the neck after different approaches: (I) original model; (II) CAA; (III) ConSE and RAFPN; (IV) final improved model.
Figure 12. Eight-channel predicted images and visual feature maps of the posterior three detection layers of the neck after different approaches: (I) original model; (II) CAA; (III) ConSE and RAFPN; (IV) final improved model.
Electronics 13 00473 g012
Figure 13. Thermograms’ Visualization: (a) grooves; (b) wear; (c) scratches; (d) broken teeth; (e) missing teeth; (f) scratches; (I) original; (II) YOLOv5s; (III) modified YOLOv5s.
Figure 13. Thermograms’ Visualization: (a) grooves; (b) wear; (c) scratches; (d) broken teeth; (e) missing teeth; (f) scratches; (I) original; (II) YOLOv5s; (III) modified YOLOv5s.
Electronics 13 00473 g013
Figure 14. Feature map visualization: (I) original image; (II) P3 layer CBS module results; (III) C3 module results; (IV) CAA module results; (V) RAFPN module results; (a) groove; (b) abrasion; (c) blemish; (d) break; (e) lack; (f) scratch.
Figure 14. Feature map visualization: (I) original image; (II) P3 layer CBS module results; (III) C3 module results; (IV) CAA module results; (V) RAFPN module results; (a) groove; (b) abrasion; (c) blemish; (d) break; (e) lack; (f) scratch.
Electronics 13 00473 g014
Table 1. Comparison results of feature extraction modules.
Table 1. Comparison results of feature extraction modules.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv591.289.791.2
YOLOv5-CNX93.491.393.9
Table 2. Comparison results between CAA feature extraction module and CNX feature extraction module.
Table 2. Comparison results between CAA feature extraction module and CNX feature extraction module.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv5-CNX93.491.393.9
YOLOv5-CAA93.291.994.1
Table 3. Results of the cross-sectional comparison experiment on attention mechanisms.
Table 3. Results of the cross-sectional comparison experiment on attention mechanisms.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv5-ConvCA9391.693.8
YOLOv5-ConvSimAM93.291.593.4
YOLOv5-CAA93.291.994.1
Table 4. RAFPN module comparison experiment results.
Table 4. RAFPN module comparison experiment results.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv5-CAA93.291.994.1
YOLOv5-CAA-RAFPN93.292.495.1
Table 5. Results of feature pyramid comparison experiments.
Table 5. Results of feature pyramid comparison experiments.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv5-PANet91.289.791.2
YOLOv5-BiFPN92.689.492.2
YOLOv5-RAFPN9490.993.6
Table 6. Comparative experimental results of DBL modules.
Table 6. Comparative experimental results of DBL modules.
ModelPrecision (%)Recall (%)[email protected] (%)
YOLOv5-CAA-RAFPN93.292.495.1
YOLOv5-CAA-RAFPN-DBL95.19496.7
Table 7. Results of ablation experiments.
Table 7. Results of ablation experiments.
CAARAFPNDBLP/%R/%[email protected] (%)Parameters/MGFLOPS
91.289.791.226.815.8
93.291.994.1 (+2.9)27.216.1
9490.993.6 (+2.4)28.116.1
92.891.293.3 (+2.1)26.815.8
95.19496.7 (+5.5)29.217.5
Table 8. Comparison of different algorithms.
Table 8. Comparison of different algorithms.
TypesSSDFaster-RCNNRetinaNetYOLOv3YOLOv5YOLOv6YOLOv7YOLOv8Ours
AP (%)Groove73.680.3879.485.494.181.283.991.798.5
Abrasion84.383.592.97993.292.785.893.499.2
Blemish83.684.591.878.391.990.581.787.695.3
Break85.157.691.388.996.492.195.296.798.9
Lack79.664.189.975.984.590.384.787.592.8
Scratch82.884.587.379.987.388.983.485.192.2
[email protected] (%)81.577.988.881.291.289.785.890.396.7
FPS (f/s)65.610.8727.249.8128.295.776.9117.675.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bian, E.; Yin, M.; Fu, S.; Gao, Q.; Li, Y. Part Defect Detection Method Based on Channel-Aware Aggregation and Re-Parameterization Asymptotic Module. Electronics 2024, 13, 473. https://doi.org/10.3390/electronics13030473

AMA Style

Bian E, Yin M, Fu S, Gao Q, Li Y. Part Defect Detection Method Based on Channel-Aware Aggregation and Re-Parameterization Asymptotic Module. Electronics. 2024; 13(3):473. https://doi.org/10.3390/electronics13030473

Chicago/Turabian Style

Bian, Enyuan, Mingfeng Yin, Shiyu Fu, Qi Gao, and Yaozong Li. 2024. "Part Defect Detection Method Based on Channel-Aware Aggregation and Re-Parameterization Asymptotic Module" Electronics 13, no. 3: 473. https://doi.org/10.3390/electronics13030473

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop