1. Introduction
Bearings are very common and critical components, as the core components of mechanical equipment, mainly used to support the mechanical rotating body and reduce the coefficient of friction. The efficiency and service life of mechanical equipment are significantly impacted by the quality and performance of bearings, so it is necessary to strengthen the quality inspection during the production and manufacturing process of bearings to ensure that the bearing qualification rate meets the requirements. Therefore, detecting scratches on the surface of the bearing [
1] is crucial. The automated detection of bearing scratches has high research value.
Traditional methods for detecting scratches rely on manual feature extraction. However, due to the various shapes and directions of scratches, the accuracy of traditional methods is low, making it difficult to meet the requirements of on-site industrial inspection. In recent years, machine vision technology has been increasingly applied in industrial production and manufacturing. It completes the detection or identification of the products being measured, improving the production efficiency of the enterprise and reducing production costs. The application of machine vision technology in the field of defect detection is becoming increasingly prevalent. Currently, there are two principal categories of defect detection methods based on machine vision technology [
2].
One approach uses traditional image processing algorithms, such as a histogram of the oriented gradient [
3] (HOG) and deformable parts model [
4,
5,
6] (DPM), typically involving three steps: region selection, feature extraction, and classification regression. However, manually extracting target features has limitations for detecting small targets such as scratches.
The other uses deep learning techniques. With the development of deep learning technology, target detection algorithms based on convolutional neural networks are generally formed into two-stage and single-stage detection algorithms.
Region convolutional neural networks [
7] (RCNNs), as the starting point of target detection in the field of deep learning, are classical two-stage target detection algorithms, with a very important status and reference value. RCNN utilizes region proposals with the SS algorithm [
8], feature extraction, object classification, bounding box refinement, and proposal filtering for target detection. Fast RCNN [
9] and Faster RCNN [
10] were proposed on this basis. Two-stage target detection also leads to high computational complexity and slow processing speed in two-stage networks, making them unsuitable for real-time target detection tasks.
Single-stage target detection networks use convolutional neural networks [
11] (CNNs) to extract image features and directly predict the location and class of the target, which is fast but slightly less accurate, but following several generations of networks, the accuracy has been greatly improved. The YOLO [
12,
13,
14,
15,
16,
17,
18,
19] family is the most representative network for single-stage object detection. You Only Look Once (YOLOV1) performs object detection by dividing the image into a grid and predicting bounding boxes and class probabilities directly from grid cells. Researchers have continued to improve the YOLO family of target detection networks and have developed a number of versions.
Research on bearing defect detection has also evolved from using traditional image algorithms to deep learning methods. Zhengyan Gu et al. [
20] proposed a machine vision-based automatic detection and identification method for bearing surface defects. Tise method improves and combines the Ostu algorithm and the Canny algorithm to enhance the completeness and accuracy of bearing surface defect segmentation. Dan L et al. [
21] proposed a deep learning method for bearing face detection based on data enhancement and improved fast RCNN, using a semi-supervised data enhancement approach based on local flaw features, the improved RA strategy, and the mosaic algorithm to enhance the initial bearing sample data. This method can effectively achieve accurate and fast bearing face flaw detection. Dahai L et al. [
22] proposed a non-destructive testing (NDT) method for detecting surface defects on cylindrical rollers of silicon nitride bearings. The method is based on an optimized convolutional neural network that combines a semantic segmentation sub-network and a decision-making sub-network. It achieves high levels of speed and accuracy. Liu B et al. [
23] designed a machine vision system that uses a multi-angle light source to capture clear images of surface defects on bearings. They achieved the automatic detection of small defects through contour sub-area fitting and the improved Nobuyuki OTSU method. Zheng Z et al. [
24] employed a deep learning approach to detect surface defects on bearing caps. They improved the YOLOV3 network structure by proposing the BNA-Net feature extraction network, attention prediction sub-network, and defect localization sub-network. The final AP value for defect detection reached 69.74%. It can be seen that current research on bearing defect detection, despite being combined with deep learning technology, has not achieved high detection accuracy and requires improvements in efficiency to meet the industrial demand.
This study focuses on bearing surface scratch detection. A scratch detection model named YOLOV5-CDG is proposed, which is based on the improved YOLOV5 to achieve efficient and high-accuracy detection of scratches on bearing surfaces, and a homemade dataset is used to complete experimental validation. Our contributions are summarized as follows:
- (1)
We designed a machine vision-based bearing surface scratch sensor system that meets the needs of bearing surface scratch inspection. By utilizing the system, a novel dataset of bearing surface scratches was produced, providing high-quality bearing surface scratch images to support our experiments.
- (2)
Based on the YOLOV5 model, we proposed a novel model that improves the performance of YOLOV5 by adding the Coordinate Attention (CA) mechanism module, incorporating Deformable Convolutional Network (DCN) feature extraction and employing the GhostNet network, and we named it YOLOV5-CDG. The model achieved efficient and high-accuracy detection of scratches on bearing surfaces.
This paper is organized as follows:
Section 2 introduces the modules used in our proposed YOLOV5-CDG model.
Section 3 describes the sensor system we established and the bearing surface scratch dataset we produced, launching a series of experiments to test the performance of the YOLOV5-CDG model.
Section 4 analyzes the experimental results and verifies the excellent performance of the YOLOV5-CDG model. Finally,
Section 5 discusses this study.
2. Materials and Methods
2.1. Overview of YOLOV5
YOLOV5, compared to the original version, has a significant improvement in the accuracy of target detection and speed compared with the original version, and it can be easily deployed into embedded or CPU devices. Therefore, in this study, we ultimately carry out bearing surface scratch detection based on YOLOV5 target detection network.
The YOLOV5 network architecture is composed of four main parts: input, backbone, neck, and prediction. The network divides the image into grids and predicts the presence of a target, along with its category and location information, for each grid. YOLOV5 performs a single forward computation to obtain the target detection results, making it faster than two-stage target detection networks. This study introduces and improves upon the YOLOV5 model, specifically YOLOV5s-v6.1. All subsequent references to YOLOV5 refer to YOLOV5s-v6.1.
Figure 1 illustrates the network structure of YOLOV5, with the SPPF module highlighted in a dashed rectangular box. The convolution module is labelled with kernel size, stride, and padding.
The YOLOV5 backbone consists of CSPDarknet53, which was initially proposed in YOLOV3. The network comprises 53 convolutional layers, which refer to the convolutional layer, the BN layer, and the SiLU activation function. The BN layer normalizes the data to alleviate the problems of gradient explosion and gradient vanishing and speeds up the network’s convergence. The SiLU activation function is a weighted linear combination of Sigmoid activation functions.
Figure 2 displays the structure of the C3 module in YOLOV5. The first convolution module employs a kernel size of 6 × 6. The C3 module comprises three convolution modules and a BottleNeck module. CSPDarknet53 introduces the CSP [
25] structure, which is primarily based on Darknet53. The CSP structure, compared to the traditional convolution structure, divides the shallow feature map into two parts in the channel dimension. The first part is followed by direct splicing with the resulting second part without any additional processing. This approach significantly reduces the reuse of gradient information, resulting in a 10% to 20% reduction in network parameters while maintaining network accuracy. In the backbone, the BottleNeck part of the C3 module uses the BottleNeck1 structure, while the remaining part of the C3 module uses the BottleNeck2 structure.
The structure of the neck part is based on the SPPF module and CSP-PAN [
26], with improvements made to the SPPF module. The SPP module implements spatial pyramid pooling by adopting the idea of SPPNet [
27]. This involves passing the input feature maps through pooling layers with different kernel sizes in parallel, achieving multi-scale feature fusion to some extent. The SPPF passes the input feature layer sequentially through the pooling layer with a kernel size of 5 × 5 and concatenates the output of each pooling layer. However, multiple serial pooling operations with the same kernel size are equivalent to pooling with a larger kernel size, resulting in the same output and improved network efficiency. The YOLOV5 PAN is based on the FPN [
28] network. The Feature Pyramid Network (FPN) fuses low-level and high-level features, providing rich semantic information at all scales and effectively addressing the multi-scale problem of image features. FPN integrates semantic features into the low-level feature map but does not consider location information. PAN adds a bottom-up pyramid structure based on FPN, which transfers strong localization information from low to high levels. YOLOV5 also uses the CSP structure in PAN, which improves localization accuracy when using high-level features for target recognition.
The Prediction section includes three detection heads that operate on different predicted feature maps. The detection heads are responsible for predicting large, medium, and small targets based on the size of the predicted feature maps. Using the input image size of 640 × 640 as an example, the feature maps of the three prediction heads are 80 × 80, 40 × 40, and 20 × 20, respectively. These maps are used to predict small, medium, and large targets, in that order. Each prediction sign map is followed by outputting the prediction parameters using 1 × 1 convolution. The prediction parameters include target category, anchor box position, and confidence score. For each target category, the confidence score indicates the probability that the anchor box contains a target of that category. Finally, the anchor box with confidence scores that exceed the threshold value is output in order to complete the detection of the target object.
2.2. Coordinate Attention
The attention mechanism (AM) [
29] is a technique that deep learning models use to mimic the way humans allocate their attention when processing information. This technique enables the model to focus on the most important information for the current task. In target detection, the attention mechanism can assist the model in identifying and locating the target more accurately. This allows the model to focus more on regions that may contain targets during target detection, thereby enhancing the accuracy and efficiency of detection.
Attention mechanisms can be classified into spatial domain, channel domain, and hybrid domain attention mechanisms. SE [
30] attention mechanism only considers the importance between channels and does not consider the spatial coordinate information. The Coordinate Attention (CA) [
31] mechanism, proposed by Hou Q et al. in 2021, combines channel attention with location information to enhance the performance of the network model. It generates attention weights and assigns smaller weights to non-target regions, embedding location information within the channels. As shown in
Figure 3, the CA attention mechanism introduces two one-dimensional feature encoders to extract the perceptual attention feature maps in the horizontal and vertical directions, respectively. Set the input features as C × H × W. C, H, and W represent the number of channels, the height, and the width of the input feature map, respectively. Separating the processing of spatial and channel information in advance can effectively preserve spatial information while taking into account channel information for subsequent combination. The input features are first average pooled in the horizontal and vertical directions, resulting in feature maps of dimensions C × H × 1 and C × 1 × W. These two feature maps are then concatenated. The feature maps in the horizontal and vertical directions are obtained through the operations of slicing and normalization. Finally, the obtained feature maps in the horizontal and vertical directions are fused with the original input feature maps to generate the feature maps with attention weights.
The use of the CA attention mechanism can improve the model’s ability to locate and recognize targets, particularly in tasks involving spatial relationships. This technique reduces the impact of unnecessary information and noise, improving the model’s robustness and generalization performance. Additionally, it enhances the accuracy of the model. In target detection tasks, the CA attention mechanism improves the ability to locate the target and increases detection accuracy. Finally, the CA attention mechanism improves computational efficiency. It does not significantly increase the number of parameters in the network and can improve model performance. This makes the CA attention mechanism more feasible for real-world applications. In conclusion, this technique has the potential to enhance model performance and improve a variety of visual tasks.
2.3. Deformable Convolutional Networks
Traditional convolutional operation samples input features within the receptive field in a fixed manner, which may not adapt well to the deformation of the target object. Deformable Convolutional Network (DCN) [
32], proposed by Dai, J et al. in 2017, is a module of convolutional neural networks for computer vision tasks; it performs the convolutional operation with the added capability of local deformation by introducing a deformable convolutional kernel. This kernel learns a set of offsets, allowing it to adaptively sample input features based on their local context, resulting in better adaptation to the deformation of the target object. Therefore, this study proposes the fusion of deformable convolution in the YOLOV5 feature extraction module. As shown in
Figure 4, the blue dots represent the 3 × 3 convolution kernel sampling positions, and the green dots represent the sampling positions after the convolution kernel offset.
Using a 3 × 3 convolution kernel as an example, the convolution computation can be represented by the set
R:
Assuming
x as the input and
y as the output, the convolution operation for the current pixel point
p0 can be expressed as:
In deformable convolution, an additional offset ∆
pn can be added to Equation (9), where
pn is the offset of
p0 relative to the receptive field within the set
R,
pn is an integer, and
w is the sampling weight.
When calculating ∆
pn from another convolution, it is typically not an integer. As a result,
x(
p0 +
pn + ∆
pn) does not correspond to an actual integer pixel point in the image. Therefore, bilinear interpolation is used to calculate it.
The formula for bilinear interpolation is represented by G(.,.), where x(q) is the value of the entire pixel point on the feature map. The weight coefficient g(a, b) is computed based on the coordinates and denotes the distance between the two coordinate points, with a maximum value of 1. The physical significance of this is that the coordinate point q within one pixel from the horizontal and vertical coordinates of the p-points participates in the operation, resulting in the value of the p-point.
The process described above does not take into account the size of the offset. When the offset is too large, the convolution kernel may deviate from the target region. To address this issue, DCNv2 [
33] adds a modulation module to control the degree of offset change. The original Equation (10) is modified to include weight coefficients based on the offset. The specific formula is as follows:
The value of ∆mn is set within a range of 0 to 1, requiring a separate convolution to learn the parameters.
Figure 5 illustrates the deformable convolution process, where an offset region is generated after the convolution calculation, and the offset is applied to the original convolution kernel, i.e., deformable convolution.
The scratches on bearings are typically elongated. To extract scratch features more efficiently, deformable convolution can be added in the feature extraction stage. This avoids the extraction of redundant information by the traditional fixed convolution kernel. Additionally, it improves target detection accuracy. The DCN module captures deformation information and precise positional alignment of the target, improving target location and identification while reducing missed and false detections.
2.4. GhostNet
Traditional deep learning models employ a large number of convolutional parameters, resulting in long inference times. This makes it challenging to deploy them in industrial settings for real-time detection. To address this issue, a potential solution is to create lightweight deep convolutional models. The GhostNet [
34] is a new lightweight network that reduces the number of parameters by generating redundant feature maps through a concise computational approach. This is achieved by introducing the Ghost module, which consists of two sub-networks: MainNet and GhostNet. The MainNet extracts the main feature representation, while the GhostNet performs lighter-weight computations to extract additional features and fuse them with those of the MainNet.
Figure 6 illustrates that the Ghost module comprises two operation parts. The first step involves using ordinary convolution to generate a small number of channels of the real feature map from the input image or feature map. The second step involves performing simple linear operations on the feature map obtained in the first step to obtain the Ghost feature layer. The feature layer is combined with the Ghost feature layer, resulting in a final output feature map with the same number of channels as a standard convolution operation.
The size of the input feature map is represented by
h ×
w ×
c, where
h is the height,
w is the width, and
c is the number of channels. The size of the output feature map is denoted by
h′ ×
w′ ×
c′, and the size of the conventional convolution kernel is
k ×
k. The operation of the kernel is as follows:
where
c is the number of input channels,
h is the height of the input map, and
w is the width of the input map.
The output feature map channel
n is divided into
s equal parts, with the number of ordinary convolution output channels accounting for 1/
s. The Ghost module’s linear operation is considered as deep convolution, with the deep convolution kernel size being
d × d and the number of output channels accounting for (1 −
s)/
s.
The speed ratio of the normal convolution to the Ghost module is:
When both d × d and k × k are convolutional kernel sizes, their values can be considered approximately equal. Additionally, since s is much smaller than c, the ratio of operations can be approximated to be s. The analysis above shows that the Ghost module can significantly reduce the arithmetic and parametric quantities of a model compared to the traditional convolutional module. It has also demonstrated comparable or even superior performance to larger and more complex models in various experimental species for multiple image classification and target detection tasks. Thus, this study employs the Ghost module to substitute certain conventional convolutional modules, thereby enhancing the model’s overall performance.
2.5. YOLOV5-CDG
The CA attention mechanism is added to the last layer of the YOLOV5 feature extraction layer to enable it to pay more attention to the target region during network training. This results in less weight being given to irrelevant regions. The original YOLOV5 network model is improved through this addition. The deformable convolution replaces some of the convolutions in the feature extraction network by changing the sampling position of the convolution kernel to make it closer to the shape of the scratch. The Ghost module reduces the number of parameters in the network by generating Ghost layers. Therefore, the convolution module of the C3 module in YOLOV5 is replaced with the Ghost module. The network structure is shown in
Figure 7, and we name the improved YOLOV5 network YOLOV5-CDG; the highlighted section of the border represents the enhancement of YOLOV5-CDG.
3. Experiments
Section 3.1 describes the selection and construction of the hardware for the construction of the sensor system,
Section 3.2 outlines the self-made experimental dataset, while
Section 3.3 details the experimental methodology employed in this study.
3.1. Machine Vision-Based Bearing Surface Scratch Sensor System
We examined the core hardware used to form the sensor system, which includes industrial cameras, lenses, and light sources. To ensure the accuracy of the system, it is essential to analyze the hardware selection. In this section, we examine the three main types of hardware and consider a range of factors, including cost, function, and other aspects, to determine the hardware models that are suitable for research in this study and the bearing surface defect detection sensor system we established with them.
- (1)
Industrial camera selection: In consideration of the task requirements set forth in this study, which pertain to the shooting of bearings, it was determined that the size of the bearing, working distance, and the selection of suitable cameras would be of paramount importance. We ultimately selected the industrial camera MV-CA050-12UC (Hikvision, Hangzhou, China), a 5-megapixel surface array camera. The detailed parameters of the camera are presented in
Table 1.
- (2)
Lens selection: In consideration of the subject, the camera, and other pertinent factors, the OPTO double telecentric lens TC23036 was selected. The lens’s detailed parameters are presented in
Table 2.
- (3)
Light source selection: We opted for the OPT-CO80-B coaxial light source. The specific parameters of this light source are presented in
Table 3. The coaxial light source is a blue coaxial light source with a luminous surface area of 73 × 73 mm. By positioning the coaxial light source on top of the bearing and capturing an image of the bearing surface, the scratches on the bearing surface can be highlighted.
Based on these hardware systems and Python (Versions: 3.8.16) language programming to implement the sensor system bearing surface defect detection function, the sensor system built in the laboratory is shown in
Figure 8, where the bearing is placed on the carrier table with a coaxial light source above the bearing. MVS3.4.1 (Hikvision, Hangzhou, China) machine vision industrial camera client is installed on the computer to connect and communicate between the computer and the camera. When the image of the bearing surface is collected, the coaxial light source is turned on.
3.2. Experimental Dataset
This section’s dataset comprises data from both company and homemade sources. The dataset of 1809 images of bearing surface scratches was collected using a camera, including both qualified and defective bearings.
Figure 9 shows the bearing images before and after annotation.
The dataset is randomly divided into training, validation, and test sets, where the training set includes 1206 images, the validation set 206 images, and the test set 397 images, and the basic information of the dataset is shown in
Table 4.
3.3. Experimental Method
Table 5 shows the experimental parameter configuration information.
To expand the training set when training the model, data enhancement methods are used due to the small dataset in this experiment. During the training process, one or more of the image enhancement methods mentioned above will be randomly selected to process the training data.
Figure 10 demonstrates the effect of two images after random scaling, random panning, and random horizontal flipping.
The training process utilizes the Warmup training strategy. This strategy involves using a small learning rate at the beginning of training, gradually increasing it to a set learning rate, and then decreasing it again as the number of training rounds increases. The use of Warmup can solve the instability of training caused by an initial learning rate that is too large.
Figure 11 illustrates the changes in learning rate during the training process. During training, the image is uniformly scaled to a size of 640 × 640. Pre-training weights YOLOV5s.pt are loaded to help the network converge faster. An epoch refers to one complete training set of the model using all the data from the training set. The experiments in this section are set up to train for 150 epochs.
Defect detection in industry involves two steps: firstly, identifying defective products, and, secondly, labelling the location of the defect. The scratch detection experiments mentioned above were conducted solely on the scratch dataset. This is because placing a targetless image during the training of the target detection network hinders the network’s ability to efficiently extract target features. Therefore, all training and test data used bearing images with scratches, which allowed the network to extract scratch features more efficiently. Finally, the trained weights were used to test the classification accuracy of the network for both scratch-free qualifying bearings and scratch-defective bearings.
5. Discussion
An improved YOLOV5 bearing surface scratch defect detection network named YOLOV5-CDG is proposed in this study, achieving the detection of bearing surface scratches. The CA attention mechanism is incorporated into the feature extraction network, and some of the convolutional layers in the feature extraction network are fused into deformable convolutions. As the number of parameters increases, the accuracy of the network improves. The traditional convolutional module of the C3 module in the network is replaced with the Ghost module, resulting in a reduction in the number of parameters and computations of the network and a significant improvement in the inference speed. A homemade dataset is utilized to train the network, and its performance is evaluated using multiple metrics. The experimental results indicate an AP value of 97% for scratch detection, an accuracy of 99.46% for defective and qualified product detection, an average detection time of 263.4 ms per image on CPU devices, and 12.2 ms per image on GPU devices. Furthermore, a comparative analysis is conducted to compare the detection results of different models. The results validate that the proposed method, as compared to the original YOLOV5 network, achieves enhanced speed and accuracy, effectively meeting the requirements of bearing surface scratch detection in industrial sites.
Currently, there are limited studies and industrial datasets available for bearing scratch detection. It is important to consider the emerging noise and other actual industrial production situations. In the future, we plan to conduct research in more practical environments to optimize the algorithms and improve the model’s detection performance.