1. Introduction
At present, many mechanical devices in service experience sequential failures, primarily due to wear and aging [
1,
2]. To appropriately monitor the wear status of such mechanical devices without causing downtime or operational disruptions, tribological analysis relies on signals generated by the equipment itself and substances produced during operation. This approach allows for the restoration of the internal working state of the equipment through methods such as analyzing faults using collected vibration and heat information. One technique that can be utilized for this purpose is ferrographic analysis, which constitutes one of the research methods developed within tribological informatics. This technique involves obtaining wear particles that have settled in used lubricating oil to analyze their morphology, size, color, and other characteristics. Through this form of analysis, it becomes possible to diagnose both the lubrication condition and degree of wear of the equipment. Furthermore, if there is any indication of potential failure trends, this technique may also enable early inference regarding both the location and cause of any impending issues, thus facilitating proactive troubleshooting [
3,
4,
5,
6]. While ferrographic analysis offers significant advantages in this context, its widespread application is constrained by substantial human resources and time costs associated with acquiring and interpreting ferrographic images. The development of artificial intelligence has provided new insights for tribological informatics by enabling computers to autonomously learn relationships within output data from tribological systems. This not only saves time and effort but also allows for integration across various data types, resulting in timely and accurate monitoring, prediction, and optimization capabilities [
7,
8]. In pursuit of realizing intelligent recognition techniques for iron particle images within ferrographical analysis, as well as providing software support for online oil monitoring device implementation efforts, in the following paper, a YOLOv8 model-based algorithm for recognizing ferrographic images is proposed.
Upon the emergence of intelligent image recognition as a focal point and research hotspot, scholars in the field of tribology began to conduct extensive investigations on this topic [
9]. First, Wang et al. integrated the traditional watershed algorithm with the gray clustering algorithm, resulting in significantly improved segmentation outcomes compared to those of previous algorithms [
10]. Moreover, the clustering algorithm serves as an essential foundation for various convolutional neural network algorithms. Subsequently, Peng et al. attempted to construct a search tree model using SVDD (Support Vector Data Description), K-means clustering algorithm, and support vector machine (SVM) techniques, achieving notable success in classifying cutting wear particles, red and black oxides, and spherical models [
11]. However, challenges persisted regarding classifying similar wear particles; furthermore, prior to inputting parameters into the algorithm, manual operation was required for data collection and processing, which limited the model’s level of intelligent recognition.
Once convolutional neural networks (CNNs) started to revolutionize the field of computer vision, their impact began to garner attention from experts across various disciplines. Peng et al. proposed a CNN-based model called FECNN for ferrographic image identification and primarily focused on fatigue wear particles, oxidation wear particles, spherical wear particles, and cases where two or more types appeared simultaneously [
12]. Their experimental results demonstrated the model’s high accuracy; however, the model’s image processing speed was relatively slow. In the same year, Peng et al. successfully validated an algorithm in a rolling bearing wear particle monitoring system by combining a Gaussian background mixture model with a spot detection algorithm and attempted to extract the three-dimensional features of wear particles from an online monitoring system [
13]. In another study, Wang S. aimed to enhance model training accuracy by extracting three-dimensional information from ferrographic images and proposed constructing a two-level classification tree that a combined backpropagation (BP) neural network with CNNs for classifying single target wear particle images; nevertheless, due to the relatively simplistic construction of the CNNs, the model performed poorly in distinguishing similar wear particles [
14,
15,
16]. Subsequently, Wang S. addressed this issue in their study by simplifying the processing of simple wear particle types using fuzzy selection methods and analyzing and training similar texture sample sets through principal component analysis and backpropagation neural networks based on existing classification structures [
17]. Furthermore, Peng Y.P. advanced previous work by proposing the use of convolutional neural networks for detection along with support vector machines for classification, resulting in a suitable level of accuracy on a self-made four-class dataset [
18]. However, the method still requires improvement in terms of its ability to identify overlapping wear particles. Fan S.L. introduced FFWR-Net, which made significant advancements in wear particle classification; however, it unfortunately lacked the ability of detection [
19]. Vivek J. employed Alexnet for feature extraction from the ferrogram and utilized 15 classifiers for classification. Their study results demonstrated that instance-based K-nearest-neighbor classification yielded the best performance. It achieved a 95% accuracy rate [
20].
The authors of previous studies have predominantly employed shallow neural networks for the simple classification and detection of worn particles. However, in recent years, the authors of previous studies have gradually grown interested in utilizing deep convolutional neural networks (DCNNs) for the intelligent detection of worn particles. Notably, the DCNN field primarily encompasses YOLO series, RCNN series, and DINO algorithms. Compared to earlier approaches, these algorithms have significantly simplified the preprocessing process and integrated detection and classification functions to achieve target detection. Moreover, some even offer video stream tracking capabilities. One of the most significant contributions is the YOLO (You Only Look Once) framework, first introduced in 2016 and subsequently refined through numerous revisions. The inception of YOLOv1 established the fundamental concept of treating object detection as a regression problem, involving the following key steps for achieving object detection: ① dividing the image into uniform regions and predicting objects; ② generating multiple bounding boxes, confidence scores, and class probabilities within these bounding boxes; ③ calculating the loss function for bounding boxes, confidence scores, and class probabilities to expedite the convergence of bounding boxes and the enhancement of class probability during training; ④ retaining the most probable bounding boxes containing target objects through non-maximum suppression (NMS); and ⑤ validating results and iteratively refining training based on learning experiences until accuracy stabilizes. The distinctive features of the YOLO framework include end-to-end training, multi-scale prediction, and rapid detection [
21]. The above factors have captured the attention of scholars. Jia et al. utilized YOLOv3 to test a self-built dataset comprising six types of wear particles and demonstrated excellent performance in detecting cutting wear particles, spherical wear particles, and severe sliding wear particles [
22]. Nevertheless, this model suffers challenges relating to its large size and deployment. To address this issue, He et al. employed an optimized version of the YOLOv5 model with reduced dimensions while incorporating attention mechanisms. This modification yielded promising experimental results without compromising processing speed [
23]. Fan H. W. employed a fusion of YOLOv3 and DarkNet53 models, following two rounds of transfer learning, and achieved a high classification recognition rate for the custom-made gearbox wear dataset [
24]. Shi X.F. conducted experiments using YOLOv5s on a highly intricate custom iron spectrum dataset containing 23 types of abrasive materials, achieving an accuracy rate of 40.5% [
25].
However, the detection of wear particles still poses challenges regarding identifying complex-shaped wear particles, overlapping wear particles, and edge-blurred and small target wear particles. In the following paper, we propose an enhanced YOLOv8 intelligent model for wear particle identification based on point sampling in ferrographic images. The main contributions and innovations of our work are outlined below:
2. Improved YOLOv8 Model Architecture
2.1. Introduction to YOLOv8
The UltraLytics team, known for their development of YOLO models from version 3 to version 5, has further enhanced the YOLOv8 by incorporating new technologies within the v5 framework [
30,
31,
32]. Similar to its predecessor, YOLOv5, YOLOv8 offers model options of varying scales (N/S/M/L/X). In the study presented herein, we aimed to address the practical requirements of online detection and opted for the nanoscale model.
As shown in
Figure 1. Compared to YOLOv7 ELAN, YOLOv8 incorporates C2f modules in both the main network and the neck network, introducing a more diverse range of gradient flows [
33]. The head network is transformed into the prevalent decoupled structure, segregating the classification and detection heads while transitioning from an anchor-based to an anchor-free approach. Furthermore, the loss calculation adopts Task Aligned Assigner’s positive sample assignment strategy and introduces distribution focal loss. Through a series of structural reconstructions, YOLOv8 achieves significant improvements in detection accuracy, effectively addressing the limitations of single-stage algorithms.
2.2. DCNv3
Deformable Convolutional Networks (DCNs) are dynamic sparse convolution operators that differ from traditional CNNs. They enhance a model’s ability to perceive target object shapes by introducing compensation values after convolution and pooling layers. These compensation values, determined by the shape of the target object, are updated in size and direction through learning to reduce interference from irrelevant factors. Consequently, DCNs can extract more flexible and accurate features related to wear particle size, contour, and texture. Unlike conventional CNN convolution operations, DCNs exhibit good generalization ability without requiring additional training data, such as image enhancement. To address variable features in wear particle images, such as size and shape, we employed the third-generation version of a DCN in the present study due to its higher computational efficiency, more stable gradient propagation, and richer spatial features.
The formula for DCNv3 is as follows:
where
represents the total number of aggregation groups. For the
th aggregation group,
represents the projection weight without any information about the location, where
is the dimension of the group;
is the modulation scalar for the
th sampling point in the
th aggregation group, which is normalized by the SoftMax function along
.
represents the sliced input feature map and
represents the offset corresponding to the grid sampling location
in the
gth group.
2.3. Dysample
Dysample is primarily utilized in the upsampling module of the neck network. While YOLOv8 adopts nearest-neighbor sampling as its sampling strategy, this method’s output image quality remains a concern despite its advantage of fast computation. To address the above issue, scholars have proposed operators such as CARAFE, FADE, and SAPA to enhance output image quality; however, the use of these operators sacrifices efficiency and increases application thresholds. In contrast, Dysample is implemented solely through PyTorch’s built-in functions with lower computational resource consumption and higher accuracy improvement than the aforementioned operators. Its working principle involves adding compensation values to sampling points during upsampling operations while determining compensation values via bilinear interpolation for dynamic sampling.
The process of feature map processing by Dysample is shown in
Figure 2:
The process of the flowchart can be explained using the following formula:
In this equation, represents the input feature map, represents the output feature map, and is the original sampling grid. is sampled via 0.25-scaled bilinear interpolation and then pixel rearranged to obtain compensation values, which are added to the original sampling network and then reshaped to perform a dynamic upsampling.
2.4. Efficenthead
In YOLOv8, the computation of the detection head constitutes approximately 40% of the entire algorithm. Considering the potential future application of the wear particle recognition algorithm in mobile or online devices, it is imperative to minimize reliance on device computing power. Upon examining the detection head of YOLOv8, it becomes evident that it comprises two branches, each utilizing two 3 × 3 convolutions and one 1 × 1 2d convolution for information extraction. These branches are employed to calculate the bounding box loss function and classification loss function, respectively. Furthermore, traversing each channel necessitates substantial computational overhead. As shown in
Figure 3. To mitigate this computational burden, we propose a shared parameter method in the present study, which simplifies three convolution kernels in both branches into one kernel within a single branch while retaining only two 3 × 3 convolution kernels for calculating the two loss functions. This approach not only reduces computational complexity but also demonstrates improved precision through experimental validation.
2.5. WISEPiou
Boundary box regression (BBR) is one of the two crucial tasks in object detection, effectively guiding the algorithm to focus on the target task. In YOLOv8, the boundary box loss comprises intersection-over-union (IoU) loss and distribution focal loss, with the present study primarily emphasizing improvements in IoU loss. YOLOv8 employs the CIOU loss function, represented by the following formula:
Equation (3) consists of three parts, which, respectively, consider the overlapping area, center distance, and aspect ratio of the bounding box regression. In this formula, d represents the distance between the predicted box center and the true box center, c represents the distance of the diagonal of the smallest enclosing rectangle, is a weight coefficient, and is a correction factor used to measure the consistency between the predicted box and the true box shape.
In the present study, we replaced the Ciou with WISEPiou, which is a combination of WISE-IoU and POWERFUL-IoU, and the loss functions of WISE-IoU and POWERFUL-IoU are given by Equations (5) and (6), respectively:
Equation (5) multiplies the loss function by a penalty factor
, which can promptly correct the deviation from the original loss function.
Equation (7) adds a penalty factor to the loss function, as shown in Equation (8). In this formula, , , , and represent the corresponding edge distances between the predicted box and the target box, and the denominators are the width and height of the target box.
Therefore, combining the characteristics of the two IoU loss functions, the WISEPiou loss function formula is given by Equation (9):
Based on the above improvements, the model structure used in the present study is shown in
Figure 4.
5. Comprehensive Results Analysis
5.1. Ablation Experiment
To better verify the impact of the improved modules and the combination of improved modules on the original model’s performance, an ablation experiment was conducted. The results are presented in
Table 7.
According to the experimental results presented in
Table 7, both individual and combined improvements of each module showed specific enhancement effects on the original model. When Dysample, DCNv3, and Efficienthead are used separately, they all contribute to a 3% performance improvement in the original model. However, their use leads to a decrease in FPS due to increased inference time. In terms of floating-point operations, these three modules were implemented based on simple underlying principles while maintaining comparable computational complexity with the original model.
By combining Dysample and Efficienthead, the recall rate significantly improves and effectively mitigates recall rate decline caused by other module enhancements in subsequent ablation experiments. After incorporating DCNv3 and WISEPiou, the improved model achieves optimal accuracy rates of 74.2% and mAP76.4%, respectively. Although loss function improvements may result in a slight decrease in recall rate compared to the original model level, the improved model still outperforms the original model overall.
The decrease in FPS is not severe as it maintains real-time detection requirements at 111.6 FPS. Furthermore, a reduction of 0.9 in floating-point operations compared to the original model was determined, which further alleviates reliance on computational power.
5.2. Comparative Experiments
As shown in
Table 8. In terms of accuracy, recall, and
[email protected], the enhanced model demonstrates superior performance, exhibiting a 2.6% increase in accuracy over the second-best improved YOLOv5n, a 1.1% improvement in recall compared to the second-best YOLOV9-S, and a 3.8% enhancement in
[email protected] relative to the second-best YOLOv9-S. Furthermore, the enhanced model achieves an FPS of 111.6, meeting online monitoring requirements with an 18.4% advantage over YOLOX-Tiny within the same series and comparable performance to other models in the experiment, notably outperforming models with an FPS below 30. In terms of FLOPs, although not minimal at 8.0 compared to YOLOX’s rate of 7.578, it remains only marginally higher by 0.5G, presenting a significant advantage when contrasted with non-YOLO models requiring over 100G FLOPs.
Through comparative experiments, it is possible to verify that this improved model achieves optimal detection accuracy with a relatively small size and meets high frame-rate video detection requirements.
5.3. Visualization Chart
To better demonstrate the advantages of our model in the present study, we selected three types of images: complex-shaped wear particles, overlapping wear particles, and edge target wear particles. We used the GradCAMPlusPlus visualization technique for heatmap visualization analysis.
Figure 5 illustrates a heatmap comparison among the algorithms, where the first row depicts the original image, the second row presents the enhanced YOLOv5 heatmap, the third row displays the YOLOv8 heatmap, the fourth row showcases the YOLOv9 heatmap, and finally, in the fifth row is depicted an improved version of YOLOv8 heatmap. In each row’s first image lies an atypical severe sliding abrasive grain image with obvious scratches on the abrasive grain but not on the texture-rich grain in the image—a complex scenario. The second image portrays two overlapping abrasive grains to validate the model’s capability to handle overlapping relationships. The third image showcases distributed abrasive grains of varying sizes and positions to assess the model’s perception ability from both scale and positional perspectives.
It can be inferred from the heatmap of the first column that the other models exhibit either a conservative or radical interpretation of the edge of the abrasive grain profile, potentially leading to gradual error accumulation in feature extraction and a subsequent decline in detection capability. In contrast, the high heatmap value (red area) generated by our proposed model accurately characterizes the abrasive grain profile without significant overshooting, effectively disregarding atypical textures of complex abrasive grains to mitigate potential model fragility resulting from excessive focus on special cases. This aspect is primarily attributed to dynamic downsampling and upsampling, which enhance feature selection rather than incorporating all features into calculations.
Observing the heatmap in the second column, it can be seen that the original model cannot accurately perceive the overlapping area when two worn particles overlap, which affects the extraction of nearby features. It can be observed that in positions close to the overlapping area, the original model presents a relatively dim heatmap. In contrast, our proposed improved model effectively suppresses this trend and provides more comprehensive and accurate perception results. Even in the presence of overlapping areas, it is not significantly affected.
The conclusion drawn from the heatmaps displayed in the third column is that the original model can only perceive worn particles located in the center of images with high integrity; however, it does not perform well in perceiving worn particles located at edges or with fewer quantities. In contrast, our improved model adopts the DCNv3 module and has a wide dynamic receptive field. The model can detect more cases of worn particles existing at edges and possesses richer and more accurate perception ability for small-sized worn particles near image edges. Therefore, this improvement effectively enhances detection accuracy and recall rate.
6. Conclusions
In the following paper, an improved YOLOv8 model is proposed based on dynamic point sampling to address the challenges of complex shapes, overlapping, and edge target wear particles in the intelligent classification and online detection of spectrogram wear particles. Through training and validation on a series of original wear particle images, the following conclusions can be drawn:
(1) The improved YOLOv8 model effectively mitigates detection difficulties such as complex shapewear particles, overlapping wear particles, and edge target wear particles based on the principle of dynamic point sampling, even with a limited dataset and no additional data augmentation. These features reduce the occurrence of undetected objects, missed detections, and false alarms. The lightweight module usage and improved detection head reduce the model’s floating-point computations while simultaneously optimizing the loss function to accelerate model convergence speed and improve detection box accuracy.
(2) Through repeated experiments on a limited dataset, the text model achieves an accuracy rate of 76.4% in terms of the
[email protected] metric and demonstrates high-speed detection capabilities at 111.6 FPS, surpassing recently developed algorithms such as TOOD, YOLOX-Tiny, and DINO.
(3) The current algorithm achieves real-time detection speeds above 30 FPS but is limited by dataset size and quality. After training on higher-quality datasets, its accuracy will further improve, with significant application prospects in fields such as aviation engine diagnostics [
42], industrial product defect detection [
43], and medical imaging [
44].