Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n

Zhang, Xingsheng; Shen, Benlan; Li, Jincheng; Ruan, Jiuhong

doi:10.3390/electronics13173399

Open AccessArticle

Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n

by

Xingsheng Zhang

^1,2,

Benlan Shen

^1,2,

Jincheng Li

^1,2 and

Jiuhong Ruan

^1,2,*

¹

School of Rail Transportation, Shandong Jiaotong University, Jinan 250357, China

²

Key Laboratory of Rail Transit Safety Technology and Equipment, Shandong Province Transportation Industry, Jinan 250357, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3399; https://doi.org/10.3390/electronics13173399

Submission received: 21 July 2024 / Revised: 23 August 2024 / Accepted: 25 August 2024 / Published: 27 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

To improve the accuracy of rail fastener detection and deploy deep learning models on mobile platforms for fast real-time inference, this paper proposes a defect detection model for rail fasteners based on an improved YOLOv8n. Considering the significant aspect ratio differences of rail fasteners, we designed the EIOU⁺ as the regression box loss function. The model is compressed and trained using an improved channel-wise knowledge distillation (CWD⁺) approach to address the challenge of accurately recognizing minor defects in rail fasteners. We introduced a feature extraction module to design a feature extraction network as the distillation teacher model (YOLOv8n-T) and a lightweight cross-stage partial bottleneck with two convolutions and a fusion module (C2f) to improve the YOLOv8n backbone network as the distillation student model (YOLOv8n-S). Experiments conducted on data collected from actual rail lines demonstrate that after CWD⁺ distillation training, the model’s mean detection accuracy (IOU = 0.5) reached 96.3%, an improvement of 2.7% over the original YOLOv8n algorithm. The recall rate increased by 4.5%, the precision by 2.7%, the number of floating-point operations decreased by 13%, and the detection frame rate frames per second (FPS) increased by 6.1 frames per second. Compared with other one-stage object detection algorithms, the CWD⁺ distilled model achieves the precise real-time detection of rail fastener conditions.

Keywords:

rail fastener; deep learning; defect detection; YOLOv8; knowledge distillation

1. Introduction

Rail fasteners are critical components that connect and secure the rails to the sleepers, ensuring a stable bond between the rails and the track. However, due to the continuous exposure to vehicle loads and harsh weather conditions, rail fasteners can experience issues such as clip breakage, clip displacement, and clip loss, which can seriously compromise train safety. Traditional rail fastener inspections predominantly rely on manual inspections, where maintenance personnel walk along the track to visually inspect for abnormalities. When an issue is detected, it is manually marked for repair. This method, which requires extensive track closure periods, is inefficient, prone to subjective errors, and can result in missed or incorrect detections. As a result, this approach is no longer sufficient to meet the demands of daily railway maintenance [1].

With the rapid development of computer vision, traditional image processing and machine learning techniques have been applied to rail fastener detection [2]. In traditional image processing methods, the geometric information between fasteners is used to locate the target positions, followed by preprocessing operations such as denoising and cropping the fastener images [3]. Key features are then extracted from the images, and the state of the fasteners is identified through classification techniques. In contrast, machine learning-based fastener detection analyzes top-view images of fasteners to detect their condition. This approach first calculates the edge density map of the rail fasteners and utilizes the random sample consensus (RANSAC) algorithm to locate the fasteners. Subsequently, the histogram of oriented gradient (HOG) algorithm is employed to extract fastener features, and the resulting images are classified using support vector machines (SVM) or clustering techniques to detect defects in the rail fasteners [4]. Wei [5] proposed an innovative method combining image processing techniques with deep learning networks. First, they improved the traditional fastener localization method based on image processing and introduced a Dense-SIFT feature-based approach for fastener defect detection, which outperformed existing methods. Subsequently, VGG16 was employed to train for fastener defect detection and recognition, demonstrating the feasibility of using CNNs for this task. Finally, Faster R-CNN was utilized to further enhance the detection accuracy and efficiency. Liu et al. [6] proposed a deep detection network based on multi-scale features (MSF-DDN) to locate fastener regions in railway images. Then, a regional classification network was used to identify key subregion types within the localized fastener regions. Finally, a decision tree was constructed to analyze the subregion identification results, achieving fastener detection.

Deep learning-based detection methods primarily use Convolutional Neural Networks (CNNs) to learn target feature information from large datasets, enabling fast and accurate target recognition. Compared to traditional image processing methods, these approaches eliminate the drawbacks of manually designing feature information, resulting in better generalization and robustness. The YOLO series algorithms proposed by Redmon et al. [7,8,9] have demonstrated significant advantages in the detection speed and accuracy, making them widely used in rail fastener defect detection. Chandran et al. [10] combined image processing techniques with neural networks for fastener status detection; however, the simplicity of the designed neural network led to a low detection speed and accuracy, failing to meet practical demands. Qi et al. [11] improved the YOLOv3 network by incorporating deep convolution and point convolution, introducing a new detection network architecture, MYOLOv3-Tiny, which significantly enhanced the detection speed compared to Faster R-CNN. Guo et al. [12] proposed a real-time and cost-effective framework based on computer vision, YOLOv4-hybrid, which achieved a mAP of 94.4% in rail fastener detection. Wang et al. [13] and Cai et al. [14] further improved the YOLOv5 algorithm, enabling detection under complex railway acceleration conditions. Despite the accuracy of these methods in detecting fastener defects, they perform poorly in detecting displacement, loss, and breakage, resulting in limited generalization and robustness, making them difficult to apply effectively in real-world engineering scenarios.

Yang et al. [15] employed a knowledge distillation method based on fine-grained feature simulation to compress an improved YOLOv5 model for detecting surface defects on a conveyor-belt surface. Lei et al. [16] utilized knowledge distillation to compress the YOLOv5s model for tooth detection. Zhou et al. [17] enhanced the global attention mechanism (GAM) algorithm to learn global feature representations in images, optimizing the attention mechanism to minimize information loss during feature processing. Although knowledge distillation is widely used in deep learning model training, it has not been practically applied in the field of railway fastener defect detection.

This paper addresses the challenges of deploying rail fastener detection algorithms on mobile devices and the issues of poor real-time performance. It focuses on research into lightweight models while ensuring the accuracy of rail fastener detection. Using YOLOv8n as the base model, we design a teacher model (YOLOv8n-T) and a student model (YOLOv8n-S), employing an improved CWD⁺ knowledge distillation method to compress the model, thereby achieving a lightweight network while enhancing the detection accuracy. The main contributions are as follows:

Numbered lists can be added as follows:

Based on the characteristics of rail fasteners, an aspect ratio loss term is added to the EIOU loss function.
Introducing ConvNeXt V2 modules, receptive field components and overlapped spatial attention (RCS-OSA) modules, and Efficient Rep networks to reconstruct the YOLOv8 backbone network, serving as the teacher model.
Designing a lightweight C2f module to improve the YOLOv8 backbone network, serving as the student model.
Improving the CWD feature knowledge distillation algorithm by designing an L2 loss function at the output end of the distillation to calculate the mean squared error loss between the student model and the teacher model outputs, enabling the student model to better learn the class features of the teacher model and enhance the model performance.

Data collection and preprocessing include data cleaning and augmentation to construct a rail fastener defect dataset. Using YOLOv8n-T as the teacher model and YOLOv8n-S as the student model, we apply CWD⁺ distillation for knowledge distillation. Ablation comparison experiments validate the effectiveness of the distilled model.

2. Methods

2.1. Regression Frame Loss Function Design

YOLOv8 employs the CIOU [18] loss function as the localization loss function. CIOU enhances the accuracy of loss calculation by considering factors such as center point distance and aspect ratios. However, it demands high precision in data annotation and can be overly sensitive in cases with small overlap areas, leading to lower localization accuracy when detecting objects with significant shape differences. This can result in slow and unstable convergence. To address these issues, we introduce the EIOU loss function [19] for localization. EIOU calculates the overlap between the target and predicted boxes using more refined metrics, considering not only the area overlap, but also positional differences between the boxes. The inference process diagram is shown in Figure 1 (with the blue box as the predicted box and the red box as the ground truth box).

Given the significant differences in the length and width of rail fasteners, the EIOU loss function considers only the intersection-over-union (IOU) [20] of the predicted and ground truth boxes, without accounting for their aspect ratios. Therefore, we designed a relative aspect ratio loss term to improve the loss function. By calculating the squared difference in aspect ratios between the predicted and ground truth boxes, the model becomes more sensitive to changes in aspect ratios. The improved loss function, termed EIOU⁺, is calculated as follows:

L_{IOU} = 1 - I O U

(1)

L_{dis} = \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}}

(2)

L_{asp} = \frac{ρ^{2} (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(h^{c})}^{2}}

(3)

L_{RAR} = | \frac{w}{h + ε} - \frac{w^{g t}}{h^{g t} + ε} |

(4)

L_{{EIOU}^{+}} = L_{IOU} + L_{dis} + L_{asp} + L_{RAR}

(5)

where w^c and h^c are the width and height of the smallest outer rectangle of the predicted bounding box and the real bounding box, ρ is the Euclidean distance between two points, and ε is a positive number approximated to 0. L_IOU is the overlap degree between the predicted frame and the real frame in the target detection task (IOU loss), L_dis is the distance loss term, L_asp is the width and height loss term, L_RAR is the aspect ratio loss term, and L_EIOU⁺ is the total positioning loss function. By introducing the EIOU⁺ loss function with aspect ratio loss term to adjust the size of the prediction frame and the real frame, the size of the prediction frame is optimized to improve the detection accuracy.

2.2. Teacher Model Design

In the task of rail fastener detection, the backbone network primarily extracts features from input images, which are the foundation for subsequent detection heads to perform object classification and localization. The C2f module connects stages where there are significant semantic differences between deep and shallow feature maps, leading to an incomplete information transfer. In complex and highly similar backgrounds, the image features of rail fasteners are affected by the complex environment, resulting in insufficient feature extraction by the network model. To address this, we introduce the ConvNeXt V2 module to improve the backbone network head. To ensure efficient transfer of information from the backbone to the neck network and prevent missed and false detections, we introduce the RCS-OSA module for efficient feature information transfer. To prevent the loss of large-scale semantic information at the end of the backbone network, we introduce the Efficient Rep network to improve the terminal C2f module. We adopt the improved EIOU⁺ loss function as the regression box localization loss function. The improved YOLOv8n-T network structure is shown in Figure 2.

2.2.1. ConvNeXt V2 Module

The ConvNeXt V2 module [21] employs a residual connection structure, as shown in Figure 3. In the main branch network, a 7 × 7 convolution kernel is used for depth-wise convolution, performing convolution operations channel-by-channel on the input features. A 1 × 1 convolution layer then increases the number of channels to four times the original, extracting richer feature information. The features are activated using the Gaussian error linear unit (GELU) activation function and global response normalization (GRN), expanding the model’s receptive field. Finally, another 1 × 1 convolution layer scales the number of channels back to the original, and a regularization method (Drop Path) is introduced to randomly drop some connections, enhancing the model’s generalization ability.

2.2.2. RCS-OSA Module

The RCS module [22] combines channel shuffle strategies with reparameterization convolution to enhance feature extraction capabilities. During training, a multi-branch structure is used to learn rich feature representations. In the inference stage, structural reparameterization simplifies it into a single branch, reducing memory usage. The OSA module aggregates multiple features in one go and, combined with the RCS module, forms the RCS-OSA module [23], as shown in Figure 4. The input to the RCS-OSA module is divided into two parts: one part passes directly, and the other is processed through stacked RCS modules. The processed features are merged with the directly passed features after channel shuffling. This structure, by stacking RCS modules, ensures feature reuse and enhances information flow between different layers while reducing the network’s computational burden and improving computational efficiency.

2.2.3. Efficient Rep Network

The Efficient Rep network mainly comprises Rep VGG-style [24] convolutional layers, decoupling model training, and inference through structural reparameterization. During training, a multi-branch network structure is employed; due to its complexity, this multi-branch structure is converted into a single-branch structure (Rep Conv) through structural reparameterization convolution operations during inference, as illustrated in Figure 5.

After connecting two Rep CONVs in series with the identity residual structure, a Bep unit is obtained. By connecting multiple Bep units in series, a Rep Block [25] is obtained, and its structure is shown in Figure 6. The Rep Block gives full play to the advantages of the Bep unit residual structure, and retains the advantages of the branch model’s powerful representation ability while simplifying the network, so that the model can more effectively capture and learn complex features in the input data.

2.3. Student Model Design

In YOLOv8, the C2f module’s cross-stage connection approach leads to additional computational costs, especially when handling large-size input images, increasing the network’s computational complexity and memory consumption, and causing incomplete information transfer. Thus, the C2f-Het module is designed to replace the C2f module in the backbone network, enhancing information flow between different layers while maintaining performance and reducing computational complexity. The improved YOLOv8n-S network structure, utilizing the enhanced EIOU⁺ loss function as the regression box localization loss function, is shown in Figure 7.

The Het Conv filter [26] shown in Figure 8 employs both 3 × 3 and 1 × 1 convolution kernels, applied to different channel sets. Combining different types of convolution kernels allows for more efficient and flexible feature extraction.

To ensure smoother information flow between different layers of the backbone network, improving feature extraction capabilities while maintaining low computational costs, the C2f-Het module designed with Het Conv filters, as shown in Figure 9, replaces the C2f module in the student model’s backbone network.

2.4. Improved CWD Knowledge Distillation Algorithm

Knowledge Distillation is a model compression technique used to transfer knowledge from a complex, high-performance model (teacher model) to a smaller, computationally efficient model (student model). The primary objective of this technique is to reduce model complexity and computational resource requirements while maintaining performance. To ensure the detection performance of the lightweight model, YOLOv8n-S, this study employs Channel-wise Knowledge Distillation (CWD) [27], where knowledge is transferred from the teacher model to the student model through each channel of the feature maps, thereby optimizing the student model.

During the knowledge distillation process, the feature maps of both the teacher and student networks undergo channel-level softmax normalization to generate corresponding probability distributions. The Kullback–Leibler (KL) divergence between these distributions is then computed as the loss function. This method effectively aligns the feature representations of the student network with those of the teacher network in prominent regions, enabling the student network to better mimic the teacher network in these key areas, thereby achieving precise detection in rail fastener defect detection tasks.

The channel feature activation distributions of the teacher network and the student network are denoted as y_c^S and y_c^T, respectively. The distillation loss is calculated as follows:

ϕ (y_{c}) = \frac{\exp (\frac{y_{c, i}}{τ})}{\sum_{i = 1}^{W \cdot H} \exp (\frac{y_{c, i}}{τ})}

(6)

L_{distall} = \frac{τ^{2}}{C} \sum_{c = 1}^{C} \sum_{i = 1}^{W \cdot H} ϕ (y_{c, i}^{T}) \cdot \log [\frac{ϕ (y_{c, i}^{T})}{ϕ (y_{c, i}^{S})}]

(7)

where

ϕ

(y_c) represents conversion to a probability distribution, c = 1, 2, …, C indicates the number of channels, i indicates the number of channel layers, y_c_,i represents layer i in channel c, τ indicates the temperature hyperparameter, and L_distall represents distillation loss. The output logit distribution is converted into a probability distribution by Formula (6), and the influence of amplitude scale between teacher network and student network is eliminated.

Due to the small difference between the normal fastener and the shifted fastener of the rail fastener, there will be false detection. Based on this, L2 loss is designed in this paper to further optimize the CWD distillation algorithm, as shown in the red dashed box in Figure 10. The mean square error loss of layer i is calculated by Formula (8), and the final loss of Formula (9) is obtained by summing the losses of all layers.

L_{i} = \frac{1}{N \times D} \sum_{j = 1}^{N} \sum_{k = 1}^{D} {(p_{i}^{(j, k)} - t_{i}^{(j, k)})}^{2}

(8)

L_{c l s} = α \times \sum_{i = 1}^{L} \frac{1}{N \times D} \sum_{j = 1}^{N} \sum_{k = 1}^{D} {(p_{i}^{(j, k)} - t_{i}^{(j, k)})}^{2}

(9)

p_i = {p₁, p₂, …, p_L} and t_i = {t₁, t₂, …, t_L} represents the list of output tensors of the student model and the teacher model, respectively, with a total of L layers. The output shape for each layer is N × D, N is the batch size, and D is the output dimension for each layer. p_i^(j,k) and t_i^(j,k) represent the values of the student model and the teacher model, respectively, at the i-th layer, the j-th sample, and the k-th output unit. α is the scaling factor and α = 0.25 is used in the experiment. Since the logit value between the normal fastener and the shifted fastener is very close, L2 loss can better reflect this closeness. Therefore, the logit output of the teacher model and the student model are directly compared by Formula (9) to capture the subtle differences, so that the student model can learn the class information of the teacher model more accurately. The improved CWD algorithm is marked as CWD⁺, and the distillation algorithm flow is shown in Figure 10.

Since the number of channels in the two models’ feature maps differs, direct loss computation is not feasible. Therefore, a 1 × 1 convolution layer is used to match the student model’s feature map channels to those of the teacher model. Additionally, batch normalization is applied to the teacher model’s feature maps to normalize them, stabilizing the training process and mitigating internal covariate shift, ensuring these feature maps are on the same scale for direct comparison.

3. Experiment and Analysis

3.1. Experimental Data Set

The experimental data in this study consist of two parts. One part is collected from rail grinding operations during domestic railway maintenance. To enhance the generalization ability of the model, the other part is gathered from the internet and includes four categories: normal fasteners, shifted fasteners, damaged fasteners, and missing fasteners. The damaged fasteners category includes broken elastic clips and deformed elastic clips. A total of 700 images were collected, each with a resolution of 416 × 416 pixels, as shown in Figure 11.

To ensure the model learns a diverse set of features and generalizes to various scenarios, defect samples underwent random augmentation, including rotation, exposure adjustment, hue adjustment, and noise addition [28]. After augmentation, a total of 3500 images were obtained and split into training, validation, and test sets in an 8:1:1 ratio.

3.2. Experimental Environment and Parameters

The experiments were conducted on a Windows 10 operating system with an RTX 3070Ti GPU, 12th Gen Intel Core i5-12400 2.50 GHz processor, 8 GB RAM, and 8 GB VRAM, utilizing the Pytorch 1.13.1 deep learning framework, CUDA 11.7, and a Python 3.7 software environment. The number of training iterations was set to 200, with a batch size of 16, a learning rate of 0.0001, a momentum parameter of 0.937, and a weight decay of 0.0005. Image augmentation employed mosaic enhancement, with a warm-up learning phase of three epochs, followed by automatic learning rate adjustment using the cosine annealing algorithm [29].

3.3. Evaluation Index

The target detection task mainly includes the classification and positioning of the target in the image, and the evaluation of the model mainly depends on the detection accuracy and the average detection time. Detection accuracy indicators generally include precision (P), recall (R) [30], and mean average precision (mAP) [31]. The calculation formula is as follows.

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

A P = \int_{0}^{1} P (R) d R

(12)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(13)

m A P @ 0.5 = \frac{\sum_{i = 1}^{N} A P_{i}}{N} (I O U_{t h} = 0.5)

(14)

where TP represents the number of fasteners correctly detected by the model and FP indicates the number of fasteners whose model identification is wrong or not recognized. FN indicates that the model does not correctly detect the number of fastener statuses and AP represents the detection accuracy, which is the integral of the accuracy rate to the recall rate under a certain category. mAP represents the average value of detection accuracy for each category. N represents the number of classes, and AP_i represents the average detection accuracy under different classes. mAP@0.5 represents the mean of the model’s average accuracy over each class when the IOU threshold is set to 0.5.

Additionally, operations Giga floating point operations per second (GFLOPs) and frames per second (FPS) were introduced to evaluate the model size and computational efficiency. GFLOPs assess the computational complexity of the model, while FPS represents the detection speed of the model.

3.4. Comparative Experiment of Different Loss Functions

To validate the effectiveness of the improved EIOU⁺ loss function, it was compared with EIOU, WIOU [32], and MPDIOU [33] loss functions, as shown in Table 1. The YOLOv8n model using the EIOU⁺ loss function achieved an mAP50 and mAP50–95 increase of 0.9% and 2.6%, respectively, in fastener detection.

The training process visualization of the EIOU⁺ and CIOU loss functions in Figure 12 shows that the EIOU⁺ loss function converged faster in the first 50 epochs on the training set, eventually stabilizing at 180 epochs. On the validation set, the EIOU⁺ and CIOU loss functions showed minor differences in the first 100 epochs, with the EIOU⁺ loss function converging more significantly and achieving a lower loss value after 100 epochs.

3.5. YOLOv8n-T Network Ablation Experiment

To verify the effectiveness of the improved YOLOv8n-T network modules in rail fastener detection, 10 ablation experiments were conducted, with results shown in Table 2. Introducing the ConvNeXt V2 module for backbone network head improvement slightly reduced the parameter count and increased the average detection accuracy thanks to ConvNeXt V2’s ability to extract and fuse a broader range of feature information at the head of the backbone network. The RCS-OSA module aggregates multiple features of different scales, capturing multi-scale information and passing it to the Neck network, enhancing the diversity of network model features, but significantly increasing the computational complexity. The Efficient Rep network at the backbone network’s end efficiently extracts large-scale information from rail fasteners, leveraging its decoupled training and inference structure advantages. This increases the model’s computational complexity by 0.1, but significantly improves the detection speed. Optimizing the original network model with the EIOU⁺ localization loss function improved the detection accuracy and significantly increased FPS. By reconstructing the backbone network with the ConvNeXt V2 module, RCS-OSA module, and Efficient Rep network, and further optimizing the network with the improved EIOU⁺ loss function, the modified model’s accuracy and recall improved by 2.2% and 2.1%, respectively, with mAP@50/% and mAP@50–90/% increasing by 2% and 3.1%, respectively. However, the improved model’s structure complexity increased, raising GFLOPS by 2.8 and decreasing FPS by 4.8 frames per second.

3.6. YOLOv8n-S Network Ablation Experiment

To verify the effectiveness of the lightweight backbone network YOLOv8n-S and explore the lightweight performance of the C2f-Het module, four ablation experiments were conducted. Data in Table 3 show that after introducing the C2f-Het module to improve the backbone network, the GFLOPS value decreased by 0.8. Further optimizing the model with the EIOU⁺ loss function improved mAP@50/% and mAP@50–90/% by 1.6% and 1.9%, respectively, and increased FPS by five frames per second compared to the original algorithm.

3.7. Knowledge Distillation Contrast Experiment

Using the improved feature extraction network YOLOv8n-T as the teacher model and the lightweight backbone network YOLOv8n-S as the student model, the backbone networks of the teacher and student models were selected for knowledge distillation. To study the effect of the distillation temperature on the detection accuracy, experiments were conducted with τ = 1, τ = 2, and τ = 3, with the results shown in Table 4.

Using τ = 1 for feature distillation, a comparative experiment was conducted between the CWD distillation algorithm before and after improvement. Table 5 shows that the model’s accuracy and recall rates improved by 2.3% and 0.7%, respectively, after CWD⁺ knowledge distillation. mAP@50/% and mAP@50–90/% increased by 1% and 1.1%, respectively, with FPS increasing by 0.2 frames per second, demonstrating that the improved CWD⁺ algorithm outperforms the original CWD algorithm.

3.8. Comparison Experiment of Different Models

To further verify the superiority of the proposed algorithm, the YOLOv8n-CWD⁺ algorithm was compared with YOLOv3-tiny and YOLOv5s algorithms on the same custom dataset. Table 6 shows that the YOLOv8n-CWD⁺ algorithm has lower computational complexity in terms of lightweight performance, with the lowest GFLOPS compared to other algorithms significantly outperforming YOLOv3-tiny, YOLOv5s, and YOLOv8n. In terms of the detection accuracy, the YOLOv8n-CWD⁺ algorithm has a clear advantage, with accuracy and recall rates of 91.8% and 91.1%, respectively, and mAP@50/% and mAP@50–90/% values of 96.3% and 69.4%, respectively, achieving the highest detection accuracy compared to other methods. Regarding the detection speed, the YOLOv8n-CWD⁺ algorithm achieved an FPS of 42.7, improving by 6.1 frames per second over the original algorithm, 15.4 frames per second over YOLOv5s, and 0.2 frames per second over YOLOv3-tiny. Comparative experimental analysis confirms that the proposed YOLOv8n-CWD⁺ algorithm effectively compresses the model and accelerates the inference speed.

3.9. Comparison of Detection Effects before and after Improvement

The improved model’s comparative experimental results prompted the selection of YOLOv3-tiny, YOLOv5s, YOLOv8n, and the distilled YOLOv8n-CWD⁺ models for a visual analysis of track fastener detection. Random samples of track fasteners from various angles and states were chosen for this analysis. A confidence threshold of 0.5 and an IOU threshold of 0.7 were set for comparative validation. The detection results for images taken from different angles are shown in Figure 13. From the detection results of YOLOv3-tiny in Figure 13(1), it is evident that the displaced fastener in image (b) was missed. While the fastener’s condition in image (d) was correctly identified, the fastener was not fully localized. In image (f), a missing fastener was incorrectly detected as a normal one, with a relatively low confidence score. As indicated by the results in Figure 13(2), the YOLOv5s model was able to correctly identify the fastener’s condition, with an improvement in confidence scores compared to YOLOv3-tiny. Figure 13(3) presents the detection results of YOLOv8n, which correctly identified the track fastener’s condition with higher confidence scores than those of YOLOv3-tiny and YOLOv5s. The detection results for YOLOv8n-CWD⁺ are illustrated in Figure 13(4), where the algorithm accurately detected the true condition of the fasteners, with significantly higher confidence scores compared to the other three models. The improved distillation algorithm achieves high detection rates and high accuracy in track fastener condition detection.

4. Conclusions

Developing a lightweight rail inspection system based on computer vision is crucial for improving rail detection efficiency. This paper designed the rail fastener localization loss function EIOU⁺ and proposed an efficient feature extraction network as the teacher model (YOLOv8n-T) and a lightweight student model (YOLOv8n-S). The improved CWD⁺ knowledge distillation method was applied to distill the teacher and student models, resulting in the YOLOv8n-CWD⁺ model for real-time rail fastener status detection. Through ablation and comparative experiments, the following conclusions were drawn:

An EIOU⁺ localization loss function was designed based on the aspect ratio differences of railway fasteners, which accelerates the model’s convergence and enhances detection precision for more accurate results.
The feature extraction network YOLOv8n-T was designed by integrating the ConvNeXt V2 module, RCS-OSA module, and Efficient Rep network to improve the original YOLOv8 backbone. This enhances the network’s feature extraction capability in complex environments, ensuring the better transmission of deep feature information to the Neck part and improving detection accuracy.
A lightweight YOLOv8n-S model was designed, incorporating the C2f-Het module to improve the C2f module in the backbone network, achieving a lightweight design while maintaining the detection speed.
An L2 loss function was designed to further optimize CWD knowledge distillation, resulting in the CWD⁺ knowledge distillation method. In this process, YOLOv8n-T was employed as the teacher model and YOLOv8n-S as the student model for model compression, leading to the training of a lightweight model.

The experimental results indicate that the distilled model using the CWD⁺ algorithm achieved GFLOPS of 7.3, a 13% reduction compared to the original algorithm. The detection speed (FPS) increased by 6.1 frames per second, and the mAP@50% improved by 2.7%. Comparative experiments of the detection results show that in complex scenarios with varying detection angles, the distilled model can quickly and accurately detect the status of track fasteners, significantly outperforming YOLOv5s and YOLOv8n. This demonstrates that the improved CWD⁺ distilled model offers a superior overall performance and enables the more precise detection of track fastener conditions.

However, our work has some limitations. The model proposed in this paper is only applicable to the detection of A-type clip track fasteners and is not suitable for other types of track fasteners. Therefore, future work will focus on further data collection and dataset expansion to increase data diversity. Additionally, the model will be deployed on mobile platforms to enable the engineering application of track fastener defect detection.

Author Contributions

Conceptualization, X.Z., B.S. and J.L.; methodology, X.Z.; software, X.Z.; resources, X.Z., J.L. and J.R.; writing—original draft preparation, X.Z.; writing—review and editing, B.S., J.L. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of the Shandong Province, China (Grant No. ZR2022QF107).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wei, X.; Wei, D.; Suo, D.; Jia, L.; Li, Y. Multi-target defect identification for railway track line based on image processing and improved YOLOv3 model. IEEE Access 2020, 8, 61973–61988. [Google Scholar] [CrossRef]
Zhuang, L.; Qi, H.Y.; Wang, T.G.; Zhang, Z. A Deep-learning powered Near-real-time Detection of Railway Track Major Components: A Two-stage Computer-vision-based Method. IEEE Internet Things J. 2022, 9, 18806–18816. [Google Scholar] [CrossRef]
Ma, A.; Lv, Z.; Chen, X.; Li, L.; Qi, Y.; Zheng, S.; Chai, X. Pandrol track fastener defect detection based on local convolutional neural networks. Proc. Inst. Mech. Eng. Part I J. Syst. Control. Eng. 2021, 235, 1906–1915. [Google Scholar] [CrossRef]
Prasongpongchai, T.; Chalidabhongse, T.H.; Leelhapantu, S. A vision-based method for the detection of missing rail fasteners. In Proceedings of the 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Kuching, Malaysia, 12–14 September 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 12–14, pp. 419–424. [Google Scholar]
Wei, X.; Yang, Z.; Liu, Y.; Wei, D.; Jia, L.; Li, Y. Railway track fastener defect detection based on image processing and deep learning techniques: A comparative study. Eng. Appl. Artif. Intell. 2019, 80, 66–81. [Google Scholar] [CrossRef]
Liu, J.; Teng, Y.; Shi, B.; Ni, X.; Xiao, W.; Wang, C.; Liu, H. A hierarchical learning approach for railway fastener detection using imbalanced samples. Measurement 2021, 186, 110240. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Chandran, P.; Asber, J.; Thiery, F.; Kumar, A.; Harsha, S.P. An Investigation of Railway Fastener Detection Using Image Processing and Augmented Deep Learning. Sustainability 2021, 13, 12051. [Google Scholar] [CrossRef]
Qi, H.; Xu, T.; Wang, G.; Cheng, Y.; Chen, C. MYOLOv3-Tiny: A New Convolutional Neural Network Architecture for Real-Time Detection of Track Fasteners. Comput. Ind. 2020, 123, 103303. [Google Scholar] [CrossRef]
Guo, F.; Qian, Y.; Shi, Y. Real-Time Railroad Track Components Inspection Based on the Improved YOLOv4 Framework. Autom. Constr. 2021, 125, 103596. [Google Scholar] [CrossRef]
Wang, L.; Zang, Q.; Zhang, K.; Wu, L. A rail fastener defect detection algorithm based on improved YOLOv5. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit. 2024, 238, 745–756. [Google Scholar] [CrossRef]
Cai, Y.; He, M.; Tao, Q.; Xia, J.; Zhong, F.; Zhou, H. Fast Rail Fastener Screw Detection for Vision-Based Fastener Screw Maintenance Robot Using Deep Learning. Appl. Sci. 2024, 14, 3716. [Google Scholar] [CrossRef]
Yang, Q.; Li, F.; Tian, H.; Li, H.; Xu, S.; Fei, J.; Wu, Z.; Feng, Q.; Lu, C. A new knowledge-distillation-based method for detecting conveyor belt defects. Appl. Sci. 2022, 12, 10051. [Google Scholar] [CrossRef]
Lei, Y.; Chen, X.; Wang, Y.; Tang, R.; Zhang, B. A Lightweight Knowledge-Distillation-Based Model for the Detection and Classification of Impacted Mandibular Third Molars. Appl. Sci. 2023, 13, 9970. [Google Scholar] [CrossRef]
Zhou, P.; Aysa, A.; Ubul, K. Research on knowledge distillation algorithm based on Yolov5 attention mechanism. Expert Syst. Appl. 2024, 240, 122553. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Faster and Better Learning for Bounding Box Regression. arXiv 2020, arXiv:1911.08287. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J.; Inc, M. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Greenspan, H.; Madabhushi, A.; Mousavi, P.; Salcudean, S.; Duncan, J.; Syeda-Mahmood, T.; Taylor, R. RCS-YOLO: A fast and high-accuracy object detector for brain tumor detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer Nature: Cham, Switzerland, 2023; pp. 600–610. [Google Scholar]
Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. Efficientrep: An efficient Repvgg-style convnets with hardware-aware neural network design. arXiv 2023, arXiv:2302.00386. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Singh, P.; Verma, V.K.; Rai, P.; Namboodiri, V.P. Hetconv: Heterogeneous kernel-based convolutions for deep cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4835–4844. [Google Scholar]
Shu, C.; Liu, Y.; Gao, J.; Yan, Z.; Shen, C. Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 5311–5320. [Google Scholar]
Li, T.; Wang, J.; Yao, K. Visibility enhancement of underwater images based on active polarized illumination and average filtering technology. Alex. Eng. J. 2022, 61, 701–708. [Google Scholar] [CrossRef]
Rizk-Allah, R.M.; Hassanien, A.E. A comprehensive survey on the sine–cosine optimization algorithm. Artif. Intell. Rev. 2023, 56, 4801–4858. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Shah, T. Measuring Object Detection Models—MAP—What Is Mean Average Precision. Tarang. Shah.-Blog. 2018, 26, 104332. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Siliang, M.; Yong, X. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]

Figure 1. EIOU loss function inference diagram.

Figure 2. Network structure of YOLOv8n-T.

Figure 3. ConvNeXt V2 module structure diagram.

Figure 4. Structure of RCS-OSA module. n represents the number of stacked RCS modules.

Figure 5. Structure reparameterization.

Figure 6. Rep Block based on Bep Unit.

Figure 7. Network structure of YOLOv8n-S.

Figure 8. HetConv filter.

Figure 9. C2f-Het module.

Figure 10. Schematic diagram of CWD⁺ distillation.

Figure 11. Data set sample. (a) Normal fastener image; (b) Shift fastener image; (c) Broken fastener image; (d) Deformed fastener image; (e) Missing fastener image.

Figure 12. Comparison of convergence curves of EIOU⁺ and CIOU. (a) Train/Box_loss curve (b) Val/Box_loss curve.

Figure 13. Comparison of detection effects of different models.

Table 1. Results of loss function comparison.

Loss Function	P%	R%	mAP@50/%	mAP@50–95/%
CIOU	88.7	86.6	93.6	66.5
EIOU	91.2	88.2	93.3	67.3
WIOU	90.5	88.4	93.8	68.8
MPDIOU	91.2	90	94.2	68.6
EIOU⁺	88.6	92.3	94.5	69.1

Table 2. YOLOv8n-T ablation experiment.

YOLOv8n				P/%	R/%	mAP@50/%	mAP@ 50–90/%	GFLOPS	FPS
Conv NeXt V2	RCS-OSA	Efficient Rep	EIoU⁺	P/%	R/%	mAP@50/%	mAP@ 50–90/%	GFLOPS	FPS
				88.7	88.6	93.6	66.5	8.4	36.5
√				87.5	92.5	94.8	68.4	8.3	35.6
	√			91.2	89.3	94.9	68.4	11.1	37.8
		√		91.9	88.9	93.7	68.3	8.5	43.7
√	√			89.8	92	94.9	68.7	11	31.9
√		√		90.9	90.5	94.5	68.6	8.5	35.9
	√	√		91.4	90.5	94.6	67.7	11.2	37.7
√	√	√		89.8	90.8	94.2	67.7	11.2	31.8
			√	88.6	92.3	94.5	69.1	8.4	43.9
√	√	√	√	90.9	90.7	95.6	69.6	11.2	31.8

Table 3. YOLOv8n-S ablation experiment.

YOLOv8n		P/%	R/%	mAP@ 50/%	mAP@ 50–90/%	GFLOPS	FPS
C2f-Het	EIoU⁺	P/%	R/%	mAP@ 50/%	mAP@ 50–90/%	GFLOPS	FPS
		88.7	88.6	93.6	66.5	8.4	36.6
√		92.5	86.6	94.6	67.9	7.6	43.5
	√	88.6	92.3	94.5	69.1	8.4	43.9
√	√	90.6	89.5	95.2	68.4	7.6	41.6

Table 4. Influence of distillation temperature τ on accuracy.

Temperature	P/%	R/%	mAP@50/%	mAP@50–90/%
$τ$ = 1.0	91.8	91.1	96.3	69.4
$τ$ = 2.0	91.5	90.6	95.6	68.9
$τ$ = 3.0	91.5	90.6	95.6	68.9

Table 5. Comparison experiment of knowledge distillation algorithm.

YOLOv8n-T+ YOLOv8n-S		P/%	R/%	mAP@ 50/%	mAP@ 50–90/%	GFLOPS	FPS
CWD	CWD⁺	P/%	R/%	mAP@ 50/%	mAP@ 50–90/%	GFLOPS	FPS
√		89.5	90.4	95.3	68.3	7.3	42.5
	√	91.8	91.1	96.3	69.4	7.3	42.7

Table 6. Comparative experiments of different models.

Model	P/%	R/%	mAP@ 50/%	mAP@ 50–90/%	GFLOPS	FPS
YOLOv3-tiny	88.4	84.8	91.4	60.8	13	42.5
YOLOv5s	92.9	86.7	93.4	64.4	16	27.3
YOLOv8n	88.7	88.6	93.6	66.5	8.4	36.6
YOLOv8n-T	90.9	90.7	95.6	69.6	11.2	31.8
YOLOv8n-S	90.6	89.5	95.2	68.4	7.6	41.6
YOLOv8n-CWD	89.5	90.4	95.3	68.3	7.3	42.5
YOLOV8n-CWD⁺	91.8	91.1	96.3	69.4	7.3	42.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Shen, B.; Li, J.; Ruan, J. Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n. Electronics 2024, 13, 3399. https://doi.org/10.3390/electronics13173399

AMA Style

Zhang X, Shen B, Li J, Ruan J. Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n. Electronics. 2024; 13(17):3399. https://doi.org/10.3390/electronics13173399

Chicago/Turabian Style

Zhang, Xingsheng, Benlan Shen, Jincheng Li, and Jiuhong Ruan. 2024. "Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n" Electronics 13, no. 17: 3399. https://doi.org/10.3390/electronics13173399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Algorithm for Rail Fastener Status Detection Based on YOLOv8n

Abstract

1. Introduction

2. Methods

2.1. Regression Frame Loss Function Design

2.2. Teacher Model Design

2.2.1. ConvNeXt V2 Module

2.2.2. RCS-OSA Module

2.2.3. Efficient Rep Network

2.3. Student Model Design

2.4. Improved CWD Knowledge Distillation Algorithm

3. Experiment and Analysis

3.1. Experimental Data Set

3.2. Experimental Environment and Parameters

3.3. Evaluation Index

3.4. Comparative Experiment of Different Loss Functions

3.5. YOLOv8n-T Network Ablation Experiment

3.6. YOLOv8n-S Network Ablation Experiment

3.7. Knowledge Distillation Contrast Experiment

3.8. Comparison Experiment of Different Models

3.9. Comparison of Detection Effects before and after Improvement

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI