1. Introduction
Railway infrastructure serves as the backbone of modern transportation systems, facilitating the long-distance movement of goods and passengers. The quality of railway tracks is crucial for the safe operation of trains, with rails forming the center of railway network efficiency and safety, constituting a pivotal foundation for rail transportation. Traditional manual inspections and contact measurement technologies persist in fault detection within rails [
1]. However, these methods not only require a significant number of engineers but are also time-consuming and labor-intensive. Moreover, they suffer from slow detection speeds, rendering them incapable of promptly identifying faults [
2]. Surface defects on the rails often result from the repetitive motion of trains on the tracks and the friction induced by defective train wheels [
3]. Failing to promptly identify and manage these surface defects can lead to an increase in maintenance costs and grave consequences [
4]. Therefore, technology capable of conducting automatic rail inspection, reducing operational time, and curbing maintenance expenses becomes imperative for enhancing the safety of railway transportation [
5,
6].
The high load-bearing capacity and prolonged exposure to environment of railway trains can result in numerous imperfections on the rail surface, such as flaking, short-wave irregularities, squats, fissures in the rails, and deficiencies in fastening components [
7,
8,
9,
10]. For example, one common problem affecting the steel rail surface is deep squats, representing rail defects along the running band of the rails. Flaking, on the other hand, refers to the progressive horizontal separation of the rail’s running surface near the gauge corner, often causing the scaling or chipping of small slivers. These imperfections have the capacity to exert a substantial impact on the operational safety of railway transportation. Failure to swiftly and diligently detect these imperfections could jeopardize the safety of train operations and the life and property of passengers. Consequently, to maintain the optimal operational condition of the rails, it is imperative to conduct periodic examinations of the rail system. The expeditious and intelligent detection of surface imperfections on rails has emerged as an exceptionally vital aspect of railway transportation.
There is a growing realization of the necessity for advanced inspection and monitoring technologies to ensure the integrity and reliability of railway tracks. As a result, non-destructive testing methods such as ultrasonic detection, eddy current detection, laser detection, and acoustic emission have been proposed. These methods belong to the domain of physical detection techniques, rooted in the principles of physics with hardware as their core. They facilitate physical phenomena induced by acoustic, optical, electromagnetic, and similar fields to evaluate the quality of rails through the detection of physical manifestations on the rails [
11]. Ultrasonic and eddy current detection methods are commonly favored for their noncontact characteristics among the various options available. The ultrasonic detection method harnesses the unique properties of sound waves to identify internal defects and rail fractures. While excelling at pinpointing tiny cracks with high precision, it may encounter limitations in detecting medium to large cracks [
12]. Electromagnetic acoustic emission (EMAE) technology can effectively avoid the interference of wheel-rail rolling noise (WRRN) and displays an excellent performance in the field of rail crack detection, but the detection of small cracks is still limited at high train speeds [
13]. Eddy current testing, utilizing eddy current technique (ECT) probes and post-processing systems, can detect railhead cracks [
9,
14]. However, this method faces challenges with respect to identifying rail faults at different levels of severity. In terms of railway systems, laser detection is employed for two-dimensional contour measurements, thereby facilitating a detailed assessment of the conditions prevalent on the railway track surface [
15]. Some scholars have collected track inspection images and ultrasonic B-scan images by using the camera and ultrasonic equipment of the rail inspection vehicle, used algorithms to segment the track surface images captured by the camera and filter the ultrasonic operations, and finally tested the images using the trained model [
16].
In recent years, artificial intelligence (AI) has led to the widespread application of detection technologies in various fields, such as 3D modeling from unmanned aerial vehicle (UAV) point clouds [
17]. This is particularly evident in the automatic identification of defects in civil infrastructure, such as automatic pavement texture recognition, tunnel crack detection, rail track identification, and unwanted object detection [
18,
19,
20]. In comparison with traditional manual inspection methods, AI models indicate superior computational performance, faster detection speeds, and a more cost-effective and secure approach. Therefore, the utilization of AI deep learning models for the automated identification and assessment of defects and service conditions in railway tracks is feasible. The effective combination of digital twins (DTs) and sixth-generation (6G) vehicle-to-electronic (V2X) communications can enhance the analysis of driver behavior and enable fast and accurate diagnosis of vehicle operating conditions to help in vehicle decision-making tasks [
21]. Kim B et al. proposed a transformer-based hybrid model that can simultaneously utilize temporal and spatial features to identify anomalies in railway heating, ventilating, and air conditioning (HVAC) systems, achieving good performance [
22]. Kim B et al. proposed a model that can simultaneously utilize temporal and spatial features to identify anomalies in railway HVAC systems, achieving good performance [
23]. Feng et al. proposed a method employing the probability structure topic model (STM) for the classification and defect detection of automatic fasteners in railway inspection systems [
24]. The SeMA-UNet model optimizes deep learning models in data-constrained scenarios, conducts comprehensive feature extraction on railway images, and can achieve better results than traditional models [
25]. T. Ye and colleagues devised an automated object detection system capable of discerning obstacles on the railway tracks, including curved sections [
26]. Cai et al. proposed the PPNN network by combining probability neural network and a particle swarm optimization algorithm; they achieved satisfactory results in recognizing subway vehicle noise using this network [
27]. Aydin et al. devised a fusion model that harnesses deep convolutional neural network (CNN) features in combination with a support vector machine (SVM) for the classification of rail defects [
28]. Aytekin et al. conducted an analysis of a fusion methodology for images acquired by a high-speed 3D laser rangefinder, predicated on pixel and histogram similarities. This approach entails reduced computational complexity, rendering it applicable for real-time surveillance [
29].
In order to overcome the problem of requiring a large number of training samples for most CNN models, some researchers have used ensemble models for detecting rail surface defects. The detection results indicate that the ensemble algorithm outperforms single detection architectures [
30]. Luo H et al. proposed a method based on an improved YOLOv5s, which utilizes techniques such as data augmentation, introducing a global attention mechanism and optimizing loss functions to enhance the detection accuracy of rail surface defects [
31]. Kim I et al. applied the RAG-PaDiM algorithm to railway track defect segmentation. By utilizing a residual attention-guided U-Net algorithm to generate embedding vectors and by performing operations such as residual connections and attention gates to provide regional weights, the algorithm achieves accurate pixel-level area curve segmentation and improves task performance [
32]. Zheng et al. employed a depth data-driven model and transfer learning approach. They adapted the foundational YOLOv3 and RetinaNet pre-trained models to discern and appraise cracks on the railway surface. This model exhibited commendable recall and precision metrics when applied to a constrained dataset [
3]. The FS-RSDD model uses prototype learning to estimate the defect probability of test samples, overcoming the limitations of supervised learning algorithms to achieve high-precision defect detection and localization with limited training samples [
33]. Zhou et al., through the manipulation of acquired images from the railway, extracted precise positional information pertaining to the areas of rails necessitating polishing. They employed a machine vision system to discern regions requiring track grinding [
34].
Generally, while machine vision and neural networks have been widely employed in industrial inspection tasks, there remains a relative scarcity of research focused on detecting rail surface defects characterized by intricate noise patterns and diverse sample sets. To further improve the accuracy of surface defect detection on railway tracks, this paper proposes a lightweight improved network MobileNet-YOLOv7. The primary contributions are as follows:
- (1)
A simplified network structure with a MobileNetV3 feature extraction network, reduced model parameters, an enlarged receptive field, more effectively extracted local features of samples, and improved computational efficiency and that has detection accuracy.
- (2)
To address problems of slow convergence and overfitting in the model, we use the k-means++ clustering algorithm to adjust the anchors for object detection, improving the alignment between anchor boxes and real samples. The results show that this method can effectively accelerate network convergence speed and improve detection accuracy while mitigating sample imbalance concerns.
- (3)
In order to tackle the challenges of oscillation and slow convergence in the loss function during algorithm training, we opted to substitute the conventional loss function with the EIOU function. By integrating the EIOU function, the algorithm can retain the essential features of the loss while minimizing the difference between the width and height of the target and anchors, consequently enhancing localization performance. The method proposed in this paper is shown in
Figure 1.
2. Methodology
YOLOv7 [
35] is one of the latest versions in the YOLO series. The YOLOv7 network structure mainly includes a backbone layer, feature pyramid network (FPN) layer, and a head layer [
36]. The backbone in YOLOv7 serves as the primary feature extraction network, responsible for processing input images to extract features. These extracted features, referred to as feature layers, constitute a set of features derived from the input images. Furthermore, the FPN is an advanced feature extraction network within YOLOv7 that integrates three key feature layers obtained from the backbone to amalgamate feature information across various scales. The head layer in YOLOv7 is the classifier and regressor, which assesses the feature points to determine whether there is an object corresponding to the prior box on the feature point.
The YOLO model is a very good machine learning algorithm, and in recent years, many scholars have also used the YOLO model to detect surface defects on the track. The YOLO model itself is also being updated at a very fast rate. Although YOLOv7 performs well in terms of speed and accuracy, its use of multiple convolutional operations for feature extraction results in a multilayered network with many parameters. There is potential for improvement in terms of speed and accuracy of detection. Furthermore, with the developments in recent years, more and more engineering departments are using edge devices for detection onsite. The YOLOv7 algorithm employs a significant amount of stacking structures which, while making the network easy to optimize, also leads to excessive model parameters and high hardware demands. This makes it unsuitable for deployment on mobile devices or hardware with limited computing power and not ideal for real-time detection on such devices.
The enhancements to the algorithm are primarily categorized into two aspects: leveraging the capabilities of MobileNetV3 and substituting the YOLOv7 backbone network with the lightweight MobileNetV3 network to establish a symmetrical network structure. MobileNetV3 runs 3 × 3 depthwise separable convolutions on multi-channel convolutional kernels, followed by 1 × 1 pointwise convolutions to reduce the model size. Proper anchor boxes play a crucial role in obtaining an excellent detection model. This paper clusters the rail surface defect dataset using the k-means++ algorithm, and the obtained anchor boxes based on the new clustering results are incorporated into the improved algorithm model, increasing the robustness of the algorithm during training. Utilizing the fast data processing speed of YOLOv7 can accelerate the model’s training speed to achieve real-time data processing. Due to the issue of imbalanced sample quality in the dataset, training low-quality samples can lead to significant fluctuations in loss values, resulting in oscillations in the convergence curve of the loss function. To solve this problem, this paper proposes using the EIOU to replace the original bounding box loss function CIOU, allowing the bounding box loss function to perform the regression more effectively and stably.
2.1. Improved YOLOv7 Network
Traditional CNNs have large memory requirements and computational costs, making them unsuitable to run on mobile and embedded devices. To address this issue, the MobileNet [
37] network was developed. The MobileNet series is widely used in object detection for its fast and accurate detection capabilities and has become a representative of lightweight networks. The entire architecture of MobileNetV3 largely follows the design of MobileNetV2, incorporating lightweight lepthwise convolution (DWC) and residual blocks. As a lightweight network, MobileNetV3 stands out for using a neural architecture search (NAS) to build the network as well as integrating the depthwise separable convolutions and linear bottleneck inverted residual structures from MobileNetV1 and MobileNetV2. Additionally, it introduces the SE lightweight attention mechanism in network construction. MobileNetV3 has shown excellent performance in tasks such as mobile image classification, object detection, and semantic segmentation. The MobileNetV3 block incorporates features from both MobileNetV1 and MobileNetV2, including the depthwise separable convolution and the inverted residual structure with linear bottlenecks. The former reduces the model’s computational load, while the latter increases the model’s representativeness.
Figure 2 displays the layout of the MobileNetV3 block.
DWC is a commonly used convolution operation in lightweight networks. Assuming the input feature map size is , DWC generates M DWC features after the convolution operation of M D × D size convolution features.
The computational cost of the standard convolution is:
The computational cost of DWC is:
The comparison of computational cost between DWC and standard convolution is:
For the commonly used 3 × 3 convolution kernel, DWC can reduce computational costs by approximately 90%, significantly lowering the computational burden.
Figure 3 illustrates the implementation process.
To ensure the real-time performance and improve accuracy, the SE lightweight attention mechanism is introduced into the network. First, global descriptive features are obtained through the squeeze operation; then, the excitation operation generates weights for each channel; lastly, the output weights of the excitation operation are used as importance indices of feature channels for the reweight operation, which weighs the previous features and recalibrates each channel dimension. Additionally, due to the high computational cost of the sigmoid function in the Swish activation function, which is significant in real-time applications, MobileNetV3 uses an improved h-swish activation function based on the Swish activation function to reduce computational cost effectively and achieve fast detection purposes. The expression of the h-swish activation function is Equation (4):
Substituting the traditional backbone network with the lightweight MobileNetV3 network enables a significant reduction in computational expenses and model size while preserving a satisfactory level of accuracy. This makes YOLOv7 more suitable for use in resource-constrained scenarios. The structure of the network used by the algorithm is shown in
Figure 4.
2.2. The EIoU Loss Function
YOLOv7’s loss function has three components: localization, object confidence, and classification loss. The choice of localization loss function is the complete intersection over union (CIOU).
where
and
denote the centers of the predicted box and the ground truth box, respectively,
represents the Euclidean distance between the two centers of the sphere,
denotes the diagonal length of the smallest rectangle that encloses both the predicted box and the ground truth box,
is the weight function,
is the factor to measure the aspect ratio,
and
are the widths of the predicted box and the ground truth box, and
and
are the heights of the predicted box and the ground truth box.
However, CIOU only considers the relative proportion of width and height without taking into account the actual differences in width and height with their confidence levels. Therefore, when the width and height meet certain conditions, the penalty term of the loss function will fail, which is not conducive to model optimization. This paper introduces an effective IOU (EIOU) loss function to replace the CIOU. The target bounding box position and feature information are taken into account by the EIOU loss function. By separating the aspect ratio loss term from the CIOU and directly using the predicted height as a penalty term, the process accelerates convergence and improves regression accuracy. The loss function consists of three parts: overlap area loss
, center point distance loss
, and width–height loss
, following Equations (9) and (10).
where
and
are the widths and heights of the smallest rectangle, which contains both the prediction and the groundtruth boxes.
The conventional IOU loss function can encounter issues when dealing with small objects because the IOU values for small objects are usually low. The EIOU loss function can better handle small objects by incorporating center point and aspect ratio information, improving small object detection performance. Additionally, the EIOU loss function is more sensitive to changes in the position and shape of the target box, making it adaptable to targets of different scales and shapes. This enhances the robustness of the model, enabling it to effectively detect targets in various scenarios.
2.3. Unit Clustering Based on the k-means++ Algorithm
The default sizes of initial anchor boxes in the YOLO V7 algorithm are based on the CoCo training set, which uses the k-means algorithm for calculation and is influenced by the initial cluster centers. If these default sizes were directly used for training, it not only affects the final model’s accuracy but also leads to a prolonged training process without convergence. Therefore, it is essential to recluster the dataset to obtain suitable anchor box sizes, which makes the learning process of deep convolutional neural networks smoother and enables better predictions.
We have improved the accuracy of the proposed object detection network in predicting object positions by using the k-means++ clustering algorithm [
38] instead of the original k-means clustering algorithm. Compared with k-means, the k-means++ clustering algorithm improves the accuracy of classification results by optimizing the selection of initial points. By using k-means++ to cluster the dataset, more accurate and representative anchor boxes are generated by reducing the bias in the clustering results caused by the random selection of initial cluster centers.
When using the k-means++ clustering algorithm to select candidate boxes for rail surface defect samples, the distance between sample
and the center of the cluster
are as shown in Equation (11), where
is the sum of the widths and heights of all labeled rails and rail defects.
where
represents the intersection over union between
and
.
Once the cluster centers are determined, the following Equation (12) is shown as follows:
When the input image size is 640 × 640, the algorithm will generate three different sizes of feature map outputs: 80 × 80, 40 × 40, and 20 × 20. Among them, the 80 × 80 map represents the shallow feature map, suitable for detecting small objects; the 20 × 20 map represents the deep feature map for capturing contour and structural information; and the 40 × 40 feature map is used for detecting medium-sized objects between the other two scales. Each feature map has 3 types of anchors, totaling 9 anchor boxes.
In this research, applying k-means clustering to our training dataset resulted in the identification of nine anchor box sizes, which are (117, 62), (123, 99), (162, 79), (618, 35), (164, 147), (614, 41), (43, 621), (626, 59), and (85, 619). These numbers represent the anchor frame dimensions, with the first number in parentheses representing the length of the anchor frame and the second number representing the height of the anchor frame. The calculation results are shown in
Figure 5, where the x-axis and y-axis represent the width and height of the ground truth bounding box, respectively. Each color represents one cluster, and “★” represents the centroid of each cluster after clustering. Because there are 9 sets of anchors in the algorithm, there are 9 centroid points in the figure.
5. Conclusions
In this paper, the original YOLOv7 algorithm was improved to propose a more excellent rail surface defect detection algorithm, and MobileNetV3 was used to replace the original backbone network in the feature extraction part. By applying a lightweight feature extraction network, the model size was significantly reduced, leading to potential performance enhancements in detecting rail fasteners. Furthermore, by enhancing the EIoU loss function and utilizing the k-means++ algorithm for data clustering, the instability arising from different initial cluster center selections was mitigated. Compared with the original algorithm, the clustering accuracy is improved, and the algorithm’s robustness was enhanced, thus increasing convergence speed and accuracy. Experimental validation has confirmed that the algorithm presented in this paper can achieve rapid and high-precision positioning, yielding a precision of 94.9%, a recall rate of 90.6%, and a mean average precision (mAP) of 95.2%. The detection algorithm exhibits excellent detection accuracy, making it particularly well suited for conducting rail surface defect detection tasks.
While this paper has made strides in enhancing the mean average precision (mAP), it is important to highlight that the dataset used in this study is from specific railway lines, and further verification is needed to detect surface defects on rails in different environments. Additionally, due to the more detailed feature extraction of MobileNetV3 on the surface condition of rails, the algorithm may mistakenly detect certain areas affected by light and environmental pollution as defects, potentially increasing the false positive rate and requiring further improvement. In future research, we will consider expanding the proposed framework and further improving the lightweight features of the network to optimize the model for application on mobile embedded devices, addressing more complex scenarios in railway surface health analysis.