*4.1. Image Acquisition*

According to the technical route of the proposed method, a track inspection field test was carried out in this paper. As shown in Figure 7, the intelligent track inspection vehicle used in the test was developed by Beijing Yinglu Technology Co., Ltd. (Beijing, China). The vehicle is composed of two parts: an electric inspection vehicle and a track state inspection system. The electric inspection vehicle contains a car body, track wheels and seats; the track state inspection system is composed of a host and a high-definition linear image scanning module. In this test, a 15 km track on the Beijing–Shanghai high-speed rail line is chosen as the test section. The travel speed of the inspection vehicle is 20 km/h, and the image resolution is 2048 × 2048.

**Figure 7.** Intelligent track inspection vehicle.

The specific collection equipment data is shown in Table 1.



The specific configuration of the algorithm environment used in the test is shown in Table 2.

**Table 2.** Test environment.


One thousand rail images collected in the field test are chosen for rail surface defect detection; among which, 900 are randomly selected as the training dataset and 100 as the test dataset. Before applying the improved YOLOv4, it needs to operate image annotation to establish a dataset feature database. In this paper, LABELIMG software with version 1.0 was used for image annotation. LABELIMG is an image annotation tool that is written in python and uses QT as a graphical interface. The rail surface in the image is regarded as the target detection area, as shown in Figure 8.

After annotation, the coordinates of the rail surface defect area are obtained, and the training algorithm and the defect detection test are performed on the coordinate dataset generated by the image annotation.

**Figure 8.** Image annotation.

#### *4.2. Establish a Detection Model for Rail Surface Defects*

In order to verify the effectiveness of the method proposed in the study, 5% and 10% Gaussian noise are added to the original dataset, respectively, as shown in Figure 9.

**Figure 9.** Gaussian noise processing diagram.

The improved YOLOv4 uses MobileNetV3 as the backbone network of the feature extraction and, at the same time, uses deep separable convolution to replace the traditional convolution in PANet. A rail defect detection model is established, as shown in Figure 10, (1) to reset the size of the input image, (2) to apply the improved YOLOv4 network based on the image operation and (3) to output the detection target.

**Figure 10.** Rail defect detection model. (1) to reset the size of the input image; (2) to apply the improved YOLOv4 network based on the image operation; (3) to output the detection target.

The *CIOU* calculation method in YOLOv4 will make the target frame regression stable. It takes into account the distance, overlap, scale and penalty items between the target and the anchor point, and there will be no training divergence problem. Figure 11 illustrates the surface defects of the rail, and the red box indicates the target frame in which the rail surface defects are surrounded. The green box is the prediction box, and the purple box is the smallest rectangle that can cover the above two. *d* represents the center point distance between the target box and the predicated one. c represents the diagonal distance of the smallest area simultaneously covering the prediction box and the target box.

**Figure 11.** Comparison of the parameters of the methods.

*CIOU* calculation formula is as shown in Formulas (4)–(6):

$$w = \frac{4}{\pi^2} \left( \arctan \frac{w \lhd}{h^{g^t}} - \arctan \frac{w}{h} \right)^2 \tag{4}$$

$$\alpha = \frac{v}{1 - IOL + v} \tag{5}$$

$$ICIOL = IOL - \frac{\rho^2 \left(b\_\prime b \delta^t\right)}{c^2} - \alpha v \tag{6}$$

where *ρ* refers to Euclidean distance; *b*, *w* and *h* refer to the center coordinates, width and height of the prediction box and *bgt*, *wgt* and *hgt* refer to the center coordinates, width and height of the frame.

In the study, the *CIOU* threshold was set to 0.7. The detection image can be output only when the result is greater than 0.7, which makes the bounding box more accurate.

In the establishment of the rail defect detection model, the learning rate and the step size for each update is too large; thus, the model cannot converge on the extreme optimal value. If the learning rate is too small, the convergence can be guaranteed, but the efficiency of the model is sacrificed.

In order to avoid the above-mentioned problems, trade-offs have to be considered by modifying the model parameters with the best performances. The adaptive learning rate is used in the experiment to improve the optimization speed of the model, and the initial value of the learning rate is set to 0.001. In the training process, after each epoch, the current model loss and accuracy are evaluated in the training set, and the loss value change is detected every other epoch. When it is less than 0.0001, the learning rate *lr* is attenuated. The attenuation formula is expressed as Formula (7):

$$lr^\* = lr \times 0.1\tag{7}$$

#### *4.3. Result Analysis*

In the study, the same dataset is applied on the Faster R-CNN, YOLOv3 and YOLOv4 methods to compare and verify the effectiveness of the proposed method.

Figure 12 illustrates the comparison of the parameter quantity of each method. It shows that the parameter quantity in the proposed method is the least, which is about 1/20 of the Faster R-CNN. Since YOLOv4 is improved from the basis of YOLOv3, the parameter quantities of the two are not much different. Improved from the basis of YOLOv4, the proposed method replaces lightweight MobileNetv3 as the backbone network and uses deep separable convolution for PANet to further reduce the amounts of the parameters. From Table 3, the parameter quantity in the proposed method is decreased by 78.04% compared with YOLOv4, effectively reducing the amounts of the parameters.

**Figure 12.** Comparison of the parameter quantities of the methods.


The proposed method 94.24% 82.56% 93.21% 44.64 53.6

**Table 3.** Comparison of the detection results of rail defects.

In order to evaluate the detection results of rail defects, precision (Pr), recall (Re), mean Average Precision (mAP), Frames Per Second (FPS) and volume are introduced. Among them, mAP is a common parameter for accuracy evaluations of different target detection models. Specifically, it is the mean of the average precision (*AP*) of each query. FPS refers to the number of frames transmitted per second, and volume refers to the size of the memory occupied by the model. The specific calculation formula is as follows:

$$P\_{\rm I} = \frac{TP}{TP + FP} \times 100\% \tag{8}$$

$$\mathcal{R}\_{\text{e}} = \frac{TP}{TP + FN} \times 100\% \tag{9}$$

$$AP = \int\_0^l p(r) dr \tag{10}$$

$$mAP = \frac{1}{N} \sum AP\_i \tag{11}$$

where True Positives (*TP*) and False Positives (*FP*) are the number of rail defects detected correctly or not, respectively. False Negatives (*FN*) is the number of rail defects detected incorrectly. *N* is the number of defects in all the rails.

From the detection results of rail surface defects by various methods in Table 3, it can be seen that, as a popular traditional method in the two-stage field, Faster R-CNN has a higher accuracy, recall and mAP than YOLOv3 but a lower detection speed. A slow, large model size not suitable for lightweight real-time detection, YOLOv3 has the advantage of a faster detection speed and smaller model size, but its accuracy, recall rate and mAP and Faster R-CNN methods are small; YOLOv4 in the accuracy, recall rate, mAP and FPS ahead of Faster R-CNN and YOLOv3, and its detection speed and model volume still have room for improvement. The research method in this paper was improved from the basis of YOLOv4. Due to the use of lightweight MobileNet V3 as the backbone network and deep separable convolution to improve the PANet, the model volume was 0.22 times that of YOLOv4, and the accuracy was improved by 1.64% compared to YOLOv4. Compared with YOLOv4, the recall rate and mAP were increased by 1.16% and 2.54%, respectively. At the same time, the detection speed of the research method exceeded YOLOv4 by 10.36 frames per second, which can better meet the requirement of rapidity.

Due to the complex environment of the rail, the algorithm is required to have a good anti-noise performance. In order to test the noise resistance of the research, Gaussian noise was added into the dataset. Tables 4 and 5 are the detection results of rail defects with 5% and 10% Gaussian noise, respectively. It can be seen the proposed method has a higher mAP than the other methods and has more superior performance when noise exists. As the same models are used with slightly different test data, the FPS and volume of each method are consistent with those in Table 3. The results of Tables 3–5 show that the proposed method in this paper has good performance and can be applied to lightweight steel rail surface defect detection.


**Table 4.** Comparison of the detection results of rail defects with 5% Gaussian noise.

**Table 5.** Comparison of the detection results of rail defects with 10% Gaussian noise.


#### **5. Conclusions**

The rapid, accurate and intelligent detection of rail surface defects is of great significance for ensuring the safe operations of railway vehicles. According to the characteristics of rail surface defect detection, a one-stage detection model based on deep learning was constructed for the detection of rail surface defects. Through experimental verification and comparative analysis, the following conclusions were drawn:

(1) In order to reduce the weight of the rail surface defect detection network, the YOLOv4 algorithm was improved. The backbone network of YOLOv4 was optimized, and the PANet layer in YOLOv4 was lightened and improved. It reduced the algorithm parameters, increased the detection speed and reduced the model size.


In addition to the above conclusions, with the rapid development of object detection methods, the ideas proposed in this paper can be extended to different deep learning networks. At the same time, in order to verify the effectiveness of the proposed method and to avoid introducing more variables, image preprocessing was not introduced in this paper. It can be inferred that the accuracy of the defect detection can be further improved if the image is effectively preprocessed. Finally, if sufficient railway surface defect images can be obtained to establish datasets, statistical tests can be performed to achieve a full statistical analysis of the proposed deep learning approaches.

**Author Contributions:** All authors conceived and designed the study. Conceptualization, methodology, software, validation and writing—original draft, T.B.; software and visualization, J.G.; validation and investigation, J.Y. and software and validation, D.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the General Project of Scientific Research Program of Beijing Municipal Education Commission under Grant KM202010016003, the National Natural Science Foundation of China under Grant 51975038, and the Natural Science Foundation of Beijing under Grant KZ202010016025.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used to support the findings of this study are available from the corresponding author upon request.

**Acknowledgments:** The authors appreciate the support from the Beijing Key Laboratory of Performance Guarantee on Urban Rail Transit Vehicle for this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

