*3.2. YOLOv4 Backbone Network Adaptability Improvement*

In practical engineering applications, the detection of rail surface defects has particularities, including the accuracy, the speed and the model size of detection. The method in this paper takes into account the particularity of rail surface defect detection, making it

adaptable to YOLOv4. MobileNetV3 is used as the backbone network of YOLOv4. MobileNet is a lightweight deep neural network proposed by Google for embedded devices. The core idea is the depthwise separable convolution. Compared with the traditional convolution used in YOLOv4, the deep separable convolution in MobileNetV3 can further reduce the amount of parameters and calculations, thus realizing the lightweight of the network.

A lightweight attention (Squeeze-and-Excitation, SE) module is used in MobileNetV3. Its advantage is that it can improve the performance of the algorithm with a negligible increase in the calculations. The specific process of the SE module is implemented as shown in Figure 4. First, the features of *C* × *H* × *W* are optimized to *C* × *H* × *W*. Then, in the process of squeeze, global average pooling is performed on the *C* × *H* × *W* features to obtain a global receptive field feature map of 1 × 1 × *C* in size. Then, a fully connected neural network is used for nonlinear transformation in the process of excitation. Finally, the input feature is weighted by the activation value of each feature layer from the SE module.

**Figure 4.** The RPN candidate box generation process.

### *3.3. Adaptability Improvement of PANet Layer in YOLOv4*

PANet in YOLOv4 has the advantages of dynamic feature pooling, fully connected layer fusion and bottom-up path enhancement but disadvantages such as a large amount of parameters and complex calculations. To resolve this problem, the convolution structure in PANet is modified, where the 3 × 3 and 5 × 5 standard convolutions are replaced by depth separable convolutions.

Depth separable convolution [30] is a lightweight convolution module. It consists of the following two parts: depthwise convolution (DW) and pointwise convolution (PW). In DW, each dimension in the input information is convolved with a convolution block separately. Then, PW applies a point convolution kernel to perform dimensional lifting of the output maps from DW.

In the standard convolutional layer, assume that the size of the input feature map is *Dz* × *Dz*, the number of channels is *M*, the size of the convolution kernel is *Di* × *Di* and the number of convolution kernels is *K*. Then, the standard convolution calculation amount *C*<sup>1</sup> can be calculated by Formula (1):

$$C\_1 = D\_{\overline{z}} \times D\_{\overline{z}} \times M \times K \times D\_{\overline{i}} \times D\_{\overline{i}} \tag{1}$$

In depth separable convolution, DW and PW are performed separately, as shown in Figure 5. The calculation amount *C*<sup>2</sup> of the depth separable convolution can be calculated as Formula (2):

$$C\_2 = D\_z \times D\_z \times M \times D\_i \times D\_i + K \times M \times D\_z \times D\_z \tag{2}$$

The calculation amounts of the depth separable convolution and classic convolution are compared as follows:

$$\frac{C\_2}{C\_1} = \frac{D\_z \times D\_z \times M \times D\_i \times D\_i + K \times M \times D\_z \times D\_z}{D\_z \times D\_i \times M \times K \times D\_i \times D\_i} = \frac{1}{K} + \frac{1}{D\_i^2} \tag{3}$$

In the equation, the channels number of the convolutional layer *K* is usually greater than 1, and the commonly used sizes of the convolution kernel are 3 × 3 and 5 × 5, so the result of the formula is less than 1. The calculation amount of the depth separable convolution is smaller than that of the standard convolution.

**Figure 5.** Classic convolution and depth separable convolution.

The PANet layer is improved, as shown in Figure 6. It can retain the advantages of PANet dynamic feature pooling, fully connected layer fusion and bottom-up path enhancement and also reduce the computation in PANet, so as to realize the lightweight of the network and, finally, achieve the optimization of YOLOv4.

**Figure 6.** Improved PANet layer.
