*2.2. Improvements to Model Framework*

Three criteria need to be met during the construction of the neural network model [34]. First, the residual neural network is used to increase the depth of the network, and the feature extraction is achieved by the deeper neural network. Second, the number of feature layers extracted from each layer is changed to achieve more feature extraction layers, ge<sup>t</sup> more features, and increase the width. Third, by increasing the resolution of the input picture, the features of network learning and expression can be enriched, which is conducive to improving the accuracy. The above criteria are followed in the YOLOv4 model compression. The SwishBlock bottleneck module is established based on the depthwise separable convolution and the construction concept of reverse residual structure. The characteristics of the network are expanded from three aspects at the same time, and the ResBlock\_body is replaced as the overall design concept of the main YOLOv4 framework. Meanwhile, the SENet Channel Attention idea is used for reference into the network structure, and different weights are assigned to the extracted feature maps to extract more critical feature information without increasing the model calculation and storage costs.

The actual architecture of the SwishBlock bottleneck module is an inverted residual structure. In a residual structure, there are fewer feature map channels in the middle and more feature map channels on both sides, while in reverse residual structure, there are more feature map channels in the middle and fewer feature map channels on both sides. When processing deep separable convolution, the method of first raising and then reducing the dimension of the feature map greatly improves the feature extraction ability of the network, and also speeds up the calculation. The Shortcut connection used in the module ensures that the gradient will not be affected during the propagation of the deep convolutional neural network, and the reverse residual structure has been shown to improve the memory utilization efficiency. Before the 3 × 3 separable convolution structure, the 1 × 1 convolution is used to increase the dimension to improve the feature extraction of the image. At the same time, a channel attention mechanism structure is added after the 3 × 3 network structure. First, the global pooling is performed, and the neural

network is used to train the weight value of each channel to extract more important feature information. Then, a large residual edge is added after a 1 × 1 convolution dimensionality reduction to avoid the gradient disappearance of the network. The SwishBlock module structure is shown in Figure 3.

**Figure 3.** SwishBlock module structure.

In this study, using the advantages of the basic SPP and PANet architecture in the YOLOv4 module, the SwishBlock bottleneck is used to replace the ResBlock\_body structure in the main YOLOv4 framework. The neural network model is constructed using the SwishBlock bottleneck structure. Meanwhile, the number of original YOLOv4 network model modules is changed and the number of layers in the whole network is reduced, such that the YOLOv4 network model can be compressed. The compressed network architecture of the YOLOv4 network model is shown in Figure 4.

The improved YOLOv4 prediction network still predicts three feature maps with different sizes to generate the location and target category of the detection box. Considering the imbalance between the cracks and background pixels in the image, the input image size of the model is 256 × 256, and the dimensions of the last three layers are 8 × 8, 16 × 16 and 32 × 32. Among them, the 8 × 8 feature map constructed by combining the two downsamplings of the shallow network with the deep network is mainly used to detect large objects, the once-upsampled 16 × 16 feature map spliced with the middle layer feature map of the backbone network is mainly used to detect medium objects, and the twice-upsampled 32 × 32 feature map spliced with the middle layer feature map of the backbone network is mainly used to detect small targets. The feature fusion and multiscale detection method can make full use of different scales to extract different layers of semantic information, enhance the feature expression ability of the network, and improve the accuracy of object detection.

**Figure 4.** Compressed network architecture of the YOLOv4 network model.

### *2.3. Improvements to SPP Structure and PANet Module*

The SPP structure and PANet module in the YOLOv4 network model play a role in enhancing the network feature extraction. By analyzing the structure of the SPP and PANet modules, it is found that the network structure contains a large number of 3 × 3 convolution layers and continuous quintic convolution structures that greatly increase the calculation amount of the model.

Figure 5a shows the process of extracting image detail features by conventional convolution, where M is the dimension of the input image, N is the number of channels filtered by the convolution kernel, and the size of the convolution kernel is 3 × 3. The dimension of the feature layer then extracted by the conventional convolution is N, and the number of parameters of the conventional convolution is *M* × *N* × 3 × 3. The depthwise separable convolution is composed of depth convolution and 1 × 1 point convolution, as shown in Figure 5b,c. The depthwise separable convolution adopts the strategy of channelby-channel convolution, and the extracted feature map adopts the form of point-by-point convolution to obtain a feature map with dimension N. The dimension of the input image is M, and the size of the separable convolution kernel is 3 × 3. Therefore, the number of parameters for the separable convolution operation is *M* × 3 × 3 + 1 × 1 × *M* × *N*. Compared with the conventional convolution operation, the separable convolution is used to extract the texture features of the image, and the number of parameters is reduced, as shown in Equation (1).

$$\frac{M \times 3 \times 3 + 1 \times 1 \times M \times N}{M \times N \times 3 \times 3} = \frac{1}{N} + \frac{1}{9} \tag{1}$$

**Figure 5.** Depthwise separable convolution versus ordinary convolution; (**a**) general convolution filtering; (**b**) depthwise separable convolution; (**c**) point convolution operation.

The PANet module and the SPP structure contain continuous quintic convolution and continuous cubic convolution to enhance the process of image feature extraction. Equation (1) shows that the separable convolution operation greatly reduces the number of parameters compared with the conventional convolution operation. In order to further compress the network model, with the aim of maintaining channel separation, spatial convolution is realized based on separable convolution. The ordinary convolution of the SPP structure and PANet module is replaced in order to reduce the number of model parameters and memory dependence, as shown in Figure 6.


**Figure 6.** Depthwise separable convolution replaces ordinary convolution.
