*2.1. The Principles of YOLOv4*

The backbone network CSPDarknet53 of YOLOv4 is the core of the algorithm and is used to extract the target features. CSPNet can maintain accuracy and reduce computing bottlenecks and memory costs while being simplified. Drawing from the experience of CSPNet, YOLOv4 adds CSP to each large residual block of Darknet53. It divides the feature mapping of the base layer into two parts, and then merges them through a cross-stage hierarchical structure to reduce the amount of calculations while ensuring accuracy. The base layer of CSPDarknet53 uses the Mish function as the activation function, and the feature extraction layer network uses the Leaky\_relu function. Experiments have shown that the above activation function setting makes the object detection more accurate. Unlike the YOLOv3 algorithm, which uses FPN for upsampling, YOLOv4 draws on the idea of information circulation in the PANet network. The semantic information of the layer features is propagated to the low-level network by upsampling and is then fused with the high-resolution information of the underlying features to improve the small target detection effect. Next, the information transmission path from the bottom to the top is increased, and the feature pyramid is enhanced through downsampling. Finally, the feature maps of different layers are fused to make predictions. The specific network framework is shown in Figure 1. The ResBlock\_body is the residual block of CSPDarknet53, which can extract the target features of the image and reduce the computational bottleneck and memory cost, as shown in Figure 2.

**Figure 1.** YOLOv4 network architecture [33].

**Figure 2.** ResBlock module structure.
