2.4.1. Yolov7 Algorithm

Figure 7 shows the structure diagram of YOLOv7 network. It was mainly composed of input, backbone feature extraction network, enhanced feature extraction network and head. The input module scales the input image to a unified pixel size to reduce the amount of computation [28]. The backbone module is composed of CBS, ELAN and MP modules. The CBS module is composed of batch normalization layer (BN), conv and silu activation function, and it is used to extract multi-scale information of images. ELAN module is composed of multi-branch convolution. It improves the learning ability of the network without destroying the original gradient path. MP module integrates MaxPool and convolution two dimensions of information, improving the feature extraction ability of the network.

**Figure 7.** Yolov7 network structure diagram.

The enhanced feature extraction module is composed of path aggregation feature pyramid network (PAFPN) structure. By introducing a bottom-up path, it was easier for the information from the bottom layer to be transferred to the top layer, thus realizing the efficient fusion of features at different levels, improving the accuracy of positioning information. In the SPPCSPC structure, SPP module, composed of CBS and 4 different sizes of maximum pooling, can better distinguish different sizes of the target through different sizes of maximum pooling; the CSP module, composed of two parts, was used for conventional processing and the above part was used for SPP module processing. Finally, the two modules merged together, which can reduce half of the amount of computation, making the processing speed faster, obtaining higher precision.

The head module was composed of REP structure and CBM structure. The REP block (RepVGG Block) was composed of two parallel convolutional layers (Conv), batch normalization layers (BN) and one batch normalization layer (BN) path. The CBM structure was composed of the convolutional layer (Conv), batch normalization layer (BN) and sigmoid activation function. The REP (RepVGG Block) structure adjusted the number of image channels for 3 features of different scales output by PAFPN, including P3, P4 and P5, and it was then used to predict the confidence, category and anchor frame through CBM structure.
