2.1.1. YOLOv5 Backbone

YOLOv5 Backbone employs CSPDarknet as the backbone for feature extraction from images consisting of cross-stage partial networks. The focus module, which rapidly downsamples the images of the dataset, can pass the image information into the channel, while

ensuring that the image information is not missing, i.e., more fully extracting the image information features. The backbone layer uses C3, C3\_F and Spatial Pyramid Pooling (SPP) modules. C3 and C3\_F modules can improve the ability of feature extraction from images, simplify the YOLOv5 model and make the detection speed faster [30]. The SPP module can improve the scale invariance of the dataset image, effectively increase the receiving range of the backbone features, make it easier to converge the network, and enhance the accuracy [31].

#### 2.1.2. YOLOv5 Neck

YOLOv5 Neck uses PANet to generate a feature pyramid network to perform aggregation on the features and pass it to Head for prediction. The bottleneck layer of YOLOv5 combines feature pyramid network (FPN) and path aggregation network (PAN) structures. Deep feature images have stronger semantic information and weaker location information, while shallow feature images have stronger location information and weaker semantic information. FPN can transfer semantic information from the deep feature image to the shallow feature image [32]. Conversely, PAN can transfer location information from the shallow feature layer to the deep feature layer [33]. The combination of FPN and PAN can aggregate parameters of different detection layers from different trunk layers, which greatly strengthens the feature fusion ability of the network [24].
