*3.4. Yolov5 Method*

Object detection is a computer vision technique for locating instances of a certain class of objects in an image. Recent object detection methods can be categorized into two main types: one-stage and two-stage. One-stage methods prioritize inference speed and include a series of YOLO detection methods [15,35,48–50], SSD [36,51], and RetinaNet [52]. Typical two-stage methods prioritize detection accuracy and include R-CNN [53], Fast R-CNN [54], Mask R-CNN [55], Cascade R-CNN [56], and others.

Yolov5 is the latest generation of one-stage object detection network models of the YOLO series proposed by Ultralytics in May 2020 see [57]. Based on the network depth and width of feature maps, Yolov5 can be divided into four models, namely Yolov5s, Yolov5m, Yolov5l, and Yolov5x [23]. Compared with two-stage detection network models, Yolov5 greatly improves the running speed of the model while maintaining detection accuracy. This not only meets the needs of real-time detection, but also has the advantage of a small structure size. The Yolov5 network model is an improved model based on Yolov3 with improvements such as multi-scale prediction, which can simultaneously detect images of different sizes [20]. Therefore, we proposed a lightweight mummy berry disease detection network model based on Yolov5s by improving the network backbone with an attention mechanism. The architecture of the improved Yolov5s-CA network model is shown in Figure 3.

#### *3.5. Improvement of Yolov5s-CA Network Model*

Figure 3 shows the structure of the improved Yolov5s-CA network model to detect mummy berry disease. It can be seen that a lightweight module CA [28] was introduced into the backbone of Yolov5s to strengthen the feature representation ability of the network and select useful information, which enhances detection performance. The network structure of Yolov5s-CA consists of four parts: input, backbone, neck, and head.

The backbone of the Yolov5s-CA network model contains Conv, C3, CA, and Spatial Pyramid Pooling Fusion (SPPF). The Conv is the basic convolution unit, which performs two-dimensional convolution, regularization, and activation operations on the input. The C3 module is located in both the backbone and neck. The C3 module with a shortcut structure is implemented in the backbone of the network. It divides the input tensor equally into two branches and performs convolution operations. One branch passes through a Conv module and then passes through multiple residual structures to avoid degradation problems in the deep computational process. The other branch directly combines the two branches to form a Conv module. As shown in Figure 4, the CA modules are integrated into the backbone following the C3 module to highlight and select the most important diseaserelated visual features and improve the representation ability of the object detection model to detect mummy berry disease in a field environment. The last layer of the backbone, Spatial Pyramid Pooling Fast (SPPF), shown in Figure 5, comprises three MaxPool layers of 5 × 5 kernel sizes in series and passes the input through the MaxPool layers in turn and performs a concatenation operation on the output before performing a Conv operation. The SPPF structure can achieve similar feature ex-traction results as SPP, but SPPF runs faster. The image can learn features at multiple scales with the help of MaxPool layers and jump connections, and then increase the representativeness of the feature map by combining global and local features.

**Figure 4.** The improved Yolov5s-CA network model structure.

The neck module is a feature aggregation layer between the head and the backbone. It collects as much information as possible from the backbone before feeding it to the head. It consists of two parts: the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). The FPN structure transmits semantically robust features from the topdown, while the PAN transmits information in a bottom-up pyramid to strengthen the feature representation capabilities of the network model. In addition, C3 modules were

added to enhance the network's feature extraction capability, and the C3 at the neck replaces the residual structure with multiple Conv modules.

**Figure 5.** The SPPF module.

The head outputs a vector containing the object category probability, the object scores, and the position of the bounding box. The loss function of Yolov5s consists of three parts: the confidence loss, the classification loss and the position loss of the target and prediction box. The original Yolov5s uses GIoU\_loss as a bounding box regression loss function to evaluate the distance between the predicted box and the ground truth box. It can be expressed in the following formulae represented in Equations (5)–(7).

$$IoU = \frac{A \cap B}{A \cup B} \tag{5}$$

$$GIolI = IoI - \frac{A^c - \mu}{A^c} \tag{6}$$

$$L\_{GIoI} = 1 - GIoI \tag{7}$$

where *A* is the predicted box, *B* is the ground truth box, *IoU* represents the intersection ratio of the predicted box and the ground truth box, *A<sup>c</sup>* represents the intersection of the predicted box and the ground truth box, *u* represents the smallest circumscribed rectangle of the predicted box and the ground truth box, and *LGIoU* is the *GIoU* Loss.

Compared with the IoU\_loss function, the GIoU loss function can solve the problem of non-overlapping bounding boxes. However, GIoU loss cannot solve the problem that the prediction frame is inside the target frame and the size of the prediction frame is the same. On the other hand, CIoU loss considers the scale information of the aspect ratio of the bounding box and measures it from the three viewpoints: (1) overlapping area, (2) center point distance, and (3) aspect ratio, which makes the prediction box regression more efficient. Therefore, in this study, we use CIoU loss as the regression loss function represented in Equations (8)–(10).

$$L\_{\rm loc} = 1 - IoIII\left(B\_{\prime}B\_{\S^{\prime}}\right) + \frac{d^2}{c^2} + av \tag{8}$$

$$a = \frac{v}{1 - IoU + v} \tag{9}$$

$$w = \frac{4}{\pi^2} \left( \arctan \prime \frac{w^{\otimes t}}{h^{\otimes t}} - \arctan \frac{w}{h} \right)^2 \tag{10}$$

where *w* is the width and *h* is the height of the prediction box and *wgt* and *hgt* are the width and height of the ground truth box, respectively.
