*2.3. Design of Apple Grading Method Based on Improved YOLOv5*

YOLOv5 is an algorithm proposed by Glenn-Jocher with high real-time performance in terms of algorithmic efficiency [15,16]. The YOLOv5 network has four main components, which are the input side, the backbone network (backbone), the Neck network part, and the Output part. The YOLO family of algorithms has promising results on open-source datasets, but there is no comprehensive and mature method for grading different state fruits [17]. Therefore, this paper proposes an improved YOLOv5 model structure for apple grading based on the lightweight network YOLOv5s shown in Figure 8. Using the Mish activation function instead of Leaky-ReLU, the distance intersection overUnion (DIoU\_Loss) loss function is used at the output of the model. Finally, a simple and efficient channel attention module (Squeeze Excitation) is introduced, which allows the model to focus on apple refinement features without increasing the computational effort of the model.

**Figure 8.** Diagram of the improved network structure of YOLOv5.

#### 2.3.1. Improvement of the Activation Function

The role of the activation function in a convolutional neural network structure is to combine features thoroughly. The activation functions commonly used in YOLOv5 neural networks are Leaky ReLU, Sigmoid, etc. Leaky ReLU (see Equation (1)) can handle the gradient disappearance problem, but it suffers from neuron necrosis due to data sparsity, while Sigmoid (see Equation (2)) can map real numbers to a specified interval, and his curve is smooth and easy to find derivatives for, but it suffers from the problem of gradient disappearance. The Mish (see Equation (3)) activation function has outperformed the Leaky ReLU and other standard activation functions in many deep-learning models [18,19]. The depth of the model in this paper is deeper, and the apple features are more abstract, so this study uses the Mish activation function in the backbone of the YOLOv5 model to achieve better feature extraction results. The CBM module in the backbone network consists of a convolutional layer, a normalization layer, and the Mish activation function. The rest of the model still uses the Leaky ReLU activation function.

$$f\_1(\mathbf{x}) = \begin{cases} \mathbf{x} & \mathbf{x} > \mathbf{0} \\ a\mathbf{x} & \text{others} \end{cases} \tag{1}$$

$$f\_2(x) = \frac{1}{1 + e^{-x}} \tag{2}$$

$$f\_3(\mathbf{x}) = \frac{\mathbf{x}}{1 + e^{-\mathbf{x}}} \tag{3}$$

As can be seen from Figure 9, the Mish activation function can output arbitrarily large positive values while allowing slight negative gradient values, which avoids gradient saturation due to the gradient being close to zero. The Mish function is non-monotonic and continuously differentiable, which allows the deep neural network to achieve better accuracy and generalization, and facilitates the optimization of gradient updates [20,21].

**Figure 9.** Comparison of Mish, Leaky ReLu, and Sigmoid function images.

2.3.2. Improvement of the Loss Function

Deep learning networks adjust the weights between the layers of the network during the training process through optimization algorithms, and they can reduce the loss so that the predicted frames and the actual frames overlap as much as possible. The loss function is the key to adjusting the weights [22,23]. GIoU has scale invariance. When the target is enlarged or reduced, the loss value can remain the same magnitude, and it considers both the overlapping and non-overlapping parts between the detection frame and the target frame. When IoU = 0, the distance of the bounding box does not affect the loss value, GIoU overcomes this shortcoming and can make the corresponding loss expression according to the distance of the two bounding boxes. GIoU expressions are as follows:

$$\begin{cases} \begin{array}{c} GIOL = IOU - \frac{C - (A \cup B)}{C} \\ GIOU = -1 + \frac{A \cup B}{C} \end{array} \end{cases} \tag{4}$$

As shown in Equation (4), when there is an intersection between predicted frame A and actual frame B, convergence is slower in the horizontal and vertical directions. When there is an inclusion relation C between the predicted and actual frames (when C is the smallest closed frame containing A and B), the *GIoU* degrades to an *IoU* and does not work. In this paper, the apples in the flip turnover detection conveyor are relatively dense, and the apples rotate in all directions with the sponge rollers, which makes it impossible to accurately distinguish the actual region from the background region in the grading work of the prediction frame. Therefore, in this paper, DIoU\_Loss is chosen as the boundary loss function in the output layer instead of GIoU\_Loss to speed up the target grading accuracy and detection speed.

DIoU inherits the advantages of GIoU and adds the centroid distance geometric information [24,25]. As shown in Figure 10, which takes into account both the overlapping area and the distance between the two centroids, DIoU can provide the accurate gradient direction for the model when the prediction frame and the actual frame have crossed or overlapped. The introduction of the distance penalty makes DIoU converge faster than GIoU. The equation is shown in Equation (5).

$$IDolI = 1 - IoI + \frac{p^2 \left(b.b^{\otimes}\right)}{c^2} \tag{5}$$

**Figure 10.** DIoU schematic.

In the above equation, *b*, *bgt* represents the target and prediction box centroids, p(\*) represents the Euclidean distance, and c is the diagonal length of the minimum enclosing box covering the target and prediction boxes.

### 2.3.3. Integration of Attentional Mechanisms

Attention is one of the most critical mechanisms in human perception. The human eye is adept at recognizing key image features from complex images and ignoring irrelevant information, which is where the attention mechanism excels. With the booming development of deep learning, the attention mechanism can be used for machine vision. Apples have characteristics such as many features and small sizes, which can easily lead to wrong and missed detection, thus making the grading accuracy of apple features low [26]. By introducing the attention mechanism in the convolutional layer, the learning representation can be enhanced autonomously, and the method is highly operational and effective [27,28]. The Backbone module in YOLOv5 adds the Focus structure, which improves the computational speed by slicing the feature map, but may have an impact on the features. In order to improve the target feature extraction effect of the Backbone module, this paper introduces the channel focus mechanism squeeze excitation (SE) module [29], which is embedded into the last layer of the Backbone module to improve the accuracy of apple grading without increasing the model size.

The SE module can effectively capture the channel and position information of the image, which in turn can improve the grading accuracy of the model. Figure 11 shows the working principle of the SE module, which consists of two main parts, Squeeze and Excitation. The SE module first obtains a global description of the input through Squeeze, which enables a wider perceptual field of view, and then obtains the weights of each channel in the Feature Map through Excitation's two-layer fully connected bottleneck structure as input to the lower layer network.

**Figure 11.** Squeeze and excitation.

In Figure 11, the squeeze operation first encodes the entire spatial feature on the channel as a local feature by global averaging pooling. Then the operation of the connected channel is performed through two fully connected layers and a non-linear activation function (see Equation (6)), followed by a Sigmoid activation function to obtain the weight of each channel, and finally, a multiplicative weighted multiplication to each channel to complete the recalibration of the attention mechanism. The calculation results are shown in Equations (7) and (8). A correlation between channels was established through global average pooling, two fully connected layers, and a non-linear activation function.

$$Z\_{\mathbb{C}} = \frac{1}{H \times W} \sum\_{i=1}^{H} \sum\_{j=1}^{W} u\_{\mathbb{C}}(i, j) \tag{6}$$

where *Zc* represents the Cth element in the statistic, *H*, *W* the space dimension, and the subscripts *i*, *j* the number of channels. After the Squeeze operation has obtained the channel information, it uses two fully connected layers to form the gate mechanism and activates it with Sigmod. The calculation is as follows:

$$s = F\_{ex}(z, \mathcal{W}) = \sigma(\lg(z, \mathcal{W}) = \sigma(\mathcal{W}\_2 \delta(\mathcal{W}\_1 z) \tag{7})$$

where δ is the ReLu activation function, σ is the Sigmoid function, *W1* and *W2* are two fully connected layers equal to C/r×C and C×C/r, respectively, r is the scaling parameter that limits the complexity of the model and increases its capability, and s represents the set of weights of the feature maps obtained through the fully connected and non-linear layers. Finally, the output weights are assigned to the original features. The calculation formula is as follows.

$$X\_{\mathfrak{c}} = s\_{\mathfrak{c}} \times u\_{\mathfrak{c}} \tag{8}$$

where *X* 8*<sup>c</sup>* is the feature map of the featured channel *X*, *Sc* is the weight, and *Uc* is a twodimensional matrix.
