**3. Result and Discussion**

#### *3.1. Experimental Validation and Analysis of Results*

#### 3.1.1. Experimental Environment

The experimental models in this paper were constructed, trained, and the results were tested based on the Windows 10-x64-bit operating system. The experimental programming environment is Python 3.7, using Cudnn for GPU acceleration, and Apple hierarchical model training is implemented under the PyTorch 1.7 deep learning framework. The experimental environment configuration is shown in Table 2. The number of iterations of the training process was set to 150, the weight decay coefficient was 0.001, the learning rate was 0.917, and the maximum training batch was eight. An IOU threshold of 0.5 was taken as the standard.

**Table 2.** Experimental environment.


In order to better calculate the classification accuracy and reliability of this model, this paper selects loss function curve (Loss), Precision, Average Precision (AP), Recall, Mean Average Precision(mAP), and frames per second (Fps) as the algorithm performance evaluation indexes [30]. The relevant evaluation indexes are calculated as shown in Equations (9)–(12).

$$Precision = \frac{TP(TruePositive)}{TP + FP(FalsePositive)} \tag{9}$$

$$Recall = \frac{TP(TruePositive)}{TP + FN(FalseNegative)} \tag{10}$$

$$Average\ Precision = \int\_{0}^{1} (P(R)dR) \tag{11}$$

$$Mean\ AveragePrecision = \frac{\sum Average\ Precision}{n(Class)}\tag{12}$$

In the above equation, *TP* represents the number of apple samples correctly identified by the model, *FP* represents the number of apples incorrectly identified by the model, *FN* represents the number of apple samples not identified by the model, and n represents the number of categories.

### 3.1.2. Analysis of Experimental Results

#### (1) Experiments related to the improved algorithm

The loss function can visually reflect whether the model can converge stably or not [31]. In the process of network training, the following three models were selected for comparison, taking into account the comparative effects of different algorithmic models: the YOLOv5 algorithm using the Mish activation function to optimize the backbone network, denoted by YOLOv5-M; the YOLOv5 algorithm using the DIoU optimization loss function, denoted by YOLOv5-D; and the simultaneous use of the Mish activation function and DIoU optimization method, denoted by Im-YOLOv5. The resulting loss function curve after training is shown in Figure 12.

**Figure 12.** Loss value curve changes with epochs.

As shown in Figure 12, the overall trend of the loss values of the four models in training is the same. They decrease rapidly in training and eventually stabilize. The YOLOv5-M and YOLOv5-D loss values and convergence rates are significantly faster than the original YOLOv5 algorithm, and the degree of fluctuation is less, which proves that the localization accuracy and convergence rate of the models can be increased when using complete loss and activation functions [31]. From Im-YOLOv5, the loss value and convergence speed are slightly lower than YOLOv5-D for the first 50 iterations of the model, but after 50 iterations, the loss value and convergence speed are due to the rest of the models. This indicates that the Im-YOLOv5 algorithm can improve the convergence speed and localization accuracy of the model, which helps to obtain a more accurate resultant model, which proves the effectiveness of the improved model.

In order to verify the effectiveness of the improved method in this paper for apple grading, this study trained the YOLOv5 and the Im-YOLOv5 models under the same dataset and training set. The PR curve represents the relationship between accuracy and recall, which can measure the model's generalization ability. The PR curves of the two models after the training was completed are shown in Figure 13. The area between the PR curve and the coordinate axes of the Im-YOLOv5 is larger than that of the original YOLOv5 model, which indicates that the improved model has better overall performance.

**Figure 13.** Comparison of YOLOv5 and Im-YOLOv5 PR curve. (**a**) YOLOv5; (**b**) Im-YOLOv5.

As can be seen from Figure 13, the Im-YOLOv5 model has improved the grading accuracy for different apple quality levels, with higher mAP of over 95% for Grade-1 and Grade-3 apples. The mAP for Grade-2 apples reached 0.755, an improvement of 9.1% over the original model. The average accuracy for all apple grades was 0.906, an increase of 3.1% compared to the original model.

The Im-YOLOv5 model and YOLOv5s model trained in this paper were used to grade apples of different qualities in an automatic apple grader.

Figure 14a shows the grading results before the improvement of the YOLOv5 model, and Figure 14b shows the results of the Im-YOLOv5 model grading. The accuracy of the apple grading in Figure 14a is low, where the apples in the first and second images appear to have duplicate detection frames, and the second image shows incorrect grading of the grade-1 and grade-2 apples, marking the grade-1 apples as grade-2 apples. The third panel shows no duplicate detection frames but incorrectly marks three grade-1 apples with low accuracy. In contrast, there is an improvement in grading accuracy for all grades of apples in Figure 14b, with no duplicate boxes. The improved model was able to pay more attention to apple feature information, which improved the robustness of the model while increasing the grading accuracy. Therefore, the Im-YOLOv5 model can satisfy apple grading in actual production environments.

In order to explore the effectiveness of visual attention mechanisms in convolutional networks and to enhance the interpretability of the apple grading model in this paper, a part of the improved YOLOv5 feature extraction layer in this paper was visualized [32]. The results of feature extraction from the convolutional layer of the backbone network are shown in Figure 15. As shown in Figure 15a, the initial feature size of the convolutional layer of the backbone network is large, the feature extraction is more fine-grained, and the apple features are extracted while containing complex background information; as the network deepens, the extracted features are gradually blurred and sparse and more semantic. As can be seen in Figure 15b, after the attention SE module, there are some highlighted areas in the figure, and the location of the apples is highlighted in the spatial pyramid pooling (SPP) output feature map, which indicates that after adding the SE module, the deep network layer of the Im-YOLOv5 model in this paper filtered the extracted features, which helped to highlight the target apples as well as filter the background information in the grading stage and improved the network model accuracy.

**Figure 14.** Grading results. (**a**) YOLOv5s model; (**b**) Im-YOLOv5 model.
