3.2.1. Comparison with YOLOX-s

In the COCO dataset, the superiority of YOLOX over sophisticated models, such as PPYOLO [39], YOLOv3, and EfficientDet [40], was established [28]. YOLOv4 was proven to have greater precision and *mAP* than Faster R-CNN and SSD 300 for detecting apple pistils [10]. Consequently, the comparison experiment portion of this study was performed on the original YOLOX-s model and the modified S-YOLO-s, and the pertinent data were collected (Table 6).



<sup>1</sup> Represents the precision of the bud, the same as below. <sup>2</sup> Note: The SAHI algorithm was used in S-YOLO by default and YOLOX by contrast.

Figure 5a shows the loss curves during the training of the S-YOLO-s model. After the 100th epoch, the backbone started to thaw, and the losses of the training set and the validation set fell. The validation set loss was higher than the training set after the 170th epoch, and when the decreasing trend was smaller than the training set, the model started to overfit. Figure 5b shows the P–R curves with IoU = 3. The AP of the fully open flowers was significantly higher than the other three stages, and the differences among the AP values of the three stages were not significant. This phenomenon was significantly correlated with the higher pixel proportion of the fully open flowers.

The precision of S-YOLO-s was enhanced by 7.94%, 8.05%, 3.49%, and 6.96%, and different types of *mAP* by 10.00%, 9.10%, 13.10%, and 7.20%, respectively, compared to YOLOX-s at each flowering stage. By comparing the experimental results, it was found that the improved model's apple flower detection precision was significantly higher than the original model and indirectly higher than EfficientDet, Faster R-CNN, and SSD 300. Therefore, it is practical and feasible for the model to accurately detect apple flowers with high resolution. The detection results of YOLOX-s and S-YOLO-s for the apple tree blossoming images under four typical weather conditions of overcast, sunnier, foggy, and sunny days (Figure 6a,b) provide further proof of the model's superiority.

**Figure 5.** Loss and P–R curve. (**a**) Loss curve of S-YOLO-s. (**b**) P–R curves with mixed datasets (IoU = 0.3).

(**d**)

**Figure 6.** Detection of four growth stages of apple blossoms with red, green, blue, and purple boxes using different models. (**a**) YOLOX-s detection results; (**b**) S-YOLO-s detection results; (**c**) YOLOX-l (add SAHI) detection results; (**d**) S-YOLO-l detection results.

#### 3.2.2. The Results of Different Versions of Models

The results of the comparison experiments demonstrate that the S-YOLO model performed better in detecting apple flowers at various growth stages. An ablation experiment was designed to gain a deeper understanding of how the two improvements of adding the slicing algorithm and replacing the backbone network contributed to the results and to determine how both improvements could be used more effectively to produce better results.

Table 7 shows the results of the ablation experiments. Using YOLO-s as the baseline, the SAHI algorithm improved the precision by 0.70%, 0.04%, 1.04%, and 3.38% for each flowering stage and by 6.70%, 5.50%, 7.20%, and 7.70% for the different types of *mAP*, respectively. The results indicate that the SAHI algorithm can successfully enhance the detection impact of the model, and the degree of enhancement was proportional to the object's pixel size. After replacing the original backbone network with Swin Transformertiny, the model parameters took up 35.79 MB, the FLOPs took up 95.57 GB, the precision improved by 7.24%, 8.01%, 2.45%, and 3.58%, and the *mAP* improved by 3.30%, 3.60%, 5.90%, and −0.50%. After rebuilding the backbone, the results indicate that S-YOLO was more sensitive to detecting small objects of varying length and breadth. The promotion of detection precision by using Swin Transformer as a backbone network was negatively correlated with the object size, and the enhancement of the *mAP* showed an inverted U-shape with flower size, which led to negative growth of the *mAP*L.

**Table 7.** Results of ablation experiments.


While replacing various backbones and detection heads extended the model, the detection effect exhibited a non-linear correlation with the model size. Different sizes of S-YOLO were obtained by adjusting the channel depth and width variation of YOLOX while maintaining increased data using SAHI. The *mAP* values from S-YOLO-t to S-YOLO-l were 32.40%, 37.40%, 35.10%, and 39.00%. This phenomenon, where the *mAP* did not grow as the model grew, resulted from the uncoordinated channel change between the network's backbone and neck. Swin-S replaced the YOLOX-s backbone with Swin Transformer-small. Although the number of parameters and FLOPs of the Swin-S model were higher than S-YOLO-l, the precision and *mAP* were smaller than S-YOLO-s. Therefore, the appropriate ratio of the number of structural parameters and the channel variation are necessary factors for the S-YOLO variant to achieve higher detection results.

In summary, S-YOLO-s outperformed the original model in detecting each flower stage at a high resolution, which resulted from the combined effect of the SAHI algorithm increasing the percentage of flower pixels while keeping the image features unchanged and the Swin Transformer being used as the backbone network. The high-resolution local information provided by SAHI without scaling was fed to the network along with the global information of the scaled original image. This information was fused via the Swin Transformer and subsequently fully used by the network, prompting the model to produce state-of-the-art experimental results. Moreover, S-YOLO is exceptionally sensitive to the size of the detected object, and larger detected objects will get a minor boost or even negative growth in the *mAP* compared to other objects of smaller length and width. Ablation experiments revealed the superiority of the S-YOLO performance and illustrated that appropriate channel depth variation and balanced parameter scaling will provide better results than arbitrarily expanding the model. This experiment provides data to

support the replacement of the larger Swin Transformer as the backbone to obtain greater experimental results.
