3.2.3. Comparing the Effects of Other Measures

The detection results on the mixed dataset (Table 8) show that the improved S-YOLO-l was slightly less effective than YOLOX-l with the comparable sizes of parameters and FLOPs on the mixed dataset, but this does not mean that the improvement of the model was unsuccessful. The dataset used for model evaluation in this experiment consisted of a 640 × 640 pixel image after slicing and the original 3000 × 3000 pixel image. However, the actual detection object in the natural environment should be 3000 × 3000 pixels, so the data in the test dataset were replaced with the raw dataset and the experiment was conducted again to obtain the new results.

**Table 8.** Comparison results of YOLOX-l and S-YOLO-l with similar parameters (54.15 M and 51.37 M) on mixed and raw datasets (all using SAHI).


\* The proportion of the test set used for all datasets was 20%.

The results in Table 8 show that S-YOLO-l significantly outperformed YOLOX-l in precision and achieved better results in all types of *mAP* after adjusting the test set percentage to 20%. Therefore, the improved model still outperformed the original model in the high-resolution task of detecting the growth stage of apple blossoms, even when the gain from the SAHI algorithm was ignored. In addition, if the SAHI algorithm is not used, YOLOX-l will not be trained properly due to the low number of input images. Notably, the backbone network utilized in this investigation was Swin Transformer-tiny. With enough computing resources, labeled data, and disregarding the negative impact of a bigger model, a bigger S-YOLO network with the proper parameter scaling and number of channels would produce superior detection results.

Figure 6c,d shows the detection results of YOLOX-l (add SAHI) and S-YOLO-l for the apple tree flowering images under four typical weather conditions: cloudy, more sunny, foggy, and sunny days. The image presentation results indicate that YOLOX-l, with the inclusion of the SAHI algorithm, may have performed marginally better than S-YOLOX-l at several detection locations. Notably, the initial annotation volume utilized in the experiment was 39,980, which grew to 109,813 after slicing using the SAHI technique and mixing with the original dataset, but was significantly less than the COCO dataset. Therefore, S-YOLO-l cannot fully exploit its strengths. With the development of IoT technologies, the issue of a too-small training dataset will be resolved, and it will also be possible to use bigger Swin Transformer models to obtain better detection results. By observing Table 8 and Figure 6, it is possible to infer that the enhanced S-YOLO-l beat YOLOX-l in an identical situation using the SAHI method but that the existing quantity of data did not cause S-YOLO to demonstrate an overwhelming advantage.
