*2.3. Slicing Using SAHI*

The high-resolution apple tree images used in this study contained a large and dense number of tiny pixel flowers. The higher resolution made directly inputting the photos into the network to extract features too computationally costly. However, reducing the resolution would result in losing information on details related to the flowers.

Multiple solutions have been developed to address the problem of small, dense objects in high-resolution photographs. The traditional method of filling and then segmenting images [6] and the method of copying and enhancing [33] images after oversampling require segmenting a large number of annotations, which results in a large number of features being altered to the point of being incompatible with the original dataset. Enlarging the target region [34] can enrich small object features, but it will add additional computational volume and is challenging to adapt to the demand for detection speed in some agricultural fields. To preserve image detail information and reduce the model calculation costs, the SAHI slicing algorithm was used to increase the model detection accuracy.

The SAHI [35] algorithm is a slicing-assisted inference approach for object detection models that perform inference by cropping the images and performing inference on them. The most notable benefit of SAHI is that it can be used in any object detection inference method, considerably enhancing the detection level of small targets while just linearly lengthening the computation time in slices. The SAHI algorithm effectively increases the precision of YOLO series detection [36].

Considering the efficacy of SAHI in small object detection applications, the SAHI algorithm was applied with a 20% coverage to the dataset utilized in this study, yielding 640 × 640 pixel images (Figure 3).

**Figure 3.** The process of SAHI splitting the original image.

The SAHI algorithm divides the original image of 3000 × 3000 pixels into 640 × 640 pixels, which can be directly fed into the network, eliminating the computational overhead of huge images without scaling, and minimizing the loss of detail information due to resizing by approximately five times. The dataset created by combining the original and sliced images is guaranteed to contain large images (3000 × 3000) with high semantic information and small images (640 × 640) with detailed local information. Additionally, there is a 20% overlap area between the sliced images which can also facilitate information fusion.
