*2.4. S-YOLO Detection Model*

### 2.4.1. Model Construction

Swin Transformer can better capture global semantic information than traditional convolutional neural networks and can better fuse global and local information and extract more powerful features. To reduce the number of model parameters to almost the same level as that of the pure convolutional backbone network to ensure the fairness of the experimental results, Swin Transformer-tiny was substituted for the YOLOX backbone network to generate the S-YOLO model. The overall architecture of the S-YOLO is shown in Figure 4.

**Figure 4.** S-YOLO network structure.

S-YOLO is separated into four distinct size models based on the varying depths and widths of the neck and head channels. The corresponding depth and width ratios for different versions of the model are S-YOLO-tiny (S-YOLO-t: 0.33, 0.375), S-YOLO-small (S-YOLO-s: 0.33, 0.50), S-YOLO-middle (S-YOLO-m: 1.00, 1.00), and S-YOLO-large (S-YOLO-l: 1.33, 1.20).

The backbone network consists of five parts. The first part is the patch partition, and the last four parts are composed of two consecutive Swin Transformer blocks, where patch partition and linear embedding are equal to Patch Merging, which plays the role of downsampling. Swin Transformer substituted the conventional multi-head self-attention (MSA) module with the window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module. Each Swin Transformer block consists of three components linked by a residual structure: a LayerNorm (LN) layer, a W-MSA/SW-MSA module, and a multilayer perceptron (MLP) with two completely connected layers and GELU nonlinearity. Using alternating transformations of W-MSA and SW-MSA to conduct all attention actions in a given window, the Swin Transformer block minimizes the computing volume of the model.

Detail information, such as color and texture, is crucial for flower detection. Therefore, retaining and extracting as much shallow information as possible becomes a vital consideration while building a feature extraction network. The neck part used a PAnet structure [37] to accomplish a twofold sampling of the features and to improve the network's capacity to fuse features. The CBS module, which comprises a convolutional layer (Conv), a batch normalization layer (BN), and an activation function called SiLU, is the

primary module for feature extraction in the S-YOLO neck and head sections. The CSPlayer layer, using the idea of residuals, consists of two parallel CBS modules and multiple residual units in a series, which will play a role in PAnet for better image features and fusion capability extraction.

The design of the detecting head module is centered on efficiently employing the characteristics gathered from global and local information. The YOLOHead part employs decoupled heads with quicker convergence and greater precision. The decoupled head is controlled via CBS modules with varying channel counts and partitioned into classification and regression subnetworks. The classification subnetwork calculates the probability of detecting flowers belonging to distinct classes (Cls) of flower labels. In contrast, the regression subnetwork predicts the feature points' classes (Obj) and positions (Reg). Combining three sets of YOLOheads designed to detect flowers of various sizes produced a considerable number of suggestion boxes, and the anchor-free SimOTA algorithm provided the final detection results.

The hybrid dataset generated using SAHI was fed into the S-YOLO network for training and validation after data enhancement via the Mosaic [25] and MixUp [38] algorithms.
