2.1. Attention Mechanism: Focusing on Selective Information
The attention mechanism has been proven to improve the precision of models [
22]. In the process of exploring the application of the attention mechanism in computer vision, many excellent works have emerged, including SE [
23], ECA [
24], CBAM [
25] and BAM [
26]. GAM is a global attention mechanism proposed by Liu et al. [
27] in 2021, which improves the performance of deep neural networks by reducing information diffusion and expanding global interactive representations. In addition to introducing the convolutional spatial attention submodule, GAM also introduces the 3D arranged channel attention with multilayer perceptron to retain information and amplify global cross-dimensional interactions, as shown in the
Figure 1.
For the input feature map,
∈
of the network, the formulas for calculating the intermediate state
and output
can be expressed as follows:
where
represents the input feature map,
represents the intermediate state,
represents the channel attention module,
represents the output feature map,
represents the spatial attention module, and
represents the element-wise multiplication.
In the channel attention module, GAM uses a three-dimensional arrangement to preserve information across all three dimensions. Specifically, the input feature F
1 is first reshaped from C × W × H to W × H × C. Then, a two-layer fully connected layer (MLP) is used to amplify the interdependencies between the spatial and channel dimensions. Finally, the three-dimensional arrangement is reshaped back into the original C × W × H form, as shown in
Figure 2.
In the spatial attention module, GAM uses two convolutions to fuse spatial information in order to make the model focus on spatial information. At the same time, to avoid the loss of spatial information caused by pooling operations, pooling operations are not chosen, which further preserves the features, as shown in
Figure 3.
2.3. Spatial Pyramid Pooling
In previous convolutional neural networks (CNNs), a fixed size image needed to be inputted for networks with predetermined structures, resulting in cropping, scaling, and other operations being performed when detecting images of various sizes, which decreased the precision of the detection. To address this problem, many excellent spatial pyramid pooling methods have been proposed, such as SPP [
31], SPPF, ASPP, and SPPCSPC, which allow the network to input images of any size without the need for cropping, scaling, and other operations. This effectively avoids problems, such as image distortion caused by cropping and scaling operations on the image area, solves the problem of convolutional neural networks extracting repetitive features related to images, and improves the precision and speed of generating candidate boxes. SPPFCSPC [
32] is an optimization of SPPCSPC based on the idea of SPPF, aimed at accelerating training and inference, as shown in
Figure 5. By linking three separate pooling operations, less computation is required on the output results of the pooling layer with a smaller pooling kernel. This approach produces pooling layer results equivalent to those obtained with a larger pooling kernel, achieving a speedup while maintaining a constant perceptual field. The calculation formula for the pooling part can be expressed as follows:
where
R represents the input feature layer,
represents the pooling layer result for the minimum pooling kernel,
represents the pooling layer result for the medium pooling kernel,
represents the pooling layer result for the maximum pooling kernel,
represents the final output result, and
represents the tensor concatenation.
2.4. The Model Structure of the YOLOv8 Network
The YOLO series models are single-stage objective detection algorithms that use a single convolutional neural network (CNN) to simultaneously predict the class and location information of objects, requiring only one forward pass. Therefore, they have high precision and speed, making them particularly suitable for solving agricultural-related problems. With the development of agriculture, many excellent target detection algorithms have emerged, such as the Faster R-CNN model, the SSD model, and the Mask R-CNN model. In terms of speed, both the YOLO model and SSD model are single-stage objective detection algorithms, while the Faster R-CNN and Mask R-CNN models are two-stage algorithms, which results in the former group having a higher detection speed. In terms of precision, these four models have their own strengths and weaknesses depending on the specific application. The YOLO model is efficient in detecting small objects with real-time performance, Faster R-CNN is better suited for large image scenarios without real-time performance, the SSD model is more accurate in large object detection tasks, and Mask R-CNN performs better on complex and overlapping scenes. In terms of complexity, both the SSD and YOLO models have lower complexity and computational demand, making them easier to implement in parallel computation and hardware acceleration. Because of the need for multiple convolutions and forward and backward passes, the R-CNN and Mask R-CNN models have higher complexity.
YOLOv8 is the latest work in the YOLO (You Only Look Once) series, and it is also the most advanced object detection model. In 2015, YOLOv1 [
33] was introduced, and the single-stage detection algorithm first appeared in people’s fields of vision. It effectively solved the shortcoming of the slow inference speed in the two-stage detection network and maintained a good effect on detection accuracy. YOLOv2 [
34] further improved upon YOLOv1 by introducing batch normalization layers after each convolutional layer and removing the use of dropout. YOLOv3 [
35] represented an advancement over the previous work, introducing significant improvements. Its key feature was the introduction of the residual module Darknet-53 and the Feature Pyramid Network (FPN) architecture, which allowed for the prediction of objects at three different scales and enabled the fusion of information across multiple scales. Since then, YOLOv4 [
36], YOLOv5 [
37], and YOLOv7 [
38] have added many techniques based on version 3. Due to its leading performance, support for multiple tasks, well-developed toolchain for deployment, and flexible design, the latest YOLOv8 model, version YOLOv8s, is used in this paper, and the relevant code can be found on GitHub [
39]. The YOLOv8s model consists of five parts: the input layer, backbone network, neck network, head network, and loss function, as shown in
Figure 6.
The input layer of YOLOv8s uses three techniques: adaptive anchoring, adaptive image scaling, and mosaic data augmentation. During model training, YOLOv8s adaptively generates various prediction boxes based on the original anchor boxes and uses NMS to select the prediction boxes that are closest to the ground truth boxes. Due to the nonuniform size of the images, adaptive scaling resizes the images to a suitable standard size before inputting them into the network for detection, avoiding issues such as a mismatch between feature tensors and fully connected layers. Mosaic is a data augmentation method that concatenates four randomly scaled, cropped, and arranged images together to enrich the data and improve the network’s ability to detect small objects, as shown in
Figure 7. In terms of improving small target detection, mosaic data augmentation can bring three benefits. First, small target detection typically requires a large amount of annotated data, but obtaining annotated data can be extremely expensive. By using mosaic data augmentation, multiple small images can be combined into a single large composite image, thereby increasing the size and diversity of the dataset without adding annotated data. Second, because small targets typically occupy only a small region of the image and are easily occluded by the background and other targets, small target detection is a challenging task. By using mosaic data augmentation, more variations and noise can be introduced during training, thereby enhancing the robustness of the model and its ability to handle different scenarios. Third, by using mosaic data augmentation, the model is exposed to a wider variety of scenarios and cases, which increases the generalization ability of the model.
The first convolutional layer of the backbone network is changed from a 6 × 6 convolution to a 3 × 3 convolution, and the idea of feature map splitting in CSPNet, and the residual structure are combined to propose the C2f module, which obtains richer gradient flow information while ensuring that it is lightweight. The YOLOv8 module still uses the SPPF module used in architectures such as YOLOv5, which serially passes the input through multiple maxpool layers, with a size of 5 × 5, on the SPP structure to avoid image distortion caused by cropping and scaling operations on image regions. Meanwhile, it solves the problem of convolutional neural networks extracting redundant features, greatly improves the speed of generating candidate boxes, and reduces computational costs.
The neck part of YOLOv8s still adopts the PAN-FPN structure to build the feature pyramid of YOLO, enabling sufficient fusion of multi-scale information. The convolution structure in the up-sampling stage of the PAN-FPN was removed, and the features output at different stages of the backbone were directly fed into the up-sampling operation. Additionally, the C3 module was replaced by the C2f module.
The head part of YOLOv8s has undergone significant changes compared to YOLOv5. It has been replaced with the currently mainstream decoupled head structure that separates the classification and detection heads and uses different branches for computation, which is beneficial for improving detection performance. Additionally, YOLOv8s has moved from anchor-based to anchor-free, avoiding the complex calculations and related hyperparameter settings of the anchor boxes, which has a significant impact on performance.
The loss function mainly consists of two parts: classification loss and regression loss. The classification loss is a varifocal loss (VFL), and the regression loss is in the form of a CIoU loss and distribution focal loss (DFL). Due to the use of DFL, the number of channels in the regression head has also become 4
reg_max, while the number of channels in the classification head is the number of categories. The main improvement of the VFL [
40] is the proposal on asymmetric weighting operations. For positive samples,
q is used for weighting. If the gt_IoU of a positive sample is high, its contribution to the loss is larger, which allows the network to focus on high-quality samples. In other words, training high-quality positive examples has a greater impact on the AP than low-quality ones. For negative samples, weighting is used to reduce their contribution to the loss because the predicted
p for negative samples is smaller. It becomes even smaller after reducing the power, which can reduce the overall contribution of the negative samples to the loss. The formula can be expressed as follows:
where
p represents the predicted IoU-aware classification score (IACS), and
q represents the target score.
DFL [
41] stands for distance-based feature learning, which optimizes the probabilities of the two positions nearest to the target label
y in a cross-entropy manner. It models the position of the box as a general distribution to enable the network to quickly focus on the distribution of positions close to the target location. The formula can be expressed as follows:
where
y represents the label, and
and
represent the nearest two to
y(
≤
y ≤
).
The CIoU loss function [
42] is based on the DIoU and adds the detection box scale, so the prediction box will better match the true box. The formula can be expressed as follows:
where
ρ represents the Euclidean distance between the centers of the prediction and truth boxes,
c represents the diagonal distance between the minimum enclosing rectangle of the prediction and truth boxes,
b and
represent the centers of the prediction and truth boxes, respectively,
and
represent the width and height of the ground truth box, respectively, and
w and
h represent the width and height of the prediction box, respectively.
Meanwhile, YOLOv8s abandons the previous IoU matching or one-sided ratio allocation method, and instead uses the Task-Aligned Assigner matching method to select positive samples based on the weighted scores for classification and regression. The formula can be expressed as follows:
where
α and
β represent the weight hyperparameters,
s is the predicted score corresponding to the annotated category,
u is the IoU between the prediction box and the truth box, and their product can measure the alignment degree.