2.3.2. C2f-CBAM Module
In this study, the application of YOLOv8 for detecting cotton bolls against complex backgrounds was bolstered by introducing the CBAM, which replaced the traditional C2f module with a new C2f-CBAM structure. This improvement aims to refine the model’s ability to discriminate features, thereby enhancing detection accuracy for small targets, particularly in scenarios with complex background information or small target sizes. The CBAM module improves feature representation through its unique dual attention mechanisms: the channel attention mechanism (CAM) and the spatial attention mechanism (SAM). The channel attention mechanism concentrates on assessing and emphasizing the importance of features across different channels, thus enabling the model to highlight features that are crucial for target recognition. Conversely, the spatial attention mechanism sharpens the focus on the spatial distribution of features, thereby reducing interference from background noise.
Figure 7 distinctly displays the structure of the CBAM, which includes both channel and spatial attention components. The channel attention component highlights important features by assessing the significance of each channel. The spatial attention component further improves the spatial distribution of features, effectively minimizing the interference from complex backgrounds. The introduction of this dual attention mechanism enables the network to more effectively concentrate on key targets in the image, such as cotton bolls, thus demonstrating improved performance especially in environments with complex backgrounds or small target sizes.
After implementing the C2f-CBAM replacement in the backbone of YOLOv8, a series of experiments was conducted to demonstrate the effectiveness of this structural improvement in the task of cotton boll detection. The experimental results indicate that the C2f-CBAM structure significantly enhances detection accuracy and stability, particularly exhibiting superior performance in counting and tracking cotton bolls in dynamic environments. Furthermore, the application of this module also demonstrates good generalizability, effectively maintaining high performance levels across different datasets and various environmental conditions.
From both theoretical and practical perspectives, the introduction of CBAM not only optimizes the network structure but also boosts the model’s capability to handle small targets in complex scenes. This approach of improving performance through internal adjustments to the deep learning model’s structure opens up new directions and possibilities for further exploration in the field of visual recognition. This improvement also provides an effective strategy for researchers in related fields, aiming to optimize and enhance the reliability and efficiency of models in practical applications.
2.3.3. Gather-and-Distribute Neck
In real cotton field environments, owing to the complexity of the background, the features of cotton bolls are small and less prominent. The neck structure of the YOLO series, illustrated in
Figure 8, employs the traditional FPN structure, comprising multiple branches for multi-scale feature fusion. However, it can only integrate features from adjacent layers fully; for other layers’ information, it is only obtainable indirectly through “recursion”. This process leads to a significant loss of small-scale information during computation, resulting in a high number of missed detections and false detections of cotton bolls. To mitigate the problem of information loss during the information transmission process in the traditional FPN structure, we introduce a new gather-and-distribute (GD) mechanism inspired by Gold-YOLO. Specifically, the gather-and-distribute process involves three modules: the feature alignment module (
FAM), the information fusion module (
IFM), and the information injection module (Inject).
The gathering process comprises two steps. Initially,
FAM collects and aligns features from various levels. Subsequently,
IFM merges the aligned features to generate global information. After obtaining the merged global information from the gathering process, the injection module distributes this information to each level and employs simple attention operations for injection, thereby enhancing the detection capabilities of the branches. To improve the model’s ability to detect objects of different sizes, we have developed two branches: the low gather-and-distribute branch (Low-GD) and the high gather-and-distribute branch (High-GD). As illustrated in
Figure 9, the input to the neck comprises feature maps
B2,
B3,
B4, and
B5 extracted by the backbone, where
. The batch size is denoted by
N, the number of channels is denoted by
C, and the spatial dimensions is denoted by
R =
H ×
W. Furthermore, the dimensions of
RB2,
RB3,
RB4, and
RB5 are
R/2,
R/4,
R/4, and
R/8, respectively.
Low-Order Aggregation-Distribution Branch: In this branch, Backbone’s output features
B2,
B3,
B4, and
B5 are fused to obtain high-resolution features that retain small target information. The structure is shown in
Figure 10a.
In the low-feature alignment module (Low-FAM), input features are downsampled using average pooling (AvgPool) operations to achieve a uniform size. By resizing the feature size to the smallest feature size in the group (RB4 = 1/4R), Falign is obtained.
The design of the low-information fusion module (
Low-IFM) comprises multiple layers of reparameterized convolution blocks (
RepBlock) and split operations. Specifically, RepBlock accepts
as input and generates
The intermediate channels represent adjustable values to accommodate different model sizes. The features produced by RepBlock are then split along the channel dimension into
and
, which are subsequently merged with features from different levels.
To inject global information more effectively into different levels, this module utilizes split experience and attention operations to merge information, as illustrated in
Figure 10c. The module receives local information (the current level’s features) and global injection information (generated by
IFM), represented as
and
. It employs two distinct
Conv operations with
to produce
and
is calculated using
Conv with
. Subsequently, features Fout are merged through attention calculations. Since the dimensions of
and
do not match, the module employs average pooling or bilinear interpolation to resize
and Fact according to the size of
, ensuring proper alignment. After each attention merge, a RepBlock is incorporated to further extract and integrate information. In the low order,
is equivalent to
, and the formula is noted as follows:
High Gather-and-Distribute Branch: The High-GD merges features
P3,
P4, and
P5 generated by Low-GD, as shown in
Figure 10b.
The high-feature alignment module (High-FAM) includes average pooling (AvgPool), which reduces the dimensions of input features to a uniform size. Specifically, when the input feature dimensions are , AvgPool reduces the feature size to the smallest dimension in the feature group, where . As the transformer module extracts high-level information, pooling operations help aggregate data while reducing the computational demand of subsequent transformer module steps.
The high-information fusion module (
High-IFM) includes transformer blocks (detailed below) and split operations, comprising three steps: (1)
obtained from
High-FAM, is combined through a transformer to obtain
. (2) The channel number of
is reduced to the sum of
and
through a
Conv 1 × 1 operation. (3)
is split along the channel dimension into
and
, which are subsequently used for fusion with the current level’s features. Formulas are as follows:
The transformer fusion module consists of several stacked transformers, with the number of transformer blocks denoted by L. Each transformer block comprises a multi-head attention block, a feed-forward network (FFN), and residual connections.
The information injection module in High-GD is exactly the same as in Low-GD. In the high order,
is equal to
, so the formula is expressed as follows:
In this paper, we replaced the original FPN structure in the YOLOv8 neck with an aggregation and distribution mechanism (GD). To further enhance the interconnectivity of cross-layer information, we added two feature alignment modules (
FAM) in the low-frequency aggregation and distribution branch, with each module receiving three inputs. The formula can be expressed as follows:
where the variable is used as in Equation (6). Specifically, as shown in
Figure 10c, the variable is used as the
input in the information injection module. Therefore, in this paper, Equation (6) should be modified as follows:
Similarly, in the high-frequency aggregation and distribution branch, we also added two additional feature alignment modules and introduced the
C2
f structure after all the information injection modules. The formula can be expressed as follows:
where
is used as
in Equation (13). Therefore, in this paper, Equation (13) should be modified as follows:
Through these improvements, the effectiveness of information fusion and transmission is strategically enhanced, thereby better addressing the issues of false positives and missed detections in cotton boll detection within complex environments and improving the model’s detection accuracy.
2.3.4. Improved Loss Function
In real farmland environments, the proportion of small objects in cotton boll detection tasks is considerably high. A well-designed loss function can significantly enhance the model’s detection performance. YOLOv8 uses DFL and
CIoU to calculate the regression loss of the bounding boxes; however,
CIoU has the following drawbacks. First,
CIoU does not account for the balance between hard and easy samples. Second,
CIoU uses the aspect ratio as one of the penalties in the loss function. If the aspect ratios of the actual and predicted boxes are identical but the values of width and height differ, the penalty cannot accurately reflect the true difference between the two boxes. Third, the calculation of
CIoU involves an arctangent function, increasing the computational burden on the model. The
CIoU calculation formula is presented in Equation (19):
In Equation (19), the intersection over union (
IoU) is defined as the ratio of intersection to union between the predicted box and the actual box. Parameters referenced in Equation (19) are depicted in
Figure 11. Here,
denotes the Euclidean distance between the centroids of the true box and the predicted box;
and
denote the height and the predicted values, respectively;
and
represent the height and width of the true box; and
and
denote the height and width of the minimum enclosing box that includes both the predicted box and the actual box.
EIoU [
25] advances beyond
CIoU by incorporating the length and width as penalty terms, thus addressing the differences in width and height between the actual box and the predicted box, thereby offering a more rational penalty than those employed in
CIoU. The formula for calculating
EIoU is presented in Equation (20):
Parameters relevant to Equation (20) are illustrated in
Figure 11;
and
denote the Euclidean distances in width and height, respectively, between the true box and the predicted box; and
and
indicate the coordinates of the center points of the actual box and the predicted box, respectively.
SIoU [
26] incorporates the angle between the predicted box and the actual box as a penalty factor for the first time. Initially, the relationship between the angle size
θ and parameter
α, as shown in
Figure 11, causes the predicted box to rapidly align with the nearest axis before regressing towards the actual box.
SIoU constrains the degrees of freedom in regression, thereby accelerating the model’s convergence.
The mainstream loss functions previously discussed utilize a static focusing mechanism. In contrast,
WIoU not only takes into account area, centroid distance, and overlapping area but also introduces a dynamic non-monotonic focusing mechanism.
WIoU employs a reasonable gradient gain distribution strategy to evaluate the quality of anchor boxes. Tong et al. introduced three variants of
WIoU.
WIoU v1 utilizes an attention-based prediction box damage design, while
WIoU v2 and
WIoU v3 incorporate focusing coefficients.
WIoU v1 employs distance as a metric for attention. When the target and prediction boxes overlap within a specific range, reducing the penalty on geometric metrics facilitates better model generalization. The formulas for calculating
WIoU v1 are presented in Equations (19) to (21).
By constructing a monotonic focusing coefficient
,
WIoU v1 is integrated into
WIoU v2, effectively reducing the weight of simple examples in the loss value. However, given that
decreases as
decreases during the model training process, representing a factor that contributes to slower convergence rates, the average value of
is introduced to normalize
. The calculation formula for
WIoU v2 is presented in Equation (24).
WIoU v3 defines the outlier
to assess the quality of anchor boxes and constructs a non-monotonic focusing factor
, which is incorporated into
WIoU v1. The smaller the
value is, the higher the anchor box quality. Conversely, the smaller the
value is, the less significant the weight of high-quality anchor boxes within a broader loss function. Larger
values denote poorer anchor box quality, resulting in reduced gradient gains assigned to these, thereby diminishing the detrimental gradients generated by low-quality anchor boxes.
WIoU v3 employs a well-calibrated gradient gain distribution strategy to dynamically adjust the weights of high-quality and low-quality anchor boxes in the loss, thereby shifting the model’s focus towards average quality samples and enhancing overall performance. The formulas for
WIoU v3 are presented in Equations (25)–(27). The variables
and
in Equation (26) are adjustable hyperparameters, enabling adaptation to various models.
Comparison of the aforementioned mainstream loss functions reveals that WIoU v3 has significant advantages in target boundary box regression loss. Consequently, WIoU v3 is chosen for introduction. First, WIoU v3 integrates some advantages of EIoU and SIoU, aligning with the design philosophy of a superior loss function. Secondly, WIoU v3 employs a dynamic non-monotonic mechanism to evaluate the quality of anchor boxes, enabling the model to focus more on anchor boxes of average quality and enhancing the model’s object localization capability. For the detection task of cotton bolls in complex environments, the presence of small cotton bolls complicates detection. WIoU v3 dynamically optimizes the loss weights of small objects, thereby improving the model’s detection performance.