COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model

Lu, Ziao; Han, Bo; Dong, Luan; Zhang, Jingjing

doi:10.3390/app14156650

Open AccessArticle

COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model

¹

College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

²

Engineering Research Center of Intelligent Agriculture Ministry of Education, Urumqi 830052, China

³

Xinjiang Agricultural Informatization Engineering Technology Research Center, Urumqi 830052, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6650; https://doi.org/10.3390/app14156650

Submission received: 3 June 2024 / Revised: 22 July 2024 / Accepted: 26 July 2024 / Published: 30 July 2024

(This article belongs to the Special Issue Advanced Computational Techniques for Plant Disease Detection)

Download

Browse Figures

Versions Notes

Abstract

:

This study aims to enhance the detection accuracy and efficiency of cotton bolls in complex natural environments. Addressing the limitations of traditional methods, we developed an automated detection system based on computer vision, designed to optimize performance under variable lighting and weather conditions. We introduced COTTON-YOLO, an improved model based on YOLOv8n, incorporating specific algorithmic optimizations and data augmentation techniques. Key innovations include the C2F-CBAM module to boost feature recognition capabilities, the Gold-YOLO neck structure for enhanced information flow and feature integration, and the WIoU loss function to improve bounding box precision. These advancements significantly enhance the model’s environmental adaptability and detection precision. Comparative experiments with the baseline YOLOv8 model demonstrated substantial performance improvements with COTTON-YOLO, particularly a 10.3% increase in the AP₅₀ metric, validating its superiority in accuracy. Additionally, COTTON-YOLO showed efficient real-time processing capabilities and a low false detection rate in field tests. The model’s performance in static and dynamic counting scenarios was assessed, showing high accuracy in static cotton boll counting and effective tracking of cotton bolls in video sequences using the ByteTrack algorithm, maintaining low false detections and ID switch rates even in complex backgrounds.

Keywords:

automated cotton detection; YOLO (You Only Look Once); data augmentation techniques; real-time image processing; precision agriculture; ByteTrack

1. Introduction

Cotton is a significant agricultural crop in China, serving as a crucial strategic material and raw material for the textile industry. With its unique geographical advantages, abundant sunlight, fertile cultivated land, and other natural resources, Xinjiang has developed into the largest high-quality cotton production base in China. Furthermore, the cotton industry has emerged as a pivotal industry, boosting farmers’ incomes and ensuring border stability, and serves as a vital safeguard for the security of China’s cotton industry [1]. The number of bolls is a critical indicator for various intelligent tasks, including cotton field management and yield estimation [2]. Accurate detection and counting of cotton bolls are essential for effective prevention and control management in automated farming, enabling more informed management decisions and consequently enhancing quality and efficiency.

In recent years, due to the reduction in equipment costs and enhanced GPU computing power, the application of computer vision in agriculture has increased significantly [3]. These technologies have been extensively applied in precision agriculture for tasks such as crop detection and classification as well as disease and pest monitoring, significantly enhancing agricultural production efficiency and quality [4,5,6]. Traditionally, cotton growth period information has been gathered with manual detection, necessitating frequent observation and specialized knowledge. These issues can be addressed through automated cotton detection facilitated by computer vision technology. A cotton boll detection method that combines image processing technology with artificial intelligence can enhance efficiency and minimize errors associated with manual detection. This method has recently been widely applied in the identification of various crops, including wheat [7], tomatoes [8], and corn [9]. Zhang et al. [10] proposed a method utilizing image processing and enhanced YOLOv4 technology to detect impurities in machine-picked cotton. A multi-channel fusion segmentation algorithm and an enhanced YOLOv4 model were employed to classify and identify cotton impurities. This approach may result in the misclassification of some impurities. Xu et al. [11] utilized high-resolution UAV imagery to detect and distinguish white cotton bolls in fields; their method integrates fuzzy reasoning techniques with RGB and HSV color spaces to achieve accurate detection. However, the system’s complexity and reliance on precise color settings may affect detection consistency and accuracy under varying lighting conditions. This approach’s dependency on specific color calibrations poses challenges in maintaining uniform detection performance across different environmental conditions. Feng et al. [12] utilized UAV multi-spectral images and deep learning methods to detect and count cotton seedlings, effectively addressing the challenges of detecting cotton seedlings against complex backgrounds and enhancing the accuracy and efficiency of detection. However, the accuracy of seed detection and counting is compromised in visible light images due to their sensitivity to light changes and the strong influence of soil reflection on crop color. Lin et al. [13] proposed a method to enumerate cotton seedlings using UAV images combined with MobileNet [14] and CenterNet [15] deep learning models. This technique enhances both the efficiency and accuracy of cotton seedling enumeration through the utilization of high-precision drone imagery and sophisticated models. However, a significant limitation of this approach is the substantial size of the models, particularly the CenterNet model, which demands considerable disk space, potentially hindering deployment in resource-constrained environments. Liang et al. [16] introduced the YOLO-C algorithm, which detects mature long-staple cotton targets in high-resolution RGB images. The algorithm’s root mean square error (RMSE) is 1.88 in static counting. These results indicate that in complex environments with significant diversity in target size and shape, the frequency of missed and incorrect detections increases, thereby compromising the detection’s stability. Liu et al. [17] developed an enhanced algorithm, MRF-YOLO, based on YOLOX, specifically for detecting unopened cotton bolls. When processing images of cotton bolls affected by blade occlusion and reflected light, the model occasionally produces erroneous recognitions. Such disturbances, arising from environmental factors, can impact the model’s overall stability and reliability. Muzaddid et al. [18] proposed a multi-object tracking framework named NTrack for accurately tracking and counting cotton bolls in dynamic field environments, but the system’s complexity and high computational demands may limit its practical application in real agricultural settings. Zhang et al. [19] proposed a detection model for small target cotton bolls, utilizing space-to-depth and non-strided convolutions to enhance accuracy. The model achieved a detection accuracy of 0.874 on UAV imagery.

Although an increasing number of scholars have begun to conduct target detection research on cotton, there remains a deficiency in the precise detection and counting of cotton bolls under complex conditions, including varying weather and lighting conditions, such as sunny, cloudy, and rainy days. To address the issues of large model size and low detection accuracy in natural environments, a cotton boll detection model named COTTON-YOLO has been proposed. This method, which is based on YOLOv8n, introduces a convolutional block attention module (CBAM) [20] to improve the C2F module, thereby enhancing the model’s ability to recognize cotton boll features, especially in complex backgrounds or when the target size is small. The CBAM sequentially applies channel and spatial attention to intermediate feature maps, focusing on ‘what’ and ‘where’ to attend, thus refining the feature representation and improving detection performance. A new gather-and-distribute (GD) [21] mechanism replaces the traditional FPN structure to enhance the information flow and integration between feature layers, thereby improving the model’s detection accuracy for cotton bolls of different sizes. This mechanism ensures that features are effectively gathered from lower layers and distributed to higher layers, facilitating better detection of objects at multiple scales. The introduction of the Wise-IoU (WIoU) [22] loss function refines the positioning of cotton boll bounding boxes. This loss function uses a dynamic non-monotonic focusing mechanism to evaluate the quality of anchor boxes, assigning different gradient gains based on their quality. This strategy reduces the impact of high-quality anchor boxes and minimizes harmful gradients from low-quality examples, significantly reducing false positives and missed detections, particularly in densely growing cotton scenarios. The proposed COTTON-YOLO is capable of effectively detecting cotton bolls against complex agricultural backgrounds.

2. Materials and Methods

2.1. Construction of Cotton Boll Detection Dataset

2.1.1. Data Acquisition and Labeling

The experimental data originated from Hua Xing Farm in Daxiquan Town, Changji City, Xinjiang Province, which is situated at 87°29′ E longitude and 44°22′ N latitude and characterized by a temperate continental climate. The variety of cotton used was NH12026, featuring a growing period of approximately 119 days, an average plant height of about 80 cm, and an average boll weight of about 6.5 g. In this study, field surveys were conducted in three adequately spaced areas within a cotton farm, with three to four representative sampling points randomly selected in each area for image collection. The changing conditions of illumination, weather, and other all-day environmental factors acted as standards for image acquisition to ensure data diversity. Images were collected under three different weather conditions: sunny, cloudy, and rainy days (see Figure 1). Image data were acquired at different times throughout the day. Consequently, a dataset was compiled covering various natural environments, including sunny, cloudy, and rainy conditions, to ensure the comprehensiveness of the images and their generalizability in the real world. Image collection occurred from 9 September to 23 September 2023, with experimental personnel utilizing an iPhone 14 Pro to capture high-resolution images (4032 × 3024, approximately 12.2 MP) from 0.5 m to 1 m above the cotton canopy.

The shooting angles comprised forward, upward, and multiple angles. In total, 607 original images were gathered from various environments.

2.1.2. Data Preprocessing and Enhancement

To prevent model overfitting due to a small dataset, image data augmentation techniques were utilized, including adjusting brightness, adding noise, applying CutOut [23], rotating, cropping, translating, and mirroring the images (see Figure 2). These methods doubled the dataset size. First, adjusting brightness increases the contrast of images, thus making the features of objects more pronounced, which is crucial for object detection. Second, adding noise increases the complexity of images, thereby enhancing the model’s robustness against noisy data. CutOut represents an effective data augmentation method that involves randomly obscuring parts of an image, thus training the model to handle incomplete information. Additionally, geometric transformations such as rotation, cropping, translation, and mirroring help the model to adapt to various object poses and scenes. For instance, rotating or cropping the images can simulate different positions and angles of objects as they might appear in real-world scenarios. Finally, the dataset, which comprised 1821 images including both original and augmented images, was organized for the object detection algorithm. These images were divided into a training set, validation set, and test set in ratios of 7:1:2, comprising 1274, 182, and 365 images, respectively. Original images were used as the test dataset to demonstrate the performance of the model in this study. The specific distribution of the data set is shown in Table 1.

We analyzed the labels of the dataset, the specific analysis results are shown in Figure 3, finding a total of 9894 labels. Cotton bolls were classified based on their size relative to the original image area: those less than or equal to 2% as small targets, those greater than 2% but less than or equal to 5% as medium targets, and those greater than 5% as large targets. In our dataset, there are 6212 small target cotton bolls, 2671 medium target cotton bolls, and 1011 large target cotton bolls. Small target cotton bolls comprised about 62.8% of the total, medium target cotton bolls comprised about 27%, while large target cotton bolls accounted for approximately 10.2% of the total. A few large bolls were found in the upper part of the cotton plant or near the main stem, although the large target represents only one-tenth of the size of the original image; thus, the number of small and medium bolls constitutes the majority.

2.2. Standard YOLOv8 Model

In January 2023, YOLOv8 was launched by Ultralytics, the company responsible for developing YOLOv5. YOLOv8 provides five scaled versions, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, which support a variety of visual tasks such as object detection, segmentation, pose estimation, tracking, and classification.

Figure 4 illustrates the detailed architecture of YOLOv8. It employs a backbone similar to YOLOv5 but with modifications on the CSPLayer, now referred to as the C2f module. The C2f module (Cross Stage Partial Bottleneck with two convolutions) integrates high-level features with contextual information to enhance detection precision. YOLOv8 utilizes an anchor-free model and a decoupled head to independently handle object detection, classification, and regression tasks. This design enables each branch to focus on its specific function, thus enhancing overall model accuracy. In YOLOv8’s output layer, a sigmoid function serves as the activation for object scores, indicating the likelihood that a bounding box contains an object, while a softmax function represents the likelihoods of belonging to each possible class.

YOLOv8 employs CioU and DFL for bounding box loss and binary cross-entropy for classification loss. These loss functions enhance target detection performance, particularly when addressing smaller objects. YOLOv8 also introduces a semantic segmentation model, named YOLOv8-Seg. Its backbone is the CSPDarknet53 feature extractor, followed by the C2f module in place of the traditional YOLO neck architecture. Following the C2f module, two segmentation heads are configured to learn to predict semantic segmentation masks of the input images. The model includes a detection head similar to YOLOv8, comprising five detection modules and a prediction layer. The YOLOv8-Seg model delivers state-of-the-art results on various object detection and semantic segmentation benchmarks, all while maintaining high speed and efficiency.

YOLOv8 can be operated from a command-line interface (CLI) or installed as a PIP package. Additionally, it incorporates multiple integrated features for labeling, training, and deployment. When evaluated on the MS COCO dataset test-dev 2017, YOLOv8x recorded an AP of 53.9% at an image size of 640 pixels (compared to YOLOv5’s AP of 50.7% at the same input size), achieving a speed of 280 FPS on NVIDIA A100 and TensorRT.

When applying the YOLO algorithm for cotton boll detection in complex environments, the limited hardware capabilities of mobile devices must be taken into account. The YOLOv8x algorithm, although it offers the best detection performance, is not feasible due to its high demands on real-time processing in complex settings. Consequently, this paper adopts the minimal weight model of YOLOv8n. However, several issues emerge with YOLOv8n in practical detection processes. First, YOLOv8n employs a large number of standard convolutions and C2f modules, which enhance the model’s accuracy but simultaneously reduce its operational speed and increase the model’s parameters. Second, the scene in cotton boll detection changes rapidly, necessitating sufficient detection accuracy. However, YOLOv8n is not ideal for processing cotton boll detection, as it is prone to false positives and missed detections. The algorithm particularly struggles with objects that undergo significant scale changes. Since YOLOv8 divides the input image into smaller grids, with each grid predicting only one object, this configuration makes it challenging to handle objects that vary greatly in size.

2.3. COTTON-YOLO Structure

The proposed network structure of COTTON-YOLO is shown in Figure 5 below. Its design is inspired by the YOLOv8 model, and the specific implementation process is illustrated in Figure 5. The details of the improvements are described in Section 2.3.1, Section 2.3.2, Section 2.3.3 and Section 2.3.4.

2.3.1. Data Preprocessing and Enhancement

Mosaic data augmentation, originally proposed for the YOLOv4 network, is based on the principle of randomly cropping four images and merging them into one image as newly generated training data. This technique not only significantly enriches the detection dataset and enhances the model’s robustness, but also reduces the use of GPU video memory, according to Bochkovskiy and others [24]. The last ten rounds of training see the deactivation of mosaic enhancement. Employing data augmentation in the initial stages of training helps the model cope with various complex situations, whereas reducing such enhancements in the later stages aids the model in learning and generalizing more accurately to unenhanced, more natural state data. Utilizing unenhanced data in the final stages of training can help the model better understand the real-world data distribution, thereby improving detection accuracy in practical applications.

The transformations applied in our data augmentation process were developed using custom algorithms, with the OpenCV library as a foundational tool. This approach was specifically chosen to enhance our model’s ability to detect cotton bolls more effectively, aligning closely with the unique requirements of our research. Adjusting brightness, adding noise, applying cutout, rotating, cropping, translating, and mirroring images are randomly applied to enhance the model’s adaptability to various scenes and conditions. As shown in Figure 6, (a) is the original image; (b) demonstrates augmentation methods such as adding noise, cutout, and translation; and (c) illustrates augmentation methods such as reducing brightness, rotation, and cutout. These methods not only mimic the variety of changes encountered in the real world but also ensure the model maintains efficient performance when faced with changes in position, size, and environment.

2.3.2. C2f-CBAM Module

In this study, the application of YOLOv8 for detecting cotton bolls against complex backgrounds was bolstered by introducing the CBAM, which replaced the traditional C2f module with a new C2f-CBAM structure. This improvement aims to refine the model’s ability to discriminate features, thereby enhancing detection accuracy for small targets, particularly in scenarios with complex background information or small target sizes. The CBAM module improves feature representation through its unique dual attention mechanisms: the channel attention mechanism (CAM) and the spatial attention mechanism (SAM). The channel attention mechanism concentrates on assessing and emphasizing the importance of features across different channels, thus enabling the model to highlight features that are crucial for target recognition. Conversely, the spatial attention mechanism sharpens the focus on the spatial distribution of features, thereby reducing interference from background noise.

Figure 7 distinctly displays the structure of the CBAM, which includes both channel and spatial attention components. The channel attention component highlights important features by assessing the significance of each channel. The spatial attention component further improves the spatial distribution of features, effectively minimizing the interference from complex backgrounds. The introduction of this dual attention mechanism enables the network to more effectively concentrate on key targets in the image, such as cotton bolls, thus demonstrating improved performance especially in environments with complex backgrounds or small target sizes.

After implementing the C2f-CBAM replacement in the backbone of YOLOv8, a series of experiments was conducted to demonstrate the effectiveness of this structural improvement in the task of cotton boll detection. The experimental results indicate that the C2f-CBAM structure significantly enhances detection accuracy and stability, particularly exhibiting superior performance in counting and tracking cotton bolls in dynamic environments. Furthermore, the application of this module also demonstrates good generalizability, effectively maintaining high performance levels across different datasets and various environmental conditions.

From both theoretical and practical perspectives, the introduction of CBAM not only optimizes the network structure but also boosts the model’s capability to handle small targets in complex scenes. This approach of improving performance through internal adjustments to the deep learning model’s structure opens up new directions and possibilities for further exploration in the field of visual recognition. This improvement also provides an effective strategy for researchers in related fields, aiming to optimize and enhance the reliability and efficiency of models in practical applications.

2.3.3. Gather-and-Distribute Neck

In real cotton field environments, owing to the complexity of the background, the features of cotton bolls are small and less prominent. The neck structure of the YOLO series, illustrated in Figure 8, employs the traditional FPN structure, comprising multiple branches for multi-scale feature fusion. However, it can only integrate features from adjacent layers fully; for other layers’ information, it is only obtainable indirectly through “recursion”. This process leads to a significant loss of small-scale information during computation, resulting in a high number of missed detections and false detections of cotton bolls. To mitigate the problem of information loss during the information transmission process in the traditional FPN structure, we introduce a new gather-and-distribute (GD) mechanism inspired by Gold-YOLO. Specifically, the gather-and-distribute process involves three modules: the feature alignment module (FAM), the information fusion module (IFM), and the information injection module (Inject).

The gathering process comprises two steps. Initially, FAM collects and aligns features from various levels. Subsequently, IFM merges the aligned features to generate global information. After obtaining the merged global information from the gathering process, the injection module distributes this information to each level and employs simple attention operations for injection, thereby enhancing the detection capabilities of the branches. To improve the model’s ability to detect objects of different sizes, we have developed two branches: the low gather-and-distribute branch (Low-GD) and the high gather-and-distribute branch (High-GD). As illustrated in Figure 9, the input to the neck comprises feature maps B2, B3, B4, and B5 extracted by the backbone, where

B_{i} \in R^{N \times C_{B i} \times R_{B i}}

. The batch size is denoted by N, the number of channels is denoted by C, and the spatial dimensions is denoted by R = H × W. Furthermore, the dimensions of R_B2, R_B3, R_B4, and R_B5 are R/2, R/4, R/4, and R/8, respectively.

Low-Order Aggregation-Distribution Branch: In this branch, Backbone’s output features B2, B3, B4, and B5 are fused to obtain high-resolution features that retain small target information. The structure is shown in Figure 10a.

Low-Feature Alignment Module (Low-FAM)

In the low-feature alignment module (Low-FAM), input features are downsampled using average pooling (AvgPool) operations to achieve a uniform size. By resizing the feature size to the smallest feature size in the group (R_B4 = 1/4R), F_align is obtained.

Low-Information Fusion Module (Low-IFM)

The design of the low-information fusion module (Low-IFM) comprises multiple layers of reparameterized convolution blocks (RepBlock) and split operations. Specifically, RepBlock accepts

F_{a l i g n} (c h a n n e l = s u m (C_{B 2}, C_{B 3}, C_{B 4}, C_{B 5})

as input and generates

F_{f u s e} (c h a n n e l = C_{B 4} + C_{B 5})

The intermediate channels represent adjustable values to accommodate different model sizes. The features produced by RepBlock are then split along the channel dimension into

F_{i n j}_P 3

and

F_{i n j}_P 4

, which are subsequently merged with features from different levels.

F_{a l i g n} = L o w_F A M (B 2, B 3, B 4, B 5)

(1)

F_{a l i g n} = R e p B l o c k (F_{a i g n})

(2)

F_{i n j_P 3}, F_{i n j_P 4} = S p l i t (F_{f u s e})

(3)

Information Injection Module

To inject global information more effectively into different levels, this module utilizes split experience and attention operations to merge information, as illustrated in Figure 10c. The module receives local information (the current level’s features) and global injection information (generated by IFM), represented as

F_{l o c a l}

and

F_{i n j}

. It employs two distinct Conv operations with

F_{i n j}

to produce

F_{l o c a l_e m b e d}

and

F_{a c t}

F_{l o c a l_e m b e d}

is calculated using Conv with

F_{a c t}

. Subsequently, features Fout are merged through attention calculations. Since the dimensions of

F_{g l o b a l}

and

F_{l o c a l}

do not match, the module employs average pooling or bilinear interpolation to resize

F_{g l o b a l_e m b e d}

and Fact according to the size of

F_{i n j}

, ensuring proper alignment. After each attention merge, a RepBlock is incorporated to further extract and integrate information. In the low order,

F_{l o c a l}

is equivalent to

B_{i}

, and the formula is noted as follows:

F_{g l o b a l_a c t_p_{i}} = r e s i z e (S i g m o i d ({C o n v}_{a c t} (F_{i n j_p_{i}})))

(4)

F_{g l o b a l_e m b e d_p_{i}} = r e s i z e ({C o n v}_{g l o b a l_e m b e d_p_{i}} (F_{i n j_p_{i}}))

(5)

F_{a t t_f u s e_p_{i}} = {C o n v}_{g l o b a l_e m b e d_p_{i}} (B_{i}) * F_{g l o b a l_a c t_p_{i}} + F_{g l o b a l_e m b e d_p_{i}}

(6)

P_{i} = R e p B l o c k (F_{a t t_f u s e_p_{i}})

(7)

High Gather-and-Distribute Branch: The High-GD merges features P3, P4, and P5 generated by Low-GD, as shown in Figure 10b.

High-Feature Alignment Module (High-FAM)

The high-feature alignment module (High-FAM) includes average pooling (AvgPool), which reduces the dimensions of input features to a uniform size. Specifically, when the input feature dimensions are

\{R_{P 3}, R_{P 4}, R_{P 5}\}

, AvgPool reduces the feature size to the smallest dimension in the feature group, where

(R_{P 5} = 1 / 8 R)

. As the transformer module extracts high-level information, pooling operations help aggregate data while reducing the computational demand of subsequent transformer module steps.

High-Information Fusion Module (High-IFM)

The high-information fusion module (High-IFM) includes transformer blocks (detailed below) and split operations, comprising three steps: (1)

F_{a l i g n}

obtained from High-FAM, is combined through a transformer to obtain

F_{f u s e}

. (2) The channel number of

F_{f u s e}

is reduced to the sum of

C_{P 4}

and

C_{P 5}

through a Conv 1 × 1 operation. (3)

F_{f u s e}

is split along the channel dimension into

F_{i n j_N 4}

and

F_{i n j_N 5}

, which are subsequently used for fusion with the current level’s features. Formulas are as follows:

F_{a l i g n} = H i g h_F A M ([P 3, P 4, P 5])

(8)

F_{f u s e} = T r a n s f o r m e r (F_{a l i g n})

(9)

F_{i n j_N 4}, F_{i n j_N 5} = S p l i t (C o n v 1 \times 1 (F_{f u s e}))

(10)

The transformer fusion module consists of several stacked transformers, with the number of transformer blocks denoted by L. Each transformer block comprises a multi-head attention block, a feed-forward network (FFN), and residual connections.

Information Injection Module

The information injection module in High-GD is exactly the same as in Low-GD. In the high order,

F_{l o c a l}

is equal to

P_{i}

, so the formula is expressed as follows:

F_{g l o b a l_a c t_N_{i}} = r e s i z e (S i g m o i d ({C o n v}_{a c t} (F_{i n j_N_{i}})))

(11)

F_{g l o b a l_e m b e d_N_{i}} = r e s i z e ({C o n v}_{g l o b a l_e m b e d_N_{i}} (F_{i n j_N_{i}}))

(12)

F_{a t t_f u s e_N_{i}} = {C o n v}_{g l o b a l_e m b e d_N_{i}} (B_{i}) * F_{g l o b a l_a c t_N_{i}} + F_{g l o b a l_e m b e d_N_{i}}

(13)

N_{i} = R e p B l o c k (F_{a t t_f u s e_N_{i}})

(14)

In this paper, we replaced the original FPN structure in the YOLOv8 neck with an aggregation and distribution mechanism (GD). To further enhance the interconnectivity of cross-layer information, we added two feature alignment modules (FAM) in the low-frequency aggregation and distribution branch, with each module receiving three inputs. The formula can be expressed as follows:

F_{i} = L o w_F A M (B_{i - 1}, B_{i}, C o n v (B_{i + 1}))

(15)

where the variable is used as in Equation (6). Specifically, as shown in Figure 10c, the variable is used as the

x_{l o c a l}

input in the information injection module. Therefore, in this paper, Equation (6) should be modified as follows:

F_{a t t_f u s e_p_{i}} = {C o n v}_{g l o b a l_e m b e d_p_{i}} (P_{i}) * F_{g l o b a l_a c t_p_{i}} + F_{g l o b a l_e m b e d_p_{i}}

(16)

Similarly, in the high-frequency aggregation and distribution branch, we also added two additional feature alignment modules and introduced the C2f structure after all the information injection modules. The formula can be expressed as follows:

F_{i} = H i g h_F A M ([C 2 f (P_{i}), C o n v (B_{i + 1})])

(17)

where

F_{i}

is used as

B_{i}

in Equation (13). Therefore, in this paper, Equation (13) should be modified as follows:

F_{a t t_f u s e_N_{i}} = {C o n v}_{g l o b a l_e m b e d_N_{i}} (F_{i}) * F_{g l o b a l_a c t_N_{i}} + F_{g l o b a l_e m b e d_N_{i}}

(18)

Through these improvements, the effectiveness of information fusion and transmission is strategically enhanced, thereby better addressing the issues of false positives and missed detections in cotton boll detection within complex environments and improving the model’s detection accuracy.

2.3.4. Improved Loss Function

In real farmland environments, the proportion of small objects in cotton boll detection tasks is considerably high. A well-designed loss function can significantly enhance the model’s detection performance. YOLOv8 uses DFL and CIoU to calculate the regression loss of the bounding boxes; however, CIoU has the following drawbacks. First, CIoU does not account for the balance between hard and easy samples. Second, CIoU uses the aspect ratio as one of the penalties in the loss function. If the aspect ratios of the actual and predicted boxes are identical but the values of width and height differ, the penalty cannot accurately reflect the true difference between the two boxes. Third, the calculation of CIoU involves an arctangent function, increasing the computational burden on the model. The CIoU calculation formula is presented in Equation (19):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(c_{ω})}^{2} + {(c_{h})}^{2}} + \frac{4}{π^{2}} ({t a n}^{- 1} \frac{ω^{g t}}{h^{g t}} - {t a n}^{- 1} \frac{ω}{h})

(19)

In Equation (19), the intersection over union (IoU) is defined as the ratio of intersection to union between the predicted box and the actual box. Parameters referenced in Equation (19) are depicted in Figure 11. Here,

ρ (b, b^{g t})

denotes the Euclidean distance between the centroids of the true box and the predicted box;

h

and

ω

denote the height and the predicted values, respectively;

h^{g t}

and

ω^{g t}

represent the height and width of the true box; and

c_{h}

and

c_{ω}

denote the height and width of the minimum enclosing box that includes both the predicted box and the actual box.

EIoU [25] advances beyond CIoU by incorporating the length and width as penalty terms, thus addressing the differences in width and height between the actual box and the predicted box, thereby offering a more rational penalty than those employed in CIoU. The formula for calculating EIoU is presented in Equation (20):

L_{E I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(c_{ω})}^{2} + {(c_{h})}^{2}} + \frac{ρ^{2} (ω, ω^{g t})}{{(c_{ω})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(c_{h})}^{2}}

(20)

Parameters relevant to Equation (20) are illustrated in Figure 11;

ρ (ω, ω^{g t})

and

ρ (h, h^{g t})

denote the Euclidean distances in width and height, respectively, between the true box and the predicted box; and

({b_{c_{x}}, b}_{c_{y}})

and

({b_{c_{y}}^{g t}, b}_{c_{y}}^{g t})

indicate the coordinates of the center points of the actual box and the predicted box, respectively.

SIoU [26] incorporates the angle between the predicted box and the actual box as a penalty factor for the first time. Initially, the relationship between the angle size θ and parameter α, as shown in Figure 11, causes the predicted box to rapidly align with the nearest axis before regressing towards the actual box. SIoU constrains the degrees of freedom in regression, thereby accelerating the model’s convergence.

The mainstream loss functions previously discussed utilize a static focusing mechanism. In contrast, WIoU not only takes into account area, centroid distance, and overlapping area but also introduces a dynamic non-monotonic focusing mechanism. WIoU employs a reasonable gradient gain distribution strategy to evaluate the quality of anchor boxes. Tong et al. introduced three variants of WIoU. WIoU v1 utilizes an attention-based prediction box damage design, while WIoU v2 and WIoU v3 incorporate focusing coefficients. WIoU v1 employs distance as a metric for attention. When the target and prediction boxes overlap within a specific range, reducing the penalty on geometric metrics facilitates better model generalization. The formulas for calculating WIoU v1 are presented in Equations (19) to (21).

L_{W I o U v 1} = R_{W I o U} \times L_{I o U}

(21)

R_{W I o U} = e x p (\frac{{(b_{c_{x}}^{g t} - b_{c_{x}})}^{2} + {(b_{c_{y}}^{g t} - b_{c_{y}})}^{2}}{({c_{w}}^{2} + {c_{h}}^{2})})

(22)

L_{I o U} = 1 - I o U

(23)

By constructing a monotonic focusing coefficient

L_{I o U}^{*}

, WIoU v1 is integrated into WIoU v2, effectively reducing the weight of simple examples in the loss value. However, given that

L_{I o U}^{*}

decreases as

L_{I o U}

decreases during the model training process, representing a factor that contributes to slower convergence rates, the average value of

L_{I o U}

is introduced to normalize

L_{I o U}^{*}

. The calculation formula for WIoU v2 is presented in Equation (24).

L_{W I o U v 2} = {(\frac{L_{I o U}^{*}}{\bar{L_{I o U}}})}^{γ} \times L_{W I o U v 1}, γ > 0

(24)

WIoU v3 defines the outlier

β

to assess the quality of anchor boxes and constructs a non-monotonic focusing factor

r

, which is incorporated into WIoU v1. The smaller the

β

value is, the higher the anchor box quality. Conversely, the smaller the

r

value is, the less significant the weight of high-quality anchor boxes within a broader loss function. Larger

β

values denote poorer anchor box quality, resulting in reduced gradient gains assigned to these, thereby diminishing the detrimental gradients generated by low-quality anchor boxes. WIoU v3 employs a well-calibrated gradient gain distribution strategy to dynamically adjust the weights of high-quality and low-quality anchor boxes in the loss, thereby shifting the model’s focus towards average quality samples and enhancing overall performance. The formulas for WIoU v3 are presented in Equations (25)–(27). The variables

β

and

r

in Equation (26) are adjustable hyperparameters, enabling adaptation to various models.

L_{W I o U v 3} = r \times L_{W I o U v 1}

(25)

r = \frac{β}{δ α^{β - δ}}

(26)

β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}} \in [0, + \infty)

(27)

Comparison of the aforementioned mainstream loss functions reveals that WIoU v3 has significant advantages in target boundary box regression loss. Consequently, WIoU v3 is chosen for introduction. First, WIoU v3 integrates some advantages of EIoU and SIoU, aligning with the design philosophy of a superior loss function. Secondly, WIoU v3 employs a dynamic non-monotonic mechanism to evaluate the quality of anchor boxes, enabling the model to focus more on anchor boxes of average quality and enhancing the model’s object localization capability. For the detection task of cotton bolls in complex environments, the presence of small cotton bolls complicates detection. WIoU v3 dynamically optimizes the loss weights of small objects, thereby improving the model’s detection performance.

2.4. ByteTrack

Multiple object tracking (MOT) is a critical computer vision task involving the identification of the movement of multiple targets within a video clip, even in complex backgrounds [27,28]. The goal is to identify, locate, and track all cotton bolls in the video, even if they are partially or completely obscured by other elements in the scene. Typically, MOT tasks comprise two steps, including object detection and object association, as shown in the Figure 12. Object detection involves identifying all potential targets in the current frame using a detector. Object association entails matching the detected targets in the current frame with trajectories from previous frames and predicting the targets’ positions in the next frame.

ByteTrack is an algorithm employed in MOT that effectively handles occlusions and interleaving of targets while maintaining efficiency [29]. This algorithm is primarily based on the YOLO series of object detection models. Through associating detected targets, it achieves continuous tracking of the same target in video sequences. The core idea of ByteTrack is to use a simple yet efficient byte tracking strategy that improves tracking stability and accuracy by minimizing ID switches and handling occlusions.

In practical applications, ByteTrack first uses deep learning-based methods to identify targets in the video and then matches them across consecutive frames to construct target trajectories. This method is particularly suitable for dynamic environments, such as cotton boll counting in agriculture, because it can effectively track and count multiple moving targets in complex backgrounds. Through integration with YOLOv8, ByteTrack not only enhances the accuracy of object detection but also ensures the continuity of targets in the video through dynamic tracking. This provides a highly practical solution for real-time video data analysis and processing, especially in dynamic agricultural environments. Furthermore, the implementation of this method holds significant importance for advancing precision agriculture and intelligent crop management. It can help agricultural workers to more accurately assess crop growth conditions and yield.

3. Results

3.1. Experimental Environment

The improved model proposed in this paper, along with all comparison models, was utilized to perform cotton boll detection on a GPU server. Table 2 presents the experimental setup, with a total of 1274 images of cotton bolls from different angles employed for model training. The optimizer chosen for the object detection algorithms in this experiment was AdamW, with a pulse value set to 0.937. The initial learning rate was set to 0.01, and the weight decay was set to 0.0005.

3.2. Evaluation Index

Precision, recall, and mean average precision (AP) metrics are utilized to evaluate the detection performance of COTTON-YOLO. Precision is the probability that a predicted positive sample is actually positive. It measures the accuracy of the predictions for positive samples. Recall is the probability that an actual positive sample is predicted to be positive, representing the model’s sensitivity.

p r e c i s i o n = \frac{T P}{T P + F P}

(28)

r e c a l l = \frac{T P}{T P + F N}

(29)

True positive (TP) refers to the correct prediction of a cotton boll target. False positive (FP) indicates a background or other feature incorrectly predicted as a cotton boll. False negative (FN) refers to a cotton boll target incorrectly classified as another feature. Setting precision and recall as the vertical and horizontal axes, respectively, produces the precision–recall curve, also known as the P–R curve. The area under the P–R curve is referred to as average precision (AP). The formulas are as shown in (30) to (31):

A P = \int_{0}^{1} P (R) d R

(30)

A P_{50 : 95} = \frac{1}{10} (A P_{50} + A P_{55} + \dots + A P_{90} + A P_{95})

(31)

To describe mean average precision (mAP), we represent the AP for a single category as the sum of the AP values for each category, given that our model focuses exclusively on cotton bolls. AP₅₀ represents the average precision at different recall levels with IoU = 0.5. IoU measures the extent of overlap between the predicted target and the actual target. AP_50:95 represents the average of the ten AP values at IoU thresholds ranging from 0.50 to 0.95 in increments of 0.05 (AP₅₀, AP₅₅, …, AP₉₅). GFLOPs indicate the number of floating-point operations per second (in billions) performed by the model on a GPU.

To evaluate the model’s performance in counting cotton bolls, we utilize the following metrics: mean absolute error (MAE), mean absolute percentage error (MAPE), coefficient of determination (R²), and root mean squared error (RMSE). MAE reflects the average difference between the actual number of cotton bolls and the counted number of cotton bolls; the smaller the MAPE is, the closer the model’s performance is to perfection. RMSE, based on mean squared error (MSE), measures the square root of the deviation between observed and actual values and is often used as a performance metric in machine learning models. The formulas are as shown in (32) to (35):

M A E = \frac{1}{n} \sum_{1}^{n} |m_{i} - c_{i}|

(32)

M A P E = \frac{1}{n} \sum_{1}^{n} |\frac{m_{i} - c_{i}}{m_{i}}| \times 100 %

(33)

R^{2} = 1 - \frac{\sum_{1}^{n} {(m_{i} - c_{i})}^{2}}{\sum_{1}^{n} {(m_{i} - {\overline{m}}_{i})}^{2}}

(34)

R M S E = \sqrt{\frac{\sum_{1}^{n} {(m_{i} - c_{i})}^{2}}{n}}

(35)

where

m_{I}

,

{\overline{m}}_{i}

, and

c_{i}

represent the actual count in the annotation file for the ith image, the average actual count for the ith image, and the predicted count for the ith image, respectively. Here,

n

denotes the number of test images.

3.3. Model Performance and Comparison

3.3.1. Comparison of Experimental Results of Different IoU Methods

In this paper, we designed a series of experiments to verify the accuracy of Section 2.3.4, comparing the effects of different IoU methods on YOLOv8n, including the standard CIoU, DIoU [30], EIoU, SIoU, and WIoU, to evaluate the performance and accuracy of the model when using these different IoU methods, The specific comparative experimental results are shown in Table 3.

3.3.2. Ablation Experiment

This study on the COTTON-YOLO model included a series of ablation experiments to evaluate the impact of various data augmentation techniques and architectural improvements on model performance. Figure 13 depicts the comparison between COTTON-YOLO and the baseline model YOLOv8 in detecting cotton bolls under various weather conditions. In clear weather conditions, both YOLOv8 and COTTON-YOLO successfully detected all the cotton bolls. Under overcast conditions, YOLOv8 failed to detect some cotton bolls at the edges of the image, whereas COTTON-YOLO successfully detected all the cotton bolls. In rainy conditions, YOLOv8 failed to detect the smaller cotton bolls. COTTON-YOLO exhibits superior confidence levels in the detection of targets, exhibiting neither misses nor false positives, thus ensuring high accuracy and reliability in its operational performance. In contrast, YOLOv8, when analyzing the same images, encountered challenges related to false positives, thereby reducing its precision. Consequently, considering the critical assessment of accuracy, reliability, and the incidence of false detections, it is evident that COTTON-YOLO substantially surpasses YOLOv8 in the detection efficacy of cotton bolls.

Table 4 details the results of these experiments, highlighting the specific contributions of different configuration combinations to precision and recall. A checkmark (√) denotes the methods or modules that were enabled. Implementing data augmentation methods, particularly padding techniques, increased AP₅₀ from 86.8% to 92.0% and AP_50:95 from 50.9% to 59.4%. This significant improvement demonstrates the effectiveness of this method in alleviating the difficulty of detecting small targets. Subsequently, by replacing the C2F bottleneck module with the CBAM module (C2F-CBAM), we further enhanced the model’s multi-scale feature extraction capability, raising AP₅₀ to 93.0% and AP_50:95 to 59.7%. Replacing the standard neck part of YOLOv8 with the Gold YOLO neck additionally improved the model’s performance, increasing AP₅₀ from 92.0% to 92.5% and AP_50:95 from 59.4% to 61.0%. Replacing the CIoU loss function in YOLOv8 with WIoU v3 resulted in a final AP₅₀ of 95.8% and AP_50:95 of 65.7%. These results indicate that WIoU v3 has excellent bounding box localization performance in object detection. With these improvements, the COTTON-YOLO model showed significant enhancements in precision, recall, and AP metrics. These results not only demonstrate the effectiveness of individual improvements but also underscore their positive impact on model performance when used in combination. The figures illustrate the detection performance of the COTTON-YOLO model on small cotton boll targets under different environmental conditions, thereby further validating its effectiveness in practical applications.

3.3.3. Comparison of Different Mainstream Models

To validate the performance of the proposed COTTON-YOLO algorithm in cotton boll detection applications, we compared it with several popular current object detection models. These models include various versions of YOLO and architectures derived from DETR. The evaluation is primarily based on key performance indicators, including precision, recall, AP₅₀, and AP_50:95, while also considering efficiency metrics such as the number of parameters and model size. The specific model comparison experimental results are shown in Table 5. YOLOv3-tiny and YOLOv5s are lightweight models; despite having fewer parameters, their performance on AP_50:95 (53.5% and 53.8%, respectively) does not reach that of COTTON-YOLO. YOLOv6s and YOLOv8m performed well in terms of precision and recall, but COTTON-YOLO excelled in AP_50:95 with a score of 65.7%, demonstrating better overall performance. The YOLOv8-DETR models, based on DETR [31], despite their innovative structures, fell short in all performance metrics compared to COTTON-YOLO, particularly in AP_50:95, with scores of 42.6% and 43.1%, respectively. YOLOv9-gelan performed excellently among the comparison models, but COTTON-YOLO still outperformed it in AP₅₀ and AP_50:95 metrics, despite YOLOv9-gelan’s larger model size. COTTON-YOLO exhibited outstanding detection capabilities, particularly in AP_50:95, achieving 65.7%, significantly higher than other models, while maintaining a low parameter count (6.1 M) and a small model size (12.4 MB). This result demonstrates that COTTON-YOLO has significant advantages in both accuracy and the ability to handle complex backgrounds when processing the specific task of cotton boll detection. The comparative analysis clearly illustrates the significant advantages of COTTON-YOLO over existing detection models in terms of accuracy, efficiency, and applicability. Its high precision, high recall rate, and excellent AP_50:95 performance make it particularly suitable for real-time cotton boll detection tasks in the agricultural field, providing stable and reliable performance in dynamic agricultural environments.

3.4. Model Static Count Assessment

In static count testing for cotton boll detection, 122 images were randomly selected from the test set, and detections were conducted using the YOLOv8 and the improved COTTON-YOLO models. The detection results displayed in Figure 14 demonstrate that the improved COTTON-YOLO model excels in detecting varying numbers of cotton bolls. The original YOLOv8 model exhibited a coefficient of determination (R²) of 0.543, which increased to 0.853 with the improved COTTON-YOLO model. This significant improvement suggests a substantial enhancement in the COTTON-YOLO model’s ability to explain variability in the data. Moreover, compared to the original YOLOv8 model, the improved COTTON-YOLO model demonstrated significant advantages in terms of root mean square error (RMSE) and mean absolute error (MAE). The MAE was reduced from 1.231 to 0.587, and the RMSE decreased from 1.779 to 1.008. These improvements in metrics, by 52.34% and 43.39%, respectively, further demonstrate the efficacy of the COTTON-YOLO model in enhancing cotton boll detection performance. Overall, the COTTON-YOLO model not only significantly improves the accuracy of cotton boll detection but also highlights its practical application value in cotton field management, thus providing important technical support for the management and monitoring of cotton growth. The results are depicted in Figure 15 and Table 6.

3.5. Dynamic Model Count Evaluation

When using dynamic videos as input for COTTON-YOLO, the extracted cotton boll information is subsequently fed into ByteTrack. Each detected cotton boll is assigned a unique tracking number, and the corresponding cotton boll data is linked across different video results. The dynamic detection results are depicted in Figure 16. In the figure, red boxes signify the bouncing cotton boll IDs, while blue boxes denote missed detections.

The video was recorded by a handheld camera moving at approximately 0.15 m/s, with a frame rate (FPS) of 30 f/s, which implies each frame appears for about 33.3 ms. The total processing time for COTTON-YOLO detection and ByteTrack for target matching and status updates amounts to 13.9 ms. Each cotton boll remains visible in the video for about 33.3 ms, exceeding the time required for target detection and tracking, thus meeting the real-time detection requirements of the input video. However, the blue boxes indicate a noticeable occurrence of missed detections.

Additionally, instances of cotton boll ID switching occur within the same image, often due to the target tracker mistakenly associating two or more different cotton bolls as the same one, causing the ID to jump from a higher to a lower number. Conversely, the similarity in appearance or movement of some cotton bolls leads to confusion and ID switches, resulting in an undercount. When targets re-enter the field of view or the tracker reinitializes after a failure, the same target might be counted multiple times. This situation typically occurs when a cotton boll briefly leaves the field of view or becomes obscured and then reappears, with the tracker unable to recognize it as the same target, thus reassigning a new ID. False cotton boll counts increase when the detector mistakenly identifies non-existent cotton bolls as actual targets, and complex backgrounds or changes in lighting may lead the detector to mistakenly identify background objects or shadows as cotton bolls, resulting in an overcount. In the final frame of the video, the highest detected cotton boll ID is 65, while manual counting averaged over three attempts yields a total of approximately 58 cotton bolls in the video, corresponding to a target detection tracking accuracy of about 87.93%. The maximum ID for cotton bolls tracked using YOLOv8 and ByteTrack is 34, with an accuracy of about 58.62%. Therefore, COTTON-YOLO combined with ByteTrack proves more effective in tracking and counting cotton bolls. The comparative results are depicted in Figure 17.

4. Discussion

The results of this study provide substantial insights into the field of cotton boll detection, enhancing our comprehension of how advanced machine learning models can be optimized to manage complex agricultural datasets. By implementing various modifications, including data augmentation, integration of the C2F-CBAM module, and introduction of the novel WIoU v3 loss function, the COTTON-YOLO model demonstrated significant improvements in detection accuracy, especially in challenging conditions marked by small targets and complex backgrounds.

The enhancements observed in the COTTON-YOLO model are consistent with findings from previous studies that have emphasized the importance of robust feature extraction and precise localization in improving object detection models. For instance, the use of attention mechanisms, as demonstrated in the CBAM module, has been previously validated for its effectiveness in focusing model learning on relevant features, thus reducing background noise and enhancing the detection of small objects. Moreover, the success of advanced loss functions like WIoU v3 in this study demonstrates ongoing research suggesting that dynamic adjustments to loss calculations can lead to more accurate and stable object detection models.

The findings suggest that even in the intricate settings of cotton boll detection, where the targets vary greatly in size and may be obscured by environmental elements, carefully designed models equipped with sophisticated architectural enhancements can achieve high levels of precision. This has significant implications for the agricultural sector, especially in automated and precision farming, where such technologies can contribute to better crop management and increased efficiency. The ability to accurately detect and count cotton bolls using the COTTON-YOLO model paves the way for more informed decisions in agricultural practices, potentially resulting in increased yields and improved resource management.

While the current modifications have led to significant improvements, future research could explore the integration of additional sensory data, including multispectral or hyperspectral imaging, to further enhance detection capabilities under varied environmental conditions. Furthermore, exploring the scalability of the model to other agricultural detection tasks may provide a broader application scope. Another potential direction could involve the development of real-time learning algorithms that adapt to changes in crop appearance throughout different growth stages, providing invaluable assistance for continuous monitoring and intervention in crop management. In conclusion, the enhancements made in the COTTON-YOLO model not only underscore the potential of machine learning in agricultural applications but also highlight the importance of continuous innovation in the fields of computer vision and object detection. By further refining these models and expanding their applications, significant advances can be achieved in the efficiency and effectiveness of agricultural practices.

Author Contributions

Z.L.: Data collection, algorithm design, experiments, and writing; J.Z.: Supervision and writing review; B.H.: Data collection and data organization; L.D.: Investigation and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Xinjiang Uygur Autonomous Region Major Science and Technology Project “Research on Key Technologies for Farm Digitalization and Intelligentization” [2022A02011-2]; Science and Technology Innovation 2030—“New Generation Artificial Intelligence” Major Project [2022ZD0115805].

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

The authors are very grateful to the editor and reviewers for their valuable comments and suggestions to improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Erokhin, V.; Diao, L.; Du, P. Sustainability-related implications of competitive advantages in agricultural value chains: Evidence from Central Asia—China trade and investment. Sustainability 2020, 12, 1117. [Google Scholar] [CrossRef]
Shi, G.; Du, X.; Du, M.; Li, Q.; Tian, X.; Ren, Y.; Zhang, Y.; Wang, H. Cotton yield estimation using the remotely sensed cotton boll index from UAV images. Drones 2022, 6, 254. [Google Scholar] [CrossRef]
Li, G.; Huang, Y.; Chen, Z.; Chesser, G.D., Jr.; Purswell, J.L.; Linhoss, J.; Zhao, Y. Practices and applications of convolutional neural network-based computer vision systems in animal farming: A review. Sensors 2021, 21, 1492. [Google Scholar] [CrossRef] [PubMed]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Dhanya, V.G.; Subeesh, A.; Kushwaha, N.L.; Vishwakarma, D.K.; Kumar, T.N.; Ritika, G.; Singh, A.N. Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
Han, B.; Lu, Z.; Dong, L.; Zhang, J. Lightweight Non-Destructive Detection of Diseased Apples Based on Structural Re-Parameterization Technique. Appl. Sci. 2024, 14, 1907. [Google Scholar] [CrossRef]
Chen, J.; Hu, X.; Lu, J.; Chen, Y.; Huang, X. Efficient and Lightweight Automatic Wheat Counting Method with Observation-Centric SORT for Real-Time Unmanned Aerial Vehicle Surveillance. Agriculture 2023, 13, 2110. [Google Scholar] [CrossRef]
Spetale, F.E.; Murillo, J.; Vazquez, D.V.; Cacchiarelli, P.; Rodriguez, G.R.; Tapia, E. LocAnalyzer: A computer vision method to count locules in tomato fruits. Comput. Electron. Agric. 2020, 173, 105382. [Google Scholar] [CrossRef]
Wang, B.; Yan, Y.; Lan, Y.; Wang, M.; Bian, Z. Accurate detection and precision spraying of corn and weeds using the improved YOLOv5 model. IEEE Access 2023, 11, 29868–29882. [Google Scholar] [CrossRef]
Zhang, C.; Li, T.; Zhang, W. The detection of impurity content in machine-picked seed cotton based on image processing and improved YOLO V4. Agronomy 2021, 12, 66. [Google Scholar] [CrossRef]
Xu, Z.; Latif, M.A.; Madni, S.S.; Rafiq, A.; Alam, I.; Habib, M.A. Detecting white cotton bolls using high-resolution aerial imagery acquired through unmanned aerial system. IEEE Access 2021, 9, 169068–169081. [Google Scholar] [CrossRef]
Feng, Y.; Chen, W.; Ma, Y.; Zhang, Z.; Gao, P.; Lv, X. Cotton Seedling Detection and Counting Based on UAV Multispectral Images and Deep Learning Methods. Remote Sens. 2023, 15, 2680. [Google Scholar] [CrossRef]
Lin, Z.; Guo, W. Cotton stand counting from unmanned aerial system imagery using mobilenet and centernet deep learning models. Remote Sens. 2021, 13, 2822. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Liang, Z.; Cui, G.; Xiong, M.; Li, X.; Jin, X.; Lin, T. YOLO-C: An Efficient and Robust Detection Algorithm for Mature Long Staple Cotton Targets with High-Resolution RGB Images. Agronomy 2023, 13, 1988. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, Y.; Yang, G. Small unopened cotton boll counting by detection with MRF-YOLO in the wild. Comput. Electron. Agric. 2023, 204, 107576. [Google Scholar] [CrossRef]
Al Muzaddid, M.A.; Beksi, W.J. NTrack: A Multiple-Object Tracker and Dataset for Infield Cotton Boll Counting. IEEE Trans. Autom. Sci. Eng. 2023. [Google Scholar] [CrossRef]
Zhang, M.; Chen, W.; Gao, P.; Li, Y.; Tan, F.; Zhang, Y.; Ruan, S.; Xing, P.; Guo, L. YOLO SSPD: A small target cotton boll detection model during the boll-spitting period based on space-to-depth convolution. Front. Plant Sci. 2024, 15, 1409194. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Ciaparrone, G.; Sánchez, F.L.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020, 381, 61–88. [Google Scholar] [CrossRef]
Zheng, L.; Tang, M.; Chen, Y.; Zhu, G.; Wang, J.; Lu, H. Improving multiple object tracking with single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2453–2462. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]

Figure 1. Examples of cotton boll images: (a) 11:00–12:00 (sunny), (b) 11:00–12:00 (cloudy), (c) 11:00–12:00 (rainy), (d) 13:00–14:00 (sunny), (e) 13:00–14:00 (cloudy), (f) 13:00–14:00 (rainy).

Figure 2. Dataset amplification methods: (a) Raw image, (b) Brightness adjustment, (c) Noise increase, (d) CutOut enhancement, (e) Rotation by 30 degrees, (f) Rotation by 60 degrees, (g) Cropping, (h) Translation, (i) Mirroring.

Figure 3. Cotton boll size distribution of the dataset.

Figure 4. Standard YOLOv8 model structure.

Figure 5. COTTON-YOLO structure.

Figure 6. Data enhancement strategies: (a) Raw image, (b) Random hybrid enhancement, (c) Random hybrid enhancement.

Figure 7. Convolutional block attention module structure.

Figure 8. Traditional neck.

Figure 9. The architecture of the proposed gold-YOLO.

Figure 10. The gather-and-distribute structure. (a) Low-GD. (b) High-GD. (c) Information injection module.

Figure 11. Diagram of the IoU principle.

Figure 12. Proposed structure of COTTON-YOLO detection and tracking of cotton.

Figure 13. Comparison of models under different weather conditions: (a,d) are clear weather; (b,e) are cloudy; (c,f) are rainy. (a–c) use YOLOv8, while (d–f) use COTTON-YOLO.

Figure 14. Image detection contrast effect: (a) Original image, (b)YOLOv8, (c) COTTON-YOLO.

Figure 15. Cotton count detection comparison chart.

Figure 16. Video frame comparison: (a) Original image, (b) YOLOV8-ByteTrack (c) COTTON-YOLO-ByteTrack.

Figure 17. Cotton dynamic count detection comparison chart.

Table 1. Cotton image dataset.

Weather Situation	Number of Original Samples	The Number of Samples after Image Amplification
Sunny	147	441
Cloudy	401	1203
Rainy	59	177

Table 2. Experimental environment configuration.

Configuration	Parameter
CPU	14 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
GPU	RTX 3090(24GB)
Operating system	Windows 11
Accelerated environment	CUDA 11.8
Development environment	Pycharm 2018
Libraries	PyTorch 2.0.0

Table 3. Comparison of experimental results of different IoU methods.

IoU	Precision	Recall	AP₅₀	AP_50:95
CIoU	87.3	84.8	92.0	59.4
DIoU	87.7	82.9	91.9	55.4
EIoU	85.3	83.1	91.2	54.8
SIoU	84.3	85.0	91.6	54.9
WIoU v1	86.6	84.8	91.7	54.4
WIoU v2	88.2	84.5	92.3	55.5
WIoU v3	91.2	85.1	92.2	60.7

Table 4. Ablation experiment.

Data Enhancement	C2f-CBAM	GD Neck	WIoU v3	Precision	Recall	AP₅₀	AP_50:95
				83.3	79.8	86.8	50.9
√				87.3	84.8	92.0	59.4
√	√			89.8	87.0	93.0	59.7
√		√		89.0	86.2	92.5	61.0
√			√	91.2	85.1	92.2	60.7
√	√	√		92.1	84.4	93.1	62.0
√	√	√	√	94.1	91.3	95.8	65.7

Table 5. Comparison of different mainstream models.

	AP₅₀	AP_50:95	Parameters	Weight Size
Faster-RCNN	92.6	54.0	-	315.2 MB
Cascade-RCNN	92.6	54.8	-	527.6 MB
FCOS [32]	93.1	54.7	-	244.7 MB
YOLO3-tiny	90.0	53.5	12.1 M	24.4 MB
YOLOv5s	89.1	53.8	9.1 M	18.5 MB
YOLOv6s	89.7	55.3	16.4 M	32.8 MB
YOLOv8m	89.9	56.4	25.9 M	52.0 MB
YOLOv8-DETR	78.1	42.6	12.9 M	26.2 MB
YOLOv9-gelan [33]	95.2	61.7	25.2 M	51.5 MB
COTTON-YOLO	95.8	65.7	6.1 M	12.4 MB

Table 6. Effect comparison between YOLOv8 and COTTON-YOLO.

Model	RMSE	MAE	MAPE	R²
YOLOv8	1.779	1.231	24.41%	0.543
COTTON-YOLO	1.008	0.587	14.01%	0.853

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Z.; Han, B.; Dong, L.; Zhang, J. COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model. Appl. Sci. 2024, 14, 6650. https://doi.org/10.3390/app14156650

AMA Style

Lu Z, Han B, Dong L, Zhang J. COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model. Applied Sciences. 2024; 14(15):6650. https://doi.org/10.3390/app14156650

Chicago/Turabian Style

Lu, Ziao, Bo Han, Luan Dong, and Jingjing Zhang. 2024. "COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model" Applied Sciences 14, no. 15: 6650. https://doi.org/10.3390/app14156650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction of Cotton Boll Detection Dataset

2.1.1. Data Acquisition and Labeling

2.1.2. Data Preprocessing and Enhancement

2.2. Standard YOLOv8 Model

2.3. COTTON-YOLO Structure

2.3.1. Data Preprocessing and Enhancement

2.3.2. C2f-CBAM Module

2.3.3. Gather-and-Distribute Neck

2.3.4. Improved Loss Function

2.4. ByteTrack

3. Results

3.1. Experimental Environment

3.2. Evaluation Index

3.3. Model Performance and Comparison

3.3.1. Comparison of Experimental Results of Different IoU Methods

3.3.2. Ablation Experiment

3.3.3. Comparison of Different Mainstream Models

3.4. Model Static Count Assessment

3.5. Dynamic Model Count Evaluation

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI