CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields

Xie, Yufei; Chen, Liping

doi:10.3390/agronomy15040996

Open AccessArticle

CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields

by

Yufei Xie

¹ and

Liping Chen

^1,2,3,*

¹

College of Information Engineering, Tarim University, Alaer 843300, China

²

Key Laboratory of Tarim Oasis Agriculture, Tarim University, Ministry of Education, Alaer 843300, China

³

Key Laboratory of Modern Agricultural Engineering, Tarim University, Alaer 843300, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(4), 996; https://doi.org/10.3390/agronomy15040996

Submission received: 17 March 2025 / Revised: 17 April 2025 / Accepted: 19 April 2025 / Published: 21 April 2025

(This article belongs to the Collection AI, Sensors and Robotics for Smart Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

The positioning of the top bud by the topping machine in the cotton topping operation depends on the recognition algorithm. The detection results of the traditional target detection algorithm contain a lot of useless information, which is not conducive to the positioning of the top bud. In order to obtain a more efficient recognition algorithm, we propose a top bud segmentation algorithm CBLN-YOLO based on the YOLO11n-seg model. Firstly, the standard convolution and multihead self-attention (MHSA) mechanisms in YOLO11n-seg are replaced by linear deformable convolution (LDConv) and coordinate attention (CA) mechanisms to reduce the parameter growth rate of the original model and better mine detailed features of the top buds. In the neck, the feature pyramid network (FPN) is reconstructed using an enhanced interlayer feature correlation (EFC) module, and regression loss is calculated using the Inner CIoU loss function. When tested on a self-built dataset, the mAP@0.5 values of CBLN-YOLO for detection and segmentation are 98.3% and 95.8%, respectively, which are higher than traditional segmentation models. At the same time, CBLN-YOLO also shows strong robustness under different weather and time periods, and its recognition speed reaches 135 frames per second, which provides strong support for cotton top bud positioning in the field environment.

Keywords:

cotton top bud; topping; accurate identification; segmentation algorithm; YOLO

1. Introduction

As one of the most important economic crops in China, the yield and quality of cotton directly affect the efficiency of agricultural production. According to the data released by the National Bureau of Statistics of China on 25 December 2024, China ’s cotton production will reach 6.164 million tons in 2024, ranking first in total production worldwide [1]. Cotton topping is the key link to increase cotton yield: by removing the top buds of cotton, the plant height can be effectively controlled, and the growth of lateral branches can be promoted so as to increase the yield of cotton [2]. The traditional cotton topping mainly relies on manual operation [3], which has low efficiency, high cost, and is difficult to ensure the accuracy of topping. With the development of agricultural automation technology, the mechanical topping method based on machine vision has become a research hotspot [4,5], and the accurate recognition of the cotton top bud is one of the key technologies.

In recent years, target detection algorithms based on deep learning have made significant progress in the field of cotton top bud recognition. Zhang X and Chen L [6] proposed a lightweight detection network based on YOLOv8n to improve the operational efficiency and accuracy of cotton topping robots. This network effectively enhances both the precision and speed of generating anchor boxes, achieving an FPS of 69.3 and mAP@0.5 of 99.2% in field experiments. Li J et al. [7] trained a cotton top bud recognition model based on YOLOv3, deployed it on a computer workstation, and tested it; the average recognition accuracy was 93.07%, and the FPS was above 10. Zhang et al. [8] improved YOLOv8n with a MobileNetV3 network and introduced the upsampling content perception function to reorganize the feature processing network so that the algorithm could better adapt to mobile devices. The model detection accuracy was 86.4%, and the FPS was 6.4. These algorithms based on target detection generally use anchor boxes to provide positioning guidance for topping equipment. When measuring the position of the top bud, the measuring equipment usually performs uniform sampling in the anchor box and then takes the average of the sampling points as the position of the center point of the top bud. However, the pixel distribution of the top bud in the anchor box is not uniform, and the anchor box also contains a large number of pixels in the background or other parts. This sampling method will produce significant errors, resulting in inaccurate positioning.

To address the limitations of object detection algorithms in localization tasks, some studies have adopted segmentation-based guidance mechanisms. For instance, Wang H et al. [9] proposed an improved segmentation network for segmenting and localizing tea buds. This approach can separate the features of tender buds from background regions, facilitating the extraction of bud contours and segmented areas, thereby enabling precise localization of tea bud picking points. Peng G et al. [10] proposed a segmentation-based weed localization fusion algorithm to address the localization interference caused by complex backgrounds and redundant pixels in traditional recognition methods. By employing geometric methods to calculate the central position within the segmented contours, the algorithm achieved a root mean square error of 12.48 mm in localization accuracy. Wenbo W et al. [11] identified that the guidance mechanism of the target detection algorithm has too much redundant information in the anchor box that often leads to imprecise localization. They proposed utilizing MobileViT-Seg to obtain target masks to address this issue, followed by circle fitting on the masks to determine the central position. This approach achieved a localization accuracy of 90.80%. The instance segmentation algorithm demonstrates significant advantages by precisely delineating the spatial distribution of target features and effectively constraining sampling behaviors. Specifically, it eliminates interference from redundant information within anchor boxes, making it particularly effective for cotton top bud localization.

Instance segmentation algorithm is an accurate detection algorithm based on the convolutional neural network (CNN). It captures target features by learning a large amount of image data, and recognition results are segmented by pixel-level instance mask in the anchor box. In this process, the algorithm corrects the recognition results several times by backpropagation to reduce the regression loss. The mainstream instance segmentation algorithms are divided into a single-stage and two-stage algorithm. The single-stage algorithm mainly includes PANet [12], UNet [13], and the YOLO (You Only Look Once) series [14]. This kind of algorithm is characterized by fast segmentation speed and strong real-time performance and is widely deployed on edge devices. The two-stage algorithm is represented by Mask R-CNN [15], which provides higher accuracy, but the inference speed is slow, and it is usually used when the computing power is sufficient.

With the continuous evolution of YOLO series algorithms, their performance in crop detection tasks has also been significantly improved [16]. At the same time, because the structure of the YOLO algorithm has good compatibility with edge smart devices, agricultural automation equipment usually deploys YOLO as a visual algorithm [17,18]. For example, Qin X et al. [19] introduced the C2FET module with the EffcientViT branch into the YOLOv8-Pose algorithm and used dual-path feature extraction to accelerate the forward propagation of the model. By deploying it on the NVIDIA Jetson device, real-time and accurate detection in a cherry tomato harvesting task was achieved. Liao J et al. [20] obtained the STC module based on Swin Transformer to enhance the expression of target features, and they replaced the traditional convolution in YOLOv10 n with GhostConv. The stochastic gradient descent optimizer is used to shorten the reasoning time of the model, and a relatively lightweight model YOLO-VM is obtained, which provides higher robustness and efficiency in practical applications. When identifying crops, the YOLO algorithm generally relies on the color, contour, and background features of crops. However, cotton top buds are different from most crops. The size of top buds is small, and their contour features are not prominent. In the field environment, it is difficult to distinguish their color features due to factors such as light and weather, which brings challenges to the identification of top buds.

Many scholars have conducted a series of studies on the accurate segmentation of small targets such as cotton top buds and have made some progress. Zhang et al. [21] improved the encoder–decoder architecture of SAM and proposed an IRSAM model for detecting infrared small targets. The objective indicators of the model are better than the original SAM. Dong et al. [22] proposed an end-to-end segmentation network called DenseU-Net to address the issue of small target features being obscured in remote sensing images. This approach has improved the discernibility of small targets by 6.71%. Zhang D et al. [23] developed a weed apical meristem localization algorithm based on YOLOv8n-seg, which is suitable for real-time detection of apical meristems of various weed species. The accuracy level of the mask was up to 97.2%, and the anchor box mAP reached 96.9%. Reddy J et al. [24] used SAM combined with a YOLO algorithm to semantically segment the UAV cotton boll image to predict the cotton yield. Pawłowski J et al. [25] used YOLOv8n instance segmentation to measure the size of bean seeds and achieved a segmentation accuracy of 90.1%. Based on PANet and BiFPN, Haiying Liu et al. [26] improved the feature fusion method of the neck network in YOLOv5 so that the model could maintain high accuracy when detecting small targets in complex environments. Xiaoting Liang et al. [27] added the BiSeNet V2 network structure to the backbone of YOLOv4, and the mAP of the new network structure was 14% and 19% higher than that of DAnet and Unet, respectively, in the segmentation experiment. Chang-Hwan Son [28] designed a perceptual feature extractor, which was combined with the YOLO detection sub-network for improving the segmentation accuracy of the leaf disease part. These methods focus on the segmentation accuracy optimization of small targets, ignoring the real-time detection performance of the model. In order to avoid the quality reduction of the topping work caused by the delay of the action of the topping actuator, the lightweight structure also needs to be considered when optimizing the model.

Over the last several years, a variety of improved target detection algorithms have emerged to solve the related difficulties of cotton top bud recognition. Zhang Jie et al. [29] integrated the GhostNetV2 network into an end-to-end neural network architecture, obtained an improved YOLOv7-tiny model, and implemented lightweight deployment on Jetson devices. Peng Song et al. [30] proposed an improved Cascade R-CNN network, which uses an RGB camera combined with a depth map to dynamically obtain the position of the cotton top bud, and cooperates with a mechanical claw to perform top bud removal. Chunshan Wang et al. [31] proposed a new lightweight model to recognize cotton main stem growth points, which achieved the multiplexing fusion of both low-level and high-level semantic features by integrating dense connection modules and enhancing the nonlinear transformations within these modules. Although the improvement of the detection algorithm in these studies has improved the accuracy of the model to a certain extent, there are still some problems such as slow recognition speed and insufficient accuracy of the anchor box. The instance segmentation algorithm can accurately describe the contour and distribution range of the top buds in the anchor box by generating a pixel-level instance mask. Uniform sampling in the mask can avoid irrelevant pixel interference and more accurately reflect the true position of the top bud.

Aiming at the shortcomings of the existing research, we propose an improved instance segmentation model CBLN-YOLO (Cotton Bud Location Network—YOLO) based on YOLO11n-seg. The model improves the segmentation accuracy of the model for small targets by introducing the LDConv module and the CA mechanism, optimizing the feature extraction network structure and changing the loss function. It can effectively identify and segment the cotton top bud image in the field environment and solve the problems of background interference and inaccurate top bud recognition while remaining lightweight. The model realizes the accurate detection of cotton top buds in the field and can improve the automation level and accuracy of cotton topping operations.

The main contributions of this study are as follows:

In order to reduce the redundant parameters in the model and increase the flexibility of the convolution kernel size, the LDConv module is proposed to replace the standard convolution module in the YOLO11n-seg network. In the backbone network, the CA mechanism is used to replace the MHSA mechanism of the C2PSA module to enhance the spatial and channel dimension information of the feature map, better integrate multiscale features, and retain the detailed information of the cotton top buds.
The feature pyramid network of the neck is optimized by using a lightweight fusion strategy that enhances interlayer feature correlation. By enhancing the small target space mapping and separation reconstruction features, the information loss and irrelevant feature extraction of cotton top buds are reduced, and the consumption of computing resources is reduced. At the same time, a small-size target segmentation head is added to the output end of the EFC module to enhance the detection accuracy of small targets.
This study presents the Inner CIoU loss function as a method for assessing the model’s regression loss and regulates the anchor box generation rules by modifying the $r a t i o$ factor of the loss function, thereby aiding the model in enhancing its convergence speed.

Owing to the absence of publicly accessible datasets, this research employed a self-built cotton top bud dataset to assess the enhanced model and conducted comparisons with existing segmentation methodologies. The results show that the improved model can maintain a high real-time processing speed, more accurately identify cotton top buds in the field environment, and provide a more accurate example mask. This ability can improve the topping accuracy of cotton topping equipment and provide technical support for the automation of cotton field management.

Section 2 introduces the establishment process of the cotton top bud dataset and the improvement principle of the CBLN-YOLO model in detail. Section 3 introduces the experimental environment and experimental data and analyzes the experimental results. Section 4 summarizes the experimental conclusions and discusses the significance of this study.

2. Materials and Methods

2.1. Dataset Image Collection

The cotton top bud images used in the dataset came from the cotton field of the tenth regiment in Alaer City, Xinjiang. The collection date was from mid-June to late July 2024 using a Redmi k30 pro smartphone (manufactured by Xiaomi in Wuxi, Jiangsu Province, China), and the image resolution is

3000 \times 3000

pixels. Considering the influence of natural lighting conditions on image quality, we collected images in two periods—from 9 a.m. to 12 a.m. and from 14 p.m. to 17 a.m. Because the topping machine recognizes the top buds from the overlooking angle when it is working, the shooting angle is top-down when collecting images. For guaranteeing sample diversity and to strengthen the model’s robustness, we collected images in different periods of sunny and dusty weather and obtained the top bud images under four different conditions, as shown in Figure 1. After the initial sampling, 173 and 159 images were collected in the morning and afternoon periods of sunny weather, respectively, while 120 and 136 images were obtained during the corresponding periods of dusty weather. Due to the significant disparity in total sample size between the two weather conditions, random undersampling was applied to the sunny weather samples. Specifically, 38 images were randomly removed from each of the two sunny periods to balance the total sample quantities across both weather types. Ultimately, a total of 512 original images were retained.

2.2. Dataset Preprocessing

2.2.1. Image Annotation

We manually marked the mask position of the top buds in each image using Labelme [32] (Version 5.6.0). Labelme is an annotation tool for labeling the pixel distribution area of the object to be recognized. It generates a polygon through a set of manually selected coordinates as corner points, uses this polygon as the mask of the instance, and records the instance label. The interface of Labelme is shown in Figure 2. The XML file generated after annotation records the coordinates of each corner point of the polygon; the coordinates are established with the image’s upper-left corner serving as the origin. Converting the coordinate value to the ratio of the relative image resolution can be converted into the label file format required by YOLO.

2.2.2. Dataset Augmentation

To enhance the generalization capability of model and avert overfitting caused by insufficient data in the training phase, we used the data enhancement method to expand the original image. As shown in Figure 3, the enhancement methods include random brightness adjustment, noise addition, tangential transformation, random rotation, mirror transformation, and graying. Finally, the size of the dataset came out to 5120 images, which were divided according to the ratio of 8:1:1. There were 4096 images in the training set, 512 images in the verification set, and 512 images in the test set. These methods simulate the changes of shooting environment and shooting conditions in actual work, thus enhancing the adaptability of the model in different scenarios.

2.2.3. Dataset Partitioning

After the dataset was enhanced, a total of 5120 images and corresponding label files were obtained, which were divided in a ratio of 8:1:1. When partitioning, an image and its corresponding label file were regarded as a piece of data. The training set contained a total of 4096 data pieces; the validation set and the test set contained 512 data pieces, respectively.

2.3. CBLN-YOLO Model for Segmentation of Cotton Top Buds

2.3.1. Using YOLO11n-Seg as the Baseline Model

In September 2024, Ultralytics released the YOLO11-seg algorithm [33]. According to different network sizes, the algorithm is divided into five structures: n (nano), s (small), m (medium), l (large), and x (extra large). The network depth and the computational power requirements for carrying hardware are gradually increasing. The automation equipment used in cotton topping can carry limited computing resources, and edge devices are usually used to provide computing support, while YOLO11n-seg can meet the requirements of accuracy and deployment at the same time. Therefore, this study used YOLO11n-seg as the baseline model so that the improved model could inherit the accuracy and lightweight characteristics of the baseline model, as well as further improve the detection and segmentation ability of cotton top buds on this basis.

As shown in Figure 4, the YOLO11n-seg algorithm consists of three parts: backbone, neck, and head. The backbone extracts three-scale feature layers and outputs them to the neck. The feature pyramid network in the neck fuses and enhances high-dimensional and low-latitude information and then inputs them into the segmentation head of the corresponding dimension in the head. After nonmaximum suppression processing, the algorithm only output the optimal solution of the anchor box and the mask.

2.3.2. Construction of the CBLN-YOLO Model

CBLN-YOLO introduces the LDConv module into the YOLO11n-seg network to replace the standard convolution module. LDConv realizes the use of irregular convolution kernels to extract target features. This enhancement improves the capacity of the model to learn features of small targets, makes the model parameters increase linearly, and reduces the amount of calculation in the reasoning process. In the C2PSA module located at the end of the backbone, the CA mechanism [34] is utilized to strengthen the connection among pixel features, which is conducive to retaining the details of the top buds. In addition, the feature pyramid network of the neck has been redesigned based on the lightweight fusion strategy, which reduces unnecessary computational operations and memory accesses. Simultaneously, a small target segmentation head has been incorporated following the EFC module [35] to enhance the detection ability of small targets. Lastly, the Inner CIoU was implemented as the loss function; the anchor box and the target box were optimized by adjusting the scale factor. According to the regression situation, the anchor box is subjected to angle penalty and distance penalty to accelerate the convergence of the model. The structure diagram of the CBLN-YOLO model and the general module is shown in Figure 5.

2.3.3. LDConv Module

Affected by different growth conditions, the size and shape of cotton top buds showed significant variability. The standard convolution used by YOLO11n-seg has a fixed kernel size during sampling, and it is difficult to fully sample the changing features of the top buds. The parameters of the standard convolution increase squarely with the increase in the kernel size. If the size of the convolution kernel is blindly increased, the computational complexity and memory overhead of the model will increase significantly, which will affect the segmentation speed of the model. Therefore, this study used LDConv proposed by Xin Zhang et al. [36], which can adjust the shape of the convolution kernel according to the target features. When detecting cotton top buds, LDConv can dynamically change the receptive field range according to the geometry of top buds, thereby reducing background interference and fully excavating the characteristic information of top buds at different scales. After replacing the standard convolution module with LDConv, the change trend of the parameter quantity changes from the original square change to a linear change. When the size of the convolution kernel increases or decreases, the parameter quantity increases slowly or decreases rapidly. This feature is of great significance to the lightweight of the model.

The overall structure of the LDConv module is shown in Figure 6. Before generating the irregular convolution kernel, the initial sampling is performed to determine the convolution kernel center coordinates

P_{0}

. The process of initial sampling is shown in Equation (1):

Conv (P_{0}) = \sum_{P_{N} \in R} w_{P_{N}} \times x (P_{0} + P_{N})

(1)

In the Equation (1), R, w, and x(

P_{0}

+

P_{N}

) represent the coordinate set of the sampling position covered by the convolution, the weight of the convolution unit, and the pixel value of the (

P_{0}

+

P_{N}

) position, respectively. Given that the input feature map has dimensions (C, H, W), the corresponding offset can be obtained by convolution operation, and the dimension is (2N, H, W). Then, the offset is added to the original coordinate (

P_{N}

+

P_{0}

) to generate a new sampling coordinate corresponding to the convolution, and different sampling shapes are generated for the convolution at different positions in the feature map. Finally, the generated irregular convolution kernel is used to interpolate and resample the features of the new pixel position to obtain deeper feature information. In order to reduce the memory access frequency, the resampling, convolution calculation, and normalization process are merged into a sub-module in the LDConv module to reduce the storage steps.

2.3.4. Coordinate Attention Mechanism

At the end of the backbone, partial self-attention (PSA) [37] relies on the built-in MHSA mechanism to provide multiscale representation capabilities for the model. However, the computational complexity of MHSA is high, especially on high-resolution images, and the ability to process local details is limited. This study proposes replacing the multihead self-attention mechanism with the CA mechanism, enabling the model to better capture spatial location information and channel-wise dependencies, thereby effectively handling complex background information in images. While maintaining a low parameter count, the introduction of CA allows the model to focus on more discriminative features and reduce interference factors in the decision-making process. The architecture of the CA mechanism is shown in Figure 7. One of the branches in the feature map of the input C2PSA flows to the CA, and the input feature map is globally pooled along both the horizontal and vertical axes. Each channel is encoded horizontally using a pooling kernel of size (H, 1), as shown in Equation (2). The size of the pooled nucleus in the vertical direction is (1, W), and the encoding process is as shown in Equation (3):

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(2)

z_{c}^{v} (v) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, v)

(3)

In the Equations (2) and (3),

z_{c}^{h}

represents the output vector under the channel c with height h, and

z_{c}^{v}

represents the output vector under the channel c with width v. W and H are the size parameters of the pooling kernel in two directions, respectively, and

x_{c}

represents the pixel value of the corresponding position of the convolution kernel when the input feature map is pooled. The squeeze step of channel c can be expressed in the form of Equation (4):

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)

(4)

The pooling outputs two feature vectors with scales of (C, 1, W) and (C, H, 1). Through the connection and convolution operation of the feature vector, the spatial location information is extracted, and the attention weight in two directions is further generated. The horizontal and vertical directions of the feature map are re-entered to enhance the direction perception ability of the model.

2.3.5. EFC-FPN

The role of FPN is to fuse the features of each layer of the neck network. The traditional FPN does not consider the correlation of the features of each layer when fusing the features, and some spatial information will be lost after fusion. In this study, the neck network was reconstructed according to the EFC lightweight fusion strategy and the new network was named EFC-FPN. The EFC module in the network is mainly composed of a multilevel feature reconstruction (MFR) module and a grouped feature focusing unit (GFF). The former reconstructs the small target features to reduce the loss of small target information in deep convolution, and the final output is shown in Equation (5). The latter enhances the correlation between different feature layers so that the model can better perceive the semantic information of small objects, and the final output is shown in Equation (6):

P_{m} = Softmax (T (A (P^{low}))) P_{n e w}

(5)

P_{f} = \frac{(P_{g 1} \cup P_{g 2} \dots \cup P_{g i}) - mean (P_{i} \oplus P_{i - 1})}{std (P_{i} \oplus P_{i - 1})}

(6)

In Equation (5),

P^{low}

represents the weak feature generated after reconstruction,

T

represents the convolution transformation layer,

A

represents the adaptive average pooling layer, and

P^{low}

is normalized by

Softmax

and weighted with the total feature

P_{n e w}

after reconstruction and output

P_{m}

. In Equation (6),

P_{g 1}

and

P_{g 2}

represent the grouped features;

m e a n

and

s t d

represent the mean and standard deviation, respectively.

P_{i}

and

P_{i - 1}

represent the feature maps input at different stages. The mean and standard deviation of the feature maps are used to normalize the aggregated grouped features to obtain

P_{f}

. The architecture of EFC-FPN is shown in Figure 8, and a 20 × 20 size target segmentation head was added to the output end in the middle of the network so as to improve the perception ability of the model to small targets without significantly increasing the amount of calculation.

2.3.6. Improved Loss Function

The YOLO11n-seg algorithm uses the CIoU loss function to calculate the regression loss of the anchor box. The CIoU loss function is defined as presented in Equation (10):

v = \frac{4}{π^{2}} {[arctan (\frac{w^{gt}}{h^{gt}}) - arctan (\frac{w}{h})]}^{2}

(7)

α = \frac{v}{1 - IoU + v}

(8)

L_{CIoU} = 1 - IoU + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + α v

(9)

In the Equation (7),

w^{gt}

and

h^{gt}

represent the width and height of the target box, respectively, w and h denote the width and height of the anchor box, correspondingly, and v is used to measure the consistency of the aspect ratio of the two boxes. In the Equation (8),

IoU

is the intersection and union ratio of the target box and the anchor box, and

α

is a weight parameter. In the Equation (9),

ρ

represents the Euclidean distance between the centers of the anchor box and the target box, and c denotes the diagonal length of the circumscribed rectangle of the anchor box and the target box. According to the relevant equation of CIoU, this loss function has some defects. For example, when the aspect ratio of the anchor box and the target box is consistent, the weight parameter a will fail, causing the sensitivity degradation of the loss function. When detecting small targets, due to the size of the anchor box and the target box being close, the gap between the aspect ratio is very small, so CIoU can not locate the small target well. In this study, Inner CIoU proposed by H Zhang et al. [38] was used as the loss function to solve the defects in CIoU. The principle of Inner IoU is shown in Figure 9. On the basis of CIoU, Inner CIoU introduces the auxiliary bounding boxes: InnerTarget Box and InnerAnchor Box with widths of

w_{i n n e r}^{g t}

and

w_{i n n e r}

and heights of

h_{i n n e r}^{g t}

and

h_{i n n e r}

, respectively. The intersection and union ratio between the auxiliary bounding boxes is denoted as

{IoU}^{inner}

. The auxiliary box introduced shares the geometric centers

b

and

b^{gt}

with the original boundary box. When the weight parameter

α

fails, the auxiliary box can make up for the sensitivity loss of the loss function through scale transformation. The definition of Inner CIoU is shown in Equation (10):

L_{Inner CIoU} = L_{CIoU} + IoU - {IoU}^{inner}

(10)

2.4. Performance Metrics

The performance indicators of the segmentation model were divided into two categories: one was used to evaluate the prediction accuracy of the model, and the other was used to evaluate the detection efficiency of the model. The indicators for evaluating accuracy include the presicion (P), recall (R), and mean average precision (mAP), and the number of frames per second (FPS) is usually used to evaluate efficiency. The indicators used to evaluate the accuracy were calculated based on the regression results, and the calculation methods are shown in Equations (11)–(13):

P = \frac{T P}{T P + F P}

(11)

R = \frac{T P}{T P + F N}

(12)

m A P = \frac{\sum_{i = 1}^{n} \int_{0}^{1} P (R) d R}{n}

(13)

In Equations (11) and (12), true positive (TP) denotes the number of samples accurately classified as top buds, false positive (FP) represents the number of samples misreported as top buds, and false negative (FN) represents the number of missed top buds. In Equation (13), n indicates the number of sample categories, which is assigned a value of 1 in this research. When the IoU threshold reaches 0.5, the mAP is calculated, and the result is recorded as mAP@0.5, indicating that it is considered as an effective detection when the threshold is exceeded.

3. Results and Analysis

3.1. Experimental Environment

The hardware environment of the server platform utilized for training and testing the model consisted of Intel(R) Xeon(R) Gold 6248R CPU and NVIDIA A10 (24G) GPU. The software environment employed was an Ubuntu20.04 operating system, CUDA12.4 version, Python3.10 programming language, and Pytorch2.5.1 deep learning framework. During the experiment, all the hyperparameter settings of the model were consistent, and the main hyperparameter settings are shown in Table 1. In order to prevent the local optimum of the model and avoid unnecessary resource consumption, this study conducted new training based on the pretraining weights provided by Ultralytics.

3.2. Model Training Results

CBLN-YOLO has been trained and verified many times, and the changes in performance indicators are shown in Figure 10. During the training process, the four types of losses were gradually converging to close to 0, which indicates that the model has a good learning curve on the cotton top bud dataset, and there is no abnormal situation such as overfitting. During the verification period, the trend of the loss function aligned with the training process, demonstrating that the model exhibits strong generalization capability following comprehensive training.

In Figure 10, (B) represents the performance of the anchor box, representing the relevant indicators during detection, (M) represents the performance of the mask, representing the relevant indicators during segmentation, and mAP50 represents mAP@0.5. When the IoU threshold was 50%, the mAP@0.5 convergence value of the anchor box and the mask exceeded 90%, and the accuracy and recall remained above 80%, indicating that the model can accurately detect and segment the top bud instances in the image.

3.3. Comparative Experiment

In order to further verify the advantages of the CBLN-YOLO model in the segmentation of top buds, this study used several mainstream segmentation models for comparative experiments. The comparison objects included Mask R-CNN, the YOLO series, and CBLN-YOLO. The same dataset was used to train and test these models. The models were evaluated based on the accuracy and efficiency performance of the model on the test set. The experimental results are shown in Table 2. For ensuring fairness, the YOLO series adopted the structure of the smallest network size.

Among all the comparison objects, YOLO11n-seg, as the baseline model, performed best in the accuracy index of the detection and segmentation tasks. When detecting the top bud and generating the anchor box, the precision, recall, and mAP@0.5 were 95.1%, 93.7%, and 95.1%, respectively. Compared to the next generation YOLOv12n-seg algorithm [39], these three indicators were 3.0%, 7.6%, and 1.7% higher. The precision, recall, and mAP@0.5 were 94.3%, 93.3%, and 92.0%, respectively, which were 1.6%, 8.5%, and 0.3% higher than those of YOLOv12n-seg. In terms of efficiency, YOLO11n-seg could detect 125 images per second. The detection speed was 22 frames faster than YOLOv12n-seg, 68 frames faster than the slowest Mask R-CNN, and 31 frames slower than the fastest YOLOv10n-seg. Although the speed of the baseline model was not optimal, it still had best comprehensive performance.

3.4. Adjustment of Loss Function

Inner CIoU uses a scaling factor

r a t i o

to regulate the dimensions of the auxiliary bounding box. The transformation center is defined by the centers of both the anchor frame and the target frame. When the

r a t i o

is less than 1, scaling is performed, and when the

r a t i o

is greater than 1, expansion is performed. Generally, the value range of the

r a t i o

factor is [0.5, 1]. This study compares the changes in the convergence speed of the model in the four values of

r a t i o

, as shown in Figure 11.

As shown in Figure 11, box loss and dfl loss represent the loss value between the anchor box and the target box. The smaller the loss value, the more accurate the detection result. Seg loss represents the loss value of the mask coverage area. The smaller the loss value, the more accurate the segmentation result. Cls loss represents the loss value of the classification. The smaller the loss value, the higher the recall of the sample. During training, the four loss values of the model gradually decreased with one iteration. When the

r a t i o

factor value was 1.5, the loss value decreased the most, and the convergence value was the smallest. It can be seen that the convergence speed of the model was the fastest. In addition, this study compares the effects of four values of the

r a t i o

on the performance of the model. The experimental results are shown in Table 3.

As seen in Table 3, the loss function degenerated to CIoU when the

r a t i o

was 0. From the data in Table 3, it can be seen that the FPS of the model decreased after the introduction of Inner CIoU, because the new loss function increased the calculation amount of the model. As the

r a t i o

factor increased, the calculation content of the auxiliary box increased, the calculation amount further increased, and the frame rate gradually decreased. When the

r a t i o

took the maximum value of 1.5, the FPS was about seven frames lower than that before the introduction. Contrary to the decline of FPS, as the

r a t i o

increased, the accuracy of the anchor boxes and masks increased, and the recall of the samples increased. When the

r a t i o

reached a maximum value of 1.5, the precision, recall, and mAP of the anchor boxes and masks increased by 0.4%, 2.0%, and 2.0% and 2.4%, 0.5%, and 0.6%, respectively, and the performance was improved the most. After considering the loss value and the change of model performance, the

r a t i o

factor of Inner CIoU was 1.5.

3.5. Ablation Experiment

To validate the efficacy of the introduced modules and structures in improving the performance of the model, we used the cotton top bud dataset to perform ablation experiments. The experimental results are shown in Table 4.

Table 4 contains six experimental groups. Group 1 does not add any content, which is the baseline model, and group 6 is the CBLN-YOLO model. Comparing the data of group 1, group 2, and group 3, it can be seen that after the introduction of the LDConv module and CA mechanism, the structure of the model became lighter, which was 20 frames higher than the baseline model FPS, and this enhanced the accuracy of the model while maintaining light weight. By comparing the data of group 3 and group 4, it can be seen that the EFC-FPN structure increased the complexity of the model but also greatly improved the accuracy. Compared to the baseline model, this model FPS was reduced by 17 frames, and the mAP@0.5 (Mask) was increased by 3.2%. The data of group 3 and group 5 show that using Inner CIoU as a loss function can increase the amount of calculation while also improving the mAP@0.5. The final CBLN-YOLO integrates the advantages of each structure, greatly improves the speed and accuracy of recognition on the basis of YOLO11n-seg, and proves the effectiveness of the improvement work. Figure 12 shows the recognition effect of YOLO11n-seg and CBLN-YOLO on the cotton top bud dataset.

It can be seen from Figure 12 that the confidence of the CBLN-YOLO reasoning was higher than that of the original model when identifying the same sample. From the shape of the mask, CBLN-YOLO could segment the tip and complete contour of the top bud. YOLO11n-seg lost the detailed features during segmentation and misjudged the shadow covering the top bud as the contour. Therefore, it can be concluded that the CBLN-YOLO model performs better than YOLO11n-seg on the enhanced dataset. For different states of the same sample, the confidence of CBLN-YOLO recognition was more than 90%. This shows that for the same sample, even if the detection environment changes, the model can still maintain good robustness.

3.6. Statistical Significance Analysis

To evaluate whether the observed performance differences between the baseline model and CBLN-YOLO are statistically significant, we performed a paired t-test on the test-set objects in both detection and segmentation tasks. The t-statistic is calculated as shown in Equations (14)–(16):

\bar{d} = \frac{\sum_{i = 1}^{n} d_{i}}{n} (n = 512)

(14)

s_{d} = \sqrt{\frac{\sum_{i = 1}^{n} {(d_{i} - \bar{d})}^{2}}{n - 1}}

(15)

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

(16)

where n denotes the number of paired samples, d represents the metric difference between the two models for the same sample, and

s_{d}

indicates the standard deviation of the differences. The test was conducted at a 0.05 significance level. When the mAP was used as an metric, t was 6.74 (p =

3.2 \times 10^{- 6}

) in the object detection task, and in the semantic segmentation task, t was 7.15 (p =

1.8 \times 10^{- 7}

). In two cases, the p values were consistently below 0.05, demonstrating that the accuracy improvement of the enhanced model is statistically significant. When FPS was used as a metric, t was 91.43, and the p values was close to 0 and far below the statistical significance threshold. This indicates that the improvement of the inference speed of CBLN-YOLO is significant and not attributed to random changes. On the whole, the model has statistical significance and practical application value.

3.7. Recognition Experiment of Cotton Top Buds in the Field

When the cotton top bud is segmented in the field environment, the top bud mainly has two states: uncovered and covered. The features of the covered top buds are quite different from top buds due to incomplete contour or shadow on the surface. Therefore, the top buds in the two states were randomly selected in the test set, and the recognition experiments were performed according to the weather and time period at the time of collection. The experimental results are shown in Figure 13.

Because the contour of the covered top bud is incomplete and there is a large area of shadow on the surface, it is not easy to separate from the background information when extracting features. Therefore, the confidence of such samples in recognition is lower than that of the uncovered top bud. The samples of all categories in Figure 13 showed high confidence, more than 80%, indicating that CBLN-YOLO has good recognition ability for various forms of top buds under different weather and different time periods, and the model has strong generalization performance.

3.8. Discussion

In the comparative experiments, we compared the improved model with several classic models, which can prominently demonstrate the advantages of CBLN-YOLO. Compared with the two-stage model Mask R-CNN, CBLN-YOLO’s mAP@0.5 was 0.5% higher, and the FPS was 67.9 frames faster. This finding indicates that the one-stage CBLN-YOLO not only surpasses the traditional two-stage algorithm in terms of accuracy but also demonstrates superiority in segmentation speed. Compared to the baseline model, when CBLN-YOLO performed detection tasks, its precision, recall rate, and mAP@0.5 increased by 0.8%, 4%, and 3.2% respectively. In segmentation tasks, the precision, recall rate, and mAP@0.5 increased by 2.5%, 1.2%, and 3.8% respectively. These results indicate that the improved model has significantly enhanced the recognition ability of cotton top buds in the field, and it can accurately extract the detailed features of small targets even in complex environments. Compared to the baseline model, the FPS of CBLN-YOLO increased by 10 frames, indicating a faster inference speed. This shows that the improved model has a more lightweight structure, consumes fewer computational resources, and has better real-time performance when deployed on topping equipment. In comparison with other YOLO algorithms, CBLN-YOLO demonstrated a remarkable lead in segmentation accuracy, with an advantage ranging from 3.8% to 6.9%. Moreover, its segmentation speed was notably prominent among similar algorithms, reaching 135.1 frames per second. The ablation experiment compared the effects of each part. Although the introduction of EFC-FPN and Inner CIoU increased the computational complexity of the model and reduces the recognition speed, it also improved the accuracy and convergence speed, so the loss of speed is acceptable. On this basis, the integration of the LDConv module and the CA mechanism enhanced the detection precision, recall, and mAP@0.5 by 0.8%, 4%, and 3.2%, and the segmentation precision, recall, and mAP@0.5 improved by 2.5%, 1.2%, and 3.8%, respectively. All metrics surpassed those of the baseline model. Field segmentation experiments show that the model has good generalization. After segmentation tests of different forms of top buds under different weather and time periods, it has been proven that the model has high stability when working under natural conditions. Through comparative experiments, ablation experiments, and field segmentation experiments, it is verified that the improved CBLN-YOLO model can accurately detect and segment top buds in the natural environment, and its performance has advantages in similar algorithms.

4. Conclusions

In this study, cotton top buds in the field were taken as the research object, and the YOLO11n-seg segmentation model was improved to enhance the segmentation accuracy and make the model more lightweight. In existing research, traditional algorithms relied on for top bud recognition have issues such as a large amount of redundant information and inaccurate recognition. Moreover, when further adopting segmentation methods, problems will be encountered, including the difficulty of segmenting small targets against complex backgrounds, the morphological diversity of small targets, and challenges like potential shortages of computational resources in topping equipment. We implemented several enhancements based on the baseline model. These enhancements included replacing the convolution module and attention mechanism, reconstructing the feature pyramid network, and improving the loss function. Combining the experimental data, we draw the following conclusions:

(1): By introducing the LDConv module to substitute the standard convolution module, the variable-sized convolution kernels in the LDConv module have thoroughly explored the feature information of top buds at different scales. This has effectively resolved the issue of the highly variable morphology of top buds. Additionally, the utilization of the CA mechanism in tandem with deformable convolutions to extract multiscale features can efficiently reduce the interference from complex background information, resulting in clearer segmentation contours. These improvements within the backbone network have made the model more lightweight. As a consequence, not only has a higher FPS metric been achieved, but the model also becomes more adaptable to low-computing-power devices.
(2): Based on the EFC module, a new lightweight feature pyramid structure EFC-FPN has been established, which accelerates the feature fusion between layers and the feature reconstruction of small targets, enabling the model to retain the detailed features of top buds.
(3): Inner CIoU has been used as a new loss function, and the $r a t i o$ factor is 1.5, which can obtain greater accuracy and convergence speed at the expense of a small increase in computational complexity.

Although our work has effectively enhanced the performance of the segmentation algorithm, this study still has certain limitations. For instance, the dataset samples mainly focus on cotton top buds, resulting in a single research object, and the model has a relatively high degree of dependence on computational resources. In future work, our research will focus on the following aspects: First, we will expand the scale of the dataset by adding cotton top bud images under different environmental conditions (such as illumination and weather) and include top buds at different maturation stages so as to enhance the generalization ability of the model. Second, we plan to further optimize the model structure to reduce the number of model parameters and computational complexity, making it more suitable for edge devices with limited computational resources, thus better meeting the application requirements of actual agricultural production.

Author Contributions

Conceptualization, Y.X. and L.C.; methodology, Y.X.; software, Y.X.; validation, Y.X.; formal analysis, Y.X.; investigation, Y.X.; resources, Y.X.; data curation, Y.X. and L.C.; writing—original draft preparation, Y.X.; writing—review and editing, L.C.; visualization, Y.X.; supervision, L.C.; project administration, L.C.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61961034), the Regional Innovation Guidance Plan of Science and Technology Bureau of Xinjiang Production and Construction Corps (2021BB012 and 2023AB040), the Modern Agricultural Engineering Key Laboratory at Universities of Education Department of Xinjiang Uygur Autonomous Region (TDNG2022106), and the Innovative Research Team Project of Tarim University President (TDZKCX202308).

Data Availability Statement

The data utilized in this study can be obtained upon request from the corresponding author.

Acknowledgments

The authors would like to thank the research team members for their contributions to this work.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu, Z. National Cotton Production Increased by 9.7% Year-on-Year in 2024. Technical Report. People’s Daily, 27 December 2024. [Google Scholar]
Nie, J.; Li, Z.; Zhang, Y.; Zhang, D.; Xu, S.; He, N.; Zhan, Z.; Dai, J.; Li, C.; Li, W.; et al. Plant pruning affects photosynthesis and photoassimilate partitioning in relation to the yield formation of field-grown cotton. Ind. Crops Prod. 2021, 173, 114087. [Google Scholar] [CrossRef]
Renou, A.; Téréta, I.; Togola, M. Manual topping decreases bollworm infestations in cotton cultivation in Mali. Crop Prot. 2011, 30, 1370–1375. [Google Scholar] [CrossRef]
Qing, X.; Fanting, K.; Lei, S.; Changlin, C.; Teng, W.; Yongfei, S. Research progress on key technologies of cotton mechanical topping. J. Chin. Agric. Mech. 2023, 44, 28. [Google Scholar]
Xu, Y.; Han, C.; Qiu, S.; You, J.; Zhang, J.; Luo, Y.; Hu, B. Design and Experimental Evaluation of a Minimal-Damage Cotton Topping Device. Agriculture 2024, 14, 2341. [Google Scholar] [CrossRef]
Zhang, X.; Chen, L. Bud-YOLO: A Real-Time Accurate Detection Method of Cotton Top Buds in Cotton Fields. Agriculture 2024, 14, 1651. [Google Scholar] [CrossRef]
Li, J.; Zhi, X.; Wang, Y.; Cao, Q. Research on Intelligent recognition system of Cotton apical Bud based on Deep Learning. J. Phys. Conf. Ser. 2021, 1820, 012134. [Google Scholar] [CrossRef]
Zhang, J.; Yasenjiang, M.; Yao, J. Lightweight Model Construction for Cotton Terminal Bud Detection Based on YOLOv8. In Proceedings of the IEEE 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 19–21 September 2024; pp. 1740–1744. [Google Scholar]
Wang, H.; Gu, J.; Wang, M.; Xia, Z. Research on tea bud segmentation and picking point location based on deep learning. J. Chin. Agric. Mech. 2024, 45, 246–252. [Google Scholar] [CrossRef]
Guo, P.; Li, W.; Xu, D.; Bai, X.; Wang, Z. Journal of Chinese Agricultural Mechanization. J. Beijing For. Univ. 2024, 46, 133–138. [Google Scholar]
Wang, W.; Shan, Y.; Hu, T.; Gu, J.; Zhu, Y.; Gao, Y. Locating apple picking points using semantic segmentation of targetregion. Trans. Chin. Soc. Agric. Eng. 2024, 40, 172–178. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain. arXiv 2024, arXiv:2406.10139. [Google Scholar]
Qin, X.; Cao, J.; Zhang, Y.; Dong, T.; Cao, H. Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting. Processes 2025, 13, 353. [Google Scholar] [CrossRef]
Liao, J.; He, X.; Liang, Y.; Wang, H.; Zeng, H.; Luo, X.; Li, X.; Zhang, L.; Xing, H.; Zang, Y. A Lightweight Cotton Verticillium Wilt Hazard Level Real-Time Assessment System Based on an Improved YOLOv10n Model. Agriculture 2024, 14, 1617. [Google Scholar] [CrossRef]
Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing segment anything model for infrared small target detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 233–249. [Google Scholar]
Dong, R.; Pan, X.; Li, F. DenseU-Net-Based Semantic Segmentation of Small Objects in Urban Remote Sensing Images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
Zhang, D.; Lu, R.; Guo, Z.; Yang, Z.; Wang, S.; Hu, X. Algorithm for Locating Apical Meristematic Tissue of Weeds Based on YOLO Instance Segmentation. Agronomy 2024, 14, 2121. [Google Scholar] [CrossRef]
Reddy, J.; Niu, H.; Scott, J.L.L.; Bhandari, M.; Landivar, J.A.; Bednarz, C.W.; Duffield, N. Cotton Yield Prediction via UAV-Based Cotton Boll Image Segmentation Using YOLO Model and Segment Anything Model (SAM). Remote Sens. 2024, 16, 4346. [Google Scholar] [CrossRef]
Pawłowski, J.; Kołodziej, M.; Majkowski, A. Implementing YOLO Convolutional Neural Network for Seed Size Detection. Appl. Sci. 2024, 14, 6294. [Google Scholar] [CrossRef]
Liu, H.; Sun, F.; Gu, J.; Deng, L. Sf-yolov5: A lightweight small object detection algorithm based on improved feature fusion mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef]
Liang, X.; Jia, X.; Huang, W.; He, X.; Li, L.; Fan, S.; Li, J.; Zhao, C.; Zhang, C. Real-time grading of defect apples using semantic segmentation combination with a pruned YOLO V4 network. Foods 2022, 11, 3150. [Google Scholar] [CrossRef]
Son, C.H. Leaf spot attention networks based on spot feature encoding for leaf disease identification and detection. Appl. Sci. 2021, 11, 7960. [Google Scholar] [CrossRef]
Jie, Z.; Musha, Y.; Jipeng, Y. Construction of Lightweight Model for Cotton Top Sprout and Research on Targeted Cotton Topping Device. IEEE Access 2024, 12, 176498–176510. [Google Scholar] [CrossRef]
Song, P.; Chen, K.; Zhu, L.; Yang, M.; Ji, C.; Xiao, A.; Jia, H.; Zhang, J.; Yang, W. An improved cascade R-CNN and RGB-D camera-based method for dynamic cotton top bud recognition and localization in the field. Comput. Electron. Agric. 2022, 202, 107442. [Google Scholar] [CrossRef]
Wang, C.; He, S.; Wu, H.; Teng, G.; Zhao, C.; Li, J. Identification of growing points of cotton main stem based on convolutional neural network. IEEE Access 2020, 8, 208407–208417. [Google Scholar] [CrossRef]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Xiao, Y.; Xu, T.; Yu, X.; Fang, Y.; Li, J. A Lightweight Fusion Strategy with Enhanced Inter-layer Feature Correlation for Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3457155. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]

Figure 1. Image schematics of four styles. Images were photographed in different weather and time periods: (a) sunny morning; (b) sunny noon; (c) dusty morning; (d) dusty noon.

Figure 2. Using Labelme tool to annotate the top bud image. (The red part is the mask coverage area).

Figure 3. Data enhancement example: (a) random brightness adjustment; (b) noise addition; (c) tangential transformation; (d) random rotation; (e) mirror transformation; (f) graying.

Figure 4. Network structure of YOLO11n-seg.

Figure 5. Structure diagram of CBLN-YOLO and general modules. (The remaining modules or structures will be described in detail below.)

Figure 6. Structure of LDConv module.

Figure 7. The process of using CA mechanism to replace MSHA mechanism.

Figure 8. Structure diagram of EFC-FPN and EFC module.

Figure 9. Schematic diagram of Inner IoU.

Figure 10. Performance change of CBLN-YOLO model. The points on the blue line represent the performance value of the model after each epoch, and the orange points represent the fitting value after smoothing the curve.

Figure 11. Change in the loss value of the model during training: (a) box loss; (b) segmentation loss; (c) classification loss; (d) distribution focal loss.

Figure 12. Segmentation effect of YOLO11n-seg and CBLN-YOLO on the cotton top bud dataset (lower-right corner of the image is the enlarged image of the top bud): (a) YOLO11n-seg segments random brightness-enhanced images; (b) CBLN-YOLO segments random brightness-enhanced images; (c) YOLO11n-seg segments noise-processed images; (d) CBLN-YOLO segments noise-processed images; (e) YOLO11n-seg segments images after tangential transformation; (f) CBLN-YOLO segments images after tangential transformation; (g) YOLO11n-seg segments images after random rotation; (h) CBLN-YOLO segments images after random rotation; (i) YOLO11n-seg segments mirrored images; (j) CBLN-YOLO segments mirrored images; (k) YOLO11n-seg segments grayscale images; (l) CBLN-YOLO segments grayscale images.

Figure 13. Experimental results of cotton top bud segmentation in the field (lower-right corner of the image is the enlarged image of the top bud): (a) uncovered top buds collected in sunny morning; (b) uncovered top buds collected in sunny noon; (c) uncovered top buds collected in dusty morning; (d) uncovered top buds collected in dusty noon; (e) covered top buds collected in sunny morning; (f) covered top buds collected in sunny noon; (g) covered top buds collected in dusty morning; (h) covered top buds collected in dusty noon.

Table 1. The value of hyperparameters.

HyperParametesr	Value
epochs	200
patience	10
batch_size	16
img_size	640
lr0	$1 \times 10^{- 4}$
lrf	$1 \times 10^{- 4}$
momentum	0.9
weight_decay	$5 \times 10^{- 4}$

Table 2. The comparison results of the model’s performance in the two tasks.

Model	P (Box)	R	mAP@0.5	P (Mask)	R	mAP@0.5	FPS
Mask R-CNN	0.915	0.917	0.927	0.912	0.895	0.915	57.1
YOLOv8n-seg	0.870	0.903	0.899	0.881	0.901	0.889	105.3
YOLOv9t-seg	0.904	0.889	0.918	0.904	0.903	0.911	70.9
YOLOv10n-seg	0.921	0.867	0.922	0.906	0.861	0.915	156.3
YOLO11n-seg	0.951	0.937	0.951	0.943	0.933	0.920	125.0
YOLOv12n-seg	0.921	0.861	0.934	0.927	0.848	0.917	103.8
CBLN-YOLO	0.959	0.977	0.983	0.968	0.945	0.958	135.1

Table 3. Effect of different values of the

r a t i o

factor on the performance of the model.

Table 3. Effect of different values of the

r a t i o

factor on the performance of the model.

Ratio	P (Box)	R	mAP@0.5	P (Mask)	R	mAP@0.5	FPS
0 (CIoU)	0.955	0.957	0.963	0.954	0.940	0.952	142.6
0.5	0.954	0.960	0.969	0.955	0.943	0.954	140.5
1	0.957	0.965	0.971	0.961	0.945	0.957	138.8
1.5	0.959	0.977	0.983	0.968	0.945	0.958	135.1

Table 4. Comparison of the effects of adding different modules.

Group	LDConv	CA	EFC-FPN	Inner CIoU	P(B)	R	mAP@0.5	P(M)	R	mAP@0.5	FPS
1					0.951	0.937	0.951	0.943	0.933	0.920	125.0
2	✓				0.950	0.944	0.954	0.943	0.936	0.933	140.2
3	✓	✓			0.947	0.958	0.961	0.940	0.945	0.948	145.5
4	✓	✓	✓		0.955	0.957	0.963	0.954	0.940	0.952	142.6
5	✓	✓		✓	0.950	0.955	0.964	0.943	0.946	0.956	138.7
6	✓	✓	✓	✓	0.959	0.977	0.983	0.968	0.945	0.958	135.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Y.; Chen, L. CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields. Agronomy 2025, 15, 996. https://doi.org/10.3390/agronomy15040996

AMA Style

Xie Y, Chen L. CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields. Agronomy. 2025; 15(4):996. https://doi.org/10.3390/agronomy15040996

Chicago/Turabian Style

Xie, Yufei, and Liping Chen. 2025. "CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields" Agronomy 15, no. 4: 996. https://doi.org/10.3390/agronomy15040996

APA Style

Xie, Y., & Chen, L. (2025). CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields. Agronomy, 15(4), 996. https://doi.org/10.3390/agronomy15040996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CBLN-YOLO: An Improved YOLO11n-Seg Network for Cotton Topping in Fields

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Image Collection

2.2. Dataset Preprocessing

2.2.1. Image Annotation

2.2.2. Dataset Augmentation

2.2.3. Dataset Partitioning

2.3. CBLN-YOLO Model for Segmentation of Cotton Top Buds

2.3.1. Using YOLO11n-Seg as the Baseline Model

2.3.2. Construction of the CBLN-YOLO Model

2.3.3. LDConv Module

2.3.4. Coordinate Attention Mechanism

2.3.5. EFC-FPN

2.3.6. Improved Loss Function

2.4. Performance Metrics

3. Results and Analysis

3.1. Experimental Environment

3.2. Model Training Results

3.3. Comparative Experiment

3.4. Adjustment of Loss Function

3.5. Ablation Experiment

3.6. Statistical Significance Analysis

3.7. Recognition Experiment of Cotton Top Buds in the Field

3.8. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI