Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10

Guan, Sitong; Lin, Yiming; Lin, Guoyu; Su, Peisen; Huang, Siluo; Meng, Xianyong; Liu, Pingzeng; Yan, Jun

doi:10.3390/agronomy14091936

Open AccessArticle

Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10

by

Sitong Guan

¹,

Yiming Lin

¹,

Guoyu Lin

¹,

Peisen Su

^2,*,

Siluo Huang

³,

Xianyong Meng

^1,4,

Pingzeng Liu

^1,4 and

Jun Yan

^1,4,*

¹

College of Information Science and Engineering, Shandong Agricultural University, Tai’an 271018, China

²

College of Agronomy and Agricultural Engineering, Liaocheng University, Liaocheng 252000, China

³

College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

⁴

Key Laboratory of Huang-Huai-Hai Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs, Tai’an 271018, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2024, 14(9), 1936; https://doi.org/10.3390/agronomy14091936

Submission received: 25 July 2024 / Revised: 19 August 2024 / Accepted: 26 August 2024 / Published: 28 August 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Wheat is one of the most crucial food crops globally, with its yield directly impacting global food security. The accurate detection and counting of wheat spikes is essential for monitoring wheat growth, predicting yield, and managing fields. However, the current methods face challenges, such as spike size variation, shading, weed interference, and dense distribution. Conventional machine learning approaches have partially addressed these challenges, yet they are hampered by limited detection accuracy, complexities in feature extraction, and poor robustness under complex field conditions. In this paper, we propose an improved YOLOv10 algorithm that significantly enhances the model’s feature extraction and detection capabilities. This is achieved by introducing a bidirectional feature pyramid network (BiFPN), a separated and enhancement attention module (SEAM), and a global context network (GCNet). BiFPN leverages both top-down and bottom-up bidirectional paths to achieve multi-scale feature fusion, improving performance in detecting targets of various scales. SEAM enhances feature representation quality and model performance in complex environments by separately augmenting the attention mechanism for channel and spatial features. GCNet captures long-range dependencies in the image through the global context block, enabling the model to process complex information more accurately. The experimental results demonstrate that our method achieved a precision of 93.69%, a recall of 91.70%, and a mean average precision (mAP) of 95.10% in wheat spike detection, outperforming the benchmark YOLOv10 model by 2.02% in precision, 2.92% in recall, and 1.56% in mAP. Additionally, the coefficient of determination (R²) between the detected and manually counted wheat spikes was 0.96, with a mean absolute error (MAE) of 3.57 and a root-mean-square error (RMSE) of 4.09, indicating strong correlation and high accuracy. The improved YOLOv10 algorithm effectively solves the difficult problem of wheat spike detection under complex field conditions, providing strong support for agricultural production and research.

Keywords:

wheat spike; YOLOv10; detection and counting; BiFPN; SEAM; GCNet

1. Introduction

Wheat (Triticum aestivum L.) is one of the world’s most important food crops [1]. It is not only one of the main food sources for human beings but also plays an important role in animal husbandry and industrial production [2]. Wheat is grown all over the world, with the major production areas including China, India, the United States, Russia, and France. Its production is directly related to global food security [3,4]. The number of wheat spikes is an important indicator of wheat yield; therefore, the accurate detection and counting of wheat spikes is important for monitoring wheat growth and predicting yield [5].

Traditional methods for detecting and counting wheat spikes primarily involve manual counting and remote sensing techniques [6]. Although these methods can meet certain needs, they suffer from low efficiency, poor accuracy, and subjectivity. Manual counting is time-consuming, labor-intensive, and susceptible to human error. Remote sensing prediction can cover large areas, but its accuracy and real-time performance are inadequate, making it challenging to meet the demands for efficient and accurate detection in modern agriculture [7]. With the development and application of high-throughput plant phenotype collection technology, phenotypic data now exhibit high-dimensional, multi-source, heterogeneous, and dynamic characteristics [8], presenting new opportunities and challenges for wheat spike detection. Machine learning, an important branch of artificial intelligence, aims to learn patterns and make predictions and decisions from large datasets [9,10]. Since the 1950s, machine learning has evolved through various schools of thought, from symbolic learning and connectionism to statistical learning [11,12]. With increased computational power and data volume, machine learning techniques have matured and achieved significant results in wheat detection. Research has shown that wheat spikes can be successfully detected from the background using features such as texture, color and morphology. For instance, Yao et al. proposed a combined algorithm named “APW”, which achieved the rapid identification and counting of wheat spikes through the alternating direction multiplier method, Potts algorithm, and watershed algorithm. The results showed that in a low-planting-density scenario, the coefficient of determination (R²) of the APW algorithm was 0.89 and the root-mean-square error (RMSE) was 3.72 [13]. Gu et al. used a density clustering algorithm based on normal vectors and a region growing algorithm based on voxels to estimate the number of wheat spikes in the field. DBSC segmented the point cloud using normal vector difference and calculated the number of spikes using a density clustering algorithm; VBRG detected wheat spikes by voxelizing the point cloud, constructing voxel topological relationships, and using a region growing algorithm [14]. Bao et al. proposed a wheat spike counting method based on frequency domain decomposition to improve the accuracy of wheat yield estimation. This method combined a multi-scale support value filter and an improved sampled contourlet transform to perform frequency domain decomposition on the wheat spike image. Then, the spike features were extracted through morphological operations and maximum entropy threshold segmentation, and the skeleton refinement and corner detection algorithms were used for counting [15].

Although machine learning methods have greatly facilitated researchers in wheat detection, traditional machine learning methods are usually more complex and time-consuming in terms of feature extraction, etc. [16], and it is difficult to guarantee the quality of the features, which limits their applicability in real-world environments. Deep learning, a subfield of machine learning, learns layer-by-layer abstraction and representation of input data through multi-layer neural networks, thus achieving the modeling of complex data structures and nonlinear relationships [17], which has become a research hotspot in the field of pattern recognition. The development of deep learning can be traced back to the early neural network concepts in the 1940s [18], and the proposal of backpropagation algorithms and convolutional neural networks (CNNs) in the 1980s marked the rise of deep learning. In particular, the success of AlexNet at the 2012 ImageNet competition established deep learning as a mainstream technique [19]. Target detection is an important task in computer vision that aims to identify and locate targets in images. Unlike image classification, target detection requires not only identifying the class of the target, but also pinpointing the target’s location in the image [20]. CNNs have significant advantages in target detection [21], which can effectively solve these problems, mainly by the following aspects: (1) Strong feature extraction ability: CNNs can automatically extract multi-level feature representations from images through multi-layer convolution and pooling operations, capturing information such as edges, textures, and shapes of the target, thus demonstrating strong feature extraction capabilities. (2) End-to-end training: CNNs can be trained in an end-to-end manner, from the input image to the output target location and category. The entire process can be optimized using the back-propagation algorithm, simplifying feature extraction and classification. (3) Efficient computation: The convolutional operation features local connection and weight sharing, which provides CNNs with high computational efficiency when processing large-scale image data, enabling the faster completion of target detection tasks.

In recent years, target detection algorithms based on CNNs have emerged in an endless stream, mainly including single-stage algorithms represented by the YOLO [22] series and SSD [23] series, and two-stage algorithms represented by the R-CNN [24] series. Shen et al. used the ShuffleNetV2 lightweight convolutional neural network for optimization based on YOLOv5s and used a lightweight upsampling operator to reorganize features in the feature pyramid structure to improve spatial resolution and perception effects. The mean average precision (mAP) of the detection was 94.8% [25]. Li et al. used Faster R-CNN to process RGB images and developed an image-based wheat spike counting method with an average accuracy of 86.7%. It was also applied to genetic research [26]. Batin et al. proposed a wheat spike segmentation model called WheatSpikeNet to accurately estimate the number of wheat spikes from field images. They used Cascade Mask RCNN as the baseline algorithm, ResNet50 and deformable convolutional network as the backbone architecture for feature extraction, and a general RoI extractor for RoI pooling, achieving advanced detection and segmentation performance, with the mAP of the model border and mask reaching 0.9303 and 0.9416, respectively [27]. Li et al. introduced FasterCANet Block in YOLOv8s to improve detection speed and proposed an efficient QAFPN network structure to enhance the neck network to achieve a balance between speed and accuracy. Finally, RFB Block was introduced to improve the SPPF layer to better capture small target features. The improved algorithm achieved 94.4% in mAP [28]. Wang et al. proposed a new detection model called SCP-YOLO to solve the problem of detecting small and overlapping wheat spikes in drone images. The model is based on the YOLOv8n network, processes low-resolution feature layers through a space-to-depth method and performs a context aggregation structure in the feature fusion stage to promote data aggregation and interaction of global feature maps. Compared to the baseline network, the AP@0.5 and AP@0.5:95 were improved by 2.5% and 6.3%, respectively [29]. These studies show that in terms of detection accuracy, both the two-stage detection algorithm and the single-stage detection algorithm can meet the wheat spike detection task, but the single-stage detection algorithm is faster and more suitable for real-time wheat spike detection and counting tasks. At present, the single-stage target detection YOLO series algorithm is the most widely used in practical scenarios. YOLO was proposed by Joseph Redmon et al. in 2015, with the core idea of transforming the target detection problem into a single regression problem, predicting the location and category of the target in the image directly through CNNs. With the continuous development of the YOLO family of algorithms, each version has been optimized and improved in terms of network structure, loss function, anchor frame adjustment, and input resolution [30], significantly enhancing the speed and accuracy of target detection.

Although existing research has shown excellent detection accuracy, there are still some shortcomings in practical applications. Dynamic factors, such as occlusion and weather changes in field environments, can easily lead to a decline in model performance. For example, in a densely planted field environment, wheat spikes may be blocked by other plants, resulting in missed detection or false detection [31]; changes in lighting and weather may also affect image quality, thereby reducing recognition accuracy [32]. Current data enhancement strategies and feature fusion methods still have certain limitations when dealing with variable field images. To this end, this study proposed a wheat spike detection method based on improved YOLOv10. By introducing a more advanced network structure and optimization strategy, the accuracy and generalization ability of wheat spike detection were improved, enabling it to perform better in complex field environments. Four main improvements were made in this study:

In order to make the data closer to the real world, we enrich the data enhancement strategies, including Flip, Rotate, Brightness Adjustment, Add Noise, Cutout, and Mosaic methods, so that the model can perform well under a variety of changing situations.
The bidirectional feature pyramid network (BiFPN) module is introduced to realize the efficient fusion of multi-scale features through top-down and bottom-up bidirectional paths, and the contribution of each input feature is dynamically adjusted through a weighted feature fusion mechanism to improve detection accuracy.
The separated and enhancement attention module (SEAM) is adopted to enhance the output part of the neck layer, which improves the quality and consistency of the feature representation through the separation and enhancement of the attention mechanism and enhances the performance of the model in complex environments.
The global context network (GCNet) module is incorporated to capture the long-range dependencies in the image through the global context block to enhance the feature representation, thus further improving the detection accuracy and robustness.

The experimental results show that the improved YOLOv10 model effectively addresses the shortcomings of previous studies and significantly improves the detection performance in complex field environments. The model performs well in the accuracy and robustness of wheat spike detection, which meets the demand for efficient and accurate detection in modern agriculture. These improvements provide strong support for the development of smart agriculture and demonstrate the great potential of deep learning in practical applications.

2. Materials and Methods

2.1. Dataset Processing

2.1.1. Dataset

The dataset used in this study is from the Global Wheat Head Detection (GWHD) dataset provided by the International Conference on Computer Vision 2021 [33]. Compared to the 2020 version, this dataset includes more than 270,000 additional pieces of wheat spike data from five countries: China, Japan, France, the United Kingdom, and Canada. These data encompass wheat images from various regions, varieties, and growing conditions. The issues of unbalanced test sets and labeling noise present in the previous version of the dataset have been effectively resolved.

After a thorough cleaning and inspection of the dataset, we found some missing or incorrect labels in it. For this, we used labelImg (https://github.com/tzutalin/labelImg, accessed on 1 June 2024) for manual labeling. During the labeling process, it was ensured that each wheat spike was completely enclosed within a rectangular bounding box. Eventually, we selected 2738 images with a resolution of 1024 × 1024 pixels as the initial dataset and expanded it to 10,952 images through data enhancement. Figure 1 shows some of the wheat images in the dataset. The dataset was divided into a training set, validation set, and test set at a ratio of 8:1:1, and the specific information is shown in Table 1.

2.1.2. Dataset Augmentation

In our study, we employed a variety of data enhancement methods to improve the robustness of the target detection model, including the following: (1) Flip: Flip the image and bounding box by 180 degrees to help the model adapt to different viewing angles and orientations. (2) Rotation: Perform 30–60 degree random rotation of the image and bounding box to simulate different shooting angles. (3) Brightness adjustment: Decrease or increase the brightness of the image with an adjustment range of ±30% to simulate different lighting conditions. (4) Add noise: Introduce Gaussian noise into the image with a noise intensity of 0.03 to 0.05 to reduce the effect of noise on the detection results. (5) Cutout: Randomly mask a part of the image with a masking length of 200 pixels and a number of masking holes of 1 to simulate the case of object masking. (6) Mosaic: Four images are spliced together by randomly cropping each image to 40% to 60% of its original size, randomly scaling each image to 40% to 100% of its original size, and then randomly permuting the images to generate new training samples and increase the diversity of the training data.

For each original image, we randomly selected three of the above methods for the enhancement process. These enhanced images, along with the original images, formed the final dataset. Figure 2 illustrates the data-enhanced images of wheat spikes. These data enhancement operations enable the model to maintain good performance under various conditions, significantly improving the model’s generalization ability and robustness.

2.2. The Improvement of the YOLOv10 Model

YOLOv10 is the latest version of the YOLO family, focusing on real-time end-to-end target detection. Since its introduction, the YOLO family has dominated the field of real-time target detection. YOLOv10 introduces a consistent dual-allocation strategy for NMS-free training and employs an overall efficiency and accuracy-driven model design strategy, significantly improving its performance and efficiency [34]. During the detection process, the image is first fed into the model, and the features are extracted through the backbone network. Next, the neck module fuses the features at different scales and passes them to the head module. The head module generates multiple prediction frames and categories, and ultimately, YOLOv10 is trained with no NMS, reducing inference latency.

The backbone network of YOLOv10 uses an enhanced version of the cross-stage partial network (CSPNet) [35], which effectively improves the gradient flow and reduces computational redundancy. This design allows the model to better extract image features while remaining efficient. The neck module employs a path aggregation network (PANet) layer [36], to achieve an effective fusion of multi-scale features, thus improving the performance of the model when dealing with targets of different sizes. The head module is divided into a pair of multiple heads and a pair of one head: a pair of multiple heads generates multiple predictions for each object during training to provide rich supervised signals and improve learning accuracy; a pair of one head generates a single best prediction for each object during inference, eliminating the need for an NMS and thus reducing latency and improving efficiency. Additionally, YOLOv10 fully optimizes various components from an efficiency and accuracy perspective, including lightweight classification heads, spatial channel decoupling downsampling, and a hierarchical bootstrap block design.

Wheat spike detection is easily affected by a variety of factors, such as complex field environments, dense distribution of wheat spikes, and morphological differences of wheat spikes at different growth stages. These factors make it difficult to achieve the ideal effect of the traditional wheat spike detection method in practical application. To address these issues, we improved YOLOv10 by introducing the BiFPN, SEAM, and GCNet modules to enhance its performance in wheat spike detection.

2.2.1. Introduction to BiFPN

BiFPN, a feature fusion network designed for target detection to improve the efficiency and effectiveness of multiscale feature fusion, was first proposed in the EfficientDet model. Compared to traditional feature pyramid networks (FPN) [37] and PANet [36], which only use top-down paths, BiFPN adds bottom-up paths to achieve more comprehensive multi-scale feature fusion. Its structure is shown in Figure 3 [38].

BiFPN achieves an efficient fusion of multi-scale features through bidirectional paths (top-down and bottom-up), allowing it to capture features from different layers more comprehensively. In traditional feature pyramid networks, all input features are usually treated equally without differentiation, meaning features with different resolutions are simply added together directly without considering their varying effects on the output features [39]. To address this issue, BiFPN introduces a weighted feature fusion mechanism, dynamically adjusting the contribution of each input feature in the fusion process by learning its weights. These weights can be scalars, vectors, or multi-dimensional tensors, thus achieving comparable accuracy to other methods while minimizing computational cost.

In order to simplify the network structure and reduce the amount of computation and number of parameters, BiFPN also removes some redundant nodes and connections. This not only improves the computational efficiency, but also reduces the complexity of the model. The bidirectional feature fusion and weighted feature fusion enable BiFPN to better capture target features at different scales, especially in small target detection, such as wheat spikes.

2.2.2. Introduction to SEAM

SEAM is an attention mechanism designed to improve the performance of target detection models in complex scenes. By separating and enhancing attention, SEAM effectively compensates for the feature loss of the occluded portion and improves the model’s ability to detect occluded targets. Its structure mainly consists of depth-separable convolution, residual connection, and fully connected layers with channel attention. The structure is shown in Figure 4 [40].

SEAM jointly models the channel and spatial dimensions of the feature map by weighting each channel and each location of the feature map through the channel attention mechanism and the spatial attention mechanism, respectively, and captures multi-scale features in the image through patch embedding of different sizes. Global average pooling compresses the feature map into channel vectors for depth-separable convolution, which is performed independently on each channel, and then the information of each channel is fused by point-by-point convolution. This method not only reduces the number of parameters, but also preserves the independence between channels, which improves the computational efficiency and the accuracy of feature extraction and enables SEAM to deal with detailed features in complex scenes more effectively. Finally, SEAM generates channel attention weights through a fully connected network. These weights are used to re-weight the individual channels of the input feature map, which enhances the response of the important channels and improves the accuracy of target detection. In addition, SEAM effectively mitigates the gradient vanishing problem and facilitates the training of deep networks through residual connections. The residual join enhances the robustness of the feature representation by adding the input features with those that have been processed by the deep separable convolution, which improves the stability of the model and its adaptability to targets of different scales.

2.2.3. Introduction to GCNet

GCNet is an attention mechanism model designed specifically for image-processing tasks, that aims to improve the understanding and utilization of global context information using deep neural networks. GCNet cleverly combines the advantages of non-local networks (NLNet) [41] and squeeze-excitation networks (SENet) [42], providing a concise, efficient, and powerful global context modeling method. Traditional CNNs mainly rely on local receptive fields for convolution operations when processing images. The limitation of this method is that it can only simulate limited areas of the image, resulting in limited receptive fields, especially when dealing with targets with long-range correlations. GCNet significantly enhances the feature representation ability of the model by capturing long-distance dependencies in images.

The network structure of GCNet mainly consists of two parts: a feature extractor and a classifier, as shown in Figure 5 [43]. The feature extractor captures the initial features of the input data through CNNs. The bottleneck structure reduces the computational complexity by reducing the dimension or the number of channels of the feature map, helping the network focus on the most important features and ensuring the efficiency of subsequent processing. GCNet uses the Gram matrix to build an attention mechanism module, which highlights important features by assigning different weights and ignores irrelevant information, thereby improving the performance and accuracy of the model. In multiple attention mechanism blocks, the network can focus on the most relevant parts of the input data, which is particularly important for processing large-scale data and complex scenarios. Finally, the classifier part is responsible for classifying the fused features and outputting the final prediction results. By combining these structures, GCNet can achieve efficient and accurate feature recognition and classification in complex scenarios. Its design makes it particularly suitable for tasks that require the processing of large-scale data and have global context dependencies.

2.2.4. Network Architecture Diagram of the Improved YOLOv10 Model

This study introduces several key modifications in the YOLOv10 model architecture to enhance feature extraction and detection capabilities. Firstly, the network integrates the BiFPN module, which enables efficient fusion of multi-scale features through top-down and bottom-up feature fusion paths. BiFPN establishes stronger connectivity between feature maps at different scales, enabling the model to better capture targets at different scales. For wheat spikes with different sizes and shapes, BiFPN’s multi-scale feature fusion mechanism significantly improves the model’s accuracy in detecting both small and large targets. In the head section, multiple SEAM modules are added, which enhance the network’s attention and capture ability when facing complex backgrounds and occlusions through the detailed processing of spatial dimensions and channels, and the SEAM can dynamically adjust the weights of different positions and channels, which makes the model more robust when dealing with complex scenes, and thus improves the detection accuracy. Wheat spikes are often occluded by leaves or other wheat parts; the SEAM module significantly improves the model’s performance in complex environments by enhancing the attention mechanism. In order to further improve the model performance, the GCNet module is introduced, which enables the network to dynamically adjust the weights at different locations by introducing a global contextual attention mechanism to better capture the global information, and the GCNet is able to model the relationship between features globally, so that the model performs better when dealing with images with complex backgrounds and long-distance dependencies, thus improving the detection accuracy. GCNet significantly improves the expressive power of the model by helping it to better understand and exploit the global information of the image through global context modeling.

The improved YOLOv10 model excels in wheat spike detection tasks and is able to effectively manage complex field environments. These architectural improvements are designed to provide YOLOv10 with more robust and accurate detection performance, especially for the complex task of wheat spike detection. Figure 6 below, illustrates the detailed structure of the improved YOLOv10 model.

3. Results

3.1. Evaluation Indicators

In this study, we used the precision, recall, F1-score, and mAP as evaluation metrics to assess the performance of the wheat spike detection algorithm. Precision measures the proportion of wheat spike samples predicted as positive that are actually positive. Recall measures the proportion of actual wheat spike samples correctly identified by the model. The F1-score, which is the harmonic mean of precision and recall, provides a combined evaluation of the model’s performance. mAP represents the mean precision across all categories. Formulas (1)–(5) represent the precision, recall, average precision (AP), mAP, and F1-score, where TP denotes true positive, FP denotes false positive, and FN denotes false negative. P(R) denotes precision at a recall of R. (AP)_j denotes the average precision of the j-th category. c denotes the total number of categories.

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(1)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(2)

A P = \int_{0}^{1} P (R) d R

(3)

m A P = \frac{\sum_{j = 1}^{c} (A P)_{j}}{c}

(4)

F 1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

Meanwhile, we used the R², mean absolute percentage error (MAPE), mean absolute error (MAE), and RMSE as evaluation metrics to measure the counting performance of the model. R² measures the degree of regression, indicating how well the predicted wheat spike counts correlate with the actual counts. A higher R² value signifies a better model fit, reflecting the proportion of variation in the actual wheat spike counts explained by the predicted counts. MAPE calculates the mean of the absolute percentage error for each data point, measuring the percentage error between the predicted and actual wheat spike counts. MAE represents the mean of the absolute errors between the predicted and actual wheat spike counts, indicating the average deviation. RMSE is the square root of the mean squared error (MSE), providing a measure of the error magnitude. Lower values of MAPE, MAE, and RMSE indicate higher prediction accuracy. By combining these metrics, we can comprehensively evaluate the model’s counting performance from various perspectives. These evaluation metrics are defined by Formulas (6)–(9), where t_j denotes the true value of the i-th sample, p_j denotes the predicted value of the i-th sample,

\bar{t_{j}}

denotes the average of the true values of all samples, and n denotes the total number of samples.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (t_{i} - p_{i})^{2}}{\sum_{i = 1}^{n} (t_{i} - \bar{t_{i}})^{2}}

(6)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} ∣ \frac{t_{i} - p_{i}}{t_{i}} ∣ \times 100 %

(7)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | t_{i} - p_{i} |

(8)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (t_{i} - p_{i})^{2}}{n}}

(9)

3.2. Model Training and Performance Analysis

The experiments were conducted under the Linux Ubuntu 20.04.6 LTS operating system environment, configured with a 20 GB system drive and a 50 GB NVMe data drive for instant storage. For the hardware, an NVIDIA GeForce RTX 3090 GPU with 24 GB of video memory was used. PyTorch 2.0.1 and CUDA 11.8 were employed for the deep learning framework. In the training phase, the model was trained for 100 epochs, and the optimization method was stochastic gradient descent (SGD). To determine the optimal hyperparameter configuration, we used a grid search method. By systematically traversing different hyperparameter combinations, we determined the optimal parameter settings: a batch size of 16, weight decay of 0.0005, an initial learning rate of 0.001, and momentum of 0.937.

After integrating the BiFPN, SEAM, and GCNet modules, we conducted comparison experiments between the improved YOLOv10 model and the SSD, Faster R-CNN, YOLOv8, YOLOv9, and original YOLOv10 models. A total of 1096 images were used as the validation set. To avoid the overfitting problem, we adopted several strategies. Firstly, the diversity of the training data was increased using data augmentation techniques, thus improving the generalization ability of the model. Second, the YOLOv10n model was used as a pre-training model from which some generic features were learnt to further enhance the initial performance of the model. In addition, during the training process, we adopted the early stopping technique, which stops the training early when the performance of the validation set no longer improves within ten consecutive epochs, in order to prevent the model from overfitting on the training set. As can be seen in Figure 7 and Table 2, the precision, mAP, recall, and F1-score gradually improved with the increase in the number of iterations. Finally, when the number of training iterations reached about 20 rounds, the fluctuation trend of each index gradually stabilized. This shows that our model does not have problems such as overfitting or underfitting and gradient disappearance. After 100 epochs of training, the original YOLOv10 algorithm achieved a mAP of 93.54%, a precision of 91.67%, a recall of 88.78%, and an F1-score of 0.90. In contrast, the improved algorithm proposed in this paper achieved a mAP of 95.10%, a precision of 93.69%, a recall of 91.70%, and an F1-score of 0.92 with the same number of training epochs. The improvements in mAP, precision, recall, and F1-score over the original YOLOv10 model were 1.56%, 2.02%, 2.92%, and 0.02, respectively.

The results show that by introducing the BiFPN, SEAM and GCNet modules, the improved YOLOv10 model exhibited higher accuracy and a better generalization ability in wheat spike detection. BiFPN significantly improved the detection accuracy through efficient multi-scale feature fusion and weighted feature fusion mechanisms. BiFPN is able to carry out bi-directional information flow between feature maps at different scales of wheat, thus better capturing the multi-scale features of wheat spikes. This multi-scale feature fusion mechanism enables the model to detect more accurately when confronted with wheat spikes of different sizes and shapes. SEAM enhances the network’s attention and capture ability when confronted with wheat occlusion through the careful processing of spatial dimensions and channels. SEAM is able to compress and stimulate the feature maps in spatial dimensions to enhance the network’s attention to the important regions. At the same time, the SEAM module also processes the feature maps in the channel dimension, thus enhancing the network’s ability to discriminate features from different channels. This dual attention mechanism enables the model to detect wheat spikes more accurately when faced with complex field environments. GCNet enables the network to dynamically adjust the weights of different locations of wheat spikes by introducing a global contextual attention mechanism, thus better capturing global information. The GCNet module is able to perform information aggregation in the global scope of the feature map, thus improving the network’s global information capturing ability of the network. This global contextual attention mechanism enables the model to detect more accurately when facing complex backgrounds and occlusions. These improvements significantly enhance the model’s adaptability and detection performance in complex field environments.

3.3. Results of Ablation Experiments

To verify the effectiveness of each improvement proposed in this paper, we tested three scenarios: adding the BiFPN module, adding both the BiFPN and SEAM modules, and adding the BiFPN, SEAM, and GCNet modules. The parameters for each model were consistent with the YOLOv10 model as a benchmark, and the training configuration remained the same as before. The experimental results are shown in Table 3.

As can be seen in Table 3, without any modules added, the mAP, precision, and recall of YOLOv10 were 93.54%, 91.67%, and 88.78%, respectively, with the number of parameters being 2,707,430. After the BiFPN module was introduced, the mAP, precision, and recall of the YOLOv10 model improved by 1.12%, 1.40%, and 2.46%, respectively, compared to the baseline YOLOv10 network, with a minimal increase in the number of parameters. In Model 3, after combining BiFPN and SEAM modules, the mAP, precision, and recall improved by 1.41%, 1.59%, and 2.61%, respectively, compared to the baseline YOLOv10 network, with the number of parameters increasing to 2,977,574. In Model 4, after the BiFPN, SEAM, and GCNet modules were combined, the mAP, precision, and recall improved by 1.56%, 2.02%, and 2.92%, respectively, compared to the baseline YOLOv10 network, with the number of parameters increasing to 3,012,832.

We tested the performance of three methods in actual wheat spike detection using YOLOv10: introducing the BiFPN module, combining the BiFPN and SEAM modules, and jointly using the BiFPN, SEAM, and GCNet models. The detection results can be seen in Figure S1 of the supplementary file, where the green circles indicate undetected wheat spikes. As shown in Figure S1c, the detection accuracy and performance significantly improved after adding the BiFPN module, compared to the original YOLOv10. This indicates that BiFPN effectively fuses features at different scales through top-down and bottom-up bidirectional paths. This multi-scale feature fusion is particularly important when detecting wheat spikes, which may vary greatly in size and shape. The introduction of the BiFPN module enabled the model to better capture these differences, thereby improving the precision and recall of detection. As shown in Figure S1d, after combining the BiFPN and SEAM modules, the SEAM module allowed for further detailed processing of the spatial dimensions and channels, which enhanced the network’s ability to pay attention and capture when facing wheat occlusion, resulting in a further increase in the mAP, precision, and recall of the model. When jointly using the BiFPN, SEAM, and GCNet modules, the GCNet module reintroduced global context information, enhancing the model’s feature extraction and fusion capabilities at different levels and scales. This multi-level and multi-scale feature enhancement allows the model to perform better in complex scenarios, achieving higher detection precision and recall, as seen in Figure S1e. The above ablation experiment results validate the effectiveness of each module in feature extraction and fusion. Although the number of parameters in the final improved YOLOv10 model has slightly increased, the detection performance of the model has significantly improved, laying a solid foundation for efficient wheat spike detection and counting.

3.4. Results of Wheat Counts

We further analyzed the ability of the improved YOLOv10 algorithm to count wheat spikes in various complex environments by randomly selecting 30 images for model testing. The results are shown in Figure 8 and Table 4. The R² values for the number of wheat spikes detected using SSD, Faster R-CNN, YOLOv8, YOLOv9, and YOLOv10 compared to the manual counting results were 0.17, 0.56, 0.88, 0.89, and 0.94, respectively. The MAPE values were 15.30%, 11.85%, 5.78%, 5.67%, and 4.09%, the MAE values were 18.23, 13.83, 6.67, 6.50, and 4.60, and the RMSE values were 19.87, 14.52, 7.58, 7.26, and 5.45, respectively. In contrast, the improved YOLOv10 algorithm achieved an R² of 0.96, an MAPE of 3.14%, an MAE of 3.57, and an RMSE of 4.09 when compared to the manual counting results. The increase in the R² value indicates that the improved algorithm had a higher correlation between the detection results and the manual counting results and was able to more accurately reflect the actual situation. The decrease in the values of the MAPE, and the MAE, and the reduced RMSE values indicate that the improved algorithm performed better in terms of prediction error with higher accuracy and stability, surpassing the SSD, Faster R-CNN, YOLOv8, YOLOv9, and the original YOLOv10.

4. Discussion

4.1. Analysis of the Improved YOLOv10 Network Structure

In this study, we first compared the performance of the SSD, Faster R-CNN, YOLOv8, YOLOv9, and YOLOv10 models in wheat spike detection and found that YOLOv10 performed best in terms of precision, recall, and mAP. Therefore, we chose YOLOv10 as the baseline model for this wheat spike detection study. However, when testing the original YOLOv10 model for spike detection, we found that the model performed poorly under adverse conditions, such as severe occlusion of spikes. Therefore, we decided to improve the network structure of the YOLOv10 algorithm to enhance its ability to detect wheat spikes

Previous studies have already improved various YOLO versions. Yang et al. integrated the convolutional block attention module (CBAM) with YOLOv4, enhancing the network’s feature extraction by adding a sensing field module, achieving a 94% accuracy rate in rapid wheat spike detection and counting [44]. Liu et al. proposed an improved YOLOv5 method for wheat spike detection, which enhanced small target detection by incorporating symmetric positive definite convolution (SPD-Conv) and a coordinate attention (CA) mechanism in the neck layer. They also replaced the C3 module with an efficient RepGFPN module, improving the feature extraction for small wheat spikes and achieving better detection accuracy compared to the original YOLOv5 [45]. Ma et al. introduced the concept of shared convolutional layers and improved the YOLOv8 detection head for wheat spike detection, reducing the parameters to enhance the running speed and achieve a lightweight design [46]. Consequently, we explored adding various attention mechanisms and network structures to the original YOLOv10 to enhance the spike detection capabilities.

However, in complex environments, improvements such as CA and CBAM cannot significantly enhance the YOLOv10 algorithm’s ability to extract wheat spike features. Previous studies have confirmed the significant advantages of the BiFPN, SEAM, and GCNet modules in handling such complex conditions. Li et al. replaced the PANet structure in YOLOv7 with BiFPN, enhancing the feature fusion process of the head [47]. Wu et al. introduced GCNet in the YOLOv7 backbone network for better feature extraction [31]. Gui et al. added SEAM in the YOLOx detection head to suppress irrelevant information and improve detection accuracy in the presence of occlusion [48]. These studies have shown the effectiveness of the BiFPN, SEAM, and GCNet modules in improving small target detection accuracy. Therefore, this study integrated the BiFPN, SEAM, and GCNet modules into the YOLOv10 network. The experimental results show that the improved YOLOv10 algorithm, with only a slight increase in the number of parameters, improved the mAP, precision, and recall by 1.56%, 2.02%, and 2.92%, respectively, significantly enhancing the detection and counting performance. Additionally, compared to the SSD, Faster R-CNN, YOLOv8, YOLOv9, and YOLOv10 models, the number of wheat spikes detected by the improved model showed a higher correlation and accuracy with the manual calculation results. Finally, we conducted ablation experiments using GWHD, using YOLOv10 as the baseline model, to evaluate the impact of introducing BiFPN, combining BiFPN and SEAM, and combining BiFPN, SEAM, and GCNet on the wheat spike detection performance, confirming the effectiveness of this study.

4.2. Performance of the Improved YOLOv10 Model in Complex Environments

In real-world environments, complex factors, such as weather changes, occlusion, and morphological differences of wheat spikes at different growth stages, can lead to overlapping and blurring of captured wheat images, affecting the accuracy of detection models [49]. Therefore, we aimed to improve the YOLOv10 network structure to enhance its detection and counting ability for wheat spikes.

To evaluate the practical performance of the improved YOLOv10 model in this study, we compared the detection results of SSD, Faster R-CNN, YOLOv8, YOLOv9, YOLOv10, and the improved YOLOv10 on the test set. Some detection results are shown in Figure S2 in the supplementary file. In Figure S2, green circles indicate issues such as missed detections and false detections. It can be seen that the SSD and Faster R-CNN algorithms performed poorly in real-world detection, missing a large number of wheat spikes. These algorithms have limitations in dealing with small and dense targets, resulting in unsatisfactory detection. In contrast, the YOLO family of algorithms performed well, far outperforming SSD and Faster R-CNN. Even without improvements, YOLOv8, YOLOv9, and YOLOv10 were able to detect the majority of wheat spikes in the image, demonstrating their strong detection capabilities and efficient performance. However, even though the YOLO series algorithms perform well in detecting most wheat spikes, they may still miss small wheat spikes at the edges of an image. In addition, there may even be cases where leaves are misidentified as wheat spikes. This may be due to fewer features in the edge region of the image, making it difficult for the detector to accurately identify these small targets. In contrast, the improved algorithm in this study can better fuse the multi-scale features of wheat spikes by introducing the BiFPN, SEAM and GCNet modules, which can more accurately identify wheat spikes in occluded and image-edge regions, and the omission and misidentification in SSD, Faster R-CNN, YOLOv8, YOLOv9 and YOLOv10 algorithms have been greatly improved. The detection results show that the improved YOLOv10 model performs well in a variety of complex environments and can adapt to different wheat distributions and complex field environments.

In summary, the improved YOLOv10 algorithm shows better detection and counting performance in a variety of complex environments compared to SSD, Faster R-CNN, YOLOv8, YOLOv9, and the original YOLOv10.

5. Conclusions

Accurate wheat spike detection and the counting of wheat spikes not only help to assess the growth conditions and yield of wheat but also provide critical data for farm management and harvesting. To address the complex environment faced in wheat spike detection, this study proposes an improved YOLOv10 algorithm for spike detection and counting, which significantly enhances the feature extraction and detection capabilities of the model by integrating the BiFPN, SEAM, and GCNet modules. The experimental results show that the improved YOLOv10 model has a wheat spike detection precision of 93.69%, a recall of 91.70%, and a mAP value of 95.10%. For wheat spike counting, the model had an R² of 0.96, an MAE of 3.57, and an RMSE of 4.09 when compared to the manual counting results. The detection and counting performance were superior to that of SSD, Faster R-CNN, YOLOv8, YOLOv9, and the original YOLOv10. These improvements allow the YOLOv10 algorithm to excel at wheat spike detection tasks and to effectively manage complex field environments. However, fast detection still requires specific hardware configurations, and the improved model has a higher computational complexity that may not be applicable to resource-constrained devices.

With the continuous advancement of deep learning techniques, the research on wheat spike detection is also progressing. By introducing the attention mechanism and multi-scale convolution technique into YOLO, SSD, R-CNN and other algorithms, the researchers have effectively improved the accuracy of wheat detection. Meanwhile, with the joint efforts of agricultural departments and researchers at home and abroad, public datasets such as GWHD are provided to wheat researchers worldwide, which not only promotes the development of the technology, but also accumulates valuable experience in data management and standardized testing. Previous experience has shown that thermal infrared images have better contrast than RGB images, providing clearer detection results in complex environments. Three-dimensional technology is also starting to be applied to wheat spike recognition, further improving the accuracy and reliability of detection. This is a study that researchers are currently working on and is a direction that could be explored in depth in the future. In addition, lightweight models can be developed, and the improved algorithms can be integrated into smart agricultural equipment to facilitate automated field management and harvesting operations and improve the automation level of agricultural production.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy14091936/s1, Figure S1: Comparison of detection effects of different modules. (a) Original image. (b) Detection result using YOLOv10. (c) Detection result using YOLOv10 with BiFPN. (d) Detection result using YOLOv10 with BiFPN and SEAM. (e) Detection result using YOLOv10 with BiFPN, SEAM, and GCNet; Figure S2: Comparison of detection effects of different algorithms. (a) Original image. (b) Detection result using SSD. (c) Detection result using Faster R-CNN. (d) Detection result using YOLOv8. (e) Detection result using YOLOv9. (f) Detection result using YOLOv10. (g) Detection result using improved YOLOv10.

Author Contributions

Conceptualization, S.G. and J.Y.; methodology, S.G. and J.Y.; software, S.G.; validation, S.G., Y.L. and G.L.; formal analysis, X.M. and S.H.; investigation, X.M. and S.H.; data curation, S.G., Y.L. and G.L.; resources, J.Y., P.S. and P.L.; writing—original draft, S.G.; writing—review and editing, J.Y., P.S. and P.L.; funding acquisition, J.Y. and P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by A Project of Shandong Province Higher Educational Program for Introduction and Cultivation of Young Innovative Talents in 2021, Natural Science Foundation of Shandong Province (ZR2022QC129), and Doctoral research start-up funds from Liaocheng University (318052018).

Data Availability Statement

The original data presented in the study are openly available in the Global Wheat Head Dataset 2021 at https://www.kaggle.com/datasets/vbookshelf/global-wheat-head-dataset-2021 or reference [33] (accessed on 21 May 2024).

Acknowledgments

We thank all the authors for their support. The authors would like to thank all the reviewers who participated in this review.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.; Wang, Z.; Yu, R.; Li, F.; Li, K.; Cao, H.; Yang, N.; Li, M.; Dai, J.; Zan, Y. Optimal nitrogen input for higher efficiency and lower environmental impacts of winter wheat production in China. Agric. Ecosyst. Environ. 2016, 224, 1–11. [Google Scholar] [CrossRef]
Hellemans, T.; Landschoot, S.; Dewitte, K.; Van Bockstaele, F.; Vermeir, P.; Eeckhout, M.; Haesaert, G. Impact of crop husbandry practices and environmental conditions on wheat composition and quality: A review. J. Agric. Food Chem. 2018, 66, 2491–2509. [Google Scholar] [CrossRef] [PubMed]
Glover, J.D.; Reganold, J.; Bell, L.; Borevitz, J.; Brummer, E.C.; Buckler, E.S.; Cox, C.; Cox, T.S.; Crews, T.; Culman, S. Increased food and ecosystem security via perennial grains. Science 2010, 328, 1638–1639. [Google Scholar] [CrossRef] [PubMed]
Mujeeb-Kazi, A.; Kazi, A.G.; Dundas, I.; Rasheed, A.; Ogbonnaya, F.; Kishii, M.; Bonnett, D.; Wang, R.R.-C.; Xu, S.; Chen, P. Genetic diversity for wheat improvement as a conduit to food security. Adv. Agron. 2013, 122, 179–257. [Google Scholar]
Sun, J.; Yang, K.; Chen, C.; Shen, J.; Yang, Y.; Wu, X.; Norton, T. Wheat head counting in the wild by an augmented feature pyramid networks-based convolutional neural network. Comput. Electron. Agric. 2022, 193, 106705. [Google Scholar] [CrossRef]
Feng, L.; Chen, S.; Zhang, C.; Zhang, Y.; He, Y. A comprehensive review on recent applications of unmanned aerial vehicle remote sensing with various sensors for high-throughput plant phenotyping. Comput. Electron. Agric. 2021, 182, 106033. [Google Scholar] [CrossRef]
Fernandez-Gallego, J.A.; Kefauver, S.C.; Gutiérrez, N.A.; Nieto-Taladriz, M.T.; Araus, J.L. Wheat ear counting in-field conditions: High throughput and low-cost approach using RGB images. Plant Methods 2018, 14, 22. [Google Scholar] [CrossRef]
van Dijk, A.D.J.; Kootstra, G.; Kruijer, W.; de Ridder, D. Machine learning in plant science and plant breeding. Iscience 2021, 24, 101890. [Google Scholar] [CrossRef]
Esposito, S.; Carputo, D.; Cardi, T.; Tripodi, P. Applications and trends of machine learning in genomics and phenomics for next-generation breeding. Plants 2019, 9, 34. [Google Scholar] [CrossRef]
Singh, A.; Ganapathysubramanian, B.; Singh, A.K.; Sarkar, S. Machine learning for high-throughput stress phenotyping in plants. Trends Plant Sci. 2016, 21, 110–124. [Google Scholar] [CrossRef]
Foggia, P.; Genna, R.; Vento, M. Symbolic vs. connectionist learning: An experimental comparison in a structured domain. IEEE Trans. Knowl. Data Eng. 2001, 13, 176–195. [Google Scholar] [CrossRef] [PubMed]
Fiser, J.; Lengyel, G. Statistical learning in vision. Annu. Rev. Vis. Sci. 2022, 8, 265–290. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Zhang, D.; Tian, T.; Zain, M.; Zhang, W.; Yang, T.; Song, X.; Zhu, S.; Liu, T.; Ma, H. APW: An ensemble model for efficient wheat spike counting in unmanned aerial vehicle images. Comput. Electron. Agric. 2024, 224, 109204. [Google Scholar] [CrossRef]
Gu, Y.; Ai, H.; Guo, T.; Liu, P.; Wang, Y.; Zheng, H.; Cheng, T.; Zhu, Y.; Cao, W.; Yao, X. Comparison of two novel methods for counting wheat ears in the field with terrestrial LiDAR. Plant Methods 2023, 19, 134. [Google Scholar] [CrossRef]
Bao, W.; Lin, Z.; Hu, G.; Liang, D.; Huang, L.; Zhang, X. Method for wheat ear counting based on frequency domain decomposition of MSVF-ISCT. Inf. Process. Agric. 2023, 10, 240–255. [Google Scholar] [CrossRef]
Liu, Y.; Pu, H.; Sun, D.-W. Efficient extraction of deep image features using convolutional neural network (CNN) for applications in detecting and analysing complex food matrices. Trends Food Sci. Technol. 2021, 113, 193–204. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Feng, A.; Vong, C.N.; Zhou, J.; Conway, L.S.; Zhou, J.; Vories, E.D.; Sudduth, K.A.; Kitchen, N.R. Developing an image processing pipeline to improve the position accuracy of single UAV images. Comput. Electron. Agric. 2023, 206, 107650. [Google Scholar] [CrossRef]
Hasan, M.M.; Chopin, J.P.; Laga, H.; Miklavcic, S.J. Detection and analysis of wheat spikes using convolutional neural networks. Plant Methods 2018, 14, 100. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Shen, X.; Zhang, C.; Liu, K.; Mao, W.; Zhou, C.; Yao, L. A lightweight network for improving wheat ears detection and counting based on YOLOv5s. Front. Plant Sci. 2023, 14, 1289726. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Hassan, M.A.; Yang, S.; Jing, F.; Yang, M.; Rasheed, A.; Wang, J.; Xia, X.; He, Z.; Xiao, Y. Development of image-based wheat spike counter through a Faster R-CNN algorithm and application for genetic studies. Crop J. 2022, 10, 1303–1311. [Google Scholar] [CrossRef]
Batin, M.; Islam, M.; Hasan, M.M.; Azad, A.; Alyami, S.A.; Hossain, M.A.; Miklavcic, S.J. WheatSpikeNet: An improved wheat spike segmentation model for accurate estimation from field imaging. Front. Plant Sci. 2023, 14, 1226190. [Google Scholar] [CrossRef]
Li, R.; Meng, J.; Wu, Y.; Zhang, D.; He, Y. Wheat ear detection based on FasterCANet-YOLOv8s algorithm. N. Z. J. Crop Hortic. Sci. 2024, 1–21. [Google Scholar] [CrossRef]
Wang, L.; Miao, Z.; Liu, E. UAV remote sensing detection and target recognition based on SCP-YOLO. Neural Comput. Appl. 2024, 1–16. [Google Scholar] [CrossRef]
Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Wu, T.; Zhong, S.; Chen, H.; Geng, X. Research on the method of counting wheat ears via video based on improved yolov7 and deepsort. Sensors 2023, 23, 4880. [Google Scholar] [CrossRef]
Zhao, W.; Liu, S.; Li, X.; Han, X.; Yang, H. Fast and accurate wheat grain quality detection based on improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107426. [Google Scholar] [CrossRef]
David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S. Global wheat head detection 2021: An improved dataset for benchmarking wheat head detection methods. Plant Phenomics 2021, 2021, 9846158. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 2117–2125. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ren, Y.; Zhang, X.; Ma, Y.; Yang, Q.; Wang, C.; Liu, H.; Qi, Q. Full Convolutional Neural Network Based on Multi-Scale Feature Fusion for the Class Imbalance Remote Sensing Image Classification. Remote Sens. 2020, 12, 3547. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
Yang, B.; Gao, Z.; Gao, Y.; Zhu, Y. Rapid detection and counting of wheat ears in the field using YOLOv4 with attention module. Agronomy 2021, 11, 1202. [Google Scholar] [CrossRef]
Liu, L.; Li, P. An improved YOLOv5-based algorithm for small wheat spikes detection. Signal Image Video Process. 2023, 17, 4485–4493. [Google Scholar] [CrossRef]
Ma, N.; Su, Y.; Yang, L.; Li, Z.; Yan, H. Wheat Seed Detection and Counting Method Based on Improved YOLOv8 Model. Sensors 2024, 24, 1654. [Google Scholar] [CrossRef]
Li, Z.; Zhu, Y.; Sui, S.; Zhao, Y.; Liu, P.; Li, X. Real-time detection and counting of wheat ears based on improved YOLOv7. Comput. Electron. Agric. 2024, 218, 108670. [Google Scholar] [CrossRef]
Gui, J.; Wu, J.; Wu, D.; Chen, J.; Tong, J. A lightweight tea buds detection model with occlusion handling. J. Food Meas. Charact. 2024, 1–17. [Google Scholar] [CrossRef]
Fernandez-Gallego, J.A.; Buchaillot, M.L.; Aparicio Gutiérrez, N.; Nieto-Taladriz, M.T.; Araus, J.L.; Kefauver, S.C. Automatic Wheat Ear Counting Using Thermal Imagery. Remote Sens. 2019, 11, 751. [Google Scholar] [CrossRef]

Figure 1. Some example images in the Global Wheat Head Detection (GWHD) dataset [33].

Figure 2. Data augmentation: (a) Original image, (b) Flipping, (c) Rotation, (d) Decreased brightness, (e) Added noise, (f) Increased brightness, (g) Cutout, (h) Mosaic. (Original image from GWHD dataset [33]).

Figure 3. Comparison of different feature pyramid networks.

Figure 4. Illustration of the structure of the separated and enhanced attention module.

Figure 5. Illustration of the structure of the global context network.

Figure 6. The overall network structure of the improved YOLOv10 model.

Figure 7. Training process curves. (a) Precision, (b) Mean average precision, (c) F1-score, (d) Recall.

Figure 8. Fitting results of detected wheat spike counts versus actual wheat spike counts. (a) SSD fitting results. (b) Faster R-CNN fitting results. (c) YOLOv8 fitting results. (d) YOLOv9 fitting results. (e) YOLOv10 fitting results. (f) Improved YOLOv10 fitting results.

Table 1. Dataset partitioning details.

Dataset	Width	Height	Number of Images
Train	1024	1024	8760
Validation	1024	1024	1096
Test	1024	1024	1096
Total	-	-	10,952

Table 2. Comparison of the results of different detection algorithms for wheat spikes.

Method	Precision	Recall	mAP50@0.5	F1-Score
SSD	89.20%	55.46%	71.75%	0.68
Faster R-CNN	68.57%	70.43%	69.64%	0.69
YOLOv8	91.22%	87.26%	92.58%	0.89
YOLOv9	91.31%	88.52%	93.47%	0.89
YOLOv10	91.67%	88.78%	93.54%	0.90
Our Method	93.69%	91.70%	95.10%	0.92

Table 3. Results of ablation experiments with different modules.

Method	BiFPN	SEAM	GCNet	mAP50@0.5	Precision	Recall	Parameters
1	-	-	-	93.54%	91.67%	88.78%	2,707,430
2	√	-	-	94.66%	93.07%	91.24%	2,707,439
3	√	√	-	94.95%	93.26%	91.39%	2,977,574
4	√	√	√	95.10%	93.69%	91.70%	3,012,832

Table 4. Comparison of wheat spike counting performance using different algorithms.

Method	R²	MAE	RMSE	MAPE
SSD	0.17	18.23	19.87	15.30%
Faster R-CNN	0.56	13.83	14.52	11.85%
YOLOv8	0.88	6.67	7.58	5.78%
YOLOv9	0.89	6.50	7.26	5.67%
YOLOv10	0.94	4.60	5.45	4.09%
Our Method	0.96	3.57	4.09	3.14%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, S.; Lin, Y.; Lin, G.; Su, P.; Huang, S.; Meng, X.; Liu, P.; Yan, J. Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10. Agronomy 2024, 14, 1936. https://doi.org/10.3390/agronomy14091936

AMA Style

Guan S, Lin Y, Lin G, Su P, Huang S, Meng X, Liu P, Yan J. Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10. Agronomy. 2024; 14(9):1936. https://doi.org/10.3390/agronomy14091936

Chicago/Turabian Style

Guan, Sitong, Yiming Lin, Guoyu Lin, Peisen Su, Siluo Huang, Xianyong Meng, Pingzeng Liu, and Jun Yan. 2024. "Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10" Agronomy 14, no. 9: 1936. https://doi.org/10.3390/agronomy14091936

APA Style

Guan, S., Lin, Y., Lin, G., Su, P., Huang, S., Meng, X., Liu, P., & Yan, J. (2024). Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10. Agronomy, 14(9), 1936. https://doi.org/10.3390/agronomy14091936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Processing

2.1.1. Dataset

2.1.2. Dataset Augmentation

2.2. The Improvement of the YOLOv10 Model

2.2.1. Introduction to BiFPN

2.2.2. Introduction to SEAM

2.2.3. Introduction to GCNet

2.2.4. Network Architecture Diagram of the Improved YOLOv10 Model

3. Results

3.1. Evaluation Indicators

3.2. Model Training and Performance Analysis

3.3. Results of Ablation Experiments

3.4. Results of Wheat Counts

4. Discussion

4.1. Analysis of the Improved YOLOv10 Network Structure

4.2. Performance of the Improved YOLOv10 Model in Complex Environments

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI