Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction

Guan, Yujie; Pan, Jiaqi; Fan, Qingqi; Yang, Liangliang; Xu, Li; Jia, Weikuan

doi:10.3390/agriculture14060899

Open AccessArticle

Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction

by

Yujie Guan

¹

,

Jiaqi Pan

¹,

Qingqi Fan

²,

Liangliang Yang

³

,

Li Xu

^4,* and

Weikuan Jia

^1,5,*

¹

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China

²

Crop Research Institute, Shandong Academy of Agricultural Science, Jinan 250100, China

³

Faculty of Engineering, Kitami Institute of Technology, Kitami 090-8507, Japan

⁴

School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China

⁵

Key Laboratory of Facility Agriculture Measurement and Control Technology and Equipment of Ma-Chinery Industry, Zhenjiang 212013, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2024, 14(6), 899; https://doi.org/10.3390/agriculture14060899

Submission received: 10 April 2024 / Revised: 27 May 2024 / Accepted: 3 June 2024 / Published: 6 June 2024

(This article belongs to the Special Issue Computer Vision and Artificial Intelligence in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Wheat ear counting is crucial for calculating wheat phenotypic parameters and scientifically managing fields, which is essential for estimating wheat field yield. In wheat fields, detecting wheat ears can be challenging due to factors such as changes in illumination, wheat ear growth posture, and the appearance color of wheat ears. To improve the accuracy and efficiency of wheat ear detection and meet the demands of intelligent yield estimation, this study proposes an efficient model, Generalized Focal Loss WheatNet (GFLWheatNet), for wheat ear detection. This model precisely counts small, dense, and overlapping wheat ears. Firstly, in the feature extraction stage, we discarded the C4 feature layer of the ResNet50 and added the Convolutional block attention module (CBAM) to this location. This step maintains strong feature extraction capabilities while reducing redundant feature information. Secondly, in the reinforcement layer, we designed a skip connection module to replace the multi-scale feature fusion network, expanding the receptive field to adapt to various scales of wheat ears. Thirdly, leveraging the concept of distribution-guided localization, we constructed a detection head network to address the challenge of low accuracy in detecting dense and overlapping targets. Validation on the publicly available Global Wheat Head Detection dataset (GWHD-2021) demonstrates that GFLWheatNet achieves detection accuracies of 43.3% and 93.7% in terms of mean Average Precision (mAP) and

A P_{50}

(Intersection over Union (IOU) = 0.5), respectively. Compared to other models, it exhibits strong performance in terms of detection accuracy and efficiency. This model can serve as a reference for intelligent wheat ear counting during wheat yield estimation and provide theoretical insights for the detection of ears in other grain crops.

Keywords:

deep learning; object detection; wheat ear; GWHD-2021; dense detection

1. Introduction

In 2022, the global wheat production reached 778.6 million tons, an increase of 2.77 million tons compared to the previous year [1]. Wheat, as one of the world’s essential staple crops, plays a critical role in food security for nations around the globe. Monitoring the growth and predicting the yield of wheat are of great significance as they provide valuable data support for precise field management, scientific breeding, and agricultural policy adjustments [2,3,4]. Wheat yield estimation involves three main factors: the number of ears per unit area, the number of grains per ear, and the thousand-grain weight. Among these indicators, the number of ears per unit area is the most direct measure reflecting the characteristics and yield of wheat. Traditional methods for detecting and counting wheat ears have relied heavily on manual labor, which is time-consuming, labor-intensive, and subjective. Recently, the automatic detection of wheat ears has started to replace manual efforts. However, automatic ear detection still faces several challenges in complex field environments:

(1): Diversity of wheat ears: Wheat ears exhibit different appearances in various regions, growth stages, and varieties.
(2): Variability in ear shapes: Wheat ear shapes in images vary due to factors such as camera angles, the posture of ear growth, and lighting conditions, resulting in overlapping, occlusion, and backlighting.
(3): Similar background: Wheat ears at different growth stages closely resemble stem and leaf backgrounds of the same color, leading to potential issues like false positives and missed detections.

The automatic detection of wheat ear in the field has attracted the attention of many scholars [5,6,7], mainly focusing on two categories of methods, machine learning and deep learning, and has made certain research progress. Traditional machine learning methods first extract features such as ear shape, texture, and color from acquired wheat RGB images and then use classifier models to achieve ear object detection. Zhao et al. used the color information of wheat ears and the AdaBoost algorithm to achieve ear detection. They first extracted wheat color information from the acquired wheat RGB images to eliminate interference areas and then used the AdaBoost algorithm to train a classifier to locate ear objects [8]. The dynamic color transform (DCT) model utilizes the idea of dynamic color transformation to mitigate the influence of background interference by modifying the color channels of the input RGB wheat ear image, thereby achieving precise ear detection [9]. Fernandez-Gallego et al. removed low-frequency and high-frequency interference elements from wheat ear images under natural light conditions using filters and the Find Maxima algorithm. They then performed segmentation on local maxima, achieving the high-throughput counting of field wheat ear images [10]. Liu et al. improved the K-means clustering algorithm, clustering based on color features, and used the number of subregions in the clustering area as an estimate of ear counting, achieving high-precision ear counting with an accuracy of 94.69% [11]. Zhou et al. used a dual-support vector machine to extract the color, texture, and edge features of wheat ear [12]. Jia et al. introduced two methods of dimensionality reduction: feature selection and feature extraction. The primary objectives of this research are to achieve low loss during dimensionality reduction and preserve the original data’s characteristics [13]. They visualized the features in the form of histograms and analyzed the differences in wheat ear features. The above method achieved good detection results in wheat field environments. However, most algorithms are highly affected by the environment when learning wheat ear features. In the face of relatively complex wheat field environments, some feature extraction becomes challenging, leading to slightly inadequate detection performance for these algorithms.

As deep learning theories [14,15,16] and plant phenomics research progress [17,18,19,20,21,22], there has been significant advancement in the detection of wheat ears and other agricultural products by employing deep learning models for automatic feature learning [23,24,25,26], gradually replacing the manual feature extraction in traditional machine learning. Hasan et al. proposed the application of Region-based Convolutional Neural Networks (R-CNNs) in ear detection. They initially trained the R-CNN model using datasets of ears at different growth stages and selected the model with the best robust performance to predict the ear category and the precise location of the bounding box [27]. Madec et al. successfully applied Faster R-CNN [28] to the detection of high-resolution ear images acquired by drones, achieving high-precision ear detection and counting for wheat density estimation [29]. Li et al. applied both the Faster R-CNN and RetinaNet [30] models to the task of predicting the number of ear at different growth stages of wheat [31]. They demonstrated that the RetinaNet model with feature pyramid networks (FPNs) [32] achieved higher prediction accuracy after transfer learning. The single-stage object detection algorithm makes a significant contribution to balance the accuracy and speed of the algorithm, which mainly refers to the You Only Look Once (YOLO) [33] family. Yang et al. proposed an upsampling method based on feature recombination on the basis of the YOLOX model [34]. They combined it with an iterative attention feature fusion module and simultaneously trained a corner detection network to identify sampling frames. This resulted in accurate detection and counting of ears per unit area. The aforementioned studies have achieved good results in ear detection. Furthermore, deep learning methods have also made progress in detecting other ear-like fruits in various complex field environments. For instance, Yu et al. introduced the lightweight TasselLFANet model, which enhances feature learning capability by balancing the variability of cross-stage fusion strategies across different layers [35]. It also utilizes multiple receptive fields to capture different feature representations, allowing the accurate and efficient detection and counting of corn ear in high-temporal–spatial image sequences. Wang et al. applied an improved YOLO_X model through transfer learning to accurately detect and count corn tassels in complex field conditions using images captured by drones [36]. Mahaur et al. [37] applied YOLOv5 to the task of small object detection in agriculture, addressing the issue of foreground/background imbalance under low-light conditions. The iS-YOLOv5 model improves detection accuracy for small objects by enhancing the SPP module and PANet structure. Wang et al. [38] improved YOLOv8 for vegetable disease detection by reducing the parameter count through the substitution of more lightweight convolutional layers and integrating a novel fusion perception attention module to preserve and merge feature information. The enhanced YOLOv8 model achieved the lightweight and rapid detection of vegetable diseases. Solimani et al. [39] aimed to improve YOLOv8’s ability to recognize different categories of objects by incorporating a Squeeze-and-Excitation (SE) block attention module [40] into the model architecture, achieving precise multi-category recognition. From these works, it can be concluded that improvements to the YOLO family of models often involve the integration of various functional attention modules to enhance performance. These methods for the detection of ears and other agricultural objects provide valuable insights for ear detection in complex field environments.

The wheat ear detection algorithm based on deep learning has significantly improved detection performance compared to traditional machine learning algorithms. However, there are still some challenges that have not been fully addressed, such as

(1): The problem of false positives and redundant detections in cases of dense and highly overlapping wheat ears;
(2): Occurrences of missed detections and false positives for a large number of small wheat ears from various angles;
(3): Issues related to the weak robustness and generalization of the detection model in unstructured field environments.

Faced with the complexity of wheat fields, to further improve the accuracy and efficiency of wheat ear detection, this study utilizes the Generalized Focal Loss V2 as a baseline and designs GFLWheatNet as the wheat ear detection model. The main ideas of this method are as follows: in the feature extraction stage, the C4 feature layer is replaced by a CBAM module to reduce redundant feature information; a skip connection module is designed to enhance features, expand the receptive field, and adapt to ear of various sizes; and a detection head network is constructed using the distribution-guided approach to improve the accuracy of small, dense, and overlapping wheat ear detection. Finally, the model’s effectiveness is validated on the publicly available GWHD-2021 [41].

2. Materials and Methods

2.1. Dataset

The experiment was conducted using the publicly available GWHD-2021 dataset for validation. The authors of this dataset rechecked, relabeled, and augmented the dataset based on GWHD-2020 [42], adding 1722 images from five additional countries, totaling 81,553 wheat ear heads. This dataset consists of RGB wheat ear images (1024 × 1024 pixels) distributed across 12 countries, containing 6387 images with a total of 275,000 wheat ear heads. Compared to GWHD-2020, GWHD-2021 includes more diverse wheat ear images with less noise.

From the GWHD-2021 dataset, 1387 representative wheat ear images were randomly selected, and the images were divided into a training set and a test set in an 8:2 ratio. The training set contained 1111 images, while the test set contained 276 images. The Microsoft Common Objects in Context (COCO) benchmark was used to calculate detection accuracy. Figure 1 shows typical wheat ear images from the dataset, and several characteristics of wheat ear detection can be summarized:

(1): Inconsistent shape and size of wheat ear targets: due to differences in growth environments, varieties, and growth stages, wheat ear targets have significant variations in shape and size, including many small wheat ear targets.
(2): Occlusion and overlap of wheat ear targets: wheat ear targets are not only easily occluded by wheat leaves or other plant parts but also exhibit dense growth and high-density overlap.
(3): Lighting conditions and complex backgrounds: wheat images contain a large number of disturbances, such as lighting conditions during photography and weeds. Some of these disturbances have similar shapes and colors to wheat ear targets.

2.2. GFLWheatNet

In complex field environments, detecting small, highly overlapped dense wheat ears poses significant challenges. Moreover, the efficiency and real-time capability of detection are constrained by model complexity and parameter volume. To address these challenges, an efficient detection model, GFLWheatNet, is proposed with the aim of achieving accurate detection of small, dense, and highly overlapping wheat ears. The optimized model consists of three components, the backbone feature extraction module, Reinforce Layer feature enhancement module, and the detection head module, as illustrated in Figure 2.

In the backbone feature extraction module, a CBAM attention module is incorporated to better utilize the C3 feature layer in ResNet50, which has a higher resolution and richer detailed information. This enables the more effective capture of detailed features of small wheat ears while discarding the C4 feature layer to reduce unnecessary feature redundancy. The feature enhancement module enlarges the receptive fields of the feature layers output by the backbone feature extraction module by stacking multiple convolutional layers. It replaces the feature pyramid network to enrich and enhance wheat ear features at different scales. The detection head network is constructed based on the idea of distribution-guided localization to improve the model’s robustness in detecting dense and overlapping wheat ears.

2.2.1. Feature Extraction

In traditional multi-layer feature extraction networks, as the dimension of feature maps increases, redundant feature information is also preserved. However, the lower-level feature layer C3 contains rich global context information and detailed information, which help capture small wheat ears, while the C5 layer contains more high-level semantic information. When DETR [43] introduced Transformer into the detection algorithm, it showed that good results can be achieved using only a single C5 feature layer. Therefore, in this module, the C4 feature layer is replaced with a CBAM, which not only reduces redundant information but also enhances the model’s focus on wheat ear features. The intact C3 feature information, after a 2× downsampling operation, is input into the C5 feature, and the resulting feature map after feature fusion has a smaller perceptual range and relatively rich positional information. This improves the detection accuracy of small wheat ears, making it easier to extract features from hard-to-detect ears.

CBAM, as a lightweight attention module, can adaptively refine the wheat feature map along two independent dimensions, namely channel and spatial dimensions, thereby enhancing the performance of Convolutional Neural Networks. Pooling operations can effectively reduce the number of parameters and aggregate information about wheat in the feature map. However, if only the average pooling method is used to aggregate spatial information in the feature map, it often overlooks the role of max-pooling operations in enhancing attention. There is a difference in the representation of target features between the two. Woo et al. demonstrated that this difference is helpful in obtaining more refined attention channel maps [44]. Therefore, CBAM simultaneously uses both average pooling and max-pooling operations. The CBAM module is shown in Figure 3.

The channel attention module (CAM) is used to interact with and weight the information between different channels in the feature map, focusing more on the meaningful ear regions for subsequent processing. The feature maps of ears for each channel are subjected to average pooling to obtain the global average features for each channel, and the spatial information of the feature map is aggregated to obtain two pooling feature maps with dimensions of C. Finally, two channel attention maps with dimensions of 1 × 1 × C, namely

F e a t u r e_{M}

and

F e a t u r e_{A}

, are obtained. The results of element-wise addition are fed into a Sigmoid activation function to obtain the channel attention

M_{C}

. The calculation of the channel attention module is as follows:

M_{C} = δ (M L P (A v g P (F) + M L P (M a x P (F)))),

(1)

Next, element-wise multiplication is performed between

M_{C}

and F to generate the input feature

F^{'}

of the spatial attention module, as expressed by the following formula:

F^{'} = M_{C} \otimes F,

(2)

After CAM is applied, a channel attention map with dimensions 1×1×C is obtained, where the weight of each dimension on the map represents its importance and relevance to critical wheat ear information. The spatial attention module (SAM) is used to interact with and weight information between different spatial positions in the feature map. To highlight the effective wheat ear information region in the input feature map

F^{'}

, we apply channel-wise max-pooling and average-pooling operations. A Sigmoid activation function is applied to obtain the spatial attention map

M_{S}

, which is then element-wise multiplied with

F^{'}

to obtain the output feature map

F_{R}

. The calculation formula is as follows:

M_{S} = δ (f^{7 \times 7} [A v g P (F^{'}), M a x P (F^{'})])),

(3)

F_{R} = M_{S} \otimes F^{'},

(4)

where

δ

represents the Sigmoid activation function, AvgP and MaxP, respectively, denote average pooling and max pooling operations, and F stands for input features.

CAM can enhance features that are difficult to detect in the wheat ear regions, enabling the network to focus on the crucial channels for wheat ear detection. SAM, on the other hand, can enhance regions in the feature maps that are relevant to wheat ear detection. It adjusts resource allocation based on the importance of attention, directing resources more toward the wheat ear objects that require attention.

2.2.2. Feature Reinforce

The backbone feature extraction module directly connects the C3 and C5 feature layers and inserts a CBAM between the feature maps. CAM enhances features that are difficult to detect in the wheat ear region, thereby capturing crucial channel information in the C3 features for better wheat ear detection. SAM improves areas in the feature map that are relevant to wheat ear detection, adaptively adjusting the importance of each spatial position to ensure that important information from C3 is more fully integrated into C5. However, even after feature fusion, the C5 feature layer, while relatively rich in positional information, still cannot cover wheat ear of various scales.

Most previous works have used FPN networks to obtain rich semantic information, but YOLOF sdhowed that the success of FPN is not solely attributed to multi-scale feature fusion. Therefore, in this module, the FPN network is abandoned, and direct feature enhancement is applied to the output features of the backbone feature extraction module to obtain output features that cover all scales of wheat ears, thereby improving the detector’s performance in detecting wheat ears of various sizes. Specifically, the C5 feature layer output from the backbone feature extraction module is processed by a 1 × 1 convolutional layer to reduce its channel dimension, followed by the addition of a 3 × 3 convolutional layer to refine context information. Finally, a skip connection module composed of multiple residual blocks is constructed to generate output features with multiple receptive field sizes, covering wheat ear objects of various scales, as shown in Figure 4.

In order to increase the receptive field to match wheat heads of different scales, dilated convolution operations are introduced in the skip connection module. Dilated convolutions can enlarge the receptive field of convolution kernels without changing the size of feature maps. Specifically, the skip connection module stacks four residual connection blocks (

B l o c k_{i}

) to generate finer-grained feature maps, and different dilation rates (d = 2 × i) are set for feature extraction, where i represents the index of the residual connection block. In the residual connection block, a 1 × 1 convolution is used to reduce the channel dimension, followed by a 3 × 3 dilated convolution to expand the receptive field. Finally, a 1 × 1 convolution is applied to restore the channel dimension, enlarging the receptive field without increasing the number of parameters. This results in the

R_{5}

feature layer containing more advanced semantic information, ensuring that subsequent wheat head detection can obtain rich contextual information, enhancing the model’s adaptability and robustness to smaller-sized wheat heads.

2.2.3. Detection Head

Given the specificity of the wheat ear dataset and its complex background conditions, localization quality estimation (LQE) is introduced into the proposed wheat ear detector in this paper. It enhances the non-maximum suppression capabilities and improves the detector’s performance by providing sorting scores based on LQE.

Statistical data on the bounding box distribution [45] are utilized to evaluate the localization quality. It has the ability to learn the discrete probability distribution of each predicted bounding box, thus describing the uncertainty of bounding box regression. Generalized Focal Loss indicates a strong correlation between general distribution statistics of bounding box regression and actual localization quality. Therefore, the shape of the distribution is used to reflect the localization quality of the prediction results. If the shape of the distribution is too flat, it indicates a significant deviation from the actual results, and vice versa. Inspired by Generalized Focal Loss V2 (GFLV2) [46], the idea of localization quality estimation is introduced into the wheat ear detection head. It utilizes distribution statistics to guide more accurate localization through bounding box regression, thereby improving detection accuracy in dense wheat ear backgrounds. Specifically, the relative offsets from the predicted positions to the actual bounding box edges are used as regression targets, and the regression results are represented as a general distribution. The resulting distribution can be used for the visualization of the bounding box regression results obtained by the detector, thereby reducing training difficulty and enhancing estimation quality, as shown in Figure 5.

Generalized Focal Loss V1 (GFLV1) introduces a general distribution D(x), where

\sum_{i = 0}^{n} D (y_{i}) = 1

to represent bounding boxes. Each side of the bounding box can be expressed as

h a t y = \int_{- \infty}^{+ \infty} D (x) x d x = \int_{- y_{0}}^{y_{n}} D (x) x d x

, where the boundaries of the box are within a predefined output range

[y_{0}, y_{n}]

. The distribution range is then discretized as

[y_{0}, y_{1}, \dots, y_{I}, \dots, y_{n}]

, and the learned distribution values are utilized, as shown in Figure 5. Specifically, the mean and the TopX values of the distribution vector

D^{w} = [D^{w} (y_{0}), D^{w} (y_{1}), \dots \dots D^{w} (y_{n})], w \in {l, r, t, b}

, representing left, right, top, and bottom directions, are concatenated to obtain statistical features F. These features are then used as input, and after passing through fully connected (FC) layers and ReLU activation functions, the network learns a more complex wheat ear feature representation. Subsequently, FC layers and Sigmoid activation functions are applied to obtain the final mapping result, LQ-IoU Score (LQIS), representing the quality score. LQIS is multiplied by the predicted localization quality to obtain the final predicted classification score. LQIS can be expressed as

L Q I S = δ (W_{2} σ (W_{1} F)),

(5)

where

δ

and

σ

denote the Sigmoid and ReLU activation functions,

W_{1} = R^{C \times 4 (X + 1)}

,

W_{2} = R^{(1 \times C)}

, X represents the number of optimal values selected, and C is the number of channels in the hidden layer of FC.

2.2.4. Generalized Focal Loss

Efficient loss functions are crucial for the success of wheat ear detection learning tasks. They describe the difference between model predictions and real labels, providing guidance during parameter optimization. During training, focal loss is used to calculate the classification loss, which partially alleviates the issue of class imbalance. This is a common problem affecting object detection accuracy. Focal loss reduces the weight of samples with clear classification categories by a proportion factor while increasing the weight of samples with ambiguous categories. The Focal Loss (FL) formula is as follows:

F L_{(} p_{t}) = - α_{t} {(1 - p_{t})}^{γ} l o g (p_{t}),

(6)

where

p_{t}

represents the confidence of the sample, and

{(1 - p_{t})}^{γ}

reduces the contribution of simple sample classification to the loss.

γ

is a weight factor for samples that are difficult to classify, with

γ \in [0, 5]

, and

α_{t}

helps balance the imbalance between positive and negative samples. With

α_{t}

, samples with excessively high or low confidence will not significantly affect the loss.

When constructing the GFLWheatNet network, the following loss functions are defined to ensure that the model can accurately learn features and detect wheat ear targets. The Generalized Focal Loss (GFL) improves the focal loss. If the GFLWheatNet model estimates the probabilities of two variables,

y_{l}

and

y_{r}

(y_{l} < y_{r})

, as

p_{(} y l)

and

p_{(} y r)

(p_{(} y l) \leq 0, p_{(} y r) \geq 0, p_{(} y l) + p_{(} y r) = 1)

, then the final prediction for their linear combination is

\hat{y} = y_{l} p_{(} y l) + y_{r} p_{(} y r) (y_{l} \leq \hat{y} \leq y_{r})

. The corresponding continuous label y for the prediction

\hat{y}

also satisfies

y_{l} \leq y \leq y_{r}

. With a modulation factor of the absolute distance

| y - \hat{y} |^{β} (β \geq 0)

, the specific formula for GFL can be written as

G F L_{(} p_{(} y l), p_{(} y r)) = - | y - \hat{y} |^{β} ((y_{r} - y) l o g (p_{(} y l)) + ((y - y_{l}) l o g (p_{(} y r)))

(7)

3. Results

To thoroughly validate the accuracy of the GFLWheatNet model in wheat ear detection, the following experiments are planned for analysis and validation. Firstly, the experimental environment and settings will be introduced. Secondly, the datasets used in the experiments will be presented, including a list of typical wheat ear images and an analysis of the characteristics of wheat ear detection. Thirdly, the specific experimental results will be compared with those of other classical methods, in terms of both visual effects and numerical results, to demonstrate the superior overall performance of the GFLWheatNet model in wheat ear detection.

3.1. Experimental Environment and Parameter Configuration

This experiment was primarily programmed using Python. The experimental environment was Ubuntu 20.04.5 LTS, and PyTorch was used on an NVIDIA A30 graphics card. After comparing the convergence of the model when the learning rate was 0.01 and 0.001, the initial learning rate was finally set to 0.001. Due to GPU memory constraints, the batch size was set to 2.

3.2. Model Training and Performance Analysis

To demonstrate the effectiveness and accuracy of GFLWheatNet in wheat ear detection, the model’s performance was evaluated using precision–recall (P-R) curves, precision–IOU (P-IOU) curves, and loss curves at different confidence levels. The detection results were also visualized to provide a more intuitive representation of the model’s performance. In the P-R curve, precision (P) represents the ratio of correctly predicted objects to all predicted objects through the model, while recall (R) represents the ratio of correctly predicted targets to all actual objects:

P = \frac{T P}{T P + F P} \times 100 %,

(8)

R = \frac{T P}{T P + F N} \times 100 %,

(9)

In the provided equations, TP represents true positive (positive sample predicted as positive). TN represents true negative (negative sample predicted as negative). FP represents false positive (negative sample predicted as positive). FN represents false negative (positive sample predicted as negative).

I O U = (\frac{A \cap B}{A \cup B}),

(10)

The IOU value in the equations represents the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box during object detection. The loss curve during the training of the GFLWheatNet model on the wheat ear dataset is shown in Figure 6 (left). From the variation in the loss, it can be observed that the model converges quickly and exhibits small oscillations. Figure 6 (middle) presents the mAP at various IoU thresholds, offering a more comprehensive evaluation of the model’s accuracy and robustness. The curves between P-R values and confidence levels reveal changes in accuracy and recall of the GFLWheatNet model at different IoU thresholds, as shown in Figure 6 (right).

Figure 7 presents four representative original images (row 1) of wheat ears along with their corresponding annotated images (row 2).

From these images, several challenges in wheat ear detection can be identified: (1) Variations in shape and size: wheat ears exhibit significant variations in their shapes and sizes, which complicates the detection process. (2) Occlusions and overlaps: in densely packed fields, wheat ears often occlude or overlap with one another, making it difficult to distinguish individual ears. (3) Complex background interference: The presence of a complex background, such as other plant parts and varying soil textures, further complicates the detection of wheat ears. These challenges highlight the difficulty of accurately identifying and locating wheat ear targets, especially those that are small, densely packed, and overlapping. Such conditions are the primary causes of false positives, false negatives, and overall reduced detection accuracy. Addressing these challenges is crucial for improving the performance of wheat ear detection models and ensuring precise yield estimation and field management.

Using the same dataset, training was conducted with RetinaNet, GFLV2, Faster R-CNN (anchor-based multi-stage method), FCOS [47], FoveaBox [48] (anchor-free one-stage methods), RepPoints [49], ATSS (anchor-free keypoint methods) [50], YOLOv8 (single-stage) [51], and the GFLWheatNet model in a batch, and the wheat ear detection results of different models are shown in Table 1, where

A P_{50}

represents the AP value when the IoU threshold is set to 0.5. Similarly,

A P_{75}

indicates the AP value when the IoU threshold is 0.75. These metrics help assess the accuracy of the detection model at different levels of overlap between the predicted bounding boxes and the ground truth. In addition to these,

A P_{S}

,

A P_{M}

, and

A P_{L}

are metrics that evaluate the detection accuracy for wheat ears of different sizes. Specifically,

A P_{S}

measures the accuracy for small wheat ears,

A P_{M}

measures the accuracy for medium-sized wheat ears, while

A P_{L}

evaluates the accuracy for large wheat ears. These metrics provide a more detailed understanding of the model’s performance across various scales of wheat ear targets. The frames per second (FPSs) are another important metric, indicating the number of frames the model can process per second. It measures the speed and efficiency of the model when applied to wheat ear images. A higher FPS value signifies a faster model, which is crucial for real-time applications in field management and yield estimation. By considering these metrics—

A P_{50}

,

A P_{75}

,

A P_{S}

,

A P_{M}

,

A P_{L}

, and FPS—we can comprehensively evaluate the performance of our wheat ear detection model in terms of both accuracy and processing speed.

A comprehensive comparison of the performance of different models reveals the following results, listed in Table 1. Compared with the baseline model GFLV2, GFLWheatNet achieved a 1.4% increase in mAP. Compared with the anchor-free ATSS model and FCOS model, it achieved increases of 1.5% and 3.5% in mAP, respectively. Compared with anchor-based models RetinaNet and Faster R-CNN, our model increased mAP by 2.0% and 3.1%, respectively. Meanwhile, the GFLWheatNet model demonstrated higher inference speeds than other detectors. Compared to the latest single-stage model, GFLWheatNet’s mAP is 0.3% lower than YOLOv8n. While the data indicate a slight decrease in overall accuracy, our model performance remains competitive. In terms of small object precision, our model outperforms YOLOv8 by 2%. This demonstrates GFLWheatNet’s superior capability in detecting small size targets, which is crucial for applications requiring fine-grained detection.

Based on the analysis above, this model achieved good results in both detection accuracy and efficiency. Therefore, replacing the C4 feature layer with the CBAM module in the main feature extraction network and using Reinforce Layer in the feature fusion process can effectively improve wheat ear detection accuracy and efficiency, confirming the effectiveness and feasibility of the proposed method. In this analysis, we focus on differentiating the performance of various wheat ear detectors. Figure 8 illustrates the visual results of this model compared with other advanced detectors in wheat ear detection. In row (a), ground truth boxes are overlaid on the original wheat ear images. The red circles seen in (b) through (i) on the visual representation indicate instances of false positives and false negatives generated by alternative detectors for wheat ear detection. Furthermore, in (j), we have identified regions that are particularly susceptible to false detections and missed detections, and these areas are highlighted with red boxes.

In Figure 9 (left), we present the detection outcomes achieved by GFLWheatNet for regions that present inherent detection challenges. To provide a closer examination, we have included a partially enlarged image. Meanwhile, Figure 9 (right) showcases the results obtained by other detectors for wheat ears that are notably obscured, heavily overlapped, or closely resembling the background. Again, partially enlarged images are provided for better insight. Upon analyzing these detection results, it becomes evident that the GFLWheatNet detector demonstrates remarkable resilience against various challenges, such as variations in wheat ear size, occlusion, overlap, and similarity to the background.

3.3. Ablation Study

IIn this paper, ablation experiments were conducted to verify the impact of various components in the GFLWheatNet model on wheat ear detection accuracy, including the choice of attention mechanism modules and the use of Reinforce Layer. The results regarding the choice of attention mechanisms are shown in Table 2, and they will be discussed in detail. By comparing with the baseline model GFLV2, the effectiveness of different improvement modules in the GFLWheatNet model is listed in Table 3.

The SE module learns the importance of each channel in the wheat ear feature map through self-learning and assigns weight values to each feature, thus focusing on specific feature channels. The global attention mechanism (GAM) module enhances the model’s performance by reducing information loss and amplifying global interaction representation, preserving the global attention mechanism of wheat ear information and amplifying global cross-dimensional interaction. The results of using different attention mechanisms for wheat ear detection are shown in Table 2. For the SE module, both

A P_{50}

and

A P_{S}

decreased, possibly due to its focus on channel attention mechanisms. For the GAM module, although

A P_{50}

and

A P_{S}

both showed slight improvements, the significantly increased parameter count led to a noticeable decrease in model detection efficiency. Considering all factors, the CBAM module was ultimately selected.

The visualization of wheat ear features extracted from the feature extraction network’s output is shown in Figure 10. It is evident that the wheat ear feature map without the CBAM module has significant omissions, while with the CBAM module, the feature extraction of wheat ears becomes more refined.

According to Table 3, by using GFLV2 as the baseline model and replacing the C4 feature layer with the CBAM module only in the main feature extraction module,

A P_{50}

increased by 0.7%, and

A P_{S}

increased by 1.3%. By replacing FPN with Reinforce Layer only in the feature fusion area,

A P_{50}

and

A P_{S}

metrics decreased slightly, but detection efficiency improved. Overall, GFLWheatNet model achieved a 1.1% increase in

A P_{50}

and a 1.4% increase in

A P_{S}

, with improved detection efficiency, confirming the effectiveness of various subnetworks in GFLWheatNet. Based on the experimental results obtained, this proposed method is both innovative and practical, providing an effective solution to the challenges in wheat ear detection.

4. Conclusions

Wheat ear counting is essential for estimating wheat field yields and managing fields scientifically. However, accurately detecting wheat ears in field conditions is challenging due to various factors. In this paper, we propose the GFLWheatNet model for precise wheat ear detection, which addresses these challenges. Our model accurately detects small, dense, and overlapping wheat ears by incorporating innovative features such as CBAM in the backbone feature extraction module and a skip connection module in the feature enhancement layer. The distribution-guided localization approach in the detection head network further improves detection performance. Specifically, in the backbone feature extraction module, we removed the C4 feature layer and added CBAM to reduce redundant feature information and enhance features that are difficult to detect in wheat ear regions. In the feature enhancement layer, a skip connection module is introduced to expand the multi-scale receptive field of the output from the feature extraction module, resulting in output features that can cover all wheat ear sizes. The detection head network is constructed based on the idea of distribution-guided localization to better address the challenges posed by dense and overlapping wheat ears in wheat images. Experimental results demonstrate the effectiveness of GFLWheatNet in diverse scenarios with high overlap and density of wheat ears. Future work will focus on creating custom datasets to enhance the model’s robustness and improve detection accuracy and efficiency without sacrificing accuracy.

5. Discussion

In this study, we proposed the GFLWheatNet model for precise wheat ear detection, aiming to improve the accuracy and efficiency of wheat ear detection. Our model demonstrates several advantages in wheat ear detection, particularly in addressing challenges such as changes in lighting, wheat ear morphology, and color variations. One key innovation of our model is its ability to accurately detect small, dense, and overlapping wheat ears in wheat images. This capability is crucial for applications requiring fine-grained detection, such as yield estimation and crop monitoring. The incorporation of innovative features, such as CBAM in the backbone feature extraction module and a skip connection module in the feature enhancement layer, significantly enhances the model’s performance in detecting challenging wheat ear regions.

To validate our model, we conducted training using the same dataset and compared it with anchor-based multi-stage methods, anchor-free one-stage methods, anchor-free keypoint methods, and the latest single-stage model. The accuracy of our model is mostly higher than existing methods. However, to achieve faster real-time detection methods, further improvements in our lightweight model are needed. The main reasons for this problem can be summarized as follows. To enhance the accuracy of small object detection, we made optimizations to improve precision, which resulted in a lower frame rate. Additionally, our model employs a more complex feature extraction method to capture more detailed features. While we believe that this trade-off is valuable for many use cases where the precision of small object detection is critical, it is important to note that the relatively complex structure of the model not only increases the computational load but may also limit its deployment in resource-constrained environments. Future research could focus on simplifying the model structure without compromising its accuracy, thereby increasing its practical applicability. Furthermore, while our model outperforms existing methods in many aspects, further comparisons with other state-of-the-art models are needed to fully assess its performance. Future studies could explore benchmarking GFLWheatNet against a wider range of models on standardized datasets to provide a more comprehensive evaluation.

In conclusion, despite its limitations, the proposed GFLWheatNet model represents a significant advancement in the field of wheat ear detection. Its ability to accurately detect wheat ears in challenging conditions is promising for improving wheat field management practices and contributing to the advancement of machine learning and agriculture technology.

Author Contributions

Conceptualization, Y.G. and J.P.; methodology, Y.G.; software, W.J.; validation, Y.G., L.X. and L.Y.; formal analysis, Q.F.; investigation, J.P.; resources, Y.G.; data curation, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, W.J.; visualization, Y.G.; supervision, L.X.; project administration, L.Y.; funding acquisition, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Nature Science Foundation of China (No.: 62206240, 62072289); Young Innovation Team Program of Shandong Provincial University (No.: 2022KJ250); New Twentieth Items of Universities in Jinan (2021GXRC049); Taishan Scholar Program of Shandong Province of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this studyareis the GWHD Global Wheat Head Detection dataset. The GWHD dataset is publicly available and can be accessed at http://www.global-wheat.com (accessed on 19 February 2023).

Acknowledgments

The authors would like to thank the editor and anonymous reviewers for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Food and Agriculture Organization of the United Nations. Available online: https://www.fao.org/documents/card/en/c/cc2211en (accessed on 24 March 2024).
Tester, M.; Langridge, P. Breeding technologies to increase crop production in a changing world. Science 2010, 327, 818–822. [Google Scholar] [CrossRef] [PubMed]
Xu, K.; Shu, L.; Xie, Q.; Song, M.; Zhu, Y.; Cao, W.; Ni, J. Precision weed detection in wheat fields for agriculture 4.0: A survey of enabling technologies, methods, and research challenges. Comput. Electron. Agric. 2023, 212, 108106. [Google Scholar] [CrossRef]
Li, Y.; Zhan, X.; Liu, S.; Lu, H.; Jiang, R.; Guo, W.; Chapman, S.; Ge, Y.; Solan, B.; Ding, Y.; et al. Self-Supervised Plant Phenotyping by Combining Domain Adaptation with 3D Plant Model Simulations: Application to Wheat Leaf Counting at Seedling Stage. Plant Phenomics 2023, 5, 0041. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Cao, Z.; Lu, H.; Li, Y.; Xiao, Y. In-field automatic observation of wheat heading stage using computer vision. Biosyst. Eng. 2016, 143, 28–41. [Google Scholar] [CrossRef]
Xu, X.; Li, H.; Yin, F.; Xi, L.; Qiao, H.; Ma, Z.; Shen, S.; Jiang, B.; Ma, X. Wheat ear counting using K-means clustering segmentation and convolutional neural network. Plant Methods 2020, 16, 1–13. [Google Scholar] [CrossRef] [PubMed]
Tesfaye, A.A.; Osgood, D.; Aweke, B.G. Combining machine learning, space-time cloud restoration and phenology for farm-level wheat yield prediction. Artif. Intell. Agric. 2021, 5, 208–222. [Google Scholar] [CrossRef]
Zhao, F.W.K.; Yuan, Y. Study on Wheat Ear Identification Based on Color Features and AdaBoost Algorithm; Hebei Agricultural University: Baoding, China, 2015. [Google Scholar]
Liu, C.; Wang, K.; Lu, H.; Cao, Z. Dynamic color transform networks for wheat head detection. Plant Phenomics 2022, 2022, 9818452. [Google Scholar] [CrossRef] [PubMed]
Fernandez-Gallego, J.A.; Kefauver, S.C.; Gutiérrez, N.A.; Nieto-Taladriz, M.T.; Araus, J.L. Wheat ear counting in-field conditions: High throughput and low-cost approach using RGB images. Plant Methods 2018, 14, 1–12. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Huang, W.; Wang, L. Field wheat ear counting automatically based on improved K-means clustering algorithm. Trans. Chin. Soc. Agric. Eng. (Transactions Csae) 2019, 35, 174–181. [Google Scholar]
Zhou, J.; Tardieu, F.; Pridmore, T.; Doonan, J.; Reynolds, D.; Hall, N.; Griffiths, S.; Cheng, T.; Zhu, Y.; Jiang, D.J. Plant phenomics: History, present status and challenges. J. Nanjing Agric. Univ. 2018, 41, 580–588. [Google Scholar]
Jia, W.; Sun, M.; Lian, J.; Hou, S.J.C.; Systems, I. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
Ball, J.E.; Anderson, D.T.; Chan, C.S. Comprehensive survey of deep learning in remote sensing: Theories, tools, and challenges for the community. J. Appl. Remote Sens. 2017, 11, 042609. [Google Scholar] [CrossRef]
Chen, Z.; Zheng, Y.; Gee, J.C. TransMatch: A transformer-based multilevel dual-stream feature matching network for unsupervised deformable image registration. IEEE Trans. Med. Imaging 2023, 43, 15–27. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Yang, Y.; Che, T.; Hou, S.; Huang, W.; Gao, Y.; Tan, P. Image matting with deep gaussian process. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8879–8893. [Google Scholar] [CrossRef] [PubMed]
Xu, B.; Wan, X.; Yang, H.; Feng, H.; Fu, Y.; Cen, H.; Wang, B.; Zhang, Z.; Li, S.; Zhao, C.; et al. TIPS: A three-dimensional phenotypic measurement system for individual maize tassel based on TreeQSM. Comput. Electron. Agric. 2023, 212, 108150. [Google Scholar] [CrossRef]
Wu, J.; Yang, G.; Yang, H.; Zhu, Y.; Li, Z.; Lei, L.; Zhao, C. Extracting apple tree crown information from remote imagery using deep learning. Comput. Electron. Agric. 2020, 174, 105504. [Google Scholar] [CrossRef]
Xie, P.; Du, R.; Ma, Z.; Cen, H. Generating 3D multispectral point clouds of plants with fusion of snapshot spectral and RGB-D images. Plant Phenomics 2023, 5, 0040. [Google Scholar] [CrossRef] [PubMed]
Abdalla, A.; Cen, H.; El-manawy, A.; He, Y. Infield oilseed rape images segmentation via improved unsupervised learning models combined with supreme color features. Comput. Electron. Agric. 2019, 162, 1057–1068. [Google Scholar] [CrossRef]
Ji, F.; Meng, J.; Cheng, Z.; Fang, H.; Wang, Y. Crop yield estimation at field scales by assimilating time series of Sentinel-2 data into a modified CASA-WOFOST coupled model. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Dandrifosse, S.; Ennadifi, E.; Carlier, A.; Gosselin, B.; Dumont, B.; Mercatoris, B. Deep learning for wheat ear segmentation and ear density measurement: From heading to maturity. Comput. Electron. Agric. 2022, 199, 107161. [Google Scholar] [CrossRef]
Bao, W.; Lin, Z.; Hu, G.; Liang, D.; Huang, L.; Zhang, X. Method for wheat ear counting based on frequency domain decomposition of MSVF-ISCT. Inf. Process. Agric. 2023, 10, 240–255. [Google Scholar] [CrossRef]
Wang, D.; Zhang, D.; Yang, G.; Xu, B.; Luo, Y.; Yang, X. SSRNet: In-field counting wheat ears using multi-stage convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Jubair, S.; Tremblay-Savard, O.; Domaratzki, M. Gxenet: Novel fully connected neural network based approaches to incorporate gxe for predicting wheat yield. Artif. Intell. Agric. 2023, 8, 60–76. [Google Scholar] [CrossRef]
Hasan, M.M.; Chopin, J.P.; Laga, H.; Miklavcic, S.J. Detection and analysis of wheat spikes using convolutional neural networks. Plant Methods 2018, 14, 1–13. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Li, J.; Li, C.; Fei, S.; Ma, C.; Chen, W.; Ding, F.; Wang, Y.; Li, Y.; Shi, J.; Xiao, Z.J.S. Wheat ear recognition based on RetinaNet and transfer learning. Sensors 2021, 21, 4845. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048. [Google Scholar]
Yang, S.; Wang, S.; Wang, P. Detecting wheat ears per unit area using an improved YOLOX. Trans. Chin. Soc. Agric. Eng. 2022, 38, 143–149. [Google Scholar]
Yu, Z.; Ye, J.; Li, C.; Zhou, H.; Li, X. TasselLFANet: A novel lightweight multi-branch feature aggregation neural network for high-throughput image-based maize tassels detection and counting. Front. Plant Sci. 2023, 14, 1158940. [Google Scholar] [CrossRef]
Wang, B.; Yang, G.; Yang, H.; Gu, J.; Zhao, D.; Xu, S.; Xu, B. UAV images for detecting maize tassel based on YOLO_X and transfer learning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 53–62. [Google Scholar]
Mahaur, B.; Mishra, K.K. Small-object detection based on YOLOv5 in autonomous driving systems. Pattern Recognit. Lett. 2023, 168, 115–122. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. Vegetable disease detection using an improved YOLOv8 algorithm in the greenhouse plant environment. Sci. Rep. 2024, 14, 4261. [Google Scholar] [CrossRef]
Solimani, F.; Cardellicchio, A.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Optimizing tomato plant phenotyping detection: Boosting YOLOv8 architecture to tackle data complexity. Comput. Electron. Agric. 2024, 218, 108728. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S. Global wheat head detection 2021: An improved dataset for benchmarking wheat head detection methods. Plant Phenomics 2021. [Google Scholar] [CrossRef]
David, E.; Madec, S.; Sadeghi-Tehran, P.; Aasen, H.; Zheng, B.; Liu, S.; Kirchgessner, N.; Ishikawa, G.; Nagasawa, K.; Badhon, M.A.J.P.P. Global wheat head detection (GWHD) dataset: A large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics 2020. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2888–2897. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11632–11641. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 1904, arXiv:1904.01355. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]

Figure 1. Display of wheat ear images with different influencing factors in the used data sets. (a) Shape difference; (b) size difference; (c) blocked; (d) intensive; (e) overlapping; (f) backlighting; (g) frontlighting; (h) interference by the blade; (i) interference by the color.

Figure 2. The overview of GFLWheatNet.

Figure 3. Architecture of CBAM.

Figure 4. Architecture of the skip block.

Figure 5. Architecture of distributed modeling. (The blue box represents the Ground Truth and the white box represents the predicted box ).

Figure 6. GFLWheatNet model evaluation index diagram (from (left–right): loss graph, mAP change graph under different IOU and P-R graph).

Figure 7. Wheat image (row 1) and corresponding annotation information (row 2 and ground truth are marked with a red box).

Figure 8. Visualize and compare the prediction results of the proposed method with those of other popular existing methods.

Figure 9. Comparison of the partial enlargement of visual prediction results of wheat images, including our method and the RetinaNet method.

Figure 10. Visualization of wheat ear characteristics before and after adding the CBAM attention mechanism. (a) No CBAM module added; (b) CBAM module added.

Table 1. Performance comparison results of our model with state-of-the-art detectors.

Model	mAP (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{L}$ (%)	FPS
Anchor-based one-stage:
RetinaNet	41.3	91.1	26.7	21.9	41.1	47.7	21.4
GFLV2	41.9	92.6	27.7	24.2	41.6	48.3	24.4
Anchor-based multi-stage:
Faster R-CNN	40.2	91.3	24.3	20.0	40.3	45.7	23.1
Anchor-free one-stage:
FCOS	39.8	90.7	23.9	20.2	40.2	47.6	17.9
FoveaBox	38.2	90.0	17.0	23.4	34.4	33.5	19.1
Anchor-free key-point:
RepPoints	42.0	91.7	27.8	22.4	42.1	47.9	15.2
ATSS	41.8	91.4	25.8	24.3	42.1	48.1	14.6
single-stage:
YOLOv8n	43.6	93.5	28.0	23.2	43.6	51.0	32.5
Ours	43.3	93.7	30.2	25.6	43.4	50.7	26.4

Table 2. The choice of attention modules.

Attention Mechanism	${AP}_{50}$ (%)	${AP}_{S}$	FPS
SE	93.42	24.57	25.48
GAM	93.86	25.82	17.65
CBAM	93.70	25.62	26.40

Table 3. Comparison of baseline modules and GFLWheatNet (analysis of each module’s contribution to model performance).

Models	CBAM	Reinforce Layer	${AP}_{50}$ (%)	${AP}_{S}$ (%)	FPS
GFLV2	×	×	92.6	24.20	24.40
GFLWheatNet	√	×	93.3	25.5	24.3
	×	√	91.6	24.0	26.3
	√	√	93.7	25.6	26.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, Y.; Pan, J.; Fan, Q.; Yang, L.; Xu, L.; Jia, W. Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction. Agriculture 2024, 14, 899. https://doi.org/10.3390/agriculture14060899

AMA Style

Guan Y, Pan J, Fan Q, Yang L, Xu L, Jia W. Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction. Agriculture. 2024; 14(6):899. https://doi.org/10.3390/agriculture14060899

Chicago/Turabian Style

Guan, Yujie, Jiaqi Pan, Qingqi Fan, Liangliang Yang, Li Xu, and Weikuan Jia. 2024. "Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction" Agriculture 14, no. 6: 899. https://doi.org/10.3390/agriculture14060899

APA Style

Guan, Y., Pan, J., Fan, Q., Yang, L., Xu, L., & Jia, W. (2024). Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction. Agriculture, 14(6), 899. https://doi.org/10.3390/agriculture14060899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalized Focal Loss WheatNet (GFLWheatNet): Accurate Application of a Wheat Ear Detection Model in Field Yield Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. GFLWheatNet

2.2.1. Feature Extraction

2.2.2. Feature Reinforce

2.2.3. Detection Head

2.2.4. Generalized Focal Loss

3. Results

3.1. Experimental Environment and Parameter Configuration

3.2. Model Training and Performance Analysis

3.3. Ablation Study

4. Conclusions

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI