Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method

Zhao, Kunpeng; Li, Jinyang; Shi, Wenqiang; Qi, Liqiang; Yu, Chuntao; Zhang, Wei

doi:10.3390/agriculture14081423

Open AccessArticle

Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method

by

Kunpeng Zhao

¹,

Jinyang Li

¹

,

Wenqiang Shi

¹,

Liqiang Qi

¹

,

Chuntao Yu

¹ and

Wei Zhang

^1,2,*

¹

College of Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China

²

Heilongjiang Province Conservation Tillage Engineering Technology Research Center, Daqing 163319, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(8), 1423; https://doi.org/10.3390/agriculture14081423

Submission received: 28 June 2024 / Revised: 11 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Changes in soybean flower and pod numbers are important factors affecting soybean yields. Obtaining the number of flowers and pods, as well as fallen flowers and pods, quickly and accurately is crucial for soybean variety breeding and high-quality and high-yielding production. This is especially challenging in the natural field environment. Therefore, this study proposed a field soybean flower- and pod-detection method based on an improved network model (YOLOv8-VEW). VanillaNet is used as the backbone feature-extraction network for YOLOv8, and the EMA attention mechanism module is added to C2f, replacing the CioU function with the WIoU position loss function. The results showed that the F1, mAP, and FPS (frames per second) of the YOLOv8-VEW model were 0.95, 96.9%, and 90 FPS, respectively, which were 0.05, 2.4%, and 24 FPS better than those of the YOLOv8 model. The model was used to compare soybean flower and pod counts with manual counts, and its R² for flowers and pods was 0.98311 and 0.98926, respectively, achieving rapid detection of soybean flower pods in the field. This study can provide reliable technical support for detecting soybean flowers and pod numbers in the field and selecting high-yielding varieties.

Keywords:

deep learning; soybean flower; soybean pod; computer vision; YOLOv8

1. Introduction

Soybeans are an important economic crop around the world, and the selection of suitable soybean varieties is an important way to improve their yield [1]. In the soybean-breeding process, the number of flowers and pods is an important indicator. Many scholars have explored this factor from different aspects. Gao et al. studied the order of flower fall in soybean [2]. Gai et al. studied the podding rate of different soybean inflorescences [3]. Su et al. and Song et al. investigated the flowering sequence of soybeans with different podding habits [4,5]. Zhao et al. found that the nodes and durations of flowering were different in different varieties, and the rates of flower drop were significantly different at different stages [6]. Fan et al. and Zhang et al. analyzed the pattern of flower and pod drop from a genetic point of view [7,8]. They determined the number of flowers and pods in each plant and the regions of genes related to flowering and podding through the localization of soybean traits using QTL (Quantitative Trait Loci). In all these studies, the number of flowers and pods were counted manually. Therefore, establishing a high-throughput, automated, and high-precision method for soybean flower and pod detection in the field is of great theoretical and practical significance for soybean variety selection and high-quality and high-yield production.

In recent years, due to the development of deep learning technology, many scholars have conducted extensive research in field phenotype counting using deep learning technology [9,10]. Xiong et al. proposed TasselNetv2 by improving TasselNet and verified it experimentally for wheat spike counting. The results show that the counting accuracy was 91.01% and reduces redundant calculations [11]. Lu et al. proposed TasselNetV3 based on TasselNetv2. TasselNetV3 achieved better results in counting corn cobs, wheat ears, and rice plants and improved the versatility of the model, but because TasselNetV3 uses deep convolution, the counting speed was slightly reduced compared to TasselNetv2 [12]. Li et al. used YOLOv4 to detect kiwifruit flowers and buds; mAP reached 97.6%. In the detection in another dataset, mAP reached 91.49%, proving the versatility of YOLOv4 for flower detection [13]. Wu et al. proposed a lightweight YOLOv4 network model to identify apple flowers with an mAP value of 97.31% and an FPS of 72.33. However, the light would affect the detection results [14]. Xiang et al. proposed YOLO POD for precise pod identification and counting, with experimental findings indicating an R² value of 0.967 between the counts from YOLO POD and the actual values, alongside low MAE, MAPE, and RMSE values of 4.18, 10.0%, and 6.48, respectively [15]. Hasan et al. introduced a wheat spike-counting model that utilized Faster R-CNN target detection. The model demonstrated the capability to effectively identify wheat spikes, even in dynamic and intricate field conditions, achieving an average accuracy rate of 93.4% [16]. Lu et al. employed a deep learning algorithm with a generalized regression neural network for detecting pods. Improvements were made to YOLOv3, and the recognition accuracy reached 90.3% when changing the IoU loss function [17]. Miao et al. aimed to detect maize and sorghum leaves; validation experiments were conducted on the same dataset using a regression method based on convolutional neural networks and a target-detection method based on Faster R-CNN, respectively, and the results showed that the target-detection method based on Faster R-CNN has high accuracy [18]. Therefore, the target-detection-class model based on deep learning algorithms is more suitable for soybean flower and pod recognition and counting.

Because of the intricate makeup of the soybean plant, there are problems, such as the occlusion of pods and flowers by leaves, the occlusion of flowers from each other, the occlusion of pods from each other, and flowers too small to be seen. Therefore, choosing a suitable shooting method is essential to improve the accuracy of flower pod detection. Yue et al. used the whole-plant shooting method to identify soybean flower pods during the flowering and podding stage in the field by improving the YOLOv5 algorithm. The model’s accuracy was 98.4%, but the accuracy of the number of flowers in the field was 80.32%, and the accuracy of the number of pods was 82.7%. The final result was not ideal [19]. Zhu et al. used the node-shooting method to obtain images and constructed a fusion model of soybean flower and pod recognition and counting; the precision reached 94.36% and 91%, and the coefficient of determination R² with the number of flowers and pods of soybean counted manually reached 0.965 and 0.98, respectively. Further, the pattern of flowers and pods dropping from soybeans during the reproductive period was studied. Still, the observation of soybeans was performed using potted plants, which does not indicate a real response of soybeans in the field [20]. According to the above research scholars, the node-shooting method can effectively solve the problem of cover between soybean flowers and pods in the field.

To address the occlusion and detection problem of soybean flower pods in the field, in this study, node shooting was used to acquire soybean flower pod images in the field and create a dataset. The target-detection algorithm using deep learning technology was used to detect the soybean flower and pod image dataset in the field, and the YOLOv8 detection model was improved to construct the YOLOv8-VEW soybean flower- and pod-detection model in the field. A field soybean flower pod image collection device was used to collect soybean flower pod images during the whole life cycle of the Jiyu 86 variety of soybeans in the field, and the YOLOv8-VEW soybean flower- and pod-detection model in the field was used to detect the flower pods and analyze the changes in soybean flower pods in the field.

2. Materials and Methods

2.1. Image Acquisition and Dataset Construction

In order to construct a field soybean flower- and pod-detection model with broad applicability, different experimental locations were selected to obtain field soybean flower and pod images and construct the dataset. The first trial site was located at the Shihezi Comprehensive Experimental Station, Xinjiang, China. The main cultivar was Jiyu 86, with purple flowers and a sub-limited podding habit, with a record yield of 467.24 kg per mu in soybean. Jiyu 86 has continuously achieved high yields in this area and has high yield potential. The second trial site was at Jianshan Farm, Heilongjiang Province, China, with the main cultivar being Longken 306, which has white flowers and an unlimited podding habit. Heilongjiang Province has the highest soybean cultivation area and yield in China, and Longken 306 has a high yield in this area.

Image acquisition was performed using the node-shooting method, with an angle of 25 to 45 degrees between the camera and the main stem and a distance of 5~10 cm from the node. The method of soybean flower and pod image acquisition is shown in Figure 1.

Figure 1a shows photographs of the flowers and pods at each node on the plant. Figure 1b shows a schematic diagram of image acquisition, wherein the lowest node in the soybean structure is labeled 1 and is numbered sequentially upwards. As shown in Figure 1c, the photographed images of the node soybean flower and pod correspond to the respective nodes in Figure 1b. The images of soybean flower and pod were acquired using a field soybean flower and pod image acquisition device, as shown in Figure 2. The device was equipped with an mcd-200w-q image sensor (Mcyeys Corp., Shenzhen, China), which can adjust the shooting angle according to the growth height of soybeans, the position of internodes, and the morphology of flower pods. It is powered by solar energy in field applications, a wireless router provides the network, and the acquisition software on the computer side realizes image acquisition at a long distance and regular intervals.

A total of 2100 visible light images were acquired with an image size of 640 × 360 and a JPG image format. Image processing techniques were used to increase the image size of the soybean flowers and pods dataset to 640 × 640. The flowers and pods in the images were manually labeled as real bounding boxes using the LabelImg1.8.1 tool, and all datasets were generated in PASCAL VOC format. After labeling, the dataset was randomly divided into training, validation, and test sets at a ratio of 8:1:1. The division of the soybean flowers and pods dataset is shown in Table 1.

2.2. YOLOv8-VEW Soybean Flower- and Pod-Detection Model

YOLOv8 comprises three main components, the backbone, neck, and head, in its overall structure [21]. The network structure can be found in Figure 3. The structure used in the backbone part is Darknet53, which mainly consists of C2f and SPPF modules, where the C2f module merges the C3 architecture from YOLOv5 with the ELAN architecture from YOLOv7, which ensures the lightness of the module while increasing the depth and sensing field of the network, thus obtaining a higher accuracy. The most popular SPPF module is still used at the end of the backbone. It passes three maxpools 5 × 5 in size in sequence and then connects each layer in series, which not only ensures the lightweight of this layer but also guarantees accuracy when detecting objects with different scales. The neck part adopts the PAN-FPN structure, which can fuse and utilize feature information at different scales to achieve feature fusion of multiple feature maps of different sizes. The C2f module is the main feature-extraction module in its structure. The head part uses the decoupled head structure to separate the classification and detection tasks; the anchor frame is discarded, and the anchor-free structure with better effect is used to reduce the algorithm’s complexity. During the training of the model, YOLOv8 disables mosaic enhancement for the final 10 epochs in order to enhance the model’s accuracy.

To improve the model’s detection performance for soybean flowers and pods, this study used the lightweight VanillaNet network as the backbone feature-extraction network to reduce the model complexity and improve the detection speed, added the EMA attention mechanism module to the C2f module to form the C2f-EMA module, and replaced the CIoU loss function with the WIoU loss function to improve the detection accuracy. The soybean flower- and pod-detection model YOLOv8-VEW was constructed to ensure that the model can detect the target normally and, at the same time, improve the target detection speed and detection accuracy.

2.2.1. VanillaNet Backbone Network

The VanillaNet neural network architecture contains only basic convolutional and pooling layers without complex connections or jump connections, thus reducing the computational and parametric count of the model [22]. Figure 4 displays the network architecture of VanillaNet (as an example of a 6-layer network structure). The network consists of three stages: in stage I, stem converts the original three-channel image into a feature map with C channels by downsampling; then, in stage II, the feature map is resized using a maximum pooling layer with a step size of 2, and the number of channels is increased to double that of the layer before. The final classification results are generated by the fully connected layer in stage III. In order to preserve the feature information contained in the feature maps and to obtain the smallest possible computational cost, a 1 × 1 convolutional kernel is used in all convolutional layers, and the BN is added at the end of each layer to simplify the training process of the network.

2.2.2. C2f_EMA Module Structure and Principle

The EMA attention mechanism introduced by Ouyang et al. presents an effective multi-scale attention module designed for cross-spatial learning [23]. The main goal is to concentrate attention on specific locations of interest while retaining information across all channels. The structure of this mechanism is illustrated in Figure 5. The process involves first dividing the cross-channel dimension of the input feature map X into G sub-features. Subsequently, attention-weight descriptors of the grouped feature maps are extracted using two branches with 1 × 1 convolutions and one branch with a 3 × 3 convolution. In the 1 × 1 branch, channels are encoded through two 1D global average pooling operations. This is followed by generating two parallel 1D feature-encoding vectors using a 1 × 1 convolution. These encoded features are then passed through a Sigmoid function individually before aggregating the channel attention maps within each group through simple multiplication to achieve varied cross-channel interaction features. In the 3 × 3 branch, convolution with a 3 × 3 kernel is employed to capture local cross-channel interaction features and expand the feature space. Additionally, two tensors are introduced to encode global spatial information in the mechanism. In the 1 × 1 branch, 2D global mean pooling encodes global spatial information, transforming the output of the branch into the appropriate dimensional shape prior to applying the joint activation mechanism of the channel features. Softmax, a natural nonlinear function, is then utilized to match the linear transformations. The first spatial attention map is obtained by multiplying the output of the parallel processing with a matrix dot product operation. The process in the 3 × 3 branch follows a similar principle to the 1 × 1 branch. Ultimately, the output feature maps within each group are summed to generate two sets of spatial attention weight values. This is followed by applying a Sigmoid function and a simple multiplication operation, ensuring that the final output of the EMA matches the size of X.

The C2f module integrates the EMA attention mechanism to enhance its capability in extracting small targets and capturing fuzzy features, resulting in the formation of the C2f-EMA module. The structure of the C2f-EMA module is illustrated in Figure 6.

2.2.3. WIoU Position Loss Function

The YOLOv8n network employs the CioU loss function to compute the coordinate loss in the prediction frame. A key benefit of CioU is its enhanced detection frame scaling loss over DIOU, which accounts for the aspect ratio loss, thereby ensuring more stable target frame regression. If the aspect ratios of the prediction frame and the gt frame match, the aspect ratio penalty term remains constant at 0, potentially impeding effective model optimization. The WIoU function utilizes a gradient allocation strategy based on IoU to dynamically adjust penalties across different anchor frame sizes, diminish penalty differences, and enhance performance when confronted with diverse geometric factors, such as distance and aspect ratio, in targets. This strengthens the model’s generalization performance [24]. The formula is shown below.

W I o U = L_{W I o U v 3} = {r L}_{W I o U v 1}

(1)

a = 1, r = \frac{β}{{δ α}^{β - δ}}

(2)

a = 1, β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}} \in [0, + \infty)

(3)

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(4)

R_{W I o U} = e x p (\frac{{({x - x}_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(5)

where α and δ are hyperparameters and β is the constructed non-monotonic focusing factor when β = δ, and δ makes r = 1; R_WIoU ∈ [1, e] and L_Iou ∈ [0, 1].

In Figure 7, R_WIoU is applied to increase the importance of regular-quality anchor frames, directing the model’s attention toward anchor frames with low overlap between the predicted and target frames. On the other hand, L_IoU is used to diminish the significance of high-quality anchor frames, decreasing the emphasis on the proximity between the centroids of the predicted frames when there is a high overlap between the predicted and target frames.

2.3. Image Test Platform Construction and Training Parameters

To fulfill the requirements for model training, the test platform was set up with the following specifications: an NVIDIA GeForce RTX4090 graphics card (Nvidia Corp., Santa Clara, CA, USA) with 12 GB of video memory, a 24-core Intel Core i9-13900K CPU (IntelCorp., Santa Clara, CA, USA), and 128 GB of RAM. The operating system used was Windows 10, and the platform ran Pytorch version 1.13, Python version 3.8, and Cuda version 11.6. The model was trained by utilizing the official YOLOv8n.pt pre-training weights using the established soybean flower and pod dataset. The batch size was 32, the epochs were set to 300, and the image size was 640 × 640 pixels. Since GPU speed is better than CPU during model training and validation, GPU is used for model training and validation.

2.4. Model Performance Evaluation Metrics

In order to assess the effectiveness of the soybean- and pod-detection model in real-world conditions, the evaluation criteria of F1 score, mean average precision (mAP), and frames per second (FPS) were utilized. The calculation formula for these metrics is as follows:

a = 1, P = \frac{T_{P}}{T_{P} + F_{P}}

(6)

R = \frac{T_{P}}{T_{P} + F_{N}}

(7)

F_{1} = \frac{2 P R}{P + R}

(8)

where T_P is the number of positive samples predicted to be positive, F_P is the number of negative samples predicted to be positive, and F_N is the number of positive samples predicted to be negative.

AP reflects the prediction accuracy for each category, and its value is the area enclosed by the P-R curve and the horizontal and vertical axes. The formula is as follows:

A P = \int_{0}^{1 \int d r} P r e c i s i o n \times R e c a l l

(9)

where r is the integral variable, which is an integral over the product of recall and precision.

mAP represents the average across all categories of Aps and serves as a measure of the overall accuracy of the model [25]. mAP@0.5 signifies the mean accuracy at an IoU threshold of 0.5. The notation mAP@0.5:0.95 indicates that mAP is calculated incrementally in intervals of 0.05 within the threshold range of 0.5 to 0.95 to determine the average accuracy across this range. In this paper, mAP corresponds to the mean average precision calculated with an IoU threshold of 0.5 by default. The calculation formula is as follows:

m A P = \frac{\sum_{j = 1}^{S} A P (j)}{S}

(10)

where S represents the total number of categories and AP(j) is the average accuracy of the jth category, where j represents the four categories of corn tassel, leaf, ear, and stem.

Frames per second (FPS) is a commonly used metric to evaluate the performance of graphics rendering [26]. It represents the rate at which the field soybean flower- and pod-detection model can display frames on the screen, showing the number of successive images. A higher FPS value indicates a quicker processing speed and improved model performance. The formula for FPS is as follows.

F P S = \frac{F r a m e s}{T i m e}

(11)

where Frames represents the total number of frames processed in a period and Time represents the period for processing frames.

2.5. Ablation Experiment Design

In order to assess the detection capabilities of the YOLOv8-VEW model compared to the original YOLOv8 model using the soybean flower pod dataset, ablation experiments were conducted to examine the impact of each module’s enhancements. We first added one enhancement module at a time to the original YOLOv8 model to assess the impact of each module on model parameters, detection accuracy, and speed. Then, one enhancement module at a time was removed from the final model (YOLOv8-VEW) to evaluate their impact on the overall performance.

2.6. Experimental Design of Different Target-Detection Models

In order to more effectively evaluate the model’s performance, we compared the YOLOv8-VEW model with other state-of-the-art target-detection models using the soybean flower and pod datasets. This includes the two-stage target-detection algorithm, Faster R-CNN, the single-stage target-detection algorithm, SSD, the commonly used detection models in the YOLO family, YOLOv5, YOLOv7, YOLOX, YOLOv8, and YOLOv9, and the most advanced current detection model in the YOLO family, YOLOv10.

2.7. Application of the Model

After the above improvement, the field soybean flower- and pod-detection model, YOLOv8-VEW, can be obtained. YOLOv8-VEW can realize the recognition and counting of field soybean flowers and pods. All node flower and pod images of a single soybean plant are collected and summarized in the application process. The images are fed into the YOLOv8-VEW model for soybean flower and pod recognition and counting. Then, the number of flowers and pods of each node is output to generate a CSV file, and the sum of this file is the number of flowers and pods of each plant. The whole process is shown in Figure 8.

3. Results

3.1. Ablation Experiment Result

Table 2 shows that after adding the VanillaNet module alone to the original model of YOLOv8, the model F1 and mAP were 0.90 and 95.7%, respectively, with no improvement or decrease in F1, a 0.01% increase in mAP, and an increase in FPS of 31. After adding the EMA attention mechanism module alone, the F1 and mAP improved to 0.93 and 95.7%, respectively, 0.03 and 1.2% higher than the original model. However, the detection speed decreased to 58, which is 8 lower than the original model. After adding the WIoU loss function alone, the F1 and mAP improved to 0.92 and 95.1%, respectively, which is an improvement of 0.02 and 0.6%, and the detection speed decreased to 62, which is 4 lower than the original model.

After removing the WIoU loss function alone based on the YOLOv8-VEW model, the model F1 and mAP were 0.93 and 95.7%, respectively, which are 0.02 and 1.2% lower compared to the YOLOv8-VEW model, but the FPS was 91, which is an increase of 1. After removing the EMA attention mechanism module alone based on the YOLOv8-VEW model, the model F1 and mAP were 0.92 and 1.2% lower compared to the YOLOv8-VEW model. Still, the FPS was 91, which is an increase of 1 FPS. The model F1 and mAP were 0.92 and 95.3%, respectively, which decreased by 0.03 and 1.6% compared to the YOLOv8-VEW model, but the FPS was 94, which increased by 4. After removing the VanillaNet module based on the YOLOv8-VEW model, the model F1 and mAP were 0.94 and 96.7%, respectively, different from the YOLOv8-VEW model by 0.01 and 0.2%, respectively, and the FPS was 60, a decrease of 30.

Using the VanillaNet module can improve the model’s detection speed, but the improvement in the detection effect is not apparent. Adding the EMA attention mechanism module and the WIoU loss function can improve the model’s detection effect, leading to a more complex model and increasing the model’s detection speed. In contrast, the YOLOv8-VEW model, after adding the three modules at the same time, had the highest F1 and mAP scores of 0.95 and 96.9%, respectively, which were 0.05 and 2.5% higher than the original model. The detection speed reaches 90 FPS, which is 24 higher than the original model, and can meet the practical requirements. Therefore, the YOLOv8-VEW model has the best overall performance.

3.2. Experimental Results of Different Target-Detection Models

As shown in Table 3, the modeling effects of the two-stage object detection algorithm using the Faster R-CNN model were first compared. The Faster R-CNN model yielded F1, mAP, and FPS values of 0.89, 93.4%, and 38, respectively, representing decreases of 0.06, 3.5%, and 52 compared to the YOLOv8-VEW model.

Next, the performance of the one-stage model SSD was analyzed. The SSD model recorded F1, mAP, and FPS values of 0.80, 85.6%, and 48, respectively, marking decreases of 0.15, 11.3%, and 42 compared to the YOLOv8-VEW model.

Lastly, a comparison was made with other models in the YOLO series. Compared to the YOLOv8-VEW model, the previously proposed YOLOv5 model showed reductions in F1, mAP, and FPS by 0.07, 3.9%, and 8, respectively. The YOLOX model showed decreases in F1, mAP, and FPS of 0.09, 5.5%, and 29, respectively, and the YOLOv7 model had lower values by 0.08, 4.5%, and 27 s, respectively. The YOLOv8n model exhibited F1, mAP, and FPS values that were 0.05, 2.4%, and 34 lower, respectively. The YOLOv9 model showed reductions of 0.05, 2.4%, and 34, respectively. Compared to the latest YOLOv10 model in the YOLO series, the YOLOv8-VEW model has improved F1, mAP, and FPS by 0.06, 3.4%, and 24, respectively.

Therefore, overall, the YOLOv8-VEW model demonstrated the best comprehensive performance. However, its effectiveness in actual field applications remains unknown, and to avoid scenarios where model accuracy is high but application performance is poor, it is necessary to validate the model’s effectiveness in actual field applications to explore the feasibility of studying the change in the number of soybean flowers and pods.

3.3. Verification of the Actual Application Effect in the Field

The above study indicates that the YOLOv8-VEW model for soybean flower and pod detection has improved results in terms of F1 score, mAP, and FPS. To validate the practicality of this detection model for soybean flower and pod identification, two soybean varieties, Jiyu 86 and Longken 306, were selected as test samples. Images of the plants were processed through the model for flower and pod recognition. The count of flowers and pods per plant was analyzed and compared with manual counts. A linear regression analysis was conducted to compare the detected counts with the manual counts, resulting in a coefficient of determination R-squared of 0.98311 for soybean flowers and 0.98926 for pods. The results, displayed in Figure 9, demonstrate that the YOLOv8-VEW model performs effectively in simultaneously identifying and counting soybean flowers and pods. Therefore, the YOLOv8-VEW model can be used to study the number of changes in soybean flowers and pods in the field.

4. Discussion

4.1. Comparison of Different Models

Faster R-CNN is a two-stage target-detection model that is more accurate than the single-stage model [27]. However, its use of a fully connected network occupies most of the parameters of the network, resulting in a relatively slow detection speed. SSD is a one-stage target-detection model that uses multiple-scale feature maps to achieve end-to-end target detection for different-sized targets [28]. Due to the use of multiple feature maps of different scales, the insensitivity to the scale change in the target can lead to missed target detection. Moreover, in the case of dense targets, the dense distribution of a priori frames leads to many overlapping detection results. The YOLO series is the most popular single-stage target-detection model available today. YOLOv5 adds the focus module to increase computational power with no loss of information, allowing for faster inference [29]. The YOLOX model uses a decoupled head to separate the regression and classification tasks into two separate sub-networks to improve the performance and efficiency of the model [30]. YOLOv7 incorporates the RepVGG convolutional neural network architecture, which uses a structure-reparameterization approach that is faster, has less memory, and is more flexible [31]. Using ELAN and MP structures in the backbone network allows deeper networks to learn and converge efficiently. In the YOLOv8 model, the C3 module in the YOLOv5 model is replaced by the lightweight C2f module, which is designed based on the ideas of the C3 module and ELAN so that the YOLOv8 model ensures the lightweight and obtains richer gradient flow information at the same time. Meanwhile, the YOLOv8 model introduces the decoupled head structure of YOLOX, which extracts the category features and location features through two parallel branches so as to complete the classification and localization tasks accurately and efficiently. YOLOv9 is based on the YOLOv7 architecture, introducing programmable gradient information PGI and an efficient GELAN backbone network [32]. Multi-branch learning can be performed to obtain multiple semantic information, and the gradient can be updated efficiently using an assisted supervision mechanism. YOLOv10 is based on the YOLOv8 structure [33], removes the NMS post-processing, and adopts a dual-assignment strategy and a consistent matching metric, which can help to minimize the supervision gap between the two branches and improve the overall performance of the model. It is the first real-time end-to-end target-detection model for YOLO. However, the detection of small targets still has defects.

4.2. Detection Results

The improved YOLOv8-VEW soybean flower- and pod-detection model based on the YOLOv8 model achieved good detection results in the field soybean flower- and pod-detection task for high amounts of light, a low light intensity, small targets, and the presence of flowers and pods in clusters, as shown in Figure 10 and Figure 11. Since the VanillaNet network uses a continuous convolution-pooling structure to extract features, there is no direct connection between different blocks, and the feature maps need to be continuously downsampled and passed to the subsequent blocks through the convolution and pooling layers while avoiding the branching structure, which reduces a large number of operations. Therefore, using the lightweight VanillaNet module as the backbone feature-extraction network can effectively reduce the complexity of the model and increase the inference speed. The EMA technique divides input feature maps into G sub-features along the channel dimensions to capture diverse feature information. By modeling cross-channel interactions and employing a cross-space aggregation method in spatial dimensions, EMA enhances feature aggregation for improved handling of small and ambiguous target recognition tasks. To address the challenges posed by the complex field environment, the WIoU loss function can balance regression for high-quality and low-quality samples, enhancing prediction frame localization accuracy. This loss function assigns smaller gradient gains to high-quality anchor frames to prioritize typical anchor frames and lower gains to low-quality anchors to prevent adverse gradient effects, promoting better gradient gain assignment for improved model performance in localization.

4.3. Limitations and Solutions

In this study, we captured soybean flower and pod images using an image acquisition device for soybean flowers and pods in the field using the YOLOv8-VEW model to identify and count the flowers and pods in the images to realize their dynamic monitoring in the field. This can provide technical means and theoretical support for exploring the rule of change in soybean flower and pod numbers, searching for soybean high-yield planting modes, and cultivating high-yield varieties of soybeans. Although the YOLOv8-VEW model achieves better detection results, there are still leakage and misdetection phenomena in the dataset captured by the soybean flower and pod image acquisition device. The analysis of the leakage and false detection phenomenon, mainly due to the soybean flower and pod image acquisition device, cannot automatically adjust the shooting angle according to the growth of soybeans, resulting in the flowers being imaged in clusters, so their exposure is incomplete. In different periods of pods of different sizes, large pods block small pods, resulting in small target pods caused by missed detection. The growth habit of soybean flowers and pods can also gather in clusters, resulting in close contact between pods, and the model may recognize two pods as one, leading to recognition errors. The quality of image acquisition will be improved in the future by improving the soybean flower and pod image acquisition device in the field to provide the means of hardware support and information technology necessary to explore the law of change in soybean pods.

5. Conclusions

In order to quickly detect changes in the number of soybean flowers and pods in the field, mainstream target-detection algorithms were used to detect soybean flowers and pods in a dataset. VanillaNet was used to replace the YOLOv8 backbone network to improve the detection speed. The EMA attention mechanism and WIoU localization loss function were added to improve the detection accuracy, and the soybean flower- and pod-detection model in the field, YOLOv8-VEW, was established.

The F1, mAP, and FPS of YOLOv8-VEW reached 0.95, 96.9%, and 90 FPS, respectively. In order to verify the model’s performance, a validation test was conducted for practical application in the field. Comparing the model’s counting results with manual counting results, the R² of soybean flowers was 0.98311 and the R² of pods was 0.98926, which indicated that the model performed well in counting soybean flowers and pods and was able to detect soybean flowers and pods quickly in the field.

Author Contributions

Writing—original draft preparation, K.Z.; writing—review and editing, W.Z.; formal analysis, W.S.; software, C.Y.; data curation, J.L.; supervision, L.Q.; investigation, K.Z. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Agriculture Research System of MOF and MARA under Grant No. CARS-04-PS32 and the Heilongjiang Bayi Agricultural University Introducing Talents Research Initiation Program under Grant No. XYB202306.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to them also being necessary for future essay writing.

Conflicts of Interest

The authors declare no conflicts of interest.

References

McCarthy, A.; Raine, S. Automated variety trial plot growth and flowering detection for maize and soybean using machine vision. Comput. Electron. Agric. 2022, 194, 106727. [Google Scholar] [CrossRef]
Gao, Y.; Huang, X.; Yang, H. Preliminary study on flowers and pods abscission of soybean. Plant Physiol. Commun. 1958, 5, 9–14. [Google Scholar]
Gai, J. Studies on flowering and podding characteristics of summer soybean of finite and infinite habit. J. Nanjing Agric. Univ. 1984, 4, 6–18. [Google Scholar]
Su, L.; Zhang, R.S.; Song, S.; Dong, Z.; Xie, F.; Wang, X. Comparative study on the progress of flowering and pod bulging in soybeans with different podding habits. Soybean Sci. 1997, 3, 52–59. [Google Scholar]
Song, S.; Dong, Z. Comparison of flowering and podding habits of different soybean varieties. Chin. Agric. Sci. 2002, 11, 1420–1423. [Google Scholar]
Zhao, S.; Tang, X.; Zhao, X.; Feng, Y.; Zhang, M. An observational study on the spatial and temporal distribution of flowering and deflowering in soybean. Chin. Agric. Sci. 2013, 8, 1543–1554. [Google Scholar]
Fan, S.; Li, J.; Zhang, Y.; Tian, X.; Wang, Q.; He, X.; Zhang, C.; Huang, W. On line detection of defective apples using computer vision system combined with deep learning methods. J. Food Eng. 2020, 286, 110102. [Google Scholar] [CrossRef]
Zhang, D.; Cheng, H.; Wang, H.; Zhang, H.; Liu, C.; Yu, D. Identification of genomic regions determining flower and pod numbers development in soybean (Glycine max L.). J. Genet. Genom. 2010, 37, 545–556. [Google Scholar] [CrossRef]
Zhu, L.; Spachos, P.; Pensini, E.; Plataniotis, K. Deep learning and machine vision for food processing: A survey. Curr. Res. Food Sci. 2021, 4, 233–249. [Google Scholar] [CrossRef]
Ceyhan, M.; Kartal, Y.; Özkan, K.; Seke, E. Classification of wheat varieties with image-based deep learning. Multimed. Tools Appl. 2024, 83, 9597–9619. [Google Scholar] [CrossRef]
Xiong, H.; Cao, Z.; Lu, H.; Madec, S.; Liu, L.; Shen, C. TasselNetv2: In-field counting of wheat spikes with context-augmented local regression networks. Plant Methods 2019, 15, 150. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Liu, L.; Li, Y.; Zhao, X.; Wang, X.; Cao, Z. TasselNetV3: Explainable plant counting with guided upsampling and background suppression. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Li, G.; Suo, R.; Zhao, G.; Gao, C.; Fu, L.; Shi, F.; Dhupia, J.; Cui, Y. Real-time detection of kiwifruit flower and bud simultaneously in orchard using YOLOv4 for robotic pollination. Comput. Electron. Agric. 2022, 193, 106641. [Google Scholar] [CrossRef]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Xiang, S.; Wang, S.; Xu, M.; Wang, W.; Liu, W. YOLO POD: A fast and accurate multi-task model for dense Soybean Pod counting. Plant Methods 2023, 19, 8. [Google Scholar] [CrossRef] [PubMed]
Hasan, M.; Chopin, J.; Laga, H.; Miklavcic, S. Detection and analysis of wheat spikes using convolutional neural networks. Plant Methods 2018, 14, 100. [Google Scholar] [CrossRef]
Lu, W.; Du, R.; Niu, P.; Xing, G.; Luo, H.; Deng, Y.; Shu, L. Soybean yield preharvest prediction based on bean pods and leaves image recognition using deep learning neural network combined with GRNN. Front. Plant Sci. 2022, 12, 791256. [Google Scholar] [CrossRef]
Miao, C.; Guo, A.; Thompson, A.; Yang, J.; Ge, Y.; Schnable, J. Automation of leaf counting in maize and sorghum using deep learning. Plant Phenome J. 2021, 4, e20022. [Google Scholar] [CrossRef]
Yue, Y. Research and Experimentation of Soybean Flower and Pod Identification Device. Master’s Thesis, Heilongjiang Bayi Agricultural University, Daqing, China, 2023. [Google Scholar]
Zhu, R.; Wang, X.; Yan, Z.; Qiao, Y.; Tian, H.; Hu, Z.; Chen, Q. Exploring soybean flower and pod variation patterns during reproductive period based on fusion deep learning. Front. Plant Sci. 2022, 13, 922030. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, J.; Tao, D. Vanillanet: The power of minimalism in deep learning. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 36. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Zhou, Y.; Tang, Y.; Zou, X.; Wu, M.; Tang, W.; Meng, F.; Kang, H. Adaptive active positioning of Camellia oleifera fruit picking points: Classical image processing and YOLOv7 fusion algorithm. Appl. Sci. 2022, 12, 12959. [Google Scholar] [CrossRef]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight tomato real-time detection method based on improved YOLO and mobile deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Wang, Z.; Jin, L.; Wang, S.; Xu, H. Apple stem/calyx real-time recognition using YOLO-v5 algorithm for fruit automatic loading system. Postharvest Biol. Technol. 2022, 185, 111808. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Qiu, J. Improved apple fruit target recognition method based on YOLOv7 model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Wang, C.; Yeh, I.; Liao, H. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]

Figure 1. A detailed diagram of the data collection scheme: (a) the sample plants; (b) a structural sketch drawn from the sample plants; (c) images taken at each node, corresponding to the structural sketch.

Figure 2. An image acquisition device for soybean flowers and pods in the field.

Figure 3. The YOLOv8 network structure diagram.

Figure 4. The VanillaNet network structure diagram.

Figure 5. The EMA attention mechanism module network structure diagram.

Figure 6. The C2f-EMA module network structure diagram.

Figure 7. Schematic diagram of the bounding box regression. (x, y) are the center coordinates of the prediction box; (x_gt, y_g_t) are the center coordinates of the target box; and W_g and H_g are the width and length of the union of the prediction box and the target box, respectively.

Figure 8. The YOLOv8-VEW model application flowchart.

Figure 9. Plot of linear correlation between model predictions and true results: (a) linear correlation plot of predicted versus true flower number; (b) linear correlation plot of predicted versus true pod number.

Figure 10. The YOLOv8-VEW model’s detection of soybean flower under different conditions.

Figure 11. The YOLOv8-VEW model’s detection of soybean pods under different conditions.

Table 1. The division of the soybean flower and pod dataset.

Image Tag	Training Set	Validation Set	Test Set
Pod	1680	210	210
Flower	1680	210	210

Table 2. The results of the ablation experiments.

Model	Improvement Modules			Indicator Index
Model	VanillaNet	EMA	WIoU	F1	mAP/%	FPS
YOLOv8				0.90	94.5	66
YOLOv8-V	√			0.90	94.6	97
YOLOv8-E		√		0.93	95.7	58
YOLOv8-W			√	0.92	95.1	62
YOLOv8-VE	√	√		0.93	95.7	91
YOLOv8-VW	√		√	0.92	95.3	94
YOLOv8-EW		√	√	0.94	96.7	60
YOLOv8-VEW	√	√	√	0.95	96.9	90

Table 3. The comparison results of the different detection models.

Model	F1	mAP/%	FPS
Faster R-CNN	0.89	93.4	38
SSD	0.80	85.6	48
YOLOv5	0.88	93.0	82
YOLOX	0.86	91.4	61
YOLOv7	0.87	92.2	73
YOLOv8	0.90	94.5	66
YOLOv9	0.87	92.6	63
YOLOv10	0.89	93.5	66
YOLOv8-VEW	0.95	96.9	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, K.; Li, J.; Shi, W.; Qi, L.; Yu, C.; Zhang, W. Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method. Agriculture 2024, 14, 1423. https://doi.org/10.3390/agriculture14081423

AMA Style

Zhao K, Li J, Shi W, Qi L, Yu C, Zhang W. Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method. Agriculture. 2024; 14(8):1423. https://doi.org/10.3390/agriculture14081423

Chicago/Turabian Style

Zhao, Kunpeng, Jinyang Li, Wenqiang Shi, Liqiang Qi, Chuntao Yu, and Wei Zhang. 2024. "Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method" Agriculture 14, no. 8: 1423. https://doi.org/10.3390/agriculture14081423

APA Style

Zhao, K., Li, J., Shi, W., Qi, L., Yu, C., & Zhang, W. (2024). Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method. Agriculture, 14(8), 1423. https://doi.org/10.3390/agriculture14081423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Field-Based Soybean Flower and Pod Detection Using an Improved YOLOv8-VEW Method

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition and Dataset Construction

2.2. YOLOv8-VEW Soybean Flower- and Pod-Detection Model

2.2.1. VanillaNet Backbone Network

2.2.2. C2f_EMA Module Structure and Principle

2.2.3. WIoU Position Loss Function

2.3. Image Test Platform Construction and Training Parameters

2.4. Model Performance Evaluation Metrics

2.5. Ablation Experiment Design

2.6. Experimental Design of Different Target-Detection Models

2.7. Application of the Model

3. Results

3.1. Ablation Experiment Result

3.2. Experimental Results of Different Target-Detection Models

3.3. Verification of the Actual Application Effect in the Field

4. Discussion

4.1. Comparison of Different Models

4.2. Detection Results

4.3. Limitations and Solutions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI