SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping

Liu, Fei; Wang, Shudong; Pang, Shanchen; Han, Zhongzhi; Zhao, Longgang

doi:10.3390/agronomy15040791

Open AccessArticle

SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping

by

Fei Liu

^1,2,

Shudong Wang

^1,*,

Shanchen Pang

¹,

Zhongzhi Han

²

and

Longgang Zhao

³

¹

Qingdao Institute of Software and College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266400, China

²

College of Science and Information Science, Qingdao Agricultural University, Qingdao 266109, China

³

College of Grassland Science, Qingdao Agricultural University, Qingdao 266109, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(4), 791; https://doi.org/10.3390/agronomy15040791

Submission received: 26 February 2025 / Revised: 18 March 2025 / Accepted: 20 March 2025 / Published: 24 March 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate soybean pod counting remains a significant challenge in field-based phenotyping due to complex factors such as occlusion, dense distributions, and background interference. We present SmartPod, an advanced deep learning framework that addresses these challenges through three key innovations: (1) a novel vision Transformer architecture for enhanced feature representation, (2) an efficient attention mechanism for the improved detection of overlapping pods, and (3) a semi-supervised learning strategy that maximizes performance with limited annotated data. Extensive evaluations demonstrate that SmartPod achieves state-of-the-art performance with an Average Precision at an IoU threshold of 0.5 (AP@IoU = 0.5) of 94.1%, outperforming existing methods by 1.7–4.6% across various field conditions. This significant improvement, combined with the framework’s robustness in complex environments, positions SmartPod as a transformative tool for large-scale soybean phenotyping and precision breeding applications.

Keywords:

soybean pods; plant phenotyping; deep learning; precision agriculture

1. Introduction

Soybeans, a globally vital crop, serve as a crucial source of protein and vegetable oil while finding extensive applications in animal feed, industrial materials, and bioenergy [1]. In China, improving soybean production is critically important, especially amid global food market volatility and trade uncertainties, making it a national agricultural priority [2]. Phenotypic measurement technologies are essential for soybean breeding, enabling the precise analysis of gene–trait relationships and facilitating genetic improvement and cultivar optimization [3].

Pod count is a key yield-related trait essential for yield prediction, germplasm screening, and breeding strategies. However, accurate counting in field conditions is challenging due to occlusion, dense pod distribution, and variable orientations. Manual counting is labor-intensive and prone to subjective errors, potentially impacting breeding decisions [4]. Therefore, the development of efficient automated measurement systems is crucial for improving breeding accuracy, minimizing errors, and advancing modern soybean breeding practices.

Recent advancements in smart agriculture have witnessed the widespread adoption of deep learning, particularly object detection techniques, in soybean phenotyping [5,6,7]. Unlike traditional machine learning approaches, deep learning excels in analyzing large-scale image data, enabling precise object recognition in complex field environments while significantly improving detection efficiency and reducing human intervention. In automated pod counting systems, the performance of object detection algorithms is pivotal in determining measurement accuracy and operational efficiency.

Current deep learning-based detection algorithms fall into two main categories: region-based and regression-based approaches. Region-based techniques typically employ a two-stage process, first generating region proposals and then performing classification and regression. A notable example is Faster R-CNN [8], which utilizes a Region Proposal Network (RPN) to automatically generate candidate regions, enhancing both detection accuracy and computational efficiency. In contrast, regression-based methods follow a single-stage architecture, directly predicting object locations and classifications. These approaches offer superior computational efficiency, making them particularly suited for real-time applications. Among them, the YOLO series (You Only Look Once, YOLO) [9,10,11,12,13,14] has emerged as a leading framework in soybean phenotyping due to its exceptional balance between detection accuracy and processing speed.

Recent advancements have significantly improved pod detection. Xiang et al. developed YOLO POD, an extension of the YOLOX framework, incorporating a dedicated pod counting module, modified loss functions for multi-task learning, and an attention mechanism, achieving notable improvements in both accuracy and efficiency [15]. Li et al. enhanced YOLOv5 by integrating the FasterNet Block and EMA module, along with the CIoU loss function, to boost recognition performance and generalization [16]. Yu et al. introduced PodNet, a lightweight network with an efficient encoder–decoder structure, effectively mitigating information loss and gradient degradation across non-adjacent layers [17]. Li et al. proposed SoybeanNet, a Transformer-based point-counting network, achieving high-precision soybean pod counting and localization using UAV images [18]. Fu et al. developed RefinePod, an enhanced instance segmentation network, delivering high-throughput and high-quality pod segmentation and precise seed-per-pod estimation [19]. Mathew et al. proposed a novel method utilizing a depth camera for real-time RGB image filtering combined with YOLOv7 for pod detection, offering an efficient and non-destructive approach for soybean yield prediction [20]. Li et al. proposed a Soybean Pod Counting Network (SPCN) based on Hybrid Dilated Convolution and attention mechanisms, demonstrating its potential for intelligent breeding and precise agricultural management [21]. However, despite these advancements, the visual similarity between mature soybean plants and soil backgrounds, combined with dense pod arrangements and frequent occlusions, continues to challenge detection robustness and accuracy. Existing approaches often struggle to balance computational efficiency and detection precision, leading to persistent issues with false positives and missed detections [22].

Thus, the development of novel detection algorithms remains crucial for overcoming these challenges, not only offering more reliable tools for soybean breeding, but also establishing a solid foundation for advancing agricultural intelligence and precision farming.

2. Materials and Methods

The soybean pod detection and counting in this experiment were performed using SmartPod methods. The experimental process is illustrated in Figure 1.

Step a: Pretrain the BeanpodNet (BN) network model using the iterative self-training framework (Figure 1a). In this phase, pseudo-labels are generated through semi-supervised learning methods, significantly reducing the reliance on manually labeled data. The pretraining process leverages unlabeled data to enhance the model’s generalization capability, enabling the network to adapt effectively to diverse scenarios.

Step b: Construct the BeanpodNet(BN) network (Figure 1b) as a subnetwork for self-training by integrating multiple advanced techniques to achieve high-accuracy pod detection. This step aims to further enhance the network’s perception of objects, particularly improving its performance in complex field environments.

Step c: Combine the weights obtained from pretraining in Step a with the optimized network architecture from Step b for end-to-end training (Figure 1c). During this phase, the network gradually optimizes its parameters, enhancing BN’s ability to accurately identify pods while effectively reducing false positives and missed detections. Through comprehensive training, the network not only adapts to various pod appearance features, but also improves its detection capability for occluded and background-complex targets.

Step d: After the model training is complete, use the best-performing network to perform pod detection and counting on the test set (Figure 1d). This phase primarily evaluates the model’s practical performance, ensuring its accuracy and robustness in real field conditions. By validating the test set data, the network is further adjusted and optimized to achieve optimal performance in real-world applications.

2.1. Data and Preprocessing

This study focused on salt-tolerant soybean varieties “Qihuang 34” and “Qihuang 641”, as well as their derivatives, as experimental subjects. The experiment was conducted at the National Saline-Alkali Land Comprehensive Utilization and Technology Innovation Center in Guangrao, Dongying, Shandong, located in a mildly saline-alkali area (longitude: 118.65° E–118.49° E, latitude: 37.31° N–37.83° N). The soil pH was 8.62, with slight fluctuations depending on precipitation. The soybean planting density was controlled within 50,000 plants per hectare, with two seeds per hole and row and plant spacings of 50 cm and 13 cm, respectively. The total area of each plot was 0.01 hectares. A three-replicate experimental design was implemented to ensure data reliability, with each variety planted in separate plots.

Before imaging, the plants were carefully cut at the base to avoid damage and laid flat on the ground beneath the phenotyping platform to ensure consistent imaging conditions as much as possible. This process was designed to minimize stress on the plants and maintain the integrity of the phenotypic traits during data collection. The dataset was automatically captured by the TraitDiscover high-throughput phenotyping platform (Figure 2a) on 10 October 2024 between 1:00 and 2:00 a.m. The platform features a truss-style mobile scanning platform with three-axis automated motion capabilities and is equipped with a visible light imaging unit (Trait-RGB), with key parameters listed in Table 1. During image acquisition, soybean plants were laid flat on the ground, with the camera positioned vertically above to prevent image distortion. The dataset comprised 1500 images, which were split into training, validation, and test sets in a ratio of 7:2:1. The model was trained on the training set and evaluated on the validation and test sets. Finally, the key points of mature soybean pods in the images were annotated using the LabelMe tool (Figure 2b).

2.2. BeanpodNet (BN)

The YOLOv8 [11] backbone uses the efficient lightweight CSPDarknet (Cross-Stage Partial Darknet) with cross-stage connections to address the vanishing gradient issue, improving computational efficiency and detection accuracy. As one of the most advanced versions in the YOLO series, YOLOv8 boasts a lightweight design, high efficiency, and enhanced feature extraction and object detection capabilities. However, when applied to soybean pod detection, YOLOv8 still faces several limitations: (1) difficulty distinguishing object features in complex backgrounds, (2) limited capability of separating occluded objects, and (3) insufficient fine-grained feature capture when processing dense targets. To address the challenges mentioned above, this study proposes BN, a soybean pod detection model based on YOLOv8:

A Super Token Sampling Vision Transformer (SViT) [23] is incorporated to strengthen global contextual feature extraction, mitigating the impact of complex backgrounds on detection performance. SViT has demonstrated exceptional effectiveness in object detection applications [24,25,26].
A Multi-Scale Spatial Enhancement Attention Module (MultiSEAM) [27] is introduced to enhance occluded object separation, improving the robustness of recognition in occlusion scenarios. The effectiveness of MultiSEAM has been validated in numerous studies [28,29,30].
Intersection over Union Loss (Inner-IoU) [31], an optimized bounding box regression strategy, is employed to enhance both detection accuracy and processing speed, improving bounding box matching for dense targets.
A semi-supervised iterative self-training strategy [32] is integrated, leveraging pseudo-labels from unlabeled data to improve detector generalization, thereby reducing dependence on labeled data.

BN was introduced to the field of pod detection to address challenges such as high-density targets, occlusion, and the similarity between background and object features. As shown in Figure 3, the BN detection architecture consists of the backbone, neck, and head networks. The backbone uses CSPDarknet as the feature extraction network, with the SViT module introduced at the P9 layer to optimize feature representation and enhance global context modeling, significantly improving pod detection performance. The neck network retains the SPP and PAN structures, further enhancing feature fusion and information propagation. To improve object localization, features from the P2 layer are fused into PAN, which enhances the detection accuracy of small and occluded objects. To further mitigate the impact of occlusion, the MultiSEAM attention mechanism is incorporated into the neck network. By integrating multi-scale features, this module significantly strengthens the model’s ability to detect occluded soybean pods, effectively reducing both false positives and missed detections. Through these innovations, BN excels in pod detection tasks, ensuring higher accuracy and robustness, particularly in scenarios involving dense vegetation, occlusion, and complex field environments.

2.2.1. CNN

CNNs (Convolutional Neural Networks) have achieved remarkable success in deep learning due to their ability to automatically extract hierarchical features and recognize complex patterns. The concept of CNNs was introduced by Hinton et al. in 2006, combining convolutional, pooling, and fully connected layers to enhance feature learning and classification performance [33]. With the advancement of big data and hardware, CNNs have expanded, notably after AlexNet’s victory in the 2012 ImageNet competition, which demonstrated their superiority in large-scale image recognition tasks [34]. Modern CNNs improve computational efficiency through local receptive fields, mitigate complexity and overfitting via weight sharing, and employ pooling layers to extract high-level representations. The hybrid network used here integrates 1 × 1, 3 × 3, and 5 × 5 convolutional kernels, enabling multi-scale feature extraction to enhance model expressiveness and adaptability. This design optimizes feature representation, allowing for more precise soybean pod detection in complex field environments, ultimately improving both accuracy and robustness.

2.2.2. C2f

C2f (CSP bottleneck with two convolutions) is an enhanced version of the C3 module (Cross-Stage Partial with three convolutions), designed to improve residual learning efficiency while incorporating the Efficient Layer Aggregation Network (ELAN) structure from YOLOv7 [35]. By combining standard convolutions with bottleneck modules, C2f optimizes gradient propagation, effectively mitigating the vanishing gradient problem while maintaining a lightweight architecture. It extracts multi-scale features through a structured combination of convolutions and C2f modules, ensuring effective feature fusion across different layers. The C2f structure includes two key components: cross-stage connections, which establish links between shallow and deep feature maps to facilitate better information flow, and partial networks, which enhance feature refinement across layers using convolution and up-sampling operations. This design not only strengthens feature representation, but also improves adaptability to complex backgrounds. The C2f module plays a crucial role in object detection, particularly in handling targets of varying scales, significantly enhancing both detection accuracy and robustness.

2.2.3. SViT

The Vision Transformer (ViT) has demonstrated remarkable success in global context modeling [36], but its effectiveness in capturing fine-grained local features is often hindered by redundant information. To mitigate this limitation, techniques such as local self-attention and early-stage convolutions are commonly employed. However, these methods can compromise long-range dependency modeling, limiting ViT’s overall performance. SViT addresses this issue by drawing inspiration from superpixel design, reducing the number of image primitives to optimize information processing. It introduces the concept of super tokens, which condense the number of tokens involved in self-attention while preserving essential global context. The Super Token Attention (STA) mechanism operates in three stages: first, super tokens are generated through sparse relational learning; second, self-attention is applied to these tokens; and finally, the processed information is mapped back to the original token space. By decomposing global self-attention into sparse relational graphs and low-dimensional attention mechanisms, STA enhances computational efficiency while significantly reducing complexity. Leveraging STA, SViT not only strengthens local feature representation, but also optimizes global context modeling, making it particularly effective for visual tasks.

This design enhances the model’s feature extraction efficiency by optimizing the image processing pipeline. Given an input image, it first undergoes initial processing through a three-stage stem structure. The image then enters the fourth-stage layer, which integrates a convolutional module and a Super Token Transformer (STT) module (Figure 4). The STT module comprises three key components: Convolutional Position Embedding (CPE), Super Token Attention (STA), and a Convolutional Feed-Forward Network (ConvFFN), defined as follows.

X = X_{0} + C P E (X_{0}), X_{0} \in R^{C \times W \times H}

(1)

Y = X + S T A (L N (X))

(2)

Z = Y + C o n v F F N (B N (Y))

(3)

The STA module comprises three key processes: Super Token Sampling (STS), Multi-Head Self-Attention (MHSA), and Token Up-Sampling (TU). To establish associations between tokens and super tokens, an attention-based method constructs the association graph, defined as follows:

Q^{t} = Softmax (\frac{{X S^{t - 1}}^{T}}{\sqrt{d}}), Q \in R^{m \times N}

(4)

where d is the channel number C.

2.2.4. Multi-SEAM

Multi-SEAM is an enhanced version of the Spatially Enhanced Attention Module (SEAM), specifically designed to address feature loss and alignment errors caused by occlusion, complex backgrounds, and environmental variations. Traditional object detection models struggle with performance degradation in scenarios involving dense targets, occlusions, and dynamic backgrounds. In contrast, Multi-SEAM effectively mitigates these issues through an innovative architecture. Its core mechanism integrates multi-head attention with depthwise separable convolutions, enhancing channel-specific feature extraction while reducing computational complexity. This design significantly improves the network’s expressiveness and robustness (Figure 5).

The key innovations of Multi-SEAM include the following:

a. Depthwise Separable Convolution: Multi-SEAM reduces computation by splitting standard convolution into channel-wise and pixel-wise convolutions, preserving feature details while minimizing redundancy for a lightweight model.

b. Pointwise Convolution: The 1 × 1 convolution bridges depthwise separable convolutions, facilitating efficient information exchange and improving object recognition accuracy by combining feature layers.

c. Fully Connected Layer: The fully connected layer aggregates channel information, enhancing the recognition of complex or occluded features and maintaining high accuracy despite occlusion.

d. Exponential Normalization: Exponential normalization at the output layer enhances adaptability to varying scales and positions, improving stability and robustness in dynamic environments.

e. Multi-Scale Feature Learning: Multi-SEAM focuses on key areas and reduces background noise through multi-scale learning, enhancing the ability to distinguish occluded from non-occluded targets and improving performance in dense environments.

These innovations allow Multi-SEAM to significantly enhance performance in complex and occluded environments, particularly for soybean pod detection. In field conditions, the dense distribution and occlusion of pods often lead to feature loss, complicating detection. Multi-SEAM effectively isolates target features from complex backgrounds, improving accuracy and robustness through precise feature capture and integration.

2.2.5. Inner-IoU

Intra-class occlusion, where one pod overlaps with another, often increases the false detection rate. To address this, Inner-IoU introduces a ‘repulsive force’ mechanism that reduces occlusion interference. The Inner-IoU loss function integrates auxiliary bounding boxes of varying scales, which are dynamically adjusted based on IoU values. This scale adjustment accelerates the regression process and improves detection accuracy. Moreover, the adaptability of Inner-IoU across different tasks and architectures ensures strong generalization, providing stable performance in dense or occlusion-prone environments, and significantly enhancing detection precision.

To address the limitations of traditional IoU loss, such as poor generalization and slow convergence across various detection tasks, Zhang et al. proposed a strategy that incorporates auxiliary bounding boxes to calculate the loss and accelerate regression [31]. In Inner-IoU, a scaling factor adjusts the size of the auxiliary boxes, allowing it to adapt to different datasets and detectors. As shown in Figure 6, the ground truth (GT) and anchor boxes are denoted as A and B, respectively. The “scale” variable refers to this scaling factor, which is typically in the range of [0.5, 1.5]. By fine-tuning this factor, the regression process is optimized, improving the model’s adaptability and accuracy. While Inner-IoU retains some characteristics of traditional IoU loss, it introduces unique features, with the relevant formulas provided in Equations (5)–(7):

I o U^{i n n e r} = \frac{i n t e r}{u n i o n}

(5)

i n t e r = (m i n (b_{r}^{g t}, b_{r}) - m a x (b_{l}^{g t}, b_{l})) \times (m i n (b_{b}^{g t}, b_{b}) - m a x (b_{t}^{g t}, b_{t}))

(6)

u n i o n = (w^{g t} \times h^{g t}) \times {(r a t i o)}^{2} + (w \times h) \times {(r a t i o)}^{2} - i n n e r

(7)

where the center points of the GT box and its internal GT box are denoted as

(x_{c}^{g t}, y_{c}^{g t})

, while the center points of the anchor box and its internal anchor box are denoted as

(x_{c}, y_{c})

. The width and height of the GT box are represented by

w^{g t}

and

h^{g t}

, while the width and height of the anchor box are denoted by w and h.

2.3. Iterative Self-Learning

Recent advancements in deep learning have led to significant breakthroughs in object detection [37,38,39,40]. However, most state-of-the-art visual models still rely on supervised learning, which requires a large number of labeled images, limiting their application to unlabeled data. As a result, the potential benefits of using unlabeled images to enhance model accuracy and robustness have not been fully realized. Recent studies demonstrate that incorporating unlabeled images can significantly improve the accuracy of state-of-the-art models, with practical benefits [41]. Fully utilizing unlabeled images not only expands the training dataset, but also enhances the model’s adaptability in complex or unknown environments, all while maintaining or even improving accuracy. This development further accelerates the application of deep learning in key point detection.

To fully utilize unlabeled image data, this study employs a semi-supervised learning method: iterative self-training. The core process is as follows: First, a preliminary teacher model is trained using labeled images. The teacher model then generates pseudo-labels for the unlabeled images, which are combined with the labeled images to train the student model. In each iteration, the updated student model becomes the new teacher model, which generates updated pseudo-labels for the unlabeled images. The student model is then retrained using these updated pseudo-labels. Algorithm 1 outlines the detailed implementation of this process. The experimental results demonstrate that the iterative self-training method significantly enhances the model’s accuracy and improves its adaptability and robustness to different data distributions, thereby boosting its performance in practical applications.

Algorithm 1: Iterative self-learning framework
Input: Labeled data ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots (x_{n}, y_{n})}$ and unlabeled data ${{\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{m}}$ .
	Train teacher model $H^{t}$ (BH model) with labeled data. $V F L (p, q) = \{\begin{matrix} - q (q (\log (p) + (1 - q) \log (1 - p))) q > 0 \\ - α p^{γ} \log (1 - p) q = 0 \end{matrix}$ $\sum_{i = 1}^{n} l o s s (y_{i}, f (x_{i}, H^{t}))$ Use the teacher model to generate pseudo labels for unlabeled data. ${\tilde{y}}_{i} = f ({\tilde{x}}_{i}, H^{t}), \forall_{i} = 1, \dots, m$ Train student model $H^{s}$ with combined data (BH model) with labeled data and unlabeled data. $\sum_{i = 1}^{n} l o s s (y_{i}, f (x_{i}, H^{s})) + \sum_{i = 1}^{m} l o s s ({\tilde{y}}_{i}, f ({\tilde{x}}_{i}, H^{s}))$ Iterative training: make the student a new teacher and go back to step 2.
Output: Model with pretraining weights.

2.4. Evaluation Standard

To validate the performance of SmartPod, this study assesses the proposed model from two perspectives: performance and complexity. Performance metrics encompass precision (P), recall (R), Average Precision (AP), and F1 score, which evaluate accuracy across various levels. Complexity is gauged by parameters and Giga Floating-point Operations Per Second (GFLOPs), reflecting model complexity and computational demands. Integrating these assessments offers insights into the efficiency and accuracy of the deep learning model for soybean bud detection. The formulas are presented in (8)–(10):

p a r a m e t e r s = [r \times (f \times f) \times o] + o

(8)

F L O P s = 2 \times H_{o u t} \times W_{o u t} \times (C_{i n} \times K^{2} \times b i a s) \times C_{o u t}

(9)

G F L O P s = 10^{9} F L O P s

(10)

where

r

is the input size,

f

is the size of the convolution kernel,

o

is the output size,

H \times W

is the size of the output feature map,

C_{i n}

is the input channel,

K

is the kernel size,

s

is the stride, and

C_{o u t}

is the output channel.

The mean absolute error (MAE), root mean squared error (RMSE), and the correlation coefficient (R) were used as the evaluation metrics to assess the counting performance. They take the forms as follows:

M A E = \frac{1}{N} \sum_{1}^{N} |t_{i} - c_{i}|

(11)

R M S E = \sqrt{\frac{1}{N} \sum_{1}^{N} {(t_{i} - c_{i})}^{2}}

(12)

R = \sqrt{1 - \frac{\sum_{i = 1}^{N} {(t_{i} - c_{i})}^{2}}{\sum_{i = 1}^{N} {(t_{i} - \bar{t})}^{2}}}

(13)

where N denotes the number of test images,

t_{i}

is the ground truth count for the ith image,

c_{i}

is the inferred count for the ith image, and

\bar{t}

is the mean of

t_{i}

.

2.5. Experimental Platform

This study presents experimental results conducted using the PyTorch (1.11.0) deep learning framework and Python (3.8) programming language. The experiments were performed on a machine running the Windows 10 64-bit operating system, equipped with an R12 vCPU Intel(R) Xeon(R) Gold 5320 CPU @ 2.20 GHz, RTX A4000 (16 GB), and 32 GB of memory. The model’s input image size was set to 640 × 640 pixels, with a batch size of 8 for training images. Training was carried out for 100 epochs, employing a learning rate of 0.01, momentum of 0.937, and weight decay coefficient of 0.0005. The SGD (stochastic gradient descent) optimization strategy was utilized during training.

3. Results and Analysis

3.1. Performance Evaluation of BeanpodNet (BN)

This study introduces BN, an efficient and accurate soybean pod detection network built upon the YOLOv8 model and enhanced with advanced modules, including SViT, Multi-SEAM, and Inner-IoU. To assess the contribution of each module to the model’s performance, ablation experiments were conducted using the PodSet dataset. The results, shown in Table 2, demonstrate that BN outperforms other YOLOv8 variants across all performance metrics, showcasing superior detection capability. A comparative analysis of the proposed method and the original YOLOv8 model, based on AP@0.5 and P-R curves, further substantiates the effectiveness of the improvements. Specifically, BN significantly surpasses YOLOv8 in AP@0.5 with its advantage becoming more pronounced as the number of training epochs increases, particularly after 10 epochs. The P-R curve (Figure 7) also validates BN’s enhanced detection performance. Overall, BN achieves substantial improvements in pod detection, particularly in addressing challenges such as color similarity, occlusion, and dense distributions, due to the integration of advanced modules.

To comprehensively evaluate BN’s performance in field soybean pod detection, comparative experiments were conducted using a mature soybean image dataset with diverse branch structures and morphological characteristics. The results demonstrate that BN significantly improves detection accuracy, particularly in occlusion scenarios (Figure 8). BN effectively identifies both partially and fully occluded pod targets, with this performance gain primarily attributed to its advanced feature extraction mechanism and context information fusion strategy. These innovations enable robust detection even in complex field environments, offering reliable support for precise plant phenotyping and breeding decisions.

3.2. Evaluation of Iterative Self-Training Effectiveness

Section 2.2 provides a detailed overview of the iterative training mechanism. The experiment adopts a progressive optimization strategy, starting with the BN-T model as the initial teacher, trained on the Podset dataset. Pseudo-labels generated by the BN-T model are then used to supervise the training of the BN-S model, with a batch size of 100 unlabeled samples. To further enhance model performance, the trained BN-S model is used as the student model for knowledge distillation into the BN-T2 model, utilizing the pseudo-labels it produces. The experimental results demonstrate that, through multiple iterations of optimization, the model’s overall performance is significantly improved.

The performance at different iterations is shown in Table 3. After the first iteration, the AP@0.5 reached 93.6%. With consistent hyperparameters, performance continued to improve: 93.8% after the second iteration and 94.1% after the third iteration. However, in the fourth iteration, the performance slightly dipped to 94.0%, possibly due to excessive noise in pseudo-labels. These results demonstrate the effectiveness of iterative training in enhancing model performance. The final performance on the test set showed a 0.8% improvement over the initial value. To further validate the effectiveness of the iterative process, 300 pseudo-labels generated across the three iterations were included in the training. However, the model’s performance during testing was suboptimal, with precision, recall, and AP@0.5 values of 90.5%, 87.9%, and 93.5%, respectively. This outcome further confirms the superiority of the iterative self-training method. The performance discrepancy can be attributed to two main factors: (i) the use of pseudo-labels to expand the training set, improving the model’s generalization ability, and (ii) iterative self-training, enabling the model to more effectively utilize the information in the pseudo-labels, significantly improving accuracy and robustness.

3.3. Comparison with Mainstream Methods

To evaluate the effectiveness of SmartPod, comparative experiments were conducted with several mainstream YOLO-based object detection methods, including YOLOv3s, YOLOv5s, YOLOv6s, YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s. The experimental results, presented in Table 4, demonstrate that the proposed method is highly competitive, validating its effectiveness in improving detection accuracy. Specifically, it outperforms other YOLO-based approaches in precision (P), recall (R), and AP@0.5 metrics. For instance, in terms of AP@0.5, SmartPod surpasses the latest YOLOv11. Further analysis reveals that, compared to mainstream one-stage object detection algorithms, the proposed method improves AP@0.5 by 3.1%, 3.0%, 1.6%, 1.9%, 1.7%, 4.6%, and 3.5%, respectively. These results thoroughly demonstrate the superior performance of SmartPod in soybean pod detection tasks.

To evaluate the performance of different object detection models in soybean pod recognition tasks, multi-branch soybean plants were selected as test samples, and the detection results of each model were compared and analyzed (Figure 9). The experimental results show that YOLOv3s and YOLOv8s perform poorly in pod recognition, with significant false positives and missed detections, especially in complex backgrounds where they struggle to accurately localize the targets. Although YOLOv5s and YOLOv11s show improvements in overall detection accuracy, they still face limitations in occluded pod recognition, leading to undetected or misidentified targets. In contrast, SmartPod outperforms all comparative models, demonstrating superior detection performance. Its advantages lie in the efficient utilization of feature information, network architecture optimization, and the application of transfer learning strategies. By enhancing feature fusion at deeper levels, this method improves target representation, making pod detection more robust and enhancing its ability to detect targets in complex backgrounds. Furthermore, the introduction of transfer learning not only accelerates model convergence, but also improves adaptability to complex scenarios, such as occlusion and dense arrangements. These enhancements lead to superior detection accuracy, recall, and occlusion handling over benchmark models, further validating SmartPod’s effectiveness and robustness in soybean pod detection tasks.

3.4. Phenotypic Identification Results

This study employs the proposed SmartPod model and iterative self-training strategy for the identification and counting of soybean pods at the mature stage. A total of 200 soybean samples were selected, and a correlation analysis was performed between manual counts and model predictions. To visually present the analysis results, scatter plots and regression lines were plotted to evaluate the detection performance of different methods. Figure 10 shows the correlation analysis between the actual and predicted pod counts for the 200 soybean samples. The experimental results indicate that the SmartPod model achieves high accuracy in pod counting tasks. Specifically, the Pearson correlation coefficient (R) for SmartPod is 0.9792, significantly outperforming YOLOv8’s 0.9612, suggesting that the model can more accurately predict pod numbers and aligns well with the actual measurements. Additionally, the bubble size in the plot reflects the repetition frequency of actual versus predicted values, aiding in the intuitive identification of data distribution and potential biases. Further analysis reveals that, for randomly selected soybean plants in moderately saline-alkaline soil, pod counts are primarily concentrated between 10 and 30, indicating a certain level of yield stability in this growing environment.

4. Discussion

4.1. Research Significance

Pod count is a key phenotypic trait of soybean. Accurate pod counting not only enables precise yield estimation, but also supports the identification of high-yielding and stress-resistant soybean varieties. Traditional methods are prone to inaccuracies due to factors like occlusion, background similarity, and dense distributions in complex field environments. These limitations often result in labor-intensive processes with high error rates, making them unsuitable for large-scale agricultural applications. Deep learning models such as YOLO-based approaches offer significant breakthroughs for these issues [42]. These models leverage advanced feature extraction and pattern recognition capabilities to handle complex visual tasks [43]. This study proposes an effective solution to the challenges of automatic pod detection and counting in the field by combining deep learning techniques with semi-supervised learning strategies [44]. It not only improves detection accuracy, but also demonstrates strong generalization capabilities across diverse field conditions. The method proposed in this study holds significant potential for extension to a broader range of research applications, such as evaluating the dynamic changes in both the quantity and quality of soybean seeds produced under shading conditions initiated at different crop phenological stages [45], as well as quantifying the effects of foliar boron application at various growth stages on the yield components of soybeans [46]. As research on diverse crop varieties advances and more comprehensive datasets are accumulated, this method is expected to play a pivotal role in driving the intelligent and automated development of agricultural technologies, thereby enhancing precision and efficiency in crop management and production.

4.2. Benefits of BeanpodNet(BN)

The BN model offers significant improvements by combining multiple innovative technologies, addressing key challenges in soybean pod detection. These advancements not only enhance detection accuracy, but also improve computational efficiency, making BN a practical solution for real-world agricultural applications. The SViT enhances feature representation by mitigating interference from complex backgrounds, making it highly robust in agricultural settings with variable environments. By leveraging the self-attention mechanism, SViT effectively captures long-range dependencies in the image, enabling the model to distinguish soybean pods from visually similar background elements such as leaves and soil. Furthermore, the Multi-SEAM improves detection accuracy, even in cases of occlusion, by focusing on targets across multiple spatial scales, which is critical for dense or overlapping soybean pods in the field. Lastly, the Inner-IoU strategy optimizes bounding box regression, refining the IoU calculation and boosting detection precision, particularly in dense target scenarios. In addition, SmartPod achieves a processing speed of 0.1 ms for preprocessing and 2.8 ms for postprocessing per image on a single RTX 2080 Ti (11 GB) GPU, which meets real-time processing requirements. The integration of SViT, Multi-SEAM, and Inner-IoU not only addresses the limitations of existing methods, but also sets a new benchmark for agricultural object detection. Future work could explore the deployment of BN on edge devices for real-time field applications, as well as its extension to other crops and phenotypic traits, further enhancing its impact on precision agriculture.

4.3. Advantages of Iteration Self-Training

This study explores the innovative use of iteration self-training within a semi-supervised learning framework, aiming to lay a solid pretraining foundation for future network modeling [47]. In deep learning, acquiring sufficient labeled data is often time-consuming and complex [48]. This approach not only reduces the annotation burden, but also improves model performance by incorporating diverse and representative samples from the unlabeled data pool. The main benefits of iteration self-training include its scalability, wide applicability, and flexibility. It performs well across both large-scale and small-scale datasets, regardless of data augmentation intensity [49]. For instance, in scenarios with limited labeled data, iteration self-training can generate high-quality pseudo-labels to supplement the training process, enabling the model to achieve competitive performance even with minimal supervision. Moreover, it is not confined to specific models or datasets, showcasing its versatility across various application scenarios. The results confirm that iteration self-training is a simple yet powerful algorithm that extracts valuable information from pseudo-labels, offering new insights for future network modeling. Future research could explore the integration of active learning strategies with iteration self-training to further optimize the selection of pseudo-labels and improve model performance.

4.4. Limitations and Future Work

Soybean pod detection and counting face significant challenges in complex field environments, including intricate backgrounds, dense distributions, and target occlusion [50]. This study proposes SmartPod, a novel detection method with remarkable effectiveness. Future research could explore integrating SmartPod with advanced technologies, such as attention mechanisms or graph neural networks (GNNs), to improve accuracy in dense and occluded environments. Additionally, SmartPod can be integrated with drone-based remote sensing systems for real-time pod detection across large fields, combined with multispectral or thermal imaging to enhance robustness in complex conditions. Coupled with IoT devices, SmartPod enables automated field data collection and analysis, supporting precision agriculture. Beyond soybean pod detection, SmartPod offers technical support for similar phenotyping tasks in crops like wheat and maize, driving intelligent and automated agricultural research. Furthermore, integrating deep learning with 3D imaging enables high-fidelity 3D reconstructions of plant morphology, advancing phenotyping solutions [51,52,53].

4.5. Integration into Field Management

Integrating SmartPod into daily field management practices offers significant practical benefits for farmers and agricultural managers. In addition to being deployed on phenotyping platforms, SmartPod can be implemented on portable devices such as smartphones or drones, enabling farmers to perform real-time pod detection and counting during routine field inspections. This capability allows for rapid yield potential assessment and data-driven decision making. Furthermore, SmartPod can be integrated with farm management software to automate data recording and analysis, providing farmers with actionable long-term insights. This not only reduces the labor intensity of manual pod counting, but also improves yield estimation accuracy, supporting more efficient resource allocation and increased profitability. Additionally, SmartPod’s adaptability to variable field conditions makes it a reliable tool for large-scale farms, where continuous monitoring is critical for sustainable agricultural practices. By bridging the gap between advanced technology and practical field applications, SmartPod has the potential to transform daily farm management and promote the widespread adoption of precision agriculture. SmartPod offers broad applicability, effectively supporting research involving soybean pod quantity assessment by providing high-precision pod counting results, thereby enhancing the overall efficiency, accuracy, and reproducibility of this study.

5. Conclusions

This study presents SmartPod, a deep learning-based framework for accurate soybean pod detection in field phenotyping, addressing challenges such as occlusion, dense distributions, and background interference. By integrating SViT for enhanced feature representation, Multi-SEAM for overlapping pod detection, Inner-IoU for efficient bounding box regression, and an iterative self-training strategy, SmartPod achieves state-of-the-art performance with 94.1% AP@0.5 and a Pearson correlation of 0.9792, surpassing existing methods by 1.7–4.6%. The semi-supervised learning approach significantly reduces the dependency on labeled data, enhancing its practicality. However, a limitation of this study is the current requirement for imaging soybean plants in a laid-flat state. Future work will focus on implementing six-degrees-of-freedom imaging to enable pod detection and counting in their natural upright state.

Author Contributions

S.W. and S.P.: conceptualization and supervision; F.L. and Z.H.: methodology, investigation, and writing—original draft; F.L. and L.Z.: data curation, formal analysis, and visualization; S.W. and Z.H.: writing—review and editing; S.P. and L.Z.: validation and resources. All authors have read and agreed to the published version of the manuscript.

Funding

Shandong Soybean Industrial Technology System of China (SDAIT-28-02), Key R&D Program of Shandong Province (2023LZGC008), and Seed-Industrialized Development Program in Shandong Province (2023LZGC008-001, 2024LZGC030-002, 2024LZGC010-05).

Data Availability Statement

Data can be provided by the corresponding authors upon request.

Conflicts of Interest

The authors declared that they have no conflicts of interest.

References

Li, Z.; He, X. The Influencing Factors and Optimization Strategies of the Spatiotemporal Evolution Pattern of Soybean Production in China. Soybean Sci. 2024, 43, 782–792. [Google Scholar]
Han, D.; Zhong, Y. Soybean Supply Security in China: Components, Formation Mechanism, and Policy Support. Rural Econ. 2024, 10, 56–65. [Google Scholar]
Roth, L.; Barendregt, C.; Bétrix, C.A.; Hund, A.; Walter, A. High-throughput field phenotyping of soybean: Spotting an ideotype. Remote Sens. Environ. 2022, 269, 112797. [Google Scholar]
Yang, S.; Zheng, L.; Yang, H.; Zhang, M.; Wu, T.; Sun, S.; Tomasetto, F.; Wang, M. A synthetic datasets based instance segmentation network for high-throughput soybean pods phenotype investigation. Expert Syst. Appl. 2022, 192, 116403. [Google Scholar]
Miranda, M.C.D.C.; Aono, A.H.; Pinheiro, J.B. A novel image-based approach for soybean seed phenotyping using machine learning techniques. Crop Sci. 2023, 63, 2665–2684. [Google Scholar]
Fu, X.; Li, A.; Meng, Z.; Yin, X.; Zhang, C.; Zhang, W.; Qi, L. A dynamic detection method for phenotyping pods in a soybean population based on an improved YOLO-v5 network. Agronomy 2022, 12, 3209. [Google Scholar] [CrossRef]
Zhang, K.; Wu, Q.; Chen, Y. Detecting soybean leaf disease from synthetic image using multi-feature fusion faster R-CNN. Comput. Electron. Agric. 2021, 183, 106064. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ultralytics. YOLOv5 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 January 2023).
Jocher, G.; Chaurasia, A.; Qiu, J.; YOLO by Ultralytics. GitHub. 1 January 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 January 2023).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10, Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11, An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12, Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Xiang, S.; Wang, S.; Xu, M.; Wang, W.; Liu, W. YOLO POD: A fast and accurate multi-task model for dense Soybean Pod counting. Plant Methods 2023, 19, 8. [Google Scholar] [CrossRef]
Li, Y.; Teng, S.; Chen, J.; Zhou, W.; Zhan, W.; Wang, J.; Huang, L.; Qiu, L. FEI-YOLO: A Lightweight Soybean Pod-Type Detection Model. Agronomy 2024, 14, 2526. [Google Scholar] [CrossRef]
Yu, Z.; Wang, Y.; Ye, J.; Liufu, S.; Lu, D.; Zhu, X.; Yang, Z.; Tan, Q. Accurate and fast implementation of soybean pod counting and localization from high-resolution image. Front. Plant Sci. 2024, 15, 1320109. [Google Scholar]
Li, J.; Magar, R.T.; Chen, D.; Lin, F.; Wang, D.; Yin, X.; Zhuang, W.; Li, Z. SoybeanNet: Transformer-based convolutional neural network for soybean pod counting from Unmanned Aerial Vehicle (UAV) images. Comput. Electron. Agric. 2024, 220, 108861. [Google Scholar] [CrossRef]
Fu, X.; Li, A.; Meng, Z.; Yin, X.; Zhang, C.; Zhang, W.; Qi, L. High-throughput soybean pods high-quality segmentation and seed-per-pod estimation for soybean plant breeding. Eng. Appl. Artif. Intell. 2024, 129, 107580. [Google Scholar]
Mathew, J.; Delavarpour, N.; Miranda, C.; Stenger, J.; Zhang, Z.; Aduteye, J.; Flores, P. A novel approach to pod count estimation using a depth camera in support of soybean breeding applications. Sensors 2023, 23, 6506. [Google Scholar] [CrossRef]
Li, X.; Zhuang, Y.; Li, J.; Zhang, Y.; Wang, Z.; Zhao, J.; Li, D.; Gao, Y. SPCN: An Innovative Soybean Pod Counting Network Based on HDC Strategy and Attention Mechanism. Agriculture 2024, 14, 1347. [Google Scholar] [CrossRef]
Wu, K.; Wang, T.; Rao, Y.; Jin, X.; Wang, X.; Li, J.; Zhang, Z.; Jiang, Z.; Shao, X.; Zhang, W. Practical framework for generative on-branch soybean pod detection in occlusion and class imbalance scenes. Eng. Appl. Artif. Intell. 2025, 139, 109613. [Google Scholar] [CrossRef]
Huang, H.; Zhou, X.; Cao, J.; He, R.; Tan, T. Vision transformer with super token sampling. arXiv 2022, arXiv:2211.11167. [Google Scholar]
Zhang, H.; Wan, J.; He, Z.; Song, J.; Yang, Y.; Yuan, D. Sparse agent transformer for unified voxel and image feature extraction and fusion. Inf. Fusion 2024, 110, 102455. [Google Scholar] [CrossRef]
Feng, Z.; Xu, J.; Ma, L.; Zhang, S. Efficient video transformers via spatial-temporal token merging for action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 120. [Google Scholar] [CrossRef]
Zhao, G.; Hu, Z.; Feng, S.; Wang, Z.; Wu, H. GLFuse: A Global and Local Four-Branch Feature Extraction Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3246. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2, A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar]
Zhao, B.; Chen, Y.; Jia, X.; Ma, T. Steel surface defect detection algorithm in complex background scenarios. Measurement 2024, 237, 115189. [Google Scholar] [CrossRef]
Dong, C.; Jiang, X.; Hu, Y.; Du, Y.; Pan, L. EL-Net: An efficient and lightweight optimized network for object detection in remote sensing images. Expert Syst. Appl. 2024, 255, 124661. [Google Scholar] [CrossRef]
Alashbi, A.; Mohamed, A.H.H.M.; El-Saleh, A.A.; Shayea, I.; Sunar, M.S.; Alqahtani, Z.R.; Saeed, F.; Saoud, B. Human face localization and detection in highly occluded unconstrained environments. Eng. Sci. Technol. Int. J. 2025, 61, 101893. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Rosenberg, C.; Hebert, M.; Schneiderman, H. Semi-supervised self-training of object detection models. In Proceedings of the 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05), Breckenridge, CO, USA, 5–7 January 2005. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7, Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Kanakis, M.; Maurer, S.; Spallanzani, M.; Chhatkuli, A.; Van Gool, L. Zippypoint: Fast interest point detection, description, and matching through mixed precision discretization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6114–6123. [Google Scholar]
Rai, N.; Zhang, Y.; Ram, B.G.; Schumacher, L.; Yellavajjala, R.K.; Bajwa, S.; Sun, X. Applications of deep learning in precision weed management: A review. Comput. Electron. Agric. 2023, 206, 107698. [Google Scholar] [CrossRef]
Taghavi, M.; Russello, H.; Ouweltjes, W.; Kamphuis, C.; Adriaens, I. Cow key point detection in indoor housing conditions with a deep learning model. J. Dairy Sci. 2024, 107, 2374–2389. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Dong, R.; Kan, J. Basl-ad slam: A robust deep-learning feature-based visual slam system with adaptive motion model. In Proceedings of the IEEE Transactions on Intelligent Transportation Systems, Edmonton, AB, Canada, 24–27 September 2024. [Google Scholar]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12275–12284. [Google Scholar]
Alif MA, R.; Hussain, M. YOLOv1 to YOLOv10, A comprehensive review of YOLO variants and their application in the agricultural domain. arXiv 2024, arXiv:2406.10139. [Google Scholar]
Shorewala, S.; Ashfaque, A.; Sidharth, R.; Verma, U. Weed density and distribution estimation for precision agriculture using semi-supervised learning. IEEE Access 2021, 9, 27971–27986. [Google Scholar] [CrossRef]
Issah, I.; Appiah, O.; Appiahene, P.; Inusah, F. A systematic review of the literature on machine learning application of determining the attributes influencing academic performance. Decis. Anal. J. 2023, 7, 100204. [Google Scholar] [CrossRef]
Cigel, C.; Arruda Souza, C.; Kandler, R.; Raissa Silva, E.; Medeiros Coelho, C.M. Physiological quality of soybean seeds produced under shading. Rev. De Cienc. Agrovet. 2023, 22, 1. [Google Scholar] [CrossRef]
de Souza, E.S.; Romam, M.S.; Schedenffeldt, B.F.; de Medeiros, E.S.; da Silva, P.V.; Mauad, M. A aplicação de boro em diferentes estádios da cultura da soja afetam sua produtividade? Rev. Ciênc. Agrovet. 2022, 21, 395–401. [Google Scholar] [CrossRef]
Amini, M.R.; Feofanov, V.; Pauletto, L.; Hadjadj, L.; Devijver, E.; Maximov, Y. Self-training: A survey. Neurocomputing 2025, 616, 128904. [Google Scholar] [CrossRef]
Zoph, B.; Ghiasi, G.; Lin, T.Y.; Cui, Y.; Liu, H.; Cubuk, E.D.; Le, Q. Rethinking pre-training and self-training. Adv. Neural Inf. Process. Syst. 2020, 33, 3833–3845. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Zhao, K.; Li, J.; Shi, W.; Qi, L.; Yu, C.; Zhang, W. Field-based soybean flower and pod detection using an improved YOLOv8-VEW method. Agriculture 2024, 14, 1423. [Google Scholar] [CrossRef]
Yu, S.; Liu, X.; Tan, Q.; Wang, Z.; Zhang, B. Sensors, systems and algorithms of 3D reconstruction for smart agriculture and precision farming: A review. Comput. Electron. Agric. 2024, 224, 109229. [Google Scholar] [CrossRef]
Paulus, S. Measuring crops in 3D: Using geometry for plant phenotyping. Plant Methods 2019, 15, 103. [Google Scholar] [PubMed]
Guo, Q.; Wu, F.; Pang, S.; Zhao, X.; Chen, L.; Liu, J.; Xue, B.; Xu, G.; Li, L.; Jing, H.; et al. Crop 3D—A LiDAR based platform for 3D high-throughput crop phenotyping. Sci. China Life Sci. 2018, 61, 328–339. [Google Scholar]

Figure 1. The experimental procedure of the SmartPod method. (a) Pretraining the BN network model using the iterative self-training framework. (b) Architecture of the BN network. (c) Training the network for soybean pod detection. (d) Detection and counting using the best-performing network.

Figure 2. Image acquisition and annotation process. (a) TraitDiscover high-throughput phenotyping platform. (b) Example of image annotation.

Figure 3. BeanpodNet (BN) network architecture.

Figure 4. Architecture of the STT block.

Figure 5. Multi-SEAM architecture.

Figure 6. Illustration of the Inner-IoU mechanism.

Figure 7. Comparison of YOLOv8s and BN in terms of AP@0.5 and P-R curve.

Figure 8. Prediction effect on soybean plants.

Figure 9. Comparison of detection performance across different methods.

Figure 10. Correlation between actual and predicted values.

Table 1. Parameters of the Trait-RGB.

Field of View	2000 mm × 1500 mm@H = 1500 mm	Camera Sensor Type	CMOS
Pixel Resolution	6460 × 4850 (3100 Megapixels)	Frame Rate	8 fps
Focal Length	8 mm	Color Type	RGB Color
Lens Mount	C-Mount	Pixel Size	3.45 μm × 3.45 μm
Object Distance	2000 mm~3500 mm	Exposure Time	46 μs–2 s
Camera Sensor Size	≥14.1 mm×10.3 mm	Imaging Light Source	Four-band LED supplemental lighting, standard color calibration with a color reference card

Table 2. Detection effects of different models.

Model	Multi-SEAM	Inner-IoU	SViT	P (%)	R (%)	AP@0.5 (%)
YOLOv8s	×	×	×	90.8	84.7	91.2
YOLOv8s_M	√	×	×	90.9	85.8	92.4
YOLOv8s_MI	√	√	×	91.1	86.0	92.7
BeanpodNet (YOLOv8s_MIS)	√	√	√	91.4	88.0	93.3

Table 3. Results over different iterations.

TRAINING METHOD	PROCESS	P (%)	R (%)	AP@0.5 (%)
ITERATION SELF-TRAINING	Iter0	90.4	88.0	93.3
	Iter1	90.8	88.4	93.6
	Iter2	91.3	89.0	93.8
	Iter3	91.5	90.1	94.1
	Iter4	91.1	89.8	94.0
NON-ITERATIVE SELF-TRAINING		90.5	87.9	93.5

Table 4. Detection performance of various methods.

Method	P (%)	R (%)	AP@0.5 (%)
YOLOv3s	90.5	84.6	91.0
YOLOv5s	90.8	83.5	91.1
YOLOv6s	91.4	84.9	91.5
YOLOv8s	90.8	84.7	91.2
YOLOv9s	90.4	85.6	91.4
YOLOv10s	89.9	79.5	89.5
YOLOv11s	90.4	83.4	90.6
SmartPod	91.5	90.1	94.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, F.; Wang, S.; Pang, S.; Han, Z.; Zhao, L. SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping. Agronomy 2025, 15, 791. https://doi.org/10.3390/agronomy15040791

AMA Style

Liu F, Wang S, Pang S, Han Z, Zhao L. SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping. Agronomy. 2025; 15(4):791. https://doi.org/10.3390/agronomy15040791

Chicago/Turabian Style

Liu, Fei, Shudong Wang, Shanchen Pang, Zhongzhi Han, and Longgang Zhao. 2025. "SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping" Agronomy 15, no. 4: 791. https://doi.org/10.3390/agronomy15040791

APA Style

Liu, F., Wang, S., Pang, S., Han, Z., & Zhao, L. (2025). SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping. Agronomy, 15(4), 791. https://doi.org/10.3390/agronomy15040791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SmartPod: An Automated Framework for High-Precision Soybean Pod Counting in Field Phenotyping

Abstract

1. Introduction

2. Materials and Methods

2.1. Data and Preprocessing

2.2. BeanpodNet (BN)

2.2.1. CNN

2.2.2. C2f

2.2.3. SViT

2.2.4. Multi-SEAM

2.2.5. Inner-IoU

2.3. Iterative Self-Learning

2.4. Evaluation Standard

2.5. Experimental Platform

3. Results and Analysis

3.1. Performance Evaluation of BeanpodNet (BN)

3.2. Evaluation of Iterative Self-Training Effectiveness

3.3. Comparison with Mainstream Methods

3.4. Phenotypic Identification Results

4. Discussion

4.1. Research Significance

4.2. Benefits of BeanpodNet(BN)

4.3. Advantages of Iteration Self-Training

4.4. Limitations and Future Work

4.5. Integration into Field Management

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI