1. Introduction
Soybeans, a globally vital crop, serve as a crucial source of protein and vegetable oil while finding extensive applications in animal feed, industrial materials, and bioenergy [
1]. In China, improving soybean production is critically important, especially amid global food market volatility and trade uncertainties, making it a national agricultural priority [
2]. Phenotypic measurement technologies are essential for soybean breeding, enabling the precise analysis of gene–trait relationships and facilitating genetic improvement and cultivar optimization [
3].
Pod count is a key yield-related trait essential for yield prediction, germplasm screening, and breeding strategies. However, accurate counting in field conditions is challenging due to occlusion, dense pod distribution, and variable orientations. Manual counting is labor-intensive and prone to subjective errors, potentially impacting breeding decisions [
4]. Therefore, the development of efficient automated measurement systems is crucial for improving breeding accuracy, minimizing errors, and advancing modern soybean breeding practices.
Recent advancements in smart agriculture have witnessed the widespread adoption of deep learning, particularly object detection techniques, in soybean phenotyping [
5,
6,
7]. Unlike traditional machine learning approaches, deep learning excels in analyzing large-scale image data, enabling precise object recognition in complex field environments while significantly improving detection efficiency and reducing human intervention. In automated pod counting systems, the performance of object detection algorithms is pivotal in determining measurement accuracy and operational efficiency.
Current deep learning-based detection algorithms fall into two main categories: region-based and regression-based approaches. Region-based techniques typically employ a two-stage process, first generating region proposals and then performing classification and regression. A notable example is Faster R-CNN [
8], which utilizes a Region Proposal Network (RPN) to automatically generate candidate regions, enhancing both detection accuracy and computational efficiency. In contrast, regression-based methods follow a single-stage architecture, directly predicting object locations and classifications. These approaches offer superior computational efficiency, making them particularly suited for real-time applications. Among them, the YOLO series (You Only Look Once, YOLO) [
9,
10,
11,
12,
13,
14] has emerged as a leading framework in soybean phenotyping due to its exceptional balance between detection accuracy and processing speed.
Recent advancements have significantly improved pod detection. Xiang et al. developed YOLO POD, an extension of the YOLOX framework, incorporating a dedicated pod counting module, modified loss functions for multi-task learning, and an attention mechanism, achieving notable improvements in both accuracy and efficiency [
15]. Li et al. enhanced YOLOv5 by integrating the FasterNet Block and EMA module, along with the CIoU loss function, to boost recognition performance and generalization [
16]. Yu et al. introduced PodNet, a lightweight network with an efficient encoder–decoder structure, effectively mitigating information loss and gradient degradation across non-adjacent layers [
17]. Li et al. proposed SoybeanNet, a Transformer-based point-counting network, achieving high-precision soybean pod counting and localization using UAV images [
18]. Fu et al. developed RefinePod, an enhanced instance segmentation network, delivering high-throughput and high-quality pod segmentation and precise seed-per-pod estimation [
19]. Mathew et al. proposed a novel method utilizing a depth camera for real-time RGB image filtering combined with YOLOv7 for pod detection, offering an efficient and non-destructive approach for soybean yield prediction [
20]. Li et al. proposed a Soybean Pod Counting Network (SPCN) based on Hybrid Dilated Convolution and attention mechanisms, demonstrating its potential for intelligent breeding and precise agricultural management [
21]. However, despite these advancements, the visual similarity between mature soybean plants and soil backgrounds, combined with dense pod arrangements and frequent occlusions, continues to challenge detection robustness and accuracy. Existing approaches often struggle to balance computational efficiency and detection precision, leading to persistent issues with false positives and missed detections [
22].
Thus, the development of novel detection algorithms remains crucial for overcoming these challenges, not only offering more reliable tools for soybean breeding, but also establishing a solid foundation for advancing agricultural intelligence and precision farming.
2. Materials and Methods
The soybean pod detection and counting in this experiment were performed using SmartPod methods. The experimental process is illustrated in
Figure 1.
Step a: Pretrain the BeanpodNet (BN) network model using the iterative self-training framework (
Figure 1a). In this phase, pseudo-labels are generated through semi-supervised learning methods, significantly reducing the reliance on manually labeled data. The pretraining process leverages unlabeled data to enhance the model’s generalization capability, enabling the network to adapt effectively to diverse scenarios.
Step b: Construct the BeanpodNet(BN) network (
Figure 1b) as a subnetwork for self-training by integrating multiple advanced techniques to achieve high-accuracy pod detection. This step aims to further enhance the network’s perception of objects, particularly improving its performance in complex field environments.
Step c: Combine the weights obtained from pretraining in Step a with the optimized network architecture from Step b for end-to-end training (
Figure 1c). During this phase, the network gradually optimizes its parameters, enhancing BN’s ability to accurately identify pods while effectively reducing false positives and missed detections. Through comprehensive training, the network not only adapts to various pod appearance features, but also improves its detection capability for occluded and background-complex targets.
Step d: After the model training is complete, use the best-performing network to perform pod detection and counting on the test set (
Figure 1d). This phase primarily evaluates the model’s practical performance, ensuring its accuracy and robustness in real field conditions. By validating the test set data, the network is further adjusted and optimized to achieve optimal performance in real-world applications.
2.1. Data and Preprocessing
This study focused on salt-tolerant soybean varieties “Qihuang 34” and “Qihuang 641”, as well as their derivatives, as experimental subjects. The experiment was conducted at the National Saline-Alkali Land Comprehensive Utilization and Technology Innovation Center in Guangrao, Dongying, Shandong, located in a mildly saline-alkali area (longitude: 118.65° E–118.49° E, latitude: 37.31° N–37.83° N). The soil pH was 8.62, with slight fluctuations depending on precipitation. The soybean planting density was controlled within 50,000 plants per hectare, with two seeds per hole and row and plant spacings of 50 cm and 13 cm, respectively. The total area of each plot was 0.01 hectares. A three-replicate experimental design was implemented to ensure data reliability, with each variety planted in separate plots.
Before imaging, the plants were carefully cut at the base to avoid damage and laid flat on the ground beneath the phenotyping platform to ensure consistent imaging conditions as much as possible. This process was designed to minimize stress on the plants and maintain the integrity of the phenotypic traits during data collection. The dataset was automatically captured by the TraitDiscover high-throughput phenotyping platform (
Figure 2a) on 10 October 2024 between 1:00 and 2:00 a.m. The platform features a truss-style mobile scanning platform with three-axis automated motion capabilities and is equipped with a visible light imaging unit (Trait-RGB), with key parameters listed in
Table 1. During image acquisition, soybean plants were laid flat on the ground, with the camera positioned vertically above to prevent image distortion. The dataset comprised 1500 images, which were split into training, validation, and test sets in a ratio of 7:2:1. The model was trained on the training set and evaluated on the validation and test sets. Finally, the key points of mature soybean pods in the images were annotated using the LabelMe tool (
Figure 2b).
2.2. BeanpodNet (BN)
The YOLOv8 [
11] backbone uses the efficient lightweight CSPDarknet (Cross-Stage Partial Darknet) with cross-stage connections to address the vanishing gradient issue, improving computational efficiency and detection accuracy. As one of the most advanced versions in the YOLO series, YOLOv8 boasts a lightweight design, high efficiency, and enhanced feature extraction and object detection capabilities. However, when applied to soybean pod detection, YOLOv8 still faces several limitations: (1) difficulty distinguishing object features in complex backgrounds, (2) limited capability of separating occluded objects, and (3) insufficient fine-grained feature capture when processing dense targets. To address the challenges mentioned above, this study proposes BN, a soybean pod detection model based on YOLOv8:
A Super Token Sampling Vision Transformer (SViT) [
23] is incorporated to strengthen global contextual feature extraction, mitigating the impact of complex backgrounds on detection performance. SViT has demonstrated exceptional effectiveness in object detection applications [
24,
25,
26].
A Multi-Scale Spatial Enhancement Attention Module (MultiSEAM) [
27] is introduced to enhance occluded object separation, improving the robustness of recognition in occlusion scenarios. The effectiveness of MultiSEAM has been validated in numerous studies [
28,
29,
30].
Intersection over Union Loss (Inner-IoU) [
31], an optimized bounding box regression strategy, is employed to enhance both detection accuracy and processing speed, improving bounding box matching for dense targets.
A semi-supervised iterative self-training strategy [
32] is integrated, leveraging pseudo-labels from unlabeled data to improve detector generalization, thereby reducing dependence on labeled data.
BN was introduced to the field of pod detection to address challenges such as high-density targets, occlusion, and the similarity between background and object features. As shown in
Figure 3, the BN detection architecture consists of the backbone, neck, and head networks. The backbone uses CSPDarknet as the feature extraction network, with the SViT module introduced at the P9 layer to optimize feature representation and enhance global context modeling, significantly improving pod detection performance. The neck network retains the SPP and PAN structures, further enhancing feature fusion and information propagation. To improve object localization, features from the P2 layer are fused into PAN, which enhances the detection accuracy of small and occluded objects. To further mitigate the impact of occlusion, the MultiSEAM attention mechanism is incorporated into the neck network. By integrating multi-scale features, this module significantly strengthens the model’s ability to detect occluded soybean pods, effectively reducing both false positives and missed detections. Through these innovations, BN excels in pod detection tasks, ensuring higher accuracy and robustness, particularly in scenarios involving dense vegetation, occlusion, and complex field environments.
2.2.1. CNN
CNNs (Convolutional Neural Networks) have achieved remarkable success in deep learning due to their ability to automatically extract hierarchical features and recognize complex patterns. The concept of CNNs was introduced by Hinton et al. in 2006, combining convolutional, pooling, and fully connected layers to enhance feature learning and classification performance [
33]. With the advancement of big data and hardware, CNNs have expanded, notably after AlexNet’s victory in the 2012 ImageNet competition, which demonstrated their superiority in large-scale image recognition tasks [
34]. Modern CNNs improve computational efficiency through local receptive fields, mitigate complexity and overfitting via weight sharing, and employ pooling layers to extract high-level representations. The hybrid network used here integrates 1 × 1, 3 × 3, and 5 × 5 convolutional kernels, enabling multi-scale feature extraction to enhance model expressiveness and adaptability. This design optimizes feature representation, allowing for more precise soybean pod detection in complex field environments, ultimately improving both accuracy and robustness.
2.2.2. C2f
C2f (CSP bottleneck with two convolutions) is an enhanced version of the C3 module (Cross-Stage Partial with three convolutions), designed to improve residual learning efficiency while incorporating the Efficient Layer Aggregation Network (ELAN) structure from YOLOv7 [
35]. By combining standard convolutions with bottleneck modules, C2f optimizes gradient propagation, effectively mitigating the vanishing gradient problem while maintaining a lightweight architecture. It extracts multi-scale features through a structured combination of convolutions and C2f modules, ensuring effective feature fusion across different layers. The C2f structure includes two key components: cross-stage connections, which establish links between shallow and deep feature maps to facilitate better information flow, and partial networks, which enhance feature refinement across layers using convolution and up-sampling operations. This design not only strengthens feature representation, but also improves adaptability to complex backgrounds. The C2f module plays a crucial role in object detection, particularly in handling targets of varying scales, significantly enhancing both detection accuracy and robustness.
2.2.3. SViT
The Vision Transformer (ViT) has demonstrated remarkable success in global context modeling [
36], but its effectiveness in capturing fine-grained local features is often hindered by redundant information. To mitigate this limitation, techniques such as local self-attention and early-stage convolutions are commonly employed. However, these methods can compromise long-range dependency modeling, limiting ViT’s overall performance. SViT addresses this issue by drawing inspiration from superpixel design, reducing the number of image primitives to optimize information processing. It introduces the concept of super tokens, which condense the number of tokens involved in self-attention while preserving essential global context. The Super Token Attention (STA) mechanism operates in three stages: first, super tokens are generated through sparse relational learning; second, self-attention is applied to these tokens; and finally, the processed information is mapped back to the original token space. By decomposing global self-attention into sparse relational graphs and low-dimensional attention mechanisms, STA enhances computational efficiency while significantly reducing complexity. Leveraging STA, SViT not only strengthens local feature representation, but also optimizes global context modeling, making it particularly effective for visual tasks.
This design enhances the model’s feature extraction efficiency by optimizing the image processing pipeline. Given an input image, it first undergoes initial processing through a three-stage stem structure. The image then enters the fourth-stage layer, which integrates a convolutional module and a Super Token Transformer (
STT) module (
Figure 4). The
STT module comprises three key components: Convolutional Position Embedding (
CPE), Super Token Attention (
STA), and a Convolutional Feed-Forward Network (
ConvFFN), defined as follows.
The
STA module comprises three key processes: Super Token Sampling (STS), Multi-Head Self-Attention (MHSA), and Token Up-Sampling (TU). To establish associations between tokens and super tokens, an attention-based method constructs the association graph, defined as follows:
where
d is the channel number
C.
2.2.4. Multi-SEAM
Multi-SEAM is an enhanced version of the Spatially Enhanced Attention Module (SEAM), specifically designed to address feature loss and alignment errors caused by occlusion, complex backgrounds, and environmental variations. Traditional object detection models struggle with performance degradation in scenarios involving dense targets, occlusions, and dynamic backgrounds. In contrast, Multi-SEAM effectively mitigates these issues through an innovative architecture. Its core mechanism integrates multi-head attention with depthwise separable convolutions, enhancing channel-specific feature extraction while reducing computational complexity. This design significantly improves the network’s expressiveness and robustness (
Figure 5).
The key innovations of Multi-SEAM include the following:
a. Depthwise Separable Convolution: Multi-SEAM reduces computation by splitting standard convolution into channel-wise and pixel-wise convolutions, preserving feature details while minimizing redundancy for a lightweight model.
b. Pointwise Convolution: The 1 × 1 convolution bridges depthwise separable convolutions, facilitating efficient information exchange and improving object recognition accuracy by combining feature layers.
c. Fully Connected Layer: The fully connected layer aggregates channel information, enhancing the recognition of complex or occluded features and maintaining high accuracy despite occlusion.
d. Exponential Normalization: Exponential normalization at the output layer enhances adaptability to varying scales and positions, improving stability and robustness in dynamic environments.
e. Multi-Scale Feature Learning: Multi-SEAM focuses on key areas and reduces background noise through multi-scale learning, enhancing the ability to distinguish occluded from non-occluded targets and improving performance in dense environments.
These innovations allow Multi-SEAM to significantly enhance performance in complex and occluded environments, particularly for soybean pod detection. In field conditions, the dense distribution and occlusion of pods often lead to feature loss, complicating detection. Multi-SEAM effectively isolates target features from complex backgrounds, improving accuracy and robustness through precise feature capture and integration.
2.2.5. Inner-IoU
Intra-class occlusion, where one pod overlaps with another, often increases the false detection rate. To address this, Inner-IoU introduces a ‘repulsive force’ mechanism that reduces occlusion interference. The Inner-IoU loss function integrates auxiliary bounding boxes of varying scales, which are dynamically adjusted based on IoU values. This scale adjustment accelerates the regression process and improves detection accuracy. Moreover, the adaptability of Inner-IoU across different tasks and architectures ensures strong generalization, providing stable performance in dense or occlusion-prone environments, and significantly enhancing detection precision.
To address the limitations of traditional IoU loss, such as poor generalization and slow convergence across various detection tasks, Zhang et al. proposed a strategy that incorporates auxiliary bounding boxes to calculate the loss and accelerate regression [
31]. In Inner-IoU, a scaling factor adjusts the size of the auxiliary boxes, allowing it to adapt to different datasets and detectors. As shown in
Figure 6, the ground truth (GT) and anchor boxes are denoted as A and B, respectively. The “scale” variable refers to this scaling factor, which is typically in the range of [0.5, 1.5]. By fine-tuning this factor, the regression process is optimized, improving the model’s adaptability and accuracy. While Inner-IoU retains some characteristics of traditional IoU loss, it introduces unique features, with the relevant formulas provided in Equations (5)–(7):
where the center points of the GT box and its internal GT box are denoted as
, while the center points of the anchor box and its internal anchor box are denoted as
. The width and height of the GT box are represented by
and
, while the width and height of the anchor box are denoted by
w and
h.
2.3. Iterative Self-Learning
Recent advancements in deep learning have led to significant breakthroughs in object detection [
37,
38,
39,
40]. However, most state-of-the-art visual models still rely on supervised learning, which requires a large number of labeled images, limiting their application to unlabeled data. As a result, the potential benefits of using unlabeled images to enhance model accuracy and robustness have not been fully realized. Recent studies demonstrate that incorporating unlabeled images can significantly improve the accuracy of state-of-the-art models, with practical benefits [
41]. Fully utilizing unlabeled images not only expands the training dataset, but also enhances the model’s adaptability in complex or unknown environments, all while maintaining or even improving accuracy. This development further accelerates the application of deep learning in key point detection.
To fully utilize unlabeled image data, this study employs a semi-supervised learning method: iterative self-training. The core process is as follows: First, a preliminary teacher model is trained using labeled images. The teacher model then generates pseudo-labels for the unlabeled images, which are combined with the labeled images to train the student model. In each iteration, the updated student model becomes the new teacher model, which generates updated pseudo-labels for the unlabeled images. The student model is then retrained using these updated pseudo-labels. Algorithm 1 outlines the detailed implementation of this process. The experimental results demonstrate that the iterative self-training method significantly enhances the model’s accuracy and improves its adaptability and robustness to different data distributions, thereby boosting its performance in practical applications.
Algorithm 1: Iterative self-learning framework
|
Input: Labeled data and unlabeled data . |
| |
Output: Model with pretraining weights. |
2.4. Evaluation Standard
To validate the performance of SmartPod, this study assesses the proposed model from two perspectives: performance and complexity. Performance metrics encompass precision (P), recall (R), Average Precision (AP), and F1 score, which evaluate accuracy across various levels. Complexity is gauged by parameters and Giga Floating-point Operations Per Second (GFLOPs), reflecting model complexity and computational demands. Integrating these assessments offers insights into the efficiency and accuracy of the deep learning model for soybean bud detection. The formulas are presented in (8)–(10):
where
is the input size,
is the size of the convolution kernel,
is the output size,
is the size of the output feature map,
is the input channel,
is the kernel size,
is the stride, and
is the output channel.
The mean absolute error (
MAE), root mean squared error (
RMSE), and the correlation coefficient (
R) were used as the evaluation metrics to assess the counting performance. They take the forms as follows:
where
N denotes the number of test images,
is the ground truth count for the
ith image,
is the inferred count for the
ith image, and
is the mean of
.
2.5. Experimental Platform
This study presents experimental results conducted using the PyTorch (1.11.0) deep learning framework and Python (3.8) programming language. The experiments were performed on a machine running the Windows 10 64-bit operating system, equipped with an R12 vCPU Intel(R) Xeon(R) Gold 5320 CPU @ 2.20 GHz, RTX A4000 (16 GB), and 32 GB of memory. The model’s input image size was set to 640 × 640 pixels, with a batch size of 8 for training images. Training was carried out for 100 epochs, employing a learning rate of 0.01, momentum of 0.937, and weight decay coefficient of 0.0005. The SGD (stochastic gradient descent) optimization strategy was utilized during training.
4. Discussion
4.1. Research Significance
Pod count is a key phenotypic trait of soybean. Accurate pod counting not only enables precise yield estimation, but also supports the identification of high-yielding and stress-resistant soybean varieties. Traditional methods are prone to inaccuracies due to factors like occlusion, background similarity, and dense distributions in complex field environments. These limitations often result in labor-intensive processes with high error rates, making them unsuitable for large-scale agricultural applications. Deep learning models such as YOLO-based approaches offer significant breakthroughs for these issues [
42]. These models leverage advanced feature extraction and pattern recognition capabilities to handle complex visual tasks [
43]. This study proposes an effective solution to the challenges of automatic pod detection and counting in the field by combining deep learning techniques with semi-supervised learning strategies [
44]. It not only improves detection accuracy, but also demonstrates strong generalization capabilities across diverse field conditions. The method proposed in this study holds significant potential for extension to a broader range of research applications, such as evaluating the dynamic changes in both the quantity and quality of soybean seeds produced under shading conditions initiated at different crop phenological stages [
45], as well as quantifying the effects of foliar boron application at various growth stages on the yield components of soybeans [
46]. As research on diverse crop varieties advances and more comprehensive datasets are accumulated, this method is expected to play a pivotal role in driving the intelligent and automated development of agricultural technologies, thereby enhancing precision and efficiency in crop management and production.
4.2. Benefits of BeanpodNet(BN)
The BN model offers significant improvements by combining multiple innovative technologies, addressing key challenges in soybean pod detection. These advancements not only enhance detection accuracy, but also improve computational efficiency, making BN a practical solution for real-world agricultural applications. The SViT enhances feature representation by mitigating interference from complex backgrounds, making it highly robust in agricultural settings with variable environments. By leveraging the self-attention mechanism, SViT effectively captures long-range dependencies in the image, enabling the model to distinguish soybean pods from visually similar background elements such as leaves and soil. Furthermore, the Multi-SEAM improves detection accuracy, even in cases of occlusion, by focusing on targets across multiple spatial scales, which is critical for dense or overlapping soybean pods in the field. Lastly, the Inner-IoU strategy optimizes bounding box regression, refining the IoU calculation and boosting detection precision, particularly in dense target scenarios. In addition, SmartPod achieves a processing speed of 0.1 ms for preprocessing and 2.8 ms for postprocessing per image on a single RTX 2080 Ti (11 GB) GPU, which meets real-time processing requirements. The integration of SViT, Multi-SEAM, and Inner-IoU not only addresses the limitations of existing methods, but also sets a new benchmark for agricultural object detection. Future work could explore the deployment of BN on edge devices for real-time field applications, as well as its extension to other crops and phenotypic traits, further enhancing its impact on precision agriculture.
4.3. Advantages of Iteration Self-Training
This study explores the innovative use of iteration self-training within a semi-supervised learning framework, aiming to lay a solid pretraining foundation for future network modeling [
47]. In deep learning, acquiring sufficient labeled data is often time-consuming and complex [
48]. This approach not only reduces the annotation burden, but also improves model performance by incorporating diverse and representative samples from the unlabeled data pool. The main benefits of iteration self-training include its scalability, wide applicability, and flexibility. It performs well across both large-scale and small-scale datasets, regardless of data augmentation intensity [
49]. For instance, in scenarios with limited labeled data, iteration self-training can generate high-quality pseudo-labels to supplement the training process, enabling the model to achieve competitive performance even with minimal supervision. Moreover, it is not confined to specific models or datasets, showcasing its versatility across various application scenarios. The results confirm that iteration self-training is a simple yet powerful algorithm that extracts valuable information from pseudo-labels, offering new insights for future network modeling. Future research could explore the integration of active learning strategies with iteration self-training to further optimize the selection of pseudo-labels and improve model performance.
4.4. Limitations and Future Work
Soybean pod detection and counting face significant challenges in complex field environments, including intricate backgrounds, dense distributions, and target occlusion [
50]. This study proposes SmartPod, a novel detection method with remarkable effectiveness. Future research could explore integrating SmartPod with advanced technologies, such as attention mechanisms or graph neural networks (GNNs), to improve accuracy in dense and occluded environments. Additionally, SmartPod can be integrated with drone-based remote sensing systems for real-time pod detection across large fields, combined with multispectral or thermal imaging to enhance robustness in complex conditions. Coupled with IoT devices, SmartPod enables automated field data collection and analysis, supporting precision agriculture. Beyond soybean pod detection, SmartPod offers technical support for similar phenotyping tasks in crops like wheat and maize, driving intelligent and automated agricultural research. Furthermore, integrating deep learning with 3D imaging enables high-fidelity 3D reconstructions of plant morphology, advancing phenotyping solutions [
51,
52,
53].
4.5. Integration into Field Management
Integrating SmartPod into daily field management practices offers significant practical benefits for farmers and agricultural managers. In addition to being deployed on phenotyping platforms, SmartPod can be implemented on portable devices such as smartphones or drones, enabling farmers to perform real-time pod detection and counting during routine field inspections. This capability allows for rapid yield potential assessment and data-driven decision making. Furthermore, SmartPod can be integrated with farm management software to automate data recording and analysis, providing farmers with actionable long-term insights. This not only reduces the labor intensity of manual pod counting, but also improves yield estimation accuracy, supporting more efficient resource allocation and increased profitability. Additionally, SmartPod’s adaptability to variable field conditions makes it a reliable tool for large-scale farms, where continuous monitoring is critical for sustainable agricultural practices. By bridging the gap between advanced technology and practical field applications, SmartPod has the potential to transform daily farm management and promote the widespread adoption of precision agriculture. SmartPod offers broad applicability, effectively supporting research involving soybean pod quantity assessment by providing high-precision pod counting results, thereby enhancing the overall efficiency, accuracy, and reproducibility of this study.
5. Conclusions
This study presents SmartPod, a deep learning-based framework for accurate soybean pod detection in field phenotyping, addressing challenges such as occlusion, dense distributions, and background interference. By integrating SViT for enhanced feature representation, Multi-SEAM for overlapping pod detection, Inner-IoU for efficient bounding box regression, and an iterative self-training strategy, SmartPod achieves state-of-the-art performance with 94.1% AP@0.5 and a Pearson correlation of 0.9792, surpassing existing methods by 1.7–4.6%. The semi-supervised learning approach significantly reduces the dependency on labeled data, enhancing its practicality. However, a limitation of this study is the current requirement for imaging soybean plants in a laid-flat state. Future work will focus on implementing six-degrees-of-freedom imaging to enable pod detection and counting in their natural upright state.