A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting

Jin, Ziyang; Hong, Wenjie; Wang, Yuru; Jiang, Chenlu; Zhang, Boming; Sun, Zhengxi; Liu, Shijie; Lv, Chunli

doi:10.3390/agriculture15070670

Open AccessArticle

A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting

by

Ziyang Jin

,

Wenjie Hong

,

Yuru Wang

,

Chenlu Jiang

,

Boming Zhang

,

Zhengxi Sun

,

Shijie Liu

and

Chunli Lv

^*

China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 670; https://doi.org/10.3390/agriculture15070670

Submission received: 23 February 2025 / Revised: 16 March 2025 / Accepted: 18 March 2025 / Published: 21 March 2025

(This article belongs to the Special Issue Application of Vision Technology and Artificial Intelligence in Smart Farming—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

A wheat growth and counting analysis model based on instance segmentation is proposed in this study to address the challenges of wheat growth monitoring and yield prediction in high-density agricultural environments. The model integrates the transformer architecture with a symmetric attention mechanism and employs a symmetric diffusion module for precise segmentation and growth measurement of wheat instances. By introducing an aggregated loss function, the model effectively optimizes both segmentation accuracy and growth measurement performance. Experimental results show that the proposed model excels across several evaluation metrics. Specifically, in the segmentation accuracy task, the wheat instance segmentation model using the symmetric attention mechanism achieved a Precision of 0.91, Recall of 0.87, Accuracy of 0.89, mAP@75 of 0.88, and F1-score of 0.89, significantly outperforming other baseline methods. For the growth measurement task, the model’s Precision reached 0.95, Recall was 0.90, Accuracy was 0.93, mAP@75 was 0.92, and F1-score was 0.92, demonstrating a marked advantage in wheat growth monitoring. Finally, this study provides a novel and effective method for precise growth monitoring and yield counting in high-density agricultural environments, offering substantial support for future intelligent agricultural decision-making systems.

Keywords:

wheat growth monitoring; precision agriculture; yield prediction; instance segmentation; deep learning

1. Introduction

Wheat, as one of the most important food crops worldwide, plays a crucial role in agricultural production stability and food security [1,2,3]. Monitoring wheat growth and yield estimation are essential aspects of agricultural production management, influencing farmers’ planting decisions, grain yield forecasting, market supply regulation, and agricultural policy making [4,5]. Traditional methods for wheat growth monitoring primarily rely on manual observation and remote sensing analysis based on vegetation indices (e.g., NDVI, EVI) [6,7]. While these methods can reflect vegetation coverage and growth health to some extent, their accuracy is often limited by data resolution and environmental factors. Although remote-sensing-based vegetation indices provide large-scale assessments of growth conditions, their spectral limitations make it difficult to accurately distinguish individual wheat plants in high-density vegetation environments. Moreover, insufficient spatial resolution constrains their application in precise wheat monitoring [8,9]. Accurate wheat yield estimation is a critical task in crop growth assessment. Traditional methods mainly rely on random sampling and statistical modeling for yield estimation [10]. However, these methods suffer from limited sampling density, high estimation errors, and difficulty adapting to variations in growth stages and field management practices. Therefore, leveraging computer vision techniques combined with deep learning to achieve efficient and precise wheat growth monitoring and yield estimation has become a significant research direction in precision agriculture.

In recent years, deep learning has achieved groundbreaking progress in computer vision, providing new technological means for agricultural image analysis [11]. Specifically, convolutional neural network (CNN)- and transformer-based methods have been widely applied in crop phenotyping, particularly in semantic and instance segmentation tasks [12,13]. Hong et al. [14] proposed a hybrid attention network (CTHNet) to count wheat spikes from RGB images, integrating local features and global contextual information. Their experimental results demonstrated mean absolute errors of 3.40 and 5.21, significantly outperforming previous studies. Guan et al. [15] proposed an improved YOLOv10 algorithm that substantially enhances the feature extraction and detection capabilities of wheat spike models by incorporating a bidirectional feature pyramid network (BiFPN), a separation and enhancement attention module (SEAM), and a global context network (GCNet). Their method achieved an Accuracy of 93.69%, a Recall of 91.70%, and a Mean Average Precision (mAP) of 95.10% in wheat spike detection, outperforming the baseline YOLOv10 model by 2.02%, 2.92%, and 1.56%, respectively. The improved YOLOv10 algorithm effectively addresses challenges in wheat spike detection under complex field conditions, providing robust support for agricultural production and research. Yao et al. [16] focused on addressing wheat spike adhesion issues and proposed a combined algorithm named “APW”, integrating the alternating direction multiplier method (ADMM), Potts model, and watershed algorithm for rapid wheat spike identification and counting. Their results indicated optimal accuracy in low-density planting scenarios, with an R² value of 0.89 and a root mean square error (RMSE) of 3.72, demonstrating APW’s capability to handle scenarios with significant background variation and low object adherence. These findings offer a novel approach for effective wheat spike counting in drone-acquired images. Sun et al. [17] developed a simulation strategy to replicate real wheat field conditions, enabling data collection in an indoor environment within a short timeframe. The results showed that YOLOv7 performed the best, achieving R² = 0.963 and RMSE = 2.463. While these methods have been preliminarily applied to agricultural object detection tasks, challenges remain, including segmentation errors in heavily overlapping objects and increased computational complexity leading to slower inference speed. Additionally, density-based counting methods have recently been integrated with deep learning approaches, such as CSRNet and Count-ception, which perform object counting through density estimation. However, these methods still suffer from significant background noise interference and cumulative counting errors.

To address the limitations of existing methods, this study proposes an instance segmentation-based wheat growth and yield analysis model, adopting a transformer-structured symmetric diffusion segmentation network (SDS-Net) to enhance wheat identification accuracy and counting precision. The key innovations of this model include the symmetric diffusion module, symmetric attention mechanism, and aggregated loss function. The symmetric diffusion module simulates wheat plant growth characteristics by enhancing boundary information through feature diffusion, improving segmentation accuracy in overlapping-object scenarios. The symmetric attention mechanism leverages multi-scale feature fusion and global context awareness to enhance wheat field object recognition, effectively reducing background noise interference. Additionally, the proposed aggregated loss function integrates instance segmentation loss, counting loss, and boundary loss, ensuring strong generalization performance in both growth analysis and precise counting tasks. The contributions of this study are as follows:

Transformer-based symmetric diffusion segmentation network: Unlike traditional CNN-based architectures, this approach employs a transformer as the primary feature extraction module, combined with a symmetric diffusion module to enhance the boundary recognition of wheat plants, improving segmentation accuracy in densely planted environments.
Symmetric diffusion module: A novel diffusion mechanism is introduced, enabling the model to effectively handle wheat plant overlap, reducing mis-segmentation and omission errors.
Symmetric attention mechanism: Integrating self-attention mechanisms and multi-scale feature fusion enhances segmentation robustness in complex environments while optimizing target counting accuracy.

In summary, this study proposes a deep-learning-based approach for wheat growth monitoring and yield estimation, integrating instance segmentation and a symmetric diffusion mechanism to significantly improve detection accuracy and counting stability.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation is a core computer vision task that classifies each pixel for precise target region delineation [18,19,20]. In agriculture, it is widely used for crop recognition, disease detection, and growth monitoring [21,22,23]. Compared to object detection, it provides pixel-level masks for fine-grained analysis. Traditional methods, such as region growing, clustering, and threshold-based segmentation, struggle with complex backgrounds and occlusions. Recent deep-learning-based approaches, particularly CNN- and transformer-based models, have significantly improved segmentation accuracy. CNN models extract features via convolutional layers and use encoder–decoder architectures for segmentation [24,25,26]. U-Net employs skip connections for high-resolution feature retention, while DeepLab uses dilated convolutions to expand the receptive field [27,28]. However, CNNs suffer from limited receptive fields, leading to challenges in segmenting densely grown wheat fields, causing boundary blurring and omissions. Originally designed for NLP, transformers have demonstrated superior performance in semantic segmentation [29]. The self-attention mechanism in transformers effectively captures global dependencies, enhancing segmentation in complex backgrounds. ViT was the first fully transformer-based vision model, dividing images into patches for global feature modeling [30]. However, ViT’s high computational cost makes it less efficient for large-scale agricultural applications. While transformer-based models excel in global feature extraction and dense wheat field segmentation, their computational complexity remains a challenge for real-world deployment.

2.2. Object Counting Based on Probability Density Estimation

Object counting is widely used in crowd counting, cell counting, and agricultural yield estimation [31,32]. In wheat growth analysis, accurate counting provides key insights into crop conditions and yield prediction [33]. However, dense wheat growth introduces challenges such as occlusions and varying lighting conditions, increasing the complexity of counting tasks. Traditional methods fall into detection-based, regression-based, and density-estimation-based approaches [34]. Among these, probability density estimation has gained attention for handling densely distributed targets and enabling efficient counting under weak supervision [35]. In density-estimation-based counting, the model learns a mapping from images to density maps, integrating density values to estimate the total count. This approach is particularly effective in dense wheat fields, where detection-based methods like Faster R-CNN and YOLO often fail due to overlapping objects [36]. For sparse wheat distributions, detection-based models may suffice, but for high-density scenarios, such as the heading stage, density estimation methods provide greater robustness. By modeling spatial distributions, these methods mitigate object omissions and enhance counting accuracy in agricultural environments.

3. Materials and Method

3.1. Dataset Collection

The wheat image data used in this study were sourced from several locations, including field-collected images from the Science Park at the West Campus of China Agricultural University in Haidian District, Beijing, publicly available images from the internet, and the official Kaggle wheat detection dataset. The data encompass various wheat growth densities, including sparse wheat fields with larger plant spacing, moderate-density fields with medium plant spacing, and high-density fields with dense plant growth, as shown in Figure 1. Each category includes between 1600 and 1900 images, ensuring the balance and representativeness of the dataset, as shown in Table 1. Each category includes between 1600 and 1900 images, with data further divided based on acquisition methods: “UAV” represents images collected using aerial drone photography and “Camera” refers to images obtained using ground-based digital cameras. This differentiation allows the model to learn multi-scale features from both large-area wheat field perspectives and close-up views of individual plants.

Field collection was carried out between March 2023 and March 2024 at the Science Park of the West Campus of China Agricultural University in Haidian District, Beijing, where various wheat varieties were cultivated at different growth stages, including the jointing, booting, flowering, and grain filling stages. These stages fulfill the study’s requirements for growth analysis across different wheat development phases. The collection equipment included a DJI Phantom 4 multispectral drone(DJI Technology Co., Ltd., Shenzhen, China) and a Canon EOS 5D Mark IV camera (Canon Inc., Tokyo, Japan). The multispectral camera mounted on the DJI Phantom 4 captures both visible and near-infrared bands, providing important vegetation growth parameters such as the NDVI (Normalized Difference Vegetation Index), while the Canon EOS 5D Mark IV, with its high-resolution imaging capabilities, accurately captures details of wheat ears and individual leaves. The combined use of these two devices ensures the diversity of the data while maintaining high recognition accuracy under various lighting conditions. During data collection, a combination of ground-based and drone-based imagery was employed. Ground-based imaging focused on capturing close-up images of wheat plants, leaves, and ears to obtain clear instance contours suitable for semantic segmentation tasks. Different angles, including frontal, lateral, and top-down views, were used to observe the growth condition of wheat from various perspectives, with close-up shots of critical areas like the leaves and ears to enhance feature learning by the model. Drone-based imagery was utilized to capture the large-scale field conditions, with the camera height set between 5 m and 20 m, balancing field-level growth analysis and the preservation of local details. Automatic and manual exposure modes were employed during shooting, adjusting ISO and shutter speed under cloudy or low-light conditions to ensure consistent image quality. Additionally, to mitigate the influence of lighting changes at different times of the day on image color and brightness, gray cards were placed on the ground for color calibration and post-processing lighting normalization was performed using the multispectral images. For wheat fields of varying density, stratified random sampling was applied to ensure a balanced distribution of data across the density categories. Sparse areas were mainly located in plots with uneven sowing or in the early stages of planting. Moderate-density areas represented typical wheat fields with standard sowing patterns, while high-density areas consisted of densely planted or vigorously growing wheat plots. In addition to the field-collected data, publicly available network data and the official Kaggle wheat detection dataset were also included. The network data primarily came from agricultural research institutions, crop disease databases, and high-resolution agricultural imaging databases, all of which were carefully selected for inclusion in this study. The Kaggle wheat detection dataset contained finely annotated wheat ear semantic segmentation data from various growing regions, providing valuable information for training the model’s semantic segmentation capabilities. Both the network data and Kaggle dataset underwent rigorous screening and preprocessing, including noise filtering, color normalization, and geometric transformations, to ensure stylistic consistency with the field-collected data. Distinct feature differences were observed among the categories of wheat images with varying densities. In the sparse category, the contours of individual plants were clear, with a relatively simple background where soil or weeds were prominent, making it suitable for precise individual semantic segmentation. In the moderate-density category, plant spacing was smaller, with some overlap between plants, and the background was more complex than in the sparse category. In the high-density category, wheat plants grew densely and ears and leaves were heavily occluded, making it difficult to distinguish individual plants. Accurate segmentation in this category typically relied on advanced attention mechanisms and contextual information.

3.2. Data Annotation

In computer vision tasks, dataset annotation serves as a crucial step in model training, particularly in semantic and semantic segmentation. Accurate annotations significantly enhance model generalization capabilities. In agricultural image analysis, wheat growth monitoring and counting tasks necessitate precise annotation to ensure that deep learning models can correctly learn the boundaries, morphology, and key structural features of wheat plants. The primary objective of data annotation is to generate semantic segmentation masks for each wheat plant within an image and to mark its key feature points, thereby enabling the model to effectively learn wheat structural information and ultimately improving growth assessment and yield estimation accuracy. In semantic segmentation tasks, each wheat plant is considered an independent target, requiring a unique mask for every individual plant during annotation. Given an input image

I \in R^{H \times W \times C}

, where H and W represent the image height and width, respectively, and C denotes the number of channels, the goal of semantic segmentation is to learn a mapping function

f_{θ}

such that the model can predict the mask

M_{i}

for the i-th wheat plant:

M_{i} = f_{θ} (I),

(1)

where

M_{i} \in {0, 1}^{H \times W}

represents a binary mask, with a value of 1 indicating that a pixel belongs to the i-th wheat plant and a value of 0 denoting background. The final semantic segmentation result M for an image containing N wheat plants can be represented as the collection of all instance masks:

M = \sum_{i = 1}^{N} M_{i}

(2)

A critical challenge in data annotation involves ensuring clear boundaries between instances to prevent mislabeling due to overlapping. Generally, plant contours are manually outlined using polygon-based (Polygon) or lasso-based (Lasso) methods to generate high-precision semantic segmentation masks. Mathematically, the closed boundary region

B_{i}

of each wheat plant can be defined as follows:

B_{i} = (x, y) ∣ F_{i} (x, y) = 0,

(3)

where

F_{i} (x, y)

represents the implicit function of the plant boundary. When

F_{i} (x, y) = 0

, the pixel

(x, y)

lies on the contour of the wheat plant. If

F_{i} (x, y) < 0

, the pixel is inside the plant, whereas

F_{i} (x, y) > 0

indicates that the pixel is outside the plant. In addition to semantic segmentation masks, key feature point (Keypoint) annotation is also crucial for wheat growth analysis. Keypoint annotation typically includes structural information such as the wheat plant’s root, stem, and spike. Each wheat plant i is represented by K keypoints

P_{i, j} {j = 1}^{K}

, where

P i, j = (x_{i, j}, y_{i, j})

denotes the coordinates of the j-th keypoint. The objective of keypoint detection is to learn a probability distribution

P (x, y)

, where

(x, y)

represents the position of keypoints. The probability density function is defined as:

P (x, y) = \sum_{j = 1}^{K} G_{σ} (x - x_{i, j}, y - y_{i, j}),

(4)

where

G_{σ}

is a Gaussian kernel function used to generate a smoothed keypoint heatmap:

G_{σ} (x, y) = \frac{1}{2 π σ^{2}} exp (- \frac{x^{2} + y^{2}}{2 σ^{2}})

(5)

During annotation, Gaussian distributions are commonly employed to model keypoint locations, thereby reducing annotation errors and mitigating the impact of label noise on model training. To improve annotation efficiency, modern annotation tools such as LabelMe and CVAT provide semi-automated annotation capabilities, including automatic segmentation using GrabCut, region clustering with K-means, and Mask R-CNN-based interactive segmentation. These approaches effectively reduce the time cost of manual annotation and enhance annotation consistency. For instance, GrabCut employs an energy minimization framework using a Bayesian model to estimate the probability distribution of foreground and background pixels. The energy function E comprises a data term and a smoothness term:

E (M) = \sum_{x, y} (- log P (I_{x, y} ∣ M_{x, y}) + λ R (M)),

(6)

where

P (I_{x, y} ∣ M_{x, y})

represents the probability that pixel

(x, y)

belongs to the foreground or background,

λ

is a balancing parameter, and

R (M)

is the smoothness regularization term. By minimizing

E (M)

, the optimized semantic segmentation mask can be obtained. In agricultural applications, wheat plants exhibit dynamic growth over time, it is necessary to consider the temporal information during data annotation to construct a more temporally consistent crop growth model. By continuously annotating wheat images from different growth stages, plant growth trends can be extracted, enabling further optimization of growth assessment algorithms. For example, the plant growth rate

G (t)

can be defined as the change in projected area between two time points

t_{1}

and

t_{2}

:

G (t) = \frac{A (t_{2}) - A (t_{1})}{t_{2} - t_{1}},

(7)

where

A (t)

represents the projected area of the wheat plant at time t, which can be computed using the semantic segmentation mask

M_{i}

:

A (t) = \sum_{x, y} M_{i} (x, y, t)

(8)

By continuously sampling wheat images at different time points and performing semantic segmentation and keypoint annotation, a more comprehensive crop growth monitoring system can be developed.

3.3. Data Preprocessing

During deep learning training, the distribution of input data significantly affects model convergence speed and stability. Since image pixel value distributions vary across different datasets, normalization is typically required to ensure that all input samples fall within a consistent range, thereby preventing instability in gradient updates. Given an input image I with pixel values in the range

[0, 255]

, normalization is usually performed using either mean standardization or min–max normalization:

I^{'} = \frac{I - μ}{σ},

where

μ

represents the dataset mean and

σ

denotes the standard deviation. Alternatively, min–max normalization is applied as follows:

I^{'} = \frac{I - I_{min}}{I_{max} - I_{min}},

where

I_{min}

and

I_{max}

represent the minimum and maximum pixel values, respectively. The application of normalization ensures a more stable input distribution for neural networks, facilitating faster model convergence. Furthermore, in agricultural image processing, variations in camera equipment may lead to significant differences in image resolution, necessitating scale transformation. Given an original image size of

(H, W)

and a target size of

(H^{'}, W^{'})

, the linear interpolation formula for scale transformation is defined as:

I^{'} (x^{'}, y^{'}) = I (x \cdot s_{x}, y \cdot s_{y}),

where

s_{x} = \frac{W^{'}}{W}

and

s_{y} = \frac{H^{'}}{H}

, with

(x^{'}, y^{'})

representing the target image coordinates and

(x, y)

denoting the original image coordinates. Common interpolation techniques include bilinear interpolation, nearest-neighbor interpolation, and cubic interpolation, with the choice of method depending on the specific task requirements. Data augmentation is employed as a strategy to generate additional training samples by applying various transformations to the original dataset. In wheat growth detection tasks, where field images exhibit significant variability, appropriate data augmentation techniques enhance model robustness, improving adaptability to changes in lighting, background, and occlusion. Commonly used data augmentation methods include CutMix, Mosaic, GridMask, and synthetic data augmentation based on diffusion models, as shown in Figure 2.

In this study, we utilized multiple data augmentation techniques to enhance model performance. CutMix increased the diversity of training data by randomly cropping and pasting regions from different images, effectively alleviating overfitting in our experiments. Mosaic combined four distinct images into a single composite image, which assisted the model in learning in multi-object environments and was beneficial for multi-object counting tasks in wheat growth monitoring. GridMask randomly masked certain regions in the input image, forcing the model to learn more discriminative features and improving the model’s generalization ability. By integrating these preprocessing techniques, the model performance was significantly enhanced, providing reliable and efficient training for wheat growth monitoring and yield estimation tasks [37].

3.4. Proposed Method

3.4.1. Symmetric Diffusion Segmentation Network Based on Transformer

As shown in Figure 3, the symmetric diffusion segmentation network based on transformer proposed in this study consists of four key modules: the transformer encoder, the symmetric diffusion module, the symmetric attention mechanism, and the aggregation loss calculation module. This model adopts an end-to-end training approach, aiming to enhance the boundary segmentation capability of wheat targets and improve the stability of target counting. After the data preprocessing, the input is first passed through the transformer encoder, which uses the ViT structure to partition the input image

I \in R^{H \times W \times C}

into patches. Each patch is mapped to a feature vector through a linear projection, and the position encoding

E_{p}

is added to form the input token.

The framework begins with the transformer encoder, where the input image

I \in R^{H \times W \times C}

is first partitioned into patches consistent with the ViT structure. Each patch is mapped to a feature vector through a linear projection layer, capturing localized spatial relationships, and the position encoding

E_{p}

is added to form the input token representation for subsequent transformer processing. This patch-based tokenization ensures that the model learns spatial dependencies effectively across different wheat growth regions.

X_{0} = [X_{p}^{1} E_{p}; X_{p}^{2} E_{p}; \dots; X_{p}^{N} E_{p}]

(9)

Once the initial token representation is generated, the transformer encoder extracts global features across the entire image, generating the feature representation

F_{g l o b a l}

. The encoder’s attention mechanism ensures that distant wheat plants and occlusions are effectively handled, improving robustness in segmentation. In the symmetric diffusion module,

F_{g l o b a l}

is further processed to simulate the growth diffusion process of wheat plants, enhancing the detailed expression of target boundaries. This module is inspired by biological growth diffusion models, where each feature point

(x, y)

aggregates spatial information from neighboring pixels, simulating natural expansion and ensuring feature continuity across dense wheat regions. The diffused feature

F_{d i f f}

is computed using the following equation:

F_{d i f f} (x, y) = \sum_{i = - k}^{k} \sum_{j = - k}^{k} W_{i j} \cdot F_{g l o b a l} (x + i, y + j)

(10)

Here,

W_{i j}

represents the adaptive diffusion weight, ensuring that the diffusion weight in high-density areas is smaller while diffusion is larger in low-density regions, thereby reducing background noise interference and enhancing the continuity of plant boundaries. The diffused feature

F_{d i f f}

is then passed through the symmetric attention mechanism, which strengthens target information through multi-scale feature fusion.

F_{d i f f}

is split into feature maps at different scales

F_{s}, F_{l}

, corresponding to local and global features, respectively:

F_{s} = {Conv}_{s} (F_{d i f f}), F_{l} = {Conv}_{l} (F_{d i f f})

(11)

These are then weighted and fused using the attention mechanism:

F_{a t t n} = α F_{s} + (1 - α) F_{l}

(12)

where

α

is computed by the self-attention module to adaptively adjust the weight between local and global features. The final feature map

F_{a t t n}

is passed through an semantic segmentation head to generate the final instance mask M, followed by loss calculation, which integrates segmentation loss, target counting error, and boundary optimization loss:

L_{t o t a l} = L_{s e g} + λ_{1} L_{c o u n t} + λ_{2} L_{b o u n d a r y}

(13)

where

L_{s e g}

optimizes segmentation accuracy,

L_{c o u n t}

ensures the stability of target counting, and

L_{b o u n d a r y}

further strengthens boundary recognition capabilities. The inclusion of an aggregation loss function ensures that the model jointly optimizes multiple learning objectives, reducing inconsistencies between segmentation and counting tasks. The entire SDS-Net fully leverages the global information capture capability of the transformer structure, combining symmetric diffusion and attention mechanisms to improve wheat semantic segmentation accuracy while optimizing target counting stability in high-density environments. By explicitly integrating multi-scale feature fusion and detailing the ViT-based encoder structure, this framework ensures robust adaptability to different wheat growth stages and environmental conditions.

To explain the computation of diffusion weights in the symmetric diffusion module, let the input feature be

F \in R^{H \times W \times C}

; the diffused feature

F_{d i f f}

is obtained by a weighted sum of neighboring features, as shown in Algorithm 1:

F_{d i f f} (x, y) = \sum_{i = - k}^{k} \sum_{j = - k}^{k} W_{i j} \cdot F (x + i, y + j)

(14)

where

W_{i j}

is the adaptive diffusion weight, defined as:

W_{i j} = \frac{exp (- | F (x, y) - F (x + i, y + j) |^{2})}{\sum_{m = - k}^{k} \sum_{n = - k}^{k} {exp (- | F (x, y) - F (x + m, y + n) |}^{2})}

(15)

This weight is controlled by feature similarity, ensuring that diffusion in high-density regions is reduced while diffusion in low-density regions is broader, thereby minimizing background noise interference with object boundaries.

Algorithm 1 Symmetric Diffusion Algorithm

Require:: Feature map $F \in R^{H \times W \times C}$ , diffusion range k
Ensure:: Diffused feature map $F_{d i f f}$
1:: Initialize $F_{d i f f} \leftarrow$ zeros of shape $(H, W, C)$
2:: for $x \leftarrow 0$ to $H - 1$ do
3:: for $y \leftarrow 0$ to $W - 1$ do
4:: Extract local neighborhood:
5:: $F_{l o c a l} \leftarrow F [m a x (0, x - k) : m i n (H, x + k + 1), m a x (0, y - k) : m i n (W, y + k + 1)]$
6:: Compute similarity weights:
7:: $W_{i j} = exp (- ∥ F (x, y) - F_{l o c a l} ∥^{2})$
8:: Normalize weights:
9:: $W_{i j} = \frac{W_{i j}}{\sum W_{i j}}$
10:: Compute diffused feature:
11:: $F_{d i f f} (x, y) = \sum W_{i j} \cdot F_{l o c a l}$
12:: end for
13:: end for
14:: return $F_{d i f f}$

3.4.2. Symmetric Diffusion Module

The core objective of the symmetric diffusion module is to simulate the growth diffusion process of wheat plants to enhance the boundary recognition capability of targets in semantic segmentation, especially in scenarios where wheat plants are densely grown and targets overlap.

As shown in Figure 4, this module effectively improves the model’s robustness. The input features are first processed through channel reduction to decrease computational complexity, allowing the diffusion operation to be performed in a more compact feature representation space. Let the input feature be

F_{i n} \in R^{N \times C}

, where N denotes the spatial dimension (such as

H \times W

) and C represents the number of channels. Channel reduction is achieved through a linear mapping

W_{r} \in R^{C \times C^{'}}

:

F^{'} = F_{i n} W_{r}, F^{'} \in R^{N \times C^{'}}

(16)

where

C^{'} ≪ C

, ensuring that the main feature information is retained while reducing the computational cost. In the diffusion attention calculation stage, the input features are first projected into queries (Q), keys (K), and values (V). The query

Q \in R^{N \times 1}

is obtained by applying average pooling to the input features followed by a linear transformation, the key

K \in R^{N^{'} \times 1}

is derived from the downsampled features, and the value

V \in R^{N^{'} \times C^{'}}

is obtained from the dimensionality-reduced features:

Q = W_{q} \cdot AvgPool (F^{'})

(17)

K = W_{k} \cdot DownSample (F^{'})

(18)

V = W_{v} \cdot F^{'}

(19)

where

W_{q}, W_{k}, W_{v}

are the respective linear transformation matrices. The attention weights are then computed to obtain the diffusion attention map A:

A = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(20)

where

d_{k}

is the scaling factor for the key dimension, ensuring gradient stability. Subsequently, the attention weights A are applied to the value V to generate the diffused feature:

F_{d i f f} = A V

(21)

This feature

F_{d i f f}

represents the diffusion result for the local region, which is further fused with the original feature

F^{'}

to retain detailed information about the target region while enhancing global coherence:

F_{o u t} = F^{'} + W_{d} F_{d i f f}

(22)

where

W_{d}

is the fusion weight parameter. The final output feature

F_{o u t} \in R^{N \times C^{'}}

is mapped back to the original channel dimension using a channel recovery mapping

W_{o}

, forming the final diffusion-enhanced feature:

F_{f i n a l} = F_{o u t} W_{o}

(23)

The design advantage of the symmetric diffusion module lies in its ability to enhance boundary information. Through the diffusion operation, the semantic information of the target region is propagated to the neighboring areas, thus improving boundary accuracy during semantic segmentation and reducing target adhesion issues. In contrast to CNNs, which rely on small receptive fields for convolution operations, this module uses attention mechanisms to obtain a larger receptive field, enabling the model to capture long-distance feature dependencies. This is particularly suitable for target recognition tasks in densely planted wheat environments. Furthermore, by adopting dimensionality reduction and local attention mapping, the computational complexity is reduced from

𝒪 (N^{2})

to

𝒪 (N d_{k})

, making it more suitable for processing high-resolution agricultural images. In dense target environments, traditional object detection methods are prone to target omission, but this module, through adaptive diffusion calculations, enables the segmentation model to more accurately distinguish dense plants and improve the stability of target counting.

3.4.3. Symmetric Attention Mechanism

As shown in Figure 5, the symmetric attention mechanism presents a significant structural optimization compared to the traditional self-attention mechanism, particularly excelling in detection tasks involving densely packed objects.

However, in agricultural image analysis tasks, the computational complexity of the standard transformer is

𝒪 (N^{2})

, resulting in a high computational overhead for high-resolution image processing. Additionally, while the self-attention mechanism primarily focuses on global information, there is significant local correlation between objects in dense target scenarios. As a result, directly using global attention may cause blurred boundaries and reduced object distinguishability. To address this, the symmetric attention mechanism combines channel attention and spatial attention, establishing a balance between feature maps at different scales and ensuring sufficient interaction of information at various scales, thereby improving the segmentation accuracy of local targets. The core of the symmetric attention mechanism consists of several symmetric attention blocks, including the channel attention module (CRA) and spatial attention module (SA). The input to the mechanism is a feature map of size

H \times W \times C

, where H and W represent the image height and width, respectively, and C represents the number of channels. To effectively integrate multi-scale feature information, the network structure employs hierarchical feature extraction, where multiple encoder layers progressively refine spatial and semantic representations. The first layer has feature dimensions of

H / 4 \times W / 4 \times C_{1}

, the second layer

H / 8 \times W / 8 \times C_{2}

, the third layer

H / 16 \times W / 16 \times C_{3}

, and the fourth layer

H / 32 \times W / 32 \times C_{4}

. This hierarchical structure allows the model to capture both fine-grained local details and high-level global information, ensuring robust segmentation across different spatial scales. The network structure is composed of feature maps at multiple scales, with the first layer having feature dimensions of

H / 4 \times W / 4 \times C_{1}

, the second layer

H / 8 \times W / 8 \times C_{2}

, the third layer

H / 16 \times W / 16 \times C_{3}

, and the fourth layer

H / 32 \times W / 32 \times C_{4}

. In the encoder part, each layer contains normalization layers, symmetric attention computation, channel MLP, and multi-scale feature fusion, while the decoder part restores the resolution through upsampling and performs feature remapping using MLP. The computation of the channel attention module is as follows:

F_{a v g} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (i, j)

(24)

F_{m a x} = max_{i, j} F (i, j)

(25)

W_{c} = σ (W_{1} (F_{a v g}) + W_{2} (F_{m a x}))

(26)

Here,

W_{1}

and

W_{2}

are fully connected layers and

σ

represents the Sigmoid activation function. Additionally, the spatial attention computation is as follows:

F_{c}^{'} = Conv (F, k = 7)

(27)

W_{s} = σ (F_{c}^{'})

(28)

The final symmetric attention computation is expressed as:

F_{S A M} = W_{c} \cdot F + W_{s} \cdot F

(29)

In the SDS-Net structure, the symmetric attention mechanism mainly acts on the connection between the encoder and decoder, ensuring that during the feature extraction phase, not only is global information considered, but local features are also retained. This compensates for the lack of local information in standard transformers for semantic segmentation tasks. On the encoder side, after the transformer extracts global features

F_{g l o b a l}

, the symmetric attention mechanism enhances the local feature information through multi-scale spatial attention, as shown below:

F^{'} * g l o b a l = F * g l o b a l + F_{S A M}

(30)

On the decoder side, during the process of progressively restoring image resolution, the symmetric attention mechanism ensures that features at different scales are fully fused, making boundary information clearer, thus optimizing the semantic segmentation accuracy:

F^{'} * d e c o d e r = UpSample (F * d e c o d e r) + F_{S A M}

(31)

Furthermore, to enhance feature representation across different spatial scales, the multi-scale feature fusion mechanism integrates outputs from multiple encoder layers before passing them to the decoder. This ensures that the network learns both fine-grained and high-level contextual features, improving segmentation robustness in dense wheat fields. The inclusion of multi-scale feature fusion mitigates the issue of feature degradation in deep networks, allowing the model to maintain consistent performance across varying wheat growth stages and occlusion levels. In addition to multi-scale feature fusion, the Channel MLP module plays a crucial role in enhancing inter-channel dependencies. Located within the Symmetric Block, the Channel MLP module processes the extracted features by refining channel-wise representations, ensuring effective feature recalibration before they are passed to the decoder. Unlike conventional MLP layers, the Channel MLP module is optimized for segmentation tasks by dynamically adjusting the weights of different channels, thereby improving feature discrimination. The integration of the Channel MLP with normalization layers, symmetric attention computation, and multi-scale feature fusion provides a balanced trade-off between computational efficiency and segmentation accuracy.

By introducing the symmetric attention mechanism, SDS-Net effectively combines the ability to model global information and extract local features, allowing the network to better adapt to dense target scenarios, improving the segmentation accuracy of object boundaries and reducing interference between instances. This optimization mechanism enhances the model’s robustness and accuracy in semantic segmentation tasks in agricultural environments, especially for precise detection and counting tasks in dense wheat plant growth environments.

3.4.4. Aggregated Loss Function

In semantic segmentation tasks, traditional loss functions typically employ binary cross-entropy loss (BCE) or Dice loss for optimization. However, each of these methods has limitations. BCE is primarily used for pixel-level classification but lacks the ability to capture boundary information, leading to target adhesion or mis-segmentation. Dice loss improves segmentation performance but struggles with background noise in densely distributed agricultural images. To address these limitations, this study proposes an aggregated loss function that jointly optimizes semantic segmentation, target counting, and boundary refinement. The weight selection process for these loss components was determined through extensive experiments and cross-validation to ensure balanced optimization.The aggregated loss function consists of three components. The first is the semantic segmentation loss

L_{s e g}

, which optimizes the match between the segmented region and the ground truth, defined by the Dice loss as:

L_{s e g} = 1 - \frac{2 \sum P G}{\sum P + \sum G},

(32)

where P is the predicted segmentation mask and G is the ground truth mask. To determine the optimal weight for

L_{s e g}

, a grid search was conducted over a range of values, considering its impact on segmentation accuracy across different wheat densities. The second component is the target counting loss

L_{c o u n t}

, which ensures the stability of the predicted target count, defined using mean squared error (MSE):

L_{c o u n t} = \sum_{x = 1}^{H} \sum_{y = 1}^{W} {(\hat{D} (x, y) - D (x, y))}^{2},

(33)

where

\hat{D} (x, y)

is the predicted density map and

D (x, y)

is the ground truth density distribution. Since wheat distribution varies across fields, the weight of

L_{c o u n t}

was optimized using 5-fold cross-validation, ensuring the loss function remains effective under different target densities. To further enhance boundary precision, the boundary loss

L_{b o u n d a r y}

is introduced:

L_{b o u n d a r y} = \sum | | \nabla P - {\nabla G | |}^{2},

(34)

where

\nabla P

and

\nabla G

represent the gradient information of the predicted segmentation mask and ground truth mask, respectively. This loss component was weighted lower in early training phases to prevent overemphasis on boundaries before segmentation features were sufficiently refined. To determine the optimal loss weight combination, an extensive hyperparameter search was conducted over different weight values for

λ_{1}

,

λ_{2}

, and

λ_{3}

in the total loss function:

L_{t o t a l} = λ_{1} L_{s e g} + λ_{2} L_{c o u n t} + λ_{3} L_{b o u n d a r y} .

(35)

A grid search was performed over

λ_{1}, λ_{2}, λ_{3} \in [0.1, 1.0]

with a step size of 0.1, evaluating segmentation accuracy (mAP@75), target counting stability (MSE), and boundary precision (Boundary IoU) on a validation set. Five-fold cross-validation was used to ensure generalization across different field conditions. The results indicated that

λ_{1} = 0.6

,

λ_{2} = 0.3

, and

λ_{3} = 0.1

provided the best balance between segmentation accuracy, counting stability, and boundary clarity. The final optimization ensures that SDS-Net effectively captures global target distributions while maintaining detailed local feature refinement. This adaptive weighting approach enhances performance across varying wheat densities, making SDS-Net more robust in real-world agricultural applications.

3.5. Experimental Configuration

3.5.1. Hardware and Software Configuration

Experiments were carried out on a high-performance computing platform. The hardware consisted of an NVIDIA A100 GPU(NVIDIA Corporation, Santa Clara, CA, USA) (Ampere architecture, 6912 CUDA cores, 432 Tensor cores, 40 GB HBM2 memory), an AMD EPYC 7742 64-core/128-thread processor, 512 GB of RAM, and NVMe SSDs for storage. The software used PyTorch 2.6.0 as the deep learning framework, running on Ubuntu 22.04 LTS with CUDA 12.1 and cuDNN 8.8. PyTorch Lightning managed the training workflow, while OpenCV and Albumentations were used for data preprocessing. NVIDIA Apex was utilized for mixed-precision training.

3.5.2. Dataset Partitioning and Hyperparameters

The dataset was partitioned into training, validation, and test sets at an 80%:10%:10% ratio. A 5-fold cross-validation was also employed for stability. To determine the optimal optimizer, we conducted comparative experiments using Adam, SGD, RMSprop, and AdamW. The results indicated that AdamW outperformed the other optimizers in terms of convergence stability and final segmentation accuracy, particularly in dense wheat growth scenarios. The effectiveness of AdamW is attributed to its decoupled weight decay, which mitigates overfitting and enhances generalization. Based on empirical evaluation, we set the hyperparameters to

β_{1} = 0.9

and

β_{2} = 0.999

and weight decay to

λ = 0.01

. The learning rate was selected through a grid search, exploring initial values from

10^{- 3}

to

10^{- 5}

, with different decay strategies, including exponential decay, cosine annealing, and linear decay. The experimental results demonstrated that a linear decay strategy provided the best balance between training stability and final performance. Thus, we adopted a learning rate schedule starting from

η_{0} = 10^{- 4}

over

T = 100

epochs, incorporating a warm-up phase in the first 10 epochs to prevent sudden weight updates at the beginning of training. The batch size was set to 16 with a gradient accumulation step of 4, ensuring efficient memory utilization without compromising training performance. To further enhance robustness, an early stopping mechanism was implemented, halting training if the validation loss did not decrease for 10 consecutive epochs. This helped prevent unnecessary computation and reduced the risk of overfitting. The combination of these hyperparameters, determined through extensive experimental evaluation, ensured the proposed model achieved optimal performance in wheat segmentation and growth measurement tasks.

3.5.3. Evaluation Metrics

To comprehensively evaluate model performance, we used Precision (P), Recall (R), Accuracy (

A c c

), F1-score (

F 1

), and Mean Average Precision (mAP@75). These metrics, common in object detection and semantic segmentation, help assess classification accuracy, object coverage, and segmentation fidelity. Precision is the ratio of correctly predicted positive samples to all predicted positive ones. Recall is the proportion of correctly detected objects among all ground-truth objects. Accuracy gauges overall classification correctness, and the F1-score balances precision and recall. mAP@75 measures performance across different IoU thresholds, especially at 0.75. These metrics rely on True Positives (TPs), False Positives (FPs), False Negatives (FNs), and True Negatives (TNs). In semantic segmentation, TPs are correctly detected objects, FPs are incorrect detections, and FNs are missed ones. Predictions with IoU > 0.5 or 0.75 are considered correct, i.e., they are TPs. Their formulas are:

P = \frac{T P}{T P + F P}

(36)

R = \frac{T P}{T P + F N}

(37)

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(38)

F 1 = \frac{2 \times P \times R}{P + R}

(39)

A P = \int_{0}^{1} P (R) d R

(40)

For multi-class tasks, mAP is the mean of all class-wise AP values (

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

, where N is the number of classes). In agricultural applications, mAP@75 uses a strict IoU threshold of 0.75 for accurate semantic segmentation. Wheat growth monitoring requires precise classification of plant regions, accurate target counting, and robust boundary delineation, making these metrics crucial for assessing segmentation quality and counting stability. Precision reflects the ability to minimize false positives, ensuring that misclassified background regions are not falsely detected as wheat plants, which is essential for avoiding overestimation in yield prediction. Recall measures the proportion of correctly detected wheat targets among all ground-truth instances, ensuring the model captures all wheat plants, even in dense growth scenarios. Accuracy provides a broad evaluation of classification correctness but is less informative in highly imbalanced scenarios where background pixels dominate. The F1-score balances precision and recall, making it useful for dense wheat fields where accurate plant detection is necessary without excessive false positives or omissions. mAP@75 is particularly important for wheat growth monitoring because it evaluates segmentation performance under a strict IoU threshold of 0.75, ensuring precise plant boundary detection. In dense wheat growth environments, achieving a high mAP@75 value indicates that the model can accurately segment individual plants without merging adjacent crops, which is critical for monitoring wheat development and estimating biomass. The calculations of these metrics rely on True Positives (TPs), False Positives (FPs), False Negatives (FNs), and True Negatives (TNs), defined as follows in semantic segmentation: TPs are correctly detected wheat regions, FPs are misclassified background areas, and FNs are missed wheat plants. Predictions with IoU > 0.5 or 0.75 are considered as correct detections (TPs), reinforcing the need for precise segmentation.

3.5.4. Baseline

To evaluate our proposed wheat growth detection and counting model, we selected several state-of-the-art semantic segmentation models as baselines. These include Tiny-Segformer [38], U-Net [39], Mask R-CNN [40], Segment Anything Model (SAM) [41], and InternImage [42]. Tiny-Segformer is a lightweight transformer-based semantic segmentation network suitable for edge devices. U-Net is a classic encoder–decoder with skip connections. Mask R-CNN combines object detection and semantic segmentation. SAM uses prompt-based segmentation, and InternImage is a CNN-transformer hybrid. Comparing with these baselines, we comprehensively analyze the advantages of our proposed method for wheat growth detection and counting tasks.

4. Results and Discussion

4.1. Segmentation Accuracy Results

The objective of this experiment is to evaluate the segmentation accuracy of various semantic segmentation models in wheat growth monitoring and target counting tasks, as well as to validate the effectiveness of the proposed method in high-density target detection scenarios. By comparing Tiny-Segformer, UNet, Mask R-CNN, Segment Anything Model (SAM), InternImage, Swin Transformer, DETR, ViT, and the proposed method, the experiment primarily focuses on the models’ performance in Precision, Recall, Accuracy, mAP@75, and F1-score, aiming to comprehensively assess the adaptability of different methods in the wheat semantic segmentation task. Additionally, we include a comparison of model parameters (Parameters (M)) and computational complexity (FLOPS (G)) to evaluate the efficiency and feasibility of the proposed method for real-world deployment.

As shown in Table 2 and Figure 6, the proposed method achieved the best performance across all metrics, with Precision and Recall reaching 0.91 and 0.87, respectively, and the F1-score being 0.89, significantly outperforming other baseline models. Compared to InternImage, the proposed method’s mAP@75 was further improved to 0.88, indicating a stronger ability to identify target boundaries under high IoU thresholds. The relatively lower Recall suggests that some targets were missed, but the improvement in Precision indicates that the method reduced the likelihood of misclassifications during segmentation. This high-precision segmentation capability is critical for precise agricultural monitoring, particularly in densely growing wheat environments, where the accurate identification and counting of each plant directly impact yield predictions. Beyond segmentation accuracy, we further analyze the efficiency of the proposed method by reporting Parameters (M) and FLOPS (G). Parameters (M) represent the number of trainable parameters in the model, which directly affects memory consumption and model size. A lower parameter count is desirable for deployment on edge devices, while larger models often lead to improved performance at the cost of increased storage requirements. The proposed method achieves a balance between these factors, containing 15.2 million parameters, which is significantly smaller than Mask R-CNN (44.5 M) and SAM (641 M) but larger than Tiny-Segformer (5.2 M), ensuring high segmentation accuracy with a manageable model size.

FLOPS (G), on the other hand, quantifies the number of floating-point operations required for a single forward pass of the model. This metric is crucial for evaluating computational efficiency, as a lower FLOPS value indicates faster inference and lower power consumption. The proposed method requires 30.1 G FLOPS, which is significantly lower than SAM (2963 G) and Mask R-CNN (180 G), demonstrating that it is well-suited for deployment in agricultural monitoring applications where computational resources may be limited. Compared to InternImage, which has a FLOPS of 5G, the proposed method requires more computation but offers significantly better segmentation accuracy and robustness, making it a viable trade-off between efficiency and performance. It is important to note that Parameters (M) and FLOPS (G) are reported specifically in Table 2 because this table focuses on overall model evaluation. Other tables, such as those analyzing ablation studies or dataset-specific tests, primarily focus on comparative accuracy and robustness rather than computational efficiency. Including these metrics in Table 2 allows for a comprehensive assessment of both segmentation accuracy and computational feasibility in a single comparison. From a mathematical perspective, the performance of different models is greatly influenced by their structural design. As a lightweight transformer architecture, Tiny-Segformer benefits from computational efficiency; however, its local attention window limits its ability to model global information, resulting in slightly lower segmentation performance, with an F1-score of 0.79. UNet, utilizing a symmetric encoder–decoder structure and skip connections, preserves some high-resolution features, leading to improvements in both Precision and Recall, with an F1-score of 0.80. However, due to its reliance on CNNs, the model’s receptive field is relatively small, and boundary blurring may occur in dense target segmentation tasks. Mask R-CNN combines object detection and semantic segmentation capabilities, enhancing target region extraction with RPN, which results in an mAP@75 of 0.83. However, due to the fixed window size of RoIAlign, this method still faces information loss issues in dense target detection. SAM utilizes a prompt-based segmentation mechanism that effectively adapts to different target categories, improving Recall and achieving an F1-score of 0.85. InternImage, as a CNN-transformer hybrid model, enhances the fusion of local and global features through the Large Kernel Attention (LKA) mechanism, further improving segmentation performance. The proposed method integrates transformer-based global feature extraction with a symmetric diffusion mechanism and symmetric attention mechanism, reducing background noise interference while enhancing the segmentation accuracy of dense targets. Mathematically, the method adjusts the weight of local and global features through adaptive attention, ensuring accurate boundary segmentation even in high-density target scenarios, thereby improving Precision and mAP@75. Additionally, the proposed aggregation loss function effectively optimizes semantic segmentation loss, target counting loss, and boundary loss, increasing Accuracy to 0.89 and ensuring the model’s stability across different target densities. Computational complexity is a key factor in evaluating the deployability of deep learning models. Although lightweight models such as MobileNet and EfficientNet offer advantages in computational cost, their limited receptive fields and feature extraction capabilities often constrain segmentation accuracy in high-density object scenarios. The proposed method enhances feature extraction effectiveness by optimizing the attention mechanism and diffusion computation, enabling high inference efficiency despite an increase in computational cost. Additionally, mixed-precision computation (AMP) and tensor optimization strategies are employed during inference to reduce computational redundancy, further improving efficiency.

4.2. Growth Measurement Results

This experiment evaluates the accuracy of various segmentation models in wheat growth measurement and assesses the proposed method’s adaptability in complex field environments. Accurate growth measurement enhances plant counting, leaf coverage estimation, and spatial distribution analysis across growth stages.

As shown in Table 3, the proposed method achieves the highest scores in all metrics, with Precision at 0.95, Recall at 0.90, and mAP@75 at 0.92, significantly outperforming the other models. This suggests fewer false positives and improved stability in wheat segmentation. Compared to InternImage, which achieves an Accuracy of 0.91, the proposed method shows better boundary detection and feature retention. While SAM and Mask R-CNN perform well globally, they suffer from mis-segmentation in dense target environments. Traditional CNN models such as UNet and Tiny-Segformer struggle with boundary details, leading to lower F1-scores (0.84 and 0.82, respectively). From a computational perspective, Tiny-Segformer employs local window attention (

𝒪 (N \cdot W)

complexity), limiting global feature modeling and reducing Recall in dense regions. UNet’s encoder–decoder structure preserves spatial information but results in boundary blurring due to localized convolutions. Mask R-CNN, with its loss function

L = L_{c l s} + L_{b o x} + L_{m a s k}

, accurately segments objects but suffers from fixed RoIAlign window limitations. SAM leverages attention-based segmentation (

Y = Attention (W X) + X

) for global feature extraction, enhancing boundary segmentation. InternImage, integrating Large Kernel Attention (LKA) and Self-Attention, improves feature extraction but at a higher computational cost. The proposed method enhances boundary segmentation through symmetric diffusion (

F_{d i f f} (x, y) = \sum_{i = - k}^{k} \sum_{j = - k}^{k} A_{i, j} F (x + i, y + j)

), combining transformer-based global extraction with adaptive diffusion computation. The symmetric attention mechanism further enhances feature stability while reducing background noise. The loss function optimizes segmentation, counting, and boundary accuracy, ensuring stable performance across varying densities.

4.3. Ablation Study of Different Attention Mechanisms

This experiment evaluates how different attention mechanisms impact wheat segmentation and growth measurement, comparing Standard Self-Attention, CBAM, and the proposed symmetric attention mechanism across five evaluation metrics.

As shown in Table 4, the symmetric attention mechanism consistently outperforms the other methods, with Precision reaching 0.95, Recall at 0.90, and mAP@75 at 0.92 in the growth measurement task. CBAM achieves an F1-score of 0.85, while Standard Self-Attention lags at 0.74, highlighting its limitations in dense wheat segmentation. For semantic segmentation, the symmetric attention mechanism scores 0.89 in F1-score, outperforming CBAM (0.83) and Standard Self-Attention (0.72). Mathematically, these performance differences stem from their computational approaches. Standard Self-Attention uses global feature interactions with

𝒪 (N^{2})

complexity, making it computationally expensive and less stable for high-resolution agricultural images. CBAM enhances feature selection via channel and spatial attention:

W_{c} = σ (W_{1} (F_{a v g}) + W_{2} (F_{m a x}))

,

W_{s} = σ (Conv (F, k = 7))

, with fusion

F_{C B A M} = W_{c} \cdot F + W_{s} \cdot F

. While CBAM efficiently enhances local features, its separate processing of channel and spatial information limits global feature modeling. In contrast, the symmetric attention mechanism integrates global modeling with optimized local attention using

F_{S A M} = W_{c} \cdot F + W_{s} \cdot F

, further stabilizing local feature interactions via multi-scale fusion. This enables adaptive weighting between global and local features, ensuring precise segmentation even in high-density wheat fields. The resulting performance—Accuracy of 0.93 and Precision of 0.95 in growth measurement—significantly outperforms CBAM and Standard Self-Attention. Figure 7 illustrates heatmaps that visualize model attention distribution during segmentation, highlighting how the symmetric attention mechanism improves boundary detection and feature clarity.

4.4. Ablation Study of Different Data Augmentation Strategies

The objective of this experiment is to evaluate the impact of different data augmentation strategies on model performance and to validate the effectiveness of the proposed method in this regard. Data augmentation is a commonly employed strategy, particularly in agricultural scenarios where wheat plants grow densely under complex environmental conditions. Challenges such as insufficient training data, occlusions, and variations in lighting conditions may affect model performance. Therefore, an appropriate data augmentation approach can effectively enhance the generalization ability of the model and improve its adaptability to complex environments. This experiment compares three classical data augmentation methods, namely CutMix, Mosaic, and GridMask, with the proposed method, analyzing their effects on Precision, Recall, Accuracy, Mean Average Precision at 75% IoU threshold (mAP@75), and F1-score.

As shown in Table 5, different data augmentation techniques have varying degrees of impact on model performance. CutMix exhibited relatively low Precision (0.68) and Recall (0.62). Mosaic showed improved Precision (0.72) and Recall (0.69) compared to CutMix, whereas GridMask performed slightly worse, with Precision and Recall values of 0.64 and 0.60, respectively. In contrast, the proposed method achieved the best performance across all metrics, with a Precision of 0.91, Recall of 0.87, mAP@75 of 0.88, and an F1-score of 0.89. These results demonstrate that the proposed data augmentation strategy more effectively enhances the segmentation and counting capabilities of the model, thereby improving its generalization ability in agricultural environments. From a mathematical perspective, different data augmentation techniques influence the feature space in distinct ways, leading to variations in model performance. CutMix augments training diversity by randomly blending regions from different images. However, due to the similarity in color and texture features among wheat plants in agricultural scenarios, random blending may cause boundary confusion, thereby degrading segmentation accuracy. Mosaic stitches four different images together, enhancing the model’s adaptability to diverse backgrounds. However, this method may compromise object integrity, increasing the risk of misclassification. GridMask applies structured occlusion to the images, compelling the model to focus on local features. However, in high-density object environments, this augmentation approach may obscure critical features, leading to lower recall performance. In contrast, the proposed method is designed with agricultural scene characteristics in mind. By employing adaptive regional augmentation, it enables the model to effectively learn key features at varying target densities, thereby improving segmentation accuracy and stability.

4.5. Ablation Study on the Symmetric Diffusion Module

The objective of this experiment is to investigate the role of the symmetric diffusion module (SDM) in wheat instance segmentation and to analyze the impact of different diffusion strategies on model performance. Diffusion mechanisms are widely used in computer vision tasks to enhance feature representations, particularly in dense object detection, where an appropriate diffusion strategy can effectively improve the model’s ability to recognize object boundaries and reduce noise interference. This experiment compares three different settings: no diffusion, Gaussian diffusion, and the proposed symmetric diffusion module. The evaluation is conducted based on five key metrics.

As shown in Table 6, the model’s segmentation performance is relatively poor when no diffusion is applied, with a Precision of only 0.70, a Recall of 0.63, and an mAP@75 of merely 0.67. This suggests that in the absence of diffusion, the model struggles to capture boundary information effectively, resulting in incomplete object segmentation. The Gaussian diffusion approach partially mitigates this issue, improving Precision and Recall to 0.78 and 0.84, respectively, while increasing mAP@75 to 0.80. These results suggest that Gaussian diffusion enhances feature propagation to some extent, improving boundary continuity. However, compared to the proposed symmetric diffusion module, Gaussian diffusion still exhibits certain limitations. The proposed method achieves the best performance across all metrics, with a Precision of 0.91, Recall of 0.87, mAP@75 of 0.88, and F1-score of 0.89. These findings indicate that, compared to traditional diffusion strategies, the symmetric diffusion module more effectively optimizes object boundaries, enhancing segmentation stability and accuracy. From a mathematical perspective, the performance differences among various diffusion strategies stem primarily from differences in diffusion computation methods. In the absence of diffusion, the model relies on self-attention mechanisms for feature extraction. However, in dense object environments, self-attention struggles to establish stable object boundary representations within a limited receptive field, leading to suboptimal Recall and mAP@75 performance. Gaussian diffusion smooths features using convolution kernels, leveraging neighborhood information to enhance local consistency. While this method improves local consistency, its diffusion weights are determined solely by spatial position and do not adapt to feature properties. As a result, it can introduce boundary blurring, limiting segmentation effectiveness in complex environments. The proposed symmetric diffusion module incorporates an adaptive feature diffusion strategy, where diffusion weights are determined based on feature similarity. This ensures reduced feature propagation in high-density regions and broader diffusion in low-density regions, thereby minimizing boundary blurring and improving segmentation accuracy. This property enables the model to accurately capture object contours even in dense target environments, leading to superior mAP@75 and F1-score performance.

4.6. Test on Other Dataset

Table 7 and Figure 8 present the performance of different models on the Kaggle dataset, including metrics such as Precision, Recall, Accuracy, mAP@75, and F1-score. From these experimental results, it is evident that the proposed method outperforms other baseline models across all evaluation metrics, particularly in Precision and Recall, where it achieves 0.89 and 0.85, respectively, significantly higher than other models like Tiny-Segformer and UNet.

4.7. Application Deployment

Table 8 presents the frames per second (FPS) of different models on different platforms, which indirectly reflects the computational efficiency and runtime speed of the models. By comparing the FPS of the “Proposed Method” with other baseline models (such as Tiny-Segformer, UNet, Mask R-CNN, SAM, and InternImage) on the Jetson and iPhone14 Pro Max platforms, it is evident that the proposed method outperforms the other models on both platforms. Particularly on the Jetson platform, the FPS of the proposed method is 50, significantly higher than other models, such as Tiny-Segformer (24 FPS) and UNet (20 FPS), indicating that the proposed method has higher computational efficiency on edge computing platforms, potentially leading to reduced operational costs. On the iPhone14 Pro Max platform, the proposed method achieves an FPS of 39, also outperforming other baseline models. These experimental results support the claim of cost reduction mentioned previously, as a higher FPS value means more images can be processed in the same amount of time, thus improving efficiency and lowering operational costs.

5. Conclusions

A wheat growth and counting analysis model based on instance segmentation is proposed in this study to address the challenges of wheat growth detection in agricultural production. With the advancement of digitalization and automation in agriculture, traditional manual monitoring methods are no longer able to meet the demands for efficiency and precision, particularly in large-scale farmland applications. The innovations in this work are primarily reflected in the following aspects: First, a transformer-based symmetric diffusion segmentation network is proposed, which, by incorporating the symmetric diffusion module and symmetric attention mechanism, effectively solves the problem of high-density target detection in wheat instance segmentation. Second, an aggregated loss function is introduced, which further optimizes the accuracy of the task by combining the results of instance segmentation. The experimental results demonstrate that the proposed model achieves significant advantages across multiple evaluation metrics. In wheat instance segmentation tasks, the model incorporating the symmetric attention mechanism performs excellently on all indicators, achieving a Precision of 0.91, Recall of 0.87, Accuracy of 0.89, mAP@75 of 0.88, and an F1-score of 0.89, significantly outperforming other baseline methods. Moreover, the model also performs outstandingly in growth measurement tasks, with Precision reaching 0.95, Recall at 0.90, and mAP@75 improving to 0.92, fully validating the effectiveness of the symmetric attention mechanism and symmetric diffusion module in high-density target detection. Ablation studies on different attention mechanisms further confirm the advantages of the symmetric attention mechanism in optimizing instance segmentation accuracy and boundary recognition. Compared to the standard self-attention mechanism and CBAM, the symmetric attention mechanism proposed in this work improves the interaction between local and global features, effectively enhancing the model’s performance in complex scenarios. The contributions of this study are not only in proposing a new instance segmentation method but also in providing a novel intelligent solution for the agricultural field, particularly in wheat growth detection tasks. This work demonstrates the potential application of deep learning techniques in precision agriculture.

Although the wheat growth and counting analysis model based on semantic segmentation proposed in this study demonstrates significant advantages in several aspects, there remain certain limitations that future research can address to further enhance the model’s performance and application scope. First, the experiments in this study were primarily based on a single wheat dataset for training and testing. Although good results were achieved on this dataset, the diversity and coverage of the dataset remain limiting factors. Therefore, future research should consider using wheat image data from different regions, varying climatic conditions, and different growth stages to expand the diversity of the dataset, thereby enhancing the model’s generalization ability and robustness. Second, although the proposed symmetric attention mechanism and symmetric diffusion module show excellent performance in high-density target detection, there is still room for improvement in the model’s computational efficiency due to limitations in computational resources and training data. Future studies could focus on optimizing the model architecture by employing more efficient network designs or lightweight models (such as MobileNet, EfficientNet, etc.) in order to improve inference speed and computational efficiency. This would enable the model to be deployed on resource-constrained edge devices, facilitating its practical application. Furthermore, although the aggregated loss function proposed in this study optimizes the accuracy of growth measurement, the model may still be influenced by factors such as noise, occlusion, and lighting variations in complex environments, leading to suboptimal performance in certain scenarios. Therefore, future work could incorporate more sophisticated noise processing methods and data cleaning techniques during data preprocessing and model training stages to enhance the model’s stability in complex environments. Also, we plan to integrate additional biological and non-biological variables, such as climate data, soil characteristics, and crop varieties, to improve the model’s robustness and adaptability to different agricultural environments. By combining these factors, we aim to enhance the model’s predictive accuracy and generalization across diverse agricultural settings. In future research, we aim to integrate segmentation and counting results with plant physiological models, exploring how morphological features impact crop growth and yield. This will enable more accurate analysis of crop dynamics and improve yield prediction, further enhancing the practical application of the model in precision agriculture.

Author Contributions

Conceptualization, Z.J., W.H., Y.W. and C.L.; Data curation, B.Z. and Z.S.; Formal analysis, C.J., Z.S. and S.L.; Funding acquisition, C.L.; Investigation, C.J.; Methodology, Z.J., W.H., Y.W. and C.L.; Project administration, C.L.; Resources, B.Z.; Software, Z.J., W.H. and Y.W.; Supervision, S.L.; Validation, C.J. and S.L.; Visualization, B.Z. and Z.S.; Writing—original draft, Z.J., W.H., Y.W., C.J., B.Z., Z.S., S.L. and C.L. Z.J. and W.H. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to express their sincere gratitude to the Computer Association of China Agricultural University (ECC) for their valuable technical support. Upon the acceptance of this paper, the project code and the dataset will be made publicly available to facilitate further research and development in this field.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mottaleb, K.A.; Kruseman, G.; Snapp, S. Potential impacts of Ukraine-Russia armed conflict on global wheat food security: A quantitative exploration. Glob. Food Secur. 2022, 35, 100659. [Google Scholar]
Zhang, Y.; Wa, S.; Zhang, L.; Lv, C. Automatic plant disease detection based on tranvolution detection network with GAN modules using leaf images. Front. Plant Sci. 2022, 13, 875693. [Google Scholar]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Alkhudaydi, T.; De la Lglesia, B. Counting spikelets from infield wheat crop images using fully convolutional networks. Neural Comput. Appl. 2022, 34, 17539–17560. [Google Scholar]
Ye, J.; Yu, Z.; Wang, Y.; Lu, D.; Zhou, H. WheatLFANet: In-field detection and counting of wheat heads with high-real-time global regression network. Plant Methods 2023, 19, 103. [Google Scholar] [PubMed]
Shammi, S.A.; Meng, Q. Use time series NDVI and EVI to develop dynamic crop growth metrics for yield modeling. Ecol. Indic. 2021, 121, 107124. [Google Scholar] [CrossRef]
Yu, J.; Zhang, S.; Zhang, Y.; Hu, R.; Lawi, A.S. Construction of a winter wheat comprehensive growth monitoring index based on a fuzzy degree comprehensive evaluation model of multispectral UAV data. Sensors 2023, 23, 8089. [Google Scholar] [CrossRef]
Li, S.; Jin, G.; Lou, Z. Research on the extraction of winter wheat planting area in Henan Province based on time-series EVI. In Proceedings of the Second International Conference on Geographic Information and Remote Sensing Technology (GIRST 2023), Qingdao, China, 21–23 July 2023; Volume 12797, pp. 237–242. [Google Scholar]
Maheswaran, S.; Arunkumar, T.; Gomathi, R.; Rithika, P.; Shafiq, I.M.; Nandita, S. Blockchain-Based Digital Twin for Agricultural Field Management. In Blockchain-Based Digital Twins: Research Trends and Challenges; CRC Press: Boca Raton, FL, USA, 2025; p. 279. [Google Scholar]
Wang, D.; Zhang, D.; Yang, G.; Xu, B.; Luo, Y.; Yang, X. SSRNet: In-field counting wheat ears using multi-stage convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4403311. [Google Scholar]
Sun, J.; Yang, K.; Chen, C.; Shen, J.; Yang, Y.; Wu, X.; Norton, T. Wheat head counting in the wild by an augmented feature pyramid networks-based convolutional neural network. Comput. Electron. Agric. 2022, 193, 106705. [Google Scholar]
Li, L.; Hassan, M.A.; Yang, S.; Jing, F.; Yang, M.; Rasheed, A.; Wang, J.; Xia, X.; He, Z.; Xiao, Y. Development of image-based wheat spike counter through a Faster R-CNN algorithm and application for genetic studies. Crop J. 2022, 10, 1303–1311. [Google Scholar]
Yousafzai, S.N.; Nasir, I.M.; Tehsin, S.; Fitriyani, N.L.; Syafrudin, M. FLTrans-Net: Transformer-based feature learning network for wheat head detection. Comput. Electron. Agric. 2025, 229, 109706. [Google Scholar]
Hong, Q.; Liu, W.; Zhu, Y.; Ren, T.; Shi, C.; Lu, Z.; Yang, Y.; Deng, R.; Qian, J.; Tan, C. CTHNet: A network for wheat ear counting with local-global features fusion based on hybrid architecture. Front. Plant Sci. 2024, 15, 1425131. [Google Scholar]
Guan, S.; Lin, Y.; Lin, G.; Su, P.; Huang, S.; Meng, X.; Liu, P.; Yan, J. Real-time detection and counting of wheat spikes based on improved YOLOv10. Agronomy 2024, 14, 1936. [Google Scholar] [CrossRef]
Yao, Z.; Zhang, D.; Tian, T.; Zain, M.; Zhang, W.; Yang, T.; Song, X.; Zhu, S.; Liu, T.; Ma, H.; et al. APW: An ensemble model for efficient wheat spike counting in unmanned aerial vehicle images. Comput. Electron. Agric. 2024, 224, 109204. [Google Scholar] [CrossRef]
Sun, X.; Jiang, T.; Hu, J.; Song, Z.; Ge, Y.; Wang, Y.; Liu, X.; Bing, J.; Li, J.; Zhou, Z.; et al. Counting wheat heads using a simulation model. Comput. Electron. Agric. 2025, 228, 109633. [Google Scholar]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y. Confidential Federated Learning for Heterogeneous Platforms against Client-Side Privacy Leakages. In Proceedings of the ACM Turing Award Celebration Conference 2024, Changsha, China, 5–7 July 2024; pp. 239–241. [Google Scholar]
Li, Q.; Zhang, Y.; Ren, J.; Li, Q.; Zhang, Y. You Can Use But Cannot Recognize: Preserving Visual Privacy in Deep Neural Networks. arXiv 2024, arXiv:2404.04098. [Google Scholar]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2023, 11, 172–186. [Google Scholar]
Anand, T.; Sinha, S.; Mandal, M.; Chamola, V.; Yu, F.R. AgriSegNet: Deep aerial semantic segmentation framework for IoT-assisted precision agriculture. IEEE Sens. J. 2021, 21, 17581–17590. [Google Scholar]
Kitzler, F.; Barta, N.; Neugschwandtner, R.W.; Gronauer, A.; Motsch, V. WE3DS: An RGB-D image dataset for semantic segmentation in agriculture. Sensors 2023, 23, 2713. [Google Scholar] [CrossRef]
Peng, M.; Liu, Y.; Qadri, I.A.; Bhatti, U.A.; Ahmed, B.; Sarhan, N.M.; Awwad, E. Advanced image segmentation for precision agriculture using CNN-GAT fusion and fuzzy C-means clustering. Comput. Electron. Agric. 2024, 226, 109431. [Google Scholar]
Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
Zhang, Y.; Yang, X.; Liu, Y.; Zhou, J.; Huang, Y.; Li, J.; Zhang, L.; Ma, Q. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput. Electron. Agric. 2024, 218, 108710. [Google Scholar]
Cui, J.; Tan, F.; Bai, N.; Fu, Y. Improving U-net network for semantic segmentation of corns and weeds during corn seedling stage in field. Front. Plant Sci. 2024, 15, 1344958. [Google Scholar]
Mahmud, M.N.; Osman, M.K.; Ismail, A.P.; Ahmad, F.; Ahmad, K.A.; Ibrahim, A. Road image segmentation using unmanned aerial vehicle images and DeepLab V3+ semantic segmentation model. In Proceedings of the 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 27–28 August 2021; pp. 176–181. [Google Scholar]
Shen, Y.; Wang, L.; Jin, Y. AAFormer: A multi-modal transformer network for aerial agricultural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1705–1711. [Google Scholar]
Fu, X.; Ma, Q.; Yang, F.; Zhang, C.; Zhao, X.; Chang, F.; Han, L. Crop pest image recognition based on the improved ViT method. Inf. Process. Agric. 2024, 11, 249–259. [Google Scholar]
Cheng, Z.Q.; Dai, Q.; Li, H.; Song, J.; Wu, X.; Hauptmann, A.G. Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19638–19648. [Google Scholar]
You, Z.; Yang, K.; Luo, W.; Lu, X.; Cui, L.; Le, X. Few-shot object counting with similarity-aware feature enhancement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6315–6324. [Google Scholar]
Khaki, S.; Safaei, N.; Pham, H.; Wang, L. WheatNet: A lightweight convolutional neural network for high-throughput image-based wheat head detection and counting. Neurocomputing 2022, 489, 78–89. [Google Scholar]
Wang, Y.; Hou, J.; Hou, X.; Chau, L.P. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans. Image Process. 2021, 30, 2876–2887. [Google Scholar]
Hallin, A.; Isaacson, J.; Kasieczka, G.; Krause, C.; Nachman, B.; Quadfasel, T.; Schlaffer, M.; Shih, D.; Sommerhalder, M. Classifying anomalies through outer density estimation. Phys. Rev. D 2022, 106, 055006. [Google Scholar]
Jiang, L.; Wang, Y.; Wu, C.; Wu, H. Fruit Distribution Density Estimation in YOLO-Detected Strawberry Images: A Kernel Density and Nearest Neighbor Analysis Approach. Agriculture 2024, 14, 1848. [Google Scholar] [CrossRef]
Wang, Y.; Wang, P.; Tansey, K.; Liu, J.; Delaney, B.; Quan, W. An interpretable approach combining Shapley additive explanations and LightGBM based on data augmentation for improving wheat yield estimates. Comput. Electron. Agric. 2025, 229, 109758. [Google Scholar]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]

Figure 1. Dataset samples. (A) Sparse. (B,C) Moderate. (D,E) Dense.

Figure 2. Data annotation: (a) is CutMix; (b) is Mosaic; (c) is GridMask.

Figure 3. Overall framework of the wheat growth and counting analysis model based on semantic segmentation.

Figure 4. Diagram of the symmetric diffusion module.

Figure 5. Schematic of the symmetric attention mechanism.

Figure 6. The visual results of the method on our dataset.

Figure 7. Heatmap demonstrating the effectiveness of the symmetric attention mechanism.

Figure 8. The visual results of the method on the Kaggle dataset.

Table 1. Data quantity for different types of wheat.

Density	Data	UAV	Camera
Dense	1682	693	989
Moderate	1804	962	842
Sparse	1752	837	915

Table 2. Experimental results of wheat growth and counting models.

Model	P	R	Acc	mAP@75	F1-Score	Parameters (M)	FLOPS (G)
Tiny-Segformer	0.80	0.77	0.79	0.79	0.79	5.2	4.4
UNet	0.82	0.79	0.81	0.81	0.80	7.76	65
Mask R-CNN	0.84	0.82	0.83	0.83	0.82	44.5	180
SAM	0.86	0.84	0.85	0.85	0.85	641	2963
Swin Transformer	0.83	0.80	0.82	0.81	0.81	88	15.4
DETR	0.85	0.81	0.82	0.83	0.82	41	86
ViT	0.87	0.84	0.85	0.86	0.85	307	190.7
InternImage	0.89	0.85	0.87	0.86	0.86	30	5
Proposed Method	0.91	0.87	0.89	0.88	0.89	15.2	30.1

Table 3. Experimental results of wheat growth measurement models.

Model	Precision	Recall	Accuracy	mAP@75	F1-Score
Tiny-Segformer	0.83	0.80	0.82	0.81	0.82
UNet	0.85	0.82	0.84	0.83	0.84
Mask R-CNN	0.88	0.85	0.86	0.85	0.87
SAM	0.89	0.87	0.88	0.87	0.88
InternImage	0.92	0.89	0.91	0.90	0.91
Proposed Method	0.95	0.90	0.93	0.92	0.92

Table 4. Ablation study of different attention mechanisms.

Model	Precision	Recall	Accuracy	mAP@75	F1-Score
Segmentation-Standard Self	0.73	0.70	0.72	0.71	0.72
Segmentation-CBAM	0.85	0.82	0.83	0.82	0.83
Segmentation-Symmetric	0.91	0.87	0.89	0.88	0.89
Measurement-Standard Self	0.77	0.72	0.74	0.73	0.74
Measurement-CBAM	0.88	0.84	0.86	0.85	0.85
Measurement-Symmetric	0.95	0.90	0.93	0.92	0.92

Table 5. Ablation study of different data augmentation strategies.

Model	Precision	Recall	Accuracy	mAP@75	F1-Score
CutMix	0.68	0.62	0.65	0.65	0.64
Mosaic	0.72	0.69	0.71	0.70	0.70
GridMask	0.64	0.60	0.62	0.61	0.62
Proposed Method	0.91	0.87	0.89	0.88	0.89

Table 6. Ablation study on the symmetric diffusion module.

Model	Precision	Recall	Accuracy	mAP@75	F1-Score
no diffusion	0.70	0.63	0.68	0.67	0.64
gaussian diffusion	0.78	0.84	0.82	0.80	0.81
Proposed Method	0.91	0.87	0.89	0.88	0.89

Table 7. Experimental results of wheat growth and counting models in Kaggle.

Model	Precision	Recall	Accuracy	mAP@75	F1-Score
Tiny-Segformer	0.78	0.75	0.76	0.77	0.76
UNet	0.81	0.76	0.79	0.78	0.78
Mask R-CNN	0.82	0.80	0.81	0.80	0.81
SAM	0.84	0.81	0.82	0.83	0.82
Swin Transformer	0.83	0.81	0.82	0.82	0.81
DETR	0.85	0.82	0.83	0.84	0.82
ViT	0.86	0.83	0.84	0.84	0.85
InternImage	0.86	0.82	0.84	0.83	0.82
Proposed Method	0.89	0.85	0.87	0.86	0.85

Table 8. FPS of different models on different platform.

Platform	Tiny-Segformer	UNet	Mask R-CNN	SAM	InternImage	Proposed Method
Jetson	24	20	—	3	31	50
IPhone14 Pro Max	18	13	—	—	27	39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Z.; Hong, W.; Wang, Y.; Jiang, C.; Zhang, B.; Sun, Z.; Liu, S.; Lv, C. A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting. Agriculture 2025, 15, 670. https://doi.org/10.3390/agriculture15070670

AMA Style

Jin Z, Hong W, Wang Y, Jiang C, Zhang B, Sun Z, Liu S, Lv C. A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting. Agriculture. 2025; 15(7):670. https://doi.org/10.3390/agriculture15070670

Chicago/Turabian Style

Jin, Ziyang, Wenjie Hong, Yuru Wang, Chenlu Jiang, Boming Zhang, Zhengxi Sun, Shijie Liu, and Chunli Lv. 2025. "A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting" Agriculture 15, no. 7: 670. https://doi.org/10.3390/agriculture15070670

APA Style

Jin, Z., Hong, W., Wang, Y., Jiang, C., Zhang, B., Sun, Z., Liu, S., & Lv, C. (2025). A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting. Agriculture, 15(7), 670. https://doi.org/10.3390/agriculture15070670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Object Counting Based on Probability Density Estimation

3. Materials and Method

3.1. Dataset Collection

3.2. Data Annotation

3.3. Data Preprocessing

3.4. Proposed Method

3.4.1. Symmetric Diffusion Segmentation Network Based on Transformer

3.4.2. Symmetric Diffusion Module

3.4.3. Symmetric Attention Mechanism

3.4.4. Aggregated Loss Function

3.5. Experimental Configuration

3.5.1. Hardware and Software Configuration

3.5.2. Dataset Partitioning and Hyperparameters

3.5.3. Evaluation Metrics

3.5.4. Baseline

4. Results and Discussion

4.1. Segmentation Accuracy Results

4.2. Growth Measurement Results

4.3. Ablation Study of Different Attention Mechanisms

4.4. Ablation Study of Different Data Augmentation Strategies

4.5. Ablation Study on the Symmetric Diffusion Module

4.6. Test on Other Dataset

4.7. Application Deployment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI