GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification

Liu, Yuanyuan; Zhang, Jiaxin; Wang, Yueyong; Luo, Yang; Sui, Pengxiang; Ren, Ying; Liu, Xiaodan; Wang, Jun

doi:10.3390/agronomy15051011

Open AccessArticle

GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification

by

Yuanyuan Liu

¹

,

Jiaxin Zhang

¹,

Yueyong Wang

²,

Yang Luo

³,

Pengxiang Sui

³,

Ying Ren

³,

Xiaodan Liu

³ and

Jun Wang

^2,*

¹

College of Information and Technology & Smart Agriculture Research Institute, Jilin Agricultural University, Changchun 130118, China

²

College of Engineering and Technology, Jilin Agricultural University, Changchun 130118, China

³

Jilin Academy of Agricultural Sciences, Changchun 130033, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1011; https://doi.org/10.3390/agronomy15051011

Submission received: 31 March 2025 / Revised: 18 April 2025 / Accepted: 21 April 2025 / Published: 22 April 2025

(This article belongs to the Special Issue Intelligent Information System for Agriculture Based on Vision Technology)

Download

Browse Figures

Versions Notes

Abstract

:

To meet the need of rapid identification of straw coverage types in conservation tillage fields, we investigated the use of unmanned aerial vehicle (UAV) low-altitude remote sensing images for accurate detection. UAVs were used to capture images of conservation tillage farmlands. An improved GCF-DeepLabv3+ model was utilized for detecting straw coverage types. The model incorporates StarNet as its backbone, reducing parameter count and computational complexity. Furthermore, it integrates a Multi-Kernel Convolution Feedforward Network with Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM) and a Gated Conv-Former Block (Gated-CFB) to improve the segmentation of fine plot details. Experimental results demonstrate that GCF-DeepLabv3+ outperforms other methods in segmentation accuracy, computational efficiency, and model robustness. The model achieves a parameter count of 3.19M and its FLOPs (Floating Point Operations) is 41.19G, with a mean Intersection over Union (MIoU) of 93.97%. These findings indicate that the proposed GCF-DeepLabv3+-based rapid detection method offers robust support for straw return detection.

Keywords:

conservation tillage; DeepLabv3+; remote sensing; straw coverage; semantic segmentation

1. Introduction

Conservation tillage is a vital agricultural practice with significant benefits for soil health, water conservation, and carbon sequestration. By reducing soil disturbance and preserving crop residues on the soil surface, conservation tillage promotes sustainable agriculture and mitigates issues such as soil erosion, nutrient depletion, and climate change [1]. Over 1 billion tons of straw are produced annually, much of which is discarded without effective utilization, leading to substantial environmental and social challenges [2,3]. Accurate and efficient straw return detection is critical for implementing conservation tillage and fostering sustainable agricultural development.

The Ministry of Agriculture and Rural Affairs released the “Technical Guidelines for the 2021 Northeast Blackland Conservation Tillage Action Plan” to standardize conservation tillage practices. The guidelines also defined complete and partial straw covers, categorizing them into whole straw cover, root stubble cover, residue stubble cover, and crushed cover [4,5]. Based on coverage rate, straw cover is classified as either complete (≥70%) or partial (≥30%). Traditional methods for assessing large-scale straw cover in Northeast China, such as visual inspection and rope measurements, are inefficient and subject to bias.

Advances in science and technology have facilitated the development of information-based detection methods. Zhu et al. [6] used GF-1 satellite images combined with machine learning algorithms to estimate winter wheat residue cover. Memon et al. [7,8] used remote sensing satellites Sentinel-2B and Landsat-8, respectively, to estimate straw cover. Li et al. [9] employed Landsat-8 OLI data to estimate wheat residue cover. Liu et al. [10] employed traditional image processing algorithms based on multi-threshold methods to detect straw coverage. Cai et al. [11] used remote sensing images for straw detection. Riegler-Nurscher et al. [12] proposed a random decision forest algorithm to classify residues in soil. However, traditional image processing algorithms, while computationally efficient, are often sensitive to environmental variations such as lighting conditions and background noise, which can significantly affect detection accuracy. Similarly, satellite-based remote sensing technologies, while offering broad coverage, suffer from limitations such as low spatial resolution, limited revisit frequency, cloud interference, and high data acquisition costs. For instance, Sentinel-2B and Landsat-8 satellites have spatial resolutions of 10–30 m, which may not be sufficient to capture subtle differences in straw coverage at the plot level. Moreover, their fixed orbital paths limit the temporal flexibility necessary for precise and timely agricultural monitoring. Additionally, researchers have conducted studies based on deep learning, such as Liu et al. [13] and Yu et al. [14], who used unmanned aerial vehicles (UAVs) to capture images for detection. We found that using UAVs to obtain field data is not only more convenient but also more practical, especially for tasks requiring high spatial detail and timely image acquisition. UAVs can be deployed flexibly according to specific agricultural schedules and capture ultra-high-resolution images that reveal subtle differences in surface features. This makes them particularly suited for detecting variations in straw cover across different plots. In studies using RGB images of agricultural corn straw, both traditional machine vision [15,16] and deep learning methods [17,18] have been applied for straw cover detection. These methods are effective for estimating straw cover rates in crushed forms but struggle with more complex coverage types and can be computationally slow.

Traditional machine learning performs well on regularly shaped plots [10,19], but deep learning excels in handling complex straw distribution patterns in agricultural fields. Aung et al. [20] achieved a Dice coefficient of 0.81 in parcel segmentation using the modified U-Net (spatio-temporal U-Net). Shao et al. [21] achieved an average precision (AP) of 0.72 in straw coverage detection using the improved Mask R-CNN algorithm. Their approach aims to enhance robustness against varying environmental conditions, such as changes in illumination and background interference, leading to a significant improvement in segmentation accuracy. Zhou et al. [18] achieved a mean accuracy of 93.70%, a mean IU of 81.04% in image segmentation, and a mean absolute deviation of 3.56% in straw coverage detection using the ResNet18-UNet algorithm.

Recently, Convolutional Neural Networks (CNNs) have demonstrated significant advantages in target classification [22,23], object detection [24,25], and semantic segmentation [26,27]. Compared to traditional machine learning algorithms, Convolutional Neural Networks (CNNs) offer significant advantages, including automatic feature extraction, higher precision and performance, end-to-end learning, scalability, and robustness to variations in input data. Unlike traditional methods, which rely heavily on manual feature engineering and separate optimization stages, CNNs learn and optimize the entire process within a single framework. This reduces the need for extensive domain expertise and is particularly effective for handling large-scale datasets and complex patterns in tasks such as image classification, object detection, and semantic segmentation.

DeepLabv3+ [26] has been recently applied to remote sensing (RS) segmentation tasks [28,29,30]. Due to its excellent capability in handling complex image segmentation tasks, DeepLabv3+ can be effectively trained on large-scale datasets, excelling in processing extensive images and segmenting simpler scenes. This makes DeepLabv3+ particularly suitable for detecting ground straw cover types, especially in conservation tillage environments. However, current deep learning models still have limitations. Many existing algorithms face issues such as overfitting, low computational efficiency, and sensitivity to variations in image quality and environmental conditions. To address these challenges, we improved DeepLabv3+ by incorporating a Multi-Kernel Convolution Feedforward Network with Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM) and a Gated Conv-Former Block (GCFB) [31,32]. This enhancement aims to improve the model’s ability to focus on relevant features, reduce computational overhead, and enhance the robustness of straw cover type detection under varying conditions.

In this study, to address the limitations of traditional machine learning algorithms, we proposed a ground straw cover type detection algorithm based on unmanned aerial vehicles (UAVs) and an improved neural network (DeepLabv3+). This algorithm offers high segmentation accuracy, fast convergence, automatic feature abstraction and classification, reduced parameters, and simplified computation, thanks to the gating mechanism’s ability to automatically select important features. Comparisons with other segmentation algorithms confirm the advantages of this approach.

2. Materials and Methods

2.1. Study Cite and Data Acquisition

The study sites were chosen from three representative agricultural fields in Changchun City, Yushu City, and Dewei City in Jilin Province, China (Figure 1) (125.3893342° E, 43.8168784° N; 125.6564316° E, 44.5830458° N; and 126.2287119° E, 45.0979541° N, respectively). These fields are typical black soil regions that have actively implemented straw return policies following the Northeast Black Soil Conservation Tillage Initiative introduced in 2021. A unique feature of these sites is the comprehensive inclusion of all types of corn straw coverage, which ensures both diversity and representativeness in the data.

Imaging was conducted using a DJI Matrice 200 V2 drone equipped with a Zenmuse X5S gimbal camera, sourced from DJI, a manufacturer based in Shenzhen, China (Figure 2a). The camera was set to Shutter Priority mode (ISO-100), with a shutter speed of 1/240–1/320 s in sunny and windy conditions, and 1/80–1/120 s under favorable weather. These settings maintained the exposure value (EV) between 0 and +0.7, yielding a ground sampling distance (GSD) of 1.5–2.1 cm per pixel. To ensure image quality, data collection was primarily conducted between 8:00 AM and 4:00 PM. During flights, the drone maintained a speed of 2 m/s at an altitude of 50 or 60 m above ground, with the camera positioned vertically downward and both lateral and longitudinal overlap set to 80%. Original images were stored in JPG format at high resolution, with each photo containing GPS coordinates.

2.2. Data Processing and Dataset Production

The collected images underwent geometric correction and illumination correction to ensure accuracy and consistency. Additionally, OpenCV [33] was used for denoising and color correction to maintain the quality of the dataset. Subsequently, image stitching was performed to merge multiple images into a complete remote sensing image. The stitched images were then segmented to reduce the size of UAV-acquired images, thereby decreasing the input image size for the neural network, reducing memory consumption, and improving training efficiency. Microsoft Image Composite Editor (version 2.0.3.0) was utilized for remote sensing image stitching. Figure 3 illustrates the stitched remote sensing images and their corresponding straw coverage types. The stitched and annotated images were cropped into 512 × 512 pixel patches with a stride of 256 pixels. Images with black borders covering more than one-tenth of their total area were excluded from the dataset. The final dataset consisted of 20,160 images, which were divided into training, validation, and test sets in a 7:2:1 ratio.

Accurate image annotation was essential for training the segmentation model. LabelImg was used to annotate images, focusing specifically on corn straw plots. Each plot was manually labeled to distinguish different types of corn straw coverage. The labeled images are single-channel images, where different label colors represent specific straw coverage types. Specifically, label color 1 corresponds to the straw burn type, which features visible charred areas and dark patches caused by the burning of straw residues; label color 2 represents straw crush, characterized by fragmented straw scattered on the soil surface, indicative of a common eco-friendly return-to-field method; label color 3 indicates straw vertical, where the stalks remain upright in the field, reflecting a post-harvest state without mechanical interference; label color 4 denotes straw bale, where straw is mechanically collected into bundled stacks; and label color 5 signifies straw level, in which straw is flattened and evenly spread across the ground surface. Additionally, label color 6 corresponds to turn the soil, representing fully tilled land with exposed soil and minimal straw residue; label color 7 represents strip-tillage, which appears as alternating bands of tilled and untilled soil; label color 8 indicates high stubble, characterized by tall, standing stubble remaining after harvesting; and label color 9 denotes low stubble, where only the lower parts of the straw remain close to the ground (Figure 2b). These nine types of straw coverage were all treated as distinct semantic categories and directly participated in the model training and classification process. Accurate identification of each type plays a critical role in downstream agricultural applications such as tillage assessment, residue management, and precision farming strategies.

2.3. Overall Process

Our overall detection process (Figure 2c) accurately and efficiently identifies straw return field types by integrating two classification criteria: straw coverage type and straw coverage rate (SCR). First, the improved DeepLabv3+ model is employed to classify farmland into different straw coverage types. Based on their mapping relationships, straw vertical, straw level, high stubble, and strip-tillage are categorized as full straw return. In contrast, fields identified as straw burn, straw bale, turn the soil, and low stubble are classified as no straw return. When straw crush is detected, the corresponding field is extracted from the original map based on the prediction results. Additionally, an adaptive thresholding algorithm [34] is applied to calculate the SCR for each straw crush field. If the SCR is ≥70%, the field is classified as full straw return; if the SCR is ≥30% and <70%, it is categorized as partial straw return; and if the SCR is <30%, it is designated as no straw return.

2.4. Network Design

Developing a precise semantic segmentation network is crucial for advancing conservation tillage practices. We propose an improved DeepLabv3+ model by introducing a Gated Conv-Former Block (GCFB), replacing the Atrous Spatial Pyramid Pooling (ASPP) module with a Multi-Kernel Convolution Feedforward Network with Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM), and refining the training strategy.

2.4.1. The Network Architecture

In this study, we used the DeepLabv3+ model for semantic segmentation. DeepLabv3+ is an advanced and widely recognized architecture from the DeepLab series by Chen et al. [35,36,37]. It enhances object boundary localization with a novel decoder module, extending the capabilities of its predecessor, DeepLabv3. The architecture includes two main components: an encoder and a decoder. The encoder consists of a backbone network and an Atrous Spatial Pyramid Pooling (ASPP) module. The backbone extracts rich features from the input image, while ASPP employs parallel atrous convolutions with varying dilation rates to capture multi-scale context and segment objects at different scales. In the decoder, low-level features from early encoder layers are fused with high-level features from the ASPP module. This enhances the model’s ability to recover fine details and delineate object boundaries accurately. However, the original DeepLabv3+ structure, which combines low-level features with high-level ASPP features, lacks precision for segmentation tasks. Thus, balancing computational efficiency and feature extraction remains a challenge in our study.

The optimized DeepLabv3+ model, named GCF-DeepLabv3+, is shown in Figure 4. The backbone was replaced with StarNet [38], which reduces inference time and enhances feature extraction efficiency. The backbone extracts four distinct features across channels. A Gated Conv-Former Block (GCFB) was introduced, utilizing a Gated Depthwise Feedforward Network (GDFN) to suppress less informative features and allow useful information to pass through. This ensures accuracy while reducing training parameters. GCFB aims to improve feature extraction and context understanding in semantic segmentation tasks. In the encoder, the ASPP module was replaced with a Multi-Kernel Convolution Feedforward Network with a Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM). The MKC-FFN-FTCM effectively extracts contextual information across varying scales of straw return imagery, allowing the model to focus on different straw coverage semantics. An FFT-CBAM was added after the fusion module within the MKC-FFN, incorporating the Fast Fourier Transform into both channel and spatial attention modules. This captures frequency domain information from feature maps and enhances global feature representation.

2.4.2. StarNet

StarNet follows a traditional hierarchical network architecture, consisting of four stages. In each stage, a convolutional layer is used to down-sample the resolution and double the number of channels. The StarNet framework is illustrated in Figure 5. The model uses multiple repeating star blocks to extract features. Each star block consists of multiple components that efficiently extract and refine features. It employs a depthwise separable convolution, multiple 1 × 1 convolutions for feature expansion and compression, along with a residual connection for better gradient flow. Within each block, the feature maps are processed through a series of convolutions, followed by activation functions and element-wise multiplication (i.e., the star operation). This enables the model to capture complex feature interactions while maintaining computational efficiency. It promotes efficient feature extraction while minimizing computational costs.

2.4.3. Multi-Kernel Convolution Feedforward Network with Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM)

The original DeepLabv3+ model employs the ASPP (Atrous Spatial Pyramid Pooling) module, which utilizes dilated convolutions with varying dilation rates to expand the receptive field and capture multi-scale information. However, in straw return detection, discontinuous plots and high-resolution images pose challenges, as excessive dilation rates in ASPP produce sparse feature maps and loss of detail, reducing detection accuracy. Additionally, fixed dilation rates in ASPP restrict flexibility in multi-scale feature fusion, reducing effectiveness in detecting small, unevenly distributed straws. High computational and memory demands often become bottlenecks in large-scale data processing. To overcome these limitations, we proposed the MKC-FFN-FTCM (Figure 4b).

MKC-FFN-FTCM employs three parallel depthwise (DW) convolutions with different kernel sizes (1 × 1, 3 × 3, and 5 × 5) to perform multi-scale feature extraction and fusion. Each convolution kernel extracts features corresponding to distinct receptive fields, ensuring multi-scale extraction while reducing the number of channel parameters. These parallel structures efficiently decompose features and expand receptive fields, preserving local information at each scale and effectively fusing features across multiple scales. This enhances the model’s ability to adapt to complex scenarios. Each convolution kernel extracts context information at different scales, ensuring that the network captures both fine-grained local details and broader global features simultaneously. MKC-FFN-FTCM effectively extracts straw coverage features by parallel processing multi-scale kernels and feature fusion, thereby avoiding redundant computations and reducing network complexity. This approach maintains performance while addressing the limitations of ASPP, particularly its failure to capture local details, especially in handling small objects, complex backgrounds, and feature-rich regions.

Agricultural images often contain background elements such as roads, weeds, and trees, which can interfere with straw return recognition. Background noise increases the computational burden and may bias feature extraction from key regions, reducing recognition accuracy. The model’s primary task in straw return recognition is to accurately distinguish straw-covered areas. In complex backgrounds with significant interference, the model may misclassify roads, weeds, or shadows as part of the straw or other plots, affecting recognition results. To capture salient features, the network focuses on identifying correlations and differences among straw coverage forms. The FFT-CBAM module (Figure 6) was added after the DW convolution layer in this paper. This method enhances frequency information capture by processing features in the frequency domain, optimizing the attention mechanism, reducing computational complexity, and improving sensitivity to high-frequency details. Specifically, it reduces interference from background elements such as roads and weeds, improving recognition accuracy and reliability.

2.4.4. Gated Conv-Former Block (GCFB)

The objective of the DeepLabv3+ decoder is to fuse deep and shallow features to enhance segmentation details and accuracy. Shallow features are extracted from the initial layers of the backbone network, primarily capturing essential details such as color and edge textures. These features capture fine-grained details, yet the 1 × 1 convolution used in the original network for shallow feature extraction has limited capability in capturing semantic information and may include more noise. For the task of identifying straw-returning patterns across different fields, distinguishing features often lie within low-level characteristics like texture and color, formed by surface straw or soil. However, in the original network, shallow features undergo 1 × 1 convolution, which can lead to the loss of certain low-level information when fusing shallow and deep features. Therefore, before the 1 × 1 convolution operation, we employ a Gated Conv-Former Block (GCFB) (Figure 7a) based on a gating mechanism (Figure 7b) to enhance valuable semantic information in shallow features obtained by the backbone network.

In the gated mechanism-based improved CFB, the process begins by passing the input feature map through a Convolutional Attention Module, which calculates attention weights using convolution operations and enhances features via a gating mechanism. Next, the attention-processed feature map is connected with the input feature map through residual connections, and Drop Path regularization is applied. Then, the feature map is further transformed through a multi-layer perceptron (MLP) module consisting of two convolutional layers and an activation function. Finally, the MLP-processed feature map is added to the residual connection result and regularized again using Drop Path, yielding the final output features. This structure effectively captures important feature information, enhancing model robustness. The combination of the Gated Deep Feedforward Network (GDFN) and Convolutional Attention Module makes the CFB a powerful intermediate block within the network architecture, capable of finer feature aggregation and improved capacity to handle complex segmentation tasks.

2.4.5. Loss Function

Cross-entropy Loss (CE loss) [39] is one of the most commonly used loss functions in semantic segmentation tasks. For each pixel, Cross-Entropy Loss measures the difference between the predicted class distribution and the true class distribution. The formula is as follows:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} l o g (p_{i, c})

(1)

where

N

represents the total number of pixels,

C

denotes the total number of classes,

y_{i, c}

is the true label of pixel

i

for class

c

, and

p_{i, c}

is the predicted probability of pixel

i

for class

c

.

In our dataset, the number of pixels for the strip-tillage category far exceeds those for categories such as vertical straw distribution. This class imbalance leads the model to favor the strip-tillage category. To address this issue, we introduced Dice Loss [38], which assesses the similarity between predicted segmentation maps and ground-truth segmentation maps. The Dice Loss function is more robust to class imbalance within the dataset, effectively mitigating the impact of large pixel quantity differences among classes. The formula is as follows:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} p_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} p_{i}}

(2)

where

N

represents the total number of pixels.

y_{i}

denotes the true label for pixel

i

(typically binary, where 1 indicates belonging to the class and 0 indicates not belonging).

p_{i}

represents the predicted probability for pixel

i

. Accordingly, we use a composite loss function

L_{L o s s}

based on the aforementioned CE loss and Dice loss, to train the model. The calculation method for this composite function is as follows:

L_{L o s s} = L_{C E} + L_{D i c e}

(3)

2.5. Experient Platform and Parameter Settings

The computing equipment used in this study is as follows: the host computer is equipped with an Intel Core i9-12900K 32-core processor (Manufacturer: Intel, City: Santa Clara, CA, Country: USA), and an NVIDIA GeForce RTX 3080 GPU with 10 GB of memory (Manufacturer: NVIDIA, City: Santa Clara, CA, Country: USA), and runs on Windows 11. We used Python 3.10.14 and Torch 2.3.1 for software implementation. During training, we selected stochastic gradient descent (SGD) as the optimizer with a momentum parameter of 0.9. The model’s initial (maximum) learning rate was set to 7 × 10⁻³, with a minimum learning rate of 0.01 times the maximum learning rate. A cosine annealing schedule was used for learning rate decay, and the weight decay coefficient was set to 1 × 10⁻⁴. The input image resolution for the network was set to 512 × 512 pixels, with a down-sampling factor of 16. Through multiple training sessions, we observed that the model converged after approximately 120 epochs, so we trained the model for 120 epochs, with a batch size of 8. Additionally, to enhance processing efficiency, the number of threads was set to 16.

2.6. Evaluation Metrics

In this study, the dataset was divided into training, validation, and test sets in a 7:2:1 ratio to facilitate model training, tuning, and evaluation. The validation set was used during training to assess model performance, assisting in parameter selection and training strategy adjustments. The test set served as an independent dataset for final evaluation, allowing assessment of the model’s generalization ability and practical application effectiveness. Therefore, total parameters and FLOPs [40] were selected as metrics for model complexity. Mean Intersection over Union (MIoU) [41], mean average precision (mAP) [42], precision [43], recall [43], and F1-score [43] were used. The formulas for these metrics are as follows:

m I o U = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}}

(4)

m A P = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c} + T N_{c}}{T P_{c} + T N_{c} + F P_{c} + F N_{c}}

(5)

P r e c i s i o n = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{T P_{c} + F P_{c}}

(6)

R e c a l l = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{T P_{c} + F N_{c}}

(7)

F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(8)

where

C

represents the number of classes and

T P_{c}, F P_{c}, T N_{c}

, and

F N_{c}

denote the number of true positive, false positive, true negative, and false negative pixels for class

C

, respectively.

3. Results

3.1. Comparative Experiments on Different Backbone Networks

In this study, we selected Xception [44], MobileNetv2 [45], MobileNetv4 [46], Convnextv2 [47], and StarNet [38] as the backbone networks for DeepLabv3+ to investigate the impact of different backbones on the model’s performance. We evaluated each backbone’s performance on metrics including Mean Intersection over Union (MIoU), Mean Average Precision (mAP), Precision, and Total Parameters. The specific experimental results are shown in Table 1.

As shown in Table 1, when StarNet is used as the backbone, the model achieved an MIoU of 93.97%, an mAP of 95.68%, and a Precision of 98.76% on our dataset. Each evaluation metric shows an improvement over Xception: mIoU increases by 3.89 percentage points, mAP by 0.34 percentage points, and Precision by 1.87 percentage points. Notably, the total parameter count when using Xception is 16.5 times higher than that of StarNet. Additionally, StarNet exhibited significantly higher MIoU, mAP, and Precision scores than ConvNextv2, MobileNetv2, and MobileNetv4. MobileNetv2 employs a linear bottleneck and inverted residual structure, effectively balancing parameter count and accuracy. However, although MobileNetv2 performs well as a lightweight model, it still falls short in complex tasks and large-scale datasets. MobileNetv2 and MobileNetv4 also have a slightly larger parameter count than StarNet. StarNet is a four-stage architecture that uses convolutional layers for down-sampling and star blocks for feature extraction. It replaces Layer Normalization with Batch Normalization, adds depth-wise convolutions, and uses ReLU6 instead of GELU. With a consistent channel expansion factor of 4 and doubling network width at each stage, StarNet emphasizes simplicity, avoiding complex features like attention mechanisms. These optimizations make the network structure more efficient and enhance its computational speed and effectiveness. Given the demands for accurate, detailed, and rapid segmentation of various field types in this study, StarNet was selected as the backbone feature extraction network. This choice ensured both model efficiency and high feature extraction capability while keeping the model lightweight.

3.2. Ablation Experiments

To validate the effectiveness of the MKC-FFN-FTCM and Gated CFB, we conducted ablation experiments using StarNet as the backbone feature extraction network in the base model. The experimental results are presented in Table 2. After introducing the Gated CFB module, the MIoU, mAP, and Precision of the base model improved by 3.60%, 4.21%, and 4.58%, respectively. When the model was modified to include only the MKC-FFN module without the Gated-CFB, the MIoU, mAP, and Precision increased by 2.69%, 4.37%, and 4.61%, respectively, compared to the base model. Furthermore, when channel-spatial attention based on Fast Fourier Transform (FFT-CBAM) was added to the MKC-FFN module, the MIoU, mAP, and Precision further improved by 1.88%, 0.45%, and 0.22%, respectively, over the version with only the CFP module. These results indicate that the addition of FFT-CBAM enables the MKC-FFN module to capture more distinctive features and generate a set of weight parameters that allows the module to more effectively identify relevant straw coverage features. In summary, incorporating the MKC-FFN- FTCM and Gated-CFB module in the model not only improves model accuracy but also enhances its ability to identify and segment different types of straw-covered field plots.

3.3. Comparison of the Different Semantic Segmentation Models

To validate the effectiveness of the proposed model in detecting and segmenting different types of corn straw mulching plots, we compared it with the Lrasspp, UNet, Segformer, Deeplabv3, Deeplabv3+, and improved Deeplabv3+ (GCF-DeepLabv3+) models. The experimental results are shown in Table 3, and the accuracy comparison of different models is illustrated in Figure 8. The FLOPs of the improved GCF-DeepLabv3+ model are 41.19G. Although Lraspp and Segformer have lower FLOPs at 4.16G and 13.6G, respectively, their accuracy is significantly lower than that of the GCF-DeepLabv3+ model. Moreover, the FLOPs of the GCF-DeepLabv3+ model are considerably lower than those of UNet, Deeplabv3+, and Deeplabv3, yet it achieves the highest accuracy among these models. Notably, the total parameter count of the GCF-DeepLabv3+ model is also lower than that of UNet, Deeplabv3+, and Deeplabv3, which not only demonstrates superior accuracy but also satisfies the requirements of a lightweight model.

In terms of segmentation accuracy, our model achieved the best results across all evaluation metrics (MIoU: 93.97%, mAP: 95.68%, Precision: 98.76%). Therefore, compared to other semantic segmentation models, our model strikes a good balance between total parameters, FLOPs, and accuracy. It maintains high accuracy while minimizing total parameter count and computational load.

To comprehensively evaluate the segmentation performance of different models on corn straw coverage, five representative test images were selected for comparative analysis, as illustrated in Figure 9. The figure includes the original RGB images and their corresponding ground truth labels, followed by the predicted segmentation results from six models: Lraspp, UNet, Segformer, DeepLabv3, DeepLabv3+, and the proposed GCF-DeepLabv3+. The results reveal significant performance variations across models in terms of accuracy, boundary delineation, and class discrimination.

Among the evaluated models, Lraspp exhibits the lowest segmentation performance. Despite employing the efficient MobileNet backbone and Lite R-ASPP module, its limited feature extraction capacity leads to coarse predictions and substantial misclassification, particularly in complex scenarios with overlapping straw types. The model fails to capture fine-scale spatial structures, especially in low stubble regions, highlighting the drawbacks of excessively lightweight architectures.

UNet performs moderately well, particularly in segmenting large, homogeneous areas. The encoder–decoder design and skip connections facilitate spatial detail retention; however, the model struggles with irregular or narrow regions. Errors near object boundaries and jagged edges are frequently observed, especially in images 2, 4, and 5. These limitations stem from insufficient global context modeling, which is essential in scenes with heterogeneous straw distribution.

Segformer, although based on an efficient transformer architecture, fails to deliver satisfactory results in this domain. The model consistently under-segments critical areas and demonstrates high misclassification rates. Its attention mechanism does not sufficiently capture the subtle spectral and textural differences among straw types, likely due to inadequate structural supervision and its general-purpose design, which limits applicability to fine-grained agricultural segmentation.

DeepLabv3 also shows unsatisfactory performance. While its ASPP module aids in capturing multi-scale context, the absence of a decoder path compromises boundary recovery. As a result, the model often identifies only the dominant class in multi-type plots, overlooking smaller or visually similar categories.

DeepLabv3+, with its encoder–decoder architecture, improves upon its predecessor by offering better boundary refinement and class separation. Nonetheless, it still presents noticeable errors in fine-detail regions, indicating that further improvements in multi-scale feature fusion and contextual learning are required for high-precision segmentation.

In contrast, the proposed GCF-DeepLabv3+ model consistently outperforms all baselines. This model introduces several architectural enhancements: the original backbone is replaced with StarNet, which improves multi-level feature extraction and reduces inference time. A novel Gated Conv-Former Block (GCFB) incorporating a Gated Depthwise Feedforward Network (GDFN) enables selective feature activation by filtering out low-importance signals. Moreover, the conventional ASPP is substituted with a Multi-Kernel Convolution Feedforward Network integrated with a Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM). This module facilitates adaptive context aggregation across varying spatial resolutions and emphasizes global semantics through an FFT-CBAM mechanism. Collectively, these components allow the model to retain intricate boundary structures, reduce misclassifications in complex scenes, and achieve more coherent segmentation masks, as particularly evident in images 3 and 5.

In summary, GCF-DeepLabv3+ demonstrates superior generalization capability and robustness across a wide range of corn straw coverage conditions. By effectively addressing the limitations of both convolution-based and transformer-based networks, it achieves state-of-the-art performance in terms of segmentation accuracy, boundary fidelity, and visual consistency. These results highlight the model’s strong potential for practical applications in precision agriculture and remote sensing-based straw return monitoring.

3.4. Performance Comparison Between GCF-Deeplabv3+ and Baseline Model

To thoroughly evaluate the classification performance of the proposed GCF-Deeplabv3+ model, we performed a detailed comparative analysis with the baseline Deeplabv3+ model across nine straw cover types: straw vertical, straw level, high stubble, strip-tillage, straw burn, straw bale, turn the soil, low stubble, and straw crush. These categories include both fully straw-returned and non-straw-returned field conditions, along with the straw crush category, which requires specialized handling due to its unique grayscale-based analysis.

The classification performance for each straw cover type was evaluated using three key metrics: precision, recall, and F1-score. The results are presented in Table 4, highlighting the comparative strengths of the GCF-Deeplabv3+ model.

As shown in Table 4 and illustrated in Figure 10, the proposed GCF-Deeplabv3+ model consistently outperforms the baseline Deeplabv3+ model across all straw coverage categories and evaluation metrics. Specifically, GCF-Deeplabv3+ achieves higher Precision, Recall, and F1-score across all nine straw coverage types, with performance improvements ranging from 3% to 5%. Notably, the most significant gains are observed in categories with complex texture patterns, such as straw vertical, straw burn, and turn the soil, as reflected in the confusion matrices in Figure 10. These categories, characterized by intricate visual details, particularly benefit from the enhanced feature extraction capabilities of the GCF-Deeplabv3+ model.

On average, the GCF-Deeplabv3+ model outperforms the baseline Deeplabv3+ model in all evaluation metrics, including Precision, Recall, and F1-score. These results, supported by the confusion matrix in Figure 10, validate the effectiveness of the Gated Conv-Former Block (GCFB) and MKC-FFN-FTCM modules in enhancing the model’s ability to recognize spatial details and distinguish subtle inter-class differences. Such improvements are crucial for accurate straw-return classification in practical agricultural monitoring scenarios.

3.5. Application Experiments

To evaluate the practicality and accuracy of the enhanced GCF-Deeplabv3+ model, we integrated it with adaptive threshold grayscale processing for field experiments aimed at rapidly detecting various straw-returned plot types. Field images for validation were acquired in October 2024 using a DJI M200 drone equipped with a Zenmuse x5s gimbal camera, sourced from DJI, a manufacturer based in Shenzhen, China in two regions of Yitong Manchu Autonomous County (125.30° E, 43.383333° N). To assess the model’s generalization capability and detection performance, the experimental dataset was distinct from the training dataset.

Initially, the acquired images were stitched and annotated. The images were then cropped to 512 × 512 pixels, and only those suitable for experimentation were retained. The trained GCF-Deeplabv3+ model was employed to predict the images, as illustrated in Figure 11 and Figure 12. Following prediction, an adaptive thresholding method was applied to binarize images of plots with straw crushing coverage types [33], and their Straw Coverage Ratio (SCR) values were computed. Plots were classified based on SCR values: fully straw-returned (SCR ≥ 70%), partially straw-returned (30% ≤ SCR < 70%), and non-straw-returned (SCR < 30%). For comparative analysis, the trained Deeplabv3+ and UNet networks were also utilized to predict plot types.

As summarized in Table 5, the enhanced GCF-Deeplabv3+ model achieved a prediction accuracy of 96.37% in Region 1, representing an improvement of 3.12 percentage points over the Deeplabv3+ model (93.25%) and 2.72 percentage points over the UNet model (93.65%). In Region 2, the accuracy of GCF-Deeplabv3+ reached 95.24%, exceeding that of the Deeplabv3+ model (92.65%) by 2.59 percentage points and that of the UNet model (93.06%) by 2.18 percentage points. Moreover, the average prediction time of GCF-Deeplabv3+ for both regions was 32.54 s, representing a reduction of approximately 14.22 s compared to the Deeplabv3+ model and 36.38 s compared to the UNet model. Furthermore, the FLOPs (Floating Point Operations) of GCF-Deeplabv3+ were 41.19G, significantly lower than those of Deeplabv3+ (by 11.74G) and the UNet model (by 39.84G).

These results indicate that the enhanced GCF-Deeplabv3+ model not only improves prediction accuracy but also reduces FLOPs, leading to a significant decrease in overall prediction time. Therefore, we conclude that the enhanced GCF-Deeplabv3+ model demonstrates exceptional performance in detecting various straw coverage types.

4. Discussion

4.1. Model Performance and Limitations

In this study, to address issues such as uneven straw distribution and fragmentation in field plots, we introduced wavelet convolutions and a channel-space attention mechanism based on Fast Fourier Transform (FFT) to capture a broader range of features during training. This ensures accurate segmentation and fine boundary delineation of different regions. The resulting GCF-DeepLabv3+ network model effectively detects and segments different types of plots, laying a solid foundation for subsequent operations in the field. To validate the performance of our segmentation network, we compared it with Lrasspp, UNet, Segformer, Deeplabv3, and Deeplabv3+ deep learning models. The experimental results demonstrate that our model achieves a balance between accuracy and speed. In terms of accuracy, compared to the other models, it shows improvements of 19.08%, 3.40%, 6.42%, 18.00%, and 2.26%, respectively, over Lrasspp, UNet, Segformer, Deeplabv3, and Deeplabv3+. For mean pixel accuracy (mPA), it improves by 14.70%, 5.26%, 30.96%, 14.04%, and 2.97%, respectively. In terms of Mean Intersection over Union (mIoU), the improvements are 15.24%, 9.85%, 36.31%, 14.57%, and 8.65%.

Although significant improvements were achieved, we must acknowledge that our study still has some limitations. For instance, the resolution of the sensors limits the detail captured in images, which may hinder accurate recognition of small or distant straw features. Environmental factors can also affect image quality, impacting segmentation performance. Regarding the network model, it might be too sensitive to anomalies or sensor resolution, potentially affecting the stability of detection and segmentation. Additionally, in real-time monitoring and segmentation scenarios, the existing system’s processing speed might not meet the required demands. Moreover, the model’s portability is not ideal, as the environmental heterogeneity across different fields could affect detection and segmentation accuracy. Therefore, future work will focus on further optimizing this network model to improve detection and segmentation accuracy, enhance model portability, and address the identified limitations.

4.2. Future Work

Future research can focus on further optimizing the performance of the GCF-DeepLabv3+ network model. Specifically, more advanced convolutional methods and attention mechanisms can be explored to enhance segmentation accuracy for different types of plots. Considering the model’s sensitivity to anomalies, efforts could be made to improve its robustness, thereby reducing the impact of environmental changes and sensor resolution on performance. Furthermore, considering the model’s real-time performance and processing speed, integrating more advanced technologies such as real-time image processing and edge computing could be explored. This would allow for deploying the model to edge devices, enabling on-site real-time monitoring and segmentation. To enhance the model’s portability across different fields, attention should be given to transfer learning and the model’s ability to adapt to new environments. Research could focus on optimizing model parameters in different environmental conditions to ensure consistent performance across various field conditions. By improving these aspects, the model’s overall accuracy, robustness, and adaptability will be further enhanced, making it more practical for diverse agricultural scenarios.

5. Conclusions

This study proposes a rapid and efficient method for detecting straw coverage types in farmland using UAV low-altitude remote sensing images, coupled with an optimized GCF-DeepLabv3+ model for segmentation and detection. The experimental results confirm that the optimized GCF-DeepLabv3+ model offers significant improvements in segmentation accuracy, computational efficiency, and robustness. Specifically, the model achieves a Mean Intersection over Union (MIoU) of 93.97%, a Mean Average Precision (mAP) of 95.68%, and a Precision score of 98.76%, all of which outperform the baseline models used for comparison. Furthermore, the optimized model maintains a low computational cost, with a total parameter count of 3.31M and a Floating-Point Operations (FLOPs) value of 41.19G. These results demonstrate that the GCF-DeepLabv3+ model can not only provide accurate straw coverage detection but also function efficiently in practical applications with limited computational resources. Overall, the GCF-DeepLabv3+-based straw coverage type detection method presented in this paper offers a promising approach to improving detection efficiency in agricultural monitoring. Its performance improvements suggest it could be a valuable tool for real-time straw coverage classification, addressing practical needs in agricultural environments.

Author Contributions

Conceptualization, Y.L. (Yuanyuan Liu); Data curation, Y.R.; Formal analysis, P.S.; Funding acquisition, J.W.; Investigation, X.L.; Methodology, J.Z.; Project administration, Y.L. (Yang Luo); Resources, Y.W., X.L., P.S. and Y.R.; Software, Y.L. (Yuanyuan Liu) and J.Z.; Supervision, Y.W., Y.R. and J.W.; Validation, Y.L. (Yuanyuan Liu), J.Z. and P.S.; Visualization, Y.L. (Yang Luo); Writing—review and editing, X.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research on regionalized surface straw cover information detection methods in complex contexts for conservation tillage, the National Natural Science Foundation of China, Product number: 42001256; the Jilin Science and Technology Development Program Project, Product number: 20220402023GH; and the Jilin Science and Technology Development Program Project, Product number: 20230202039NC.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, S.; Li, Q.; Zhang, X.; Wei, K.; Chen, L.; Liang, W. Effects of conservation tillage on soil aggregation and aggregate binding agents in black soil of Northeast China. Soil Tillage Res. 2012, 124, 196–202. [Google Scholar] [CrossRef]
Han, L.; Yan, Q.; Liu, X.; Hu, J. Straw Resources and Their Utilization in China. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2002, 18, 87–91. [Google Scholar]
Li, H.; Dai, M.; Dai, S.; Dong, X. Current status and environment impact of direct straw return in China’s cropland—A review. Ecotoxicol. Environ. Saf. 2018, 159, 293–300. [Google Scholar] [CrossRef]
Zhang, J.; Sun, B.; Zhu, J.; Wang, J.; Pan, X.; Gao, T. Black Soil Protection and Utilization Based on Harmonization of Mountain-River-Forest-Farmland-Lake-Grassland-Sandy Land Ecosystems and Strategic Construction of Ecological Barrier. Bull. Chin. Acad. Sci. 2021, 36, 1155–1164. [Google Scholar] [CrossRef]
Chen, B.; Mohrmann, S.; Li, H.; Gaff, M.; Lorenzo, R.; Corbi, I.; Corbi, O.; Fang, K.; Li, M. Research and Application Progress of Straw. J. Renew. Mater. 2023, 11, 599–623. [Google Scholar] [CrossRef]
Zhu, Q.; Xu, X.; Sun, Z.; Liang, D.; An, X.; Chen, L.; Yang, G.; Huang, L.; Xu, S.; Yang, M. Estimation of Winter Wheat Residue Coverage Based on GF-1 Imagery and Machine Learning Algorithm. Agronomy 2022, 12, 1051. [Google Scholar] [CrossRef]
Memon, M.S.; Jun, Z.; Sun, C.; Jiang, C.; Xu, W.; Hu, Q.; Yang, H.; Ji, C. Assessment of Wheat Straw Cover and Yield Performance in a Rice-Wheat Cropping System by Using Landsat Satellite Data. Sustainability 2019, 11, 5369. [Google Scholar] [CrossRef]
Memon, M.S.; Chen, S.; Niu, Y.; Zhou, W.; Elsherbiny, O.; Liang, R.; Du, Z.; Guo, X. Evaluating the Efficacy of Sentinel-2B and Landsat-8 for Estimating and Mapping Wheat Straw Cover in Rice–Wheat Fields. Agronomy 2023, 13, 2691. [Google Scholar] [CrossRef]
Li, Z.; Wang, C.; Pan, X.; Liu, Y.; Li, Y.; Shi, R. Estimation of wheat residue cover using simulated Landsat-8 OLI datas. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2016, 32, 145–152. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Yu, H.; Qin, M.; Sun, J. Detection of Straw Coverage Rate Based on Multi-threshold Image Segmentation Algorithm. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2018, 49, 27–35, 55. [Google Scholar] [CrossRef]
Cai, W.; Zhao, S.; Wang, Y.; Peng, F.; Heo, J.; Duan, Z. Estimation of Winter Wheat Residue Coverage Using Optical and SAR Remote Sensing Images. Remote Sens. 2019, 11, 1163. [Google Scholar] [CrossRef]
Riegler-Nurscher, P.; Prankl, J.; Vincze, M. Tillage Machine Control Based on a Vision System for Soil Roughness and Soil Cover Estimation. In Computer Vision Systems, Proceedings of the 12th International Conference, ICVS 2019, Thessaloniki, Greece, 23–25 September 2019; Springer: Cham, Switzerland, 2019; pp. 201–210. [Google Scholar]
Liu, Y.Y.; Zhou, X.K.; Wang, Y.Y.; Yu, H.; Geng, C.; He, M. Straw coverage detection of conservation tillage farmland based on improved U-Net model. Opt. Precis. Eng. 2022, 30, 1101. [Google Scholar] [CrossRef]
Yu, K.; Qiu, L.; Wang, J.; Sun, L.; Wang, Z. Winter wheat straw return monitoring by UAVs observations at different resolutions. Int. J. Remote Sens. 2016, 38, 2260–2272. [Google Scholar] [CrossRef]
Shao, Y.; Guan, X.; Xuan, G.; Li, X.; Gu, F.; Ma, J.; Wu, F.; Hu, Z. Detection Method of Straw Mulching Unevenness with RGB-D Sensors. AgriEngineering 2023, 5, 12–19. [Google Scholar] [CrossRef]
Ma, J.; Wu, F.; Xie, H.; Gu, F.; Yang, H.; Hu, Z. Uniformity Detection for Straws Based on Overlapping Region Analysis. Agriculture 2022, 12, 80. [Google Scholar] [CrossRef]
Ma, Q.; Wan, C.; Wei, J.; Wang, W.; Wu, C. Calculation Method of Straw Coverage Based on U-Net Network and Feature Pyramid Network. Trans. Chin. Soc. Agric. Mach. 2023, 54, 224–234. [Google Scholar]
Zhou, D.; Li, M.; Li, Y.; Qi, J.; Liu, K.; Cong, X.; Tian, X. Detection of ground straw coverage under conservation tillage based on deep learning. Comput. Electron. Agric. 2020, 172, 105369. [Google Scholar] [CrossRef]
Liu, Y.; Sun, J.; Zhang, S.; Yu, H.; Wang, Y. Detection of straw coverage based on multi-threshold and multi-target UAV image segmentation optimization algorithm. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2020, 36, 134–143. [Google Scholar] [CrossRef]
Aung, H.L.; Uzkent, B.; Burke, M.; Lobell, D.; Ermon, S. Farm Parcel Delineation Using Spatio-temporal Convolutional Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 340–349. [Google Scholar]
Shao, Y.; Guan, X.; Xuan, G.; Liu, H.; Li, X.; Gu, F.; Hu, Z. Detection of Straw Coverage under Conservation Tillage Based on an Improved Mask Regional Convolutional Neural Network (Mask R-CNN). Agronomy 2024, 14, 1409. [Google Scholar] [CrossRef]
Razavian, A.S.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 512–519. [Google Scholar]
Targ, S.; Almeida, D.; Lyman, K. Resnet in Resnet: Generalizing Residual Architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Du, S.; Du, S.; Liu, B.; Zhang, X. Incorporating DeepLabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images. Int. J. Digit. Earth 2020, 14, 357–378. [Google Scholar] [CrossRef]
Liu, J.; Wang, Z.; Cheng, K. An improved algorithm for semantic segmentation of remote sensing images based on DeepLabv3+. In Proceedings of the 5th International Conference on Communication and Information Processing, Chongqing, China, 15 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 124–128. [Google Scholar]
Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with Deeplabv3+. Comput. Geosci. 2022, 158, 104969. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5729. [Google Scholar]
Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Washington, DC, USA, 2024; pp. 6378–6386. [Google Scholar] [CrossRef]
Culjak, I.; Abram, D.; Pribanic, T.; Dzapo, H.; Cifrek, M. A brief introduction to OpenCV. In Proceedings of the 2012 Proceedings of the 35th International Convention MIPRO, Opatija, Croatia, 21–25 May 2012; pp. 1725–1730. [Google Scholar]
Wu, S.; Guo, X.; Li, J.; Wulamu, A.; Li, T.; Zou, Q.; Xia, C. A Straw Coverage Rate Calculation Method Based on Single Image Processing. In Proceedings of the 2023 2nd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics (AIHCIR), Tianjin, China, 8–10 December 2023; pp. 483–490. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Gu, Y.; Cao, R.; Zhao, L.; Lu, B.; Su, B. Real time semantic segmentation network of wire harness terminals based on multiple receptive fieldattention. Opt. Precis. Eng. 2023, 2, 277–287. [Google Scholar] [CrossRef]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2017, arXiv:1611.06440. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011, 2, 2229–3981. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Proceedings of the Computer Vision–ECCV, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]

Figure 1. The three study sites are Erdao District in Changchun City, Dewei City, and Yushu City, all in Jilin Province.

Figure 2. Aerial Image-Based Workflow for Straw Return Classification and Recognition.

Figure 3. Examples of stitched images and types of straw coverage.

Figure 4. Diagram of the GCF-DeepLabv3+ network model’s structure.

Figure 5. StarNet architecture overview.

Figure 6. (a) The overview of FFT-CBAM; (b) overall process of fast Fourier transform-based channel attention module; (c) overall process of fast Fourier transform-based spatial attention module.

Figure 7. (a) Overall process of gated Conv-Former block; (b) overall process of the gating mechanism.

Figure 8. Comparison of the different models’ accuracy.

Figure 9. Comparison of the recognition and segmentation effects of the different models.

Figure 10. Comparison of confusion matrices for GCF-Deeplabv3+ and Deeplabv3+ models. (a) GCF-Deeplabv3+ confusion matrix; (b) Deeplabv3+ confusion matrix.

Figure 11. Region 1 prediction process diagram. (a) Stitched image; (b) straw cover form prediction image; (c) straw crush form plot extraction; (d) adaptive threshold segmentation detail map.

Figure 12. Region 2 prediction process diagram. (a) Stitched image; (b) straw cover form prediction image; (c) straw crush form plot extraction; (d) adaptive threshold segmentation detail map.

Table 1. Comparative experiments on the different backbone networks.

Network Model	Backbone	MIoU (%)	mAP (%)	Precision (%)	Total Params (M)
Deeplabv3+	Xception	90.08	95.32	96.89	54.71
	MobileNetv2	83.63	89.42	95.92	5.54
	ConvNeXtv2	86.05	94.51	96.33	8.16
	MobileNetv4	88.68	94.39	97.82	4.23
	StarNet	93.97	95.68	98.76	3.31

Table 2. The ablation experiments were conducted with the GCF-DeepLabv3+ model.

Network Model	Gated-CFB	MKC-FFN	MKC-FFN-FTCM	MIoU (%)	mAP (%)	Precision (%)
Deeplabv3+ (StarNet)	-	-	-	85.94	90.64	91.85
	√	-	-	89.54	94.85	96.43
	-	√	-	88.63	95.01	96.46
	-	-	√	90.51	95.46	96.68
	√	√	√	93.97	95.68	98.76

Table 3. Detection results of the different semantic segmentation models.

Network Model	MIoU (%)	mAP (%)	Precision (%)	Total Params (M)	FLOPs (G)
Lraspp	78.73	80.98	79.68	3.22	4.16
UNet	84.12	90.42	95.36	4.38	81.03
Segformer	57.66	64.72	92.34	3.71	13.60
Deeplabv3	79.40	81.64	80.76	11.02	19.91
Deeplabv3+	85.32	92.89	96.50	5.81	52.93
GCF-DeepLabv3+ (ours)	93.97	95.68	98.76	3.31	41.19

Table 4. Comparison of classification performance between GCF-Deeplabv3+ and Deeplabv3+.

Straw Coverage Type	GCF-Deeplabv3+			Deeplabv3+
Straw Coverage Type	Precision (%)	Recall (%)	F1-Score (%)	Precision (%)	Recall (%)	F1-Score (%)
Straw Vertical	92.35	91.40	91.87	88.78	84.10	86.38
Straw Level	90.20	92.65	91.41	85.50	86.47	85.98
High Stubble	91.05	89.30	90.17	85.30	86.18	85.74
Strip-Tillage	90.60	90.80	90.07	85.86	82.28	83.60
Straw Burn	89.45	91.10	90.26	83.76	82.74	83.25
Straw Bale	90.20	90.80	90.50	85.29	86.32	85.80
Turn the Soil	89.90	88.90	89.45	85.87	82.50	84.15
Low Stubble	90.05	89.10	89.57	85.70	85.25	85.47
Straw Crush	88.90	86.60	87.74	84.38	85.46	84.63

Table 5. Prediction of straw coverage types in two regions using different models.

Network Model	Accuracy (%)		Mean Times (s)	FLOPs (G)
Network Model	Region1	Region2	Mean Times (s)	FLOPs (G)
GCF-Deeplabv3+	96.37	95.24	32.54	41.19
Deeplabv3+	93.25	92.65	46.76	52.93
UNet	93.65	93.06	68.92	81.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhang, J.; Wang, Y.; Luo, Y.; Sui, P.; Ren, Y.; Liu, X.; Wang, J. GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification. Agronomy 2025, 15, 1011. https://doi.org/10.3390/agronomy15051011

AMA Style

Liu Y, Zhang J, Wang Y, Luo Y, Sui P, Ren Y, Liu X, Wang J. GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification. Agronomy. 2025; 15(5):1011. https://doi.org/10.3390/agronomy15051011

Chicago/Turabian Style

Liu, Yuanyuan, Jiaxin Zhang, Yueyong Wang, Yang Luo, Pengxiang Sui, Ying Ren, Xiaodan Liu, and Jun Wang. 2025. "GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification" Agronomy 15, no. 5: 1011. https://doi.org/10.3390/agronomy15051011

APA Style

Liu, Y., Zhang, J., Wang, Y., Luo, Y., Sui, P., Ren, Y., Liu, X., & Wang, J. (2025). GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification. Agronomy, 15(5), 1011. https://doi.org/10.3390/agronomy15051011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GCF-DeepLabv3+: An Improved Segmentation Network for Maize Straw Plot Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Cite and Data Acquisition

2.2. Data Processing and Dataset Production

2.3. Overall Process

2.4. Network Design

2.4.1. The Network Architecture

2.4.2. StarNet

2.4.3. Multi-Kernel Convolution Feedforward Network with Fast Fourier Transform Convolutional Block Attention Module (MKC-FFN-FTCM)

2.4.4. Gated Conv-Former Block (GCFB)

2.4.5. Loss Function

2.5. Experient Platform and Parameter Settings

2.6. Evaluation Metrics

3. Results

3.1. Comparative Experiments on Different Backbone Networks

3.2. Ablation Experiments

3.3. Comparison of the Different Semantic Segmentation Models

3.4. Performance Comparison Between GCF-Deeplabv3+ and Baseline Model

3.5. Application Experiments

4. Discussion

4.1. Model Performance and Limitations

4.2. Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI