Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features

Yang, Hua; Lyu, Yanjie; Jiang, Yunpeng; Jiang, Feng; Deng, Taiyong; Yu, Lihao; Qiu, Yuanhao; Xue, Hao; Guo, Junying; Meng, Zhaoqi

doi:10.3390/app15094812

Open AccessArticle

Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features

by

Hua Yang

^1,*,†,

Yanjie Lyu

^1,†,

Yunpeng Jiang

²,

Feng Jiang

³

,

Taiyong Deng

⁴,

Lihao Yu

¹,

Yuanhao Qiu

¹,

Hao Xue

¹,

Junying Guo

¹ and

Zhaoqi Meng

¹

School of Mathematics and Computer Science, Wuhan Polytechnic University, Wuhan 430040, China

²

BiSiCloud (Wuhan) Information Technology Co., Ltd., Wuhan 430024, China

³

School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China

⁴

Zhengzhou Xinsiqi Technology Co., Ltd., Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(9), 4812; https://doi.org/10.3390/app15094812 (registering DOI)

Submission received: 6 February 2025 / Revised: 23 April 2025 / Accepted: 24 April 2025 / Published: 26 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

To solve the missed and wrong detection problems of the object detection model in identifying soybean companion weeds, this paper proposes an enhanced multi-scale channel feature model based on RT-DETR (EMCF-RTDETR). First, we designed a lightweight hybrid-channel feature extraction backbone network, which consists of a CGF-Block module and a FasterNet-Block module working together, aiming to reduce the amount of computation and the number of parameters while improving the efficiency of feature extraction. Second, we constructed the EA-AIFI module. This module enhances the extraction of detailed features by combining the in-scale feature interaction module with the Efficient Additive attention mechanism. In addition, we designed an Enhanced Multiscale Feature Fusion (EMFF) network structure, which first differentiates the inputs of the three feature layers and then ensures the effective flow between the original and enhanced features of each feature layer by two multiscale feature fusions as well as one diffusion. The experimental results demonstrate that the EMCF-RTDETR model improves the average precision mAP50 and mAP50:95 by 3.3% and 2.2%, respectively, compared to the RT-DETR model, and the FPS is improved by 10%. Moreover, our model outperforms other mainstream detection models in terms of accuracy and speed, revealing its significant potential for soybean weed detection.

Keywords:

RT-DETR; weed detection; FasterNet; EA attention; feature fusion

1. Introduction

Soybean is a critical grain and oil crop with significant economic importance [1]. It is rich in isoflavones, lecithin, soybean peptides and many other nutrients beneficial to human health [2]. This has led to significant applications of soybean in various fields, such as food, feed, health care and medicine. However, soybean production in China is lower than the global average [3]. There are various reasons for this phenomenon, and one of the main factors is that weeds compete with soybeans for water, nutrients and sunlight in the soil during their growth [4]. This seriously affects the growth of soybeans and directly leads to lower yields. Currently, chemical weeding is the predominant weed management approach [5]. Although it can effectively control weeds, it poses a potential threat to the farmland environment and food safety [6]. Therefore, how to achieve efficient, precise and environmentally friendly weed management has become a key issue to be solved nowadays.

In recent years, smart agriculture has developed at a rapid pace [7,8], and computer vision technology has gradually demonstrated its advantages in agricultural weed processing [9,10,11]. Early applications of computer vision in weed detection usually require image preprocessing, such as grayscale or binarization operations, followed by feature extraction. Eventually, the weeds are classified and identified by recognition algorithms based on the features that have been extracted, such as color, leaf texture and shape. Among the main recognition algorithms are support vector machine (SVM), random forest, artificial neural network (ANN), etc. [12,13,14]. This type of method has relatively low requirements on the image processor, which leads to certain limitations when identifying objects in complex backgrounds.

Nowadays, research on convolutional neural networks is constantly deepening, and computer vision has also moved from the field of machine learning to the field of deep learning [15,16]. Deep learning has a wide range of applications in life and industrial production, such as grain pile temperature detection [17], air quality prediction [18], and wind power prediction [19], among others. For object detection [20], it can be classified into two categories: one is based on convolutional neural network models, and the other on the transformer [21].

In convolutional neural network models, two-stage algorithms have higher computational time and resource consumption compared to single-stage algorithms due to the need to generate candidate regions [22]. For example, the Faster R–CNN algorithm significantly improves the detection speed by integrating the two-stage network with the region proposal network (RPN) [23]. Therefore, it has been widely used in the field of weed detection [24]. For single-stage algorithms, such as YOLO series and SSD, although the inference speed is faster than the two-stage algorithms, the accuracy is relatively low. Ding et al. [25] proposed a lightweight weed detection model and the experimental results showed that the computational complexity of the model was reduced while the average accuracy was improved. There is also the case of detection, as in [26,27], of crop pests and diseases, where the average improvement in detection accuracy is not very high, despite the improvement in detection speed.

The transformer is mainly used in natural language processing applications, such as machine translation, text generation, and sentiment analysis [28]. The use of transformers in the field of object detection has been proposed only in recent years. Detection Transformer (DETR) is the first transform-based object detection model, which breaks the tradition of convolutional neural network-based models [29]. Convolutional neural network-based models require anchor point generation and non-maximum suppression (NMS) [30]. However, DETR can directly regard the target detection problem as the output of an unordered sequence. By combining transform and self-attention mechanisms, the model can not only make full use of local and global features but also simplify the detection process. Although DETR has outstanding advantages, it still faces the problems of slow convergence and limited ability to detect small targets. Guo et al. [31] proposed RMS–DETR for detecting weeds in rice fields, and the experimental results showed that the average accuracy was improved by 4.4% compared to the original model. Li et al. [32] combined the semi-supervised object detection (SSOD) algorithm with DETR in order to detect the growth of Chinese cabbage, and the results were shown to be an average accuracy of 74.1%. Yang et al. utilized the improved DETR in order to detect rice pests and diseases with an average accuracy improvement of about 29% [33].

In order to solve the problems of inaccurate detection and large model computation of the above models for weed detection, this paper proposes a model focusing on weed detection, Enhanced Multiscale Channel Feature Model based on RT-DETR (EMCF-RTDETR). The main contributions of this paper are as follows:

A lightweight convolutional gated feature extraction backbone network is constructed. A new channel mixer is fused into the FasterBlock to construct a new backbone BasicBlock, which not only improves the computational efficiency, but also strengthens the flexibility and robustness of feature extraction.
The attention-based intra-scale feature interaction module (AIFI) is an adaptive input feature integration module. Its main function is to effectively fuse feature maps of different scales to improve the model’s detection ability for different targets. We introduced the Efficient Additive Attention module in AIFI to enhance feature capture capabilities and significantly improve computational efficiency.
The enhanced multi-scale feature fusion (EMFF) module is designed. By implementing two cross-scale feature fusions and one decentralized strategy, the accuracy of feature fusion is significantly improved and the adaptability of the model to diverse targets is enhanced.

The rest of the paper is organized as follows, Section 2 describes the model presented in this paper. Section 3 analyzes the experimental results. Section 4 discusses the usefulness of the model proposed in this paper and the future outlook. Section 5 concludes.

2. Materials and Methods

2.1. RT-DETR Model

Real-Time DEtection Transformer (RT-DETR) is a real-time end-to-end target detector proposed by the Baidu Flying team in April 2023 [34]. When compared with other excellent object detection models, such as YOLOv5 [35], YOLOv6 [36], Deformable-DETR [37], DETR [29], etc., they all show excellent speed and accuracy, and solve the challenges of the traditional real-time detector in reasoning speed and accuracy.

The structure of RT-DETR is similar to that of traditional object detection networks, but it introduces innovations in the extraction and fusion of multilayer outputs. The architecture of the model mainly consists of a backbone network, a hybrid encoder, and a decoder with an auxiliary detection head. The backbone network uses a convolutional neural network (CNN) for feature extraction and inputs the extracted features into the encoder for efficient processing. The hybrid encoder comprises an attention-based intra-scale feature interaction module (AIFI) and a cross-scale feature fusion module (CCFF). In the decoder part, the RT-DETR uses an IoU-aware query selection module to select a certain number of image features from the encoder output as initial object queries, followed by iterative optimization to generate bounding boxes and confidence scores. The RT-DETR model structure is shown in Figure 1.

2.2. EMCF-RTDETR Model

To achieve simultaneous improvement in accuracy and real-time performance for soybean weed detection, this paper chooses ResNet-18 as the backbone network of RT-DERT [34]. ResNet-18 has the advantages of faster computation speed and smaller model size compared to networks such as ResNet-L, ResNet-101, and ResNet-50 [38,39]. The EMCF model structure is illustrated in Figure 2. The CGF-Block significantly improves the backbone network by combining the hybrid channel mechanism with FasterNet Block [40]. This innovation not only reduces redundancy and computational complexity but also enhances feature extraction through multi-channel information processing. Furthermore, the AIFI module is refined by introducing the Efficient Additive Attention mechanism [41], which makes AIFI more flexible and fast in feature interaction. Finally, the innovative EMFF network employs differentiated processing strategies, effectively preserving the unique characteristics of each feature layer and mitigating potential issues of feature loss or information blurring during the fusion process.

2.2.1. CGF-Block Module

Although ResNet-18, as a classic lightweight convolutional neural network, performs well in feature extraction, its structural design also exposes some shortcomings. As shown in Figure 1, the two 3 × 3 convolutional layers used in the residual block of ResNet-18 enhance the learning ability of the model but also lead to increased computational complexity and frequent memory accesses, which in turn affects the efficiency of the model in processing high-dimensional features. In addition, it is also necessary to accurately capture fine-grained features in the complex background of soybean weed detection. To address these challenges, this article obtains the CGF-Block module by fusing the FasterNet Block with the Convolutional GLU (CGLU) mechanism [40,42]. This module makes full use of the computational efficiency of partial convolution in FasterNet Block and the feature selection capabilities of convolution-gated linear units. This allows the CGF-Block module not only to improve the richness of feature extraction, but also to effectively reduce redundant memory usage, making the model more efficient in detecting weeds in soybean fields. Compared with the traditional ResNet-18 convolution structure, CGF-Block has demonstrated excellent performance advantages in efficiently utilizing multi-channel information and significantly reducing parameters and computational costs. The CGF-Block module detailed structure is shown in Figure 3.

One of the most critical components in the CGF-Block is PConv [40]. PConv is an efficient convolution method that applies convolution operations to select feature map channels while keeping the content of the remaining channels, thereby effectively reducing computational redundancy and enhancing computational efficiency. For consecutive or regular memory access, the first or last continuous channel is treated as a representative of the entire feature map computation. Selective convolution targeting specific channels demonstrates remarkable computational advantages under the assumption of identical input and output feature map channel counts. When input channel proportion is limited to 1/4, PConv’s Floating-Point Operations (FLOPs) are reduced to 1/16 of traditional convolution, with memory access requirements diminishing to just 1/4 of conventional approaches. The FLOP calculation process of PConv is shown in the following equation:

F = h \times w \times k^{2} \times c_{p}^{2}

(1)

The memory access volume calculation process of PConv is shown in the following equation:

M = h \times w \times {2 c}_{p} + k^{2} \times c_{p}^{2}

(2)

where h is the height of the channel, w is the width of the channel,

c_{p}

is the number of continuous network channels, and k is the convolution kernel scale.

The original convolution operation in FasterNet is replaced by the CGLU mechanism [42], which cleverly combines the deep convolution operation with the gating mechanism. The main process is that CGLU provides a unique gating signal for each feature channel, which is based on the local features of the feature channel itself and its surroundings. This not only improves the model’s perception of local and global features but also has less computational complexity and improves the robustness of the model.

2.2.2. EA-AIFI

The AIFI mechanism, as an integrated feature input attention mechanism, plays a crucial role in RT-DETR, driving the model to efficiently and accurately detect objects. However, the multi-head self-attention mechanism in AIFI, although capable of capturing rich feature relationships in sequences, suffers from high computational complexity, large memory consumption, and limited scalability, particularly when handling long sequence inputs. These drawbacks significantly impact efficiency and performance. To address these issues, this paper introduces Efficient Additive (EA) Attention [41], which optimizes computations using an additive approach, reduces memory requirements, and enhances scalability. As a result, EA significantly boosts the efficiency of processing long sequences while maintaining its expressive capability, making it more suitable for large-scale applications, like weed detection. The EA Attention structure is shown in Figure 4.

The EA Attention mechanism primarily simplifies the computational process by eliminating the expensive matrix multiplication operations in traditional self-attention mechanisms, adopting linear transformations and element-wise multiplications instead. This significantly reduces the computational complexity. Moreover, this mechanism relies solely on the interaction between queries and keys, eliminating the explicit interaction between keys and values, which allows it to be applied at all stages of the network while maintaining a high context-capturing capability. Consequently, it provides a consistent context of information captured across different resolutions.

The specific process involves embedding the input into a matrix X and applying a linear transformation to obtain the query matrix Q and the key matrix K. Then, by applying a learnable weight vector

ω_{α}

to the query matrix Q, the global query vector

α

is calculated. After that, a weighted average is used to generate the global query q. Finally, the global query q is multiplied element-wise with the key matrix K to form the context representation C. The calculation process is shown in the equation below:

α = \frac{Q ω_{α}}{\sqrt{d}}

(3)

q = \sum_{i = 1}^{n} α_{i} Q_{i}

(4)

C = K ⊙ q

(5)

where d is the dimension of the query, which is used for normalization. ⊙ is element-level multiplication.

2.2.3. EMFF Network

In RT-DETR [34], the Cross-Scale Contextual Fusion (CCFF) module is a core component designed to perform cross-scale feature fusion via a cross-attention mechanism. Our goal is to address the soybean weed detection scenario characterized by small targets and complex backgrounds, which requires accurate detection of multiple weeds while keeping the number of parameters and computation small. This requires further refinement of the fusion of multi-scale features [43]. To address these challenges, we propose the Enhanced Multi-Scale Fusion Module (EMFF), which is inspired by the adaptive feature fusion model [44,45]. EMFF is a more flexible multi-scale feature extraction and fusion mechanism designed to improve the granularity and comprehensiveness of feature representation. By processing the P3, P4, and P5 feature layers differently through two different feature layer aggregations and a diffusion process, this module ensures that the features of each feature layer are fully expressed and fully utilized [45].

In this process, the fusion module stands as the core design element of EMFF, with its structure illustrated in Figure 5. Within this fusion process, distinct operations are strategically applied to different feature layers: the P3 layer employs ADown downsampling to extract advanced semantic features [46]; the P4 layer utilizes 1 × 1 convolution to adjust features, thereby preserving the integrity of mid-level representations; the P5 layer enhances detailed features through upsampling. This approach optimizes each feature layer, significantly improving cross-scale feature fusion effectiveness. In the feature enhancement phase, the method introduces parallel depth-wise convolutions with multiple kernel sizes (5 × 5, 7 × 7, 9 × 9, and 11 × 11) designed to match and amplify cross-scale features [47]. Additionally, a 1 × 1 convolution layer is incorporated to refine feature representations, thereby elevating the quality and expressive capabilities of the features. Ultimately, the residual connection mechanism is retained, ensuring effective flow between original and enhanced features while preventing information loss.

Assume that the feature inputs of layers P3, P4 and P5 are

x_{1}

,

x_{2}

and

x_{3}

respectively, and the final feature output is Y, the process is shown in formulas below:

F = C o n c a t \{A D o w n (x_{1}), {C o n v}_{1 \times 1} (x_{2}), {C o n v}_{1 \times 1} [U p s a m p l e (x_{3})]\}

(6)

F_{i} = {D W C o n v}_{k_{i} \times k_{i}} (F)

(7)

where F is the feature after splicing operation;

F_{i}

is the feature output after separable convolution at different depths;

k_{i}

∈ 5, 7, 9, 11; i = 1, 2, 3, 4. Using convolution kernels of different sizes after obtaining F can capture rich context information under different receptive fields, thus enhancing the perception ability for targets at different scales. For example, smaller convolution kernels can capture more local details, while larger convolution kernels help to obtain the global context, thus striking a balance between local and global features. Adopting a parallel structure ensures the complementarity of different receptive field features, effectively improving the robustness of detection.

F^{'} = {C o n v}_{1 \times 1} [F + \sum_{i = 1}^{4} {C o n v}_{1 \times 1} (F_{i})]

(8)

Y = F + F^{'}

(9)

where

F^{'}

is the result of refining features through the last convolution layer; finally, the final feature output result Y is obtained by residual connection. By aggregating and diverging three different feature layers in different forms, it is ensured that the EMFF module performs better in detecting small targets in complex scenes of soybean fields, and the accuracy and robustness of the model are greatly improved.

2.3. Dataset

This paper uses the soybean weed dataset (dados_tcc) created by the Federal University of Rondonopolis—UFR and published on the roboflow website for weed detection in soybean [48]. The dataset consists of three categories: soybean crops, narrow-leaved weeds, and broad-leaved weeds. Each image in the dataset has a resolution of 640 × 640 pixels and has undergone preprocessing involving automatic pixel data orientation. Data augmentation techniques, including random rotation, random brightness adjustment, and random Gaussian blurring, were employed to enhance the diversity of the data and improve the model’s generalization capability. Figure 6 illustrates images of the three categories.

There are 8083 pictures in the dataset, but the training set, test set and verification test set are not divided according to the proportion. In order to meet the experimental requirements, this paper re-divides the dataset according to the ratio of 7:2:1, in which the training set is 5658 pictures, the test set is 1617 pictures, and the verification set is 808 pictures. The dataset category quantity information is shown in Table 1.

2.4. Evaluation Index

This paper selects multiple indicators to evaluate the performance of the model for soybean weed detection, namely accuracy, recall rate, average precision, parameter amount, detection frame rate and parameter amount. The calculation of precision, recall rate and average precision is shown in the following formula:

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

Here,

T P

refers to the number of weeds correctly detected by the model in dados_tcc dataset, FP refers to the number of weeds incorrectly detected by the model in dados_tcc dataset, and FN refers to the number of weeds not detected by the model in dados_tcc dataset.

A P = \int_{0}^{1} P (R) d R

(12)

m A P = \frac{\sum_{i = 1}^{C} A P_{i}}{C}

(13)

where C represents the total number of categories in the dataset;

A P_{i}

represents the detection accuracy of the i-th class, and the intersection ratio threshold of mAP is selected as 0.5 in this paper; For cross-to-union ratios greater than 0.5 and less than 0.95, their mAPs are calculated separately first, and then the mean value of these mAPs is calculated, i.e., mAP50:95. FPS is used to evaluate the real-time detection speed of the model. The larger the FPS, the better the real-time detection efficiency.

3. Results

3.1. Experimental Environment and Parameters

All experiments are conducted on a GPU model RTX 3080 (NVIDIA Corporation, Santa Clara, CA, USA), with the training environment utilizing Python 3.8 compiler and PyTorch 1.13 deep learning framework. No pre-trained weight files were used during the training process. The modified training parameters are presented in Table 2, with all other parameters remaining unchanged.

3.2. Comparative Experiment

To comprehensively evaluate the performance of the model proposed in this paper, the current mainstream single-stage and two-stage detection models will be selected and compared horizontally in the dados_tcc dataset. The comparison results are shown in Table 3. Two-stage models are selected as Faster-RCNN [23], DINO [49], RT-DETR [34] and Deformable-DETR [37]. For single-stage models, the most popular YOLO series models and EfficientNet [50] are selected, including YOLOv3t [51], YOLOv5m [35], YOLOv6m [36], YOLOv9m [46] and YOLOv10m [52].

Based on the data analysis in Table 3, our model demonstrates outstanding overall performance. Although compared with YOLOv9m [52], the mAP50 and mAP50:95 of our model decreased by 0.2% and 0.3%, respectively. However, the computational efficiency of our model has significantly improved. For example, GFLOPs have been reduced by 31%, and FPS far exceeds YOLOv9m [46]. Compared with YOLOv5m [35], although our model is 0.6% and 0.5% lower in Recall and mAP50:95, respectively, our model’s FPS is 11 frames higher, and the number of parameters is reduced by 28%. In comparison with YOLOv3t [51], although YOLOv3t has a high FPS of 214 and low GFLOPs, its mAP50 and mAP50:95 are 7% and 16.8% lower than those of our model, respectively. This makes it difficult to achieve accurate recognition in the weed detection task, thus affecting the actual application effect. Other single-stage detection models also marginally underperformed compared to our proposed approach. Our model achieved 88.4% and 77.8% on the mAP 50 and mAP 50:95 index on the dados_tcc dataset, while FPS also achieved a good performance of 50.3. However, the two-stage detection model is much lower than the proposed model in average accuracy and FPS. In conclusion, our model maintains robust competitive performance in detection capabilities compared to these mainstream detection models.

3.3. Ablation Experiment

In order to verify the superiority of the proposed CGF-Block module, EA-AIFI and EMFF network, this paper uses rtdetr as the baseline to ablate each module sequentially. These experiments were all conducted on the dados_tcc dataset [48], in which different experimental groups had a batch size of 1 when testing FPS. The experimental results are shown in Table 4.

According to the results of Table 4, when the fused CGF-Block module, EA-AIFI and the newly constructed EMFF network are used to replace the corresponding modules in baseline, although the accuracy is improved, they all have certain disadvantages compared with baseline. For instance, the FPS of the CGF-Block module and EA-AIFI were significantly lower than the baseline; moreover, EMFF’s GFLOPs were 12% higher than the baseline. However, the three modules correspondingly replace the modules in the baseline, and it can be seen that while the GFLOPs and parameter count are basically the same as the baseline, not only the mAP50 and mAP50:95 have increased by 3.2 and 2.2 respectively, but the FPS has also increased by 4.5 frames. The results of ablation experiments fully prove the success of the multi-scale channel feature optimization model proposed in this paper in soybean weed detection. In order to better show the results of the model in this paper and the benchmark model on the dataset, three pictures were randomly selected, and the results of the two were visualized. From the visualization results in Figure 7, it can be seen that the benchmark model has missed detection phenomenon, especially in the recognition of small targets, and the recognition rate does not cover the whole picture. This also highlights the accuracy of the model proposed in this paper. The main reason is that the model proposed in this paper focuses on fusing multi-scale features through multi-channels, fully combining local features and global context, thereby improving the detection of weeds by the model, especially for small targets.

3.4. Generalization Experiment

In light of the considerable diversity of companion weeds of soybean in natural environments, to further validate the utility of the model proposed in this paper for soybean weed detection, generalization experiments are conducted on the Aerial Weeds dataset [53]. The dataset uses Unmanned Aerial Vehicles for aerial photography, which contains five kinds of weeds and a variety of crops. The environment is extremely similar to that of soybean weeds, which is conducive to verifying the performance of the model in this paper. The experimental environment and parameter settings remained consistent with those outlined in Section 3.1, with experimental results in Table 5.

It can be concluded from Table 5 that the EMCF model proposed in this paper is equally effective in the Aerial Weeds dataset. Not only did mAP50 and mAP50:95 improve by 6.4% and 5.2% compared to the benchmark model, respectively, but the accuracy and recall rate also improved by 4.9% and 2.9%. This fully demonstrates the accuracy of this model in weed detection under complex background. Similarly, in order to visually observe the results, the two models are visualized on the Aerial Weeds dataset, and the results are shown in Figure 8 and Figure 9. From the drawings, it can be seen that the model proposed in this paper reduces missed detection and false detection compared with the benchmark model and shows superiority in weed detection accuracy.

4. Discussion

A series of experiments conclusively demonstrated the superiority of the proposed EMCF model in soybean weed detection. Through ablation studies, we validated the advantages of each model module. In the backbone network design, we integrated multi-channel mechanisms with FasterNet [40], effectively leveraging its convolutional layers to adaptively select or exclude specific pixels, thereby reducing computational complexity and parameter count. Simultaneously, the gating signals from the multi-channel mechanism enhanced feature extraction capabilities. Within the AIFI framework, we replaced traditional multi-head attention mechanisms with EA attention mechanisms [41]. This approach not only facilitates upper-layer feature extraction but also intrinsically strengthens feature representation through its inherent mechanism, significantly improving local feature extraction performance. Moreover, we innovatively constructed the EMFF by combining diffusion network structures with a fusion module, enabling the model to more comprehensively capture local features and global information, thereby enhancing both accuracy and processing speed. This strategy resonates with the multiscale receptive field optimization techniques in [54], which emphasized the importance of parallelized multiscale operations for efficient feature integration in cluttered environments. Notably, the fusion module processes multiple depth-wise separable convolutions in parallel across different scales, progressively expanding the receptive field and performing more nuanced feature fusion across channels. This approach provides richer information for subsequent network layers, further optimizing the model’s detection capabilities.

We compared YOLOv10m [52], YOLOv9m [46], YOLOv6m [36], YOLOv5m [35], YOLOv3t [51] and the EfficientNet [50] single-stage real-time monitoring models. Compared with YOLOv5m [35] and YOLOv9m [46], our model has a slight compromise in mAP50:95. The main reason is that it is based on lightweight design, and its main advantage is to reduce redundant calculations. For example, Faster-Block in CGF-Block contains a large number of 1 × 1 convolutions, which reduces the number of parameters but may weaken the ability to model spatial features. However, from the overall comparative experimental results, it can be seen that, while maintaining competitive detection accuracy, the detection speed is also maintained at the same level, or even slightly improved. For the two-stage detection models Faster R-CNN [23], DINO [49], RT-DETR [34], and Deformable-DETR [37], our model has significantly improved the detection accuracy and detection speed while significantly reducing the number of parameters and calculations.

Finally, to verify the generalization ability of the model, we conducted comparative experiments with the baseline model on the aerial weeds dataset. The results show that our model can still achieve significant improvements in accuracy and real-time detection compared to the baseline model, despite some differences from the dadao_cc dataset. Through the visualization of the detection results, there are fewer missed detections and false positives compared to the baseline, especially in areas where crops and weeds coexist. This is consistent with the quantitative gains in recall and mAP, ensuring real-time applicability in resource-limited agricultural deployments. Although our model shows good generalization ability, it is still difficult to say that the model has perfect practicality in different application scenarios. Therefore, we plan to explore the integration with automated phenotyping [55], spectral fusion [56], and optical measurement accuracy [57,58] technologies in the agricultural field to prepare for the deployment of future models in real-world environments.

Compared to the weed detection model proposed in [59,60,61], the EMCF-RTDETR model proposed in this paper has a leading position in terms of average accuracy and detection speed, and the number of parameters is greatly reduced. This demonstrates that the model in this paper will provide an efficient and accurate solution for soybean weed detection.

5. Conclusions

Considering that there are many weed species in soybean fields, it is difficult to accurately recognize weeds and soybean crops by conventional detection methods; for this reason, this paper proposes an enhanced multi-scale channel feature model based on RT-DETR [34]. The experimental results show that the mAP50, mAP50:95 and FPS increased by 3.2%, 2.2% and 10%, respectively, with a reduction in the number of parameters and computation over the benchmark model. Compared to the YOLO series of real-time detection models, the detection speed remains at the same level while maintaining the leading detection accuracy. Especially comparing to the newly released YOLOv10m [52], there is an increase by 3.1% vs. 0.7% in mAP50 and mAP50:95, respectively. Finally, in order to verify the generalization ability of the present model, we compared it with the benchmark model on the Aerial Weeds dataset, and the results show that the accuracy is still leading and reduces the possibility of misdetection and omission.

Despite the model’s improvement in detection accuracy, its real-time detection speed has not yet been significantly improved for mainstream object detection models. Therefore, the main task in the future is to further reduce the number of parameters and computation of the model in order to improve the real-time detection speed. This will provide a more valuable reference for intelligent management of weed control in farmland.

Author Contributions

Conceptualization, H.Y. and Y.L.; methodology, Y.L.; software, Y.L., F.J. and Y.J.; validation, T.D., H.X. and L.Y.; formal analysis, Y.J., F.J. and Z.M.; investigation, J.G., T.D. and Y.Q.; resources, H.Y., T.D. and Y.L.; data curation, L.Y., J.G. and Y.Q.; writing—original draft preparation, Y.L.; writing—review and editing, H.Y., F.J. and Y.L.; visualization, Z.M. and H.X.; supervision, H.Y.; project administration, H.Y. and Y.J.; funding acquisition, H.Y. and Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (U1833119); the Humanities and Social Science Research Foundation of Ministry of Education of China (22YJAZH038); the Ministry of Education Industry-University Cooperation Education Project (231106627155856); 2025 ESI (Engineering) (01003009); Hubei Provincial Natural Science Foundation (2025AFC122); the School Enterprise Cooperation Project (No.whpu-2024-kj-4582, No.whpu-2024-kj-4639).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These datasets can be found at https://universe.roboflow.com/universidade-federal-de-rondonpolis-phpbg/dados_tcc/ (accessed on 8 June 2024) and https://data.mendeley.com/datasets/8kjcztbjz2/2 (accessed on 8 June 2024).

Conflicts of Interest

Author Yunpeng Jiang was employed by the company BiSiCloud (Wuhan) Information Technology Co., Ltd.; and author Taiyong Deng was employed by the company Zhengzhou Xinsiqi Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Medic, J.; Atkinson, C.; Hurburgh, C.R. Current Knowledge in Soybean Composition. J. Am. Oil Chem. Soc. 2014, 91, 363–384. [Google Scholar] [CrossRef]
Kim, I.-S.; Yang, W.-S.; Kim, C.-H. Beneficial Effects of Soybean-Derived Bioactive Peptides. Int. J. Mol. Sci. 2021, 22, 8570. [Google Scholar] [CrossRef] [PubMed]
Jamet, J.-P.; Chaumet, J.-M. Soybean in China: Adaptating to the Liberalization. OCL 2016, 23, D604. [Google Scholar] [CrossRef]
Datta, A.; Ullah, H.; Tursun, N.; Pornprom, T.; Knezevic, S.Z.; Chauhan, B.S. Managing Weeds Using Crop Competition in Soybean [Glycine max (L.) Merr.]. Crop Prot. 2017, 95, 60–68. [Google Scholar] [CrossRef]
Manisankar, G.; Ghosh, P.; Malik, G.C.; Banerjee, M. Recent Trends in Chemical Weed Management: A Review. Pharma Innov. J. 2022, 11, 745–753. [Google Scholar]
Bond, W.; Grundy, A.C. Non-Chemical Weed Management in Organic Farming Systems. Weed Res. 2001, 41, 383–405. [Google Scholar] [CrossRef]
Ray, P.P. Internet of Things for Smart Agriculture: Technologies, Practices and Future Direction. J. Ambient. Intell. Smart Environ. 2017, 9, 395–420. [Google Scholar] [CrossRef]
Sinha, B.B.; Dhanalakshmi, R. Recent Advancements and Challenges of Internet of Things in Smart Agriculture: A Survey. Future Gener. Comput. Syst. 2022, 126, 169–184. [Google Scholar] [CrossRef]
Patrício, D.I.; Rieder, R. Computer Vision and Artificial Intelligence in Precision Agriculture for Grain Crops: A Systematic Review. Comput. Electron. Agric. 2018, 153, 69–81. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep Learning for Computer Vision: A Brief Review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef]
Dhanya, V.G.; Subeesh, A.; Kushwaha, N.L.; Vishwakarma, D.K.; Nagesh Kumar, T.; Ritika, G.; Singh, A.N. Deep Learning Based Computer Vision Approaches for Smart Agricultural Applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
Tao, T.; Wei, X. A Hybrid CNN–SVM Classifier for Weed Recognition in Winter Rape Field. Plant Methods 2022, 18, 29. [Google Scholar] [CrossRef]
Bakhshipour, A.; Jafari, A. Evaluation of Support Vector Machine and Artificial Neural Networks in Weed Detection Using Shape Features. Comput. Electron. Agric. 2018, 145, 153–160. [Google Scholar] [CrossRef]
Anubha Pearline, S.; Sathiesh Kumar, V.; Harini, S. A Study on Plant Recognition Using Conventional Image Processing and Deep Learning Approaches. J. Intell. Fuzzy Syst. 2019, 36, 1997–2004. [Google Scholar] [CrossRef]
Khan, A.I.; Al-Habsi, S. Machine Learning in Computer Vision. Procedia Comput. Sci. 2020, 167, 1444–1451. [Google Scholar] [CrossRef]
Sharma, N.; Sharma, R.; Jindal, N. Machine Learning and Deep Learning Applications—A Vision. Glob. Transit. Proc. 2021, 2, 24–28. [Google Scholar] [CrossRef]
Yang, H.; Wu, Y.; Zhu, Y.; Deng, X.; Liu, N.; Zhao, Q. Application of Decomposition-Based Deep Learning Model in Grain Pile Temperature Prediction. J. South-Cent. Minzu Univ. (Nat. Sci. Ed.) 2023, 42, 696–701. [Google Scholar] [CrossRef]
Jiang, F.; Zhu, Q.; Tian, T. An Ensemble Interval Prediction Model with Change Point Detection and Interval Perturbation-Based Adjustment Strategy: A Case Study of Air Quality. Expert Syst. Appl. 2023, 222, 119823. [Google Scholar] [CrossRef]
Jiang, F.; Zhu, Q.; Yang, J.; Chen, G.; Tian, T. Clustering-Based Interval Prediction of Electric Load Using Multi-Objective Pathfinder Algorithm and Elman Neural Network. Appl. Soft Comput. 2022, 129, 109602. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Li, Y.; Miao, N.; Ma, L.; Shuang, F.; Huang, X. Transformer for Object Detection: Review and Benchmark. Eng. Appl. Artif. Intell. 2023, 126, 107021. [Google Scholar] [CrossRef]
Du, L.; Zhang, R.; Wang, X. Overview of Two-Stage Object Detection Algorithms. J. Phys. Conf. Ser. 2020, 1544, 012033. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Fan, X.; Chai, X.; Zhou, J.; Sun, T. Deep Learning Based Weed Detection and Target Spraying Robot System at Seedling Stage of Cotton Field. Comput. Electron. Agric. 2023, 214, 108317. [Google Scholar] [CrossRef]
Ding, Y.; Jiang, C.; Song, L.; Liu, F.; Tao, Y. RVDR-YOLOv8: A Weed Target Detection Model Based on Improved YOLOv8. Electronics 2024, 13, 2182. [Google Scholar] [CrossRef]
Yang, H.; Lin, D.; Zhang, G.; Zhang, H.; Wang, J.; Zhang, S. Research on Detection of Rice Pests and Diseases Based on Improved Yolov5 Algorithm. Appl. Sci. 2023, 13, 10188. [Google Scholar] [CrossRef]
Yang, H.; Sheng, S.; Jiang, F.; Zhang, T.; Wang, S.; Xiao, J.; Zhang, H.; Peng, C.; Wang, Q. YOLO-SDW: A Method for Detecting Infection in Corn Leaves. Energy Rep. 2024, 12, 6102–6111. [Google Scholar] [CrossRef]
Yang, H.; Zhang, S.; Shen, H.; Zhang, G.; Deng, X.; Xiong, J.; Feng, L.; Wang, J.; Zhang, H.; Sheng, S. A Multi-Layer Feature Fusion Model Based on Convolution and Attention Mechanisms for Text Classification. Appl. Sci. 2023, 13, 8550. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
Guo, Z.; Cai, D.; Zhou, Y.; Xu, T.; Yu, F. Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning. Plant Methods 2024, 20, 105. [Google Scholar] [CrossRef]
Li, H.; Shi, F. A DETR-like Detector-Based Semi-Supervised Object Detection Method for Brassica Chinensis Growth Monitoring. Comput. Electron. Agric. 2024, 219, 108788. [Google Scholar] [CrossRef]
Yang, H.; Deng, X.; Shen, H.; Lei, Q.; Zhang, S.; Liu, N. Disease Detection and Identification of Rice Leaf Based on Improved Detection Transformer. Agriculture 2023, 13, 1361. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 27 September 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Khan, R.U.; Zhang, X.; Kumar, R.; Aboagye, E.O. Evaluating the Performance of ResNet Model Based on Image Recognition. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, Chengdu, China, 12–14 March 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 86–90. [Google Scholar]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 12021–12031. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.-H.; Khan, F.S. SwiftFormer: Efficient Additive Attention for Transformer-Based Real-Time Mobile Vision Applications. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 17379–17390. [Google Scholar]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 17773–17783. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10778–10787. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 27706–27716. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Koluguri, N.R.; Park, T.; Ginsburg, B. TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 8102–8106. [Google Scholar]
Universidade Federal de Rondonpolis. Dados_tcc Dataset. 2023. Available online: https://universe.roboflow.com/universidade-federal-de-rondonpolis-phpbg/dados_tcc/ (accessed on 8 June 2024).
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Adarsh, P.; Rathi, P.; Kumar, M. YOLO V3-Tiny: Object Detection and Recognition Using One Stage Improved Model. In Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 March 2020; pp. 687–694. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Rai, N.; Mahecha, M.V.; Christensen, A.; Quanbeck, J.; Zhang, Y.; Howatt, K.; Ostlie, M.; Sun, X. Multi-Format Open-Source Weed Image Dataset for Real-Time Weed Identification in Precision Agriculture. Data Brief 2023, 51, 109691. [Google Scholar] [CrossRef]
Zhao, D.; Zhou, H.; Chen, P.; Hu, Y.; Ge, W.; Dang, Y.; Liang, R. Design of Forward-Looking Sonar System for Real-Time Image Segmentation With Light Multiscale Attention Net. IEEE Trans. Instrum. Meas. 2024, 73, 4501217. [Google Scholar] [CrossRef]
Zhou, Y.; Zhou, H.; Chen, Y. An Automated Phenotyping Method for Chinese Cymbidium Seedlings Based on 3D Point Cloud. Plant Methods 2024, 20, 151. [Google Scholar] [CrossRef]
Han, H.; Sha, R.; Dai, J.; Wang, Z.; Mao, J.; Cai, M. Garlic Origin Traceability and Identification Based on Fusion of Multi-Source Heterogeneous Spectral Information. Foods 2024, 13, 1016. [Google Scholar] [CrossRef]
Ma, X.; Wang, T.; Lu, L.; Huang, H.; Ding, J.; Zhang, F. Developing a 3D Clumping Index Model to Improve Optical Measurement Accuracy of Crop Leaf Area Index. Field Crops Res. 2022, 275, 108361. [Google Scholar] [CrossRef]
Ma, X.; Liu, Y. A Modified Geometrical Optical Model of Row Crops Considering Multiple Scattering Frame. Remote Sens. 2020, 12, 3600. [Google Scholar] [CrossRef]
Rehman, M.U.; Eesaar, H.; Abbas, Z.; Seneviratne, L.; Hussain, I.; Chong, K.T. Advanced Drone-Based Weed Detection Using Feature-Enriched Deep Learning Approach. Knowl.-Based Syst. 2024, 305, 112655. [Google Scholar] [CrossRef]
Fan, X.; Sun, T.; Chai, X.; Zhou, J. YOLO-WDNet: A Lightweight and Accurate Model for Weeds Detection in Cotton Field. Comput. Electron. Agric. 2024, 225, 109317. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed Detection in Maize Fields by UAV Images Based on Crop Row Preprocessing and Improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]

Figure 1. RT-DETR model structure.

Figure 2. EMCF model structure.

Figure 3. CGF-Block module structure. ☉: Element-wise multiplication of matrices; ⊕: Element-wise multiplication of matrices.

Figure 4. EA Attention structure.

Figure 5. Fusion module structure. ⊕: Element-wise multiplication of matrices.

Figure 6. There are three types of datasets. (a) Soybeans; (b) Broad-leaf weeds; (c) Narrow-leaf weeds.

Figure 7. Visualization results of ablation experiments. Darker colors and broader color ranges indicate higher detection accuracy. (a) an original image; (b) baseline model visualization; (c) our model visualization.

Figure 8. Visualized detection results of the baseline model.

Figure 9. Visualized detection results of our model.

Table 1. Dataset category information table.

Dataset	Number of Categories	Narrow-Leaf Weed	Broad-Leaf Weed	Soybeans
Train	6732	2588	2417	1727
Val	1922	754	664	504
Test	979	353	356	270
Total	9633	3695	3437	2501

Table 2. Model training parameters.

Type	Value	Type	Value
Epoch	200	optimizer	AdamW
Batch size	16	learning rate	1 × 10⁻⁶

Table 3. Comparative experimental results.

Model	Precision (%)	Recall (%)	mAP50 (%)	mAP50:95 (%)	GFLOPs (G)	Paras (M)	FPS
YOLOv3t	84.6	73.8	81.4	61.0	18.9	12.1	214.1
YOLOv5m	91.6	82.4	88.2	78.3	64	25.0	39.3
YOLOv6m	87.7	81.0	86.9	77.3	161.1	52.0	46.3
YOLOv9m	92.6	80.0	88.6	78.1	76.5	20.0	38.3
YOLOv10m	92.0	75.6	85.3	77.1	63.4	16.5	40.0
EfficientNet	90.1	80.3	84.8	74.1	87.9	18.38	21.8
RT-DETR	87.8	78.4	85.2	75.6	56.9	19.9	45.6
Faster-RCNN	91.8	78.4	88.2	72.6	--	41.4	7.3
DINO	91.6	76.1	83.5	74.2	--	47.5	4.3
Deformable-DETR	89.2	68.8	87.4	71.8	--	40.1	5.8
Ours	92.7	81.8	88.4	77.8	52.5	18.0	50.3

Table 4. Results of ablation experiments.

Model	CGF-Block	EA-AIFI	EMFF	Precision (%)	Recall (%)	mAP50 (%)	mAP50:95 (%)	GFLOPs (G)	Paras (M)	FPS
Baseline				87.8	78.4	85.2	75.6	57	19.9	45.6
1	✓			91.1	79.2	87.3	76.4	46.2	15.4	39.1
2		✓		90.5	81.2	87.7	76.8	57.5	20.0	43.3
3			✓	89.9	78.8	86.9	76.5	63.7	21.3	47.5
4	✓	✓		92.6	80.8	87.9	77.3	46.4	15.4	48.9
Ours	✓	✓	✓	92.7	81.8	88.4	77.8	52.5	18.0	50.3

Table 5. Results of generalization experiments.

Model	Precision (%)	Recall (%)	mAP50 (%)	mAP50:95 (%)	GFLOPs (G)	Paras (M)	FPS
Baseline	61.5	56.4	50.1	28.3	57.0	19.9	44.5
Ours	66.4	59.3	56.5	33.5	52.5	18.0	49.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Lyu, Y.; Jiang, Y.; Jiang, F.; Deng, T.; Yu, L.; Qiu, Y.; Xue, H.; Guo, J.; Meng, Z. Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features. Appl. Sci. 2025, 15, 4812. https://doi.org/10.3390/app15094812

AMA Style

Yang H, Lyu Y, Jiang Y, Jiang F, Deng T, Yu L, Qiu Y, Xue H, Guo J, Meng Z. Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features. Applied Sciences. 2025; 15(9):4812. https://doi.org/10.3390/app15094812

Chicago/Turabian Style

Yang, Hua, Yanjie Lyu, Yunpeng Jiang, Feng Jiang, Taiyong Deng, Lihao Yu, Yuanhao Qiu, Hao Xue, Junying Guo, and Zhaoqi Meng. 2025. "Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features" Applied Sciences 15, no. 9: 4812. https://doi.org/10.3390/app15094812

APA Style

Yang, H., Lyu, Y., Jiang, Y., Jiang, F., Deng, T., Yu, L., Qiu, Y., Xue, H., Guo, J., & Meng, Z. (2025). Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features. Applied Sciences, 15(9), 4812. https://doi.org/10.3390/app15094812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Soybean Weed Detection Based on RT-DETR with Enhanced Multiscale Channel Features

Abstract

1. Introduction

2. Materials and Methods

2.1. RT-DETR Model

2.2. EMCF-RTDETR Model

2.2.1. CGF-Block Module

2.2.2. EA-AIFI

2.2.3. EMFF Network

2.3. Dataset

2.4. Evaluation Index

3. Results

3.1. Experimental Environment and Parameters

3.2. Comparative Experiment

3.3. Ablation Experiment

3.4. Generalization Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI