RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments

Chen, Yiyong; Guo, Yang; Li, Jianlong; Zhou, Bo; Chen, Jiaming; Zhang, Man; Cui, Yingying; Tang, Jinchi

doi:10.3390/agriculture14122256

Open AccessArticle

RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments

by

Yiyong Chen

¹,

Yang Guo

²,

Jianlong Li

¹,

Bo Zhou

¹,

Jiaming Chen

³,

Man Zhang

¹,

Yingying Cui

¹

and

Jinchi Tang

^1,*

¹

Tea Research Institute, Guangdong Academy of Agricultural Sciences & Guangdong Provincial Key Laboratory of Tea Plant Resources Innovation and Utilization, Dafeng Road 6, Tianhe District, Guangzhou 510640, China

²

College of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China

³

Key Laboratory of South China Agricultural Plant Molecular Analysis and Genetic Improvement & Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Xingke Road 723, Tianhe District, Guangzhou 510650, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(12), 2256; https://doi.org/10.3390/agriculture14122256

Submission received: 25 October 2024 / Revised: 5 December 2024 / Accepted: 7 December 2024 / Published: 10 December 2024

(This article belongs to the Special Issue Research Progress on Agricultural Equipments for Precision Planting and Harvesting)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate bud detection is a prerequisite for automatic tea picking and yield statistics; however, current research suffers from missed detection due to the variety of singleness and false detection under complex backgrounds. Traditional target detection models are mainly based on CNN, but CNN can only achieve the extraction of local feature information, which is a lack of advantages for the accurate identification of targets in complex environments, and Transformer can be a good solution to the problem. Therefore, based on a multi-variety tea bud dataset, this study proposes RT-DETR-Tea, an improved object detection model under the real-time detection Transformer (RT-DETR) framework. This model uses cascaded group attention to replace the multi-head self-attention (MHSA) mechanism in the attention-based intra-scale feature interaction (AIFI) module, effectively optimizing deep features and enriching the semantic information of features. The original cross-scale feature-fusion module (CCFM) mechanism is improved to establish the gather-and-distribute-Tea (GD-Tea) mechanism for multi-level feature fusion, which can effectively fuse low-level and high-level semantic information and large and small tea bud features in natural environments. The submodule of DilatedReparamBlock in UniRepLKNet was employed to improve RepC3 to achieve an efficient fusion of tea bud feature information and ensure the accuracy of the detection head. Ablation experiments show that the precision and mean average precision of the proposed RT-DETR-Tea model are 96.1% and 79.7%, respectively, which are increased by 5.2% and 2.4% compared to those of the original model, indicating the model’s effectiveness. The model also shows good detection performance on the newly constructed tea bud dataset. Compared with other detection algorithms, the improved RT-DETR-Tea model demonstrates superior tea bud detection performance, providing effective technical support for smart tea garden management and production.

Keywords:

multi-variety tea buds; deep learning; real-time detection Transformer (RT-DETR); object recognition; feature fusion

1. Introduction

Tea is one of the most popular beverages in the world due to its unique flavor and high nutritional value. In 2020, the global tea production was 62,690 tons, and China ranked first, with 29,860 tons of tea yield [1,2]. Amidst the annual increase in tea production, however, the increasingly serious population aging has gradually reduced the number of tea farmers who manually pick tea leaves. Artificial tea-picking methods have disadvantages such as high cost, low efficiency, and inevitable subjectivity, which pose considerable challenges to picking high-quality tea [3]. Although the efficiency of current machine tea-picking equipment has been greatly improved, a one-size-fits-all method is generally adopted, which tends to cause the damage or breakage of tea leaves [4,5]. At the same time, tea farmers usually use manual methods to count the number of buds, which is inefficient and time-consuming. Therefore, after accurately identifying and locating the buds, automatically counting their number can not only improve the production efficiency but also realize the estimation of tea production [6]. However, the detection of high-quality tea buds is challenging due to various factors such as tea bud species, pose and size, and illumination diversity [7], so accurate tea bud detection in complex environments has become a research hotspot. Therefore, accurate tea bud identification is a prerequisite for automated and intelligent picking of high-quality tea buds.

The emergence of deep learning research in the field of object detection has opened up a wide range of possibilities for accurate tea bud picking. Traditional machine vision methods rely on the extraction of posture, color, and texture features of tea buds, young leaves, and old leaves, which is achieved through an artificial design. These features are then used to enable the accurate detection and segmentation of tea buds [8,9]. Regional Convolutional Neural Networks (R-CNN) are the first models to be applied to object recognition, including models such as Faster-CNN and Faster-RCNN. Several optimization algorithms, including You Only Look Once (YOLO) and Single Shot Multibox Detector (SSD), have been integrated into the agricultural sector [10,11]. These algorithms have demonstrated improved efficiency and accuracy in various applications such as crop yield prediction, weed detection, and livestock monitoring. At the same time, numerous agricultural experts and scholars have utilized deep learning in tea research. For instance, Li et al. [12] combined the enhanced YOLOv5 algorithm with the Hungarian matching algorithm and Kalman filtering algorithm to achieve the real-time tracking and monitoring of tea bud targets. This method can estimate the number of tea buds in dynamic images and predict tea yield. Sun et al. [13] used pre-segmentation to eliminate the complexity of the tea bud environment in response to the effect of complex background on the performance of the detection model, and deployed an enhanced YOLO mesoscale network model for accurate detection with an average accuracy of 84.2%. A tea YOLO algorithm was proposed based on YOLOv5 to detect the key points of tea buds and tea bud picking, and a tea picking point positioning method was put forward. Compared to the baseline model, this achieves an average accuracy improvement of 5.26% [14]. In that study, the researchers proposed a tea sprout detection method based on Faster-RCNN, achieving an average precision (AP) of 0.54%, and a root mean square error (RMSE) of 3.32 when the type of tea sprout was not differentiated [15]. Wang et al. [16] developed a model for recognizing tea bud and picking point based on the mask R-CNN, which performed well in complex environments. Chen et al. [17] proposed a tea bud detection method utilizing image enhancement and a fused single-stage detection network (SSDN) to improve detection accuracy. In summary, the research of deep learning on tea object detection mainly focuses on improved models such as the one-stage YOLO series models based on CNN and the two-stage SSD, while there is still little research on the end-to-end object detection model based on the Transformer framework for the fast and accurate detection of tea buds.

Because Transformers show powerful global feature extraction and parallel computing capabilities, Transformer-based object detection algorithms have also been widely studied in the field of deep learning. In object detection tasks, compared with CNN, Transformers can obtain a larger receptive field and more detailed information, which can better realize the fusion of feature context semantic information and demonstrate a better ability to model global features [18]. They can also effectively solve the problems of occlusion and small object detection. Recently, a new Transformer-based object detection framework called detection Transformer (DETR) was introduced, and many researchers have proposed modification strategies for it [19,20,21]. However, DETR suffers from some defects such as slow training convergence and poor target detection accuracy. The subsequent improved real-time detection Transformer (RT-DETR) model perfectly realizes end-to-end object detection, overcomes the slow convergence of DETR training, and exceeds the YOLO model in detection accuracy. Compared with the object detection framework of CNN, RT-DETR provides a fully end-to-end object detection framework [22]. Moreover, the non-maximum suppression strategy (NMS) in previous object detection models is eliminated to avoid the delay caused by this post-processing step, which greatly simplifies the complexity of the object detection model compared with YOLO. In addition, its remarkable performance in standard datasets has prompted researchers to explore its applications in various real-world scenarios, which provides a new feasible scheme for different research fields. Therefore, we used RT-DETR as the basic model and modify it to achieve the rapid detection and identification of tea buds with high feasibility studies.

In addition, every object detection model needs a dataset to complete its model training, and the quality of the dataset is a key factor in determining whether the model can be successful. In application research, it is necessary to build professional datasets to meet the needs of specific scenarios. The existing tea bud detection research is usually based on datasets constructed by a single variety of tea buds. The phenotypes of different varieties of tea buds differ significantly in terms of their color and morphology. For example, the colors of Yinghong 9 and Huangyu are completely different, and the bud sizes of Jinxuan and Yinghong 9 also differ greatly. Therefore, limitations exist if an accurate detection of tea sprouts is only achieved on a single variety without exploring the detection effect of tea sprouts of different varieties in different environments. This study is based on the actual complex scene of a tea garden to achieve accurate and rapid detection of multiple varieties of tea buds, and the main purpose of this paper is to improve the generalization and robustness of the proposed model. The specific contributions of this study are as follows:

1.: A multi-variety tea bud dataset in an unstructured environment is constructed, including Jinxuan, Hongyan 12, Yinghong 9, and Huangyu. Then, 1250 images are collected for each variety, and a total of 5000 tea bud images are collected.
2.: The RT-DETR-Tea detection model based on the Transformer framework achieves an accurate identification of multiple tea bud varieties in natural environments and achieves satisfactory performance on independently constructed datasets, exhibiting good generalization and robustness.
3.: A feature fusion mechanism (GD-Tea) that can effectively fuse low-level and high-level semantic information in tea bud images is proposed to improve the accuracy of precise recognition of tea buds.

2. Materials and Methods

2.1. Data Acquisition

All the tea bud images were taken in the experimental field of Yingde Tea Research Institute in Guangdong Province from 20 August 2023 to 9 September 2023. Four varieties of tea buds were collected as experimental samples: Jinxuan, Hongyan 12, Yinghong 9, and Huangyu. Then, 1250 images were collected for each variety, and a total of 5000 tea bud images were finally collected as our experimental samples. The tea bud images were captured using a hand-held Canon camera in the tea garden. To ensure that the images were in line with the complex field environment in actual picking, we collected them under various conditions, such as single tea bud, tea bud occlusion, complex background, and with or without shade. The collected images had a resolution of 5184 × 3456 and were stored in the commonly used JPEG format, as shown in Figure 1. Since “one bud” and “one bud and one leaf” patterns are usually used as raw materials for high-quality tea products, considering the universality of the detection model, we took the two patterns as the same detection object, called the “tender bud”, and used a rectangular box to mark the targets. To avoid subjective manual annotation experiences caused by different researchers, the 5000 images were annotated by the same researcher. To reduce memory consumption during training, all photos were resized to 640 × 640 and divided into training and test sets in an 8:2 ratio.

2.2. RT-DETR Baseline Network Structure

DETR is the first end-to-end object detection model based on Transformer. Compared with the YOLO series, DETR avoids anchor pre-processing and NMS post-processing but suffers from slow training, convergence, and inference, so optimizations of DETR focus on overcoming these shortcomings. The latest improved model RT-DETR not only surpasses the current real-time detector in terms of accuracy and speed, but also eliminates the post-processing step, and its inference speed is stable without delay. RT-DETR is a real-time object detection model, which includes a backbone network, a hybrid encoder, and a Transformer decoder with an auxiliary pre-detection head. The model structure is displayed in Figure 2. The backbone is HGNetv2, mainly composed of a steam module and HGblock, which aims to reduce the number of parameters and accelerate the calculation speed. The neck mainly includes the attention-based intra-scale feature interaction (AIFI) module and the CNN-based cross-scale feature-fusion module (CCFM). In the AIFI module, the deepest features extracted by the backbone are flattened into vectors by using a simple Transformer encoder and feedforward neural network (FFN) and then restored to two-dimensional vectors by FFN. Finally, the features in the CCFM module are combined to complete all feature fusion. RT-DETR’s head and loss are the same as those of the head in DETR with improved denoising anchor boxes (DINO), where DINO’s “denoising idea” is employed to improve the quality of matched samples and accelerate the convergence speed of training [23].

From the flow chart above, we can see that the RT-DETR model is mainly composed of three parts: backbone, neck, and head. The first two parts are the most important ones in the architecture that determine the final detection effect. The main role of the backbone is to effectively extract image features, whereas the neck effectively processes and fuses the extracted features. The backbone of the original RT-DETR model is the GHNetV2 network structure. Although the original model has a good feature extraction effect, it is limited by problems such as a large number of model parameters and high computational complexity. Meanwhile, the Transformer attention mechanism of the AIFI module in the neck and the similarity between different heads will lead to computational redundancy in the attention mechanism and increase the training burden of the model. The CCFM structure and path aggregation feature pyramid network (PaFPN) are consistent, and in object detection tasks, the truly useful features must contain detailed and semantic information about the object. In the existing feature pyramid architecture, semantic information from high-level features may be lost or degraded. During the bottom-up feature fusion of PaFPN, low-level features may be lost or degraded, which will cause great obstacles to the detection of small targets. Therefore, subsequent research addresses these problems in the backbone and neck to achieve the accurate identification of tea buds.

2.3. RT-DETR-Tea Model Improvement

Although RT-DETR has been greatly improved in training, convergence, and inference speed compared with DETR, its parameter number is still up to 31.9 M, while the lightest YOLOV5s model has only 7M parameters. Therefore, the main goal of this paper is to improve the inference speed of the model and reduce the model parameters while ensuring the tea bud detection accuracy. The specific improved model structure is given in Figure 3. Firstly, in response to the calculation redundancy of the multi-head self-attention (MHSA) mechanism in the AIFI module of the original model, the cascaded group attention module is used instead to segment the features for input and calculate the attention of each head in a cascaded way. Then, the output of the head is added to the head of the subsequent features. The feature optimization is implemented step by step, and this advanced module is named CGFA. Secondly, to reduce the number of model parameters, we used ResNet18 to replace the original HGNetV2 as the backbone network and realize the extraction of higher-dimensional features. The goal was to achieve the accurate detection of small target buds when tea buds are away from the camera and prevent the loss of small targets. The feature fusion CCFM module of the original RT-DETR model has the disadvantage of losing semantic information when dealing with low-level features. We thus constructed the GD-Tea mechanism to replace the original CCFM module for the perfect fusion of high-level and low-level features and feature semantic information and a better model capability to detect small objects. Completing the acquisition of semantic information can also ensure better model detection effects in complex environments. Moreover, we replaced the RepC3 module in the original model with the DRBC3 module, which can expand the receptive field and improve the model’s ability to understand features.

2.3.1. Self-Attention Mechanism Improvement

When the original model’s AIFI module performs the MHSA operation, due to its many steps (filling, transformation, de-filling, etc.), it will occupy an excessive amount of memory, especially when the input size becomes large, which will lead to high computational overhead. To solve this problem, the cascaded group attention in EfficientViT was introduced to improve the MHSA part of the AIFI module twice [24], as shown in Figure 4a. Through slicing and cascading operations, the cascade group attention mechanism allows each attention head to capture local information while obtaining all the information of the previous heads, which not only improves the capacity of information, but also restores the diversity of features, strengthens the features, and reduces the computational redundancy. The specific structure of the improved CGFA module is shown in Figure 4b.

Different from the standard multi-head self-attention, the division operation here is used to divide the attention heads into multiple attention heads before calculating Q, K, and V. Although not all input features are directly received, the attention heads are in a cascade form, and the latter sub-head can receive the output results of the previous sub-head. Each attention head and slicing operation are expressed in Equation (1), and the cascade operation is given in Equation (2):

{\tilde{X}}_{i j} = A t t e n t i o n (X_{i j} M_{i j}^{Q}, X_{i j} M_{i j}^{K}, X_{i j} M_{i j}^{V}) {\tilde{X}}_{i + 1} = C o n c a t {[{\tilde{X}}_{i j}]}_{j = 1 : h} M_{i}^{p}

(1)

X_{i j}^{'} = X_{i j} + {\tilde{X}}_{i (j - 1)}

(2)

where

{\tilde{X}}_{i j}

is the output feature of the

i

-th input feature after the

j

-th (1 < j ≤ h) attention head processing;

A t t e n t i o n

denotes the attention operation;

X_{i j}

is the

j

-th slice of the input feature;

M_{i j}^{Q}

,

M_{i j}^{K}

, and

M_{i j}^{V}

represent the query feature, value feature, and correlation feature obtained by mapping different layers of the input feature, respectively;

{\tilde{X}}_{i + 1}

is the result of concatenation and projection of all the features processed by the attention head;

C o n c a t

denotes the concatenation operation;

h

is the total number of attention heads;

M_{i}^{p}

is the projection of the concatenated output features; and

X_{i j}^{'}

is the sum of the input slice of the current head and the output

{\tilde{X}}_{i (j - 1)}

of the previous head.

2.3.2. GD-Tea Feature Fusion Module

The CCFM feature fusion module of the original RT-DETR model tends to lose semantic information when dealing with low-level features, so we constructed the GD-Tea mechanism to replace the original CCFM module to realize the perfect fusion of high-level and low-level features, and feature semantic information and improve the model’s ability to detect small objects. The GD-Tea feature fusion mechanism proposed in this study is mainly composed of the low-GD, high-GD, and inject modules. The low-GD module is used to introduce high-resolution features and strengthen the retention of small target information. The specific structure of low-GD is shown in Figure 5. Firstly, the feature information dimension alignment of the tea bud image extracted by the backbone is realized by Concat merging to obtain a new feature X₁. The new features are, respectively, processed by the convolution (Conv), RepBlock, and split operations. Finally, it uses

S p l i t

and the B3 and B4 features to achieve feature fusion through the inject module.

After realizing the fusion and extraction of large-size feature maps by the low-GD module, the high-GD module is employed to ensure the fusion and extraction of small-size feature maps, maintain the integrity of global information, and realize the complete retention of semantic information of tea bud images. The structure of the high-GD module is shown in Figure 6.

The inject module is similar to a self-supporting mechanism that distributes global features of different sizes into features of different levels, which enhances feature extraction and semantic information ease for small and medium targets in images and promotes the accurate detection of small and medium targets in tea tree bud images. As shown in Figure 7, the local and global information are input at the same time, and are calculated with the Conv1×1 convolution layer, respectively.

X_{g l o b a l}

obtains

X_{g l o b a l_e m b e d}

and

X_{a c t}

, respectively, then uses average pooling or bilinear interpolation to complete the alignment of feature sizes, and finally achieves attention fusion and output for all features.

2.3.3. RepC3 Module Improvement

For the RT-DETR model, RepC3 is added before the detection head, which is mainly realized by RepBlock convolution. After the output of the inject module, the further fusion of features is strengthened. Although the number of calculation parameters is reduced and the computational efficiency is improved, it is necessary to pay more attention to the deep feature information in the image when detecting tea buds. The receptive field should also be expanded, so we can efficiently fuse the feature information of tea buds in the complex field environment and ensure the accuracy of the detection head. Therefore, this study draws on the submodule of DilatedReparamBlock in UniRepLKNet [25]. By using this module, the 5 × 5 large convolution kernel is in parallel with two 3 × 3 small dilated convolutions, which is equivalent to a large convolution kernel through the idea of reparameterization. In this way, the sparse features in the image can be more effectively captured in feature fusion, thereby improving the model’s ability to cope with complex scenes. The concrete structure of DilatedReparamBlock is presented in Figure 8.

2.3.4. Model Evaluation Metrics

The metrics of precision (P), recall (R), and mean average precision (mAP) are normally used to evaluate the performance of the detection models [26,27]. Among these indexes, mAP is a key metric to evaluate the overall model performance. Their formulas are listed below.

P = \frac{T P}{T P + F P}

(3)

R = \frac{T P}{T P + F N}

(4)

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} A P_{k}

(5)

A P = \int_{0}^{1} p (r) d r

(6)

where TP (true positive) represents the positive samples correctly predicted by the model; TN (true negative) represents the negative samples correctly predicted by the model; FP (false positive) represents the positive samples incorrectly predicted by the model; FN (false negative) represents the negative samples incorrectly predicted by the model; AP_k represents the k classes; and n represents the number of classes.

3. Experiment and Result Analysis

3.1. Experimental Platform and Model Parameter Settings

All experiments in the study were based on the PyTorch deep learning framework and the programming language was Python. The main configurations of the computer used in the experiment were as follows: the processor was an i5 CPU, the operating system was Win11 64, and the GPU was an NVIDIA GeForce RTX4060ti 8G. The main parameter settings of the model are displayed in Table 1.

3.2. Backbone Network Comparison

The design of the backbone network, a core model component, has a direct impact on the final performance of the model. In this paper, the comparison models of ResNet18, ResNet34, ResNet50, and the original RT-DETR backbone network HGNetV2 were selected to evaluate the number of parameters, computational complexity, and detection accuracy. Table 2 shows a detailed comparison of structural parameters and training results.

ResNet18 is a lightweight network from the ResNet family and has a more streamlined structure and lower complexity than the other networks [28]. From the comparison results, it can be seen that ResNet18 has only 19.8 M parameters, which are significantly fewer than the other networks. A smaller number of parameters means a cleaner structure, which in turn reduces the computational resource requirements and storage footprint while lowering the risk of overfitting. In terms of floating-point operations, ResNet18 only needs 57 G of them, which is the lowest in the table. Fewer floating-point operations correspond to less computational overhead, reducing the dependence on large computing devices. Although ResNet18 has an average accuracy of 77.3%, which is slightly lower than HGNetV2’s 77.8%, considering its smaller number of parameters and lower computational complexity, ResNet18 can be regarded as a candidate backbone network achieving the balance between performance and efficiency.

3.3. Visual Recognition of the Heatmap

Deep learning models have achieved high-precision recognition results in specific target detection tasks in agricultural applications, but their interpretability is relatively lacking. The model interpretability can ensure that researchers have a better understanding and trust of the proposed model. Therefore, to better explain the ability of the proposed model to effectively learn the characteristics of tea buds, we use Grad-CAM to visualize the heat map of the detection results of RT-DETR-r18 and RT-DETR-Tea. Grad-CAM uses gradient information to generate weights and create a heat map on the original image according to the weight size. The color ranges from blue to red, and the region with a greater weight is more important for tea bud detection, as shown in red in the heat map [29].

As can be seen from Figure 9, the original RT-DETR-r18 model focuses more on the large targets of tea buds but pays less attention to some small targets, resulting in the missed detection of some small targets. This may be because the original feature fusion has lost low-level features, resulting in insufficient semantic information. The proposed RT-DETR-Tea model can better focus on large and small targets simultaneously. We can also see that the bud area in the image is red on the heat map, especially in the complex tea garden environment. Even if there are multiple bud targets, the bud information can still be effectively focused. This may be because the GD-Tea feature fusion module uses multi-level features and effective representation (such as shallow and deep features) to fuse rich semantic information. As shown in the test results in the figure, the improved model can find the key areas in the heat map in the visualization of the final test results, and can realize the accurate identification of tea buds. This indicates that the GD-Tea feature fusion module proposed in this study can integrate rich semantic information, realize effective attention to global information, and can still effectively detect tea bud information even in the complex environment of tea gardens.

3.4. Ablation Experiments

To verify whether the proposed model RT-DETR-Tea based on the RT-DETR-r18 basic model can exhibit a better detection effect, ablation experiments were carried out on the tea bud dataset constructed in this study. Table 3 shows the results of the ablation experiments.

As shown in Table 3, Improvement 1 (named CGFA) changes the MHSA mechanism in the AIFI module of the original model to the cascaded group attention. The mAP value of the model is 0.5% higher than that of the original model. CGFA effectively optimizes the deep features and enriches the semantic information of the model, which can help us locate the bud position more accurately. The second improvement replaces the original CCFM feature fusion mechanism with the proposed GD-Tea module, and the mAP value of the model is improved by 1.2%. This method can effectively fuse low-level and high-level semantic information and ensure that the model can effectively focus on the tea bud size target in complex environments. In Improvement 3, DRBC3 is used instead of RepC3, and it is found that the progress of the model improved by 0.8%, which indicates the necessity of multi-level feature fusion. Improvement 4 employs both CGFA and GD-Tea strategies, increasing the model accuracy by 1.4% in terms of mAP. Improvement 5 includes the final improvement strategy proposed in this study, and the model accuracy enhancement is the largest compared with that of the other strategies. The accuracy of the model is 2.4% higher than that of the original model, and its FLOPs drop to 51.4 G, which indicates that the final model not only improves detection accuracy but also accelerates model calculation. Compared with the RT-DETR-r18 model, the proposed model has good detection ability and can meet the application requirements of tea bud detection.

3.5. Effective Receptive Field (ERF)

The main function of the DRBC3 module proposed in this study is to integrate the injected features into the feature fusion, and enlarge the receptive field of the fused features through the large convolution kernel, so as to enhance the detection and location understanding of tea buds. In Table 3, it has shown that the improved method can not only increase the detection accuracy by 0.8%, but also reduce the number of parameters in the model. However, due to the lack of specific explanation for the increase of receptive field, we randomly selected 50 images from the test set and resized them uniformly to 640 × 640. We normalized the pixel feature contribution of the images to 0 to 1 and measured the proportion of effective pixel area of the corresponding pixels in the feature map. Table 4 shows the comparative analysis of the contribution ratio of RepC3 and DRBC3 to feature effective pixels when the pixel contribution ratio threshold is set to t = 20%, 30%, 50% and 99%, respectively.

As mentioned above, compared with RepC3, DRBC3 has an increased proportion of effective pixels at different pixel contribution thresholds. When t = 99%, the proportion of effective pixels reaches 94.1%. The increase in receptive field can effectively increase the detection and localization of tea, and explain the effectiveness of the improved idea.

3.6. Comparison Results of Multiple Models

The improved model is compared with other two-stage object detection models like SSD [30] and Faster-RCNN [31] and one-stage detection models such as YOLOv5s [32] and YOLOv8l [33]. The results of the analysis are shown in Table 5, which highlights the advantage of the accurate computation of the improved model on tea bud detection compared with the other detection models. As shown in Figure 10, verified on the tea bud dataset, our model has the best P, R, and mAP metrics, which are 96.1%, 91.7%, and 79.7%, respectively.

Although the number of parameters and FLOPs/G of the proposed model are higher than those of YOLOv5, its application in practical projects is not affected. The Parameters (M) and FLOPs/G of the proposed model are 20 M and 51.4 G, respectively. Compared with the RT-DETR-L model, the reduction was 37.3% and 46.2%, respectively.

In order to better prove the superiority of the model proposed in this study, Yinghong 9 was taken as an example, and six models were, respectively, used to detect and test tea buds in the natural state, and the detection effect of each model was evaluated (as shown in the Figure 10). Figure 10c and Figure 10d show Faster-RCNN and SSD two-stage and single-stage target detection models, respectively. There are obvious problems of missing detection and false detection in areas where tea buds are obscured, which may be due to the lack of high-frequency information and low-frequency information fusion operation, resulting in insufficient feature information and reduced detection accuracy. Figure 10a,b,e shows the detection results of YOLOv5, YOLOv8l, and RT-DETR-L. These models have been improved by feature fusion, and it is found that the missed detection errors of buds can be effectively reduced, but there is still room for improvement. Figure 10f shows the detection effect of the improved model proposed in this study. We improved the feature fusion mode of the model to achieve the effective fusion of high-frequency semantic information and low-frequency information, improve the model’s localization and recognition accuracy of the target, and reduce the missed detection of small targets by increasing the receptive field. It is obvious that the detection effect of the model proposed in this study has been greatly improved.

3.7. Verification of Bud Detection

To verify the generalization and detection effect of the proposed RT-DETR-Tea model, we construct a tea bud dataset independently in this paper. The dataset involves multiple varieties of tea trees, close tea bud images, and dense and complex environments, which is effective in testing the performance of detection models. The final inference result is shown in Figure 11 and Table 6.

In general, the original RT-DETR-r18 model has the shortcomings of missed and false detections. Figure 11 shows the detection results of the improved model and the original model in four kinds of tea buds, Yinghong 9, Jinxuan, Huangyu, and Hongyan 12. For Yinghong 9, Jinxuan, and Hongyan 12, the RT-DETR-r18 model shows the errors of missed detection when occlusion and dense buds are encountered, especially for small target buds. This is likely because the original model loses semantic information in high-level features during feature fusion, resulting in missed detection of small targets and false detection under occlusion. In the improved RT-DETR-Tea model, the GD-Tea module achieves the perfect fusion of multi-level features. From the detection results of Yinghong 9 and Jinxuan, the advantages of our improved model can be well demonstrated. According to the detection results of Huangyu, even if the bud has a color similar to the leaf, the improved model achieves better detection results compared with the original model. Our proposed DRBC3 module can not only expand the perception field, but also enhance the understanding of features, which can effectively improve tea bud location and recognition by the detection head and prevent the occurrence of false detection. All in all, the improved strategy can effectively improve the detection performance of the model.

3.8. Results of a Comprehensive Analysis of Tea Buds Classified by Size

In practice, the size and number of tea buds in the image and the surrounding environment will change with the movement of the camera; the information about the buds is more complete and better recognized when the buds are approached nearby. When the visual system expands the detection field, with the continuous expansion of the field of view, problems such as excessive number of buds, serious occlusion, and small and dense bud targets are likely to occur. In addition, the color and size of buds of different varieties of tea trees differ greatly, and the color of buds is similar to that of leaves, which will undoubtedly increase the difficulty of detecting buds in complex environments. To evaluate the efficacy of the improved model in detecting tea buds of varying sizes within a tea garden environment, this study constructed datasets representing different sizes of tea buds for detection analysis. The specific results are presented as follows:

Figure 12b has a larger field of view than (a), which makes the small size of the tea bud target in the image more prominent. By analyzing the detection results of tea buds of different sizes in the real environment of tea gardens, the superiority of the proposed model in detecting small targets can be further highlighted, which indicates that the model can achieve accurate detection of both large and small targets.

3.9. Tea Bud Detection and Verification Under Different Light Intensity

The change in light intensity has influence on the detection of tea bud. In order to verify the generalization and detection accuracy of the model proposed in this study under different illumination, 50 tea bud images of different varieties were randomly selected from the dataset. Randomly selected images were simulated by gamma transform to obtain tea bud images under different light intensities, and these images were input into the RT-DETR-Tea model to verify the changes in model performance under different light intensities and analyze the model detection effect.

In this study, gamma changes were used to simulate tea bud images under four different light intensifies. Among them, gamma = 0.45 simulated the light intensity from 11:00–13:00 in summer of tea garden, gamma = 0.75 simulated the light intensity from 13:00–15:00 in summer of tea garden, and gamma = 2.0 simulated the light intensity from 15:00–16:00 in summer of tea garden. gamma = 3.5 simulated light intensity from 16:00 to 18:00 in summer afternoon of tea garden. The tea bud image with transformed light intensity was input into the detection network, and the specific detection results are shown in the Figure 13.

The results showed that tea buds could be accurately detected even under different light intensities, but tea buds could be missed when the light intensity was too strong. From the comparison results of Figure 13c,d, the confidence value of tea bud detection decreased significantly, indicating that too-low light intensity would lead to a decline in the detection accuracy of tea bud, but would not affect the detection and positioning of tea bud. Through the analysis of the overall detection effect of gamma transform experiment, the model proposed in this study has good generalization and robustness.

4. Conclusions

Aiming at the limitation of a single variety in tea bud detection, relying on the improved RT-DETR-Tea model, this study realizes the accurate identification of different varieties of tea buds with different phenotypes. In this model, CGFA effectively optimizes the deep features and enriches the semantic information of the model. GD-Tea effectively fuses low-level and high-level semantic information, ensures that the model can pay attention to large and small tea buds simultaneously in a complex environment, and reduces the probability of missed and false detection. DRBC3 further enhances the understanding of features and effectively improves the localization and recognition of the region of interest by the detection head.

In general, compared with the RT-DETR-r18 basic model, the RT-DETR-Tea model exhibits greatly improved detection accuracy for multi-variety tea buds. The results show that the p-value and the mAP value of the RT-DETR-Tea model are 96.1% and 79.7%, which are increased by 5.2% and 2.4%, respectively. In addition, to verify the robustness and generalization of the model, we constructed a new multi-variety tea bud dataset. The comparison results show that the proposed model addresses missed and false detections. Therefore, the proposed RT-DETR-Tea model can realize the accurate detection of multi-variety tea buds in natural environments and can provide technical support for automatic picking positioning and yield statistics in actual tea production.

There are still some limitations to our work. Firstly, the data in this paper only contain the samples of four varieties of tea sprouts due to the limitation of resources and manpower. Although the model has good robustness and generalization, it cannot be well generalized to the detection of other varieties of tea sprouts not included in the training data. Therefore, we must construct a larger tea bud dataset, which can not only promote the robustness of the model, but also ensure that the application of the model is not limited by tea varieties. In addition, this study only considers the detection scene based on RGB images, which is affected by the viewing angle and excessive density, and some buds are completely occluded and cannot be detected. Therefore, in future work, we will consider the fusion of depth information to realize the rapid reconstruction of a 3D scene of tea buds and realize the detection of completely occluded tea buds in 3D space.

Author Contributions

Y.C. (Yiyong Chen): Conceptualization, Methodology, Investigation, Data curation, Writing—original draft, Writing—review and editing. Y.G.: Methodology, Software, Data curation, Writing—review and editing. J.L.: Data curation, Methodology, Validation. B.Z.: Data curation, Methodology. J.C.: Data curation, Methodology. M.Z.: Data curation, Methodology. Y.C. (Yingying Cui): Data curation, Methodology. J.T.: Supervision, Methodology, Project administration, Funding acquisition, Writing—original draft, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Key-Area Research and Development Programme of Guangdong Province (2023B0202120001) and the Scientific Innovation Strategy-construction of High-Level Academy of Agriculture Science (grant no. R2023PY-JX023).

Data Availability Statement

The data presented in this study are available on request from the corresponding author on reasonable request.

Acknowledgments

We are appreciative of the reviewers’ valuable suggestions on this manuscript and the editor’s efforts in processing the manuscript.

Conflicts of Interest

No commercial or financial conflicts of interest were identified for this research.

References

Mei, Y.; Zhang, S. China Tea Production and Sales Situation Report. 2022. Available online: https://www.ctma.com.cn/index/index/zybg/id/17/ (accessed on 1 May 2024).
ITC-Credible Accurate Tea Statistics-.(n.d.). Retrieved 24 September. 2022. Available online: https://inttea.com/ (accessed on 10 May 2024).
Han, Y.; Xiao, H.; Qin, G.; Song, Z.; Ding, W.; Mei, S. Developing situations of tea plucking machine. Engineering 2014, 6, 45606. [Google Scholar] [CrossRef]
Du, Z.; Hu, Y.G.; Wang, S. Simulation and experiment of reciprocating cutter kinematics of portable tea picking machine. Trans. Chin. Soc. Agric. Mach. 2018, 49, 221–226. [Google Scholar]
Xu, W.; Zhao, L.; Li, J.; Shang, S.; Ding, X.; Wang, T. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z. Attempting to estimate the unseen—Correction for occluded fruit in tree fruit load estimation by machine vision with deep learning. Agronomy 2021, 11, 347. [Google Scholar] [CrossRef]
Yang, H.; Chen, L.; Chen, M.; Ma, Z.; Deng, F.; Li, M.; Li, X. Tender tea shoots recognition and positioning for picking robot using improved YOLO-V3 model. IEEE Access 2019, 7, 180998–181011. [Google Scholar] [CrossRef]
Li, W.; Chen, R.; Gao, Y. Automatic recognition of tea bud image based on support vector machine. In Advanced Hybrid Information Processing; Lecture Notes of the Institute for Computer Sciences. Social-Informatics and Telecommunications Engineering (LNICST); Springer: Cham, Switzerland, 2021; Volume 348, pp. 279–290. [Google Scholar] [CrossRef]
Bojie, Z.; Dong, W.; Weizhong, S.; Yu, L.; Ke, W. Research on tea bud identification technology based on HSI/HSV color transformation. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering, Shanghai, China, 20–22 December 2019; pp. 511–515. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘Mango YOLO’. Precis Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Velumani, K.; Lopez-Lozano, R.; Madec, S.; Guo, W.; Gillet, J.; Comar, A.; Baret, F. Estimates of maize plant density from UAV RGB images using Faster-RCNN detection model: Impact of the spatial resolution. Plant Phenomics 2021, 2021, 9824843. [Google Scholar] [CrossRef]
Li, Y.; Ma, R.; Zhang, R.; Cheng, Y.; Dong, C. A tea buds counting method based on YOLOV5 and Kalman filter tracking algorithm. Plant Phenomics 2023, 5, 30. [Google Scholar] [CrossRef]
Sun, X.; Mu, S.; Xu, Y.; Cao, Z.; Su, T. Detection algorithm of tea tender buds under complex background based on deep learning. J. Hebei Univ. (Nat. Sci. Ed.) 2019, 39, 211. [Google Scholar] [CrossRef]
Shuai, L.; Mu, J.; Jiang, X.; Chen, P.; Zhang, B.; Li, H.; Wang, Y.; Li, Z. An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds. Biosyst. Eng. 2023, 231, 117–132. [Google Scholar] [CrossRef]
Zhu, H.; Li, X.; Meng, Y.; Yang, H.; Xu, Z.; Li, Z. Tea bud detection based on Faster R-CNN network. Trans. Chin. Soc. Agric. Machmery 2022, 53, 217–224. [Google Scholar]
Wang, T.; Zhang, K.; Zhang, W.; Wang, R.; Wan, S.; Rao, Y.; Jiang, Z.; Gu, L. Tea picking point detection and location based on Mask-RCNN. Inf. Process. Agric. 2023, 10, 267–275. [Google Scholar] [CrossRef]
Chen, B.; Yan, J.; Wang, K. Fresh tea sprouts detection via image enhancement and fusion SSD. J. Control Sci. Eng. 2021, 2021, 6614672. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Neural Inf. Process. Syst. 2017. Available online: https://arxiv.org/pdf/1706.03762.pdf (accessed on 10 May 2024).
Li, J.; Du, J.Q.; Zhu, Y.C.; Guo, Y.K. Survey of transformer-based object detection algorithms. Comput. Eng. Appl. 2023, 59, 48–64. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision-ECCV 2020; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zhao, Y.; Li, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. 2023. Available online: https://arxiv.org/pdf/2304.08069.pdf (accessed on 10 May 2024).
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. arXiv 2023, arXiv:2305.07027. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2023, arXiv:2311.15599. [Google Scholar]
Arbel, R.; Rokach, L. Classifier evaluation under limited resources. Pattern Recogn. Lett. 2006, 27, 1619–1631. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual leaning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot MultiBox detector. In Computer Vision-ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Glenn, J. 2020 yolov5. Git Code . Available online: https://github.com/ultralytics/yolov5 (accessed on 15 June 2024).
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A Lightweight YOLOv8 Tomato Detection Algorithm Combining Feature Enhancement and Attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]

Figure 1. Tea tree bud images. (a) Schematic representation of multi-species tea buds. (b) Schematic diagrams of tea buds in different natural environments.

Figure 2. Structure diagram of RT-DETR.

Figure 3. RT-DETR-Tea network structure.

Figure 4. Cascaded Group Attention module architecture.

Figure 5. Diagram of the low-GD module structure.

Figure 6. High-GD module structure diagram.

Figure 7. Structure diagram of the Inject module.

Figure 8. DilatedReparamBlock module structure diagram.

Figure 9. The result of Grad-CAM and detection results.

Figure 10. Detection effects of different models.

Figure 11. The results of RT-DETR-Tea detection.

Figure 12. Comparison of the detection results for different tea bud sizes.

Figure 13. Comparison of the detection results for different light intensities.

Table 1. Main parameters of the model.

Training Parameter	Value
Input size	640 × 640
Learning rate	0.01
Batch size	4
Momentum	0.937
Weight decay	0.0005
Warm-up epoch	2000
Maximum epoch	100

Table 2. Comparison of the results for different backbones.

Backbone Name	Parameter/M	Floating Point Operations FLOPs/G	Mean Average Precision mAP (0.5–0.95)/%
HGNetV2	31.9	103.4	77.8
ResNet50	41.9	129.5	77.6
ResNet34	31.1	88.8	77.4
ResNet18	19.8	56.9	77.3

Table 3. Results of the ablation experiments.

Method	CGFA	GD-Tea	DRBC3	P (%)	R (%)	mAP95 (%)	FLOPs/G
RT-DETR-r18	×	×	×	90.9	87.9	77.3	56.9
Improvement 1	√	×	×	91.0	88.0	77.8	56.3
Improvement 2	×	√	×	92.8	89.0	78.5	54.7
Improvement 3	×	×	√	91.7	87.4	77.9	51.6
Improvement 4	√	√	×	94.0	89.4	78.7	60.0
Improvement 5	√	√	√	96.1	91.7	79.7	51.4

Table 4. Comparison of effective receptive fields.

	t = 20%	t = 30%	t = 50%	t = 99%
RepC3	1.2%	2.1%	4.7%	89.3%
DRBC3	1.8%	3.3%	7.8%	94.1%

Table 5. Multi-model comparison of experimental results.

Method	P (%)	R (%)	mAP95 (%)	Parameters (M)	FLOPs/G	FPS(f/s)
YOLOv5s	90.2	87.4	77	7.2	16.2	110.5
SSD	75.6	69.4	40.5	26.3	62.7	55.4
Faster-RCNN	70.4	80.6	75.6	137.0	370.2	15
RT-DETR-L	91.9	88.3	77.8	31.9	103.4	45.1
YOLOv8l	91.5	88	77.5	43.7	165.7	25.5
Ours	96.1	91.7	79.7	20	51.4	62.7

Table 6. Results of the tea bud detection of different tea species.

Tea Species	RT-DETR-r18		RT-DETR-Tea
Tea Species	P (%)	R (%)	P (%)	R (%)
Ying hong 9	91.5	88.2	96.3	91.5
Jin xuan	89.7	86.9	93.5	89.3
Huang yu	91.9	87.6	95.5	92.1
Hong yan 12	88.5	86.3	92.4	90.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Guo, Y.; Li, J.; Zhou, B.; Chen, J.; Zhang, M.; Cui, Y.; Tang, J. RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments. Agriculture 2024, 14, 2256. https://doi.org/10.3390/agriculture14122256

AMA Style

Chen Y, Guo Y, Li J, Zhou B, Chen J, Zhang M, Cui Y, Tang J. RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments. Agriculture. 2024; 14(12):2256. https://doi.org/10.3390/agriculture14122256

Chicago/Turabian Style

Chen, Yiyong, Yang Guo, Jianlong Li, Bo Zhou, Jiaming Chen, Man Zhang, Yingying Cui, and Jinchi Tang. 2024. "RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments" Agriculture 14, no. 12: 2256. https://doi.org/10.3390/agriculture14122256

APA Style

Chen, Y., Guo, Y., Li, J., Zhou, B., Chen, J., Zhang, M., Cui, Y., & Tang, J. (2024). RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments. Agriculture, 14(12), 2256. https://doi.org/10.3390/agriculture14122256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RT-DETR-Tea: A Multi-Species Tea Bud Detection Model for Unstructured Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. RT-DETR Baseline Network Structure

2.3. RT-DETR-Tea Model Improvement

2.3.1. Self-Attention Mechanism Improvement

2.3.2. GD-Tea Feature Fusion Module

2.3.3. RepC3 Module Improvement

2.3.4. Model Evaluation Metrics

3. Experiment and Result Analysis

3.1. Experimental Platform and Model Parameter Settings

3.2. Backbone Network Comparison

3.3. Visual Recognition of the Heatmap

3.4. Ablation Experiments

3.5. Effective Receptive Field (ERF)

3.6. Comparison Results of Multiple Models

3.7. Verification of Bud Detection

3.8. Results of a Comprehensive Analysis of Tea Buds Classified by Size

3.9. Tea Bud Detection and Verification Under Different Light Intensity

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI