1. Introduction
Tea is one of the most popular beverages in the world due to its unique flavor and high nutritional value. In 2020, the global tea production was 62,690 tons, and China ranked first, with 29,860 tons of tea yield [
1,
2]. Amidst the annual increase in tea production, however, the increasingly serious population aging has gradually reduced the number of tea farmers who manually pick tea leaves. Artificial tea-picking methods have disadvantages such as high cost, low efficiency, and inevitable subjectivity, which pose considerable challenges to picking high-quality tea [
3]. Although the efficiency of current machine tea-picking equipment has been greatly improved, a one-size-fits-all method is generally adopted, which tends to cause the damage or breakage of tea leaves [
4,
5]. At the same time, tea farmers usually use manual methods to count the number of buds, which is inefficient and time-consuming. Therefore, after accurately identifying and locating the buds, automatically counting their number can not only improve the production efficiency but also realize the estimation of tea production [
6]. However, the detection of high-quality tea buds is challenging due to various factors such as tea bud species, pose and size, and illumination diversity [
7], so accurate tea bud detection in complex environments has become a research hotspot. Therefore, accurate tea bud identification is a prerequisite for automated and intelligent picking of high-quality tea buds.
The emergence of deep learning research in the field of object detection has opened up a wide range of possibilities for accurate tea bud picking. Traditional machine vision methods rely on the extraction of posture, color, and texture features of tea buds, young leaves, and old leaves, which is achieved through an artificial design. These features are then used to enable the accurate detection and segmentation of tea buds [
8,
9]. Regional Convolutional Neural Networks (R-CNN) are the first models to be applied to object recognition, including models such as Faster-CNN and Faster-RCNN. Several optimization algorithms, including You Only Look Once (YOLO) and Single Shot Multibox Detector (SSD), have been integrated into the agricultural sector [
10,
11]. These algorithms have demonstrated improved efficiency and accuracy in various applications such as crop yield prediction, weed detection, and livestock monitoring. At the same time, numerous agricultural experts and scholars have utilized deep learning in tea research. For instance, Li et al. [
12] combined the enhanced YOLOv5 algorithm with the Hungarian matching algorithm and Kalman filtering algorithm to achieve the real-time tracking and monitoring of tea bud targets. This method can estimate the number of tea buds in dynamic images and predict tea yield. Sun et al. [
13] used pre-segmentation to eliminate the complexity of the tea bud environment in response to the effect of complex background on the performance of the detection model, and deployed an enhanced YOLO mesoscale network model for accurate detection with an average accuracy of 84.2%. A tea YOLO algorithm was proposed based on YOLOv5 to detect the key points of tea buds and tea bud picking, and a tea picking point positioning method was put forward. Compared to the baseline model, this achieves an average accuracy improvement of 5.26% [
14]. In that study, the researchers proposed a tea sprout detection method based on Faster-RCNN, achieving an average precision (AP) of 0.54%, and a root mean square error (RMSE) of 3.32 when the type of tea sprout was not differentiated [
15]. Wang et al. [
16] developed a model for recognizing tea bud and picking point based on the mask R-CNN, which performed well in complex environments. Chen et al. [
17] proposed a tea bud detection method utilizing image enhancement and a fused single-stage detection network (SSDN) to improve detection accuracy. In summary, the research of deep learning on tea object detection mainly focuses on improved models such as the one-stage YOLO series models based on CNN and the two-stage SSD, while there is still little research on the end-to-end object detection model based on the Transformer framework for the fast and accurate detection of tea buds.
Because Transformers show powerful global feature extraction and parallel computing capabilities, Transformer-based object detection algorithms have also been widely studied in the field of deep learning. In object detection tasks, compared with CNN, Transformers can obtain a larger receptive field and more detailed information, which can better realize the fusion of feature context semantic information and demonstrate a better ability to model global features [
18]. They can also effectively solve the problems of occlusion and small object detection. Recently, a new Transformer-based object detection framework called detection Transformer (DETR) was introduced, and many researchers have proposed modification strategies for it [
19,
20,
21]. However, DETR suffers from some defects such as slow training convergence and poor target detection accuracy. The subsequent improved real-time detection Transformer (RT-DETR) model perfectly realizes end-to-end object detection, overcomes the slow convergence of DETR training, and exceeds the YOLO model in detection accuracy. Compared with the object detection framework of CNN, RT-DETR provides a fully end-to-end object detection framework [
22]. Moreover, the non-maximum suppression strategy (NMS) in previous object detection models is eliminated to avoid the delay caused by this post-processing step, which greatly simplifies the complexity of the object detection model compared with YOLO. In addition, its remarkable performance in standard datasets has prompted researchers to explore its applications in various real-world scenarios, which provides a new feasible scheme for different research fields. Therefore, we used RT-DETR as the basic model and modify it to achieve the rapid detection and identification of tea buds with high feasibility studies.
In addition, every object detection model needs a dataset to complete its model training, and the quality of the dataset is a key factor in determining whether the model can be successful. In application research, it is necessary to build professional datasets to meet the needs of specific scenarios. The existing tea bud detection research is usually based on datasets constructed by a single variety of tea buds. The phenotypes of different varieties of tea buds differ significantly in terms of their color and morphology. For example, the colors of Yinghong 9 and Huangyu are completely different, and the bud sizes of Jinxuan and Yinghong 9 also differ greatly. Therefore, limitations exist if an accurate detection of tea sprouts is only achieved on a single variety without exploring the detection effect of tea sprouts of different varieties in different environments. This study is based on the actual complex scene of a tea garden to achieve accurate and rapid detection of multiple varieties of tea buds, and the main purpose of this paper is to improve the generalization and robustness of the proposed model. The specific contributions of this study are as follows:
- 1.
A multi-variety tea bud dataset in an unstructured environment is constructed, including Jinxuan, Hongyan 12, Yinghong 9, and Huangyu. Then, 1250 images are collected for each variety, and a total of 5000 tea bud images are collected.
- 2.
The RT-DETR-Tea detection model based on the Transformer framework achieves an accurate identification of multiple tea bud varieties in natural environments and achieves satisfactory performance on independently constructed datasets, exhibiting good generalization and robustness.
- 3.
A feature fusion mechanism (GD-Tea) that can effectively fuse low-level and high-level semantic information in tea bud images is proposed to improve the accuracy of precise recognition of tea buds.
3. Experiment and Result Analysis
3.1. Experimental Platform and Model Parameter Settings
All experiments in the study were based on the PyTorch deep learning framework and the programming language was Python. The main configurations of the computer used in the experiment were as follows: the processor was an i5 CPU, the operating system was Win11 64, and the GPU was an NVIDIA GeForce RTX4060ti 8G. The main parameter settings of the model are displayed in
Table 1.
3.2. Backbone Network Comparison
The design of the backbone network, a core model component, has a direct impact on the final performance of the model. In this paper, the comparison models of ResNet18, ResNet34, ResNet50, and the original RT-DETR backbone network HGNetV2 were selected to evaluate the number of parameters, computational complexity, and detection accuracy.
Table 2 shows a detailed comparison of structural parameters and training results.
ResNet18 is a lightweight network from the ResNet family and has a more streamlined structure and lower complexity than the other networks [
28]. From the comparison results, it can be seen that ResNet18 has only 19.8 M parameters, which are significantly fewer than the other networks. A smaller number of parameters means a cleaner structure, which in turn reduces the computational resource requirements and storage footprint while lowering the risk of overfitting. In terms of floating-point operations, ResNet18 only needs 57 G of them, which is the lowest in the table. Fewer floating-point operations correspond to less computational overhead, reducing the dependence on large computing devices. Although ResNet18 has an average accuracy of 77.3%, which is slightly lower than HGNetV2’s 77.8%, considering its smaller number of parameters and lower computational complexity, ResNet18 can be regarded as a candidate backbone network achieving the balance between performance and efficiency.
3.3. Visual Recognition of the Heatmap
Deep learning models have achieved high-precision recognition results in specific target detection tasks in agricultural applications, but their interpretability is relatively lacking. The model interpretability can ensure that researchers have a better understanding and trust of the proposed model. Therefore, to better explain the ability of the proposed model to effectively learn the characteristics of tea buds, we use Grad-CAM to visualize the heat map of the detection results of RT-DETR-r18 and RT-DETR-Tea. Grad-CAM uses gradient information to generate weights and create a heat map on the original image according to the weight size. The color ranges from blue to red, and the region with a greater weight is more important for tea bud detection, as shown in red in the heat map [
29].
As can be seen from
Figure 9, the original RT-DETR-r18 model focuses more on the large targets of tea buds but pays less attention to some small targets, resulting in the missed detection of some small targets. This may be because the original feature fusion has lost low-level features, resulting in insufficient semantic information. The proposed RT-DETR-Tea model can better focus on large and small targets simultaneously. We can also see that the bud area in the image is red on the heat map, especially in the complex tea garden environment. Even if there are multiple bud targets, the bud information can still be effectively focused. This may be because the GD-Tea feature fusion module uses multi-level features and effective representation (such as shallow and deep features) to fuse rich semantic information. As shown in the test results in the figure, the improved model can find the key areas in the heat map in the visualization of the final test results, and can realize the accurate identification of tea buds. This indicates that the GD-Tea feature fusion module proposed in this study can integrate rich semantic information, realize effective attention to global information, and can still effectively detect tea bud information even in the complex environment of tea gardens.
3.4. Ablation Experiments
To verify whether the proposed model RT-DETR-Tea based on the RT-DETR-r18 basic model can exhibit a better detection effect, ablation experiments were carried out on the tea bud dataset constructed in this study.
Table 3 shows the results of the ablation experiments.
As shown in
Table 3, Improvement 1 (named CGFA) changes the MHSA mechanism in the AIFI module of the original model to the cascaded group attention. The
mAP value of the model is 0.5% higher than that of the original model. CGFA effectively optimizes the deep features and enriches the semantic information of the model, which can help us locate the bud position more accurately. The second improvement replaces the original CCFM feature fusion mechanism with the proposed GD-Tea module, and the
mAP value of the model is improved by 1.2%. This method can effectively fuse low-level and high-level semantic information and ensure that the model can effectively focus on the tea bud size target in complex environments. In Improvement 3, DRBC3 is used instead of RepC3, and it is found that the progress of the model improved by 0.8%, which indicates the necessity of multi-level feature fusion. Improvement 4 employs both CGFA and GD-Tea strategies, increasing the model accuracy by 1.4% in terms of
mAP. Improvement 5 includes the final improvement strategy proposed in this study, and the model accuracy enhancement is the largest compared with that of the other strategies. The accuracy of the model is 2.4% higher than that of the original model, and its FLOPs drop to 51.4 G, which indicates that the final model not only improves detection accuracy but also accelerates model calculation. Compared with the RT-DETR-r18 model, the proposed model has good detection ability and can meet the application requirements of tea bud detection.
3.5. Effective Receptive Field (ERF)
The main function of the DRBC3 module proposed in this study is to integrate the injected features into the feature fusion, and enlarge the receptive field of the fused features through the large convolution kernel, so as to enhance the detection and location understanding of tea buds. In
Table 3, it has shown that the improved method can not only increase the detection accuracy by 0.8%, but also reduce the number of parameters in the model. However, due to the lack of specific explanation for the increase of receptive field, we randomly selected 50 images from the test set and resized them uniformly to 640 × 640. We normalized the pixel feature contribution of the images to 0 to 1 and measured the proportion of effective pixel area of the corresponding pixels in the feature map.
Table 4 shows the comparative analysis of the contribution ratio of RepC3 and DRBC3 to feature effective pixels when the pixel contribution ratio threshold is set to t = 20%, 30%, 50% and 99%, respectively.
As mentioned above, compared with RepC3, DRBC3 has an increased proportion of effective pixels at different pixel contribution thresholds. When t = 99%, the proportion of effective pixels reaches 94.1%. The increase in receptive field can effectively increase the detection and localization of tea, and explain the effectiveness of the improved idea.
3.6. Comparison Results of Multiple Models
The improved model is compared with other two-stage object detection models like SSD [
30] and Faster-RCNN [
31] and one-stage detection models such as YOLOv5s [
32] and YOLOv8l [
33]. The results of the analysis are shown in
Table 5, which highlights the advantage of the accurate computation of the improved model on tea bud detection compared with the other detection models. As shown in
Figure 10, verified on the tea bud dataset, our model has the best
P,
R, and
mAP metrics, which are 96.1%, 91.7%, and 79.7%, respectively.
Although the number of parameters and FLOPs/G of the proposed model are higher than those of YOLOv5, its application in practical projects is not affected. The Parameters (M) and FLOPs/G of the proposed model are 20 M and 51.4 G, respectively. Compared with the RT-DETR-L model, the reduction was 37.3% and 46.2%, respectively.
In order to better prove the superiority of the model proposed in this study, Yinghong 9 was taken as an example, and six models were, respectively, used to detect and test tea buds in the natural state, and the detection effect of each model was evaluated (as shown in the
Figure 10).
Figure 10c and
Figure 10d show Faster-RCNN and SSD two-stage and single-stage target detection models, respectively. There are obvious problems of missing detection and false detection in areas where tea buds are obscured, which may be due to the lack of high-frequency information and low-frequency information fusion operation, resulting in insufficient feature information and reduced detection accuracy.
Figure 10a,b,e shows the detection results of YOLOv5, YOLOv8l, and RT-DETR-L. These models have been improved by feature fusion, and it is found that the missed detection errors of buds can be effectively reduced, but there is still room for improvement.
Figure 10f shows the detection effect of the improved model proposed in this study. We improved the feature fusion mode of the model to achieve the effective fusion of high-frequency semantic information and low-frequency information, improve the model’s localization and recognition accuracy of the target, and reduce the missed detection of small targets by increasing the receptive field. It is obvious that the detection effect of the model proposed in this study has been greatly improved.
3.7. Verification of Bud Detection
To verify the generalization and detection effect of the proposed RT-DETR-Tea model, we construct a tea bud dataset independently in this paper. The dataset involves multiple varieties of tea trees, close tea bud images, and dense and complex environments, which is effective in testing the performance of detection models. The final inference result is shown in
Figure 11 and
Table 6.
In general, the original RT-DETR-r18 model has the shortcomings of missed and false detections.
Figure 11 shows the detection results of the improved model and the original model in four kinds of tea buds, Yinghong 9, Jinxuan, Huangyu, and Hongyan 12. For Yinghong 9, Jinxuan, and Hongyan 12, the RT-DETR-r18 model shows the errors of missed detection when occlusion and dense buds are encountered, especially for small target buds. This is likely because the original model loses semantic information in high-level features during feature fusion, resulting in missed detection of small targets and false detection under occlusion. In the improved RT-DETR-Tea model, the GD-Tea module achieves the perfect fusion of multi-level features. From the detection results of Yinghong 9 and Jinxuan, the advantages of our improved model can be well demonstrated. According to the detection results of Huangyu, even if the bud has a color similar to the leaf, the improved model achieves better detection results compared with the original model. Our proposed DRBC3 module can not only expand the perception field, but also enhance the understanding of features, which can effectively improve tea bud location and recognition by the detection head and prevent the occurrence of false detection. All in all, the improved strategy can effectively improve the detection performance of the model.
3.8. Results of a Comprehensive Analysis of Tea Buds Classified by Size
In practice, the size and number of tea buds in the image and the surrounding environment will change with the movement of the camera; the information about the buds is more complete and better recognized when the buds are approached nearby. When the visual system expands the detection field, with the continuous expansion of the field of view, problems such as excessive number of buds, serious occlusion, and small and dense bud targets are likely to occur. In addition, the color and size of buds of different varieties of tea trees differ greatly, and the color of buds is similar to that of leaves, which will undoubtedly increase the difficulty of detecting buds in complex environments. To evaluate the efficacy of the improved model in detecting tea buds of varying sizes within a tea garden environment, this study constructed datasets representing different sizes of tea buds for detection analysis. The specific results are presented as follows:
Figure 12b has a larger field of view than (a), which makes the small size of the tea bud target in the image more prominent. By analyzing the detection results of tea buds of different sizes in the real environment of tea gardens, the superiority of the proposed model in detecting small targets can be further highlighted, which indicates that the model can achieve accurate detection of both large and small targets.
3.9. Tea Bud Detection and Verification Under Different Light Intensity
The change in light intensity has influence on the detection of tea bud. In order to verify the generalization and detection accuracy of the model proposed in this study under different illumination, 50 tea bud images of different varieties were randomly selected from the dataset. Randomly selected images were simulated by gamma transform to obtain tea bud images under different light intensities, and these images were input into the RT-DETR-Tea model to verify the changes in model performance under different light intensities and analyze the model detection effect.
In this study, gamma changes were used to simulate tea bud images under four different light intensifies. Among them, gamma = 0.45 simulated the light intensity from 11:00–13:00 in summer of tea garden, gamma = 0.75 simulated the light intensity from 13:00–15:00 in summer of tea garden, and gamma = 2.0 simulated the light intensity from 15:00–16:00 in summer of tea garden. gamma = 3.5 simulated light intensity from 16:00 to 18:00 in summer afternoon of tea garden. The tea bud image with transformed light intensity was input into the detection network, and the specific detection results are shown in the
Figure 13.
The results showed that tea buds could be accurately detected even under different light intensities, but tea buds could be missed when the light intensity was too strong. From the comparison results of
Figure 13c,d, the confidence value of tea bud detection decreased significantly, indicating that too-low light intensity would lead to a decline in the detection accuracy of tea bud, but would not affect the detection and positioning of tea bud. Through the analysis of the overall detection effect of gamma transform experiment, the model proposed in this study has good generalization and robustness.
4. Conclusions
Aiming at the limitation of a single variety in tea bud detection, relying on the improved RT-DETR-Tea model, this study realizes the accurate identification of different varieties of tea buds with different phenotypes. In this model, CGFA effectively optimizes the deep features and enriches the semantic information of the model. GD-Tea effectively fuses low-level and high-level semantic information, ensures that the model can pay attention to large and small tea buds simultaneously in a complex environment, and reduces the probability of missed and false detection. DRBC3 further enhances the understanding of features and effectively improves the localization and recognition of the region of interest by the detection head.
In general, compared with the RT-DETR-r18 basic model, the RT-DETR-Tea model exhibits greatly improved detection accuracy for multi-variety tea buds. The results show that the p-value and the mAP value of the RT-DETR-Tea model are 96.1% and 79.7%, which are increased by 5.2% and 2.4%, respectively. In addition, to verify the robustness and generalization of the model, we constructed a new multi-variety tea bud dataset. The comparison results show that the proposed model addresses missed and false detections. Therefore, the proposed RT-DETR-Tea model can realize the accurate detection of multi-variety tea buds in natural environments and can provide technical support for automatic picking positioning and yield statistics in actual tea production.
There are still some limitations to our work. Firstly, the data in this paper only contain the samples of four varieties of tea sprouts due to the limitation of resources and manpower. Although the model has good robustness and generalization, it cannot be well generalized to the detection of other varieties of tea sprouts not included in the training data. Therefore, we must construct a larger tea bud dataset, which can not only promote the robustness of the model, but also ensure that the application of the model is not limited by tea varieties. In addition, this study only considers the detection scene based on RGB images, which is affected by the viewing angle and excessive density, and some buds are completely occluded and cannot be detected. Therefore, in future work, we will consider the fusion of depth information to realize the rapid reconstruction of a 3D scene of tea buds and realize the detection of completely occluded tea buds in 3D space.