1. Introduction
In the context of China’s ongoing economic growth and the rising consumption levels among its populace, there has been a notable increase in consumer demand for high-quality tea; Sichuan Province, recognized as one of China’s principal tea-producing areas, has garnered significant attention regarding the evolution of its tea-market dynamics [
1]. The fundamental components of tea production encompass four key stages: plucking, withering, rolling, and drying. While automation technologies have been extensively implemented in the latter three stages, the initial process of fresh leaf harvesting in many regions of China continues to rely heavily on traditional manual techniques. The benefit of this manual plucking method lies in the ability of the pickers to choose buds of uniform size, relying on the distinct variations in color and sheen between the mature buds and the buds of the tea plant. This practice ensures the consistency and high quality of the harvested tender tea buds [
2]. Nevertheless, the conventional manual plucking method encounters several constraints, such as inefficiencies in the harvesting process, high costs, and variability in the quality of the harvest [
3]. Consequently, there is an urgent need to investigate automated tea bud harvesting technology, with a primary focus on the automatic identification and detection of tea buds within their natural habitats.
In recent years, the swift advancement of computer vision technology has led to a transition in the recognition methods for tea buds, moving from traditional manual assessment to automated recognition utilizing image processing and deep learning techniques [
4]. Traditional image processing approaches primarily depend on color and shape features [
5]. In terms of color characteristics, the color information can be extracted from images using color histograms or color moments in color spaces such as RGB and HSV, facilitating the identification and segmentation of target regions [
6]. For instance, Zhang et al. [
7] achieved precise identification of tea buds through G-B component thresholding in conjunction with an enhanced Otsu algorithm, while Novikov et al. [
8] refined a growth quality assessment technique for pine seedlings by utilizing the color features of seed epidermis. The morphological characteristics were one of the key features that distinguished different categories or growth stages. For example, Zhang et al. [
9] employed K-means clustering and the Bayesian discriminant analysis of color features to detect tea buds, whereas Anurekha et al. [
10] extracted shape–texture features of mangoes using the GANFIS method, attaining a grading accuracy of 99.18%. Although traditional image processing techniques have demonstrated some efficacy in tea bud recognition, their accuracy and generalization capabilities have remained constrained by factors such as harvesting conditions, image quality, lighting variations, and tea cultivar differences, which have hindered optimal performance in real-world detection scenarios.
Deep learning methodologies have significantly propelled the extensive utilization of target detection methods across various domains. In agriculture, the integration of image recognition and analytical techniques allows for the effective monitoring of critical physiological and environmental factors influencing crop growth, ultimately enhancing both yield and quality [
11,
12,
13]. The methodology is categorized into two primary types: two-stage and single-stage detection approaches. Two-stage detection frameworks encompass R-CNN, Fast R-CNN, and Faster R-CNN, which enhance target localization accuracy through a two-step processing mechanism. Nevertheless, two-stage algorithms necessitate the classification and positional refinement of each region proposal, leading to significant computational complexity, thereby posing challenges in scenarios with stringent real-time demands [
14]. The single-stage algorithms encompass Single Shot MultiBox Detector (SSD), You Only Look Once (YOLO), and others, with the YOLO series garnering significant attention owing to its rapid detection capabilities, outstanding performance, and streamlined model design [
15]. Appe et al. [
16] introduced an enhanced YOLOV5 algorithm for tomato detection, achieving an average recognition accuracy of 88.1% to address the challenges associated with detecting overlapping and small tomatoes. Meng et al. [
17] addressed the challenge of identifying tea buds in complex scenarios where the color of the buds closely resembled the background, making recognition difficult. They proposed an improved YOLOV7 algorithm that replaced the original convolutional blocks with depthwise separable convolutional blocks and incorporated a Convolutional Block Attention Module (CBAM) along with a Coordinate Attention (CA) module. As a result, the average accuracy for recognizing four classes of one bud and one leaf reached 96.70%. Similarly, Shahriar Zaman Abid et al. [
18] employed the YOLOV8 model for the identification of leaf pests and diseases, achieving a high-performance detection outcome with a mean average precision (mAP) of 98% and an F1 score of 97%, thereby providing significant technical support for advancing food security in Bangladesh.
Despite the impressive accuracy demonstrated by deep learning networks in target detection applications, they often encounter challenges related to an excessive number of model parameters, significant computational demands, and substantial memory requirements, which hinder their implementation on embedded devices [
19,
20]. Consequently, it is essential to optimize the model while preserving detection accuracy. Amudhan et al. [
21] introduced an RFSOD model designed for deployment on embedded devices, specifically for the recognition of small-sized objects. Sun et al. [
22] addressed the challenges posed by small speckle disease on apple leaves and the difficulties in detecting speckle targets, which were further complicated by the orchard’s intricate background. They developed ResBlock, a multi-scale feature extraction module, and C4, a lightweight feature fusion module, based on YOLOV5s and YOLOV5n, respectively. These models achieved sizes of only 10.8 MB and 2.4 MB while enhancing accuracy. Additionally, Zhang et al. [
23] incorporated the MobileNetV3 network into YOLOV4 to create a lightweight model for tea bud detection, resulting in an improved model size of just 11.78 MB and an increase in detection speed by 11.68 FPS.
The challenges of inadequate environmental adaptability, excessive model complexity, and suboptimal deployment of models in tea bud recognition can be addressed through conventional image processing and deep learning methodologies. This research utilized the traditional YOLOV5 algorithm as the foundational model for investigation. To address the inadequacy of the network in extracting features from tea buds, we proposed an offline data augmentation strategy that utilized image cropping. This approach enabled the network to concentrate on detecting larger targets, thereby enhancing the model’s recognition accuracy. Additionally, we incorporated lightweight networks, specifically ShuffleNetV2 and MobileNetV3, to substitute the backbone network in both the three-scale YOLOV5s and the four-scale YOLOV5n6 configurations. This substitution effectively reduced the model’s parameters, computational load, and memory usage. Furthermore, we introduced a layer of downsampling feature extraction network between the lightweight network and YOLOV5n6 to ensure adequate feature extraction. The enhanced four-scale YOLOV5n6_MobileNetV3 model demonstrated a 5.29% improvement in accuracy and a 29.06% reduction in weight file size compared to the three-scale YOLOV5s_MobileNetV3. Consequently, the lightweight optimization of the four-scale detection model was more advantageous for deployment on embedded devices in practical applications.
2. Materials and Methods
2.1. Data Acquisition
The data pertaining to tea buds utilized in this study were obtained from the Niu Mianping Tea Plantation located in the Mingshan District of Ya’an City, Sichuan Province, during the period from late February to early March 2024. During this timeframe, the local average daily temperatures fluctuated between 9 °C and 16 °C, accompanied by minimal precipitation, which was considered the optimal harvesting period for the renowned Ya’an Famous Tea in Sichuan. To closely replicate the natural growth environment of the tea buds, photographs were taken on both sunny and overcast days, with shooting angles and backgrounds selected at random. The imaging equipment employed was a Sony A6000 (Sony Corporation, Tokyo, Japan), featuring a resolution of 6000 × 4000 pixels, and the images were stored in JPG format. This process yielded a total of 4290 original images of the tea buds, encompassing three distinct tea varieties from the following plantations: Sichuan No. 9, Sichuan No. 11, and Mengshan No. 4.
The cultivation periods for the three tea varieties predominantly occur from late February to mid-April. As the principal cultivar of Ya’an Famous Tea, Sichuan No. 9 tea possesses significant economic value due to its buds featuring velvety hairs and vibrant colors. The bud morphology of Sichuan No. 11 tea and Mengshan No. 4 tea closely resembles that of Sichuan No. 9 tea; however, there are minor variations in leaf shape and coloration. These distinctions contribute to the diversity of tea bud characteristics and enhance the model’s generalization capability, thereby facilitating the capability of embedded devices for identifying and harvesting tea buds in complex environments.
Figure 1 illustrates three types of tea buds under two distinct weather conditions, respectively.
2.2. Data Construction
2.2.1. Classification Levels and Data Annotation
According to the standards for tea leaf harvesting, the tender bud samples were categorized into four grades: special-grade tea consisting of a single bud, first-grade tea with one bud and one leaf, second-grade tea with one bud and two leaves, and third-grade tea with one bud and three leaves. The special-grade tea and first-grade tea, noted for their superior appearance, were particularly favored as primary materials for the production of high-quality premium teas [
24].
Figure 2 illustrates the four grades of tender tea buds.
This study employed the LabelImg tool (version is 1.8.6) as an image annotation instrument, generating YOLO format storage files in the form of “category, relative coordinates of the box center, relative width of the box, relative height of the box”. Prior to annotating the images, training sessions were conducted with local tea garden staff to ensure accurate identification of each grade of tea buds. To further enhance the accuracy of the data annotation, all annotated data were reviewed by a tea-grade assessment committee, composed of five experienced tea experts.
Figure 3 illustrates the annotation tool used for the tea data and the generated files.
The dataset’s quality significantly influenced the model’s training efficacy and generalization capabilities. During the data acquisition phase, it was inevitable to collect some low-quality images alongside high-quality ones.
Figure 4 illustrates several images of suboptimal quality captured during the imaging process. As illustrated in
Figure 4a,b, issues with camera focus calibration during image capture resulted in the target tea buds being misaligned or out of focus, rendering their morphological characteristics indistinct. This led to potential misdetections and omissions via the model. Furthermore,
Figure 4c depicted an image of a tea plantation with only a few scattered new buds, which did not meet the criteria for proper tea picking. These low-quality images not only posed challenges for model training but also served as critical references for understanding model misidentification and missed identification issues, ultimately enhancing the model’s robustness and accuracy in real-world applications.
2.2.2. Data Augmentation for Tea Leaf Buds
During the learning process of neural networks, the model-generated anchor boxes often failed to fully encompass the targets due to the high-resolution images not being compressed. While traditional image scaling and rectangular scaling operations could process images, they tended to introduce excessive redundant information within the image data. Furthermore, after processing through the network’s downsampling layers, the effective information retained in the anchor boxes was significantly reduced, ultimately leading to a decrease in the model’s classification recognition accuracy. To address the shortcomings encountered by the network in extracting features from tea bud images, this study proposed a data augmentation strategy centered on image cropping, which could enhance the network’s feature extraction capabilities.
During the image acquisition phase, the initial image dimensions were 6000 × 4000 pixels. Due to variations in the distance between the tea buds and the acquisition lens, the bounding box size for an individual tea bud was approximately 300 × 700 pixels. Although cropping the original image to a smaller size of 640 × 640 pixels allowed for direct input into the model for training, it resulted in a significant loss of detailed information regarding the target region of the tea buds, thereby compromising the model’s recognition accuracy. To address this issue, this study consistently employed an image size of 1280 × 1280 pixels in subsequent experiments. This dimension not only preserved more detailed information about the tea buds but also aligned more effectively with the input requirements of the network model, enhancing the stability of model training and recognition performance.
Figure 5 illustrates the sizes and positions of the bounding boxes that needed to be traversed during the data augmentation process. It showed the cropping of an image sized 1280 × 1280 pixels centered around target box A, with the center point of box B falling within the cropping range defined by the center point of box A.
The fundamental concept of the current methodology involves initially traversing each target frame within the label file, designating the traversed target frame as A. Subsequently, it is assessed whether frame A has been previously saved. In the event that frame A has not been saved, a 1280 × 1280 pixel image segment is extracted from the central point of frame A, and the relative coordinates of frame A in relation to this cropped segment are documented in a new label file. Frame A is then excluded, and the subsequent target frame within the label file is traversed and identified as target frame B. If the central point of frame B falls within the 1280 × 1280 pixel area extracted in the preceding step, the new coordinates of frame B relative to that area are computed based on nine distinct scenarios, with the new coordinate data being recorded in the label file of frame A, continuing this process until all target frames in the original image have been saved at least once.
Under the assumption that the original label information for box A was represented as , the and denoted the relative coordinates of the center point of box A, while and represented the relative width and height of box A, respectively. Similarly, the original label information for box B was represented as . Firstly, we converted the coordinates of the two bounding boxes to absolute values, where box A was represented as and box B as . In the cropped image, the relative coordinates of the centers of boxes A and B became and , respectively. With , the new coordinates for boxes A and B were and . represented the relative width and height of cropped bounding boxes A and B, respectively. The nine distinct scenarios are as follows:
- (1)
When the dimensions of bounding box B did not exceed the boundaries, and were set with the coordinates of bounding box B represented as ;
- (2)
When the width of bounding box B did not exceed the left and right boundaries, but the height exceeded the upper boundary, the relative height of bounding box B was calculated as , and the y-coordinate of the center point of bounding box B was ;
- (3)
When the width of bounding box B did not exceed the left and right boundaries, but the height exceeded the lower boundary, the relative height of bounding box B was given by , and the y-coordinate of the center point of bounding box B was ;
- (4)
When the height of bounding box B did not exceed the upper and lower boundaries, but the width exceeded the left boundary, the relative width of bounding box B was defined as , and the x-coordinate of the center point of bounding box B was ;
- (5)
When the height of bounding box B did not exceed the upper and lower boundaries, but the width exceeded the right boundary, the relative width of bounding box B was calculated as , and the x-coordinate of the center point of bounding box B was ;
- (6)
When bounding box B exceeded the upper left corner, the coordinates of bounding box B referred to cases (2) and (4);
- (7)
When bounding box B exceeded the lower left corner, the coordinates of bounding box B referred to cases (2) and (5);
- (8)
When bounding box B exceeded the upper right corner, the coordinates of bounding box B referred to cases (3) and (4);
- (9)
When bounding box B exceeded the lower right corner, the coordinates of bounding box B referred to cases (3) and (5).
Given that the dimensions of the cropped 1280 × 1280 region may have exceeded the limits of the original image, the system function automatically filled the surplus area with black pixel values to ensure uniformity in the input image size.
Figure 6 demonstrates the data post-offline data augmentation, emphasizing cases that required boundary filling.
Through the aforementioned data augmentation techniques, the dataset of tea leaf tender buds was expanded from the original 4290 images to 19,562 images. The data were randomly divided into a training set of 15,650 images and a validation set of 3912 images, using an 8:2 ratio.
2.3. YOLOV5 Network Architecture
YOLO represents a quintessential single-stage object detection algorithm, which was initially introduced by Joseph Redmon and Ali Farhadi in 2016 [
25]. This algorithm identifies potential frames and classifies them, producing outputs through a unified regression approach. Its architecture is characterized by simplicity and rapid inference capabilities, facilitating real-time application. However, the early iterations of YOLO exhibited lower prediction accuracy compared to two-stage detection algorithms. In recent years, various iterations such as YOLOV3 [
26], YOLOV4 [
27], YOLOV5, YOLOV7 [
28], and YOLOV8 have emerged, each iteration yielding substantial enhancements in performance.
The YOLOV5 model was structured into four distinct modules: Input, Backbone Network, Neck Network, and Head Network, each corresponding to specific functions within the architecture. The input image size for YOLOV5 was set to 640 × 640 pixels. Following a comprehensive feature extraction process, the Backbone and Neck components yielded three pairs of feature maps, all maintaining the same dimensions and channel counts. A Concatenation operation was executed within the Neck layer, resulting in three feature maps that were subsequently fed into the Detect prediction layer. In the large version of the model, these feature maps measured 80 × 80 × 256, 40 × 40 × 512, and 20 × 20 × 1024, respectively. The anchor boxes generated from these three scales of feature maps were designed to detect small, medium, and large targets in that specific order.
YOLOV5 represented an advancement over the YOLOV4 model, successfully minimizing the model’s size while enhancing detection efficiency. In contrast, the YOLOV7 and YOLOV8 versions primarily focused on refining the instance segmentation code [
29], with less emphasis on target detection. Furthermore, for the network model to operate smoothly and efficiently on edge devices, it was essential that the deployment tool effectively provided the fundamental operators (such as convolution, pooling, and activation functions) and that the target detection model adhered to lightweight specifications. Considering these factors, YOLOV5 was ultimately chosen as the enhanced foundational network.
Figure 7 illustrates the network architecture of YOLOV5.
2.4. Lightweight Improvement of the YOLOV5 Model
The YOLOV5 series of networks offers various sizes to accommodate different computational requirements, with models ranging from small to large: YOLOV5n, YOLOV5s, YOLOV5m, YOLOV5l, and YOLOV5x. Among these, YOLOV5n features the smallest architecture, making it suitable for deployment on devices with limited computational capacity [
30]. YOLOV5s (small) serves as the foundational model, while the other models are enhancements of YOLOV5s, achieved by widening and deepening the network. Additionally, YOLOV5n6 is a derivative model developed specifically for higher-resolution images of 1280 × 1280. It also includes models with varying depths and widths, such as s6, m6, l6, and x6. The YOLOV5 series ultimately downsamples the input image by a factor of 32, producing three predictive feature layers, whereas the YOLOV5n6 series model downsamples the input image by a factor of 64 and generated four predictive feature layers, thereby improving the detection capabilities for larger targets [
31].
Conventional neural network model files retain weight parameters that typically required substantial memory resources in real-world applications. To mitigate the computational complexity of neural networks while preserving detection accuracy, researchers introduced several lightweight enhancement strategies, including the substitution of convolutional layers and the optimization of the backbone network architecture. Nevertheless, current lightweighting techniques exhibited specific constraints. For instance, while the GhostNet module demonstrated superior recognition performance, its elevated parameter count hindered model lightweighting [
32]. Similarly, although Inception enhanced certain feature extraction capabilities, it significantly escalated memory access requirements [
33]. Furthermore, the Xception module, characterized by its increased computational complexity [
34], posed challenges for effective model deployment. In contrast, the ShuffleNet and MobileNet architectures excelled in the lightweight enhancement paradigm, achieving high-accuracy recognition while simultaneously optimizing the recognition model so that it could decrease computational complexity and memory usage. Consequently, this section evaluated the lightweight ShuffleNet and MobileNet series by juxtaposing them to fulfill the deployment criteria for embedded systems.
2.4.1. Original YOLOV5s Network
The primary function of the YOLOV5s backbone network is to extract features while progressively refining the feature map. Its key components included the convolutional module, the C3 module, and the Spatial Pyramid Pooling-Fast (SPPF) module. The Neck network serves to integrate the feature maps obtained from the backbone network and subsequently merged them with deeper semantic features. Meanwhile, the head network is responsible for generating predictions from the anchor frames for the purpose of object detection [
35].
Figure 8a illustrates the backbone network of YOLOV5s, while
Figure 8b depicts the head network of YOLOV5s. In contrast to the YOLOV5n architecture, the YOLOV5s model features a more intricate network design with an increased number of convolutional layers, enabling it to extract more comprehensive feature representations. This enhancement particularly improved its capacity to discern fine details of small objects within complex environments. Despite the added complexity of its architecture, YOLOV5s maintained a compact model size and minimized computational demands by optimizing model weights and enhancing computational efficiency. This adaptability allowed it to function effectively on embedded devices with constrained computational capabilities. Such a balance rendered YOLOV5s particularly suitable for practical applications, especially in contexts that necessitated high-precision detection while operating under limited computational resources.
2.4.2. Original YOLOV5n6 Network
YOLOV5n6 preserved the depth characteristics of YOLOV5s while halving its width compared to the original model. The SPPF module within the network employed a cascade of several small-sized pooling cores, in contrast to the single large-sized pooling core utilized in the Spatial Pyramid Pooling (SPP) module. This design choice enhanced the model’s runtime efficiency while maintaining its original functionality.
Figure 9a illustrates the backbone network of YOLOV5n6, while
Figure 9b depicts the header network of YOLOV5n6. The increased input resolution of YOLOV5n6 resulted in substantially greater computational demands, potentially leading to slower inference times, a requirement for more computational resources during both training and inference phases, and elevated costs associated with deployment.
2.4.3. ShufflenetV2 Backbone Network
ShuffleNetV2 was recognized as a quintessential lightweight network [
36], comprising a base unit on the left and a downsampling unit on the right, with respective stride lengths of 1 and 2. Both units incorporated a layer of pointwise convolution preceding the depthwise separable convolution, which served to increase the channel count and enhance the unit’s feature extraction capabilities. In the base unit, the feature map was initially bifurcated into two branches via Channel Split, subsequently concatenated along the channel dimension using Concat, and ultimately subjected to the Channel Shuffle operation. This operation was instrumental in mixing and rearranging channel information, facilitating the fusion of feature map data and augmenting information complexity. Within the downsampled ShuffleNetV2 block, the feature map was similarly fed into two distinct branches, with a convolutional stride of 2 in both, enabling the reorganization of channel information and the fusion of feature map data. In summary, ShuffleNetV2 reduced the amount of computation and the number of parameters through efficient ShuffleNet cells and excelled in inference speed and memory footprint.
Figure 10 depicts the fundamental feature extraction block of ShuffleNetV2.
2.4.4. MobilenetV3 Backbone Network
The MobileNet family of networks represented a highly efficient model tailored for mobile and embedded vision applications [
37]. Its primary advantage lay in substituting the conventional convolution with depth-separable convolution. The MobileNetV3 architecture is built upon the foundations established by the MobileNetV1 and MobileNetV2 networks. While preserving the inverted residual structure, it incorporates a feature calibration mechanism into the base network blocks, specifically the Squeeze and Excitation (SE) module. This mechanism ensures that effective weights are maximized while minimizing the influence of ineffective or less significant weights. Compared to the ShuffleNetV2 network, the SE module in the MobileNet network architecture made it more advantageous for high-precision target recognition tasks.
Figure 11 provides a visual representation of the MobileNetV3 network architecture.
2.5. Lightweight Improvement of the YOLOV5 Model
In this segment of the research, two lightweight network architectures, ShuffleNetV2 and MobileNetV3, were selected to replace the backbone networks of the three-scale model YOLOV5s and the four-scale model YOLOV5n6, respectively.
2.5.1. Optimization of the Three-Scale Model
Initially, the feature maps from layer 6 at 1/16 scale and layer 4 at 1/8 scale of the YOLOV5s backbone network were channel-spliced with the corresponding scales from the head network. The Shuffle block in ShuffleNetV2 facilitated the extraction of feature maps across various scales; thus, the layer 4 feature map at 1/8 scale, the layer 2 feature map at 1/8 scale, and the feature maps at the corresponding scales from the YOLO head network were also channel-spliced.
Furthermore, the bneck module in MobileNetV3 similarly enables the extraction of feature maps at different scales. Consequently, the layer 8 feature map at a 1/16 scale, the layer 3 feature map at a 1/8 scale, and the feature maps at the corresponding scales from the YOLO head network underwent channel splicing.
Figure 12a illustrates the architecture of the enhanced YOLOV5s_ShuffleNetV2 network, while
Figure 12b depicts the structure of the improved YOLOV5s_MobileNetV3 network.
2.5.2. Optimization of the Four-Scale Model
Initially, the output feature maps from the C3 structure of the 23rd, 26th, 29th, and 32nd layers of the YOLOV5n6 header network were sized at 1/8, 1/16, 1/32, and 1/64, respectively. However, neither ShuffleNetV2 nor MobileNetV3 included a module for extracting the 1/64 feature maps. To facilitate the proper training of the models, a downsampling feature extraction network was incorporated between the lightweight network and the YOLOV5n6 base model network. The downsampling feature extraction network for ShuffleNetV2 consisted of a shuffle_block with a stride of 2, followed by three shuffle_blocks with a stride of 1. In contrast, MobileNetV3 comprised one bneck block with a stride of 2 and two bneck blocks with a stride of 1. All configurations utilized depthwise separable convolution, which introduced only a minimal increase in computational load and memory access.
Upon the completion of the enhancements, the feature maps from layer 6 at 1/32 scale, layer 4 at 1/16 scale, and layer 2 at 1/8 scale, as extracted via ShuffleNetV2, were subjected to channel splicing with the feature maps of the corresponding scales from the YOLO head network. In a similar manner, the feature maps from the 11th layer at 1/32 scale, the 8th layer at 1/16 scale, and the 3rd layer at 1/8 scale, extracted via MobileNetV3 post-improvement, were also channel-spliced with the feature maps of the corresponding scales from the YOLO head network.
Figure 13a illustrates the architecture of the enhanced YOLOV5s_ShuffleNetV2 network, while
Figure 13b depicts the architecture of the enhanced YOLOV5s_MobileNetV3 network.
3. Results
3.1. Experimental Platform
The experimental setup comprises a hardware configuration featuring the Windows Server 2019 operating system (64-bit), powered by an Intel i9-10920 processor with a base frequency of 3.50 GHz. It is equipped with dual NVIDIA GeForce RTX 3090 graphics cards, each possessing 24 GB of video memory, alongside a total of 128 GB of system memory. In the software environment, the experiment employed Python 3.9 as the programming language, utilizing the PyTorch 2.1 deep learning framework for model training and evaluation, supplemented with tool libraries such as CUDA 11.8.
3.2. Evaluation Index
To objectively assess the performance of the neural network model in grading tea buds, various metrics were employed, including precision, recall, average precision (AP), mAP, F1, inference time, and model size.
Among these, precision refers to the proportion of correctly identified premium tea buds among all targets recognized as such, as illustrated in Formula (1). Recall, on the other hand, denotes the proportion of correctly identified samples that are actually premium tea buds, as demonstrated in Formula (2).
True Positives (TPs) refer to the number of instances where tea buds are accurately identified as belonging to the corresponding grade, False Positives (FPs) indicate the number of instances where other categories are incorrectly classified as this grade of tea, and False Negatives (FNs) denote the number of instances where this grade of tea is misclassified as belonging to other categories.
AP is the area enclosed by the precision–recall (PR) curve and the coordinate axes, where the PR curve plots recall on the
x-axis and precision on the
y-axis. Points on the curve that are closer to the upper right corner indicate better model performance in identifying the target class, as illustrated in Formula (3). The mAP represented the average of the AP values across all classes. Specifically, mAP@0.5 denotes the mean precision across m classes when the threshold is set at 0.5, as shown in Formula (4), where m represents the number of grades of tea buds.
F1 represents the harmonic mean of precision and recall, serving as a metric for evaluating the efficacy of a classification model. Its values typically span from 0 to 1, with higher values indicating superior model performance. This is illustrated in Formula (5).
FPS (frames per second) denotes the quantity of images that the model is capable of processing within a second. This metric should be evaluated under identical hardware configurations. In the context of target detection, a higher value signifies an increased capacity for detecting more targets per second.
3.3. Comparative Analysis of Models
3.3.1. Comparison of Models at Different Scales
In this section, we analyze and compare the three-scale and four-scale network models to validate the impact of the selected improvement strategies on model optimization. The dataset utilized for this experiment comprises 19,562 images following augmentation and expansion. The data were randomly partitioned into 15,650 training samples and 3912 validation samples, adhering to an 8:2 ratio. Throughout the training phase, the model’s learning rate was established at 0.01, with the optimizer employing Stochastic Gradient Descent (SGD). The default loss function was set to Complete Intersection over Union (CIoU), the number of training epochs was fixed to 300, and the batch size was determined to be 32. The input image dimensions for the three-scale model were 640 × 640, while the four-scale model utilized dimensions of 1280 × 1280.
Table 1 presents the experimental results before and after the improvements for the three-scale network.
Table 2 presents the experimental results before and after the improvements for the four-scale network.
The precision, recall, mAP, and F1 scores of the YOLOV5s model were markedly superior to those of the YOLOV5n variant, suggesting that YOLOV5s demonstrated enhanced accuracy in target detection tasks. Furthermore, the feature information acquired via the YOLOV5n model was constrained due to its reduced parameter count. Compared to the YOLOV5s network, the improved YOLOV5s_ShuffleNetV2 and YOLOV5s_MobileNetV3 achieved accuracy enhancements of 0.99% and 1.79%, respectively, while the weight files decreased by 5.42 MB and 6.68 MB. The accuracy remained relatively stable compared to the F1 score; however, the value of weighted parameters decreased by over 40%, further illustrating the benefits of the lightweight backbone network in enhancing the overall performance of the model.
The YOLOV5l6, YOLOV5m6, and YOLOV5s6 models demonstrated commendable accuracies of 94.54%, 90.22%, and 84.59%, respectively. However, each model size exceeded 20 MB, necessitating substantial computational resources and complicating their deployment on lightweight embedded systems. In comparison with YOLOV5s, YOLOV5s_ShuffleNetV2, and YOLOV5s_MobileNetV3, the YOLOV5n6, YOLOV5n6_ShuffleNetV2, and YOLOV5n6_MobileNetV3 models exhibited enhancements in accuracy of 4.49%, 1.28%, and 5.29%, respectively, while achieving reductions in weight of 49.05%, 39.37%, and 29.06%, respectively. Additionally, the F1 scores improved by 0.21%, 10.94%, and 10.02%. Notably, YOLOV5n6_MobileNetV3 demonstrated superior performance, characterized by the lowest weight and the highest accuracy.
To further analyze whether the models were overfitting or underfitting, we next compared the performance of each model in grading the tender buds of four types of tea, as illustrated in
Table 3. G0 represents a single bud, G1 denotes one bud and one leaf, G2 indicates one bud and two buds, and G3 signifies one bud and three buds.
In the three-scale analysis, the YOLOV5s_ShuffleNetV2 and YOLOV5s_MobileNetV3 networks, along with the YOLOV5n6 network in the four-scale analysis, achieved recognition accuracies that exceeded 90% for the G3 tea buds. However, the accuracies for the G0, G1, and G2 tea buds were notably lower, indicating an imbalance in the recognition accuracy across different grades of tea buds. Conversely, the YOLOV5n6_ShuffleNetV2 and YOLOV5n6_MobileNetV3 networks in the four-scale analysis exhibited the opposite trend, suggesting that the features available for the network to learn from in the third-grade tea leaf data were insufficient.
During the actual harvesting process, premium teas focus on the collection of G0, G1, or G2 tea buds, while G3 buds are only harvested after the onset of summer in that year. Therefore, it is more appropriate to concentrate on the identification of the top three grades of tea buds. The improved YOLOV5n6_MobileNetV3 network was selected to enhance the model’s accuracy and efficiency in classifying tender tea buds.
Table 4 presents the performance results of YOLOV5n6_MobileNetV3 across four and three grades of tea buds, respectively.
The YOLOV5n6_MobileNetV3 model achieved an average recognition accuracy of 88.29% for identifying four grades of tea buds. For the classification of three grades of tea buds, the model attained an average accuracy of 92.43%, a recall rate of 87.25%, and a mean average precision of 93.17%. Notably, precision improved by 4.41%, recall increased by 4.5%, F1 improved by 4.33%, and mAP rose by 3.61%, demonstrating significant effectiveness.
3.3.2. Comparative Analysis of Model Training Losses
To provide a more comprehensive assessment of the model’s performance, this paper conducts a comparative analysis with other large models. As illustrated in
Figure 14, it was evident that none of the models exhibited significant overfitting overall, with the loss stabilizing after 150 epochs. Although YOLOV5s6, YOLOV5m6, and YOLOV5l6 demonstrated a rapid decrease in loss, their substantial weight did not meet the requirements for lightweight deployment. In contrast, the YOLOV5n6_MobileNetV3 model showed improved convergence speed and effectiveness compared to the original YOLOV5n6 model, thereby validating the feasibility of the proposed enhancement strategy.
3.3.3. Comparative Analysis of Weighting Precision
The relationship between the weights of neural network models and their accuracy is interdependent. Excessively high weights can lead to an overly complex model that captures noise within the training data, resulting in overfitting. Conversely, weights that are too low may render the model overly simplistic, failing to capture the true distribution of the data.
Figure 15 illustrates the relationship between the weight sizes of the ten models discussed in this paper and their corresponding recognition accuracies. Points that were closer to the upper left region of the graph aligned more closely with the requirements for model lightweighting and high accuracy as outlined in this study. Specifically, the YOLOV5n6_MobileNetV3 model met the criteria for accurate detection and classification of tea bud recognition, while smaller models were more amenable to deployment in embedded devices.
3.3.4. Comparison with Other Research Model
To validate the comprehensive performance of the proposed YOLOV5n6_MobileNetV3 model, we conducted a comparative analysis with other target detection models for tea buds presented in the literature. Most models focus solely on single-class recognition, treating all tea buds as a single category, or they perform binary classification distinguishing between premium and first-grade categories.
Table 5 presents a comparison between the research approach outlined in this study and those found in the literature.
According to the data presented in the table, the YOLOV5n6_MobileNetV3 model discussed in this paper not only identified a greater number of tea bud grades but also demonstrated higher accuracy compared to binary classification. It took into account varying weather conditions, enhancing the model’s generalization capabilities. Furthermore, after optimization for lightweight performance, the model’s weight file was reduced to just 4.98 MB, making it more suitable for deployment on edge devices.
3.3.5. Comparison with Other Lightweight Mainstream Models
To assess the efficacy of the proposed model, YOLOV5n6_MobileNetV3 was evaluated against the contemporary YOLOV7-tiny and YOLOV8n mainstream models, utilizing a dataset exclusively comprising the G0, G1, and G2 tea grades. The findings from this experimental comparison are presented in
Table 6.
The YOLOV5n6_MobileNetV3 model demonstrated commendable performance in precision, recall, mAP, and light weight. With a weight of merely 4.98 MB, it is the lightest among the three models, showcasing significant potential for implementation in resource-limited embedded systems. The F1 score stands at 89.76%, which is on par with that of YOLOV8n, suggesting an effective equilibrium between precision and recall.
3.4. Comparative Analysis of Model Detection Performance
To evaluate the recognition performance of all models, the detection results of tea buds were visualized and analyzed, as shown in
Figure 16. It was observed that the improved YOLOV5n6_MobileNetV3 model, which increased the detection scale and reduced the number of channels, was more suitable for detecting tender tea bud targets, thereby enhancing the model’s accuracy in detecting tea buds.
3.5. Embedded Deployment Experiment
The experimental findings presented above indicate that the lightweight enhancement of YOLOV5n6_MobileNetV3 markedly alleviates the computational load on hardware, concurrently decreasing both the computational expense and the parameter count of the original model. To assess the efficacy of this lightweight modification on devices with limited computational capabilities, deployment tests were performed utilizing the NVIDIA Jetson Nano b01 as the deployment platform, with the system parameters detailed in
Table 7.
The Jetson Nano B01 not only addresses the requirements of various application scenarios but also offers robust support for device expandability and compatibility, owing to its compact form factor and high-performance capabilities. Its efficient hardware configuration and comprehensive interface design significantly lower the research and development costs associated with AI terminals. In this study, we firstly established a remote connection to the Jetson Nano using MobaXterm software (version is 23.4) on the PC, deploying and installing the model runtime environment within the Ubuntu 18.04 operating system and utilizing JetPack version 4.6.1.
Figure 17 illustrates the flowchart for real-time detection using the Jetson Nano (NVIDIA, Santa Clara, CA, USA).
Initially, high-resolution industrial cameras were employed to capture image data on tea buds, facilitating the creation of the training dataset. The acquired data were then transferred to a computer workstation for model training and optimization utilizing a deep learning framework. Upon the completion of the model training, the optimized weight files were deployed to the Jetson Nano embedded platform. Ultimately, the real-time acquisition and processing of the tea garden scene were conducted by the industrial camera integrated with the Jetson Nano, enabling the online detection and identification of tea buds.
In this study, the optimal weight file derived from the training of the YOLOV5n6_MobileNetV3 model was employed to evaluate the test dataset. The average inference time for the enhanced model to identify a single image was 272.1 ms on a desktop computer, while on an embedded platform, the average inference time was reduced to 205 ms, representing a decrease of approximately 24.6%. In comparison to the original model, the enhanced model demonstrated a reduced computation time on the Jetson Nano device, making it more suitable for devices with limited computational resources and better aligned with practical real-world applications.
4. Discussion
As the standard of living continues to improve, the demand for high-quality tea buds has been increasing year by year, and the traditional hand-picking of tea can no longer meet the market demand. To enhance the efficiency of tea harvesting and lower the expenses associated with tea production, a tea bud detection model was introduced, which was based on an enhanced algorithm of the YOLOV5 convolutional neural network and integrated with the lightweight MobileNetV3 architecture. Through a comparative experimental analysis, this research achieved notable advancements in several key areas:
(1) Firstly, this study collected datasets of three types of tea buds under both sunny and overcast conditions. To enhance the network’s ability to learn the characteristics of tea bud grades, an offline data augmentation strategy was proposed. This strategy centered around the bounding boxes in the label files and enumerated nine different methods for cropping these bounding boxes under various conditions. This approach effectively preserved the grade information contained within all bounding boxes of the original images, expanding the dataset from the initial 4290 images to a total of 19,562 images.
(2) Following a comparative analysis of various models within the YOLOV5 network series, it was determined that the accuracy and weight characteristics of the foundational models, YOLOV5s and YOLOV5n6, aligned more closely with the demands for high-precision and lightweight networks. Consequently, these models were selected as the primary frameworks for target detection across varying scales in this research. To further minimize model weight and enhance detection speed, the backbone networks of YOLOV5s and YOLOV5n6 were substituted with ShuffleNetV2 and MobileNetV3, respectively, and a compact feature map extraction module was incorporated. The experimental findings indicated that the optimized YOLOV5n6_MobileNetV3 model achieved an accuracy of 88.29% and 92.43% for four-category and three-category tea bud recognition, respectively, with a mean average precision (mAP) of 89.56% and 93.17%, and F1 scores of 85.43% and 89.76%. Notably, the model’s weight was a mere 4.98 MB, representing a significant improvement over the original model.
(3) The improved model proposed in this study not only featured high accuracy and lightweight characteristics but also achieved a detection speed of 205 ms per image on resource-constrained embedded devices, reducing training time by approximately 24.6% compared to PC platforms while maintaining a high recognition accuracy. This model is suitable for real-time detection tasks, and it provides technical support for the intelligent management of tea plantations and the practical application of harvesting robots, contributing to the modernization and intelligent development of the tea industry.
Despite the significant findings of this research, several limitations persisted. For instance, regarding the classification of tea bud grades, although the data had been evaluated by seasoned tea specialists, there remained a need to establish more comprehensive and standardized grading criteria aligned with industry benchmarks to mitigate the biases introduced via subjectivity. Furthermore, the scarcity of labels for G3 tea samples hampered the model’s feature learning for this category, resulting in diminished accuracy. Additionally, the enhanced model exhibited suboptimal performance in FPS. Thus, future investigations can leverage the latest iteration of the model to enhance detection speed while maintaining the model’s lightweight characteristics.
5. Conclusions
In response to the growing need for automated identification technology for tea buds, this research introduced a streamlined YOLOV5n6_MobileNetV3 detection model. The model was implemented on embedded devices to facilitate real-time grading of tea buds, thereby providing both theoretical and technological foundations for the modernization and intelligent evolution of tea gardens. The key findings derived from the experimental results were summarized as follows:
(1) The data augmentation strategy introduced in this study effectively enhanced the dataset collection, bolstering the deep learning model’s capacity to recognize the distinct characteristics of tea buds across various grades and thereby significantly increasing grading efficiency and minimizing labor costs.
(2) The enhanced YOLOV5n6_MobileNetV3 architecture demonstrated commendable performance in terms of both precision and efficiency, achieving mean average precision (mAP) of 89.56% and 93.17% across four and three tea categories, respectively. Additionally, it recorded F1 scores of 85.43% and 89.76%, respectively, while maintaining a model size of merely 4.98 MB.
(3) In comparison to the YOLOV7-tiny, YOLOV8n, and YOLO11n models, the YOLOV5n6_MobileNetV3 model demonstrated notable superiority in accuracy for the extraction of tea buds while maintaining smaller model parameters. This made it particularly well-suited for deployment in embedded systems, thereby offering a viable research solution for advancing agricultural detection technologies.