Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation

Chen, Tianci; Li, Haoxin; Lv, Jinhong; Chen, Jiazheng; Wu, Weibin

doi:10.3390/agriculture14081388

Open AccessArticle

Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation

by

Tianci Chen

¹,

Haoxin Li

¹,

Jinhong Lv

¹,

Jiazheng Chen

² and

Weibin Wu

^1,3,*

¹

National Key Laboratory of Agricultural Equipment Technology, College of Engineering, South China Agricultural University, Guangzhou 510642, China

²

College of Mechanical and Electrical Engineering, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

Guangdong Engineering Technology Research Center for Creative Hilly Orchard Machinery, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(8), 1388; https://doi.org/10.3390/agriculture14081388 (registering DOI)

Submission received: 26 July 2024 / Revised: 13 August 2024 / Accepted: 15 August 2024 / Published: 17 August 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately detecting tea bud leaves is crucial for the automation of tea picking robots. However, challenges arise due to tea stem occlusion and overlapping of buds and leaves, presenting varied shapes of one bud–one leaf targets in the field of view, making precise segmentation of tea bud leaves challenging. To improve the segmentation accuracy of one bud–one leaf targets with different shapes and fine granularity, this study proposes a novel semantic segmentation model for tea bud leaves. The method designs a hierarchical Transformer block based on a self-attention mechanism in the encoding network, which is beneficial for capturing long-range dependencies between features and enhancing the representation of common features. Then, a multi-path feature aggregation module is designed to effectively merge the feature outputs of encoder blocks with decoder outputs, thereby alleviating the loss of fine-grained features caused by downsampling. Furthermore, a refined polarized attention mechanism is employed after the aggregation module to perform polarized filtering on features in channel and spatial dimensions, enhancing the output of fine-grained features. The experimental results demonstrate that the proposed Unet-Enhanced model achieves segmentation performance well on one bud–one leaf targets with different shapes, with a mean intersection over union (mIoU) of 91.18% and a mean pixel accuracy (mPA) of 95.10%. The semantic segmentation network can accurately segment tea bud leaves, providing a decision-making basis for the spatial positioning of tea picking robots.

Keywords:

tea bud leaves; multi-shape targets; attention mechanism; path feature aggregation; semantic segmentation

1. Introduction

Tea plants are among the most extensively cultivated agroforestry crops worldwide. After maturing, tea bud leaves are processed into tea and related products, becoming one of the most consumed beverages globally [1,2]. With the continuous expansion of the tea beverage consumption market, there is a growing demand for high-quality tea [3]. However, harvesting high-nutrition and economically valuable one bud–one leaf high-quality tea predominantly relies on manual labor, leading to challenges such as high labor intensity, low efficiency, and elevated costs [4]. Moreover, labor shortages during the tea-picking season hinder the timely harvesting of high-quality tea. Therefore, developing automated tea-picking robots is becoming an inevitable trend for the sustainable development of the tea industry [5,6].

The primary task in selective harvesting of high-quality tea bud leaves is achieving fine detection via machine vision, critical for subsequent picking point localization and other harvesting determinations [7]. Traditional image analysis methods can segment tea bud leaves but require manual classifier design through feature engineering [8,9,10,11]. However, these algorithms have limited feature information extraction capabilities and are often only suitable for specific background experimental conditions, lacking generalizability for fine tea bud leaf segmentation.

Due to the complexity of agricultural production environments, convolutional neural networks (CNNs) have been widely used for crop detection and recognition [12,13]. For tea bud leaf detection, Chen et al. employed Faster R-CNN, achieving 79% accuracy and 90% recall [14]. Li et al. utilized an improved YOLOv4, incorporating channel and spatial attention mechanisms, resulting in an average precision of 85.26% [15]. Meng et al. proposed an enhanced YOLOX-TINY model for tea bud detection, using content-aware reshuffling of feature modules for upsampling, improving average accuracy and recall to 97.42% and 95.09%, respectively [16]. Other relevant studies have also demonstrated promising results [17,18,19].

However, most object detection tasks rely on rectangular bounding boxes for precise localization of target areas. After achieving tea bud leaf detection, obtaining fine-grained tea bud leaf segmentation through clustering remains challenging. The background information within detection boxes also hinders locating stem-picking points. Fang et al. found that smaller targets with good image integrity and fit enable effective localization [20]. Chen et al., despite an 84.91% recognition accuracy, noted stem occlusion significantly affects picking point localization [14]. Yan et al. measured picking point positioning using the comprehensive metric F2 through object detection methods, yielding only 0.313 [21]. In our previous research on multi-scale, multi-target detection of tea bud leaves, we similarly found that, although the detection results of tea bud leaves are robust, achieving fine localization of tea bud leaves after detection remains a challenge [22].

In comparison to the aforementioned research results, researchers have begun considering pixel-level semantic segmentation methods for more distinctive features of tea bud leaves [23]. Lu et al. employed the HRNet_W18 model for semantic segmentation of tea bud leaves, achieving the mIoU of 0.81 and obtaining satisfactory segmentation results [24]. This approach addresses the challenge of recognition difficulty caused by variations in the growth positions of tea buds. Wang et al. implemented fine-grained segmentation of tea bud leaves based on Mask R-CNN, demonstrating increased robustness and improved utilization of the model for stem-picking point localization [25]. Zhang et al. proposed the MDY7-3PTB model, achieving an IoU of 86.61% on the tea bud leaf segmentation dataset, directly capable of segmenting tea stem positions [26]. Yan et al. improved the accuracy of bud and leaf recognition with DeepLabv3 to over 80%, even without considering bud and leaf overlap or occlusion [21].

Although these methods have achieved encouraging results in directly detecting and segmenting stem positions through visual perception, most tea bud leaves have clear shapes and obvious stem positions. This clear and single-class target reduces the difficulty of detection and segmentation. However, in the actual picking process, tea bud leaves have different shapes when imaged from different angles. Tea stem lengths vary at different growth stages, and the branching angles of tea bud leaves also differ. There can be overlapping between tea buds and leaves, and tea stems may be occluded and not be visible in the field of view, resulting in diverse forms of tea bud leaves. Therefore, solving the detection and segmentation of tea bud leaves in different shapes within the field of view plays a crucial role in determining the posture of tea bud leaves and making decisions on picking locations.

This study addresses the precise segmentation of multi-shape one bud–one leaf targets by proposing a novel segmentation network suitable for various shapes and fine granularity. We constructed a dataset containing three main shapes of tea bud leaves and accurately annotated the different shapes of these targets at the pixel level. The role of the multi-path feature aggregation module was analyzed, and the characteristics of different attention mechanisms are discussed. Subsequently, the segmentation results of different models for various shapes and fine granularity of tea bud leaves were compared to obtain a more suitable and effective network, achieving precise segmentation of multi-shape tea bud leaf targets.

2. Materials and Methods

2.1. Data Acquisition and Processing

The dataset used in this study was obtained from tea plantations in natural environments. Considering the varying shapes of tea bud leaves due to different camera perspectives, images were captured from multiple angles during collection. The capturing device was the Intel RealSense depth camera D435i with a resolution of 1920 × 1080. The tea variety collected was the tree-dwelling large-leaf black tea, specifically the Yinghong 9 cultivar. In Guangzhou, Guangdong Province, China, 734 valid images were collected in June 2022 at the South China Agricultural University, and 356 valid images were collected in May 2023 at the Zhongkai University of Agriculture and Engineering. In Yingde, Guangdong Province, China, 458 valid images were collected in July 2023. The dataset comprises 1548 original images.

Subsequently, images were classified and annotated using Labelme software linking in https://github.com/labelmeai/labelme (accessed on 17 August 2023), with annotations saved as label files in “json” format. During annotation, focusing on one bud–one leaf as the target, three main shape labels were applied. Targets with a clearly visible tea bud, tea leaf, and stem were labeled “tea_Y”. Targets where the tea bud and tea leaf exhibited an obvious branching angle, but the stem’s position was obscured, were labeled “tea_V”. Targets where the tea bud and tea leaf overlapped in the camera’s field of view were labeled “tea_I”. The tea plantation environment and label schematics are illustrated in Figure 1.

We first divided the raw data into training–validation and test datasets in a 9:1 ratio. Additionally, a statistical analysis was performed on the labeled samples in the training–validation dataset. There were 4306 samples labeled as “tea_Y”, 2403 labeled as “tea_V”, and 1100 labeled as “tea_I”.

To achieve a more balanced dataset, 424 images in the training–validation dataset primarily labeled as “tea_V” and “tea_I” were selected for data augmentation. Image augmentation included random generation of images with brightness, contrast, and saturation enhancement using scale factors ranging from 0.7 to 1.3, as well as mirroring the images. The choice of these image augmentation methods is primarily based on the consideration that tea leaf images can be affected by factors such as changes in natural lighting and variations in image capture angles. By simulating these natural environmental factors and ensuring that the augmented images remain undistorted, the dataset becomes more generalizable. Other augmentation methods such as adding noise or missing parts of the image are less common in real-world scenarios.

The augmented dataset consisted of 2668 images total. In this dataset, there were 6459 samples labeled as “tea_Y”, 5478 labeled as “tea_V”, and 4612 labeled as “tea_I”. The dataset statistics are summarized in Table 1. The training–validation dataset was split in a 9:1 ratio for training and validating subsequent segmentation models.

2.2. Unet-Enhanced Model

2.2.1. Segmentation Network Architecture

Unet [27] is a simple and practical semantic segmentation network that, through its symmetric U-shaped architecture and skip connections, can integrate local and global feature information to some extent. However, this process mainly captures local features that are gradually aggregated and fused, and due to the limitations of its convolutional operations, Unet still faces challenges in capturing long-range dependencies and global context. Developing enhanced networks tailored to different research objects has great potential [28]. In this study, Unet serves as the primary benchmark architecture, consisting of an encoder network and a decoder network. The encoder network serves as the main feature extraction network, responsible for extracting features from input images and generating a series of feature layers, which are connected to the decoder network through skip connections. The decoder network merges high-level semantic features from different levels and transforms features into pixel-level image space information to generate precise segmentation results.

Based on this foundation, a new design approach is proposed to achieve fine-grained segmentation of different shapes of tea buds and leaves in the field of view:

(1): A hierarchical Transformer block is designed based on a self-attention mechanism in the encoder network. It captures dependencies between input features during the generation of high-resolution coarse features and low-resolution fine features, enhancing the extraction of common features of tea buds and leaves with different shapes.
(2): A multi-path feature aggregation module is designed to fuse feature information from multiple different levels, alleviating the loss of spatial detail information caused by downsampling.
(3): A polarized attention mechanism is introduced to endow the model with a more comprehensive perception capability for fine-grained targets within the field of view.

The newly designed segmentation network is shown in Figure 2.

2.2.2. Transformer Block

The backbone serves as an image feature extractor, playing a critical role in semantic segmentation networks [29]. Although traditional CNN models like Unet can capture some global information through feature pyramids and skip connections, they struggle to effectively capture long-range dependencies and global features due to their reliance on local receptive fields. Unlike CNN-based models, the self-attention mechanism excels at capturing the global context and demonstrates strong adaptability to downstream tasks during training [30]. Therefore, to better capture long-range dependencies and global context information, this study introduces Transformer blocks. Transformers, through their self-attention mechanism, can dynamically model long-range dependencies between input features, addressing the shortcomings of traditional CNN architectures. This design allows the model to explicitly focus on distant contextual information in the image, thereby enhancing its ability to capture global features.

After completing the encoding of the feature sequences through patch embedding, the self-attention mechanism and multi-layer perceptron in the Transformer block are computed sequentially. The self-attention mechanism is employed to calculate dependencies between sequences, generating new sequence features fused with global information. Assuming the input sequence is

X = [x_{1}, \dots, x_{n}] \in R^{D \times N}

, the output sequence

H = [h_{1}, \dots, h_{n}] \in R^{D \times N}

is obtained. Then, a multi-layer perceptron is applied to the features H to obtain the final output

O = [o_{1}, \dots, o_{n}] \in R^{D \times N}

, where D represents the sequence dimension, and N represents the number of sequences. The Transformer block process is illustrated in Figure 3.

For each input sequence, features are initially mapped to three different spaces, resulting in three feature vectors: the query vector

q_{i} \in R^{D}

, key vector

k_{i} \in R^{D}

, and value vector

v_{i} \in R^{D}

. These vectors form the feature vectors matrices Q, K, and V. Then, by computing self-attention and weighting the different attention values, a feature sequence H with global fused information is obtained. Finally, a multilayer perceptron is used to derive the feature sequence O. For a detailed calculation process, refer to [31].

It is worth noting that, similar to CNNs, as the depth of the backbone network increases, the dimension of the feature sequence increases, while the number of sequences decreases. Single-head self-attention mechanisms have weak learning capabilities for distant elements, and the relationships between the elements they focus on are singular, unable to distinguish between different types of relationships. Therefore, multi-head self-attention mechanisms are introduced, which divide the sequence into multiple sequence blocks of the same dimensionality along the dimension direction, perform self-attention calculations separately, and then merge them. This approach allows for better focus on information from different positions, thereby capturing both global and local relationships. The parameter table for the hierarchical backbone network constructed using Transformer blocks is shown in Table 2.

2.2.3. Multi-Path Feature Aggregation Module

For the fine segmentation of tea bud leaves under the field of view, there is significant segmentation difficulty for fine-grained targets due to the challenge of restoring global information in the upsampling process of the decoding network. An effective method for segmenting tea bud leaves is to compensate as much as possible for the feature space information loss caused by the downsampling process.

In DenseNet, since each layer of the model can access features from all preceding layers, the model can better capture fine-grained features of the input, thereby improving the network’s representation ability for complex tasks [32]. Inspired by this, in the Unet architecture, instead of using skip connections to connect features of the same resolution for upsampling, multiple path features with higher resolutions are aggregated together to compensate for the loss of spatial detail information caused by downsampling. By aggregating multiple paths with different levels of granularity through the pathway aggregation module, the network can make more informed decisions, thereby improving segmentation accuracy. The pathway feature aggregation module is illustrated in Figure 4.

First, the feature map X of the decoding network is subjected to convolution and upsampling, and the resolution of the obtained image feature

\hat{X}

is used as the baseline resolution for concatenation in the aggregation module. Different pathway features from the hierarchical Transformer encoder block are introduced for aggregation, with pathway features from top to bottom denoted as X₁, X₂, …, X_i, where

X_{1} \in R^{\frac{H}{4} \times \frac{W}{4} {\times C}_{1}}

.

Multiple pathway features have varying channel dimensions and resolutions. Features with larger dimensions may receive disproportionately higher attention from the network, failing to accurately reflect their importance [33]. Therefore, when processing features with different dimensions, the pathway features are first subjected to convolution operations to maintain the same channel dimensions. Furthermore, the resolution of pathway features is transformed to be consistent with

\hat{X}

through resolution transformation, ensuring that all feature maps have consistent feature sizes. Equation (1) represents the transformation calculation for different pathway features, where R represents the feature resolution.

{\hat{X}}_{i =} \{\begin{cases} D o w n s a m p l e (C o n v 2 d (X_{i})), & i f R (X_{i}) > R (\hat{X}), \\ C o n v 2 d (X_{i}), & i f R (X_{i}) = R (\hat{X}), \\ U p s a m p l e (C o n v 2 d (X_{i})), & i f R (X_{i}) < R (\hat{X}), \end{cases}

(1)

Then, multiple feature maps with the same resolution and channel dimensions are concatenated and subjected to convolutional operations to obtain the final output features.

2.2.4. Polarized Attention Mechanism

In the tea plantation environment, tea bud leaves growth is irregular, with tea stems at different growth stages varying in length, and tea buds and leaves differing in size. These factors pose challenges for image segmentation of tea bud leaves under the field of view, especially for fine-grained features.

Therefore, this study proposes a polarized attention mechanism to emphasize the learning of different shapes and fine-grained features of tea bud leaves. The implementation of the attention module consists of channel-wise self-attention and spatial self-attention. This module is illustrated in Figure 5.

First, the feature

X \in R^{C \times H \times W}

is used as input for both channel-wise self-attention and spatial self-attention, highlighting the pixel’s belonging to a particular category from the channel dimension and emphasizing the pixel position of the same semantic feature from the spatial dimension. H and W are the height and width of the feature map, and C is the number of channels

In the channel dimension, query vectors

W_{q} \in R^{1 \times H \times W}

and value vectors

W_{v} \in R^{C / 2 \times H \times W}

are obtained through 1 × 1 convolutional kernels, and then a weighted proportion

Y_{1} \in R^{C \times 1 \times 1}

with enhanced attention is generated in the channel dimension. The calculation method is shown in Equations (2) and (3).

y_{1} = σ (Γ (W_{q})) \times Γ (W_{v})

(2)

Y_{1} = C o n v (Γ (y_{1}))

(3)

where

Γ

represents the reshape operation, and

σ

represents the softmax activation function.

Further, guided by the weights of the channel branches, the feature

Z_{1} \in R^{C \times H \times W}

with channel-wise global attention is obtained through weighted aggregation.

Z_{1} = ζ (Y_{1}) ⊙ X

(4)

where

ζ

is the sigmoid activation function, and

⊙

represents the element-wise operation of multiplication.

In the spatial dimension, similarly,

W_{q} \in R^{C / 2 \times H \times W}

and

W_{v} \in R^{C / 2 \times H \times W}

are obtained through convolutional operations. Equation (5) represents the global pooling operation of W_q, generating feature F for the calculation of spatial global feature weights

Y_{2} \in R^{1 \times H \times W}

, as shown in Equation (6).

F = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} W_{q} (i, j)

(5)

Y_{2} = Γ [σ (Γ (F)) \times Γ (W_{v})]

(6)

Further, guided by the weights of the spatial branches, the feature

Z_{2} \in R^{C \times H \times W}

with spatial-wise global attention is obtained through weighted aggregation.

Z_{2} = ζ (Y_{2}) ⊙ X

(7)

Finally, by merging the dual-channel and spatial features, the feature map

Z \in R^{C \times H \times W}

with a multi-dimensional global attention capability is obtained.

Z = Z_{1} \oplus Z_{2}

(8)

where

\oplus

represents the element-wise operation of addition.

2.3. Model Training

2.3.1. Training Details

The experimental setup for this study is outlined in Table 3. During the training process, the batch size was set to 16, and the maximum training epochs were set to 800. The image resolution was normalized to 512 × 512. The maximum learning rate of the model was 0.01, and the minimum learning rate dropped to 0.001. The optimizer used was stochastic gradient descent (SGD).

During the training process, the loss function combines dice loss [34] and binary cross-entropy loss (BCE Loss), calculated as follows:

Dice_Loss = 1 - \sum_{c = 1}^{4} \frac{2 y_{c} p_{c}}{y_{c} + p_{c}}

(9)

BCE_Loss = - \sum_{c = 1}^{4} y_{c} \log (p_{c})

(10)

Loss = 1 - \sum_{c = 1}^{4} \frac{2 y_{c} p_{c}}{y_{c} + p_{c}} - \sum_{c = 1}^{4} y_{c} \log (p_{c})

(11)

where y_c represents the ground truth label, and p_c represents the predicted label.

The dice loss function is particularly effective in dealing with class imbalance issues, especially when the foreground (or object of interest) occupies a relatively small proportion of the image, and the background is more prevalent. Additionally, it is more sensitive to the accuracy of boundary predictions, which is crucial when optimizing the segmentation of small targets or fine boundaries, such as the segmentation of tea bud leaves. This characteristic makes it highly suitable for tasks where precise boundary delineation is important.

The BCE Loss function can adapt to different label structures by treating each category as an independent binary classification problem. Although we applied data augmentation to our samples, there were still differences in the number of instances between categories, leading to some degree of imbalance. BCE Loss can effectively handle class imbalance by calculating the loss for each category’s pixels separately. Moreover, BCE Loss treats each category independently, ensuring that the learning process for each category is optimized separately. This helps stabilize training in the presence of class imbalance and simplifies the overall model training process.

2.3.2. Evaluation Metrics

In this study, mean pixel accuracy (mPA), intersection over union (IoU), and mean IoU (mIoU) are used as evaluation metrics to measure the model’s performance. mPA represents the average of the ratios of the correctly predicted pixel count for all classes to the total pixel count in the image. IoU is the ratio between the intersection to the union of the predicted class pixel count and the true class pixel count. mIoU is the mean IoU for all classes.

Let the total number of semantic categories be k + 1 (k target classes and 1 background class). P_ii represents the number of pixels that belong to class i and are predicted as class i (true positive), P_ij represents the number of pixels that belong to class i but are predicted as class j (false negative), and P_ji represents the number of pixels that belong to class j but are predicted as class i (false positive). The calculation formulas are as follows:

mPA = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j}}

(12)

{IoU}_{i} = \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}} \times 100 %

(13)

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{1} I o U_{i}

(14)

3. Results and Analysis

3.1. Training Results

After training, the curve of loss function values for the training dataset is obtained. As shown in Figure 6a, loss values rapidly decrease between training epochs 0 and 200, then slow down. After 400 epochs, training loss stabilizes around 0.014 and validation loss around 0.016.

However, to ensure the model’s generalization ability, we chose to continue training up to 800 epochs. This decision was based on the following considerations: First, we wanted to observe the model’s stability over a longer training period and check for any signs of overfitting. Second, setting a maximum of 800 epochs provided ample space for learning rate adjustments and fine-tuning. In some cases, the process of decreasing the learning rate may result in slight performance improvements in the subsequent epochs. Therefore, by setting a longer training period, we could maximize the model’s performance. It is worth noting that, by monitoring the performance on the validation set, we confirmed that the validation loss did not significantly increase during the 800 epochs, indicating that the model did not overfit. Additionally, during the training process, we saved the model weights every 10 epochs, which allowed us to effectively restore and recover the model if needed.

To evaluate the improved model performance for detection, test images are processed, yielding IoU values for different tea bud leaf categories. The test results attained an mIoU of 91.18%. IoU is similar for the three tea bud leaf shapes, with the background easiest to segment at 0.99 IoU. Among the three shapes, “tea_I” has the best segmentation performance at 0.87 IoU.

To validate the segmentation performance of the improved model, the images in the test dataset were evaluated to obtain the IoU values for tea bud leaves of different shapes. As shown in Figure 6b, the testing results yielded an mIoU value of 91.18%. The IoU values for the three shapes of tea bud leaves are similar, with “tea_I” exhibiting the best segmentation performance.

3.2. Comparison Experiment of Different Backbones

Unet was used as the baseline model to compare network models constructed with different backbones, including Mobilenetv3 [35], VGGNet [36], Resnet [37], and the proposed Tea–Trans backbone. The experimental results are shown in Table 4.

Mobilenetv3 inadequately extracts features of varying shapes of tea bud leaves in environments with multiple interferences due to its focus on designing lightweight network structures. Consequently, it exhibits insufficient detection and segmentation capabilities, with an mIoU value of 50.71% and an mPA value of 55.55%.

VGGNet, owing to its network depth and structure, possesses robust feature extraction capabilities. Similarly, Resnet, with its introduction of residual structures, allows for deeper network construction, enhancing the model’s representational capacity. Consequently, both models significantly improve segmentation results. Specifically, both models show comparable segmentation capabilities on Tea_Y, with Vgg–Unet focusing more on tea_I and Resnet–Unet emphasizing tea_V.

Different from traditional convolutional neural networks, the Transformer–Unet model constructed with a Transformer block has shown improved segmentation capabilities for the three shapes of tea bud leaves. The segmentation IoU values for Tea_Y, Tea_V, and Tea_I are 80.49%, 80.71%, and 80.95%, respectively, reaching the highest among all models.

Therefore, selecting a Transformer-based backbone to construct the Unet model enables the model to pay more attention to the global visual information of tea bud leaves of different shapes, rather than just local perspectives. This allows for more effective extraction of common features among tea bud leaves.

3.3. Ablation Experiments

To evaluate the model’s segmentation capability for tea bud leaves of different shapes, ablation experiments were conducted on the path feature aggregation module and different attention mechanism, and the impact of different modules on the network model was analyzed. The experimental results are shown in Table 5.

3.3.1. Path Feature Aggregation Module

On the basis of using a Transformer as the backbone, aggregating different path features improved the model’s mIoU from 85.39% to 89.10% and mPA from 90.83% to 93.76%. This module aggregates features from different paths in the encoding network, facilitating information flow within the model and allowing better utilization of low-level features. Importantly, by aggregating these features, spatial features from the encoding network are effectively utilized to compensate for and restore decoding network features, better mitigating the loss of spatial detail information caused by downsampling. This ultimately enhances the segmentation performance for tea bud leaves of different shapes.

3.3.2. Attention Mechanism

The use of polarized attention [38] further enhanced the segmentation performance of the model, with mIoU reaching 91.18% and mPA reaching 95.10%. Polarizing attention enables the model to better distinguish the important regions and key features of different forms of tea bud leaves. By introducing polar information, the concentrated attention to features is enhanced, so as to improve the ability of distinguishing features and obtain the feature information with a multi-dimensional global attention ability. In addition, the polarization self-attention mechanism adopts the combined function of Softmax and Sigmoid in both channel and spatial branches, which can better fit the output distribution of fine-grained regression.

The selective kernel (SK) attention mechanism [39] enhances the ability to pay attention to different forms of tea bud leaves. Based on the path feature aggregation module, mIoU increased by 1.26% and mPA increased by 0.76%.

Gate attention [40] enables the model to enhance the response to the spatial region of interest, but it is difficult for it to play a good role in global features. The mIoU value and mPA value of the model are 87.94% and 92.60%. The use of this attention degrades the segmentation performance of the model.

The convolutional block attention module (CBAM) [41] also did not improve model performance. In terms of combination, polarized attention and CBAM attention are similar. However, from a structural perspective, CBAM applies sequential channel attention and spatial attention modules to weigh the channel and spatial information of feature maps, respectively. This mechanism primarily relies on local convolution operations, making it more focused on capturing local spatial and channel information. In contrast, PSA is based on a self-attention mechanism to capture global information interactions within feature maps. Unlike CBAM, polarized attention not only focuses on local spatial and channel information but also more extensively captures long-range dependencies and handles distant feature interactions.

Furthermore, in terms of model functionality, CBAM is used to enhance local perception capabilities and perform fine-grained feature weighting in the spatial and channel dimensions. On the other hand, polarized attention not only handles local spatial and channel information but also captures more extensive long-range feature interactions. By processing the interactions between different subspaces of the input features, PSA enhances the model’s attention to important information.

3.4. Comparison with Advanced Segmentation Models

The experiments were conducted on the dataset collected in natural environments. Advanced segmentation models were selected for comparison, including DeepLabv3+, PSPNet, HRNet, and Segformer [42,43,44,45]. The experimental results are shown in Figure 7.

DeepLabv3+ uses dilated convolutions with various dilation rates to effectively capture a broader range of contextual information at different scales. This method allows the network to understand long-range dependencies within an image by expanding the receptive field without losing resolution. By applying dilated convolutions, DeepLabv3+ can aggregate multi-scale contextual information from deeper layers of the network, which is crucial for capturing both fine details and the broader context in semantic segmentation tasks.

However, the performance of DeepLabv3+ is highly sensitive to the resolution of the input images. When the input images have a low resolution, the network may not be able to capture fine-grained details adequately. This is because the process of dilating convolutions increases the receptive field, but it also spreads out the information, potentially leading to a loss in detail in lower-resolution images. As the network depth increases, the risk of missing detailed information becomes more pronounced. This can result in a reduction in the accuracy and quality of the segmentation results, particularly when fine details are crucial for precise segmentation.

PSPNet (pyramid scene parsing network) processes an input image by dividing it into multiple regions of varying sizes. It then applies pooling operations to these regions to capture contextual information at different scales. Finally, PSPNet combines the pooled results from these different scales to form the final output.

However, this approach can lead to some limitations. Specifically, by focusing on different scales, PSPNet may struggle to retain fine-grained details and subtle features in the image. In the context of segmenting tea bud leaves, which can vary in shape and texture, this loss in detail becomes significant. As a result, the model might fail to accurately segment different forms of tea bud leaves, particularly those with intricate or small-scale features. This challenge is reflected in the segmentation performance metrics: the mean intersection over union (mIoU) is 69.21%, and the mean pixel accuracy (mPA) is 75.87%. These scores indicate that, while the model captures some contextual information effectively, it does not perform as well in preserving and segmenting the detailed characteristics of the tea bud leaves, leading to reduced overall segmentation quality.

HRNet excels in segmentation tasks by maintaining high-resolution feature maps throughout the network. It constructs a high-resolution feature pyramid, which allows the model to simultaneously capture both low-resolution and high-resolution details. This approach enables the model to focus on a wide range of features, including fine details and global context, which is particularly useful for the segmentation of tea bud leaves with varying shapes and sizes.

In the case of tea bud leaf segmentation, HRNet effectively differentiates between different forms due to its ability to preserve detailed features at various scales. This results in better performance in capturing intricate structures and textures of tea bud leaves. Despite achieving commendable results, with the model effectively handling the segmentation of three different tea bud forms, there remains potential for further enhancement. Additional refinements could be made to improve the model’s accuracy and robustness, particularly in challenging scenarios where fine details are critical.

Segformer, leveraging the Transformer architecture, excels in capturing global features and integrating multi-scale information from images. Unlike traditional CNN-based models, Segformer uses self-attention mechanisms to effectively aggregate and process information across the entire image, enabling it to capture long-range dependencies and contextual relationships.

In the context of tea bud leaf segmentation, Segformer demonstrates robust generalization across various leaf shapes. Its ability to handle diverse scales and complex shapes is evident in its consistent performance. For the three different shapes of tea bud leaves tested, Segformer achieves an intersection over union (IoU) score exceeding 80% for each shape. This indicates the model’s effectiveness in accurately segmenting different forms of tea bud leaves, due to its comprehensive feature extraction and integration capabilities. The model’s high performance highlights its suitability for tasks requiring detailed and accurate segmentation across varied target shapes and sizes.

Among all models, the Unet-Enhanced model has the best segmentation performance. The mIoU and mPA were 91.48% and 95.10%, respectively. This is mainly due to the construction of Transformer encoders, and the use of multipath feature aggregation modules and polarization attention mechanisms.

The segmentation results of different models are shown in Figure 8, Figure 9 and Figure 10. The orange box indicates the area with missing segmentation, and the blue box indicates the area with incorrect segmentation. As can be seen from Figure 8, when there are a few segmentation targets in the image and the targets have a large scale, different models can better complete the segmentation of different forms of tea bud leaves. However, with the increase in the number of targets and the appearance of fine-grained targets, it becomes more difficult to accurately segment different forms of tea bud leaves. As can be seen from Figure 9, other models have different degrees of missing segmentation or false segmentation, but Unet-Enhanced still maintains better performance.

As can be seen from Figure 10, the segmentation capability of DeepLabv3+, PSPNet, and Hrnet is still insufficient, and there are missing segmentations or incorrect areas to varying degrees. Segformer and Unet-Enhanced have better integrity for the segmentation of tea bud leaves of different shapes. However, the Segformer model was not clear enough to segment the boundary of tea bud leaves, and it was difficult to accurately locate the boundary region, resulting in incorrect points for the fine features of the boundary. With Unet-Enhanced, only a small number of pixels cannot be segmented, showing better segmentation performance.

3.5. Segmentation Performance

The segmentation results of the proposed Unet-Enhanced model for tea bud leaves with different morphologies and fine-grain sizes are shown in Figure 11. In Figure 11a, due to the large vertical tilt angle of the image, the tea bud leaves are mainly shaped in “I” and “V” shapes. In Figure 11b,c, the tilt angle decreases, resulting in a gradual increase in the “Y” type of tea bud leaves. In addition, the images contain tea bud leaves at different growth stages, and tea bud leaf targets with different fine-grain sizes appear. It can be seen from the segmentation diagram that, except for the incomplete segmentation of a few pixels, the network model has a good segmentation effect on different tea bud leaf targets, which proves that the network model designed has a good robustness in the segmentation of tea bud leaves with multi-target, multi-shape, and different fine granularity features.

Although the Unet-Enhanced model demonstrates a strong ability to accurately segment most tea buds and leaves, it still exhibits certain limitations. Specifically, as highlighted in the orange box in Figure 12, the model tends to incorrectly segment tea buds located in boundary regions. This issue likely arises because the model, while enhancing the recognition of fine-grained features through the use of enhancing modules, struggles when the overall shape or goal of the tea buds and leaves is not clearly defined. In such cases, the fine-grained features may introduce interference, leading to incorrect segmentation of the tea buds.

Moreover, the model also shows some limitations when dealing with complete tea bud leaves that possess fine-grained characteristics. In these instances, there can be instances of missing segmentation, where parts of the tea buds are not correctly identified or segmented, as highlighted in the blue box in Figure 12. This suggests that, while the model excels in many scenarios, it may still require further refinement to improve its accuracy in handling complex boundary regions and fine-grained features within the tea buds and leaves.

3.6. Feature Extraction Analysis

In order to analyze the model’s feature extraction capabilities for tea bud leaves and stems, the feature sequences at shallow layers of the model were converted into multi-channel feature maps for visualization. As shown in Figure 13, a 64-channel shallow layer feature map is presented. The image contains features such as the edges, textures, and colors of the tea bud leaves and stems.

From the figure, it can be observed that some fully visible images primarily represent different color channel features, including information on color distribution and contrast. However, color alone is not a reliable indicator, as the color of tea leaves can change depending on factors such as different growth stages and lighting conditions.

In rows two and five of Figure 13, some features also capture simple texture information, such as repetitive patterns or local grayscale variations in the image. Additionally, there are inherent differences in the shapes of tea leaves and stems. Tea leaves generally have a broad, spread-out surface, while stems are relatively slender, resulting in different observable characteristics.

To some extent, shallow features are limited in achieving optimal performance. To better analyze the deep features, we also visualized the 64-channel output features from the decoding network, as shown in Figure 14. The deep features mainly represent high-level characteristics of the tea bud leaves. From the figure, it is evident that these high-level features can preliminarily identify the complete tea bud leaf targets or parts of them. However, besides these analyzable features, we also observed some feature maps with low grayscale values that are difficult to interpret for specific meanings. The abstract nature of these features might not be limited to understandable characteristics; they may also highlight more abstract patterns and rules that are sometimes challenging for humans to directly explain but are useful for the task, helping to understand the overall structure and semantics of the image.

Therefore, in addition to understandable features such as color, texture, and shape, the model might also utilize contextual information to assist in segmentation. Even if certain local regions share similar colors or textures, the model can make segmentation decisions based on the contextual relationships within the entire image (such as the relative positions of leaves and the spatial relationship between stems and leaves). This use of global information can further enhance the segmentation accuracy. So, it may be helpful to utilize the Transformer module to build global information associations.

3.7. Feature Attention Analysis

Although neural network models can produce segmented results for different classes when performing image segmentation, the deep features are not very interpretable in network processing. Therefore, we use class activation heatmaps to compare the visualization effects of the Unet model and the improved Unet-Enhanced model. In the heatmap, the darker the red area, the stronger the model’s focus on features at that location.

As shown in Figure 15, the heatmap generated by the Unet model highlights the feature attention for tea bud leaves. From the figure, it can be observed that the model demonstrates some ability to focus on the features of the tea bud leaves. However, the strong attention to features, as indicated by the red regions in the heatmap, is primarily concentrated in the center area of the tea leaves, with less focus on the stem. This phenomenon may occur because the CNN model extracts a large number of features through the convolutional layers, but these features belong to different categories, such as shape, texture, edge features, and color. The model struggles to differentiate and focus on the most representative target features, as it tends to give equal attention to features of different classes.

Figure 16 shows the heatmap of feature attention for tea bud leaves generated by the Unet-Enhanced model. It is evident that the proposed model exhibits significantly stronger feature attention for the tea bud leaves. A large portion of the attention is focused on both the leaves and the stems, and these regions are predominantly red.

Comparing the heatmaps of the two models, there is a clear difference in the attention paid to the tea bud leaves and stems. In the Unet model, most of the feature areas are green, whereas in the Unet-Enhanced model, most areas are red. From a quantitative perspective, the Unet-Enhanced model’s attention is noticeably higher than that of the Unet model, and the attention is concentrated on the targeted areas of the tea bud leaves. From the perspective of edge attention, the Unet-Enhanced model displays a more complete and clear focus on the edges of the tea bud leaves and stems. This suggests that the Transformer module strengthens the dependencies between features, allowing the model to more effectively capture global information and long-distance dependencies, rather than merely local features. Moreover, the introduction of the polarized attention mechanism better balances and utilizes different information from channels and spatial dimensions, enabling the model to focus on both global and detailed features simultaneously.

3.8. Limitations and Future Work

The model proposed in this study has been trained and tested on a specific tea variety, Yinghong 9, characterized by its strip-like or oval shape, with complete and plump leaves that exhibit serrated edges. The tea buds are relatively robust, with overall uniform leaves and clearly visible leaf veins. The color of the leaves is a bright green, often glossy, with healthy leaves displaying a vibrant green or deep green hue. This variety differs from others, such as the flat green leaves of Longjing tea or the needle-like pale green leaves of Baihao Yinzhen. Although the model performs well on Yinghong 9, its applicability to other tea varieties has not yet been validated. Significant differences in the geometric shape, color, and texture of different tea varieties could impact the model’s segmentation and recognition performance. To ensure the model’s generalization across various tea varieties, further experimental validation is required, and the model may need to be retrained or fine-tuned according to the specific characteristics of these varieties.

In future work, we will explore ways to enhance the model’s generalizability through fine-tuning or by incorporating a more diverse training dataset. We also plan to further evaluate the model’s robustness and generalization capabilities through comprehensive cross-variety experiments.

Additionally, the segmentation model proposed in this study can effectively segment different shape types of tea bud leaves, providing valuable information for analyzing the occluded or overlapping states of tea bud leaves. However, to some extent, the model does not address the segmentation of occluded or hidden areas. In future research, we will build upon this model to analyze different occlusion or overlapping scenarios and design post-processing steps to achieve the localization of invisible regions. For example, we plan to develop a standard morphological model of tea buds and use morphological processing or conditional random field (CRF) techniques to address these issues.

4. Conclusions

This study proposes a new segmentation network for the segmentation of tea bud leaves with different shapes and different fine granularity under vision. The experimental results demonstrate the effectiveness of the proposed Unet-Enhanced model. Specific conclusions are as follows:

(1): The target of one bud–one leaf presents a variety of different shapes under different angles of imaging vision due to the occlusion of stems or overlapping of buds and leaves. Utilizing a hierarchical Transformer block based on a self-attention mechanism facilitates the capture of long-range dependencies between features, enhancing the expression of common features.
(2): Using multi-path feature aggregation to fuse features and introducing a polarization self-attention mechanism can effectively enhance the fine-grained features of tea bud leaves. The experimental results show that the IoU for background, Tea_Y, Tea_V, and Tea_I are 99.62%, 88.69%, 86.46%, and 89.94%, respectively. The mIoU was 91.18% and the mPA was 95.10%. These performance values verify that the designed model can effectively and accurately segment the tea bud leaf target with different shapes and different fine-grain sizes in the field of view.

Author Contributions

Data curation, T.C. and H.L.; funding acquisition, W.W.; project administration, W.W.; formal analysis, T.C. and J.L.; investigation, W.W.; methodology, T.C., H.L. and J.C.; software, T.C. and J.C.; supervision, W.W.; resources, T.C. and J.C.; validation, W.W.; writing—original draft, T.C.; writing—review and editing, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the 2024 Rural Revitalization Strategy Special Funds Provincial Project (2023LZ04), Guangdong Province (Shenzhen) Digital and Intelligent Agricultural Service Industrial Park (FNXM012022020-1), Construction of Smart Agricultural Machinery and Control Technology Research and Development, and the 2023 Guangdong Provincial Special Fund for Modern Agriculture Industry Technology Innovation Teams (2023KJ120).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article.

Acknowledgments

The authors acknowledge the editors and reviewers for their constructive comments and all the support on this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, S.; Sun, H. Tea-YOLOv8s: A tea bud detection model based on deep learning and computer vision. Sensors 2023, 23, 6576. [Google Scholar] [CrossRef]
Wang, J.; Li, X.; Yang, G.; Wang, F.; Men, S.; Xu, B.; Xu, Z.; Yang, H.; Yan, L. Research on Tea Trees Germination Density Detection Based on Improved YOLOv5. Forests 2022, 13, 2091. [Google Scholar] [CrossRef]
Zhu, Y.; Wu, C.; Tong, J.; Tong, J.; Chen, J.; He, L.; Wang, R.; Jia, J. Deviation tolerance performance evaluation and experiment of picking end effector for famous tea. Agriculture 2021, 11, 128. [Google Scholar] [CrossRef]
Xu, W.; Zhao, L.; Li, J.; Shang, S.; Ding, X.; Wang, T. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Zhang, S.; Yang, H.; Yang, C.; Zhang, S.; Yang, H.; Yang, C.; Yuan, W.; Li, X.; Wang, X.; Zhang, Y.; et al. Edge device detection of tea leaves with one bud and two leaves based on ShuffleNetv2-YOLOv5-Lite-E. Agronomy 2023, 13, 577. [Google Scholar] [CrossRef]
Yang, H.; Chen, L.; Ma, Z.; Chen, M.; Zhong, Y.; Deng, F.; Li, M. Computer vision-based high-quality tea automatic plucking robot using Delta parallel manipulator. Comput. Electron. Agric. 2021, 181, 105946. [Google Scholar] [CrossRef]
Li, Y.; Wu, S.; He, L.; Tong, J.; Zhao, R.; Jia, J.; Chen, J.; Wu, C. Development and field evaluation of a robotic harvesting system for plucking high-quality tea. Comput. Electron. Agric. 2023, 206, 107659. [Google Scholar] [CrossRef]
Tang, Y.; Han, W.; Hu, A.; Wang, W.; Anguo, H. Design and Experiment of Intelligentized Tea plucking Machine for Human Riding Based on Machine Vision. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2016, 47, 15–20. [Google Scholar]
Zhang, L.; Zhang, H.; Chen, Y.; Dai, S.; Li, X.; Kenji, I.; Liu, Z.; Li, M. Real-time monitoring of optimum timing for harvesting fresh tea leaves based on machine vision. Int. J. Agric. Biol. Eng. 2019, 12, 6–9. [Google Scholar] [CrossRef]
Karunasena, G.; Priyankara, H. Tea bud leaf identification by using machine learning and image processing techniques. Int. J. Sci. Eng. Res. 2020, 11, 624–628. [Google Scholar] [CrossRef]
Zhang, L.; Zou, L.; Wu, C.; Jia, J.; Chen, J. Method of famous tea sprout identification and segmentation based on improved watershed algorithm. Comput. Electron. Agric. 2021, 184, 106108. [Google Scholar] [CrossRef]
Chen, S.; Zou, X.; Zhou, X.; Xiang, Y.; Wu, M. Study on fusion clustering and improved yolov5 algorithm based on multiple occlusion of camellia oleifera fruit. Comput. Electron. Agric. 2023, 206, 107706. [Google Scholar] [CrossRef]
Wu, W.; He, Z.; Li, J.; Chen, T.; Luo, Q.; Luo, Y.; Wu, W.; Zhang, Z. Instance Segmentation of Tea Garden Roads Based on an Improved YOLOv8n-seg Model. Agriculture 2024, 14, 1163. [Google Scholar] [CrossRef]
Chen, Y.T.; Chen, S.F. Localizing plucking points of tea leaves using deep convolutional neural networks. Comput. Electron. Agric. 2020, 171, 105298. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight detection networks for tea bud on complex agricultural environment via improved YOLO v4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Meng, J.; Wang, Y.; Zhang, J.; Tong, S.; Chen, C.; Zhang, C.; An, Y.; Kang, F. Tea Bud and Picking Point Detection Based on Deep Learning. Forests 2023, 14, 1188. [Google Scholar] [CrossRef]
Cao, M.; Fu, H.; Zhu, J.; Cai, C. Lightweight tea bud recognition network integrating GhostNet and YOLOv5. Math. Biosci. Eng. 2022, 19, 12897–12914. [Google Scholar] [CrossRef] [PubMed]
Yan, L.; Wu, K.; Lin, J.; Xu, X.; Zhang, J.; Zhao, X.; Tayor, J.; Chen, D. Identification and picking point positioning of tender tea shoots based on MR3P-TS model. Front. Plant Sci. 2022, 13, 962391. [Google Scholar] [CrossRef]
Li, Y.; Ma, R.; Zhang, R.; Cheng, Y.; Dong, C. A tea buds counting method based on YOLOV5 and Kalman filter tracking algorithm. Plant Phenomics 2023, 5, 0030. [Google Scholar] [CrossRef]
Fang, M.R.; Lu, J.; Ruan, J.Y.; Bian, L.; Wu, C.Y.; Yao, Q. Tea Buds Detection Model Using Improved YOLOv4-tiny. J. Tea Sci. 2022, 42, 549–560. [Google Scholar]
Yan, C.; Chen, Z.; Li, Z.; Liu, R.; Li, Y.; Xiao, H.; Lu, P.; Xie, B. Tea sprout picking point identification based on improved deepLabV3+. Agriculture 2022, 12, 1594. [Google Scholar] [CrossRef]
Chen, T.; Li, H.; Chen, J.; Zeng, Z.; Han, C.; Wu, W. Detection network for multi-size and multi-target tea bud leaves in the field of view via improved YOLOv7. Comput. Electron. Agric. 2024, 218, 108700. [Google Scholar] [CrossRef]
Qian, C.; Li, M.; Ren, Y. Tea sprouts segmentation via improved deep convolutional encoder-decoder network. IEICE Trans. Inf. Syst. 2020, 103, 476–479. [Google Scholar] [CrossRef]
Lu, J.; Yang, Z.; Sun, Q.; Gao, Z.; Ma, W. A Machine Vision-Based Method for Tea Buds Segmentation and Picking Point Location Used on a Cloud Platform. Agronomy 2023, 13, 1537. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Zhang, W.; Wang, R.; Wan, S.; Rao, Y.; Jiang, Z.; Gu, L. Tea picking point detection and location based on Mask-RCNN. Inf. Process. Agric. 2021, 10, 267–275. [Google Scholar] [CrossRef]
Zhang, F.; Sun, H.; Xie, S.; Dong, C.; Li, Y.; Xu, Y.; Zhang, Z.; Chen, F. A tea bud segmentation, detection and picking point localization based on the MDY7-3PTB model. Front. Plant Sci. 2023, 14, 1199473. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Qian, L.; Wen, C.; Li, Y.; Hu, Z.; Zhou, X.; Xia, X.; Kim, S.H. Multi-scale context UNet-like network with redesigned skip connections for medical image segmentation. Comput. Methods Programs Biomed. 2024, 243, 107885. [Google Scholar] [CrossRef] [PubMed]
Lin, G.; Wang, C.; Xu, Y.; Wang, M.; Zhang, Z.; Zhu, L. Real-time guava tree-part segmentation using fully convolutional network with channel and spatial attention. Front. Plant Sci. 2022, 13, 991487. [Google Scholar] [CrossRef]
Li, Y.; Jing, B.; Feng, X.; Li, Z.; He, Y.; Wang, J.; Zhang, Y. nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance. arXiv 2023, arXiv:2309.16967. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Liang, Z.; Zhao, K.; Liang, G.; Li, S.; Wu, Y.; Zhou, Y. MAXFormer: Enhanced transformer for medical image segmentation with multi-attention and multi-scale features fusion. Knowl.-Based Syst. 2023, 280, 110987. [Google Scholar] [CrossRef]
Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice loss for data-imbalanced NLP tasks. arXiv 2019, arXiv:1911.02855. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]

Figure 1. Tea garden environment and label schematics.

Figure 2. Segmentation network architecture. Note: Conv1 × 1: Convolution operation with kernel 1 × 1; Conv3 × 3: Convolution operation with kernel 3 × 3; BN: BatchNomalization; Upsampling2D: feature upsample; ReLU: Activate function; Linear: Linear transformation; Interpolate: Bilinear interpolation.

Figure 3. Computation of Transformer block.

Figure 4. Path feature aggregation module.

Figure 5. Polarized attention mechanism module.

Figure 6. Results of training and testing: (a) Loss curve for the training dataset; (b) Statistics of the mIoU for the testing dataset.

Figure 7. Comparison of segmentation performance of different models.

Figure 8. Comparison of segmentation results for large-size tea bud leaves.

Figure 9. Comparison of segmentation results for multi-target and fine-grained tea bud leaves.

Figure 10. Comparison of segmentation results of different network models. (A) Original image. (B) Ground truth. (C) DeepLabv3+. (D) PSPNet. (E) Hrnet. (F) Segformer. (G) Unet-Enhanced.

Figure 11. Segmentation effect of tea bud leaves with different shapes and fine-grained features. (a) mainly “tea_I”. (b) mainly “tea_V”. (c) mainly “tea_Y”.

Figure 12. Failure cases of Unet-Enhanced.

Figure 13. Shallow feature visualization.

Figure 14. Deep feature visualization.

Figure 15. Unet heat map. (a–f) are the test results of different samples.

Figure 16. Unet-Enhanced heat map. (a–f) are the test results of different samples.

Table 1. Dataset statistics.

Datasets	Image Samples	Label Category
Datasets	Image Samples	Tea_Y	Tea_V	Tea_I
Raw dataset	1548	4829	2644	1238
Test set	154	523	241	138
Training–validation dataset	1394	4306	2403	1100
Augmented dataset	2668	6459	5478	4612
Training set	2402	5781	4944	4123
Validation set	266	678	534	489

Table 2. Backbone network parameters.

Operation	Input	Output	Kernel	Step	Head
Patch embedding	(512, 512, 3)	(128, 128, 32)	(7, 7)	4
(Transform block) × 2	(128, 128, 32)	(16,384, 32)			1
Feature conversion	(16,384, 32)	(128, 128, 32)
Patch embedding	(128, 128, 32)	(64, 64, 64)	(3, 3)	2
(Transform block) × 2	(64, 64, 64)	(4096, 64)			2
Feature conversion	(4096, 64)	(64, 64, 64)
Patch embedding	(64, 64, 64)	(32, 32, 160)	(3, 3)	2
(Transform block) × 2	(32, 32, 160)	(1024, 160)			5
Feature conversion	(1024, 160)	(32, 32, 160)
Patch embedding	(32, 32, 160)	(16, 16, 256)	(3, 3)	2
(Transform block) × 2	(16, 16, 256)	(256, 256)			8
Feature conversion	(256, 256)	(16, 16, 256)

Table 3. Experimental environment.

Configuration	Parameters
Operating system	Windows 10
GPU	Nvidia RTX 3090
CPU	Intel I9-10920X
Library	Pytorch 1.10.0;
Accelerated environment	CUDA 11.3; CUDNN 8.2.1

Table 4. Experimental comparison results of different models.

Model	Backbone	IoU (%)				mIoU (%)	mPA (%)
Model	Backbone	Background	Tea_Y	Tea_V	Tea_I	mIoU (%)	mPA (%)
Mobilenetv3–Unet	Mobilenetv3	96.75	31.29	30.44	44.37	50.71	55.55
Vgg–Unet	VGGNet	99.04	77.83	67.73	80.04	81.16	85.36
Resnet–Unet	Resnet	99.08	77.18	78.95	77.77	83.24	87.11
Transformer–Unet	Transformer	99.41	80.49	80.71	80.95	85.39	90.83

Note: The bold font indicates the maximum value of the metric.

Table 5. Experimental comparison results of different components.

Tag	Transformer–Unet	Path Features Aggregation	Polarized Attention	SK Attention	Gates Attention	CBAM Attention	mIoU (%)	mPA (%)
1	✓						85.39	90.83
2	✓	✓					89.10	93.76
3	✓	✓	✓				91.18	95.10
4	✓	✓		✓			90.36	94.52
5	✓	✓			✓		87.94	92.60
6	✓					✓	88.07	93.10

Note: The bold font indicates the maximum value of the metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, T.; Li, H.; Lv, J.; Chen, J.; Wu, W. Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation. Agriculture 2024, 14, 1388. https://doi.org/10.3390/agriculture14081388

AMA Style

Chen T, Li H, Lv J, Chen J, Wu W. Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation. Agriculture. 2024; 14(8):1388. https://doi.org/10.3390/agriculture14081388

Chicago/Turabian Style

Chen, Tianci, Haoxin Li, Jinhong Lv, Jiazheng Chen, and Weibin Wu. 2024. "Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation" Agriculture 14, no. 8: 1388. https://doi.org/10.3390/agriculture14081388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Processing

2.2. Unet-Enhanced Model

2.2.1. Segmentation Network Architecture

2.2.2. Transformer Block

2.2.3. Multi-Path Feature Aggregation Module

2.2.4. Polarized Attention Mechanism

2.3. Model Training

2.3.1. Training Details

2.3.2. Evaluation Metrics

3. Results and Analysis

3.1. Training Results

3.2. Comparison Experiment of Different Backbones

3.3. Ablation Experiments

3.3.1. Path Feature Aggregation Module

3.3.2. Attention Mechanism

3.4. Comparison with Advanced Segmentation Models

3.5. Segmentation Performance

3.6. Feature Extraction Analysis

3.7. Feature Attention Analysis

3.8. Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI