Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP

Jiang, Tong; Li, Yane; Feng, Hailin; Wu, Jian; Sun, Weihai; Ruan, Yaoping

doi:10.3390/agriculture14091449

Open AccessArticle

Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP

by

Tong Jiang

^1,2,3,†,

Yane Li

^1,2,3,†,

Hailin Feng

^1,2,3,*

,

Jian Wu

^1,2,3,

Weihai Sun

^1,2,3 and

Yaoping Ruan

^1,2,3

¹

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China

²

Key Laboratory of Forestry Intelligent Monitoring and Information Technology of Zhejiang Province, Hangzhou 311300, China

³

China Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2024, 14(9), 1449; https://doi.org/10.3390/agriculture14091449

Submission received: 23 June 2024 / Revised: 19 August 2024 / Accepted: 22 August 2024 / Published: 25 August 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Grapes are an important cash crop that contributes to the rapid development of the agricultural economy. The harvesting of ripe fruits is one of the crucial steps in the grape production process. However, at present, the picking methods are mainly manual, resulting in wasted time and high costs. Therefore, it is particularly important to implement intelligent grape picking, in which the accurate detection of grape stems is a key step to achieve intelligent harvesting. In this study, a trellis grape stem detection model, YOLOv8n-GP, was proposed by combining the SENetV2 attention module and CARAFE upsampling operator with YOLOv8n-pose. Specifically, this study first embedded the SENetV2 attention module at the bottom of the backbone network to enhance the model’s ability to extract key feature information. Then, we utilized the CARAFE upsampling operator to replace the upsampling modules in the neck network, expanding the sensory field of the model without increasing its parameters. Finally, to validate the detection performance of YOLOv8n-GP, we examined the effectiveness of the various keypoint detection models constructed with YOLOv8n-pose, YOLOv5-pose, YOLOv7-pose, and YOLOv7-Tiny-pose. Experimental results show that the precision, recall, mAP, and mAP-kp of YOLOv8n-GP reached 91.6%, 91.3%, 97.1%, and 95.4%, which improved by 3.7%, 3.6%, 4.6%, and 4.0%, respectively, compared to YOLOv8n-pose. Furthermore, YOLOv8n-GP exhibits superior detection performance compared with the other keypoint detection models in terms of each evaluation indicator. The experimental results demonstrate that YOLOv8n-GP can detect trellis grape stems efficiently and accurately, providing technical support for advancing intelligent grape harvesting.

Keywords:

intelligent harvesting; YOLOv8n-pose detection model; SENetV2 attention module; CARAFE upsampling operator

1. Introduction

Grapes occupy an important place in the fruit industry. In China, grapes, as an important cash crop, play a key role in the rapid development of the agricultural economy [1,2]. The ripening time of grapes is short and concentrated in the summer. Once ripe, they need to be picked in a timely manner; otherwise, they are prone to rotting and spoilage. However, the current grape harvesting method is mainly manual with shortcomings such as low efficiency, high costs, and considerable labor demands [3,4,5]. In addition, with the accelerating trend of population aging in China, problems such as the oversupply and rapidly rising costs of labor have made it difficult to meet the demand for large-scale production of grapes. Therefore, it is crucial to realize intelligent grape harvesting, in which accurately and efficiently detecting grape stems is a key step to achieve intelligent harvesting [6,7].

Research on grape stem detection is mainly categorized into traditional image processing methods and deep learning methods [8]. The traditional image processing method combines features such as grayscale, color, and texture of target objects with traditional image processing algorithms such as threshold segmentation and region growing to achieve stem detection [9,10,11]. Stem detection methods based on traditional image processing have shown a certain degree of usability from some research results. For example, Luo et al. first utilized a K-means-based method to segment the grapes in the image, then extracted the edge images of grape clusters and built a geometric model to select the region of interest for each grape stem. Finally, the cutting point of the grape stems was determined by a geometric constraint algorithm [12]. Jin et al. detected the center of mass of distant grape clusters using a series of image processing methods, followed by combining edge detection and cumulative probabilistic Hough transform to detect and localize close grape stems and cut points. This method proved highly reliable for recognizing grape stems and locating cut points [13]. Zhu et al. segmented grapes in two-dimensional space using the maximum connectivity domain and the Otsu algorithm. In the meantime, a device was built to locate grape picking using point infrared tubes, which determined whether the device had moved to the position of the grape stems according to the planar coordinate information, thus identifying the picking point [14]. The results of the above studies demonstrate that the grape stem detection methods based on traditional image processing exhibited a certain degree of practicality. However, the above methods are affected by natural factors such as light changes, complex backgrounds, and object occlusion, which limit their effectiveness in complex natural environments [15].

As computer technology and agricultural information have advanced, deep-learning techniques have achieved substantial advances in the field of fruit detection. Xiao et al. integrated CA modules and optimized the backbone within YOLOv7 to identify citrus fruits in natural environments. Additionally, they collaborated with a flexible hand-clawed robot to automate the citrus harvesting process [16]. Li et al. identified and localized apples in intricate scenes by utilizing Faster R-CNN and an improved template matching approach, achieving an AP of 88.12% and a localization precision of 99.64% [17]. Zhang et al. employed YOLOv5 in combination with a depth camera for the identification and localization of trellis pears, with a recognition accuracy of 99% and a picking success rate of 86.67% [18]. Numerous studies have reported commendable results in the domains of fruit detection and automated fruit-picking systems. However, in contrast to the aforementioned fruits, which can be harvested efficiently by simply accurately identifying the location of the fruit during the picking process, grape fruits present unique challenges due to their soft texture and the difficulty associated with detaching their stems. Thus, relying solely on identifying the location of the fruit is insufficient to efficiently execute the automatic grape harvesting task [19,20]. Therefore, identifying grape stems is essential for an automated grape harvesting system.

In recent years, deep-learning approaches have been extensively utilized for identifying grape stems. Deep learning models can autonomously learn the complex features of target objects, resulting in better detection performance and robustness. At present, grape stem detection methods based on object detection models are divided into two main categories: one-stage object detection models, such as SSD [21] and YOLO algorithms [22], and two-stage object detection models, such as Faster R-CNN [23] and Mask R-CNN [24]. Ning et al. utilized the region growing algorithm for fine segmentation of grape stems based on coarse segmentation of grape stems using an improved Mask R-CNN model and calculated the optimal picking region according to the center of mass of grape stems, which achieved a detection accuracy of 88% and a localization time of 4.9 s [25]. Wang et al. proposed the DualSeg model to segment grape stems in complex vineyard environments and achieved superior detection results. However, the variety of grapes detected by this approach was limited, making it difficult to be applied to agricultural production environments [26]. Wu et al. divided the grape stem detection task into two steps, which combined YOLOv5m with Ghost-HRNet, with 90.2% stem recognition accuracy. Regrettably, the performance of this approach was relatively poor when faced with extremely complex scenarios, and its detection speed is only 7.7 FPS due to the two-step approach [27]. Zhu et al. regarded the center point of prediction boxes for grape stems detected by the Yolov5-CFD model as the picking point, with a localization accuracy of 88.72% and a detection speed of 20.8 FPS [28]. Zhang et al. extracted grape bunches obtained from YOLOv5-GAP using image processing techniques and identified the picking point as the position of ten pixels exactly above the low extreme point indicated by the centroid and upper boundary of grapes. Although this approach had a low localization error for grape stems, it ignored the irregular attitude present in grape stems [29].

Summarizing the previous deep learning-based grape stem detection methods, grape stem detection still faces the following difficulties:

(1): Numerous researchers have divided the process of grape stem detection into multiple steps, making it tedious and time-consuming.
(2): Although numerous researchers have studied grape bunch detection with good results, the detection accuracy for grape stems is not satisfactory.

To address the difficulties mentioned above, this study proposed the YOLOv8n-GP model to simultaneously detect grape fruits and stems, which used the idea of a human posture estimation algorithm to analyze the posture of grape stems. First, the structure of the SENetV2 network was integrated into the bottom of the backbone network to improve the model’s capacity to extract features. Subsequently, CARAFE modules were employed to substitute the upsampling modules within the neck network to obtain a larger receptive field. Figure 1 depicts the workflow of this study, with the following contributions:

(1): Acquisition of a comprehensive image dataset with multiple trellis grape varieties, including “Kyoho”, “Summer Black”, and “Shine Muscat”.
(2): Development of the YOLOv8n-GP model by integrating the SENetV2 module and CARAFE upsampling operator into the YOLOv8n-pose keypoint detection model to simultaneously detect grapes and stems.
(3): Comparison of performance differences between YOLOv8n-GP and YOLOv8n-pose in a complex environment with densely distributed grapes. YOLOv8n-GP achieved a precision, recall, and mAP of 91.9%, 86.5%, and 93.1% for grape stem detection, which improves by 4.4%, 1.3%, and 2.3% compared with the YOLOv8n-pose model.
(4): Comparison of performance differences between YOLOv8n-GP and other keypoint detection methods. Considering mAP, mAP-kp, model size, and FLOPs, YOLOv8n-GP demonstrates superior detection performance compared with other keypoint detection models.

2. Materials and Methods

2.1. Image Dataset Acquisition

The dataset used in the experimental procedures of this research was collected in Pujiang County, Jinhua City, Zhejiang Province, China (north latitude: 29°28′15.17″, east longitude: 119°59′21.18″); Yuhuan County, Taizhou City, Zhejiang Province, China (north latitude: 28°06′7.25″, east longitude: 121°10′40.40″); and Lin’an District, Hangzhou City, Zhejiang Province, China (north latitude: 30°19′8.42″, east longitude: 119°45′5.50″). The data were collected in July 2022, May 2023, and August 2023, under natural light on sunny days. The dataset contained three varieties of grapes: “Kyoho”, “Summer Black”, and “Shine Muscat”.

In addition, this study used a camera (Kinect v2, Microsoft, Washington, DC, USA) to photograph grapes from multiple angles, with images saved in .jpg format at a resolution of 1920 × 1080 pixels. Subsequently, we performed data cleansing on the dataset to remove grape images of poor quality. The final dataset comprised a total of 1180 grape images, with the distribution of images across each grape category listed in Table 1. Moreover, the grape dataset included a rich variety of sample conditions, such as lit from the front, single-string, multi-string, and fruit occlusion, as shown in Figure 2.

2.2. Image Preprocessing

In the image preprocessing stage, this study used the Labelme tool to annotate the grape bunches and the three key points on their stems in each image, as shown in Figure 3. First, we applied bounding boxes to mark grape bunches with the label “grape”. Then, the top and bottom of the stem were selected as key points, and these points were labeled “top” and “down”, respectively. In addition, we identified the middle point of the stem as the picking point, with the label “pick”.

After image annotation, this study saved the annotation file in .json format, including the location information of bounding boxes and key points as well as the path and size of the image. Subsequently, the annotation file was converted from .json format to .txt format. Finally, the grape image dataset was divided into three parts: the training (80%, 943 images), validation (10%, 118 images), and testing (10%, 119 images) sets.

2.3. Methods

2.3.1. YOLOv8-Pose Model

YOLOv8 is developed by Ultralytics Company (Washington, DC, USA) based on YOLOv5, which aims to detect objects more accurately and rapidly [30]. Compared with YOLOv5, YOLOv8 introduces the following improvements: (1) replacing C3 modules in YOLOv5 with C2f modules; (2) removing two 1 × 1 convolution operations from the upsampling stage in the neck; (3) introducing the idea of a decoupled head and using the anchor-free method in the head [31].

Furthermore, YOLOv8 is equipped to handle a variety of tasks, including pose estimation, object detection, and instance segmentation [32]. YOLOv8 for pose estimation, called YOLOv8-pose, is designed by introducing a keypoint detection branch into the head of YOLOv8, which can simultaneously execute both tasks of object detection and keypoint detection. The architecture of YOLOv8-pose is comprised of the following primary components: backbone, neck, and head. The backbone is composed of an SPPF module, a series of convolutional layers, and C2f modules to facilitate the extraction of feature maps for targets across various scales. The neck, with the PAN-FPN network [33,34], is employed to integrate feature maps across various scales derived from the backbone to enhance feature representation capability. The head utilizes the structure of a decoupled head and the idea of anchor-free detection to handle the feature maps transmitted from the neck to derive detection results. Based on model size and complexity, YOLOv8-pose offers five versions, including YOLOv8n-pose, YOLOv8s-pose, YOLOv8m-pose, YOLOv8l-pose, and YOLOv8x-pose. In order to subsequently deploy the model to mobile devices, this study has chosen YOLOv8n-pose with the least computation and fastest detection speed as the baseline network.

2.3.2. Structure of the YOLOv8n-GP Model

Herein, we present an improved model for detecting trellis grape stems, YOLOv8n-GP, based on YOLOv8n-pose. Specifically, to substantially improve the extraction of critical feature information from grape images and thereby more effectively enhance the representation learning of the model, we incorporated the SENetV2 attention mechanism behind the SPPF module in the backbone. Then, the upsampling modules in the neck were replaced by CARAFE modules, which can accurately predict the upsampling kernel according to the input feature map, thus efficiently realizing feature reorganization. At the same time, introducing CARAFE modules can significantly reduce the feature loss during information processing, which facilitates a more comprehensive recovery and presentation of the details within the grape images. Figure 4 shows the basic architecture of YOLOv8n-GP, where the box and dots represent detection results of grape bunch and three key points on stems, respectively.

2.3.3. SENetV2 Module

Due to the growing environment in which trellised grapes are located, the grape stem detection task is often affected by the dynamic environmental conditions surrounding trellised grapes, including light variations, occlusion by branches and leaves, and fruit occlusion, which can significantly impair the overall performance of models. Attentional mechanisms can improve the model’s ability to extract features, thus enabling it to focus on important feature information and ignore complex background information [35]. The fusion of the squeeze–excitation module and dense layer in the SENetV2 module enhances the capacity to capture channel patterns and global knowledge for models and achieves better feature representation for grape stems compared with other attention mechanisms such as Squeeze-and-Excitation (SE) [36] and Efficient Channel Attention (ECA) [37]. Consequently, the introduction of the SENetV2 module in the backbone can improve the feature extraction capability for grape images in complex environments.

The SENetV2 attention module was proposed by Narayanan in 2023 on the basis of the SE module [38]. The SENetV2 utilizes multibranch fully connected layers to reinforce the global representation learning, resulting in a notable improvement in detection performance with only a slight increase in the parameter count.

The SENetV2 attention module integrates ideas from the SE module and the ResNeXt network. The ResNeXt network combines the multibranch strategy of the inception module with the residual module. The SE module compresses the feature map using global average pooling after the standard convolution operation, then derives channel weights through the utilization of two fully connected layers followed by a sigmoid activation function. Ultimately, the learned weights are multiplied by the input feature map to generate the weighted feature map. Figure 5 illustrates the basic architecture of the ResNeXt, SE, and SENetV2 modules.

The SENetV2 module employs a more sophisticated strategy in the squeeze operation to capture more global information. Unlike ResNeXt, SENetV2 chooses a base value of 4 instead of 32. Then, SENetV2 performs excitation operations on global features to analyze the interrelationships among channels and derive corresponding weights for the channels. Compared with SENet, SENetV2 proposes a new module called SaE (Squeeze Aggregated Excitation), which introduces a multibranch fully connected layer in the excitation operation to enhance the global representation of the network. Figure 6 illustrates the structure of the SaE module, which allows models to learn different features of input images efficiently while taking into account the interdependencies among various channels during feature transformation.

2.3.4. CARAFE Module

The upsampling modules in YOLOv8n-pose aim to increase the resolution of the feature map using nearest neighbor interpolation for fusing feature information at multiple scales, thereby enriching the spatial information contained within the feature map [39]. However, since they solely rely on the weighting calculation of neighboring pixel points in the image while neglecting the comprehensive semantic content [40], the upsampling modules exhibit certain limitations when faced with a scenario involving a dense distribution of grapes, which contains a lot of semantic information. In contrast, the CARAFE module can perform content-aware processing for the input feature map, dynamically generate adaptive kernels, and obtain a larger receptive field [41], thus significantly reducing the feature loss and presenting complete details of grape stems in images. Meanwhile, this module has wide applicability owing to its fast computational speed and lightweight design. Therefore, this study utilized CARAFE modules to replace the original upsampling modules in the neck.

CARAFE (Content-Aware ReAssembly of Features) is a lightweight upsampling operator [42], which is composed of two fundamental components: the kernel prediction module and the content-aware reassembly module [43]. Figure 7 illustrates the structure of the CARAFE module. In the kernel prediction module (Figure 7a), the channels of an input feature map, characterized by dimensions of H × W × C, were reduced to C_m by a 1 × 1 convolution operation. Assuming an upsampling multiplicity of σ and an upsampling kernel size of k_up × k_up, the content encoder utilizes a convolutional layer to predict the upsampling kernel with a size of σ² ×

K_{up}^{2}

based on the compressed feature map. Then, the channels are expanded in spatial dimensions to obtain an upsampling kernel with a size of σH × σW ×

K_{up}^{2}

. Ultimately, the generated upsampling kernel undergoes normalization using the softmax function, ensuring that the total sum of the weights in the convolution kernels is 1. In the content-aware reassembly module (Figure 7b), each position within the output feature map is mapped one by one to the input feature map, and the region of k_up × k_up is extracted to perform a dot product with the predicted upsampling kernel. Finally, a new feature map with the size of σH × σW × C is obtained.

3. Results

3.1. Experimental Environment

Herein, we developed an improved model for detecting grape stems in trellis environments based on a server equipped with the Windows 11 operating system. Moreover, all experiments were conducted in the same environment. The settings of the specific experimental environment and training parameters are shown in Table 2 and Table 3, respectively.

3.2. Evaluation Indicators

In this study, evaluation indicators, namely precision, recall, mean average precision (mAP), mAP-kp, FLOPs, inference time, and model size, were utilized for evaluating model performance. Calculations for all these evaluation indicators are shown in Formulas (1)–(4).

P recision = \frac{TP}{TP + FP}

(1)

R ecall = \frac{TP}{TP + FN}

(2)

mAP = \frac{1}{n} \sum_{i = 1}^{n} {AP}_{i} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} P (R) dR

(3)

FLOP s = 2 \times H \times W (C_{i} \times K^{2} + 1) \times C_{o}

(4)

In Formulas (1)–(4), TP denotes the number of grapes predicted correctly by the model; FP denotes the number of other objects incorrectly predicted as grapes; FN denotes the number of grapes incorrectly predicted as other objects; C_i and C_o represent the number of input channels and output channels, respectively; H and W represent the height and width of the output feature map; and K represents the convolution kernel size.

3.3. Analysis of Experimental Results

3.3.1. Training Results

To validate the effectiveness of the improved grape stem detection model, we evaluated the detection performance of YOLOv8n-pose and YOLOv8n-GP by analyzing the variation of loss and accuracy during the training process in the same experimental environment.

As illustrated in Figure 8a,b, the loss values of the detection box and key point of YOLOv8n-pose and YOLOv8n-GP decreased significantly during the initial 50 epochs, followed by a gradual convergence, indicating that the detection models have a better training effect and tend toward a convergence state. After 160 epochs, the loss values of the detection box and key point of YOLOv8n-GP stabilized around 0.5 and 0.1, respectively, which were lower than those of the original YOLOv8n-pose model. This demonstrates that YOLOv8n-GP has a better convergence effect compared with YOLOv8n-pose. Meanwhile, from Figure 8c,d, it is evident that the detection accuracy and keypoint detection accuracy of YOLOv8n-GP remained consistent at around 0.97 and 0.95, respectively, both of which surpassed the performance of YOLOv8n-pose. This demonstrates that YOLOv8n-GP has higher accuracy when facing the task of grape stem detection in a trellis environment. In summary, YOLOv8n-GP exhibited superior performance compared to the original YOLOv8n-pose model during the training process.

3.3.2. Performance of the Model before and after Improvement

To more intuitively reflect the superiority of YOLOv8n-GP for the task of grape stem detection, this study compared the model performance before and after the improvement in terms of precision, recall, detection accuracy, and keypoint detection accuracy. The specific experimental results are shown in Table 4.

YOLOv8n-GP outperformed YOLOv8n-pose in all evaluation indicators except for inference time (Table 4). Specifically, the precision, recall, mAP, and mAP-kp of YOLOv8n-GP reached 91.6%, 91.3%, 97.1%, and 95.4%, respectively, which represent improvements of 3.7%, 3.6%, 4.6%, and 4.0% over YOLOv8n-pose. Moreover, YOLOv8n-GP achieved an inference time of 7.3 ms. Although the inference time of YOLOv8n-GP increased by 16.4% compared with YOLOv8n-pose, it still met the demand for real-time grape stem detection in actual agricultural production environments. Considering all of the above evaluation indicators, YOLOv8n-GP is the optimal choice for detecting grape stems.

In addition, this study used the above two models to detect grapes and their fruit stems, with detection results illustrated in Figure 9. Although the improved model suffered from missed detections, the YOLOv8n-GP model had better detection confidence than YOLOv8n-pose for all grapes. Morever, in terms of keypoint detection accuracy, YOLOv8n-GP was more accurate in locating grape stems.

3.3.3. Effect of Applying Different Attention Mechanisms

The implementation of an attention mechanism can improve the model’s ability to capture critical features, thereby improving its detection performance. To confirm the superiority of the SENetV2 module in the grape stem detection task, various attention mechanism modules were introduced into the backbone network of YOLOv8n-pose, including ECA, SE, and CBAM (Convolutional Block Attention Module). The experimental results are shown in Table 5.

Introducing attention modules into the backbone can effectively improve the detection accuracy of grapes and their stems (Table 5). Among them, the improved model with the SENetV2 attention module demonstrated superior detection performance. The recall, mAP, and mAP-kp reached 91.1%, 96.6%, and 95.1%, respectively, representing improvements of 3.4%, 4.1%, and 3.7% over those of YOLOv8n-pose. Compared to incorporating the ECA, CBAM, and SE attention modules, the SENetV2-based model improved the recall of detection boxes by 1.7%, 5.1%, and 1.7%, the mAP of detection boxes by 0.7%, 0.4%, and 0.6%, and the mAP of keypoint detection by 0.3%, 0.2%, and 0.1%, respectively. In summary, introducing the SENetV2 attention module into the backbone network of the YOLOv8n-pose model can improve the detection performance as much as possible.

3.3.4. Ablation Experiments

To further demonstrate the contribution of the SENetV2 attention module and the CARAFE upsampling operator to the YOLOv8n-GP model, a total of four groups of ablation experiments were performed in this study. Table 6 illustrates the experimental results.

In order to validate the contribution of the SENetV2 module to YOLOv8n-GP, this study embedded the SENetV2 module into the backbone. The YOLOv8n-GP achieved a precision, recall, mAP, and mAP-kp of 90.1%, 91.1%, 96.6%, and 95.1%, which represents improvements of 2.2%, 3.4%, 4.1%, and 3.7% over YOLOv8n-pose, respectively. The experimental results demonstrate that the introduction of the SENetV2 module can enhance the model’s ability to handle complex spatial features, thereby improving its detection performance.

To demonstrate the effectiveness of CARAFE modules, we utilized CARAFE modules to replace the upsampling modules in the neck network of YOLOv8n-pose. As illustrated in Table 6, although the CARAFE-based detection model showed no change in recall, it improved precision, mAP, and mAP-kp by 7.3%, 3.7%, and 4.0%, respectively. These experimental results demonstrate that CARAFE modules can guarantee the detection performance of the model while consuming minimal computational resources.

The results of the ablation experiments demonstrate that combining the SENetV2 module with the CARAFE module can achieve the optimal performance of the grape stem detection model. YOLOv8n-GP improved precision, recall, mAP, and mAP-kp by 3.7%, 3.6%, 4.6%, and 4.1%, respectively. Figure 10 illustrates the variation curves of the loss values of the detection boxes and key points during the training process for each set of ablation experiments. As illustrated in Figure 10, the YOLOv8n-GP model exhibited the lowest loss values for detection boxes and key points upon achieving a convergence state during the training process, indicating its superior convergence effect. This indicates that the detection performance of YOLOv8n-GP surpasses that of the other models. In summary, combining the SENetV2 and CARAFE modules to optimize the structure of YOLOv8n-pose for trellis grape stem detection is effective and feasible.

3.3.5. Effect of Densely Distributed Grape Bunches on Model Performance

In real-world trellis environments, grape bunches can exhibit high density in relation to one another, potentially affecting detection accuracy. To verify the robustness and stability of YOLOv8n-GP, we used YOLOv8n-pose and YOLOv8n-GP to detect grapes and stems in dense environments, respectively.

The detection accuracy of YOLOv8n-pose and YOLOv8n-GP was compromised when faced with more densely distributed grape bunches (Table 7). Among them, the precision, recall, and mAP of the YOLOv8n-GP keypoint detection model reached 91.9%, 86.5%, and 93.1%, respectively, representing improvements of 4.4%, 1.3%, and 2.3% over YOLOv8n-pose. Furthermore, this study conducted a comparative analysis of the detection results of YOLOv8n-pose and YOLOv8n-GP in dense environments, as shown in Figure 11.

As illustrated in Figure 11, YOLOv8n-GP outperformed YOLOv8n-pose when confronted with a denser distribution of grape bunches. First, in terms of the effectiveness of detection boxes, although both YOLOv8n-GP and YOLOv8n-pose can accurately detect grape bunches, the confidence of YOLOv8n-GP detection boxes was higher than those of YOLOv8n-pose. Second, in terms of the accuracy of keypoint detection, although YOLOv8n-GP had a certain degree of error for the grape stem detection, the error rate was much lower than that of YOLOv8n-pose. The above experimental results demonstrated the robustness and stability of YOLOv8n-GP when facing the task of detecting grape stems with dense distribution.

3.3.6. Comparison of the Performances of the Different Keypoint Detection Models

To demonstrate the advantages of the YOLOv8n-GP model for grape and stem detection tasks, this study compared the YOLOv8n-GP model with mainstream keypoint detection models such as YOLOv8n-pose, YOLOv5-pose, YOLOv7-pose, and YOLOv7-Tiny-pose from two aspects.

First, this study compared and analyzed the effectiveness of various models for grape bunch detection, as illustrated in Table 8.

YOLOv8n-GP outperformed other keypoint detection models in grape bunch detection with an mAP of 97.1%, which was an increase of 5.1%, 4.4%, 4.1%, and 4.6% over YOLOv5-pose, YOLOv7-Tiny-pose, YOLOv7-pose, and YOLOv8n-pose, respectively (Table 8). The recall of YOLOv8n-GP improved by 1.3%, 1.4%, 6.9%, and 3.6% compared with YOLOv5-pose, YOLOv7-Tiny-pose, YOLOv7-pose, YOLOv7-pose, and YOLOv8n-pose. In terms of precision, YOLOv8n-GP improved by 10.8%, 7.7%, and 3.7% over YOLOv5-pose, YOLOv7-Tiny-pose, and YOLOv8n-pose, but the precision of YOLOv7-pose was higher than that of YOLOv8n-GP. However, considering precision, recall, and mAP, YOLOv8n-GP is the optimal choice for grape bunch detection. Second, this study also compared and analyzed the effectiveness of different models for grape stem detection, as illustrated in Table 9.

YOLOv8n-GP improved the detection accuracy of key points on grape stems while consuming minimal computational effort (Table 9). The mAP-kp of YOLOv8n-GP reached 95.4%, representing an improvement of 4.6%, 3.9%, 3.7%, and 4.0% over YOLOv5-pose, YOLOv7-Tiny-pose, YOLOv7, and YOLOv8n-pose, respectively. In addition, the FLOPs and model size of YOLOv8n-GP are about 1/10 and 1/25 of YOLOv7-pose and 1/2 and 1/9 of YOLOv5-pose, greatly facilitating the deployment and application of the model.

Meanwhile, to further demonstrate the usefulness of YOLOv8n-GP, we presented the recognition results of various keypoint detection models for grapes and stems, as illustrated in Figure 12. From Figure 12, it is evident that YOLOv8n-GP outperformed the other keypoint detection models in the task of grape stem detection. YOLOv8n-pose and YOLOv5-pose performed poorly in detecting key points on grape stems, and the points detected by them deviated more from the real ones. Moreover, YOLOv7-pose and YOLOv7-Tiny-pose also exhibited some degree of error in grape stem detection. By contrast, YOLOv8n-GP accurately localized key points on grape stems with minimal error.

3.3.7. SOTA Experiments

In this study, we compared YOLOv8n-GP with three other keypoint detection models proposed by other studies, including YOLOv5-tea [44], Improved YOLOv8-pose [45], and YOLOv8-GP [15]. All experiments were performed on the same dataset and in the same experimental environment. YOLOv8-GP proved to be effective and feasible for the detection of grape stems in trellis environments (Table 10). The mAP of YOLOv8n-GP improved by 5.1%, 1.3%, and 0.4% compared with YOLOv5-tea, Improved YOLOv8-pose, and YOLOv8-GP. Moreover, in terms of mAP-kp, YOLOv8n-GP improved by 3.5%, 0.2%, and 0.6% over YOLOv5-tea, Improved YOLOv8-pose, and YOLOv8-GP. Overall, YOLOv8n-GP outperformed existing developed keypoint models and is a feasible choice for grape detection and stem location.

4. Discussion

Grapes, as a crucial cash crop, play a significant role in the rapid advancement of the agricultural economy in China. Harvesting fruits is an important part of the grape production process. At present, grape harvesting predominantly relies on manual labor, which presents several drawbacks, including low efficiency, high costs, and considerable labor demands. In addition, unlike fruits such as apples and citrus, grapes are soft and their stems are difficult to break, so it is not feasible to efficiently complete the harvesting task by identifying the location of the fruit. Therefore, detecting grape stems accurately is the key to achieving intelligent grape harvesting. In this study, YOLOv8n-GP was developed for identifying grape stems. Specifically, the structure of the SENetV2 network was introduced into the backbone. Then, CARAFE modules were employed to substitute the upsampling modules within the neck network. The experimental results demonstrated that the mAP and mAP-kp of YOLOv8n-GP reached 97.1%, and 95.4%, representing an improvement of 4.6%, and 4.0% over the YOLOv8n-pose model. Moreover, the YOLOv8n-GP can detect grape stems in real time in real agricultural production environments with an inference time of 7.3 ms.

In recent years, the evolution of deep learning technology, coupled with the advancement in agricultural information progress, has led to numerous researchers employing deep-learning models for detecting grape stems. Wu et al. proposed a lightweight approach by combining YOLOv5m with Ghost-HRNet for detecting grape stems [27]. Although this approach exhibited superior detection performance for grape stems, its inference time and model size reached 129.8 ms and 59.4 MB due to the two-stage detection approach, which was almost 18 times slower and 9 times larger than the YOLOv8n-GP model. YOLOv5-CFD designed by Zhu et al. [28] achieved 88.7% stem detection accuracy, but the inference time of YOLOv5-CFD was about 7 times longer than that of YOLOv8n-GP because of its complex localization strategy for grape picking points. Ning et al. utilized an improved Mask R-CNN model to detect grape stems with a detection accuracy of 88% and a localization time of 4.9 s [25]. Compared with this approach, YOLOv8n-GP has higher detection accuracy and speed for grape stems. In contrast, the YOLOv8n-GP model demonstrated superior detection performance for grape stems and strong robustness in complex environments with dense grape distribution, as it could detect grapes and stems in one step and eliminate the tedious process of picking point location, reducing detection time while maintaining detection accuracy. In addition, YOLOv8n-GP exhibited significant practical utility in real-time grape stem detection and was readily implemented on embedded devices. This is attributed to the FLOPs of 8.8G, aligning with the computational requirements of most approaches designed for real-time detection tasks, which typically range from 6G to 13G. Configurations of low-power embedded devices, such as the Raspberry Pi-4B (Raspberry Pi Foundation, Cambridge, UK), which has the maximum FLOPs of 13.5G with a RAM of 4 GB, support this claim.

Although YOLOv8n-GP can rapidly and accurately detect trellis grape stems, there are some limitations in this study. For instance, the variety of grapes in the dataset used for this study was insufficient. Therefore, the variety of grape samples should be further extended in future work to make the model more applicable to real agricultural production environments. Due to the insufficient dataset, the detection performance of YOLOv8n-GP regarding the problem of overlapping bunches and stems of grapes with obstacles in the form of leaves and branches has not been verified. In order to address this issue, we plan to employ the Kinect V2 camera to collect images from multiple angles, thereby building a more comprehensive dataset covering the above special scenarios. Then, we will use cutting-edge approaches in computer vision to further optimize the structure of the network in future work, thereby improving its capacity to distinguish between grapes, stems, and other obstacles. Moreover, we intend to implement YOLOv8n-GP on the embedded device (Raspberry Pi-4B 64bit) and cooperate with other hardware and software equipment to form a comprehensive automated grape harvesting system in our forthcoming study, as illustrated in Table 11.

5. Conclusions

Aiming at challenges associated with low efficiency, elevated cost, and considerable labor demands in the process of manual grape picking, this study took “Kyoho”, “Summer Black”, and “Shine Muscat” in a trellis environment as research objects, and developed YOLOv8n-GP for identifying grape stems. Specifically, we incorporated the SENetV2 module into the backbone and employed CARAFE modules to substitute the original upsampling module. Experimental results demonstrated that YOLOv8n-GP achieved precision, recall, mAP, and mAP-kp of 91.6%, 91.3%, 97.1%, and 95.4% on the homemade grape dataset, representing an improvement of 3.7%, 3.6%, 4.6%, and 4.0%, respectively, over YOLOv8n-pose. In addition, compared with mainstream keypoint detection models such as YOLOv5-pose, YOLOv7-pose, and YOLOv7-Tiny-pose, YOLOv8n-GP exhibited superior performance in the detection accuracy of target boxes and key points and efficiently and accurately realized the detection and localization of grape stems in trellis environments. Overall, the YOLOv8n-GP model can provide a foundational technological framework for the subsequent implementation of automated grape harvesting.

Author Contributions

Conceptualization, T.J.; methodology, T.J. and Y.L.; validation, T.J.; formal analysis, Y.L.; investigation, J.W. and W.S.; resources, H.F.; data curation, Y.R.; writing—original draft preparation, T.J.; writing—review and editing, T.J. and Y.L.; visualization, Y.L.;project administration, H.F.; funding acquisition, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Zhejiang Province (LGF20F020002); the Key R&D Projects of Zhejiang Province (2022C02009, 2022C02044, 2022C02020); the Three Agricultural Nine-Party Science and Technology Collaboration Projects of Zhejiang Province (2022SNJF036); and the Research Development Foundation of Zhejiang A&F University (2019RF065).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study is available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Xiao, J.; Yan, Y.; Liu, W.; Cui, P.; Xu, C.; Nan, L.; Liu, X. Multivariate Analysis and Optimization of the Relationship between Soil Nutrients and Berry Quality of Vitis vinifera cv. Cabernet Franc Vineyards in the Eastern Foothills of the Helan Mountains, China. Horticulturae 2024, 10, 61. [Google Scholar] [CrossRef]
Li, W.; Liu, C.; Yang, Q.; You, Y.; Zhuo, Z.; Zuo, X. Factors Influencing Farmers’ Vertical Collaboration in the Agri-Chain Guided by Leading Enterprises: A Study of the Table Grape Industry in China. Agriculture 2023, 13, 1915. [Google Scholar] [CrossRef]
Zhao, J.; Yao, X.; Wang, Y.; Yi, Z.; Xie, Y.; Zhou, X. Lightweight-Improved YOLOv5s Model for Grape Fruit and Stem Recognition. Agriculture 2024, 14, 774. [Google Scholar] [CrossRef]
Coll-Ribes, G.; Torres-Rodríguez, I.J.; Grau, A.; Guerra, E.; Sanfeliu, A. Accurate detection and depth estimation of table grapes and peduncles for robot harvesting, combining monocular depth estimation and CNN methods. Comput. Electron. Agric. 2023, 215, 108362. [Google Scholar] [CrossRef]
Chen, Z.; Wang, Y.; Tong, S.; Chen, C.; Kang, F. Grapevine Branch Recognition and Pruning Point Localization Technology Based on Image Processing. Appl. Sci. 2024, 14, 3327. [Google Scholar] [CrossRef]
Liu, B.; Zhang, Y.; Wang, J.; Luo, L.; Lu, Q.; Wei, H.; Zhu, W. An improved lightweight network based on deep learning for grape recognition in unstructured environments. Inf. Process. Agric. 2024, 11, 202–216. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Wang, J.; Cai, L.; Jin, Y.; Zhao, S.; Xie, B. Realtime Picking Point Decision Algorithm of Trellis Grape for High-Speed Robotic Cut-and-Catch Harvesting. Agronomy 2023, 13, 1618. [Google Scholar] [CrossRef]
Wang, W.; Shi, Y.; Liu, W.; Che, Z. An Unstructured Orchard Grape Detection Method Utilizing YOLOv5s. Agriculture 2024, 14, 262. [Google Scholar] [CrossRef]
Bhargava, A.; Bansal, A. Fruits and vegetables quality evaluation using computer vision: A review. J. King Saud Univ.-Comput. Inf. Sci. 2021, 33, 243–257. [Google Scholar] [CrossRef]
Zhang, K.; Zhao, L.; Sun, Z.; Geng, C.; Li, W. Design and target extraction of intelligent grape bagging robot. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2013, 44, 240–246. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Lu, Q.; Chen, X.; Zhang, P.; Zou, X. A vision methodology for harvesting robot to detect cutting points on peduncles of double overlapping grape clusters in a vineyard. Comput. Ind. 2018, 99, 130–139. [Google Scholar] [CrossRef]
Jin, Y.; Yu, C.; Yin, J.; Yang, S.X. Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test. Comput. Electron. Agric. 2022, 202, 107364. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, T.; Liu, L.; Liu, P.; Li, X. Fast Location of Table Grapes Picking Point Based on Infrared Tube. Inventions 2022, 7, 27. [Google Scholar] [CrossRef]
Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Xiao, X.; Wang, Y.; Zhou, B.; Jiang, Y. Flexible Hand Claw Picking Method for Citrus-Picking Robot Based on Target Fruit Recognition. Agriculture 2024, 14, 1227. [Google Scholar] [CrossRef]
Li, T.; Fang, W.; Zhao, G.; Gao, F.; Wu, Z.; Li, R.; Fu, L.; Dhupia, J. An improved binocular localization method for apple based on fruit detection using deep learning. Inf. Process. Agric. 2023, 10, 276–287. [Google Scholar] [CrossRef]
Zhang, H.; Li, X.; Wang, L.; Liu, D.; Wang, S. Construction and Optimization of a Collaborative Harvesting System for Multiple Robotic Arms and an End-Picker in a Trellised Pear Orchard Environment. Agronomy 2024, 14, 80. [Google Scholar] [CrossRef]
Sun, J.; Sun, Y.; Zhao, R.; Ji, Y.; Zhang, M.; Li, H. Tomato Recognition Method Based on Iterative Random Circle and Geometric Morphology. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2019, 50, 22–26, 61. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ning, Z.; Luo, L.; Liao, J.; Wen, H.; Wei, H.; Lu, Q. Recognition and the optimal picking point location of grape stems based on deep learning. Nongye Gongcheng Xuebao/Trans. Chin. Soc. Agric. Eng. 2021, 37, 222–229. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Z.; Luo, L.; Wei, H.; Wang, W.; Chen, M.; Luo, S. DualSeg: Fusing transformer and CNN structure for image segmentation in complex vineyard environment. Comput. Electron. Agric. 2023, 206, 107682. [Google Scholar] [CrossRef]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Zhu, Y.; Li, S.; Du, W.; Du, Y.; Liu, P.; Li, X. Identification of table grapes in the natural environment based on an improved Yolov5 and localization of picking points. Precis. Agric. 2023, 24, 1333–1354. [Google Scholar] [CrossRef]
Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-Bunch Identification and Location of Picking Points on Occluded Fruit Axis Based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Yang, W.; Wu, J.; Zhang, J.; Gao, K.; Du, R.; Wu, Z.; Firkat, E.; Li, D. Deformable convolution and coordinate attention for fast cattle detection. Comput. Electron. Agric. 2023, 211, 108006. [Google Scholar] [CrossRef]
Chen, J.; Ji, C.; Zhang, J.; Feng, Q.; Li, Y.; Ma, B. A method for multi-target segmentation of bud-stage apple trees based on improved YOLOv8. Comput. Electron. Agric. 2024, 220, 108876. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Yin, J.; Huang, P.; Xiao, D.; Zhang, B. A Lightweight Rice Pest Detection Algorithm Using Improved Attention Mechanism and YOLOv8. Agriculture 2024, 14, 1052. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Narayanan, M. SENetV2: Aggregated dense layer for channelwise and global representations. arXiv 2023, arXiv:2311.10807. [Google Scholar]
Li, A.; Sun, S.; Zhang, Z.; Feng, M.; Wu, C.; Li, W. A Multi-Scale Traffic Object Detection Algorithm for Road Scenes Based on Improved YOLOv5. Electronics 2023, 12, 878. [Google Scholar] [CrossRef]
Zeng, W.; He, M. Rice disease segmentation method based on CBAM-CARAFE-DeepLabv3+. Crop Prot. 2024, 180, 106665. [Google Scholar] [CrossRef]
Li, T.; Sun, M.; He, Q.; Zhang, G.; Shi, G.; Ding, X.; Lin, S. Tomato recognition and location algorithm based on improved YOLOv5. Comput. Electron. Agric. 2023, 208, 107759. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Yao, M.; Shi, J.; Hu, J. Seedling-YOLO: High-Efficiency Target Detection Algorithm for Field Broccoli Seedling Transplanting Quality Based on YOLOv7-Tiny. Agronomy 2024, 14, 931. [Google Scholar] [CrossRef]
Shuai, L.; Mu, J.; Jiang, X.; Chen, P.; Zhang, B.; Li, H.; Wang, Y.; Li, Z. An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds. Biosyst. Eng. 2023, 231, 117–132. [Google Scholar] [CrossRef]
Liu, M.; Chu, Z.; Cui, M.; Yang, Q.; Wang, J.; Yang, H. Red Ripe Strawberry Recognition and Stem Detection Based on Improved YOLO v8-Pose. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2023, 54, 244–251. [Google Scholar]

Figure 1. Flow chart of this research.

Figure 2. Examples of grape images: (a) lit from the front; (b) backlit; (c) single-string; (d) multi-string; (e) fruit occlusion; (f) branches and leaves occlusion.

Figure 3. Example of the image annotation.

Figure 4. Architecture of YOLOv8n-GP.

Figure 5. Comparison of different attention mechanism modules.

Figure 6. Structure of the SaE module.

Figure 7. Structure of the CARAFE module.

Figure 8. Training loss and accuracy variation curves: (a) Box_loss; (b) Pose_loss; (c) Box_accuracy; (d) Pose_accuracy.

Figure 9. Model detection results. (a) Original images; (b) YOLOv8n-pose; (c) YOLOv8n-GP.

Figure 10. Loss variation results of ablation experiments. (a) Box_loss; (b) Pose_loss.

Figure 11. Detection results in dense environments. (a) Original images; (b) YOLOv8n-pose; (c) YOLOv8n-GP.

Figure 12. Detection effect of different models.

Table 1. Details of the grape dataset.

Variety	Number
Kyoho	569
Summer Black	415
Shine Muscat	196
Total	1180

Table 2. Configuration of experimental environment.

Environment Configuration	Version
Operating system	Windows 11
CPU	13th Gen Intel(R) Core(TM) i5-13600KF 3.50 GHz
GPU	NVIDIA GeForce RTX 4070 Ti
GPU memory	12 GB
Cuda	11.1
Python	3.8.0
Pytorch	1.9.0

Table 3. Parameter settings of the model.

Training Parameter	Value
Input shape	640 × 640
Optimizer	Adam
Learning rate	0.001
Batch size	16
Epoch	200

Table 4. Comparison of detection performance of YOLOv8n-pose and YOLOv8n-GP.

Model	Precision (Box)	Recall (Box)	mAP (Box)	mAP-kp (Point)	Inference Time (ms)
YOLOv8n-Pose	87.9%	87.7%	92.5%	91.4%	6.1
YOLOv8n-GP	91.6%	91.3%	97.1%	95.4%	7.3

Table 5. Effect of different attention mechanism modules on the performance of the model.

Model	Recall (Box)	mAP (Box)	mAP-kp (Point)
YOLOv8n-Pose	87.7%	92.5%	91.4%
+ECA	89.4%	95.9%	94.8%
+CBAM	86.0%	96.2%	94.9%
+SE	89.4%	96.0%	95.0%
+SENetV2	91.1%	96.6%	95.1%

Table 6. Results of ablation experiments.

Model	Module		Precision (Box)	Recall (Box)	mAP (Box)	mAP-kp (Point)
Model	SENetv2	CARAFE	Precision (Box)	Recall (Box)	mAP (Box)	mAP-kp (Point)
A	×	×	87.9%	87.7%	92.5%	91.4%
B	√	×	90.1%	91.1%	96.6%	95.1%
C	×	√	95.2%	87.7%	96.2%	95.4%
D	√	√	91.6%	91.3%	97.1%	95.5%

Table 7. Effect of densely distributed grape bunches on model performance.

Model	Precision (Point)	Recall (Point)	mAP-kp (Point)
YOLOv8n-Pose	87.5%	85.2%	90.8%
YOLOv8n-GP	91.9%	86.5%	93.1%

Table 8. Comparison of the performance of different models for grape bunch detection.

Model	Precision	Recall	mAP
YOLOv5-pose	80.9%	90.0%	92.0%
YOLOv7-Tiny-pose	83.9%	89.9%	92.7%
YOLOv7-pose	93.7%	84.4%	93.0%
YOLOv8n-pose	87.9%	87.7%	92.5%
YOLOv8n-GP	91.6%	91.3%	97.1%

Table 9. Comparison of the performance of different models for grape stem detection.

Model	mAP-kp	FLOPs (G)	Model Size (MB)
YOLOv5-pose	90.8%	16.4	54.5
YOLOv7-Tiny-pose	91.5%	19.7	18.6
YOLOv7-pose	91.7%	100.8	153.0
YOLOv8n-pose	91.4%	8.3	6.1
YOLOv8n-GP	95.4%	8.8	6.4

Table 10. Performance of different keypoint detection models proposed by other studies.

Source	Research Object	Method	mAP (Box)	mAP-kp (Point)
Shuai et al. [44]	tea bud	YOLOv5-tea	92.0%	91.9%
Liu et al. [45]	strawberry	Improved YOLOv8-pose	95.8%	95.2%
Chen et al. [15]	grape	YOLOv8-GP	96.7%	94.8%
Ours	grape	YOLOv8n-GP	97.1%	95.4%

Table 11. Hardware and software scheme for an automated grape harvesting system.

Device	Designation
Embedded system	Raspberry Pi-4B 64bit
Image acquisition	Kinect V2 camera
Navigator	LIDAR (DJI, Shenzhen, China)
Transportation equipment	Unmanned vehicles
Picking equipment	Robotic arms
	Ring scissor
	Lifting platform
Software system	YOLOv8n-GP algorithm
Software system	Data analytics platforms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, T.; Li, Y.; Feng, H.; Wu, J.; Sun, W.; Ruan, Y. Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP. Agriculture 2024, 14, 1449. https://doi.org/10.3390/agriculture14091449

AMA Style

Jiang T, Li Y, Feng H, Wu J, Sun W, Ruan Y. Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP. Agriculture. 2024; 14(9):1449. https://doi.org/10.3390/agriculture14091449

Chicago/Turabian Style

Jiang, Tong, Yane Li, Hailin Feng, Jian Wu, Weihai Sun, and Yaoping Ruan. 2024. "Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP" Agriculture 14, no. 9: 1449. https://doi.org/10.3390/agriculture14091449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a Trellis Grape Stem Recognition Method Based on YOLOv8n-GP

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Dataset Acquisition

2.2. Image Preprocessing

2.3. Methods

2.3.1. YOLOv8-Pose Model

2.3.2. Structure of the YOLOv8n-GP Model

2.3.3. SENetV2 Module

2.3.4. CARAFE Module

3. Results

3.1. Experimental Environment

3.2. Evaluation Indicators

3.3. Analysis of Experimental Results

3.3.1. Training Results

3.3.2. Performance of the Model before and after Improvement

3.3.3. Effect of Applying Different Attention Mechanisms

3.3.4. Ablation Experiments

3.3.5. Effect of Densely Distributed Grape Bunches on Model Performance

3.3.6. Comparison of the Performances of the Different Keypoint Detection Models

3.3.7. SOTA Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI