Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet

Xu, Xingmei; Su, Xiangyu; Zhou, Lei; Yu, Helong; Zhang, Jian

doi:10.3390/agronomy14081808

Open AccessArticle

Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet

by

Xingmei Xu

¹,

Xiangyu Su

¹,

Lei Zhou

¹,

Helong Yu

^1,*

and

Jian Zhang

^2,3,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

Faculty of Agronomy, Jilin Agricultural University, Changchun 130118, China

³

Department of Biology, University of British Columbia, Okanagan, Kelowna, BC V1V 1V7, Canada

^*

Authors to whom correspondence should be addressed.

Agronomy 2024, 14(8), 1808; https://doi.org/10.3390/agronomy14081808

Submission received: 22 July 2024 / Revised: 9 August 2024 / Accepted: 15 August 2024 / Published: 16 August 2024

(This article belongs to the Special Issue Recognition and Utilization of Natural Genetic Resources for Advances in Plant Biology through Genomics and Biotechnology—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The shape and quantity of Lentinus edodes (commonly known as shiitake) fruiting bodies significantly affect their quality and yield. Accurate and rapid segmentation of these fruiting bodies is crucial for quality grading and yield prediction. This study proposed the YOLOv5seg-BotNet, a model for the instance segmentation of Lentinus edodes, to research its application for the mushroom industry. First, the backbone network was replaced with the BoTNet, and the spatial convolutions in the local backbone network were replaced with global self-attention modules to enhance the feature extraction ability. Subsequently, the PANet was adopted to effectively manage and integrate Lentinus edodes images in complex backgrounds at various scales. Finally, the Varifocal Loss function was employed to adjust the weights of different samples, addressing the issues of missed segmentation and mis-segmentation. The enhanced model demonstrated improvements in the precision, recall, Mask_AP, F1-Score, and FPS, achieving 97.58%, 95.74%, 95.90%, 96.65%, and 32.86 frames per second, respectively. These values represented the increases of 2.37%, 4.55%, 4.56%, 3.50%, and 2.61% compared to the original model. The model achieved dual improvements in segmentation accuracy and speed, exhibiting excellent detection and segmentation performance on Lentinus edodes fruiting bodies. This study provided technical fundamentals for future application of image detection and decision-making processes to evaluate mushroom production, including quality grading and intelligent harvesting.

Keywords:

Bottleneck Transformer; image segmentation; Lentinus edodes; Varifocal Loss; YOLOv5s

1. Introduction

Lentinus edodes, commonly known as shiitake mushrooms, are highly nutritious and possess medicinal properties [1,2], making them the leading contributor of edible fungi in China [3]. The shape and quantity of their fruiting bodies are essential to their quality and yield, posing a significant challenge for the industry to have a higher production [4]. In complex scenarios, such as the occlusion between fruiting bodies, efficient and precise segmentation of Lentinus edodes is crucial for quality grading and yield estimation. Instance segmentation enables the simultaneous detection of multiple fruiting bodies and their precise localization and segmentation.

Traditional image-segmentation algorithms, including thresholding [5], edge detection [6], and region-growing methods [7], have been widely used. For example, Chen et al. [8] applied the multi-threshold Otsu criterion and an improved moth–flame optimization (IMFO) algorithm for fruit recognition and localization, particularly in challenging nighttime environments. Sun et al. [9] focused on clustered tomatoes, achieving 85.1% accuracy using the Canny edge detection algorithm. Ji et al. [10] employed the region growing and color features, along with a support vector machine (SVM) algorithm, achieving an 89% recognition success rate for apple recognition, with an average recognition time of 352 ms. However, these algorithms often rely on manual feature extraction and rule design, focus primarily on local features, and lack a comprehensive understanding of the image context. This limitation results in reduced segmentation accuracy and poor model generalization capabilities.

Smart agriculture incorporates current advanced technologies into existing farming or agricultural practices in order to increase production efficiency, improve the quality of agricultural products, reduce production costs, and reduce the environmental footprint [11]. Deep learning-based image segmentation algorithms are widely used in agriculture [12,13,14]. Shen et al. [15] proposed an improved Mask R-CNN-based algorithm for non-healthy wheat grain segmentation, achieving 86% accuracy and 91% recall rates, laying the foundation for wheat grading. Wang et al. [16] used a Mask R-CNN network with ResNeSt and FPN as the backbone, adding convolutional layers and Dual Attention Network (DANet) to improve bounding box detection and instance segmentation accuracy for apples. This approach achieved a model recall rate, mask mAP, and running time of 97.4%, 92.0%, and 0.27 frames per second, respectively. Yang Yu et al. [17] utilized a visual localization approach for strawberry picking points after generating mask images of mature fruits using Mask R-CNN. The fruit detection results of 100 test images indicated an average detection accuracy of 95.78%, a recall rate of 95.41%, and an average intersection-over-union of 89.85% for instance segmentation. Although deep learning image segmentation algorithms offer high accuracy and strong generalization, two-stage segmentation models implemented through two independent models achieve excellent accuracy [18] but slower processing speeds.

Single-stage object detection models such as YOLO excel in both accuracy and speed [19,20,21]. Instance segmentation models based on YOLO networks have shown promise. Zhu et al. [22] introduced a SwinT-YOLACT model using YOLACT with Swin-Transformer as the backbone, achieving precise segmentation and identification of corn kernels and tassels with a mean mask average precision of 79.43%. Redmon et al. utilized the YOLOv8n-seg for autonomous weeders, enabling recognition and segmentation of uncut weeds and obstacles within rows. The YOLOv8s-seg model surpasses YOLOv5n-seg, YOLOv5s-seg, and YOLOv8n-seg in terms of mAP by 16.1%, 7.0%, and 2.1%, respectively. Lawal [23] proposed a lightweight YOLOv5-LiNet model for fruit instance segmentation, outperforming ShuffleNetv2 and other lightweight models with a box accuracy of 0.893, instance segmentation accuracy of 0.885, weight size of 3.0 MB, and real-time detection of 2.6 ms.

At present, Lentinus edodes cultivation is mainly based on spawn sticks and spawn bags. During the growth process of Lentinus edodes, there are significant morphological changes, leading to uneven distribution of Lentinus edodes, overlapping of stems and caps, and highly similar colors, all of which pose certain difficulties for the identification and segmentation of Lentinus edodes. The present study has taken advantage of the YOLOv5seg-BotNet procedure to segment images at instance for detection and decision-making process for the evaluation of Lentinus edodes, providing an improved way to report its quality and yield prediction. Specific contributions are as follows:

(1): Innovative integration method: BoTNet, PANet, and VFL were integrated into YOLOv5seg for the first time, and a novel case segmentation technique was proposed, which improved the accuracy and speed of case segmentation of Lentinus edodes;
(2): Specific domain optimization: Special optimization was carried out for the common complex situations in the cultivation process of Lentinus edodes (such as morphological changes, overlapping occlusion, etc.) so that the model showed higher robustness and stability when dealing with these problems;
(3): Practical application verification: The effectiveness of the improved model in practical application is verified through experiments, which shows that the model not only improves the segmentation accuracy but also has significant advantages in real-time detection and decision support.
(4): Data diversity: The samples were selected to cover Lentinus edodes of different shapes, sizes, and maturity, and multiple measurements and data enhancement were carried out to ensure the diversity and representativeness of the data and improve the generalization ability of the model in various environments;
(5): Performance comparison: In a detailed comparison with existing methods (such as Mask RCNN, YOLCAT, YOLOv8, etc.), our method performs well in comprehensive performance and provides a more effective solution for Lentinus edodes instance segmentation.

The following will introduce data acquisition and preprocessing in detail, how to integrate BoTNet, PANet, and VFL to solve existing problems, and analyze and discuss experimental results.

2. Materials and Methods

2.1. Data Collection and Preprocessing

The Lentinus edodes data for this study were collected from Jilin Agricultural University, Changchun City, Jilin Province, between early March and mid-July 2023. The videos of mature Lentinus edodes were captured using an iPhone 14 Pro (China) from a distance of approximately 0.2 to 0.5 m. The turntable rotation speed was set to 33 s per revolution, with one image extracted every 45 frames from the video. During the data collection process, we used standardized light source and background settings to ensure the quality and consistency of the image. The image resolution was 1280 × 720 pixels. To ensure data diversity, 1000 Lentinus edodes images were collected for the experiment. The sample selection covered Lentinus edodes of different shapes, sizes, and maturity levels to ensure the diversity and representativeness of the samples. Each Lentinus edodes sample was measured multiple times to ensure the accuracy and consistency of the data. Specifically, for each sample, we conducted 5 independent segmentation measurements. These measurements were conducted under different lighting conditions and backgrounds to evaluate the stability and robustness of the model in various environments.

The original images were annotated using Labelme [24]. To improve the generalization and robustness of the model, five data augmentation methods were applied, including brightness adjustment, noise addition, panning, random point value modification, and horizontal flipping [25,26], each generating 1000 images, as shown in Figure 1. Python scripts were developed to apply these augmentations to the images and generate the corresponding annotation files. This process resulted in 6000 Lentinus edodes images, divided into training, validation, and test sets in a ratio of 7:2:1, and data augmentation was applied only to the training set.

2.2. Network Design and Improvement

2.2.1. YOLOv5s Instance Segmentation

There are many versions of YOLOv5, such as YOLOv5n, which has better real-time performance. YOLOv5n is, indeed, lighter and faster, but it achieves these benefits at the cost of reduced precision and recall. For the segmentation of Lentinus edodes, maintaining high accuracy was essential, and YOLOv5s provided a better trade-off between speed and precision. The YOLOv5s instance segmentation (YOLOv5seg) model is an instance segmentation model based on a single-stage object detector [27]. It excels in rapid and accurate object detection and pixel-level instance segmentation [28]. Owing to its strong performance in real-time detection and instance segmentation applications, this study adopted the YOLOv5seg network as the baseline network, as depicted in Figure 2.

The YOLOv5seg network divided the instance segmentation into two parallel tasks, including prediction and segmentation. Initially, the backbone network extracted the features from the input images, and a Feature Pyramid Network (FPN) was utilized to fuse the feature maps at various scales. Subsequently, the Prediction Head identified the target objects to produce k mask coefficients, whereas the Protonet generated k prototype masks. Finally, the prototype masked the images, and their confidence scores were linearly combined to produce the instance segmentation results.

2.2.2. BoTNet

The Bottleneck Transformer Network (BoTNet) is an architecture that integrates both convolutional and self-attention mechanisms from the transformers. It effectively mitigates information loss in convolution by allowing convolution to learn abstract low-resolution feature maps from high-resolution images and the self-attention mechanism to handle and aggregate the information contained in these feature maps. This combination extends the model’s ability to focus on various positions.

The BoTNet was integrated into YOLOv5s-Seg, where global multi-head self-attention (MHSA) replaced the 3 × 3 spatial convolutions in the last three bottleneck blocks of ResNet. MHSA, the core structure of BoTNet [29], simultaneously considered the relationships between all positions in the input sequence, captured global information, and enhanced the feature extraction capabilities. BoTNet reduces the parameter size of the model by utilizing a 1 × 1 matrix.

The MHSA is an attention mechanism derived from self-attention. Figure 3 illustrates the structure of the self-attention mechanism and MHSA module. Figure 3a depicts the self-attention mechanism module comprising three matrix operations, Q, K, and V, which are essentially self-operations. Figure 3b presents the MHSA module, which is an enhanced version of the self-attention mechanism, where each attention operation can extract effective feature information from multiple dimensions through grouping.

The formula for Figure 3a is

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(1)

where Q, K, and V are matrices obtained by linear transformation of the input matrix, with dimensions identical to the input matrix;

Q K^{T}

is the matrix obtained through matrix multiplication, computing the similarity between Q and K;

\sqrt{d_{k}}

is the scaling factor to prevent the value of

Q K^{T}

from becoming too large when the dimension is too high, causing gradient disappearance during the backpropagation of the Softmax function, and Softmax is a normalization function, mapping with V to obtain channel correlations.

The formula for Figure 3b is

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{o},

(2)

h e a d_{i} = A t t e n t i o n ({Q W}_{i}^{Q}, {K W}_{i}^{K}, {V W}_{i}^{V}),

(3)

where Q, K, and V are the matrices obtained by a linear transformation of the input matrix; h represents the number of heads; Concat denotes the concatenation operation;

W_{i}^{Q}

,

W_{i}^{K}

, and

{V W}_{i}^{V}

represent the query, key, and value matrices of the i-th head, and

W^{o}

represents the output matrix. MHSA can divide the input data into h subspaces and independently applies scaled dot-product attention to each subspace, producing h output vectors.

2.2.3. PANet

Figure 4 illustrates the structure of the PANet. The Path Aggregation Network (PANet) [30] is an architecture based on the Feature Pyramid Network (FPN) [31]. It integrates multi-scale pathways from both top–down (FPN) and bottom–up (PAN) directions to aggregate the semantic information. This integration can be achieved through lateral connections (AFP), enhancing the capability for multi-scale feature extraction.

The PANet architecture consists of FPN, PAN, and AFP to enhance the feature extraction for Lentinus edodes fruiting bodies. The FPN fuses feature maps through multiple up-samplings, whereas the structure achieves this through multiple down-samplings. ROIAlign pooling [32] can be applied to obtain four feature maps of the same size, as depicted in the dark gray area in Figure 4b. Subsequently, a fully connected layer calculates the four feature maps separately. The AFP can then be used to fuse the four sets of features, as shown in Figure 4c. Finally, the features can be passed into different connection layers for fusion, yielding class, box regression, and mask prediction results.

2.2.4. VFL

Varifocal Loss (VFL) is an extension of Focal Loss [33]. It introduces an adjustable parameter

α

to balance the loss magnitude between the easy- and hard-to-classify samples. VFL enables the model to focus more on learning from high-quality positive samples during training by reducing the weight of the negative samples and decreasing their contribution to the loss. This attention paid to occluded targets and other challenging, positive samples is illustrated in Equation (4).

V F L (p, q) = \{\begin{matrix} - q (qlog (p) + (1 - q) \log (1 - p)) q > 0 \\ - α p^{γ} \log (1 - p) q = 0 \end{matrix},

(4)

where

p

is the predicted value;

q

is the target score, and

α

is a hyperparameter used to balance the loss between positive and negative samples.

2.2.5. YOLOv5seg-BotNet

Figure 5 depicts the architecture of the YOLOv5seg-BotNet model, comprising three parts: backbone; neck; and head. The backbone was composed of multiple CBS, C3, SPPF, and MHSA modules, whereas the neck consisted of CBS, C3, and upsample modules. The PANet structure was formed by adding a red dashed box. The head consisted of detection and segmentation heads. This study proposed three improvements to the YOLOv5seg model. First, introducing BoTNet as the backbone network enhanced the feature extraction while reducing the parameter count, therefore improving the computational efficiency, as depicted by the red module in the Backbone of Figure 5. Second, adopting the PANet structure enhanced the feature extraction for Lentinus edodes and improved the instance segmentation performance. Finally, using VFL instead of Focal Loss adjusted the weights of different samples to increase the focus on the detection and segmentation of samples.

2.3. Experimental Environment and Parameter Settings

The experimental setup in this study employed Windows 10 as the operating system, an NVIDIA GeForce RTX 3060Ti GPU with 32GB memory, an Intel (R) Core (TM) i9-10920X CPU @3.50Ghz processor, and PyTorch 1.12.1 with CUDA 13.0, for deep-learning model development. The input image size was 640 × 640; batch size was 8; lr was set to 0.01; momentum was set to 0.937, and the optimizer was SGD, as detailed in Table 1. Early stop and maximum epoch stop criteria are used during training to ensure the best performance of the model.

2.4. Evaluation Metrics

In this study, precision (P), recall (R), and average precision (AP) were used as evaluation metrics. It also assesses the model’s inference speed using Frame Per Second (FPS). The formulas for calculating P, R, and AP are as follows:

P = \frac{T P}{T P + F P} \times 100 %,

(5)

R = \frac{T P}{T P + F N},

(6)

A P = \int_{0}^{1} P (R) d R

(7)

True Positives (TP) indicate instances predicted as positive that are indeed positive, whereas True Negatives (TN) indicate instances predicted as negative that are indeed negative. False Positives (FP) indicate instances predicted as positive but actually negative, and False Negatives (FN) indicate instances predicted as negative but actually positive.

3. Results

3.1. Ablation Study

Table 2 presents the evaluation results of YOLOv5seg, YOLOv5seg-MHSA, YOLOv5seg-PANet, YOLOv5seg-VFL, and YOLOv5seg-BotNet on the test set. In YOLOv5seg-MHSA, MHSA was replaced with a modified version. YOLOv5seg-PANet incorporated PANet, and YOLOv5seg-VFL integrated VFL. Mask_AP represented the AP value of the model in the sub-entity segmentation.

According to Table 2, YOLOv5seg-MHSA achieved the precision, recall, Mask_AP, F1-Score, and FPS of 96.47%, 91.03%, 92.79%, 93.67%, and 32.59 frames/s, respectively. Compared to the original model, it improved the precision, Mask_AP, F1-Score, and FPS by 1.26%, 1.45%, 0.52%, and 2.34%, respectively, while the recall decreased by 0.16%. YOLOv5seg-PANet achieved the precision, recall, Mask_AP, F1-Score, and FPS of 96.82%, 92.81%, 93.65%, 94.77%, and 27.42 frames/s, respectively. Compared to the original model, it improved the precision, recall, Mask_AP, and F1-Score by 1.61%, 1.62%, 2.31%, and 1.62%, respectively, whereas FPS decreased by 2.83%. YOLOv5seg-VFL achieved the precision, recall, Mask_AP, F1-Score, and FPS of 97.10%, 94.58%, 93.73%, 95.82%, and 31.63 frames/s, respectively. Compared to the original model, it improved the precision, recall, Mask_AP, F1-Score, and FPS by 1.89%, 3.39%, 2.39%, 2.61%, and 1.38%, respectively.

YOLOv5seg-BotNet achieved the precision, recall, Mask_AP, F1-Score, and FPS of 97.58%, 95.74%, 95.90%, 96.65%, and 32.86 frames/s, respectively. Compared to the original model, it improved the precision, recall, Mask_AP, F1-Score, and FPS by 2.37%, 4.55%, 4.56%, 3.50%, and 2.61%, respectively.

Figure 6 illustrates the Mask_AP variation during the training process for YOLOv5seg, YOLOv5seg-MHSA, YOLOv5seg-PANet, YOLOv5seg-VFL, and YOLOv5seg-BotNet. YOLOv5seg-BotNet demonstrated the best performance for the training set.

3.2. YOLOv5seg-BotNet Compared with Other Segmentation Models

Table 3 presents the evaluation results of Mask RCNN, YOLCAT, and YOLOv8 on the test set. All the models adopted the same experimental environment and training strategy described in this study.

In terms of segmentation accuracy, YOLOv5seg-BotNet achieved the precision, recall, Mask_AP, F1-Score, and FPS values of 97.58%, 95.74%, 95.90%, and 96.65%, respectively, all higher than those of the other models. Specifically, Mask_AP was improved by 4.31%, 0.56%, and 3.62% compared with the other models. Regarding the segmentation speed, YOLOv5seg-BotNet achieved an FPS of 32.86 frames/s, demonstrating enhancements of 25.66, 5.07, and 2.24 compared to other models, all surpassing the performance of the other models.

Figure 7 illustrates the variation in Mask_AP during the training process for Mask RCNN, YOLCAT, and YOLOv8. YOLOv5seg-BotNet demonstrated the best performance for the training set.

3.3. YOLOv5seg-BotNet Compared with Results from Other Segmentation Models

Figure 8 and Figure 9 illustrate the segmentation results of YOLOv5seg-BotNet compared with other models. The Mask-RCNN model exhibited poor segmentation performance, particularly in cases where the color difference between the Lentinus edodes stem and cap was minimal. This leads to inaccurate segmentation of the complete Lentinus edodes shape and noticeable instances of under-segmentation, as highlighted by the green circles in the figures. The YOLACT and YOLOv8 models exhibited moderate performance in segmenting Lentinus edodes and their edges, occasionally mis-segmenting small targets, as indicated by the yellow circles in the figures. All models performed poorly in segmenting Lentinus edodes images under backlit conditions, showing tendencies toward both under-segmentation and mis-segmentation. The proposed YOLOv5seg-BotNet model achieved precise segmentation of Lentinus edodes images under various conditions and complex backgrounds, demonstrating satisfactory segmentation performance.

4. Discussion

Segmenting Lentinus edodes fruiting bodies is crucial for phenotypic research. Factors such as uncertain lighting, high color similarity between stems and caps, and occlusion could increase the segmentation difficulty [37,38].

Replacing the spatial convolutions with MHSA in the local backbone network, YOLOv5seg-MHSA achieved the improved precision, recall, Mask_AP, F1-Score, and FPS at 96.47%, 91.03%, 92.79%, 93.67%, and 32.59 frames/s, respectively. Compared to the original model, the precision, Mask_AP, F1-Score, and FPS increased by 1.26%, 1.45%, 0.52%, and 2.34%, respectively, whereas the recall decreased by 0.16%. Compared to traditional YOLOv5seg models, the BoTNet module that combined convolution and self-attention mechanisms effectively addressed the issue of convolutional information loss and enhanced the feature extraction capabilities of the model. Our approach improves upon this by incorporating attention mechanisms that better account for spatial relationships and contextual information, leading to enhanced performance in instance segmentation tasks and confirming the effectiveness of MHSA in extracting diverse features [39].

In this study, we compared the common feature fusion modules such as FPN, BiFPN, and PANet. YOLOv5seg-PANet achieved the precision, recall, Mask_AP, F1-Score, and FPS of 96.82%, 92.81%, 93.65%, 94.77%, and 27.42 frames/s, respectively. Compared to the original model, it improved by 1.61%, 1.62%, 2.31%, and 1.62% in the precision, recall, Mask_AP, and F1-Score, respectively, whereas the FPS decreased by 2.83. The conventional FPN possesses only a single top–down information flow, which causes the loss of feature information [40]. PANet demonstrated superior feature fusion capabilities and enhanced the segmentation accuracy of the model [41]. The decrease in FPS was attributed to the PANet structure, which improved upon FPN by adding PAN, AFP, and other modules. Additionally, PANet had more parameters than FPN and BiFPN [42].

Mask RCNN, YOLCAT, and YOLOv8 faced challenges related to under-segmentation and mis-segmentation in the Lentinus edodes fruitbody instance segmentation, particularly when dealing with occlusion between fruitbodies, leading to suboptimal detection performance [43]. This study addressed sample imbalance by utilizing loss functions such as EIoU, SIoU, and VFL. The convergence of the loss curves for these functions during training is illustrated in Figure 10, with convergence typically occurring at approximately 100 iterations.

YOLOv5seg-VFL achieved the precision, recall, Mask_AP, F1-Score, and FPS of 97.10%, 94.58%, 93.73%, 95.82%, and 31.63 frames/s, respectively. Compared to the original model, it improved the precision, recall, Mask_AP, F1-Score, and FPS by 1.89%, 3.39%, 2.39%, 2.61%, and 1.38%, respectively. By reducing the negative sample weight, VFL directs more attention to occluded targets and other challenging samples [44,45], effectively addressing issues related to under-segmentation and mis-segmentation.

Figure 11a illustrates the segmentation results of the original model. The Lentinus edodes fruiting bodies circled in the figure are mis-segmented. Figure 11b illustrates the segmentation results of the model after introducing VFL. The Lentinus edodes fruiting bodies can be correctly segmented.

5. Conclusions

This study introduced the YOLOv5seg-BotNet model for segmenting Lentinus edodes fruitbodies, with the objective of addressing challenges such as overlap, lighting uncertainties, and high color similarity between caps and stems. This study initially created a dataset comprising 6000 images (including data augmentation) for Lentinus edodes fruitbody instance segmentation. It then compared the impact of four loss functions (Focal Loss, EIoU, SIoU, and VFL) on the model precision. Additionally, the backbone network of the model was substituted with BoTNet, and the PANet adaptive spatial feature fusion mechanism was applied. The results indicated that VFL achieved the optimal enhancement in precision, whereas BoTNet and PANet enhanced the model’s feature extraction, leading to improved Mask_AP. YOLOv5seg-BotNet achieved the precision, recall, Mask_AP, F1-Score, and FPS of 97.58%, 95.74%, 95.90%, 96.65%, and 32.86 frames/s, respectively. Compared to YOLOv5seg, Mask R-CNN, YOLCAT, and YOLOv8, YOLOv5seg-BotNet demonstrated superior overall performance. The combination of improved Mask_AP and FPS contributes to the overall efficiency and effectiveness of the segmentation system. The minor gains in both metrics collectively enhance the system’s performance, making it more robust and suitable for practical use. Therefore, YOLOv5seg-BotNet could provide effective detection and segmentation of Lentinus edodes fruiting bodies, supporting subsequent tasks such as quality grading and intelligent harvesting.

Author Contributions

X.X.: conceptualization, methodology, investigation, writing—original draft; X.S.: conceptualization, methodology, investigation, formal analysis, writing—original draft, supervision, writing—review/editing; L.Z.: methodology, investigation, formal analysis, visualization; H.Y.: visualization, investigation, supervision; J.Z.: methodology, writing—review/editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Jilin Province (YDZJ202201ZYTS544) and the Technology Development Plan Project of Jilin Province (20240304096SF).

Data Availability Statement

Data supporting the findings of this study are available from the corresponding author.

Acknowledgments

The authors would like to acknowledge the anonymous reviewers for their valuable comments and members of the editorial team for editing this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Z.E. Application value and prospect of mushroom. Mod. Food 2023, 29, 26–28. [Google Scholar]
Yao, F.; Gao, H.; Wang, Z.H.; Yin, C.M.; Shi, D.F.; Zhang, J.; Fan, X.Z. Comparative analysis of active substances and in vitro activities of alcohol extracts from mushrooms with different textures. Food Res. Dev. 2023, 44, 28–33. [Google Scholar]
Liu, F.; Zhang, M.Z.; Cao, B.; Ling, Y.Y.; Zhao, R.Y. Application and prospect analysis of MNP molecular markers in accurate identification of edible mushroom varieties. Fungal Res. 2024, 1–9. [Google Scholar] [CrossRef]
Zhu, W.Q.; Guo, L.L.; Chen, L.Y. Preliminary exploration of cultivating high-quality mushrooms using solid selenium-rich additives. Spec. Econ. Anim. Plants 2023, 26, 16–18. [Google Scholar]
Shajin, F.H.; Aruna Devi, B.; Prakash, N.B.; Sreekanth, G.R.; Rajesh, P. Sailfish optimizer with Levy flight, chaotic and opposition-based multi-level thresholding for medical image segmentation. Soft Comput. 2023, 27, 12457–12482. [Google Scholar] [CrossRef]
Elharrouss, O.; Hmamouche, Y.; Idrissi, A.K.; El Khamlichi, B.; El Fallah-Seghrouchni, A. Refined edge detection with cascaded and high-resolution convolutional network. Pattern Recognit. 2023, 138, 109361. [Google Scholar] [CrossRef]
Sahu, K.; Minz, S. Adaptive fusion of K-means region growing with optimized deep features for enhanced LSTM-based multi-disease classification of plant leaves. Geocarto Int. 2023, 38, 2178520. [Google Scholar] [CrossRef]
Chen, R.J.; Tang, W.Y.; Lv, W.G.; Li, D.Y. Multi-threshold segmentation of fruit depth images based on IMFO-Otsu. Mod. Agric. Equip. 2023, 44, 30–35. [Google Scholar]
Sun, J.T.; Sun, Y.F.; Zhao, R.; Ji, Y.H.; Zhang, M.; Li, H. Tomato recognition method based on geometric morphology and iterative random circle. J. Agric. Mach. 2019, 50, 22–26+61. [Google Scholar]
Ji, W.; Zhao, D.; Cheng, F.; Xu, B.; Zhang, Y.; Wang, J. Automatic recognition vision system guided for apple harvesting robot. Comput. Electr. Eng. 2012, 38, 1186–1195. [Google Scholar] [CrossRef]
Zhang, J.; Trautman, D.; Liu, Y.; Bi, C.; Chen, W.; Ou, L.; Goebel, R. Achieving the Rewards of Smart Agriculture. Agronomy 2024, 14, 452. [Google Scholar] [CrossRef]
Li, K.; Huang, J.; Song, W.; Wang, J.; Lv, S.; Wang, X. Automatic segmentation and measurement methods of living stomata of plants based on the CV model. Plant Methods 2019, 15, 67. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.; Shi, Y.; Ahmad, I. Design of smart citrus picking model based on Mask RCNN and adaptive threshold segmentation. PeerJ Comput. Sci. 2024, 10, e1865. [Google Scholar] [CrossRef] [PubMed]
Lu, J.; Yang, R.; Yu, C.; Lin, J.; Chen, W.; Wu, H.; Chen, X.; Lan, Y.; Wang, W. Citrus green fruit detection via improved feature network extraction. Front. Plant Sci. 2022, 13, 946154. [Google Scholar] [CrossRef]
Shen, R.; Zhen, T.; Li, Z. Segmentation of Unsound Wheat Kernels Based on Improved Mask RCNN. Sensors 2023, 23, 3379. [Google Scholar] [CrossRef]
Wang, D.; He, D. Apple detection and instance segmentation in natural environments using an improved Mask Scoring R-CNN Model. Front. Plant Sci. 2022, 13, 1016470. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Wang, F.K.; Huang, Y.Q.; Huang, Z.C.; Shen, H.; Huang, C.; Qiao, X.; Qian, W.Q. MRUNet: A two-stage segmentation model for small insect targets in complex environments. J. Integr. Agric. 2023, 22, 1117–1130. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhu, D.L.; Yu, M.S.; Liang, M.F. Real-time instance segmentation of corn ears based on SwinT-YOLACT. Trans. Chin. Soc. Agric. Eng. 2023, 39, 164–172. [Google Scholar]
Lawal, O.M. YOLOv5-LiNet: A lightweight network for fruits instance segmentation. PLoS ONE 2023, 18, e0282297. [Google Scholar] [CrossRef]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Fan, Y.; Zhang, S.; Feng, K.; Qian, K.; Wang, Y.; Qin, S. Strawberry maturity recognition algorithm combining dark channel enhancement and YOLOv5. Sensors 2022, 22, 419. [Google Scholar] [CrossRef] [PubMed]
Pan, S.Q.; Qiao, J.F.; Wang, R.; Yu, H.L.; Wang, C.; Kerry, T.; Pan, H.Y. Intelligent diagnosis of northern corn leaf blight with deep learning model. J. Integr. Agric. 2022, 21, 1094–1105. [Google Scholar] [CrossRef]
Patel, S.C.; Salot, P. Survey on Different Object Detection and Segmentation Methods. Int. J. Innov. Sci. Res. Technol. 2021, 6, 608–611. [Google Scholar]
Kwon, D.J.; Lee, S. Car detection area segmentation using deep learning system. Int. J. Adv. Smart Converg. 2023, 12, 182–189. [Google Scholar]
Vasanthi, P.; Mohan, L. Multi-Head-Self-Attention based YOLOv5X-transformer for multi-scale object detection. Multimed. Tools Appl. 2024, 83, 36491–36517. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Luo, Y.; Cao, X.; Zhang, J.; Guo, J.; Shen, H.; Wang, T.; Feng, Q. CE-FPN: Enhancing channel information for object detection. Multimed. Tools Appl. 2022, 81, 30685–30704. [Google Scholar] [CrossRef]
Zarbakhsh, P.; Demirel, H. Low-rank sparse coding and region of interest pooling for dynamic 3D facial expression recognition. Signal Image Video Process. 2018, 12, 1611–1618. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8514–8523. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Sohan, M.; Sai Ram, T.; Reddy, R.; Venkata, C. A review on yolov8 and its advancements. In Algorithms for Intelligent Systems, Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Fan, Z.; Sun, N.; Qiu, Q.; Li, T.; Feng, Q.; Zhao, C. In situ measuring stem diameters of maize crops with a high-throughput phenotyping robot. Remote Sens. 2022, 14, 1030. [Google Scholar] [CrossRef]
Song, C.Y.; Zhang, F.; Li, J.S.; Xie, J.Y.; Yang, C.; Zhou, H.; Zhang, J.X. Detection of maize tassels for UAV remote sensing image with an improved YOLOX Model. J. Integr. Agric. 2023, 22, 1671–1683. [Google Scholar] [CrossRef]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato maturity detection and counting model based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Zhao, L.; Wang, Z.; Zhang, J.; Chen, G. Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-tiny. Agronomy 2023, 13, 1419. [Google Scholar] [CrossRef]
Liu, H.; Sun, F.; Gu, J.; Deng, L. SF-YOLOv5: A lightweight small object detection algorithm based on improved feature fusion mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Zhao, Y.; Zheng, Z. Novel Recursive BiFPN Combining with Swin Transformer for Wildland Fire Smoke Detection. Forests 2022, 13, 2032. [Google Scholar] [CrossRef]
Liu, X.; Li, G.; Chen, W.; Liu, B.; Chen, M.; Lu, S. Detection of dense Citrus fruits by combining coordinated attention and cross-scale connection with weighted feature fusion. Appl. Sci. 2022, 12, 6600. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Y.; Zhou, Y.; Sun, S.; Zhang, H.; Wang, Y. Automatic detection of indoor occupancy based on improved YOLOv5 model. Neural Comput. Appl. 2023, 35, 2575–2599. [Google Scholar] [CrossRef]
Cao, X.; Su, Y.; Geng, X.; Wang, Y. YOLO-SF: YOLO for Fire Segmentation Detection; IEEE Access: New York, NY, USA, 2023. [Google Scholar]

Figure 1. Example of enhanced image sample of Lentinus edodes. (a) original picture; (b) brightness adjusted; (c) noise added; (d) panning; (e) value randomly changed; and (f) horizontal flip.

Figure 2. Structure of YOLOv5seg. NMS is non-maximal suppression;

\otimes

denotes the matrix multiplication; Crop denotes a zeroing operation for the mask outside the bounding box, and Threshold denotes an image binarization of the generated mask with a threshold of 0.5. The same annotations are applicable across the subsequent figures.

Figure 2. Structure of YOLOv5seg. NMS is non-maximal suppression;

\otimes

denotes the matrix multiplication; Crop denotes a zeroing operation for the mask outside the bounding box, and Threshold denotes an image binarization of the generated mask with a threshold of 0.5. The same annotations are applicable across the subsequent figures.

Figure 3. Scaled dot-product attention and MHSA module structure diagram. (a) Scaled dot-product attention, (b) MHSA layers running in parallel, and Q, K, and V are matrices obtained from the input matrix by a linear transformation.

Figure 4. Structure of PANet. (a) FPN backbone; (b) Bottom–up Path Enhancement; (c) Adaptive Feature Pooling; (d) Box Branch, and (e) Fully Connected Fusion. For brevity, the channel dimensions of the feature maps in (a,b) were omitted.

Figure 5. Structure of YOLOv5seg-BotNet. Black arrows indicate the direction of the data flow during network operation, with different colors representing different network modules. For example, the CBS module is represented in light blue, and the C3 module is represented in dark blue. The name of each module is located in the middle of the rectangle. Similarly, the improved MHSA module is indicated by a red module, and PANet is marked with a red rectangle.

Figure 6. Comparison of Mask_AP in different modules. The horizontal axis represents the number of iterations, while the vertical axis represents the value of Mask_AP.

Figure 7. Comparison of Mask_AP in different models. The horizontal axis represents the number of iterations, while the vertical axis represents the value of Mask_AP.

Figure 8. Comparison of the segmentation effect of different models. The yellow circle represents false segmentation, and the green circle represents missing segmentation.

Figure 9. Comparison of the segmentation effect of different models at different angles. The yellow circle represents false segmentation, and the green circle represents missing segmentation.

Figure 10. Comparison of different loss functions. The horizontal axis represents the number of iterations, while the vertical axis represents the loss value. The comparison between the different loss functions can be clearly seen in the local magnification.

Figure 11. Comparison of different modules. (a) Segmentation results of the original model, and (b) segmentation results after introducing the VFL. The yellow circle represents missing segmentation.

Table 1. Network model training parameters.

Parameter Category	Parameter Setting
Image-size	640 × 640
Epochs	100
Batch-size	8
lr	0.01
Momentum	0.937
Optimizer	SGD
Early stop	10

Table 2. The results of ablation experiments.

Models	P (%)	R (%)	Mask_AP (%)	F1-Score (%)	FPS
YOLOv5seg	95.21	91.19	91.34	93.15	30.25
YOLOv5seg-MHSA	96.47	91.03	92.79	93.67	32.59
YOLOv5seg-PANet	96.82	92.81	93.65	94.77	27.42
YOLOv5seg-VFL	97.10	94.58	93.73	95.82	31.63
YOLOv5seg-BotNet	97.58	95.74	95.90	96.65	32.86

Table 3. The results of different segmentation models are compared.

Models	P (%)	R (%)	Mask_AP (%)	F1-Score (%)	FPS
Mask RCNN [34]	91.37	89.72	91.59	90.53	7.20
YOLCAT [35]	97.86	92.58	95.34	95.14	27.79
YOLOv8 [36]	95.94	90.24	92.28	93.00	30.62
YOLOv5seg-BotNet	97.58	95.74	95.90	96.65	32.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Su, X.; Zhou, L.; Yu, H.; Zhang, J. Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet. Agronomy 2024, 14, 1808. https://doi.org/10.3390/agronomy14081808

AMA Style

Xu X, Su X, Zhou L, Yu H, Zhang J. Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet. Agronomy. 2024; 14(8):1808. https://doi.org/10.3390/agronomy14081808

Chicago/Turabian Style

Xu, Xingmei, Xiangyu Su, Lei Zhou, Helong Yu, and Jian Zhang. 2024. "Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet" Agronomy 14, no. 8: 1808. https://doi.org/10.3390/agronomy14081808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Instance Segmentation of Lentinus edodes Images Based on YOLOv5seg-BotNet

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Preprocessing

2.2. Network Design and Improvement

2.2.1. YOLOv5s Instance Segmentation

2.2.2. BoTNet

2.2.3. PANet

2.2.4. VFL

2.2.5. YOLOv5seg-BotNet

2.3. Experimental Environment and Parameter Settings

2.4. Evaluation Metrics

3. Results

3.1. Ablation Study

3.2. YOLOv5seg-BotNet Compared with Other Segmentation Models

3.3. YOLOv5seg-BotNet Compared with Results from Other Segmentation Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI