Enhancing Instance Segmentation in Agriculture: An Optimized YOLOv8 Solution
Abstract
1. Introduction
- (1)
- A small target detection layer is added, which specifically processes feature maps that have been downsampled by a factor of four from the original image, thereby significantly improving the detection accuracy for small targets (such as people).
- (2)
- The C2f_CPCA module is proposed, which introduces a channel priority convolutional attention (CPCA) mechanism in the C2f_ module. This mechanism effectively captures relationships at different spatial scales by dynamically adjusting the attention weights in both channel and spatial dimensions. Therefore, the feature extraction and category recognition ability of agricultural specific scenarios are improved, and the robustness of the model in complex background is enhanced.
- (3)
- A C3RFEM module is proposed, which combines dilated convolution and a weighted layer, adding it to the backbone network to enable the model to extract richer features across different receptive field ranges and further improve the generalization ability.
- (4)
- The calculation efficiency of the model is further optimized to achieve a better balance between accuracy and speed under the premise of maintaining high segmentation accuracy.
2. Materials and Methods
2.1. YOLOv8-Seg Instance Segmentation Network Structure
2.2. Improvement of the YOLOv8-Seg Instance Segmentation Network Structure
2.3. Add Small Object Detection Layer
- (1)
- Challenges faced by the original YOLOv8n-seg in detecting small objects: The YOLOv8n model generates feature maps at three scales—80 × 80, 40 × 40, and 20 × 20—from input images of 640 × 640 resolution for detection. For extremely small objects that may only occupy a few pixels (such as distant people), after multiple downsampling steps, the effective information on low-resolution feature maps like 20 × 20 or 40 × 40 is almost completely lost, making them difficult to recognize.
- (2)
- Loss of detail information: Although shallow networks contain rich detail and edge information, which is crucial for accurately localizing small objects, they lack high-level semantic information. Deep networks have strong semantic information but suffer from low spatial resolution and coarse details. The original YOLOv8n-seg neck network partially restores some details through upsampling and fusion, but its finest fused feature map scale is only 80 × 80. For the numerous tiny targets commonly found in agricultural scenarios, the feature details at this resolution are still insufficient.
- (3)
- Scarcity of assigned anchor points: Object detection algorithms need to assign ground truth boxes to specific locations (anchor points) on the feature map for learning. Low-resolution feature map grids are sparse, and a tiny target may not find enough matching anchor points within a 20 × 20 grid, leading to missed detections (false negatives). Even when assigned, the number of grid cells used for prediction is very limited, making it difficult for the model to learn stable and robust feature representations.
- (1)
- Preserving high-resolution detailed information: The newly added P2 layer processes feature maps at a scale of 160 × 160. These feature maps are obtained by upsampling outputs from shallower network layers and have undergone fewer downsampling operations, thus retaining richer spatial details and texture information. This enables the model to “see” clearer edges and finer structures, providing a crucial information foundation for accurately segmenting and localizing small objects.
- (2)
- Providing denser prediction anchors for small objects: the 160 × 160 high resolution corresponds to a denser grid of cells. This greatly increases the chances that small objects are correctly assigned and learned, effectively reducing missed detections caused by sparse anchors. Each small object can now be predicted by more and finer grid cells working together, thereby improving detection recall and the accuracy of segmentation boundaries.
- (3)
- Optimized feature pyramid structure: The addition of the P2 layer enhances the model’s multi-scale detection capabilities. The model now features four detection scales ranging from 160 × 160 (fine), to 80 × 80 (medium), to 40 × 40 (coarse), to 20 × 20 (very coarse), forming a feature pyramid with broader coverage and smoother gradients. This design ensures that objects of any size can be detected and segmented at the most appropriate feature scale, particularly addressing the original model’s perception gap at the “extremely small” object scale.
2.4. CPCA Attention Mechanism
2.5. Neck Network Improvement
2.6. Backbone Network Improvement
3. Experiments and Analysis
3.1. Data Set
3.2. Evaluation Index
3.3. Experimental Environment
3.4. Experimental Result
3.4.1. Ablation Experiment
- (1)
- Fundamental enhancement in multi-scale feature extraction: The C3RFEM module’s multi-scale receptive fields significantly boost the model’s modeling capacity for complex structures and contextual information—capabilities that neither the attention mechanism (C2f_CPCA) nor the shallow P2 layer alone could fully achieve.
- (2)
- Complementary synergy with attention mechanisms: While C3RFEM generates richer feature maps, C2f_CPCA excels at extracting critical information from these features with precision. This combination achieves a balance between “expanding resources” and “conserving computation,” leading to further performance gains.
3.4.2. Algorithm Comparison Experiment
3.4.3. Model Generalization Analysis and Discussion
4. Discussion
4.1. Limitations and Challenge Scenario Analysis
4.1.1. Robustness Under Extreme Light Conditions
4.1.2. Intense Occlusion and High Overlap of Targets
4.1.3. Long-Range Very Small Target Detection
4.1.4. Limitations of Model Generalization Ability
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rovira-Más, F. Sensor Architecture and Task Classification for Agricultural Vehicles and Environments. Sensors 2010, 10, 11226. [Google Scholar] [CrossRef] [PubMed]
- Mavridou, E.; Vrochidou, E.; Papakostas, G.A.; Pachidis, T.; Kaburlasos, V.G. Machine Vision Systems in Precision Agriculture for Crop Farming. J. Imaging 2019, 5, 89. [Google Scholar] [CrossRef] [PubMed]
- Razin, B.I.; Modhumonty, D.; Saferi, R.M.; Monika, B.; Khalilur, R.M.; Nawaz, R.K.S.; Rabiul, A.M.G. Double Deep Q-Learning and Faster R-CNN-Based Autonomous Vehicle Navigation and Obstacle Avoidance in Dynamic Environment. Sensors 2021, 21, 1468. [Google Scholar] [CrossRef]
- Ball, D.; Upcroft, B.; Wyeth, G.; Corke, P.; English, A.; Ross, P.; Patten, T.; Fitch, R.; Sukkarieh, S.; Bate, A. Vision-based Obstacle Detection and Navigation for an Agricultural Robot. J. Field Robot. 2016, 33, 1107–1130. [Google Scholar] [CrossRef]
- Zhang, Q.; Liu, Y.Q.; Gong, C.Y.; Chen, Y.Y.; Yu, H.H. Applications of Deep Learning for Dense Scenes Analysis in Agriculture: A Review. Sensors 2020, 20, 1520. [Google Scholar] [CrossRef] [PubMed]
- Debnath, S.; Paul, M.; Debnath, T. Applications of LiDAR in Agriculture and Future Research Directions. J. Imaging 2023, 9, 57. [Google Scholar] [CrossRef] [PubMed]
- Kragh, M.; Jorgensen, R.N.; Pedersen, H. Object Detection and Terrain Classification in Agricultural Fields Using 3D Lidar Data. Comput. Vis. Syst. 2015, 9163, 188–197. [Google Scholar] [CrossRef]
- Champ, J.; Mora-Fallas, A.; Goëau, H.; Mata-Montero, E.; Bonnet, P.; Joly, A. Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots. Appl. Plant Sci. 2020, 8, e11373. [Google Scholar] [CrossRef] [PubMed]
- Shoaib, M.; Hussain, T.; Shah, B.B.; Ullah, I.; Shah, S.M.; Ali, F.; Park, S.H. Deep learning-based segmentation and classification of leaf images for detection of tomato plant disease. Front. Plant Sci. 2022, 13, 1031748. [Google Scholar] [CrossRef] [PubMed]
- Bai, Y.H.; Guo, Y.X.; Zhang, Q.; Cao, B.Y.; Zhang, B.H. Multi-network fusion algorithm with transfer learning for green cucumber segmentation and recognition under complex natural environment. Comput. Electron. Agric. 2022, 194, 106789. [Google Scholar] [CrossRef]
- Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Ruan, C.; Sun, Y. Cucumber Fruits Detection in Greenhouses Based on Instance Segmentation. IEEE Access 2019, 7, 139635–139642. [Google Scholar] [CrossRef]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Feng, C.J.; Zhong, Y.J.; Gao, Y.; Scott, M.R.; Huang, W.L. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. arXiv 2020, arXiv:2006.04388. [Google Scholar] [CrossRef]
- Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Cvpr), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
- Ji, X.; Chen, S.; Hao, L.Y.; Zhou, J.; Chen, L. FBDPN: CNN-Transformer hybrid feature boosting and differential pyramid network for underwater object detection. Expert Syst. Appl. 2024, 256, 124978. [Google Scholar] [CrossRef]
- Yu, Z.P.; Huang, H.B.; Chen, W.J.; Su, Y.X.; Liu, Y.H.; Wang, X.Y. YOLO-FaceV2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Cvpr), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Category | Number of Training Set Instances | Number of Instances in the Validation Set | Number of Test Set Instances | Total Number of Instances |
---|---|---|---|---|
unharvested area | 5216 | 1452 | 752 | 7420 |
having harvested areas | 3289 | 942 | 461 | 4692 |
farm | 2155 | 601 | 312 | 3068 |
ridge between fields | 1843 | 521 | 264 | 2628 |
slight lodging area | 1327 | 385 | 188 | 1900 |
harvester | 704 | 198 | 102 | 1004 |
obstacle | 563 | 162 | 81 | 806 |
people | 488 | 135 | 67 | 690 |
Experiment | Parameters | GFLOPs | PM | RM | mAP0.5M | mAP0.5:0.95M |
---|---|---|---|---|---|---|
YOLOv8n-seg | 3,259,624 | 12.0 | 0.907 | 0.892 | 0.929 | 0.676 |
YOLOv8n-seg+P2 | 3,175,632 | 26.0 | 0.922 | 0.911 | 0.954 | 0.698 |
YOLOv8n-seg + P2 + C2f_CPCA | 3,133,056 | 25.9 | 0.914 | 0.927 | 0.958 | 0.705 |
YOLOv8n-seg + P2 + C2f_CPCA_1 | 3,140,096 | 26.0 | 0.916 | 0.912 | 0.955 | 0.699 |
YOLOv8n-seg + P2 + C2f_CPCA + C3RFEM | 3,347,200 | 26.0 | 0.921 | 0.932 | 0.959 | 0.711 |
Category | YOLOv8n-seg | Algorithm in This Paper | ||||||
---|---|---|---|---|---|---|---|---|
PM | RM | mAP0.5M | mAP0.5:0.95M | PM | RM | mAP0.5M | mAP0.5:0.95M | |
having harvested areas | 0.962 | 0.954 | 0.967 | 0.675 | 0.958 | 0.952 | 0.962 | 0.679 |
obstacles | 0.976 | 0.969 | 0.985 | 0.894 | 0.975 | 0.983 | 0.988 | 0.904 |
slight lodging area | 0.968 | 0.935 | 0.973 | 0.737 | 0.964 | 0.946 | 0.97 | 0.753 |
harvesters | 0.924 | 0.947 | 0.964 | 0.791 | 0.894 | 0.947 | 0.966 | 0.825 |
ridge between fields | 0.792 | 0.598 | 0.718 | 0.344 | 0.83 | 0.796 | 0.856 | 0.436 |
people | 0.859 | 0.803 | 0.855 | 0.498 | 0.87 | 0.898 | 0.921 | 0.545 |
unharvested area | 0.936 | 0.911 | 0.956 | 0.693 | 0.938 | 0.945 | 0.967 | 0.744 |
farm | 0.914 | 0.943 | 0.962 | 0.732 | 0.902 | 0.957 | 0.964 | 0.748 |
Models | Parameters | GFLOPs | Inference Time (ms) | mAP0.5M | mAP0.5:0.95M |
---|---|---|---|---|---|
YOLOv5n | 1,889,221 | 6.8 | 4.2 | 0.741 | 0.46 |
YOLOv5s | 7,417,301 | 25.7 | 8.5 | 0.845 | 0.554 |
YOLOv5m | 21,680,645 | 69.9 | 15.3 | 0.896 | 0.609 |
YOLOv7 | 3,032,992 | 172.3 | 22.1 | 0.921 | 0.636 |
YOLOv8n | 3,259,624 | 12.0 | 5.1 | 0.929 | 0.676 |
YOLOv8s | 11,782,696 | 42.5 | 10.8 | 0.959 | 0.762 |
YOLOv9t | 3,032,992 | 17.1 | 6.2 | 0.813 | 0.505 |
YOLOv10n | 2.3 M | 6.8 | 4.0 | 0.935 | 0.690 |
YOLOv10s | 7.2 M | 21.0 | 8.0 | 0.965 | 0.780 |
Mask R-CNN | 44.0 M | 275.0 | 48.5 | 0.950 | 0.720 |
Mask2Former | 45.0 M | 290.0 | 52.0 | 0.970 | 0.790 |
Ours | 3,347,200 | 26.0 | 9.2 | 0.959 | 0.711 |
Model | mAP@0.5↓ * | mAP@0.5:0.95↓ * | GFLOPs |
---|---|---|---|
YOLOv8n-seg (Benchmark) | 0.641 | 0.381 | 12.0 |
Ours (YOLOv8n-seg + P2 + C2f_CPCA + C3RFEM) | 0.683 (4.2%) | 0.422 (4.1%) | 26.0 |
Scenario Type | mAP@0.5↓ * | Recall↓ * | Main Failure Causes |
---|---|---|---|
Strong light (overexposure) | 0.892 (−7.0%) | 0.841 (−9.8%) | The feature details are lost and the CPCA attention fails |
Long-range small target (<15 px) | 0.803 (−16.3%) | 0.712 (−23.6%) | The feature resolution is insufficient |
Heavy occlusion (>50%) | 0.867 (−9.6%) | 0.794 (−14.8%) | Discontinuity of spatial information |
Rain and fog interference | 0.908 (−5.3%) | 0.861 (−7.6%) | The contrast is reduced and the noise is enhanced |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Q.; Chen, D.; Feng, W.; Sun, L.; Yu, G. Enhancing Instance Segmentation in Agriculture: An Optimized YOLOv8 Solution. Sensors 2025, 25, 5506. https://doi.org/10.3390/s25175506
Wang Q, Chen D, Feng W, Sun L, Yu G. Enhancing Instance Segmentation in Agriculture: An Optimized YOLOv8 Solution. Sensors. 2025; 25(17):5506. https://doi.org/10.3390/s25175506
Chicago/Turabian StyleWang, Qiaolong, Dongshun Chen, Wenfei Feng, Liang Sun, and Gaohong Yu. 2025. "Enhancing Instance Segmentation in Agriculture: An Optimized YOLOv8 Solution" Sensors 25, no. 17: 5506. https://doi.org/10.3390/s25175506
APA StyleWang, Q., Chen, D., Feng, W., Sun, L., & Yu, G. (2025). Enhancing Instance Segmentation in Agriculture: An Optimized YOLOv8 Solution. Sensors, 25(17), 5506. https://doi.org/10.3390/s25175506