Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments

Zheng, Shuhe; Jia, Xuexin; He, Minglei; Zheng, Zebin; Lin, Tianliang; Weng, Wuxiong

doi:10.3390/agronomy14081764

Open AccessArticle

Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments

by

Shuhe Zheng

^1,2,

Xuexin Jia

^1,2,

Minglei He

^1,2,

Zebin Zheng

^1,2,

Tianliang Lin

³ and

Wuxiong Weng

^1,2,*

¹

College of Mechanical and Electrical Engineering, Fujian Agriculture and Forestry University, Fuzhou 350002, China

²

Fujian University Engineering Research Center for Modern Agricultural Equipment, Fujian Agriculture and Forestry University, Fuzhou 350002, China

³

Fujian Key Laboratory of Green Intelligent Drive and Transmission for Mobile Machinery, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(8), 1764; https://doi.org/10.3390/agronomy14081764

Submission received: 4 July 2024 / Revised: 1 August 2024 / Accepted: 7 August 2024 / Published: 12 August 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Tomatoes are a critical economic crop. The realization of tomato harvesting automation is of great significance in solving the labor shortage and improving the efficiency of the current harvesting operation. Accurate recognition of fruits is the key to realizing automated harvesting. Harvesting fruit at optimum ripeness ensures the highest nutrient content, flavor and market value levels, thus maximizing economic benefits. Owing to foliage and non-target fruits obstructing target fruits, as well as the alteration in color due to light, there is currently a low recognition rate and missed detection. We take the greenhouse tomato as the object of research. This paper proposes a tomato recognition model based on the improved YOLOv8 architecture to adapt to detecting tomato fruits in complex situations. First, to improve the model’s sensitivity to local features, we introduced an LSKA (Large Separable Kernel Attention) attention mechanism to aggregate feature information from different locations for better feature extraction. Secondly, to provide a higher quality upsampling effect, the ultra-lightweight and efficient dynamic upsampler Dysample (an ultra-lightweight and efficient dynamic upsampler) replaced the traditional nearest neighbor interpolation methods, which improves the overall performance of YOLOv8. Subsequently, the Inner-IoU function replaced the original CIoU loss function to hasten bounding box regression and raise model detection performance. Finally, the model test comparison was conducted on the self-built dataset, and the test results show that the mAP0.5 of the YOLOv8-Tomato model reached 99.4% and the recall rate reached 99.0%, which exceeds the original YOLOv8 model detection effect. Compared with faster R-CNN, SSD, YOLOv3-tiny, YOLOv5, and YOLOv8 models, the average accuracy is 7.5%, 11.6%, 8.6%, 3.3%, and 0.6% higher, respectively. This study demonstrates the model’s capacity to efficiently and accurately recognize tomatoes in unstructured growing environments, providing a technical reference for automated tomato harvesting.

Keywords:

tomato recognition; YOLOv8; Inner-IoU loss function; complex environments; object detection

1. Introduction

Tomatoes are an important cash crop whose unique flavor, rich nutrients, and strong ecological adaptability have led to their widespread cultivation worldwide [1]. However, tomato harvesting still mainly uses manual picking, and problems such as low harvesting efficiency, considerable human input, high labor intensity, and harvest quality cannot be guaranteed [2]. Therefore, exploring more efficient orchard harvesting is the key to promoting the growth of the industry for tomato planting, in which the accurate identification of the fruit through technology is not only to realize the critical premise of intelligent harvesting but also to ensure that efficient, timely, and damage-free harvesting is a vital guarantee [3,4].

Accurately determining whether a tomato fruit is ripe or not is a critical part of the automated harvesting process. Only when the fruit reaches optimal maturity harvesting, does it ensure that its nutrient content and market value achieve the best state. The ripeness of tomatoes is determined primarily by the surface color of the fruit, and ripe tomatoes usually exhibit a uniform red or orange color (orange tomatoes can be ripened by natural conditions during transport). However, in practice, automated harvesting robots face many challenges in recognizing ripeness. These include the fact that tomato color also varies significantly due to different light conditions caused by weather, which affects the accuracy of the color information in the image. Different light intensities and angles may cause the fruit surface color to appear with different brightness and shades in the image, adding to the complexity of recognition. In addition, the presence of leaves, branches, and other non-target objects around the fruit may partially obscure the target fruit, further increasing the difficulty of recognition.

To overcome these challenges, advanced image processing and computer vision techniques are needed to improve the recognition rate and accuracy of ripe tomatoes. At present, traditional object detection methods based on image processing include image segmentation, image feature extraction, color segmentation [5,6,7], etc., which have limitations, including low accuracy, poor robustness, and a time-consuming process in unstructured growing environments where the target is interfered with by occlusion, noise, and light. It is not easy to meet practical needs. Zhao et al. [8] proposed a non-color-coded tomato fruit recognition algorithm, which obtains a number of weak classifiers through threshold judgment based on Haar-like features and uses the AdaBoost algorithm to train multiple weak classifiers to obtain a strong classifier for recognizing red ripe tomato fruits through learning. The recognition rate of this method for ripe tomato fruits is low under poor lighting conditions and serious fruit occlusion. In recent years, object detection using deep learning techniques has now been extensively used in agriculture as well and has effectively improved the problems existing in traditional target detection methods. These methods are mainly based on Convolutional Neural Networks (CNN) [9,10]. According to processing, deep learning is often divided into two categories: two-stage detection and one-stage detection. The R-CNN series of algorithms is the most representative of two-stage detection [11,12,13,14]. Wang et al. [15] used ResNet-50 as the backbone, utilizing RoIAlign (Region of Interest Align) to obtain more accurate bounding boxes in the feature mapping stage, and introduced PAN to solve the difficulty of tomato detection in complex environments. Fu et al. [16] used backpropagation with the ZFNet (Zeiler and Fergus Net) combined with the Stochastic Gradient Descent (SGD) algorithm for Faster R-CNN network end–to–end training, which essentially enhanced the precision and speed of the model for detecting kiwi fruits. Long et al. [17] proposed an improved Mask R-CNN method for tomato fruit segmentation with different ripeness in a greenhouse environment, using Cross Stage Partial Network (CSPNet) fused with Residual Network (ResNet) in the Mask R-CNN network. The final average accuracy was 95.45%. Despite its high accuracy, the R-CNN family of algorithms cannot meet real-time detection requirements due to its long detection time and enormous network size. In contrast, one-stage detection methods have faster detection speed and higher scalability and thus are more applicable in practical applications. The YOLO [18,19,20,21,22] set of algorithms is the most well-known and often-used one-stage detection technique. Numerous researchers have widely used the YOLO series algorithms in agriculture. Zhang et al. [23] suggested the improved YOLOv3 network to cluster the constructed fruit dataset using the K-mean clustering method, and the recognition rate reached 95%. Chen et al. [24] proposed an improved Yolov3 cherry tomato detection algorithm. The algorithm uses a dual-path network as the feature extraction network, establishes four feature layers with different scales for multi-scale prediction, and calculates the scale of the anchor box using the improved K-means++ clustering algorithm. The final accuracy of the algorithm reached 94.29%. GAI et al. [25] suggested a model algorithm to improve YOLOv4 to solve the issue of the low detection accuracy of cherry fruits under environmental problems such as shade, to realize fast and precise cherry fruit identification, and compared with the benchmark model YOLOv4, its average precision means 1.2 percentage points’ improvement. Li et al. [26] proposed a recognition method based on the combination of YOLO v4 and HSV (Hue, Saturation, Value). By comparing the recognition effect of this algorithm on the test set under different percentages, 16% was used as the percentage of the mature tomato recognition algorithm, and the correctness of the YOLO v4 + HSV algorithm under this percentage was 94.77%. Xiong et al. [27] used lightweight YOLOv5-Lite to classify and recognize the ripeness of papaya. They achieved an overall detection effect of 92.4% for fruits with different shooting distances, shading situations, and lighting. Rong et al. [28] proposed an improved YOLOv5 tomato detection algorithm by fusing RGB images and depth images as inputs using ByteTrack to track tomato clusters in consecutive frames, and the improved algorithm achieved 97.9% detection accuracy. Long et al. [29] made improvements based on YOLOv7 and designed a model for apple detection during the fruit thinning stage applicable to small object detection in near–scene color, and compared with the original model, the mAP0.5 value, precision, and recall were increased by 2.3%, 0.9% and 1.3% in that order, respectively. Chen et al. [30] suggested an MTD–YOLOv7 model for recognizing cherry tomatoes, which greatly enhanced the identification accuracy and detection effectiveness of the model by adding two additional decoders, designing a loss function based on multitasking features and employing SIoU instead of CIoU. Yang et al. [31] proposed to replace the conventional convolution operation based on the YOLOv8s model by adopting the Depthwise Separable Convolution (DSConv) technique with the design of a Dual Path Attention Gate (DPAG) module and the introduction of the Feature Enhancement Module (FEM), which effectively improved the detection accuracy of the model in complex environments to 93.4%.

Although the above studies have shown feasibility in fruit and vegetable recognition, in scenes occluded by fruits and leaves, the blurring or missing features of the tomato object due to the volume, shape, and color of the occluded objects significantly affects the network model’s feature extraction. Moreover, the large number of tomato targets and the small scale and low resolution of individual targets increase the probability of missed detection by the model. In dense scenes, noise interferes with tomato features, which makes it difficult for the network model to classify and localize individual tomatoes accurately; this process is also prone to false detection. The interaction of these factors leads to significant challenges for existing detection models when dealing with complex and highly dense greenhouse environments for tomato detection, thus making it difficult for the models to achieve the desired level of performance in terms of accuracy and robustness of target detection. Current augmented models for tomatoes perform well in improving detection accuracy, enhancing robustness, and adapting to diverse environments. However, these models also face several challenges, including high data collection costs, increased computational complexity, risk of overfitting, long training times, and difficulties in combining practical applications that may result from larger models. Therefore, to improve the recognition accuracy of the model and better provide some reference for automated agricultural harvesting technology, a tomato object identification model based on YOLOv8-Tomato is proposed in this paper. In the Spatial Pyramid Pooling Fast (SPPF) layer of the model, the LSKA attention mechanism was included to raise the feature extraction and perception capacity of the model. In the model’s neck layer, a lightweight Dysample upsampler, which is lighter than the original model, was used. The Inner-Iou function was also utilized to substitute the model’s original loss function to reduce the misdetection caused by multiple targets. The improved model experimented with a self-collected dataset, which effectively verified the feasibility and reliability of the method.

2. Materials and Methods

2.1. Image Acquisition and Dataset Construction

2.1.1. Image Acquisition

The tomato image acquisition site used for the recognition experiments in this study was located in a greenhouse for tomato cultivation at the Yinong Agricultural Base, Changle District, Fuzhou City, Fujian Province, China (longitude 119°47′ E, latitude 25°91′ N), with Syngenta Sibede varieties as the study subjects. The image acquisition equipment was an IntelRealSenseD435i depth camera (Intel Corporation, Santa Clara, CA, USA), photographed on 20 March 2024, from 9 a.m. to 5 p.m. Under the direction of a tomato breeder, tomatoes of various maturity stages were partitioned, including being shadowed by sister fruits and plant leaves under varying degrees and the presence of frontlight and backlight owing to distinct sunlight exposure angles. Invalid images, such as blurring and repetition, were filtered out. 1336 images of various phases of tomato ripeness were collected in the image format and saved as JPGs. The types of images collected were single fruit images, overlapping fruit images, frontlight images, backlight images, plant leaf shade images, near-distance images, and far-distance images. Some of the photos taken under various conditions are shown in Figure 1.

2.1.2. Data Preprocessing and Dataset Construction

To enrich the diversity of the samples, mitigate overfitting during training, and improve the generalization of network models to tomato images, offline data enhancement methods like panning, rotating, flipping, varying brightness, and applying Gaussian noise to the tomato images were processed using OpenCV; Figure 2 illustrates the specific results. Among them, the luminance change operation simulated the lighting situation in different weather, the rotation and flip mimicked the circumstance in which the detection apparatus’s angle is altered, and the blurring and adding noise simulated the artifacts that the detection equipment may capture when acquiring the image. The dataset was finally expanded to 7800 images as the dataset for this experiment.

To determine whether tomato fruits were ripe or not, the fruits were divided into two categories: unripe tomatoes (including young green fruits and yellow tomatoes at the expansion stage) and ripe tomatoes (bright red). The dataset was divided in a 7:1:2 ratio of training, validation, and test sets. The training set included 5460 images, the validation set comprised 780 images, and the test set comprised 1560 images. The categorization of tomato ripeness levels within the sample was manually labeled using the LabelImg1.8.6 software tool. The annotation process comprised marking using category and location information, where the categories were divided into ripe (ripe) and unripe (unripe) labels. The annotation format was in the YOLO format, and the annotation information was converted into format and saved as a .txt file. A total of 25,150 tomatoes of different ripeness levels were labeled. The labeling results were finally provided to the cultivation experts for confirmation to ensure the accuracy of the tomato labeling results, thus completing the construction of the dataset. Figure 3 shows the classification of tomatoes.

2.2. Improvements to YOLOv8

2.2.1. YOLOv8 Object Detection Algorithm

The YOLOv8 algorithm is the 8th version of the YOLO series. YOLOv8’s network architecture carries on the three-part concept of YOLOv5, which is still split into the Cross Stage Partial backbone network (backbone), the feature-enhanced network (neck), and the detecting head (head). A lighter C2f module replaces the C3 module of YOLOv5 in the backbone, therefore improving the receptive field and feature representation capacity. In terms of the Feature Enhancement Network, the idea of Path Aggregation Feature Pyramid Network (PA-FPN) is adopted by removing the convolution from the sampling stage of YOLOv5’s PA-FPN and replacing the C3 module with the C2f module, which helps the network to extract characteristics at many levels and enhances the ability of tiny item identification. The accuracy and robustness of object detection are further enhanced by cascading feature maps with varying degrees of scale. For the detection head, the decoupled head is used to enhance the training and inference efficiency of the network. Figure 4 displays the YOLOv8 network architecture.

Among the YOLOv8 series models, YOLOv8n has the smallest model size, which aligns with the practical application conditions of tomato harvesting. Therefore, this study chose to design and improve the model for YOLOv8n.

2.2.2. YOLOv8-Tomato Object Detection Algorithm

This paper proposes an improved tomato recognition model based on the YOLOv8 model for tomato fruit detection in complex situations. Firstly, to improve the sensitivity of the model to local features, this paper introduces the LSKA attention mechanism in the SPPF (Spatial-Pyramid-Pooling-Fast) layer of the base model to aggregate feature information from different locations for better feature extraction. Secondly, to provide a higher-quality upsampling effect, the traditional nearest neighbor interpolation method was replaced by Dysample, an ultra-lightweight and efficient dynamic up-sampler, which improved the overall performance of YOLOv8. Subsequently, the Inner-IoU function replaced the original CIoU loss function, which sped up the bounding box regression and improved the model detection performance. Figure 5 illustrates the network and structure diagram of the improved YOLOv8-Tomato model, where specific improvements are identified using red dashed boxes.

2.2.3. Large Separable Kernel Attention

Large Separable Kernel Attention (LSKA) is an attention module designed for Visual Attention Networks (VANs) [32]. It addresses the issue of quadratic increases in computational and memory demands of deep convolutional layers in the LKA module as convolutional kernel size grows. LSKA facilitates the use of large convolutional kernels in VANs’ attention modules by generating cascaded horizontal and vertical 1D convolutional kernels from a 2D convolutional kernel. This decomposition method reduces computational complexity and memory usage compared to the traditional Large Kernel Attention (LKA) design, enabling the direct application of deep convolutional layers with very large kernels in the attention module without additional blocks. As the convolutional kernel size increases, this approach improves the capture of shape information over texture information.

Figure 6a displays the design of an expansion depth convolution with a fairly large kernel size; this is to alleviate the problem of high computational cost as the depth convolution in the LKA module leads to a quadratic increase in computational complexity as the kernel size increases in the design of the large kernel size depth convolution. Figure 6b shows the modified configuration LSKA of the LKA module achieved by splitting the 2D convolutional separable weight kernels and the weight kernel of the deep convolution and deep expansion convolution, into two cascaded 1D convolutional separable weight kernels.

The specific output formulae are as follows. Firstly, give the input feature map

F \in R^{C \times H \times W}

; the number of input channels is represented by C. The feature map’s height and breadth are indicated by the letters H and W, respectively.

Z^{C} = \sum_{H, W} W_{k \times k}^{C} * F^{C}

(1)

A^{C} = W_{1 \times 1} * F^{C}

(2)

{\bar{F}}^{C} = A^{C} \otimes F^{C}

(3)

{\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times (2 d - 1)}^{C} * F^{C}

(4)

Z^{C} = \sum_{H, W} W_{[\frac{k}{d}] \times [\frac{k}{d}]}^{C} * {\bar{Z}}^{C}

(5)

In this case,

\otimes

and

*

stand for the Hadamard product and convolution, respectively, and

d

is the dilation rate. The output of the deep convolution produced by convolving the kernel

W

of size

k \times k

with the input feature map

F

is represented by

Z^{C}

in Equation (1).

k

also represents the maximum receptive field of kernel

W

. The attention map that results from convolving the kernel with a convolution of

1 \times 1

is represented by the letter

A^{C}

in Equation (2). The Hadamard product of the input feature map

F^{C}

and the attention map

A^{C}

is the LKA

{\bar{F}}^{C}

output. Equations (4) and (5) are the outputs of LKA. By dividing the 2D-dimensional weight kernel of the deep convolution and deep extended convolution into two cascaded 1D-dimensional separable weight kernels, the LSKA structure may be created. After passing Equations (3) and (4), the outputs of Equations (6) and (7) LSKA can be obtained.

{\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times 1}^{C} * (\sum_{H, W} W_{1 \times (2 d - 1)}^{C} * F^{C})

(6)

Z^{C} = \sum_{H, W} W_{[\frac{k}{d}] \times 1}^{C} * (\sum_{H, W} W_{1 \times (2 d - 1)}^{C} * \bar{Z})

(7)

The

{\bar{Z}}^{C}

in Equation (6) denotes the output of the deep convolution with a kernel size of

(2 d - 1) \times 1

, which balances the mesh impact of later deep convolutions and catches the local spatial information. The kernel size of the deep convolution is

[\frac{k}{d}] \times 1

, and the dilated deep convolution is responsible for capturing the global spatial information of the deep convolution output.

Figure 7 shows the structural diagram of adding the LSKA attention mechanism to the Spatial-Pyramid-Pooling-Fast (SPPF) layer, which is improved according to this experiment, where the LSKA module is added to the structure of the SPPF after all the Maximum Pooling Layers (MaxPool2d) operations are completed and before the second convolutional layer (Conv). In the forward method of the code, the input x is first passed through a convolutional layer cv1 and then through three consecutive maximum pooling layers M. The outputs of these layers are concatenated, and the whole concatenated output is fed to the LSKA module. After processing by the LSKA module, the output is fed to another convolutional layer, cv2.

2.2.4. Dysample Dynamic Upsampler

DySample (School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China) is an ultra-lightweight and efficient dynamic upsampler, which is constructed by reconstructing the upsampling process from the perspective of point sampling, effectively reducing parameters, FLOPs (floating–point operations), GPU memory, and latency, and dramatically improving the efficiency of resource utilization [33]. The improved model maintains the same feature map input as the original YOLOv8, and no longer uses the traditional nearest neighbor interpolation operations when upsampling the feature map. Instead, the input feature maps are more finely upsampled by generating content-aware sampling points. Then, the upsampled feature maps are fused pixel–by–pixel with the larger-scale feature maps. Then, features are extracted by a convolutional layer to generate the feature maps that are finally used for detection. Figure 8 shows the network structure diagram after replacing it with a Dysample upsampler.

2.2.5. Inner-CIoU Loss Function

During the automated tomato harvesting process, due to the complex environment of the orchard, the tomatoes are covered by branches and leaves and the fruits overlap heavily, which makes it difficult for the harvesting robot to recognize the tomatoes. To improve the accuracy and speed up the bounding box regression, we considered replacing the CIoU [34] loss function with the Inner-CIoU loss function, which is enhanced by the Inner-IoU [35]. The CIoU loss function only thinks about the shape loss, and when the target frame and bounding box have the same aspect ratio, the CIoU loss function will degenerate into the IoU loss function.

To address delayed convergence of the CIoU loss function in detection tasks and inadequate generalization, Inner-IoU proposes an auxiliary bounding box-based loss computation method to speed the bounding box regression. By incorporating a scaling factor and employing auxiliary bounding boxes with different scales based on the particular dataset and detector, Inner-IoU strengthens the model’s robustness and generalization capacity. Figure 9 specifically depicts the Inner-IoU loss function, where the blue solid box denotes the authentic bounding box, the blue dashed box denotes the anchor box, the black solid box denotes the auxiliary bounding box, and the black dashed box denotes the auxiliary anchor box. The blue region indicates the overlap region between the authentic bounding box and the anchor box, and the yellow region indicates the intersection region between the auxiliary bounding box and the auxiliary anchor box. These two overlap regions are important for calculating the Intersection over Union (IoU) metric. The actual bounding box and anchor box are denoted by

b^{g t}

and

b

, respectively. The centroid of the actual bounding box is denoted by

(x_{c}^{g t}, y_{c}^{g t})

, the centroid of the anchor box is denoted by

(x_{c}, y_{c})

, the width and height of the actual bounding box are denoted as

w^{g t}

and

h^{g t}

, respectively, and those of the anchor box are denoted as

w

and

h

, with the scale factor

r \in [0.5, 1.5]

. The formula for calculating the Inner-CIoU is given below.

I O U^{i n n e r} = \frac{inter}{u n i o n}

(8)

b_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} \cdot r}{2}, b_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} \cdot r}{2}

(9)

b_{t}^{g t} = y_{c}^{g t} - \frac{h^{g t} \cdot r}{2}, b_{b}^{g t} = y_{c}^{g t} + - \frac{h^{g t} \cdot r}{2}

(10)

b_{l} = x_{c} - \frac{w \cdot r}{2}, b_{r} = x_{c} + \frac{w \cdot r}{2}

(11)

b_{t} = y_{c} - \frac{h \cdot r}{2}, b_{b} = y_{c} + \frac{h \cdot r}{2}

(12)

In Equation (8),

I o U^{i n n e r}

is the IoU of Inner-IoU, inter is the area where the auxiliary anchor box intersects the auxiliary bounding box, and the area where the auxiliary bounding box and auxiliary anchor box merge is called the union; in Equation (9)

b_{l}^{g t}

is the transverse coordinates of the auxiliary bounding box’s left boundary,

b_{r}^{g t}

is the transverse coordinates of the auxiliary bounding box’s right boundary, and

r

is the scaling factor for the control of the size of the auxiliary bounding box scale. In Equation (10),

b_{t}^{g t}

is the longitudinal coordinate of the auxiliary bounding box’s lower boundary, and

b_{b}^{g t}

is the longitudinal coordinate of the auxiliary bounding box’s upper boundary. In Equation (11),

b_{l}

is the transverse coordinate of the left boundary of the auxiliary anchor frame, and

b_{r}

is the transverse coordinate of the left boundary of the auxiliary anchor frame, and in Equation (12),

b_{t}

is the longitudinal coordinate of the auxiliary anchor frame’s lower edge, and

b_{b}

is the auxiliary anchor frame’s upper boundary’s longitudinal coordinate. In the equation

(\min (b_{r}^{g t}, b_{r}) - \max (b_{l}^{g t}, b_{l})) \cdot (\min (b_{r}^{g t}, b_{b}) - \max (b_{t}^{g t}, b_{t}))

(13)

u n i o n = (w^{g t} \cdot h^{g t}) \cdot (r^{2}) + (w * h) \cdot (r^{2}) - inter

(14)

L_{I o U} = 1 - I o U

(15)

I o U = \frac{|B \cap B^{g t}|}{|B \cup B^{g t}|}

(16)

Equation

L_{I o U}

denotes the IoU loss function, and the intersection and concatenation ratio of IoU is represented by Equation (16), which represents the anchor frame’s area, and

B^{g t}

represents the bounding box’s actual area. In the equation

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{d^{2}} + α v

(17)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(18)

α = \frac{v}{(1 - I o U) + v}

(19)

L_{i n n e r - C I o U} = L_{C I o U} + I o U - I o U^{i n n e r}

(20)

L_{C I o U}

denotes the

C I o U

loss function, where

ρ ()

is the Euclidean distance,

d

is the minimal bounding box’s diagonal,

v

is the trade-off parameter, and

α

is the aspect ratio’s consistency parameter, where

L_{i n n e r - C I o U}

stands for the

i n n e r - C I o U

loss function.

2.3. Model Training and Performance Evaluation

2.3.1. Experimental Environment

The training environment hardware for this experiment was 12thGen Intel(R)Core (TM) i7-12700HCPU (Intel Corporation, Santa Clara, CA, USA), NVIDIAGeForceRTX3050TiLaptopGPU (NVIDIA Corporation, Santa Clara, CA, USA), and 16GBRAM (Samsung Electronics Co., Ltd., Seoul, Republic of Korea). The software environments were Windows11, PyTorch1.13.1, CUDA11.6 and Python3.8.18. Iterative optimization was performed using the SGD optimizer in model training, which, due to its stochastic nature, effectively reduced the possibility of the model falling into a local optimum, thus achieving faster convergence. The final training parameters of the model in this experiment were set as follows: the period was 200 rounds, the batch size was 16, the initial learning rate was 0.01, and the weight decay coefficient was 0.0005. By randomly splicing the four images together after applying different image enhancements to the input images, mosaic data augmentation of the model training improved the model’s generalization and increased the diversity of the data while also strengthening the model’s robustness. Since the dataset was rich enough after Mosaic data enhancement, this experiment chose to turn off Mosaic data enhancement in the last 10 training rounds to improve the model’s performance.

2.3.2. Assessment of Indicators

To assess how well the improved model performed this study primarily used precision (precision, P), recall (R, recall), AP, and mAP as evaluation metrics. Precision gauged how accurate the model is at identifying a category by counting the percentage of samples that are actually in the category (e.g., ripe tomatoes) that the model predicted. On the other hand, the percentage of samples that fall into a category (such as ripe tomatoes) was measured by recall through the model’s correct prediction of what will be in that category. It reflects the model’s coverage ability, i.e., finding all samples in that category. Under the precision–recall curve is the area represented by mean average precision (mAP), which is the average of the precision at various recall thresholds. It provides a comprehensive metric for evaluating the model’s performance under different detection thresholds.

P = \frac{T P}{T P + F P}

(21)

R = \frac{T P}{T P + F N}

(22)

A P = \int_{0}^{1} P (R) d R

(23)

m A P = \frac{1}{k} \sum_{i = 1}^{k} A P_{i}

(24)

In this instance, FP indicates the number of tomatoes that were mistakenly predicted to be ripe (actually unripe), and TP indicates the number of tomatoes that were correctly predicted to be ripe. The number of tomatoes that were erroneously estimated to be unripe but were ripe is indicated by the letters FN.

A P_{i}

is the average of the mean accuracy of each category. mAP is the average accuracy of all categories, which can comprehensively evaluate the accuracy and robustness of the algorithm. A greater mAP number corresponds to a higher prediction accuracy. These evaluation indexes can comprehensively measure the performance of the network model in tomato ripeness detection experiments to ensure the effectiveness and reliability of the model in practical applications.

3. Results and Discussion

3.1. Visualization

3.1.1. Graphical Analysis of Results

To validate the improved model’s effectiveness, the models’ overall performance was compared by using different deep learning model training datasets. Figure 10 demonstrates the changes in the precision, recall, mAP0.5, and mAP0.5:0.95 curves of the YOLOv8 and YOLOv8-Tomato models during the training process, respectively.

According to the four sets of precision change curves and recall change curves in Figure 10, it can be seen that in the first 50 rounds of iterative training, the YOLOv8-Tomato model metrics rapidly improved, which indicates that its model training efficiency is high. It could quickly achieve high accuracy, thus speeding up the deployment and use of the model in real applications. As the training continued, the metrics of the YOLOv8-Tomato model gradually levelled off and converged in the 200th round, indicating that the model had reached an ideal fitting state. Among them, the accuracy curve of precision in Figure 10a is slightly higher than that of YOLOv8 in general, and the difference is more obvious, especially in the early training cycles. Although the accuracy of both models tends to be close to one, YOLOv8-Tomato maintains a higher accuracy in the vast majority of training cycles. This indicates that YOLOv8-Tomato performs better in avoiding false positives. In Figure 10b, the curve of recall is similar to that of precision, and the recall of YOLOv8-Tomato is also generally higher than that of YOLOv8, especially in the early stage of training. It indicates that YOLOv8-Tomato performs better in detecting more true positives. Figure 10c mAP0.5 represents the average accuracy at the IoU threshold of 0.5, and from the result of the plot, it can be obtained that the integrated detection performance of YOLOv8-Tomato was better than that of YOLOv8. Figure 10d mAP0.5:0.95 shows that YOLOv8-Tomato’s [email protected]:0.95 was consistently higher than that of YOLOv8, which indicates that its performance is more stable under different IoU thresholds. This indicates that YOLOv8-Tomato has a higher detection accuracy and stability under different IoU standards. Since the mAP value combines the precision and recall of the two ripening categories of tomatoes in the greenhouse, the weights of the best-performing rounds in mAP are used as the final weights of the model. In this state, the YOLOv8-Tomato model has a precision of 99.1%, a mAP0.5 of 99.4%, and a recall of 99.0%. In comparison, the YOLOv8 model has a precision of 96.1%, a mAP0.5 value of 98.8%, and a recall of 96.6%. The improved model is 3.0%, 0.6%, and 2.4% higher than the original YOLOv8 in terms of precision, mAP0.5, and recall values, respectively, which indicates that the improved YOLOv8-Tomato model greatly improves recall and precision.

The loss curve in Figure 11 shows the ground truth box’s deviation from the predicted box in the model’s validation set. From the curves, we can find that the initial loss values of both models are high, while the initial loss value of YOLOv8-Tomato is slightly lower than that of YOLOv8. However, with the increase of the training period, the loss values of both models decrease rapidly, which indicates that the models can learn the location of the bounding box quickly during the training process and reduce the deviation of the ground truth box from the predicted box effectively. After about 50 training cycles, the loss values of the two models slowed down and gradually levelled off, indicating that the models gradually converged. This can be observed by comparing the loss function change curves of the two networks. YOLOv8-Tomato showed better performance throughout the training process, and its loss values were always lower than those of YOLOv8, which indicates that the YOLOv8-Tomato model has a smaller amount of error and higher accuracy in the validation set. This indicates that the model maintains a low classification error rate in the validation set and has good generalization ability.

3.1.2. Visualization and Analysis of Grad-CAM

Grad-CAM (Gradient-weighted Class Activation Mapping) is a visualization method that highlights the areas of an image that the model considers important for classification by computing gradient information. Grad-CAM’s heat map analysis visualization makes it feasible to see the areas that the model concentrates on while analyzing photos. The heat maps of the YOLOv8n and YOLOv8-Tomato models are shown in Figure 12.

In the heat map of the YOLOv8 model, the model mainly focuses on some regions of the tomato and some irregularly distributed attention points. Although the model captures some of the features of the tomato, the overall area of attention is more dispersed and does not focus on the critical features of the tomato, indicating that the model has a low concentration of attention in discriminating the ripeness of a tomato. In contrast, the heat map under the YOLOv8-Tomato model shows that the model focused more on the whole surface of the tomato, especially the edge and center regions, covering more key feature areas. Moreover, the model’s area of focus was more extensive and uniform and could capture different parts of the tomato’s features more comprehensively. This indicates that the model focuses on more explicit and delicate features when discriminating between ripe and unripe tomatoes. In summary, the improved model significantly enhanced the model’s performance in the tomato ripening classification task. The Grad-CAM heat map comparative analysis demonstrates that the LSKA attention mechanism increases the focus, comprehensiveness, and precision of the model’s attention area, which enhances the model’s capacity for feature extraction, accurate classification, and generalization.

3.1.3. Visualization of the Detection Results of the Improved Model

To have a better visualization of the ability of the improved model to identify tomato ripeness in real growing environments, we used the YOLOv8 and YOLOv8-Tomato models to identify tomatoes in the greenhouse for comparison. The results are shown in Figure 13. Panel 1 demonstrates that the improved model can increase the confidence score when recognizing tomatoes under different light conditions. Panel 2 indicates that the YOLOv8-Tomato model recognized tomato fruits not recognized by YOLOv8 under fruit shading and reduced the false detection rate. Panel 3 demonstrates that under the occlusion of plant leaves, the YOLOv8-Tomato model can effectively reduce the mis-detection and false-detection rates of YOLOv8. Panel 4 indicates that YOLOv8 can effectively recognize fruits obscured by leaves at different distances and improve the detection quality. In summary, the improved model YOLOv8-Tomato can effectively improve the recognition accuracy in the actual planting environment and effectively improve the situation of false detection and missed detection, which verifies the accurate detection ability of this algorithm for tomato ripening under various environmental conditions.

3.2. Ablation Experiment

To validate the performance enhancement of the proposed YOLOv8-Tomato model, YOLOv8-Tomato and YOLOv8n were compared step–by–step to ascertain the effectiveness of the improvement at each step. Specific results are shown in Table 1.

In the baseline experiments, this paper used the YOLOv8n model. It had good performance in identifying tomato ripeness, with 98.8% for mAP0.5, 90.1% for mAP0.5–0.95, 8.1 G FLOPs, and 3.006 M parameter counts. These results provide a benchmark for subsequent experiments.

In Experiment 2, after introducing the LSKA module, the mAP0.5 was increased to 99.0%, and the mAP0.5:0.95 value reached 90.7%, which indicates that the LSKA module can effectively improve the accuracy of the model without significantly increasing the amount of computation and parameters. In Experiment 3, after replacing the Dysample module, the mAP0.5 value was increased to 99.1%, and the mAP0.5:0.95 value was increased to 90.6%, while the FLOPs and parameters remained basically unchanged. In Experiment 4, after replacing the Inner-IoU loss function, the mAP0.5 value reached 99.3%, the mAP0.5:0.95 value was 90.4%, and the FLOPs and the number of parameters remained at 8.1 G and 3.006 M. The mAP0.5 value was 99.3% and the mAP0.5:0.95 value was 90.4%. This shows that the InnerIoU module can improve the measurement accuracy without increasing the computational burden. In Experiment 5, Experiment 6, and Experiment 7, the mAP0.5 value reached 99.0%, 99.1%, and 99.2%, respectively. mAP0.5:0.95 values were increased to 90.8%, 90.7%, and 90.8%, with a slight increase in computation and number of covariates, which suggests synergistic effects in performance. In Experiment 8, when all the innovation points (LSKA, Dysample, Inner-IoU) were introduced, the recall reached 99.0%, the [email protected] value reached 99.4%, the mAP0.5:0.95 value improved to 91.0%, the FLOPs was 8.3 G, and the parameter size was 3.291 M. The results show that the combined use of these modules significantly improves the detection ability of the model, and the model maintains high performance at different IoU thresholds. Figure 14 shows the variation of the mAP0.5 curves for different improvement methods.

The ablation experiments validate the modules’ effectiveness in improving the detection performance of the YOLOv8n model and show how the detection accuracy of the model can be significantly improved by the combined application of these techniques while maintaining low computational complexity. These results provide strong technical support for tomato ripening detection in practical applications.

3.3. Comparison of Different Models

To validate the improved model’s effectiveness, the models’ overall performance was compared by using different deep learning model training datasets. Table 2 demonstrates the precision, recall, mAP0.5 value, and model size of the Faster-RCNN, SSD, YOLOv3-tiny, YOLOv5, YOLOv8n, and YOLOv8-Tomato models, respectively, during the training process.

From the data in Table 2, in contrast to Faster R-CNN, YOLOv8-Tomato improved recall by 9.4 percentage points, the mAP0.5 metric by 7.5%, and the model size was only 6.1% of Faster R-CNN. Compared with the SSD, YOLOv8-Tomato improved recall by 9.2 percentage points, and the mAP0.5 metric by 11.6 percentage points, and the model size was 93.5% smaller than the SSD model. Figure 15 illustrates the changes in the data in Table 2.

Combining Table 2 and Figure 15, it can be observed that compared with the lightweight networks of the same series of YOLO algorithms, such as YOLOv3-tiny, YOLOv5, and YOLOv8n, the amount of calculation required and the number of parameters were decreased to various degrees, and mAP0.5 was improved by 8.6%, 3.3%, and 0.6%, respectively. The model size of YOLOv8-Tomato was reduced by 74.4% and 64.7% compared with YOLOv3-tiny and YOLOv5, respectively. The combined experimental results show that the YOLOv8-Tomato algorithm performs well compared to other mainstream object detection algorithms. The algorithm is ideal for real-time object detection of tomatoes because it not only achieves higher detection efficiency but also a smaller model size, lower computational cost, and higher detection accuracy. It is also simple to deploy on end devices. This provides a significant reference value for future fruit and vegetable robotic harvesting technology development.

4. Conclusions

This paper proposes a YOLOv8-Tomato model improved with YOLOv8 as the base model. This model can identify the ripening degree of tomatoes in greenhouses growing in complex environments more rapidly and accurately than other models, including under different light conditions, with the fruit and leaves of the plant in various degrees of shading, and so on. The aim is to automate tomato harvesting by precision agriculture, thus promoting innovative agriculture development.

(1): Aimed at the situation in which the edges of green tomatoes are highly similar to the background of green leaves, which leads to some immature tomatoes being misrecognized as the background, the LSKA attention mechanism is added to the SPPF layer of the YOLOv8 model, which better captures the shape features of the tomatoes, enhances the network’s attention to the essential features, and effectively improves the overall performance of the model;
(2): To increase resource utilization efficiency, Dysample, an ultra-lightweight and efficient dynamic upsampler, has replaced the original upsampling to provide higher-quality upsampling effects;
(3): The loss function CIoU of the YOLOv8n model was replaced with the Inner-CIoU loss function, and an auxiliary bounding box was used to compute the loss and accelerate the bounding box regression to improve the recognition accuracy effectively;
(4): It was confirmed that the YOLOv8-Tomato model produced superior test results in all performances under identical test settings after the ablation test. The final mAP0.5 value reached 99.4%, which is 7.5%, 11.6%, 8.6%, 3.3%, and 0.6% higher than the Faster RCNN, SSD, YOLOv3-tiny, YOLOv5 and YOLOv8 models. This enabled fast and accurate detection of tomatoes in complex environments and met the requirements of practical applications.

In conclusion, the enhanced YOLOv8-Tomato model can accurately extract tomato ripening features, which effectively improves the recognition rate of the model, and performs well in recognition and detection tasks in unstructured real planting scenarios. It also lowers fruit’s false-detection and mis-detection rates quite well. In the future, to truly realize efficient mechanized harvesting operations, embedding the YOLOv8-Tomato model combined with a depth camera and a harvesting robotic arm into agricultural harvesting robots should also be a focus of research.

Author Contributions

Conceptualization, X.J.; methodology, X.J., S.Z. and Z.Z.; software, X.J.; validation, T.L. and M.H.; formal analysis, X.J.; investigation, M.H. and T.L.; resources, X.J., T.L. and M.H.; data curation, M.H. and Z.Z.; writing—original draft preparation, X.J.; writing—review and editing, W.W. and S.Z.; visualization, X.J., Z.Z. and S.Z.; project administration, S.Z. and W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Guiding Project of Fujian Provincial Department of Science and Technology, (grant number 2022N0009); Open Foundation of Fujian Key Laboratory of Green Intelligent Drive and Transmission for Mobile Machinery (grant number GIDT-202308); Fujian Agriculture and Forestry University (grant number K1520005A05).

Data Availability Statement

All data are presented in this article in the form of figures and tables.

Acknowledgments

The authors would like to acknowledge the College of Mechanical Electronic Engineering, Fujian Agriculture and Forestry University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bac, C.W.; Van Henten, E.J.; Hemming, J. Harvesting robots for high-value crops: State-of-the-art review and challenges ahead. J. Field Robot. 2014, 31, 888–911. [Google Scholar] [CrossRef]
Fu, L.; Gao, F.; Wu, J. Application of consumer RGB-D cameras for fruit detection and localization in field: A critical review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
Goldenberg, L.; Yaniv, Y.; Porat, R. Mandarin fruit quality: A review. J. Sci. Food Agric. 2018, 98, 18–26. [Google Scholar] [CrossRef] [PubMed]
Chen, Q.; Yin, C.; Guo, Z. Current status and future development of the key technologies for apple picking robots. Nung Yeh Kung Ch’eng Hsueh Pao 2023, 38, 1–15. [Google Scholar]
Sun, S.; Jiang, M.; He, D. Recognition of green apples in an orchard environment by combining the GrabCut model and Ncut algorithm. Biosyst. Eng. 2019, 187, 201–213. [Google Scholar] [CrossRef]
Lu, J.; Lee, W.S.; Gan, H. Immature citrus fruit detection based on local binary pattern feature and hierarchical contour analysis. Biosyst. Eng. 2018, 171, 78–90. [Google Scholar] [CrossRef]
Hayashi, S.; Yamamoto, S.; Saito, S. Field operation of a movable strawberry-harvesting robot using a travel platform. Jpn. Agric. Res. Q. 2014, 48, 307–316. [Google Scholar] [CrossRef]
Zhao, Y.; Gong, L.; Zhou, B.; Hua, Y.; Niu, Q.; Liu, C. Object recognition algorithm of tomato harvesting robot using non-color coding approach. Trans. Chin. Soc. Agric. Mach. 2016, 47, 1–7. [Google Scholar]
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bottou, L.; Bengio, Y. Gradient-based learning applied to document recognition. Inst. Electr. Electron. Eng. Proc. 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Wang, Z.; Ling, Y.; Wang, X. An improved Faster R-CNN model for multi-object tomato maturity detection in complex scenarios. Ecol. Inform. 2022, 72, 101886. [Google Scholar] [CrossRef]
Fu, L.; Feng, Y.; Majeed, Y. Kiwifruit detection in field images using Faster R-CNN with ZFNet. IFAC Pap. OnLine 2018, 51, 45–50. [Google Scholar] [CrossRef]
Long, J.; Zhao, C.; Lin, S.; Guo, W.; Wen, C.; Zhang, Y. Segmentation method of the tomato fruits with different maturities under greenhouse environment based on improved Mask R-CNN. Trans. Chin. Soc. Agric. Eng. 2021, V37, 100–108. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R. You Only Look Once: Unified, real-time object Detection. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Zhang, X.; Gao, Q.; Pan, D. Picking recognition research of pineapple in complex field environment based on improved YOLOv3. J. Chin. Agric. Mech. 2021, 42, 201–206. [Google Scholar]
Chen, J.; Wang, Z.; Wu, J.; Hu, Q.; Zhao, C.; Tan, C.; Teng, L.; Luo, T. An improved Yolov3 based on dual path network for cherry tomatoes detection. J. Food Process Eng. 2021, 44, e13803. [Google Scholar] [CrossRef]
Gai, R.; Chen, N.; Yuan, H.; Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Li, T.; Sun, M.; Ding, X.; Li, Y.; Zhang, G.; Shi, G.; Li, W. Tomato recognition method at the ripening stage based on YOLO v4 and HSV. Trans. Chin. Soc. Agric. Eng. 2021, 37, 183–190. [Google Scholar]
Xiong, J.T.; Han, Y.L.; Wang, X. Method of maturity detection for papaya fruits in natural environment based on YOLO v5-lite. Trans. Chin. Soc. Agric. Mach. 2023, 54, 243–252. [Google Scholar]
Rong, J.; Zhou, H.; Zhang, F.; Yuan, T.; Wang, P. Tomato cluster detection and counting using improved YOLOv5 based on RGB-D fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
Long, Y.; Yang, Z.; He, M. Recognizing apple targets before thinning using improved YOLOv7. Trans. Chin. Soc. Agric. Eng. 2023, 39, 191–199. [Google Scholar]
Chen, W.; Liu, M.; Zhao, C.J. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]

Figure 1. Some representative tomato fruits: (a) single tomato; (b) overlapping fruits; (c) fruits in a complex background; (d) fruits in frontlight; (e) fruits in backlight; (f) obscured fruit.

Figure 2. Data-enhanced images: (a) brighten; (b) dim; (c) blur; (d) add noise; (e) rotate by 90 degrees; (f) rotate randomly; (g) mirror vertically; (h) mirror horizontally.

Figure 3. Tomato ripening determination.

Figure 4. YOLOv8 network structure diagram.

Figure 5. YOLOv8-Tomato network structure diagram.

Figure 6. The design of LKA and LSKA: (a) displays the LKA design in VAN using convolution, dilated convolution, conventional convolution (DW-Conv), and

1 \times 1

convolution; (b) displays the LKA design; the first two layers of LKA are broken down into four levels by LSKA, and each layer is made up of two one-dimensional convolutional layers.

Figure 6. The design of LKA and LSKA: (a) displays the LKA design in VAN using convolution, dilated convolution, conventional convolution (DW-Conv), and

1 \times 1

convolution; (b) displays the LKA design; the first two layers of LKA are broken down into four levels by LSKA, and each layer is made up of two one-dimensional convolutional layers.

Figure 7. Adding LSKA attention mechanism to SPPF layer.

Figure 8. Diagram of network structure after replacement.

Figure 9. Inner-CIoU loss function.

Figure 10. YOLOv8 and YOLOv8-Tomato model comparison curves: (a) precision; (b) recall; (c) mAP0.5; (d) mAP0.5:0.95.

Figure 11. Loss for validation sets.

Figure 12. Visualization and analysis of Grad-CAM.

Figure 13. Comparison of test results between YOLOv8 and YOLOv8-Tomato.

Figure 14. Changes in mAP0.5 curves for different improvement methods.

Figure 15. Comparison plot of detection performance of different models.

Table 1. Experimental results of different improvement methods.

Module	Precision/%	Recall/%	mAP@50/%	mAP@50–95/%	FLOPs/G	Param/M
YOLOv8n	96.1	96.6	98.8	90.1	8.1	3.006
+LSKA	98.0	98.1	99.0	90.7	8.3	3.278
+Dysample	97.2	97.2	99.1	90.6	8.1	3.018
+Inner-IoU	95.5	97.9	99.3	90.4	8.1	3.006
+Inner + LSKA	96.4	97.3	99.0	90.8	8.3	3.291
+Dy + LSKA	96.7	97.6	99.1	90.7	8.3	3.291
+Inner + Dy	97.5	97.7	99.2	90.8	8.1	3.018
YOLOv8-Tomato	99.1	99.0	99.4	91.0	8.3	3.291

Table 2. Comparison of detection performance of different models.

Model Name	Precision/%	Recall/%	mAP@50/%	Model Size/MB
Faster-RCNN	82.5	89.6	91.9	110.8
SSD	81.7	89.8	87.8	93.3
YOLOv3-tiny	87.9	88.3	90.8	23.8
YOLOv5	90.1	94.4	96.1	17.1
YOLOv8n	96.1	96.6	98.8	6.1
YOLOv8-Tomato	99.1	99.0	99.4	6.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, S.; Jia, X.; He, M.; Zheng, Z.; Lin, T.; Weng, W. Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments. Agronomy 2024, 14, 1764. https://doi.org/10.3390/agronomy14081764

AMA Style

Zheng S, Jia X, He M, Zheng Z, Lin T, Weng W. Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments. Agronomy. 2024; 14(8):1764. https://doi.org/10.3390/agronomy14081764

Chicago/Turabian Style

Zheng, Shuhe, Xuexin Jia, Minglei He, Zebin Zheng, Tianliang Lin, and Wuxiong Weng. 2024. "Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments" Agronomy 14, no. 8: 1764. https://doi.org/10.3390/agronomy14081764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition and Dataset Construction

2.1.1. Image Acquisition

2.1.2. Data Preprocessing and Dataset Construction

2.2. Improvements to YOLOv8

2.2.1. YOLOv8 Object Detection Algorithm

2.2.2. YOLOv8-Tomato Object Detection Algorithm

2.2.3. Large Separable Kernel Attention

2.2.4. Dysample Dynamic Upsampler

2.2.5. Inner-CIoU Loss Function

2.3. Model Training and Performance Evaluation

2.3.1. Experimental Environment

2.3.2. Assessment of Indicators

3. Results and Discussion

3.1. Visualization

3.1.1. Graphical Analysis of Results

3.1.2. Visualization and Analysis of Grad-CAM

3.1.3. Visualization of the Detection Results of the Improved Model

3.2. Ablation Experiment

3.3. Comparison of Different Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI