Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network

Cai, Yingqi; Cui, Bo; Deng, Hong; Zeng, Zhi; Wang, Qicong; Lu, Dajiang; Cui, Yukang; Tian, Yibin

doi:10.3390/agronomy14102320

Open AccessArticle

Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network

by

Yingqi Cai

^1,†,

Bo Cui

^2,3,†,

Hong Deng

^3,†,

Zhi Zeng

²,

Qicong Wang

¹,

Dajiang Lu

¹,

Yukang Cui

¹ and

Yibin Tian

^1,*

¹

College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen 518060, China

²

School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

³

R&D Division, Litemaze Technology, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2024, 14(10), 2320; https://doi.org/10.3390/agronomy14102320

Submission received: 15 August 2024 / Revised: 29 September 2024 / Accepted: 5 October 2024 / Published: 9 October 2024

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Robotic fruit harvesting has great potential to revolutionize agriculture, but detecting cherry tomatoes in farming environments still faces challenges in accuracy and efficiency. To overcome the shortcomings of existing cherry tomato detection methods for harvesting, this study introduces a deep-learning-based cherry tomato detection scheme for robotic harvesting in greenhouses using multimodal RGB-D perception and an improved YOLOv7-tiny Cherry Tomato Detection (YOLOv7-tiny-CTD) network, which has been modified from the original YOLOv7-tiny by eliminating the “Objectness” output layer, introducing a new “Classness” method for the prediction box, and incorporating a new hybrid non-maximum suppression. Acquired RGB-D images undergo preprocessing such as color space transformation, point cloud normal vector angle computation, and multimodal regions of interest segmentation before being fed into the YOLOv7-tiny-CTD. The proposed method was tested using an AGV-based robot in a greenhouse cherry tomato farming facility. The results indicate that the multimodal perception and deep learning method improves detection precision and accuracy over existing methods while running in real time, and the robot achieved over 80% successful picking rates in two-trial mode in the greenhouse farm, showing promising potential for practical harvesting applications.

Keywords:

cherry tomato; fruit harvesting; multimodal perception; RGB image; point cloud; deep neural network

1. Introduction

Fruit harvesting robots have been studied over the decades, mostly in research settings [1,2,3,4]. With the advancement of robotics and deep learning, they are gaining increasing attention in recent years [5,6,7]. For harvesting robots, rapid and accurate identification and localization of fruits have become some of the key challenges. Previous studies have primarily focused on the recognition and localization of fruits using color images. Bulanon et al. used an automatic threshold selection algorithm based on the red color difference histogram, effectively improving the accuracy of segmentation and the success rate of recognition [8]. Payne et al. separated mangoes from the background by performing color segmentation in the RGB and YCbCr color spaces [9]. Senthinath et al. proposed a co-spectral space classification method that first uses the naive Bayes algorithm to obtain the optimal number of image clusters, followed by spectral clustering using K-means, expectation maximization, and self-organizing maps for tomato identification [10]. Luo et al. introduced an automatic detection method for ripe grapes based on visual sensors, integrating the AdaBoost framework with multiple color components, which can suppress the impact of environmental noise to some extent [11]. Teixidó et al. employed a classification method based on multiple color hyper-planes implicit in the color space, calculating the distance of pixel points to different hyper-planes to improve detection accuracy [12]. Kurtulmus et al. extracted features through circular Gabor texture analysis and achieved the recognition of immature peach fruits in natural images using various classifiers [13]. Zhou et al. utilized the K-means algorithm for cherry tomato picking [14]. Fruit picking robots often encounter issues such as drastic changes in lighting, complex backgrounds, and random occlusions of target objects when operating in outdoor unstructured environments. These factors lead to insufficient stability of target detection methods that only rely on features such as color, spectrum, and shape. Although the aforementioned algorithms have high detection efficiency in well-controlled environments, they still face challenges in unstructured field environments.

Due to the strong representational capability of deep learning for images, it can effectively compensate for the lack of robustness in algorithms that solely depend on color, shape, and texture features. Many studies have introduced deep learning into fruit harvesting robots [15]. Chen et al. introduced a dual-path network based on YOLOv3 to enhance the semantic feature extraction for cherry tomatoes [16]. Zheng et al. proposed YOLOvX-Dense-CT for cherry tomatoes based on the YOLOvX model. It uses the DenseNet as its core framework and integrates the convolutional block attention module to significantly improve the model’s recognition [17]. Yan et al. constructed a new Si-YOLO network to identify cherry tomatoes. It is based on YOLOv5, integrating the SIMAM attention module and generative adversarial network to enhance the model’s generalization performance [18]. Wang et al. proposed a lightweight model for detecting cherry tomatoes based on YOLOv5. It calculates the size and aspect ratio of the anchor box through K-means++ and the coordinate attention mechanism and the weighted intersection over union (IOU) loss function [19]. The generated weight file is only 4 MB in size.

The more recent YOLOv7 model consists of Input, Backbone, Neck, and Head and provides three basic models of YOLOv7-tiny, YOLOv7, and YOLOv7-W6 [20]. Compared with other YOLO series, the model construction of YOLOv7 is similar to that of YOLOv4 and YOLOv5 [21]. Its core structure mainly consists of convolutions and the Extended-ELAN (E-ELAN), MPConv, and SPPCSPC modules. The E-ELAN module can greatly improve the learning performance of the network by changing the structure of the calculation block without affecting the original gradient path and using techniques to expand, shuffle, and merge cardinality. The SPPCSPC module greatly reduces the distortion and repetitive feature problems that may occur in the image processing by introducing multiple parallel MaxPool operations, thereby effectively improving the performance. In the MPConv module, the MaxPool operation can greatly expand the sensing range of the current feature layer and combine it with the feature processed by convolutions, thus greatly improving the generalization ability of the network.

In order to ensure the recall rate, the existing target detection algorithms often output multiple candidate boxes for the same target. However, multiple redundant candidate boxes will affect the detection accuracy, thus it is necessary to use postprocessing to filter out the redundant ones and output the candidate boxes with the highest confidence scores. The original YOLOv7 uses non-maximum suppression (NMS) to eliminate redundant candidate boxes, but the operation efficiency of this method is low and only IOU is used as the evaluation metric, which is insufficient to fully describe the overlapping relationship. In addition, using a manually set threshold will make the results not robust. Although other studies have proposed various solutions, such as Weighted NMS [22], DIOU NMS [23], and CIOU NMS [24], it is still difficult to solve the problem effectively. Learning-based NMS has also been introduced [25], but its implementation is not simple. Table 1 shows the advantages and disadvantages of various NMS methods.

In addition, the aforementioned algorithms primarily rely on color images and under-utilize depth information. During the operation of the harvesting robot, these models need high-performance GPU hardware, which leads to a significant increase in cost. Depth information can improve the detection of cherry tomatoes as distance can help reduce the ambiguities arising from partial occlusions by neighboring cherry tomatoes. To overcome these shortcomings of existing cherry tomato detection methods for harvesting, this study combines multimodal perception (color and depth, RGB-D for short) with an improved YOLOv7-tiny network to achieve real-time detection and positioning of cherry tomatoes in commercial greenhouse farms, named YOLOv7-tiny-CTD. It was evaluated with a harvesting robot consisting of an off-the-shelf collaborative robotic arm, a custom automated ground vehicle (AGV) platform, and a custom end effector with force sensors and an RGB-D sensor. Results indicate that the proposed YOLOv7-tiny-CTD model has significantly enhanced the recognition capability for cherry tomatoes. Moreover, harvesting experiments demonstrate that the robot achieved a more than 80% picking success rate and has potential for applications in commercial farming facilities. Partial results have been previously presented at a conference [26].

Cherry tomatoes can be harvested either individually or by bunches. The latter is much more efficient, if and only if we look at picking from the vines alone. For many farms, if harvested in bunches, individual fruits still need to be separated from them and examined and classified before being packaged, which needs much more time than picking bunches from the vines. Some farms envisioned the use of robots to classify and pick individual fruits and directly package them. Whether this is more efficient needs to be studied in the future. In the current study, we chose detecting and picking individual cherry tomatoes at the request of the farm that we worked with.

The main contributions of the paper are:

(1): Multimodal RGB-D images are utilized in combination with simple preprocessing methods to screen regions of interest (ROIs) for cherry tomato detection to improve efficiency.
(2): To better utilize depth information for cherry tomato detection, the normal vector angles of a point cloud are introduced and combined with the non-luminance color channels in the Lab color space of an RGB image as the input to neural networks.
(3): In addition to the multimodal image input, YOLOv7-tiny has been improved from three different aspects: eliminating the “Objectness” output layer; introducing a new “Classness” method for prediction box; and improving the NMS by using a hybrid method.
(4): The proposed approach has been evaluated with a cherry tomato harvesting robot in a commercial greenhouse farm, and it outperforms several state-of-the-art detection neural networks in precision, recall, and accuracy while running at 26 FPS on Nvidia Jetson TX1. And cherry tomato picking based on the detection results shows promising potential for practical applications.

2. Materials and Methods

2.1. Cherry Tomato Picking Robot

We first introduce the cherry tomato harvesting robot. The functional module diagram of the robotic system is shown in Figure 1. The main hardware modules include a customized AGV platform, a collaborative robotic arm (cobot), a customized end effector with force sensors and an RGB-D sensor, and computing, communication, and control modules. And the main software modules are obstacle avoidance, eye–hand calibration, cherry tomato detection, robotic arm and end effector control, communication and data transfer, and debugging tool and graphic user interface (GUI).

2.1.1. Main Hardware Modules

As the robot is intended for use in greenhouse cherry tomato farming facilities, the AGV only runs on embedded tracks on the ground (Figure 2a), and no extra navigation is needed. For obstacle avoidance, two Litemaze time-of-flight (TOF30) sensors (the highest resolutions of color and depth images are 5 MP and 0.3 MP (640 × 480); Litemaze Technology, Shenzhen, China) are installed at the front and back sides of the AGV, respectively. These TOF sensors have short working distance of 3 m, which is sufficient for the low-speed AGV whose speed is capped at 1 m/s.

A cobot arm Aubo i5 (Aubo Robotics, Beijing, China) is mounted on top of the AGV to offer 6 degrees of manipulation freedom. It has arm reach of 88.65 cm and payload of 5 Kg. A customized end effector with two straight parallel fingers is attached to the end arm of the cobot. A force sensor is built into each finger of the end effector. The two fingers can move towards or away from each other driven by an electrical motor. A multimodal RGB-D sensor, either the Microsoft Kinect DK (Microsoft, Redmond, WA, USA) or the Litemaze TOF30, can be mounted on the cobot end arm. The Kinect DK is a developer kit, and its highest resolutions of color and depth images are 12 MP and 1 MP, respectively. The total weight of the end effector, RGB-D sensor, and the installation parts, illustrated in Figure 2b, is around 3 Kg, leaving sufficient payload capacity for cherry tomato picking. Two-arm solutions have been used to imitate humans for fruit picking [27], with more complex systems at a much higher cost. The embedded computing platform for perception and robotic control is the Nvidia Jetson TX1 (Nvidia, Santa Clara, CA, USA). A separate personal computer (PC) with Intel i7 CPU and 16 GB of RAM (Lenovo, Beijing, China) is used for development and hosting the user interface.

The cost of the build of materials (BOM) for the hardware is about USD 28,000 (the AGV platform: USD 14,000, the cobot arm: USD 8000, the end effector and sensors: USD 3000, the PC and TX1: USD 1800, other parts and cables: USD 1200).

2.1.2. Main Software Modules

The main software modules are obstacle avoidance, cherry tomato detection, robotic arm and end effector control, communication and data transfer, and debugging tool and graphic user interface (GUI). The algorithms involved in the obstacle avoidance, eye–hand calibration, and cherry tomato detection will be discussed in detail later. The debugging and GUI are shown in Figure 3. System parameters can be configured via the settings interface. The communications between the PC, TX1, and cobot are by TCP/IP protocol. And the TX1 communicates with the RGB-D sensor by USB.

2.2. Multimodal Perception

Multimodal perception here refers to the utilization of color, depth, and force sensors for the cherry tomato picking robot. The force sensor feedback is used to prevent damage to the cherry tomatoes.

2.2.1. Obstacle Avoidance

As the robot moves around in the greenhouse in autonomous mode, it needs to be able to avoid obstacles, such as farm workers, unexpected objects falling on the ground, or loose cherry tomato vines. As the AGV is on the track, obstacle avoidance is only offered for forward and backward movements. Both RGB and depth images are used for obstacle detection, using a strategy similar to the salient feature learning in a previous study [28]. The control strategy is simply distance based; the space in front in the moving direction is divided into three different zones: danger/red (<1 m), risk/yellow (1–2 m), and safe/green (>2 m). If an obstacle of a diameter greater than a given size (in this study, it was set to 2 cm) is detected in the red or yellow zone, the robot immediately stops or slows down, respectively. Otherwise, it continues moving at the current speed. Obstacles smaller than the given size are ignored as they are unlikely to pose a risk. In the study, obstacle avoidance testing was carried out by dropping objects on the ground in front of the AGV, and human operators were not allowed to be in front of it.

2.2.2. Cherry Tomato ROI Image Patches

Mature cherry tomatoes have distinct red colors and are in great contrast with the surrounding environment. At the same time, most cherry tomatoes are in strings. Based on these characteristics, we propose a preprocessing method to segment cherry tomato ROIs by superimposing the depth information onto the RGB image to obtain a new multimodal image.

Compared with RGB color space, the Lab color space has more compact color range and higher contrast. Acquired RGB images are converted into Lab space [29] with which thresholding is carried out to obtain a binary mask

B_{C}

for each color channel where 1s represent candidate ROIs.

B_{C} (i, j) = \{\begin{array}{l} 1 C_{L} \leq C (i, j) \leq C_{H} \\ 0 o t h e r w i s e \end{array}

(1)

where

C (i, j)

is the color channel

(C \in [L, a, b])

of pixel

(i, j)

under consideration,

C_{L}

and

C_{H}

the low and high thresholds of the color channel (for the dataset in this study,

C_{L}

and

C_{H}

were set to 30 and 70 for

L

(range [0, 100]), 18 and 127 for

a

(range [−128, 127]), and 10 and 50 for

b

(range [−128, 127]). Similarly, for the depth channel, thresholding is carried out to obtain a binary mask

B_{D}

where 1s are candidate ROIs.

B_{D} (u, v) = \{\begin{array}{l} 1 d_{L} \leq D (u, v) \leq d_{H} \\ 0 o t h e r w i s e \end{array}

(2)

where

D (u, v)

is the depth of pixel

(u, v)

,

d_{L}

and

d_{H}

the low and high depth thresholds (in this study,

d_{L}

and

d_{H}

were set to 30 to 100 cm). The initial cherry tomato ROIs are determined as

B_{R O I} (i, j) = [B_{L} (i, j) {\cap B}_{a} (i, j) \cap B_{b} (i, j)] \cap f [D (u, v)] > 0

(3)

where

f [\cdot]

is the rotation and translation operations that match the depth camera coordinates in

(u, v)

to the RGB camera coordinates in

(i, j)

, which can be obtained from camera calibration.

B_{R O I}

is the intersection of all candidate ROIs in color and depth, where pixels of true values (1s) are ROIs. Figure 4 shows an example of the process of obtaining the initial ROIs, which can be further cleaned up by morphological operations, such as open and close [30,31].

As the resolution of the original images is high, if they are down-sampled to feed to deep neural networks (DNNs), the image will be compressed, resulting in the loss of information. We can crop the input images into patches of a fixed size and feed them into DNNs in batches, but the computational cost is high. To address this issue, we introduce a threshold parameter for each image patch as

S_{p} = \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} B_{R O I} (i, j)

(4)

where

m

and

n

are the row and column sizes of the patch. When

S_{p}

is less than the set threshold (based on the image size of the cherry tomatoes in the dataset, this study set the threshold to 10), there are no cherry tomatoes in the patch, and it is discarded. The sequentially cropped image and the filtered cropped image are shown in Figure 5.

With the simple cropping, a string of cherry tomatoes or even single fruits may be divided into different patches. This leads to inaccurate coordinates of the cherry tomatoes, which greatly reduces the accuracy and robustness of the detection results. Using sliding windows to generate many patches with overlaps can overcome this issue, but at the cost of a dramatic increase in computations. Therefore, we use a patch merging strategy; patches are merged if (1) the pixel distances between the patch borders and the image edges are less than a given threshold, or (2) the pixel distances between the adjacent patches are within a given threshold, which can be accomplished by the distance transform [32].

2.2.3. Normal Vector Angles of a Point Cloud

The image captured by the TOF camera only contains depth. In Euclidean space, the horizontal and vertical coordinates are needed to represent the position information, forming XYZ coordinate values. Therefore, it is necessary to calculate the coordinate value of a point in the depth map according to the internal parameters of the camera, and then convert it into point cloud. Because of the small size of cherry tomatoes, their depth differences are very small. To make better use of depth information, we calculate the normal vector angles of a point cloud, as illustrated in Figure 6, and convert them into an image.

(1): The integral image of the Z data of the point cloud is constructed, which is the sum of all Z coordinate values in the rectangular area from $Z (0, 0)$ to $Z (P, Q)$ .

$I_{Z} = \sum_{i = 0}^{P} \sum_{j = 0}^{Q} Z (i, j)$

(5)

where $P$ and $Q$ are the row and column sizes of the depth map.
(2): The horizontal and vertical vectors are computed as [33]

$\{\begin{matrix} V_{x}^{h} = 0.5 [X (m + r, n) - X (m - r, n)] \\ V_{y}^{h} = 0.5 [Y (m + r, n) - Y (m - r, n)] \\ V_{z}^{h} = 0.5 [S (I_{Z}, m + 1, n, r - 1) - S (I_{Z}, m - 1, n, r - 1)] \end{matrix} \{\begin{matrix} V_{x}^{v} = 0.5 [X (m, n + r) - X (m, n - r)] \\ V_{y}^{v} = 0.5 [Y (m, n + r) - Y (m, n - r)] \\ V_{z}^{v} = 0.5 [S (I_{Z}, m, n + 1, r - 1) - S (I_{Z}, m, n - r, r - 1)] \end{matrix}$

(6)

where $X$ and $Y$ are the values of the pixels at the corresponding positions of the XY channels in the Euclidean coordinate system, $r$ is the radius of a smooth rectangular region, and $S$ the average value in the specified area, which can be calculated as

$S (I_{Z}, m, b, r) = [I_{Z} (m + r, n + r) - I_{Z} (m - r, n + r) - I_{Z} (m + r, n - r) + I_{Z} (m - r, n - r))] / (4 r^{2})$

(7)
(3): The local point cloud normal vector, as shown in Figure 6a, is

$V = (V_{x}, V_{y}, V_{z}) = (V_{x}^{h}, V_{y}^{h}, V_{z}^{h}) \times (V_{x}^{v}, V_{y}^{v}, V_{z}^{v})$

(8)

where $\times$ represents the outer product of the two vectors. It is further transformed into an angle and converted to a pixel value.

$θ = a t a n [V_{z} / \sqrt{{(V_{x})}^{2} + {(V_{y})}^{2} + {(V_{z})}^{2}}] P_{θ} = P_{m a x} θ / (4 π)$

(9)

where $P_{m a x}$ is the maximum pixel value used to convert the normal vector angles to image pixel values. For 8-bit images, $P_{m a x}$ is 255.

2.2.4. Depth Value in the Prediction Box

For the “Classness” prediction described below, using individual pixel depth values is volatile. To improve the stability, we utilize a depth value (

Z_{p}

) corresponding to a percentile of the cumulative distribution function of the depth values in the prediction box, as shown in Figure 7.

2.2.5. YOLOv7-Tiny-CTD: Improvement of YOLOv7-Tiny for Cherry Tomato Detection

We introduce an improved YOLOv7-tiny for cherry tomato detection as YOLOv7-tiny-CTD. YOLOv7-tiny is chosen as it is compact, with less than 1/6 of the network parameters of the other two YOLOv7 versions. The following improvements are made to YOLOv7-tiny:

(1): Replacing the color image input with a multimodal image input. Most of the object detection models, including YOLOv7, use color image input. In some methods using RGB-D input, the color image is first sent into the model to obtain the preliminary results and combined with the depth map for further processing, which struggles to make full use of the depth information. Hybrid RGB-D DNNs have been proposed to better utilize both color and depth information [34,35,36]. But they usually introduce significant complexity. For simplicity, we adjust the network structure of the input of the YOLOv7-tiny and add the depth map as a separate channel to the color channels in Lab space to feed into the network. As shown in Figure 8, an RGB image is replaced by a 4-channel image by adding the mask map described above.
(2): Eliminating the “Objectness” output layer of YOLOv7-tiny given that there is only one type of detection target for cherry tomatoes so for every prediction box the object inside must be a cherry tomato.
(3): Defining a new “Classness” method of the prediction box:

$P_{c l a s s} = \frac{A_{d}}{2 w h} + \frac{1}{2} (1 - \frac{Z_{t o p} - A_{d}}{Z_{m a x} - Z_{m i n}})$

(10)

where $w$ and $h$ are the width and height of the prediction box, respectively, $Z_{m a x}$ and $Z_{m i n}$ the maximum and minimum depth thresholds for cherry tomato detection, and $A_{d}$ is the area with the depth value $d = Z_{p}$ in the prediction box, which is explained in the previous section.
(4): Improving the NMS in YOLOv7-tiny. We propose a custom CWD-NMS combining CIOU NMS [24] and Weighted NMS [22] to solve the problem.

$M = \frac{\sum_{i} W_{i} B_{i}}{\sum_{i} W_{i}}, B_{i} \in \{B| C I O U (M, B) > T\}, W_{i} = S_{i} * C I O U (M, B)$

(11)

where $M$ is the box with the current highest score, $B$ the set of detected prediction boxes, $S_{i}$ the prediction box score, and $T$ the threshold.

2.2.6. Evaluation Metrics

A few common metrics are used for evaluating the proposed YOLOv7-tiny-CTD cherry tomato detection method, namely, mean average precision (mAP), accuracy, and recall. For robot picking, a success rate is computed as the number of successful pickings after 2 attempts divided by the number of fruits with which the system initiated the attempts.

2.3. Eye–Hand Calibration

The cherry tomato detection RGB-D sensor is mounted on the cobot end arm in the eye-in-hand mode. It is calibrated by placing a plane calibration board with an array of small circles on the ground or the AGV platform while the cobot arm moves the sensor to different viewing angles and distances. Intrinsic and extrinsic parameters are obtained simultaneously with about 30 calibration images [37]. Hand–eye calibration is performed with the relatively simple 9-point method [38,39].

2.4. End Effector Trajectory Planning and Cherry Tomato Picking

Once cherry tomatoes are detected, their distances to the fingertips of the end effector can be calculated using the coordinates of their centers and the camera center, as well as the spatial relationship between the camera and end effector obtained from the hand–eye calibration. The simplest strategy is employed; using a straight-line trajectory to carry out cherry tomato picking. The two end effector fingers open up to about 1 cm wider than the diameter of the current target cherry tomato, which is the closest and fully exposed one in the current camera field of view. The two fingers grip the target fruit and make a turn of about 90 degrees while pulling it away from the vine by moving about 5–10 cm. If the fruit has not been detached after this process, the robot gives up and moves to the next cycle. This helps the fruit detach easily without tearing the stem or damaging the plant. In this process, the gripping force of the two fingers is capped at 10 N as measured by the force sensor to avoid bruising of the fruits. It should be noted that most cherry tomatoes occluded by others will be exposed once the latter are picked. And those occluded by leaves are ignored.

3. Experimental Results

The cherry tomato picking robot was built and tested in a commercial greenhouse farming facility. The tests and experiments ran from April to July to cover a long period such that different harvesting conditions were covered. The overall system evaluations were carried out to test autonomous running and obstacle avoidance. No collision with obstacles occurred during the test period.

More importantly, two types of specific experiments were performed to evaluate the proposed YOLOv7-tiny-CTD cherry tomato detection method and the cherry tomato picking success rate.

3.1. Dataset

The dataset was collected in the commercial greenhouse framing facility over the 4-month period. The RGB-D sensor on the cobot’s arm collected both color and depth images while the AGV autonomously ran through the facility. The images were taken strictly according to the posture and distance of the cobot arm, and the collection times corresponded to the peak and end period of cherry tomato maturity, respectively. It contains 799 RGB-D image pairs and 6312 postannotated labels. The dataset was divided into training and testing at a ratio of 4 to 1 (639 and 160 images for training and testing, respectively).

The cherry tomatoes are mostly large and dense at the mature stage and relatively small and sparse at the end stage. In the latter case, the leaf color is yellowish, and the fruit is shaded. Due to the continuous mode during image acquisition, some cherry tomatoes appear blurry in the images.

3.2. Model Training and Testing

The hardware and software environments for training and testing the models are summarized in Table 2.

3.3. YOLOv7-Tiny-CTD Compared with Existing Models

YOLOv7-tiny-CTD is compared to several existing deep learning models using the same RGB-D dataset described above, including YOLOv5-s [40], Faster R-CNN [41], SSDLite [42], and YOLOv7-tiny. As expected, for all existing models, using multimodal RGB-D images improves the cherry tomato detection performance in terms of all the metrics over usage of RGB images only (Table 3). In addition, YOLOv7-tiny outperforms all other existing models for both RGB and RGB-D inputs. More importantly, YOLOv7-tiny-CTD further improves over YOLOv7-tiny by 2.1%, 4.9%, and 4.2% in mAP, recall, and accuracy, respectively, while both of them ran at about 26 frames per second (FPS), achieving real-time detection. Figure 9 shows an example where YOLOv7-tiny-CTD detected all six cherry tomatoes while YOLOv7-tiny only detected four of them in the top left corner of the input image. The two fruits missed by YOLOv7-tiny (pointed to by orange arrows in Figure 9d) are slightly further away from the robot and partially occluded by the neighboring ones on the same string, showcasing the advantages of making good use of the depth information.

3.4. Cherry Tomato Robot Picking Results

Cherry tomato robot picking experiments were carried out for five runs with the same number of image acquisitions of 80, and the number of picking actions performed by the cobot arm and the number of successful pickings were counted, respectively. Because the end effector may collide with other objects around the cherry tomatoes, the picking could fail even if the positioning is accurate. Therefore, the force sensor feedback was used to judge whether the picking was successful. In the case of the first failure, the second picking was carried out with the same positioning coordinates. The results are shown in Table 4.

4. Discussion

The proposed YOLOv7-tiny-CTD was tested in a commercial cherry tomato farming greenhouse with the harvesting robot. It obtains better cherry tomato detection under the transformed multimodal image inputs. Incorporating depth improves cherry tomato detection precision and accuracy for all deep learning models compared. And the YOLOv7-tiny-CTD outperforms all compared models without sacrificing efficiency. Combined with the eye-in-hand configuration and the force sensing feedback built into the custom end effector, robot pickings of cherry tomatoes were carried out in the greenhouse using very simple trajectory planning and grasping strategy. In 400 picking attempts at five different rack positions, the average success rate was 83.5%, indicating that the picking robot system has potential for application in a real commercial farming environment.

However, there are still many shortcomings in the harvesting robot used. For example, the robot picks one single fruit at a time, and the cobot arm picking movement is slow, thus the overall system efficiency is low. It is foreseeable that with a proper end effector design and more sophisticated picking strategy, cherry tomatoes may be harvested in bunches instead of individual ones as it may be desired in some cases. This needs a different detection strategy, but we think the multimodal perception scheme is likely to be of great help for cherry tomato bunch detection with some adjustment.

In addition, though the cherry tomato detection accuracy is high (about 95%), the picking success rate is about 10% lower even with a 2nd try allowed. The limitations are more on the hardware side. It should be noted that there are more sophisticated techniques available, such as humanoid end effectors [43,44,45], but at the moment the cost and complexities associated with such devices are prohibitive for practical fruit harvesting robot applications. The hardware cost of the robotic system in the current study is only about USD 28,000, comparable to the yearly labor cost of two workers in the commercial farm. The affordability is an important factor in practical adoptions of fruit picking robots.

In the future, we plan to improve the robotic system in several aspects: (1) design a new end effector to support both single picking and bunch picking to improve efficiency; (2) incorporate bunch detection into the neural network model; (3) increase the tactile sensor embedded in the fingertip of the end effector to achieve a more delicate visual–tactile-guided fruit picking; (4) develop a miniature and portable multispectral sensor to facilitate integration into the end effector to judge the maturity and status of cherry tomatoes before picking.

5. Conclusions

This study designed, integrated, and tested a cherry tomato detection scheme using a multimodal RGB-D sensor and an improved YOLOv7-tiny cherry tomato detection network named YOLOv7-tiny-CTD for a large greenhouse farming environment. In order to fully utilize the depth information in RGB-D images and improve cherry tomato detection performance, the RGB-D images are segmented using both color and depth to obtain cherry tomato ROIs, and a normal vector angle transformation of the point cloud is introduced. At the same time, the YOLOv7-tiny model is improved in multiple aspects. The color image input is replaced by a four-channel image containing the normal vector angle map. New Classness and hybrid NMS methods are also utilized.

The proposed method was tested using a harvesting robot in a commercial cherry tomato farming greenhouse. The experimental results show that the improved YOLOv7-tiny-CTD model obtains better cherry tomato detection under the multimodal image inputs. Incorporating depth improves cherry tomato detection precision and accuracy for all deep learning models compared. And the proposed YOLOv7-tiny-CTD outperforms all compared models without sacrificing efficiency, being able to run at 26 FPS on Nvidia Jetson TX1.

6. Patents

A patent application was filed with China National Intellectual Property Administration in 2023 (No. CN202310505684.X).

Author Contributions

Conceptualization, Z.Z. and Y.T.; methodology, B.C., H.D. and Y.T.; software, B.C. and H.D.; validation, Y.C. (Yingqi Cai), B.C. and Q.W.; formal analysis, B.C. and Y.C. (Yingqi Cai); investigation, D.L. and Y.C. (Yukang Cui); resources, H.D. and Y.T.; data curation, B.C. and H.D.; writing—original draft preparation, Y.C. (Yingqi Cai), B.C. and Q.W.; writing—review and editing, D.L., Z.Z., Y.C. (Yukang Cui) and Y.T.; visualization, Y.C. (Yingqi Cai) and Q.W.; supervision, Y.T.; project administration, H.D.; funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by Shenzhen University (No. 868/03010315 and 86902/00248).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

B.C. and H.D. are with Litemaze Technology of Shenzhen, China, a 3D sensor manufacturer. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bogue, R. Fruit picking robots: Has their time come? Ind. Robot Int. J. Robot. Res. Appl. 2020, 47, 141–145. [Google Scholar] [CrossRef]
Ceres, R.; Pons, J.L.; Jimenez, A.; Martin, J.; Calderon, L. Design and implementation of an aided fruit-harvesting robot (Agribot). Ind. Robot Int. J. 1998, 25, 337–346. [Google Scholar] [CrossRef]
Muscato, G.; Prestifilippo, M.; Abbate, N.; Rizzuto, I. A prototype of an orange picking robot: Past history, the new robot and experimental results. Ind. Robot Int. J. 2005, 32, 128–138. [Google Scholar] [CrossRef]
Scarfe, A.J.; Flemmer, R.C.; Bakker, H.; Flemmer, C.L. Development of an autonomous kiwifruit picking robot. In Proceedings of the 4th International Conference on Autonomous Robots and Agents, Wellington, New Zealand, 10–12 February 2009; pp. 380–384. [Google Scholar]
Hua, X.; Li, H.; Zeng, J.; Han, C.; Chen, T.; Tang, L.; Luo, Y. A review of target recognition technology for fruit picking robots: From digital image processing to deep learning. Appl. Sci. 2023, 13, 4160. [Google Scholar] [CrossRef]
Pal, A.; Leite, A.C.; From, P.J. A novel end-to-end vision-based architecture for agricultural human–robot collaboration in fruit picking operations. Robot. Auton. Syst. 2024, 172, 104567. [Google Scholar] [CrossRef]
Chen, B.; Gong, L.; Yu, C.; Du, X.; Chen, J.; Xie, S.; Le, X.; Li, Y.; Liu, C. Workspace decomposition based path planning for fruit-picking robot in complex greenhouse environment. Comput. Electron. Agric. 2023, 215, 108353. [Google Scholar] [CrossRef]
Bulanon, D.M.; Kataoka, T.; Ota, Y.; Hiroma, T. AE—Automation and emerging technologies: A segmentation algorithm for the automatic recognition of Fuji apples at harvest. Biosyst. Eng. 2002, 83, 405–412. [Google Scholar] [CrossRef]
Payne, A.B.; Walsh, K.B.; Subedi, P.; Jarvis, D. Estimation of mango crop yield using image analysis–segmentation method. Comput. Electron. Agric. 2013, 91, 57–64. [Google Scholar] [CrossRef]
Senthilnath, J.; Dokania, A.; Kandukuri, M.; Ramesh, K.; Anand, G.; Omkar, S. Detection of tomatoes using spectral-spatial methods in remotely sensed RGB images captured by UAV. Biosyst. Eng. 2016, 146, 16–32. [Google Scholar] [CrossRef]
Luo, L.; Tang, Y.; Zou, X.; Wang, C.; Zhang, P.; Feng, W. Robust grape cluster detection in a vineyard by combining the AdaBoost framework and multiple color components. Sensors 2016, 16, 2098. [Google Scholar] [CrossRef]
Teixidó, M.; Font, D.; Pallejà, T.; Tresanchez, M.; Nogués, M.; Palacín, J. Definition of linear color models in the RGB vector color space to detect red peaches in orchard images taken under natural illumination. Sensors 2012, 12, 7701–7718. [Google Scholar] [CrossRef] [PubMed]
Kurtulmus, F.; Lee, W.S.; Vardar, A. Immature peach detection in colour images acquired in natural illumination conditions using statistical classifiers and neural network. Precis. Agric. 2014, 15, 57–79. [Google Scholar] [CrossRef]
Zhou, W.; Meng, F.; Li, K. A cherry tomato classification-picking Robot based on the K-means algorithm. J. Phys. Conf. Ser. 2020, 1651, 012126. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Wu, J.; Hu, Q.; Zhao, C.; Tan, C.; Teng, L.; Luo, T. An improved Yolov3 based on dual path network for cherry tomatoes detection. J. Food Process Eng. 2021, 44, e13803. [Google Scholar] [CrossRef]
Zheng, H.; Wang, G.; Li, X. YOLOX-Dense-CT: A detection algorithm for cherry tomatoes based on YOLOX and DenseNet. J. Food Meas. Charact. 2022, 16, 4788–4799. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, J.; Bi, Z.; Wang, P. Identification and Location Method of Cherry Tomato Picking Point Based on Si-YOLO. In Proceedings of the IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Qinhuangdao, China, 11–14 July 2023; pp. 373–378. [Google Scholar]
Wang, C.; Wang, C.; Wang, L.; Wang, J.; Liao, J.; Li, Y.; Lan, Y. A lightweight cherry tomato maturity real-time detection algorithm based on improved YOLOV5n. Agronomy 2023, 13, 2106. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Zhou, H.; Li, Z.; Ning, C.; Tang, J. Cad: Scale invariant framework for real-time object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 760–768. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
Cui, B.; Zeng, Z.; Tian, Y. A Yolov7 cherry tomato identification method that integrates depth information. In Proceedings of the Third International Conference on Optics and Image Processing (ICOIP 2023), Hangzhou, China, 14–16 April 2023; Volume 12747, pp. 312–320. [Google Scholar]
Gursoy, E.; Navarro, B.; Cosgun, A.; Kulić, D.; Cherubini, A. Towards vision-based dual arm robotic fruit harvesting. In Proceedings of the IEEE 19th International Conference on Automation Science and Engineering (CASE), Auckland, New Zealand, 26–30 August 2023; pp. 1–6. [Google Scholar]
Wang, H.; Cui, B.; Wen, X.; Jiang, Y.; Gao, C.; Tian, Y. Pallet detection and estimation with RGB-D salient feature learning. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; pp. 8914–8919. [Google Scholar]
Durmus, D. CIELAB color space boundaries under theoretical spectra and 99 test color samples. Color Res. Appl. 2020, 45, 796–802. [Google Scholar] [CrossRef]
Tian, Y. Dynamic focus window selection using a statistical color model. Digit. Photogr. II 2006, 6069, 98–106. [Google Scholar]
Serra, J.; Vincent, L. An overview of morphological filtering. Circuits Syst. Signal Process. 1992, 11, 47–108. [Google Scholar] [CrossRef]
Fabbri, R.; Costa, L.D.F.; Torelli, J.C.; Bruno, O.M. 2D Euclidean distance transform algorithms: A comparative survey. ACM Comput. Surv. (CSUR) 2008, 40, 1–44. [Google Scholar] [CrossRef]
Holzer, S.; Rusu, R.B.; Dixon, M.; Gedikli, S.; Navab, N. Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 2684–2689. [Google Scholar]
Zia, S.; Yuksel, B.; Yuret, D.; Yemez, Y. RGB-D object recognition using deep convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 896–903. [Google Scholar]
Gené-Mola, J.; Vilaplana, V.; Rosell-Polo, J.R.; Morros, J.-R.; Ruiz-Hidalgo, J.; Gregorio, E. Multi-modal deep learning for Fuji apple detection using RGB-D cameras and their radiometric capabilities. Comput. Electron. Agric. 2019, 162, 689–698. [Google Scholar] [CrossRef]
Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust RGB-D object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 681–687. [Google Scholar]
Guan, L.; Wang, F.; Li, B.; Tang, R.; Wei, R.; Deng, H.; Tian, Y. Adaptive automotive chassis welding joint inspection using a cobot and a multi-modal vision sensor. In Proceedings of the International Conference on Digital Economy and Artificial Intelligence, Shenzhen, China, 24–26 June 2024; pp. 841–849. [Google Scholar]
Jiang, J.; Luo, X.; Luo, Q.; Qiao, L.; Li, M. An overview of hand-eye calibration. Int. J. Adv. Manuf. Technol. 2022, 119, 77–97. [Google Scholar] [CrossRef]
Enebuse, I.; Foo, M.; Ibrahim, B.S.K.K.; Ahmed, H.; Supmak, F.; Eyobu, O.S. A comparative review of hand-eye calibration techniques for vision guided robots. IEEE Access 2021, 9, 113143–113155. [Google Scholar] [CrossRef]
Zhou, Q.; Zhang, W.; Li, R.; Wang, J.; Zhen, S.; Niu, F. Improved YOLOv5-S object detection method for optical remote sensing images based on contextual transformer. J. Electron. Imaging 2022, 31, 043049. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Kim, S.; Na, S.; Kong, B.Y.; Choi, J.; Park, I.-C. Real-time SSDLite object detection on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 1192–1205. [Google Scholar] [CrossRef]
Fukaya, N.; Toyama, S.; Asfour, T.; Dillmann, R. Design of the TUAT/Karlsruhe humanoid hand. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Takamatsu, Japan, 30 October–5 November 2000; Volume 3, pp. 1754–1759. [Google Scholar]
Parlikar, S.; Jagannath, V. Application of pneumatic soft actuators as end-effectors on a humanoid torso playing percussion instrument. In Proceedings of the 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021; pp. 676–680. [Google Scholar]
Ramón, J.L.; Calvo, R.; Trujillo, A.; Pomares, J.; Felicetti, L. Trajectory optimization and control of a free-floating two-arm humanoid robot. J. Guid. Control Dyn. 2022, 45, 1661–1675. [Google Scholar] [CrossRef]

Figure 1. The function module diagram of the cherry tomato picking robotic system.

Figure 2. The cherry tomato harvesting robot in the greenhouse commercial farming facility. (a) The robotic system on the track in the greenhouse. (b) Single cherry tomato picking.

Figure 3. The graphic user interface of the robotic system. S/S/S represent Start/Stop/Settings.

Figure 4. Initial cherry tomato ROI segmentation. (a) RGB image. (b) Image in Lab space. (c) Single channel in the Lab space. (d) Binary mask from all color channels. (e) Binary mask from the depth channel. (f) Initial cherry tomato ROIs.

Figure 5. The sequential cropped patches before (left) and after (right) filtering. Some patches (with red cross signs in the middle) on the top right, right, and bottom are discarded.

Figure 6. Illustration of the normal vector angles of a point cloud. (a) Normal vector angles. A cherry tomato is a small spherical object, the depth range (

d Z = Z_{2} - Z_{1}

) is only a tiny fraction of its distance to the camera (Z₁). (b) A local point cloud neighborhood and its normal vector (depiction of the purple rectangular area in (a)). Each colored circle represents a point in the local neighborhood.

Figure 6. Illustration of the normal vector angles of a point cloud. (a) Normal vector angles. A cherry tomato is a small spherical object, the depth range (

d Z = Z_{2} - Z_{1}

) is only a tiny fraction of its distance to the camera (Z₁). (b) A local point cloud neighborhood and its normal vector (depiction of the purple rectangular area in (a)). Each colored circle represents a point in the local neighborhood.

Figure 7. Illustration of depth percentile in prediction box. (a) Histogram of depth values. (b) Cumulative distribution function (CDF) of the depth values.

Figure 8. The adjustment to the network input for YOLOv7-tiny-CTD.

Figure 9. Cherry tomato detection by the YOLOv7-tiny and YOLOv7-tiny-CTD. (a) RGB image. (b) Point cloud normal vector angles added to (a), green areas indicate small, far away, or partially occluded cherry tomatoes, and purple areas large or close ones. (c) Detection output from the YOLOv7-tiny. (d) Detection output from the YOLOv7-tiny-CTD. The two orange arrows point to cherry tomatoes that are detected by the YOLOv7-tiny-CTD but missed by the YOLOv7-tiny.

Table 1. Comparison of different Non-Maximum Suppression (NMS) methods.

NMS	Advantage	Disadvantage
Regular	Simple	Sequential processing, IOU selected based on experience
Weighted [22]	High precision	Sequential processing, low efficiency
DIOU [23]	High recall; can be combined with other methods	Low efficiency; abnormal conditions when centers coincide
CIOU [24]	Overcome DIOU anomalies	Low efficiency; increasing number of iterations
Learning [25]	No hand-crafted settings	Complex implementation

Table 2. Model training and evaluation conditions.

Environment	Parameters/Version
Operating system	Ubuntu18.04
CPU	Intel i7-10700F
Memory	16G
GPU	NVIDIA RTX3070
CUDA	11.2
CUDNN	8.1.1
Python	3.8
PaddlePaddle-GPU	2.2.1

Table 3. Cherry tomato detection comparison between YOLOv7-tiny-CTD and 4 existing deep learning models using RGB-only and RGB-D images. YOLOv7-tiny-CTD has no RGB-only mode as it utilizes the normal vector angles of the point cloud.

Model	Input	mAP	Recall	Accuracy
YOLOv5-s	RGB	86.0%	87.6%	88.2%
	RGB-D	91.1%	89.7%	91.2%
Faster R-CNN	RGB	88.0%	90.1%	89.1%
	RGB-D	91.8%	91.3%	91.5%
SSDLite	RGB	87.3%	87.9%	88.7%
	RGB-D	89.6%	90.2%	91.1%
YOLOv7-tiny	RGB	90.4%	90.2%	89.9%
	RGB-D	92.8%	91.6%	91.8%
YOLOv7-tiny-CTD	RGB-D	94.9%	96.1%	95.7%

Note: Bold is the best value, and italic underlined is the second-best value in each column.

Table 4. Cherry tomato picking success rates over five different runs, each with 80 image acquisitions and cherry tomato detections.

Trial	Num of Fruits with Picking Action	Num of Successes on 1st Attempt	Num of Successes on 2nd Attempt	Success Rate with Two Attempts
1	77	56	62	81%
2	78	57	63	81%
3	78	60	65	83%
4	72	61	63	87%
5	71	61	61	86%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Y.; Cui, B.; Deng, H.; Zeng, Z.; Wang, Q.; Lu, D.; Cui, Y.; Tian, Y. Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network. Agronomy 2024, 14, 2320. https://doi.org/10.3390/agronomy14102320

AMA Style

Cai Y, Cui B, Deng H, Zeng Z, Wang Q, Lu D, Cui Y, Tian Y. Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network. Agronomy. 2024; 14(10):2320. https://doi.org/10.3390/agronomy14102320

Chicago/Turabian Style

Cai, Yingqi, Bo Cui, Hong Deng, Zhi Zeng, Qicong Wang, Dajiang Lu, Yukang Cui, and Yibin Tian. 2024. "Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network" Agronomy 14, no. 10: 2320. https://doi.org/10.3390/agronomy14102320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cherry Tomato Detection for Harvesting Using Multimodal Perception and an Improved YOLOv7-Tiny Neural Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Cherry Tomato Picking Robot

2.1.1. Main Hardware Modules

2.1.2. Main Software Modules

2.2. Multimodal Perception

2.2.1. Obstacle Avoidance

2.2.2. Cherry Tomato ROI Image Patches

2.2.3. Normal Vector Angles of a Point Cloud

2.2.4. Depth Value in the Prediction Box

2.2.5. YOLOv7-Tiny-CTD: Improvement of YOLOv7-Tiny for Cherry Tomato Detection

2.2.6. Evaluation Metrics

2.3. Eye–Hand Calibration

2.4. End Effector Trajectory Planning and Cherry Tomato Picking

3. Experimental Results

3.1. Dataset

3.2. Model Training and Testing

3.3. YOLOv7-Tiny-CTD Compared with Existing Models

3.4. Cherry Tomato Robot Picking Results

4. Discussion

5. Conclusions

6. Patents

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI