Skip Content
You are currently on the new version of our website. Access the old version .
MachinesMachines
  • Article
  • Open Access

13 September 2024

Deep Learning-Based Real-Time 6D Pose Estimation and Multi-Mode Tracking Algorithms for Citrus-Harvesting Robots

,
and
1
School of ICT, Robotics and Mechanical Engineering, Hankyong National University, Anseong 456-749, Republic of Korea
2
Smart Convergence Technology Research Center, Hankyong National University, Anseong 456-749, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Section Robotics, Mechatronics and Intelligent Machines

Abstract

In the agricultural sector, utilizing robots for tasks such as fruit harvesting poses significant challenges, particularly in achieving accurate 6D pose estimation of the target objects, which is essential for precise and efficient harvesting. Particularly, fruit harvesting relies heavily on manual labor, leading to issues with an unstable labor supply and rising costs. To solve these problems, agricultural harvesting robots are gaining attention. However, effective harvesting necessitates accurate 6D pose estimation of the target object. This study proposes a method to enhance the performance of fruit-harvesting robots, including the development of a dataset named HWANGMOD, which was created using both virtual and real environments with tools such as Blender and BlenderProc. Additionally, we present methods for training an EfficientPose-based model for 6D pose estimation and ripeness classification, and an algorithm for determining the optimal harvest sequence among multiple fruits. Finally, we propose a multi-object tracking method using coordinates estimated by deep learning models to improve the robot’s performance in dynamic environments. The proposed methods were evaluated using metrics such as A D D and A D D S , showing that the deep learning model for agricultural harvesting robots excelled in accuracy, robustness, and real-time processing. These advancements contribute to the potential for commercialization of agricultural harvesting robots and the broader field of agricultural automation technology.

1. Introduction

The agricultural sector faces significant challenges due to an aging farming population and the increasing reliance on manual labor, particularly in tasks such as fruit harvesting. These challenges lead to an unstable labor supply and rising operational costs, necessitating innovative technological solutions to maintain productivity and sustainability in agriculture [1]. Among these solutions, agricultural harvesting robots have gained significant attention for their potential to address these issues by enabling fast and accurate harvesting [2]. However, the effectiveness of these robots is heavily dependent on their ability to accurately estimate the 6D pose of fruits, which is crucial for recognizing and optimally approaching the target for harvesting [3].
Despite the progress made in developing 6D pose estimation techniques, the existing methods often face limitations when applied in the dynamic and unstructured environments typical of agriculture. Previous studies have predominantly focused on industrial applications where the environmental conditions are controlled, and the objects are more uniform in shape. These approaches often struggle with the variability of lighting, occlusions, and the complex geometries of fruits found in agricultural settings. As a result, inaccuracies in pose estimation can lead to reduced efficiency and effectiveness in robotic harvesting systems.
In recent research, various approaches have been developed to enhance robotic harvesting systems. For instance, a study proposed a method based on a single-stage object detector (SSD) to detect apples, leveraging stereo cameras and inverse kinematics for harvesting [4,5]. Another study introduced rotated YOLO (R-YOLO) based on YOLOv3, which predicts rotated bounding boxes to improve the accuracy of strawberry harvesting points [6,7]. Additionally, other studies employed Mask R-CNN for instance segmentation, obtaining pixel-level position information on crops to assist in precise fruit harvesting [8,9,10,11]. These studies demonstrated significant progress in the field but still faced challenges related to real-time accuracy and adaptability in unstructured environments.
In comparison with these studies, this study emphasized the innovation of our approach by focusing on the specific challenges posed by agricultural environments. We propose a more robust and adaptable 6D pose estimation model that can overcome the limitations of existing systems. This study not only addresses real-time accuracy and adaptability in unstructured environments but also contributes to developing a more efficient robotic harvesting process. In a comparison of the proposed research plan with the state-of-the-art methods, the novelty and significance of our work became even more evident.
The objective of this study was to develop a deep learning model that accurately estimates the 6D pose of fruits, thereby improving the performance and accuracy of fruit-harvesting robots. This research aimed to overcome the limitations identified in previous studies by addressing the specific challenges posed by agricultural environments. By constructing datasets in both virtual and real environments, and employing advanced deep learning techniques, this study sought to improve the robustness and accuracy of pose estimation models used in agricultural robotics. Additionally, we propose a method for labeling the harvesting order and efficiently tracking targets in situations where there are multiple suitable fruits, based on the recognition of ripeness. Ultimately, this research contributes to the advancement of agricultural automation technology and enhances the commercialization potential of agricultural harvesting robots.
This article proposes several approaches to enhance the performance of fruit-harvesting robots. First, we propose a method to build a dataset named HWANGMOD. This method can be used in both virtual and real environments and utilizes Blender and BlenderProc to automatically generate large-scale fruit datasets in virtual environments. Furthermore, the method includes techniques for collecting data from real-world environments via a mapping process between 2D images and 3D objects and can effectively build a variety of large datasets required for the model’s training.
Second, the article provides a detailed explanation of the training process and hyperparameter settings for the EfficientPose-based 6D pose estimation model. This process was designed to enable accurate ripeness classification and 6D pose estimation of fruits, playing a crucial role in allowing the robot to identify and harvest ripe fruits.
Third, an algorithm for determining the harvest sequence among multiple suitable fruits with estimated 6D poses is proposed. This algorithm allows for the creation of efficient harvesting plans based on the camera coordinate system of the agricultural harvesting robot, helping the robot to systematically and efficiently harvest the fruits.
Finally, we propose a method to track multiple citrus objects by leveraging the coordinates estimated by deep learning models. This tracking method enhances the robot’s ability to effectively recognize and navigate toward multiple fruits while in motion, improving its performance in dynamic environments.
In the experiments, the proposed methods were evaluated by comparing the performance of the EfficientPose model and the YOLOv5-6D model on a single-object dataset using the A D D (average distance of the model’s points) and A D D S (average distance of the model’s points for symmetric objects) metrics. YOLO stands for you only look once, which is a popular object detection model that has been extended in this case to support 6D pose estimation. Subsequently, the effectiveness of the proposed model’s 6D pose estimation and ripeness classification, and the proposed harvesting and tracking algorithms were assessed in real-time scenarios involving multiple objects in both virtual and real environments. The experimental results demonstrated that the EfficientPose model excelled in accuracy, robustness, and real-time processing capability, proving to be a critical factor for the practical deployment of agricultural harvesting robots.

3. Proposed Method

3.1. Building a Dataset

In this study, the creation of a large-scale dataset, including automatically renderable virtual datasets, utilized the BlenderProc pipeline, which combines Blender and PyTorch [23]. This enabled the generation and processing of the large-scale 3D datasets necessary for training deep learning models for 6D object pose estimation. The following describes the rendering process, the data used, and the structure and content of the generated dataset.
In the Blender environment, the 3D fruit models of the ripe fruit class (red) and un-ripe fruit class (green), previously scanned, were uploaded. Then, the position coordinates and rotation angles were randomly rendered. Thresholds were set to ensure that the two fruit objects did not exceed a certain cubic volume, and images of the two objects, background, and lighting were captured through the camera. Additionally, the camera’s 6D pose was randomly arranged to ensure that both fruits could be captured. If one object was completely occluded by another, it was named an occlusion dataset with the class name “valid poses” and rendered automatically.
This automatic rendering algorithm generated n scenes, each captured twice by the camera, resulting in 2 n scenes being stored. An automatic rendering environment was established to automatically obtain the internal and external parameters of the camera. Thus, 2 n RGB and depth images, a camera.json file containing the camera’s parameters used for rendering, a scene_camera.json file containing the camera’s parameters for each image, and a scene_gt.json file containing the parameters of the relationship between the camera and objects in the scene were automatically generated. Additionally, the mask and mask_visib were represented by a single image showing the information on the mask for each object in the 2 n scenes, expressed in grayscale from 0 to 255. The mask’s information was extracted using RGB and depth images, storing information on whether all objects were present in the scene or which class was occluded. Subsequently, a scene_gt_info.json file containing the ground-truth pose’s metadata, such as the bbox_obj parameter with the 2D bounding box information ( x , y , h , w ) and the px_count_all parameter with the pixel count of the object’s silhouette, was generated to construct the virtual dataset. Figure 11 shows the automatic rendering environment established using Blender and BlenderProc. Figure 12 illustrates the RGB and mask images from the automatically rendered dataset in the virtual environment.
Figure 11. Auto rendering of virtual environment using Blender.
Figure 12. RGB and mask images automatically rendered in the virtual environment.
Extracting the 6D pose values of objects in a virtual environment yields accurate results, but datasets from real environments are also necessary to achieve high recognition rates. Therefore, we proposed a dataset construction environment using SolvePnP and QT Creator to match 3D object models and extract 6D poses from 2D images.
SolvePnP estimates the 6D pose (position and orientation) of an object using the camera’s internal parameters and several points on the 2D image. This method is widely used in computer vision and is essential for obtaining 3D information from 2D images. By utilizing point matching, the 3D object model of the scanned fruit is matched with corresponding points on the 2D image [24]. Pose estimation is then carried out using the SolvePnP algorithm, and the 6D pose of the fruit is estimated based on the matched points. The internal parameters of the camera used to capture the actual data are utilized to extract accurate 6D poses.
To efficiently perform the algorithm that extracts the pose of the 3D fruit model using point-to-point matching in 2D images, a graphical user interface (GUI) was developed using QT Creator. QT Creator is used for cross-platform development of GUI applications and provides a user-friendly interface to facilitate the matching process. This interface includes algorithms for loading and displaying images, a 3D object viewer, matching tools, and saving the matched results in a dataset format.
In the proposed method, images were loaded and displayed by accessing the 2D image location of fruits containing two classes that were acquired. The 3D fruit objects to be point-to-point matched were then visualized, and coordinate systems were also brought in to match the positional and rotational information. Interactive tools are essential for connecting corresponding points by adjusting the rotation, depth, and position to align the 3D fruit object with the image’s pose during point-to-point matching. This interactive tool interface was constructed using QT Creator.
Finally, a tool was developed for precise matching between the 3D object model and the 2D image by integrating the two algorithms mentioned above. This tool was used to estimate the accurate 6D pose (position and orientation) of the 3D object. By selecting points on the 3D model and linking them with corresponding points on the 2D image using the user-friendly interface, the pose of the 3D object was calculated and output to a data file, enabling construction of a dataset in real environments. Figure 13 shows the interface environment built for constructing the actual dataset. Figure 14 illustrates the RGB and mask images extracted from the real dataset in an actual environment.
Figure 13. Actual dataset construction using SolvePnP and QT Creator.
Figure 14. RGB and mask images constructed through point-to-point matching in a real environment.

3.2. Deep Learning Model Architecture

The proposed model, based on EfficientPose, could distinguish between harvest-ready fruits (red class) and unripe fruits (green class). Additionally, the 6D pose of each fruit class was extracted for harvesting operations by agricultural robots. Training was possible using the HWANGMOD dataset, which includes both real and virtual datasets. The proposed model allows for flexible adjustment of image resolution according to the model’s backbone. Therefore, if the internal parameters of the camera used to create the dataset are known, it is possible to build and train the model on both real and virtual datasets without being restricted by the image resolution output by the camera, enabling the application of the model to various objects.
The EfficientPose model referenced in this study was originally developed for TensorFlow version 1. The backbone components of the algorithm, including EfficientNet, BiFPN, and the sub-networks, were designed to operate only on TensorFlow version 1. However, this imposed a limitation in utilizing the latest graphics processing units (GPUs). Therefore, the proposed model was modified to function on TensorFlow version 2, updating all the model’s functions and structures accordingly. This upgrade allows the core components of the model, including the backbone network, BiFPN, and sub-networks, to be utilized on the latest GPUs.
EfficientNet, used as the backbone in EfficientPose, is a network for feature extraction, as shown on the left side of Figure 15. The backbone structure responsible for feature extraction employs the model described earlier, with the compound coefficient ϕ set to 3 to utilize compound scaling for depth, width, and resolution. During compound scaling, the parameters were adjusted to be 1.4 times deeper, 1.2 times wider, and with an input resolution of 896 × 896 compared with the base backbone model, forming a model with seven blocks. The seven blocks used as the backbone in the proposed model are shown on the right side of Figure 15. The model’s depth, width, and input image resolution varied, depending on the parameters used for compound scaling. Consequently, the model could be trained on the HWANGMOD dataset, which includes more complex and occluded data compared with models using the Linemod dataset that classify only one type of object, improving the accuracy of simultaneous multi-object tracking.
Figure 15. Backbone structure.
The MBConv blocks utilized in the proposed model were mobile inverted bottleneck convolution blocks. MBConv was designed as a lightweight block that optimizes the performance relative to the computational cost, making it ideal for efficient modeling. MBConv consists of three main components.
The first component is the 1 × 1 convolution layer (Conv1×1). This layer expands or reduces the number of input channels, adjusting the depth of each feature map to alleviate bottlenecks. Additionally, a batch normalization (BN) layer follows this layer, enhancing the network’s stability and speeding up learning. The second component is the depth-wise convolution layer [25]. This layer extracts spatial features by applying filters independently to each input channel. It significantly reduces the computational cost while maintaining performance. As shown in Figure 16, a 3 × 3 depth-wise convolution is typically used, but a 5 × 5 depth-wise convolution can also be used to capture a broader range of features. This layer is followed by BN and ReLU activation functions [26]. The third component is the SE (squeeze and excitation) block [27]. This block learns channel-wise interactions to emphasize the important features. The SE block consists of pooling fully connected layers (FC), ReLU, sigmoid, and multiplication operations. It learns and adjusts the importance of each channel, enhancing the model’s representational capabilities. Each MBConv block ends with an addition operation with the original input. This prevents information loss as the network deepens, and mitigates the problem of a vanishing gradient during backpropagation. This structure allows the model to optimize its performance while minimizing the computational cost. By employing these MBConv blocks, the proposed model can effectively balance computational efficiency and high performance, making it well-suited for applications in agricultural robotics, particularly for tasks requiring real-time processing and accuracy.
Figure 16. The MBConv structure used in the backbone.
The EfficientPose model can estimate 6D poses through four sub-networks from an input image. However, this model operates only in TensorFlow version 1 and cannot utilize high-resolution input images. Therefore, the proposed model addressed these limitations by utilizing the latest features and APIs of TensorFlow 2 to improve the performance and efficiency. Additionally, it increased the depth of the backbone through compound scaling parameters and optimized the flow of information between the features in the BiFPN part to enable more accurate object detection and segmentation. This improvement also leverages new functions in TensorFlow 2. Finally, the sub-networks, which perform predictions in the final stage of the model, were enhanced using TensorFlow 2’s improved performance to provide more precise predictions and reduce training time. These upgrades allowed the model to operate faster and more efficiently, taking advantage of the various benefits of the latest TensorFlow version to enhance the overall performance.
The proposed model has several advantages over the basic EfficientPose by modifying the data generator part. The model can accept high-resolution images as input. Additionally, it can simultaneously recognize multiple objects in real-time, outputting various labeled classes of fruits, and can also recognize occluded objects through dataset transformations. Therefore, the proposed deep learning model can determine the harvest readiness of fruits and estimate the 6D poses of various objects using the created datasets, regardless of the resolution.
Figure 17 shows the structure of the proposed model. The model comprises seven compound-scaled blocks as the backbone, BiFPN for fusion of the feature maps at various levels, and four sub-networks for the outputs (class, bounding box, rotation, and translation). The class network allows simultaneous estimation of multiple objects. In this study, this part was used to determine that some fruit was harvest-ready if its class was “red” and not harvest-ready if its class was “green”. The bounding box network estimates 2D bounding boxes of objects and combines them with 3D object information from the dataset to estimate the 3D bounding boxes. The rotation network estimates the rotation values of the 6D poses for each detected object. Finally, the translation network estimates the position values of the 6D poses for each detected object.
Figure 17. The model’s structure.

3.3. Object Labeling and Tracking According to the Driving and Harvesting Mode

In this study, we applied the SORT algorithm to predict the 2D bounding boxes for object tracking in a fruit-harvesting robot. The SORT algorithm integrates an object detector, a Kalman filter, I o U distance calculation, and the Hungarian algorithm to enable real-time object tracking. The object detector predicts the position of objects in each frame by generating bounding boxes in the format [ x , y , a , h ] , where x and y represent the center coordinates of the bounding box, a is the aspect ratio, and h is the height. Using the information on bounding boxes from previously tracked objects, the Kalman filter predicts the object’s position in the current frame. The Kalman filter estimates the object’s trajectory and provides a corrected position to maintain tracking continuity, resulting in predictions of the bounding boxes for the next frame.
Next, we calculated the I o U values between the detected bounding boxes in the current frame and the predicted bounding boxes from the Kalman filter to compute matching scores. I o U represents the overlap ratio between two bounding boxes, with higher values indicating greater similarity. Based on these matching scores, the Hungarian algorithm optimally matches the detected objects with the predicted objects. The Hungarian algorithm minimizes the matching cost, assigning tracking IDs to each object and maintaining object continuity across frames.
The proposed algorithm’s workflow is as follows. The Kalman filter uses the detection results from the previous frame to predict the current frame’s object positions while simultaneously performing new detections. I o U distances are calculated to match the predicted and detected bounding boxes. Unmatched detections are treated as new objects, while unmatched predictions are considered to be tracking failures. Matched objects update their state via the Kalman filter, while unmatched objects are either deleted or added as new objects, enabling real-time object tracking.
This proposed algorithm for tracking 2D bounding boxes of fruits is utilized during the robot’s navigation mode when the target is farther than the threshold distance. Hence, it is an essential component for a fruit-harvesting robot to help it accurately navigate toward the target fruit. Figure 18 represents the overall workflow of the tracking algorithm used in the fruit-harvesting robot. In recognizing multiple classes of fruits through the detector in the previous and current frames, ripe fruits are indicated by blue boxes, and unripe fruits are indicated by orange boxes. Subsequently, real-time object tracking can be performed as described for the previous process.
Figure 18. Overall operational flow of the tracking algorithm for citrus-harvesting robots.
We propose an algorithm that utilizes the 6D coordinates of objects to sort the harvesting order so that the fruit-harvesting robot can effectively harvest multiple ripe fruits. This enables the robot to harvest the fruits in the most efficient path, optimizing the harvesting task. Figure 19 shows the schematic of estimating the position of a fruit in three-dimensional space using a geometric model of the camera. Based on this, an algorithm can be proposed to sort the harvesting order using the 6D coordinates of the objects. This algorithm helps the robot harvest multiple fruits in the most efficient path, optimizing the harvesting task. In the camera coordinate system, the fruit’s position is represented as t = ( t x , t y , t z ) , which is projected onto the image plane from the principal point. The camera is aligned with the optical axis, and the principal point is the center of the camera’s image sensor. The image plane of the camera is set at z = f in the camera coordinate system, where f is the focal length of the camera. The 3D coordinates of the fruit are projected onto the image plane along the optical axis, resulting in the image coordinates ( c x , c y ) . This projection process uses the camera’s internal and external parameters to calculate and convert 3D coordinates to image coordinates ( c x , c y ) .
Figure 19. Schematic for labeling the harvesting order.
Through the four sub-networks of the previously explained deep learning model, the class, 2D bounding box, rotation, and translation of the objects can be extracted. We propose an algorithm that labels the harvesting order using the class and translation values of the objects when they are ready for harvesting. The determination of whether an object is ready for harvesting is made based on the extracted class ID of the object. If the recognized object’s class is “red” and its ID is 0, it is considered ready for harvesting; if the class is “green” and its ID is 1, it is considered unready for harvesting. When multiple objects of either class are recognized on the tree, the agricultural harvesting robot may become confused about which ripe object to harvest first. Therefore, an algorithm for labeling the harvesting order is necessary. By using the Euclidean distance formula, the translation values of multiple ripe objects are utilized to calculate the distance in the 3D space for each object. The Euclidean distance formula is as follows:
d = ( t x X c ) 2 + ( t y Y c ) 2 + ( t z Z c ) 2 ,
Finally, we propose an algorithm that controls the harvest and navigation modes using a threshold to enable the fruit-harvesting robot to effectively harvest multiple fruits. The proposed algorithm calculates the Euclidean distance of the detected fruits and applies the harvest mode when the distance to the nearest fruits is within the threshold of 30 cm, and applies the navigation mode when it exceeds this threshold.
The overall flow of the algorithm is as follows. First, the robot checks if there are any detected fruits. If fruits are detected, it then determines whether there are any ripe fruits ready for harvest. If no ripe fruits are detected, it displays “green class”. If ripe fruits are present, it proceeds to the distance calculation stage. Using the Euclidean distance, it calculates the distance to each fruit. If the distance to the nearest fruits is within 30 cm, it applies the harvest mode (SORT) and displays “red class harvest sequence labeling” to label the harvesting order. Conversely, if the distance to the nearest fruits exceeds 30 cm, it applies the navigation mode (NORM) and displays “red class tracking ID labeling”.
Figure 20 shows the overall structure of the model, which includes the algorithm for estimating the 6D pose of fruits, determining the harvest-readiness of multiple objects, and switching between harvest and navigation modes depending on the distance. The algorithm first checks for detected fruits, then determines if there are any ripe fruits. It then calculates the distance to the nearest fruits and compares it with the 30 cm threshold to decide between the harvest and navigation modes. This algorithm is integrated into the overall deep learning model. The deep learning model is based on the EfficientNet Backbone and BiFPN network and extracts the object’s class, 2D bounding box, rotation, and translation through four sub-networks. The extracted data are fed into the proposed algorithm to determine the harvest and navigation modes.
Figure 20. Algorithm for switching between harvesting and navigation modes for a fruit-harvesting robot.

4. Experimental Results

In this study, three experiments were conducted to evaluate the performance of the fruit-harvesting robot. The first experiment compared A D D and A D D S for the Yolov5-6D [28] model and the EfficientPose model using an optimized dataset of single fruits. The second experiment was to validate the recognition rate of recognition for multiple citrus fruits, including ripe and unripe citrus, using multi-object virtual and real datasets. The third experiment compared the FPS (frames per second) and recognition rate of the overall model, including the algorithm for switching between harvest and navigation modes, based on the results of recognition.
A D D and A D D S were used as performance evaluation metrics for object recognition [29]. All experiments were conducted using two 3090ti GPUs. These two metrics were used to evaluate the accuracy of a model in predicting the 6D position and pose of an object. A D D was used for asymmetric objects and calculated the average distance between the predicted 6D pose and the actual 6D pose of the model’s points. The formula for A D D is as follows
A D D = 1 m x M ( R x + T ) ( R ^ x + T ^ ) ,
where M is the set of the 3D model’s points; m is the number of the model’s points, R and T are the actual rotation matrix and translation vector of the object, respectively; R ^ and T ^ are the predicted rotation matrix and translation vector, respectively; and x represents the model’s points. A D D indicated how closely the predicted pose matched the actual pose, with smaller values indicating higher accuracy. A D D S was used for symmetric objects and calculated the shortest distance between the predicted pose and the actual pose for each point in the model, averaging these distances. The formula for A D D S is as follows
A D D S = 1 m x 1 M min x 2 M ( R x 1 + T ) ( R ^ x 2 + T ^ ) ,
where x 1 and x 2 are points in the set of points in the model M , and the other symbols have the same meaning as in A D D . A D D S takes symmetry into account and calculates the average distance between the predicted pose and the actual pose, making it a more accurate metric for evaluating symmetric objects.
To evaluate performance, FPS and recognition accuracy of the entire model were measured. FPS is a metric that evaluates the real-time processing performance of a model, representing the number of frames processed per second [30]. A high FPS indicates the model’s capability to operate in real-time, ensuring the robot’s fast and accurate operation. In the experiments, the FPS of the entire system based on Yolov5-6D and EfficientPose models was measured and compared. The recognition rate of fruits suitable for harvesting evaluated how accurately the model classified the suitability of fruits for harvesting, using the confusion matrix components of TP (true positive), FP (false positive), TN (true negative), and FN (false negative) [31]. The recognition rate indicated the percentage of classes correctly recognized by the model, calculated using the confusion matrix.
In the first experiment, we compared the 6D pose accuracy of a single ripe citrus object using the Yolov5-6D model and the proposed model with A D D and A D D S metrics. The Yolov5-6D model is a deep learning model based on the YOLO (you only look once) architecture, which detected 2D objects and subsequently estimated their 6D pose. In contrast, the proposed model directly estimated the 6D pose in an end-to-end manner from the input image. Both models were trained using the dataset of single citrus objects automatically rendered in a virtual environment that was proposed in this study. The dataset was divided into 80% (8000 images) for training and 20% (2000 images) for validation. Table 1 shows the performance evaluation metrics for single objects of the harvestable “Red” class and the non-harvestable “Green” class. The comparison used the 10% A D D and A D D S metrics, evaluating the models with the criterion that errors within 10% of the size of the modeled object were acceptable. This means that the accuracy of the 6D pose estimation was considered to be successful if the error in predicting the object’s position and orientation was within 10% of the object’s total size. By using this 10% threshold, the experiment ensured that the models are robust enough to perform accurate pose estimation in scenarios where slight deviations are permissible, reflecting real-world applications where perfect accuracy may not always be necessary but the accuracy must be within a tolerable range
Table 1. Performance evaluation for a single object using the 10% A D D and A D D S metrics.
Figure 21 and Figure 22 illustrate the ground truth of the 6D pose and the predicted 6D pose for a single object, represented as 3D bounding boxes.
Figure 21. YOLOv5-6D’s performance on the virtual dataset of single red fruits: correct (green box) and predicted (blue box) 6D poses.
Figure 22. EfficientPose’s performance on the virtual dataset of single red fruits: correct (green box) and predicted (blue box) 6D poses.
In the second experiment, a combined dataset of virtual and real environments was used to simultaneously recognize ripe and unripe citrus fruits. The dataset consisted of 10,000 images, with virtual and real images mixed in a 7:3 ratio. All 10,000 images were used for training. The validation was performed on an independent dataset of 4000 images obtained from outdoor environments in Jeju Island. The backbone structure of the model was set with a compound coefficient of 3 to recognize multiple objects in a more complex structure. Table 2 shows the recognition accuracy of multi-object class classification in both virtual and real environments through a confusion matrix. The ripeness classification of fruits involved assigning each of the multiple objects a class (red or green) and distinguishing them. To evaluate the performance of the ripeness classification, 10 scenes each containing 10 citrus fruits were filmed, comprising 8 fruits suitable for harvest and 2 unsuitable ones. In total, 100 citrus fruits were evaluated using a confusion matrix. The results were categorized into harvest-suitable success (true positive, TP), harvest-suitable failure (false negative, FN), harvest-unsuitable success (false positive, FP), and harvest-unsuitable failure (true negative, TN).
Table 2. Recognition rate for the confusion matrix after 10 rounds of evaluating performance in discriminating ripeness.
Figure 23 is one of the scenes from the maturity recognition experiment shown in Table 2. Figure 24 shows the results of estimating 6D poses for green and red objects photographed in three different scenes from an actual environment in Jeju.
Figure 23. Experimental results of determination of the ripeness of multiple objects, represented as harvestable (blue boxes) and unharvestable (red boxes).
Figure 24. Accurate poses (green boxes) and predicted poses (blue and red boxes) of 6D pose estimations for citrus fruits in three different scenes from a real environment of a Jeju Island farm.
In the final experiment, we evaluated the processing performance of the required harvest and driving modes in an agricultural harvesting robot using the weights learned from previous experiments. We utilized the same training results as in the second experiment. If the closest object was within the threshold distance, this corresponded to the robot’s harvest mode, and the harvest order IDs for multiple objects were sorted. Conversely, if the object was beyond the threshold distance, this corresponded to the robot’s driving mode, and tracking was performed. On the basis of this algorithm, we verified the inference time per frame from the RGB input to the classification of multiple objects, extraction of 2D and 3D bounding boxes, classification of the objects’ translation and rotation, and, finally, the proposed sorting and order.
The experimental results showed that using the model with a compound coefficient of 3, which utilized a relatively complex backbone, processing 4000 images took 236.13 s. Among the 4000 images, the number of images that were correctly recognized was 3998. Thus, a performance of approximately 16.94 FPS was verified. Figure 25 and Figure 26 show the application of the tracking function in driving mode and the function of classifying the harvesting order in harvesting mode in an agricultural harvesting robot.
Figure 25. Assigning a harvesting order to harvestable citrus fruits in harvest mode.
Figure 26. Tracking the same objects in driving mode compared with the previous frame.

5. Discussion

Our experimental results provided a comprehensive understanding of the performance and practical applicability of the EfficientPose-based model for fruit-harvesting robots. The EfficientPose-based model outperformed the YOLOv5-6D model across all key metrics. In the first experiment, the EfficientPose model achieved higher accuracy in 6D pose estimation, with an average A D D and A D D S of 97.615% compared with 96.51% for the YOLOv5-6D model. This improvement was attributed to the end-to-end architecture of EfficientPose, which directly estimated the 6D pose from input images, thereby reducing the errors associated with the intermediate steps in the YOLOv5-6D model.
In the second experiment, the model successfully recognized and classified ripe and unripe citrus fruits in both virtual and real environments, achieving a recognition rate of 97.5% for ripeness. The high true positive (TP) rate demonstrated the model’s robustness in accurately identifying harvestable fruits, while the relatively low false negative (FN) and false positive (FP) rates underscored its precision in distinguishing between ripe and unripe fruits. This precision is crucial for optimizing the harvesting process and minimizing damage to unripe fruits.
The third experiment validated the model’s real-time processing capabilities, with an average FPS of 16.94. This indicated that the model can effectively support the dynamic and fast-paced requirements of an agricultural harvesting robot. The ability to switch between harvesting and navigation modes based on the proximity of target objects should ensure efficient operation and reduce downtime.
In conclusion, the EfficientPose-based model presents a viable solution for enhancing the performance and efficiency of fruit-harvesting robots. The integration of advanced deep learning techniques and comprehensive dataset construction methods significantly contributes to the advancement of agricultural automation technology, demonstrating the potential for commercial deployment and real-world application in diverse agricultural settings.

6. Conclusions

We developed and validated a deep learning model specifically designed for agricultural robots, with a focus on improving the efficiency and accuracy of fruit harvesting. A key contribution of this research is the creation and validation of the HWANGMOD dataset, which includes both virtual and real-world data. This dataset was instrumental in enabling the model to accurately extract 6D pose information for multiple fruits in real-time, and to assess the ripeness of each fruit.
One of the distinguishing features of our proposed model is its ability to switch between navigation and harvesting modes according to a predefined threshold. When the distance between the robot and the target fruit exceeds this threshold, the model utilizes the SORT algorithm to track multiple objects, ensuring efficient navigation toward the fruits. As the robot approaches the fruits and the distance falls below the threshold, the model transitions to harvesting mode. In this mode, the Euclidean distance is employed to prioritize the fruits based on their proximity to the robot, allowing the robot to harvest the fruits in the most efficient sequence.
This dynamic mode-switching capability, driven by real-time 6D pose estimation and threshold-based decision-making, significantly enhances the robot’s operational efficiency. By accurately distinguishing between ripe and unripe fruits and adapting its behavior according to the proximity of the target, the robot can optimize both its navigation and harvesting processes. The integration of these advanced techniques contributes to the development of more autonomous and effective agricultural robots, paving the way for their practical deployment in diverse agricultural environments.
Additionally, we proposed a model that includes 6D pose estimation and tracking algorithms for switching between harvesting and navigation modes to improve the performance of fruit-harvesting robots. The proposed model was designed to perform 6D pose estimation and determine the ripeness of fruits in real-time. To achieve this, we constructed virtual and real datasets to accurately estimate the 6D pose of fruits and developed an algorithm to efficiently determine the harvesting order by distinguishing between ripe and unripe fruits.
In a comparison of the A D D and A D D S metrics on an optimized fruit dataset using the Yolov5-6D model and the EfficientPose model, the EfficientPose model achieved higher accuracy in 6D pose estimation. The model successfully recognized and classified ripe and unripe fruits in both virtual and real environments, achieving a recognition rate of 97.5% for suitability for harvesting. The high true positive rate and low false negative and false positive rates demonstrated the model’s robustness and precision. The model’s real-time processing capability was validated with an average FPS of 16.94. This indicated that the model can effectively support the dynamic and fast-paced requirements of agricultural harvesting robots, allowing efficient operation by switching between the harvesting and navigation modes depending on the proximity of the target objects.
This study suggests that the development of a robust and efficient 6D pose estimation model can significantly impact the commercialization of agricultural harvesting robots. The ability to accurately identify and harvest ripe fruits in real time can lead to substantial improvements in productivity and cost-efficiency. Additionally, integrating virtual and real datasets for training enhanced the model’s adaptability to various environmental conditions, supporting practical deployment in diverse agricultural settings.
While the current study has demonstrated promising results, further research is needed to address certain limitations. Future work should focus on improving the model’s performance under varying lighting conditions and in more complex environments. Additionally, integrating other sensory data, such as depth information, could enable accurate pose estimation regardless of the fruits’ characteristics and improve the rotational accuracy. Expanding the dataset to include a wider variety of fruit types and sizes could also enhance the model’s generalizability and robustness.

Author Contributions

Conceptualization, methodology, and software, H.-J.H.; investigation, J.-H.C.; writing and original draft preparation, Y.-T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through Open Field Smart Agriculture Technology Short-term Advancement Program, funded by the Ministry of Agriculture, Food, and Rural Affairs (MAFRA) (122032-03-1SB010).

Data Availability Statement

Restrictions apply to the datasets. The datasets presented in this article are not readily available due to restrictions imposed by the Korean government, as the data were generated under a government-funded project. Therefore, the data cannot be shared publicly.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

References

  1. Rad, M.; Lepetit, V. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3828–3836. [Google Scholar]
  2. Lin, K.-Y.; Tseng, Y.-H.; Chiang, K.-W. Interpretation and transformation of intrinsic camera parameters used in photogrammetry and computer vision. Sensors 2022, 22, 9602. [Google Scholar] [CrossRef] [PubMed]
  3. Szeliski, R. Computer Vision: Algorithms and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  4. Onishi, Y.; Yoshida, T.; Kurita, H.; Fukao, T.; Arihara, H.; Iwai, A. An automated fruit harvesting robot by using deep learning. Robomech J. 2019, 6, 13. [Google Scholar] [CrossRef]
  5. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
  6. Yu, Y.; Zhang, K.; Liu, H.; Yang, L.; Zhang, D. Real-time visual localization of the picking points for a ridge-planting strawberry harvesting robot. IEEE Access 2020, 8, 116556–116568. [Google Scholar] [CrossRef]
  7. Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1–6. [Google Scholar]
  8. Santos, T.T.; De Souza, L.L.; dos Santos, A.A.; Avila, S. Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. Comput. Electron. Agric. 2020, 170, 105247. [Google Scholar] [CrossRef]
  9. Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
  10. Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef]
  11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  12. Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1746–1754. [Google Scholar]
  13. Brito, A. Blender 3D; Novatec: New York, NY, USA, 2007. [Google Scholar]
  14. Hodan, T.; Michel, F.; Brachmann, E.; Kehl, W.; GlentBuch, A.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X. Bop: Benchmark for 6d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–34. [Google Scholar]
  15. Bukschat, Y.; Vetter, M. EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv 2020, arXiv:2011.04307. [Google Scholar]
  16. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  17. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  18. Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv 2017, arXiv:1711.00199. [Google Scholar]
  19. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. Mar. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  20. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  21. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
  22. Dokmanic, I.; Parhizkar, R.; Ranieri, J.; Vetterli, M. Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 2015, 32, 12–30. [Google Scholar] [CrossRef]
  23. Denninger, M.; Sundermeyer, M.; Winkelbauer, D.; Zidan, Y.; Olefir, D.; Elbadrawy, M.; Lodhi, A.; Katam, H. Blenderproc. arXiv 2019, arXiv:1911.01911. [Google Scholar]
  24. Lee, D.H.; Lee, S.S.; Kang, H.H.; Ahn, C.K. Camera position estimation for UAVs using SolvePnP with Kalman filter. In Proceedings of the 2018 1st IEEE International Conference on Hot Information-Centric Networking (HotICN), Shenzhen, China, 15–17 August 2018; pp. 250–251. [Google Scholar]
  25. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  26. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  27. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  28. Viviers, C.G.; Filatova, L.; Termeer, M.; de With, P.H.; van der Sommen, F. Advancing 6-DoF Instrument Pose Estimation in Variable X-Ray Imaging Geometries. IEEE Trans. Image Process. 2024, 33, 2462–2476. [Google Scholar] [CrossRef] [PubMed]
  29. Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; Navab, N. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Republic of Korea, 5–9 November 2012; Revised Selected Papers, Part I 11. pp. 548–562. [Google Scholar]
  30. Liu, Y.; Zhai, G.; Zhao, D.; Liu, X. Frame rate and perceptual quality for HD video. In Proceedings of the Advances in Multimedia Information Processing—PCM 2015: 16th Pacific-Rim Conference on Multimedia, Gwangju, Republic of Korea, 16–18 September 2015; Proceedings, Part II 16. pp. 497–505. [Google Scholar]
  31. Düntsch, I.; Gediga, G. Confusion matrices and rough set data analysis. J. Phys. Conf. Ser. 2019, 1229, 012055. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.