1. Introduction
The labor force used in fruit picking operations accounts for 33% to 55% of the labor force used in the entire production process. At present, the vast majority of fruit picking in China is still manual picking, and the mechanical harvesting rate in 2020 was only 2.7%. With the rising cost of labor and an aging population, the industrial development of fruit is facing a bottleneck [
1]. At the same time, China’s labor force population is rapidly decreasing. In order to solve these outstanding problems, China’s policies and guidelines in recent years have included mechanized agriculture and automation into the key development strategies. With the deep integration of agronomy and agricultural machinery recognition algorithms, the automatic picking operation of orchards has developed rapidly, bringing convenience to mechanized fruit planting. Researchers have conducted considerable research and development on the recognition and positioning of orchard picking robots. Automatic picking robots have become the development trend of fruit harvesting, helping to solve the problem of labor shortage, to improve productivity, and to reduce production costs.
The Queensland University of Technology in Australia has designed a bell pepper picking robot with a six-degree-of-freedom collaborative robotic arm and an end-effector with a suction cup cutting blade [
2]. The robot uses a camera to acquire the shape, position, and attitude of the pepper, and selects the appropriate attitude to grasp and cut the pepper stalk without damaging the pepper plant. Xiong et al. achieved the efficient picking of ripe strawberries by improving the vision and using a dual-arm system [
3]. The Israeli startup MetoMotion has designed and developed a greenhouse pepper picking robot Sweeper based on a robot operating system (ROS). The robot consists of an RGB-D camera, a six-degree-of-freedom industrial robotic arm, an end-effector to cut the stalks, a high-performance computer, and a PLC [
4]. The robot’s vision system identifies the location and shape of ripe bell peppers and estimates the center of the fruit and the position of the stalk. The University of Madrid, Spain, has developed a two-armed collaborative intelligent eggplant picking robot, which allows for simultaneous picking with both arms [
5]. It can perform accurate recognition in case of shading and improves the success rate of picking in a picking environment with cluttered branches and leaves. A strawberry picking robot was designed and developed by Feng et al. of the China Intelligent Agricultural Equipment Research Center. The main parts of the robot include a binocular vision camera, a sonar-guided four-wheeled mobile platform, a six-degree-of-freedom industrial robotic arm, and an end-effector with suction cups to suck the fruit [
6]. It can be identified by the vision recognition system and the control of the robotic arm to complete the picking operation. A hybrid pneumatic–electric driven apple picking robot was designed by Zhao et al. of Jiangsu University [
7]. The apple recognition method can switch between deep learning and conventional vision, and is capable of all-weather operation. The robot contains a lifting device and a pneumatic telescopic arm, which can realize picking operations. However, the above picking robot suffers from poor mechanical mechanism stability, low reliability, and low recognition accuracy, as well as a high planning failure rate and a long picking operation time. Wang et al. used deep learning to identify clusters of the entire banana fruit during the harvesting process by counting the number of clusters. They used edge detection algorithms to extract the centroid points of the fruit fingers and used clustering algorithms to determine the optimal number of clusters on the visual detection surface. The results indicated a target segmentation MIoU of 0.878 during the debudding period, a mean pixel precision of 0.936, and a final bunch detection accuracy rate of 86% [
8]. Wu et al. proposed a spatio-temporal convolutional neural network model that leverages the shifted window Transformer fusion region convolutional neural network model for the purpose of detecting pineapple fruits [
9]. Therefore, this article proposes the YOLOv5 algorithm that introduces attention, which uses the YOLOv5 algorithm that introduces attention mechanism for recognition and positioning, and uses a robotic picking arm to complete the picking operation.
The idea of the attention mechanism is to quickly scan the global image to obtain the target area that needs to be focused on, to form an attention focus, and then to invest more attention in this area to obtain more detailed information of the target that needs to be focused on, while suppressing other useless information. In recent years, the attention mechanism has been widely applied in the field of image recognition, and it has become one of the most noteworthy and in-depth core technologies in deep learning to obtain correlated features in the input image while suppressing irrelevant features. The ECA attention mechanism was proposed by Wang et al. [
8] in the article titled, “ECA Net: Efficient Channel Attention for Deep Convolutional Neural Network.” The ECA module improves on the SE attention mechanism [
10] by using 1 × 1 mapping. The dynamic convolutional kernel replaces the fully connected layer of the SE module to avoid damaging the direct correspondence between channels and weights. Wang et al. proposed a residual attention network to make the extraction of target features more efficient [
11]. Shen et al. proposed an improved Mask R-CNN that enhances the model’s ability to extract and generalize complex features by introducing an attention mechanism [
12]. Experimental results on a test set showed that its accuracy and recall rates reached 95.8% and 97.1%, respectively. Although this method accurately detects apples, the network model is large and has a high demand for hardware devices. Yang et al. proposed a BCo-YOLOv5 that enables the effective detection of apples by introducing an attention mechanism to make the backbone network more sensitive to the surface texture, color, and features of the target, which in turn enables the effective detection of orchard fruits [
13]. However, the type of data set in this experiment is single, so it is necessary to expand the type of data set to further test the detection performance of the model.
At the same time, there have been many studies on object detection based on deep learning in the agricultural field. Fan et al. proposed using a lightweight YOLOv5 model to effectively detect the position and maturity of tomato fruits in real-time [
14]. This improvement reduced the model size by 51.1% while maintaining a true detection rate of 93% [
15]. Gao et al. achieved efficient detection of multiple targets by enhancing the YOLOv8 network [
16]. Zeng et al. introduced an apple small fruit detection method; this method utilizes the YOLOv5s deep learning algorithm with channel pruning to achieve fast and accurate detection. A compact model size of 1.4 MB is beneficial for the development of portable mobile terminals [
17]. Ma et al. evaluated six versions of YOLO (YOLOv3, YOLOv3 tiny, YOLOv4, YOLOv4 tiny, YOLOv5x, and YOLOv5s) for real-time string detection and counting of grapes. YOLOv4 tiny was found to be the best choice due to its optimal balance between accuracy and speed. This study provides valuable insights into the performance of different YOLO versions and their applicability in the grape industry [
18]. Wang et al. used a single stage detector based on YOLOv5 to effectively identify nodes, fruits, and flowers on challenging datasets obtained during stress experiments on various tomato genotypes, and achieved relatively high scores [
19]. Machine vision based fruit and vegetable picking robots not only reduce labor costs but improve the picking rate of fruits and vegetables, becoming a research hotspot in the agricultural intelligent equipment industry [
20]. Cardellicchio et al. proposed an improved YOLOv5 network for robotic vision systems to address the current issues of inaccurate visual positioning and low recognition efficiency of robots [
21]. Song et al. proposed a BiFPN-S structure to improve the feature extraction network of YOLOv5. The improved algorithm can detect fruits at different growth stages for use in the machine vision system of subsequent fruit picking robots [
22]. Moreover, the apple detection algorithm based on the YOLOv7 model proposed by Yang et al. and the tomato detection algorithm based on the YOLOv8 model proposed by Yue et al. are examples of fruit detection methods based on the YOLO framework [
23,
24].
In summary, although previous studies have made achievements in the development of robotic arm systems and object detection, there are still some challenges in unstructured orchard scenarios. The existing robots generally face problems such as insufficient stability of mechanical structures and low operational reliability. In real environments, the failure rate of path planning is relatively high, and the picking operation is time-consuming, which is difficult to meet the requirements of actual production for efficient and stable operations. In addition, accurate target localization under occlusion and cluttered environments remains to be further explored. To address the above issues, this study has developed an efficient harvesting system that can operate continuously in small picking spaces and still has high-precision target detection capability in the presence of environmental occlusion. The contributions of this study are as follows:
The apple farming techniques in Liaoning Province are backward, with most orchards being unstructured. The dense planting of fruit trees leads to a large number of leaf occlusions and interlaced branches in the environment. Traditional robotic arms have small picking spaces and find it difficult to complete continuous picking. To solve this problem, this study designed a six-degree-of-freedom robotic arm with a 120° working angle. Precise spatial positioning is achieved through coordinate transformation, and the positioning error of the end-effector is within 1.5 mm, achieving a motion planning success rate of 92%.
Aiming at the problem of low apple detection accuracy in orchard environments, this study proposed a machine vision-based apple harvesting system. By integrating the ECA attention module into the YOLOv5s model, the feature representation of occluded targets is significantly enhanced, solving the problems of occlusion and clutter in unstructured environments, and achieving an extremely high apple recognition rate. Compared with the base model, the average precision is improved by 2.5%, realizing accurate target localization.
3. Identification and Positioning
3.1. YOLOv5 Algorithm for Apple Recognition with the Introduction of Attention Mechanism
Obtaining information about the location and pose of apples is a prerequisite for successful picking. Due to the complex background, different shapes, different lighting and shading, and different ripeness of the apples growing in the natural environment of orchards, the effect of recognition can be affected. In this paper, we choose the improved YOLOv5 recognition algorithm and introduce the attention mechanism into the recognition algorithm. The idea of the attention mechanism was originally derived from the signal processing mechanism of the brain that is unique to human vision. By quickly scanning the global image, human vision obtains the target area that needs to be focused on, forming the attention focus, and then devotes more attention to this area to obtain more detailed information about the target that needs to be focused on, while suppressing other useless information. Attention mechanism has been widely used in the field of image recognition in recent years.
The network structure of YOLOv5 consists of the Backbone and the Head (the Head part can be subdivided into Neck and Detect). In this paper, YOLOv5s, with depth multiple = 1.0 and width multiple = 1.0, is chosen as the network model for this experiment. The attention mechanism is incorporated in the C3-3 network layer, before the SPPF network layer, and in the C3 network layer (C3-3, C3-6, C3-9) of the backbone network, respectively. To seek ways to enhance the detection performance of YOLOv5s, four attention mechanisms (SE, CBAM, ECA, and CA) were integrated between the C3-3 network layer and the SPPF network layer of the backbone feature extraction network, and the same four attention mechanisms (SE, CBAM, ECA, and CA) were integrated into the C3 network layer (C3-3, C3-6, C3-9) of the backbone feature extraction network. The experimental results show that the incorporation of the ECA attention mechanism in the C3 module effectively enhances the ability of the backbone network to extract features, and the ECA module improves the feature enhancement of the input features in terms of channel dimension while maintaining the direct relationship between the channels and the weights, which improves the performance of the YOLOv5 model for detecting apples.
The experimental results show that the incorporation of the ECA attention mechanisms in the C3 (C3-3, C3-6, C3-9) network layer makes the detection performance of the algorithm optimal; the network architecture of the C3ECA-YOLOv5s is shown in
Figure 4.
The backbone feature extraction network consists of three modules, Focus, CSP, and SPPF. In YOLOv5 (version 6.1), the Focus module is replaced by a 6 × 6 sized convolutional layer. There are two structures of CSP (Cross Stage Partial Network) [
26,
27] in YOLOv5, the CSP in the backbone network is connected by residuals [
28]. The CSP in the Head part is operated by direct connection In this paper, the improvement of C3ECA-YOLOv5s algorithm occurs in the Cross Stage Partial Network in the backbone network. The ability of the backbone network to extract features is enhanced by incorporating the ECA attention mechanism in the Bottleneck module. In YOLOv5 (version 6.1), the SPP module is replaced by SPPF, which achieves the same effect as SPP module and speeds up the network by cascading three 55 pooling kernels.
The Head module consists of two parts, Neck and Detect. The Neck module consists of the Path Aggregation Network (PANet) [
29], where data are fused from bottom to top paths after going through the Feature Pyramid Network (top to bottom path fusion). Enhancing the Neck module’s ability to fuse features increases the ability of the Neck module to detect features. The prediction of the Detect module includes Bounding Box Loss and Non maximum suppression, which YOLOv5s uses to calculate the distance between the real box and the predicted box. It can effectively solve the problem of inaccurate IOU calculation between the two and improve the accuracy of target detection.
is calculated as shown in Equation (3):
Note: IOU denotes the intersection ratio between the true and predicted values; denotes the Euclidean distance between the center points of the predicted target bounding box and the true target bounding box; c denotes the diagonal distance of the smallest external rectangle between the predicted target bounding box and the true target bounding box; αv denotes the ratio of length to width.
In this study, the logistic regression algorithm is used to solve the data set by linear regression and the highest confidence level is output as the final result of image detection and recognition.
The confidence level reflects the degree of truthfulness of the target falling in a specific box area, and the judgment equation is as follows:
where
score is the confidence value of confidence level, Pr(
Object)is the probability of prediction frame training samples, and
IOU is the overlap rate of candidate box and original marker box.
The ECA is a channel attention module with lightweight and plug and play characteristics. It can enhance the channel dimension features of the input feature map without changing the size of the input feature map. In the ECA-Net, change in the use of fully connected layer learning channel dimension information in the SE attention mechanism of the dynamic convolutional kernel completes the corresponding operation with a size of 1 × 1 mapping. The dynamic convolution kernel refers to the function where the size of the convolution kernel adapts to changes in the size of the feature map. The relationship between the size of the convolution kernel and the number of channels is shown in Equation (7). The structure of the ECA module as shown in
Figure 5.
We chose to incorporate the ECA attention modules into the C3 network layer (C3-3, C3-6, C3-9) of the backbone feature network to enhance the network’s ability to extract features. As shown in
Figure 5 (network architecture of C3ECA-YOLOv5s), in 1 × 1 and 3 × 3 mapping, an attention module must be added after the convolutional kernels, then add them to the feature maps connected by the residuals. Experiments have shown that the mAP of this method reaches 94.7%, which is 4.0% higher than YOLOv5s. The results demonstrate the effectiveness of the improved algorithm.
3.2. Evaluation Indicators of the Model
This study evaluated the model using four indicators: Precision, Recall, Mean Average Precision (mAP), and detection speed. By using Formulas (8)–(10), the scores of accuracy, recall, and average accuracy can be calculated separately. The higher the score obtained, the better the detection effect and the more stable and true the performance of the model.
Note: TP indicates that the actual sample is positive and the prediction is also positive; FP indicates that the actual sample is positive, but the predicted sample is positive; FN indicates that the actual sample is positive, but the predicted sample is negative. AP (average precision) represents the average accuracy. The higher the value of AP, the better the performance of the model. mAP is the average value of four different apples. AP, and C represent the number of categories.
3.3. Comparison of C3ECA with SE, CBAM, CA Attention Mechanism
To further analyze the impact of the C3ECA attention mechanism, along with SE, CBAM, and CA, on detection performance, this study embeds these four attention mechanisms into the YOLOv5s model for evaluation. To ensure experimental fairness, the three comparative attention mechanisms are also embedded into the same C3 layers (C3-3, C3-6, C3-9). Accuracy, recall, and mAP are used as evaluation metrics. The detailed comparative data of the four attention mechanisms are presented in
Table 2.
In
Table 2, compared with SE, CBAM, and CA attention mechanism, the mAP of the model embedded with the C3ECA detection model has improved by 1.2%, 0.9%, and 1.8%, respectively, and the inference time has been accelerated by 0.1 ms, 0.3 ms, and 0.1 ms, respectively.
3.4. Comparison of C3ECA YOLOv5s Algorithm with Other Detection Methods
To further analyze the detection performance of the C3ECA-YOLOv5s algorithm, precision, recall, and mAP were used to evaluate the algorithm in this article. The performance was compared with the latest mainstream detection models, such as YOLOv8s and the original YOLOv5s detection algorithm. The comparative data of the three algorithms are detailed in
Table 3.
In
Table 3, it can be seen that the precision, recall, mAP
@0.5, and inference time was 90.7%, 88.1%, 92.5%, and 2.5 ms. Compared to the original YOLOv5s algorithm, it has improved by 2.1%, 1.2%, and 2.5%, and 1.2 ms, respectively. Compared with the mainstream detection models shown in the table, mAP has improved by 0.9%, 1.5%, 1.8%, and 0.6%, and the inference time is faster than that of the above-mentioned mainstream detection methods. In summary, the precision, recall, and mAP@0.5 are all improved.
3.5. Algorithm Detection Field Experiment
To enrich the dataset, to improve the accuracy of the network model, and to prevent the insufficient number of datasets and environmental variable gaps, the apple image training set was expanded to 844 images, and the images were collected to cover the daytime sunlight hours. To determine the recognition effectiveness and to determine the best recognition effect, field real-time localization recognition was performed in the orchard field. The YOLOv5s model was used to identify and localize apple trees within a distance of 0.5–2.5 m in a straight line distance, and 50 localization trials were conducted for each model. The visualized images of some of the trials of the YOLOv5s model are shown in
Figure 6. Different angles and lighting were considered within the recognition range (see
Figure 6). The distribution of apples is irregular and obstructed (see
Figure 6). The model localization range was identified for apples within the field of view, and all measurements were obtained. The confidence level based on the YOLOv5s model reached 90% and the recognition rate of apples within the recognition range reached 98%. Evidence that the YOLOv5s algorithm has good robustness in different environments. The improved YOLOv5s algorithm can effectively identify targets and provide accurate target information for robotic picking.
3.6. Target Identification and Positioning
The target identification and positioning system mainly includes image acquisition software and a motion control module. The depth camera in the vision camera is used to take the information of the picking object, and the picking object can be quickly located by filtering the data set. The vision camera is mounted on the upper part of the end-effector of the 6-DOF robotic arm, which can realize multi-angle adjustment in the space range. The experiment used an Intel Real Sense D435i stereo vision depth camera with a depth map resolution of 1280 × 720 pixels, color image resolution of 848 × 480 pixels, depth detection range 0.2–10.0 m, powered by USB. Camera internal parameters are obtained using Intel Real Sense Viewer, the built-in software of the Real Sense camera.
Since the target coordinate system of the target object captured in the vision camera field of view does not match the coordinate system of the robot itself. It is necessary to transform the coordinate system in the vision camera into the robot coordinate system, i.e., for target calibration. To perform the transformation of the coordinate system, a 4 × 4 transformation matrix is required, which contains a rotation transformation matrix (3 × 3) and a translation matrix (1 × 3) as follows:
where {
xc, yc,
zc} is the camera coordinate system with the origin of
Oc, {
x,
y} is the image coordinate system with the origin of
Oi, {
u,
v} is the pixel coordinate system with the origin of
Op, point
Pc is the positioning point under the camera coordinate system, point P is the intersection of the projection ray
Oc,
Pc, and the image coordinate system plane, and
f is the focal length, mm. The model of the camera coordinate system is shown in
Figure 7.
The conversion of the image coordinate system to the pixel coordinate system requires the operation of the translation matrix as follows:
where
fx,
fy,
u0,
v are camera parameters,
dx,
dy are the size of the unit pixel on the u and v axes of the pixel coordinate system, and the two transformation matrices are combined to derive the conversion equation as follows:
The depth image of the apple is captured by the infrared image obtained by the 3D camera. The depth image is obtained by the principle of triangulation, and the original depth processing is aligned. Setting the original color pixel coordinate system as (uc, vc), the processed depth image coordinate system is (ud, vd), aligning the two. Then each detected pixel point is (uc, vc) and the processed depth map pixel point is (ud, vd). Finally the 3D coordinates of the localization point in the camera coordinate system (xc, yc, zc) can be obtained according to the transformed coordinates.
In this experiment, the Laser rangefinder is installed on the top of the depth camera. The height difference between the Laser rangefinder and the depth camera measured is far less than the distance of the positioning point, so this height difference is ignored. Connect the device to a laptop computer, use the laptop to drive the depth camera to obtain the distance measurement value
zci. Use the Laser rangefinder to obtain the distance measurement value
zdci and save it accordingly. In order to evaluate the positioning accuracy, the mean error
Ev and mean error ratio
Evr are used as evaluation indicators.
Ez reflects the absolute error between the estimated value and the true value, while
Ezr reflects the relative error between the estimated value and the true value. The calculation equation is as follows:
In the equation, m is the number of apples successfully identified and located in the same image.
5. Conclusions
Coordinated with a 6-DOF robotic arm featuring a 120° maximum working angle, the system attains precise spatial localization through coordinate transformation matrices, confining the end-effector positioning errors to ≤1.5 mm—a negligible tolerance given typical fruit dimensions. Field validation further confirms a motion planning success rate of 92% with a picking cycle time of 23 s per apple, demonstrating operational efficiency suitable for real-world deployment.
This study presents an enhanced vision-based apple harvesting system that ad-dresses critical challenges of occlusion and clutter in unstructured orchard environments. By integrating an efficient channel attention (ECA) module into strategic C3 layers (C3-3, C3-6, C3-9) of the YOLOv5 architecture, the proposed algorithm significantly strengthens feature representation for obscured targets, achieving a robust confidence level of 90% and an in-range apple recognition rate of 98%. This represents a 4% improvement in mean Average Precision (mAP) over the baseline YOLOv5s model.
Future work will focus on developing adaptive end-effectors capable of dynamically adjusting the grasping force for diverse fruit morphologies, alongside integrating multi-modal sensors to augment visual perception for maturity assessment and quality control, thereby extending the system’s applicability to broader horticultural crops.