Next Article in Journal
Integrated Assessment of Near-Surface Ozone Impacts on Rice Yield and Sustainable Cropping Strategies in Pearl River Delta (2015–2023)
Previous Article in Journal
Integrating Multi-Strategy Improvements to Sand Cat Group Optimization and Gradient-Boosting Trees for Accurate Prediction of Microclimate in Solar Greenhouses
Previous Article in Special Issue
A Hybrid Path Planning Algorithm for Orchard Robots Based on an Improved D* Lite Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced YOLOv5 with ECA Module for Vision-Based Apple Harvesting Using a 6-DOF Robotic Arm in Occluded Environments

College of Engineering, Shenyang Agricultural University, Shenyang 110866, China
*
Author to whom correspondence should be addressed.
Agriculture 2025, 15(17), 1850; https://doi.org/10.3390/agriculture15171850
Submission received: 17 July 2025 / Revised: 27 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025
(This article belongs to the Special Issue Perception, Decision-Making, and Control of Agricultural Robots)

Abstract

Accurate target recognition and localization remain significant challenges for robotic fruit harvesting in unstructured orchard environments characterized by branch occlusion and leaf clutter. To address the difficulty in identifying and locating apples under such visually complex conditions, this paper proposes an improved YOLOv5-based visual recognition algorithm incorporating an efficient channel attention (ECA) module. The ECA module is strategically integrated into specific C3 layers (C3-3, C3-6, C3-9) of the YOLOv5 network architecture to enhance feature representation for occluded targets. During operation, the system simultaneously acquires apple pose information and achieves precise spatial localization through coordinate transformation matrices. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed system. The custom-designed six-degree-of-freedom (6-DOF) robotic arm exhibits a wide operational range with a maximum working angle of 120°. The ECA-enhanced YOLOv5 model achieves a confidence level of 90% and an impressive in-range apple recognition rate of 98%, representing a 2.5% improvement in the mean Average Precision (mAP) compared to the baseline YOLOv5s algorithm. The end-effector positioning error is consistently controlled within 1.5 mm. The motion planning success rate reaches 92%, with the picking completed within 23 s per apple. This work provides a novel and effective vision recognition solution for future development of harvesting robots.

1. Introduction

The labor force used in fruit picking operations accounts for 33% to 55% of the labor force used in the entire production process. At present, the vast majority of fruit picking in China is still manual picking, and the mechanical harvesting rate in 2020 was only 2.7%. With the rising cost of labor and an aging population, the industrial development of fruit is facing a bottleneck [1]. At the same time, China’s labor force population is rapidly decreasing. In order to solve these outstanding problems, China’s policies and guidelines in recent years have included mechanized agriculture and automation into the key development strategies. With the deep integration of agronomy and agricultural machinery recognition algorithms, the automatic picking operation of orchards has developed rapidly, bringing convenience to mechanized fruit planting. Researchers have conducted considerable research and development on the recognition and positioning of orchard picking robots. Automatic picking robots have become the development trend of fruit harvesting, helping to solve the problem of labor shortage, to improve productivity, and to reduce production costs.
The Queensland University of Technology in Australia has designed a bell pepper picking robot with a six-degree-of-freedom collaborative robotic arm and an end-effector with a suction cup cutting blade [2]. The robot uses a camera to acquire the shape, position, and attitude of the pepper, and selects the appropriate attitude to grasp and cut the pepper stalk without damaging the pepper plant. Xiong et al. achieved the efficient picking of ripe strawberries by improving the vision and using a dual-arm system [3]. The Israeli startup MetoMotion has designed and developed a greenhouse pepper picking robot Sweeper based on a robot operating system (ROS). The robot consists of an RGB-D camera, a six-degree-of-freedom industrial robotic arm, an end-effector to cut the stalks, a high-performance computer, and a PLC [4]. The robot’s vision system identifies the location and shape of ripe bell peppers and estimates the center of the fruit and the position of the stalk. The University of Madrid, Spain, has developed a two-armed collaborative intelligent eggplant picking robot, which allows for simultaneous picking with both arms [5]. It can perform accurate recognition in case of shading and improves the success rate of picking in a picking environment with cluttered branches and leaves. A strawberry picking robot was designed and developed by Feng et al. of the China Intelligent Agricultural Equipment Research Center. The main parts of the robot include a binocular vision camera, a sonar-guided four-wheeled mobile platform, a six-degree-of-freedom industrial robotic arm, and an end-effector with suction cups to suck the fruit [6]. It can be identified by the vision recognition system and the control of the robotic arm to complete the picking operation. A hybrid pneumatic–electric driven apple picking robot was designed by Zhao et al. of Jiangsu University [7]. The apple recognition method can switch between deep learning and conventional vision, and is capable of all-weather operation. The robot contains a lifting device and a pneumatic telescopic arm, which can realize picking operations. However, the above picking robot suffers from poor mechanical mechanism stability, low reliability, and low recognition accuracy, as well as a high planning failure rate and a long picking operation time. Wang et al. used deep learning to identify clusters of the entire banana fruit during the harvesting process by counting the number of clusters. They used edge detection algorithms to extract the centroid points of the fruit fingers and used clustering algorithms to determine the optimal number of clusters on the visual detection surface. The results indicated a target segmentation MIoU of 0.878 during the debudding period, a mean pixel precision of 0.936, and a final bunch detection accuracy rate of 86% [8]. Wu et al. proposed a spatio-temporal convolutional neural network model that leverages the shifted window Transformer fusion region convolutional neural network model for the purpose of detecting pineapple fruits [9]. Therefore, this article proposes the YOLOv5 algorithm that introduces attention, which uses the YOLOv5 algorithm that introduces attention mechanism for recognition and positioning, and uses a robotic picking arm to complete the picking operation.
The idea of the attention mechanism is to quickly scan the global image to obtain the target area that needs to be focused on, to form an attention focus, and then to invest more attention in this area to obtain more detailed information of the target that needs to be focused on, while suppressing other useless information. In recent years, the attention mechanism has been widely applied in the field of image recognition, and it has become one of the most noteworthy and in-depth core technologies in deep learning to obtain correlated features in the input image while suppressing irrelevant features. The ECA attention mechanism was proposed by Wang et al. [8] in the article titled, “ECA Net: Efficient Channel Attention for Deep Convolutional Neural Network.” The ECA module improves on the SE attention mechanism [10] by using 1 × 1 mapping. The dynamic convolutional kernel replaces the fully connected layer of the SE module to avoid damaging the direct correspondence between channels and weights. Wang et al. proposed a residual attention network to make the extraction of target features more efficient [11]. Shen et al. proposed an improved Mask R-CNN that enhances the model’s ability to extract and generalize complex features by introducing an attention mechanism [12]. Experimental results on a test set showed that its accuracy and recall rates reached 95.8% and 97.1%, respectively. Although this method accurately detects apples, the network model is large and has a high demand for hardware devices. Yang et al. proposed a BCo-YOLOv5 that enables the effective detection of apples by introducing an attention mechanism to make the backbone network more sensitive to the surface texture, color, and features of the target, which in turn enables the effective detection of orchard fruits [13]. However, the type of data set in this experiment is single, so it is necessary to expand the type of data set to further test the detection performance of the model.
At the same time, there have been many studies on object detection based on deep learning in the agricultural field. Fan et al. proposed using a lightweight YOLOv5 model to effectively detect the position and maturity of tomato fruits in real-time [14]. This improvement reduced the model size by 51.1% while maintaining a true detection rate of 93% [15]. Gao et al. achieved efficient detection of multiple targets by enhancing the YOLOv8 network [16]. Zeng et al. introduced an apple small fruit detection method; this method utilizes the YOLOv5s deep learning algorithm with channel pruning to achieve fast and accurate detection. A compact model size of 1.4 MB is beneficial for the development of portable mobile terminals [17]. Ma et al. evaluated six versions of YOLO (YOLOv3, YOLOv3 tiny, YOLOv4, YOLOv4 tiny, YOLOv5x, and YOLOv5s) for real-time string detection and counting of grapes. YOLOv4 tiny was found to be the best choice due to its optimal balance between accuracy and speed. This study provides valuable insights into the performance of different YOLO versions and their applicability in the grape industry [18]. Wang et al. used a single stage detector based on YOLOv5 to effectively identify nodes, fruits, and flowers on challenging datasets obtained during stress experiments on various tomato genotypes, and achieved relatively high scores [19]. Machine vision based fruit and vegetable picking robots not only reduce labor costs but improve the picking rate of fruits and vegetables, becoming a research hotspot in the agricultural intelligent equipment industry [20]. Cardellicchio et al. proposed an improved YOLOv5 network for robotic vision systems to address the current issues of inaccurate visual positioning and low recognition efficiency of robots [21]. Song et al. proposed a BiFPN-S structure to improve the feature extraction network of YOLOv5. The improved algorithm can detect fruits at different growth stages for use in the machine vision system of subsequent fruit picking robots [22]. Moreover, the apple detection algorithm based on the YOLOv7 model proposed by Yang et al. and the tomato detection algorithm based on the YOLOv8 model proposed by Yue et al. are examples of fruit detection methods based on the YOLO framework [23,24].
In summary, although previous studies have made achievements in the development of robotic arm systems and object detection, there are still some challenges in unstructured orchard scenarios. The existing robots generally face problems such as insufficient stability of mechanical structures and low operational reliability. In real environments, the failure rate of path planning is relatively high, and the picking operation is time-consuming, which is difficult to meet the requirements of actual production for efficient and stable operations. In addition, accurate target localization under occlusion and cluttered environments remains to be further explored. To address the above issues, this study has developed an efficient harvesting system that can operate continuously in small picking spaces and still has high-precision target detection capability in the presence of environmental occlusion. The contributions of this study are as follows:
  • The apple farming techniques in Liaoning Province are backward, with most orchards being unstructured. The dense planting of fruit trees leads to a large number of leaf occlusions and interlaced branches in the environment. Traditional robotic arms have small picking spaces and find it difficult to complete continuous picking. To solve this problem, this study designed a six-degree-of-freedom robotic arm with a 120° working angle. Precise spatial positioning is achieved through coordinate transformation, and the positioning error of the end-effector is within 1.5 mm, achieving a motion planning success rate of 92%.
  • Aiming at the problem of low apple detection accuracy in orchard environments, this study proposed a machine vision-based apple harvesting system. By integrating the ECA attention module into the YOLOv5s model, the feature representation of occluded targets is significantly enhanced, solving the problems of occlusion and clutter in unstructured environments, and achieving an extremely high apple recognition rate. Compared with the base model, the average precision is improved by 2.5%, realizing accurate target localization.

2. Environment and Picking Machinery System

2.1. Environmental Analysis

The apple orchards in Liaoning Province are agronomically backward, apple trees are densely planted. The fruit is distributed in the main distribution area of apples in the middle of the canopy [25], about 80% or more of the apples are distributed in the space of l to 2 m from the ground, with the trunk as the axis of l to 2 m. There is no obvious difference in the distribution of apples in the shade, the branches and trunks are interlaced, and the picking space is small, which is not conducive to the operation of the traditional robotic arm as shown in Figure 1. The stability of the operation ability of the traditional picking arm in the unstructured environment is poor. Due to the randomness of the natural growth of apple plants, the stems, branches, and other obstacles of apple plants put high demands on the freedom and autonomous recognition ability of the robotic picking arm. According to the requirements of apple orchard picking tasks, the 6-DOF robotic picking arm needs to perform picking operations in an unstructured environment. A large number of the robot operations need to be performed in a narrow space, so it poses a great challenge to the flexibility of the robotic arm.
Based on the above circumstances, this paper proposes an improved visual recognition algorithm based on YOLOv5, which can collaborate more efficiently with robotic arms to complete fruit picking tasks in orchards under unstructured conditions. The specific workflow is shown in Figure 2.

2.2. Picking Machinery System

The 6-DOF robotic picking arm (AUBO Robotics; China; Beijing) is a vertical multi-joint type robotic arm, and its composition consists of a base, a connector, a large arm, a small arm, a joint, and an end-effector. The end picking hand with flexible control drive and stiffness can change with the surface hardness of the picking target. The end picking hand can adapt to changes in the size of the fruit, can pick fruit without damage, and there is no dead space within the grasp of the variable stiffness adaptive flexible robotic hand, as shown in Figure 3. The joints of the robotic arm are driven by the actuator, and the relative motion of the joints drives the movement of the rod to bring the end-effector to the working position. For the positive and negative kinematic analysis of the 6-DOF robotic arm. It can be simplified to an open linkage system consisting of joints, including connecting rods and end-effectors connected first and last, and connected through the joints to form an open linkage system with the base as the base coordinate system. Other joints relative to it establish a relative motion coordinate system. In this paper, we establish the transformed coordinate system: take the end-effector Z-axis of the robotic arm as the picking direction and keep the Z-axis perpendicular to the picking object. Then, establish the coordinate system of the end-effector according to the right-hand rule. Finally, establish the D-H parameters of the robotic arm based on the coordinate system establishment principle, as shown in Table 1.
The mapping from joint space to task space is called positive kinematics, and the attitude of the end of the robotic arm can be solved based on the known joint angles. The positive kinematics of the robotic arm is solved by calculating the position of the robot end-effector with the known joint variables θi (i = 1, 2, 3, …, 6) of the robot. After establishing the linkage coordinate system for the robotic arm, the general form of the transformation of the coordinate system between adjacent links is as follows:
A n = c θ i s θ i c α i s θ i s i α a i c θ i s θ i c θ i c α i c θ i s α i a i s θ i 0 s α i c α i d i 0 0 0 1
where s θ i = sin ( θ i ) , s α i = sin ( α i ) , s α i = sin ( α i ) , c α i = cos ( α i ) .
For tandem robots, the end coordinate poses can be obtained by multiplying each of their sub-transformation matrices as follows:
T 0 6 = T 0 1 T 1 2 T 2 3 T 5 6 = f ( θ 1 , θ 2 , θ 3 θ 6 )

3. Identification and Positioning

3.1. YOLOv5 Algorithm for Apple Recognition with the Introduction of Attention Mechanism

Obtaining information about the location and pose of apples is a prerequisite for successful picking. Due to the complex background, different shapes, different lighting and shading, and different ripeness of the apples growing in the natural environment of orchards, the effect of recognition can be affected. In this paper, we choose the improved YOLOv5 recognition algorithm and introduce the attention mechanism into the recognition algorithm. The idea of the attention mechanism was originally derived from the signal processing mechanism of the brain that is unique to human vision. By quickly scanning the global image, human vision obtains the target area that needs to be focused on, forming the attention focus, and then devotes more attention to this area to obtain more detailed information about the target that needs to be focused on, while suppressing other useless information. Attention mechanism has been widely used in the field of image recognition in recent years.
The network structure of YOLOv5 consists of the Backbone and the Head (the Head part can be subdivided into Neck and Detect). In this paper, YOLOv5s, with depth multiple = 1.0 and width multiple = 1.0, is chosen as the network model for this experiment. The attention mechanism is incorporated in the C3-3 network layer, before the SPPF network layer, and in the C3 network layer (C3-3, C3-6, C3-9) of the backbone network, respectively. To seek ways to enhance the detection performance of YOLOv5s, four attention mechanisms (SE, CBAM, ECA, and CA) were integrated between the C3-3 network layer and the SPPF network layer of the backbone feature extraction network, and the same four attention mechanisms (SE, CBAM, ECA, and CA) were integrated into the C3 network layer (C3-3, C3-6, C3-9) of the backbone feature extraction network. The experimental results show that the incorporation of the ECA attention mechanism in the C3 module effectively enhances the ability of the backbone network to extract features, and the ECA module improves the feature enhancement of the input features in terms of channel dimension while maintaining the direct relationship between the channels and the weights, which improves the performance of the YOLOv5 model for detecting apples.
The experimental results show that the incorporation of the ECA attention mechanisms in the C3 (C3-3, C3-6, C3-9) network layer makes the detection performance of the algorithm optimal; the network architecture of the C3ECA-YOLOv5s is shown in Figure 4.
The backbone feature extraction network consists of three modules, Focus, CSP, and SPPF. In YOLOv5 (version 6.1), the Focus module is replaced by a 6 × 6 sized convolutional layer. There are two structures of CSP (Cross Stage Partial Network) [26,27] in YOLOv5, the CSP in the backbone network is connected by residuals [28]. The CSP in the Head part is operated by direct connection In this paper, the improvement of C3ECA-YOLOv5s algorithm occurs in the Cross Stage Partial Network in the backbone network. The ability of the backbone network to extract features is enhanced by incorporating the ECA attention mechanism in the Bottleneck module. In YOLOv5 (version 6.1), the SPP module is replaced by SPPF, which achieves the same effect as SPP module and speeds up the network by cascading three 55 pooling kernels.
The Head module consists of two parts, Neck and Detect. The Neck module consists of the Path Aggregation Network (PANet) [29], where data are fused from bottom to top paths after going through the Feature Pyramid Network (top to bottom path fusion). Enhancing the Neck module’s ability to fuse features increases the ability of the Neck module to detect features. The prediction of the Detect module includes Bounding Box Loss and Non maximum suppression, which YOLOv5s uses to calculate the distance between the real box and the predicted box. It can effectively solve the problem of inaccurate IOU calculation between the two and improve the accuracy of target detection.
L o s s C L O U is calculated as shown in Equation (3):
L oss C L O U = 1 I O U + ρ 2 ( b , b g t ) c 2 + a v
a = v 1 I O U + v
v = 4 π 2 ( arctan w g t h g t arctan w h ) 2
Note: IOU denotes the intersection ratio between the true and predicted values; ρ 2 ( b , b g t ) denotes the Euclidean distance between the center points of the predicted target bounding box and the true target bounding box; c denotes the diagonal distance of the smallest external rectangle between the predicted target bounding box and the true target bounding box; αv denotes the ratio of length to width.
In this study, the logistic regression algorithm is used to solve the data set by linear regression and the highest confidence level is output as the final result of image detection and recognition.
The confidence level reflects the degree of truthfulness of the target falling in a specific box area, and the judgment equation is as follows:
s c o r e = Pr ( O b j e c t ) × I O U t r u t h p r e d
where score is the confidence value of confidence level, Pr(Object)is the probability of prediction frame training samples, and IOU is the overlap rate of candidate box and original marker box.
The ECA is a channel attention module with lightweight and plug and play characteristics. It can enhance the channel dimension features of the input feature map without changing the size of the input feature map. In the ECA-Net, change in the use of fully connected layer learning channel dimension information in the SE attention mechanism of the dynamic convolutional kernel completes the corresponding operation with a size of 1 × 1 mapping. The dynamic convolution kernel refers to the function where the size of the convolution kernel adapts to changes in the size of the feature map. The relationship between the size of the convolution kernel and the number of channels is shown in Equation (7). The structure of the ECA module as shown in Figure 5.
k = ψ C = l o g 2 C γ + b γ o d d ,
We chose to incorporate the ECA attention modules into the C3 network layer (C3-3, C3-6, C3-9) of the backbone feature network to enhance the network’s ability to extract features. As shown in Figure 5 (network architecture of C3ECA-YOLOv5s), in 1 × 1 and 3 × 3 mapping, an attention module must be added after the convolutional kernels, then add them to the feature maps connected by the residuals. Experiments have shown that the mAP of this method reaches 94.7%, which is 4.0% higher than YOLOv5s. The results demonstrate the effectiveness of the improved algorithm.

3.2. Evaluation Indicators of the Model

This study evaluated the model using four indicators: Precision, Recall, Mean Average Precision (mAP), and detection speed. By using Formulas (8)–(10), the scores of accuracy, recall, and average accuracy can be calculated separately. The higher the score obtained, the better the detection effect and the more stable and true the performance of the model.
precision = T P / T P + F P × 100 %
r e c a l l = T P / T P + F N × 100 %
m A P = i = 1 C A P i / C
Note: TP indicates that the actual sample is positive and the prediction is also positive; FP indicates that the actual sample is positive, but the predicted sample is positive; FN indicates that the actual sample is positive, but the predicted sample is negative. AP (average precision) represents the average accuracy. The higher the value of AP, the better the performance of the model. mAP is the average value of four different apples. AP, and C represent the number of categories.

3.3. Comparison of C3ECA with SE, CBAM, CA Attention Mechanism

To further analyze the impact of the C3ECA attention mechanism, along with SE, CBAM, and CA, on detection performance, this study embeds these four attention mechanisms into the YOLOv5s model for evaluation. To ensure experimental fairness, the three comparative attention mechanisms are also embedded into the same C3 layers (C3-3, C3-6, C3-9). Accuracy, recall, and mAP are used as evaluation metrics. The detailed comparative data of the four attention mechanisms are presented in Table 2.
In Table 2, compared with SE, CBAM, and CA attention mechanism, the mAP of the model embedded with the C3ECA detection model has improved by 1.2%, 0.9%, and 1.8%, respectively, and the inference time has been accelerated by 0.1 ms, 0.3 ms, and 0.1 ms, respectively.

3.4. Comparison of C3ECA YOLOv5s Algorithm with Other Detection Methods

To further analyze the detection performance of the C3ECA-YOLOv5s algorithm, precision, recall, and mAP were used to evaluate the algorithm in this article. The performance was compared with the latest mainstream detection models, such as YOLOv8s and the original YOLOv5s detection algorithm. The comparative data of the three algorithms are detailed in Table 3.
In Table 3, it can be seen that the precision, recall, mAP@0.5, and inference time was 90.7%, 88.1%, 92.5%, and 2.5 ms. Compared to the original YOLOv5s algorithm, it has improved by 2.1%, 1.2%, and 2.5%, and 1.2 ms, respectively. Compared with the mainstream detection models shown in the table, mAP has improved by 0.9%, 1.5%, 1.8%, and 0.6%, and the inference time is faster than that of the above-mentioned mainstream detection methods. In summary, the precision, recall, and mAP@0.5 are all improved.

3.5. Algorithm Detection Field Experiment

To enrich the dataset, to improve the accuracy of the network model, and to prevent the insufficient number of datasets and environmental variable gaps, the apple image training set was expanded to 844 images, and the images were collected to cover the daytime sunlight hours. To determine the recognition effectiveness and to determine the best recognition effect, field real-time localization recognition was performed in the orchard field. The YOLOv5s model was used to identify and localize apple trees within a distance of 0.5–2.5 m in a straight line distance, and 50 localization trials were conducted for each model. The visualized images of some of the trials of the YOLOv5s model are shown in Figure 6. Different angles and lighting were considered within the recognition range (see Figure 6). The distribution of apples is irregular and obstructed (see Figure 6). The model localization range was identified for apples within the field of view, and all measurements were obtained. The confidence level based on the YOLOv5s model reached 90% and the recognition rate of apples within the recognition range reached 98%. Evidence that the YOLOv5s algorithm has good robustness in different environments. The improved YOLOv5s algorithm can effectively identify targets and provide accurate target information for robotic picking.

3.6. Target Identification and Positioning

The target identification and positioning system mainly includes image acquisition software and a motion control module. The depth camera in the vision camera is used to take the information of the picking object, and the picking object can be quickly located by filtering the data set. The vision camera is mounted on the upper part of the end-effector of the 6-DOF robotic arm, which can realize multi-angle adjustment in the space range. The experiment used an Intel Real Sense D435i stereo vision depth camera with a depth map resolution of 1280 × 720 pixels, color image resolution of 848 × 480 pixels, depth detection range 0.2–10.0 m, powered by USB. Camera internal parameters are obtained using Intel Real Sense Viewer, the built-in software of the Real Sense camera.
Since the target coordinate system of the target object captured in the vision camera field of view does not match the coordinate system of the robot itself. It is necessary to transform the coordinate system in the vision camera into the robot coordinate system, i.e., for target calibration. To perform the transformation of the coordinate system, a 4 × 4 transformation matrix is required, which contains a rotation transformation matrix (3 × 3) and a translation matrix (1 × 3) as follows:
Z c = x t 1 = f 0 0 0 0 f 0 0 0 0 1 0 x c y c z c 1
where {xc, yc, zc} is the camera coordinate system with the origin of Oc, {x, y} is the image coordinate system with the origin of Oi, {u, v} is the pixel coordinate system with the origin of Op, point Pc is the positioning point under the camera coordinate system, point P is the intersection of the projection ray Oc, Pc, and the image coordinate system plane, and f is the focal length, mm. The model of the camera coordinate system is shown in Figure 7.
The conversion of the image coordinate system to the pixel coordinate system requires the operation of the translation matrix as follows:
u v 1 = f x 0 u 0 0 f y v 0 0 0 1 x y 1
where fx, fy, u0, v are camera parameters, dx, dy are the size of the unit pixel on the u and v axes of the pixel coordinate system, and the two transformation matrices are combined to derive the conversion equation as follows:
x c = z c ( u u 0 ) f x y c = z c ( v v 0 ) f y
The depth image of the apple is captured by the infrared image obtained by the 3D camera. The depth image is obtained by the principle of triangulation, and the original depth processing is aligned. Setting the original color pixel coordinate system as (uc, vc), the processed depth image coordinate system is (ud, vd), aligning the two. Then each detected pixel point is (uc, vc) and the processed depth map pixel point is (ud, vd). Finally the 3D coordinates of the localization point in the camera coordinate system (xc, yc, zc) can be obtained according to the transformed coordinates.
In this experiment, the Laser rangefinder is installed on the top of the depth camera. The height difference between the Laser rangefinder and the depth camera measured is far less than the distance of the positioning point, so this height difference is ignored. Connect the device to a laptop computer, use the laptop to drive the depth camera to obtain the distance measurement value zci. Use the Laser rangefinder to obtain the distance measurement value zdci and save it accordingly. In order to evaluate the positioning accuracy, the mean error Ev and mean error ratio Evr are used as evaluation indicators. Ez reflects the absolute error between the estimated value and the true value, while Ezr reflects the relative error between the estimated value and the true value. The calculation equation is as follows:
E Z = i = 1 m z d c i z c i m
E z r = i = 1 m z d c i z c i z d c i m × 100 %
In the equation, m is the number of apples successfully identified and located in the same image.

4. Experiment

4.1. Simulation Experiment

The robot toolbox is used to simulate the 6-DOF picking arm in the picking operation process. Among the control methods, the polynomial control method is used, and the algorithm has the advantages of simple control and fast iteration speed. According to the D-H parameter table of the 6-DOF picking arm, the robot model is established, as shown in Figure 8. The workspace of the robotic arm is adopted by the Monte Carlo method. The Monte Carlo method workspace is set to 400 cm in length, 400 cm in width, and 200 cm in height. More iterations in the workspace is closer to the real situation as well as faster. Set the parameters for 100,000 times and the test results are shown in Figure 8. The shape of the working space of the robotic arm can be regarded as an ellipsoid. The maximum range and minimum range of the end point activity in the X-axis direction to the coordinate origin are 105 cm and 183 cm, respectively, the maximum range and minimum range of the activity in the Y-axis direction to the coordinate origin are 175 cm and 110 cm, respectively, and the maximum range of the activity in the Z-axis direction is 168 cm. The working space of the robotic arm can reach 120°, which can meet the demand of steering in a narrow space, and the movement range is large enough to cover the fruit trees for picking.
The spatial trajectory of the end of the robotic arm was obtained by a polynomial trajectory planning algorithm. As shown in Figure 9, the robotic arm can reach the position of the target point in the simulation space, which proves the rationality of the design of the six-axis, 6-DOF robotic picking arm.
Through the polynomial trajectory planning algorithm, the spatial trajectory of the robotic arm’s end-effector was obtained. As illustrated, the robotic arm is capable of reaching the target point within the simulation space, demonstrating the rationality of the six-axis, six-degree-of-freedom harvesting robotic arm design. Given the complexity of the robotic arm’s working environment, a third-order polynomial interpolation or a fifth-order polynomial interpolation can be selected for planning, as shown in Figure 9.
As shown in Figure 9a, the planning parameters for the third-order polynomial interpolation are presented; whereas Figure 9b displays the planning parameters for the fifth-order polynomial interpolation. A comparison between the results of the third-order polynomial interpolation and the fifth-order polynomial interpolation reveals that the angles and speeds obtained from both interpolations are essentially identical, with no sudden changes observed in either. However, the angular acceleration of the third-order polynomial interpolation is discontinuous and exhibits sudden changes. In the context of robotic trajectory planning, a sudden change in joint angular velocity indicates that the motor of the joint may experience an impact, which necessitates ensuring the smooth operation of the motor during normal harvesting tasks. The angular acceleration derived from the fifth-order polynomial interpolation is continuous and smooth, thereby resolving the issue of impact present in the third-order polynomial interpolation.

4.2. Ablation Experiment

In order to demonstrate the effectiveness of the improved model more intuitively, this study conducted a systematic and comprehensive ablation experimental study at the C3 layer, aiming to fully demonstrate the effectiveness of the selected attention embedding layer.
As shown in Table 4, when the ECA attention mechanism is simultaneously embedded in layers C3-3, C3-6, and C3-9, the mAP, precision, and recall reaches the highest level. But we found that the ECA is not suitable for embedding in a C3-1 layer, because it will lead to a decline in the detection performance and increase the inference time.

4.3. Picking Experiment

Using the algorithm proposed above to recognize the target fruit of apple trees, recognition and localization experiments were conducted on different potted plants to verify the effectiveness of the improved recognition algorithm. The recognition results obtained are shown in Figure 10. The coordinates of the target fruit identified (see Figure 10) are used as the target points for the picking experiment. During the picking process, 100 recognition and positioning experiments were performed for each model configuration. The visual results of the improved recognition algorithm model in several experimental scenarios are shown in Figure 10, successfully identifying apples in the designated area. Through experimental verification of the improved recognition algorithm model, due to the relatively simple experimental environment compared to the actual situation, the recognition rate of apples within the recognition range of this experiment reached 100%. The improved recognition algorithm can effectively identify and locate targets, providing accurate coordinate information for robot harvesting.
The apple picking robot developed by this group was used to validate the quintuple polynomial interpolation method in the laboratory environment. Based on the distribution of apple fruits and apple tree branches and leaves in the natural environment. Small potted apple plants were used as the experimental objects in this study to ensure the accuracy of the experiment. As shown in Figure 11, different potted plants were set up for the experiment, potted plant 1 without shading, and potted plant 2 with branches and leaves shading in front of the fruit. The picking operation was performed in both cases, and the execution time of the picking operation and the picking success rate were recorded. The initial posture of the apple picking robotic arm ended at a horizontal distance of 30 cm from the apple tree. Ten picking experiments were conducted for each fruit. The picking process of the apple picking robotic arm is shown in Figure 11.
As shown in Figure 11A–F for the picking operation on potted plant 1, the average picking operation took 25 s, and the total picking success rate was 95%. As shown in Figure 11G–L for the picking operation on potted plant 2, the average picking operation took 28 s, and the total picking success rate was 92%. From the picking experimental results, it is shown that the robotic picking arm can complete the picking operation better. But when there are obstacles in the way, it will reduce the picking efficiency and success rate. During the experiment, parameter acquisition was carried out to track the end position, and the average tracking error of ten operations was obtained, as shown in Figure 11. The rate of change of each joint angle during the operation (10 s) was collected. The angle, angular velocity, and angular acceleration of the six rotating joints were also collected, as shown in Figure 12. Tracking error during operation is shown in Figure 13.
The experimental results show that the robotic arm can perform the complete picking task more smoothly when performing physical picking. The coordinate system and angle transformation of the joint are within a reasonable working range (see Figure 14and Figure 15). As shown in Figure 15, the control error range of the end-effector is controlled within 1.5 mm, which can be ignored in the actual picking operation when considering larger fruit, and the algorithm planning success rate is 92%. By combining the robotic arm with the fifth polynomial interpolation, the angle, velocity, and angular acceleration are continuous and not abrupt, so in the robot trajectory planning, the motor at the joint will not be impacted (see Figure 12). It can reduce the shock during the practical application and improve the picking efficiency of the robotic arm.

5. Conclusions

Coordinated with a 6-DOF robotic arm featuring a 120° maximum working angle, the system attains precise spatial localization through coordinate transformation matrices, confining the end-effector positioning errors to ≤1.5 mm—a negligible tolerance given typical fruit dimensions. Field validation further confirms a motion planning success rate of 92% with a picking cycle time of 23 s per apple, demonstrating operational efficiency suitable for real-world deployment.
This study presents an enhanced vision-based apple harvesting system that ad-dresses critical challenges of occlusion and clutter in unstructured orchard environments. By integrating an efficient channel attention (ECA) module into strategic C3 layers (C3-3, C3-6, C3-9) of the YOLOv5 architecture, the proposed algorithm significantly strengthens feature representation for obscured targets, achieving a robust confidence level of 90% and an in-range apple recognition rate of 98%. This represents a 4% improvement in mean Average Precision (mAP) over the baseline YOLOv5s model.
Future work will focus on developing adaptive end-effectors capable of dynamically adjusting the grasping force for diverse fruit morphologies, alongside integrating multi-modal sensors to augment visual perception for maturity assessment and quality control, thereby extending the system’s applicability to broader horticultural crops.

Author Contributions

Conceptualization, Y.X. and X.Y.; methodology, L.D.; software, X.Q.; writing—original draft preparation, X.L. and Z.C.; writing—review and editing, Y.X.; visualization, X.Y.; supervision, Y.X.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Liaoning Provincial Science and Technology Planning Project, grant number 2023JH2/10700006.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, T.; Qiu, Q.; Zhao, C.J.; Xie, F. Task planning of dwarf close planting orchard multi arm picking robot. J. Agric. Eng. 2021, 37, 10. [Google Scholar]
  2. Lehnert, C.; English, A.; Mccool, C.; Tow, A.W.; Perez, T. Autonomous Sweet Pepper Harvesting for Protected Cropping Systems. IEEE Robot. Autom. Lett. 2017, 2, 872–879. [Google Scholar] [CrossRef]
  3. Xiong, Y.; Ge, Y.; Grimstad, L.; From, P.J. An autonomous strawberry-harvesting robot: Design, development, integration, and field evaluation. J. Field Robot. 2020, 37, 202–224. [Google Scholar] [CrossRef]
  4. Arad, B.; Balendonck, J.; Barth, R.; Ben-Shahar, O.; Edan, Y.; Hellström, T.; Hemming, J.; Kurtser, P.; Ringdahl, O.; Tielen, T.; et al. Development of a sweet pepper harvesting robot. J. Field Robot. 2020, 37, 1027–1039. [Google Scholar] [CrossRef]
  5. SepúLveda, D.; Fernández, R.; Navas, E.; Armada, M.; Gonzalez-de-Santos, P. Robotic aubergine harvesting using dual-arm manipulation. IEEE Access 2020, 8, 121889–121904. [Google Scholar] [CrossRef]
  6. Feng, Q.; Wang, X.; Zheng, W.; Qiu, Q.; Jiang, K. New strawberry harvesting robot for elevated-trough culture. Int. J. Agric. Biol. Eng. 2012, 5, 1–8. [Google Scholar]
  7. Zhao, D.A.; Shen, T.; Chen, Y.; Jia, W.K. Rapid tracking and recognition of overlapping fruits by apple picking robots. J. Agric. Eng. 2015, 31, 22–28. [Google Scholar]
  8. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  9. Wu, F.; Yang, Z.; Mo, X.; Wu, Z.; Tang, W.; Duan, J.; Zou, X. Detection and counting of banana bunches by integrating deep learning and classic image-processing algorithms. Comput. Electron. Agric. 2023, 209, 107827. [Google Scholar] [CrossRef]
  10. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  11. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
  12. Shen, L.; Su, J.; Huang, R.; Quan, W.; Song, Y.; Fang, Y.; Su, B. Fusing attention mechanism with Mask R-CNN for instance segmentation of grape cluster in the field. Front. Plant Sci. 2022, 13, 934450. [Google Scholar] [CrossRef]
  13. Yang, R.; Hu, Y.; Yao, Y.; Gao, M.; Liu, R. Fruit Target Detection Based on BCo-YOLOv5 Model. Mob. Inf. Syst. 2022, 2022, 8457173. [Google Scholar] [CrossRef]
  14. Fan, Y.Y.; Zhang, Z.M.; Chen, G.P. Application of vision sensor in the target fruit recognition system of picking robot. J. Agric. Mech. Res. 2019, 41, 210–214. [Google Scholar]
  15. Wei, J.; Yi, D.; Bo, X.; Chen, G.Y.; Zhao, D. Adaptive variable parameter impedance control for apple harvesting robot compliant picking. Complexity 2020, 2020, 4812657. [Google Scholar] [CrossRef]
  16. Gao, L.; Zhao, X.; Yue, X.; Yue, Y.; Wang, X.; Wu, H.; Zhang, X. A Lightweight YOLOv8 Model for Apple Leaf Disease Detection. Appl. Sci. 2024, 14, 6710. [Google Scholar] [CrossRef]
  17. Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight Tomato Real-Time Detection Method Based on Improved YOLO and Mobile Deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
  18. Ma, J.; Lu, A.; Chen, C.; Ma, X.; Ma, Q. YOLOv5-Lotus an Efficient Object Detection Method for Lotus Seedpod in a Natural Environment. Comput. Electron. Agric. 2023, 206, 107635. [Google Scholar] [CrossRef]
  19. Wang, D. Channel Pruned YOLO V5s-Based Deep Learning Approach for Rapid and Accurate Apple Fruitlet Detection before Fruit Thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
  20. Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic Bunch Detection in White Grape Varieties UsingYOLOv3, YOLOv4, and YOLOv5 Deep Learning Algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
  21. Cardellicchio, A.; Solimani, F.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Detection of Tomato Plant Phenotyping Traits Using YOLOv5-Based Single Stage Detectors. Comput. Electron. Agric. 2023, 207, 107757. [Google Scholar] [CrossRef]
  22. Song, Q.; Li, S.; Bai, Q.; Yang, J.; Zhang, X.X.; Li, Z.A.; Duan, Z.J. Object detection method for grasping robot based on improved YOLOv5. Micromachines 2021, 12, 1273. [Google Scholar] [CrossRef] [PubMed]
  23. Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved apple fruit target recognition method based on YOLOv7 model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
  24. Yue, X.; Qi, K.; Na, X.Y.; Zhang, Y.; Liu, Y.H.; Liu, C.H. Improved YOLOv8-Seg Network for Instance Segmentation of Healthy and Diseased Tomato Plants in the Growth Stage. Agriculture 2023, 13, 1643. [Google Scholar] [CrossRef]
  25. Tustin, D.S.; Breen, K.C.; Van Hooijdonk, B.M. Light utilisation, leaf canopy properties and fruiting responses of narrow-row, planar cordon apple orchard planting systems—A study of the productivity of apple. Sci. Hortic. 2022, 294, 110778. [Google Scholar] [CrossRef]
  26. Zheng, S.S.; Li, Y.C.; Zhang, S.; Ji, W.; Xia, W. Structural Design of Apple Picking Robot. Technol. Innov. Appl. 2015, 21, 11–12. [Google Scholar]
  27. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  29. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Figure 1. The apple orchard environment.
Figure 1. The apple orchard environment.
Agriculture 15 01850 g001
Figure 2. Overall Work Map.
Figure 2. Overall Work Map.
Agriculture 15 01850 g002
Figure 3. Composition of the robotic system.
Figure 3. Composition of the robotic system.
Agriculture 15 01850 g003
Figure 4. Enhanced YOLOv5 Network architecture diagram.
Figure 4. Enhanced YOLOv5 Network architecture diagram.
Agriculture 15 01850 g004
Figure 5. Structure of the ECA module.
Figure 5. Structure of the ECA module.
Agriculture 15 01850 g005
Figure 6. The on-site target detection effect of the orchard.
Figure 6. The on-site target detection effect of the orchard.
Agriculture 15 01850 g006
Figure 7. Camera coordinate system conversion model.
Figure 7. Camera coordinate system conversion model.
Agriculture 15 01850 g007
Figure 8. Parametric model and Monte Carlo working domain analysis.
Figure 8. Parametric model and Monte Carlo working domain analysis.
Agriculture 15 01850 g008
Figure 9. Polynomial interpolation trajectory planning: (a) cubic polynomial programming; (b) quintic polynomial programming.
Figure 9. Polynomial interpolation trajectory planning: (a) cubic polynomial programming; (b) quintic polynomial programming.
Agriculture 15 01850 g009
Figure 10. Identify experimental results.
Figure 10. Identify experimental results.
Agriculture 15 01850 g010
Figure 11. Physical picking map: (AF) picking operation on potted plant 1, (GL) picking operation on potted plant 2.
Figure 11. Physical picking map: (AF) picking operation on potted plant 1, (GL) picking operation on potted plant 2.
Agriculture 15 01850 g011
Figure 12. Angle, angular velocity, and angular acceleration curves of each joint.
Figure 12. Angle, angular velocity, and angular acceleration curves of each joint.
Agriculture 15 01850 g012
Figure 13. Track tracking error.
Figure 13. Track tracking error.
Agriculture 15 01850 g013
Figure 14. Three-coordinate trajectory tracking change curve.
Figure 14. Three-coordinate trajectory tracking change curve.
Agriculture 15 01850 g014
Figure 15. The variation pattern of the angle of each joint in the trajectory plan.
Figure 15. The variation pattern of the angle of each joint in the trajectory plan.
Agriculture 15 01850 g015
Table 1. Six-degree-of-freedom robotic arm D-H parameter table.
Table 1. Six-degree-of-freedom robotic arm D-H parameter table.
Linkθi (°)αi (°)di (m)ai (m)
1θ10d10
2θ2−900a2
3θ300a3
4θ4−90d4a4
5θ5900
6θ69000
End-effector90−90−470−360
Table 2. Performance comparison of algorithms incorporating four attention mechanisms.
Table 2. Performance comparison of algorithms incorporating four attention mechanisms.
MethodPrecision
(%)
Recall
(%)
mAP@0.5
(%)
Inference Time
(ms)
C3ECA-YOLOv5s90.788.192.52.5
SE-YOLOv5s90.187.491.32.6
CBAM-YOLOv5s90.487.891.63.2
CA-YOLOv5s89.287.590.72.6
Table 3. Comparison of the algorithms.
Table 3. Comparison of the algorithms.
MethodPrecision
(%)
Recall
(%)
mAP@0.5
(%)
Inference Time
(ms)
YOLOv5s88.686.990.01.3
YOLOv8s89.887.291.61.4
YOLOv10n89.287.191.11.8
YOLOv11s88.986.989.42.7
Faster-RCNN90.187.491.32.6
C3ECA-YOLOv5s90.788.192.52.5
Table 4. ECA Embedding layer ablation experiment.
Table 4. ECA Embedding layer ablation experiment.
LayerPrecision
(%)
Recall
(%)
mAP@0.5
(%)
Inference Time
(ms)
C388.686.990.01.3
C3 (C3-1)88.286.589.41.6
C3 (C3-1, C3-3)88.686.789.61.8
C3 (C3-1, C3-6)88.386.689.91.9
C3 (C3-1, C3-9)88.787.090.31.8
C3 (C3-1, C3-3,C 3-6, C3-9)90.087.791.62.7
C3 (C3-1, C3-3, C3-6)89.687.390.92.4
C3 (C3-1, C3-3, C3-9)89.287.191.22.5
C3 (C3-3)89.187.090.61.6
C3 (C3-3, C3-6)89.887.591.42.0
C3 (C3-3, C3-9)89.687.791.62.0
C3 (C3-6)89.387.290.71.8
C3 (C3-6, C3-9)90.287.992.02.2
C3 (C3-9)89.487.491.11.7
C3 (C3-3, C3-6, C3-9)90.788.192.52.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Qiao, X.; Ding, L.; Li, X.; Chen, Z.; Yue, X. Enhanced YOLOv5 with ECA Module for Vision-Based Apple Harvesting Using a 6-DOF Robotic Arm in Occluded Environments. Agriculture 2025, 15, 1850. https://doi.org/10.3390/agriculture15171850

AMA Style

Xu Y, Qiao X, Ding L, Li X, Chen Z, Yue X. Enhanced YOLOv5 with ECA Module for Vision-Based Apple Harvesting Using a 6-DOF Robotic Arm in Occluded Environments. Agriculture. 2025; 15(17):1850. https://doi.org/10.3390/agriculture15171850

Chicago/Turabian Style

Xu, Yan, Xuejie Qiao, Li Ding, Xinghao Li, Zhiyu Chen, and Xiang Yue. 2025. "Enhanced YOLOv5 with ECA Module for Vision-Based Apple Harvesting Using a 6-DOF Robotic Arm in Occluded Environments" Agriculture 15, no. 17: 1850. https://doi.org/10.3390/agriculture15171850

APA Style

Xu, Y., Qiao, X., Ding, L., Li, X., Chen, Z., & Yue, X. (2025). Enhanced YOLOv5 with ECA Module for Vision-Based Apple Harvesting Using a 6-DOF Robotic Arm in Occluded Environments. Agriculture, 15(17), 1850. https://doi.org/10.3390/agriculture15171850

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop