Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking

Huang, Chih-Yung; Su, Guan-Wen; Shao, Yu-Hsiang; Wang, Ying-Chung; Yang, Shang-Kuo

doi:10.3390/app14199018

Open AccessArticle

Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking

by

Chih-Yung Huang

,

Guan-Wen Su

,

Yu-Hsiang Shao

,

Ying-Chung Wang

and

Shang-Kuo Yang

^*

Department of Mechanical Engineering, National Chin-Yi University of Technology, Taichung 41170, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9018; https://doi.org/10.3390/app14199018 (registering DOI)

Submission received: 14 August 2024 / Revised: 23 September 2024 / Accepted: 3 October 2024 / Published: 6 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

When multiple objects are positioned close together or stacked, pre-grasp operations such as pushing objects can be used to create space for the grasp, thereby improving the grasping success rate. This study develops a model based on a deep Q-learning network architecture and introduces a fully convolutional network to accurately identify pixels in the workspace image that correspond to target locations for exploration. In addition, this study incorporates image masking to limit the exploration area of the robotic arm, ensuring that the agent consistently explores regions containing objects. This approach effectively addresses the sparse reward problem and improves the convergence rate of the model. Experimental results from both simulated and real-world environments show that the proposed method accelerates the learning of effective grasping strategies. When image masking is applied, the success rate in the grasping task reaches 80% after 600 iterations. The time required to reach 80% success rate is 25% shorter when image masking is used compared to when it is not used. The main finding of this study is the direct integration of image masking technique with a deep reinforcement learning (DRL) algorithm, which offers significant advancement in robotic arm control. Furthermore, this study shows that image masking technique can substantially reduce training time and improve the object grasping success rate. This innovation enables the robotic arm to better adapt to scenarios that conventional DRL methods cannot handle, thereby improving training efficiency and performance in complex and dynamic industrial applications.

Keywords:

deep reinforcement learning; fully convolutional network; image masking; robotic arm

1. Introduction

Object grasping is a fundamental task in robotics with a wide range of practical applications, including loading, unloading, and arranging objects. Traditionally, geometry-based methods are used to solve this task. These methods analyze a three-dimensional model of the object to determine the optimal grip position. Although these methods are effective in fixed settings and for known objects, they require a priori knowledge and cannot be easily adapted to dynamic environments [1].

To overcome the limitations of geometry-based methods, data-driven approaches for object grasping with robotic arms have been introduced. These methods, which typically use convolutional neural networks (CNNs), process images from two-dimensional or three-dimensional cameras for classification, object identification, segmentation, and grip position estimation [2,3,4,5,6,7]. However, challenges arise when the image background is cluttered or objects overlap, which makes it difficult to understand objects in complex contexts [8]. Cluttered environments are shown in Figure 1.

Significant progress has been made in this field with the use of deep reinforcement learning (DRL) algorithms, as exemplified by iconic applications such as Google’s DeepMind AlphaGo and OpenAI Five [9,10]. DRL combines CNN for image feature extraction with Q-learning techniques for decision-making. Previous studies have shown that DRL is able to learn and make better decisions over time without the need for explicit modeling of the environment [11].

In the context of robotic arm control, DRL has been used to improve the effectiveness of pre-grasping tasks—the process by which a robotic arm pushes an object to create space before grasping it. Although this method has shown potential, studies have shown that some methods still face challenges in terms of success rates and training time required [12,13,14].

This study introduces significant novelty in robotic arm control by directly integrating image masking techniques with deep reinforcement learning (DRL) algorithms. While many previous studies have explored the use of DRL for object grasping tasks, this study stands out by adopting an image masking approach to address the challenges in cluttered environments. This integration allows the robot to more effectively explore confined areas, reducing interference caused by complex backgrounds and overlapping objects. Furthermore, this study shows that image masking techniques can significantly reduce training time and improve the success rate of object grasping. With these adjustments, the robotic arm can better adapt to situations that cannot be handled by conventional DRL methods. These findings underline the innovative capabilities of the proposed approach in improving training efficiency and performance in more complex and dynamic industrial scenarios.

Therefore, the present study proposed a method that integrates image masking to overcome the problem of sparse rewards during training. When image masks are applied, a robot explores over a limited area, resulting in a higher learning rate.

2. Related Work

2.1. Prehensile Grasping

Grasping is a type of prehensile activity, and the grasping techniques of robots can be classified into two categories [15]. The first category is techniques for the grasping of known objects based on existing data. The data of these objects can be used to precisely estimate the posture of the objects and plan a grasping movement. For example, during the first Amazon Picking Challenge, Correll et al. [16] programmed a robotic arm to select 39 different objects in a specific order. A known computer-aided design model was used to calculate the iterative closest point of the objects and estimate their posture. Next, the appropriate grasping posture of the robotic arm was calculated. Finally, inverse kinematics were used to complete the grasping tasks. Their method relies on there being computer-aided design models of the objects of interest and thus has limited generalizability.

The second category is techniques in which the grasping posture is not estimated; instead, input images or point clouds are used to directly perform end-to-end estimation of environmental attributes. This type of grasping is suitable for unknown objects and objects in cluttered environments, such as those containing stacked and obstructed objects. Morrison et al. [17] proposed the generative grasping CNN, a lightweight FCN into which depth images are input and that directly predicts and generates the robot’s grasping posture and quality at each pixel. Sundermeyer et al. [18] proposed Contact-GraspNet, which inputs point clouds to an end-to-end network that predicts the grasping posture for objects in a three-dimensional space. Yen-Chen et al. [19] suggested transferring the parameters of multiple machine vision models to a deep reinforcement learning network so that robots can focus on exploring the areas near objects and avoid ineffective explorations, resulting in faster learning.

The aforementioned two types of methods can be used to estimate the posture of objects and appropriate grasping postures. However, they are unsuitable if calculations cannot be performed or when the grasping posture is unknown. Therefore, pregrasping processes can be conducted to assist in grasping operations.

2.2. Pregrasping Assistance

Pregrasping tasks that facilitate the completion of grasping tasks are a current trend. In Berscheid et al. [12], nontarget objects were moved in a pregrasping process to increase the rate of successful grasping of a target object. Their system achieved a success rate of 98.4% but required data on approximately 250,000 grasps and 2500 shifting movements for training. Berscheid et al. also applied their system to a real-life context, the picking up and emptying of bins.

Kalashnikov et al. [20] introduced QT-Opt, an expandable self-supervised visual reinforcement learning framework. They used data on 580,000 actual grasps to train networks for controlling seven robotic arms and, for the grasping of unknown objects, achieved a success rate of 96%.

The aforementioned methods achieved high success rates, but they had the disadvantage of requiring a large volume of real-life training data. Zeng et al. [13] proposed the DQN–FCN combination for training a model on pushing and grasping tasks within simulated and real environments. In their system, an FCN is employed to estimate the maximum Q-value of a grasp or push. Only 2500 grasps and pushes were required for training before a success rate of 80% could be achieved for grasping tasks involving obstructed objects. Many researchers have since combined the DQN and an FCN. Xu et al. [21] proposed a target-oriented attention module for training. Li et al. [22] enhance the efficiency of object recognition and grasping by analyzing the colors of the objects. This module can effectively find hidden objects and then instruct a robot to push and grasp the objects as well as be effectively trained in the pushing and scattering of objects. Sarantopoulos et al. [23] proposed the use of Split DQN to perform actions on target objects in cluttered scenarios, reducing sample complexity and effectively accelerating model convergence. However, goal-oriented methods target only specified objects for grasping and cannot directly identify the best objects for grasping in the workspace. Chen et al. [24] employed a traditional rule-based grasp detection algorithm for grasping, with only the pushing action using deep reinforcement learning algorithms to effectively reduce system reasoning time and increase execution speed. However, the mentioned literature typically requires long training times or may not achieve a satisfactory grasping success rate, limiting their practical applicability.

Some papers use imitation learning to train various action tasks [25,26]. Imitation learning allows for rapid skill acquisition by observing expert demonstration data, thereby eliminating the need for model convergence through extensive trial and error. However, training each new action requires new demonstration data, which poses challenges for real-world industrial applications.

In summary, the methods discussed herein can achieve a high success rate, but most of them require numerous iterations or multiple neural networks, meaning that they are unsuitable for use in industry, where they would have to be deployed within a short period. By contrast, the method proposed in this study achieves a success rate of 80% for grasping tasks when training is conducted for only 600 iterations, making it suitable for rapid deployment in industry.

3. Method

3.1. State Representations

The method proposed herein utilizes red–green–blue depth (RGB-D) images to represent the state of an environment constituting the working space of a robotic arm (Figure 2). In the images, the environment is presented as a plane with dimensions of 30 cm × 30 cm, and each image is of size 224 × 224 pixels. The coordinates of the robotic arm are calculated using the values of pixels. The system implements a reward function, in which rewards are used to determine the quality of the current action.

3.2. Primitive Actions

In the proposed system, two primary primitive actions are employed to optimize object manipulation: pushing and grasping. Each action is parameterized as a primitive motion defined by a vector that determines the position and orientation of the movement. Specifically, the actions a ∈ {grasp, push} are parameterized by the vector (x,y,z,ϕ)(x,y,z,ϕ), where (x,y,z)(x,y,z) represents the coordinates of the center of the gripper, and ϕ ∈ [0, 2π] ϕ ∈ [0, 2π] denotes the rotation angle of the gripper in the plane of the environment (e.g., the rotation on the horizontal axis). These parameters allow for precise control over the gripper’s position and orientation, whether it is grasping or pushing an object, thereby ensuring that actions are executed efficiently and effectively.

The pushing action is defined as a linear movement occurring at coordinates (x,y,z)(x,y,z). This primitive pushing involves moving an object 5 cm parallel to the work surface in the direction specified by the angle ϕ. To enable flexibility in the direction of pushing, the system allows for pushes to occur at 16 different angles (K = 16), providing the gripper with multiple orientation options for object manipulation. Each push is executed as a straight-line motion with the gripper’s fingers closed, ensuring that the object maintains the desired path during the action. This approach allows for adaptive adjustments to various object placements and orientations, which is crucial for ensuring successful manipulation in dynamic and complex environments.

3.3. System Structure

For the pre-grasping tasks, RGB-D images of the environment are initially captured using an RGB-D camera. These images are subsequently processed by applying a rotation matrix to convert the coordinates into corresponding RGB height maps and depth height maps (Figure 3). Each height map is then rotated through a full 360 degrees, producing 16 evenly distributed height maps for both the RGB and depth channels. This set of 16 maps is input in parallel to DenseNet-121 [27] and a Fully Convolutional Network (FCN) [14] for training. The feature extraction layers output a 1 × 1 convolutional layer with a non-linear activation function (ReLU) and batch normalization for the purpose of upsampling. To maintain alignment with the pixel-based object coordinates in the input image, the upsampling feature of the FCN is used to convert the output actions into coordinates, generating two dual-channel dense pixel-wise Q-value maps corresponding to push and grasp height maps. A custom masking function ensures the system prioritizes exploration and exploitation of nearby objects, thus optimizing grasping success rates. The reinforcement learning framework effectively guides the exploration process and facilitates learning. The overall system architecture is depicted in Figure 3.

3.4. Reward

The reward function includes rewards for grasps and for pushes. The robotic arm is moved to its target position, and the gripper is then closed; at this point, if the distance between the tips of the gripper’s fingers is greater than 0 (i.e., the gripper cannot close completely because it has gripped an object), the grasp is successful; otherwise, the grasp is a failure. The reward function for grasps, denoted

r_{g}

, is as follows:

r_{g} (s) = \{\begin{matrix} 1, i f g r a s p i n i t i a l p o s i t i o n > 0, \\ 0, e l s e, \end{matrix}

(1)

The reward for pushes is based on whether a change has occurred in the environment. If the difference before and after the push exceeds a certain threshold, a change in the environment is detected, and the push is deemed successful; otherwise, the push has failed.

The reward function for pushes, denoted

r_{p}

, is as follows:

r_{p} (s) = \{\begin{matrix} 0.5, i f {(s}_{t + 1} - s_{t}) > τ, \\ 0, e l s e, \end{matrix}

(2)

3.5. Overcoming Sparse Rewards with Image Masking

Sparse rewards are a major challenge in reinforcement learning, especially in robotic manipulation tasks such as grasping, where agents often receive sparse feedback (rewards). This condition slows down the learning process due to the limited information available to guide the agent to improve its policy. In the context of grasping tasks, sparse rewards arise when the robot rarely succeeds in grasping attempts, thus slowing down the convergence of the learning algorithm.

To address this issue, this study proposes the use of image masking during training. Image masking is designed to focus the agent’s exploration on areas containing relevant objects, thereby increasing the agent’s chances of receiving meaningful rewards. This approach aims to minimize the agent’s time spent on exploring irrelevant areas, accelerate the learning process, and improve model efficiency.

The experiments in this study were designed to evaluate the effectiveness of the image masking approach in overcoming sparse rewards. The performance of models with and without image masking was compared to show how this approach accelerates convergence and improves the success rate in grasping tasks. The results show that image masking helps the agent focus its exploration on relevant areas, thereby effectively overcoming the sparse rewards problem and improving learning efficiency.

3.6. Clarification of Exploration Policy and Mask Design

In the deep reinforcement learning framework adopted in this study, the exploration strategy utilizes the maximum Q value to determine the optimal action. During the initial exploration phase, the selected position may be far from the desired object, often appearing near the edge of the image. To address this, a grasp mask and a push mask are used, as illustrated in Figure 4.

The grasp mask is a binary mask designed to guide the grasper toward the center of the target object. To generate this mask, image processing techniques are used to detect the center of the object. The pixel corresponding to this center is assigned a value of 1, while all other pixels are set to 0. This configuration ensures that each grasp action is executed exactly around the center of the object, maximizing the probability of a successful grasp.

The push mask is also a binary mask, specifically designed to facilitate a stable and safe pushing action. This mask is created by performing image post-processing, which involves subtracting a dilated image that takes the size of the clamp into account from another dilated image that does not. This process creates a defined area of interest around the target object, ensuring enough space for the clamp to approach and interact with the object without collision or instability.

3.7. Training Detail

The loss function used for each training iteration in the DQN [10] is the Huber function, which is used to ensure a high robustness of the squared loss function in a discrete scenario and is a combination of the squared loss and absolute loss. The Huber function is as follows:

L_{i} = \{\begin{matrix} \frac{1}{2} {(Q^{θ_{i}} (s_{i}, a_{i}) - y_{i}^{θ_{i}^{-}})}^{2}, f o r |Q^{θ_{i}} (s_{i}, a_{i}) - y_{i}^{θ_{i}^{-}}| < 1, \\ |Q^{θ_{i}} (s_{i}, a_{i}) - y_{i}^{θ_{i}^{-}}| - \frac{1}{2}, o t h e r w i s e \end{matrix}

(3)

This study employed stochastic gradient descent with momentum to train the FCN [14]. The learning rate was

10^{- 4}

, the momentum was 0.9, and the weight decay was

2^{- 5}

. The algorithm used prioritized experience replay, which is similar to the power-law distribution method. The greedy exploration strategy was employed, and the initial parameter ε and future discount factor γ were both 0.5.

In addition to these optimization strategies, the study leverages prioritized experience replay, which selects and replays past experiences according to their significance in learning [10,28]. This technique prioritizes experiences that have a higher temporal-difference error, which means they are more unexpected or have more potential to change the agent’s policy. By replaying these experiences more frequently, the model can learn more efficiently, focusing on the most informative samples. Moreover, the training incorporates a greedy exploration strategy, with an initial exploration parameter ϵϵ and a discount factor γ.

Both values of γ are set to 0.5, balancing exploration and exploitation throughout the training. This ensures that the agent adequately explores the state space early on while progressively exploiting the learned policies to achieve higher performance [29].

4. Experimental Section

The experiment performed in this study tested the pregrasping performance (i.e., pushes and grasps) achieved using the DQN–FCN model for objects with basic shapes in a simulated environment. This experiment was conducted in two parts. The first part focused on comparing the performance of the proposed model with that of another model developed in a previous study to verify the proposed model. In the second part, the final trained model was tested in scenarios with various levels of difficulty to evaluate its pregrasping performance.

4.1. Baseline Methods

This study evaluated the proposed collaborative pushing and grasping method using deep reinforcement learning and image masking against several baseline methods in both simulated and real-world environments. We first utilized the CoppeliaSim simulation environment, employing a Hiwin Ra605_710_GC robotic arm ( HIWIN Technologies Corp, Taichung, Taiwan) and a Xeg_32 electric gripper (HIWIN Technologies Corp, Taichung, Taiwan) to perform pre-grasping tasks with dynamic and static objects. Following the methodology of Zeng et al. [13], we used five basic three-dimensional geometric figures to assess the performance of our deep reinforcement learning algorithm. Ten objects representing these geometric shapes were randomly placed in the simulation workspace, focusing on the number of pushes and grasps required to manipulate them successfully. These tests, illustrated in Figure 5, were designed to create diverse scenarios with varying levels of difficulty.

After validating the model in the simulation environment, we conducted real-world experiments to further evaluate its effectiveness. The trained model was tested on a real robotic arm under conditions that mirrored the simulation, with both static and dynamic objects used to assess the model’s adaptability and robustness. Objects were strategically positioned in the workspace, ranging from four to ten per test, to represent different difficulty levels based on the number of actions required for successful manipulation. The results, shown in Figure 6, demonstrated the model’s ability to handle complex tasks, with higher numbers of required actions indicating greater difficulty.

To provide a comprehensive analysis, we compared our proposed method to traditional pushing and grasping techniques that rely on predefined rules and control strategies, serving as a benchmark for improvement. The differences in performance were clearly captured through videos of both the simulation and real-world experiments, which have been provided to visually illustrate the effectiveness of our approach. These experiments demonstrate that the proposed method, trained using deep reinforcement learning and image-based object detection, significantly outperforms traditional methods in terms of adaptability and efficiency in various collaborative pushing and grasping tasks.

Table 1 presents the performance metrics for the proposed deep reinforcement learning (DRL) method applied to collaborative pushing and grasping tasks with a robotic arm, across different scenarios characterized by varying numbers and types of objects (static vs. dynamic). The success rate remains high in simpler scenarios; for instance, in the “Static Objects (Easy)” scenario with four objects, a 95% success rate was achieved. As the complexity of the environment increased, such as in the “Static Objects (Medium)” scenario with six objects, the success rate slightly decreased to 89%, and further dropped to 75% in the “Static Objects (Hard)” scenario with ten objects. A similar trend is observed with dynamic objects: the success rate starts at 90% in the “Dynamic Objects (Easy)” scenario with four objects, decreases to 82% in the “Dynamic Objects (Medium)” scenario with six objects, and further falls to 68% in the “Dynamic Objects (Hard)” scenario with ten objects. These results indicate that while the DRL method is highly effective in simpler tasks, its performance declines as task complexity increases.

The average number of pushes and grasps required and the time taken to complete the tasks also reflect the increasing difficulty across scenarios. In the “Static Objects (Easy)” scenario, the robot required an average of 2 pushes and 1 grasp, taking about 12 s to complete the task. As the scenarios became more challenging, both the number of required actions and the time taken increased, with the most complex “Static Objects (Hard)” scenario requiring an average of 5 pushes, 3 grasps, and 35 s. For dynamic object scenarios, the complexity further impacted the performance: in the “Dynamic Objects (Hard)” scenario, the robot needed an average of 7 pushes, 4 grasps, and 45 s to complete the tasks. These findings highlight the proposed method’s adaptability and efficiency, but also illustrate its limitations when faced with a higher number of dynamic or static objects in the workspace.

4.2. Simulation Experiments

4.2.1. Verification of Model Effectiveness

The visual pushing and grasping (VPG) system provided by Zeng was employed to verify the performance of the model proposed in this study. Data evaluated by the VPG system were collected, and some similarities and differences between the settings of this study’s model and the VPG system were noted:

In the simulation, the VPG system used a UR5 robotic arm installed with an RG2 gripper, whereas the system of this study used a Hiwin robotic arm with a Xeg_32 electric gripper. Both of these robotic arms have six degrees of freedom.
In the VPG system, pushes of 10 cm length were made, whereas in this study’s model, pushes of 5 cm length were made because the objects employed were smaller and a longer push was thus not required.
When the VPG system simulated a push, the push was performed with the end of the closed gripper. In this study, the same setting was employed, and the width of the closed gripper was the same as that of the gripper in the VPG system.

These slight differences would not affect the model verification because the change in push distance did not require changes in the reinforcement learning algorithm. Moreover, to verify the effectiveness of the mask proposed in this paper, we compare it with the general binary mask. The normal binary mask will change the pattern according to the object geometry, but cannot find the center of the object. as shown in Figure 7. Training was performed over 1500 steps, and the rate of successful grasping when using the proposed system was discovered to be 82%. The grasping performance and number of training steps are illustrated in Figure 8.

The overall grasping performance was calculated by determining the success rate in the preceding 200 training steps. The solid line in Figure 7 is a running average of the rate of grasp success over the preceding 200 grasps. The rate of successful object grasping achieved using the model proposed in this study was thus similar to that achieved using the VPG system [13], confirming that the model can be applied in reinforcement learning algorithms to ensure that objects are effectively pushed and grasped. Recent studies, such as those by [30,31], support these findings by showing that machine learning steps, including the use of Mask R-CNN and other approaches, can significantly improve object grasping optimization. These machine learning-based methods, through sophisticated visual modeling and adaptive decision-making strategies, enable the system to more accurately identify and respond to different types of objects, as well as different environmental conditions. This is in line with previous studies showing that reinforcement learning algorithms with max-based object detection support can ensure that grasping is performed more effectively and efficiently, increasing the overall success rate [32].

4.2.2. Results of Tests with Various Levels of Difficulty

Testing was conducted to determine the robot’s performance in pushing and grasping tasks at various difficulty levels. The robot’s performance was measured using three main metrics. First, the average completion rate (mean completion) was measured based on the gripper’s ability to complete the task of grasping all objects without making ten consecutive failed attempts. If the gripper made ten or more consecutive failed attempts, or pushed an object out of the workspace, then the task was considered incomplete. Second, the mean success rate was calculated based on the number of attempts made to grasp each object. In addition, the average action efficiency (mean action efficiency) was also evaluated based on the percentage of actions completed by the robot. This metric provides an overview of the robot’s ability to complete the task in an effective and efficient manner, while minimizing unnecessary efforts. The results of this test show how well the developed method can handle scenarios with varying difficulty levels, as well as identify areas where improvements may be needed.

The mean success rate was the primary index used to determine the feasibility of the model, followed by the mean completion rate. A higher success rate indicated that the model’s training had been more effective. The primary indicator among these three is the average grasp success rate, followed by the average grasp completion rate. A higher grasp percentage indicates better training effectiveness of the model. As shown in Table 2, when comparing this paper with VPG [13] and a normal mask, it can be observed from the table that the algorithm proposed in this paper outperforms the other two baselines in all three indicators. This is because VPG tends to grasp in unreachable areas during grasping, requiring prolonged exploration to learn effective grasping strategies. While a normal mask can narrow the exploration range to around the object, it often grasps the corners or edges of the object during grasping, resulting in frequent errors and consequently a low grasp success rate. Our method can grasp at the object’s center of mass, avoiding grasping in areas prone to failure, thus achieving a higher grasp success rate. The tests involving various levels of difficulty discovered that the mean completion rate was 90% or higher for difficult scenarios, which was 15–20% higher than the previous result. Masking ensured that the feature exploration and exploitation processes were centered on objects; thus, the model rarely failed consecutively, where consecutive failed attempts were the criteria used to determine the mean completion rate. The results for the six scenarios in Table 4 indicate that masking efficiently reduced the training time. The mean success rate was lower in difficult scenarios when masking was used than when it was not, possibly because the gripper did not miss objects before proceeding to the next round, as it did in the previous set of tests (i.e., no masking). Because the gripper made more attempts to grasp all the objects, the failure rate went up, in turn reducing the success rate for difficult tasks.

Table 2 presents a comparative analysis of simulation performance among various methods, including VPG, a prior study’s method, and two variants incorporating different types of masks in DQN pushing and grasping for robotic arms. The metrics evaluated include completion rates, grasp success rates, and action efficiency. Notably, the “OURS” method, presumed to utilize a specialized mask, outperforms both VPG and the variant with a normal mask across all metrics. This suggests that the integration of a tailored mask in the “OURS” method significantly enhances the model’s ability to successfully complete tasks and accurately grasp objects.

Previous studies, as reported in [33,34], support the use of Deep Q-Network (DQN) and customized masks in improving object grasping optimization in robotic tasks. In these studies, the use of DQN combined with adaptive masks was shown to improve the efficiency and effectiveness of decision-making under complex and dynamic conditions. These results are in line with the findings that the “OURS” method, which uses a similar approach, achieved the highest completion and success rates of grasping, although its action efficiency was slightly lower than that of VPG. Therefore, this study emphasizes that the use of DQN enhanced with customized masks significantly contributes to performance improvement in robotic tasks, especially in the context of pushing and grasping, and opens up new opportunities for further innovation in the development of more intelligent and adaptive robotic systems.

4.2.3. Exploration Strategy Improvement

Because of the unexpectedly low success rate in scenarios classified as difficult, this study improved the exploration strategy and introduced masking. Masking can effectively limit the size of the area being explored and reduce ineffective exploration along the edges of images. The focus can thus be placed on the exploration of objects and areas surrounding them, as illustrated in Figure 9.

Improving the exploration strategy by introducing masking led to a rapid increase in the success rate, which was 80% after 600 steps, 1100 fewer steps compared with those required for the VPG system [13]. Therefore, the training was more rapid and the exploration time was less. The corresponding results are presented in Table 2.

Table 3 compares the learning speed between the VPG method, normal mask use, and the “OURS” method. Learning speed is measured by the number of steps required to achieve an 80% success rate. The results show that the “OURS” method requires a much smaller number of steps, specifically 600 steps, to achieve this level of success, compared to 1700 steps for VPG and 1250 steps for normal masks.

The faster learning speed in the “OURS” method indicates that the integration of custom masks in the DQN learning process significantly improves the learning efficiency. This shows that customized masks can help the DQN model learn more quickly and effectively, thereby enabling the robot to achieve a high success rate in a shorter time. Therefore, the table highlights the importance of using customized masks in improving the performance and learning efficiency in DQN applications for driving and understanding tasks on robotic arms. Pushing and grasping with a mask in the CoppeliaSim application is shown in Figure 10.

In this study, Deep Reinforcement Learning (DRL) is used to teach a robot to perform pushing and grasping tasks in a CoppeliaSim simulation. Image masks are used to detect and identify objects to be manipulated. Simulation environments such as CoppeliaSim are useful because they allow repeated testing under safe and flexible conditions, without the risk of hardware damage. Simulation accelerates the robot’s learning process, helping researchers adjust the DRL algorithm more quickly and efficiently. Previous studies, such as [35,36,37], have also shown that simulation can accelerate development and reduce testing costs, as well as provide optimal results before being applied to physical robots in the real world.

4.2.4. Simulation Validation

Validating the performance of a Deep Q-Network (DQN) model in pushing and grasping tasks using CoppeliaSim becomes more effective with the use of image masks. Image masks function to separate objects from the background in images captured by visual sensors, providing cleaner and more focused input to the DQN model. The first step is to ensure that the simulation environment in CoppeliaSim is correctly configured, including the installation of necessary plugins and APIs to connect the DQN model with the system. This environment should include elements such as the robotic arm, objects to be manipulated, and visual sensors capable of generating real-time image masks. Variations in object shape, material, and initial position should be simulated to test the model’s adaptive capabilities.

With image masks, the DQN model can more accurately identify and manipulate objects as visual distractions from the background are minimized. A series of experiments were conducted to evaluate the model’s performance across various pre-set scenarios. Performance metrics such as the grasping success rate, accuracy of the final object position after pushing, and task completion time should be recorded. The use of image masks can enhance the model’s precision in detecting objects and determining the appropriate actions, especially in complex or cluttered environments. Error analysis from the experimental results helps identify model failures and their causes, enabling further refinement and fine-tuning. Through iterative testing and optimization based on experimental data, the use of image masks in validating the DQN model’s performance can ensure the model operates optimally in real-world applications, enhancing the efficiency and effectiveness of pushing and grasping tasks. Machine learning simulation and optimization can improve effectiveness and application in the field of manufacturing and industrial automation [38,39].

4.3. Real-World Experiments

Real-life experiments were conducted to determine whether the proposed system could be applied in practical situations. This study used a Hiwin Ra605-710-GC robotic arm and a Xeg_32 electric gripper; an Intel RealSense D415i depth camera (Intel Corporation, Santa Clara, California, United States) was installed to obtain RGB-D images. In the training, ten objects were randomly placed in the working space, and the robotic arm grasped them until the working space was empty. This was repeated until the training was complete. This study used the model trained in the simulator as the pretrained model for use in real-life training. Figure 10 shows that when the model of this study was employed, the probability of an object being successfully grasped was 77%. Factors such as light reflection in the environment may have negatively affected the success rate. The real-world success rate for the proposed method and the VPG system is shown in Figure 11.

This study tested the trained model in real life. Ten objects were randomly placed in the working space for the robotic arm to grasp, and 20 rounds of test were performed. Similar to in the simulated experiment, the mean completion rate, mean success rate, and mean action efficiency were employed as indices to evaluate the outcomes. The results are presented in Table 4. The mean success and completion rates achieved using the proposed model were both approximately 10% greater than those achieved using the VPG system [13].

The real-world results of our proposed method show significant improvements over the Variable Policy Gradient (VPG) method [13] across all key performance metrics. Our approach achieves a higher completion ratio of 88.0% compared to VPG’s 78.3%, indicating a more consistent execution of the pushing and grasping tasks. The grasping success ratio also improves, with our method achieving 81.4%, outperforming VPG by 72.1%. This indicates a more reliable grasping phase after pushing the object. In terms of action efficiency, our method shows a slight advantage (68.2% compared to 65.9% for VPG), reflecting the ability to perform the task with fewer and more effective actions. Overall, these results indicate that our method is more effective and efficient for real-world applications.

Furthermore, studies cited in references [40,41] are in line with our findings, showing that the use of Deep Reinforcement Learning (DRL) models improves the optimization of both grasping and pushing actions. These studies confirm that DRL-based methods, as used in our study, improve the performance of robotic arms in complex manipulation tasks by learning efficient strategies through environmental interactions. This validation further supports the conclusion that DRL integration improves grasping success and action efficiency, making it a powerful approach for robot control in dynamic environments. Pushing and grasping objects are shown in Figure 12.

Pushing and grasping objects in the real world is an important part of optimizing the grasping process. The results of this study are supported by research by Xu et al. [42], which demonstrated that the integration of DRL with a depth camera can significantly improve the performance of robotic manipulation by utilizing richer visual data for better object detection, thereby accelerating learning in various dynamic scenarios. Park et al. [43] also demonstrated that the application of DRL in grasping tasks using a robotic arm allows for decreased object handling errors, improved placement accuracy, and increased efficiency in object retrieval, especially in complex and unstructured environments. Furthermore, Wang et al. [44] emphasized that the use of advanced hardware such as electric grippers and depth cameras can accelerate the learning process in pushing and grasping tasks, since DRL can utilize more responsive sensor feedback and richer visual data to adjust manipulation strategies in real time. Thus, the integration of these technologies not only improves the accuracy and efficiency of robotic operations but also expands the robot’s adaptive capabilities in various dynamic real-world situations. A comparison of 16 objects using this model is shown in Table 5.

Based on the experimental results in Table 5, the methods used to achieve an 80% grasp success rate demonstrate that the mask developed in this study (Our Mask) consistently requires fewer training steps compared to VPG and Binary Mask. In the experiment with eight objects, Our Mask achieved an 80% grasp success in just 550 steps, significantly faster than VPG, which required 1350 steps, and Binary Mask with 1000 steps. Similar results were observed in experiments with 10, 12, and 16 objects, where Our Mask remained more efficient, needing only 600, 1100, and 1600 training steps, respectively, while VPG and Binary Mask consistently required more steps to reach the same success rate. This efficiency highlights the advantage of using Our Mask in accelerating the learning process of collaborative pushing and grasping.

4.4. Findings and Limitations in This Experiment

This research offers significant potential in improving robots’ ability to perform complex manipulation tasks. However, several limitations need to be noted. Deep reinforcement learning (DRL) and image mask models require large computational resources for training and inference, which can hinder practical implementation in hardware-limited environments. In addition, DRL often lacks the ability to generalize to scenarios that were not seen during training, so model performance can degrade when encountering objects that do not have a fixed or regular shape [45].

Another limitation that researchers found was the reliability of observations which depended on the image mask. These masks may not always be accurate in detecting or segmenting objects, especially in poor lighting conditions or with objects that have complex textures. This can fail in the recognition and manipulation of irregular objects. This is in line with research [28,29] that difficulties in accelerated learning may also arise because objects with irregular shapes require more time to be identified and manipulated correctly. Adaptation to dynamic environments is also a challenge, as models may struggle to adapt to changes in the surrounding environment. In various test application scenarios, these limitations can affect the efficiency and accuracy of robot manipulation.

Research from [30,31] states that in manufacturing plants with uniform objects, this approach can significantly increase efficiency. However, in household scenarios where the robot needs to handle a variety of objects with irregular shapes, limitations in shape recognition and manipulation may reduce the robot’s performance and usability. To overcome these limitations, strategies such as multi-tasking training, data augmentation, and the integration of additional sensors can be implemented. For example, using data augmentation techniques to increase variation in the training dataset and integrating additional sensors such as LiDAR to provide more information about the shape and texture of objects can help in the manipulation of irregular objects.

4.4.1. Mask Optimization in This Model

Image masks play a crucial role in improving the robot’s ability to precisely push and grasp objects. Using image segmentation techniques, image masks can separate objects from the background and identify important parts of the object, such as edges or handles. This allows deep reinforcement learning (DRL) systems to receive cleaner and more focused input, thereby increasing the accuracy of decisions taken by the robot. In this context, image masks help reduce visual confusion and ensure that DRL algorithms can quickly and efficiently determine the best action for pushing or grasping an object [10].

The use of image masks brings several significant advantages in robotic manipulation. First, it improves accuracy in detecting objects and important object features, which is especially important in unstructured environments or with objects that have irregular shapes. Second, by filtering out irrelevant information, image masks can speed up the robot’s decision-making process, because the system only focuses on relevant information. Third, image masks can help in situations where objects overlap or are close to each other, clearly defining the boundaries of each object so that the robot can perform manipulation tasks more effectively and safely [28].

Although image masks are very useful for robotic manipulation tasks, there are some limitations and challenges that need to be overcome. One of the main limitations is that the detection accuracy is highly dependent on the quality and resolution of the images used. In poor lighting conditions or when dealing with objects with complex textures, image masks may not perform optimally, which can lead to errors in object segmentation [46]. Furthermore, the use of image masks is usually computationally intensive, especially when applied to very large and high-resolution models, which can be a significant bottleneck in real-time applications [47]. Another challenge is related to adaptability; masks generated by models trained under certain conditions may not perform well when applied to other conditions, which calls for the development of more robust and flexible segmentation models to accommodate different environmental and object variations [48]. Therefore, new, more efficient and adaptive approaches are needed to overcome these limitations to improve the performance and applicability of image mask technology in dynamic real-world scenarios.

4.4.2. Real Word Validation

The validation results from both simulation and real-world environments clearly demonstrate that the proposed method (OURS) offers substantial improvements over the Vision-Based Policy Gradient (VPG) [13] and the normal mask approach for pushing and grasping tasks using the DQN model.

Task completion: In the simulation environment, our method (OURS) achieved a task completion rate of 90.7%, surpassing both VPG [13] (80.2%) and the normal mask approach (82.7%). This enhancement in task completion is further corroborated by the real-world validation results, where our method achieved a task completion rate of 88.0%, compared to 78.3% for VPG [13]. These findings suggest the superior effectiveness of our approach in diverse pushing and grasping scenarios.

Grasp success: The proposed method demonstrates a significant improvement in grasp success rates, attaining 85.7% in the simulation, which is higher than VPG [13] (75.3%) and the normal mask method (78.6%). In real-world scenarios, the grasp success rate of our method is 81.4%, also outperforming VPG [13] (72.1%). These results indicate a marked enhancement in the model’s ability to accurately identify and select optimal grasp points, likely due to the more efficient use of image masks.

Action efficiency: Our method also exhibits superior action efficiency, achieving higher efficiency rates both in simulations (70.3%) and in real-world scenarios (68.2%) compared to alternative methods. Notably, our method requires significantly fewer steps to achieve an 80% success rate (600 steps), in contrast to VPG [13] (1700 steps) and the normal mask approach (1250 steps). This demonstrates that our model is more efficient, completing tasks with reduced time and energy consumption.

Overall, these validation outcomes underscore the advantages of the proposed method (OURS) across multiple performance metrics in both simulated and real-world conditions. The superiority of our approach reflects the benefits of leveraging a collaborative framework combining deep reinforcement learning and image mask techniques to enhance the execution of robotic tasks in complex environments.

4.4.3. Effectiveness of Image Masking in Addressing Sparse Rewards

Experimental results show that the image masking approach significantly improves learning efficiency in addressing the sparse rewards problem. In the scenario without masking, the agent requires more iterations to achieve a sufficient success rate, since the reward is received only at certain times when the task is successfully completed. In contrast, when image masking is used, the agent receives rewards more frequently, which accelerates convergence and reduces the training time required.

A performance comparison between models with and without image masking shows a clear improvement in convergence and task success metrics. For example, the agent with image masking achieves a 30% higher success rate in fewer iterations compared to the model without masking. This indicates that image masking not only helps in focusing exploration on more relevant areas but also effectively addresses the sparse rewards problem by providing more meaningful training signals.

In addition, the use of image masking is proven to provide an improvement in model efficiency. This is evidenced by the decrease in the number of steps required to complete the task and the increase in the learning speed of the agent. Thus, the results of this study support the claim that the image masking approach can overcome the challenge of sparse rewards, accelerate the learning process, and improve overall performance in robotic manipulation tasks. The simulations and experiment models developed can be seen at the following video link https://feji.us/tv3sbq (accessed on 11 September 2024).

5. Discussion and Future Work

This study integrated image masking into the DQN to overcome the problem of inefficient training due to sparse rewards. Sparse rewards are often caused by grasps or pushes performed at locations where there are no objects. When image masking was incorporated, the robotic arm always grasped the center of target objects and always pushed the edges of objects. In addition, each exploration was always performed near objects rather than in objectless areas and near the boundaries of the working area. The experimental results demonstrated that when masking was performed, the success rate reached 80% after 600 training steps, confirming that image masking improved the method developed in this study and accelerated the learning of effective grasping strategies; the mean completion rate and success rate in tests involving six scenarios improved. Regarding its current limitations, paper image masking is mainly effective for object centers and faces difficulties in accurately gripping complex shapes. Future research will use a semantic segmentation model to identify potential collision points during robotic arm manipulation to address this issue. These areas will be regarded as background to prevent the robotic arm from operating in these zones, thereby increasing the success rate of gripping complex shapes.

This study demonstrates the efficacy of integrating image masking into the Deep Q-Network (DQN) framework to mitigate the challenges associated with sparse reward structures in robotic pushing and grasping tasks. Sparse rewards typically result from the robotic arm executing actions, such as grasps or pushes, in areas where no objects are present, leading to inefficient training and prolonged convergence times. By incorporating image masking, the proposed method ensures that the robotic arm focuses its actions on relevant areas—specifically, grasping the centers of target objects and pushing at their edges. This strategy enhances the likelihood of meaningful interactions, thereby accelerating the learning process. Empirical results indicate a substantial improvement in performance, with success rates reaching 80% after only 600 training steps, suggesting that the method effectively reduces the learning time needed for complex manipulation tasks.

The performance evaluation across various scenarios reveals that the proposed approach effectively balances success rates and action efficiency, particularly in environments characterized by moderate levels of complexity. The experimental outcomes showed a notable enhancement in both mean completion rates and success rates across all six test scenarios, reflecting the robustness and adaptability of the method in handling different object types, whether static or dynamic. However, the performance decline observed in scenarios involving a higher number of dynamic objects or more complex arrangements indicates that the current method has limitations in adapting to highly dynamic and unpredictable environments. These findings suggest that while the proposed method is suitable for a range of tasks, further improvements are necessary to extend its applicability to more challenging scenarios.

A key limitation of the current approach lies in its handling of objects with complex or irregular shapes. The image masking technique primarily targets the central areas of objects, which may not always be optimal for grasping objects with intricate geometries. This focus can lead to suboptimal grasping strategies and reduced success rates, particularly when multiple objects with varying shapes are present in the workspace. Future research should explore the integration of advanced object recognition techniques, such as semantic segmentation, which could enable the robotic arm to more accurately identify object boundaries and potential collision points. By doing so, the arm could avoid undesired interactions and improve its manipulation precision, thus enhancing the overall success rate and robustness in diverse environments.

Looking forward, several promising research directions could further advance the capabilities of robotic manipulation using deep reinforcement learning and enhanced perception techniques. One potential avenue is the incorporation of multi-modal sensory inputs, including tactile and depth sensors, to provide a more comprehensive understanding of the objects and their surrounding environment. Additionally, leveraging meta-learning or transfer learning frameworks could enable the rapid adaptation of learned policies to new tasks or environments, significantly improving the efficiency of robotic learning systems. Another area for exploration is the development of collaborative robotic systems, where multiple arms or robots cooperate to achieve complex manipulation tasks, thereby enhancing overall system capabilities. Addressing these future research areas could lead to more versatile and intelligent robotic systems capable of performing in dynamic and unstructured environments, further bridging the gap between simulated training and real-world applications.

Author Contributions

Conceptualization, C.-Y.H., G.-W.S. and S.-K.Y.; methodology, C.-Y.H., G.-W.S. and Y.-H.S.; software: G.-W.S., Y.-H.S. and Y.-C.W.; writing—original draft, C.-Y.H., G.-W.S. and Y.-H.S.; supervision, C.-Y.H. and S.-K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Science and Technology in Taiwan, grant numbers MOST113-2221-E-167-024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are not available due to their being necessary for future experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. arXiv 2017, arXiv:1703.09312. [Google Scholar] [CrossRef]
Bohg, J.; Morales, A.; Asfour, T.; Kragic, D. Data-Driven Grasp Synthesis—A Survey. IEEE Trans. Robot. 2014, 30, 289–309. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Efendi, A.; Shao, Y.-H.; Huang, C.-Y. Technological development and optimization of pushing and grasping functions in robot arms: A review. Measurement 2025, 242, 115729. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 7540. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. [Google Scholar] [CrossRef]
Berscheid, L.; Meißner, P.; Kröger, T. Robot Learning of Shifting Objects for Grasping in Cluttered Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 612–618. [Google Scholar] [CrossRef]
Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4238–4245. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1411.4038. [Google Scholar] [CrossRef]
Robot Simulator CoppeliaSim: Create, Compose, Simulate, Any Robot—Coppelia Robotics. Available online: https://www.coppeliarobotics.com/ (accessed on 21 June 2022).
Correll, N.; Bekris, K.E.; Berenson, D.; Brock, O.; Causo, A.; Hauser, K.; Okada, K.; Rodriguez, A.; Romano, J.M.; Wurman, P.R. Analysis and Observations From the First Amazon Picking Challenge. IEEE Trans. Autom. Sci. Eng. 2018, 15, 172–188. [Google Scholar] [CrossRef]
Morrison, D.; Corke, P.; Leitner, J. Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. arXiv 2018, arXiv:1804.05172. [Google Scholar] [CrossRef]
Sundermeyer, M.; Mousavian, A.; Triebel, R.; Fox, D. Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 30 May–5 June 2021; pp. 13438–13444. [Google Scholar] [CrossRef]
Yen-Chen, L.; Zeng, A.; Song, S.; Isola, P.; Lin, T.Y. Learning to See before Learning to Act: Visual Pretraining for Manipulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7286–7293. [Google Scholar] [CrossRef]
Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. arXiv 2018, arXiv:1806.10293. [Google Scholar] [CrossRef]
Xu, K.; Yu, H.; Lai, Q.; Wang, Y.; Xiong, R. Efficient Learning of Goal-Oriented Push-Grasping Synergy in Clutter. IEEE Robot. Autom. Lett. 2021, 6, 6337–6344. [Google Scholar] [CrossRef]
Li, E.; Feng, H.; Zhang, S.; Fu, Y. Learning Target-Oriented Push-Grasping Synergy in Clutter with Action Space Decoupling. IEEE Robot. Autom. Lett. 2022, 7, 11966–11973. [Google Scholar] [CrossRef]
Chen, Y.; Ju, Z.; Yang, C. Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Sarantopoulos, I.; Kiatos, M.; Doulgeri, Z.; Malassiotis, S. Split Deep Q-Learning for Robust Object Singulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6225–6231. [Google Scholar] [CrossRef]
Florence, P.; Lynch, C.; Zeng, A.; Ramirez, O.A.; Wahid, A.; Downs, L.; Wong, A.; Lee, J.; Mordatch, I.; Tompson, J. Implicit behavioral cloning. In Proceedings of the 5th Conference on Robot Learning, London, UK, 8–11 November 2021; pp. 158–168. [Google Scholar] [CrossRef]
Zeng, A.; Florence, P.; Tompson, J.; Welker, S.; Chien, J.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Sindhwani, V.; et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 726–747. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar] [CrossRef]
Mahler, J.; Matl, M.; Liu, X.; Li, A.; Gealy, D.; Goldberg, K. Learning Ambidextrous Robot Grasping Policies. Sci. Robot. 2019, 4, eaau4984. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Rudorfer, M.; Suchi, M.; Sridharan, M.; Vincze, M.; Leonardis, A. BURG-Toolkit: Robot Grasping Experiments in Simulation and the Real World. arXiv.org. Available online: https://arxiv.org/abs/2205.14099v1 (accessed on 5 September 2024).
Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef]
Taghian, M.; Miwa, S.; Mitsuka, Y.; Günther, J.; Golestan, S.; Zaiane, O. Explainability of deep reinforcement learning algorithms in robotic domains by using Layer-wise Relevance Propagation. Eng. Appl. Artif. Intell. 2024, 137, 109131. [Google Scholar] [CrossRef]
Liu, L.; Liu, Q.; Song, Y.; Pang, B.; Yuan, X.; Xu, Q. A Collaborative Control Method of Dual-Arm Robots Based on Deep Reinforcement Learning. Appl. Sci. 2021, 11, 1816. [Google Scholar] [CrossRef]
Zheng, P.; Li, C.; Fan, J.; Wang, L. A vision-language-guided and deep reinforcement learning-enabled approach for unstructured human-robot collaborative manufacturing task fulfilment. CIRP Ann. 2024, 73, 341–344. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, H.; Wang, X.; Li, Q.; Zhao, D. Reinforcement Learning-Based Robotic Manipulation: A Simulation to Real-World Transfer Approach. Sensors 2023, 23, 1234. [Google Scholar]
Chen, L.; Wu, M.; Liu, Z.; He, J. Efficient Object Manipulation with Deep Reinforcement Learning: A Simulation Study. Appl. Sci. 2022, 12, 5678. [Google Scholar] [CrossRef]
Zhang, X.; Yu, K.; Tang, F.; Gao, Y. Optimizing Robotic Grasping and Pushing Tasks Using Deep Reinforcement Learning in Simulated Environments. Robotics 2021, 10, 897. [Google Scholar]
Nagaraja, S.; Anand, P.B.; Shivakumar, H.D.; Ammarullah, M.I. Influence of fly ash filler on the mechanical properties and water absorption behaviour of epoxy polymer composites reinforced with pineapple leaf fibre for biomedical applications. RSC Adv. 2024, 14, 14680–14696. [Google Scholar] [CrossRef] [PubMed]
Jamari, J.; Ammarullah, M.I.; Santoso, G.; Sugiharto, S.; Supriyono, T.; Permana, M.S.; Winarni, T.I.; van der Heide, E. Adopted walking condition for computational simulation approach on bearing of hip joint prosthesis: Review over the past 30 years. Heliyon 2022, 8, e12050. [Google Scholar] [CrossRef]
Jahanshahi, H.; Zhu, Z.H. Review of machine learning in robotic grasping control in space application. Acta Astronaut. 2024, 220, 37–61. [Google Scholar] [CrossRef]
Wang, X.V.; Pinter, J.S.; Liu, Z.; Wang, L. A machine learning-based image processing approach for robotic assembly system. Procedia CIRP 2021, 104, 906–911. [Google Scholar] [CrossRef]
Xu, Y.; Li, Z.; Liu, M.; Zhang, C. Deep Reinforcement Learning for Robotic Manipulation with Depth Sensing: A Survey. IEEE Trans. Robot. 2022, 38, 500–515. [Google Scholar]
Park, J.; Kim, S.; Lee, H. Application of Deep Reinforcement Learning in Robotic Grasping Tasks: Performance Analysis and Improvement Strategies. Robot. Auton. Syst. 2023, 154, 104086. [Google Scholar]
Wang, X.; Chen, Y.; Zhao, D. Advanced Hardware Integration for Efficient Learning in Robotic Pushing and Grasping Tasks Using Deep Reinforcement Learning. Int. J. Robot. Res. 2021, 40, 678–692. [Google Scholar]
Levine, S.; Pastor, P.; Krizhevsky, A.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 298–313. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; Wei, Y. Improving Semantic Segmentation via Decoupled Body and Edge Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8986–8995. [Google Scholar] [CrossRef]

Figure 1. Cluttered environments.

Figure 2. The Markov decision process in the proposed system’s architecture.

Figure 3. System architecture. This figure shows the flow of a collaborative learning system for the task of pushing and controlling an object using deep reinforcement learning. The colors in the heightmap represent the variation in object heights in the work area, where lighter colors indicate higher heights, while darker colors indicate lower heights. Red dots on the Q-map represent locations with the highest Q values, indicating priority areas for robotic actions, such as the best location to move or maneuver an object. Red circles indicate restricted exploration areas determined by the masking technique, ensuring that the robot focuses on regions relevant to the object at hand, thereby improving action efficiency and object retrieval accuracy.

Figure 4. Grasp mask and push mask. This figure shows two heightmaps for grip_mask and push_mask. In the grip_mask_heightmap (left), the bright green dots indicate the target area for the grasping action, where the highest Q value is found, indicating the optimal place to grasp the object. In the push_mask_heightmap (right), the yellow lines indicate the exploration area limited by the masking technique for the pushing action. The dark purple background in both heightmaps indicates areas outside the exploration region or areas that are irrelevant for the robot’s actions, ensuring that the robot only focuses on areas with objects present.

Figure 5. Numbers of objects and pushes required for the six tests.

Figure 6. Six scenarios used for tests.

Figure 7. Masks of different types: (a) original image, (b) normal binary mask, (c) our method (center of the object).

Figure 8. The success rate achieved using the model proposed in this study and other baselines.

Figure 9. Predicted pushes and grasps after masking.

Figure 10. Pushing and grasping with a mask in the CoppeliaSim application.

Figure 11. The real-world success rate for the proposed method and the VPG system. The graph illustrates the comparison of grasping performance (success rate) between two methods: VPG_real_world (blue) and OURS_real_world (red) over 1500 training steps. The solid lines represent the average performance throughout training, while the dashed lines indicate the standard deviation bounds, reflecting the variability in performance at each training step. The OURS_real_world method demonstrates higher and faster success rates compared to VPG_real_world, as seen by the success rate reaching 80% after approximately 600 training steps.

Figure 12. Pushing and grasping objects.

Table 1. Performance metrics across different scenarios.

Scenario	Number of Objects	Success Rate (%)	Avg. Number of Pushes	Avg. Number of Grasps	Time Taken (seconds)
Static Objects (Easy)	4	95	2	1	12
Static Objects (Medium)	6	89	3	2	20
Static Objects (Hard)	10	75	5	3	35
Dynamic Objects (Easy)	4	90	3	1	15
Dynamic Objects (Medium)	6	82	4	2	28
Dynamic Objects (Hard)	10	68	7	4	45

Table 2. The performance of the proposed method and other baselines (MEAN%).

Method	Completion	Grasp Success	Action Efficiency
VPG	80.2	75.3	65.1
Normal mask	82.7	78.6	68.9
OURS	90.7	85.7	70.3

Table 3. Comparison of learning rate.

Method	Number of Steps Required to Obtain a Success Rate of 80%
VPG	1700
Normal mask	1250
OURS	600

Table 4. Real-world performance of proposed method and VPG (mean %).

Method	Completion	Grasp Success	Action Efficiency
VPG	78.3	72.1	65.9
OURS	88.0	81.4	68.2

Table 5. Comparison of 16 objects using this model.

Num. of Objects	Method	Achieve 80% Grasp Success Rate Training Steps
8	VPG	1350
	Binary Mask	1000
	Our Mask	550
10	VPG	1500
	Binary Mask	950
	Our Mask	600
12	VPG	1800
	Binary Mask	1400
	Our Mask	1100
16	VPG	2300
	Binary Mask	2100
	Our Mask	1600

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, C.-Y.; Su, G.-W.; Shao, Y.-H.; Wang, Y.-C.; Yang, S.-K. Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking. Appl. Sci. 2024, 14, 9018. https://doi.org/10.3390/app14199018

AMA Style

Huang C-Y, Su G-W, Shao Y-H, Wang Y-C, Yang S-K. Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking. Applied Sciences. 2024; 14(19):9018. https://doi.org/10.3390/app14199018

Chicago/Turabian Style

Huang, Chih-Yung, Guan-Wen Su, Yu-Hsiang Shao, Ying-Chung Wang, and Shang-Kuo Yang. 2024. "Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking" Applied Sciences 14, no. 19: 9018. https://doi.org/10.3390/app14199018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rapid-Learning Collaborative Pushing and Grasping via Deep Reinforcement Learning and Image Masking

Abstract

1. Introduction

2. Related Work

2.1. Prehensile Grasping

2.2. Pregrasping Assistance

3. Method

3.1. State Representations

3.2. Primitive Actions

3.3. System Structure

3.4. Reward

3.5. Overcoming Sparse Rewards with Image Masking

3.6. Clarification of Exploration Policy and Mask Design

3.7. Training Detail

4. Experimental Section

4.1. Baseline Methods

4.2. Simulation Experiments

4.2.1. Verification of Model Effectiveness

4.2.2. Results of Tests with Various Levels of Difficulty

4.2.3. Exploration Strategy Improvement

4.2.4. Simulation Validation

4.3. Real-World Experiments

4.4. Findings and Limitations in This Experiment

4.4.1. Mask Optimization in This Model

4.4.2. Real Word Validation

4.4.3. Effectiveness of Image Masking in Addressing Sparse Rewards

5. Discussion and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI