Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly
Abstract
:1. Introduction
- Product complexity: Disassembly often involves products with numerous, intricately connected components. The complexity rises with the number of parts and the intricacy of their connections, requiring sophisticated handling to avoid damaging valuable elements.
- Product variability: Variability across different products, or even between different versions of the same product, necessitates highly adaptable disassembly processes. Traditional automated systems struggle to accommodate this variability without extensive reconfiguration.
- Condition of Components: The condition of the components of a product can vary widely. Parts may be damaged, worn out, or contaminated, complicating the disassembly process and requiring adaptable strategies to effectively handle them.
- 1.
- RL-based control strategy: the design and implementation of an RL-based control strategy tailored to the disassembly of flexible elements, emphasizing force minimization and adaptability.
- 2.
- Adaptive reward function: the introduction of an adaptive reward function that normalizes task complexity based on material properties, ensuring consistent performance across varying elasticities.
- 3.
- Algorithm comparison: A comparative analysis of state-of-the-art RL algorithms (SAC DDPG, and PPO) to evaluate their effectiveness in dynamic disassembly environments. By benchmarking these algorithms, this work provides practical insights into their applicability for real-world disassembly tasks while also identifying key limitations, such as challenges in generalizing to the unseen direction of extraction scenarios.
- 4.
- experimental validation: A comprehensive experimental evaluation in a simulated environment, demonstrating the ability to generalize across different disassembly scenarios and material characteristics.
2. Related Work
3. Problem Formulation
3.1. Problem Formulation
- is the state space, representing the robot’s observations of its environment.
- is the action space, consisting of the robot’s possible movements.
- P is the transition probability function, describing the dynamics of the environment.
- R is the reward function, providing feedback to the robot based on its actions.
- is the discount factor, balancing immediate and future rewards.
3.1.1. State Space ()
- The position of the end effector relative to the grasping point .
- The Cartesian force exerted via the end effector , computed as the Euclidean norm of the force components:
- The distance, d, between the end-effector position and the grasping point :
3.1.2. Action Space ()
3.2. Reward Function Design
- (d) represents the progress made in the disassembly task, measured using the distance between the grasping point and the current position of the end-effector.
- () denotes the physical interaction forces exerted via the robot, which should be minimized to prevent damage to the flexible elements and ensure safe handling.
- () and () are fixed weighting coefficients that govern the trade-off between task progress and force minimization. These coefficients determine the relative importance of each objective in the reward function, ensuring a balanced optimization strategy.
Adaptive Reward Function
- is the reward computed using Equation (3).
- and are the minimum and maximum expected reward values for the episode, estimated based on the elastic constant of the flexible element.
4. Methodology
4.1. Experimental Setup
4.1.1. Simulated Environment
- Kinematics and dynamics: the simulation includes the kinematic and dynamic models of the KUKA LBR iiwa14 robot, ensuring realistic interaction with the flexible elements.
- Use case workspace: the workspace mimics the real-world setup, including the constraints and preferred extraction direction for the flexible element.
- Interaction forces: The forces exerted during extraction are simulated using two main components; the reaction force of the gripper () and the flexible element’s elastic force (). These forces are modeled to replicate the physical interactions between the robot and the flexible element during disassembly. However, it is important to note that the main sim-to-real gaps are expected in this aspect, as real-world conditions may introduce additional complexities, such as unmodeled friction, material imperfections, or dynamic perturbations, which are not fully captured in the simulation.
4.1.2. Hybrid Planning Architecture
- General objective: detach the entire flexible element by sequentially grasping it at various positions until complete extraction is achieved.
- Specific objective: preserve the physical integrity of the element by minimizing the applied force during extraction at each grasping point, thereby identifying low-force extraction trajectories for each operation.
- 1.
- Global planning: At the start of the task, the global planner generates a reference trajectory that includes all grasping points. The robot then moves to the first grasping point.
- 2.
- Local planning and execution: Upon reaching the grasping point, the local planner (using RL-based control) takes over to handle the interaction with the flexible element. The local planner adjusts the robot’s actions in response to real-time feedback, ensuring efficient and low-force extraction.
- 3.
- Switching back to global planning: Once the element at the current grasping point is successfully extracted, the planner manager switches control back to the global planner, which moves the robot to the next grasping point.
- 4.
- Repeating the process: this process repeats until all grasping points are addressed and the disassembly task is completed.
4.2. Experiments
4.2.1. Training and Testing Procedure
- Structured scenario (S): This scenario represents an ideal case where comprehensive information about the environment is available beforehand (elastic properties and the expected direction of extraction). The agent is both trained and tested under these well-defined conditions, allowing for fast learning and high task performance due to the consistency of the environment.
- Operational scenario (O): This setup reflects real-world disassembly conditions, where exact environmental characteristics are unknown but operational limits can be estimated. Since this scenario closely mirrors practical applications, it serves as the primary benchmark for evaluating system performance and deriving key conclusions.
- Unstructured scenario (U): In this configuration, the agent encounters environments significantly different from those used during training. This scenario is designed to test the adaptive capabilities of the RL-based controller, assessing its ability to generalize and perform in completely unfamiliar conditions.
- 1.
- Environment configuration: Training and testing are conducted under different environmental characteristics to analyze the behavior and performance of the RL algorithms. This approach is designed to evaluate the learning capacities of RL agents by exposing them to a range of conditions. These conditions span from structured scenarios (S), to operational scenarios (O), and finally to unexplored configurations (U). This progression allows us to assess adaptability and robustness across increasingly complex and uncertain environments.
- Structured configuration (S): elastic modulus (200 [N/m]); direction of extraction (0°).
- Operation range (O): elastic modulus ([200, 700] [N/m]); direction of extraction ([−30°,30°]).
- Unexplored configuration (U): elastic modulus (1000 [N/m]); direction of extraction (60°).
- 2.
- Training roll-out setting: for the training of each scenario, a set of 50 rollouts of three hundred thousand steps are performed.
- 3.
- Episode initialization: At the start of each episode, the end effector of the robot was positioned near the first grasping point of the flexible element. The elastic modulus (k) of the element and the preferred direction of extraction are selected according to the roll-out specification.
- 4.
- Episode execution: the robot, guided by the RL-based local planner, attempted to extract the flexible element, adjusting the actions based on the interaction forces and the reward function.
- 5.
- Policy update: the RL algorithm updates its policy based on the cumulative rewards received during each episode, gradually improving its performance over time.
- 6.
- Agent evaluation: the strategy learned by the agent in each roll-out is tested in different environment configurations () where the metrics are the results of a batch of 100 individual tests.
4.2.2. Algorithms Used
4.2.3. Evaluation Metrics
- Success rate: The percentage of episodes in which the robot successfully extracted the flexible element without causing damage. This was measured using two thresholds: a position range to ensure proper trajectory execution and a direction threshold to ensure correct extraction direction.
- Force exertion: The average forces exerted via the robot during the extraction process were measured to evaluate the ability to minimize interaction forces and avoid damaging the flexible element. For this analysis, two baseline trajectories were used for comparison: an ideal trajectory, which follows the theoretical extraction direction and represents the optimal path for minimizing forces, and a deviated trajectory, which diverges by 45° from the ideal path, simulating a suboptimal or misaligned extraction scenario. By comparing the system’s performance against these two baselines, we expect the results to fall between them, ideally closer to the ideal trajectory. This comparison provides a clear benchmark for assessing the effectiveness of maintaining low force levels and ensuring the safe handling of the flexible elements.
- Adaptability: The adaptability of the RL agent to different elastic constants and environmental conditions will be qualitatively assessed using two key metrics: the success rate and the mean reward values. The success rate indicates whether the disassembly task was successfully completed, while the mean reward values provide insight into how well the task was performed. A high mean reward value (closer to 1) suggests that the agent not only completed the task but also minimized excessive forces during the process.This dual evaluation is crucial because, in some cases, the task may be successfully executed at the end, but high forces might have been exerted on the element during the process, potentially causing damage. By focusing on both success rate and mean reward values, we ensure that the agent not only achieves the goal but also performs the task efficiently and safely.
5. Results and Discussion
5.1. Training Results
- PPO exhibited one of the fastest convergences, followed by the lowest computational time, reaching a high cumulative reward within fewer training episodes compared to SAC. Its stable learning process compared with SAC and DDPG makes it well suited for environments where rapid learning is essential.
- DDPG also shows a fast convergence but exhibits more variance during training, which suggests that it struggled more with the complex, dynamic nature of the disassembly task.
- SAC showed a slower but stable convergence, demonstrating its robustness in environments with continuous action spaces and dynamic conditions. Also, SAC shows a high computational time and the lowest cumulative reward compared to PPO and DDPG.
5.2. Evaluating the Learned Strategies
5.2.1. Performance of the Agent in the Disassembly Task
- Metrics performance: From the results in Table 2, it is evident that the agents trained and tested under structured configurations (S) achieved the highest performance, as indicated by the mean reward metrics. The agents consistently demonstrated a success rate of 1.0, indicating flawless task execution. Moreover, as shown in Figure 7, the agents exhibited the lowest variance in comparison to algorithms trained and evaluated in other task configurations, reflecting both stability and reliability in structured environments.
- Force exertion: Figure 8 compares the force signatures of the three algorithms during the extraction task. All three algorithms closely followed the ideal (lowest-force) extraction path, highlighting the effectiveness of the RL-based control in minimizing applied forces. The agents consistently reduced force exertion by at least 20% compared to a suboptimal 45º trajectory, demonstrating their capability to dynamically optimize disassembly actions while ensuring minimal physical stress on the flexible element.
- Trajectory efficiency: In successful cases, all three agents exhibited efficient, direct trajectories during the disassembly task. As shown in the force signature analysis in Figure 8, the PPO agent consistently performed smooth, controlled movements with minimal variation, further indicating its ability to maintain a stable and optimized trajectory throughout the extraction process.
5.2.2. Adaptability and Generalization
5.3. Discussion
- As expected, the agent performs optimally when trained and tested in structured conditions, achieving the highest success rates.
- In operational conditions, the agent also demonstrates strong performance, achieving a perfect success rate. This is a crucial finding, as it validates the proposed approach and supports its potential transfer to real-world experiments.
- A significant observation is that the only cases where the agent fails to complete the task involve unknown extraction directions. However, when faced with unknown elastic properties, the agent successfully adapts, demonstrating its ability to generalize across different material conditions.
- The adaptive reward function played a crucial role in this success, particularly in handling varying elastic properties of flexible elements. The results indicate that dynamically normalizing the force component of the reward function based on material elasticity allows the RL agent to generalize effectively across diverse scenarios. By scaling the reward function according to the elasticity (k) of the element, the system ensures consistent and meaningful feedback, regardless of material properties.
- These results were achieved despite using simplified force models and state representations, aligning with the objective of validating the use of simplifications without compromising task success. Furthermore, since the agent successfully overcame these simplifications through the adaptive reward function, this finding suggests that the system may also be capable of bridging the sim-to-real gap and handling real-world uncertainties such as noise and unmodeled environmental factors.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RL | Reinforcement Learning |
SAC | Soft Actor–Critic |
DDPG | Deep Deterministic Policy Gradient |
PPO | Proximal Policy Optimization |
AI | Artificial Intelligence |
LfD | Learning from Demonstration |
IRL | Inverse Reinforcement Learning |
MDP | Markov Decision Process |
ROS2 | Robot Operating System |
k | Modulus of Elasticity |
References
- Li, J.; Barwood, M.; Rahimifard, S. Robotic disassembly for increased recovery of strategically important materials from electrical vehicles. Robot. Comput.-Integr. Manuf. 2018, 50, 203–212. [Google Scholar] [CrossRef]
- Foo, G.; Kara, S.; Pagnucco, M. Challenges of robotic disassembly in practice. Procedia CIRP 2022, 105, 513–518. [Google Scholar] [CrossRef]
- Vongbunyong, S.; Kara, S.; Pagnucco, M. Application of cognitive robotics in disassembly of products. CIRP Ann.-Manuf. Technol. 2013, 62, 31–34. [Google Scholar] [CrossRef]
- Poschmann, H.; Brüggemann, H.; Goldmann, D. Disassembly 4.0: A Review on Using Robotics in Disassembly Tasks as a Way of Automation. Chem. Ing.-Tech. 2020, 92, 341–359. [Google Scholar] [CrossRef]
- Li, F.; Jiang, Q.; Zhang, S.; Wei, M.; Song, R. Robot skill acquisition in assembly process using deep reinforcement learning. Neurocomputing 2019, 345, 92–102. [Google Scholar] [CrossRef]
- Hjorth, S.; Chrysostomou, D. Human–robot collaboration in industrial environments: A literature review on non-destructive disassembly. Robot. Comput.-Integr. Manuf. 2022, 73, 102208. [Google Scholar] [CrossRef]
- Wan, A.; Xu, J.; Chen, H.; Zhang, S.; Chen, K. Optimal Path Planning and Control of Assembly Robots for Hard-Measuring Easy-Deformation Assemblies. IEEE/ASME Trans. Mechatron. 2017, 22, 1600–1609. [Google Scholar] [CrossRef]
- Schneider, D.; Schomer, E.; Wolpert, N. A motion planning algorithm for the invalid initial state disassembly problem. In Proceedings of the MMAR: 2015 20th International Conference on Methods and Models in Automation and Robotics, Miedzyzdroje, Poland, 24–27 August 2015; Institute of Electrical and Electronics Engineers: Miedzyzdroje, Poland, 2015; p. 839. [Google Scholar]
- Elguea-Aguinaco, Í.; Serrano-Muñoz, A.; Chrysostomou, D.; Inziarte-Hidalgo, I.; Bøgh, S.; Arana-Arexolaleiba, N. A review on reinforcement learning for contact-rich robotic manipulation tasks. Robot. Comput.-Integr. Manuf. 2023, 81, 102517. [Google Scholar] [CrossRef]
- Duan, J.; Gan, Y.; Chen, M.; Dai, X. Adaptive variable impedance control for dynamic contact force tracking in uncertain environment. Robot. Auton. Syst. 2018, 102, 54–65. [Google Scholar] [CrossRef]
- Wang, W.; Guo, Q.; Yang, Z.; Jiang, Y.; Xu, J. A state-of-the-art review on robotic milling of complex parts with high efficiency and precision. Robot. Comput.-Integr. Manuf. 2023, 79, 102436. [Google Scholar] [CrossRef]
- Martín-Martín, R.; Lee, M.A.; Gardner, R.; Savarese, S.; Bohg, J.; Garg, A. Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
- Schoettler, G.; Nair, A.; Luo, J.; Bahl, S.; Ojea, J.A.; Solowjow, E.; Levine, S. Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards. arXiv 2019, arXiv:1906.05841. [Google Scholar] [CrossRef]
- Zhang, H.; Wang, W.; Zhang, S.; Zhang, Y.; Zhou, J.; Wang, Z.; Huang, B.; Huang, R. A novel method based on deep reinforcement learning for machining process route planning. Robot. Comput.-Integr. Manuf. 2024, 86, 102688. [Google Scholar] [CrossRef]
- Englert, P.; Toussaint, M. Learning manipulation skills from a single demonstration. Int. J. Robot. Res. 2018, 37, 137–154. [Google Scholar] [CrossRef]
- Levine, S.; Wagener, N.; Abbeel, P. Learning Contact-Rich Manipulation Skills with Guided Policy Search. arXiv 2015, arXiv:1501.05611. [Google Scholar]
- Chebotar, Y.; Kalakrishnan, M.; Yahya, A.; Li, A.; Schaal, S.; Levine, S. Path Integral Guided Policy Search. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2018. [Google Scholar]
- Huang, Y.; Liu, D.; Liu, Z.; Wang, K.; Wang, Q.; Tan, J. A novel robotic grasping method for moving objects based on multi-agent deep reinforcement learning. Robot. Comput.-Integr. Manuf. 2024, 86, 102644. [Google Scholar] [CrossRef]
- Qu, M.; Wang, Y.; Pham, D.T. Robotic Disassembly Task Training and Skill Transfer Using Reinforcement Learning. IEEE Trans. Ind. Inform. 2023, 19, 10934–10943. [Google Scholar] [CrossRef]
- Qu, M.; Pham, D.T.; Altumi, F.; Gbadebo, A.; Hartono, N.; Jiang, K.; Kerin, M.; Lan, F.; Micheli, M.; Xu, S.; et al. Robotic Disassembly Platform for Disassembly of a Plug-In Hybrid Electric Vehicle Battery: A Case Study. Automation 2024, 5, 50–67. [Google Scholar] [CrossRef]
- Serrano-Muñoz, A.; Arana-Arexolaleiba, N.; Chrysostomou, D.; Bøgh, S. Learning and generalising object extraction skill for contact-rich disassembly tasks: An introductory study. Int. J. Adv. Manuf. Technol. 2023, 124, 3171–3183. [Google Scholar] [CrossRef]
- Zhang, X.; Sun, L.; Kuang, Z.; Tomizuka, M. Learning Variable Impedance Control via Inverse Reinforcement Learning for Force-Related Tasks. IEEE Robot. Autom. Lett. 2021, 6, 2225–2232. [Google Scholar] [CrossRef]
- Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv 2023, arXiv:2304.13705. [Google Scholar]
- Beltran-Hernandez, C.C.; Petit, D.; Ramirez-Alpizar, I.G.; Nishi, T.; Kikuchi, S.; Matsubara, T.; Harada, K. Learning Force Control for Contact-Rich Manipulation Tasks with Rigid Position-Controlled Robots. IEEE Robot. Autom. Lett. 2020, 5, 5709–5716. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; p. 526. [Google Scholar]
- Chen, H.; Liu, Y. Robotic assembly automation using robust compliant control. Robot. Comput.-Integr. Manuf. 2013, 29, 293–300. [Google Scholar] [CrossRef]
- Kristensen, C.B.; Sørensen, F.A.; Nielsen, H.B.; Andersen, M.S.; Bendtsen, S.P.; Bøgh, S. Towards a Robot Simulation Framework for E-Waste Disassembly Using Reinforcement Learning; Elsevier: Amsterdam, The Netherlands, 2019; Volume 38, pp. 225–232. [Google Scholar] [CrossRef]
- Kroemer, O.; Niekum, S.; Konidaris, G. A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms. J. Mach. Learn. Res. 2021, 22, 1–82. [Google Scholar]
- Tapia Sal Paz, B.; Sorrosal, G.; Mancisidor, A. Hybrid Robotic Control for Flexible Element Disassembly. In Proceedings of the European Robotics Forum 2024, Rimini, Italy, 13–15 March 2014; Secchi, C., Marconi, L., Eds.; Springer: Cham, Switzerland, 2024; pp. 180–185. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Learning Rate | Buffer Size | Batch Size | Gae Lambda | ||||
---|---|---|---|---|---|---|---|
SAC | 0.003 | 256 | 0.005 | 0.99 | - | ||
RL Algorithm | DDPG | 0.001 | 256 | 0.005 | 0.99 | - | |
PPO | 0.003 | 64 | 0.005 | 0.99 | 0.95 |
Evaluation of Learned Strategies Under Different Environment Configurations. | ||||||
---|---|---|---|---|---|---|
Algorithm | Training Force (k) | Training Direction | TestForce (k) | Test Direction | Mean Reward | Success Rate |
SAC | S | S | S | S | 0.85 | 1.00 |
SAC | O | O | S | S | 0.60 | 1.00 |
SAC | O | O | O | O | 0.61 | 1.00 |
SAC | O | O | U | O | 0.48 | 0.00 |
SAC | O | O | O | U | −0.08 | 0.00 |
SAC | O | O | U | U | −0.05 | 0.00 |
DDPG | S | S | S | S | 0.75 | 1.00 |
DDPG | O | O | S | S | 0.44 | 1.00 |
DDPG | O | O | O | O | 0.44 | 1.00 |
DDPG | O | O | U | O | 0.48 | 0.57 |
DDPG | O | O | O | U | −0.25 | 0.00 |
DDPG | O | O | U | U | −0.02 | 0.00 |
PPO | S | S | S | S | 0.80 | 1.00 |
PPO | O | O | S | S | 0.62 | 1.00 |
PPO | O | O | O | O | 0.62 | 1.00 |
PPO | O | O | U | O | 0.62 | 1.00 |
PPO | O | O | O | U | −0.46 | 0.00 |
PPO | O | O | U | U | −0.47 | 0.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tapia Sal Paz, B.; Sorrosal, G.; Mancisidor, A.; Calleja, C.; Cabanes, I. Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics 2025, 13, 1120. https://doi.org/10.3390/math13071120
Tapia Sal Paz B, Sorrosal G, Mancisidor A, Calleja C, Cabanes I. Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics. 2025; 13(7):1120. https://doi.org/10.3390/math13071120
Chicago/Turabian StyleTapia Sal Paz, Benjamín, Gorka Sorrosal, Aitziber Mancisidor, Carlos Calleja, and Itziar Cabanes. 2025. "Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly" Mathematics 13, no. 7: 1120. https://doi.org/10.3390/math13071120
APA StyleTapia Sal Paz, B., Sorrosal, G., Mancisidor, A., Calleja, C., & Cabanes, I. (2025). Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics, 13(7), 1120. https://doi.org/10.3390/math13071120