Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly

Tapia Sal Paz, Benjamín; Sorrosal, Gorka; Mancisidor, Aitziber; Calleja, Carlos; Cabanes, Itziar

doi:10.3390/math13071120

Open AccessFeature PaperArticle

Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly

by

Benjamín Tapia Sal Paz

^1,2,*

,

Gorka Sorrosal

¹

,

Aitziber Mancisidor

²

,

Carlos Calleja

¹ and

Itziar Cabanes

²

¹

Ikerlan Technology Research Centre, Basque Research and Technology Alliance (BRTA), 20500 Arrasate, Spain

²

Department of Automatic Control and System Engineering, Bilbao School of Engineering, University of the Basque Country (UPV/EHU), 48013 Bilbao, Spain

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1120; https://doi.org/10.3390/math13071120

Submission received: 11 March 2025 / Revised: 24 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Modern Trends in Computation and Control in Autonomous Robotics Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Disassembly plays a vital role in sustainable manufacturing and recycling processes, facilitating the recovery and reuse of valuable components. However, automating disassembly, especially for flexible elements such as cables and rubber seals, poses significant challenges due to their nonlinear behavior and dynamic properties. Traditional control systems struggle to handle these tasks efficiently, requiring adaptable solutions that can operate in unstructured environments that provide online adaptation. This paper presents a reinforcement learning (RL)-based control strategy for the robotic disassembly of flexible elements. The proposed method focuses on low-level control, in which the precise manipulation of the robot is essential to minimize force and avoid damage during extraction. An adaptive reward function is tailored to account for varying material properties, ensuring robust performance across different operational scenarios. The RL-based approach is evaluated in a simulation using soft actor–critic (SAC), deep deterministic policy gradient (DDPG), and proximal policy optimization (PPO) algorithms, benchmarking their effectiveness in dynamic environments. The experimental results indicate the satisfactory performance of the robot under operational conditions, achieving an adequate success rate and force minimization. Notably, there is at least a 20% reduction in force compared to traditional planning methods. The adaptive reward function further enhances the ability of the robotic system to generalize across a range of flexible element disassembly tasks, making it a promising solution for real-world applications.

Keywords:

intelligent control; robotic control; decision-making; reinforcement learning (RL); robotic disassembly

MSC:

68T05

1. Introduction

Disassembly is a critical stage in the lifecycle management of products spanning industries such as electronics, automotive, and household appliances. As global industries increasingly prioritize sustainability, disassembly has emerged as a key enabler of repair, recycling, and repurposing initiatives, aligned with the principles of the circular economy [1]. By recovering valuable materials and components, disassembly reduces waste and supports the reintegration of parts into manufacturing processes. However, automating disassembly remains a formidable challenge due to the inherent complexity, variability, and unpredictability of the tasks involved [2,3,4,5]. Key challenges include the following:

Product complexity: Disassembly often involves products with numerous, intricately connected components. The complexity rises with the number of parts and the intricacy of their connections, requiring sophisticated handling to avoid damaging valuable elements.
Product variability: Variability across different products, or even between different versions of the same product, necessitates highly adaptable disassembly processes. Traditional automated systems struggle to accommodate this variability without extensive reconfiguration.
Condition of Components: The condition of the components of a product can vary widely. Parts may be damaged, worn out, or contaminated, complicating the disassembly process and requiring adaptable strategies to effectively handle them.

Despite these challenges, manual disassembly remains widely used, as human operators excel at managing diverse and unpredictable scenarios. However, manual processes are inherently labor-intensive, time-consuming, and costly, highlighting the growing need for automated solutions. Robots, with their flexibility and advanced capabilities, offer a compelling alternative to traditional automation methods [4,6]. Yet, conventional robotic control techniques, which rely on predefined models and deterministic approaches, often struggle to accommodate the complex interactions between robotic manipulators and components, particularly in unstructured environments [4,7,8]. This limitation underscores the necessity of adaptive and intelligent control methods capable of handling the dynamic and unstructured nature of disassembly tasks [9,10,11].

Reinforcement learning has emerged as a powerful tool for robotic control in complex and dynamic environments, particularly in physical interaction tasks. Unlike traditional methods, RL enables robots to learn optimal policies through trial and error, eliminating the need for precise physical models. This capability is especially beneficial in disassembly tasks involving flexible elements such as cables, seals, and rubber components, which exhibit nonlinear and unpredictable behaviors that are difficult (if not impossible) to model accurately. Recent advancements in RL have demonstrated its effectiveness in solving high-dimensional control problems, making it well suited to physical interaction tasks that demand both precision and adaptability [9,12,13,14,15,16,17,18]. However, despite these advancements, the application of RL in robotic disassembly (particularly for handling flexible elements) remains underexplored [19,20,21]. Addressing this gap is crucial, as it presents unique challenges that must be overcome to enable more efficient and autonomous disassembly processes.

This study addressed this gap by proposing an RL-based control strategy for the robotic disassembly of flexible elements. The primary objective was to develop a system capable of adapting to the dynamic and nonlinear behaviors of flexible materials while minimizing the interaction forces to prevent damage. This study focused on low-level control, in which the robot interacts directly with unknown flexible elements, making on-the-fly adjustments to ensure safe and efficient extraction. The key contributions of this study are as follows.

1.: RL-based control strategy: the design and implementation of an RL-based control strategy tailored to the disassembly of flexible elements, emphasizing force minimization and adaptability.
2.: Adaptive reward function: the introduction of an adaptive reward function that normalizes task complexity based on material properties, ensuring consistent performance across varying elasticities.
3.: Algorithm comparison: A comparative analysis of state-of-the-art RL algorithms (SAC DDPG, and PPO) to evaluate their effectiveness in dynamic disassembly environments. By benchmarking these algorithms, this work provides practical insights into their applicability for real-world disassembly tasks while also identifying key limitations, such as challenges in generalizing to the unseen direction of extraction scenarios.
4.: experimental validation: A comprehensive experimental evaluation in a simulated environment, demonstrating the ability to generalize across different disassembly scenarios and material characteristics.

This paper is organized as follows. Section 2 provides a comprehensive review of related works, focusing on advancements in robotic disassembly and RL-based control strategies. Section 3 presents the problem formulation (Section 3.1) and elaborates on the design of the reward function (Section 3.2). In Section 4, the experimental setup is described in detail, including the implementation specifics (Section 4.1) and experiments conducted (Section 4.2). Section 5 analyses the experimental results and provides insights and observations. Finally, Section 6 concludes the paper by summarizing the findings and discussing the existing challenges and potential avenues for future research in this domain.

2. Related Work

Robotic disassembly has attracted significant attention as industries seek to automate the recovery and recycling of valuable components from end-of-life products. Traditional approaches primarily rely on predefined sequences and deterministic control methods. For example, ref. [3] explored the use of structured assembly data to guide robotic disassembly, highlighting challenges such as product variability and the need for adaptable systems. However, these methods often fall short in unstructured environments, where product conditions and configurations exhibit high variability [9].

To address these limitations, recent advancements have focused on improving flexibility and adaptability in robotic disassembly. For instance, ref. [4] proposed a hybrid approach that combines rule-based methods with machine learning to enhance system adaptability. Despite this progress, the disassembly of flexible elements (such as cables and soft materials) remains a major challenge due to their complex and unpredictable interactions [19,20,21].

Traditional robotic systems, primarily designed for manipulating rigid objects, struggle with the complexities introduced via flexible elements. Various approaches have been explored to overcome this issue, ranging from model-based control to data-driven techniques [5]. Model-based methods require precise physical models of flexible elements, which are often difficult to obtain and may lack generalizability across different materials and configurations. In contrast, data-driven approaches, particularly those leveraging artificial intelligence (AI), have shown promise in adapting to the variability of flexible elements by training on large datasets.

Several studies have focused on specific tasks within flexible object manipulation, such as cable routing and soft object grasping. For instance, ref. [22] developed a method for cable routing that combines visual feedback with machine learning, enabling robots to adapt to diverse cable types and routing paths. Despite these advancements, the application of such techniques to disassembly remains limited, particularly in scenarios where flexible elements are entangled with rigid components or require precise manipulation to prevent damage.

Learning-based techniques provide a promising solution to these challenges by offering the flexibility and adaptability required for complex, dynamic applications. In robotics, two primary methodologies (learning from demonstration (LfD) and reinforcement learning) are commonly employed. While LfD is effective when human demonstrations can guide robotic behavior, it is less suitable for disassembly tasks, which are highly variable and unpredictable. The uniqueness of each disassembly scenario makes it impractical to account for all possible cases through demonstrations alone. Consequently, a more autonomous approach is needed—one that enables robots to navigate and respond to task complexities without extensive human intervention.

Several approaches attempt to mitigate these limitations, including generative adversarial imitation learning (GAIL) [23]. However, GAIL is heavily dependent on the quality and diversity of expert demonstrations. When the test environment deviates from training examples, performance deteriorates, requiring additional training or an expanded dataset. Similarly, inverse reinforcement learning (IRL) derives an expert’s cost function before optimizing policies through reinforcement learning. While effective in some contexts, IRL is computationally expensive, struggles with generalization, and often requires substantial expert data and dedicated hardware [24].

For complex and dynamic tasks such as disassembly, RL presents a compelling alternative to methods reliant on predefined examples. RL allows robotic agents to learn through accumulated experience, rather than explicit demonstrations, enabling them to adapt to the unpredictable nature of disassembly tasks [21,25]. In RL, the agent continuously refines its control policies based on environmental feedback, optimizing performance over time through trial and error [26]. This paradigm is particularly advantageous in disassembly operations, where precise adjustments in force modulation, compliance, and positioning are essential. By leveraging RL, robotic systems can autonomously manage environmental variability, effectively addressing challenges that traditional control techniques and other learning-based approaches struggle to overcome.

Reinforcement learning has proven to be a powerful tool for robotic control in physical interaction tasks, particularly where conventional methods fall short due to scenario complexity and unpredictability [21,25]. By enabling robots to learn control policies through direct interaction with the environment, RL is especially well suited to tasks in unstructured or dynamic settings. The theoretical foundations of RL, established in [26], have since been extended to a wide range of robotic applications.

In robotic disassembly, RL has been applied in two primary areas: high-level sequence planning and low-level control. High-level planning focuses on optimizing the order of disassembly actions, as demonstrated in [27], where RL was used to determine optimal sequences for electronic device disassembly. This approach prioritizes macro-level decisions, such as minimizing the disassembly time or maximizing material recovery. Conversely, low-level control involves the precise manipulation of individual components. RL has shown effectiveness in fine motor control tasks such as grasping and manipulation [21,28]. For instance, ref. [29] applied RL to the manipulation of soft objects, outperforming traditional control methods in scenarios where object behavior is difficult to model.

While significant progress has been made in both robotic physical interaction tasks and reinforcement learning, their intersection remains underexplored. In particular, the application of RL for low-level control in flexible element disassembly has not been comprehensively investigated. Most existing studies focus on related physical interaction tasks, such as assembly, or emphasize high-level planning, with relatively few addressing the unique challenges posed by low-level control in flexible element disassembly [19,20,21].

Notably, previous research has primarily focused on the disassembly of rigid objects, where problem formulation is simplified by setting the objective force to zero. However, these studies frequently highlight the inability to handle flexible elements as a key limitation. The need for rapid adaptation to varying elastic properties and the complex interactions involved in flexible element disassembly make RL a particularly promising yet underexplored approach.

This paper aims to bridge these gaps by proposing an RL-based control strategy specifically designed for flexible element disassembly. Unlike prior studies focused on high-level planning or rigid object manipulation, this research emphasizes low-level control, developing a robust and adaptable system capable of handling the complexities inherent in flexible element disassembly.

3. Problem Formulation

The methodology proposed in this work leverages reinforcement learning to develop a robotic control strategy for disassembling flexible elements. This section describes the problem formulation (Section 3.1) and the design of the reward function, including an adaptive reward mechanism for handling varying elasticities (Section 3.2).

3.1. Problem Formulation

The disassembly task is formulated as a Markov decision process (MDP), where the robot interacts with its environment to learn an optimal control policy. The MDP is defined by the tuple

(O, A, P, R, γ)

, where the following applies:

$O$ is the state space, representing the robot’s observations of its environment.
$A$ is the action space, consisting of the robot’s possible movements.
P is the transition probability function, describing the dynamics of the environment.
R is the reward function, providing feedback to the robot based on its actions.
$γ$ is the discount factor, balancing immediate and future rewards.

The environment in this study is defined by the flexible element to be disassembled and the state of the robot, which includes the position of the end effector and applied forces. The RL agent’s primary objective is to extract the flexible element while minimizing applied forces to prevent damage to both the environment and the robotic system. This problem formulation is intentionally designed for simplicity, ensuring efficient learning and practical implementation. However, it is the outcome of extensive evaluations of more complex representations, which ultimately did not yield significant improvements. The rationale behind these design choices and their implications will be further explored in the discussion section.

3.1.1. State Space ( $O$ )

The state space,

O

, represents the robot’s observations of its environment, which include the following:

The position of the end effector relative to the grasping point $(e e_{x}, e e_{y}, e e_{z})$ .
The Cartesian force exerted via the end effector $(F_{e e})$ , computed as the Euclidean norm of the force components:

$F_{e e} = \sqrt{F_{x_{e e}}^{2} + F_{y_{e e}}^{2} + F_{z_{e e}}^{2}} .$
The distance, d, between the end-effector position $(e e_{p o s i t i o n})$ and the grasping point $(G_{p o s i t i o n})$ :

$d = ∥ G_{p o s i t i o n} - e e_{p o s i t i o n} ∥ .$

These observations are critical for the robot to monitor its progress, adjust its actions, and ensure safe and efficient disassembly. The state space is formally defined as follows:

O ⟶ \begin{matrix} e e_{x}, e e_{y}, e e_{z} \\ F_{e e} = ∥F_{x_{e e}} + F_{y_{e e}} + F_{z_{e e}}∥ \\ d = ∥G_{p o s i t i o n} - e e_{p o s i t i o n}∥ \end{matrix}

(1)

3.1.2. Action Space ( $A$ )

The action space

A

consists of continuous Cartesian movements of the end effector of the robot

(a_{x}, a_{y}, a_{z})

while maintaining a fixed orientation. These actions allow for fine-grained control over the movements of the robot, enabling precise adjustments during the disassembly process. The action space is defined as follows:

A \to a_{x}, a_{y}, a_{z} \in ℜ [0, 0.05]

(2)

where the range

[0, 0.05]

ensures that the movements of the robot are incremental and controlled, minimizing the risk of an excessive force application.

3.2. Reward Function Design

The reinforcement learning agent receives feedback from the environment through a reward function, which is fundamental in shaping the learning process of the RL-based controller. The reward function is designed to implicitly encode the task objective by assigning a numerical value to each state–action pair. This value quantifies the immediate benefit or penalty associated with the chosen action by the agent. The goal of the RL agent is to learn a policy that maximizes the cumulative reward over time, thereby optimizing task performance.

In the context of flexible element disassembly, the reward function

R

is formulated to balance two key factors: task progress and force minimization. The objective is to guide the robot toward efficient disassembly while minimizing physical interaction forces to prevent damage to both the flexible element and the robotic system. To achieve this, the proposed reward function is defined as follows:

R = α \times d - β \times {F_{e e}}^{2}

(3)

where the following applies:

(d) represents the progress made in the disassembly task, measured using the distance between the grasping point and the current position of the end-effector.
( $F_{e e}$ ) denotes the physical interaction forces exerted via the robot, which should be minimized to prevent damage to the flexible elements and ensure safe handling.
( $α$ ) and ( $β$ ) are fixed weighting coefficients that govern the trade-off between task progress and force minimization. These coefficients determine the relative importance of each objective in the reward function, ensuring a balanced optimization strategy.

The coefficients

α

= 2.5 and

β

= 1 were determined through a systematic analysis to produce a reward surface (illustrated in Figure 1) that aligns with the expected physical behavior of the system. This analysis assumed a flexible element with an elastic property of k = 50 [N/m]. The reward function is structured to reflect real-world disassembly scenarios, where successful extraction typically occurs when the end-effector reaches approximately 0.3 m from the grasping point. This formulation ensures that the reward function effectively incentivizes both efficiency and safety in the disassembly process.

Figure 1 shows the distribution of the reward function within a plane defined by the grasping point and the preferred extraction direction. In the left subfigure, higher reward values are observed in regions aligned with the extraction trajectory, reinforcing the importance of following an optimal path. The right subfigure, which presents a parallel view along the extraction direction, highlights the existence of a peak reward point along the extraction path. This peak corresponds to the location where successful disassembly occurs, demonstrating how the reward function directs the RL agent toward optimal performance.

Adaptive Reward Function

The performance of a PPO-based RL agent trained using the fixed reward function defined in Equation (3) is illustrated in Figure 2. The agent was evaluated using four flexible elements with distinct elastic properties (k = 10, 50, 100, and 200 [N/m]). While the agent successfully extracts the element with the elastic property used during training (k = 50 [N/m]), its performance deteriorates when tested on elements with different elasticities. Specifically, for the element with k = 10 [N/m], the robot’s end-effector remains near the grasping point, failing to complete the extraction. Conversely, for elements with higher elasticities (k = 100 and k = 200 [N/m]), the robot applies excessive force even after extraction, increasing the risk of damaging the element. These results highlight the limitations of a fixed reward function in handling materials with varying elastic properties. Similar conclusions were drawn when evaluating the SAC and DDPG algorithms.

To address these limitations, an adaptive reward function is introduced. This approach dynamically normalizes the force component of the reward function based on the elastic properties of the element being disassembled in each episode. By incorporating the elastic property of the material, the adaptive reward function ensures that the RL agent can adjust its behavior to the specific characteristics of each flexible element, enabling consistent learning and performance across a wide range of materials.

The adaptive reward function is defined as follows:

R_{n o r m} = 2 (\frac{R - R_{m i n}}{R_{m a x} - R_{m i n}}) - 1

(4)

where the following applies:

$R$ is the reward computed using Equation (3).
$R_{m i n}$ and $R_{m a x}$ are the minimum and maximum expected reward values for the episode, estimated based on the elastic constant of the flexible element.

This normalization scales the reward function within a fixed range (−1, 1), ensuring that the reward signal remains independent of the material’s elasticity. Consequently, the RL agent receives appropriately scaled feedback, regardless of the elastic properties of the element. Additionally, the weighting coefficients

α

and

β

, which control the trade-off between task progress and force minimization, are dynamically adjusted based on the elastic properties of the flexible element. Specifically,

α

is defined as

α = \frac{k^{2}}{1.5}

, where k represents the elastic constant of the element, while

β

remains fixed at

β = 1

. This dynamic adaptation ensures that the reward function scales appropriately with the elastic property of the material, preserving a consistent reward distribution, as illustrated in Figure 1.

By incorporating an adaptive reward function, the system maintains robust performance even when faced with significant variations in material properties, which is a common challenge in real-world disassembly tasks. This enhancement enables the RL agent to generalize more effectively across materials with different elasticities, addressing the shortcomings of the fixed reward function. As a result, task performance improves while reducing the risk of damage to both the flexible elements and the robotic system, making the proposed approach more suitable for practical applications.

4. Methodology

This section outlines the experimental setup, procedure, and evaluation metrics used to assess the performance of the proposed RL-based control strategy for the disassembly of flexible elements. The experiments were conducted in a simulated environment designed to replicate real-world conditions and challenges.

A key aspect of the methodology, consistent with the problem formulation, is the emphasis on maintaining simplicity in both environment design and the information required for task execution. This decision was motivated by the primary objective of this study: to evaluate whether the proposed approach can effectively handle task uncertainties and adapt to real-world conditions. By reducing environmental and state-space complexity, the study aims to demonstrate that robust and adaptive control strategies can be developed even with limited prior knowledge or highly simplified models. This approach enhances the practical applicability of the solution while also providing insights into the ability of the system to generalize across diverse and unpredictable scenarios.

4.1. Experimental Setup

For the implementation of the proposed control, this work selects as a use case the disassembly of sealing elements in refrigerators (Figure 3). This is a representative task where the disassembly of these elements requires dynamic actions and the adaptation of the system according to the current state of the flexible element.

4.1.1. Simulated Environment

This work uses a simulated environment to validate the proposed methodology in the disassembly task. For that, the whole system is implemented using the ROS 2 (Robot Operating System) framework, simulated in the Gazebo using the collaborative robot KUKA LBR iiwa14 (KUKA, Augsburg, Germany), using a computer with a processor Intel i7 (Intel, Atlanta, GA, USA) and an Nvidia RTX 4080 graphic card (GIGABYTE, Singapore). The KUKA LBR iiwa14 robot was chosen for its advanced kinematic and dynamic capabilities, which are essential for executing the precise and adaptive movements required in flexible element disassembly tasks. Additionally, the KUKA LBR iiwa14 is a collaborative robot designed to operate safely alongside humans. This feature not only enhances its adaptability but also facilitates future integration into human–robot workspaces, making it a versatile choice for applications requiring close collaboration.

The simulation replicated the real-world disassembly scenario, as shown in Figure 3, where the setup focused on replicating the physical and dynamic conditions of the flexible element extraction. The main aspects considered in the simulations are as follows:

Kinematics and dynamics: the simulation includes the kinematic and dynamic models of the KUKA LBR iiwa14 robot, ensuring realistic interaction with the flexible elements.
Use case workspace: the workspace mimics the real-world setup, including the constraints and preferred extraction direction for the flexible element.
Interaction forces: The forces exerted during extraction are simulated using two main components; the reaction force of the gripper ( $F_{G r i p p e r}$ ) and the flexible element’s elastic force ( $F_{e l a s t i c}$ ). These forces are modeled to replicate the physical interactions between the robot and the flexible element during disassembly. However, it is important to note that the main sim-to-real gaps are expected in this aspect, as real-world conditions may introduce additional complexities, such as unmodeled friction, material imperfections, or dynamic perturbations, which are not fully captured in the simulation.

Here, the elastic force is modeled as follows:

F_{e l a s t i c} = K_{e l a s t i c} \times d

(5)

where

K_{e l a s t i c}

is the elastic constant of the flexible element, and d is the distance of the end-effector from the grasping point (

d = 0 \to F_{e l a s t i c} = 0

).

And the gripper reaction force (

F_{G r i p p e r}

) is modeled as follows:

\{\begin{matrix} if θ > 90 a n d θ < 270 : \\ F_{G r i p p e r} = a + d \times K_{e l a s t i c} \\ else : \\ F_{G r i p p e r} = ∥s i n (θ) \times d \times K_{e l a s t i c}∥ \end{matrix}

(6)

where

θ

represents the angle relative to the preferred extraction direction (Figure 4). The model differentiates between pulling in the preferred direction and more constrained zones, adding a constraint constant, a, when necessary.

Figure 4 shows the force distribution in the simulated environment for the proposed experimental configuration (Figure 3), highlighting the lower force zones (represented in blue) corresponding to the preferred extraction direction surroundings, where the RL agent is expected to learn effective extraction strategies.

4.1.2. Hybrid Planning Architecture

In the planning and low-level control stages of the disassembly task, two primary objectives are identified:

General objective: detach the entire flexible element by sequentially grasping it at various positions until complete extraction is achieved.
Specific objective: preserve the physical integrity of the element by minimizing the applied force during extraction at each grasping point, thereby identifying low-force extraction trajectories for each operation.

To achieve these objectives, this work employs the ROS2 Hybrid Planning architecture (PickNik Robotics, Boulder CO, USA) (Figure 5), which seamlessly integrates global and local planning levels, ensuring efficient and adaptive control throughout the disassembly process. This hybrid planning architecture combines pre-planned (global) and reactive (local) control behaviors. In the flexible element disassembly tasks, the global planner will be responsible for generating an overall disassembly plan, which includes identifying all necessary grasping points and planning the trajectory of the robot to these points. On the other hand, the local planner will be responsible for the low-level control where the RL control will be implemented, enabling the system to react to dynamic changes and interaction forces on the- fly [30].

The planning process is managed by a planner manager (S) who coordinates the switching between global and local planners. The sequence of operations is as follows:

1.: Global planning: At the start of the task, the global planner generates a reference trajectory that includes all grasping points. The robot then moves to the first grasping point.
2.: Local planning and execution: Upon reaching the grasping point, the local planner (using RL-based control) takes over to handle the interaction with the flexible element. The local planner adjusts the robot’s actions in response to real-time feedback, ensuring efficient and low-force extraction.
3.: Switching back to global planning: Once the element at the current grasping point is successfully extracted, the planner manager switches control back to the global planner, which moves the robot to the next grasping point.
4.: Repeating the process: this process repeats until all grasping points are addressed and the disassembly task is completed.

4.2. Experiments

To evaluate the proposed RL-based controller, a series of experiments was conducted to assess its performance in disassembly tasks. These experiments aimed to examine two key aspects: (1) the learning capacity and ability to overcome the simplifications introduced in the MDP design and environment interaction (force model) and (2) the generalization and adaptive capabilities across different environments.

To thoroughly analyze the strengths and limitations of the RL agent, a hierarchical evaluation approach was employed. This method enables a deeper understanding of the novel behaviors and potential weaknesses of the agent. Additionally, the system was tested across a range of environmental setups, from structured scenarios (where prior information about the environment is available) to completely unknown conditions, where no prior data are accessible. This evaluation framework provides critical insights into the robustness and adaptability of the proposed control strategy.

4.2.1. Training and Testing Procedure

The reinforcement learning algorithms used in these experiments were trained and tested across a range of scenarios within the simulated environment to ensure robustness and adaptability under varying conditions. Three distinct scenarios were designed, each serving a specific purpose.

Structured scenario (S): This scenario represents an ideal case where comprehensive information about the environment is available beforehand (elastic properties and the expected direction of extraction). The agent is both trained and tested under these well-defined conditions, allowing for fast learning and high task performance due to the consistency of the environment.
Operational scenario (O): This setup reflects real-world disassembly conditions, where exact environmental characteristics are unknown but operational limits can be estimated. Since this scenario closely mirrors practical applications, it serves as the primary benchmark for evaluating system performance and deriving key conclusions.
Unstructured scenario (U): In this configuration, the agent encounters environments significantly different from those used during training. This scenario is designed to test the adaptive capabilities of the RL-based controller, assessing its ability to generalize and perform in completely unfamiliar conditions.

While the operational scenario is used as the main evaluation benchmark due to its real-world relevance, the structured and unstructured scenarios provide additional insights into the learning efficiency and generalization ability of the system. By analyzing performance across these varied conditions, the study aims to comprehensively assess the effectiveness of the RL approach in flexible element disassembly.

The procedure used was as follows:

1.

Environment configuration: Training and testing are conducted under different environmental characteristics to analyze the behavior and performance of the RL algorithms. This approach is designed to evaluate the learning capacities of RL agents by exposing them to a range of conditions. These conditions span from structured scenarios (S), to operational scenarios (O), and finally to unexplored configurations (U). This progression allows us to assess adaptability and robustness across increasingly complex and uncertain environments.

Structured configuration (S): elastic modulus (200 [N/m]); direction of extraction (0°).
Operation range (O): elastic modulus ([200, 700] [N/m]); direction of extraction ([−30°,30°]).
Unexplored configuration (U): elastic modulus (1000 [N/m]); direction of extraction (60°).

2.

Training roll-out setting: for the training of each scenario, a set of 50 rollouts of three hundred thousand steps are performed.

3.

Episode initialization: At the start of each episode, the end effector of the robot was positioned near the first grasping point of the flexible element. The elastic modulus (k) of the element and the preferred direction of extraction are selected according to the roll-out specification.

4.

Episode execution: the robot, guided by the RL-based local planner, attempted to extract the flexible element, adjusting the actions based on the interaction forces and the reward function.

5.

Policy update: the RL algorithm updates its policy based on the cumulative rewards received during each episode, gradually improving its performance over time.

6.

Agent evaluation: the strategy learned by the agent in each roll-out is tested in different environment configurations (

O; S; U

) where the metrics are the results of a batch of 100 individual tests.

4.2.2. Algorithms Used

The following RL algorithms were employed in the experiments, selected based on their effectiveness in environments with similar characteristics seen in the literature [31,32]:

Soft actor–critic, chosen for its robustness in continuous action spaces and ability to explore diverse actions due to its entropy regularization term. Proximal policy optimization, selected for its balance between exploration and exploitation, offering stable and efficient learning through controlled policy updates. Deep deterministic policy gradient, used for its effectiveness in continuous action spaces, combining deep learning and actor–critic methods for fine-grained control.

The hyperparameters used for each of the RL algorithms are shown in Table 1, which is the result of an optimization stage using the Optuna framework [33] with a posteriori manual fine-tuning.

4.2.3. Evaluation Metrics

The performance of the RL-based control strategies was assessed using several key metrics:

Success rate: The percentage of episodes in which the robot successfully extracted the flexible element without causing damage. This was measured using two thresholds: a position range to ensure proper trajectory execution and a direction threshold to ensure correct extraction direction.
Force exertion: The average forces exerted via the robot during the extraction process were measured to evaluate the ability to minimize interaction forces and avoid damaging the flexible element. For this analysis, two baseline trajectories were used for comparison: an ideal trajectory, which follows the theoretical extraction direction and represents the optimal path for minimizing forces, and a deviated trajectory, which diverges by 45° from the ideal path, simulating a suboptimal or misaligned extraction scenario. By comparing the system’s performance against these two baselines, we expect the results to fall between them, ideally closer to the ideal trajectory. This comparison provides a clear benchmark for assessing the effectiveness of maintaining low force levels and ensuring the safe handling of the flexible elements.
Adaptability: The adaptability of the RL agent to different elastic constants and environmental conditions will be qualitatively assessed using two key metrics: the success rate and the mean reward values. The success rate indicates whether the disassembly task was successfully completed, while the mean reward values provide insight into how well the task was performed. A high mean reward value (closer to 1) suggests that the agent not only completed the task but also minimized excessive forces during the process.
This dual evaluation is crucial because, in some cases, the task may be successfully executed at the end, but high forces might have been exerted on the element during the process, potentially causing damage. By focusing on both success rate and mean reward values, we ensure that the agent not only achieves the goal but also performs the task efficiently and safely.

5. Results and Discussion

The experimental results provide valuable insights into the performance of the RL-based control strategies, highlighting the strengths and limitations of the algorithms used. The experiments were designed to evaluate the ability of the agent to learn and generalize the disassembly task under various conditions, focusing on success rate, force minimization, and adaptability.

5.1. Training Results

The learning curves for the SAC, PPO, and DDPG algorithms are shown in Figure 6. These curves demonstrate the progression of cumulative rewards during training. While all three algorithms successfully converged to an optimal policy, differences in convergence speed and stability were observed.

PPO exhibited one of the fastest convergences, followed by the lowest computational time, reaching a high cumulative reward within fewer training episodes compared to SAC. Its stable learning process compared with SAC and DDPG makes it well suited for environments where rapid learning is essential.
DDPG also shows a fast convergence but exhibits more variance during training, which suggests that it struggled more with the complex, dynamic nature of the disassembly task.
SAC showed a slower but stable convergence, demonstrating its robustness in environments with continuous action spaces and dynamic conditions. Also, SAC shows a high computational time and the lowest cumulative reward compared to PPO and DDPG.

5.2. Evaluating the Learned Strategies

The performance and adaptive and generalization capabilities of the trained RL agents are evaluated using the metrics discussed in Section 4.2.3.

5.2.1. Performance of the Agent in the Disassembly Task

To evaluate the agents in the flexible element disassembly task, structured configurations (S) were selected to benchmark the performance of the different algorithms.

Metrics performance: From the results in Table 2, it is evident that the agents trained and tested under structured configurations (S) achieved the highest performance, as indicated by the mean reward metrics. The agents consistently demonstrated a success rate of 1.0, indicating flawless task execution. Moreover, as shown in Figure 7, the agents exhibited the lowest variance in comparison to algorithms trained and evaluated in other task configurations, reflecting both stability and reliability in structured environments.
Force exertion: Figure 8 compares the force signatures of the three algorithms during the extraction task. All three algorithms closely followed the ideal (lowest-force) extraction path, highlighting the effectiveness of the RL-based control in minimizing applied forces. The agents consistently reduced force exertion by at least 20% compared to a suboptimal 45º trajectory, demonstrating their capability to dynamically optimize disassembly actions while ensuring minimal physical stress on the flexible element.
Trajectory efficiency: In successful cases, all three agents exhibited efficient, direct trajectories during the disassembly task. As shown in the force signature analysis in Figure 8, the PPO agent consistently performed smooth, controlled movements with minimal variation, further indicating its ability to maintain a stable and optimized trajectory throughout the extraction process.

5.2.2. Adaptability and Generalization

An in-depth analysis was conducted using various combinations of environmental configurations (Table 2). The study aimed to evaluate how effectively these agents could manage different conditions, both within their training range and beyond their training scenarios.

The results in Table 2 and Figure 9 show that all agents were able to successfully learn the task when trained and tested within their operational range (O). As illustrated in Figure 9, the agents displayed the correct reward patterns during episodes, consistently achieving a success rate of 1.00 across all structured environment configurations. This indicates that the agents, particularly when dealing with familiar conditions, were able to execute the task with complete accuracy and reliability.

However, when tested in previously unexplored environments (U), there was a noticeable drop in both success rate and reward values, as shown in Figure 10. This performance drop was particularly evident when the agents encountered new extraction directions that were absent from the training data. In these scenarios, all three agents consistently failed to complete the task, as reflected in the negative rewards (approaching

- 1

, the lowest possible value due to normalization) and a zero success rate. On the other hand, when faced with unseen elastic properties of the flexible elements, the agents were able to perform the task, but with a reduced success rate and lower rewards compared to familiar scenarios. This suggests that the agents were better equipped to handle variations in material characteristics than drastic changes in task dynamics, such as extraction direction.

5.3. Discussion

The experimental results confirm the effectiveness of the proposed RL-based control strategy for flexible element disassembly. All three algorithms (PPO, SAC, and DDPG) maintained high success rates while minimizing the exerted force, an essential factor in preserving the integrity of flexible components. This aspect is shown in the force signature of flexible element extraction of Figure 8, where a 20% force reduction is achieved against forcible classical methodologies. These findings highlight RL-based control as a promising solution for real-world robotic disassembly.

The comprehensive evaluation presented in Table 2 provides several key insights:

As expected, the agent performs optimally when trained and tested in structured conditions, achieving the highest success rates.
In operational conditions, the agent also demonstrates strong performance, achieving a perfect success rate. This is a crucial finding, as it validates the proposed approach and supports its potential transfer to real-world experiments.
A significant observation is that the only cases where the agent fails to complete the task involve unknown extraction directions. However, when faced with unknown elastic properties, the agent successfully adapts, demonstrating its ability to generalize across different material conditions.
The adaptive reward function played a crucial role in this success, particularly in handling varying elastic properties of flexible elements. The results indicate that dynamically normalizing the force component of the reward function based on material elasticity allows the RL agent to generalize effectively across diverse scenarios. By scaling the reward function according to the elasticity (k) of the element, the system ensures consistent and meaningful feedback, regardless of material properties.
These results were achieved despite using simplified force models and state representations, aligning with the objective of validating the use of simplifications without compromising task success. Furthermore, since the agent successfully overcame these simplifications through the adaptive reward function, this finding suggests that the system may also be capable of bridging the sim-to-real gap and handling real-world uncertainties such as noise and unmodeled environmental factors.

Future research will focus on real-world testing, specifically on the disassembly of refrigerator door seals, to validate the robustness of the approach beyond simulation. Among the evaluated reinforcement learning algorithms, PPO emerged as the most practical for real-world implementation, offering a balance between training efficiency, stability, and computational demands. PPO exhibited faster and more stable convergence compared to SAC and DDPG, making it well suited for real-time disassembly tasks. While SAC and DDPG also performed well, their higher computational requirements may limit scalability in practical applications.

Despite promising results, certain limitations were identified. The performance of the RL agent deteriorated in extreme cases where the expected extraction direction varied significantly. However, it is important to emphasize that these limitations were observed only in highly extreme scenarios. Within the operational range, which encompasses a broad spectrum of realistic configurations, the agent performed effectively.

Addressing these challenges will be essential for developing more robust and generalizable RL agents capable of handling a wider range of disassembly tasks. Future investigations will focus on enhancing the adaptability to highly dynamic environments, ultimately paving the way for RL-driven automation across diverse industrial applications.

6. Conclusions

This paper has presented a reinforcement learning-based control strategy for the robotic disassembly of flexible elements, addressing key challenges inherent to dynamic and unstructured environments. The approach centered on low-level control, enabling the robotic system to learn adaptive strategies for extracting flexible components, such as cables and rubber seals, through low-force trajectories. By utilizing RL algorithms, including SAC, PPO, and DDPG, the study demonstrated significant advancements in adaptability, efficiency, and overall task performance compared to traditional control methods.

The experimental results showcased the effectiveness of the RL-based control approach in achieving high success rates, consistently minimizing force exertion, particularly in scenarios where predefined thresholds are not possible, and maintaining efficient optimized trajectories across a range of task conditions. A key contributor to this success was the adaptive reward function, which allowed the RL agents to maintain reliable performance when dealing with elements of varying elastic properties.

However, the study also identified some limitations, particularly in the ability of the RL agent to generalize beyond their trained task configurations, especially in highly complex or unforeseen situations. These limitations point toward future research opportunities, including exploring hybrid control strategies that combine RL with model-based techniques and expanding the RL framework to multi-agent systems, which could further enhance adaptability and efficiency in complex disassembly scenarios.

Overall, this research advances intelligent robotic disassembly by demonstrating the potential of RL-based control to handle the complexities of flexible element disassembly in unpredictable and dynamic environments, which is an existing gap in current robotic disassembly studies [19,20,21]. The findings underscore the feasibility of automating flexible element disassembly using RL-based controllers, contributing to more sustainable and efficient manufacturing and recycling practices. Future work will aim to further refine these control strategies and expand their applicability across a broader range of disassembly tasks and real-world scenarios.

Author Contributions

Conceptualization, B.T.S.P., G.S. and A.M.; methodology, B.T.S.P., G.S. and A.M.; software, B.T.S.P.; validation, B.T.S.P., G.S. and A.M.; formal analysis, B.T.S.P., G.S. and A.M.; investigation, B.T.S.P.; resources, B.T.S.P., G.S., A.M., C.C. and I.C.; data curation, B.T.S.P.; writing—original draft preparation, B.T.S.P.; writing—review and editing, B.T.S.P., G.S., A.M., C.C. and I.C.; visualization, B.T.S.P.; supervision, G.S., C.C. and A.M.; project administration, G.S., C.C. and I.C.; funding acquisition, G.S., C.C. and I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the European Union’s Horizon 2020 research and innovation program under Marie Sklodowska-Curie grant agreement No. 955681 and by members of the Virtual Sensorization Research Group from the University of Basque Country (Basque Goverment Ref. IT1726-22).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
SAC	Soft Actor–Critic
DDPG	Deep Deterministic Policy Gradient
PPO	Proximal Policy Optimization
AI	Artificial Intelligence
LfD	Learning from Demonstration
IRL	Inverse Reinforcement Learning
MDP	Markov Decision Process
ROS2	Robot Operating System
k	Modulus of Elasticity

References

Li, J.; Barwood, M.; Rahimifard, S. Robotic disassembly for increased recovery of strategically important materials from electrical vehicles. Robot. Comput.-Integr. Manuf. 2018, 50, 203–212. [Google Scholar] [CrossRef]
Foo, G.; Kara, S.; Pagnucco, M. Challenges of robotic disassembly in practice. Procedia CIRP 2022, 105, 513–518. [Google Scholar] [CrossRef]
Vongbunyong, S.; Kara, S.; Pagnucco, M. Application of cognitive robotics in disassembly of products. CIRP Ann.-Manuf. Technol. 2013, 62, 31–34. [Google Scholar] [CrossRef]
Poschmann, H.; Brüggemann, H.; Goldmann, D. Disassembly 4.0: A Review on Using Robotics in Disassembly Tasks as a Way of Automation. Chem. Ing.-Tech. 2020, 92, 341–359. [Google Scholar] [CrossRef]
Li, F.; Jiang, Q.; Zhang, S.; Wei, M.; Song, R. Robot skill acquisition in assembly process using deep reinforcement learning. Neurocomputing 2019, 345, 92–102. [Google Scholar] [CrossRef]
Hjorth, S.; Chrysostomou, D. Human–robot collaboration in industrial environments: A literature review on non-destructive disassembly. Robot. Comput.-Integr. Manuf. 2022, 73, 102208. [Google Scholar] [CrossRef]
Wan, A.; Xu, J.; Chen, H.; Zhang, S.; Chen, K. Optimal Path Planning and Control of Assembly Robots for Hard-Measuring Easy-Deformation Assemblies. IEEE/ASME Trans. Mechatron. 2017, 22, 1600–1609. [Google Scholar] [CrossRef]
Schneider, D.; Schomer, E.; Wolpert, N. A motion planning algorithm for the invalid initial state disassembly problem. In Proceedings of the MMAR: 2015 20th International Conference on Methods and Models in Automation and Robotics, Miedzyzdroje, Poland, 24–27 August 2015; Institute of Electrical and Electronics Engineers: Miedzyzdroje, Poland, 2015; p. 839. [Google Scholar]
Elguea-Aguinaco, Í.; Serrano-Muñoz, A.; Chrysostomou, D.; Inziarte-Hidalgo, I.; Bøgh, S.; Arana-Arexolaleiba, N. A review on reinforcement learning for contact-rich robotic manipulation tasks. Robot. Comput.-Integr. Manuf. 2023, 81, 102517. [Google Scholar] [CrossRef]
Duan, J.; Gan, Y.; Chen, M.; Dai, X. Adaptive variable impedance control for dynamic contact force tracking in uncertain environment. Robot. Auton. Syst. 2018, 102, 54–65. [Google Scholar] [CrossRef]
Wang, W.; Guo, Q.; Yang, Z.; Jiang, Y.; Xu, J. A state-of-the-art review on robotic milling of complex parts with high efficiency and precision. Robot. Comput.-Integr. Manuf. 2023, 79, 102436. [Google Scholar] [CrossRef]
Martín-Martín, R.; Lee, M.A.; Gardner, R.; Savarese, S.; Bohg, J.; Garg, A. Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
Schoettler, G.; Nair, A.; Luo, J.; Bahl, S.; Ojea, J.A.; Solowjow, E.; Levine, S. Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards. arXiv 2019, arXiv:1906.05841. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Zhang, S.; Zhang, Y.; Zhou, J.; Wang, Z.; Huang, B.; Huang, R. A novel method based on deep reinforcement learning for machining process route planning. Robot. Comput.-Integr. Manuf. 2024, 86, 102688. [Google Scholar] [CrossRef]
Englert, P.; Toussaint, M. Learning manipulation skills from a single demonstration. Int. J. Robot. Res. 2018, 37, 137–154. [Google Scholar] [CrossRef]
Levine, S.; Wagener, N.; Abbeel, P. Learning Contact-Rich Manipulation Skills with Guided Policy Search. arXiv 2015, arXiv:1501.05611. [Google Scholar]
Chebotar, Y.; Kalakrishnan, M.; Yahya, A.; Li, A.; Schaal, S.; Levine, S. Path Integral Guided Policy Search. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2018. [Google Scholar]
Huang, Y.; Liu, D.; Liu, Z.; Wang, K.; Wang, Q.; Tan, J. A novel robotic grasping method for moving objects based on multi-agent deep reinforcement learning. Robot. Comput.-Integr. Manuf. 2024, 86, 102644. [Google Scholar] [CrossRef]
Qu, M.; Wang, Y.; Pham, D.T. Robotic Disassembly Task Training and Skill Transfer Using Reinforcement Learning. IEEE Trans. Ind. Inform. 2023, 19, 10934–10943. [Google Scholar] [CrossRef]
Qu, M.; Pham, D.T.; Altumi, F.; Gbadebo, A.; Hartono, N.; Jiang, K.; Kerin, M.; Lan, F.; Micheli, M.; Xu, S.; et al. Robotic Disassembly Platform for Disassembly of a Plug-In Hybrid Electric Vehicle Battery: A Case Study. Automation 2024, 5, 50–67. [Google Scholar] [CrossRef]
Serrano-Muñoz, A.; Arana-Arexolaleiba, N.; Chrysostomou, D.; Bøgh, S. Learning and generalising object extraction skill for contact-rich disassembly tasks: An introductory study. Int. J. Adv. Manuf. Technol. 2023, 124, 3171–3183. [Google Scholar] [CrossRef]
Zhang, X.; Sun, L.; Kuang, Z.; Tomizuka, M. Learning Variable Impedance Control via Inverse Reinforcement Learning for Force-Related Tasks. IEEE Robot. Autom. Lett. 2021, 6, 2225–2232. [Google Scholar] [CrossRef]
Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv 2023, arXiv:2304.13705. [Google Scholar]
Beltran-Hernandez, C.C.; Petit, D.; Ramirez-Alpizar, I.G.; Nishi, T.; Kikuchi, S.; Matsubara, T.; Harada, K. Learning Force Control for Contact-Rich Manipulation Tasks with Rigid Position-Controlled Robots. IEEE Robot. Autom. Lett. 2020, 5, 5709–5716. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; p. 526. [Google Scholar]
Chen, H.; Liu, Y. Robotic assembly automation using robust compliant control. Robot. Comput.-Integr. Manuf. 2013, 29, 293–300. [Google Scholar] [CrossRef]
Kristensen, C.B.; Sørensen, F.A.; Nielsen, H.B.; Andersen, M.S.; Bendtsen, S.P.; Bøgh, S. Towards a Robot Simulation Framework for E-Waste Disassembly Using Reinforcement Learning; Elsevier: Amsterdam, The Netherlands, 2019; Volume 38, pp. 225–232. [Google Scholar] [CrossRef]
Kroemer, O.; Niekum, S.; Konidaris, G. A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms. J. Mach. Learn. Res. 2021, 22, 1–82. [Google Scholar]
Tapia Sal Paz, B.; Sorrosal, G.; Mancisidor, A. Hybrid Robotic Control for Flexible Element Disassembly. In Proceedings of the European Robotics Forum 2024, Rimini, Italy, 13–15 March 2014; Secchi, C., Marconi, L., Eds.; Springer: Cham, Switzerland, 2024; pp. 180–185. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]

Figure 1. Reward function value distribution considering a plane that contains the grasping point and the preferred direction of extraction. (Left) shows the reward distribution in a plane that includes the grasping point and the preferred extraction direction. (Right) provides a view parallel to the extraction direction.

Figure 2. Performance of the extraction task for flexible elements with different elastic properties using a PPO RL agent trained with the fixed reward function in Equation (3). The agent was trained on an element with k = 50 [N/m]. The plot demonstrates that the agent performs efficiently only for the k = 50 [N/m] element while struggling to adapt to elements with other elastic properties.

Figure 3. Disassembly task use case: Extraction of the sealing element of a fridge door. (a) real-world application; (b) Experimental setup used for simulation, where the gripper replicates the attached points in the real application. (c) Simulation environment used to replicate the real world interaction forces.

Figure 4. Force distribution for the experimental setup proposed in Figure 3. Upper (a), side (b), and isometric (c) views of the force spatial distribution in a plane that contains the grip point and the preferred direction of extraction.

Figure 5. Hybrid control scheme proposal with a reinforcement learning-based control that combines pre-planned (global) and reactive (local) control behaviors.

Figure 6. Learning curves for the DDPG, SAC, and PPO algorithms during training in the operation range (O). The curves show the cumulative reward over episodes, highlighting the convergence behavior of each algorithm. The shaded regions represent the standard deviation across multiple training runs.

Figure 7. Reward curves for the PPO, SAC, and DDPG algorithms in a structured environment configuration (S). The curves illustrate the reward progression during a testing episode, with SAC achieving the highest and most stable rewards. The shaded regions indicate the variance across multiple (100) test runs. The results demonstrate the effectiveness of the RL agents in maintaining consistent performance under controlled conditions, with SAC outperforming PPO and DDPG in terms of stability and reward maximization.

Figure 8. Force signatures during the disassembly task for the PPO, SAC, and DDPG algorithms in a structured environment configuration (S). The plots compare the force exerted via the robot along the extraction trajectory, with the ideal (lowest-force) path and a suboptimal 45° trajectory as baselines. All three algorithms closely follow the ideal path, with SAC and DDPG exerting the lowest forces compared to PPO but PPO exhibiting the smoothest and most consistent force profile.

Figure 9. Reward curves for PPO, SAC, and DDPG algorithms trained in the operation range (O) and tested in structured (S) and operational (O) configurations. The curves show the reward progression during testing across a variety of elastic properties and extraction directions within the training range. The three algorithms show similar behaviors in the O range. The shaded regions represent the variance across multiple (100) test runs.

Figure 10. Reward curves for PPO, SAC, and DDPG algorithms trained in the operational range (O) and tested in unexplored environment configurations (U). The curves illustrate the agents performance when tested on elastic properties and extraction directions outside the training range. All algorithms show similar performances facing an unseen k but significant performance degradation in scenarios with unfamiliar extraction directions. The shaded regions represent the variance across multiple (100) test runs.

Table 1. Main parameters values used for the experiments.

		Learning Rate	Buffer Size	Batch Size	$τ$	$γ$	Gae Lambda
	SAC	0.003	$10^{6}$	256	0.005	0.99	-
RL Algorithm	DDPG	0.001	$10^{6}$	256	0.005	0.99	-
	PPO	0.003	$10^{6}$	64	0.005	0.99	0.95

Table 2. Evaluation of learned strategies under the combination of different environment configurations: operation range (O), structured configuration (S), and unexplored configuration (U).

Evaluation of Learned Strategies Under Different Environment Configurations.
Algorithm	Training Force (k)	Training Direction	TestForce (k)	Test Direction	Mean Reward	Success Rate
SAC	S	S	S	S	0.85	1.00
SAC	O	O	S	S	0.60	1.00
SAC	O	O	O	O	0.61	1.00
SAC	O	O	U	O	0.48	0.00
SAC	O	O	O	U	−0.08	0.00
SAC	O	O	U	U	−0.05	0.00
DDPG	S	S	S	S	0.75	1.00
DDPG	O	O	S	S	0.44	1.00
DDPG	O	O	O	O	0.44	1.00
DDPG	O	O	U	O	0.48	0.57
DDPG	O	O	O	U	−0.25	0.00
DDPG	O	O	U	U	−0.02	0.00
PPO	S	S	S	S	0.80	1.00
PPO	O	O	S	S	0.62	1.00
PPO	O	O	O	O	0.62	1.00
PPO	O	O	U	O	0.62	1.00
PPO	O	O	O	U	−0.46	0.00
PPO	O	O	U	U	−0.47	0.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tapia Sal Paz, B.; Sorrosal, G.; Mancisidor, A.; Calleja, C.; Cabanes, I. Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics 2025, 13, 1120. https://doi.org/10.3390/math13071120

AMA Style

Tapia Sal Paz B, Sorrosal G, Mancisidor A, Calleja C, Cabanes I. Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics. 2025; 13(7):1120. https://doi.org/10.3390/math13071120

Chicago/Turabian Style

Tapia Sal Paz, Benjamín, Gorka Sorrosal, Aitziber Mancisidor, Carlos Calleja, and Itziar Cabanes. 2025. "Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly" Mathematics 13, no. 7: 1120. https://doi.org/10.3390/math13071120

APA Style

Tapia Sal Paz, B., Sorrosal, G., Mancisidor, A., Calleja, C., & Cabanes, I. (2025). Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly. Mathematics, 13(7), 1120. https://doi.org/10.3390/math13071120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Problem Formulation

3.1.1. State Space ( $O$ )

3.1.2. Action Space ( $A$ )

3.2. Reward Function Design

Adaptive Reward Function

4. Methodology

4.1. Experimental Setup

4.1.1. Simulated Environment

4.1.2. Hybrid Planning Architecture

4.2. Experiments

4.2.1. Training and Testing Procedure

4.2.2. Algorithms Used

4.2.3. Evaluation Metrics

5. Results and Discussion

5.1. Training Results

5.2. Evaluating the Learned Strategies

5.2.1. Performance of the Agent in the Disassembly Task

5.2.2. Adaptability and Generalization

5.3. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Reinforcement Learning-Based Control for Robotic Flexible Element Disassembly

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Problem Formulation

3.1.1. State Space ( O )

3.1.2. Action Space ( A )

3.2. Reward Function Design

Adaptive Reward Function

4. Methodology

4.1. Experimental Setup

4.1.1. Simulated Environment

4.1.2. Hybrid Planning Architecture

4.2. Experiments

4.2.1. Training and Testing Procedure

4.2.2. Algorithms Used

4.2.3. Evaluation Metrics

5. Results and Discussion

5.1. Training Results

5.2. Evaluating the Learned Strategies

5.2.1. Performance of the Agent in the Disassembly Task

5.2.2. Adaptability and Generalization

5.3. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.1. State Space ( $O$ )

3.1.2. Action Space ( $A$ )