Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control

Kołota, Jakub; Kargin, Turhan Can

doi:10.3390/app13169153

Open AccessArticle

Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control

by

Jakub Kołota

^*

and

Turhan Can Kargin

Institute of Automatic Control and Robotics, Poznan University of Technology, 60-965 Poznan, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9153; https://doi.org/10.3390/app13169153

Submission received: 21 June 2023 / Revised: 4 August 2023 / Accepted: 8 August 2023 / Published: 11 August 2023

(This article belongs to the Special Issue Motion Planning and Control for Automatic Machines, Robots and Multibody Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Controlling flexible and continuously structured continuum robots is a challenging task in the field of robotics and control systems. This study explores the use of reinforcement learning (RL) algorithms in controlling a three-section planar continuum robot. The study aims to investigate the impact of various reward functions on the performance of the RL algorithm. The RL algorithm utilized in this study is the Deep Deterministic Policy Gradient (DDPG), which can be applied to both continuous-state and continuous-action problems. The study’s findings reveal that the design of the RL environment, including the selection of reward functions, significantly influences the performance of the RL algorithm. The study provides significant information on the design of RL environments for the control of continuum robots, which may be valuable to researchers and practitioners in the field of robotics and control systems.

Keywords:

reinforcement learning; continuum robots; control systems; reward functions; deep deterministic policy gradient

1. Introduction

Reinforcement learning (RL) is an area of machine learning that is expanding rapidly and has various applications, such as robotics, control systems, and decision support systems [1,2,3]. An active area of research involves the application of RL in the control of continuum robots. These robots typically have continuous state and action spaces [4,5,6]. In this study, our objective was to design and compare various RL environments to control continuum robots by varying the reward functions.

The importance of this work lies in the growing utilization of continuum robots in various industrial and medical applications, and the potential of RL to enhance their control performance. However, the configuration of the RL environment, including the selection of the reward function, can significantly impact the performance of the RL algorithm. Through the comparison of various environments, this study aims to offer valuable insights into the design of RL environments for continuum robot control. Current research in the field frequently uses actor–critic methods, specifically the Deep Deterministic Policy Gradient (DDPG) [2], for continuum robots. However, there is a lack of comparative research on RL environments for controlling these robots. This study aims to address this gap by examining the performance of different environments in controlling the continuum robot.

In this study, our goal is to control a three-section planar continuum robot, which is a type of robot that has a structure that is continuous and flexible. There are several approaches to controlling continuum robots, including classical control algorithms based on explicitly implemented kinematics or dynamics, such as PID, and predictive control [7,8,9,10]. However, an alternative and increasingly popular approach is to use RL to solve this problem [11,12,13,14]. For example, in [13], a control strategy was used based on deep Q learning to achieve 2D position reaching. Additionally, in [14], a model-free control method based on Q learning was introduced and implemented on a multi-segment soft manipulator to achieve 2D position reaching. Using RL algorithms enables a control policy to be learned by interacting with the continuum robot, which is treated as the environment. This approach offers the advantage of not having to explicitly consider the dynamics or kinematics of the robot in the design process. Moreover, it allows the control policy to be learned through experimentation with the robot. Continuum robots present unique control challenges due to their flexibility and high degree of freedom. Traditional methods struggle to handle their non-linear dynamics and uncertainties [15]. RL can learn effective control policies even with model inaccuracies, making it a promising approach to address existing limitations and advance the control of continuum robots.

While the three-section planar continuum robot may seem simple, controlling it effectively can still be a challenging problem, especially when we consider the potential for non-linearities, uncertainties, and changes in the robot or the environment. Therefore, we believe that this system is a suitable testbed for exploring the potential of RL. The primary aim of this study is to examine the application of the DDPG algorithm, as well as the effect of various environment designs, to control three-section continuum robots. To sum up, the primary objective of this study is to create different RL environments for the control of continuum robots by varying the reward functions and assess their effectiveness. The primary outcome of this study will offer valuable insights into the development of RL environments for controlling continuum robots, which will be useful for researchers and practitioners working in the fields of robotics and control systems.

The continuum robot is described by its kinematics and input signals. To investigate the influence of the design choices of the RL environment on the learning process, we consider different reward functions. The study is organized as follows: Section 2 describes the definition of the problem as well as the robot model, the definition of the environment for RL, and the implementation of the DDPG algorithm, and Section 3 presents the results of the learning process through validation tests and highlights the main features of the learning process.

2. Materials and Methods

This paper explores the use of RL to overcome the challenge of controlling a continuum robot. RL is a strategy that determines the best action to take in a given scenario based on the incentives received from the environment. The aim of the RL algorithm is to find the most optimal policy, or a close approximation to it. In order to create the environment, a model of the three-section planar continuum robot is established with three-dimensional inputs that are derivatives of the curvatures for each section. Next, a description of the environment in which the control challenge is present is provided. Finally, different rewards are examined to achieve the best possible outcome. We included Figure 1 in the paper to give readers a clear visual representation of the three-section planar continuum robot.

2.1. The Continuum Robot Environment

The environment is designed to simulate the kinematics of a continuum robot for directing purposes. The environment is composed of the state space, action space, and reward function. The robot’s current location and target location are included in the state space. The action space contains information on how the robot moves, such as changes in location or velocity. Feedback from the reward function is used to evaluate the effectiveness of a control policy.

The DDPG algorithm is depicted in Figure 2 as a basic schematic diagram. It is evident from Figure 2 that a simulation-based environment should be created to control the robot. In this section of the paper, the environment created for continuous robot simulation will be described in depth.

To provide a clearer understanding of the planar continuum robot used in our study, we present the kinematics parameters in Table 1. These parameters include the length and curvature of each section of the robot.

The kinematics of a continuum robot have been extensively researched in the literature [16,17]. As previously mentioned, the continuum robot’s environment is described by its kinematics and input signals. To achieve this, a framework based on forward and velocity kinematics has been developed. The model is described as follows:

\dot{x} = J \dot{q}

(1)

where

x \in R^{m \times 1}

is the vector that represents the robot’s position in task space, and the dot indicates differentiation with respect to time. It is important to mention that this applies specifically to our planar robot with three sections as follows:

\dot{q} = {[\dot{κ_{1}}, \dot{κ_{2}}, \dot{κ_{3}}]}^{T}, \dot{x} = {[\dot{x}, \dot{y}]}^{T}

(2)

The matrix J is known as the Jacobian matrix, and is a function of the ‘curvature’ variable

q

. We can obtain the final equation, Equation (3), from Equations (1) and (2).

[\begin{matrix} \frac{d x}{d t} \\ \frac{d y}{d t} \end{matrix}] = [\begin{matrix} \frac{\partial x}{\partial κ_{1}} & \frac{\partial x}{\partial κ_{2}} & \frac{\partial x}{\partial κ_{3}} \\ \frac{\partial y}{\partial κ_{1}} & \frac{\partial y}{\partial κ_{2}} & \frac{\partial y}{\partial κ_{3}} \end{matrix}] [\begin{matrix} \frac{d κ_{1}}{d t} \\ \frac{d κ_{2}}{d t} \\ \frac{d κ_{3}}{d t} \end{matrix}]

(3)

In Equation (3), the Jacobian expresses the connection between two separate system representations in a dynamic manner. This connection is a result of changes in both position and curvature in our scenario. The aim of velocity kinematics is to define the movement of endpoints. Since curvatures limit this movement quite a bit, explaining the state of motion at certain curvatures is comparable to explaining the movement of the endpoint.

According to Equation (3), the DDPG algorithm calculates the actions that affect the environment as derivatives of the curvatures for each segment over time. The Jacobian matrix is computed with planar forward kinematics [16] and numerical differentiation. The resulting motion is then transformed to the final position using Equation (4) after each action.

x_{t + 1} = x_{t} + \dot{x_{t}} \cdot d t

(4)

where

d t

is the time step which is decided to be 0.05 s. The objective of this work is to navigate the robot from its initial state

x_{init}

to the goal state

x_{g o a l}

of [−0.2, 0.15]. The decision to keep the goal state fixed was made in order to reduce the learning time required. This allows for more efficient testing of various environment designs. The initial position

x_{init}

of the continuum robot is selected randomly within the target space, as illustrated in Figure 3, which also displays the goal state and potential initial positions.

In our study, we primarily focus on the kinematic aspects of the continuum robot, as represented by Equations (3) and (4). We acknowledge that these equations do not capture the mechanical and physical features of the robot. However, our aim was to demonstrate that effective control can be achieved using RL, even with a simplified model. The complexity and computational demand of a more detailed model could slow down the learning process.

The most important aspect of the RL approach is defining the environment. In our work, we incorporate the environment directly from the continuum robot model, as described in Equations (3) and (4). As a result, the states and the actions of the environment are shown in Table 2.

In order to finalize the environment, it is necessary to design the reward. To simplify the description of the reward function, the state space is designed to not exceed the robot’s task space. Additionally, we define the Euclidean distance to

x_{g o a l}

as shown in Equation (5).

d_{u} = \sqrt{{(x - x_{g o a l})}^{2} + {(y - y_{g o a l})}^{2}}

(5)

The definition of a reward can vary depending on its intended impact on the learning process. In this proposal, we suggest a set of rewards that can help address the issue of continuum robot control, with the goal of facilitating the robot’s access from its initial state to its goal state. The rewards take into account the robot’s position, with the first reward being given below:

r_{1} = - 1 \cdot {(d_{u})}^{2}

(6)

The first reward function proposed in the paper for solving the problem of continuum robot control is defined in Equation (6). The reward function takes into account the robot’s current position in relation to the goal state. It is calculated as the negative of the squared Euclidean distance between the current state and the goal state, as shown in Equation (5). The purpose of the reward function is to motivate the robot to move from its starting position to the goal state. The reward is set as negative to discourage the robot from deviating from the goal state, with the amount of the penalty increasing as the Euclidean distance between the robot and the goal state increases. This reward function serves as the foundation of the RL algorithm used for controlling the continuum robot and plays a pivotal role in shaping the robot’s behavior during the learning phase. The second reward function is defined as

r_{2} = \{\begin{matrix} 1 & d_{u} < d_{u - 1} \\ - 0.5 & d_{u} = d_{u - 1} \\ - 1 & d_{u} > d_{u - 1} \end{matrix}

(7)

The second reward function considers the change in Euclidean distance between the current state and the goal state, represented by

d_{u}

. The reward is assigned based on whether the distance between the current state and the goal state is decreasing, constant, or increasing. If the Euclidean distance between the current state and the goal state decreases, a reward of 1 is assigned. If the distance remains constant, a reward of −0.5 is assigned. If the distance increases, a reward of −1 is assigned. This reward function aims to incentivize the robot to move towards the goal state, and penalize it for moving away or staying at the same distance from the goal state. The third reward is calculated as the weighted portion of the distance:

r_{3} = - 0.7 \cdot d_{u}

(8)

The third reward function is calculated by multiplying the Euclidean distance by a weight factor of 0.7. Specifically, the Euclidean distance is computed between the current position of the robot and the goal position, as described in Equation (5). The aim of this reward function is to assign a numerical value to the distance between the robot and its goal. The weight factor enables the influence of this distance on the learning process to be adjusted. As the robot moves further from its goal, the reward will decrease, promoting the robot’s movement towards the goal state. Conversely, as the robot approaches the goal, the reward will increase.

Regarding the weight of 0.7 in Equation (8), this was chosen based on empirical observations during our initial experiments. We found that this value provided a good balance between encouraging the robot to reach the goal and avoiding excessive movements. However, we acknowledge that this value may not be optimal for all situations and could depend on the specific characteristics of the robot and the task. The final reward is defined as follows:

r_{4} = \{\begin{matrix} 200 & d_{u} \leq 0.025 \\ 150 & d_{u} \leq 0.05 \\ 100 & d_{u} \leq 0.1 \\ 100 \times (d_{u - 1} - d_{u}) & d_{u} > 0.1 \\ - 100 & d_{u} = d_{u - 1} \end{matrix}

(9)

The final reward function, which is described in Equation (9), aims to incentivize the robot to reach the goal state in the shortest possible time. This reward function takes into account the Euclidean distance

(d_{u})

between the robot’s current position and the goal position. The reward is divided into multiple intervals based on the value of the Euclidean distance. The highest reward of 200 is given if the distance is less than or equal to 0.025. A reward of 150 is given if the distance is between 0.025 and 0.05, and a reward of 100 is given if the distance is between 0.05 and 0.1. If the distance is greater than 0.1, the reward is calculated using the formula

100 \times d_{u - 1} - d_{u}

. If the Euclidean distance is the same as the previous distance, the reward is −100 to discourage the robot from staying in the same place. To recap, we have established four distinct rewards that consider both the target position and the current position. Table 3 presents information about the attributes of each reward function. The reward Equations (6)–(9) are relatively straightforward and based on a standard measure, such as Equation (5), which is a common choice for this type of problem. Our goal was to show that effective control can be achieved using RL, even with such simple reward functions.

2.2. The Learning Algorithm

The continuous action and state of the continuum robot’s environment is a crucial factor in determining the ideal RL algorithm. We employ the DDPG algorithm, which permits continuous action and state spaces, to address this issue. This algorithm utilizes neural network approximations and is based on the actor–critic method. The actor is ultimately responsible for determining the best course of action in a given situation, while the critic evaluates the efficacy of their choice. A replay buffer stores past experiences and a target network helps to stabilize the learning process in order to improve the control policy over time.

The DDPG algorithm collects information from its environment to determine the optimal course of action for controlling future actions in order to maximize rewards. The target network is used to stabilize the learning process, while the replay buffer is utilized to continuously update the actor and critic networks. The actor network computes the optimal response for a given state, forming the basis for the ultimate control strategy. The deterministic policy is trained in an off-policy way by DDPG. However, because the policy is deterministic, it is possible that the agent will not initially explore a wide enough variety of on-policy actions to collect useful learning signals. To address this problem, authors have improved DDPG policies’ capacity for exploration by introducing noise to their actions during training [2]. For this study, we utilized the Ornstein–Uhlenbeck process to produce noise. The Ornstein–Uhlenbeck process produces noise that is correlated to the preceding noise, in order to avoid the noise cancelling out the overall dynamics. You can find the parameters of the Ornstein–Uhlenbeck process, which generates temporally correlated exploration, in Table 4.

The success of the DDPG algorithm is strongly influenced by the state space, action space, and reward function of the environment. Consequently, it is essential to meticulously design and evaluate multiple environment configurations in order to determine the most suitable control strategy for the continuum robot.

3. Results

This section examines the simulations of the proposed RL algorithm. To enhance the realism of the simulation, the parameters of the continuum robot model, as defined by (1), are subject to certain criteria. As shown in Table 1, each segment of the continuum robot measures 0.1

[m]

in length, and its curvature values are defined as follows:

- 4 [\frac{1}{m}] \leq κ \leq 16 [\frac{1}{m}]

(10)

In addition, the base of the continuum robot is located at point (0,0). At the start of each scenario, the robot appears at a random location with the task of reaching the goal state. The simulation also uses a time step of 50 milliseconds, defined as

d t

.

We employ two distinct neural networks, one for the actor and one for the critic, to approximate the actor and critic described in Section 2.2. Both forms contain a single input layer, a single output layer, and three hidden layers. Rectified Linear Unit (ReLU) activation functions are used for hidden layers in both variants. Each hidden layer of our actor neural network consists of 64 neurons, while the network’s output layer has only 3 neurons with a hyperbolic tangent activation function. This restriction applies to the control signal. We train a critic neural network with 32 neurons in each hidden layer and a single neuron in the output layer with a linear activation function. We use Adam optimizer to train the neural networks, using mean absolute error (MAE) metrics.

3.1. Simulations

The following section presents the results of the simulations. The implementation is based on the keras-rl library [18]. The robot’s input consists of the derivatives of the curvatures for each segment with respect to time, as described in Equation (3). We prepared four reward functions as described in Equations (6)–(9). To visualize the reward function, we performed an open-loop movement of the continuum robot and calculated the rewards for each position. The trajectory of the robot is shown in Figure 4, and the rewards function during the open-loop movement is presented in Figure 5. As the robot gets closer to the target, the high value of the reward function results in the observed peak at the 200th step. However, the value starts to decrease as the robot moves further away from the target.

To demonstrate the influence of design variables, several learning processes and testing simulations were conducted. In Table 5, we present the hyperparameters used in our experiments. These values were not arbitrarily chosen; instead, they were adopted from a prior study by [11]. The study demonstrated the effectiveness of these hyperparameters in controlling continuum robots using reinforcement learning. Given the similarity of our work to theirs, we decided to use the same hyperparameters as a starting point for our study. We acknowledge that the optimal hyperparameters can depend on the specific task and environment. However, in the absence of a clear theoretical guide for choosing these values, and considering the computational cost of extensive hyperparameter tuning, we believe that using values demonstrated to work well in a similar context is a reasonable approach. Future work could include a more extensive exploration of the hyperparameter space to determine if improvements in performance can be achieved.

Our trained agents successfully passed 10 independent tests for each reward function. To determine which tested factor yields better results, we have determined several quality indicators. One of the first indicators is effectiveness. It is important to carefully consider the effectiveness of different reward functions in achieving the desired behavior of the robot. This may involve comparing the robot’s performance using various reward functions in terms of task completion time metrics. Another metric to consider is convergence. Evaluating the convergence properties of different reward functions may also provide valuable insights. This could involve comparing the speeds with which the robot’s behavior converges to the desired behavior. Finally, we also measured training time and the number of times the robot reached the goal state out of a total of 250 episodes of training. It is worth noting that training was conducted using a Tesla K80 Accelerator GPU.

3.2. Analysis

In this section, we analyze the simulations described in the previous section. The effectiveness score metrics, which indicate how quickly episodes are successfully completed, are calculated as the mean value of 10 independent tests after training. A lower effectiveness score indicates faster episode completion. The convergence metrics will be examined based on the plots obtained (Figure 6). The other two metric results were collected during training. We grouped the data by reward type factors to determine which reward function works better, and present the results in Table 6.

Based on the convergence metric, we can conclude from Figure 6 that Reward 4 is the least-effective reward function, as it never converges to zero. The other rewards produced similar results. However, it appears that

r_{3}

achieves convergence at around 200 episodes, which is faster than the other rewards. Table 6 demonstrates that

r_{4}

yields the best training results, despite having the worst convergence and effectiveness scores. It appears that the model may have been overfitted during training and performed poorly on the test cases.

As a result, the scaling distance reward function, which is the third reward, provided better and quicker results than the other functions. Figure 7 presents an example of robot movement on the X-Y plane for a single case, which provides the best results according to Table 6 and Figure 6. As demonstrated in the example test, the continuum robot came close to reaching the goal state of [−0.2, 0.15].

In Figure 8, we can see the transient view of the states, curvature values, position, and reference position. The presented example shows that the robot is capable of reaching the goal position in multiple learning attempts and validation tests.

Finally, Figure 9 finally shows the average reward graph for 250 training episodes. It is essential to remember that the reward levels for various functions may vary due to varying reward definitions (which was not the case for the performance indexes presented in the tables). Overall, the learning process is challenging due to the nonlinearity of the problem, but the DDPG algorithm is largely effective at determining the optimal policy.

4. Conclusions

In this study, we have explored the use of RL algorithms in controlling a three-section planar continuum robot. By varying the reward functions, we have compared the performance of different RL environments, providing a concrete example of how the choice of reward function can impact the performance of the RL algorithm in this specific context. While the reward functions we used are specific to this problem, we believe that our findings provide valuable insights for researchers and practitioners working on similar problems in the field of robotics and control systems. The nonlinear aspect of robot kinematics is accounted for by the presented DDPG algorithm, which relies heavily on the reward function. Reward functions do not need to be complex to effectively represent the objectives of the task at hand. Specifically, we found that reward functions based on scaling distance led to good performance indices for the continuum robot, even without the need for complex formulations. This highlights the importance of carefully designing the reward function in RL environments for the effective control of continuum robots. The environment of continuum robots is described in depth, and a framework based on forward and velocity kinematics is used to describe the continuum robot environment. Overall, the paper describes a promising method for controlling continuum robots and emphasizes the potential of RL algorithms in this field. This example demonstrates continuum robot control for a small area, which has the potential to be applied to trajectory tracking, high-level control (such as obstacle avoidance), or moving-target state tasks with additional extensions.

Looking ahead, there are several promising directions for future research. First, while our study focused on the impact of different reward functions, future research could explore the effects of other aspects of the RL environment, such as the state representation or the choice of RL algorithm. Second, our study used a specific type of continuum robot, the three-section planar robot. Future studies could investigate whether our findings generalize to other types of continuum robots, such as testing our RL algorithm on a 3D continuum robotic system. In addition, we plan to explore the effects of different types of noise on the performance of the RL algorithm. In this study, we used the Ornstein–Uhlenbeck process for generating noise. However, in future work, we aim to compare this with mean-zero Gaussian noise to understand the impact of different noise processes on the learning and performance of the RL algorithm. Next, while we adopted the hyperparameters from a previous study, an interesting direction for future research would be to conduct more extensive hyperparameter tuning to determine if improvements in performance can be achieved. Finally, the paper focused on comparing different RL environments, but we acknowledge the potential value in comparing the RL algorithm with traditional algorithms such as PID and predictive control.

In conclusion, our study contributes to the growing body of research on the application of RL in the control of continuum robots, and we hope that it will inspire further research and practical applications in this area.

Author Contributions

Conceptualization, T.C.K. and J.K.; methodology, T.C.K. and J.K.; software, T.C.K.; validation, T.C.K.; writing—original draft preparation, T.C.K.; writing—review and editing, J.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Education and Science, grant number 0211/SBAD/0122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement Learning in Robotics: A Survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Rabe, M.; Dross, F. A Reinforcement Learning approach for a Decision Support System for logistics networks. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 2020–2032. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; Kwok, K.W. A survey for machine learning-based control of continuum robots. Front. Robot. AI 2021, 8, 730330. [Google Scholar] [CrossRef] [PubMed]
Satheeshbabu, S.; Uppalapati, N.K.; Fu, T.; Krishnan, G. Continuous Control of a Soft Continuum Arm Using Deep Reinforcement Learning. In Proceedings of the 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft), New Haven, CT, USA, 15 May–15 July 2020; pp. 497–503. [Google Scholar] [CrossRef]
Goharimanesh, M.; Mehrkish, A.; Janabi-Sharifi, F. A Fuzzy Reinforcement Learning Approach for Continuum Robot Control. J. Intell. Robot. Syst. 2020, 100, 809–826. [Google Scholar] [CrossRef]
Gravagne, I.; Rahn, C.; Walker, I. Large deflection dynamics and control for planar continuum robots. IEEE/ASME Trans. Mechatron. 2003, 8, 299–307. [Google Scholar] [CrossRef] [Green Version]
Bailly, Y.; Amirat, Y. Modeling and Control of a Hybrid Continuum Active Catheter for Aortic Aneurysm Treatment. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 924–929. [Google Scholar] [CrossRef]
Best, C.M.; Gillespie, M.T.; Hyatt, P.; Rupert, L.; Sherrod, V.; Killpack, M.D. A New Soft Robot Control Method: Using Model Predictive Control for a Pneumatically Actuated Humanoid. IEEE Robot. Autom. Mag. 2016, 23, 75–84. [Google Scholar] [CrossRef]
Penning, R.S.; Jung, J.; Ferrier, N.J.; Zinn, M.R. An evaluation of closed-loop control options for continuum manipulators. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St. Paul, MN, USA, 14–18 May 2012; pp. 5392–5397. [Google Scholar] [CrossRef]
Liu, J.; Shou, J.; Fu, Z.; Zhou, H.; Xie, R.; Zhang, J.; Fei, J.; Zhao, Y. Efficient reinforcement learning control for continuum robots based on Inexplicit Prior Knowledge. arXiv 2020, arXiv:2002.11573. [Google Scholar]
Thuruthel, T.G.; Falotico, E.; Renda, F.; Laschi, C. Model-Based Reinforcement Learning for Closed-Loop Dynamic Control of Soft Robotic Manipulators. IEEE Trans. Robot. 2019, 35, 124–134. [Google Scholar] [CrossRef]
Wu, Q.; Gu, Y.; Li, Y.; Zhang, B.; Chepinskiy, S.A.; Wang, J.; Zhilenkov, A.A.; Krasnov, A.Y.; Chernyi, S. Position Control of Cable-Driven Robotic Soft Arm Based on Deep Reinforcement Learning. Information 2020, 11, 310. [Google Scholar] [CrossRef]
You, X.; Zhang, Y.; Chen, X.; Liu, X.; Wang, Z.; Jiang, H.; Chen, X. Model-free control for soft manipulators based on reinforcement learning. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 2909–2915. [Google Scholar] [CrossRef]
Della Santina, C.; Duriez, C.; Rus, D. Model-Based Control of Soft Robots: A Survey of the State of the Art and Open Challenges. IEEE Control Syst. Mag. 2023, 43, 30–65. [Google Scholar] [CrossRef]
Hannan, M.W.; Walker, I.D. Kinematics and the Implementation of an Elephant’s Trunk Manipulator and Other Continuum Style Robots. J. Robot. Syst. 2003, 20, 45–63. [Google Scholar] [CrossRef] [PubMed]
Robert, J.; Webster, I.; Jones, B.A. Design and Kinematic Modeling of Constant Curvature Continuum Robots: A Review. Int. J. Robot. Res. 2010, 29, 1661–1683. [Google Scholar] [CrossRef]
Plappert, M. Keras-rl. 2016. Available online: https://github.com/keras-rl/keras-rl (accessed on 10 May 2023).

Figure 1. Visual representation of the three-section planar continuum robot.

Figure 2. Schematic diagram of Deep Deterministic Policy Gradient.

Figure 3. Task space of three-section continuum robot.

Figure 4. An example movement of a continuum robot in visual reward functions.

Figure 5. Graph of all reward functions for the example open-loop movement.

Figure 6. Graph of all reward functions for 10 episodes of distance error on the X-Y axis with confidence band.

Figure 7. An example single case of winner reward functions (

r_{3}

).

Figure 7. An example single case of winner reward functions (

r_{3}

).

Figure 8. The results of the single case for winner reward functions (

r_{3}

). Individual subfigures show: total error (a), errors for different axes (b), position (c) and curvature values (d).

Figure 8. The results of the single case for winner reward functions (

r_{3}

). Individual subfigures show: total error (a), errors for different axes (b), position (c) and curvature values (d).

Figure 9. The mean of rewards during the training for all reward types.

Table 1. Kinematics parameters of the planar continuum robot.

Section	Length (m)	Curvature (m $^{- 1}$ )
1	0.1	$[- 4, 16]$
2	0.1	$[- 4, 16]$
3	0.1	$[- 4, 16]$

Table 2. State and action space of the environment.

State Space	$x_{t}$	$y_{t}$	$x_{g o a l}$	$y_{g o a l}$
Action Space	$\dot{κ_{1}}$	$\dot{κ_{2}}$	$\dot{κ_{3}}$

Here, each

\dot{κ}

value is limited to the range of

[- 1, + 1]

, and the state variables are confined to the task space shown in Figure 3.

Table 3. Summary of the reward functions.

Reward	Attribute
$r_{1}$	squared norm of the distance
$r_{2}$	differential of distance
$r_{3}$	scaling distance
$r_{4}$	variable reward based on distance to goal state

Table 4. Noise parameters of the DDPG algorithm.

Parameter	Value
$μ$	0.00
$θ$	0.15
$σ$	0.20

Here,

μ

is the mean value of the process,

θ

is a positive constant that determines the speed of mean reversion, and

σ

is a positive constant that determines the volatility of the process.

Table 5. Hyperparameters of the DDPG algorithm.

Parameter	Value
Optimizer	Adam
Learning rate of actor	0.0001
Learning rate of critic	0.0003
Discount factor $(γ)$	0.99
Batch size	128
Replay buffer size	$10^{6}$
Maximum training steps	750
Soft update $(τ)$	0.001

Table 6. Comparison of quality indicators depending on the reward functions.

Reward	Effectiveness Score	Training Time	Number of Successful Episodes
$r_{1}$	414.5	22.37 m	96
$r_{2}$	541.7	23.60 m	77
$r_{3}$	201.5	17.60 m	140
$r_{4}$	575.3	14.35 m	185

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kołota, J.; Kargin, T.C. Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control. Appl. Sci. 2023, 13, 9153. https://doi.org/10.3390/app13169153

AMA Style

Kołota J, Kargin TC. Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control. Applied Sciences. 2023; 13(16):9153. https://doi.org/10.3390/app13169153

Chicago/Turabian Style

Kołota, Jakub, and Turhan Can Kargin. 2023. "Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control" Applied Sciences 13, no. 16: 9153. https://doi.org/10.3390/app13169153

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Various Reinforcement Learning Environments in the Context of Continuum Robot Control

Abstract

1. Introduction

2. Materials and Methods

2.1. The Continuum Robot Environment

2.2. The Learning Algorithm

3. Results

3.1. Simulations

3.2. Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI