2.1. The Continuum Robot Environment
The environment is designed to simulate the kinematics of a continuum robot for directing purposes. The environment is composed of the state space, action space, and reward function. The robot’s current location and target location are included in the state space. The action space contains information on how the robot moves, such as changes in location or velocity. Feedback from the reward function is used to evaluate the effectiveness of a control policy.
The DDPG algorithm is depicted in
Figure 2 as a basic schematic diagram. It is evident from
Figure 2 that a simulation-based environment should be created to control the robot. In this section of the paper, the environment created for continuous robot simulation will be described in depth.
To provide a clearer understanding of the planar continuum robot used in our study, we present the kinematics parameters in
Table 1. These parameters include the length and curvature of each section of the robot.
The kinematics of a continuum robot have been extensively researched in the literature [
16,
17]. As previously mentioned, the continuum robot’s environment is described by its kinematics and input signals. To achieve this, a framework based on forward and velocity kinematics has been developed. The model is described as follows:
where
is the vector that represents the robot’s position in task space, and the dot indicates differentiation with respect to time. It is important to mention that this applies specifically to our planar robot with three sections as follows:
The matrix
J is known as the Jacobian matrix, and is a function of the ‘curvature’ variable
. We can obtain the final equation, Equation (
3), from Equations (
1) and (
2).
In Equation (
3), the Jacobian expresses the connection between two separate system representations in a dynamic manner. This connection is a result of changes in both position and curvature in our scenario. The aim of velocity kinematics is to define the movement of endpoints. Since curvatures limit this movement quite a bit, explaining the state of motion at certain curvatures is comparable to explaining the movement of the endpoint.
According to Equation (
3), the DDPG algorithm calculates the actions that affect the environment as derivatives of the curvatures for each segment over time. The Jacobian matrix is computed with planar forward kinematics [
16] and numerical differentiation. The resulting motion is then transformed to the final position using Equation (
4) after each action.
where
is the time step which is decided to be 0.05 s. The objective of this work is to navigate the robot from its initial state
to the goal state
of [−0.2, 0.15]. The decision to keep the goal state fixed was made in order to reduce the learning time required. This allows for more efficient testing of various environment designs. The initial position
of the continuum robot is selected randomly within the target space, as illustrated in
Figure 3, which also displays the goal state and potential initial positions.
In our study, we primarily focus on the kinematic aspects of the continuum robot, as represented by Equations (
3) and (
4). We acknowledge that these equations do not capture the mechanical and physical features of the robot. However, our aim was to demonstrate that effective control can be achieved using RL, even with a simplified model. The complexity and computational demand of a more detailed model could slow down the learning process.
The most important aspect of the RL approach is defining the environment. In our work, we incorporate the environment directly from the continuum robot model, as described in Equations (
3) and (
4). As a result, the states and the actions of the environment are shown in
Table 2.
In order to finalize the environment, it is necessary to design the reward. To simplify the description of the reward function, the state space is designed to not exceed the robot’s task space. Additionally, we define the Euclidean distance to
as shown in Equation (
5).
The definition of a reward can vary depending on its intended impact on the learning process. In this proposal, we suggest a set of rewards that can help address the issue of continuum robot control, with the goal of facilitating the robot’s access from its initial state to its goal state. The rewards take into account the robot’s position, with the first reward being given below:
The first reward function proposed in the paper for solving the problem of continuum robot control is defined in Equation (
6). The reward function takes into account the robot’s current position in relation to the goal state. It is calculated as the negative of the squared Euclidean distance between the current state and the goal state, as shown in Equation (
5). The purpose of the reward function is to motivate the robot to move from its starting position to the goal state. The reward is set as negative to discourage the robot from deviating from the goal state, with the amount of the penalty increasing as the Euclidean distance between the robot and the goal state increases. This reward function serves as the foundation of the RL algorithm used for controlling the continuum robot and plays a pivotal role in shaping the robot’s behavior during the learning phase. The second reward function is defined as
The second reward function considers the change in Euclidean distance between the current state and the goal state, represented by
. The reward is assigned based on whether the distance between the current state and the goal state is decreasing, constant, or increasing. If the Euclidean distance between the current state and the goal state decreases, a reward of 1 is assigned. If the distance remains constant, a reward of −0.5 is assigned. If the distance increases, a reward of −1 is assigned. This reward function aims to incentivize the robot to move towards the goal state, and penalize it for moving away or staying at the same distance from the goal state. The third reward is calculated as the weighted portion of the distance:
The third reward function is calculated by multiplying the Euclidean distance by a weight factor of 0.7. Specifically, the Euclidean distance is computed between the current position of the robot and the goal position, as described in Equation (
5). The aim of this reward function is to assign a numerical value to the distance between the robot and its goal. The weight factor enables the influence of this distance on the learning process to be adjusted. As the robot moves further from its goal, the reward will decrease, promoting the robot’s movement towards the goal state. Conversely, as the robot approaches the goal, the reward will increase.
Regarding the weight of 0.7 in Equation (
8), this was chosen based on empirical observations during our initial experiments. We found that this value provided a good balance between encouraging the robot to reach the goal and avoiding excessive movements. However, we acknowledge that this value may not be optimal for all situations and could depend on the specific characteristics of the robot and the task. The final reward is defined as follows:
The final reward function, which is described in Equation (
9), aims to incentivize the robot to reach the goal state in the shortest possible time. This reward function takes into account the Euclidean distance
between the robot’s current position and the goal position. The reward is divided into multiple intervals based on the value of the Euclidean distance. The highest reward of 200 is given if the distance is less than or equal to 0.025. A reward of 150 is given if the distance is between 0.025 and 0.05, and a reward of 100 is given if the distance is between 0.05 and 0.1. If the distance is greater than 0.1, the reward is calculated using the formula
. If the Euclidean distance is the same as the previous distance, the reward is −100 to discourage the robot from staying in the same place. To recap, we have established four distinct rewards that consider both the target position and the current position.
Table 3 presents information about the attributes of each reward function. The reward Equations (
6)–(
9) are relatively straightforward and based on a standard measure, such as Equation (
5), which is a common choice for this type of problem. Our goal was to show that effective control can be achieved using RL, even with such simple reward functions.
2.2. The Learning Algorithm
The continuous action and state of the continuum robot’s environment is a crucial factor in determining the ideal RL algorithm. We employ the DDPG algorithm, which permits continuous action and state spaces, to address this issue. This algorithm utilizes neural network approximations and is based on the actor–critic method. The actor is ultimately responsible for determining the best course of action in a given situation, while the critic evaluates the efficacy of their choice. A replay buffer stores past experiences and a target network helps to stabilize the learning process in order to improve the control policy over time.
The DDPG algorithm collects information from its environment to determine the optimal course of action for controlling future actions in order to maximize rewards. The target network is used to stabilize the learning process, while the replay buffer is utilized to continuously update the actor and critic networks. The actor network computes the optimal response for a given state, forming the basis for the ultimate control strategy. The deterministic policy is trained in an off-policy way by DDPG. However, because the policy is deterministic, it is possible that the agent will not initially explore a wide enough variety of on-policy actions to collect useful learning signals. To address this problem, authors have improved DDPG policies’ capacity for exploration by introducing noise to their actions during training [
2]. For this study, we utilized the Ornstein–Uhlenbeck process to produce noise. The Ornstein–Uhlenbeck process produces noise that is correlated to the preceding noise, in order to avoid the noise cancelling out the overall dynamics. You can find the parameters of the Ornstein–Uhlenbeck process, which generates temporally correlated exploration, in
Table 4.
The success of the DDPG algorithm is strongly influenced by the state space, action space, and reward function of the environment. Consequently, it is essential to meticulously design and evaluate multiple environment configurations in order to determine the most suitable control strategy for the continuum robot.