2.1. Environment Modeling
In this study, an AMR was modeled with differential traction to maneuver and avoid collisions with dynamic as well as static obstacles. An environment was created based on the cell decomposition technique, since it facilitates the development of the RL algorithm by converting the problem into a finite state. Each cell represents the orientation, position, and any other possible state according to the problem to be resolved.
In the environment, the robot is considered a punctual mass with a radius smaller than the cell size, so 8 actions are allowed for each state: forward, backward, left, right, forward–left, forward–right, backward–left, and backward–right, as shown in
Figure 2a. In this method, obstacles are modeled in such a manner that if the border of an obstacle partially overlaps a cell, the entire cell will be considered an obstacle.
The environments used have a dimension of
cells, and only the positions of the obstacles vary.
Figure 2b shows an example of the environment. The red-filled cells represent the obstacles, the red-dotted cells represent the area where an obstacle has motion for each interaction, and the target point is represented in green. The robot is located at position
and the target at
. Each cell represents an
coordinate in integers.
2.2. Reinforcement Learning and Q-Learning Algorithm
The sequence of decisions in RL is determined by the Markov decision process introduced by Bellman [
28]. The MDP consists of a set of states of an environment, a set of actions taken by an agent, and a transition probability distribution that gives the probability of moving from one state to another after completing an action in another state. This process captures the effect of each action that has an associated trade-off.
Figure 3 shows the process of RL in which the following elements interact [
15,
29]:
An agent is an entity that perceives/explores an environment and makes decisions.
An environment includes everything surrounding an agent and is generally assumed to be stochastic.
Actions correspond to an agent’s movement in an environment.
State is the representation of an agent over time.
Reward is a numerical value that an agent tries to maximize by the selection of its actions.
Policy represents a strategy used by an agent to select an action based on the present state.
The following is a general presentation of the MDP, extracted from [
14,
15].
The result of the MDP corresponds to a sequence or trajectory of the groups of states,
, actions,
, and rewards,
, for each instant of time,
. The beginning of the sequence starts as shown in Equation (1):
The selection process for a subsequent state must comply with the fact that the next state will depend on the immediately preceding state and action, and not on any other previous states or actions. Furthermore, all possible states must have a probability greater than
, which guarantees that all actions are eligible at time
. The probability of the occurrence of these states is represented in Equation (2):
where
defines the dynamic of the MDP for all
; therefore,
. From this function, we can calculate the transition probability from one state to another, which is commonly denoted as follows:
On the other hand, Equation (4) shows that it is possible to calculate the reward for each transition based on two arguments,
:
If you want to estimate the expected reward considering three arguments—state, action, and next state—then
is expressed as follows:
Reward terms for the present study are shown in Equation (6), where there are two main objectives: to incentivize an agent to move as fast as possible, reducing the number of direction changes, and to increase diagonal movements, which represent movements with greater distances:
The objective is to maximize the performance,
, that corresponds to the sum of each reward for each time instant, as shown in Equation (8), where
is the final time:
A complete sequence of rewards, from the initial state to the final state, is called an episode. Although episodes conclude in terminal states, the rewards are not always the same for each episode. In this regard, the maximization of the expected performance is accomplished through the parameter
, which is represented as follows:
where parameter
is referred to as the discount rate and has values of
.
Several exact solutions and approximations have been developed to solve the MDP. These methods can be classified into two main groups: tabular methods and neural networks.
Figure 4 shows a general diagram of the main algorithms based on the methods used.
With regard to the correlation between states and actions established by the policy, the optimal policy leads to the generation of a maximum trade-off. Dynamic programming is based on the assumption of the ideal MDP, which allows for the calculation of the trade-off for all possible actions.
The most commonly used dynamic programming algorithms are value iteration and policy iteration. In the present study, policy iteration was used, since it implies low costs of computational resources due to requiring a lower number of iterations to converge. In this algorithm, the policy necessary to compute the function,
, from the value of the state is evaluated:
The improved policy is calculated using an anticipated view to replace the initial value of the policy,
:
Algorithm 1 shows a complete version of policy iteration [
15].
Algorithm 1: Policy Iteration |
1: | Initialize |
2: |
and
arbitrarily, for all states
|
3: | Policy Evaluation |
4: | Loop: |
5: | |
6: | Loor for each : |
7: |
|
8: |
|
9: |
10: | Until |
11: | Policy Improvement |
12: | |
13: | Loop for each : |
14: |
|
15: |
|
16: |
17: | |
18: | |
Table 1 shows the hyperparameters used in the policy iteration algorithm. A high discount rate is considered for the agent to pursue long-term rewards. On the other hand, a low learning rate is considered to guarantee convergence, which implies a higher number of iterations. For practical purposes, the policy iteration technique will be referred to hereafter as RL to differentiate it from the Q-Learning (QL) and Deep Q-Learning (DQL) techniques.
The algorithm was developed with Python 3.11.7 programming using an Intel Core i5-10300 processor with 4 cores up to 2.50 GHz, an NVIDIA GeForce GTX 1650 graphic card, 16 GB of RAM, and Windows 11.
Q-Learning is a method based on values and policies, which uses a Q-table to update the action-value function at each step rather than the end of each episode. Each row contains a state-action value. Q represents the “Quality” of each action performed in that state. During training, the agent updates the Q-table to determine the optimal policy. It is possible to obtain the optimal policy function by using the Bellman equation,
, to select the best action in each state [
14]:
The optimal action-value function for policy
is described as follows:
Figure 5 shows the approaches of Q-Learning and Deep Q-Learning, where the values in the Q-table in Deep Q-Learning are updated based on the architecture of deep learning.
The algorithm for Deep Q-Learning updates the parameters of the neural network using the same policy that it explores. Algorithm 2 shows a complete version of Deep Q-Learning that follows an off-policy strategy and the mean square error as the function loss.
Algorithm 2: Deep Q-Learning |
1: | Input: learning rate |
2: | Initialize |
3: | q-value, |
4: | |
5: | Initialize buffer |
6: | Loop: episode
|
7: | Restart environment |
8: | Loop: |
9: | Select action |
10: | Execute action and observe |
11: | Insert transition into the buffer |
12: | Compute loss function: |
|
|
13: | Update the NN parameters θ |
14: | End Loop |
15: | Every episode |
16: | End Loop |
17: | Output: Optimal policy and q-value approximation |
The architecture employed to approximate the Q function in the Deep Q-Learning algorithm is shown in
Figure 6. It is a fully connected neural network with the following characteristics:
Input layer: two fully connected neurons, one for the position and one for the position.
Hidden layers: two fully connected layers with 64 neurons and the ReLU activation function.
Output layer: a fully connected dense layer, which has 64 neurons as the input to the last hidden layer. It produces a vector that corresponds to the Q values for each possible action in an environment.
Table 2 describes the hyperparameters selected for this study. Selecting a high discount rate allows the agent to pursue a greater reward in the long term. As the discount rate decreases, the agent will pursue short-term payoffs. A low learning rate translates into slow convergence, so a high number of episodes is necessary to guarantee convergence with a low learning rate. A high learning rate can result in network divergence.
The algorithm was developed in Python programming with support from the open-source framework TensorFlow. The hardware settings are the same as those of the policy iteration algorithm.