2.3.1. Deep Reinforcement Learning
The main walking control methods for the bipedal robots are model-based control and learning control. In this study, we use learning control, which has potential for future development and has been actively studied [
23,
24,
25,
26].
Humans learn to walk as infants by adjusting their movement methods through trial and error according to their abilities and the characteristics of the terrain [
27]. This learning process is considered similar to the process of reinforcement learning. In reinforcement learning, the actions that maximize future value are learned through trial and error. The neural network used in deep learning is said to mimic the mechanism of human neurons [
28]. For these reasons, among learning controls, we believe that deep reinforcement learning can be employed to incorporate human gait characteristics as rewards for acquiring a human-like bipedal gait in various environments.
We first explain reinforcement learning. In reinforcement learning, the agent executes actions according to the policy, and then, the environment returns rewards to the agent as feedback based on the state of the position, velocity, posture, joint angles, etc., and the agent’s choice of action. The agent improves the policy so that the cumulative reward received in the sequence of actions becomes larger. The above is repeated to find the optimal solution.
Reinforcement learning has the problem that it can only handle discrete values of the action space. Therefore, we combine the high-function approximation ability of neural networks with the action selection of reinforcement learning. As a result, the continuous action space can be learned with continuous values without artificially discretizing it, and a system that can respond more flexibly to changes in the state can be realized.
In this study, we use the policy gradient method, in which the policy is a function expressed by a certain parameter, and the policy can be learned directly by learning the parameter. Moreover, using the REINFORCE algorithm and Gaussian policy, the policy gradient methods can be used in a continuous action space [
29].
2.3.2. Policy Gradient Methods
The update equation of the policy parameter
can be expressed as follows:
where
is the learning rate factor, and
denotes the partial differential vector with respect to
, where
is a multidimensional vector.
In the policy gradient method, we consider the problem of maximizing the objective function, the expected return
. We define the objective function
as the value function
calculated under the policy
at the time of the start of learning, since this is the expected return at state
.
where
is the cumulative discounted reward, and
is the state variable at time
. The gradient
with respect to the policy parameters of the objective function is expressed as follows for any policy
, differentiable with respect to the parameter
, and for the objective function
, defined by Equation (2) [
30].
where
is the action and
is the action value function. There are two problems to calculate Equation (3): “It is difficult to calculate the expected value” and “Estimation of the action value function is necessary”.
The solution to the first problem is to use Monte Carlo approximation, which is a method for approximating the expected value. This allows the expected value to be calculated even when the probability distribution is complex. Based on the probabilistic policy
, the action is executed for
steps, and the gradient is approximated from the obtained state, action, and reward observations.
The solution to the second problem is to use the REINFORCE algorithm, which does not require direct estimation of the action value function but approximates it with the actual cumulative discounted reward or reward obtained. The action value function
in Equation (4) is approximated based on the cumulative discounted reward
.
where
is the discount rate factor, and
is the reward variable at time
. The REINFORCE algorithm reduces the variance of the action value function
by introducing a baseline, which is a function
that provides a reference for the action value function
. This makes learning easier to converge. The equation with the baseline is shown below.
The value function
is a weighted average function of the action value function
with the policy probability
, by definition. Therefore, the value function
is used as the baseline in this case. We define the advantage function, which represents the quantity of the action value measured with respect to its mean value, as follows.
The expected value can be expressed by substituting Equation (8) into Equation (7) and computing the Monte Carlo approximation as follows.
where
is the action variable at time
. The action value function
included in the advantage function is approximated based on the cumulative discounted reward
using the REINFORCE algorithm. Then, the update formula for
is calculated as follows.
2.3.3. Neural Network
In this study, we use two types of neural networks.
The first is a neural network used for the Gaussian policy. To clarify the objective function, the derivative operator is taken from Equation (11) and multiplied by −1 to obtain Equation (13) as the loss function. RMSprop is used as the optimization algorithm, and the parameters are updated to minimize this loss function. The input, hidden, and output layers are listed in
Table 3.
The second is a neural network for estimating the state value function. This neural network is designed to output the cumulative discounted rewards, using Equation (15) as the loss function. Adam is used as the optimization algorithm, and the parameters are updated to minimize this loss function. The input, hidden, and output layers are listed in
Table 4.
A schematic diagram of the learning algorithm used in this study is shown in
Figure 3. We define one step as the next action to be taken in a certain state according to Gaussian policy and execute that action to obtain the next state and reward. One episode is defined as repeating the process until the termination conditions are met: the bipedal robot falls over, or the maximum number of steps is reached. When the termination condition is met, the neural network of the state value function is updated based on the history of states, actions, and rewards, and the cumulative discounted reward is calculated. The advantage function is then updated, and the neural network of the Gaussian policy is updated. In this study, this learning process is repeated for 500,000 episodes.
2.3.4. Probabilistic Policy with the Gaussian Model
The Gaussian model is a typical example of a probabilistic policy for control in continuous space. This model samples a K-dimensional action vector
from a K-dimensional normal distribution with a mean
and covariance matrix
as parameters in state
. The equation is expressed as follows.
If the covariance matrix
is not diagonal, the different components of the action vector will interact. This complicates the implementation of the policy function in a neural network. Therefore, we assume that the components of the action vectors are independent and that the covariance matrix consists only of diagonal components. Let the
k-th diagonal component of the covariance matrix be
, then the K-dimensional normal distribution can be decomposed into an independent 1-dimensional normal distribution formula as follows.
where
and
are functions with state
as input, obtained based on the neural network introduced in
Section 2.3.3.
Using the Gaussian policy,
in Equation (13) can be calculated as follows.
2.3.5. Rewards
This study used the following four rewards for reinforcement learning.
This is the reward that keeps the upper body at a certain height and reduces the rotation of the body in the direction of the pitch. The equation is expressed as follows.
where
is the height of the center of the body and
is the angle of the center of the body along the pitch axis.
This is the reward for the distance advanced by the center of the body. First, we find the distance
from the position at time
to the target position.
where
and
are the target positions in the
and
directions, respectively. The advanced distance
is expected based on the following equation.
The reward
is expressed by the following equation.
where
is the weight coefficient.
This is the reward for ensuring that each joint does not exceed its range of motion. The reward
is expected based on the following equation.
where
is the weight coefficient,
is the number of joints beyond the range of motion.
This is the reward for planar covariation, which is a characteristic of the human gait. Since the kinematic synergy is such that the elevation angles at the thigh, shank, and foot are placed on a single plane, it is sufficient to ensure that the three calculated elevation angles are placed on a single plane. The following is a detailed description of the application method and the calculation process.
First, we explain how the angles are read and when the rewards are executed. The angle of each joint is read each time an action is executed according to the policy. When the heel of the bipedal robot lands on the ground, the reward is executed and the angle data are reset.
Next, we explain how we take a plane from the angle readings. In this study, the plane is found by using the least-squares plane. The least-squares plane is a plane that is the least-squares distance from all the points in a 3-dimensional point cloud. The least-squares plane can be expressed as follows.
where
,
, and
are coefficients. We can obtain the unknowns
A,
B, and
C using the lower–upper (LU) decomposition method. A least-squares plane is obtained by substituting the calculated coefficients into Equation (23). In this study,
x is the elevation angle at the thigh,
y is the elevation angle at the shank, and
z is the elevation angle at the foot, so for simplicity,
is used for
,
for
, and
for
. The plane
is obtained by substituting each variable into Equation (23).
The average of the squares
of the difference between the plane and the elevation angle at the foot should be zero for the angle data to be placed on a single plane. Therefore, we train so that
is close to zero.
and the reward
are expressed by the following equations, respectively.
where
is the number of the angle data,
is the elevation angle at the
i-th foot, and
is the weight coefficient. Since planar covariation is valid for each of the left and right legs, the rewards in Equation (26) are also applied to each of the left and right legs.