1. Research Background
1.1. Background of Quadruped Robots
Quadruped animals have demonstrated extremely strong adaptability and coordinated movement ability in various natural environments [
1,
2]. This natural advantage has inspired the research of bionics, thereby giving birth to quadruped robots [
3]. Walking robots have made significant progress in recent years, especially showing unique advantages in adaptability and trafficability in complex terrains. Compared with wheeled and tracked robots, walking robots can better cope with uneven terrains, which makes them particularly outstanding in application scenarios such as disaster relief, field navigation, and military reconnaissance [
4]. Especially in the disaster relief environment, walking robots can traverse rubble, crushed stones, and other irregular terrains and quickly reach the disaster-stricken areas, providing great convenience for rescue work [
5]. In addition, in field navigation tasks, walking robots can autonomously avoid obstacles, cross gullies, and climb steep slopes to complete complex tasks [
6,
7].
1.2. Existing Control Methods and Limitations
To achieve efficient motion control of quadruped robots in complex terrains, many researchers have proposed various methods and strategies. These studies mainly focus on traditional control methods, bio-inspired methods, and the emerging deep reinforcement learning methods in recent years:
Model Predictive Control:
MPC is an advanced control method that achieves precise control of the system by predicting a series of future states and optimizing the current control input. MPC has been widely used in robot control, especially for the motion control of quadruped robots in complex terrains.
MPC works by predicting the system’s state changes over the next few steps. It creates a dynamic model and adjusts the control input at each step to match the desired state [
8]. The advantage of MPC is that it can achieve optimal control under multiple constraints and multiple objectives. However, MPC relies heavily on the accuracy of the model, and inaccurate models can lead to poor control effects [
9]. Ding and others proposed a cascaded MPC framework based on a disturbance observer for the walking of humanoid robots, which integrates ankle, step, hip, and height change strategies to achieve high robustness [
10]. Daneshmand and others developed a variable horizon MPC for bipedal walking control. This allows the robot to adjust its prediction horizon in real time. This improves flexibility and adaptability during walking [
11].
MPC has shown significant advantages in the control of quadruped robots, especially in known and static terrains. However, when dealing with highly dynamic and unknown complex terrains, the performance of MPC is still limited by the accuracy of the model and the real-time computing [
12].
Zero Moment Point (ZMP) Control:
ZMP control is a classic method for robot stability control, which ensures the balance of the robot by maintaining the position of the zero moment point. The ZMP control method is widely applied in biped and quadruped robots, especially playing a crucial role in ensuring stable walking in robots [
13].
The concept of ZMP refers to a point on the robot’s supporting surface where the resultant moment of all supporting forces is zero. Controlling the position of the ZMP can ensure that the robot does not tip over during walking and thus achieves stable walking. The advantage of the ZMP control method lies in its simplicity and effectiveness. The precise ZMP position can be obtained through calculation, and the robot’s posture can be adjusted in real time to maintain balance [
14]. Kajita and others presented a method for generating stable walking patterns for biped robots using ZMP preview control, allowing real-time adjustment and stability during locomotion [
15]. Herdt and others discussed a method for online walking motion generation that uses ZMP to automatically adjust foot placement, improving the robot’s ability to navigate unpredictable terrain [
16].
However, when dealing with dynamic changes and complex terrains, ZMP control shows certain limitations. The ZMP control method assumes that the contact between the robot and the ground is completely known and controllable. When the terrain is complex and irregular, this assumption is often difficult to meet [
17]. For example, in rubble or crushed stone terrains, the contact points between the robot’s foot and the ground keep changing, making it difficult to accurately calculate the ZMP position, resulting in poor control effects. In addition, the ZMP control method mainly focuses on balance and ignores the issues of walking speed and efficiency. In some application scenarios, such as disaster relief and emergency response, robots are required to walk stably and quickly [
18].
The ZMP control method has significant advantages in ensuring stable walking in robots, especially in known and static terrains. But, on dynamic and complex terrains, ZMP performance is limited. It needs improvement with other methods.
1.3. Applications of Deep Reinforcement Learning in Robot Control
Deep reinforcement learning (DRL), with its strong adaptive ability and the ability to handle high-dimensional state spaces, provides a new path for robot control [
19]. Traditional control methods like model predictive control (MPC) and zero moment point (ZMP) rely on pre-defined models and struggle to adapt to unpredictable conditions, limiting their ability to balance speed and stability in real time. In contrast, a dual-layer reinforcement learning system leverages the strengths of both value-based methods, such as DDQN, and policy-based methods, such as PPO, to dynamically adjust robot behavior based on real-time feedback.
The dual-layer approach allows the robot to optimize discrete decision making (via DDQN) for high-level speed commands while using PPO to refine continuous motion adjustments for stability. This layered structure makes it more adaptable to changing environments compared to traditional methods, ensuring that the robot can navigate a variety of terrains with both efficiency and precision. Furthermore, the dual-layer system provides better generalization across different terrain types, making it more robust and scalable than single-layer approaches or classical control techniques. This flexibility and adaptability make it a superior choice for complex robot control tasks.
The Deep Q-Network (DQN) effectively addresses the poor performance of traditional Q-learning in high-dimensional state spaces by using deep neural networks to approximate the Q-value function. However, DQN has the problem of overestimation of Q-values, which can lead to instability and inefficiency of the strategy. Double DQN effectively reduces the overestimation bias of Q-values and improves the stability and efficiency of learning by separating the two processes of action selection and Q-value update [
20]. This method can provide more accurate and stable action strategies, thereby enhancing the maneuverability and stability of robots in complex terrains.
As a widely applied policy optimization algorithm, PPO balances the speed and stability of policy improvement by restricting the step size of the policy update [
21]. PPO adopts a truncated probability ratio objective function, effectively controlling the problems of gradient explosion and vanishing in the policy update process. In legged robots, PPO can be used to optimize the control strategy of joint positions. Through continuous interaction with the environment, the positions of robot joints can be dynamically adjusted to achieve more flexible and precise motion control. The PPO algorithm performs well in various reinforcement learning tasks and is widely used in the field of robot control [
22]. PPO avoids the instability in the policy update process, enabling robots to achieve stable joint control in complex environments.
The advantages of deep reinforcement learning lie in its high flexibility and strong generalization ability, enabling it to adapt to different environments and tasks. Firstly, DRL can autonomously learn the optimal strategy through continuous interaction with the environment without relying on detailed models and prior knowledge, which is important in complex and dynamically changing environments. Secondly, DRL can handle high-dimensional state and action spaces and utilize the powerful representational ability of deep neural networks to capture complex environmental features, thereby making more precise decisions [
23]. In addition, DRL shows a strong exploration ability in the face of unknown environments and can search for the optimal solution among multiple possible strategies [
24]. This exploration ability gives DRL a significant advantage when dealing with tasks with high uncertainty and variability. In the gait control field, Zhang and others developed an exploration framework using DDPG and an improved SAC algorithm for biped robots. This improves stability, training speed, and walking adaptability in complex terrains [
25,
26]. Therefore, the method proposed in this paper uses DRL as the top layer of the control system to output instructions to the bottom layer of the DRL controller.
1.4. Hierarchical Control Methods
Deep reinforcement learning has significant advantages in high-level action selection and strategy optimization [
27,
28]. However, in practical applications, low-level fine torque control mostly relies on classical control methods. The PD controller achieves precise output of torque by adjusting the errors of joint position and velocity. The PD controller is simple to implement and insensitive to parameter changes, and is widely used in industrial control and robot joint control. In legged robots, the combination of the PD controller and reinforcement learning methods enables precise torque adjustment after high-level action selection and medium-level joint control, thereby achieving stable torque output and smooth motion trajectories [
29].
Although a large number of studies have made significant progress in the motion control of walking robots, there are still some problems that need to be urgently solved: Existing control strategies can often only achieve a single optimization between speed and stability. In complex terrains, high-speed walking strategies can easily cause the robot to lose balance, while it is difficult for high-stability strategies to achieve rapid movement. At the same time, the adaptability and robustness of existing control algorithms in different terrains and tasks still need to be improved [
30].
Especially in the face of unknown and dynamically changing environments, how to ensure stable walking in robots is a major challenge. Researchers have made significant progress in the control algorithms of walking robots [
30]. The discrete-time model predictive control method is used to optimize the gait of legged robots online, enabling them to walk smoothly on uneven terrains. DRL technology is also widely used in robot control, significantly improving the adaptability and dynamic motion generation ability of quadruped robots in complex terrains [
31].
Researchers have made significant progress in the control algorithms of walking robots. The discrete-time model predictive control method is used to optimize the gait of legged robots online, enabling them to walk smoothly on uneven terrains. DRL is also widely used in robot control, significantly improving the adaptability and dynamic motion generation ability of quadruped robots in complex terrains [
32].
To address this challenge, this paper proposes a double-layer reinforcement learning algorithm that combines the advantages of DDQN and PPO to realize the dynamic speed adjustment of quadruped robots using terrain information and robot states. By combining Double DQN, PPO, and the PD controller to form a hierarchical control structure, efficient motion control of legged robots in complex terrains is achieved. Detailed experimental verification on the Isaac Gym simulation platform shows that this algorithm exhibits excellent speed adjustment ability and stability in complex terrains. In general, the contributions of this paper can be summarized as follows:
This paper innovatively proposes a double-layer reinforcement learning algorithm that combines DDQN and PPO to address the motion control problem of quadruped robots in complex terrains.
Using terrain information and robot states to dynamically adjust the walking speed of quadruped robots, thereby maintaining high stability and speed in complex terrains and maintaining a high degree of stability when switching between different terrains.
Various complex terrains, including slopes, cracks, obstacles, etc., are constructed in Isaac Gym to simulate various complex situations in the real environment. Using the quadruped robot Unitree Go1 as the test platform, the effectiveness of the algorithm is verified.
1.5. Research Content and Structure of This Paper
To systematically elaborate on the above research content, the structure of this paper is arranged as follows:
Section 1 introduces the application background and importance of walking robots in complex terrains.
Section 2 elaborates on the basic concepts of the reinforcement learning methods used in this study, as well as the principles and design of the two-layer reinforcement learning algorithm.
Section 3 describes in detail the experimental design and implementation process for validating the algorithm, including using the Unitree Go1 quadruped robot to build various complex terrain environments on the Isaac Gym simulation platform for testing.
Section 4 summarizes the full text and conducts a result analysis.
2. Double-Layer Reinforcement Learning Algorithm Design
2.1. Basic Concepts of RL
Reinforcement learning (RL) is a machine learning method that learns the optimal policy through interaction with the environment to maximize cumulative rewards. The core concepts of RL include state, action, reward, policy, value function, and environment. The state represents the specific situation of the environment at a certain moment, the action is the behavior that the agent can take in a specific state, and the reward is the immediate score that the environment feeds back after the agent takes an action. The policy determines the rule of the action taken by the agent in each state, and the value function estimates the value of a certain state or state–action pair.
The RL problem is usually modeled as a Markov Decision Process (MDP), which includes a state space
S, an action space
A, a state transition probability
P, a reward function
R, and a discount factor
. At each time step, the agent starts from state
and selects action
; the environment transitions to a new state
based on the state transition probability
and provides a reward
. The objective of the MDP is to find the optimal policy
that maximizes the expected cumulative reward, expressed as follows:
where
is the discount factor, and
is the reward at time step
t.
The value iteration method is a common policy in RL. Q-learning is a model-free value iteration method that finds the optimal policy by iteratively updating the state-action value function . DQN approximates the Q-value function by using a deep neural network to solve the problem of traditional Q-learning in high-dimensional state spaces, but DQN has the problem of overestimation of Q-values. Double Q-learning reduces the overestimation of Q-values. It improves learning stability and efficiency by separating action selection and Q-value updates.
2.2. Policy Gradient Methods
Policy gradient methods are a type of algorithm that solves reinforcement learning problems by directly optimizing the policy function . Unlike value iteration methods, policy gradient methods do not estimate the value function directly but instead optimize the policy itself to maximize the expected cumulative reward. These methods are widely used for reinforcement learning tasks with continuous action spaces and high-dimensional state spaces.
The basic idea of policy gradients is to compute the gradient ascent to directly optimize the policy parameters, maximizing the cumulative reward. The core of policy gradient methods is the policy gradient theorem, which provides the method for computing the gradient of the expected cumulative reward with respect to the policy parameters.
Let
be a parameterized probability distribution representing the probability of taking action
a in state
s, with parameters
. The policy gradient theorem gives the gradient of the expected cumulative reward with respect to the policy parameters
:
In Formula (
2), the term
represents the gradient of the expected cumulative reward
with respect to the policy parameters
, indicating how to adjust
to increase the reward. The expectation
takes the average over all possible state–action pairs
that could occur under the current policy, capturing the average behavior of the policy. The term
is the gradient of the log probability of taking action
a in state
s, which measures how sensitive the action probabilities are to changes in
. This helps in simplifying the gradient computation for policy updates. Finally,
is the state–action value function, representing the expected cumulative reward from state
s after taking action
a and following the policy
. This quantifies how beneficial a certain action is in a given state. Together, this formula directs the policy update towards actions that both are probable and lead to high rewards.
The steps of a policy gradient method are illustrated as Algorithm 1:
Algorithm 1: Policy Gradient Method |
- 1:
Initialize policy parameters and other related parameters. - 2:
Sampling: Interact with the environment using the current policy to generate state–action–reward sequences . - 3:
Compute Gradients: Compute the gradient of the expected cumulative reward with respect to the policy parameters according to the policy gradient theorem:
- 4:
Update Policy: Update the policy parameters using gradient ascent:
-
where is the learning rate.
|
PPO is a widely used policy gradient algorithm. PPO balances the speed and stability of policy improvement by limiting the step size of policy updates, addressing the instability issues often encountered in policy gradient methods. The core idea of PPO is to introduce a clipped probability ratio objective function, which limits the extent of change between the new and old policies, ensuring stable updates.
PPO optimizes the following objective function:
In Formula (
3),
represents the probability ratio of the new policy to the old policy, indicating how the new policy differs from the previous one. The term
is the estimated advantage function, which helps determine how much better a particular action is compared to the expected outcome. The clipping operation
limits the probability ratio to a range between
and
, where
is a hyperparameter that controls how much the policy can change in each update step. This ensures that the policy updates do not become too large, helping stabilize learning by preventing excessive updates.
By optimizing the above objective function, PPO effectively controls gradient explosion and vanishing problems during the policy update process. It also balance the speed and stability of policy improvement, enabling the robot to achieve stable action control in complex environments.
2.3. The Design Structure of the Double-Layer Structure
The double-layer reinforcement learning algorithm framework proposed in this paper combines DDQN and PPO to improve the speed adjustment ability and stability of the quadruped robot Unitree Go1 in complex terrains. In the initial stage of the algorithm, it is necessary to initialize the parameters of the DDQN and PPO models, including setting the terrain information and the state of the robot as observation variables, and initializing the robot environment and the simulation platform Isaac Gym. Then, the algorithm collects the current terrain information and the state of the robot, such as position, speed, and posture, and inputs this data into the DDQN network to generate the walking speed command. The speed command serves as a high-level instruction, determining how fast the robot should move based on its environment and internal status. It provides a target speed that ensures the robot maintains balance and avoids obstacles, while optimizing its overall movement efficiency. Subsequently, the PPO model takes this speed command, along with the current observation information, to generate specific robot actions, such as joint angles and gaits. These actions control the precise movement of the robot’s legs and joints. The PPO model continuously refines and adjusts these actions through policy optimization, helping the robot adapt to uneven or complex terrain while maintaining stability and smooth locomotion. After the robot executes the actions generated by PPO, the system updates its state, and collects new observation information and feeds it back to the DDQN and PPO models for further optimization and adjustment. Through training and update, the DDQN algorithm updates the Q value to optimize the walking speed strategy, and the PPO algorithm updates the policy network to optimize the action control strategy. The above steps are executed continuously in a loop until the model converges or reaches the predetermined number of training times. The step can be illustrated as Algorithm 2.
Algorithm 2: Dual-Layer Reinforcement Learning Algorithm |
- 1:
Initialization: - 2:
Initialize parameters for DDQN and PPO models - 3:
Set terrain information and robot state as observations - 4:
Initialize the environment and simulation platform (Isaac Gym) for Unitree Go1 - 5:
loop - 6:
Terrain Information and State Observation: - 7:
Collect current terrain information and robot state (position, speed, posture, etc.) - 8:
Input observation information into the DDQN network - 9:
DDQN Speed Command Generation: - 10:
DDQN network outputs a walking speed command based on the input observations - 11:
PPO Action Control: - 12:
PPO model generates specific robot actions (joint angles, gait, etc.) based on the -
speed command and current observations - 13:
PPO model continuously adjusts actions through policy optimization to adapt to -
complex terrains and maintain stability - 14:
Execute Action and Update State: - 15:
Execute actions generated by PPO and update the robot’s state in the environment - 16:
Collect new observations and feed back to DDQN and PPO models - 17:
Training and Updating: - 18:
Use DDQN algorithm to update Q-values and optimize the walking speed strategy - 19:
Use PPO algorithm to update the policy network and optimize the action control -
strategy - 20:
if convergence or maximum training iterations reached then - 21:
Model Validation: - 22:
Perform detailed validation of the trained model in Isaac Gym - 23:
Evaluate the model’s speed adjustment ability and stability in various terrains - 24:
Exit loop - 25:
end if - 26:
end loop - 27:
Performance Evaluation: - 28:
Assess the model’s performance, ensuring high walking speed and stability across -
different terrains
|
The schematic diagram of the method proposed in this paper can be represented by
Figure 1. The dual-layer structure offers several advantages by separating high-level decision making from low-level control, improving both learning efficiency and motion precision. In the upper layer, DDQN reduces the risk of overestimating action values, allowing for more accurate decisions in complex environments, while PPO ensures stable policy updates, leading to smoother transitions in control actions. This setup enhances the system’s ability to learn optimal strategies without being overwhelmed by fine-grained control details. Meanwhile, the lower layer, governed by the PD controller, precisely converts these high-level commands into torque outputs, ensuring the robot’s movements are accurate and responsive. By decoupling strategic decision making from mechanical control, the dual-layer structure allows for more effective learning and adaptability, especially in dynamic and unpredictable environments.
2.4. Terrain Information and Robot State Observation
In reinforcement learning tasks, choosing the right observation vectors is very important. These vectors help the robot understand both its environment and its internal state. They act as the robot’s “sensors”, giving the model real-time information about what is happening. For example, the robot needs to know its speed, position, and orientation, as well as details about the terrain or any obstacles it might face. This information allows the learning model to make better decisions. Without good observations, the robot would not know how to adjust its actions to achieve its goals. Therefore, clear and accurate observation data ensure the robot can learn quickly, adapt to new situations, and perform tasks successfully in complex environments.
In reinforcement learning tasks, the reasonable selection of robot state observation is very important. In this paper, the observation information of Unitree Go1 is divided into the following types: velocity v and acceleration a information obtained through the inertial measurement unit, torque T, position p, and joint angular velocity collected by the joints, and surrounding terrain height h information extracted through the depth camera. These pieces of information are integrated into a 235-dimensional vector at each time point. The observations specifically include the following: IMU data include linear acceleration , angular velocity , and the calculated attitude information (pitch angle , roll angle , and yaw angle ). Linear acceleration is measured by the triaxial accelerometer, which can capture the acceleration changes of the robot in the x, y, and z directions. The angular velocity is measured by the triaxial gyroscope, recording the rotational speed of the robot in space. Joint information comes from the output of multiple joint sensors, including joint position sensors, torque sensors, and angular velocity sensors. The joint position sensor records the absolute and relative angular positions of each joint. The torque sensor measures the torque T applied by the joint, helping to evaluate the load and resistance of each joint during movement. The joint angular velocity sensor provides real-time data of the rotational speed of the joint. In addition, the depth camera is responsible for capturing the three-dimensional environmental information around the robot, generating point cloud data and height maps. The depth camera measures the reflection time or deformation of light to construct the depth information of the environment. After processing these data, the height distribution of the terrain can be obtained, and features such as the ground, obstacles, and slopes can be identified. Through the real-time fusion of the above multi-source sensor data, the robot can accurately perceive its own state and the surrounding environment, thereby performing stable and efficient movement in complex terrains. In this study, in the process of determining the robot’s speed command based on DDQN and the process of generating the robot’s actions based on PPO, the robot’s observations use all the observation information.
2.5. Speed Command Generation Based on DDQN
To achieve efficient motion control of the quadruped robot in complex terrains, this paper proposes a speed command generation method based on DDQN. In complex terrains, a single speed command cannot maintain good terrain adaptability. In rough terrain environments, an overly fast speed command will make the robot unable to maintain balance, while in flat terrain environments, an overly slow speed command will make the robot walk slowly. Therefore, it is necessary to dynamically adjust the speed, and DDQN can effectively judge the terrain and output corresponding commands. The Q-network structure in the structure is shown in
Figure 2:
In previous verifications, even on flat ground, the robot has difficulty walking at a speed exceeding 3.5 m/s. Therefore, the output of the network is four discrete speeds: 0.5, 1.5, 2.5, and 3.5, with a unit of m/s. Since the task of the DDQN model is to maintain the fastest speed as possible, the reward function of this task is set as Equation (
4), where the reward is designed to encourage actions that maximize speed while penalizing unstable or inefficient movements. This structure helps the DDQN model prioritize behaviors that achieve high velocity while maintaining balance and control, ensuring the robot performs optimally in dynamic environments.
where
is the horizontal speed of the robot walking along its own orientation, and
is a parameter that determines the reward difference corresponding to the difference in different speeds.
Taking the speed command as the input, the corresponding action planning is carried out using the control module based on PPO. In each training step, the robot selects the appropriate speed command by perceiving the changes in the environment and updates the Q-value function according to the actual walking result. To avoid the model overfitting to a specific terrain during training, this paper uses a variety of different terrain types for mixed training during training, including flat ground, slopes, steps, and complex gravel roads, etc.
2.6. Reward Function
In reinforcement learning, the reward function is one of the core elements guiding the agent’s learning process. It provides positive or negative feedback, helping the agent make optimal decisions in its environment. The reward function design is key in this algorithm. It directly affects the robot’s stable walking in complex terrains. Since different factors contribute variably to the robot’s overall performance, a series of weighted reward functions are used to guide the robot’s behavior in both positive and negative ways [
33].
To determine the most effective combination of weights for these reward functions, we employed the grid search method, a widely used hyperparameter tuning technique. Grid search involves defining a grid of possible values for each weight and systematically evaluating the performance of the model for every possible combination of these values. In our case, we defined a range of candidate values for each reward component, such as penalties for instability or incentives for speed, and tested the algorithm across these values.
Each combination was evaluated by running simulations in various terrains, and the overall performance was measured by metrics such as walking stability, terrain adaptability, and motion efficiency. By comparing the results for different weight combinations, we identified the optimal set of weights that maximized performance across all terrains. This approach ensured that the robot could balance speed and stability under diverse conditions, leading to more reliable control. The specific reward functions and their associated weights are shown in
Table 1.
2.7. Motion Trajectory Generation Based on PPO
To achieve the efficient motion planning of quadruped robots in complex terrains, this study further adopts the motion trajectory generation method based on PPO. The input of the PPO model is the same as that of DDQN, which is 235-dimensional observation information. In order to train the robot to walk smoothly under different terrain complexities, this study selects four different terrain parameters for training in four different terrains, as shown in
Table 2.
Four different models are used, corresponding to four different speeds, respectively. When walking at a slow speed, the corresponding model mainly focuses on maintaining stability. Therefore, the reward function of the model is adjusted to make it pay more attention to the balance ability of the robot in rough terrains rather than pursuing speed. When walking at medium and fast speeds, the corresponding models will appropriately balance the requirements of stability and speed. When walking at high speed, the reward function of the model is more inclined to an increase in speed, while considering the balance requirements at high speed. Therefore, at different speeds, the corresponding PPO model will generate the optimal motion trajectory that meets the current speed requirements.
3. Experiment
Isaac Gym is a physics engine and development platform for robotics simulation and reinforcement learning developed by NVIDIA [
34]. It combines high-performance physical simulation and large-scale reinforcement learning capabilities, enabling developers to train and simulate robots in a unified environment. Its adopted NVIDIA GPU acceleration technology enables large-scale parallel physical simulation. This allows developers to train tens of thousands of robots simultaneously. In this study, Pytorch (
https://pytorch.org/) and Isaac Gym (
https://developer.nvidia.com/isaac-gym) are combined, and Pytorch is used to build neural networks. Ten thousand robots are trained simultaneously during training. Training is conducted on an Intel Core i9-11900 CPU and an Nvidia GeForce RTX 3080 GPU for 3000 episodes, and each round of training takes approximately 150 min. In the speed command generation module, 500 episodes are conducted for each training, and the training takes approximately 40 min. During the training process, the GPU utilization rate remains at approximately 80%, and the total memory usage averages around 10 GB.
In this study, the specific parameters of the quadruped robot Unitree Go1 used are shown in
Table 3:
During the training process, the duration of each training episode is 10 s, while the frequency of the command output structure based on DDQN is 2.5 Hz. The output frequency of the position output structure based on PPO is 50 Hz, meaning that the speed command is re-determined every 20 position outputs. At the same time, between every two adjacent positions, the PD controller performs five torque control outputs. This study uses a 7 × 7 terrain grid, with each terrain having a size of 8 × 8 m. The entire terrain layout is shown in the upper part of
Figure 3. The specific size parameters of the terrain are obtained through random sampling of the parameters used in training the four different models. These four types of terrains, as illustrated in the lower part of
Figure 3, are concave slope, convex slope, cuboid obstacle, and stepping with gap, with proportions of 2:3:2:3. At the beginning of each training cycle, the robot’s initial position is randomly set at the center of one of the terrains, and it moves forward in the direction it faces.
Experimental Results and Analysis
To verify the effectiveness of the method proposed in this study, this study chose to let the robot walk forward for 10 s and compared the effects of different models based on the distance the robot walked forward. During the training process, the different effects when using a single model and using four specially designed models for training were compared. The single model selected was designed for a speed of 2.5 m/s during training. No conditions for stopping training were set, and the training would end after 500 episodes. The characteristics of the reward changing with the episode during training are shown in the
Figure 4. From these figures, it is observable that, as training progresses, the agents gradually receive increasing rewards, indicating that the agents are progressively optimizing their adaptation to the environment through learning.
At the same time, we conducted 10 experiments in random terrains and took the average, and compared the distances walked by the robot using six different methods within 10 s. The results are shown in the table. It can be seen from the table that the proposed method takes into account the fast moving speed while maintaining stability by adjusting the walking speed in different terrains. The specific results are shown in
Table 4.
Using a fixed speed command of 0.5 m/s.
Using a fixed speed command of 1.5 m/s.
Using a fixed speed command of 2.5 m/s.
Using a fixed speed command of 3.5 m/s.
Using a 2.5 m/s motion model and dynamically adjusting the speed using the speed generation module.
The double-layer reinforcement learning algorithm proposed in this paper.
To more intuitively reflect the advantages of the dynamic adjustment method proposed in this paper, this paper compares the gait differences between using the fixed speed method of 3.5 m/s and the method proposed in this paper when climbing an obstacle with a height of 0.5 m. The result is shown in
Figure 5. It can be clearly seen that the adaptability of the method proposed in this paper to complex terrains is much better than the method of choosing a fixed speed, and it has the longest walking distance among all the methods.