1. Introduction
The leg design of bipedal robots is mainly divided into bent-knee and telescopic [
1,
2], as shown in
Figure 1. The straight-leg type draws on the results of biological evolution, imitating the human gait, and partly refers to the way birds move. It increases the degree of freedom of movement and allows the robot to adapt to different speeds and terrains, while the passive oscillation of the knee joint helps energy recovery. However, the straight-leg structure also introduces a larger torsional inertia, which increases control difficulty and leads to higher energy consumption of the knee motors [
3,
4]. In contrast, the telescopic leg uses a simplified joint design, driven primarily through the hip, with the knee function replaced by a linear telescopic leg [
1]. Although it is not bionic in appearance, the control is more direct and reduces dynamic loading for simpler gait planning [
5,
6]. Telescopic joints are less common in the biological world but are easier to implement in engineering, making robot leg length adjustment more efficient and meeting the need for simplified dynamics models. Control methods for telescopic-legged robots have made significant progress in recent research, with the central goal of balancing dynamic performance and stability. Approaches based on model simplification have been widely adopted, such as linear inverted pendulum models and single rigid body models, by simplifying multibody dynamics for real-time gait planning [
1]. In telescopic-legged robots such as Slider, the sliding joint design significantly reduces the leg inertia, which, combined with the center of mass
trajectory generated by the linear inverted pendulum model and the feedback control, effectively reduces the vertical motion deviation and improves the walking stability. For highly dynamic motion, the study further proposes an inertia shaping model to adjust the
inertia by optimizing the leg configuration, which enhances jumping efficiency and attitude control [
7]. The combination of trajectory optimization algorithms (e.g., multi-stage nonlinear programming) and whole-body control makes it possible to generate and execute complex movements (e.g., twisting jumps), while contact detection methods based on mean spatial velocity are introduced to achieve soft landing in combination with force control strategies. In the L04 robot, the dual-slider telescopic leg design further optimizes the dynamic characteristics by simulating knee bending and leg height adjustment. The split-coupled design of its hip joint reduces the number of motors, lowering energy consumption and cost [
8]. With a linear inverted pendulum model and angular momentum trajectory planning, the L04 robot achieves stable forward and lateral walking. These conventional methods rely on accurate state observation, and the methods in the literature [
8] do not enable steered walking.
With recent advancements in RL, an increasing number of researchers are investigating its application to address the locomotion challenges of bipedal robots. RL is a machine learning paradigm in which an agent learns optimal behavior through interactions with its environment [
9]. The agent explores various behaviors to maximize cumulative rewards, thereby learning to operate optimally in complex environments. This data-driven approach obviates the need for manual controller design in bipedal robots [
10]. Current learning frameworks for bipedal locomotion are predominantly categorized into two approaches: reference trajectory imitation and cyclic gait encouragement [
11,
12]. Trajectory imitation necessitates high-quality data and the formulation of an imitation reward function. This approach incentivizes the robot to adhere to the reference data. Xue et al. combined motion imitation with task-specific objectives [
11]. The robot is trained using motion capture data, yielding high-quality and robust movements. This approach further facilitates a wider array of complex maneuvers, such as dynamic rotations, kicks, and flips with intermittent ground contact. Li et al. developed robust walking controllers by emulating reference motions and validated their performance on the Cassie robot at varying speeds [
13]. In [
14], a framework that emulates human motion data allowed a bipedal robot to effectively learn environment-appropriate movements. The approach of promoting periodic walking entails guiding the robot to adopt a periodic gait through the use of periodic signals. Jonah et al. utilized an RL-based approach to analyze Cassie’s walking gait by designing a reward function grounded in clock signals, thereby achieving a periodic gait [
12]. In [
15], curriculum learning was introduced to enhance periodic walking. Omnidirectional walking in bipedal robots was accomplished through the progressive escalation of task difficulty. Wu et al. expedited the training of periodic gaits on the Cassie robot by utilizing custom sinusoidal curves [
16]. In [
17], periodic joint references were generated using state machines and cosine curves, enabling bipedal walking through a hybrid approach combining feedforward and RL techniques. However, trajectory imitation necessitates extensive high-quality data [
18]. Acquiring such data is challenging, and the quality of training is highly dependent on it. Promoting periodic walking may lead to unnatural gait patterns, including persistently open lateral joints or one foot consistently positioned ahead of the other. None of these methods take into account the special structure of the L04, which makes it difficult to maintain the parallel attitude of the dual telescopic rods. The modeled data also cannot be redirected to the mechanical structure of the L04.
To address the aforementioned issues, our contributions are as follows:
We propose a novel walking learning framework that synergistically combines prior feedforward knowledge with RL.
We derive the forward and inverse kinematics of the parallel dual-slider telescopic leg robot and design prior knowledge and feedforward actions.
We implement sim2real for physical L04 and design smooth regularization terms to optimize the sim2real effect.
2. Background
The L04 robot is depicted in
Figure 2. The robot comprises 17 joints and is structured into four main components: the head, arms, hips, and legs [
8]. The head incorporates two rotary joints, each arm is equipped with three, the hips accommodate five, and each leg features two linear joints. All joints are independently driven by motors. The left and right hip traversing roll joints utilize a motor-driven split configuration. Similarly, the left and right hip steering joints adopt a split-pair design. Four motors are integrated within the hip region. The coupled split design improves efficiency and minimizes energy consumption, thereby enhancing the robot’s overall performance.
The control relationship between the linear drive and the corresponding joint angles within the linkage-driven mechanism is depicted in
Figure 2. The lateral bifurcation five-bar linkage mechanism comprises the crossbar A, upper linkage B, lower linkage C, bottom linkage D, and a central axis, which is implemented as a rolling screw rod. The crossbar A moves vertically along the central axis, causing the linkage C to swing laterally. The rotational, linear drive at the base moves the crossbar A along the bottom centerline, enabling forward and backward motion, which in turn drives the linkage C to produce hip rotational swinging. The lateral screw rod, oriented vertically downward and perpendicular to the horizontal rotational screw axis, facilitates the synchronous coupling of lateral and steering movements in the hip of bipedal robots. Furthermore, this design ingeniously integrates all degrees of freedom functionalities into a compact structure, enabling hip rotation and lateral swinging through a dual-coupling bifurcation mechanism, thereby significantly reducing power consumption during robot locomotion.
3. Design of Prior Knowledge
Conventional RL methods utilize the representational power of neural networks to develop walking strategies in an end-to-end fashion. Most data-driven walking approaches neglect the robot’s underlying physical principles and prior knowledge, instead focusing on training policies end-to-end, from sensor data to joint position commands. However, such data-driven methods necessitate meticulously designed reward functions and substantial training time, yet frequently fail to produce natural and graceful gaits. To enable bipedal robots to generate natural and graceful walking patterns, researchers commonly design imitation-based rewards, incentivizing robots to replicate a predefined reference motion. Conventional methods for generating reference motions include video-based learning or the utilization of costly motion capture systems. However, the dual-telescopic-leg structure of the L04 poses significant challenges in retargeting human motion data. Therefore, we fully leverage the robot’s prior knowledge and integrate it with RL to achieve stable walking.
We use the Cartesian coordinate system based on the midpoint
of the line
connecting the points of lateral symmetric rotation of the hip, as shown in
Figure 3. The
is positioned 0.17 m directly above
O. This setting is based on the actual physical structure and mass distribution of the L04 robot. The positive direction of the
X-axis is defined as the forward positive direction of the robot. The positive direction of the
Y-axis is defined as the positive direction in which the robot moves to the left. The positive direction of the
Z-axis is defined as the direction in which the robot’s center of mass moves vertically downward. The vertical distance from the origin
O to the ground is set to
h, and the hip split-rotation joint angles are labeled as
and
. The pair-fractional structure of the robot’s hip joint design ensures that
and
are equal. Point
A and point
are the centers of rotation of the left lateral and right hip joints, respectively. The corresponding lateral joint rotation angles are labeled
and
. Due to the nature of the pairwise split structure design,
=
. The left key points
O,
A,
B, and
C are always in the same plane. Similarly, the right side points
O,
,
, and point
are always in the same plane. The rotation angles of the right and left hip joints are denoted as
and
, respectively. The
and
denote the lengths of the left and right legs, respectively. In addition to the changes in leg length, the length of the hip linkage is fixed.
represents the horizontal distance from the steering center
O to the rotation center
A of the hip lateral rotation joint, with a length of
a;
represents the vertical distance from the rotation center
A of the hip lateral rotation joint to the rotation center
C of the hip joint, with a length of
b;
represents the horizontal distance from the rotation center
A of the hip lateral rotation joint to the rotation center
C of the hip joint, with a length of
c.
Forward kinematics refers to determining the position and orientation of the virtual leg end effector in Cartesian space given the known joint angles (
) in joint space. Here, the geometric method of solution is employed. For the left leg, assuming
is parallel to the
X-axis. Let the length of the virtual leg be
, then the position in Cartesian space along the
X,
Y, and
Z directions for the left leg end-point
F can be respectively expressed as
,
,
.
Combined with the virtual ankle angle
formed by the front and back legs, here defined as
, the pose at the end point
F of the virtual leg can be derived:
Virtual leg inverse kinematics refers to solving the joint angles (
) of each joint with the known position and pose of the virtual leg end
F in Cartesian space. To solve the virtual leg inverse kinematics, the forward kinematics is used to find the equational constraint equations containing the known position information
of the virtual leg end
F point and the steering
variables:
Newton’s iterative method is used to solve for
and
, which are brought into the equation to solve for the steering angle
and the virtual ankle joint
.
Using the above derivation, the swing angle of the hip joint, the angle of the virtual ankle joint, and the geometric length of the virtual leg can be obtained.
By analyzing fundamental human walking patterns, we observe that the foot’s position relative to the COM follows a sinusoidal trajectory. Human-like alternating foot-lifting gaits are achieved by regulating the foot placement of the virtual leg using a sinusoidal function. The frequency and amplitude of the sinusoidal function are dynamically adjusted to control the foot-lifting motion during walking. However, relying exclusively on the sinusoidal strategy may lead to instability during foot contact and lift-off, including a bouncing effect. To mitigate this issue, we incorporate a double-support phase into the sinusoidal function. During the double-support phase, both feet maintain contact with the ground, ensuring seamless gait transitions. A sinusoidal function regulates the Z-axis variations of the foot landing points. The foot-lifting height is dynamically calculated and adjusted in real time according to the gait cycle phase.
where
is the maximum position that the virtual leg can reach in the Z-direction;
is the minimum position that the virtual leg can reach in the Z-direction; and the
term is a parameter used to adjust the duration of the double-supported standing phase.
During the foot lifting phase, the COM undergoes an offset in the y-direction towards the supporting foot. As the foot makes contact with the ground, the COM offsets back towards the origin in the Y-direction. In the frame of the COM, the changes in the foot’s landing points in the Y-direction also conform to a sinusoidal curve, exhibiting the same periodic characteristics as the changes in the Z-direction of the foot’s landing points.
Feedforward control has enabled the robot to perform alternating leg movements while airborne. However, upon landing, the robot can only maintain stationary, stepping briefly before it falls. This issue arises because the feedforward reference is unable to accommodate changes in body posture. In the initial state, there is the problem of not being able to distinguish which foot to swing first. If a certain foot is fixed to swing first, it will cause the problem of a single gait. Therefore, our method of generating
t in the reset phase is
4. Walking Learning Framework Design
During the leg-lifting phase of L04, it is crucial to maintain consistent ground clearance for the bottoms of the dual telescopic legs on the supporting leg to ensure a natural and aesthetically pleasing gait. End-to-end methods typically rely on reward functions to encourage uniform ground clearance of the parallel telescopic legs. However, this approach does not guarantee the desired policy and significantly increases exploration time. To enable bipedal robots to achieve natural and graceful walking patterns, researchers often design imitation-related rewards to encourage robots to mimic a given reference motion. Common methods for generating reference motions include video-based learning or the use of expensive motion capture systems. However, the dual-telescopic leg structure poses challenges for retargeting human motion data. To address these issues, we designed an L04 walking learning method based on PFRL. The PFRL method integrates feedforward and feedback mechanisms, as shown in
Figure 4. This method extends traditional neural network-based RL by introducing a novel feedforward control component. The feedforward control operates independently and in parallel with the neural network. This design enables the simultaneous processing of prior knowledge and feedback information, thereby enhancing the robustness of the entire framework. PFRL ensures the aesthetic quality of the parallel telescopic legs during the swing phase using a virtual leg model while also reducing the action space. Notably, we set the virtual leg’s
to zero. This not only keeps the foot parallel to the ground in real time but also avoids excessive and unnecessary exploration of the ankle joint. Unlike common gait learning methods, our policy network outputs foot placement positions. The robot’s actions are derived from the sum of the feedforward reference and the policy’s foot placement output, which are converted into joint positions through the virtual leg model.
4.1. Policy Construction
The state space of the policy network input is defined as follows:
: feedforward landing points are converted to joint positions by virtual leg models.
: the roll and pitch of the body’s COM.
: the roll, pitch, and yaw rate of the body’s COM.
: current joint position of the robot’s lower body.
: current joint velocity of the robot’s lower body.
: phase signal.
: last action output by the policy.
: speed command.
Unlike common end-to-end approaches, the output of the virtual leg model-based framework is a six-dimensional vector representing the real-time position of the robot’s feet. This foot position is relative to the center of the mass coordinate system. It is worth noting that due to the structural advantages of the L04, the left and right lateral target positions of the policy network outputs are set to the same value, so the action space contains only five vectors.
The policy network outputs the locations of the landing points and then expands to conform to the hip pair division mechanism, i.e., . Note that and are equal. The extended landing points are added to the feedforward landing points, and the joint target position is solved by the virtual leg model. The steering joints are learned directly by the end-to-end method. The target positions of the left and right steering joints are equal in real time. The strategy network outputs target landing point positions at 40 Hz. The solved target position of the virtual leg model is filtered by a low-pass filter and fed to the PD controller running at a frequency of 1000 Hz. The sampling period of the low-pass filter is 0.001 s, and the cutoff frequency is 4 Hz. The PD controller uses a low-stiffness controller output torque to control the biped robot’s walking through a position control method. We use position control to control biped robot walking. With the PD controller, the target position is converted into motor torque. This is calculated as . The is set to 0.
4.2. Rewards
Most of the existing studies have established complex reward engineering to constrain the training process. However, this also brings a large number of reward function coefficients, which further adds difficulty to the training process. In order to maximize the demonstration of the effect of structure on sim2real results, we implemented the walking motion using only three of the most common reward functions used in bipedal training. These three reward functions are velocity tracking reward, stability reward, and periodic foot lift reward.
The velocity tracking reward is
where,
is the robot’s velocity command in x and y directions,
is the robot’s linear velocity in x and y directions during the actual walking process,
is the robot’s angular velocity command, and
is the robot’s actual angular velocity in z direction.
The stability reward is
where
and
are the roll and pitch of the robot’s COM, respectively.
In order to ensure that the virtual leg length change can conform to the characteristic of periodic change and change within a reasonable range, a periodic reward mechanism is designed based on the virtual leg model. The purpose of this mechanism is to guide the robot during the walking process so that the change in its virtual leg length can follow a predetermined periodic pattern and, at the same time, to ensure that the change in the leg length does not exceed a reasonable range. Specifically, the periodic reward is defined as
where
is the virtual leg leg length of the current feedforward and feedback summed. The reward function is designed as
4.3. Smooth Regularization
In order to achieve the real-virtual migration of the policy, Reference [
19] improves the robustness by making the action smoother after being disturbed by noise through regularization techniques. Compared with drones, bipedal robots have greater noise and external interference, such as fluctuations in IMU data when the robot touches the ground and disturbances from different terrains. Therefore, using only the CAPS method [
19], bipedal robots will still transform noise interference into small-range oscillations of joint positions. However, this method can suppress large-range mutations of movements, thereby protecting the hardware. The regularization term in CAPS is divided into the abrupt change smoothing term
and the oscillation smoothing term
.
requires that the current action be similar to the next action under the same policy. On the other hand,
demands that the mapping states of similar input operations be similar under the same policy. The regularization term used in the CAPS method is
where
is from a distribution around
s.
In contrast to the CAPS approach, we define the noise as a spherical space
. The radius of this spherical space
is
.
is computed by the L2 paradigm, i.e.,
. All states in the spherical perturbation space should have similar strategies
. However, we cannot compute strategies for all states in the space. Therefore, we pick the most variable
in
by
where
is the distributional measure of the two strategies under noise perturbation. We use the L2 paradigm to compute
D.
Restricting the fluctuations of the policies
and
to the next moment only cannot suppress the small oscillations caused by the noise of the bipedal robot. Therefore, we also add a computation of the fluctuations between
and
. We suppress small-scale oscillations in the strategies by limiting the fluctuations between the three policies. The state-valued function
will become unstable when the training exploration process is perturbed by state noise. We impose a smoothing constraint on the state-valued function
to increase the robustness to noise.
The total noise robust regularization term is
where
,
, and
are regularization weights.