1. Introduction
With the rapid advancement of artificial intelligence and autonomous driving technologies, intelligent vehicles represented by autonomous cars have become a research hotspot in special transportation, logistics distribution, and other fields. However, in complex real-world scenarios, the trajectory tracking control systems of intelligent vehicles face severe challenges. For example, in vehicle motion planning, the coupling relationship between the nonlinear planar monorail vehicle model and the dynamic wheel model significantly increases the control complexity [
1] and the complexity of driving scenarios, and dynamic coupling relationships significantly increase the complexity of vehicle motion control and computational burden [
2,
3,
4]. Therefore, there is an urgent need to develop vehicle trajectory tracking control systems with high computational efficiency and excellent real-time performance [
5,
6].
As one of the core technologies in vehicle automatic control, trajectory tracking aims to guide vehicles to accurately follow pre-defined trajectories through advanced control algorithms and system architectures, thereby enhancing safety and control efficiency [
7,
8,
9]. Among numerous control methods, Model Predictive Control (MPC) is widely used due to its ability to explicitly handle constraints and predict future states [
10,
11,
12,
13,
14]. However, MPC needs to solve finite-time optimization problems online. Its computational efficiency and solving capability directly determine control accuracy and real-time performance. This, in turn, affects the overall tracking and stability control effects [
15,
16,
17].
Reinforcement learning (RL) has advantages. These include autonomous learning and model-free control. RL has attracted extensive research on vehicle trajectory tracking. In particular, the Deep Deterministic Policy Gradient (DDPG) algorithm is used here [
18]. A variety of optimization strategies have been explored for trajectory tracking, such as the reward function mechanism designed with logarithmic functions [
19]; the DDPG controller, which directly uses lateral deviation, heading angle deviation, and the curvature of the target trajectory as state inputs [
20]; and the dynamic selection and switching of the optimal underlying strategy model based on real-time conditions. This can effectively improve system accuracy during the trajectory following process [
21]. On the other hand, hybrid approaches combining RL with traditional model-based control have been widely explored. For instance, using RL to optimize compensation parameters online (e.g., front-wheel steering angle and acceleration) for traditional controllers can significantly improve real-time response speed and adaptability to high-precision path tracking requirements [
22]. Architectures that directly embed pre-trained deep RL model outputs into predictive control frameworks and fuse them via quadratic programming have achieved effective tracking of complex paths, such as double lane changes and slalom trajectories [
23]. In specific application scenarios (e.g., logistics AGVs), hybrid control schemes combining RL with traditional PI control have also demonstrated adaptive learning capabilities [
24]. Meanwhile, multi-core online RL controllers based on advanced planning algorithms (e.g., DHP) have been validated in simulation environments and have been shown to exhibit superior tracking accuracy and control smoothness under complex road conditions (S-curves and urban roads) [
25]. The introduction of perturbation observers and event-triggering techniques provides new ideas for solving the problems of system robustness and computational efficiency [
26,
27].
While these control schemes have demonstrated potential in specific scenarios, RL-based trajectory tracking methods (e.g., DDPG) still face core challenges, such as insufficient adaptability to complex and variable trajectories (especially sharp curvature changes) and low training efficiency, limiting their performance in full-scenario applications requiring high real-time performance and strong robustness. Therefore, designing a novel controller with dynamic control strategy adjustment and efficient optimization algorithm integration holds significant research value and application prospects. This paper focuses on the combined control mechanism of reinforcement learning algorithms and heuristic algorithms (COA) for autonomous vehicle trajectory tracking control, combined with the concept of symmetry. Different trajectory types, such as axial symmetry and central symmetry, were designed, and simulation comparisons were conducted. This study analyzes error variations during tracking control and evaluates adaptability to different trajectories. The findings provide a new framework for the trajectory tracking control of intelligent vehicles, including autonomous cars.
2. Kinematic Modeling of Autonomous Vehicle Motion Control
Kinematics analyzes the motion laws of objects from a geometric perspective and can express the regular models of an autonomous vehicle’s coordinate position, heading change, and velocity change. In practical motion control, it involves coordinate translation transformations and the direction of the reference heading angle. Therefore, using appropriate reference coordinate systems and kinematic models is crucial.
The motion of an autonomous vehicle typically involves two coordinate systems: the inertial coordinate system, denoted as XOY, and the body coordinate system, denoted as xoy. The XOY is used by the inertial navigation system, while the xoy is employed to describe the relative motion of the autonomous vehicle.
As shown in
Figure 1, the inertial coordinate system XOY in this paper is defined as follows. The X-axis points due east, the Y-axis points north, and the Z-axis points upward. The body coordinate system xoy is defined such that the x-axis points directly forward of the vehicle, and the y-axis points to the left side of the vehicle. The vehicle’s yaw angle
(phi) is the angle between the x-axis of the body coordinate system and the X-axis of the inertial coordinate system, with the counterclockwise direction defined as positive.
Assuming the autonomous vehicle is undergoing arbitrary curvilinear or rectilinear motion at a certain moment, we establish the steering motion model, in which
and
are the coordinates of the rear and front axle centers in the inertial coordinate system, respectively;
is the velocity at the rear axle center, with the unit of m/s;
l is the wheelbase, with the unit of m;
R is the instantaneous turning radius at the rear axle center, with the unit of m; and
is the front-wheel steering angle, with the unit of rad. The velocity at the rear axle center
is:
The kinematic constraints for the front and rear axles are:
It is jointly obtained from Equations (
1) and (
2) that:
Based on Equation (
3) and the geometric relationship between the front and rear wheels, it can be derived that:
Substituting Equations (
3) and (
4) into Equation (
2), the yaw rate can be obtained as:
where
is the yaw rate, and the turning radius
R and the front wheel steering angle
can be readily obtained as:
From Equations (
3) and (
5), the kinematic model of the autonomous vehicle can be derived as:
where
is the current heading angle and
is the velocity of the rear axle center. It can be rewritten in a simpler form as:
In the equation, the state vector is
, and the control vector is
. In the path tracking control of unmanned vehicles, it is often hoped that
can be used as the control quantity. Combining Equations (
5) and (
7), the kinematic model of the autonomous vehicle can be transformed into the following form:
In the motion control of autonomous vehicles, the heading angle is a critical concept. This angle is not only used in ships but also applies to describing the directional state of vehicles, commonly used in navigation and tracking control to determine the absolute direction of the vehicle. It can serve as a key reference index to help determine the driving direction and angular changes, thereby achieving precise tracking and control.
In the ground coordinate system, the heading angle refers to the angle between the vehicle’s center of mass velocity and the horizontal axis, used to describe the vehicle’s orientation in the coordinate system. Generally, the range of the heading angle is from 0° to 360°, with the clockwise direction defined as positive and the counterclockwise direction as negative. In radians, this range is . The vehicle heading angle is a core parameter that describes the direction of vehicle travel. It is defined as the angle formed between the direction of the vehicle’s center of mass velocity and the horizontal axis (usually referring to the true north direction or the X-axis of the coordinate system) in the ground coordinate system.
Suppose the current heading angle of the autonomous vehicle is
, the reference heading angle is
, and the unit is rad. The calculation equation for the heading angle deviation is:
where the calculation equations for
and
are:
Reinforcement Learning (RL), a sub-field of machine learning, is a machine learning paradigm that achieves optimal decision making through trial-and-error and dynamic interaction with the environment. Its principle schematic diagram is shown in
Figure 2:
As shown in
Figure 2, the agent indirectly guides its learning process through reward signals during interaction with the environment. The agent continuously adjusts its behavior through trial and error to adapt to environmental rules, aiming to maximize long-term cumulative rewards to achieve specific goals. Meanwhile, it optimizes its decision-making process by constructing corresponding models and value functions to autonomously explore the optimal behavior of the control system. The specific process is as follows:
- (1)
The agent observes the current state of the environment and the reward signal.
- (2)
Based on the current state and policy, the agent selects an action.
- (3)
The agent executes the selected action in the environment.
- (4)
The environment returns a reward and transitions to the next state.
- (5)
The agent updates its policy based on the received reward and the new state.
Supervised learning finds optimal model parameters from a training dataset. The dataset is sampled from a given data distribution. Reinforcement learning (RL) obtains data differently. It comes from the interaction between the agent and the environment. That is, if the agent does not take a certain decision action, the data corresponding to that action will never be acquired. Therefore, the current training data of the agent comes from the decision results of previous agents, meaning that different policies lead to different data distributions generated by interactions with the environment. To address the data distribution mismatch problem caused by the above policy iteration, researchers have proposed deep reinforcement learning (DRL) methods that integrate RL and deep learning. This approach can input multi-dimensional information into the algorithm, combining the decision-making process in RL with the large-scale data processing capability of deep learning. It then uses neural networks to fit functions and finally outputs actions to achieve control of the agent.
3. Optimization of DDPG Trajectory Tracking Controller Based on Crayfish Algorithm
The DDPG algorithm belongs to the offline policy learning algorithm in the Actor–Critic method. It solves the problem. The DQN algorithm can only handle discrete and low-dimensional action spaces. The Actor network makes corresponding action decisions based on state inputs and stores the current state information for sampling and updating network parameters. The Critic network judges the quality of action decisions by calculating the Q-value of the Actor network’s action decisions. The main framework structure of the DDPG algorithm is as follows:
According to
Figure 3, in the DDPG algorithm flow, the Actor network receives the current state input of the agent and outputs action
. The Actor network is mainly used for action selection. The Critic network is used for action evaluation. During the interaction between the agent and the environment, a series of tuples
of states, actions, and rewards is obtained. Throughout the algorithm, the information in the tuples is stored in the experience pool and updated.
During training, the target network parameters of DDPG are updated using the following equation:
The pseudocode process of the DDPG algorithm is shown in Algorithm 1:
Algorithm 1: Specific Process of the DDPG Algorithm |
8. Randomly sample a batch of tuples from R;
11. Update the target network according to Equation ( 12);
|
Exploration of adding random noise to the actions output by Actor networks in action output:
In the equation, represents the output action, C represents the coefficient, and represents the dimension of the action. Introducing random noise can enrich the action space and enhance the agent’s ability to explore the actions.
As shown in
Figure 4 and
Figure 5, the Actor network consists of an input layer, two hidden layers, and a final output layer. Based on the main control variables in Equation (
9), the number of nodes in the output layer is designed to be 2, corresponding to velocity and yaw rate, respectively.
As the main data source for the interaction between intelligent agents and training environments, if the state space has too few state variables, the network cannot capture appropriate state changes, resulting in a decrease in the quality of the output actions. In this study, the DDPG algorithm was used for trajectory tracking control of unmanned vehicles. The main parameters of the Actor network and Critic network are shown in
Table 1:
On the other hand, the calculation of “reward value” runs through the entire training process of reinforcement learning, reflecting the value of the actions selected during the algorithm training process. It directly relates to whether the agent can recognize and learn the appropriate actions and strategies. Therefore, it is necessary to design a reward function to calculate the evaluation criteria for the agent to perform the selected action in the current state.
This article designs a reward function that takes into account changes in heading and position. The equation for calculating the reward value related to heading changes is as follows:
In the equation,
represents the heading error between the unmanned vehicle and the target trajectory. It is the difference. The difference is between the slope of the line connecting the reference points of the target trajectory and the orientation angle of the vehicle itself in the unmanned vehicle coordinate reference system. The absolute value is taken during the calculation. The equation for calculating
is:
In addition to considering the impact of angle changes on the reward value, the introduction of distance error can avoid the possibility of the unmanned vehicle spinning or remaining stationary at a certain point during training. The distance between the unmanned vehicle and the target trajectory decreases. A positive reward is given. The distance does not decrease. A penalty is imposed. After multiple comparative attempts, it was found that a simple linear function did not have a significant representation effect on distance error. The final designed reward function for distance error is calculated as follows:
The values of
and
in Equations (
14) and (
15) are custom constants. They were adjusted through multiple attempts. In this study, they are ultimately taken as −0.4 and 0.2, respectively. Among them,
is the distance error between the center of mass of the unmanned vehicle and the target trajectory point position, calculated using Euclidean distance, and the calculation equation is:
In this equation,
are the centroid position coordinates of the current unmanned vehicle, and
are the position coordinates of the target trajectory reference point at the current time.
Combining Equations (
13) and (
15), the final reward function form is:
In the equation, and are threshold conditions. They serve logical judgment for reward function calculation in the training environment. They can be set during environment initialization.
According to the task requirements of trajectory tracking control in this study, the training termination condition is set as follows:
- (1)
The distance error and heading error are both less than or equal to the preset error threshold;
- (2)
Achieve the maximum number of steps for a single training session;
- (3)
Complete all training rounds.
This study selected Bézier curves, S-shaped curves, and double displacement lines as the testing scenarios for unmanned vehicle trajectory tracking control. The Bézier curve controls the shape and variation of the curve by using
n control points
. The curve only passes through the starting point
and the ending point
, but does not pass through the intermediate point
. It can draw any continuously changing curve line and has the advantages of high computational smoothness and continuous curvature changes. Suitable for simulation design of trajectory tracking control, the general parameter expression equation is as follows:
Through multiple adjustments, the generation points of the Bézier target trajectory were set to (0.8, −0.6), (8, 20), (20, −80), (30, 35), (5, 20), (35, 25). The curve trajectory is set as a fifth-order Bézier interpolation curve. At the same time, any control algorithm is used for control. The initial state of the unmanned vehicle is set. This aims to reduce calculation deviation and evaluation criteria caused by different initial states:
Using the hyperparameter settings in
Table 2 for training, draw the change results of the average reward return curve. The results are as follows:
As shown in
Figure 6, as the training rounds continue to increase, the reward curve starts to rise from 0 at the beginning and gradually converges to around 400. After 100 rounds, it gradually stabilizes and maintains a certain range of fluctuations, indicating that the training results of the DDPG algorithm have converged within the target round.
As shown in
Figure 7,
Figure 7a shows the results of training using the Bézier curve and tracking the curve. It can be seen that the overall tracking effect is good, but there is still a certain deviation between the actual trajectory and the reference trajectory.
Figure 7b,c use the training results of
Figure 7a for tracking tests, tracking the circular arc curve and the straight line, respectively. It can be seen that there is a significant deviation between the reference trajectory and the actual trajectory. The reason for this phenomenon is that although the reward curve of the training results using DDPG converges and stabilizes, there is a certain deviation between the training trajectory and the actual test trajectory, and the curvature and heading changes are not the same, even if noise exploration is added to enrich the action space during the training process. But if certain good actions and strategies are not explored, intelligent agents will not be able to learn the optimal strategy. In summary, when using the DDPG algorithm for trajectory tracking control, the following issues may arise.
- (1)
Insufficient generalization ability of the model: The DDPG algorithm lacks dynamic adaptability to diverse trajectories.
- (2)
Imbalance between exploration and utilization: Fixed noise mechanisms can easily lead to local optima (such as difficulty adapting to curvature mutation scenarios after learning linear trajectories).
The Crayfish Optimization Algorithm (COA) was proposed by Jia Heming et al. in 2023 [
28]. This algorithm is designed to simulate the summer escape, competition and foraging behaviors of crayfish. To achieve an effective balance between global and local searches, it is divided into three different stages [
29,
30,
31]. The structure of crayfish is shown in
Figure 8.
The process of COA algorithm is shown in
Figure 9:
In multidimensional optimization problems, each crayfish is a 1 × dim matrix, and each column matrix represents a solution to a problem. In a set of variables (Xi, 1, Xi, 2, …, Xi, dim), each variable Xi is located between the upper and lower boundaries, and the initialization of the COA algorithm is to randomly generate a set of candidate solutions X in space. The candidate solutions X are based on the population size N and dimension dim.
In the equation,
X is the initial population position,
N represents the population size,
is the dimensionality of the population, and
is the position of individual
i in the
j-th dimension. The expression for
is as follows:
In the equation, denotes the lower bound of the j-th dimension, denotes the upper bound of the j-th dimension, and is a random number.
The behavioral patterns of crayfish are primarily modulated by temperature fluctuations, inducing distinct phases that serve as critical criteria for determining individual behavioral strategies. In the original Crayfish Optimization Algorithm (COA), temperature is mathematically defined as follows:
The competitive stage of crayfish is shown in
Figure 10.
Under appropriate temperature conditions, crayfish exhibit foraging behavior, and their food intake is also influenced by temperature. The calculation method for the food intake
p of crayfish is as follows:
In the equation,
denotes the optimal temperature for crayfish, while
and
are parameters used to regulate the food intake of crayfish under different temperature conditions.
Under appropriate temperature conditions, crayfish exhibit foraging behavior. When the temperature exceeds 30 °C, crayfish will choose to seek shelter in caves:
In the equation,
C serves as a decreasing coefficient that gradually diminishes with the progression of the maximum number of iterations
T;
denotes the best position obtained through iterative updates, while
represents the current global best position within the population. The average of these two positions is used to determine the current cave size, i.e., the target optimal solution, as defined by Equation (
26).
When the temperature exceeds a predefined threshold and the random number is greater than or equal to a preset value, crayfish enter the competition phase:
In the equation,
Z represents a randomly selected crayfish individual. During the competition phase, crayfish compete with each other, and the current position
of each crayfish is adjusted based on the position
of another randomly chosen individual. This mechanism expands the search scope of the COA and enhances the algorithm’s exploratory capability.
For crayfish in the foraging phase, when the temperature
T ≤30 °C, they will move toward the food source. By evaluating the size of the food, crayfish decide whether to tear it apart or consume it directly. The position and quantity of the food
are mathematically defined as follows:
In the equation,
C is the food factor, representing the maximum food quantity and taking a constant value,
denotes the fitness value of the
i-th crayfish, and
represents the fitness value corresponding to the food position.
The foraging stage of crayfish is shown in
Figure 11.
Therefore, we propose a biologically inspired combination control strategy based on the Deep Deterministic Policy Gradient (DDPG) algorithm and enhance it through the Crayfish Optimization Algorithm (COA) to address the limitations in generalization and dynamic adaptability. The proposed DDPG-COA controller embodies a symmetrical structure; DDPG serves as the main controller for global trajectory tracking, while COA acts as a compensating regulator, dynamically optimizing actions through interference observation mechanisms. The symmetric balance between learning-based control (DDPG) and crayfish optimization algorithm (COA) ensures robust performance in complex scenarios.
The COA algorithm is introduced and combined with the DDPG algorithm for trajectory tracking control. The DDPG model trained offline is used to optimize and compensate by calculating the output changes of the DDPG model. The specific equation design and iteration process are as follows:
- (1)
Initialize the population size, population size, and iteration times of crayfish, use DDPG to train models with different trajectories (sine, straight, Bézier, and arc trajectories), and store the results as a strategy library.
- (2)
Curvature is defined as the degree of curvature of a curve, usually denoted by
; the greater the
, the greater the degree of curvature of the curve. Conversely, the smaller the degree of curvature, the curve becomes a straight line when
. The calculation equation for the curvature of the target trajectory is:
Calculate the curvature value of the current target trajectory point, select the action output decision layer based on the absolute value of the curvature value, input the current state to the model in the strategy library, and output the action .
- (3)
The upper and lower limits for setting action compensation optimization are:
In Equations (
33) and (
34),
and
,
and
are the upper and lower limits of the adjustment range for velocity values and yaw rate, respectively. Then, based on the upper and lower limits of
and
,
N candidate actions are generated with the current basic action
as the center. The calculation equation is:
In the equation, represents the basic action value output by the DDPG algorithm, and is the weight coefficient. By adjusting the value of , the proportion of the output action can be adjusted. The value range of is .
- (4)
Combining with Equation (
9), the COA algorithm performs observation compensation on the current output action
. Calculate the current error value and the fitness value. The fitness value is based on the generated candidate action set
. The dynamic influence of trajectory curvature on the control target needs consideration. Traditional heuristic algorithms use a fitness function with fixed weights
. This function can achieve basic control. It does not reflect the dynamic adjustment requirements when the curvature changes.
Therefore, the dynamic weight adjustment mechanism is improved as follows:
Curvature perception module: Calculate the curvature of the current trajectory point in real time.
Dynamic allocation of weights:
Segmented fitness function:
In the equation,
is the curvature sensitivity coefficient. It is taken as 0.1 by default. When the trajectory curvature is large
or the position error is significant, the distance error weight
is strengthened. When the linear trajectory
and the error is small, emphasis is placed on heading control
, with the threshold
set at 0.05 rad/m, and
M remains at the original error threshold. The heading error
and distance error
are calculated according to Equations (
15) and (
18). As can be seen from Equation (
37), when the position error is too large, the fitness value is more affected by the distance factor. At this time, the action instruction should primarily control the autonomous vehicle to move towards the target trajectory point; when the error is within a certain range, the fitness value is more affected by the heading change, and the yaw angle should be preferentially adjusted for heading control.
- (5)
Choose to enter the summer escape or competition behavior based on the change in temperature . During the summer escape behavior, update the position of the crayfish (i.e., the optimal solution) according to the size of the cave. Using the change in error as the fitness value for evaluating each crayfish, select the optimal solution and output new action instructions.
- (6)
Update population fitness values , , and , evaluate the population, and determine whether to exit the loop. If not, continue iterating in the loop.
- (7)
If the maximum number of iterations is reached or the performance meets the preset standards, the iteration is stopped and the current optimal fitness value and optimal solution are output, namely optimal fitness and optimal action .
In summary, the process of the control algorithm is shown in
Figure 12: