Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm

Wang, Le; Lu, Hongrui; Su, Qingyang; Wang, Yang

doi:10.3390/sym17091396

Open AccessArticle

Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm

¹

School of Traffic and Transportation, Shijiazhuang Tiedao University, Shijiazhuang 050043, China

²

Hebei Key Laboratory of Traffic Safety and Control, Shijiazhuang 050043, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(9), 1396; https://doi.org/10.3390/sym17091396

Submission received: 14 July 2025 / Revised: 9 August 2025 / Accepted: 18 August 2025 / Published: 27 August 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

The widespread application of unmanned vehicles in logistics distribution and special transportation has made improving trajectory tracking accuracy and dynamic adaptability critical for operational efficiency. This article proposes a biologically inspired combination control strategy based on the Deep Deterministic Policy Gradient (DDPG) algorithm, enhanced by the Crayfish Optimization Algorithm (COA) to address limitations in generalization and dynamic adaptability. The proposed DDPG-COA controller embodies a symmetrical structure: DDPG acts as the primary controller for global trajectory tracking, while COA serves as a compensatory regulator, dynamically optimizing actions through a disturbance observation mechanism. This symmetrical balance between learning-based control (DDPG) and bio-inspired optimization (COA) ensures robust performance in complex scenarios. Experiments on symmetrical trajectories demonstrated significant improvements, with the average tracking errors reduced by 56.3 percent, 71.6 percent, and 74.6 percent, respectively. The results highlight how symmetry in control architecture and trajectory design synergistically enhances precision and adaptability for unmanned systems.

Keywords:

deep reinforcement learning; trajectory tracking; DDPG algorithm; COA algorithm

1. Introduction

With the rapid advancement of artificial intelligence and autonomous driving technologies, intelligent vehicles represented by autonomous cars have become a research hotspot in special transportation, logistics distribution, and other fields. However, in complex real-world scenarios, the trajectory tracking control systems of intelligent vehicles face severe challenges. For example, in vehicle motion planning, the coupling relationship between the nonlinear planar monorail vehicle model and the dynamic wheel model significantly increases the control complexity [1] and the complexity of driving scenarios, and dynamic coupling relationships significantly increase the complexity of vehicle motion control and computational burden [2,3,4]. Therefore, there is an urgent need to develop vehicle trajectory tracking control systems with high computational efficiency and excellent real-time performance [5,6].

As one of the core technologies in vehicle automatic control, trajectory tracking aims to guide vehicles to accurately follow pre-defined trajectories through advanced control algorithms and system architectures, thereby enhancing safety and control efficiency [7,8,9]. Among numerous control methods, Model Predictive Control (MPC) is widely used due to its ability to explicitly handle constraints and predict future states [10,11,12,13,14]. However, MPC needs to solve finite-time optimization problems online. Its computational efficiency and solving capability directly determine control accuracy and real-time performance. This, in turn, affects the overall tracking and stability control effects [15,16,17].

Reinforcement learning (RL) has advantages. These include autonomous learning and model-free control. RL has attracted extensive research on vehicle trajectory tracking. In particular, the Deep Deterministic Policy Gradient (DDPG) algorithm is used here [18]. A variety of optimization strategies have been explored for trajectory tracking, such as the reward function mechanism designed with logarithmic functions [19]; the DDPG controller, which directly uses lateral deviation, heading angle deviation, and the curvature of the target trajectory as state inputs [20]; and the dynamic selection and switching of the optimal underlying strategy model based on real-time conditions. This can effectively improve system accuracy during the trajectory following process [21]. On the other hand, hybrid approaches combining RL with traditional model-based control have been widely explored. For instance, using RL to optimize compensation parameters online (e.g., front-wheel steering angle and acceleration) for traditional controllers can significantly improve real-time response speed and adaptability to high-precision path tracking requirements [22]. Architectures that directly embed pre-trained deep RL model outputs into predictive control frameworks and fuse them via quadratic programming have achieved effective tracking of complex paths, such as double lane changes and slalom trajectories [23]. In specific application scenarios (e.g., logistics AGVs), hybrid control schemes combining RL with traditional PI control have also demonstrated adaptive learning capabilities [24]. Meanwhile, multi-core online RL controllers based on advanced planning algorithms (e.g., DHP) have been validated in simulation environments and have been shown to exhibit superior tracking accuracy and control smoothness under complex road conditions (S-curves and urban roads) [25]. The introduction of perturbation observers and event-triggering techniques provides new ideas for solving the problems of system robustness and computational efficiency [26,27].

While these control schemes have demonstrated potential in specific scenarios, RL-based trajectory tracking methods (e.g., DDPG) still face core challenges, such as insufficient adaptability to complex and variable trajectories (especially sharp curvature changes) and low training efficiency, limiting their performance in full-scenario applications requiring high real-time performance and strong robustness. Therefore, designing a novel controller with dynamic control strategy adjustment and efficient optimization algorithm integration holds significant research value and application prospects. This paper focuses on the combined control mechanism of reinforcement learning algorithms and heuristic algorithms (COA) for autonomous vehicle trajectory tracking control, combined with the concept of symmetry. Different trajectory types, such as axial symmetry and central symmetry, were designed, and simulation comparisons were conducted. This study analyzes error variations during tracking control and evaluates adaptability to different trajectories. The findings provide a new framework for the trajectory tracking control of intelligent vehicles, including autonomous cars.

2. Kinematic Modeling of Autonomous Vehicle Motion Control

Kinematics analyzes the motion laws of objects from a geometric perspective and can express the regular models of an autonomous vehicle’s coordinate position, heading change, and velocity change. In practical motion control, it involves coordinate translation transformations and the direction of the reference heading angle. Therefore, using appropriate reference coordinate systems and kinematic models is crucial.

The motion of an autonomous vehicle typically involves two coordinate systems: the inertial coordinate system, denoted as XOY, and the body coordinate system, denoted as xoy. The XOY is used by the inertial navigation system, while the xoy is employed to describe the relative motion of the autonomous vehicle.

As shown in Figure 1, the inertial coordinate system XOY in this paper is defined as follows. The X-axis points due east, the Y-axis points north, and the Z-axis points upward. The body coordinate system xoy is defined such that the x-axis points directly forward of the vehicle, and the y-axis points to the left side of the vehicle. The vehicle’s yaw angle

ϕ

(phi) is the angle between the x-axis of the body coordinate system and the X-axis of the inertial coordinate system, with the counterclockwise direction defined as positive.

Assuming the autonomous vehicle is undergoing arbitrary curvilinear or rectilinear motion at a certain moment, we establish the steering motion model, in which

(x_{b}, y_{b})

and

(x_{a}, y_{a})

are the coordinates of the rear and front axle centers in the inertial coordinate system, respectively;

v_{r}

is the velocity at the rear axle center, with the unit of m/s; l is the wheelbase, with the unit of m; R is the instantaneous turning radius at the rear axle center, with the unit of m; and

ζ

is the front-wheel steering angle, with the unit of rad. The velocity at the rear axle center

(x_{b}, y_{b})

is:

v_{r} = {\dot{X}}_{b} cos φ + {\dot{Y}}_{b} sin φ

(1)

The kinematic constraints for the front and rear axles are:

\{\begin{matrix} {\dot{X}}_{a} sin (φ + ζ) - {\dot{Y}}_{a} cos (φ + ζ) = 0 \\ {\dot{X}}_{b} sin φ - {\dot{Y}}_{b} cos φ = 0 \end{matrix}

(2)

It is jointly obtained from Equations (1) and (2) that:

\{\begin{matrix} {\dot{X}}_{b} = v_{r} cos φ \\ {\dot{Y}}_{b} = v_{r} sin φ \end{matrix}

(3)

Based on Equation (3) and the geometric relationship between the front and rear wheels, it can be derived that:

\{\begin{matrix} X_{a} = X_{b} + l cos φ \\ Y_{a} = Y_{b} + l sin φ \end{matrix}

(4)

Substituting Equations (3) and (4) into Equation (2), the yaw rate can be obtained as:

ω = \frac{v_{r}}{l} tan ζ

(5)

where

ω

is the yaw rate, and the turning radius R and the front wheel steering angle

ζ

can be readily obtained as:

\{\begin{matrix} R = \frac{v_{r}}{ω} \\ ζ = arctan (\frac{l}{R}) \end{matrix}

(6)

From Equations (3) and (5), the kinematic model of the autonomous vehicle can be derived as:

[\begin{matrix} {\dot{X}}_{b} \\ {\dot{Y}}_{b} \\ \dot{φ} \end{matrix}] = [\begin{matrix} cos φ \\ sin φ \\ \frac{tan ζ}{l} \end{matrix}]

(7)

where

φ

is the current heading angle and

v_{r}

is the velocity of the rear axle center. It can be rewritten in a simpler form as:

{\dot{χ}}_{i} = f_{t} (χ_{t}, u_{t})

(8)

In the equation, the state vector is

χ_{t} = {[X_{b}, Y_{b}, φ]}^{T}

, and the control vector is

u_{t} = {[V_{r}, ζ]}^{T}

. In the path tracking control of unmanned vehicles, it is often hoped that

[V_{r}, ω]

can be used as the control quantity. Combining Equations (5) and (7), the kinematic model of the autonomous vehicle can be transformed into the following form:

[\begin{matrix} {\dot{X}}_{b} \\ {\dot{Y}}_{b} \\ \dot{φ} \end{matrix}] = [\begin{matrix} cos φ \\ sin φ \\ 0 \end{matrix}] v_{r} + [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] ω

(9)

In the motion control of autonomous vehicles, the heading angle is a critical concept. This angle is not only used in ships but also applies to describing the directional state of vehicles, commonly used in navigation and tracking control to determine the absolute direction of the vehicle. It can serve as a key reference index to help determine the driving direction and angular changes, thereby achieving precise tracking and control.

In the ground coordinate system, the heading angle refers to the angle between the vehicle’s center of mass velocity and the horizontal axis, used to describe the vehicle’s orientation in the coordinate system. Generally, the range of the heading angle is from 0° to 360°, with the clockwise direction defined as positive and the counterclockwise direction as negative. In radians, this range is

(0, 2 π)

. The vehicle heading angle is a core parameter that describes the direction of vehicle travel. It is defined as the angle formed between the direction of the vehicle’s center of mass velocity and the horizontal axis (usually referring to the true north direction or the X-axis of the coordinate system) in the ground coordinate system.

Suppose the current heading angle of the autonomous vehicle is

φ

, the reference heading angle is

φ_{r e f}

, and the unit is rad. The calculation equation for the heading angle deviation is:

e_{φ} = φ - φ_{r e f}

(10)

where the calculation equations for

φ

and

φ_{r e f}

are:

φ = arctan (\frac{y_{t + 1} - y_{t}}{x_{t + 1} - x_{t}})

(11)

Reinforcement Learning (RL), a sub-field of machine learning, is a machine learning paradigm that achieves optimal decision making through trial-and-error and dynamic interaction with the environment. Its principle schematic diagram is shown in Figure 2:

As shown in Figure 2, the agent indirectly guides its learning process through reward signals during interaction with the environment. The agent continuously adjusts its behavior through trial and error to adapt to environmental rules, aiming to maximize long-term cumulative rewards to achieve specific goals. Meanwhile, it optimizes its decision-making process by constructing corresponding models and value functions to autonomously explore the optimal behavior of the control system. The specific process is as follows:

(1): The agent observes the current state of the environment and the reward signal.
(2): Based on the current state and policy, the agent selects an action.
(3): The agent executes the selected action in the environment.
(4): The environment returns a reward and transitions to the next state.
(5): The agent updates its policy based on the received reward and the new state.

Supervised learning finds optimal model parameters from a training dataset. The dataset is sampled from a given data distribution. Reinforcement learning (RL) obtains data differently. It comes from the interaction between the agent and the environment. That is, if the agent does not take a certain decision action, the data corresponding to that action will never be acquired. Therefore, the current training data of the agent comes from the decision results of previous agents, meaning that different policies lead to different data distributions generated by interactions with the environment. To address the data distribution mismatch problem caused by the above policy iteration, researchers have proposed deep reinforcement learning (DRL) methods that integrate RL and deep learning. This approach can input multi-dimensional information into the algorithm, combining the decision-making process in RL with the large-scale data processing capability of deep learning. It then uses neural networks to fit functions and finally outputs actions to achieve control of the agent.

3. Optimization of DDPG Trajectory Tracking Controller Based on Crayfish Algorithm

The DDPG algorithm belongs to the offline policy learning algorithm in the Actor–Critic method. It solves the problem. The DQN algorithm can only handle discrete and low-dimensional action spaces. The Actor network makes corresponding action decisions based on state inputs and stores the current state information for sampling and updating network parameters. The Critic network judges the quality of action decisions by calculating the Q-value of the Actor network’s action decisions. The main framework structure of the DDPG algorithm is as follows:

According to Figure 3, in the DDPG algorithm flow, the Actor network receives the current state input of the agent and outputs action

A_{t}

. The Actor network is mainly used for action selection. The Critic network is used for action evaluation. During the interaction between the agent and the environment, a series of tuples

〈 S, P 〉

of states, actions, and rewards is obtained. Throughout the algorithm, the information in the tuples is stored in the experience pool and updated.

During training, the target network parameters of DDPG are updated using the following equation:

\{\begin{matrix} θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \end{matrix}

(12)

The pseudocode process of the DDPG algorithm is shown in Algorithm 1:

Algorithm 1: Specific Process of the DDPG Algorithm

1. Initialize the experience replay buffer R, and the parameters $θ$ of the main network and the target network.

2. for episode $e = 1 \to E$ :

3. Initialize the stochastic process $η$ for action exploration, and initialize the environmental state $s_{1}$ ;

4. for time-step $t = 1 \to T$ :

5. Select an action $a_{t}$ according to the current policy;

6. Execute the action $a_{t}$ , obtain the reward $r_{t}$ , and the state changes to $s_{t + 1}$ ;

7. Store the tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in the experience replay buffer R;

8. Randomly sample a batch of tuples ${(s_{i}, a_{i}, r_{i}, s_{i + 1})}_{i = 1, \dots, N}$ from R;

9. Update the critic network by minimizing the loss function $L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - Q (s_{i}, a_{i}))}^{2}$ ;

10. Update the actor network using the gradient-based policy;

11. Update the target network according to Equation (12);

12. end

13. end

Exploration of adding random noise to the actions output by Actor networks in action output:

A_{t} = a_{t} + C \times random (rand (a_{\dim}))

(13)

In the equation,

A_{t}

represents the output action, C represents the coefficient, and

a_{d i m}

represents the dimension of the action. Introducing random noise can enrich the action space and enhance the agent’s ability to explore the actions.

As shown in Figure 4 and Figure 5, the Actor network consists of an input layer, two hidden layers, and a final output layer. Based on the main control variables in Equation (9), the number of nodes in the output layer is designed to be 2, corresponding to velocity and yaw rate, respectively.

As the main data source for the interaction between intelligent agents and training environments, if the state space has too few state variables, the network cannot capture appropriate state changes, resulting in a decrease in the quality of the output actions. In this study, the DDPG algorithm was used for trajectory tracking control of unmanned vehicles. The main parameters of the Actor network and Critic network are shown in Table 1:

On the other hand, the calculation of “reward value” runs through the entire training process of reinforcement learning, reflecting the value of the actions selected during the algorithm training process. It directly relates to whether the agent can recognize and learn the appropriate actions and strategies. Therefore, it is necessary to design a reward function to calculate the evaluation criteria for the agent to perform the selected action in the current state.

This article designs a reward function that takes into account changes in heading and position. The equation for calculating the reward value related to heading changes is as follows:

R_{1} = α^{*} |p_{e r r o r}|

(14)

In the equation,

p_{e r r o r}

represents the heading error between the unmanned vehicle and the target trajectory. It is the difference. The difference is between the slope of the line connecting the reference points of the target trajectory and the orientation angle of the vehicle itself in the unmanned vehicle coordinate reference system. The absolute value is taken during the calculation. The equation for calculating

p_{e r r o r}

is:

p_{e r r o r} = |p_{n} - p_{r e f}|

(15)

In addition to considering the impact of angle changes on the reward value, the introduction of distance error can avoid the possibility of the unmanned vehicle spinning or remaining stationary at a certain point during training. The distance between the unmanned vehicle and the target trajectory decreases. A positive reward is given. The distance does not decrease. A penalty is imposed. After multiple comparative attempts, it was found that a simple linear function did not have a significant representation effect on distance error. The final designed reward function for distance error is calculated as follows:

R_{2} = β * 2^{2 - \frac{|d_{e r r o r}|}{10}}

(16)

The values of

α

and

β

in Equations (14) and (15) are custom constants. They were adjusted through multiple attempts. In this study, they are ultimately taken as −0.4 and 0.2, respectively. Among them,

d_{e r r o r}

is the distance error between the center of mass of the unmanned vehicle and the target trajectory point position, calculated using Euclidean distance, and the calculation equation is:

d_{e r r o r} = \sqrt{{(x_{n} - x_{r e f})}^{2} + {(y_{n} - y_{r e f})}^{2}}

(17)

In this equation,

(x_{n}, y_{n})

are the centroid position coordinates of the current unmanned vehicle, and

(x_{r e f}, y_{r e f})

are the position coordinates of the target trajectory reference point at the current time.

Combining Equations (13) and (15), the final reward function form is:

R = \{\begin{matrix} R_{1} + R_{2}, & if θ_{e} > p_{t h r e s h o l d} or d_{e} > d_{t h r e s h o l d} \\ 0, & else \end{matrix}

(18)

In the equation,

p_{t h r e s h o l d}

and

d_{t h r e s h o l d}

are threshold conditions. They serve logical judgment for reward function calculation in the training environment. They can be set during environment initialization.

According to the task requirements of trajectory tracking control in this study, the training termination condition is set as follows:

(1): The distance error and heading error are both less than or equal to the preset error threshold;
(2): Achieve the maximum number of steps for a single training session;
(3): Complete all training rounds.

This study selected Bézier curves, S-shaped curves, and double displacement lines as the testing scenarios for unmanned vehicle trajectory tracking control. The Bézier curve controls the shape and variation of the curve by using n control points

{P_{1}, P_{2}, \dots, P_{n}}

. The curve only passes through the starting point

P_{1}

and the ending point

P_{n}

, but does not pass through the intermediate point

{P_{2} \sim P_{n}}

. It can draw any continuously changing curve line and has the advantages of high computational smoothness and continuous curvature changes. Suitable for simulation design of trajectory tracking control, the general parameter expression equation is as follows:

\begin{matrix} B (t) & = \sum_{i = 0}^{n} (\binom{n}{i}) P_{i} {(1 - t)}^{n - i} t^{i} \\ = (\binom{n}{0}) P_{0} {(1 - t)}^{n} t^{0} + (\binom{n}{1}) P_{1} {(1 - t)}^{n - 1} t^{1} + \dots + (\binom{n}{n - 1}) P_{n - 1} {(1 - t)}^{1} t^{n - 1} \end{matrix}

(19)

Through multiple adjustments, the generation points of the Bézier target trajectory were set to (0.8, −0.6), (8, 20), (20, −80), (30, 35), (5, 20), (35, 25). The curve trajectory is set as a fifth-order Bézier interpolation curve. At the same time, any control algorithm is used for control. The initial state of the unmanned vehicle is set. This aims to reduce calculation deviation and evaluation criteria caused by different initial states:

[x_{0}, y_{0}, φ_{0}] = [0, 0, π / 3]

(20)

Using the hyperparameter settings in Table 2 for training, draw the change results of the average reward return curve. The results are as follows:

As shown in Figure 6, as the training rounds continue to increase, the reward curve starts to rise from 0 at the beginning and gradually converges to around 400. After 100 rounds, it gradually stabilizes and maintains a certain range of fluctuations, indicating that the training results of the DDPG algorithm have converged within the target round.

As shown in Figure 7, Figure 7a shows the results of training using the Bézier curve and tracking the curve. It can be seen that the overall tracking effect is good, but there is still a certain deviation between the actual trajectory and the reference trajectory. Figure 7b,c use the training results of Figure 7a for tracking tests, tracking the circular arc curve and the straight line, respectively. It can be seen that there is a significant deviation between the reference trajectory and the actual trajectory. The reason for this phenomenon is that although the reward curve of the training results using DDPG converges and stabilizes, there is a certain deviation between the training trajectory and the actual test trajectory, and the curvature and heading changes are not the same, even if noise exploration is added to enrich the action space during the training process. But if certain good actions and strategies are not explored, intelligent agents will not be able to learn the optimal strategy. In summary, when using the DDPG algorithm for trajectory tracking control, the following issues may arise.

(1): Insufficient generalization ability of the model: The DDPG algorithm lacks dynamic adaptability to diverse trajectories.
(2): Imbalance between exploration and utilization: Fixed noise mechanisms can easily lead to local optima (such as difficulty adapting to curvature mutation scenarios after learning linear trajectories).

The Crayfish Optimization Algorithm (COA) was proposed by Jia Heming et al. in 2023 [28]. This algorithm is designed to simulate the summer escape, competition and foraging behaviors of crayfish. To achieve an effective balance between global and local searches, it is divided into three different stages [29,30,31]. The structure of crayfish is shown in Figure 8.

The process of COA algorithm is shown in Figure 9:

In multidimensional optimization problems, each crayfish is a 1 × dim matrix, and each column matrix represents a solution to a problem. In a set of variables (Xi, 1, Xi, 2, …, Xi, dim), each variable Xi is located between the upper and lower boundaries, and the initialization of the COA algorithm is to randomly generate a set of candidate solutions X in space. The candidate solutions X are based on the population size N and dimension dim.

X = [X_{1}, X_{2}, \dots, X_{N}] = [\begin{matrix} X_{1, 1} & \dots & X_{1, j} & \dots & X_{1, d i m} \\ ⋮ & ⋮ & ⋮ \\ X_{i, 1} & \dots & X_{i, j} & \dots & X_{i, d i m} \\ ⋮ & ⋮ & ⋮ \\ X_{d i m, 1} & \dots & X_{N, j} & \dots & X_{N, d i m} \end{matrix}]

(21)

In the equation, X is the initial population position, N represents the population size,

d i m

is the dimensionality of the population, and

X_{i, j}

is the position of individual i in the j-th dimension. The expression for

X_{i, j}

is as follows:

X_{i, j} = l b_{j} + (u b_{j} - l b_{j}) \times rand

(22)

In the equation,

l b_{j}

denotes the lower bound of the j-th dimension,

u b_{j}

denotes the upper bound of the j-th dimension, and

r a n d

is a random number.

The behavioral patterns of crayfish are primarily modulated by temperature fluctuations, inducing distinct phases that serve as critical criteria for determining individual behavioral strategies. In the original Crayfish Optimization Algorithm (COA), temperature is mathematically defined as follows:

temp = rand \times 15 + 20

(23)

The competitive stage of crayfish is shown in Figure 10.

Under appropriate temperature conditions, crayfish exhibit foraging behavior, and their food intake is also influenced by temperature. The calculation method for the food intake p of crayfish is as follows:

p = C_{1} \times (\frac{1}{\sqrt{2 \times π} \times σ} \times exp (- \frac{{(temp - μ)}^{2}}{2 σ^{2}}))

(24)

In the equation,

μ

denotes the optimal temperature for crayfish, while

σ

and

C_{1}

are parameters used to regulate the food intake of crayfish under different temperature conditions.

Under appropriate temperature conditions, crayfish exhibit foraging behavior. When the temperature exceeds 30 °C, crayfish will choose to seek shelter in caves:

X_{i, j}^{t + 1} = X_{i, j}^{t} + C \times rand \times (X_{s h a d e} - X_{i, j}^{t})

(25)

X_{s h a d e} = \frac{(X_{G} + X_{L})}{2}

(26)

C = 2 - \frac{t}{T}

(27)

In the equation, C serves as a decreasing coefficient that gradually diminishes with the progression of the maximum number of iterations T;

X_{G}

denotes the best position obtained through iterative updates, while

X_{L}

represents the current global best position within the population. The average of these two positions is used to determine the current cave size, i.e., the target optimal solution, as defined by Equation (26).

When the temperature exceeds a predefined threshold and the random number is greater than or equal to a preset value, crayfish enter the competition phase:

X_{i, j}^{t + 1} = X_{i, j}^{t} - X_{z, j}^{t} + X_{s h a d e}

(28)

z = round (rand \times (N - 1)) + 1

(29)

In the equation, Z represents a randomly selected crayfish individual. During the competition phase, crayfish compete with each other, and the current position

X_{i}

of each crayfish is adjusted based on the position

X_{z}

of another randomly chosen individual. This mechanism expands the search scope of the COA and enhances the algorithm’s exploratory capability.

For crayfish in the foraging phase, when the temperature T ≤30 °C, they will move toward the food source. By evaluating the size of the food, crayfish decide whether to tear it apart or consume it directly. The position and quantity of the food

X_{f o o d}

are mathematically defined as follows:

X_{f o o d} = X_{G}

(30)

Q = C \times rand \times (\frac{{fitness}_{i}}{{fitness}_{f o o d}})

(31)

In the equation, C is the food factor, representing the maximum food quantity and taking a constant value,

f i t n e s s_{i}

denotes the fitness value of the i-th crayfish, and

{fitness}_{f o o d}

represents the fitness value corresponding to the food position.

The foraging stage of crayfish is shown in Figure 11.

Therefore, we propose a biologically inspired combination control strategy based on the Deep Deterministic Policy Gradient (DDPG) algorithm and enhance it through the Crayfish Optimization Algorithm (COA) to address the limitations in generalization and dynamic adaptability. The proposed DDPG-COA controller embodies a symmetrical structure; DDPG serves as the main controller for global trajectory tracking, while COA acts as a compensating regulator, dynamically optimizing actions through interference observation mechanisms. The symmetric balance between learning-based control (DDPG) and crayfish optimization algorithm (COA) ensures robust performance in complex scenarios.

The COA algorithm is introduced and combined with the DDPG algorithm for trajectory tracking control. The DDPG model trained offline is used to optimize and compensate by calculating the output changes of the DDPG model. The specific equation design and iteration process are as follows:

(1): Initialize the population size, population size, and iteration times of crayfish, use DDPG to train models with different trajectories (sine, straight, Bézier, and arc trajectories), and store the results as a strategy library.
(2): Curvature is defined as the degree of curvature of a curve, usually denoted by $κ$ ; the greater the $κ$ , the greater the degree of curvature of the curve. Conversely, the smaller the degree of curvature, the curve becomes a straight line when $κ = 0$ . The calculation equation for the curvature of the target trajectory is:

$κ = \frac{| y^{″} |}{{(1 + {(y^{'})}^{2})}^{\frac{3}{2}}}$

(32)

Calculate the curvature value of the current target trajectory point, select the action output decision layer based on the absolute value of the curvature value, input the current state $S_{t} = {[x_{t}, y_{t}, φ_{t}]}^{T}$ to the model in the strategy library, and output the action $[v_{t}, ω_{t}]$ .
(3): The upper and lower limits for setting action compensation optimization are:

$l b = [l b_{v}, l b_{ω}]$

(33)

$u b = [u b_{v}, u b_{ω}]$

(34)

In Equations (33) and (34), $l b_{v}$ and $u b_{v}$ , $l b_{ω}$ and $u b_{ω}$ are the upper and lower limits of the adjustment range for velocity values and yaw rate, respectively. Then, based on the upper and lower limits of $l b$ and $u b$ , N candidate actions are generated with the current basic action $A_{b a s e}$ as the center. The calculation equation is:

$X_{N} = α \times A_{b a s e} + rand (l b, u b)$

(35)

In the equation, $A_{b a s e}$ represents the basic action value output by the DDPG algorithm, and $α$ is the weight coefficient. By adjusting the value of $α$ , the proportion of the output action can be adjusted. The value range of $α$ is $(0, 1)$ .
(4): Combining with Equation (9), the COA algorithm performs observation compensation on the current output action ${[v_{t}, ω_{t}]}^{T}$ . Calculate the current error value and the fitness value. The fitness value is based on the generated candidate action set $X_{N}$ . The dynamic influence of trajectory curvature on the control target needs consideration. Traditional heuristic algorithms use a fitness function with fixed weights $(w_{1} = 0.7, w_{2} = 0.3)$ . This function can achieve basic control. It does not reflect the dynamic adjustment requirements when the curvature changes.
Therefore, the dynamic weight adjustment mechanism is improved as follows:
Curvature perception module: Calculate the curvature $κ_{t}$ of the current trajectory point in real time.
Dynamic allocation of weights:

$\{\begin{matrix} w_{1} = 0.7 + α * | κ_{t} | \\ w_{2} = 1 - w_{1} \end{matrix}$

(36)

Segmented fitness function:

${fitness}_{i} = \{\begin{matrix} - (w_{1} * d_{e}^{t} + w_{2} * ψ_{e}^{t}) & if d_{e}^{t} > M \\ - (w_{2} * d_{e}^{t} + w_{1} * ψ_{e}^{t}) & else \end{matrix}$

(37)

In the equation, $α$ is the curvature sensitivity coefficient. It is taken as 0.1 by default. When the trajectory curvature is large $(| κ_{t} | > κ_{t h})$ or the position error is significant, the distance error weight $(w_{1})$ is strengthened. When the linear trajectory $(| κ_{t} | < κ_{t h})$ and the error is small, emphasis is placed on heading control $(w_{2})$ , with the threshold $κ_{t h}$ set at 0.05 rad/m, and M remains at the original error threshold. The heading error $ψ_{e}^{t}$ and distance error $d_{e}^{t}$ are calculated according to Equations (15) and (18). As can be seen from Equation (37), when the position error is too large, the fitness value is more affected by the distance factor. At this time, the action instruction should primarily control the autonomous vehicle to move towards the target trajectory point; when the error is within a certain range, the fitness value is more affected by the heading change, and the yaw angle should be preferentially adjusted for heading control.
(5): Choose to enter the summer escape or competition behavior based on the change in temperature $t e m p$ . During the summer escape behavior, update the position of the crayfish (i.e., the optimal solution) according to the size of the cave. Using the change in error as the fitness value ${fitness}_{i}$ for evaluating each crayfish, select the optimal solution and output new action instructions.
(6): Update population fitness values ${fitness}_{i}$ , $X_{G}$ , and $X_{L}$ , evaluate the population, and determine whether to exit the loop. If not, continue iterating in the loop.
(7): If the maximum number of iterations is reached or the performance meets the preset standards, the iteration is stopped and the current optimal fitness value $X_{G}$ and optimal solution $X_{b e s t}$ are output, namely optimal fitness $f i t n e s s_{b e s t}$ and optimal action $A_{b e s t}$ .

In summary, the process of the control algorithm is shown in Figure 12:

4. Simulation Results and Analysis

4.1. DDPG Simulation Experiment Results

In addition to the DDPG experimental group, a deep Q-network (DQN) control group was set up for this simulation experiment. Deep Q-Network (DQN) combines deep learning and Q-learning, approximates the Q-value function through neural networks, and solves the limitations of traditional Q-learning in high-dimensional state spaces. Its experience replay mechanism and target network improve training stability, making it suitable for discrete action space tasks. Compared with DDPG, DQN focuses on discrete action selection, while DDPG targets continuous action space, and the two are adapted to different control requirements in scenarios such as trajectory tracking [32,33,34].

The target trajectory selection uses Bézier curves. The Bézier curve is a trajectory that simulates large curvature turning conditions. The purpose of setting a large curvature Bezier trajectory is to test the dynamic response ability of the trajectory tracking controller in extreme curvature change scenarios, especially the control effect on lateral deviation and heading angle deviation.

The tracking results are shown in Figure 13:

From the figure, it can be seen that the DDPG controller (red dotted line) has a lower tracking performance than the DQN controller (black dashed line) in the early stage, but performs better than the DQN controller in both the middle and late stages.

Combining COA with DDPG and DQN, the tracking results are shown in Figure 14:

From the figure, it can be seen that the overall tracking performance of the DDPG-COA controller (red dashed line) is better than that of the DQN-COA controller (black dashed line) during the tracking process. Therefore, the DDPG algorithm and COA algorithm are combined for use.

4.2. DDPG-COA Simulation Experiment Results

In the simulation experiment, in addition to DDPG-COA, control groups of DDPG, DDPG-PSO, and DDPG-GWO were set up. The three meta-heuristic algorithms, Particle Swarm Optimization (PSO), Grey Wolf Optimizer (GWO), and COA, all belong to swarm intelligence optimization algorithms. They achieve optimization by simulating the cooperative behavior of biological groups and have the ability to balance exploration and development. PSO and GWO are chosen for comparison because PSO is a commonly used intelligent path planning algorithm with excellent global search capabilities. GWO can balance local optimization and global search. COA has a strong ability. It balances global search and local development. This can comprehensively verify COA’s characteristics and advantages in DDPG optimization [35,36,37].

According to the control objective of the proposed DDPG-COA trajectory tracking controller, and in order to reflect the specific tracking effect of this combined control strategy for different trajectory types, the target trajectory needs to have obvious curvature changes and different curve characteristics. Therefore, three types of target trajectories, namely double shift lines, S-shaped curves, and Bézier curves, were set up for testing. The double-shifted trajectory is one of the commonly used trajectories in vehicle motion control, usually employed to test the tracking effect during the vehicle motion control process. Designing a double-shifted trajectory as the target trajectory can verify the control effect of the controller when the linear trajectory and the low-curvature trajectory change. The S-shaped curve trajectory is characterized by continuous and smooth alternations of left and right turns, and its curvature changes show periodic, gradual changes and reverse conversions. The S-shaped curve was set as the experimental trajectory, with the main purpose of testing the comprehensive control performance of the vehicle trajectory tracking controller in scenarios of continuous curvature dynamic changes and frequent direction switching.

Based on the above design, the DDPG-COA trajectory tracking control strategy is used for trajectory tracking control. The specific parameter settings are shown in the Table 3:

In the table, the units of

l b_{v}

and

u b_{v}

are m/s, and the units of

l b_{ω}

and

u b_{ω}

are rad.

As shown in the table above, the population size represented by Equation (35) represents the number of actions generated based on the basic actions output by DDPG;. The maximum number of iterations is the maximum number of times a single loop can be used during the optimization process.

To simulate real scenes, we selected three paths in Dongyuan Ancient City Ruins Park in Shijiazhuang, China to simulate the actual scenes of three types of curves. The specific parameters of the paths are shown in Table 4:

4.2.1. Results of Double-Lane-Change Trajectory Tracking

Figure 15 shows the tracking of double-shifting trajectories using four controllers. The tracking result of the DDPG-PSO controller (the green dotted line) is superior to that of the DDPG controller (the black dotted line) and the DDPG-GWO (the yellow dotted line). The tracking change effect based on DDPG-COA (the red dotted line) basically conforms to the target trajectory, and the overall tracking error is significantly reduced compared with other controllers.

To compare the specific tracking performance, plot the course change curve in Figure 16, distance error in Figure 17, and course error change process separately in Figure 18.

Figure 17 and Figure 18, respectively, show the results of heading error variation and distance error variation. Although the error curve of the DDPG controller changes stably, the overall cumulative error is relatively large. The error curve of the DDPG-GWO controller is unstable and the overall error is relatively large. The error curve of the DDPG-PSO controller lies between that of the DDPG controller and the DDPG-COA controller. Among them, the error variation of the DDPG-COA controller is the most stable, and the overall error curve variation is the smallest. In terms of effect, it is closest to the straight line Y = 0.

Figure 19 shows the simulation results of a real scene with double-lane change trajectory. It can be seen from the figure that the tracking curve of the DDPG-COA controller mostly overlaps with the target curve, with only a slight deviation at the final endpoint, and the average tracking error is 2.1571 m.

4.2.2. Results of S-Shaped Curve Trajectory Tracking

As shown in Figure 20, in the tracking of the S-shaped curve, the tracking effect of the DDPG controller is poor, and the tracking curve basically deviates from the target trajectory. The tracking curves of the other three controllers coincide with the target trajectory. A detailed comparison shows that the tracking change effect of DDPG-COA (the red dotted line) is slightly better than that of the DDPG-PSO controller (the green dotted line) and the DDPG-GWO controller (the yellow dotted line).

To compare the specific tracking performance, the course change curve is plotted in Figure 21, distance error in Figure 22, and course error change process in Figure 23.

Figure 22 and Figure 23, respectively, show the variations of heading error and distance error. The overall cumulative error of the DDPG controller error curve is relatively large. The error variations of the DDPG-COA controller, DDPG-PSO controller and DDPG-GWO controller are relatively stable. The overall error curve changes very little, and the error curves are all close to the straight line Y = 0.

Figure 24 shows the simulation results of the S-shaped curve in a real scene. As the distance is the shortest among the three actual paths, the tracking curve of the DDPG-COA controller basically coincides with the target curve, with an average tracking error of 0.2459 m.

4.2.3. Results of Bézier Trajectory Tracking

As shown in Figure 25, the trajectory tracking results of Bézier curves using four control algorithms are presented. The tracking effect of the DDPG-COA controller (the red dotted line) basically conforms to the target trajectory. The tracking effect of the DDPG-PSO controller (the green dotted line) partially does not conform to it. The tracking effect of the DDPG-GWO controller (the yellow dotted line) does not fully conform to the target trajectory, and the tracking effect of the DDPG controller (the black dotted line) is the worst. Overall, the tracking effect of the DDPG-COA controller is the best.

In order to compare the control effects of different controllers, the heading change curves, tracking error changes, and heading error change results were, respectively, plotted for analysis. Among them, the heading change curve is shown in Figure 26:

To track the variation results of the heading angle of the unmanned vehicle during the Bézier curve process, the control effects of four controllers were, respectively, plotted. During the entire course change process, the fluctuation range of the change effect of the DDPG-PSO controller is obvious, and it basically changes around the target course (blue solid line). The heading change curve of the DDPG controller and an obvious heading deviation occurred during the change process. The heading change curves of the DDPG-COA controller and the DDPG-GWO controller basically coincided with the target flight path. Draw the heading error variation curve and the distance error variation curve.

Based on Equation (18), calculate the actual trajectory, target trajectory, and distance error during the tracking process, and draw the corresponding change curve.

As shown in Figure 27 and Figure 28, the heading error variation of the DDPG-COA controller and the DDPG-GWO controller is basically 0. The fluctuation range of the variation effect of the DDPG-PSO controller is obvious. In terms of distance error variation, the distance variation of the DDPG-COA controller is significantly closer to 0, and the overall error variation is the smallest.

Figure 29 shows the simulation results of the real scene of the Bézier curve. It can be seen from the figure that the tracking effect of the DDPG-COA controller is relatively good, with almost no deviation from the trajectory and an average tracking error of 0.4752 m.

4.2.4. Bézier Trajectory Interference Analysis

During the Bézier curve training process, moderately intense, uniformly distributed random noise was added when the environmental state was updated, and moderately fierce Gaussian noise was added after the DDPG action output. The tracking curves of the three controllers are shown in Figure 30.

The average tracking error of the DDPG-COA controller is 0.3848, that of the DDPG-PSO controller is 0.3981, and that of the DDPG-GWO controller is 0.4993. The average tracking error of the DDPG-COA controller is the smallest, and the effect is the best.

4.2.5. Results of Calculation Time

The simulation experiment was conducted using a Lenovo Legion Y7000P IRX9 laptop (Lenovo (Beijing) Limited, Beijing, China), which was manufactured in China and sourced from an online retailer in Ningbo, China, with the following hardware parameters: an Inter i7-14700HX CPU, 20 cores, 28 logic processors, and a frequency of 2.10 GHz. The motherboard is LNVNB161216 from LENOVO. The graphics card is an NVDIA GeForce RTX 4070 Laptop GPU with 8 GB of VRAM. The memory is 32 GB, DDR5 type, and the frequency is 5600 MT/s.

Under the same parameters, the calculation time result of the DDPG-COA controller, DDPG-PSO controller, and DDPG-GWO controller on the three target trajectories is shown in Table 5, with the unit being seconds:

It can be seen from the data comparison that the DDPG-COA controller is superior to both the DDPG-PSO controller and the DDPG-GWO controller in terms of computing time.

4.2.6. Tracking Error Analysis

Simulation experiments were conducted on the CarSim-MATLAB joint simulation platform to compare the performance of the DDPG-COA algorithm with other algorithms. The results are shown in Table 6:

In the two-lane change trajectory, the average tracking error was reduced by approximately 45.17 percent compared with DDPG-PSO, approximately 79.00 percent compared with DDPG-GWO, and approximately 76.05 percent compared with DDPG. The average heading error is approximately 73.33 percent lower than that of DDPG-PSO, approximately 96.68 percent lower than that of DDPG-GWO, and approximately 90.59 percent lower than that of DDPG. The average tracking error on the horizontal axis was approximately 50.96 percent lower than that of DDPG-PSO, and approximately 98.76 percent lower than that of both DDPG-GWO and DDPG. The average tracking error on the vertical axis was reduced by approximately 73.47 percent compared with DDPG-PSO, and by approximately 94.09 percent compared with both DDPG-GWO and DDPG.

For the S-shaped curve transformation trajectory, the average tracking error is approximately 4.18 percent lower than that of DDPG-PSO and approximately 98.57 percent lower than that of DDPG. The average heading error is reduced by more than 99.99 percent compared with DDPG-PSO, more than 99.99 percent compared with DDPG-GWO, and approximately 99.84 percent compared with DDPG. The average tracking error on the horizontal axis is approximately 33.33 percent lower than that of DDPG-PSO and approximately 99.985 percent lower than that of DDPG. The average tracking error on the vertical axis is approximately 5.88 percent lower than that of DDPG-PSO and approximately 99.984 percent lower than that of DDPG.

But in terms of the average tracking error of the S-shaped curve, the error of DDPG-COA is slightly higher than that of DDPG-GWO. Although the difference is small (about 0.0045 m), it reflects that in scenarios with periodic changes in continuous curvature and frequent direction switching, the optimization mechanism of DDPG-GWO may be more suitable for the dynamic characteristics of such trajectories. This may be because the curvature changes of the S-shaped curve exhibit symmetric periodicity, and the wolf pack hunting optimization mode of DDPG-GWO temporarily outperforms the dynamic weight adjustment mechanism of DDPG-COA in terms of local search stability when dealing with such regular fluctuations. The curvature sensing module of the latter may experience slight delays in real-time updating of weight coefficients during fast direction switching, resulting in a slight increase in short-term tracking errors.

However, it should be emphasized that this difference is a subtle difference in local scenarios, and DDPG-COA still has an absolute advantage in other core indicators of the S-shaped curve (such as the average heading error being much lower than DDPG-GWO), with overall performance still leading. This phenomenon also provides direction for subsequent optimization, such as further optimizing the dynamic weight response speed of COA for periodic curvature trajectories to achieve better performance in all scenarios.

For the Bézier curve trajectory, the average tracking error is reduced by approximately 51.48 percent compared with DDPG-PSO, by approximately 62.28 percent compared with DDPG-GWO, and by approximately 86.24 percent compared with DDPG. The average heading error is approximately 72.37 percent lower than that of DDPG-PSO and approximately 92.93 percent lower than that of DDPG. The average tracking error on the horizontal axis is approximately 34.70 percent lower than that of DDPG-PSO, approximately 49.00 percent lower than that of DDPG-GWO, and approximately 89.23 percent lower than that of DDPG. The average tracking error on the vertical axis is approximately 67.92 percent lower than that of DDPG-PSO, approximately 90.42 percent lower than that of DDPG-GWO, and approximately 98.32 percent lower than that of DDPG.

Through the comparison of tracking error data, it is indicated that the tracking accuracy of the DDPG-COA algorithm is significantly improved, the heading stability ability for trajectories with sudden curvature changes is significantly enhanced, and it has rapid adaptability to complex trajectories.

4.3. Algorithm Convergence Analysis

The convergence of the DDPG-COA algorithm can be verified from two aspects: algorithm mechanism and experimental results. DDPG breaks the data correlation through the experience replay mechanism, and combines asynchronous updates between the target network and the evaluation network to ensure stable convergence of the reward function after 100 rounds of training, ensuring that the value function and strategy converge asymptotically to the optimal state. COA achieves the transition of the population from global exploration to local exploitation through a three-stage iteration of “heat avoidance competition foraging”, with the exploration step coefficient monotonically decreasing. Its dynamic weight fitness function ensures that the compensation converges to the optimal solution, significantly reducing tracking errors in all three types of trajectories.

The DDPG-COA algorithm achieves collaborative convergence through a hierarchical structure of “DDPG pre-training strategy library + COA real-time compensation”. In the tracking of double shift lines, S-shaped curves, and Bézier curves, the average error is significantly lower than that of the comparison algorithm and has the smallest fluctuation. The error curve is close to the stable value, and the calculation time is the shortest, which verifies the global stable convergence of tracking error and control strategy.

5. Conclusions

To enhance the trajectory tracking accuracy and adaptability of unmanned vehicles to diverse trajectories, this study proposes a DDPG-COA joint trajectory tracking control method based on deep reinforcement learning. By integrating the Deep Deterministic Policy Gradient (DDPG) algorithm with the Crayfish Optimization Algorithm (COA), a hierarchical control framework is constructed to effectively address the core issues of insufficient adaptability and low training efficiency of the single DDPG algorithm in multi-trajectory scenarios.

In terms of control strategy design, the DDPG algorithm is first used to pre-train typical trajectories (including two-lane straight trajectories and Bézier curve trajectories). A deep neural network is employed to fit the state–action mapping relationship, establishing an initial control strategy library. On this basis, the COA is introduced to dynamically compensate for the action quantity during real-time tracking; through the uncertainty reasoning mechanism of COA, distance errors and heading errors are mapped to control quantity correction coefficients. Meanwhile, a dynamic weight error evaluation index (with weight coefficients adaptively adjusted according to trajectory curvature) is embedded in the COA fitness function to achieve fine-grained regulation of tracking errors for non-linear trajectories.

In summary, the DDPG-COA joint controller achieves an organic combination of data-driven and model-based optimization through a two-layer architecture of “pre-training strategy library + real-time error compensation”. In future engineering applications, the DDPG-COA joint control strategy can be further combined with multi-sensor fusion technology to integrate real-time environmental data, optimize adaptability to dynamic traffic scenarios, and provide a more reliable solution for the full scenario trajectory tracking of autonomous vehicles on complex urban roads and highways. At the same time, the extended application of this strategy in heterogeneous intelligent transportation systems can be explored, improving the overall operational efficiency and safety of the transportation system through a unified strategy library and dynamic compensation mechanism.

Author Contributions

Conceptualization, L.W. and Q.S.; methodology, L.W.; validation, L.W., Q.S. and H.L.; formal analysis, H.L. and Q.S.; investigation, Q.S.; resources, L.W.; data curation, H.L. and Q.S.; writing—original draft preparation, H.L.; writing—review and editing, Q.S. and Y.W.; visualization, Q.S.; supervision, L.W. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by National Natural Science Foundation of China (No. 52402505), Science and Technology Project of Hebei Education Department (No. QN2024254),and National Engineering Research Center for Water Transport Safety (No. A202504).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fehér, Á.; Aradi, S.; Hegedüs, F.; Bécsi, T.; Gáspár, P. Hybrid DDPG approach for vehicle motion planning. In Proceedings of the 16th International Conference on Informatics in Control, Automation and Robotics, Prague, Czech Republic, 29–31 July 2019. [Google Scholar]
2023 Statistical Bulletin on the Development of the Transportation Industry—Technical Report. China Transport News, 18 June 2024.
Ni, J.; Jiang, X.; Xiong, Z.; Zhou, B. Overview of Big Data and Cloud Control Technologies in the Field of Unmanned Ground Vehicles. Sch. Mech. Eng. Inst. Technol. 2021, 41, 1–8. [Google Scholar] [CrossRef]
Zhang, X.; Sun, D.; Cao, H. A Study on Unmanned Intelligent Means of Transportation. Army Mil. Transp. Univ. 2022, 1, 64–68. [Google Scholar]
Zhao, Z. Trajectory Tracking of Unmanned Logistics Vehicle Based on Event-Triggered and Adaptive Optimization Parameters MPC. Processes 2024, 12, 1878. [Google Scholar] [CrossRef]
Ryoo, Y.J. Trajectory-tracking control of a transport robot for smart logistics using the fuzzy controller. Int. J. Fuzzy Log. Intell. Syst. 2022, 22, 69–77. [Google Scholar] [CrossRef]
Zha, Y.; Deng, J.; Qiu, Y.; Zhang, K.; Wang, Y. A survey of intelligent driving vehicle trajectory tracking based on vehicle dynamics. SAE Int. J. Veh. Dyn. Stab. NVH 2023, 7, 221–248. [Google Scholar] [CrossRef]
Lei, M. Study on Lateral, Longitudinal andIntegrated Control Strategy of IntelligentVehicles Based on Dynamic Model. Master’s Thesis, Chongqing Jiaotong University, Chongqing, China, 2017. [Google Scholar]
Qian, Y.; Yu, M.; Guo, X.; Huang, H.; Li, S. Development of Intelligent Control Technology for Unmanned Vehicle. Sci. Technol. Eng. 2022, 22, 3846–3858. [Google Scholar]
Chen, Y.; Gai, J.; He, S.; Li, H.; Cheng, C.; Zou, W. MPC-TD3 Trajectory Tracking Control for Electrically Driven Unmanned Tracked Vehicles. Electronics 2024, 13, 3747. [Google Scholar] [CrossRef]
Ding, J.; Tian, J. Research on lane change trajectory and tracking control of unmanned vehicle. J. Chongqing Univ. Technol. Sci. 2024, 38, 72–78. [Google Scholar]
Kim, M.J.; Bae, S.B.; Baek, W.K.; Joo, M.G.; Ha, K.N. A Way-Point Tracking of Hovering AUV by PID Control. IEMEK J. Embed. Syst. Appl. 2015, 10, 257–264. [Google Scholar] [CrossRef]
Luo, Z.; Zheng, Y. Analyzing the Impact of Wind on Trajectory Tracking for Unmanned Vehicles Based on Road Adhesion Coefficient Estimation. Processes 2024, 13, 52. [Google Scholar] [CrossRef]
Yang, Y.; Du, J.; Liu, H.; Guo, C.; Abraham, A. A trajectory tracking robust controller of surface vessels with disturbance uncertainties. IEEE Trans. Control Syst. Technol. 2013, 22, 1511–1518. [Google Scholar] [CrossRef]
Xu, M.; Liu, Q. Design and Simulation of Intelligent Vehicle Trajectory Tracking Control Algorithm Based on LQR and PID. J. Taiyuan Univ. Technol. 2022, 53, 877–885. [Google Scholar] [CrossRef]
Zhang, P.; Jiang, S.; Chen, Y.; Zhang, B.; Han, Y. Intelligent Vehicle Path Tracking Based on Lateral and Longitudinal Integrated Control. J. Chongqing Jiaotong Univ. Sci. 2023, 42, 153–160. [Google Scholar]
Jin, M.; Li, J.; Chen, T. Method for the Trajectory Tracking Control of Unmanned Ground Vehicles Based on Chaotic Particle Swarm Optimization and Model Predictive Control. Symmetry 2024, 16, 708. [Google Scholar] [CrossRef]
Wei, Z.; Sun, T.; Zhou, M. LIRL: Latent Imagination-Based Reinforcement Learning for Efficient Coverage Path Planning. Symmetry 2024, 16, 1537. [Google Scholar] [CrossRef]
Lai, J.; Li, H.; Shi, Y.; Xu, L.; Yan, H. Anti Collision Control Strategy of Unmanned Vehicle Based on DDPG Algorithm. J. Wuhan Univ. Technol. 2021, 43, 68–76. [Google Scholar]
He, Y.; Song, R.; Ma, J. Trajectory Tracking Control of Intelligent Vehicle Based on DDPG Method of Reinforcement Learning. China J. Highw. Transp. 2021, 34, 335–348. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Guo, Z.; Li, S.; Ren, T. Hierarchical Decision-Making For UAV Air Combat Based on DDQN-D3PG. Acta Armamentarii 2025, 279, 1–12. [Google Scholar]
Wen, J.; Liang, X.; Wang, Y. Path Tracking Control of Rice Pollination Robot Based on DDPG+MPC. J. Agric. Mech. Res. 2025, 47, 18–25. [Google Scholar] [CrossRef]
Xie, X.; Zhao, X.; Jin, L.; Guo, B.; Li, K. Trajectory tracking control of intelligent vehicles based on deep reinforcement learning and rolling horizon optimization. J. Traffic Transp. Eng. 2024, 24, 259–272. [Google Scholar] [CrossRef]
Sierra-Garcia, J.E.; Santos, M. Combining reinforcement learning and conventional control to improve automatic guided vehicles tracking of complex trajectories. Expert Syst. 2024, 41, e13076. [Google Scholar] [CrossRef]
Liu, J.; Huang, Z.; Xu, X.; Zhang, X.; Sun, S.; Li, D. Multi-kernel online reinforcement learning for path tracking control of intelligent vehicles. IEEE Trans. Syst. Man Cybern. Syst. 2020, 51, 6962–6975. [Google Scholar] [CrossRef]
Chen, M.; Shao, S.; Jiang, B. Adaptive neural control of uncertain nonlinear systems using disturbance observer. IEEE Trans. Cybern. 2017, 47, 3110–3123. [Google Scholar] [CrossRef]
Shao, S.; Chen, M.; Hou, J.; Zhao, Q. Event-triggered-based discrete-time neural control for a quadrotor UAV using disturbance observer. IEEE/ASME Trans. Mechatron. 2021, 26, 689–699. [Google Scholar] [CrossRef]
Jia, H.; Rao, H.; Wen, C.; Mirjalili, S. Crayfish optimization algorithm. Artif. Intell. Rev. 2023, 56, 1919–1979. [Google Scholar] [CrossRef]
Jia, H.; Zhou, X.; Zhang, J.; Abualigah, L.; Yildiz, A.R.; Hussien, A.G. Modified crayfish optimization algorithm for solving multiple engineering application problems. Artif. Intell. Rev. 2024, 57, 127. [Google Scholar] [CrossRef]
Elhosseny, M.; Abdel-Salam, M.; El-Hasnony, I.M. Adaptive dynamic crayfish algorithm with multi-enhanced strategy for global high-dimensional optimization and real-engineering problems. Sci. Rep. 2025, 15, 10656. [Google Scholar] [CrossRef]
Shikoun, N.H.; Al-Eraqi, A.S.; Fathi, I.S. BinCOA: An efficient binary crayfish optimization algorithm for feature selection. IEEE Access 2024, 12, 28621–28635. [Google Scholar] [CrossRef]
Panov, A.I.; Yakovlev, K.S.; Suvorov, R. Grid Path Planning with Deep Reinforcement Learning: Preliminary Results. Procedia Comput. Sci. 2018, 123, 347–353. [Google Scholar] [CrossRef]
Wen, J.; Huang, Z.; Zhang, G. The Path Planning for Unmanned Ship Based on the Prioritized Experience Replay of Deep Q-networks. Basic Clin. Pharmacol. Toxicol. 2020, 126, 128–129. [Google Scholar]
Hu, H.; Chu, D.; Yin, J.; Lu, L. Double Deep Q-Networks Based Game-Theoretic Equilibrium Control of Automated Vehicles at Autonomous Intersection. Automot. Innov. 2024, 7, 571–587. [Google Scholar] [CrossRef]
Hu, Z.; Feng, C.; Luo, Y. Improved particle swarm optimization algorithm for mobile robot path planning. Appl. Res. Comput. 2021, 38, 3089–3092. [Google Scholar] [CrossRef]
Zhang, X.; Wang, X. Comprehensive Review of Grey Wolf Optimization Algorithm. Comput. Sci. 2019, 46, 30–38. [Google Scholar]
Liu, Z.; He, L.; Yuan, L.; Zhang, H. Path Planning of Mobile Robot Based on TGWO Algorithm. J. Xi’an Jiaotong Univ. 2022, 56, 49–60. [Google Scholar]

Figure 1. Kinematic model of autonomous vehicle.

Figure 2. Schematic diagram of reinforcement learning principles.

Figure 3. Framework of the DDPG.

Figure 4. Schematic diagram of Actor network structure.

Figure 5. Schematic diagram of Critic network structure.

Figure 6. Cumulative curve of DDPG rewards.

Figure 7. Trajectory tracking results of Bézier curves.

Figure 8. Structure diagram of crayfish.

Figure 9. Flowchart of Crayfish Optimization Algorithm.

Figure 10. (a) The crayfish enters the cave. (b) The crayfish compete with each other to get the cave.

Figure 11. (a) Crayfish shred food before eating. (b) Crayfish eat directly.

Figure 12. Flowchart of the DDPG-COA algorithm.

Figure 13. Comparison experiment between DDPG and DQN.

Figure 14. Comparison experiment between DDPG-COA and DQN-COA.

Figure 15. Double lane-change tracking.

Figure 16. Double-lane change heading angle change.

Figure 17. Double-lane change tracking error.

Figure 18. Double-lane change course error change.

Figure 19. Double-lane change trajectory tracking in real scenes.

Figure 20. S-shaped curve trajectory tracking.

Figure 21. S-shaped curve trajectory heading angle change.

Figure 22. S-shaped curve trajectory tracking error.

Figure 23. S-shaped curve trajectory course error change.

Figure 24. S-shaped curve trajectory tracking in real scenes.

Figure 25. Bézier curve trajectory tracking.

Figure 26. Bézier trajectory course angle change.

Figure 27. Bézier trajectory the change in heading error.

Figure 28. Bézier trajectory distance error comparison.

Figure 29. Bézier trajectory tracking in real scenes.

Figure 30. Bézier curve trajectory tracking results under interference.

Table 1. Parameter settings of the DDPG network.

Parameter Name	Actor Network	Critic Network
Network Structure Type	Fully-connected Network	Fully-connected Network
Number of Input Layer Nodes	6	8
Number of Nodes in the First Hidden Layer	200	100
Number of Nodes in the Second Hidden Layer	20	100
Number of Output Layer Nodes	2	1
Activation Function for Intermediate Layers	Relu	Relu
Activation Function for Output Layer	Sigmoid and Tanh	/
Network Optimizer	ADAM	ADAM

Table 2. DDPG algorithm hyperparameters.

DDPG Algorithm Hyperparameters
Actor Learning Rate	0.0001
Critic Learning Rate	0.001
Discount Factor $γ$	0.95
Soft-Update Coefficient tau	0.005
Batch Size	128
Experience Replay Buffer Capacity	20,000

Table 3. Parameter settings of DDPG-COA algorithm.

Parameter Settings of DDPG-COA Algorithm
Population Size N	30
Maximum Iteration Times T	100
Weight Factor $α$	0.85
$[l b_{v}, l b_{ω}]$	$[- 0.1, - π / 6]$
$[u b_{v}, u b_{ω}]$	$[0.15, π / 6]$
$d t$	0.1

Table 4. Actual path parameters.

Simulation Curve Type	Starting Point	End Point	Total Length (m)
Double lane change trajectory	38.086347° N, 114.538622° E	38.088275° N, 114.541549° E	460
S-shaped curve change trajectory	38.083919° N, 114.540221° E	38.083944° N, 114.541963° E	170
Bézier curve trajectory	38.084006° N, 114.537592° E	38.083921° N, 114.540221° E	364

Table 5. Results of calculation time.

	Double Lane Change Trajectory	S-Shaped Curve Change Trajectory	Bézier Curve Trajectory
DDPG-COA	10.7703049	10.8997662	10.9299905
DDPG-PSO	14.4123351	14.1890516	14.4854619
DDPG-GWO	16.8978474	16.5678804	16.4175828

Table 6. Comparison results of tracking errors for different trajectory types.

Trajectory Type	Algorithm	Average Tracking Error (m)	Average Heading Error (rad)	Average Tracking Error on Transverse Axis (m)	Average Tracking Error on Longitudinal Axis (m)
Double lane change trajectory	DDPG-COA	0.1911	0.0008	0.0102	0.0274
	DDPG-PSO	0.3485	0.0030	0.0208	0.1033
	DDPG-GWO	0.9574	0.0241	0.8206	0.4639
	DDPG	0.7979	0.0085	0.8206	0.4639
S-shaped curve change trajectory	DDPG-COA	0.0413	0.0001	0.0004	0.0016
	DDPG-PSO	0.0431	6.9143	0.0006	0.0017
	DDPG-GWO	0.0368	2.4468	0.0004	0.0013
	DDPG	2.8817	0.0562	2.6564	9.9397
Bézier curve trajectory	DDPG-COA	0.1405	0.0021	0.0508	0.0111
	DDPG-PSO	0.2896	0.0076	0.0778	0.0346
	DDPG-GWO	0.3725	0.0021	0.0996	0.1169
	DDPG	1.0205	0.0297	0.4698	0.6593

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Lu, H.; Su, Q.; Wang, Y. Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm. Symmetry 2025, 17, 1396. https://doi.org/10.3390/sym17091396

AMA Style

Wang L, Lu H, Su Q, Wang Y. Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm. Symmetry. 2025; 17(9):1396. https://doi.org/10.3390/sym17091396

Chicago/Turabian Style

Wang, Le, Hongrui Lu, Qingyang Su, and Yang Wang. 2025. "Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm" Symmetry 17, no. 9: 1396. https://doi.org/10.3390/sym17091396

APA Style

Wang, L., Lu, H., Su, Q., & Wang, Y. (2025). Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm. Symmetry, 17(9), 1396. https://doi.org/10.3390/sym17091396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Vehicle Trajectory Tracking Control Based on Deep Deterministic Policy Gradient Algorithm and Crayfish Optimization Algorithm

Abstract

1. Introduction

2. Kinematic Modeling of Autonomous Vehicle Motion Control

3. Optimization of DDPG Trajectory Tracking Controller Based on Crayfish Algorithm

4. Simulation Results and Analysis

4.1. DDPG Simulation Experiment Results

4.2. DDPG-COA Simulation Experiment Results

4.2.1. Results of Double-Lane-Change Trajectory Tracking

4.2.2. Results of S-Shaped Curve Trajectory Tracking

4.2.3. Results of Bézier Trajectory Tracking

4.2.4. Bézier Trajectory Interference Analysis

4.2.5. Results of Calculation Time

4.2.6. Tracking Error Analysis

4.3. Algorithm Convergence Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI