Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning

Hua, Xia; Zhang, Tengteng; Cao, Jun

doi:10.3390/a18030146

Open AccessArticle

Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning

by

Xia Hua

,

Tengteng Zhang

^* and

Jun Cao

College of Mechanical Engineering, Zhejiang University of Technology, 18 Chaowang Road, Hangzhou 310014, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(3), 146; https://doi.org/10.3390/a18030146

Submission received: 18 December 2024 / Revised: 10 February 2025 / Accepted: 13 February 2025 / Published: 5 March 2025

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Download

Browse Figures

Versions Notes

Abstract

Dynamic path planning enables vehicles to autonomously navigate in unknown or continuously changing environments, thereby reducing reliance on fixed maps. Deep reinforcement learning (DRL), with its superior performance in handling high-dimensional state spaces and complex dynamic environments, has been widely applied to dynamic path planning. Traditional DRL methods are prone to capturing unnecessary noise information and irrelevant features during the training process, leading to instability and decreased adaptability of models in complex dynamic environments. To address this challenge, we propose a dynamic path-planning method based on our Causal State-Masking Twin-delayed Deep Deterministic Policy Gradient (CSM-TD3) algorithm. CSM-TD3 integrates a causal inference mechanism by introducing dynamic state masks and intervention mechanisms, allowing the policy network to focus on genuine causal features for decision optimization and thereby enhancing the convergence speed and generalization capabilities of the agent. Furthermore, causal state-masking DRL allows the system to learn the optimal mask configurations through backpropagation, enabling the model to adaptively adjust the causal features of interest. Extensive experimental results demonstrate that this method significantly enhances the convergence of the TD3 algorithm and effectively improves its performance in dynamic path planning.

Keywords:

dynamic path planning; vehicles; deep reinforcement learning; causal inference mechanism; state masks

1. Introduction

Path-planning technology aims to generate safe and efficient routes for vehicles, enabling them to reach target locations without collisions. Achieving optimal paths is crucial for ensuring that vehicles complete tasks accurately and swiftly [1,2,3]. Path planning includes two mainstream strategies: global and reactive planning. Global planning adopts an offline approach utilizing predefined environmental maps, typically represented by grid or graph structures, and determines the most effective path through search algorithms such as breadth-first search [4] and rapidly exploring random trees (RRTs) [5]. These methods are referred to as global planning because they consider the entire environment when planning the path, formulating a comprehensive route from the starting point to the destination.

Reactive planning, on the other hand, is applied in online scenarios, dynamically navigating by continuously processing real-time data from sensors, emphasizing local path planning, and focusing on immediate obstacle avoidance and adaptation to environmental changes. Common reactive planning methods include the vector field histogram (VFH) [6], dynamic window approach (DWA) [7], simultaneous localization and mapping [8], and model predictive control [9]. These techniques enable vehicles to make rapid decisions based on current sensor inputs, achieving excellent performance in unpredictable and dynamic environments.

In recent years, machine learning has played a significant role in global and reactive path planning, substantially enhancing the intelligence level of vehicle navigation [10,11,12]. In addition to traditional supervised learning algorithms, such as support vector machines (SVMs) [13] and decision trees, used for predicting optimal paths based on labeled data, reinforcement learning (RL) [14,15], a key branch of machine learning, is rapidly becoming a research hotspot in the field of path planning. RL autonomously optimizes decision-making strategies by interacting with the environment and utilizing reward mechanisms, enabling vehicles to achieve adaptive navigation in dynamic and complex environments. Common RL algorithms include Q-learning, deep Q-networks (DQNs) [16], policy gradient methods, and proximal policy optimization [17]. These methods not only improve the efficiency and flexibility of path planning but also reduce the dependence on large amounts of labeled data. Deep RL (DRL) [18], because of its excellent performance in handling high-dimensional state spaces and complex decision-making tasks, has become a valuable method in the field of motion planning.

However, traditional DRL methods are prone to capturing task-irrelevant features during the training process. These spurious correlations often arise from environmental noise, sensor errors, or dynamic obstacles, leading to a significant decline in the model’s generalization ability and robustness in new environments. To address this challenge in path planning, this study proposes an innovative Causal State-Masking Twin-Delayed Deep Deterministic Policy Gradient (CSM-TD3) method. This method, based on the twin-delayed deep deterministic (TD3) policy gradient algorithm, integrates a causal inference mechanism by introducing state masks and intervention mechanisms to effectively filter out features irrelevant to decision making. CSM-TD3 generates dynamic state masks to ensure that the policy network relies solely on state features with genuine causal relationships.

The main contributions of this study are summarized as follows:

Enhanced model robustness and transferability: By combining the causal consistency loss term and dynamic state masks, the policy network is compelled to rely solely on state features with genuine causal relationships for decision making. This addresses the issue of DRL models capturing spurious correlations during training.
Improved convergence and generalization capabilities of DRL in dynamic path planning: Introducing encoding configurations optimized through backpropagation learning enables the model to adaptively adjust the causal features of each state dimension.
Algorithm validation: The effectiveness of CSM-TD3 was validated through multiple simulation experiments. The experimental results demonstrate that this method performs excellently in terms of convergence and generalization capabilities in dynamic path planning.

The remainder of this paper is structured as follows. The related work is introduced in Section 2. The problem formulation is presented in Section 3. The implementation of the improved CSM-TD3 is elaborated in Section 4. The simulation results are compared in Section 5. The conclusions are summarized in Section 6.

2. Related Work

In the field of path planning, with the continuous advancement of automation technologies, particularly in autonomous driving and robotics, researchers have proposed various path-planning methods. These methods can primarily be categorized into traditional path-planning approaches and reinforcement learning (RL)-based methods. Traditional path-planning techniques, such as the A* algorithm and RRT, exhibit certain advantages in specific environments. On the other hand, RL-based path-planning methods demonstrate enhanced flexibility and adaptability, enabling online learning and adaptation in dynamic environments. Both of these approaches have been extensively studied in recent years.

2.1. Traditional Path-Planning Methods

Numerous studies on path planning have focused on optimizing classical path-planning algorithms to address increasingly complex environments and specific application requirements. For autonomous-vehicle path planning, Lin et al. [19] proposed an improved A* algorithm that incorporates redundant safety spaces and predictive planning, generating shorter and smoother paths to enhance collision avoidance capabilities. In terms of hybrid methods, Sun et al.’s study combined the enhanced A* algorithm with the DWA, utilizing bidirectional search, adaptive heuristics, and key node filtering to shorten path lengths and improve algorithm reliability [20]. Chen et al. [21] improved the Lifelong Planning A* (LPA*) algorithm by introducing a path priority partitioning mechanism, dividing paths into high- and low-priority categories to maintain the stability of high-priority paths. This method effectively avoids path conflicts by weighting high-priority path nodes, thereby enhancing the efficiency and stability of multi-path planning.

Li et al., in a study on autonomous driving [22], integrated environmental and vehicle constraints to propose an improved RRT-Connect algorithm. By designing appropriate path cost functions and heuristic guidance strategies, the method significantly reduced path lengths and planning times, expanding the path-planning capabilities of intelligent vehicles in complex environments. For mobile vehicles in complex unknown environments, Shi et al. [23] optimized the RRT-Connect algorithm through goal-biased strategies and node selection, improving path smoothness and success rates.

In dynamic environments, Lee et al. [24] combined Artificial Potential Fields (APFs) with RRT methods, rapidly generating collision-free local paths through weighted obstacle configurations, yielding improved replanning efficiency. In other developments, the virtual potential field combined with RRT (VPF-RRT) algorithm [25] and the convex decomposition topology-based RRT (CDT-RRT) algorithm [26] were designed to optimize the path smoothness and the performance in exploring the shortest paths, respectively. For robotic-arm path planning, Wen et al. [27] proposed the PS-RRT algorithm based on partitioned dynamic sampling and node rejection mechanisms, effectively reducing the number of invalid nodes and achieving efficient path planning.

Traditional path-planning methods largely rely on predefined environmental maps, making it difficult to cope with highly dynamic or unknown environmental changes, limiting their real-time application capabilities. Many optimization algorithms exhibit high computational complexity in complex environments, affecting the real-time performance and efficiency of path planning. These methods typically focus on the geometric optimization of paths, such as optimizing path length and smoothness, while neglecting the causal relationships and environmental interference factors in the path-planning process, resulting in insufficient robustness in dynamic environments.

2.2. RL-Based Path-Planning Methods

With the rapid development of deep learning and RL, RL-based path-planning methods have become a research hotspot because of their adaptability and online learning capabilities in dynamic environments. Many researchers have explored how to leverage RL to enhance the effectiveness of path planning. Tan et al. [28] proposed a complete coverage path-planning algorithm based on Q-learning, which achieved 100% coverage by introducing global environmental information and optimizing path strategies near obstacles while significantly reducing path redundancy. For unplugged ground platforms, Zhang et al. [29] improved the Q-learning algorithm to enhance path-planning flexibility and convergence speed, making it suitable for autonomous driving functions. Similarly, Zhang et al. [30] combined deep Q-Learning with path-smoothing techniques, demonstrating their model’s effectiveness in complex environments.

Kong et al. [31] proposed an APF path-planning strategy based on the partially observable Markov decision process (POMDP) model, which integrates sensor information to predict obstacle positions, enabling real-time path planning for uncrewed aerial vehicles in complex environments. In further research [32], the path generation method based on the deep Markov model (RL-PG) could ensure path-tracking safety through a motion fine-tuning module, exhibiting a high success rate. In research on drone path planning [33], the authors combined the A* algorithm with velocity obstacles in an RL approach, improving planning efficiency in dynamic environments. In addition, Jaramillo et al. [34] designed a reward function strategy to encourage agents to choose the shortest paths with fewer turns, thereby optimizing the number of turns and total distance.

For time-constrained path planning in complex environments, Wang et al. [35] integrated prioritized experience replay and improved exploration mechanisms in an off-policy RL method, enhancing sampling efficiency and rescue target achievement rates. Hua et al. [36] introduced the PIOPGRL algorithm based on the pigeon-inspired optimization algorithm, fusing it with RL to optimize spacecraft path planning under multiple attitude constraints. Finally, Feng et al. [37] proposed a path obstacle-avoidance control method based on the deep deterministic policy gradient that maps perception to control through deep neural networks, improving obstacle-avoidance performance.

RL methods are prone to capturing task-irrelevant noise and features in high-dimensional state spaces, leading to unstable policies and decreased generalization capabilities. Moreover, these methods typically require large amounts of training data and computational resources, resulting in lengthy training processes that hinder real-time applications. Additionally, many RL algorithms lack effective causal inference mechanisms when facing environmental disturbances and dynamic changes, affecting the model’s convergence in complex environments and its ability to be generalized to new environments.

3. Problem Formulation

To effectively address the problem of autonomous path planning for vehicles in complex dynamic environments, this study first needed to precisely model the vehicle’s kinematic and dynamic characteristics. This process required not only a comprehensive understanding of the vehicle’s motion behavior in different coordinate systems but also the integration of causal relationship models with RL frameworks to construct a suitable mathematical model for solving the problem. This section details the mathematical formalization process for the problem. Firstly, by establishing a vehicle dynamics model, we can describe the motion characteristics of the vehicle within a two-dimensional plane and clarify the specific description of the dynamic path-planning problem. Subsequently, we formalize the problem as a Markov decision process (MDP), laying the theoretical foundation for the subsequent causal RL algorithms. Finally, we construct a structural equation model (SEM) to identify key state features that directly impact vehicle decision making, introducing the concept of a state mask to enhance the interpretability of the model and the effectiveness of decision-making.

3.1. Dynamic Path-Planning Model

In dynamic path planning, accurately describing the vehicle’s dynamic characteristics is crucial for achieving precise control and path optimization. We model the vehicle as a planar rigid-body dynamic model moving within a two-dimensional plane, as represented in Figure 1. The dynamic equations of the vehicle in the body coordinate system are as follows:

\begin{matrix} \{\begin{matrix} M ({\dot{v}}_{x} - v_{y} ω) & = F_{x} \\ M ({\dot{v}}_{y} + v_{x} ω) & = F_{y} \\ I_{γ} \dot{ω} & = M_{z} \end{matrix} \end{matrix}

(1)

where M is the mass of the vehicle;

I_{γ}

is the moment of inertia of the vehicle about the centroid perpendicular to the plane;

{\dot{v}}_{x}

and

{\dot{v}}_{y}

are the longitudinal and lateral accelerations, respectively;

\dot{ω}

is the angular acceleration of the vehicle; and

F_{x}

,

F_{y}

, and

M_{z}

are the longitudinal force, lateral force, and yaw moment of the vehicle, respectively. To represent the vehicle’s motion state in the global coordinate system, we establish a kinematic model describing the position and heading-angle changes in that system:

\begin{matrix} \{\begin{matrix} \dot{x} & = v_{x} \cos ψ - v_{y} \sin ψ \\ \dot{y} & = v_{x} \sin ψ + v_{y} \cos ψ \\ \dot{ψ} & = ω \end{matrix} \end{matrix}

(2)

where

\dot{x}

and

\dot{y}

are the longitudinal and lateral velocities of the vehicle in the global coordinate system, respectively, and

ψ

is the vehicle’s heading angle. By integrating the vehicle’s dynamic and kinematic Equations (1) and (2), respectively, we can comprehensively consider the vehicle’s motion in its coordinate system and the global coordinate system, as well as the coupling effects of rotational motion on lateral and longitudinal movements. In the design of dynamic path planning for vehicles, the state variables are defined as

\begin{matrix} s_{v} = [\begin{matrix} x & y & ψ & v_{x} & v_{y} & ω \end{matrix}], \end{matrix}

(3)

and the control variables are defined as

\begin{matrix} u = [\begin{matrix} F_{x} & F_{y} & M_{z} \end{matrix}] . \end{matrix}

(4)

Based on the established vehicle dynamics model, we describe the dynamic path-planning problem as follows: given the vehicle’s initial state

s_{0}

and the target position

s_{g} = [x_{g}, y_{g}]

, a control policy

π_{θ}

that enables the vehicle to move from the starting position to the target position while satisfying dynamic and environmental constraints needs to be learned. During motion, the vehicle must avoid dynamic obstacles in the environment while optimizing path length and smoothness to achieve efficient and safe path planning. To formalize the dynamic path-planning problem as an optimization problem, we aim to minimize the cumulative cost from the start point to the target point. The specific optimization objective function is defined as

\begin{matrix} min_{π_{θ}} J = \sum_{t = 0}^{T} ∆ t [α C_{pat} (t) + β C_{smo} (t) + λ C_{col} (t)] \end{matrix}

(5)

where

∆ t

is the control time step; the coefficients

α

,

β

, and

λ

are weight factors used to balance the importance of each cost component;

C_{pat} (t)

represents the path length cost, encouraging the vehicle to choose shorter paths and reducing the travel distance and time;

C_{smo} (t)

denotes the path smoothness cost, encouraging the vehicle to maintain a smoother motion process and reducing motion shocks; and

C_{col} (t)

signifies the collision risk cost, penalizing the vehicle’s proximity to obstacles to ensure the safety of the path.

We define the path length cost

C_{pat} (t)

as the Euclidean distance between the current position and the target position:

\begin{matrix} C_{pat} (t) = \sqrt{{(x (t) - x_{g})}^{2} + {(y (t) - y_{g})}^{2}} \end{matrix}

(6)

where

x (t)

and

y (t)

represent the vehicle’s longitudinal and lateral positions in the global coordinate system at the current time step, respectively, and

x_{g}

and

y_{g}

denote the longitudinal and lateral positions of the endpoint in the global coordinate system, respectively.

Because the variations in control variables at adjacent time steps reflect the variation in the vehicle’s state at those times, we define the path smoothness cost

C_{smo} (t)

as related to the variations in control variables:

\begin{matrix} \begin{matrix} C_{smo} (t) = {(F_{x} (t) - F_{x} (t - 1))}^{2} + {(F_{y} (t) - F_{y} (t - 1))}^{2} + {(M_{z} (t) - M_{z} (t - 1))}^{2} \end{matrix} \end{matrix}

(7)

where

F_{x} (t)

,

F_{y} (t)

, and

M_{z} (t)

represent the vehicle’s current longitudinal force, lateral force, and yaw moment, respectively.

The collision risk cost is a penalty for the vehicle’s proximity to dynamic obstacles in the environment. To quantify this risk, we employ the following formula:

\begin{matrix} C_{col} (t) = \frac{1}{d_{\min} (t) + ϵ} \end{matrix}

(8)

where

ϵ

is a small constant to avoid division by 0 and

d_{\min} (t) = \sqrt{{(x (t) - x_{\min})}^{2} + {(y (t) - y_{\min})}^{2}}

represents the Euclidean distance between the vehicle’s current position and the nearest obstacle’s position, where

x_{\min}

and

y_{\min}

denote the longitudinal and lateral positions of the nearest obstacle in the global coordinate system, respectively. When the vehicle approaches an obstacle, the value of this cost function increases significantly, driving the vehicle away from obstacles to establish a safer path.

3.2. Formalization of the Markov Decision Process

Based on the definitions in Section 3.1, we formalize the vehicle’s actual control and dynamic path-planning optimization problem as an MDP to enable effective control decision making and optimization using DRL algorithms [38]. An MDP is a mathematical framework widely used for modeling decision-making problems, especially in the presence of uncertainty and dynamic environments. Within this framework, as represented in Figure 2, the agent’s actions are defined to be the same as the vehicle’s control variables, i.e.,

a = u

. The state is defined to include both the vehicle’s current state and the environmental state, specifically represented as

\begin{matrix} s = [\begin{matrix} s_{v} & d_{\min} & s_{g} \end{matrix}] = [\begin{matrix} x & y & ψ & v_{x} & v_{y} & ω & d_{\min} & x_{g} & y_{g} \end{matrix}] \end{matrix}

(9)

To ensure that the agent’s actions are executable and can maintain the vehicle’s stability, we constrain the vehicle’s control inputs within predefined ranges. These constraints not only reflect the vehicle’s physical limitations but also help prevent excessive control inputs that could lead to vehicle instability or loss of control. The specific constraints are as follows:

\begin{matrix} \{\begin{matrix} F_{x, min} & \leq F_{x} (t) \leq F_{x, max} \\ F_{y, min} & \leq F_{y} (t) \leq F_{y, max} \\ M_{z, min} & \leq M_{z} (t) \leq M_{z, max} \end{matrix} \end{matrix}

(10)

These limitations ensure that the control inputs remain within a safe and executable range for the vehicle, avoiding unnecessary risks caused by excessively large or small forces and moments.

For the vehicle to avoid collisions with any obstacles throughout the path-planning process, we introduce the concept of safety distance, in which the distance

d_{i}

between the vehicle and all obstacles must be greater than a preset safety distance

d_{s}

. This constraint can be expressed as

\begin{matrix} d_{i} (t) = \sqrt{{(x (t) - x_{i} (t))}^{2} + {(y (t) - y_{i} (t))}^{2}} \geq d_{s} \end{matrix}

(11)

where i denotes the number of obstacles and

x_{i} (t)

and

y_{i} (t)

represent the longitudinal and lateral positions, respectively, of the ith obstacle in the global coordinate system at time t. By mandating that the vehicle maintain a distance of at least

d_{s}

from all obstacles, we can effectively avoid potential collision risks.

We define the condition for the vehicle to reach the endpoint as the distance between the vehicle’s current position, and the endpoint as less than a predefined distance

d_{g}

:

\begin{matrix} d (t) = \sqrt{{(x (t) - x_{g})}^{2} + {(y (t) - y_{g})}^{2}} \leq d_{g} \end{matrix}

(12)

This condition ensures that the vehicle can reach the target position during the path-planning process while allowing a certain tolerance to accommodate uncertainties and dynamic changes in the actual environment.

In summary, the optimization problem for dynamic path planning can be formalized as

\begin{matrix} \begin{matrix} a^{*} = & \arg \min_{π_{θ}} \sum_{t = 0}^{T} ∆ t [α C_{pat} (t) + β C_{smo} (t) + λ C_{col} (t)] \\ s . t . & F_{x, min} \leq F_{x} (t) \leq F_{x, max} \\ F_{y, min} \leq F_{y} (t) \leq F_{y, max} \\ M_{z, min} \leq M_{z} (t) \leq M_{z, max} \\ d_{i} (t) \geq d_{s}, \forall i, \forall t \\ d (t) \leq d_{g} \end{matrix} \end{matrix}

(13)

The objective of this optimization problem is to minimize the cumulative cost from the start point to the target point by designing an appropriate control policy

π_{θ}

while satisfying dynamic and environmental constraints. Solving this problem provides both theoretical foundations and practical guidance for the autonomous path planning of intelligent vehicles in complex dynamic environments.

To guide the agent’s action decisions and optimize the planned path toward the predefined target, we design the following reward function:

\begin{matrix} r = \{\begin{matrix} - (K_{α} c o s t_{pat} (t) + K_{β} c o s t_{smo} (t) + K_{λ} c o s t_{col} (t)) & d (t) \geq d_{g}, d_{i} (t) \geq d_{s} \\ K_{a} & d (t) \leq d_{g} \\ - K_{b} & d_{i} (t) \leq d_{s} \end{matrix} \end{matrix}

(14)

where

K_{α}

is a positive gain parameter associated with the distance to the target point,

K_{β}

is a negative gain parameter for action smoothness,

K_{λ}

is a negative gain parameter associated with the distance to the nearest obstacle,

K_{a}

is the reward term for reaching the target point, and

K_{b}

is the penalty term for collision with obstacles. When the vehicle is in a normal driving state, i.e., the distance to the target point

d (t)

is greater than the preset distance

d_{g}

and the distance to all obstacles

d_{i} (t)

is greater than the safety distance

d_{s}

, the reward function is the negative of the weighted sum of the respective costs. This reward function design encourages the agent to choose efficient, safe, and smooth paths by minimizing the path length, path smoothness, and collision risk. When the vehicle successfully reaches the target point, i.e.,

d (t) \leq d_{g}

, a positive reward

K_{a}

is provided to encourage the agent to complete the path-planning task. When the vehicle’s distance to any obstacle satisfies

d_{i} (t) \leq d_{s}

, a negative penalty

- K_{b}

is imposed to strongly discourage the agent from selecting actions that lead to collisions, improving the safety of the path. Through this reward function design, the agent can progressively optimize its policy during the learning process, enabling efficient, safe, and smooth path planning in complex dynamic environments.

3.3. Causal Relationship Model

In complex dynamic environments, the state information perceived by the vehicle may contain a large amount of noise and many redundant features. These task-irrelevant features can lead to decreased model performance and weakened generalization capabilities. To address this, we introduce an SEM to identify key state features that have a direct causal impact on vehicle decision making and propose the concept of a state mask.

In the causal graph model, we use a directed acyclic graph (DAG) to represent the causal relationships between variables [39], as represented in Figure 3. Nodes represent states, actions, and rewards, while directed edges represent causal influences. We assume the following causal relationships:

(1): Relationships between state features. $s_{i} \to s_{i + 1}$ , indicating that feature $s_{i}$ has a direct causal influence on feature $s_{i + 1}$ .
(2): The direct influence of state features on control inputs. $s_{i} \to a_{i}$ , indicating that feature $s_{i}$ has a direct causal influence on control input $a_{i}$ .
(3): The influence of control inputs on the next state features. $a_{i} \to s_{i + 1}$ , indicating that control input $a_{i}$ influences the state feature $s_{i + 1}$ at the next time step.
(4): States and control inputs jointly determine rewards. $s, a \to r$ .

Based on the above assumptions, we can define an SEM for each variable to provide a quantitative description of causal relationships:

\begin{matrix} \{\begin{matrix} s_{i} & = f_{i} (Pa (s_{i}), n_{i}), & i = 1, 2, \dots, n \\ a_{l} & = π_{θ} (\tilde{s}), & l = 1, 2, \dots, m \\ r & = r (s, a) \end{matrix} \end{matrix}

(15)

where

Pa (s_{i})

denotes the set of parent nodes of

s_{i}

, i.e., the variables that directly influence

s_{i}

;

n_{i}

represents independent noise terms, that is, unmodeled external influences

n_{i} \sim P (n_{i})

; and

\tilde{s}

represents the set of state features after being masked. To identify the state features that have a direct causal influence on the control inputs

a

, we partition the set of state features

s

into causal and non-causal feature sets. The causal feature set

s^{C}

represents the features that directly influence the control inputs. The non-causal feature set

s^{N}

represents the features that do not have a direct causal relationship with the control inputs, that is,

s = s^{C} \cup s^{N}

. In the causal graph, only the features in

s^{C}

point to the control inputs

a

.

To leverage causal features in DRL, we introduce a state mask vector

m = [m_{1}, m_{2},

\dots, m_{n}]^{T}

, where

m_{i} \in 0, 1

, which is used to indicate whether each feature is causal:

\begin{matrix} m_{i} = \{\begin{matrix} 1 s_{i} \in s^{C} \\ 0 s_{i} \in s^{N} \end{matrix} \end{matrix}

(16)

After applying the state mask, the masked state is represented as

\tilde{s} = s ⊙ m

. In this manner, the policy function

π_{θ}

makes decisions solely based on the causal features

s^{C}

, filtering out noise features irrelevant to decision making and reducing the impact of spurious correlations on the model. To ensure model consistency under intervention conditions, we introduce a causal consistency constraint. Specifically, when intervening on a causal feature

s_{i} \in s^{C}

by setting it to a specific value

do (s_{i} = {\hat{s}}_{i})

, the conditional distribution of the control input should satisfy

\begin{matrix} P (a ∣ do (s_{i} = {\hat{s}}_{i}), s^{∖ s_{i}}) = P (a ∣ s_{i} = {\hat{s}}_{i}, s^{∖ s_{i}}) \end{matrix}

(17)

where

s^{∖ s_{i}}

denotes the set of state features excluding

s_{i}

. This consistency constraint helps the policy network learn the true causal relationships, preventing it from being misled by spurious correlations. To implement the causal consistency constraint during the RL training process, we introduce a causal consistency loss function

L_{c} (θ)

to quantify the consistency deviation of the model under intervention conditions:

\begin{matrix} L_{c} (θ) = E_{s_{i} \in s^{C}} [{∥π_{θ} (\tilde{s}) - π_{θ} ({\tilde{s}}_{d o (s_{i} = {\hat{s}}_{i})})∥}^{2}] \end{matrix}

(18)

Upon minimizing the causal consistency loss

L_{c} (θ)

, the policy network can better learn the true relationships between causal features and decision making.

4. Causal State-Masking TD3 Algorithm

To enhance the convergence and generalization capabilities of DRL in path planning within complex dynamic environments, we integrate a causal state-masking mechanism into the TD3 algorithm, resulting in the CSM-TD3 algorithm. The TD3 algorithm is an advanced deterministic policy gradient method in DRL that effectively mitigates the problem of overestimation bias and improves the convergence speed and stability of the algorithm through the mechanisms of dual-value networks and delayed policy updates [40]. Building upon TD3, CSM-TD3 enhances the model’s causal inference capabilities by introducing state masks and intervention mechanisms, enabling the agent to focus more on causal features that have the greatest impact on path-planning decisions, exhibiting greater efficiency and robustness in complex environments.

In the standard actor–critic architecture [41], the actor network is responsible for generating policies, while the critic network evaluates the value of policies. Based on this framework, we design and modify the critic and actor networks by incorporating causal inference theory into the TD3 algorithm. The following sections detail the design of the critic and actor networks, as well as the overall steps of the algorithm.

4.1. Critic Networks

The main function of the critic network is to evaluate the value of a given state and action, that is, to estimate the state–action value function

Q (s, a)

. The design of the critic network follows the standard TD3 algorithm structure. We utilize a twin-value network architecture, employing two independent critic networks

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

, represented by parameters

ϕ_{1}

and

ϕ_{2}

, respectively [42]. The purpose of this is to reduce the bias in value estimation by using two independent estimates, enhancing algorithm performance. The inputs to the critic networks include the original state

s

and action

a

, and the outputs are the corresponding Q-values

Q_{ϕ_{j}} (s, a)

,

j = 1, 2

. During the update process of the value networks, we define target value networks

Q_{ϕ_{j}^{'}}

, whose parameters are obtained from the main network parameters

ϕ_{j}

through a soft update method. We define the loss function of the value network

L_{critic}^{j} (ϕ_{j})

as

L_{critic}^{j} (ϕ_{j}) = \frac{1}{N} \sum_{i = 1}^{N} {(Q_{ϕ_{j}} (s_{i}, a_{i}) - y_{i})}^{2}

(19)

where

y_{i}

is the target Q-value:

y_{i} = r_{i} + γ min_{j = 1, 2} Q_{ϕ_{j}^{'}} (s_{i}^{'}, {\tilde{a}}_{i}^{'})

(20)

where

γ

is the discount factor and

{\tilde{a}}_{i}^{'}

is the target action. The target Q-value

y_{i}

is based on the Bellman equation, which combines the current immediate reward

r_{i}

with the discounted cumulative future return, forming a robust learning signal. By taking the minimum value from the two target value networks, we further reduce the overestimation of Q-values.

The update of the value network aims to minimize the value estimation error so that the critic network evaluates the actor network’s policy accurately and reliably. To update the critic network’s parameters

ϕ_{j}

, we use gradient descent to minimize the loss function

L_{critic}^{j} (ϕ_{j})

:

ϕ_{j} \leftarrow ϕ_{j} - η_{ϕ} \nabla_{ϕ_{j}} L_{critic}^{j} (ϕ_{j})

(21)

where

η_{ϕ}

is the learning rate of the critic network, which determines the step size for each parameter update.

4.2. Actor Networks

The actor network is responsible for generating optimal actions based on the current state. The design of the policy network directly affects the agent’s decision-making capability. To enhance the model’s focus on key causal features and improve convergence and generalization capabilities in path-planning tasks, we introduce state masks and intervention mechanisms into the actor network. Specifically, a trainable state-mask vector

m = [m_{1}, m_{2}, \dots, m_{n}]

, where

m_{i} \in [0, 1]

, is added before the input layer of the actor network. The original state

s_{t}

is processed through the mask to obtain the masked state

{\tilde{s}}_{t}

.

The state-mask vector

m

highlights causal features directly related to path-planning decisions while suppressing non-causal features and noise interference [43]. In complex dynamic environments, the state space typically contains a large number of features, only a subset of which directly influences path-planning decisions. By introducing state masks, the model can automatically identify and reinforce these key features while suppressing irrelevant or interfering features, enhancing the efficiency and stability of the policy. During the training process of the agent, the mask vector is treated as a trainable parameter and optimized through backpropagation. This means that the optimization of the mask vector is synchronized with the optimization of the policy network parameters, allowing the mask to adaptively adjust to accommodate different environments and task requirements.

To enable the actor network to better learn the true relationship between causal features and path-planning decisions, we introduce intervention mechanisms and causal consistency loss during the training process. The specific operations are as follows:

(1): Intervention feature selection. For each training sample, randomly select a causal feature $s_{t, k}$ . This process simulates possible causal disturbances in real environments, forcing the model to learn robust policies under different causal contexts.
(2): Intervention value generation. Apply perturbations to the selected causal feature to generate an intervention value ${\hat{s}}_{t, k}$ :

${\hat{s}}_{t, k} = s_{t, k} + δ, δ \sim N (0, σ_{int}^{2})$

(22)

where $δ$ follows a Gaussian distribution with a mean of zero and variance $σ_{int}^{2}$ . This perturbation simulates unpredictable changes or noise in the environment, helping the model learn robustness when causal features change.
(3): Intervention state construction. Generate the intervened state $s_{t, do}$ :

$s_{t, do} = [s_{t, 1}, \dots, s_{t, k - 1}, {\hat{s}}_{t, k}, s_{t, k + 1}, \dots, s_{t, n}]$

(23)

Replacing the selected causal feature with the intervention value constructs a new state, reflecting the environmental state when the feature is disturbed.
(4): Intervention state masking. Apply mask processing to the intervened state to obtain ${\tilde{s}}_{t, do} = s_{t, do} ⊙ m$ . Through state masks, adjust the intervened state features, allowing the model to focus on key causal features under intervention conditions.
(5): Policy network output calculation. Employ the policy network to generate actions $a_{t}$ and $a_{t, do}$ on ${\tilde{s}}_{t}$ and ${\tilde{s}}_{t, do}$ , respectively. By comparing the actions generated under original and intervened states, the model can learn the true impact of causal feature changes on decisions.
(6): Causal consistency loss calculation. Compute the causal consistency loss $L_{c} (θ)$ . This loss term encourages the policy network to maintain consistency in action outputs when facing causal feature interventions, thereby reducing dependence on non-causal features and enhancing the model’s causal inference capabilities.

By minimizing

L_{c} (θ)

, the policy network is compelled to respond appropriately to changes in causal features under intervention conditions, reducing dependence on non-causal features. Additionally, to encourage the sparsity of the mask vector, we introduce a mask sparsity regularization term

L_{m} (m)

in the loss function:

L_{m} (m) = {∥m∥}_{1} = \sum_{i = 1}^{n} | m_{i} |

(24)

L1 regularization helps drive the weights of unnecessary features in the mask vector toward zero, achieving sparsity in feature selection. This reduces the complexity of the model and further reinforces the focus on key causal features. Upon combining the above designs, the loss function of the policy network

L_{actor} (θ, m)

can be expressed as

L_{actor} (θ, m) = - \frac{1}{N} \sum_{i = 1}^{N} Q_{ϕ_{1}} (s_{i}, π_{θ} ({\tilde{s}}_{i})) + λ_{c} L_{c} (θ) + λ_{m} L_{m} (m)

(25)

where N is the batch size and

λ_{c}

and

λ_{m}

are hyperparameters used to balance the causal consistency loss and the mask sparsity regularization, respectively, so that the policy network considers both causal relationships and feature selection during the optimization process. By performing gradient descent on

L_{actor} (θ, m)

, we update the policy network parameters

θ

and the mask vector

m

:

\begin{matrix} θ \leftarrow θ - η_{θ} \nabla_{θ} L_{actor} (θ, m) \end{matrix}

(26)

\begin{matrix} m \leftarrow m - η_{m} \nabla_{m} L_{actor} (θ, m) \end{matrix}

(27)

where

η_{θ}

and

η_{m}

are the learning rates for the policy network and the mask vector, respectively. To ensure that

m_{i} \in [0, 1]

, the updated mask vector is clipped so that each element of the mask vector remains within a reasonable range. This prevents negative values or values exceeding 1, resulting in the effectiveness and interpretability of the mask when weighting features.

m_{i} \leftarrow clip (m_{i}, 0, 1)

(28)

With the above design, the actor network can generate high-quality policies and adaptively select and reinforce key causal features, enhancing the overall algorithm’s convergence and generalization capabilities in path planning within complex dynamic environments.

4.3. CSM-TD3 Architecture

Based on the above design of the critic and actor networks, the pseudo-code for the overall training and optimization steps of the CSM-TD3 algorithm is as described in Algorithm 1.

Algorithm 1 CSM-TD3

1:: Initialize the critic networks $Q_{ϕ_{1}}, Q_{ϕ_{2}}$ and the policy network $π_{θ}$ with random parameters $ϕ_{1}, ϕ_{2}, θ$ ;
2:: Initialize the target networks $Q_{ϕ_{1}}^{'}, Q_{ϕ_{2}}^{'}, π_{θ}^{'}$ , setting their parameters to $ϕ_{1}^{'} \leftarrow ϕ_{1}$ , $ϕ_{2}^{'} \leftarrow ϕ_{2}$ , $θ^{'} \leftarrow θ$ ;
3:: Initialize the state mask vector $m$ ;
4:: Initialize the experience-replay buffer Ω;
5:: for each $e p i s o d e = 1$ to M do
6:: Initialize the random process $R_{N}$ for action exploration;
7:: Receive the initial observation state $s_{1}$ ;
8:: for each time step $t = 1$ to E do
9:: Compute the masked state ${\tilde{s}}_{t} = s_{t} ⊙ m$ ;
10:: Select action $a_{t} = π_{θ} ({\tilde{s}}_{t}) + N_{t}$ , where $N_{t}$ is the exploration noise;
11:: Execute action $a_{t}$ , observe reward $r_{t}$ and next state $s_{t + 1}$ ;
12:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ into the experience replay buffer Ω;
13:: Randomly sample a batch of samples ${s_{i}, a_{i}, r_{i}, s_{i}^{'}}_{i = 1}^{N}$ from the buffer;
14:: Compute the target Q-value

$y_{i} = r_{i} + γ min_{j = 1, 2} Q_{ϕ_{j}^{'}} (s_{i}^{'}, {\tilde{a}}_{i}^{'})$
15:: Update the critic network parameters $ϕ_{j}$ by minimizing the loss function:

$L_{critic}^{j} (ϕ_{j}) = \frac{1}{N} \sum_{i = 1}^{N} {(Q_{ϕ_{j}} (s_{i}, a_{i}) - y_{i})}^{2}, j = 1, 2$
16:: if $t \mod d_{delay} = = 0$ then
17:: Randomly select a causal feature $s_{t, k}$ ;
18:: Generate an intervention value ${\hat{s}}_{t, k} = s_{t, k} + δ$ , where $δ \sim N (0, σ_{int}^{2})$ ;
19:: Construct the intervened state $s_{t, do}$ ;
20:: Compute the masked intervened state ${\tilde{s}}_{t, do} = s_{t, do} ⊙ m$ ;
21:: Compute the actions $a_{t}$ , $a_{t, do}$ ;
22:: Compute the causal consistency loss $L_{c} (θ)$ ;
23:: Compute the mask sparsity regularization $L_{m} (m)$ ;
24:: Compute the loss function for the policy network:

$L_{actor} (θ, m) = - \frac{1}{N} \sum_{i = 1}^{N} Q_{ϕ_{1}} (s_{i}, π_{θ} ({\tilde{s}}_{i})) + λ_{c} L_{c} (θ) + λ_{m} L_{m} (m)$
25:: Update the mask vector $m$ :

$\begin{matrix} m & \leftarrow m - η_{m} \nabla_{m} L_{actor} (θ, m) \end{matrix}$
26:: Clip the mask vector to ensure $m_{i} \in [0, 1]$ :

$m_{i} \leftarrow clip (m_{i}, 0, 1), \forall i \in {1, \dots, n}$
27:: Update the policy network and the target network parameters $θ$ , $ϕ_{j}$ :

$\begin{matrix} ϕ_{j} \leftarrow ϕ_{j} - η_{ϕ} \nabla_{ϕ_{j}} L_{critic}^{j} (ϕ_{j}), & j = 1, 2 \\ θ \leftarrow θ - η_{θ} \nabla_{θ} L_{actor} (θ, m) \end{matrix}$
28:: end if
29:: end for
30:: end for

Finally, the complete framework of the proposed CSM-TD3 algorithm interacts with the vehicle path-planning environment, as illustrated in Figure 4. This framework comprises three core components: the vehicle dynamic path-planning environment, the actor–critic network, and the experience replay buffer. The vehicle dynamic path-planning environment consists of the vehicle dynamics model and the environmental model, designed to interact with the DRL algorithm to generate the data required for training. This dynamic model is capable of simulating the vehicle’s motion states and responses under various environmental conditions, providing diverse training scenarios so that the agent can learn effective path-planning strategies in complex and dynamic environments. The actor–critic network component includes both the online and target actor neural networks, which are responsible for generating appropriate control actions. The online actor network outputs the action policy based on the current state, while the target actor network is utilized to stabilize the training process and mitigate fluctuations during policy updates. Concurrently, the DRL critic component comprises two online critic neural networks and two target critic networks. These are designed to evaluate the value of actions and provide reliable feedback for updating the actor network. The design of dual critic networks effectively reduces the overestimation bias in value estimation, enhancing the stability and performance of the algorithm. The experience-replay buffer stores historical experience data generated by the agent during interactions with the environment. These data include information such as states, actions, rewards, and future states. By leveraging the experience-replay buffer, the algorithm can break the temporal correlations between consecutive samples, improve data efficiency, and stabilize the training process.

5. Comparison of Simulation Results

To validate the effectiveness of the proposed improved DRL algorithm (CSM-TD3) in dynamic path planning, we compared the performance of the CSM-TD3 algorithm with that of the TD3, proximal policy optimization (PPO) and DQN algorithms in terms of convergence and generalization through simulation experiments. All experiments were conducted on a hardware platform equipped with an Intel Core i9-13900K processor, sourced from Intel Corporation, Santa Clara, CA, USA, and an NVIDIA GeForce RTX 4080 graphics card, sourced from NVIDIA Corporation, Santa Clara, CA, USA.

5.1. Convergence Analyses

The simulation parameters for different DRL algorithms are shown in Table 1. The discount factor determines the weight of future rewards relative to immediate rewards. The soft update coefficient controls the rate at which the target network converges toward the current network. The exploration noise reflects the degree of variability in exploration behavior, with its variation determined by bounded noise. The refresh frequency of the policy network dictates the update rate of the associated parameters. In the path-planning environment, the starting point is fixed at the center of the geographical coordinate system, while the endpoint is randomly generated within the path-planning boundary. The speed of obstacles varies within a specified range to ensure that the DRL algorithm can be trained in a more diverse environment; the training environment parameters are shown in Table 2. To ensure comparability among the algorithms, all the hyperparameters and training environment parameters for the comparative algorithms were kept consistent.

In the established training environment, each dynamic path-planning task was performed 10 times to validate the stability and effectiveness of the DRL algorithms. During training, an evaluation report was generated in each assessment. In the comparison graphs for the reward curves, the solid line represents the average results of the reward curves across the 10 trials, while the shaded area indicates the range of variation in the reward curves. The path curves in the dynamic path-planning comparison illustrate the optimal paths generated by the trained agents of the same algorithm across the 10 trials in the same environment.

The average return curves obtained from three DRL algorithms in this convergence comparison experiment—DQN, PPO, TD3, and the proposed CSM-TD3—are compared in Figure 5 and Figure 6, with all the other parameters held constant. The average return of the DQN demonstrates a significantly slower growth rate throughout the training process compared to TD3 and CSM-TD3 and exhibits considerable fluctuations, consistently hovering at a lower level. This is attributed to the DQN’s original design, which is tailored for discrete-action decision making, making it challenging to effectively model and optimize policies for high-dimensional continuous control tasks. Furthermore, the high volatility of the DQN indicates that it is heavily influenced by randomness during policy updates, lacking an effective mechanism to smooth and stabilize its optimization process, presenting certain limitations for continuous control tasks in dynamic path planning.

PPO is a policy gradient-based method that directly optimizes the policy for continuous actions, avoiding the complexity of discretizing actions in continuous spaces as required by DQN. PPO improves training stability by clipping policy updates, thereby restricting the magnitude of each update. This helps to mitigate performance fluctuations caused by excessively large updates, resulting in better convergence performance compared to DQN in continuous action spaces.

In this experiment, TD3 also outperformed the DQN, showing a clear upward trend in average returns and ultimately achieving higher return levels than the DQN. However, the return curve for TD3 exhibited significant fluctuations, indicating a degree of instability, particularly during the initial training phase, where the returns varied greatly. This suggests that TD3’s handling of noise during the policy exploration phase and its ability to improve policies were relatively limited, making it difficult to maintain consistent performance improvements. Thus, although TD3 gradually surpassed the DQN during convergence, its overall stability and final convergence performance were still inferior to those of CSM-TD3.

CSM-TD3, on the other hand, demonstrated a consistently increasing trend in average returns throughout the training process, with relatively small fluctuations. This indicates that the algorithm possesses strong sample efficiency and stable convergence performance, attributable to the improvements made to TD3 through the introduction of causal models and a state-masking mechanism. Specifically, CSM-TD3 employs an SEM to identify and leverage state features that have direct causal effects on control inputs, using a state-masking vector to filter out irrelevant features that do not contribute to decision making, reducing noise and spurious correlations. Moreover, the incorporation of causal consistency constraints and causal consistency loss enables the effective learning of causal relationships, which significantly enhances CSM-TD3’s policy optimization compared to the traditional TD3 algorithm. The comparative results of different DRL algorithms are presented in Table 3. Compared to the DQN, the proposed CSM-TD3 improved the convergence performance by 51%, and compared to TD3, it achieved an improvement of 37%.

5.2. Generalization Analyses

In this path-planning simulation, we designed a dynamic path-planning environment to evaluate the generalization capabilities and path-planning performance of three RL algorithms: DQN, TD3, and CSM-TD3. The entire experiment was divided into two phases: the first phase involved a static obstacle environment, while the second phase featured a dynamic obstacle environment. This design allows us to progressively increase the difficulty of path planning from simple to complex scenarios. In the static obstacle environment, the agent only needs to contend with the challenges posed by a static setting, providing a performance baseline for each algorithm under fundamental conditions. In contrast, the dynamic obstacle environment introduces continuously moving obstacles, significantly increasing the complexity of path planning and enabling the assessment of each algorithm’s adaptability and decision-making responsiveness in the face of dynamic changes. Through this phased approach, we can observe how the agent transitions from handling static obstacles to confronting more complex dynamic environments.

The environment is modeled as a two-dimensional plane with dimensions set to

[- 5, 5] \times [- 5, 5]

, where the agent’s starting point is at

(0, 0)

and the target points are located at three positions:

(0, 4)

,

(- 3, 4)

, and

(4, 4)

. This design maintains consistent initial conditions for each algorithm across different phases, ensuring comparability in their performances. The diversity of target points further tests the agent’s flexibility and adaptability in addressing various path requirements. This design, transitioning from the same starting point to different endpoints, increases the difficulty of path planning incrementally in each phase. As a result, the agent must demonstrate higher generalization and policy optimization capabilities to effectively navigate the challenges presented. In Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, the black points represent the starting point, gray points represent obstacles, dashed circles represent the moving trajectory of obstacles, and purple points represent the goal point. The arrow indicates the direction of movement of the obstacle

In the static obstacle scenarios, as illustrated in Figure 7, Figure 8 and Figure 9, the path-planning results of the DQN exhibit significant deviations and instability. When faced with static obstacles, the DQN frequently demonstrates unnecessary detours and backtracking, particularly in densely packed obstacle areas, leading to chaotic path selections filled with superfluous turns. This performance issue stems from the DQN’s heavy reliance on discrete Q-value estimations, which results in inaccurate assessments of the optimal path and an inability to generate effective continuous actions. This directly manifests in its paths, showing excessive backtracking and redundant movements, greatly increasing the complexity and overall cost of the planned routes.

TD3 also manages to complete tasks in a static obstacle environment, but its paths are characterized by numerous unnecessary detours and twists. Especially when navigating narrow passages, the paths selected by TD3 lack smoothness, indicating either overly conservative obstacle avoidance or insufficient understanding of the environment. An analysis of the path trajectories reveals that TD3 experiences repeated adjustments in certain areas, suggesting that its strategy fails to optimize control decisions effectively when confronted with complex obstacle distributions. Although TD3 employs delayed policy updates to mitigate overestimation bias, it still faces limitations in extracting and utilizing critical state features in complex environments. Particularly when the agent encounters multiple obstacles, the lack of a global perspective on the optimal path results in redundant avoidance actions.

In contrast, CSM-TD3 demonstrates significant advantages in static obstacle scenarios, producing the smoothest paths that closely approximate the shortest route. The agent is able to navigate around obstacles precisely, reaching the target point with minimal detour costs and optimal decision making. The path diagram indicates that CSM-TD3 avoids unnecessary turns and backtracking, showcasing its superior performance. This advantage is primarily attributed to the causal enhancement strategies introduced in CSM-TD3, which effectively identify key state features that directly impact control decisions through structural equation modeling. The state-masking mechanism filters out redundant noise features so that the agent can focus on the crucial information affecting path planning during decision making, enhancing both the effectiveness and efficiency of path planning. Notably, when the agent traverses narrow corridors, CSM-TD3 exhibits exceptional smoothness and agility in responding to environmental changes, minimizing fluctuations along the entire path. This reflects its accurate judgment of obstacle layouts and efficient planning capabilities.

In the dynamic obstacle scenarios, as shown in Figure 10, Figure 11 and Figure 12, the paths generated by the DQN frequently deviate from the optimal direction. Particularly when obstacles approach, the agent struggles to react in time to avoid them, resulting in repeated backtracking and detours. This behavior highlights the DQN’s problems in handling rapidly changing dynamic obstacles, often causing its path planning to fall into local suboptimal traps and lack a global planning perspective.

TD3’s performance in the dynamic obstacle environment also declines. While TD3 ultimately manages to navigate around obstacles to reach the target point, its paths exhibit noticeable detours. Especially when dynamic obstacles suddenly come close, TD3’s responses appear sluggish, and its path adjustments lag behind, leading to decreased overall smoothness and efficiency. This inflexibility becomes particularly evident in scenarios requiring frequent path adjustments to cope with moving obstacles, resulting in unnecessary circumventions. Such delays directly affect the overall quality of path planning, making the final planned paths more complex and less efficient compared to those generated by CSM-TD3.

CSM-TD3, on the other hand, maintains exceptional performance in the dynamic obstacle environment. Its paths can adapt in real time to the dynamic changes of obstacles, demonstrating the rapid processing of dynamic information and real-time decision-making capabilities. The experimental results show that CSM-TD3 can swiftly adjust paths as obstacles approach, allowing the agent to maintain a smooth trajectory while effectively avoiding all dynamic obstacles, reaching the target point successfully. This performance advantage of CSM-TD3 is attributed to the causal consistency constraints, which ensure that the agent’s strategy focuses on causal features that directly influence control inputs when responding to environmental changes. This reduces the interference of spurious correlations in path planning, enabling the agent to react quickly and make effective adjustments in the presence of dynamic obstacles, achieving optimality and stability of its paths.

Upon combining the above analysis with the path step comparisons in Table 4 and Table 5, we see that CSM-TD3 exhibits exceptional planning performance in both static and dynamic obstacle environments. Leveraging causal enhancement strategies and causal consistency constraints, CSM-TD3 effectively identifies and utilizes state features that have direct causal impacts on control decisions. This capability allows the agent to maintain optimal, shortest, and smooth paths even in complex dynamic environments.

6. Conclusions

This study investigates the challenges faced by traditional DRL algorithms in dynamic path planning, particularly how confounding factors in the environment can impair convergence and generalization, making it difficult to generate high-quality paths. We propose a causal state-masking-based DRL method (CSM-TD3). Initially, we integrate a causal consistency loss term and dynamic state masking to combine causal inference mechanisms with TD3, compelling the policy network to rely solely on state features that reflect true causal relationships for decision making. This effectively filters out irrelevant features that do not contribute to the decision-making process. To enable TD3 to adaptively adjust the causal features across various state dimensions, we introduce an encoding configuration optimized through backpropagation learning. This allows the model to dynamically fine-tune causal features, enhancing the agent’s convergence during training and its generalization ability across different environments. Numerical simulations validate the efficacy of the proposed method. The experimental results indicate a 25% improvement in convergence performance. The CSM-TD3 algorithm effectively addresses dynamic path-planning challenges across various complex environments. Compared to other DRL algorithms, our approach generates shorter paths with better smoothness and demonstrates superior obstacle-avoidance capabilities.

Although the CSM-TD3 method demonstrates advantages in dynamic path planning, its training process relies on a large amount of training data and significant computational resources, which may limit its application in resource-constrained scenarios. Future research could integrate other forms of causal inference and adaptive mechanisms to make the learning strategy more efficient.

Author Contributions

Conceptualization, X.H. and T.Z.; methodology, T.Z.; software, T.Z.; validation, X.H., T.Z. and X.H.; formal analysis, T.Z.; resources, X.H.; data curation, T.Z.; writing—original draft preparation, J.C. and T.Z.; writing—review and editing, X.H., J.C., T.Z. and J.C.; project administration, X.H.; funding acquisition, X.H. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China through Grant No. 52005443, and the Natural Science Foundation of Zhejiang Province through Grant No. LQ21E050016.

Data Availability Statement

The authors do not have permission to share data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Quan, L.; Han, L.; Zhou, B.; Shen, S.; Gao, F. Survey of UAV motion planning. IET Cyber-Syst. Robot. 2020, 2, 14–21. [Google Scholar] [CrossRef]
Li, D.; Wang, P.; Du, L. Path planning technologies for autonomous underwater vehicles-a review. IEEE Access 2018, 7, 9745–9768. [Google Scholar] [CrossRef]
Wu, Z. Path planning techniques for autonomous vehicles. AIP Conf. Proc. 2024, 3144, 050020. [Google Scholar]
Vásconez, J.P.; Basoalto, F.; Briceño, I.C.; Pantoja, J.M.; Larenas, R.A.; Rios, J.H.; Castro, F.A. Comparison of path planning methods for robot navigation in simulated agricultural environments. Procedia Comput. Sci. 2023, 220, 898–903. [Google Scholar] [CrossRef]
Feng, L.; Jia, J. Improved algorithm of RRT path planning based on comparison optimization. Jisuanji Gongcheng Yu Yingyong (Comput. Eng. Appl.) 2011, 47. Available online: https://api.semanticscholar.org/CorpusID:62978159 (accessed on 23 October 2024).
Hou, J.; Shi, L.; Jiang, W.; Luo, Z.; Yang, L. Dynamic path planning of mobile robots by combining improved informed-RRT* and VFH+ algorithms. In Proceedings of the Eleventh International Symposium on Precision Mechanical Measurements, Guangzhou, China, 24–26 November 2023; Volume 13178, pp. 624–641. [Google Scholar]
Liu, Y.; Wang, C.; Wu, H.; Wei, Y. Mobile Robot Path Planning Based on Kinematically Constrained A-Star Algorithm and DWA Fusion Algorithm. Mathematics 2023, 11, 4552. [Google Scholar] [CrossRef]
Zhang, X.; Lai, J.; Xu, D.; Li, H.; Fu, M. 2D Lidar-Based SLAM and Path Planning for Indoor Rescue Using Mobile Robots. J. Adv. Transp. 2020, 2020, 8867937. [Google Scholar] [CrossRef]
Ishihara, S.; Kanai, M.; Narikawa, R.; Ohtsuka, T. A proposal of path planning for robots in warehouses by model predictive control without using global paths. IFAC-PapersOnLine 2022, 55, 573–578. [Google Scholar] [CrossRef]
Huang, D.X. Prospects of AI assisted planning based on machine learning. Urban Dev. Stud. 2017, 24, 50–55. [Google Scholar]
Rocha, L.G.S.; Kim, P.H.C.; Teixeira Vivaldini, K.C. Performance analysis of path planning techniques for autonomous robots: A deep path planning analysis in 2D environments. Int. J. Intell. Robot. Appl. 2023, 7, 778–794. [Google Scholar] [CrossRef]
Mokhtari, S.A. Fopid control of quadrotor based on neural networks optimization and path planning through machine learning and pso algorithm. Int. J. Aeronaut. Space Sci. 2022, 23, 567–582. [Google Scholar] [CrossRef]
Xu, T.; Chen, S.; Wang, D.; Wu, T.; Xu, Y.; Zhang, W. A novel path planning method for articulated road roller using support vector machine and longest accessible path with course correction. IEEE Access 2019, 7, 182784–182795. [Google Scholar]
Zhang, Y.; Zhao, W.; Wang, J.; Yuan, Y. Recent progress, challenges and future prospects of applied deep reinforcement learning: A practical perspective in path planning. Neurocomputing 2024, 608, 128423. [Google Scholar] [CrossRef]
Almazrouei, K.; Kamel, I.; Rabie, T. Dynamic obstacle avoidance and path planning through reinforcement learning. Appl. Sci. 2023, 13, 8174. [Google Scholar] [CrossRef]
Yang, H.; Qi, Y.; Wu, B.; Rong, D.; Hong, M.; Wang, J. Path planning of mobile robots based on memristor reinforcement learning in dynamic environment. J. Syst. Simul. 2023, 35, 1619. [Google Scholar]
Yang, L.; Bi, J.; Yuan, H. Dynamic path planning for mobile robots with deep reinforcement learning. IFAC-PapersOnLine 2022, 55, 19–24. [Google Scholar] [CrossRef]
Ren, J.; Huang, X.; Huang, R.N. Efficient deep reinforcement learning for optimal path planning. Electronics 2022, 11, 3628. [Google Scholar] [CrossRef]
Lin, Z.; Wu, K.; Shen, R.; Yu, X.; Huang, S. An Efficient and Accurate A-star Algorithm for Autonomous Vehicle Path Planning. IEEE Trans. Veh. Technol. 2023, 73, 9003–9008. [Google Scholar] [CrossRef]
Sun, Y.; Yuan, Q.; Gao, Q.; Xu, L. A Multiple Environment Available Path Planning Based on an Improved A* Algorithm. Int. J. Comput. Intell. Syst. 2024, 17, 172. [Google Scholar] [CrossRef]
Chen, G.; Tan, Y.; Zeng, G.; Tang, F.; Zhou, W. Research on multipath planning problems based on an improved LPA* algorithm. In Proceedings of the Seventh International Conference on Advanced Electronic Materials, Computers, and Software Engineering (AEMCSE 2024), Nanchang, China, 10–12 May 2024; Volume 13229, pp. 674–679. [Google Scholar]
Li, J.; Huang, C.; Pan, M. Path-planning algorithms for self-driving vehicles based on improved RRT-Connect. Transp. Saf. Environ. 2023, 5, tdac061. [Google Scholar] [CrossRef]
Shi, X.; Zhang, L.; Tang, L.; Dong, L.; Peng, J. Path planning of mobile robot using improved RRT_Connect algorithm. Sci. Technol. Rev. 2024, 42, 111–119. [Google Scholar]
Lee, C.C.; Song, K.T. Path re-planning design of a cobot in a dynamic environment based on current obstacle configuration. IEEE Robot. Autom. Lett. 2023, 8, 1183–1190. [Google Scholar] [CrossRef]
Chen, Z.; Yu, J.; Zhao, Z.; Wang, X.; Chen, Y. A path-planning method considering environmental disturbance based on VPF-RRT. Drones 2023, 7, 145. [Google Scholar] [CrossRef]
Liu, J.; Fu, M.; Liu, A.; Zhang, W.; Chen, B. A Homotopy Invariant Based on Convex Dissection Topology and a Distance Optimal Path Planning Algorithm. IEEE Robot. Autom. Lett. 2023, 8, 7695–7702. [Google Scholar] [CrossRef]
Wen, Y.; Wen, H.; Zhang, Z. Obstacle avoidance path planning of manipulator based on improved RRT algorithm. In Proceedings of the 2021 International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 8–10 January 2021; pp. 104–109. [Google Scholar]
Tan, X.; Han, L.; Gong, H.; Wu, Q. Biologically inspired complete coverage path planning algorithm based on Q-learning. Sensors 2023, 23, 4647. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, C.Y.; Gai, W.L. Research on path planning algorithm of unmanned ground platform based on reinforcement learning. In Proceedings of the Third International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022), Wuhan, China, 4–6 November 2022; Volume 12610, pp. 1241–1246. [Google Scholar]
Zhang, D.; Ju, R.; Cao, Z. Reinforcement learning-based motion control for snake robots in complex environments. Robotica 2024, 42, 947–961. [Google Scholar] [CrossRef]
Kong, F.; Wang, Q.; Gao, S.; Yu, H. B-APFDQN: A UAV path planning algorithm based on deep Q-network and artificial potential field. IEEE Access 2023, 11, 44051–44064. [Google Scholar] [CrossRef]
Zhang, L.; Hou, Z.; Wang, J.; Liu, Z.; Li, W. Robot navigation with reinforcement learned path generation and fine-tuned motion control. IEEE Robot. Autom. Lett. 2023, 8, 4489–4496. [Google Scholar] [CrossRef]
Dan, H.; Peng, H. 3D path planning of UAV based on improved reinforcement learning. In Proceedings of the Second International Conference on Electronic Information Engineering and Computer Communication (EIECC 2022), Xi’an, China, 25–27 November 2022; Volume 12594, pp. 21–30. [Google Scholar]
Jaramillo-Martínez, R.; Chavero-Navarrete, E.; Ibarra-Pérez, T. Reinforcement-Learning-Based Path Planning: A Reward Function Strategy. Appl. Sci. 2024, 14, 7654. [Google Scholar] [CrossRef]
Wang, Z.; Gao, W.; Li, G.; Wang, Z.; Gong, M. Path Planning for Unmanned Aerial Vehicle via Off-Policy Reinforcement Learning With Enhanced Exploration. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2625–2639. [Google Scholar] [CrossRef]
Hua, B.; Sun, S.; Wu, Y.; Chen, Z. A spacecraft attitude maneuvering path planning method based on PIO-improved reinforcement learning. Sci. Sin. Technol. 2021, 53, 1674–7259. [Google Scholar] [CrossRef]
Feng, Z.; Wang, G.; Chang, Y.; Shi, Y.; Geng, J.; Zhu, C. Research on path planning and obstacle avoidance for unmanned platforms based on reinforcement learning. In Proceedings of the International Conference on Electronic Materials and Information Engineering (EMIE 2023), Guangzhou, China, 14–16 July 2023; Volume 12919, pp. 107–112. [Google Scholar]
Liu, Y.; Wang, H.; Wu, T.; Lun, Y.; Fan, J.; Wu, J. Attitude control for hypersonic reentry vehicles: An efficient deep reinforcement learning method. Appl. Soft Comput. 2022, 123, 108865. [Google Scholar] [CrossRef]
Tang, W.; Wu, F.; Lin, S.W.; Ding, Z.; Liu, J.; Liu, Y.; He, J. Causal deconfounding deep reinforcement learning for mobile robot motion planning. Knowl.-Based Syst. 2024, 303, 112406. [Google Scholar] [CrossRef]
Teng, Y.; Natalino, C.; Li, H.; Yang, R.; Majeed, J.; Shen, S.; Monti, P.; Nejabati, R.; Yan, S.; Simeonidou, D. Deep-reinforcement-learning-based RMSCA for space division multiplexing networks with multi-core fibers. J. Opt. Commun. Netw. 2024, 16, C76–C87. [Google Scholar] [CrossRef]
Zheng, J.; Kurt, M.N.; Wang, X. Stochastic integrated actor–critic for deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 6654–6666. [Google Scholar] [CrossRef]
Luo, X.; Wang, Q. UAV path planning based on the average TD3 algorithm with prioritized experience replay. IEEE Access 2024, 12, 38017–38029. [Google Scholar] [CrossRef]
Yu, K.; Liu, L.; Li, J. A unified view of causal and non-causal feature selection. ACM Trans. Knowl. Discov. Data (TKDD) 2021, 15, 63. [Google Scholar] [CrossRef]

Figure 1. Vehicle dynamic model and path-planning environment.

Figure 2. Agent and environment interactions according to the Markov model.

Figure 3. Causal graph model of the DAG.

Figure 4. A diagram of the proposed path-planning control framework for vehicles based on CSM-TD3.

Figure 5. Comparisons of the average return obtained from the DQN, TD3, and the proposed CSM-TD3 when the obstacle is stationary.

Figure 6. Comparisons of the average return obtained from the DQN, TD3, and the proposed CSM-TD3 when the obstacle is moving.

Figure 7. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is stationary and the endpoint is

(0, 4)

.

Figure 7. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is stationary and the endpoint is

(0, 4)

.

Figure 8. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is stationary and the endpoint is

(- 4, 3)

.

Figure 8. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is stationary and the endpoint is

(- 4, 3)

.

Figure 9. Comparison of the path-planning diagrams of the DQN, TD3, and proposed GER-TD3 when the obstacle is stationary and the endpoint is

(4, 4)

.

Figure 9. Comparison of the path-planning diagrams of the DQN, TD3, and proposed GER-TD3 when the obstacle is stationary and the endpoint is

(4, 4)

.

Figure 10. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(0, 4)

.

Figure 10. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(0, 4)

.

Figure 11. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(- 4, 3)

.

Figure 11. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(- 4, 3)

.

Figure 12. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(4, 4)

.

Figure 12. Comparison of the path-planning diagrams of the DQN, TD3, and the proposed GER-TD3 when the obstacle is moving and the endpoint is

(4, 4)

.

Table 1. Simulation parameters used in different DRL algorithms.

DRL Algorithm	CSM-TD3	TD3	PPO	DQN
Hidden layer dimension	256	256	256	256
Batch size	256	256	256	256
Discount factor	0.99	0.99	0.99	0.99
Soft update coefficient	0.05	0.05	×	×
Policy noise	0.2	0.2	×	×
Noise clipping range	0.5	0.5	×	×
Policy update frequency	2	2	2	×
Priority exponent	0.6	0.6	0.6	0.6
Learning rate	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴

Table 2. System parameters of the environment.

Path-Planning Parameter	Parameter
Vehicle mass (kg)	1477
Vehicle yaw inertia ( $kg \cdot m^{2}$ )	1536.7
Boundary length (m)	8
Angular velocity range of obstacles (rad/s)	$[0.2, 0.4]$
Radius of obstacles (m)	0.5
Radius of obstacle motion path (m)	2.83

Table 3. Mean average returns obtained from different DRL algorithms.

DRL Algorithm	DQN	TD3	CSM-TD3
Mean average return	−781.6090	−508.2607	−318.9694

Table 4. Comparisons of the path steps obtained from the DQN, TD3, and the proposed CSM-TD3 when the obstacle is stationary.

DRL Algorithm	DQN	TD3	CSM-TD3
Location of Goal Point
$(0, 4)$	26	24	23
$(- 4, 3)$	40	35	33
$(4, 4)$	67	50	43

Table 5. Comparisons of the path steps obtained from the DQN, TD3, and the proposed CSM-TD3 when the obstacle is moving.

DRL Algorithm	DQN	TD3	CSM-TD3
Location of Goal Point
$(0, 4)$	49	42	23
$(- 4, 3)$	44	34	32
$(4, 4)$	58	40	36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hua, X.; Zhang, T.; Cao, J. Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning. Algorithms 2025, 18, 146. https://doi.org/10.3390/a18030146

AMA Style

Hua X, Zhang T, Cao J. Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning. Algorithms. 2025; 18(3):146. https://doi.org/10.3390/a18030146

Chicago/Turabian Style

Hua, Xia, Tengteng Zhang, and Jun Cao. 2025. "Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning" Algorithms 18, no. 3: 146. https://doi.org/10.3390/a18030146

APA Style

Hua, X., Zhang, T., & Cao, J. (2025). Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning. Algorithms, 18(3), 146. https://doi.org/10.3390/a18030146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Path Planning for Vehicles Based on Causal State-Masking Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Traditional Path-Planning Methods

2.2. RL-Based Path-Planning Methods

3. Problem Formulation

3.1. Dynamic Path-Planning Model

3.2. Formalization of the Markov Decision Process

3.3. Causal Relationship Model

4. Causal State-Masking TD3 Algorithm

4.1. Critic Networks

4.2. Actor Networks

4.3. CSM-TD3 Architecture

5. Comparison of Simulation Results

5.1. Convergence Analyses

5.2. Generalization Analyses

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI