A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis

Hoseinnezhad, Reza

doi:10.3390/app15042179

Open AccessArticle

A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis

by

Reza Hoseinnezhad

School of Engineering, RMIT University, Melbourne, VIC 3000, Australia

Appl. Sci. 2025, 15(4), 2179; https://doi.org/10.3390/app15042179

Submission received: 16 January 2025 / Revised: 10 February 2025 / Accepted: 14 February 2025 / Published: 18 February 2025

(This article belongs to the Special Issue Advances in Robot Path Planning, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Deep Reinforcement Learning (DRL) has emerged as a transformative approach in mobile robot path planning, addressing challenges associated with dynamic and uncertain environments. This comprehensive review categorizes and analyzes DRL methodologies, highlighting their effectiveness in navigating high-dimensional state–action spaces and adapting to complex real-world scenarios. The paper explores value-based methods like Deep Q-Networks (DQNs) and policy-based strategies such as Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC), emphasizing their contributions to efficient and robust navigation. Hybrid approaches combining these methodologies are also discussed for their adaptability and enhanced performance. Additionally, the review identifies critical gaps in current research, including limitations in scalability, safety, and generalization, proposing future directions to advance the field. This work underscores the transformative potential of DRL in revolutionizing mobile robot navigation across diverse applications, from search-and-rescue missions to autonomous urban delivery systems.

Keywords:

path planning; mobile robots; deep reinforcement learning

1. Introduction

Reinforcement Learning (RL) has emerged as a powerful framework for solving sequential decision-making problems, where an agent interacts with an environment to learn optimal behaviors by maximizing cumulative rewards [1]. This ability to adapt through interaction makes RL particularly suitable for mobile robot path planning in dynamic and uncertain environments. Traditional path planning algorithms, such as A* and Dijkstra’s, excel in static and well-structured settings but falter in applications characterized by uncertainty, unpredictability, and incomplete knowledge of the environment. For instance, search-and-rescue operations in disaster-stricken areas often involve navigating through unknown terrains filled with debris and dynamic obstacles. Similarly, autonomous delivery robots operating in crowded urban environments must make real-time decisions to avoid pedestrians and other obstacles while efficiently reaching their destinations. In such scenarios, RL offers a robust solution by learning policies that directly map sensory inputs to actions, enabling real-time decision-making and adaptability [2].

Deep Reinforcement Learning (DRL), a subset of RL that leverages deep neural networks, has significantly advanced the capabilities of RL by addressing challenges associated with high-dimensional state–action spaces and continuous control. Unlike traditional RL, which struggles to scale in complex environments, DRL employs neural networks to approximate value functions and policies, enabling efficient learning and decision-making in scenarios involving raw sensory inputs like camera images or LiDAR data [3]. This scalability and versatility make DRL an ideal framework for tackling complex navigation problems, such as autonomous navigation in cluttered indoor spaces, multi-agent coordination, and socially aware robot path planning. Techniques, such as Proximal Policy Optimization (PPO) [4], Deep Deterministic Policy Gradient (DDPG) [3], and Soft Actor–Critic (SAC) [2] have demonstrated remarkable performance in such applications, achieving state-of-the-art results across a variety of benchmark environments.

This paper presents a comprehensive review of DRL techniques for mobile robot path planning, categorizing existing methods and analyzing their strengths and limitations. It explores how DRL can address challenges in path planning, including navigation in dynamic and uncertain environments, balancing efficiency with safety, and generalizing to unseen scenarios. The review focuses on the application of DRL in tasks requiring continuous control, adaptive decision-making, and integration with multimodal sensory inputs.

Several reviews of DRL solutions for mobile robot path planning have been presented in the literature, offering broad overviews of existing methods and their applications. However, this work differentiates itself by providing a structured and standardized analysis of DRL techniques through a unified notation framework. Specifically, we categorize various DRL methods and reformulate their mathematical formulations under a consistent notation, facilitating direct comparisons across different approaches. Furthermore, for each category, multiple representative solutions are analyzed, with their key components restructured to maintain uniformity in presentation. A notable contribution of this review is the inclusion of original block diagrams and pseudocode representations for nearly all solutions discussed, offering a clearer understanding of their working principles and implementation details. These enhancements provide a more accessible and structured perspective on DRL-based path planning, allowing researchers and practitioners to better interpret, compare, and apply these methodologies in real-world scenarios.

The rest of this paper is organized as follows: In Section 2, the problem of designing a reinforcement learning solution for mobile robot path planning is formally defined using a differential drive robot as an illustrative example and detailing why reinforcement learning is advantageous. Then, Section 3 explores deep reinforcement learning, and discusses how uncertainties are modeled and handled in DRL with a note on sensor fusion for enhanced path planning using DRL. A categorization of the methods discussed in the next sections is also presented. Section 4 examines value-based DRL approaches such as DQN and D3QN, followed by Section 5 where policy-based methods are introduced. Recent actor–critic type methods, including DDPG, A3C, and PPO, are highlighted as the main developments in DRL for robot path planning, with several contemporary studies reviewed under each sub-category in Section 6. Some emerging trends in DRL for robotics are discussed in Section 7, including transformer-based DRL, metal learning for transfer learning, multi-agent methods, and attention mechanisms. The section also covers the scaling of DRL solutions and the safety of mobile robot exploration using DRL. Section 8 covers the practical considerations for real-world applications, such as the challenges of hybrid methods, optimization for real-time processing, and empirical performance comparisons. Samples of path planning outcomes from selected works are presented and discussed in Section 9, illustrating the benefits of these deep reinforcement learning strategies. The paper concludes by presenting insights and potential future research directions, in Section 10.

2. Problem Statement

2.1. Reinforcement Learning Framework for Control Systems

Reinforcement Learning (RL) is a mathematical framework used to solve sequential decision-making problems, where an agent interacts with an environment to achieve a specific goal by maximizing a cumulative reward signal. These problems are often formulated as a Markov Decision Process (MDP), which is characterized by the 5-tuple:

M = (S, A, P, r, γ),

(1)

where

$S$ : The state space, representing all possible states of the environment. A state $s_{t} \in S$ encodes all relevant information about the environment at a given time t.
$A$ : The action space, representing all possible actions that the agent can take. An action $a_{t} \in A$ represents a decision or control input applied by the agent at time t.
$P (s^{'} | s, a)$ : The state transition probability distribution, which defines the probability of transitioning from a current state $s \in S$ to a next state $s^{'} \in S$ as a result of taking action $a \in A$ . This function captures the dynamics of the environment and may be deterministic or stochastic.
$r (s, a)$ : The reward function, which assigns a scalar reward to each state–action pair $(s, a)$ . This reward quantifies the immediate desirability of taking action a in state s. For example, it might reflect the closeness to a goal or penalties for undesirable events such as collisions.
$γ \in [0, 1)$ : The discount factor, which determines the relative importance of immediate versus future rewards. A smaller $γ$ places more emphasis on immediate rewards, while a value closer to 1 encourages long-term planning.

The goal of the RL agent is to find an optimal policy

π (a | s)

, which defines the probability of taking an action a given a state s, to maximize the expected cumulative discounted reward:

J (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})],

(2)

where

s_{t}

and

a_{t}

represent the state and action at time step t, respectively.

The quality of a policy

π

is often evaluated using the state–value function

V^{π} (s)

or the action–value function

Q^{π} (s, a)

, defined as follows:

V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | s_{0} = s],

(3)

Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | s_{0} = s, a_{0} = a] .

(4)

The optimal policy

π^{*}

maximizes these functions and satisfies the Bellman optimality equation:

Q^{*} (s, a) = r (s, a) + γ E_{s^{'} \sim P} [max_{a^{'} \in A} Q^{*} (s^{'}, a^{'})] .

(5)

The Bellman optimality equation is foundational for reinforcement learning, serving as the theoretical basis for many algorithms aimed at finding optimal policies in sequential decision-making problems. It is rooted in the principle of optimality, which states the following: An optimal policy has the property that, regardless of the initial state and initial decision, the subsequent decisions must constitute an optimal policy with regard to the state resulting from the first decision [1]. This principle allows us to decompose the long-term optimization problem into smaller subproblems, leveraging the recursive nature of the problem. For a given state–action pair

(s, a)

, the equation defines the optimal action–value function

Q^{*} (s, a)

as the immediate reward plus the maximum possible discounted future reward from subsequent states.

The intuition behind the equation lies in the agent’s sequential decision-making process. At each step, the agent selects an action a in state s to maximize the sum of immediate and future rewards. The equation’s recursive nature means the value of the current state–action pair depends on the values of future states, tying local decisions to global optimization.

To derive the Bellman optimality equation, consider the expected cumulative reward under the optimal policy

π^{*}

in Equation (4). This can be decomposed into two parts:

The immediate reward $r (s, a)$ obtained at $t = 0$ .
The discounted expected reward from subsequent states, starting at $t = 1$ .

By definition of the expected value and the Markov property of the environment:

Q^{*} (s, a) = r (s, a) + γ E_{s^{'} \sim P} [E_{π^{*}} [\sum_{t = 1}^{\infty} γ^{t - 1} r (s_{t}, a_{t}) | s_{1} = s^{'}]] .

(6)

Since the future reward depends only on the subsequent state

s^{'}

and action

a^{'}

under the optimal policy, the inner expectation simplifies to

{max}_{a^{'} \in A} Q^{*} (s^{'}, a^{'})

. Thus, the Bellman optimality equation is established.

2.2. Reinforcement Learning for Mobile Robot Path Planning

In the context of mobile robot path planning, the goal is to compute a policy that enables the robot to navigate from a starting position to a target position while avoiding obstacles and minimizing path costs. The MDP formulation of this problem is described as follows:

State Space ( $S$ ): Each state $s \in S$ represents the robot’s current status and environmental context. For example, s may include the robot’s position $(x, y)$ , orientation $θ$ , velocities, and potentially sensor measurements, such as distances to nearby obstacles and the relative position of the goal.
Action Space ( $A$ ): The action $a \in A$ represents a control input applied to the robot. For instance, a could define discrete motion commands (e.g., MoveForward, TurnLeft), or continuous control signals such as linear velocity v and angular velocity $ω$ .
State Transition ( $P (s^{'} | s, a)$ ): This models the robot’s motion dynamics and environmental interactions. For example, if the robot is controlled using differential drive kinematics, the transition from s to $s^{'}$ after applying a depends on equations of motion. Stochasticity may arise due to wheel slip, sensor noise, or dynamic obstacles.
Reward Function ( $r (s, a)$ ): The reward function incentivizes behaviors that achieve the desired goal while avoiding undesirable outcomes.
Discount Factor ( $γ$ ): The discount factor determines how much emphasis is placed on immediate versus long-term navigation performance. For instance, $γ$ close to 1 ensures that the robot considers the entire trajectory to the goal.

Using reinforcement learning, the agent iteratively interacts with the environment to learn an optimal policy

π^{*}

that selects actions to navigate effectively while avoiding obstacles. State-of-the-art deep reinforcement learning algorithms such as Proximal Policy Optimization (PPO) [4] and Soft Actor–Critic (SAC) [2] are particularly suited for this task, as they handle high-dimensional state–action spaces and continuous control effectively.

Problem Formulation for a Differential Drive Mobile Robot:

For a differential-drive mobile robot, the state

s \in S

is defined as follows:

s = {[x y θ v ω]}^{⊤},

where

x, y

are the Cartesian coordinates of the robot’s position in a 2D plane,

θ

is the orientation (heading angle) of the robot relative to the global frame, v is the linear velocity of the robot, and

ω

is the angular velocity of the robot. The control input

a \in A

is defined as

a = {[u_{v} u_{ω}]}^{⊤},

where

u_{v}

and

u_{ω}

are the commanded linear and angular velocity inputs, respectively. The robot’s motion dynamics can be described using the following system of nonlinear differential equations:

\begin{matrix} \dot{x} & = & v cos θ, & \dot{y} & = & v sin θ, & \dot{θ} & = & ω, \\ \dot{v} & = & - \frac{1}{τ_{v}} (v - u_{v}) + ξ_{v}, & \dot{ω} & = & - \frac{1}{τ_{ω}} (ω - u_{ω}) + ξ_{ω}, \end{matrix}

(7)

where

τ_{v}, τ_{ω}

are the time constants for the linear and angular velocity dynamics, and

ξ_{v}, ξ_{ω}

are the stochastic disturbances representing noise in the robot’s actuators, modeled as Gaussian random variables,

ξ_{v} \sim N (0, σ_{v}^{2})

and

ξ_{ω} \sim N (0, σ_{ω}^{2})

. For practical implementation in reinforcement learning, the continuous time dynamics are discretized using a time step

Δ t

. The discrete-time state transition equations become the following:

\begin{matrix} x_{t + 1} & = & x_{t} + v_{t} cos θ_{t} Δ t, \end{matrix}

(8)

\begin{matrix} y_{t + 1} & = & y_{t} + v_{t} sin θ_{t} Δ t, \end{matrix}

(9)

\begin{matrix} θ_{t + 1} & = & θ_{t} + ω_{t} Δ t, \end{matrix}

(10)

\begin{matrix} v_{t + 1} & = & v_{t} + (- \frac{1}{τ_{v}} (v_{t} - u_{v}) + ξ_{v}) Δ t, \end{matrix}

(11)

\begin{matrix} ω_{t + 1} & = & ω_{t} + (- \frac{1}{τ_{ω}} (ω_{t} - u_{ω}) + ξ_{ω}) Δ t . \end{matrix}

(12)

The robot operates in an environment containing obstacles and goals, which influence the state transition. For example,

If the robot’s position $(x_{t + 1}, y_{t + 1})$ lies within an obstacle region, the state transition is interrupted, and the robot enters a collision state with a penalty reward.
Sensor noise can introduce uncertainty in the perceived state, modeled as an additive Gaussian noise $η \sim N (0, Σ)$ applied to the observed position and orientation:

${\tilde{s}}_{t + 1} = s_{t + 1} + η,$

(13)

where ${\tilde{s}}_{t + 1}$ is the observed state (assuming that the sensing mechanism includes a complete GPS-INS).

For the differential-drive mobile robot example, the deterministic component of the state transition is governed by the discretized nonlinear dynamics described earlier. However, to account for stochastic effects in the real-world environment, the state transition probability density can be expressed as follows:

P (s^{'} | s, a) = N (s^{'}; f (s, a), Σ),

(14)

where

$f (s, a)$ is the deterministic state transition function derived from the discretized equations of motion,

$f (s, a) = [\begin{matrix} x_{t} + v_{t} cos θ_{t} Δ t \\ y_{t} + v_{t} sin θ_{t} Δ t \\ θ_{t} + ω_{t} Δ t \\ v_{t} + (- \frac{1}{τ_{v}} (v_{t} - u_{v}) + ξ_{v}) Δ t \\ ω_{t} + (- \frac{1}{τ_{ω}} (ω_{t} - u_{ω}) + ξ_{ω}) Δ t \end{matrix}],$

(15)
$Σ$ is the covariance matrix representing the stochastic uncertainties in the state transition, which could arise from sensor noise, actuator variability, or environmental factors.

The covariance matrix

Σ

is often diagonal under the assumption that the noise in different state dimensions is uncorrelated. For instance,

Σ = diag (σ_{x}^{2}, σ_{y}^{2}, σ_{θ}^{2}, σ_{v}^{2}, σ_{ω}^{2}),

(16)

where

σ_{x}^{2}, σ_{y}^{2}

represent the positional uncertainties,

σ_{θ}^{2}

the orientation uncertainty, and

σ_{v}^{2}, σ_{ω}^{2}

the variances in the linear and angular velocities due to stochastic disturbances.

The environment in which the robot operates significantly affects the state transition function. In the presence of obstacles or terrain constraints, the deterministic dynamics

f (s, a)

may no longer accurately describe the transition and

P (s^{'} | s, a)

must be adjusted accordingly. For example,

Obstacle Collisions: If the predicted state $s^{'} = {(x^{'}, y^{'}, θ^{'}, v^{'}, ω^{'})}^{⊤}$ lies within an obstacle region, the state transition function could assign zero probability to this transition and redirect the robot to a collision state $s_{collision}$ . Mathematically,

$P (s^{'} | s, a) = \{\begin{matrix} N (s^{'}; f (s, a), Σ), & if (x^{'}, y^{'}) \notin O, \\ δ (s^{'} - s_{collision}), & if (x^{'}, y^{'}) \in O, \end{matrix}$

(17)

where $O$ is the set of obstacle regions, and $δ (\cdot)$ is the Dirac delta function centered at the collision state.
Dynamic Obstacles: In environments with moving obstacles, the state transition must account for dynamic constraints imposed by the obstacle’s trajectory. For instance, if the robot is predicted to intersect with an obstacle at time $t + 1$ , the state transition probability can penalize such trajectories by assigning lower likelihoods:

$P (s^{'} | s, a) \propto exp (- \frac{dist {(s^{'}, O_{t})}^{2}}{2 σ_{obs}^{2}}),$

(18)

where $dist (s^{'}, O_{t})$ is the distance between the predicted state and the obstacle’s trajectory at time t, and $σ_{obs}$ controls the sensitivity to obstacle proximity.

In real-world scenarios, the robot’s sensors introduce noise, leading to uncertainty in the observed states. The state transition probability density can be modified to include the perception model:

P ({\tilde{s}}^{'} | s, a) = \int P ({\tilde{s}}^{'} | s^{'}) P (s^{'} | s, a) d s^{'},

(19)

where

P ({\tilde{s}}^{'} | s^{'})

is the sensor noise model, typically modeled as Gaussian:

P ({\tilde{s}}^{'} | s^{'}) = N ({\tilde{s}}^{'}; s^{'}, Σ_{sensor}) .

(20)

The reward function is designed to encourage goal-reaching behavior, penalize collisions, and promote efficient navigation. For the differential drive mobile robot model presented above, such a reward function can be expressed as follows:

r (s, a) = \{\begin{matrix} + R_{g}, & {if ∥ [x y]}^{⊤} - {[x_{g} y_{g}]}^{⊤} ∥ \leq ϵ, \\ - R_{c}, & if (x, y) \in O, \\ {- α ∥ [x y]}^{⊤} - {[x_{g} y_{g}]}^{⊤} {∥ - β ∥ a ∥}^{2}, & otherwise, \end{matrix}

(21)

where

–: ${[x y]}^{⊤}$ is the robot’s current position in the state $s = {[x y θ v ω]}^{⊤}$ ,
–: ${[x_{g} y_{g}]}^{⊤}$ is the goal position in the global frame,
–: $ϵ$ is a small threshold to determine if the robot has reached the goal (e.g., within a radius of $ϵ$ around the goal position),
–: $R_{g}$ is a positive scalar reward assigned when the robot reaches the goal,
–: $R_{c}$ is a scalar penalty assigned if the robot’s position lies within an obstacle region $O$ ,
–: $α$ is a weighting factor penalizing the Euclidean distance between the robot’s current position and the goal,
–: $β$ is a weighting factor penalizing the energy expenditure represented by the squared magnitude of the action vector $a = {[u_{v} u_{ω}]}^{⊤}$ .

2.3. Why Reinforcement Learning?

Traditional path planning methods, such as A* and Dijkstra’s algorithm, are primarily designed for static environments where obstacles and goals are fixed and known a priori [5]. While extensions to these methods, such as dynamic A* (D*), allow for some level of adaptability [6], they often require substantial computational overhead for replanning. In contrast, RL inherently learns a policy

π (a | s)

that maps states to actions, enabling real-time decision-making without explicit replanning. Indeed, the optimal policy in RL,

π^{*} (a | s)

, which is obtained by solving

π^{*} (a | s) = arg max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})],

(22)

directly accounts for environment dynamics through interaction during the training phase [1]. This allows the robot to handle scenarios where the obstacle set

O_{t}

or the goal position

{[x_{g}, y_{g}]}^{⊤}

varies over time.

Another advantage of RL stems from its optimality in applications involving environments with complex geometries or high-dimensional state spaces. In such applications, traditional methods often struggle due to the exponential growth of the search space [7]. For instance, sampling-based methods like Rapidly-Exploring Random Trees (RRT) can provide feasible paths but are not guaranteed to find optimal paths without significant computational effort. Reinforcement Learning, particularly with Deep RL, excels in such scenarios by approximating the value function

V^{π} (s)

or action–value function

Q^{π} (s, a)

using neural networks [3,8]. Algorithms such as Deep Q-Networks (DQN) or Proximal Policy Optimization (PPO) leverage the Bellman optimality principle (see Equation (5)) to find near-optimal policies efficiently, even in high-dimensional settings [4].

Finally, in real-world applications, robots face significant uncertainties, such as sensor noise, actuator variability, and unpredictable environmental dynamics. While traditional methods can handle some uncertainties through deterministic approximations, they lack the robustness to stochastic effects. In RL, stochasticity is inherently modeled in the state transition function

P (s^{'} | s, a)

and reward function

r (s, a)

. For example, the dynamics of a differential-drive mobile robot are represented as

P (s^{'} | s, a) = N (s^{'}; f (s, a), Σ),

where

Σ

captures uncertainties in motion. The RL agent learns to optimize its policy under such probabilistic dynamics, making it robust to real-world noise [2].

Reinforcement Learning is particularly advantageous in several applications and scenarios. In navigation through unknown environments, where the environment is partially known or entirely unknown, RL allows the robot to learn an effective policy through exploration and interaction with the environment [9]. This capability is crucial for tasks such as search-and-rescue operations. In scenarios involving dynamic obstacle avoidance, RL enables the robot to predict and react to dynamic changes in the environment by leveraging the policy

π (a | s)

trained on similar conditions. RL is also effective in multi-objective path planning, allowing for the optimization of complex reward functions that balance competing objectives, such as minimizing energy consumption, maximizing safety, and achieving timely goal completion:

{r (s, a) = - α ∥ [x y]}^{⊤} - {[x_{g} y_{g}]}^{⊤} {∥ - β ∥ a ∥}^{2} - γ_{collision} I {(x, y) \in O},

(23)

where

I

is an indicator function penalizing collisions. Additionally, RL can be integrated with imitation learning, enabling robots to learn policies from human demonstrations or expert-generated trajectories, which reduces the need for manual design of path planning algorithms [10].

3. Deep Reinforcement Learning

Deep Reinforcement Learning (Deep RL) leverages deep neural networks as function approximators to address high-dimensional state and action spaces. In scenarios, such as mobile robot path planning, where the state space

S

may include sensor data, such as images or LIDAR scans, and the action space

A

involves continuous control, Deep RL provides a scalable and effective solution [3,8]. The key elements of Deep RL build directly on the Markov Decision Process (MDP) framework and extend it by introducing function approximation methods.

In Deep RL, the policy

π (a | s; θ)

is parameterized by a neural network with parameters

θ

. The policy maps states

s \in S

to a probability distribution over actions

a \in A

. The objective remains to optimize the parameters

θ

to maximize the expected cumulative reward (2):

J (π_{θ}) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})],

(24)

where

γ

is the discount factor, as defined previously [1].

In addition to the policy, in deep RL, the state–value function

V^{π} (s)

and action–value function

Q^{π} (s, a)

, which are critical for evaluating the quality of a policy or action, can be approximated using neural networks:

Q (s, a; ϕ) \approx Q^{π} (s, a),

(25)

where

ϕ

represents the parameters of the neural network. Algorithms such as Deep Q-Networks (DQN) [8] use this approximation to solve the Bellman equation:

Q (s, a; ϕ) = r (s, a) + γ E_{s^{'} \sim P} [max_{a^{'}} Q (s^{'}, a^{'}; ϕ)] .

(26)

When deep RL combines a policy network (actor)

π (a | s; θ)

with a value network (critic)

Q (s, a; ϕ)

, the actor selects actions, while the critic evaluates them. This approach is called the actor–critic paradigm. The policy gradient is computed as follows:

\nabla_{θ} J (π_{θ}) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a | s) Q^{π} (s, a)],

(27)

where

Q^{π} (s, a)

is provided by the critic network [2,4].

Deep RL also addresses the balance between exploration (choosing less certain actions to gather information) and exploitation (choosing actions that maximize known rewards). Techniques such as

ϵ

-greedy policies [1,8] or entropy regularization [2] are commonly used to manage this trade-off. To stabilize learning, deep RL often employs mechanisms such as target networks and experience replay. Target networks provide a stable reference for value updates, and experience replay reuses past experiences for training, improving sample efficiency [8].

The key components of a Deep Reinforcement Learning (Deep RL) framework and their interactions are illustrated in Figure 1. The interaction flow in the diagram starts with the state

s \in S

, which is processed by both the policy network (actor) and value network. The policy network generates an action

a \in A

, which interacts with the environment to produce a reward

r (s, a)

and a new state

s^{'}

. The value network evaluates

Q (s, a; ϕ)

to inform the critic. The critic provides feedback for updating both the value network and the policy network, ensuring iterative improvements in both evaluation and decision-making.

3.1. Modeling and Handling Uncertainty in Deep Reinforcement Learning

Uncertainty is an inherent challenge in real-world robotic applications, arising from sensor noise, dynamic obstacles, and environmental unpredictability. Traditional reinforcement learning models assume a stationary Markov decision process (MDP), which often fails to capture the stochastic nature of real-world interactions. To address these issues, researchers have proposed several uncertainty-aware methods to improve the robustness and adaptability of DRL-based robotic systems.

Bayesian reinforcement learning

Bayesian deep reinforcement learning (BDRL) has been widely explored as a means to incorporate model uncertainty into decision-making. Zheng et al. [11] introduced a Bayesian DRL framework that quantifies model uncertainty, aleatoric uncertainty, and reward function uncertainty in robot manipulation tasks. Their findings demonstrated that incorporating Bayesian networks into DRL significantly enhances convergence speed in sparse reward environments.

Uncertainty-aware policy optimization

Robust policy optimization techniques have been developed to mitigate the performance degradation caused by uncertain transitions. Zhang et al. [12] proposed an uncertainty set regularization (USR) technique, which formulates an uncertainty-aware policy using an adversarial approach to model unknown transition uncertainties. Their results on real-world reinforcement learning benchmarks showed improved robustness against environmental perturbations.

Trajectory planning under uncertain constraints

Trajectory planning in dynamic environments requires DRL agents to anticipate and adapt to uncertain constraints in real-time. Govindarajula1 et al. [13] presented a trajectory planning framework that integrates deep learning models with reinforcement learning to handle dynamic obstacles and unpredictable environments effectively. Their approach demonstrated improved safety and efficiency in both simulated and real-world scenarios.

Sim-to-real adaptation for handling sensor noise

Handling sensor noise is crucial for ensuring stable DRL deployment in robotic navigation. Joshi et al. [14] analyzed the impact of sensor noise on UAV navigation and obstacle avoidance using DRL. Their results indicated that denoising techniques such as Kalman filtering and artificially injecting noise during training improved real-world performance. Similarly, Martinez et al. [15] developed a velocity-space-based DRL planner that mitigates discrepancies between simulation and reality, reducing failure rates in highly dynamic environments.

Deep stochastic Koopman operator for uncertain systems

A novel data-driven stochastic modeling approach was introduced by Han et al. [16], which utilizes a deep stochastic Koopman operator (DeSKO) for robust learning control in uncertain nonlinear robotic systems. Their framework achieved robust stability in real-world soft robotics applications, demonstrating resilience against unexpected disturbances.

These advancements illustrate the importance of uncertainty-aware DRL techniques in real-world robotics. Future research should focus on improving generalization across diverse operational conditions while maintaining computational efficiency.

3.2. Sensor Fusion for Enhanced Path Planning in DRL

Sensor fusion has become a crucial component in reinforcement learning-based robot navigation, allowing autonomous agents to integrate multiple sensory inputs to improve decision-making in complex environments. Combining LiDAR, camera images, and IMU data enhances depth perception, spatial awareness, and obstacle avoidance, leading to improved policy robustness in deep reinforcement learning (DRL).

Recent studies have demonstrated the effectiveness of LiDAR–camera fusion in DRL-based navigation. Ou et al. [17] introduced a sensor fusion framework that integrates LiDAR and camera data to establish a more reliable environmental representation. Their approach enables mobile robots to navigate unfamiliar environments more efficiently than traditional single-sensor setups. Additionally, Tan [18] designed a lightweight multimodal data fusion network that combines LiDAR and image-based semantic segmentation, bridging the gap between simulated and real-world navigation.

Incorporating IMU data into sensor fusion frameworks further enhances DRL-based path planning by providing motion stability and aiding localization. Xue and Gonsalves [19] proposed a short-term visual-IMU fusion memory agent for drone motion planning, significantly reducing neural network complexity while maintaining robust obstacle avoidance. Similarly, Liu et al. [20] developed a sensor fusion system that integrates IMU with LiDAR-based SLAM to improve 3D color reconstruction in large-scale environments.

Several DRL frameworks now explicitly incorporate sensor fusion to improve real-time navigation. Yan et al. [21] introduced a deep deterministic policy gradient (DDPG)-based approach that fuses laser and vision sensor data, achieving a 10% improvement in navigation success rates in dynamic environments. Furthermore, Jiang et al. [22] developed a multi-sensor fusion framework that preprocesses RGB and depth images from cameras before integrating them with LiDAR data, significantly improving obstacle avoidance in real-world scenarios.

These advancements highlight the growing importance of sensor fusion in DRL-based robotics. Future research should focus on optimizing real-time data processing pipelines and developing lightweight fusion architectures to enhance computational efficiency.

3.3. Categorization

Our investigations have resulted in the categorization of recent DRL solutions for mobile robot path planning into three general categories: value-based method, policy-based methods and hybrid actor–critic methods. In the next three sections, we review each of these categories. Since the last category is more recent and has shown an exceedingly large potential for the guidance and control of mobile robots, we provide an extensive exploration of different Actor–Critic algorithms that in their nature employ a combination of the principles used in value-based and policy-based approaches to DRL for mobile robot path planning. To provide a clearer understanding of the relationships between different DRL methods, Figure 2 presents a conceptual diagram summarizing the categorization of DRL techniques. By structuring the discussion around these categories, we aim to establish a cohesive narrative that logically connects various methods and their applications.

4. Value-Based Methods

In the domain of mobile robot path planning, recent advances in deep reinforcement learning (Deep RL) have introduced both value-based and policy-based methods to solve complex navigation tasks. These methods leverage deep neural networks to approximate functions like the action–value function

Q^{π} (s, a)

or directly parameterize policies

π (a | s; θ)

. This section reviews key value-based methods, such as Deep Q-Networks (DQN), Double DQN (DDQN), and Dueling DQN (D3QN), as well as policy-based approaches, outlining their mathematical foundations and contributions to mobile robot navigation.

Value-based methods in deep RL focus on estimating the action–value function

Q^{π} (s, a)

, which represents the expected cumulative reward of taking action a in state s and following the policy

π

thereafter. These methods solve the Bellman equation:

Q^{π} (s, a) = r (s, a) + γ E_{s^{'} \sim P} [max_{a^{'} \in A} Q^{π} (s^{'}, a^{'})],

(28)

by using neural networks as function approximators. A well-known value-based deep RL method is the Deep Q-Networks (DQN) [8] which approximates

Q^{π} (s, a)

using a neural network parameterized by

ϕ

, denoted as

Q (s, a; ϕ)

. The network is trained by minimizing the mean squared error between the current Q-value and the target value:

L (ϕ) = E_{(s, a, r, s^{'})} [{(Q (s, a; ϕ) - y)}^{2}],

(29)

where the target value y is as follows:

y = r (s, a) + γ max_{a^{'} \in A} Q (s^{'}, a^{'}; ϕ^{-}) .

(30)

Here,

ϕ^{-}

denotes the parameters of a separate target network, which is updated periodically for stability. The step-by-step pseudocode for DQN is presented in Algorithm 1.

Algorithm 1: Deep Q-Network (DQN) Algorithm

While DQN is effective, it suffers from overestimation bias in Q-value updates. Double DQN (DDQN) [23] addresses this by decoupling the action selection and evaluation processes. The target value is modified as follows:

y = r (s, a) + γ Q (s^{'}, arg max_{a^{'} \in A} Q (s^{'}, a^{'}; ϕ); ϕ^{-}) .

(31)

This correction reduces overestimation and improves learning stability. An alternative is the Dueling DQN (D3QN) method [24] which introduces a novel architecture that separates the representation of state–value and advantage functions:

Q (s, a; ϕ, ξ) = V (s; ϕ) + A (s, a; ξ),

(32)

where

V (s; ϕ)

is the state–value function and

A (s, a; ξ)

is the advantage function. This architecture helps the network learn which states are valuable independent of the specific actions taken, improving performance in complex environments. The pseudocode shown in Algorithm 2 presents how the D3QN method works in a step-by-step fashion.

Algorithm 2: Dueling Deep Q-Network (D3QN) Algorithm

5. Policy-Based Methods

Value-based methods, such as DQN, DDQN, and D3QN, are effective for discrete action spaces, offering stability and scalability in robotic navigation tasks with well-defined action sets. However, they struggle with continuous control due to the need for discretization. Policy-based methods, on the other hand, are designed for continuous action spaces, making them ideal for controlling mobile robots with differential drive or other complex dynamics.

Policy-based methods directly optimize the policy

π (a | s; θ)

by maximizing the expected cumulative reward:

J (π_{θ}) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})] .

(37)

The policy parameters

θ

are updated using the gradient of the objective, known as the policy gradient:

\nabla_{θ} J (π_{θ}) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a | s) Q^{π} (s, a)] .

(38)

An example of a purely policy-based method is the REINFORCE algorithm [25], which uses Monte Carlo estimates of the return

G_{t}

to compute the policy gradient. These methods do not utilize a value function to evaluate actions or guide policy updates, relying solely on sampled returns.

6. Actor–Critic Methods

Actor–critic methods combine a policy-based “actor” with a value-based “critic”. The actor represents the policy

π (a | s; θ)

, while the critic estimates a value function, such as the state–value function

V^{π} (s)

or the action–value function

Q^{π} (s, a)

. The critic helps reduce the variance in the policy gradient by replacing the raw return

G_{t}

with an advantage estimate:

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s) .

(39)

The policy gradient in actor–critic methods is typically expressed as follows:

\nabla_{θ} J (π_{θ}) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a | s) A^{π} (s, a)] .

(40)

The three main categories of actor–critic methods have been applied in mobile robot path planning applications: (1) Deep deterministic policy gradient (DDPG) method and its variants, (2) Asynchronous advantage actor critic (A3C) method and its variants, (3) Proximal policy optimization (PPO) method and its variants, and (4) Soft actor–critic (SAC) method and its variants. Each of these categories are presented in separate sections as follows.

6.1. Deep Deterministic Policy Gradient

The DDPG method is an actor–critic reinforcement learning algorithm designed for continuous action spaces. It extends the Deterministic Policy Gradient (DPG) method [26] by incorporating deep neural networks to parameterize both the policy and value functions. DDPG has been widely used in robotics, including mobile robot path planning, where the control commands often lie in a continuous domain, such as linear and angular velocities. It uses an actor–critic framework where

The actor $π (s; θ^{π})$ is a deterministic policy, parameterized by $θ^{π}$ , that maps states $s \in S$ to continuous actions $a \in A$ .
The critic $Q (s, a; θ^{Q})$ is a neural network, parameterized by $θ^{Q}$ , that estimates the action–value function:

$Q (s, a; θ^{Q}) \approx Q^{π} (s, a) = E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | s_{0} = s, a_{0} = a] .$

(41)

The policy is updated by maximizing the estimated Q-value:

\nabla_{θ^{π}} J (π) = E_{s \sim D} [\nabla_{a} Q (s, a; θ^{Q}) |_{a = π (s; θ^{π})} \nabla_{θ^{π}} π (s; θ^{π})],

(42)

where

D

is a replay buffer used to store past transitions

(s, a, r, s^{'})

for sampling during training. The critic is updated by minimizing the mean squared error of the Bellman residual:

L (θ^{Q}) = E_{(s, a, r, s^{'}) \sim D} [{(Q (s, a; θ^{Q}) - y)}^{2}],

(43)

where the target value y is computed using a target actor

π^{'} (s; θ^{π^{'}})

and a target critic

Q^{'} (s, a; θ^{Q^{'}})

:

y = r (s, a) + γ Q^{'} (s^{'}, π^{'} (s^{'}; θ^{π^{'}}); θ^{Q^{'}}) .

(44)

DDPG uses target networks to stabilize training. The target actor

π^{'} (s; θ^{π^{'}})

and target critic

Q^{'} (s, a; θ^{Q^{'}})

are updated slowly using a soft update mechanism:

θ^{π^{'}} \leftarrow τ θ^{π} + (1 - τ) θ^{π^{'}},

(45)

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}},

(46)

where

τ \in [0, 1)

controls the update rate.

To encourage exploration in continuous action spaces, DDPG adds noise to the deterministic policy during training. A common choice is Ornstein–Uhlenbeck (OU) noise, which models temporally correlated random processes:

a_{t} = π (s_{t}; θ^{π}) + N_{t}, N_{t} \sim OU (μ, σ) .

(47)

Algorithm 3 shows the step-by-step pseudocode for the DDPG algorithm.

In mobile robot path planning, DDPG has been applied to learn navigation strategies in continuous control domains. The actor network generates control commands, such as linear velocity v and angular velocity

ω

, while the critic evaluates the navigation performance. The state s may include the following:

The robot’s position $(x, y)$ and orientation $θ$ .
Sensor inputs, such as LIDAR or depth images, providing information about obstacles.
The relative position of the goal $(x_{g}, y_{g})$ .

The reward function is typically similar to (21), and designed to guide the robot toward the goal while avoiding collisions.

6.1.1. Assisted DDPG (AsDDPG)

In the work by Xie et al. [27], Assisted DDPG (AsDDPG) extends the standard DDPG algorithm by incorporating an external controller to accelerate training and stabilize learning. The external controller acts as a supplementary policy during the early stages of training and is progressively replaced by the learned policy as training converges. The integration of the external controller introduces an adaptive switching mechanism, enabling the algorithm to dynamically choose between the controller and the actor network.

Algorithm 3: Deep Deterministic Policy Gradient (DDPG) Algorithm

The core innovation in AsDDPG is the inclusion of a Critic-DQN network that evaluates both the learned policy and the external controller. At each time step t, the Critic-DQN decides which policy to use based on the Q-value estimates:

σ_{t} = arg max_{σ \in {policy, controller}} Q (x_{t}, σ; θ^{Q}),

(54)

where

σ_{t}

is the switching action and

Q (x_{t}, σ; θ^{Q})

represents the Q-value of the state–action pair under the selected policy.

The critic network in AsDDPG consists of two branches: the Critic branch and the DQN branch. The Critic branch evaluates the Q-value of the actor’s actions and computes the policy gradient, while the DQN branch estimates the Q-values of both the external controller and the actor network. The advantage function

A (x, a)

is computed to improve stability:

A (x, a) = Q (x, a; θ^{Q}) - V (x; θ^{V}),

(55)

where

V (x; θ^{V})

is the state value function learned using a dueling architecture.

The loss function for the Critic-DQN is defined as follows:

\begin{matrix} L (θ^{Q}, θ^{A}) = \frac{1}{N} \sum_{i = 1}^{N} [{(y_{i}^{A} - Q (x_{i}, a_{i}; θ^{A}))}^{2} + {(y_{i}^{Q} - Q (x_{i}, σ_{i}; θ^{Q}))}^{2}], \end{matrix}

(56)

where the TD targets

y_{i}^{A}

and

y_{i}^{Q}

are computed as follows:

\begin{matrix} y_{i}^{A} & = & r_{i} + γ Q^{'} (x_{i + 1}, π^{'} (x_{i + 1}; θ^{π^{'}}); θ^{A^{'}}), \end{matrix}

(57)

\begin{matrix} y_{i}^{Q} & = & r_{i} + γ max_{σ^{'}} Q^{'} (x_{i + 1}, σ^{'}; θ^{Q^{'}}) . \end{matrix}

(58)

Here,

γ

is the discount factor,

r_{i}

is the reward, and

Q^{'}

denotes the target network.

The actor network is updated using the policy gradient derived from the Critic branch:

\begin{matrix} \nabla_{θ^{π}} J (θ^{π}) & = & E_{x_{t} \sim D} [\nabla_{a} Q (x_{t}, a; θ^{Q}) |_{a = π (x_{t}; θ^{π})} \nabla_{θ^{π}} π (x_{t}; θ^{π})] . \end{matrix}

(59)

The primary advantage of AsDDPG lies in its ability to leverage the external controller to guide exploration during the early stages of training, reducing the reliance on random exploration. As training progresses, the actor network learns to outperform the external controller, and the reliance on external guidance diminishes–see Figure 3. This adaptive approach ensures faster convergence and more robust learning, particularly in environments with sparse rewards or complex dynamics.

6.1.2. Dynamic Path Planning Using an Improved DDPG Algorithm

The study by Li et al. [28] builds on the DDPG framework by incorporating several enhancements to address its limitations, such as low success rate and slow convergence in dynamic environments. These enhancements include replacing the optimizer, adding prioritized experience replay, introducing transfer learning, and using a curiosity module.

To improve convergence speed and accuracy, the RAdam optimizer is integrated into the DDPG algorithm. The RAdam update rule is given by the following:

θ_{t + 1} = θ_{t} - α_{t} r_{t} \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}}},

(60)

where

{\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2},

(61)

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t},

(62)

r_{t} = \sqrt{\frac{ρ_{t} - 4}{ρ_{\infty} - 4} \cdot \frac{ρ_{t} - 2}{ρ_{\infty} - 2}},

(63)

and

ρ_{t}

and

ρ_{\infty}

are parameters of the RAdam optimization process. This optimizer stabilizes training and enhances convergence.

Prioritized experience replay is employed to prioritize sampling experiences with higher temporal-difference (TD) error. The TD error is defined as follows:

δ_{t} = r_{t + 1} + γ max_{a} Q (s_{t + 1}, a; θ) - Q (s_{t}, a_{t}; θ),

(64)

where Q is the action–value function,

s_{t}

is the current state,

a_{t}

is the current action, and

γ

is the discount factor. This prioritization accelerates learning by focusing on significant experiences.

To improve exploration in complex environments, a curiosity module is introduced. This module provides intrinsic rewards by minimizing the prediction error of the forward dynamics model:

\hat{ϕ} (s_{t + 1}) = f (ϕ (s_{t}), a_{t}; θ_{F}),

(65)

where

ϕ (s_{t})

is the encoded feature vector of the current state,

a_{t}

is the action, and

θ_{F}

represents the model parameters. The intrinsic reward is calculated as follows:

r_{t}^{i} = \frac{η}{2} {∥ \hat{ϕ} (s_{t + 1}) - ϕ (s_{t + 1}) ∥}^{2},

(66)

where

η > 0

is a scaling factor.

The total reward combines intrinsic and extrinsic rewards:

r_{t} = r_{t}^{i} + r_{t}^{e},

(67)

where

r_{t}^{e}

is the external reward from the environment.

Transfer learning is applied to initialize the policy and value networks with pre-trained weights, reducing the required training time. The combined effect of these enhancements significantly improves the convergence speed and success rate of the DDPG algorithm in dynamic environments.

6.1.3. Virtual-to-Real Deep Reinforcement Learning for Mapless Navigation

Tai et al. [29] build on the DDPG framework by introducing asynchronous updates and leveraging low-dimensional sparse range findings for efficient mapless navigation of nonholonomic mobile robots.

To improve sampling efficiency, the sample collection process is separated into an independent thread, creating an asynchronous version of DDPG (ADDPG). This asynchronous mechanism ensures that the agent can collect more samples per iteration compared to traditional DDPG, enhancing convergence. The effectiveness of ADDPG is illustrated by the faster increase in the Q-value during training:

Q (s_{t}, a_{t}; θ) = r_{t} + γ max_{a} Q (s_{t + 1}, a; θ),

(68)

where

r_{t}

is the immediate reward,

s_{t}

is the state,

a_{t}

is the action, and

γ

is the discount factor.

The input to the ADDPG model consists of the following:

x_{t} = [laser range findings, target position, previous velocity],

(69)

where the laser range findings are abstracted into a 10-dimensional sparse representation. These inputs are normalized to the range

(0, 1)

.

The output of the actor network consists of continuous control commands, including linear velocity

v_{t}

and angular velocity

ω_{t}

. The actor network updates are computed as follows:

\nabla_{θ^{μ}} J \approx E [\nabla_{a} Q (s, a; θ^{Q}) |_{a = μ (s; θ^{μ})} \nabla_{θ^{μ}} μ (s; θ^{μ})] .

(70)

The critic network uses a three-layer fully connected structure to predict the Q-value. The Q-value is computed using a linear activation function at the output layer:

Q (s, a; θ^{Q}) = w^{T} x + b,

(71)

where w and b are the weights and bias, respectively.

The reward function is defined as follows:

r (s_{t}, a_{t}) = \{\begin{matrix} r_{arrive}, & if d_{t} < d_{threshold}, \\ r_{collision}, & if min (x_{t}) < x_{collision}, \\ c_{r} (d_{t - 1} - d_{t}), & otherwise, \end{matrix}

(72)

where

d_{t}

is the distance to the target at time t, and

r_{arrive}

,

r_{collision}

, and

c_{r}

are hyperparameters.

The overall method, visualized in Figure 4 enables end-to-end training of a mapless motion planner capable of transferring knowledge from virtual to real environments without additional fine-tuning, ensuring robust navigation in unseen scenarios.

6.1.4. Goal-Oriented Obstacle Avoidance in Continuous Action Space

The proposed method in [30] extends the DDPG framework by integrating depth-wise separable convolutional layers to handle depth image inputs effectively. This enhancement allows for efficient processing of large-scale sequential depth images, combined with goal-oriented navigation capabilities.

The actor network takes a stack of sequential depth images along with positional information in polar coordinates. The stack of depth images from

t - n

to t provides a short-term memory, enabling the robot to detect peripheral obstacles that may not be visible in the current input. The depth-wise separable convolutional operation is defined as follows:

z_{i}^{depth} = σ (\sum_{j = 1}^{n} k_{i, j} * x_{j}),

(73)

where

k_{i, j}

represents the filter applied to channel j,

x_{j}

is the input feature map, and

σ

is the activation function.

The polar coordinates of the robot’s position relative to the goal are concatenated with the processed depth image features. The output of the actor network consists of continuous control commands:

[ω, v] = π (s; θ^{π}),

(74)

where

ω

is the angular velocity, v is the linear velocity, and

π (s; θ^{π})

represents the actor policy parameterized by

θ^{π}

.

The critic network evaluates the Q-value using the current state and action:

Q (s, a; θ^{Q}) = E [r_{t} + γ max_{a} Q (s_{t + 1}, a; θ^{Q}) | s, a],

(75)

where

r_{t}

is the immediate reward and

γ

is the discount factor.

The reward function is designed to encourage linear motion and penalize unnecessary angular adjustments:

r (s_{t}, a_{t}) = \{\begin{matrix} r_{goal}, & if d_{t} < d_{threshold}, \\ r_{collision}, & if a collision occurs, \\ c_{v} \cdot v - c_{ω} \cdot | ω |, & otherwise, \end{matrix}

(76)

where

d_{t}

is the distance to the goal, and

c_{v}

and

c_{ω}

are coefficients for linear and angular velocities.

The target networks for the actor and critic are updated using a soft update mechanism:

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'},

(77)

where

τ

is a small constant controlling the update rate.

This method enables end-to-end learning of a goal-oriented obstacle avoidance policy, trained entirely in simulation and transferable to real-world scenarios without additional fine-tuning. Algorithm 4 shows a step-by-step pseudocode of the method.

Algorithm 4: Pseudocode of the Goal-oriented obstacle avoidance algorithm in continuous action space [30].

6.1.5. DDPG for Path Planning in Cluttered Indoor Spaces

In the study by Gao et al. [31], incremental training, parameter transfer, and a hybrid PRM+TD3 planner are introduced to improve development efficiency, convergence, and generalization in indoor mobile robot path planning. The training process employs an incremental strategy starting in a lightweight 2D simulation environment and gradually transitioning to a complex 3D environment. The training in the 2D environment uses simpler simulations to debug parameters and optimize the reward function before transferring the network to the 3D environment. The transfer of parameters follows:

θ_{initial} \leftarrow θ_{2 D},

(78)

where

θ_{initial}

denotes the initial weights transferred from the 2D environment.

The hybrid planner PRM+TD3 combines the Probabilistic Roadmap (PRM) algorithm for global path planning and TD3 for local control. PRM generates intermediate waypoints:

W = {w_{1}, w_{2}, \dots, w_{n}},

(79)

which decompose the global path into local sub-paths for TD3 to navigate.

TD3 enhances DDPG by addressing overestimation bias using two Q-networks and delayed updates. The Q-value update is as follows:

y (r, s^{'}, d) = r + γ (1 - d) min_{i = 1, 2} Q_{ϕ_{target, i}} (s^{'}, π^{'} (s^{'}; θ^{'})),

(80)

where d is the termination flag, r is the reward, and

π^{'} (s^{'}; θ^{'})

is the target policy.

The critic networks are updated by minimizing the loss:

L (ϕ_{i}) = \frac{1}{| B |} \sum_{(s, a, r, s^{'}, d) \in B} {(Q_{ϕ_{i}} (s, a) - y (r, s^{'}, d))}^{2}, i = 1, 2 .

(81)

The actor network is updated using the policy gradient:

\nabla_{θ^{π}} J = E_{s \sim B} [\nabla_{a} Q_{ϕ_{1}} (s, a) |_{a = π (s; θ^{π})} \nabla_{θ^{π}} π (s; θ^{π})] .

(82)

Target networks for both actor and critic are updated softly:

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}, ϕ^{'} \leftarrow τ ϕ + (1 - τ) ϕ^{'},

(83)

where

τ

controls the rate of update.

The reward function is designed to incentivize collision-free navigation and goal-reaching:

r (s_{t}, a_{t}) = \{\begin{matrix} r_{goal}, & if d_{t} < d_{threshold}, \\ r_{collision}, & if d_{\min} < d_{t}, \\ α (d_{goal} - d_{t}), & otherwise, \end{matrix}

(84)

where

d_{t}

is the distance to the target,

d_{threshold}

is the goal proximity threshold, and

α

is a scaling factor.

This approach enhances convergence, reduces training time, and improves generalization to complex environments. See Algorithm 5 for a step-by-step pseudocode of the proposed method.

Algorithm 5: PRM+TD3 with Incremental Training [31].

6.1.6. Learning World Transition Model for Socially Aware Robot Navigation

Cui et al. [32] present an extension of the DDPG framework by integrating a model-based reinforcement learning approach with a deep world transition model, which predicts future states and rewards. This improves sample efficiency and allows socially compliant robot navigation.

The world transition model predicts the evolution of the robot’s environment, including future obstacle maps and rewards. Let the current state

s_{t}

be represented as a sequence of obstacle maps, relative goal position, and current velocity. The predicted next state

s_{t + 1}

is given by the following:

s_{t + 1} = f (s_{t}, a_{t}; θ_{w}),

(85)

where f is the world transition model parameterized by

θ_{w}

.

The obstacle map representation disentangles static and dynamic obstacles using ego-motion transformation. Dynamic features are computed by the following:

m_{t} = o_{t} - o_{t - 1},

(86)

where

o_{t}

is the obstacle map at time t. Static features are extracted by encoding the last obstacle map.

The reward function incorporates goal-reaching, collision avoidance, and social compliance:

R (s_{t}) = R_{g} (s_{t}) + R_{c} (s_{t}) + R_{s} (s_{t}),

(87)

where

R_{g} (s_{t})

encourages goal-reaching:

R_{g} (s_{t}) = \{\begin{matrix} r_{arrival}, & if goal reached, \\ w_{1} (∥ p_{t} - p^{*} ∥ - ∥ p_{t - 1} - p^{*} ∥), & otherwise . \end{matrix}

(88)

Collision avoidance is penalized by

R_{c} (s_{t})

:

R_{c} (s_{t}) = \{\begin{matrix} r_{collision}, & if collision, \\ w_{2} (1 - \frac{d}{r + 1.0}), & if d \leq r + 1.0, \\ 0, & otherwise . \end{matrix}

(89)

Social compliance is achieved by penalizing proximity to dynamic obstacles:

R_{s} (s_{t}) = \{\begin{matrix} w_{3} (1 - \frac{d_{ped}}{r + 1.25}), & if d_{ped} \leq r + 1.25, \\ 0, & otherwise . \end{matrix}

(90)

The policy network uses stacked obstacle maps processed with a 3D-CNN to output control commands:

a_{t} = π (s_{t}; θ^{π}),

(91)

where

π

is the actor policy parameterized by

θ^{π}

.

The critic networks estimate the Q-values using the TD3 framework:

Q (s, a; θ^{Q}) = E [r + γ min_{i = 1, 2} Q_{θ^{Q^{'}}} (s^{'}, a^{'}) |_{a^{'} = π^{'} (s^{'}; θ^{π^{'}})}] .

(92)

The critic loss is minimized by the following:

L (ϕ_{i}) = \frac{1}{| B |} \sum_{(s, a, r, s^{'}, d) \in B} {(Q_{ϕ_{i}} (s, a) - y)}^{2}, i = 1, 2,

(93)

where

y = r + γ (1 - d) {min}_{i = 1, 2} Q_{ϕ_{i}^{'}} (s^{'}, a^{'})

.

The actor network is updated using the deterministic policy gradient:

\nabla_{θ^{π}} J = E [\nabla_{a} Q_{ϕ_{1}} (s, a) |_{a = π (s; θ^{π})} \nabla_{θ^{π}} π (s; θ^{π})] .

(94)

The target networks are updated softly:

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'} .

(95)

This method enables efficient policy training using a combination of real and virtual data, leading to socially compliant navigation in dynamic environments. The block diagram shown in Figure 5 shows the process flow of the method.

6.2. Asynchronous Advantage Actor–Critic (A3C) Method

The Asynchronous Advantage Actor–Critic (A3C) method, introduced by Mnih et al. [33], maintains a policy

π (a_{t} | s_{t}; θ)

and an estimate of the value function

V (s_{t}; θ_{v})

. The algorithm operates in the forward view and uses n-step returns to update both the policy and the value function.

The advantage function

A (s_{t}, a_{t}; θ, θ_{v})

is computed as follows:

A (s_{t}, a_{t}; θ, θ_{v}) = \sum_{i = 0}^{k - 1} γ^{i} r_{t + i} + γ^{k} V (s_{t + k}; θ_{v}) - V (s_{t}; θ_{v}),

(96)

where k is bounded by

t_{\max}

, the maximum number of steps. The policy is updated using the gradient:

\nabla_{θ} J = \sum_{t = t_{start}}^{t_{end}} \nabla_{θ} log π (a_{t} | s_{t}; θ) A (s_{t}, a_{t}; θ, θ_{v}),

(97)

where

t_{start}

and

t_{end}

define the range of steps for the update. The value function is updated by minimizing the loss:

L_{v} (θ_{v}) = \sum_{t = t_{start}}^{t_{end}} {(R_{t} - V (s_{t}; θ_{v}))}^{2},

(98)

where

R_{t}

is the n-step return:

R_{t} = \sum_{i = 0}^{k - 1} γ^{i} r_{t + i} + γ^{k} V (s_{t + k}; θ_{v}) .

(99)

Entropy regularization is added to the policy loss to encourage exploration. The entropy of the policy

π

is given by the following:

H (π (s_{t}; θ)) = - \sum_{a} π (a | s_{t}; θ) log π (a | s_{t}; θ),

(100)

and the entropy term is added to the policy gradient with a weight

β

:

\nabla_{θ} J_{total} = \nabla_{θ} J + β \nabla_{θ} H (π (s_{t}; θ)) .

(101)

The updates for the policy and value networks are performed asynchronously across multiple threads. Each thread maintains a local copy of the parameters

θ

and

θ_{v}

, which are periodically synchronized with the global parameters. The local parameters are updated using the following:

θ \leftarrow θ + α \nabla_{θ} J_{total}, θ_{v} \leftarrow θ_{v} - α \nabla_{θ_{v}} L_{v} (θ_{v}),

(102)

where

α

is the learning rate.

This asynchronous framework stabilizes training by allowing parallel actor-learners to explore different parts of the state space, reducing the correlation between updates and improving training efficiency. Algorithm 6 presents the step-by-step pseudocode of the A3C method.

Algorithm 6: Asynchronous Advantage Actor–Critic (A3C) Algorithm

6.2.1. A3C Algorithm for Learning to Navigate in Complex Environments

Mirowski et al. [34] have extended A3C by incorporating auxiliary tasks to improve navigation in complex environments. The architecture augments the standard A3C framework with two auxiliary outputs: depth prediction and loop closure detection. These tasks enhance data efficiency and representation learning by providing additional gradient updates during training.

The first auxiliary task, depth prediction, estimates a depth map from RGB inputs. The loss function for depth prediction is implemented as a classification task, dividing depth values into discrete bins:

\begin{matrix} L_{depth} & = & \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i j} log ({\hat{y}}_{i j}), \end{matrix}

(103)

where

y_{i j}

represents the true probability of the j-th depth bin for the i-th pixel,

{\hat{y}}_{i j}

is the predicted probability, N is the number of pixels, and C is the number of depth bins.

The second auxiliary task, loop closure detection, predicts whether the agent is revisiting a previously visited location. The training targets are generated based on position similarity between the current position and earlier positions in the trajectory. The binary classification loss is given by the following:

\begin{matrix} L_{loop} & = & - \frac{1}{T} \sum_{t = 1}^{T} (l_{t} log ({\hat{l}}_{t}) + (1 - l_{t}) log (1 - {\hat{l}}_{t})), \end{matrix}

(104)

where

l_{t}

is the ground truth loop closure label for time step t,

{\hat{l}}_{t}

is the predicted probability, and T is the total number of steps.

The overall loss function combines the A3C objective with these auxiliary losses. The total loss is as follows:

\begin{matrix} L_{total} & = & L_{A 3 C} + β_{depth} L_{depth} + β_{loop} L_{loop}, \end{matrix}

(105)

where

β_{depth}

and

β_{loop}

are coefficients balancing the contributions of the auxiliary tasks.

To integrate auxiliary tasks with the A3C framework, the architecture uses a shared convolutional encoder to process RGB inputs. The shared features are passed to both the policy/value networks and the auxiliary task heads. Specifically, the depth prediction head predicts depth values from the convolutional features, while the loop closure head predicts loop closure probabilities using the hidden states of the LSTM layer. The policy

π (a_{t} | s_{t}; θ)

and value

V (s_{t}; θ_{V})

networks follow the A3C architecture and are updated with standard A3C gradients.

The depth prediction output is computed from the convolutional features

f_{t}

as follows:

\begin{matrix} {\hat{y}}_{i j} & = & softmax (W_{depth} f_{t} + b_{depth}), \end{matrix}

(106)

where

W_{depth}

and

b_{depth}

are the weights and biases of the depth prediction layer.

The loop closure output is computed from the LSTM hidden states

h_{t}

as follows:

\begin{matrix} {\hat{l}}_{t} & = & sigmoid (W_{loop} h_{t} + b_{loop}), \end{matrix}

(107)

where

W_{loop}

and

b_{loop}

are the weights and biases of the loop closure layer.

These auxiliary tasks provide additional supervision signals, accelerating learning and improving the agent’s ability to represent spatial and geometric features critical for navigation. The block diagram shown in Figure 6 presents how the proposed architecture extends A3C by integrating auxiliary tasks for depth prediction and loop closure detection. The shared encoder processes RGB inputs to extract features

f_{t}

, which are passed to the policy and value networks as well as auxiliary task-specific heads. Depth prediction and loop closure detection contribute to the total loss

L_{total}

, accelerating training and improving navigation performance.

6.2.2. A3C Algorithm for Robot Navigation in Unknown Rough Terrain

Zhang et al. [35] have extended A3C to address navigation in unknown rough terrain environments by incorporating an elevation map, depth images, and the robot’s 3D orientation as inputs into the policy and value networks. The architecture consists of three separate branches for processing these inputs, which are then merged to compute navigation actions.

The depth image branch processes downsampled depth images

I_{d} \in R^{84 \times 84}

through a series of convolutional layers, reducing the high-dimensional sensory input into feature embeddings

f_{d}

. The elevation map branch processes a single-channel 84 × 84 grid

E \in R^{84 \times 84}

that encodes terrain elevation, using convolutional layers to extract features

f_{e}

. The robot’s 3D orientation

(α, β, γ)

is passed through a fully connected layer, producing feature embeddings

f_{o}

that match the spatial dimensions of

f_{e}

. These two branches are merged using element-wise addition:

\begin{matrix} f_{merge} = f_{e} + f_{o} . \end{matrix}

(108)

The outputs from the depth and merged branches are concatenated to form a unified feature representation:

\begin{matrix} f_{total} = concat (f_{d}, f_{merge}) . \end{matrix}

(109)

This representation is passed through an LSTM layer to capture temporal dependencies in the robot’s navigation states. The output from the LSTM layer is used to compute the policy

π (a_{t} | s_{t}; θ)

and the value

V (s_{t}; θ_{v})

. The policy network computes the probability distribution over navigation actions

a_{t}

, including moving forward, backward, or turning left or right. The policy is updated using the A3C loss function:

\begin{matrix} Δ θ = \nabla_{θ} log π (a_{t} | s_{t}; θ) A (s_{t}, a_{t}; θ_{v}) + ζ \nabla_{θ} H (π (s_{t}; θ)), \end{matrix}

(110)

where

A (s_{t}, a_{t}; θ_{v})

is the advantage function, computed as follows:

\begin{matrix} A (s_{t}, a_{t}; θ_{v}) = \sum_{i = 0}^{n - 1} Γ^{i} r_{t + i} + Γ^{n} V (s_{t + n}; θ_{v}) - V (s_{t}; θ_{v}), \end{matrix}

(111)

and

H (π (s_{t}; θ))

is the entropy of the policy, encouraging exploration, with

ζ

as a scaling factor.

The value network minimizes the squared advantage:

\begin{matrix} Δ θ_{v} = A (s_{t}, a_{t}; θ_{v}) \nabla_{θ_{v}} V (s_{t}; θ_{v}) . \end{matrix}

(112)

The reward function is tailored to the rough terrain navigation task. It assigns positive rewards for reaching the goal within a certain tolerance and penalizes undesirable states such as collisions or exceeding the maximum steps per episode:

\begin{matrix} r_{t} = \{\begin{matrix} μ \frac{d_{\min} - d_{t}}{Δ t}, & if d_{t} < d_{\min}, \\ 0.5 + r_{v}, & if d_{t} \leq d_{g}, \\ - 0.5, & if in s_{tu}, \\ 0, & otherwise . \end{matrix} \end{matrix}

(113)

Here,

d_{t}

is the distance to the goal at time t,

d_{\min}

is the closest distance observed so far,

d_{g}

is the goal tolerance, and

s_{tu}

represents undesirable terminal states like flipping or getting stuck.

The network is trained in a simulated environment with high-dimensional sensory inputs. Nine agents operate in parallel, each updating the global policy and value functions asynchronously to accelerate convergence. The block diagram shown in Figure 7 visualizes how architecture extends A3C for navigation in unknown rough terrain by incorporating multiple input branches for depth images, elevation maps, and 3D orientation. These inputs are processed through separate branches to extract features (

f_{d}

,

f_{e}

,

f_{o}

) which are merged and concatenated to form a unified representation (

f_{total}

). Temporal dependencies are captured using an LSTM layer, and the outputs are used by the policy and value networks to compute navigation actions.

6.2.3. A3C Algorithm for Teaching a Machine to Read Maps

The study by Brunner et al. [36] based Dynamic Obstacle Avoidance andecture that decomposes the navigation task into intermediate subtasks, each handled by a specialized module. These modules are trained either independently or in conjunction with the main A3C policy. The subtasks include visible local map estimation, recurrent localization, map interpretation, and reactive action execution.

The visible local map network processes the 3D RGB visual input and a discretized 3-hot encoded orientation angle

\hat{α}

. The map excerpt is gated using the visible field estimation:

\begin{matrix} Visible Local Map & = & Map Excerpt \cdot Visible Field, \end{matrix}

(114)

where · denotes element-wise multiplication. The visible local map output is used to construct an egocentric local map.

The recurrent localization cell integrates the visible local map excerpts into an egocentric local map. It estimates the egomotion as follows:

\begin{matrix} s_{t} & = & softmax (f (s_{t - 1}, \hat{α}, a_{t - 1}, r_{t - 1}) + {LM}_{est, t - 1} * \tilde{V}), \end{matrix}

(115)

where

f (\cdot)

is a two-layer feedforward network, ∗ represents 2D convolution,

{LM}_{est, t - 1}

is the previous local map estimation, and

\tilde{V}

is the current visible local map input. The estimated local map is updated as follows:

\begin{matrix} {LM}_{est, t} & = & {[{LM}_{est, t - 1} * s_{t} + \tilde{V}]}_{- 0.5}^{+ 0.5}, \end{matrix}

(116)

where

{[\cdot]}_{- 0.5}^{+ 0.5}

denotes clipping to the range

[- 0.5, + 0.5]

. The updated map is combined with a feedback map to enhance localization:

\begin{matrix} {LM}_{est + mfb, t} & = & {[{LM}_{est, t} + λ \cdot {LM}_{mfb, t - 1} * s_{t}]}_{- 0.5}^{+ 0.5} . \end{matrix}

(117)

The location probability distribution is calculated by correlating the estimated local map with the global map:

\begin{matrix} {p_{loc, i}}_{i = 1}^{N} & = & softmax (M * {LM}_{est + mfb, t}), \end{matrix}

(118)

where M is the global map and N is the number of discrete location cells.

The map interpretation network processes the global map to identify rewarding locations and outputs a short-term target direction (STTD). The STTD at each location is derived using a shortest path algorithm, producing a direction distribution

{d_{sttd, i}}_{i = 1}^{N}

.

The reactive agent combines the STTD distribution, the normalized entropy of the location probability

H_{loc}

, and the estimated target distance to select actions. It maximizes the total reward, which combines extrinsic and intrinsic rewards:

\begin{matrix} I_{explor, t} & = & H_{loc, t - 1} - H_{loc, t}, \end{matrix}

(119)

\begin{matrix} I_{exploit, t} & = & e_{t}^{⊤} \cdot d_{sttd, t - 1}, \end{matrix}

(120)

where

e_{t}

is the egomotion vector and

d_{sttd, t - 1}

is the STTD vector. The agent’s policy is updated using A3C’s policy gradient:

\begin{matrix} Δ θ & = & \nabla_{θ} log π (a_{t} | s_{t}; θ) A (s_{t}, a_{t}; θ_{v}) + ζ \nabla_{θ} H (π (s_{t}; θ)), \end{matrix}

(121)

where

A (s_{t}, a_{t}; θ_{v})

is the advantage function, and

H (π (s_{t}; θ))

is the entropy of the policy.

The proposed architecture improves navigation by integrating modular components, leveraging A3C’s stability, and introducing additional supervision signals through intrinsic rewards and auxiliary tasks. Algorithm 7 presents the step-by-step pseudocode for implementing the proposed architecture.

Algorithm 7: Pseudocode for the navigation algorithm proposed in [36].

6.2.4. A3C Algorithm for Target-Driven Visual Navigation in Indoor Scenes

Zhu et al. [37] extended A3C by introducing a target-driven visual navigation framework based on a deep Siamese actor–critic network. The key innovation is the use of a dual input structure, where both the current state

s_{t}

and the target representation g are processed to jointly estimate policy

π

and value V. This allows generalization to unseen targets without re-training.

The network takes as input two images:

s_{t}

, representing the agent’s current observation, and g, representing the target. Both inputs are processed through Siamese layers that share weights, producing embeddings

ϕ (s_{t})

and

ϕ (g)

in a shared latent space:

\begin{matrix} ϕ (s_{t}) = f_{Siamese} (s_{t}), ϕ (g) = f_{Siamese} (g) . \end{matrix}

(122)

The embeddings are concatenated to form a joint representation:

\begin{matrix} ϕ_{joint} = concat (ϕ (s_{t}), ϕ (g)) . \end{matrix}

(123)

This joint representation is passed through scene-specific layers to capture environment-specific features. The scene-specific representation is used to compute the policy

π (a_{t} | s_{t}, g; θ)

and the value

V (s_{t}, g; θ_{v})

. The policy network predicts the probability distribution over actions:

\begin{matrix} π (a_{t} | s_{t}, g; θ) = softmax (W_{π} ϕ_{joint} + b_{π}) . \end{matrix}

(124)

The value network estimates the expected future reward:

\begin{matrix} V (s_{t}, g; θ_{v}) = W_{V} ϕ_{joint} + b_{V} . \end{matrix}

(125)

The policy is updated using the A3C policy gradient:

\begin{matrix} Δ θ = \nabla_{θ} log π (a_{t} | s_{t}, g; θ) A (s_{t}, g; θ_{v}) + ζ \nabla_{θ} H (π (s_{t}, g; θ)), \end{matrix}

(126)

where

A (s_{t}, g; θ_{v})

is the advantage function:

\begin{matrix} A (s_{t}, g; θ_{v}) = r_{t} + γ V (s_{t + 1}, g; θ_{v}) - V (s_{t}, g; θ_{v}), \end{matrix}

(127)

and

H (π (s_{t}, g; θ))

is the entropy of the policy to encourage exploration.

The value network is updated by minimizing the squared error:

\begin{matrix} L_{V} = \frac{1}{2} {(r_{t} + γ V (s_{t + 1}, g; θ_{v}) - V (s_{t}, g; θ_{v}))}^{2} . \end{matrix}

(128)

The network leverages asynchronous training, where multiple threads independently interact with the environment. Each thread collects gradients and updates the shared model parameters. Generic Siamese layers are updated across all targets and scenes, while scene-specific layers are fine-tuned for each environment, enabling both inter-target and inter-scene generalization.

The reward function for navigation includes a goal-reaching reward

R_{goal}

and a time penalty

R_{time}

:

\begin{matrix} R_{t} = \{\begin{matrix} 10.0, & if goal is reached, \\ - 0.01, & otherwise . \end{matrix} \end{matrix}

(129)

This framework improves upon A3C by enabling generalization to unseen targets through the Siamese structure and shared parameter updates, while maintaining data efficiency through asynchronous training. Figure 8 shows a block diagram of the proposed target-driven navigation method based on A3C. The Siamese network processes the current state

s_{t}

and target g to generate embeddings

ϕ (s_{t})

and

ϕ (g)

, which are concatenated to form a joint representation

ϕ_{joint}

. Scene-specific layers extract environment-specific features used by the policy and value networks. The policy network outputs navigation actions

a_{t}

, while the value network evaluates future rewards. Both networks are updated asynchronously using the A3C framework.

Figure 8. Block diagram of the deep reinforcement learning framework proposed in [37].

6.2.5. A3C Algorithm for Autonomous Navigation in Indoor Environments

The study by Surmann et al. [38] builds on A3C by using the GA3C framework, which improves efficiency and scalability in parallel training. Instead of maintaining separate network copies for each agent, GA3C centralizes predictions and updates using a global deep neural network (DNN) model. This eliminates the need for synchronization and allows for efficient GPU utilization. The algorithm employs two global queues: a prediction queue for obtaining observations and a learning queue for training updates.

The inputs to the network are the last four laser scans

{L_{t - 3}, L_{t - 2}, L_{t - 1}, L_{t}}

, each with 1081 values, and a one-hot encoded orientation vector

o_{t}

representing the direction of the goal. The laser scan values are normalized to the range

[0, 1]

, and the orientation is encoded as follows:

\begin{matrix} o_{t} = one - hot (g_{dir}) where g_{dir} \in [0, 360] . \end{matrix}

(130)

The network begins with two 1D convolutional layers:

\begin{matrix} Conv 1 : 1 \times 9 \times 16, stride : 4, \end{matrix}

(131)

\begin{matrix} Conv 2 : 1 \times 5 \times 32, stride : 2 . \end{matrix}

(132)

The outputs are flattened and passed to a dense layer with 256 neurons to generate the shared latent representation

ϕ (s_{t})

. The policy network computes the probability distribution over actions

π (a_{t} | s_{t}; θ)

:

\begin{matrix} π (a_{t} | s_{t}; θ) = softmax (W_{π} ϕ (s_{t}) + b_{π}), \end{matrix}

(133)

where

a_{t}

represents discrete actions corresponding to linear and angular velocities, such as forward motion and turning. The value network estimates the state value

V (s_{t}; θ_{v})

:

\begin{matrix} V (s_{t}; θ_{v}) = W_{V} ϕ (s_{t}) + b_{V} . \end{matrix}

(134)

The policy and value networks are updated using the A3C loss. The policy gradient is computed as follows:

\begin{matrix} Δ θ = \nabla_{θ} log π (a_{t} | s_{t}; θ) A (s_{t}, a_{t}; θ_{v}) + ζ \nabla_{θ} H (π (s_{t}; θ)), \end{matrix}

(135)

where

A (s_{t}, a_{t}; θ_{v})

is the advantage function:

\begin{matrix} A (s_{t}, a_{t}; θ_{v}) = r_{t} + γ V (s_{t + 1}; θ_{v}) - V (s_{t}; θ_{v}) . \end{matrix}

(136)

The value loss is minimized to improve the value network:

\begin{matrix} L_{V} = \frac{1}{2} {(r_{t} + γ V (s_{t + 1}; θ_{v}) - V (s_{t}; θ_{v}))}^{2} . \end{matrix}

(137)

The reward function integrates goal-reaching rewards, collision penalties, and intermediate rewards. For goal-reaching, the reward is as follows:

\begin{matrix} R_{goal} = 20 if the goal is reached . \end{matrix}

(138)

For collisions:

\begin{matrix} R_{collision} = - 20 if a collision occurs . \end{matrix}

(139)

Intermediate rewards encourage progress toward the goal:

\begin{matrix} R_{t} = \{\begin{matrix} + 1, & if the distance to the goal decreases, \\ - 1, & if the distance to the goal increases . \end{matrix} \end{matrix}

(140)

The GA3C framework allows for high-speed parallel training with up to 32 simulation instances running concurrently, ensuring robust and efficient learning. A step-by-step pseudocode of the algorithm is presented in Algorithm 8.

Algorithm 8: GA3C-based Robot Navigation Framework [38]

6.2.6. A3C-Based End-to-End Navigation Strategy

The study by Shi et al. [39] introduced an Intrinsic Curiosity Module (ICM) to address the challenges of sparse rewards in navigation tasks. The ICM generates intrinsic rewards by leveraging the prediction error of a forward model, thereby encouraging exploration in complex environments. Additionally, the network is enhanced with two LSTM layers to capture temporal dependencies, enabling the agent to better utilize previous observations, actions, and rewards.

The policy

π (a_{t} | s_{t}; θ_{p})

and value

V (s_{t}; θ_{v})

are represented by neural networks with parameters

θ_{p}

and

θ_{v}

, respectively. The A3C policy gradient and value loss functions are extended to incorporate intrinsic rewards. The policy loss is as follows:

\begin{matrix} f_{π} (θ_{p}) = log π (a_{t} | s_{t}; θ_{p}) \cdot (R_{t} - V (s_{t}; θ_{v})), \end{matrix}

(141)

where the total reward

R_{t}

combines extrinsic and intrinsic rewards:

\begin{matrix} R_{t} = \sum_{i = 0}^{k - 1} γ^{i} r_{t + i} + γ^{k} V (s_{t + k}; θ_{v}) . \end{matrix}

(142)

The value loss minimizes the discrepancy between the predicted and actual returns:

\begin{matrix} f_{v} (θ_{v}) = \frac{1}{2} {(R_{t} - V (s_{t}; θ_{v}))}^{2} . \end{matrix}

(143)

To calculate intrinsic rewards, the ICM includes an inverse model and a forward model. The inverse model predicts the action

a_{t}

given consecutive states

s_{t}

and

s_{t + 1}

:

\begin{matrix} {\hat{a}}_{t} = g (s_{t}, s_{t + 1}; θ_{I}), \end{matrix}

(144)

where

θ_{I}

are the parameters of the inverse model. The loss for the inverse model is as follows:

\begin{matrix} L_{I} = \sum_{i = 1}^{n} - P (a_{t}^{i}) log q ({\hat{a}}_{t}^{i}), \end{matrix}

(145)

with

P (a_{t}^{i})

as the true action probability and

q ({\hat{a}}_{t}^{i})

as the predicted action probability. The forward model predicts the next state feature

ϕ (s_{t + 1})

based on the current state feature

ϕ (s_{t})

and action

a_{t}

:

\begin{matrix} \hat{ϕ} (s_{t + 1}) = f (ϕ (s_{t}), a_{t}; θ_{F}), \end{matrix}

(146)

where

θ_{F}

are the parameters of the forward model. The forward model loss is as follows:

\begin{matrix} L_{F} = \frac{1}{2} {∥ \hat{ϕ} (s_{t + 1}) - ϕ (s_{t + 1}) ∥}_{2}^{2} . \end{matrix}

(147)

The intrinsic reward

r_{t}^{i}

is derived from the forward model’s prediction error:

\begin{matrix} r_{t}^{i} = \frac{η}{2} {∥ \hat{ϕ} (s_{t + 1}) - ϕ (s_{t + 1}) ∥}_{2}^{2}, \end{matrix}

(148)

where

η > 0

scales the intrinsic reward. The total loss function combines policy, value, inverse, and forward losses:

\begin{matrix} L = - λ E_{π (a_{t} | s_{t}; θ_{p})} [\sum_{t} r_{t}] + (1 - β) L_{I} + β L_{F}, \end{matrix}

(149)

where

λ

adjusts the weight of the policy gradient, and

β

balances the forward and inverse model losses.

This method significantly enhances A3C by addressing sparse rewards and enabling robust navigation in complex environments. The use of intrinsic rewards, alongside the temporal memory provided by LSTMs, allows the agent to explore effectively and generalize better from simulation to real-world scenarios. The block diagram of Figure 9 illustrates the integration of the Intrinsic Curiosity Module (ICM) with A3C. The input states

s_{t}

and

s_{t + 1}

are processed to extract features

ϕ (s_{t})

and

ϕ (s_{t + 1})

, which feed into the inverse and forward models. The forward model generates intrinsic rewards

r_{t}^{i}

, while the policy and value networks compute the action policy

π (a_{t} | s_{t}; θ_{p})

and value estimate

V (s_{t}; θ_{v})

. The output is the navigation action

a_{t}

, optimized to balance exploration and exploitation in sparse-reward environments.

6.2.7. A3C Algorithm for Autonomous Navigation with Landmark Generators

Wang et al. [40] proposed a method that enhances the A3C algorithm by introducing a hierarchical navigation framework that integrates global and local planners connected by a dynamic sub-goal generator. The global planner uses traditional path planning algorithms like A* or Dijkstra to create a global path, while the local planner employs A3C to navigate to dynamically generated sub-goals, optimizing navigation efficiency in dynamic environments.

The local planner is modeled as a partially observable Markov decision process (POMDP) defined as a 5-tuple

(T, S, O, A, R, γ)

, where T is the current time, S represents the environment states, O denotes the observed information, A is the action space, R is the reward function, and

γ

is the reward discount factor. To address the challenge of sparse rewards, a reward mechanism is designed as follows:

\begin{matrix} R_{t} = r_{t s} + r_{t d} + r_{t c} + r_{t s t}, \end{matrix}

(150)

where

\begin{matrix} r_{t s} & = & \{\begin{matrix} 50, & if the robot reaches the sub - goal, \\ 0, & otherwise . \end{matrix} \end{matrix}

(151)

\begin{matrix} r_{t d} & = & \{\begin{matrix} - 0.5, & if distance to obstacles < d_{safe}, \\ 0, & otherwise . \end{matrix} \end{matrix}

(152)

\begin{matrix} r_{t c} & = & \{\begin{matrix} - 20, & if collision occurs, \\ 0, & otherwise . \end{matrix} \end{matrix}

(153)

\begin{matrix} r_{t s t} & = & \{\begin{matrix} - 0.05, & if movement distance < r_{min}, \\ 0, & otherwise . \end{matrix} \end{matrix}

(154)

The policy

π (a | s; θ)

and value

V (s; θ_{v})

networks are enhanced with a shared feature extraction backbone and a memory module to retain long-term dependencies. The policy gradient for updating the actor network is as follows:

\begin{matrix} Δ θ = \nabla_{θ} log π (a_{t} | s_{t}; θ) A (s_{t}, a_{t}; θ_{v}), \end{matrix}

(155)

where the advantage function

A (s_{t}, a_{t}; θ_{v})

is computed as follows:

\begin{matrix} A (s_{t}, a_{t}; θ_{v}) = R_{t} + γ V (s_{t + 1}; θ_{v}) - V (s_{t}; θ_{v}) . \end{matrix}

(156)

The critic network minimizes the temporal difference (TD) loss:

\begin{matrix} L_{V} = \frac{1}{2} {(R_{t} + γ V (s_{t + 1}; θ_{v}) - V (s_{t}; θ_{v}))}^{2} . \end{matrix}

(157)

The integration of the sub-goal generator with A3C further reduces the computational burden. The sub-goal generator dynamically identifies intermediate targets based on two types of frontiers, A and B, defined in the robot’s LiDAR data. Sub-goals are generated using the following rules:

\begin{matrix} For type A : & sub - goal at d_{max} along the frontier' s midpoint . \end{matrix}

(158)

\begin{matrix} For type B : & adjust based on safe distance d_{safe} and obstacle edges . \end{matrix}

(159)

The hierarchical design of global and local planners ensures robust navigation in dynamic environments, leveraging the memory-enhanced A3C to handle long-term dependencies and optimize navigation efficiency. The block diagram of the proposed hierarchical navigation method, shown in Figure 10, exhibits how the system integrates a global planner with a sub-goal generator to dynamically identify intermediate targets for the local planner. The feature extraction backbone processes the current state

s_{t}

, and the reward function computes

R_{t}

based on proximity, collision, and progress metrics. The policy

π (a | s; θ)

and value network

V (s; θ_{v})

are optimized using the advantage function

A (s_{t}, a_{t})

, with the final output being the navigation action

a_{t}

.

6.3. Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO), introduced by Schulman et al. [4], is a widely used reinforcement learning algorithm that strikes a balance between sample efficiency and computational simplicity. PPO builds upon the policy gradient methods and introduces a clipping mechanism to ensure stable and monotonic policy updates.

The goal of PPO is to maximize the expected reward

J (π_{θ})

while ensuring that policy updates do not deviate excessively from the current policy. The optimization objective is defined as follows:

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})],

(160)

where

r_{t} (θ)

is the probability ratio given by the following:

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} .

(161)

Here,

A_{t}

represents the advantage function,

ϵ

is a hyperparameter that determines the clipping range, and

clip (\cdot)

restricts the probability ratio to the interval

[1 - ϵ, 1 + ϵ]

.

The advantage function

A_{t}

is computed as follows:

A_{t} = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t}),

(162)

where

Q^{π} (s_{t}, a_{t})

is the action–value function, and

V^{π} (s_{t})

is the state–value function. PPO employs a surrogate objective function to balance exploration and exploitation while avoiding large policy updates, which could destabilize learning.

In addition to the clipping mechanism, the total loss function in PPO combines the policy loss

L^{CLIP}

, the value function loss

L^{VF}

, and an entropy term

L^{ENT}

to encourage policy exploration:

L (θ) = L^{CLIP} (θ) - c_{1} L^{VF} (θ) + c_{2} L^{ENT} (θ),

(163)

where

L^{VF}

is the squared error loss for value function approximation:

L^{VF} (θ) = E_{t} [{(V_{θ} (s_{t}) - R_{t})}^{2}],

(164)

and

L^{ENT}

is the entropy loss:

L^{ENT} (θ) = E_{t} [- \sum_{a} π_{θ} (a | s_{t}) log π_{θ} (a | s_{t})] .

(165)

The coefficients

c_{1}

and

c_{2}

control the relative importance of the value loss and entropy loss, respectively. A step-by-step pseudo code of PPO is presented in Algorithm 9.

Algorithm 9: Proximal Policy Optimization (PPO) Algorithm

PPO has been applied to various domains, including robotic control, game playing, and autonomous navigation. Its robustness and simplicity make it a preferred choice for continuous and discrete action space problems. By preventing large policy changes through the clipping mechanism, PPO ensures stable learning and improved convergence properties. Some of the recent applications of this algorithm in robotic path planning and navigation are reviewed in the following sections.

6.3.1. Navigation Skills Acquisition for Wheel-Legged Robots

The study by Chen et al. [41] builds upon PPO by introducing hierarchical training strategies with secondary and primary policies, along with domain randomization to improve robustness and generalization. The approach addresses challenges such as reward sparsity, temporal credit assignment, and data inefficiency in high-dimensional navigation tasks.

The hierarchical training begins with secondary policies, each trained for specific behaviors such as moving straight, avoiding obstacles, climbing over obstacles, and squeezing through narrow passages. The secondary policies are trained using PPO with a compound loss function:

\begin{matrix} L (θ) = E [λ_{1} η_{π_{θ}} + λ_{2} L_{v}], \end{matrix}

(166)

where

η_{π_{θ}}

is the PPO surrogate function:

\begin{matrix} η_{π_{θ}} = E [min (Z (ψ_{θ}) A_{t}, ψ_{θ} A_{t})], \end{matrix}

(167)

and

L_{v}

is the value function loss:

\begin{matrix} L_{v} = \frac{1}{2} {(V_{θ} (s_{t}) - R_{t})}^{2} . \end{matrix}

(168)

The probability ratio

ψ_{θ}

is defined as follows:

\begin{matrix} ψ_{θ} = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}, \end{matrix}

(169)

and

Z (ψ_{θ})

clips

ψ_{θ}

to

[1 - ϵ, 1 + ϵ]

, where

ϵ

is a clipping parameter.

The reward function for secondary policies encourages specific behaviors:

\begin{matrix} r_{t} = r_{step} + r_{progress} + r_{invalid}, \end{matrix}

(170)

where

\begin{matrix} r_{step} & = & - 0.1, \end{matrix}

(171)

\begin{matrix} r_{progress} & = & d_{t - 1} - d_{t}, \end{matrix}

(172)

\begin{matrix} r_{invalid} & = & \{\begin{matrix} - 0.1, & if collision or invalid action occurs, \\ 0, & otherwise . \end{matrix} \end{matrix}

(173)

The primary policy is trained using samples from the domain-randomized batch and real interactions in complex environments. Trajectories from secondary policies are replayed in randomized environments to generate action–observation–return tuples

(s_{t}, a_{t}, R_{t})

. The advantage function

A_{t}

for primary policy training is as follows:

\begin{matrix} A_{t} = R_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t}) . \end{matrix}

(174)

The policy parameters are updated using stochastic gradient ascent on the compound loss:

\begin{matrix} L (θ) = E [λ_{1} η_{π_{θ}} + λ_{2} L_{v}] . \end{matrix}

(175)

A block diagram showing the flow of computations involved in this method is presented in Figure 11.

Domain randomization improves generalization by altering non-essential aspects of the training environments, such as obstacle appearance, number, and configuration. The randomized environments ensure task-relevant sensory features are preserved while diversifying the training data. The final policy maps height-map images and robot poses to motor actions using a neural network with three convolutional layers followed by fully connected layers.

By decomposing navigation into manageable sub-tasks, utilizing domain randomization, and optimizing with PPO, the proposed method achieves robust and efficient navigation in high-dimensional environments.

6.3.2. PPO-Based Path Planning for Cleaning Robot

Moon et al. [42] propose a PPO method integrated with transfer learning (TL), detection of nearest uncleaned tiles (DNUT), reward shaping (RS), and an elite set (ES) to improve cleaning performance by robots. These additions address challenges such as reward sparsity, inefficient exploration, and data inefficiency in reinforcement learning tasks for cleaning robots.

Transfer learning allows the agent to leverage knowledge from previously trained environments to initialize training in new ones. This ensures improved convergence and robustness. The transferred policy

π_{TL}

initializes the primary policy

π_{θ}

:

\begin{matrix} π_{θ} \leftarrow π_{TL} . \end{matrix}

(176)

The DNUT technique ensures that the agent focuses on uncleaned tiles. The agent detects the nearest uncleaned tile and adjusts the reward

r_{t}

accordingly:

\begin{matrix} r_{t} = \{\begin{matrix} + 1, & if moving to an uncleaned tile, \\ - 1, & if moving to a cleaned tile, \\ - 2, & if moving to an obstacle . \end{matrix} \end{matrix}

(177)

The reward shaping introduces a stacked value mechanism to encourage consistent positive behavior. The stacked value

R_{stacked}

is computed as follows:

\begin{matrix} R_{stacked} = \sum_{i = 1}^{k} 1 . 5^{i}, \end{matrix}

(178)

where k is the number of consecutive positive rewards. The total reward becomes the following:

\begin{matrix} R_{t} = r_{t} + R_{stacked} . \end{matrix}

(179)

The elite set (ES) ensures the agent avoids learning from poor-performance episodes. If the agent takes more than 500 steps in a single episode, the episode is terminated, and the data are excluded from training. Let

E_{elite}

denote the set of high-performance episodes:

\begin{matrix} E_{elite} = {e \in E ∣ performance (e) > threshold} . \end{matrix}

(180)

The PPO loss function is modified to incorporate these components. The clipped objective is given by the following:

\begin{matrix} L_{CLIP} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})], \end{matrix}

(181)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

is the probability ratio and

A_{t}

is the advantage function:

\begin{matrix} A_{t} = R_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t}) . \end{matrix}

(182)

The value network minimizes the temporal difference loss:

\begin{matrix} L_{V} = \frac{1}{2} {(V_{θ} (s_{t}) - (R_{t} + γ V_{θ} (s_{t + 1})))}^{2} . \end{matrix}

(183)

As presented in Algorithm 10, the agent learns the optimal policy by combining TL for initialization, DNUT for focus, RS for behavior shaping, and ES for data curation. These components significantly enhance the efficiency and robustness of PPO in cleaning environments.

Algorithm 10: The PPO-based Cleaning Method Proposed by [42].

6.3.3. PPO-Based End-to-End Path Planning for Lunar Rovers with Safety Constraints

Yu et al. [43] proposed an end-to-end path planning method based on PPO by integrating a safety-aware reward function, sliding behavior analysis, and curriculum learning to address the challenges of autonomous path planning for lunar rovers. These improvements ensure safety, stability, and adaptability across different terrains. The path planning problem is formulated as a Markov Decision Process (MDP), with the policy

π_{θ}

represented by a deep neural network parameterized by

θ

. The optimal policy

π_{θ}^{*}

is defined as follows:

\begin{matrix} u_{t} = π_{θ}^{*} (s_{t}) = π_{θ}^{*} (s_{t 1}, s_{t 2}, s_{t 3}), \end{matrix}

(184)

where

s_{t 1}

,

s_{t 2}

, and

s_{t 3}

represent depth camera data, LiDAR data, and rover/target state information, respectively. The safety-aware reward function incorporates distance and safety considerations:

\begin{matrix} r_{t} = r_{t}^{dis} + α_{s} \cdot r_{t}^{safe}, \end{matrix}

(185)

where

α_{s}

is the safety factor. The distance reward

r_{t}^{dis}

is defined as follows:

\begin{matrix} r_{t}^{dis} = \{\begin{matrix} 10, & if d_{p}^{t} < d_{goal}, \\ - 10, & if d_{o}^{t} < d_{\min}, \\ d_{p}^{t - 1} - d_{p}^{t}, & otherwise, \end{matrix} \end{matrix}

(186)

where

d_{p}^{t}

and

d_{p}^{t - 1}

are the rover’s distances to the target at time t and

t - 1

, and

d_{o}^{t}

is the distance to the nearest obstacle.

The safety reward

r_{t}^{safe}

penalizes sliding and unstable behaviors:

\begin{matrix} r_{t}^{safe} = \{\begin{matrix} - 10, & if (θ_{t} > θ_{\max}) \lor (ϕ_{t} > ϕ_{\max}), \\ - {slip}_{t} \cdot β_{t}, & otherwise, \end{matrix} \end{matrix}

(187)

where

θ_{t}

and

ϕ_{t}

are the pitch and roll angles,

{slip}_{t}

and

β_{t}

are the sliding rate and sliding angle, calculated as follows:

\begin{matrix} {slip}_{t} = A e^{B θ_{t}}, β_{t} = C e^{D ϕ_{t}}, \end{matrix}

(188)

with

A, B, C, D

as constants derived from experimental data. The policy is updated using the PPO objective with a clipped surrogate:

\begin{matrix} L_{PPO} (θ) = E_{t} [min (ρ_{t} (θ) A_{t}, clip (ρ_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})], \end{matrix}

(189)

where

ρ_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

, and

A_{t}

is the advantage function:

\begin{matrix} A_{t} = r_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t}) . \end{matrix}

(190)

Curriculum learning is applied to improve adaptability by incrementally increasing terrain complexity. Training begins in flat terrains and progresses to more complex environments with adjusted safety factors

α_{s}

. This progressive training ensures robust and efficient learning of safe and optimal paths. Figure 9 shows a block diagram of the method for lunar rover path planning. The input includes depth camera data, LiDAR data, and rover/target state information. Feature extraction is performed using a neural network, followed by safety-aware reward calculations and policy updates using PPO. A curriculum learning strategy incrementally increases terrain complexity, resulting in a safe and efficient navigation policy

π_{θ} (a_{t} | s_{t})

. Figure 12 shows a block diagram of the proposed algorithm.

6.3.4. PPO-Based Real-Time Collision Avoidance in Dense Crowds

The study by Liang et al. [44] proposes a method, CrowdSteer, that builds upon PPO by incorporating multi-sensor fusion, a hybrid reward function, and specialized training scenarios to address challenges in dense and dynamic crowd navigation. The approach uses data from a 2-D LiDAR and a depth camera to implicitly model interactions with obstacles and pedestrians, generating smoother and collision-free trajectories. The navigation task is formulated as a Partially Observable Markov Decision Process (POMDP) defined by a 6-tuple

(S, A, P, R, Ω, O)

, where S represents the state space, A the action space, P the state transition probabilities, R the reward function,

Ω

the observation space, and O the observation probability distribution. The robot’s action at time t is sampled from the policy

π_{θ}

as follows:

\begin{matrix} a_{t} \sim π_{θ} (a_{t} | o_{t}), \end{matrix}

(191)

where

o_{t} = [o_{t}^{lid}, o_{t}^{cam}, o_{t}^{g}, o_{t}^{v}]

comprises LiDAR data

o_{t}^{lid}

, camera data

o_{t}^{cam}

, relative goal location

o_{t}^{g}

, and robot velocity

o_{t}^{v}

. The hybrid reward function encourages goal-reaching and safe navigation while penalizing collisions and oscillatory behavior:

\begin{matrix} r_{t} = r_{t}^{g} + r_{t}^{c} + r_{t}^{osc} + r_{t}^{safe}, \end{matrix}

(192)

where the components are defined as follows:

\begin{matrix} r_{t}^{g} & = & \{\begin{matrix} r_{goal}, & if | | p_{t} - p_{g} | | < δ, \\ 2.5 (| | p_{t - 1} - p_{g} | | - | | p_{t} - p_{g} | |), & otherwise, \end{matrix} \end{matrix}

(193)

\begin{matrix} r_{t}^{c} & = & \{\begin{matrix} r_{collision}, & if | | p_{t} - p_{obs} | | < ϵ, \\ 0, & otherwise, \end{matrix} \end{matrix}

(194)

\begin{matrix} r_{t}^{osc} & = & - 0.1 | ω_{t} |, if | ω_{t} | > 0.3, \end{matrix}

(195)

\begin{matrix} r_{t}^{safe} & = & - 0.1 | | R_{\max} - R_{\min} | | . \end{matrix}

(196)

The policy is trained using PPO with a clipped surrogate objective:

\begin{matrix} L_{PPO} (θ) = E_{t} [min (ρ_{t} A_{t}, clip (ρ_{t}, 1 - ϵ, 1 + ϵ) A_{t})], \end{matrix}

(197)

where

ρ_{t} = \frac{π_{θ} (a_{t} | o_{t})}{π_{θ_{old}} (a_{t} | o_{t})}

is the probability ratio, and

A_{t}

is the advantage function:

\begin{matrix} A_{t} = r_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t}) . \end{matrix}

(198)

The policy

π_{θ}

is represented by a neural network with separate branches for processing LiDAR and camera data, followed by fusion layers that integrate these features with goal and velocity observations. Lidar data

o_{t}^{lid}

and camera data

o_{t}^{cam}

are processed as follows:

\begin{matrix} o_{t}^{lid} & = & {l \in R^{512} : 0 < l_{i} < 4}, \end{matrix}

(199)

\begin{matrix} o_{t}^{cam} & = & {C \in R^{150 \times 120} : 1.4 < C_{i j} < 5} . \end{matrix}

(200)

During training, the policy is initialized in static environments and progressively trained in dynamic scenarios with occluded obstacles and high pedestrian densities. This curriculum learning approach improves the robot’s ability to generalize to real-world dense crowd navigation while ensuring smooth and efficient trajectories.

6.3.5. Self-Learned Vehicle Control Using PPO

Canal and Taschin [45] proposed a method that enhances PPO for vehicle control by combining minimal path planning with reinforcement learning to achieve efficient and generalizable vehicle control in obstacle-rich environments. The key innovation lies in using PPO to learn vehicle control policies that follow a pre-computed approximate path while adapting to unforeseen obstacles and optimizing for faster completion times.

The environment is framed as a POMDP, where the state

s_{t}

includes relative positions of the next n checkpoints, vehicle velocity, and simulated LiDAR data. The action

a_{t}

consists of continuous control variables, such as throttle and steering for cars, and x-axis and z-axis movements for drones. The policy

π_{θ}

is parameterized by a neural network, mapping the observed state

s_{t}

to an action

a_{t}

:

\begin{matrix} a_{t} \sim π_{θ} (a_{t} | s_{t}) . \end{matrix}

(201)

The reward function balances three objectives: checkpoint progress, collision avoidance, and minimizing time. The total reward

r_{t}

at time t is defined as follows:

\begin{matrix} r_{t} = r_{t}^{chk} + r_{t}^{time} + r_{t}^{col}, \end{matrix}

(202)

where

\begin{matrix} r_{t}^{chk} & = & \{\begin{matrix} 1 + \sum_{i = 1}^{k} δ_{i}, & if checkpoint i is passed, \\ 0, & otherwise, \end{matrix} \end{matrix}

(203)

\begin{matrix} r_{t}^{time} & = & - β \cdot Δ t, \end{matrix}

(204)

\begin{matrix} r_{t}^{col} & = & \{\begin{matrix} - 3, & if collision occurs, \\ 0, & otherwise . \end{matrix} \end{matrix}

(205)

The policy is trained using PPO with a clipped surrogate objective:

\begin{matrix} L_{PPO} (θ) = E_{t} [min (ρ_{t} A_{t}, clip (ρ_{t}, 1 - ϵ, 1 + ϵ) A_{t})], \end{matrix}

(206)

where

ρ_{t} = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

is the probability ratio, and

A_{t}

is the advantage function:

\begin{matrix} A_{t} = r_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t}) . \end{matrix}

(207)

The training process employs curriculum learning, progressively increasing the complexity of generated maps by adjusting the proportion of obstacles and randomizing the initial orientation of the vehicle. Early training stages use simpler maps with fewer obstacles and optimal starting orientations. The difficulty gradually increases as the agent improves, enhancing generalization to unseen environments.

Additionally, the method introduces a fractional factorial design to optimize hyperparameters and state representations. This approach ensures the efficient evaluation of different configurations without exhaustive testing, reducing overall computational costs during training. By combining minimal path planning, curriculum learning, and PPO-based control, the proposed method eliminates the need for detailed motion planning while achieving high performance in generalizable navigation tasks.

6.4. Soft Actor–Critic (SAC)

Soft Actor–Critic (SAC), introduced by Haarnoja et al. [2], is an off-policy reinforcement learning algorithm that combines maximum entropy reinforcement learning with actor–critic methods. The primary goal of SAC is to optimize a stochastic policy by maximizing both the expected cumulative reward and the entropy of the policy, encouraging exploration and robustness to model uncertainty.

The objective of SAC can be expressed as follows:

J (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}) + α H (π (\cdot | s_{t})))],

(208)

where

H (π (\cdot | s_{t})) = - E_{a \sim π} [log π (a | s_{t})]

is the entropy of the policy,

α

is the temperature parameter that balances exploration (via entropy maximization) and exploitation (via reward maximization), and

γ

is the discount factor.

SAC employs two key networks: a stochastic policy

π (a | s; θ)

parameterized by

θ

, and two Q-value functions

Q_{ϕ_{1}} (s, a)

and

Q_{ϕ_{2}} (s, a)

parameterized by

ϕ_{1}

and

ϕ_{2}

, respectively. The use of two Q-value functions addresses the overestimation bias typically observed in value-based reinforcement learning methods. The target Q-value is computed as follows:

y_{t} = r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim P} [min (Q_{ϕ_{1}} (s_{t + 1}, a_{t + 1}), Q_{ϕ_{2}} (s_{t + 1}, a_{t + 1})) - α log π (a_{t + 1} | s_{t + 1})] .

(209)

The Q-value networks are updated by minimizing the Bellman residual:

L_{Q} (ϕ_{i}) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(Q_{ϕ_{i}} (s_{t}, a_{t}) - y_{t})}^{2}],

(210)

where

D

is a replay buffer storing transitions.

The policy

π (a | s; θ)

is optimized to maximize the expected entropy-augmented reward:

J_{π} (θ) = E_{s_{t} \sim D, a_{t} \sim π_{θ}} [α log π (a_{t} | s_{t}) - Q_{ϕ} (s_{t}, a_{t})] .

(211)

The temperature parameter

α

can be adjusted dynamically to control the trade-off between exploration and exploitation. The temperature is updated by minimizing the objective:

L_{α} (α) = E_{a_{t} \sim π_{θ}} [- α log π (a_{t} | s_{t}) - H_{target}],

(212)

where

H_{target}

is the desired minimum entropy. The step-by-step pseudocode for the SAC algorithm is presented in Algorithm 11.

Algorithm 11: Soft Actor–Critic (SAC) Algorithm

SAC has been shown to outperform traditional reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) in tasks involving high-dimensional continuous action spaces. Its ability to encourage exploration through entropy maximization makes it particularly effective in environments with sparse rewards and high uncertainty.

The combination of off-policy learning, entropy regularization, and a stochastic actor enables SAC to achieve robust performance in a wide range of robotic control tasks, including locomotion, navigation, and manipulation. The recent advancements related to robotic path planning are reviewed in the following sections.

6.4.1. SAC Algorithm in Dynamic Path Planning for Mobile Robots

Yang et al. [46] improved the original SAC algorithm to address the challenges of mobile robot path planning in environments with static and dynamic obstacles. They introduced an enhanced reward function, state dynamic normalization, and a prioritized replay buffer to improve robustness and decision-making efficiency.

The SAC framework is retained, where the policy

π_{θ}

maps the current state

s_{t}

to a stochastic action

a_{t}

:

\begin{matrix} a_{t} \sim π_{θ} (a_{t} | s_{t}), \end{matrix}

(213)

with the actor network optimized to maximize the entropy-regularized objective:

\begin{matrix} J_{π} (θ) = E_{s_{t} \sim D, a_{t} \sim π_{θ}} [α H (π_{θ} (\cdot | s_{t})) - Q^{π} (s_{t}, a_{t})], \end{matrix}

(214)

where

H (π_{θ})

represents the entropy term, balancing exploration and exploitation, and

α

is the entropy temperature coefficient.

The critic networks are updated by minimizing the temporal difference (TD) error for the value function

Q^{π}

:

\begin{matrix} L_{Q} (ϕ) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(Q_{ϕ} (s_{t}, a_{t}) - \hat{Q} (s_{t}, a_{t}))}^{2}], \end{matrix}

(215)

where the target

\hat{Q} (s_{t}, a_{t})

includes the entropy term:

\begin{matrix} \hat{Q} (s_{t}, a_{t}) = r_{t} + γ E_{a_{t + 1} \sim π_{θ}} [Q^{π} (s_{t + 1}, a_{t + 1}) - α log π_{θ} (a_{t + 1} | s_{t + 1})] . \end{matrix}

(216)

To enhance the reward structure, a hybrid reward function

r_{t}

is defined to guide robot behavior:

\begin{matrix} r_{t} = r_{t}^{angle} + r_{t}^{target} + r_{t}^{collision} + r_{t}^{termination}, \end{matrix}

(217)

where

\begin{matrix} r_{t}^{angle} & = & cos (l_{vector}, v_{robot}), \end{matrix}

(218)

\begin{matrix} r_{t}^{target} & = & - 0.4 \frac{d_{t}}{500} + 1.0 + (\frac{d_{t}^{prev} - 0.95 d_{t}}{24}), \end{matrix}

(219)

\begin{matrix} r_{t}^{collision} & = & \{\begin{matrix} - 1, & if d_{obs} < 5 m, \\ 0.5, & if exiting danger zone, \end{matrix} \end{matrix}

(220)

\begin{matrix} r_{t}^{termination} & = & \{\begin{matrix} 100, & if goal reached, \\ - 6, & if collision occurs, \\ - 5, & if border exceeded . \end{matrix} \end{matrix}

(221)

State dynamic normalization ensures stable training by normalizing the state vector to follow a normal distribution:

\begin{matrix} {\hat{s}}_{t} = \frac{s_{t} - μ}{σ}, \end{matrix}

(222)

where

μ

and

σ

are the mean and standard deviation of the state distribution. The prioritized replay buffer selects samples with higher importance for training, weighted by the following:

\begin{matrix} P (i) = \frac{p_{i}^{β}}{\sum_{k} p_{k}^{β}}, \end{matrix}

(223)

where

p_{i}

is the priority of sample i and

β

controls the influence of priority on sampling.

This combination of techniques enables SAC to efficiently navigate in complex environments, outperforming the original SAC and PPO algorithms in terms of stability and cumulative rewards. A block diagram of the proposed SAC-based method for mobile robot navigation is shown in Figure 13. The input comprises state information

s_{t}

, including LiDAR, velocity, and target observations. Normalized states are processed by the actor and critic networks. The prioritized replay buffer improves sampling, and the entropy adjustment refines exploration. The final policy update combines outputs to produce the optimized policy

π_{θ}

, which ensures efficient navigation actions.

6.4.2. Sim-to-Real Transfer with Incremental Environment Complexity

The study by Chaffre et al. [47] proposed a method that builds on SAC by introducing incremental environment complexity and an improved reward shaping mechanism to achieve robust depth-based mapless navigation for mobile robots. The main enhancements address the sim-to-real transfer problem by combining domain randomization with structured environment transitions. The SAC framework is used, where the policy

π_{θ} (a_{t} | s_{t})

maps states

s_{t}

to actions

a_{t}

, optimizing an entropy-regularized objective:

\begin{matrix} J (θ) = E_{π_{θ}} [\sum_{t = 1}^{T} (r (s_{t}, a_{t}) + α H (π_{θ} (\cdot | s_{t})))], \end{matrix}

(224)

where

H (π_{θ})

is the entropy of the policy:

\begin{matrix} H (π_{θ}) = - \sum_{a \in A} π_{θ} (a | s) log π_{θ} (a | s), \end{matrix}

(225)

and

α

is a temperature parameter balancing exploration and exploitation.

The Q-value and value functions are defined as follows:

\begin{matrix} Q_{w} (s_{t}, a_{t}) & = & r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim ρ_{π}} [V_{ψ} (s_{t + 1})], \end{matrix}

(226)

\begin{matrix} V_{ψ} (s_{t}) & = & E_{a_{t} \sim π_{θ}} [Q_{w} (s_{t}, a_{t}) - α log π_{θ} (a_{t} | s_{t})] . \end{matrix}

(227)

The reward function is shaped to guide the robot’s behavior effectively during training. The reward

r (s_{t}, a_{t})

is defined as follows:

\begin{matrix} r (s_{t}, a_{t}) = \{\begin{matrix} C \cdot d R_{t} \cdot V_{r}, & if d R_{t} > 0, \\ r_{recede}, & if d R_{t} \leq 0, \\ r_{reached}, & if d_{t} < d_{\min}, \\ r_{collision}, & if collision detected, \end{matrix} \end{matrix}

(228)

where

d R_{t} = d_{t - 1} - d_{t}

is the change in distance to the target, and

V_{r}

is a velocity reduction factor:

\begin{matrix} V_{r} = (1 - max (v_{t}, 0.1)) \cdot \frac{1}{max (d_{t}, 0.1)} . \end{matrix}

(229)

The constants

r_{reached}

,

r_{collision}

, and

r_{recede}

are set to encourage task completion while penalizing collisions and ineffective movements.

Incremental complexity is introduced by progressively training the policy in three simulated environments (empty room, static obstacles, and mixed static and dynamic obstacles). Transitions between environments are governed by success rates

S_{rate}

over recent episodes. If

S_{rate} > α

, the agent progresses to a more complex environment; if

S_{rate} < β

, it returns to a simpler environment. The policy update is performed by minimizing the KL divergence:

\begin{matrix} π^{new} = arg min_{π^{'}} D_{KL} (π^{'} (\cdot | s_{t}) ∥ exp (Q_{π^{old}} (s_{t}, \cdot) - log Z_{π^{old}} (s_{t}))) . \end{matrix}

(230)

Here,

Z_{π^{old}} (s_{t})

is a partition function that normalizes the distribution.

This approach ensures smooth transitions between simulation and real-world scenarios by gradually increasing task complexity and refining the policy through domain randomization and reward shaping.

A block diagram of the SAC-based method is presented in Figure 14. The input includes state

s_{t}

with LiDAR, velocity, and target observations. State normalization refines inputs for the actor and critic networks. The actor generates a policy

π_{θ} (a_{t} | s_{t})

, while the critic evaluates value functions. The hybrid reward function and entropy adjustment guide the training process. The policy update refines

π_{θ}

, producing optimized navigation actions.

Figure 14. Block diagram of the sim-to-real transfer method of [47].

6.4.3. Dynamic Obstacle Avoidance and Path Planning Integration

Choi et al. [48] developed an SAC-based framework for decentralized dynamic obstacle avoidance and path planning integration. The enhancements to the SAC algorithm include a tailored state representation, a hybrid reward function, and a combination of reinforcement learning and classical path planning for improved navigation. The state

s_{t}

is formulated as follows:

\begin{matrix} s_{t} = {s_{t}^{lidar}, s_{t}^{goal}, s_{t}^{speed}}, \end{matrix}

(231)

where

s_{t}^{lidar} = [o_{t - 2}^{l}, o_{t - 1}^{l}, o_{t}^{l}]

represents LiDAR measurements from three consecutive time steps to capture the temporal dynamics of obstacles. The goal state is

s_{t}^{goal} = [x, y]

, the relative position of the robot to the target, and the speed state

s_{t}^{speed} = [v, ω]

, consisting of forward velocity v and rotational velocity

ω

. The action

a_{t} = [v, ω]

is defined in a continuous space:

\begin{matrix} v \in [0, 0.55], ω \in [- 0.60, 0.60] . \end{matrix}

(232)

The reward function

R (s_{t}, a_{t})

combines terms for goal achievement, collision avoidance, and control efficiency:

\begin{matrix} R (s_{t}, a_{t}) = R_{g} + R_{c} + R_{ω}, \end{matrix}

(233)

where

\begin{matrix} R_{g} & = & \{\begin{matrix} 10, & if {dist}_{curr} < 0.5, \\ {dist}_{prev} - {dist}_{curr}, & otherwise, \end{matrix} \end{matrix}

(234)

\begin{matrix} R_{c} & = & \{\begin{matrix} - 10, & if collision, \\ 0, & otherwise, \end{matrix} \end{matrix}

(235)

\begin{matrix} R_{ω} & = & \{\begin{matrix} - 0.1 | ω |, & if | ω | > 0.6, \\ 0, & otherwise . \end{matrix} \end{matrix}

(236)

The SAC objective is augmented to optimize this hybrid reward while maintaining exploration through entropy regularization:

\begin{matrix} J (θ) = E [\sum_{t = 0}^{T} (r (s_{t}, a_{t}) + α H (π (a_{t} | s_{t})))], \end{matrix}

(237)

where

H (π (a_{t} | s_{t}))

is the entropy of the policy, encouraging diverse action exploration.

To address navigation inefficiencies and conflicts between obstacle avoidance and target-reaching, a classical path planning component is integrated. The global path planner generates a trajectory using an A* algorithm, which provides a “look-ahead point”

p_{lookahead}

for the SAC agent:

\begin{matrix} p_{lookahead} = p_{path} (d_{lookahead}), \end{matrix}

(238)

where

d_{lookahead}

is the distance parameter guiding the point selection along the path. The relative position of the look-ahead point is incorporated into the RL agent’s input:

\begin{matrix} s_{t}^{goal} = [p_{lookahead} - p_{robot}] . \end{matrix}

(239)

The actor and critic networks are designed with convolutional layers to process LiDAR data and fully connected layers for speed and goal components. The critic evaluates the Q-value as follows:

\begin{matrix} Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim π} [V (s_{t + 1})], \end{matrix}

(240)

and the value function

V (s_{t})

is as follows:

\begin{matrix} V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α log π (a_{t} | s_{t})] . \end{matrix}

(241)

By combining SAC with classical path planning and integrating dynamic obstacle prediction into the state representation, the method achieves robust and efficient navigation in complex environments. Algorithm 12 presents a step-by-step pseudocode for the method.

Algorithm 12: The SAC-based Dynamic Obstacle Avoidance and Path Planning of [48].

6.4.4. Robot Navigation in Constrained Pedestrian Environments

The study by Perez-D’Arpino et al. [49] proposed a method that extends SAC by integrating a sampling-based motion planner with a reactive sensorimotor policy to achieve dynamic navigation in constrained pedestrian environments. The problem is formulated as a Partially Observable Markov Decision Process (POMDP) defined as follows:

\begin{matrix} M = (S, A, O, T, R, γ), \end{matrix}

(242)

where S is the state space, A is the action space, O is the observation space,

T (s^{'} | s, a)

is the state transition model,

R (s, a)

is the reward function, and

γ

is the discount factor. Observations

o_{t} \in O

include

o_{t} = {goal, lidar, waypoints}

, where

goal represents the 2D coordinates of the target in the robot’s frame,
lidar consists of 128 range measurements from a 1D LiDAR sensor, and
waypoints are derived from a global planner using a 2D map.

The policy

π_{θ} (a_{t} | o_{t})

outputs actions

a_{t} = [v_{x}, v_{y}, ω]

, where

v_{x}

and

v_{y}

are the linear velocities, and

ω

is the angular velocity.

The reward function

R (s, a)

incorporates terms to encourage goal-reaching, minimize time, avoid collisions, and follow global waypoints:

\begin{matrix} R = R_{goal} + R_{timestep} + R_{collision} + R_{potential} + R_{waypoint} . \end{matrix}

(243)

Here,

\begin{matrix} R_{goal} & = & + 1 if {dist}_{goal} < d_{g}, \end{matrix}

(244)

\begin{matrix} R_{timestep} & = & - 0.001, \end{matrix}

(245)

\begin{matrix} R_{collision} & = & - 1 if collision occurs, \end{matrix}

(246)

\begin{matrix} R_{potential} & = & f ({dist}_{waypoint}), \end{matrix}

(247)

where

d_{g}

is the goal tolerance distance, and

f ({dist}_{waypoint})

is a potential-based reward inversely proportional to the distance to the next waypoint.

The SAC algorithm optimizes the expected sum of rewards while maximizing the entropy of the policy:

\begin{matrix} J (θ) = E [\sum_{t = 0}^{\infty} γ^{t} (R (s_{t}, a_{t}) + α H (π (a_{t} | o_{t})))], \end{matrix}

(248)

where

α

is the entropy temperature and

H (π (a_{t} | o_{t}))

is the policy entropy:

\begin{matrix} H (π (a_{t} | o_{t})) = - \int π (a_{t} | o_{t}) log π (a_{t} | o_{t}) d a_{t} . \end{matrix}

(249)

The policy and value networks are updated using SAC’s off-policy gradient rules. The value function is computed as follows:

\begin{matrix} V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α log π (a_{t} | o_{t})], \end{matrix}

(250)

and the critic is updated by minimizing the Bellman error:

\begin{matrix} L_{Q} = E [{(Q (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ V (s_{t + 1})))}^{2}] . \end{matrix}

(251)

The policy receives high-level waypoints from a motion planner and uses RL to navigate while avoiding collisions and adapting to dynamic pedestrian behavior. Training involves multi-layout environments (e.g., corridors, doors, intersections), allowing the policy to generalize to unseen, compositional environments.

6.4.5. Velocity Range-Based SAC Reward Shaping for Mapless Navigation

A novel SAC reward shaping method was proposed by Lee and Jeong [50] in which a “Velocity Range-Based Reward Shaping” technique is devised to improve map-less navigation for autonomous mobile robots. This technique adjusts the reward function to prioritize efficient and safe movements by incorporating velocity-based evaluations into the reinforcement learning framework. The robot’s state

s_{t}

is defined as follows:

\begin{matrix} s_{t} = f (x_{t}, p_{t}), \end{matrix}

(252)

where

x_{t}

represents LiDAR-based distance data, and

p_{t}

is the relative target coordinate.

The action

a_{t} = [v, ω]

consists of the linear velocity v and angular velocity

ω

, with velocity bounded by the following:

\begin{matrix} v \in [- v_{\max}, v_{\max}], ω \in [- ω_{\max}, ω_{\max}] . \end{matrix}

(253)

The reward function

r (s_{t}, a_{t})

incorporates velocity-based evaluation and consists of the following:

\begin{matrix} r (s_{t}, a_{t}) = r_{goal} + r_{step} + r_{velocity}, \end{matrix}

(254)

where

\begin{matrix} r_{goal} & = & \{\begin{matrix} r_{arrive}, & if d_{t} < d_{goal}, \\ r_{collide}, & if \min_{x} < c_{collision}, \end{matrix} \end{matrix}

(255)

\begin{matrix} r_{step} & = & w_{1} d_{t} + w_{2} θ_{t}, \end{matrix}

(256)

\begin{matrix} r_{velocity} & = & e_{t}, \end{matrix}

(257)

with

d_{t}

being the distance to the target,

θ_{t}

the orientation error,

e_{t}

the velocity-based evaluation score, and

w_{1}, w_{2}

the importance weights.

The velocity evaluation score

e_{t}

is determined by dividing the velocity range

[- v_{\max}, v_{\max}]

into n segments and assigning evaluation scores

e_{1}, e_{2}, \dots, e_{n}

to each range:

\begin{matrix} v_{1} < v_{2} < \dots < v_{n}, e_{1} < e_{2} < \dots < e_{n} . \end{matrix}

(258)

The policy

π_{θ} (a_{t} | s_{t})

is optimized using the SAC objective, modified to include the velocity-based reward:

\begin{matrix} J (π) = E [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}) + α H (π (a_{t} | s_{t})))], \end{matrix}

(259)

where

α

is the entropy temperature, and

H (π (a_{t} | s_{t}))

is the entropy of the policy:

\begin{matrix} H (π (a_{t} | s_{t})) = - \int π (a_{t} | s_{t}) log π (a_{t} | s_{t}) d a_{t} . \end{matrix}

(260)

The value function

V (s_{t})

and Q-function

Q (s_{t}, a_{t})

are trained using the following updates:

\begin{matrix} V (s_{t}) & = & E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α log π (a_{t} | s_{t})], \end{matrix}

(261)

\begin{matrix} L_{Q} & = & E [{(Q (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ V (s_{t + 1})))}^{2}] . \end{matrix}

(262)

By incorporating velocity-based evaluation into the reward function and leveraging the entropy-regularized objective, the method achieves stable and efficient map-less navigation in dynamic environments. A block diagram of the process flow of the method is presented in Figure 15. The input includes state information

s_{t}

, derived from LiDAR data and the target position. State normalization prepares inputs for the actor and critic networks. The actor generates the policy

π_{θ} (a_{t} | s_{t})

, while the critic evaluates the Q-function. A hybrid reward function and entropy adjustment guide the training process. The replay buffer stores transitions for policy updates. The final policy update combines outputs, producing optimized navigation actions.

7. Emerging Trends in Reinforcement Learning for Robotics

Recent advancements in reinforcement learning have introduced new techniques that improve learning efficiency, generalization, and decision-making in robotics. This section explores emerging trends, including transformer-based RL, attention mechanisms, meta-learning, and transfer learning.

7.1. Transformer-Based DRL

Transformers, originally developed for natural language processing, have shown promise in reinforcement learning by improving long-term sequence modeling and decision-making. Hu et al. [51] provided a comprehensive survey on transformer-based RL, highlighting its applications in robotic manipulation, autonomous driving, and multi-agent coordination. These architectures enable better credit assignment over long time horizons and mitigate the “deadly triad” problem of RL. Similarly, Zhou et al. [52] introduced a transformer-based approach for autonomous driving, demonstrating a 41% reduction in collision rates compared to conventional reinforcement learning algorithms.

7.2. Meta-Learning and Transfer Learning in Robotics

Meta-learning aims to enable RL agents to adapt rapidly to new environments by learning how to learn. Ren et al. [53] developed a meta-RL framework that leverages human preference feedback for few-shot adaptation, allowing robots to learn complex tasks with minimal demonstrations. Furthermore, Liu and Ahmad [54] proposed a multi-task RL framework that reuses previously learned policies for efficient adaptation in continuous control tasks, reducing sample complexity and improving policy generalization.

Transfer learning has also gained traction in RL-based robotics. Jiang et al. [55] introduced a transformer-based multi-agent RL approach for cross-domain transfer in urban environments, significantly improving generalization across different cities. These methods enable RL agents to transfer knowledge across different environments, reducing training times for real-world deployment.

7.3. Advancements in Multi-Agent Reinforcement Learning

Multi-agent RL (MARL) has witnessed significant progress in handling complex coordination problems. Gabler and Wollherr [56] proposed a decentralized MARL framework with dual critics, enabling robots to optimize joint team rewards while considering individual constraints. Additionally, Song et al. [57] introduced a hierarchical reinforcement learning method for large-scale multi-agent path planning, enhancing sample efficiency through spatiotemporal abstraction.

7.4. Attention Mechanisms in Robot Navigation

Attention mechanisms have been increasingly applied to RL-based navigation tasks. Escudie et al. [58] developed an attention graph-based RL framework for multi-robot social navigation, improving interaction modeling in dense human environments. Their results demonstrated enhanced cooperation and safety in real-world scenarios.

These emerging trends illustrate the growing capabilities of RL in robotics and highlight promising directions for future research. Integrating these advanced methodologies will enable more efficient, robust, and scalable autonomous systems.

7.5. Scaling DRL for Large-Scale Robotic Systems and Multi-Agent Environments

Deploying deep reinforcement learning (DRL) in large-scale robotic systems, such as urban autonomous fleets or search-and-rescue missions, presents unique challenges related to computational efficiency, coordination, and policy transferability. Recent studies have explored various methods to improve the scalability of DRL-based approaches for multi-agent environments.

One approach is the development of hierarchical DRL techniques that enable more efficient decision-making in large agent populations. Liu et al. [59] provided an extensive survey on scalability challenges in multi-agent DRL (MADRL), highlighting the role of hierarchical structures in managing state–action space complexity. Their study emphasized that scalable algorithms must incorporate robustness mechanisms to handle dynamic environments effectively.

Graph-based learning methods have also proven effective for large-scale coordination. Zhang et al. [60] introduced a graph-based DRL technique for multi-robot task allocation, leveraging graph normalization to enhance scalability. Their approach demonstrated improved task efficiency and adaptability to varying team sizes, outperforming conventional heuristic-based task assignment methods.

Distributed and federated learning have also been explored as viable solutions for scalability. Pramuk et al. [61] proposed a federated DRL framework that enables multi-robot systems to share learned policies without centralized computation, reducing communication overhead while maintaining performance in large-scale navigation tasks. Similarly, Chen et al. [62] introduced an open-source framework, MultiRoboLearn, designed to bridge the gap between multi-agent DRL simulations and real-world multi-robot applications, ensuring better generalization across different environments.

Another critical consideration in scalability is decentralized policy learning. Liu et al. [63] presented an Independent Soft Actor–Critic (ISAC)-based method that enhances decentralized training in large-scale multi-agent path planning. Their results showed improved success rates and computation efficiency compared to centralized learning approaches.

Overall, these advancements illustrate the growing feasibility of DRL in large-scale robotic systems, paving the way for scalable, efficient, and adaptive multi-agent coordination in real-world applications.

7.6. Safe Exploration in DRL for Robotics

Ensuring safe exploration is a critical challenge in deep reinforcement learning (DRL) for robotic applications. While traditional RL methods rely on trial-and-error exploration, real-world deployment necessitates strategies that prevent unsafe actions during training and execution. Recent advancements in safe DRL have introduced several approaches, including constrained optimization, risk-aware policy learning, and uncertainty-aware exploration techniques.

One approach to safe exploration is through constrained optimization, where policies are trained under explicit safety constraints. Zhao et al. [64] introduced Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a method that ensures zero training violations by employing a safety monitor with black-box dynamics. Their approach improves sample efficiency while preventing unsafe transitions. Similarly, Xu et al. [65] proposed an Extra Safety Budget (ESB-CPO) framework that balances exploration efficiency with safety compliance, dynamically adjusting constraints throughout training.

Hierarchical reinforcement learning has also been explored for safe navigation. Roza et al. [66] developed a constrained hierarchical RL framework that integrates a safety layer to modify low-level policies, reducing collision rates while maintaining high performance. Additionally, Marchesini et al. [67] introduced a Safety-Oriented Search technique that biases policy optimization towards safer actions, leveraging an evolutionary cost optimization framework.

Uncertainty-aware policies have been proposed to address model inaccuracies in dynamic environments. Ramírez and Yu [68] developed a Monte Carlo dropout-based method to estimate uncertainty in state–action pairs, enabling safer decision-making in robotic control. Their approach significantly reduces risk while maintaining policy efficiency. Furthermore, Jayant and Bhatnagar [69] introduced a model-based constrained RL approach that leverages an ensemble of neural networks to estimate uncertainty and ensure robust constraint satisfaction.

Beyond policy optimization, risk-aware exploration strategies have been studied to enhance safety. Kim et al. [70] proposed SafeTAC, a Tsallis entropy-regularized safe RL method that improves exploration without violating safety constraints. Their method successfully mitigates overly conservative behaviors while optimizing performance in high-dimensional robotic tasks.

8. Practical Considerations for Real-World Applications

8.1. Real-World Applications and Challenges of Hybrid DRL Methods

Hybrid DRL approaches that integrate value-based and policy-based reinforcement learning have gained significant attention due to their potential to combine the sample efficiency of value-based methods with the stability of policy-gradient techniques. Recent studies highlight both the advantages and the existing challenges of these methods in mobile robot path planning.

One key advantage of hybrid approaches is their improved learning efficiency and generalization in real-world environments. Jiang et al. [71] introduced a goal-conditioned hybrid model that integrates model-based and model-free reinforcement learning, demonstrating superior learning efficiency and generalization performance in complex robotic navigation tasks. Similarly, Pinosky et al. [72] proposed a systematic approach to combining predictive models with experience-based policy learning, leading to improved sample efficiency and motor skill acquisition.

However, hybrid methods also pose challenges related to computational complexity and stability. Wang et al. [73] examined the integration of Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) in autonomous robot navigation, reporting that while hybrid models offer enhanced adaptability in unknown environments, they require careful tuning of reward functions and policy updates to prevent instability during training. Additionally, Li [74] highlighted the challenges of applying hybrid DRL techniques to underwater navigation, where continuous control tasks demand high sample efficiency and robust convergence properties.

Despite these challenges, hybrid methods continue to evolve with innovative frameworks. Jiang et al. [75] introduced the HOPE framework, which fuses reinforcement learning with trajectory-based planning, achieving higher success rates in diverse parking scenarios. Meanwhile, Feng et al. [76] developed a hybrid DRL-based navigation strategy for social robot interaction, demonstrating improved compliance with human navigation patterns in real-world environments.

Overall, while hybrid DRL methods present promising solutions for mobile robot path planning, further research is required to optimize their computational efficiency and stability for deployment in dynamic and unpredictable settings.

8.2. Optimizing DRL for Real-Time Processing in Robotics

Real-time decision-making is a fundamental requirement in robotic applications, where delays in processing sensor data and executing control commands can significantly impact performance and safety. Traditional deep reinforcement learning (DRL) methods, while effective in simulation, often struggle with real-time execution due to computational overhead, sample inefficiency, and high-dimensional state–action spaces. Recent advancements have focused on optimizing DRL for real-time robotic applications by improving computational efficiency, reducing latency, and enabling on-device processing.

One approach to achieving real-time DRL execution is on-device learning and optimization. Li et al. [77] introduced the

R^{3}

framework, which co-optimizes timing, memory, and algorithm performance for DRL on embedded robotic platforms. Their method dynamically adjusts batch sizes and replay buffer sizes to balance computational efficiency with learning stability, significantly reducing latency in real-time robotic control.

Parallel and asynchronous processing techniques have also been explored to improve DRL efficiency. Bohm et al. [78] proposed a non-blocking asynchronous DRL training architecture that decouples action execution from learning updates, allowing robots to make faster decisions while continuously refining policies in the background. Their approach significantly outperforms conventional blocking architectures, particularly in real-world robotic control tasks with variable communication delays.

Another strategy is to leverage lightweight neural architectures for real-time inference. Sain et al. [79] discussed the use of compressed deep networks and network quantization techniques to accelerate DRL policy execution on robotic hardware. Their findings suggest that reducing model complexity without sacrificing policy performance is key to achieving real-time responsiveness.

Additionally, model-based reinforcement learning (MBRL) has been proposed as an alternative to traditional model-free DRL approaches. Sacks et al. [80] introduced Deep Model Predictive Optimization (DMPO), which integrates model predictive control (MPC) with reinforcement learning to improve planning efficiency in high-speed robotic tasks. By incorporating predictive models, DMPO enables robots to make faster and more accurate decisions, reducing the reliance on large sample collections.

8.3. Empirical Performance Comparisons of DRL Methods

To complement the theoretical discussions, we summarize recent empirical evaluations of DRL-based mobile robot path-planning techniques. Table 1 presents key performance metrics across different DRL methods.

Liu et al. [81] evaluated Deep Q-Networks (DQN) and its variants (DDQN) for robot navigation in cluttered environments, reporting an average success rate of 78%. However, these value-based methods required longer training times (24 h on average) and exhibited moderate robustness to sensor noise. Conversely, Ren et al. [53] found that policy-based techniques like Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC) achieved success rates above 88% while converging in just 12 h.

Further studies Ruqing et al. [82], Gök [83] showed that actor–critic methods (DDPG, SAC) outperformed value-based methods, achieving a 91% success rate with better generalization in dynamic environments. Notably, Zhao et al. [85] demonstrated that an improved SAC algorithm yielded the highest success rate (95%) and exhibited superior robustness, making it a promising approach for real-world deployment.

9. Samples of the Results

9.1. Evaluation of PPO in Robotic Path Planning

In their work on Proximal Policy Optimization (PPO), Schulman et al. [4] evaluated the algorithm across several simulated robotic tasks using the MuJoCo physics engine. The tasks included challenging continuous control problems, such as HalfCheetah, Hopper, and Walker2d. Each task was run for one million timesteps, and the experiments used a fully connected neural network with two hidden layers of 64 units for the policy representation. Key hyperparameters, including the clipping threshold

ϵ = 0.2

, were tuned to maximize performance. The authors compared PPO against other algorithms like Trust Region Policy Optimization (TRPO), vanilla policy gradient methods, and Advantage Actor Critic (A2C), demonstrating that PPO consistently achieved superior performance in terms of sample efficiency and stability. Figure 16 presents a comparative performance analysis of PPO against various algorithms across seven MuJoCo environments. This figure illustrates the training curves for each method, highlighting the robustness of PPO in achieving higher cumulative rewards over the training period. It effectively showcases PPO’s advantage in continuous control tasks, making it a crucial inclusion for understanding its contributions to robotic path planning.

9.2. DDPG for Continuous Robotic Control

Lillicrap et al. [3] proposed the DDPG algorithm to address challenges in continuous action domains, employing an actor–critic framework with deterministic policy gradients. The experimental evaluation covered various challenging physical control tasks simulated using MuJoCo, including classic problems, such as cartpole swing-up, pendulum balance, and locomotion with cheetah-like agents. The study involved training on both low-dimensional state descriptions (e.g., joint angles) and high-dimensional pixel inputs, showcasing DDPG’s ability to learn policies “end-to-end”. The experiments used consistent hyperparameters and network architectures across tasks, with reward functions tailored to encourage task-specific behaviors such as smooth locomotion or object manipulation. Policies were evaluated periodically without exploration noise to assess stability and convergence. Figure 17 presents performance curves for a variety of environments. This figure demonstrates the effectiveness of DDPG, highlighting the impact of using components like target networks and batch normalization. Notably, the inclusion of these components led to more stable learning and better overall performance, especially when training on pixel-based observations.

9.3. Accelerated Learning in Robotic Navigation with Assisted DDPG

Xie et al. [27] introduced the Assisted Deep Deterministic Policy Gradient (AsDDPG) framework, which incorporates an external controller, such as a PID controller, to enhance the training efficiency of Deep Reinforcement Learning (DRL) for robotic navigation. The framework initially uses the external controller to guide policy exploration, enabling the agent to learn basic behaviors like moving in a straight line. Over time, the actor network surpasses the controller’s performance and operates independently, while the controller is gradually discarded. The experimental setup involved training in simulation environments (Stage and Gazebo) with three levels of complexity: empty, simple, and complex obstacle layouts. Performance was evaluated using metrics like collision rates, goal-reaching frequency, and learning curves under varying network configurations and reward functions.

Figure 18 shows the smoothed learning curves of AsDDPG and standard DDPG under different network hyperparameters, demonstrating that AsDDPG not only accelerates training but is also more robust across hyperparameter variations. AsDDPG achieves higher average performance compared to DDPG, even under challenging conditions. This result underscores the effectiveness of leveraging an external controller to stabilize and speed up learning in DRL-based navigation tasks.

9.4. Enhancing Sample Efficiency with Conservative Model-Based Actor–Critic

Wang et al. [86] introduced the Conservative Model-Based Actor–Critic (CMBAC), a reinforcement learning framework designed to address the challenge of model inaccuracies in complex environments. CMBAC employs an ensemble of probabilistic neural networks to approximate the posterior distribution of Q-values, using the average of the bottom-k estimates to optimize policies conservatively. This approach ensures that the agent avoids unreliable promising actions that arise from overestimation biases. The experiments were conducted on a range of MuJoCo environments, including Walker2d, HalfCheetah, and Humanoid, to evaluate sample efficiency and robustness. Metrics such as cumulative reward and learning curves were used to compare CMBAC with state-of-the-art algorithms like MBPO, MOPO, and SAC.

A notable result is presented in Figure 19, which illustrates the average returns of CMBAC and baseline methods across six benchmark tasks. The figure demonstrates that CMBAC consistently outperforms competing algorithms in sample efficiency, achieving comparable or superior performance with fewer environment interactions. The shaded regions in the figure represent the standard deviation across multiple random seeds, highlighting the robustness of CMBAC even in noisy scenarios.

9.5. Leveraging Auxiliary Tasks for Enhanced Navigation in Dynamic Mazes

Mirowski et al. [34] proposed a navigation framework leveraging auxiliary tasks, including depth prediction and loop closure detection, to improve the training efficiency and performance of reinforcement learning agents. Using the A3C algorithm, the authors trained agents in visually rich, 3D mazes with dynamic goals, such as random goal mazes and an I-maze inspired by rodent navigation experiments. The auxiliary tasks, predicting depth and detecting loop closures, provided additional learning signals that accelerated the development of spatial reasoning and navigation capabilities. The experiments involved static and dynamic mazes of varying sizes and complexities, with performance measured by cumulative rewards, area under the learning curve, and position decoding accuracy. Key findings demonstrated that incorporating auxiliary tasks improved data efficiency, robustness, and stability, enabling agents to approach human-level performance in some scenarios.

Figure 20 compares the learning curves of agents with and without auxiliary tasks across different maze configurations. The figure highlights how agents augmented with depth prediction and loop closure detection achieve faster convergence and higher final performance. This result underscores the value of auxiliary tasks in enabling more efficient reinforcement learning in complex navigation domains.

9.6. Real-World Autonomous Navigation Using DRL

Surmann et al. [38] proposed a DRL framework for the autonomous navigation of mobile robots in real-world, unstructured indoor environments. The system uses fused data from a 2D laser scanner and an RGB-D camera as input, while the output consists of linear and angular velocity commands generated by an asynchronous advantage actor–critic (GA3C) network. The robot is trained in a custom simulation environment capable of parallel processing, which accelerates learning by simulating multiple environments simultaneously. Key features of the framework include the fusion of 2D and 3D sensor data for robust obstacle detection, the use of a simplified network architecture to optimize for hardware constraints, and a reward function tailored to encourage goal-directed navigation while minimizing collisions. The learned policy is deployed on a real robot (Turtlebot 2), demonstrating its capability to navigate efficiently without requiring pre-built maps or external planners.

A particularly illustrative result is presented in Figure 21, which compares training results for 1D and 2D convolutional network architectures over 20,000 episodes. The figure highlights that 1D convolutions achieve comparable navigation performance while reducing training time significantly, making them more practical for real-world applications. Additionally, the red-circled area in the figure indicates a reduction in environmental diversity during later training stages, focusing on challenging scenarios to refine policy performance.

9.7. Collision Avoidance in Dense Crowds Using Hybrid Sensors and DRL

Liang et al. [44] proposed a novel algorithm called CrowdSteer, designed for real-time collision avoidance in densely crowded environments. The approach uses hybrid sensor data from a 2-D LiDAR and a depth camera, fused via a deep reinforcement learning (DRL) framework to enable robust navigation. The problem is modeled as a Partially Observable Markov Decision Process (POMDP) and solved using Proximal Policy Optimization (PPO). High-fidelity simulations, including complex pedestrian interactions, were used to train the policy, ensuring a smooth sim-to-real transfer. The CrowdSteer algorithm generates collision-free trajectories in scenarios with varying crowd densities, narrow passages, and occlusions. Evaluations were conducted on both Turtlebot and Jackal robots, demonstrating superior performance compared to state-of-the-art methods like Dynamic Window Approach (DWA) and Fan et al.’s approach. A key visualization of the results is presented in Figure 22, which compares the trajectories generated by CrowdSteer and Fan et al.’s method in static and occluded pedestrian scenarios. The figure highlights that CrowdSteer produces smoother trajectories and avoids oscillations, resulting in significantly improved success rates in dense environments. These results emphasize the effectiveness of combining LiDAR and depth camera data in achieving robust, real-world collision avoidance.

9.8. Robot Navigation in Constrained Pedestrian Environments Using Reinforcement Learning

Pérez-D’Arpino et al. [49] proposed a reinforcement learning-based approach for enabling robots to navigate efficiently around pedestrians in constrained indoor environments, such as corridors and intersections. Their method combines a global motion planner with a reinforcement learning (RL) agent that handles local interactions with moving pedestrians. The RL policy is trained in a compositional multi-layout regime, where canonical layouts, such as corridors and doorways, are used to develop robust navigation strategies. These policies generalize well to unseen, more complex layouts and 3D reconstructions of real-world environments like a supermarket and an apartment. The experiments were conducted using the Interactive Gibson Simulator and the Soft Actor–Critic (SAC) algorithm to optimize the navigation policies. The evaluation metrics included success rate, collision rate, and personal space overlap, highlighting the efficacy of the multi-layout training approach. A key result is depicted in Figure 23, which compares the success rate of the RL-based approach with a traditional motion planner under varying pedestrian densities. The figure shows that the proposed method significantly outperforms the planner, particularly in high-density scenarios, by avoiding the “freezing robot problem” and enabling smoother navigation. This demonstrates the robustness and scalability of the approach in dynamic and constrained environments.

9.9. Further Case Studies and Real-World Applications of DRL

Deep reinforcement learning (DRL) has been successfully applied to various real-world robotic tasks, enabling autonomous agents to perform complex decision-making and control operations in dynamic environments. This section highlights notable case studies that illustrate the practical deployment of DRL in robotics.

One prominent application of DRL is in robotic manipulation tasks. Krishnan [87] demonstrated the use of DRL for robotic arm control in medical applications, such as instrument-assisted surgery and object manipulation in constrained environments. Their approach maps raw pixel inputs from a camera to control commands using a deep reinforcement learning neural network, enabling precise manipulation with minimal manual intervention.

Another area of successful DRL implementation is autonomous robot navigation. Farooqui [88] applied DRL-based techniques to real-time mobile robot path planning, improving navigation efficiency in dynamic environments. By combining visual perception with reinforcement learning policies, their approach enabled robots to make real-time adjustments in response to obstacles, significantly reducing collision rates.

DRL has also been integrated into multi-agent robotic systems to enhance coordination and collaboration. Orr and Dutta [89] explored multi-agent deep reinforcement learning (MADRL) for multi-robot applications, demonstrating its effectiveness in cooperative tasks such as warehouse automation and swarm robotics. Their findings indicate that DRL-based policies enable robots to learn from each other’s experiences, improving task efficiency and adaptability.

Additionally, DRL has proven effective in motion control for industrial and agricultural robotics. Liu [90] showcased DRL-based trajectory optimization in robotic manipulators for automated fruit picking, achieving significant improvements in efficiency and precision. Their study also highlighted the challenges of transferring learned policies from simulations to real-world deployments, emphasizing the importance of domain adaptation techniques.

10. Conclusions and Future Directions

This review provides a comprehensive analysis of the application of Deep Reinforcement Learning (DRL) in mobile robot path planning, showcasing its potential to tackle dynamic and uncertain environments where traditional methods often fall short. By leveraging neural networks to handle high-dimensional state–action spaces, DRL enables mobile robots to make real-time decisions with enhanced efficiency and adaptability. The categorized exploration of value-based methods, policy-based approaches, and hybrid frameworks demonstrates the breadth and versatility of DRL techniques in addressing complex navigation challenges. Key advancements, such as Proximal Policy Optimization (PPO), Soft Actor–Critic (SAC), and hybrid frameworks, have delivered a state-of-the-art performance in scenarios ranging from indoor navigation to multi-agent coordination. Despite this progress, limitations persist in scalability, safety, interpretability, and generalization to unseen environments. Addressing these challenges is essential for fully harnessing the transformative potential of DRL in autonomous navigation.

The future of DRL in mobile robot path planning lies in overcoming these limitations and expanding its applicability to real-world scenarios. One critical area is scalability, as current DRL models often struggle with large and complex environments. Distributed and hierarchical learning frameworks could address this by decomposing navigation tasks into subproblems. For instance, a warehouse robot might use a hierarchical approach to first navigate to a specific aisle and then locate a target object within it. Real-time performance improvements will also be essential, especially for logistics robots in high-demand environments like fulfillment centers.

Ensuring safety in navigation is another crucial direction, particularly for robots operating in human-centric environments. By integrating DRL with safety constraints and verification mechanisms, robots can learn to avoid risky behaviors. Autonomous delivery robots in crowded urban areas, for example, could benefit from policies that account for pedestrian movements and dynamic obstacles while maintaining a reliable path to their destination.

The gap between simulation and reality remains a significant challenge. While simulation environments are invaluable for training DRL models, real-world deployment often suffers from discrepancies in environmental conditions. Techniques like domain adaptation and transfer learning could help bridge this gap. For example, autonomous vehicles trained on simulated highways could adapt their policies to real-world conditions involving variable weather, road textures, and traffic dynamics.

Coordination among multiple robots presents another promising avenue for DRL. Multi-agent reinforcement learning frameworks could promote collaboration and efficient resource use in shared environments. Search-and-rescue missions, for instance, could benefit from teams of drones and ground robots working together to navigate and locate survivors in disaster-stricken areas. Similarly, such frameworks could optimize fleets of warehouse robots for collaborative tasks, reducing bottlenecks and improving throughput.

Socially aware navigation is another frontier where DRL could play a transformative role. Incorporating human preferences and social norms into reward functions can help robots operate seamlessly in public spaces. Service robots in airports, shopping malls, or hospitals must navigate crowded areas while respecting personal space and pedestrian flow patterns. These considerations could enhance their acceptance and utility in everyday human environments.

Energy efficiency is an often-overlooked aspect of path planning but is critical for battery-powered mobile robots. Reward functions that factor in energy consumption and optimize for sustainable operation could extend the operational range of robots in applications like agricultural fieldwork or remote inspection tasks. For instance, agricultural robots managing tasks like planting or weed removal would benefit from energy-efficient navigation strategies that maximize battery life across large fields.

The exploration of unstructured terrains presents additional challenges and opportunities for DRL. Robots operating in forests, disaster sites, or underwater environments require policies capable of handling incomplete environmental data and complex dynamics. Exploration-based DRL approaches, which use curiosity-driven rewards, could enable robots to navigate unknown terrains effectively. For example, underwater robots conducting coral reef surveys could leverage these techniques to explore and document fragile ecosystems while avoiding obstacles.

Deep reinforcement learning presents significant computational challenges when deployed on real-world robotic systems, particularly due to the high-dimensional sensory inputs required for robust decision-making. LiDAR and vision-based sensing are widely used in mobile robot path planning, but processing these inputs in real-time demands specialized hardware configurations.

Gautier et al. [91] highlighted that multi-robot systems relying on DRL for task allocation face severe computational bottlenecks, particularly when cloud resources are unavailable. Their study demonstrated that distributing computation across local robotic clusters improves efficiency but introduces coordination challenges. Similarly, Kwon et al. [92] explored Hyperdimensional Computing (HDC) as a lightweight alternative to deep learning for sensorimotor control, achieving a 14.2× improvement in computational efficiency while maintaining comparable accuracy.

For autonomous navigation, high-dimensional image data processing is another major concern. Vijetha and Geetha [93] proposed a state abstraction technique to transform raw vision data into compact representations, significantly reducing the computational overhead of DRL policies running on resource-constrained devices. Additionally, Xiang et al. [94] introduced RMBench, a benchmarking framework for DRL-based robotic manipulation, revealing that Soft Actor–Critic (SAC) achieves the best trade-off between computational cost and task performance.

Another critical challenge is balancing computational efficiency with training stability. Han et al. [16] proposed a deep stochastic Koopman operator (DeSKO) to model complex robot dynamics, improving decision-making efficiency while mitigating the impact of computational constraints. Their approach demonstrated robust performance in real-world soft robotic systems.

Given these considerations, future research should focus on optimizing DRL architectures for edge computing and developing hybrid approaches that integrate learning-based methods with classical control techniques to alleviate hardware limitations.

Finally, advancements in integrating vision and multimodal sensory systems could significantly enhance DRL’s robustness in real-world applications. Combining inputs from cameras, LiDAR, and other sensors could enable robots to build richer environmental models. Autonomous drones delivering packages in urban areas, for instance, could utilize multimodal inputs for precise obstacle detection and landing.

As DRL continues to evolve, its potential to generalize across unseen scenarios will become increasingly important. Approaches like meta-reinforcement learning and few-shot learning could enable robots to adapt rapidly to new tasks or environments with minimal additional training. For example, autonomous robots in industrial settings could quickly learn to navigate newly reconfigured layouts or operate in entirely different facilities.

By addressing these challenges and directions, DRL could become a cornerstone of mobile robot navigation, paving the way for its widespread deployment in diverse applications. These advancements will not only improve the adaptability and efficiency of robots but also enhance their safety and reliability, enabling them to meet the demands of real-world scenarios effectively.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflict of interest.

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Stentz, A. The Focussed D* Algorithm for Real-Time Replanning. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Volume 2, pp. 1652–1659. [Google Scholar]
LaValle, S.M. Planning Algorithms; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J.A.; Abbeel, P.; Peters, J. An algorithmic perspective on imitation learning. Found. Trends Robot. 2018, 7, 1–179. [Google Scholar] [CrossRef]
Zheng, L.; Li, Y.; Wang, Y.; Bai, G.; He, H.; Dong, E. Uncertainty in Bayesian Reinforcement Learning for Robot Manipulation Tasks with Sparse Rewards. In Proceedings of the 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Koh Samui, Thailand, 4–9 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, J.; Boedecker, J. Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. arXiv 2022, arXiv:2207.02016. [Google Scholar]
Govindarajula1, C.S.; Challa1, V.S.B.; Vinay, S.; Reddy, A.S.; Abhiram, S.; Sanjay, C.S.; Bangari, V.R. Deep Reinforcement Learning Based Trajectory Planning Under Uncertain Constraints. Int. J. Sci. Res. Eng. Manag. 2023, 7. [Google Scholar] [CrossRef]
Joshi, B.; Kapur, D.; Kandath, H. Sim-to-Real Deep Reinforcement Learning Based Obstacle Avoidance for UAVs Under Measurement Uncertainty. arXiv 2023, arXiv:2303.0724. [Google Scholar]
Martinez, D.; Riazuelo, L.; Montano, L. Deep reinforcement learning oriented for real world dynamic scenarios. arXiv 2022, arXiv:2210.11392. [Google Scholar]
Han, M.; Wong, K.; Euler-Rolle, J.; Zhang, L.; Katzschmann, R.K. Robust Learning-Based Control for Uncertain Nonlinear Systems With Validation on a Soft Robot. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 510–524. [Google Scholar] [CrossRef] [PubMed]
Ou, Y.; Cai, Y.; Sun, Y.; Qin, T. Autonomous Navigation by Mobile Robot with Sensor Fusion Based on Deep Reinforcement Learning. Sensors 2024, 24, 3895. [Google Scholar] [CrossRef] [PubMed]
Tan, J. A Method to Plan the Path of a Robot Utilizing Deep Reinforcement Learning and Multi-Sensory Information Fusion. Appl. Artif. Intell. 2023, 37, 2224996. [Google Scholar] [CrossRef]
Xue, Z.; Gonsalves, T. Short-Term Visual-IMU Fusion Memory Agent For Drone’s Motion Planning. In Proceedings of the 2022 10th International Conference on Information and Education Technology (ICIET), Matsue, Japan, 9–11 April 2022; pp. 378–383. [Google Scholar] [CrossRef]
Liu, Z.; Hu, Y.; Fu, T.; Pun, M.O. Dense Three-Dimensional Color Reconstruction with data Fusion and Image-Guided Depth Completion for Large-Scale Outdoor Scenes. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3468–3471. [Google Scholar] [CrossRef]
Yan, K.; Gao, J.; Li, Y. Deep Reinforcement Learning Based Mobile Robot Navigation Using Sensor Fusion. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 4125–4130. [Google Scholar] [CrossRef]
Jiang, P.; Ma, J.; Zhang, Z.; Zhang, J. Multi-Sensor Fusion Framework for Obstacle Avoidance via Deep Reinforcement Learning. In Proceedings of the 2022 2nd International Conference on Electrical Engineering and Control Science (IC2ECS), Nanjing, China, 16–18 December 2022; pp. 153–156. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. Proc. AAAI Conf. Artif. Intell. 2016, 30, 2094–2100. [Google Scholar] [CrossRef]
Wang, Z.; de Freitas, N.; Lanctot, M. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
Xie, L.; Wang, S.; Rosa, S.; Markham, A.; Trigoni, N. Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning. arXiv 2018, arXiv:1812.05027. [Google Scholar]
Li, P.; Ding, X.; Sun, H.; Zhao, S.; Cajo, R. Research on Dynamic Path Planning of Mobile Robot Based on Improved DDPG Algorithm. Mob. Inf. Syst. 2021, 2021, 5169460. [Google Scholar] [CrossRef]
Tai, L.; Paolo, G.; Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 31–36. [Google Scholar]
Cimurs, R.; Lee, J.H.; Suh, I.H. Goal-Oriented Obstacle Avoidance with Deep Reinforcement Learning in Continuous Action Space. Electronics 2020, 9, 411. [Google Scholar] [CrossRef]
Gao, J.; Ye, W.; Guo, J.; Li, Z. Deep Reinforcement Learning for Indoor Mobile Robot Path Planning. Sensors 2020, 20, 5493. [Google Scholar] [CrossRef]
Cui, Y.; Zhang, H.; Wang, Y.; Xiong, R. Learning World Transition Model for Socially Aware Robot Navigation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 9262–9268. [Google Scholar]
Mnih, V. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Mirowski, P.; Pascanu, R.; Viola, F.; Soyer, H.; Ballard, A.J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.; Kavukcuoglu, K.; et al. Learning to Navigate in Complex Environments. arXiv 2016, arXiv:1611.03673. [Google Scholar]
Zhang, K.; Niroui, F.; Ficocelli, M.; Nejat, G. Robot Navigation of Environments with Unknown Rough Terrain Using Deep Reinforcement Learning. In Proceedings of the 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Philadelphia, PA, USA, 6–8 August 2018; pp. 1–7. [Google Scholar]
Brunner, G.; Richter, O.; Wang, Y.; Wattenhofer, R. Teaching a Machine to Read Maps with Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018; Volume 32. [Google Scholar]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
Surmann, H.; Jestel, C.; Marchel, R.; Musberg, F.; Elhadj, H.; Ardani, M. Deep reinforcement learning for real autonomous mobile robot navigation in indoor environments. arXiv 2020, arXiv:2005.13857. [Google Scholar]
Shi, H.; Shi, L.; Xu, M.; Hwang, K.S. End-to-End Navigation Strategy with Deep Reinforcement Learning for Mobile Robots. IEEE Trans. Ind. Inform. 2019, 16, 2393–2402. [Google Scholar] [CrossRef]
Wang, X.; Sun, Y.; Xie, Y.; Bin, J.; Xiao, J. Deep reinforcement learning-aided autonomous navigation with landmark generators. Front. Neurorobot. 2023, 17, 1662–5218. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Ghadirzadeh, A.; Folkesson, J.; Jensfelt, P. Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments. arXiv 2018, arXiv:1804.10500. [Google Scholar]
Moon, W.; Park, B.; Nengroo, S.H.; Kim, T.; Har, D. Path Planning of Cleaning Robot with Reinforcement Learning. arXiv 2022, arXiv:2208.08211. [Google Scholar]
Yu, X.; Wang, P.; Zhang, Z. Learning-Based End-to-End Path Planning for Lunar Rovers with Safety Constraints. Sensors 2021, 21, 796. [Google Scholar] [CrossRef]
Liang, J.; Patel, U.; Sathyamoorthy, A.J.; Manocha, D. Realtime Collision Avoidance for Mobile Robots in Dense Crowds using Implicit Multi-sensor Fusion and Deep Reinforcement Learning. arXiv 2020, arXiv:2004.03089. [Google Scholar]
Taschin, F.; Canal, O. Self-Learned Vehicle Control Using PPO. CampusAI 2020. Available online: https://campusai.github.io/pdf/autonomous_driving.pdf (accessed on 17 February 2025).
Yang, L.; Bi, J.; Yuan, H. Dynamic Path Planning for Mobile Robots with Deep Reinforcement Learning. IFAC-PapersOnLine 2022, 55, 19–24. [Google Scholar] [CrossRef]
Chaffre, T.; Moras, J.; Chan-Hon-Tong, A.; Marzat, J. Sim-to-Real Transfer with Incremental Environment Complexity for Reinforcement Learning of Depth-Based Robot Navigation. In Proceedings of the 17th International Conference on Informatics in Control, Automation and Robotics, Paris, France, 7–9 July 2020; pp. 314–323. [Google Scholar]
Choi, J.; Lee, G.; Lee, C. Reinforcement learning-based dynamic obstacle avoidance and integration of path planning. Intell. Serv. Robot. 2021, 14, 663–677. [Google Scholar] [CrossRef]
Perez-D’Arpino, C.; Liu, C.; Goebel, P.; Martin-Martin, R.; Savarese, S. Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning. arXiv 2021, arXiv:2010.08600. [Google Scholar]
Lee, H.; Jeong, J. Velocity range-based reward shaping technique for effective map-less navigation with LiDAR sensor and deep reinforcement learning. Front. Neurorobot. 2023, 17, 1662–5218. [Google Scholar] [CrossRef] [PubMed]
Hu, S.; Shen, L.; Zhang, Y.; Chen, Y.; Tao, D. On Transforming Reinforcement Learning With Transformers: The Development Trajectory. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8580–8599. [Google Scholar] [CrossRef]
Zhou, H.; Xu, D.; Ji, Y. Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving. arXiv 2024, arXiv:2404.12090. [Google Scholar]
Ren, J.; Huang, X.; Huang, R.N. Efficient Deep Reinforcement Learning for Optimal Path Planning. Electronics 2022, 11, 3628. [Google Scholar] [CrossRef]
Liu, Y.; Ahmad, A. Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition. arXiv 2023, arXiv:2303.1393. [Google Scholar]
Jiang, H.; Li, Z.; Wei, H.; Xiong, X.; Ruan, J.; Lu, J.; Mao, H.; Zhao, R. X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner. arXiv 2024, arXiv:2404.12090. [Google Scholar]
Gabler, V.; Wollherr, D. Decentralized multi-agent reinforcement learning based on best-response policies. Front. Robot. AI 2024, 11, 1229026. [Google Scholar] [CrossRef]
Song, Z.; Zhang, R.; Cheng, X. HELSA: Hierarchical Reinforcement Learning with Spatiotemporal Abstraction for Large-Scale Multi-Agent Path Finding. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar] [CrossRef]
Escudie, E.; Matignon, L.; Saraydaryan, J. Attention Graph for Multi-Robot Social Navigation with Deep Reinforcement Learning. arXiv 2024, arXiv:2401.17914. [Google Scholar]
Liu, D.; Ren, F.; Yan, J.; Su, G.; Gu, W.; Kato, S. Scaling Up Multi-Agent Reinforcement Learning: An Extensive Survey on Scalability Issues. IEEE Access 2024, 12, 94610–94631. [Google Scholar] [CrossRef]
Zhang, Z.; Jiang, X.; Yang, Z.; Ma, S.; Chen, J.; Sun, W. Scalable Multi-Robot Task Allocation Using Graph Deep Reinforcement Learning with Graph Normalization. Electronics 2024, 13, 1561. [Google Scholar] [CrossRef]
Pramuk, M.P.; Kumar, M.V.; Kashyap, P.S.; Lohith, N.; Tripathi, S. Goal Driven Multi-Robot Navigation in Simulated Environments with Federated Deep Reinforcement Learning. In Proceedings of the 2024 9th International Conference on Control and Robotics Engineering (ICCRE), Osaka, Japan, 10–12 May 2024; pp. 114–120. [Google Scholar] [CrossRef]
Chen, J.; Deng, F.; Gao, Y.; Hu, J.; Guo, X.; Liang, G.; Lam, T.L. MultiRoboLearn: An open-source Framework for Multi-robot Deep Reinforcement Learning. arXiv 2022, arXiv:2209.13760. [Google Scholar]
Liu, X.; Wang, Z.; Liu, S. Multi Robot Path Planning based on Reinforcement Learning. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; pp. 1718–1723. [Google Scholar] [CrossRef]
Zhao, W.; Sun, Y.; Li, F.; Chen, R.; Wei, T.; Liu, C. Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization. arXiv 2023, arXiv:2308.13140. [Google Scholar]
Xu, H.; Wang, S.; Wang, Z.; Zhuo, Q.; Zhang, T. Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization. arXiv 2023, arXiv:2302.14339. [Google Scholar]
Roza, F.S.; Rasheed, H.; Roscher, K.; Ning, X.; Günnemann, S. Safe Robot Navigation Using Constrained Hierarchical Reinforcement Learning. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 737–742. [Google Scholar] [CrossRef]
Marchesini, E.; Corsi, D.; Farinelli, A. Exploring Safer Behaviors for Deep Reinforcement Learning. AAAI Conf. Artif. Intell. 2022, 36, 7701–7709. [Google Scholar] [CrossRef]
Ramírez, J.; Yu, W. Safe Exploration in Reinforcement Learning for Learning from Human Experts. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), Mount Pleasant, MI, USA, 16–17 September 2023; pp. 1–5. [Google Scholar] [CrossRef]
Jayant, A.K.; Bhatnagar, S. Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm. arXiv 2022, arXiv:2210.07573. [Google Scholar]
Kim, D.; Heo, J.; Oh, S. SafeTAC: Safe Tsallis Actor-Critic Reinforcement Learning for Safer Exploration. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 4070–4075. [Google Scholar] [CrossRef]
Jiang, Z.; Hu, J.; Zhao, Y.; Huang, X.; Li, H. A goal-conditioned policy search method with multi-timescale value function tuning. Robot. Intell. Autom. 2023, 44, 549–559. [Google Scholar] [CrossRef]
Pinosky, A.; Abraham, I.; Broad, A.; Argall, B.; Murphey, T. Hybrid control for combining model-based and model-free reinforcement learning. Int. J. Robot. Res. 2022, 42, 337–355. [Google Scholar] [CrossRef]
Wang, Z.; Yan, H.; Wang, Y.; Xu, Z.; Wang, Z.; Wu, Z. Research on Autonomous Robots Navigation based on Reinforcement Learning. In Proceedings of the 2024 3rd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), Mianyang, China, 5–7 July 2024. [Google Scholar] [CrossRef]
Li, Y. Research and Design of an Autonomous Underwater Vehicle Path Planning Method Based on Deep Reinforcement Learning. In Proceedings of the 2023 5th International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 November 2023. [Google Scholar] [CrossRef]
Jiang, M.; Li, Y.; Zhang, S.; Wang, C.; Yang, M. HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios. arXiv 2024, arXiv:2405.205. [Google Scholar]
Feng, Z.; Gao, M.; Xue, B.; Wang, C.; Zhou, F. Socially Aware Hybrid Robot Navigation via Deep Reinforcement Learning. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 3697–3701. [Google Scholar] [CrossRef]
Li, Z.; Samanta, A.; Li, Y.; Soltoggio, A.; Kim, H.; Liu, C. R³: On-Device Real-Time Deep Reinforcement Learning for Autonomous Robotics. arXiv 2023, arXiv:2308.15039. [Google Scholar]
Bohm, P.; Pounds, P.; Chapman, A.C. Non-blocking Asynchronous Training for Reinforcement Learning in Real-World Environments. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 10927–10934. [Google Scholar] [CrossRef]
Sain, M.K.; Tiwari, S.; Saini, A. Deep Reinforcement Learn In Robotics Challenges and Applications. Tuijin Jishu/J. Propuls. Technol. 2023, 44, 92–96. [Google Scholar] [CrossRef]
Sacks, J.; Rana, R.; Huang, K.; Spitzer, A.; Shi, G.; Boots, B. Deep Model Predictive Optimization. arXiv 2023, arXiv:2310.04590. [Google Scholar]
Liu, H.; Shen, Y.; Yu, S.; Gao, Z.; Wu, T. Deep Reinforcement Learning for Mobile Robot Path Planning. arXiv 2024, arXiv:2404.06974. [Google Scholar] [CrossRef]
Ruqing, Z.; Xin, L.; Shubin, L.; Jihuai, Z.; Fusheng, L. Deep Reinforcement Learning Based Path Planning for Mobile Robots Using Time-Sensitive Reward. In Proceedings of the 2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 16–18 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
Gök, M. Evaluation of the Deep Q-Learning Models for Mobile Robot Path Planning Problem. Gazi Üniv. Fen Bilim. Derg. Part C Tasarım Teknol. 2024, 12, 620–627. [Google Scholar] [CrossRef]
Wang, W.; Wu, Z.; Luo, H.; Zhang, B. Path Planning Method of Mobile Robot Using Improved Deep Reinforcement Learning. J. Electr. Comput. Eng. 2022, 2022, 5433988:1–5433988:7. [Google Scholar] [CrossRef]
Zhao, T.; Wang, M.; Zhao, Q.; Zheng, X.; Gao, H. A Path-Planning Method Based on Improved Soft Actor-Critic Algorithm for Mobile Robots. Biomimetics 2023, 8, 481. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Zhou, Q.; Li, B.; Li, H. Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic. arXiv 2021, arXiv:2112.10504. [Google Scholar] [CrossRef]
Krishnan, K. Using Deep Reinforcement Learning For Robot Arm Control. J. Artif. Intell. Capsul. Netw. 2022, 4, 160–166. [Google Scholar] [CrossRef]
Farooqui, A.K. Deep Reinforcement Learning for Autonomous Robot Navigation in Dynamic Environments. In Proceedings of the 2023 Seventh International Conference on Image Information Processing (ICIIP), Solan, India, 22–24 November 2023; pp. 689–694. [Google Scholar] [CrossRef]
Orr, J.; Dutta, A. Multi-Agent Deep Reinforcement Learning for Multi-Robot Applications: A Survey. Sensors 2023, 23, 3625. [Google Scholar] [CrossRef]
Liu, Q. Deep Reinforcement Learning for Motion Control Algorithms in Robotics. Trans. Comput. Sci. Intell. Syst. Res. 2024, 5. [Google Scholar] [CrossRef]
Gautier, P.; Laurent, J.; Diguet, J. Deep Q-Learning-Based Dynamic Management of a Robotic Cluster. IEEE Trans. Autom. Sci. Eng. 2023, 20, 2503–2515. [Google Scholar] [CrossRef]
Kwon, H.; Kim, K.; Lee, J.; Lee, H.S.; Kim, J.; Kim, J.; Kim, T.; Kim, Y.; Ni, Y.; Imani, M.; et al. Brain-Inspired Hyperdimensional Computing in the Wild: Lightweight Symbolic Learning for Sensorimotor Controls of Wheeled Robots. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 5176–5182. [Google Scholar] [CrossRef]
Vijetha, U.; Geetha, V. Optimizing Reinforcement Learning-Based Visual Navigation for Resource-Constrained Devices. IEEE Access 2023, 11, 125648–125663. [Google Scholar] [CrossRef]
Xiang, Y.; Wang, X.; Hu, S.; Zhu, B.; Huang, X.; Wu, X.; Lyu, S. RMBench: Benchmarking Deep Reinforcement Learning for Robotic Manipulator Control. arXiv 2022, arXiv:2210.11262. [Google Scholar]

Figure 1. Core components and interactions in a Deep Reinforcement Learning (Deep RL) framework.

Figure 2. Conceptual diagram summarizing the categorization of DRL techniques for mobile robot path planning.

Figure 3. A block diagram illustrating the architecture of the Assisted Deep Deterministic Policy Gradient (AsDDPG) framework [27], highlighting the interaction between the policy network, the external controller, and the critic-DQN network.

Figure 4. The flow of the proposed algorithm in [29], highlighting the transition from virtual training to real-world deployment.

Figure 5. The block diagram showing the process flow in the method proposed by [32].

Figure 6. A block diagram showing the process flow of the method proposed in [34].

Figure 7. Block diagram of the deep reinforcement learning framework for robot navigation in unknown rough terrain [35].

Figure 9. Algorithm flow for the method of [39].

Figure 10. Block diagram of the proposed autonomous navigation framework integrating global path planning and deep reinforcement learning (DRL) with landmark generators [40].

Figure 11. Block diagram of the method proposed by [40].

Figure 12. Block diagram of the method proposed by [43].

Figure 13. Block diagram of the dynamic path planning algorithm for mobile robots using SAC algorithm [46].

Figure 15. Regenerated block diagram of the velocity range-based reward shaping technique for map-less navigation using LiDAR sensors [50].

Figure 16. Training performance of PPO and other reinforcement learning algorithms across MuJoCo environments, showcasing PPO’s superior learning efficiency (Adapted from [4]).

Figure 17. Performance curves for a selection of domains using variants of DPG: original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark gray), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue). The addition of target networks and batch normalization significantly enhances stability and performance (Adapted from [3]).

Figure 18. Smoothed learning curves for AsDDPG and DDPG under different network hyperparameters, in (a) an empty world and (b) a simple world. The figure illustrates AsDDPG’s superior performance in terms of training efficiency and stability across all configurations (Adapted from [27]).

Figure 19. Comparison of average returns for CMBAC and baseline methods on six benchmark tasks. CMBAC demonstrates superior sample efficiency and robustness (Adapted from [86]).

Figure 20. Comparison of learning curves for agents trained with and without auxiliary tasks across different maze configurations. Agents leveraging auxiliary tasks, such as depth prediction and loop closure detection, achieve faster convergence and higher final performance (adapted from [34]). Star in the label indicates the use of reward clipping.

Figure 21. Training results for 1D and 2D convolutional networks. The figure demonstrates the efficiency of 1D convolutions in achieving similar performance to 2D convolutions with reduced training time. Green circles represent the circulating robots, and red circles represent reduction of the simulation to one environmental map (Lab) (adapted from [38]).

Figure 22. (a) Trajectories generated by Fan et al.’s method in the Narrow-static (left) and Occluded-ped scenarios (right). (b) The trajectories generated by CrowdSteer for the same scenarios (Adapted from [44]).

Figure 23. Comparison of success rates between the RL-based approach and a traditional motion planner across varying pedestrian densities. The RL-based approach avoids the freezing robot problem and maintains higher success rates (Adapted from [49]).

Table 1. Performance Comparison of DRL Methods in Mobile Robot Path Planning.

Study	DRL Method	Success Rate (%)	Training Time (h)	Robustness to Noise
[81]	DQN, DDQN	78	24	Moderate
[53]	PPO, SAC	88	12	High
[82]	DDPG, SAC	91	10	High
[83]	DQN variants	85	18	Moderate
[84]	Improved DQN	82	14	High
[85]	SAC (Improved)	95	9	Very High

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hoseinnezhad, R. A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis. Appl. Sci. 2025, 15, 2179. https://doi.org/10.3390/app15042179

AMA Style

Hoseinnezhad R. A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis. Applied Sciences. 2025; 15(4):2179. https://doi.org/10.3390/app15042179

Chicago/Turabian Style

Hoseinnezhad, Reza. 2025. "A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis" Applied Sciences 15, no. 4: 2179. https://doi.org/10.3390/app15042179

APA Style

Hoseinnezhad, R. (2025). A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis. Applied Sciences, 15(4), 2179. https://doi.org/10.3390/app15042179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Review of Deep Learning Techniques in Mobile Robot Path Planning: Categorization and Analysis

Abstract

1. Introduction

2. Problem Statement

2.1. Reinforcement Learning Framework for Control Systems

2.2. Reinforcement Learning for Mobile Robot Path Planning

2.3. Why Reinforcement Learning?

3. Deep Reinforcement Learning

3.1. Modeling and Handling Uncertainty in Deep Reinforcement Learning

3.2. Sensor Fusion for Enhanced Path Planning in DRL

3.3. Categorization

4. Value-Based Methods

5. Policy-Based Methods

6. Actor–Critic Methods

6.1. Deep Deterministic Policy Gradient

6.1.1. Assisted DDPG (AsDDPG)

6.1.2. Dynamic Path Planning Using an Improved DDPG Algorithm

6.1.3. Virtual-to-Real Deep Reinforcement Learning for Mapless Navigation

6.1.4. Goal-Oriented Obstacle Avoidance in Continuous Action Space

6.1.5. DDPG for Path Planning in Cluttered Indoor Spaces

6.1.6. Learning World Transition Model for Socially Aware Robot Navigation

6.2. Asynchronous Advantage Actor–Critic (A3C) Method

6.2.1. A3C Algorithm for Learning to Navigate in Complex Environments

6.2.2. A3C Algorithm for Robot Navigation in Unknown Rough Terrain

6.2.3. A3C Algorithm for Teaching a Machine to Read Maps

6.2.4. A3C Algorithm for Target-Driven Visual Navigation in Indoor Scenes

6.2.5. A3C Algorithm for Autonomous Navigation in Indoor Environments

6.2.6. A3C-Based End-to-End Navigation Strategy

6.2.7. A3C Algorithm for Autonomous Navigation with Landmark Generators

6.3. Proximal Policy Optimization (PPO)

6.3.1. Navigation Skills Acquisition for Wheel-Legged Robots

6.3.2. PPO-Based Path Planning for Cleaning Robot

6.3.3. PPO-Based End-to-End Path Planning for Lunar Rovers with Safety Constraints

6.3.4. PPO-Based Real-Time Collision Avoidance in Dense Crowds

6.3.5. Self-Learned Vehicle Control Using PPO

6.4. Soft Actor–Critic (SAC)

6.4.1. SAC Algorithm in Dynamic Path Planning for Mobile Robots

6.4.2. Sim-to-Real Transfer with Incremental Environment Complexity

6.4.3. Dynamic Obstacle Avoidance and Path Planning Integration

6.4.4. Robot Navigation in Constrained Pedestrian Environments

6.4.5. Velocity Range-Based SAC Reward Shaping for Mapless Navigation

7. Emerging Trends in Reinforcement Learning for Robotics

7.1. Transformer-Based DRL

7.2. Meta-Learning and Transfer Learning in Robotics

7.3. Advancements in Multi-Agent Reinforcement Learning

7.4. Attention Mechanisms in Robot Navigation

7.5. Scaling DRL for Large-Scale Robotic Systems and Multi-Agent Environments

7.6. Safe Exploration in DRL for Robotics

8. Practical Considerations for Real-World Applications

8.1. Real-World Applications and Challenges of Hybrid DRL Methods

8.2. Optimizing DRL for Real-Time Processing in Robotics

8.3. Empirical Performance Comparisons of DRL Methods

9. Samples of the Results

9.1. Evaluation of PPO in Robotic Path Planning

9.2. DDPG for Continuous Robotic Control

9.3. Accelerated Learning in Robotic Navigation with Assisted DDPG

9.4. Enhancing Sample Efficiency with Conservative Model-Based Actor–Critic

9.5. Leveraging Auxiliary Tasks for Enhanced Navigation in Dynamic Mazes

9.6. Real-World Autonomous Navigation Using DRL

9.7. Collision Avoidance in Dense Crowds Using Hybrid Sensors and DRL

9.8. Robot Navigation in Constrained Pedestrian Environments Using Reinforcement Learning

9.9. Further Case Studies and Real-World Applications of DRL

10. Conclusions and Future Directions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI