1. Introduction
Reinforcement Learning (RL) has emerged as a powerful framework for solving sequential decision-making problems, where an agent interacts with an environment to learn optimal behaviors by maximizing cumulative rewards [
1]. This ability to adapt through interaction makes RL particularly suitable for mobile robot path planning in dynamic and uncertain environments. Traditional path planning algorithms, such as A* and Dijkstra’s, excel in static and well-structured settings but falter in applications characterized by uncertainty, unpredictability, and incomplete knowledge of the environment. For instance, search-and-rescue operations in disaster-stricken areas often involve navigating through unknown terrains filled with debris and dynamic obstacles. Similarly, autonomous delivery robots operating in crowded urban environments must make real-time decisions to avoid pedestrians and other obstacles while efficiently reaching their destinations. In such scenarios, RL offers a robust solution by learning policies that directly map sensory inputs to actions, enabling real-time decision-making and adaptability [
2].
Deep Reinforcement Learning (DRL), a subset of RL that leverages deep neural networks, has significantly advanced the capabilities of RL by addressing challenges associated with high-dimensional state–action spaces and continuous control. Unlike traditional RL, which struggles to scale in complex environments, DRL employs neural networks to approximate value functions and policies, enabling efficient learning and decision-making in scenarios involving raw sensory inputs like camera images or LiDAR data [
3]. This scalability and versatility make DRL an ideal framework for tackling complex navigation problems, such as autonomous navigation in cluttered indoor spaces, multi-agent coordination, and socially aware robot path planning. Techniques, such as Proximal Policy Optimization (PPO) [
4], Deep Deterministic Policy Gradient (DDPG) [
3], and Soft Actor–Critic (SAC) [
2] have demonstrated remarkable performance in such applications, achieving state-of-the-art results across a variety of benchmark environments.
This paper presents a comprehensive review of DRL techniques for mobile robot path planning, categorizing existing methods and analyzing their strengths and limitations. It explores how DRL can address challenges in path planning, including navigation in dynamic and uncertain environments, balancing efficiency with safety, and generalizing to unseen scenarios. The review focuses on the application of DRL in tasks requiring continuous control, adaptive decision-making, and integration with multimodal sensory inputs.
Several reviews of DRL solutions for mobile robot path planning have been presented in the literature, offering broad overviews of existing methods and their applications. However, this work differentiates itself by providing a structured and standardized analysis of DRL techniques through a unified notation framework. Specifically, we categorize various DRL methods and reformulate their mathematical formulations under a consistent notation, facilitating direct comparisons across different approaches. Furthermore, for each category, multiple representative solutions are analyzed, with their key components restructured to maintain uniformity in presentation. A notable contribution of this review is the inclusion of original block diagrams and pseudocode representations for nearly all solutions discussed, offering a clearer understanding of their working principles and implementation details. These enhancements provide a more accessible and structured perspective on DRL-based path planning, allowing researchers and practitioners to better interpret, compare, and apply these methodologies in real-world scenarios.
The rest of this paper is organized as follows: In
Section 2, the problem of designing a reinforcement learning solution for mobile robot path planning is formally defined using a differential drive robot as an illustrative example and detailing why reinforcement learning is advantageous. Then,
Section 3 explores deep reinforcement learning, and discusses how uncertainties are modeled and handled in DRL with a note on sensor fusion for enhanced path planning using DRL. A categorization of the methods discussed in the next sections is also presented.
Section 4 examines value-based DRL approaches such as DQN and D3QN, followed by
Section 5 where policy-based methods are introduced. Recent actor–critic type methods, including DDPG, A3C, and PPO, are highlighted as the main developments in DRL for robot path planning, with several contemporary studies reviewed under each sub-category in
Section 6. Some emerging trends in DRL for robotics are discussed in
Section 7, including transformer-based DRL, metal learning for transfer learning, multi-agent methods, and attention mechanisms. The section also covers the scaling of DRL solutions and the safety of mobile robot exploration using DRL.
Section 8 covers the practical considerations for real-world applications, such as the challenges of hybrid methods, optimization for real-time processing, and empirical performance comparisons. Samples of path planning outcomes from selected works are presented and discussed in
Section 9, illustrating the benefits of these deep reinforcement learning strategies. The paper concludes by presenting insights and potential future research directions, in
Section 10.
3. Deep Reinforcement Learning
Deep Reinforcement Learning (Deep RL) leverages deep neural networks as function approximators to address high-dimensional state and action spaces. In scenarios, such as mobile robot path planning, where the state space
may include sensor data, such as images or LIDAR scans, and the action space
involves continuous control, Deep RL provides a scalable and effective solution [
3,
8]. The key elements of Deep RL build directly on the Markov Decision Process (MDP) framework and extend it by introducing function approximation methods.
In Deep RL, the policy
is parameterized by a neural network with parameters
. The policy maps states
to a probability distribution over actions
. The objective remains to optimize the parameters
to maximize the expected cumulative reward (
2):
where
is the discount factor, as defined previously [
1].
In addition to the policy, in deep RL, the state–value function
and action–value function
, which are critical for evaluating the quality of a policy or action, can be approximated using neural networks:
where
represents the parameters of the neural network. Algorithms such as Deep Q-Networks (DQN) [
8] use this approximation to solve the Bellman equation:
When deep RL combines a policy network (actor)
with a value network (critic)
, the actor selects actions, while the critic evaluates them. This approach is called the
actor–critic paradigm. The policy gradient is computed as follows:
where
is provided by the critic network [
2,
4].
Deep RL also addresses the balance between exploration (choosing less certain actions to gather information) and exploitation (choosing actions that maximize known rewards). Techniques such as
-greedy policies [
1,
8] or entropy regularization [
2] are commonly used to manage this trade-off. To stabilize learning, deep RL often employs mechanisms such as target networks and experience replay. Target networks provide a stable reference for value updates, and experience replay reuses past experiences for training, improving sample efficiency [
8].
The key components of a Deep Reinforcement Learning (Deep RL) framework and their interactions are illustrated in
Figure 1. The interaction flow in the diagram starts with the state
, which is processed by both the policy network (actor) and value network. The policy network generates an action
, which interacts with the environment to produce a reward
and a new state
. The value network evaluates
to inform the critic. The critic provides feedback for updating both the value network and the policy network, ensuring iterative improvements in both evaluation and decision-making.
3.1. Modeling and Handling Uncertainty in Deep Reinforcement Learning
Uncertainty is an inherent challenge in real-world robotic applications, arising from sensor noise, dynamic obstacles, and environmental unpredictability. Traditional reinforcement learning models assume a stationary Markov decision process (MDP), which often fails to capture the stochastic nature of real-world interactions. To address these issues, researchers have proposed several uncertainty-aware methods to improve the robustness and adaptability of DRL-based robotic systems.
Bayesian deep reinforcement learning (BDRL) has been widely explored as a means to incorporate model uncertainty into decision-making. Zheng et al. [
11] introduced a Bayesian DRL framework that quantifies model uncertainty, aleatoric uncertainty, and reward function uncertainty in robot manipulation tasks. Their findings demonstrated that incorporating Bayesian networks into DRL significantly enhances convergence speed in sparse reward environments.
Robust policy optimization techniques have been developed to mitigate the performance degradation caused by uncertain transitions. Zhang et al. [
12] proposed an uncertainty set regularization (USR) technique, which formulates an uncertainty-aware policy using an adversarial approach to model unknown transition uncertainties. Their results on real-world reinforcement learning benchmarks showed improved robustness against environmental perturbations.
Trajectory planning in dynamic environments requires DRL agents to anticipate and adapt to uncertain constraints in real-time. Govindarajula1 et al. [
13] presented a trajectory planning framework that integrates deep learning models with reinforcement learning to handle dynamic obstacles and unpredictable environments effectively. Their approach demonstrated improved safety and efficiency in both simulated and real-world scenarios.
Handling sensor noise is crucial for ensuring stable DRL deployment in robotic navigation. Joshi et al. [
14] analyzed the impact of sensor noise on UAV navigation and obstacle avoidance using DRL. Their results indicated that denoising techniques such as Kalman filtering and artificially injecting noise during training improved real-world performance. Similarly, Martinez et al. [
15] developed a velocity-space-based DRL planner that mitigates discrepancies between simulation and reality, reducing failure rates in highly dynamic environments.
A novel data-driven stochastic modeling approach was introduced by Han et al. [
16], which utilizes a deep stochastic Koopman operator (DeSKO) for robust learning control in uncertain nonlinear robotic systems. Their framework achieved robust stability in real-world soft robotics applications, demonstrating resilience against unexpected disturbances.
These advancements illustrate the importance of uncertainty-aware DRL techniques in real-world robotics. Future research should focus on improving generalization across diverse operational conditions while maintaining computational efficiency.
3.2. Sensor Fusion for Enhanced Path Planning in DRL
Sensor fusion has become a crucial component in reinforcement learning-based robot navigation, allowing autonomous agents to integrate multiple sensory inputs to improve decision-making in complex environments. Combining LiDAR, camera images, and IMU data enhances depth perception, spatial awareness, and obstacle avoidance, leading to improved policy robustness in deep reinforcement learning (DRL).
Recent studies have demonstrated the effectiveness of LiDAR–camera fusion in DRL-based navigation. Ou et al. [
17] introduced a sensor fusion framework that integrates LiDAR and camera data to establish a more reliable environmental representation. Their approach enables mobile robots to navigate unfamiliar environments more efficiently than traditional single-sensor setups. Additionally, Tan [
18] designed a lightweight multimodal data fusion network that combines LiDAR and image-based semantic segmentation, bridging the gap between simulated and real-world navigation.
Incorporating IMU data into sensor fusion frameworks further enhances DRL-based path planning by providing motion stability and aiding localization. Xue and Gonsalves [
19] proposed a short-term visual-IMU fusion memory agent for drone motion planning, significantly reducing neural network complexity while maintaining robust obstacle avoidance. Similarly, Liu et al. [
20] developed a sensor fusion system that integrates IMU with LiDAR-based SLAM to improve 3D color reconstruction in large-scale environments.
Several DRL frameworks now explicitly incorporate sensor fusion to improve real-time navigation. Yan et al. [
21] introduced a deep deterministic policy gradient (DDPG)-based approach that fuses laser and vision sensor data, achieving a 10% improvement in navigation success rates in dynamic environments. Furthermore, Jiang et al. [
22] developed a multi-sensor fusion framework that preprocesses RGB and depth images from cameras before integrating them with LiDAR data, significantly improving obstacle avoidance in real-world scenarios.
These advancements highlight the growing importance of sensor fusion in DRL-based robotics. Future research should focus on optimizing real-time data processing pipelines and developing lightweight fusion architectures to enhance computational efficiency.
3.3. Categorization
Our investigations have resulted in the categorization of recent DRL solutions for mobile robot path planning into three general categories: value-based method, policy-based methods and hybrid actor–critic methods. In the next three sections, we review each of these categories. Since the last category is more recent and has shown an exceedingly large potential for the guidance and control of mobile robots, we provide an extensive exploration of different Actor–Critic algorithms that in their nature employ a combination of the principles used in value-based and policy-based approaches to DRL for mobile robot path planning. To provide a clearer understanding of the relationships between different DRL methods,
Figure 2 presents a conceptual diagram summarizing the categorization of DRL techniques. By structuring the discussion around these categories, we aim to establish a cohesive narrative that logically connects various methods and their applications.
4. Value-Based Methods
In the domain of mobile robot path planning, recent advances in deep reinforcement learning (Deep RL) have introduced both value-based and policy-based methods to solve complex navigation tasks. These methods leverage deep neural networks to approximate functions like the action–value function or directly parameterize policies . This section reviews key value-based methods, such as Deep Q-Networks (DQN), Double DQN (DDQN), and Dueling DQN (D3QN), as well as policy-based approaches, outlining their mathematical foundations and contributions to mobile robot navigation.
Value-based methods in deep RL focus on estimating the action–value function
, which represents the expected cumulative reward of taking action
a in state
s and following the policy
thereafter. These methods solve the Bellman equation:
by using neural networks as function approximators. A well-known value-based deep RL method is the Deep Q-Networks (DQN) [
8] which approximates
using a neural network parameterized by
, denoted as
. The network is trained by minimizing the mean squared error between the current
Q-value and the target value:
where the target value
y is as follows:
Here,
denotes the parameters of a separate target network, which is updated periodically for stability. The step-by-step pseudocode for DQN is presented in Algorithm 1.
Algorithm 1: Deep Q-Network (DQN) Algorithm |
![Applsci 15 02179 i001]() |
While DQN is effective, it suffers from overestimation bias in
Q-value updates. Double DQN (DDQN) [
23] addresses this by decoupling the action selection and evaluation processes. The target value is modified as follows:
This correction reduces overestimation and improves learning stability. An alternative is the Dueling DQN (D3QN) method [
24] which introduces a novel architecture that separates the representation of state–value and advantage functions:
where
is the state–value function and
is the advantage function. This architecture helps the network learn which states are valuable independent of the specific actions taken, improving performance in complex environments. The pseudocode shown in Algorithm 2 presents how the D3QN method works in a step-by-step fashion.
Algorithm 2: Dueling Deep Q-Network (D3QN) Algorithm |
![Applsci 15 02179 i002]() |
6. Actor–Critic Methods
Actor–critic methods combine a policy-based “actor” with a value-based “critic”. The actor represents the policy
, while the critic estimates a value function, such as the state–value function
or the action–value function
. The critic helps reduce the variance in the policy gradient by replacing the raw return
with an advantage estimate:
The policy gradient in actor–critic methods is typically expressed as follows:
The three main categories of actor–critic methods have been applied in mobile robot path planning applications: (1) Deep deterministic policy gradient (DDPG) method and its variants, (2) Asynchronous advantage actor critic (A3C) method and its variants, (3) Proximal policy optimization (PPO) method and its variants, and (4) Soft actor–critic (SAC) method and its variants. Each of these categories are presented in separate sections as follows.
6.1. Deep Deterministic Policy Gradient
The DDPG method is an actor–critic reinforcement learning algorithm designed for continuous action spaces. It extends the Deterministic Policy Gradient (DPG) method [
26] by incorporating deep neural networks to parameterize both the policy and value functions. DDPG has been widely used in robotics, including mobile robot path planning, where the control commands often lie in a continuous domain, such as linear and angular velocities. It uses an actor–critic framework where
The policy is updated by maximizing the estimated Q-value:
where
is a replay buffer used to store past transitions
for sampling during training. The critic is updated by minimizing the mean squared error of the Bellman residual:
where the target value
y is computed using a target actor
and a target critic
:
DDPG uses target networks to stabilize training. The target actor
and target critic
are updated slowly using a soft update mechanism:
where
controls the update rate.
To encourage exploration in continuous action spaces, DDPG adds noise to the deterministic policy during training. A common choice is Ornstein–Uhlenbeck (OU) noise, which models temporally correlated random processes:
Algorithm 3 shows the step-by-step pseudocode for the DDPG algorithm.
In mobile robot path planning, DDPG has been applied to learn navigation strategies in continuous control domains. The actor network generates control commands, such as linear velocity v and angular velocity , while the critic evaluates the navigation performance. The state s may include the following:
The robot’s position and orientation .
Sensor inputs, such as LIDAR or depth images, providing information about obstacles.
The relative position of the goal .
The reward function is typically similar to (
21), and designed to guide the robot toward the goal while avoiding collisions.
6.1.1. Assisted DDPG (AsDDPG)
In the work by Xie et al. [
27], Assisted DDPG (AsDDPG) extends the standard DDPG algorithm by incorporating an external controller to accelerate training and stabilize learning. The external controller acts as a supplementary policy during the early stages of training and is progressively replaced by the learned policy as training converges. The integration of the external controller introduces an adaptive switching mechanism, enabling the algorithm to dynamically choose between the controller and the actor network.
Algorithm 3: Deep Deterministic Policy Gradient (DDPG) Algorithm |
![Applsci 15 02179 i003]() |
The core innovation in AsDDPG is the inclusion of a Critic-DQN network that evaluates both the learned policy and the external controller. At each time step
t, the Critic-DQN decides which policy to use based on the Q-value estimates:
where
is the switching action and
represents the Q-value of the state–action pair under the selected policy.
The critic network in AsDDPG consists of two branches: the
Critic branch and the
DQN branch. The Critic branch evaluates the Q-value of the actor’s actions and computes the policy gradient, while the DQN branch estimates the Q-values of both the external controller and the actor network. The advantage function
is computed to improve stability:
where
is the state value function learned using a dueling architecture.
The loss function for the Critic-DQN is defined as follows:
where the TD targets
and
are computed as follows:
Here,
is the discount factor,
is the reward, and
denotes the target network.
The actor network is updated using the policy gradient derived from the Critic branch:
The primary advantage of AsDDPG lies in its ability to leverage the external controller to guide exploration during the early stages of training, reducing the reliance on random exploration. As training progresses, the actor network learns to outperform the external controller, and the reliance on external guidance diminishes–see
Figure 3. This adaptive approach ensures faster convergence and more robust learning, particularly in environments with sparse rewards or complex dynamics.
6.1.2. Dynamic Path Planning Using an Improved DDPG Algorithm
The study by Li et al. [
28] builds on the DDPG framework by incorporating several enhancements to address its limitations, such as low success rate and slow convergence in dynamic environments. These enhancements include replacing the optimizer, adding prioritized experience replay, introducing transfer learning, and using a curiosity module.
To improve convergence speed and accuracy, the RAdam optimizer is integrated into the DDPG algorithm. The RAdam update rule is given by the following:
where
and
and
are parameters of the RAdam optimization process. This optimizer stabilizes training and enhances convergence.
Prioritized experience replay is employed to prioritize sampling experiences with higher temporal-difference (TD) error. The TD error is defined as follows:
where
Q is the action–value function,
is the current state,
is the current action, and
is the discount factor. This prioritization accelerates learning by focusing on significant experiences.
To improve exploration in complex environments, a curiosity module is introduced. This module provides intrinsic rewards by minimizing the prediction error of the forward dynamics model:
where
is the encoded feature vector of the current state,
is the action, and
represents the model parameters. The intrinsic reward is calculated as follows:
where
is a scaling factor.
The total reward combines intrinsic and extrinsic rewards:
where
is the external reward from the environment.
Transfer learning is applied to initialize the policy and value networks with pre-trained weights, reducing the required training time. The combined effect of these enhancements significantly improves the convergence speed and success rate of the DDPG algorithm in dynamic environments.
6.1.3. Virtual-to-Real Deep Reinforcement Learning for Mapless Navigation
Tai et al. [
29] build on the DDPG framework by introducing asynchronous updates and leveraging low-dimensional sparse range findings for efficient mapless navigation of nonholonomic mobile robots.
To improve sampling efficiency, the sample collection process is separated into an independent thread, creating an asynchronous version of DDPG (ADDPG). This asynchronous mechanism ensures that the agent can collect more samples per iteration compared to traditional DDPG, enhancing convergence. The effectiveness of ADDPG is illustrated by the faster increase in the Q-value during training:
where
is the immediate reward,
is the state,
is the action, and
is the discount factor.
The input to the ADDPG model consists of the following:
where the laser range findings are abstracted into a 10-dimensional sparse representation. These inputs are normalized to the range
.
The output of the actor network consists of continuous control commands, including linear velocity
and angular velocity
. The actor network updates are computed as follows:
The critic network uses a three-layer fully connected structure to predict the Q-value. The Q-value is computed using a linear activation function at the output layer:
where
w and
b are the weights and bias, respectively.
The reward function is defined as follows:
where
is the distance to the target at time
t, and
,
, and
are hyperparameters.
The overall method, visualized in
Figure 4 enables end-to-end training of a mapless motion planner capable of transferring knowledge from virtual to real environments without additional fine-tuning, ensuring robust navigation in unseen scenarios.
6.1.4. Goal-Oriented Obstacle Avoidance in Continuous Action Space
The proposed method in [
30] extends the DDPG framework by integrating depth-wise separable convolutional layers to handle depth image inputs effectively. This enhancement allows for efficient processing of large-scale sequential depth images, combined with goal-oriented navigation capabilities.
The actor network takes a stack of sequential depth images along with positional information in polar coordinates. The stack of depth images from
to
t provides a short-term memory, enabling the robot to detect peripheral obstacles that may not be visible in the current input. The depth-wise separable convolutional operation is defined as follows:
where
represents the filter applied to channel
j,
is the input feature map, and
is the activation function.
The polar coordinates of the robot’s position relative to the goal are concatenated with the processed depth image features. The output of the actor network consists of continuous control commands:
where
is the angular velocity,
v is the linear velocity, and
represents the actor policy parameterized by
.
The critic network evaluates the Q-value using the current state and action:
where
is the immediate reward and
is the discount factor.
The reward function is designed to encourage linear motion and penalize unnecessary angular adjustments:
where
is the distance to the goal, and
and
are coefficients for linear and angular velocities.
The target networks for the actor and critic are updated using a soft update mechanism:
where
is a small constant controlling the update rate.
This method enables end-to-end learning of a goal-oriented obstacle avoidance policy, trained entirely in simulation and transferable to real-world scenarios without additional fine-tuning. Algorithm 4 shows a step-by-step pseudocode of the method.
Algorithm 4: Pseudocode of the Goal-oriented obstacle avoidance algorithm in continuous action space [30]. |
![Applsci 15 02179 i004]() |
6.1.5. DDPG for Path Planning in Cluttered Indoor Spaces
In the study by Gao et al. [
31], incremental training, parameter transfer, and a hybrid PRM+TD3 planner are introduced to improve development efficiency, convergence, and generalization in indoor mobile robot path planning. The training process employs an incremental strategy starting in a lightweight 2D simulation environment and gradually transitioning to a complex 3D environment. The training in the 2D environment uses simpler simulations to debug parameters and optimize the reward function before transferring the network to the 3D environment. The transfer of parameters follows:
where
denotes the initial weights transferred from the 2D environment.
The hybrid planner PRM+TD3 combines the Probabilistic Roadmap (PRM) algorithm for global path planning and TD3 for local control. PRM generates intermediate waypoints:
which decompose the global path into local sub-paths for TD3 to navigate.
TD3 enhances DDPG by addressing overestimation bias using two Q-networks and delayed updates. The Q-value update is as follows:
where
d is the termination flag,
r is the reward, and
is the target policy.
The critic networks are updated by minimizing the loss:
The actor network is updated using the policy gradient:
Target networks for both actor and critic are updated softly:
where
controls the rate of update.
The reward function is designed to incentivize collision-free navigation and goal-reaching:
where
is the distance to the target,
is the goal proximity threshold, and
is a scaling factor.
This approach enhances convergence, reduces training time, and improves generalization to complex environments. See Algorithm 5 for a step-by-step pseudocode of the proposed method.
Algorithm 5: PRM+TD3 with Incremental Training [31]. |
![Applsci 15 02179 i005]() |
6.1.6. Learning World Transition Model for Socially Aware Robot Navigation
Cui et al. [
32] present an extension of the DDPG framework by integrating a model-based reinforcement learning approach with a deep world transition model, which predicts future states and rewards. This improves sample efficiency and allows socially compliant robot navigation.
The world transition model predicts the evolution of the robot’s environment, including future obstacle maps and rewards. Let the current state
be represented as a sequence of obstacle maps, relative goal position, and current velocity. The predicted next state
is given by the following:
where
f is the world transition model parameterized by
.
The obstacle map representation disentangles static and dynamic obstacles using ego-motion transformation. Dynamic features are computed by the following:
where
is the obstacle map at time
t. Static features are extracted by encoding the last obstacle map.
The reward function incorporates goal-reaching, collision avoidance, and social compliance:
where
encourages goal-reaching:
Collision avoidance is penalized by
:
Social compliance is achieved by penalizing proximity to dynamic obstacles:
The policy network uses stacked obstacle maps processed with a 3D-CNN to output control commands:
where
is the actor policy parameterized by
.
The critic networks estimate the Q-values using the TD3 framework:
The critic loss is minimized by the following:
where
.
The actor network is updated using the deterministic policy gradient:
The target networks are updated softly:
This method enables efficient policy training using a combination of real and virtual data, leading to socially compliant navigation in dynamic environments. The block diagram shown in
Figure 5 shows the process flow of the method.
6.2. Asynchronous Advantage Actor–Critic (A3C) Method
The Asynchronous Advantage Actor–Critic (A3C) method, introduced by Mnih et al. [
33], maintains a policy
and an estimate of the value function
. The algorithm operates in the forward view and uses
n-step returns to update both the policy and the value function.
The advantage function
is computed as follows:
where
k is bounded by
, the maximum number of steps. The policy is updated using the gradient:
where
and
define the range of steps for the update. The value function is updated by minimizing the loss:
where
is the
n-step return:
Entropy regularization is added to the policy loss to encourage exploration. The entropy of the policy
is given by the following:
and the entropy term is added to the policy gradient with a weight
:
The updates for the policy and value networks are performed asynchronously across multiple threads. Each thread maintains a local copy of the parameters
and
, which are periodically synchronized with the global parameters. The local parameters are updated using the following:
where
is the learning rate.
This asynchronous framework stabilizes training by allowing parallel actor-learners to explore different parts of the state space, reducing the correlation between updates and improving training efficiency. Algorithm 6 presents the step-by-step pseudocode of the A3C method.
Algorithm 6: Asynchronous Advantage Actor–Critic (A3C) Algorithm |
![Applsci 15 02179 i006]() |
6.2.1. A3C Algorithm for Learning to Navigate in Complex Environments
Mirowski et al. [
34] have extended A3C by incorporating auxiliary tasks to improve navigation in complex environments. The architecture augments the standard A3C framework with two auxiliary outputs: depth prediction and loop closure detection. These tasks enhance data efficiency and representation learning by providing additional gradient updates during training.
The first auxiliary task, depth prediction, estimates a depth map from RGB inputs. The loss function for depth prediction is implemented as a classification task, dividing depth values into discrete bins:
where
represents the true probability of the
j-th depth bin for the
i-th pixel,
is the predicted probability,
N is the number of pixels, and
C is the number of depth bins.
The second auxiliary task, loop closure detection, predicts whether the agent is revisiting a previously visited location. The training targets are generated based on position similarity between the current position and earlier positions in the trajectory. The binary classification loss is given by the following:
where
is the ground truth loop closure label for time step
t,
is the predicted probability, and
T is the total number of steps.
The overall loss function combines the A3C objective with these auxiliary losses. The total loss is as follows:
where
and
are coefficients balancing the contributions of the auxiliary tasks.
To integrate auxiliary tasks with the A3C framework, the architecture uses a shared convolutional encoder to process RGB inputs. The shared features are passed to both the policy/value networks and the auxiliary task heads. Specifically, the depth prediction head predicts depth values from the convolutional features, while the loop closure head predicts loop closure probabilities using the hidden states of the LSTM layer. The policy and value networks follow the A3C architecture and are updated with standard A3C gradients.
The depth prediction output is computed from the convolutional features
as follows:
where
and
are the weights and biases of the depth prediction layer.
The loop closure output is computed from the LSTM hidden states
as follows:
where
and
are the weights and biases of the loop closure layer.
These auxiliary tasks provide additional supervision signals, accelerating learning and improving the agent’s ability to represent spatial and geometric features critical for navigation. The block diagram shown in
Figure 6 presents how the proposed architecture extends A3C by integrating auxiliary tasks for depth prediction and loop closure detection. The shared encoder processes RGB inputs to extract features
, which are passed to the policy and value networks as well as auxiliary task-specific heads. Depth prediction and loop closure detection contribute to the total loss
, accelerating training and improving navigation performance.
6.2.2. A3C Algorithm for Robot Navigation in Unknown Rough Terrain
Zhang et al. [
35] have extended A3C to address navigation in unknown rough terrain environments by incorporating an elevation map, depth images, and the robot’s 3D orientation as inputs into the policy and value networks. The architecture consists of three separate branches for processing these inputs, which are then merged to compute navigation actions.
The depth image branch processes downsampled depth images
through a series of convolutional layers, reducing the high-dimensional sensory input into feature embeddings
. The elevation map branch processes a single-channel 84 × 84 grid
that encodes terrain elevation, using convolutional layers to extract features
. The robot’s 3D orientation
is passed through a fully connected layer, producing feature embeddings
that match the spatial dimensions of
. These two branches are merged using element-wise addition:
The outputs from the depth and merged branches are concatenated to form a unified feature representation:
This representation is passed through an LSTM layer to capture temporal dependencies in the robot’s navigation states. The output from the LSTM layer is used to compute the policy
and the value
. The policy network computes the probability distribution over navigation actions
, including moving forward, backward, or turning left or right. The policy is updated using the A3C loss function:
where
is the advantage function, computed as follows:
and
is the entropy of the policy, encouraging exploration, with
as a scaling factor.
The value network minimizes the squared advantage:
The reward function is tailored to the rough terrain navigation task. It assigns positive rewards for reaching the goal within a certain tolerance and penalizes undesirable states such as collisions or exceeding the maximum steps per episode:
Here,
is the distance to the goal at time
t,
is the closest distance observed so far,
is the goal tolerance, and
represents undesirable terminal states like flipping or getting stuck.
The network is trained in a simulated environment with high-dimensional sensory inputs. Nine agents operate in parallel, each updating the global policy and value functions asynchronously to accelerate convergence. The block diagram shown in
Figure 7 visualizes how architecture extends A3C for navigation in unknown rough terrain by incorporating multiple input branches for depth images, elevation maps, and 3D orientation. These inputs are processed through separate branches to extract features (
,
,
) which are merged and concatenated to form a unified representation (
). Temporal dependencies are captured using an LSTM layer, and the outputs are used by the policy and value networks to compute navigation actions.
6.2.3. A3C Algorithm for Teaching a Machine to Read Maps
The study by Brunner et al. [
36] based Dynamic Obstacle Avoidance andecture that decomposes the navigation task into intermediate subtasks, each handled by a specialized module. These modules are trained either independently or in conjunction with the main A3C policy. The subtasks include visible local map estimation, recurrent localization, map interpretation, and reactive action execution.
The visible local map network processes the 3D RGB visual input and a discretized 3-hot encoded orientation angle
. The map excerpt is gated using the visible field estimation:
where · denotes element-wise multiplication. The visible local map output is used to construct an egocentric local map.
The recurrent localization cell integrates the visible local map excerpts into an egocentric local map. It estimates the egomotion as follows:
where
is a two-layer feedforward network, ∗ represents 2D convolution,
is the previous local map estimation, and
is the current visible local map input. The estimated local map is updated as follows:
where
denotes clipping to the range
. The updated map is combined with a feedback map to enhance localization:
The location probability distribution is calculated by correlating the estimated local map with the global map:
where
M is the global map and
N is the number of discrete location cells.
The map interpretation network processes the global map to identify rewarding locations and outputs a short-term target direction (STTD). The STTD at each location is derived using a shortest path algorithm, producing a direction distribution .
The reactive agent combines the STTD distribution, the normalized entropy of the location probability
, and the estimated target distance to select actions. It maximizes the total reward, which combines extrinsic and intrinsic rewards:
where
is the egomotion vector and
is the STTD vector. The agent’s policy is updated using A3C’s policy gradient:
where
is the advantage function, and
is the entropy of the policy.
The proposed architecture improves navigation by integrating modular components, leveraging A3C’s stability, and introducing additional supervision signals through intrinsic rewards and auxiliary tasks. Algorithm 7 presents the step-by-step pseudocode for implementing the proposed architecture.
Algorithm 7: Pseudocode for the navigation algorithm proposed in [36]. |
![Applsci 15 02179 i007]() |
6.2.4. A3C Algorithm for Target-Driven Visual Navigation in Indoor Scenes
Zhu et al. [
37] extended A3C by introducing a target-driven visual navigation framework based on a deep Siamese actor–critic network. The key innovation is the use of a dual input structure, where both the current state
and the target representation
g are processed to jointly estimate policy
and value
V. This allows generalization to unseen targets without re-training.
The network takes as input two images:
, representing the agent’s current observation, and
g, representing the target. Both inputs are processed through Siamese layers that share weights, producing embeddings
and
in a shared latent space:
The embeddings are concatenated to form a joint representation:
This joint representation is passed through scene-specific layers to capture environment-specific features. The scene-specific representation is used to compute the policy
and the value
. The policy network predicts the probability distribution over actions:
The value network estimates the expected future reward:
The policy is updated using the A3C policy gradient:
where
is the advantage function:
and
is the entropy of the policy to encourage exploration.
The value network is updated by minimizing the squared error:
The network leverages asynchronous training, where multiple threads independently interact with the environment. Each thread collects gradients and updates the shared model parameters. Generic Siamese layers are updated across all targets and scenes, while scene-specific layers are fine-tuned for each environment, enabling both inter-target and inter-scene generalization.
The reward function for navigation includes a goal-reaching reward
and a time penalty
:
This framework improves upon A3C by enabling generalization to unseen targets through the Siamese structure and shared parameter updates, while maintaining data efficiency through asynchronous training.
Figure 8 shows a block diagram of the proposed target-driven navigation method based on A3C. The Siamese network processes the current state
and target
g to generate embeddings
and
, which are concatenated to form a joint representation
. Scene-specific layers extract environment-specific features used by the policy and value networks. The policy network outputs navigation actions
, while the value network evaluates future rewards. Both networks are updated asynchronously using the A3C framework.
Figure 8.
Block diagram of the deep reinforcement learning framework proposed in [
37].
Figure 8.
Block diagram of the deep reinforcement learning framework proposed in [
37].
6.2.5. A3C Algorithm for Autonomous Navigation in Indoor Environments
The study by Surmann et al. [
38] builds on A3C by using the GA3C framework, which improves efficiency and scalability in parallel training. Instead of maintaining separate network copies for each agent, GA3C centralizes predictions and updates using a global deep neural network (DNN) model. This eliminates the need for synchronization and allows for efficient GPU utilization. The algorithm employs two global queues: a prediction queue for obtaining observations and a learning queue for training updates.
The inputs to the network are the last four laser scans
, each with 1081 values, and a one-hot encoded orientation vector
representing the direction of the goal. The laser scan values are normalized to the range
, and the orientation is encoded as follows:
The network begins with two 1D convolutional layers:
The outputs are flattened and passed to a dense layer with 256 neurons to generate the shared latent representation
. The policy network computes the probability distribution over actions
:
where
represents discrete actions corresponding to linear and angular velocities, such as forward motion and turning. The value network estimates the state value
:
The policy and value networks are updated using the A3C loss. The policy gradient is computed as follows:
where
is the advantage function:
The value loss is minimized to improve the value network:
The reward function integrates goal-reaching rewards, collision penalties, and intermediate rewards. For goal-reaching, the reward is as follows:
For collisions:
Intermediate rewards encourage progress toward the goal:
The GA3C framework allows for high-speed parallel training with up to 32 simulation instances running concurrently, ensuring robust and efficient learning. A step-by-step pseudocode of the algorithm is presented in Algorithm 8.
Algorithm 8: GA3C-based Robot Navigation Framework [38] |
![Applsci 15 02179 i008]() |
6.2.6. A3C-Based End-to-End Navigation Strategy
The study by Shi et al. [
39] introduced an Intrinsic Curiosity Module (ICM) to address the challenges of sparse rewards in navigation tasks. The ICM generates intrinsic rewards by leveraging the prediction error of a forward model, thereby encouraging exploration in complex environments. Additionally, the network is enhanced with two LSTM layers to capture temporal dependencies, enabling the agent to better utilize previous observations, actions, and rewards.
The policy
and value
are represented by neural networks with parameters
and
, respectively. The A3C policy gradient and value loss functions are extended to incorporate intrinsic rewards. The policy loss is as follows:
where the total reward
combines extrinsic and intrinsic rewards:
The value loss minimizes the discrepancy between the predicted and actual returns:
To calculate intrinsic rewards, the ICM includes an inverse model and a forward model. The inverse model predicts the action
given consecutive states
and
:
where
are the parameters of the inverse model. The loss for the inverse model is as follows:
with
as the true action probability and
as the predicted action probability. The forward model predicts the next state feature
based on the current state feature
and action
:
where
are the parameters of the forward model. The forward model loss is as follows:
The intrinsic reward
is derived from the forward model’s prediction error:
where
scales the intrinsic reward. The total loss function combines policy, value, inverse, and forward losses:
where
adjusts the weight of the policy gradient, and
balances the forward and inverse model losses.
This method significantly enhances A3C by addressing sparse rewards and enabling robust navigation in complex environments. The use of intrinsic rewards, alongside the temporal memory provided by LSTMs, allows the agent to explore effectively and generalize better from simulation to real-world scenarios. The block diagram of
Figure 9 illustrates the integration of the Intrinsic Curiosity Module (ICM) with A3C. The input states
and
are processed to extract features
and
, which feed into the inverse and forward models. The forward model generates intrinsic rewards
, while the policy and value networks compute the action policy
and value estimate
. The output is the navigation action
, optimized to balance exploration and exploitation in sparse-reward environments.
6.2.7. A3C Algorithm for Autonomous Navigation with Landmark Generators
Wang et al. [
40] proposed a method that enhances the A3C algorithm by introducing a hierarchical navigation framework that integrates global and local planners connected by a dynamic sub-goal generator. The global planner uses traditional path planning algorithms like A* or Dijkstra to create a global path, while the local planner employs A3C to navigate to dynamically generated sub-goals, optimizing navigation efficiency in dynamic environments.
The local planner is modeled as a partially observable Markov decision process (POMDP) defined as a 5-tuple
, where
T is the current time,
S represents the environment states,
O denotes the observed information,
A is the action space,
R is the reward function, and
is the reward discount factor. To address the challenge of sparse rewards, a reward mechanism is designed as follows:
where
The policy
and value
networks are enhanced with a shared feature extraction backbone and a memory module to retain long-term dependencies. The policy gradient for updating the actor network is as follows:
where the advantage function
is computed as follows:
The critic network minimizes the temporal difference (TD) loss:
The integration of the sub-goal generator with A3C further reduces the computational burden. The sub-goal generator dynamically identifies intermediate targets based on two types of frontiers,
A and
B, defined in the robot’s LiDAR data. Sub-goals are generated using the following rules:
The hierarchical design of global and local planners ensures robust navigation in dynamic environments, leveraging the memory-enhanced A3C to handle long-term dependencies and optimize navigation efficiency. The block diagram of the proposed hierarchical navigation method, shown in
Figure 10, exhibits how the system integrates a global planner with a sub-goal generator to dynamically identify intermediate targets for the local planner. The feature extraction backbone processes the current state
, and the reward function computes
based on proximity, collision, and progress metrics. The policy
and value network
are optimized using the advantage function
, with the final output being the navigation action
.
6.3. Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO), introduced by Schulman et al. [
4], is a widely used reinforcement learning algorithm that strikes a balance between sample efficiency and computational simplicity. PPO builds upon the policy gradient methods and introduces a clipping mechanism to ensure stable and monotonic policy updates.
The goal of PPO is to maximize the expected reward
while ensuring that policy updates do not deviate excessively from the current policy. The optimization objective is defined as follows:
where
is the probability ratio given by the following:
Here,
represents the advantage function,
is a hyperparameter that determines the clipping range, and
restricts the probability ratio to the interval
.
The advantage function
is computed as follows:
where
is the action–value function, and
is the state–value function. PPO employs a surrogate objective function to balance exploration and exploitation while avoiding large policy updates, which could destabilize learning.
In addition to the clipping mechanism, the total loss function in PPO combines the policy loss
, the value function loss
, and an entropy term
to encourage policy exploration:
where
is the squared error loss for value function approximation:
and
is the entropy loss:
The coefficients
and
control the relative importance of the value loss and entropy loss, respectively. A step-by-step pseudo code of PPO is presented in Algorithm 9.
Algorithm 9: Proximal Policy Optimization (PPO) Algorithm |
![Applsci 15 02179 i009]() |
PPO has been applied to various domains, including robotic control, game playing, and autonomous navigation. Its robustness and simplicity make it a preferred choice for continuous and discrete action space problems. By preventing large policy changes through the clipping mechanism, PPO ensures stable learning and improved convergence properties. Some of the recent applications of this algorithm in robotic path planning and navigation are reviewed in the following sections.
6.3.1. Navigation Skills Acquisition for Wheel-Legged Robots
The study by Chen et al. [
41] builds upon PPO by introducing hierarchical training strategies with secondary and primary policies, along with domain randomization to improve robustness and generalization. The approach addresses challenges such as reward sparsity, temporal credit assignment, and data inefficiency in high-dimensional navigation tasks.
The hierarchical training begins with secondary policies, each trained for specific behaviors such as moving straight, avoiding obstacles, climbing over obstacles, and squeezing through narrow passages. The secondary policies are trained using PPO with a compound loss function:
where
is the PPO surrogate function:
and
is the value function loss:
The probability ratio
is defined as follows:
and
clips
to
, where
is a clipping parameter.
The reward function for secondary policies encourages specific behaviors:
where
The primary policy is trained using samples from the domain-randomized batch and real interactions in complex environments. Trajectories from secondary policies are replayed in randomized environments to generate action–observation–return tuples
. The advantage function
for primary policy training is as follows:
The policy parameters are updated using stochastic gradient ascent on the compound loss:
A block diagram showing the flow of computations involved in this method is presented in
Figure 11.
Domain randomization improves generalization by altering non-essential aspects of the training environments, such as obstacle appearance, number, and configuration. The randomized environments ensure task-relevant sensory features are preserved while diversifying the training data. The final policy maps height-map images and robot poses to motor actions using a neural network with three convolutional layers followed by fully connected layers.
By decomposing navigation into manageable sub-tasks, utilizing domain randomization, and optimizing with PPO, the proposed method achieves robust and efficient navigation in high-dimensional environments.
6.3.2. PPO-Based Path Planning for Cleaning Robot
Moon et al. [
42] propose a PPO method integrated with transfer learning (TL), detection of nearest uncleaned tiles (DNUT), reward shaping (RS), and an elite set (ES) to improve cleaning performance by robots. These additions address challenges such as reward sparsity, inefficient exploration, and data inefficiency in reinforcement learning tasks for cleaning robots.
Transfer learning allows the agent to leverage knowledge from previously trained environments to initialize training in new ones. This ensures improved convergence and robustness. The transferred policy
initializes the primary policy
:
The DNUT technique ensures that the agent focuses on uncleaned tiles. The agent detects the nearest uncleaned tile and adjusts the reward
accordingly:
The reward shaping introduces a stacked value mechanism to encourage consistent positive behavior. The stacked value
is computed as follows:
where
k is the number of consecutive positive rewards. The total reward becomes the following:
The elite set (ES) ensures the agent avoids learning from poor-performance episodes. If the agent takes more than 500 steps in a single episode, the episode is terminated, and the data are excluded from training. Let
denote the set of high-performance episodes:
The PPO loss function is modified to incorporate these components. The clipped objective is given by the following:
where
is the probability ratio and
is the advantage function:
The value network minimizes the temporal difference loss:
As presented in Algorithm 10, the agent learns the optimal policy by combining TL for initialization, DNUT for focus, RS for behavior shaping, and ES for data curation. These components significantly enhance the efficiency and robustness of PPO in cleaning environments.
Algorithm 10: The PPO-based Cleaning Method Proposed by [42]. |
![Applsci 15 02179 i010]() |
6.3.3. PPO-Based End-to-End Path Planning for Lunar Rovers with Safety Constraints
Yu et al. [
43] proposed an end-to-end path planning method based on PPO by integrating a safety-aware reward function, sliding behavior analysis, and curriculum learning to address the challenges of autonomous path planning for lunar rovers. These improvements ensure safety, stability, and adaptability across different terrains. The path planning problem is formulated as a Markov Decision Process (MDP), with the policy
represented by a deep neural network parameterized by
. The optimal policy
is defined as follows:
where
,
, and
represent depth camera data, LiDAR data, and rover/target state information, respectively. The safety-aware reward function incorporates distance and safety considerations:
where
is the safety factor. The distance reward
is defined as follows:
where
and
are the rover’s distances to the target at time
t and
, and
is the distance to the nearest obstacle.
The safety reward
penalizes sliding and unstable behaviors:
where
and
are the pitch and roll angles,
and
are the sliding rate and sliding angle, calculated as follows:
with
as constants derived from experimental data. The policy is updated using the PPO objective with a clipped surrogate:
where
, and
is the advantage function:
Curriculum learning is applied to improve adaptability by incrementally increasing terrain complexity. Training begins in flat terrains and progresses to more complex environments with adjusted safety factors
. This progressive training ensures robust and efficient learning of safe and optimal paths.
Figure 9 shows a block diagram of the method for lunar rover path planning. The input includes depth camera data, LiDAR data, and rover/target state information. Feature extraction is performed using a neural network, followed by safety-aware reward calculations and policy updates using PPO. A curriculum learning strategy incrementally increases terrain complexity, resulting in a safe and efficient navigation policy
.
Figure 12 shows a block diagram of the proposed algorithm.
6.3.4. PPO-Based Real-Time Collision Avoidance in Dense Crowds
The study by Liang et al. [
44] proposes a method, CrowdSteer, that builds upon PPO by incorporating multi-sensor fusion, a hybrid reward function, and specialized training scenarios to address challenges in dense and dynamic crowd navigation. The approach uses data from a 2-D LiDAR and a depth camera to implicitly model interactions with obstacles and pedestrians, generating smoother and collision-free trajectories. The navigation task is formulated as a Partially Observable Markov Decision Process (POMDP) defined by a 6-tuple
, where
S represents the state space,
A the action space,
P the state transition probabilities,
R the reward function,
the observation space, and
O the observation probability distribution. The robot’s action at time
t is sampled from the policy
as follows:
where
comprises LiDAR data
, camera data
, relative goal location
, and robot velocity
. The hybrid reward function encourages goal-reaching and safe navigation while penalizing collisions and oscillatory behavior:
where the components are defined as follows:
The policy is trained using PPO with a clipped surrogate objective:
where
is the probability ratio, and
is the advantage function:
The policy
is represented by a neural network with separate branches for processing LiDAR and camera data, followed by fusion layers that integrate these features with goal and velocity observations. Lidar data
and camera data
are processed as follows:
During training, the policy is initialized in static environments and progressively trained in dynamic scenarios with occluded obstacles and high pedestrian densities. This curriculum learning approach improves the robot’s ability to generalize to real-world dense crowd navigation while ensuring smooth and efficient trajectories.
6.3.5. Self-Learned Vehicle Control Using PPO
Canal and Taschin [
45] proposed a method that enhances PPO for vehicle control by combining minimal path planning with reinforcement learning to achieve efficient and generalizable vehicle control in obstacle-rich environments. The key innovation lies in using PPO to learn vehicle control policies that follow a pre-computed approximate path while adapting to unforeseen obstacles and optimizing for faster completion times.
The environment is framed as a POMDP, where the state
includes relative positions of the next
n checkpoints, vehicle velocity, and simulated LiDAR data. The action
consists of continuous control variables, such as throttle and steering for cars, and x-axis and z-axis movements for drones. The policy
is parameterized by a neural network, mapping the observed state
to an action
:
The reward function balances three objectives: checkpoint progress, collision avoidance, and minimizing time. The total reward
at time
t is defined as follows:
where
The policy is trained using PPO with a clipped surrogate objective:
where
is the probability ratio, and
is the advantage function:
The training process employs curriculum learning, progressively increasing the complexity of generated maps by adjusting the proportion of obstacles and randomizing the initial orientation of the vehicle. Early training stages use simpler maps with fewer obstacles and optimal starting orientations. The difficulty gradually increases as the agent improves, enhancing generalization to unseen environments.
Additionally, the method introduces a fractional factorial design to optimize hyperparameters and state representations. This approach ensures the efficient evaluation of different configurations without exhaustive testing, reducing overall computational costs during training. By combining minimal path planning, curriculum learning, and PPO-based control, the proposed method eliminates the need for detailed motion planning while achieving high performance in generalizable navigation tasks.
6.4. Soft Actor–Critic (SAC)
Soft Actor–Critic (SAC), introduced by Haarnoja et al. [
2], is an off-policy reinforcement learning algorithm that combines maximum entropy reinforcement learning with actor–critic methods. The primary goal of SAC is to optimize a stochastic policy by maximizing both the expected cumulative reward and the entropy of the policy, encouraging exploration and robustness to model uncertainty.
The objective of SAC can be expressed as follows:
where
is the entropy of the policy,
is the temperature parameter that balances exploration (via entropy maximization) and exploitation (via reward maximization), and
is the discount factor.
SAC employs two key networks: a stochastic policy
parameterized by
, and two Q-value functions
and
parameterized by
and
, respectively. The use of two Q-value functions addresses the overestimation bias typically observed in value-based reinforcement learning methods. The target Q-value is computed as follows:
The Q-value networks are updated by minimizing the Bellman residual:
where
is a replay buffer storing transitions.
The policy
is optimized to maximize the expected entropy-augmented reward:
The temperature parameter
can be adjusted dynamically to control the trade-off between exploration and exploitation. The temperature is updated by minimizing the objective:
where
is the desired minimum entropy. The step-by-step pseudocode for the SAC algorithm is presented in Algorithm 11.
Algorithm 11: Soft Actor–Critic (SAC) Algorithm |
![Applsci 15 02179 i011]() |
SAC has been shown to outperform traditional reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) in tasks involving high-dimensional continuous action spaces. Its ability to encourage exploration through entropy maximization makes it particularly effective in environments with sparse rewards and high uncertainty.
The combination of off-policy learning, entropy regularization, and a stochastic actor enables SAC to achieve robust performance in a wide range of robotic control tasks, including locomotion, navigation, and manipulation. The recent advancements related to robotic path planning are reviewed in the following sections.
6.4.1. SAC Algorithm in Dynamic Path Planning for Mobile Robots
Yang et al. [
46] improved the original SAC algorithm to address the challenges of mobile robot path planning in environments with static and dynamic obstacles. They introduced an enhanced reward function, state dynamic normalization, and a prioritized replay buffer to improve robustness and decision-making efficiency.
The SAC framework is retained, where the policy
maps the current state
to a stochastic action
:
with the actor network optimized to maximize the entropy-regularized objective:
where
represents the entropy term, balancing exploration and exploitation, and
is the entropy temperature coefficient.
The critic networks are updated by minimizing the temporal difference (TD) error for the value function
:
where the target
includes the entropy term:
To enhance the reward structure, a hybrid reward function
is defined to guide robot behavior:
where
State dynamic normalization ensures stable training by normalizing the state vector to follow a normal distribution:
where
and
are the mean and standard deviation of the state distribution. The prioritized replay buffer selects samples with higher importance for training, weighted by the following:
where
is the priority of sample
i and
controls the influence of priority on sampling.
This combination of techniques enables SAC to efficiently navigate in complex environments, outperforming the original SAC and PPO algorithms in terms of stability and cumulative rewards. A block diagram of the proposed SAC-based method for mobile robot navigation is shown in
Figure 13. The input comprises state information
, including LiDAR, velocity, and target observations. Normalized states are processed by the actor and critic networks. The prioritized replay buffer improves sampling, and the entropy adjustment refines exploration. The final policy update combines outputs to produce the optimized policy
, which ensures efficient navigation actions.
6.4.2. Sim-to-Real Transfer with Incremental Environment Complexity
The study by Chaffre et al. [
47] proposed a method that builds on SAC by introducing incremental environment complexity and an improved reward shaping mechanism to achieve robust depth-based mapless navigation for mobile robots. The main enhancements address the sim-to-real transfer problem by combining domain randomization with structured environment transitions. The SAC framework is used, where the policy
maps states
to actions
, optimizing an entropy-regularized objective:
where
is the entropy of the policy:
and
is a temperature parameter balancing exploration and exploitation.
The Q-value and value functions are defined as follows:
The reward function is shaped to guide the robot’s behavior effectively during training. The reward
is defined as follows:
where
is the change in distance to the target, and
is a velocity reduction factor:
The constants
,
, and
are set to encourage task completion while penalizing collisions and ineffective movements.
Incremental complexity is introduced by progressively training the policy in three simulated environments (empty room, static obstacles, and mixed static and dynamic obstacles). Transitions between environments are governed by success rates
over recent episodes. If
, the agent progresses to a more complex environment; if
, it returns to a simpler environment. The policy update is performed by minimizing the KL divergence:
Here,
is a partition function that normalizes the distribution.
This approach ensures smooth transitions between simulation and real-world scenarios by gradually increasing task complexity and refining the policy through domain randomization and reward shaping.
A block diagram of the SAC-based method is presented in
Figure 14. The input includes state
with LiDAR, velocity, and target observations. State normalization refines inputs for the actor and critic networks. The actor generates a policy
, while the critic evaluates value functions. The hybrid reward function and entropy adjustment guide the training process. The policy update refines
, producing optimized navigation actions.
Figure 14.
Block diagram of the sim-to-real transfer method of [
47].
Figure 14.
Block diagram of the sim-to-real transfer method of [
47].
6.4.3. Dynamic Obstacle Avoidance and Path Planning Integration
Choi et al. [
48] developed an SAC-based framework for decentralized dynamic obstacle avoidance and path planning integration. The enhancements to the SAC algorithm include a tailored state representation, a hybrid reward function, and a combination of reinforcement learning and classical path planning for improved navigation. The state
is formulated as follows:
where
represents LiDAR measurements from three consecutive time steps to capture the temporal dynamics of obstacles. The goal state is
, the relative position of the robot to the target, and the speed state
, consisting of forward velocity
v and rotational velocity
. The action
is defined in a continuous space:
The reward function
combines terms for goal achievement, collision avoidance, and control efficiency:
where
The SAC objective is augmented to optimize this hybrid reward while maintaining exploration through entropy regularization:
where
is the entropy of the policy, encouraging diverse action exploration.
To address navigation inefficiencies and conflicts between obstacle avoidance and target-reaching, a classical path planning component is integrated. The global path planner generates a trajectory using an A* algorithm, which provides a “look-ahead point”
for the SAC agent:
where
is the distance parameter guiding the point selection along the path. The relative position of the look-ahead point is incorporated into the RL agent’s input:
The actor and critic networks are designed with convolutional layers to process LiDAR data and fully connected layers for speed and goal components. The critic evaluates the Q-value as follows:
and the value function
is as follows:
By combining SAC with classical path planning and integrating dynamic obstacle prediction into the state representation, the method achieves robust and efficient navigation in complex environments. Algorithm 12 presents a step-by-step pseudocode for the method.
Algorithm 12: The SAC-based Dynamic Obstacle Avoidance and Path Planning of [48]. |
![Applsci 15 02179 i012]() |
6.4.4. Robot Navigation in Constrained Pedestrian Environments
The study by Perez-D’Arpino et al. [
49] proposed a method that extends SAC by integrating a sampling-based motion planner with a reactive sensorimotor policy to achieve dynamic navigation in constrained pedestrian environments. The problem is formulated as a Partially Observable Markov Decision Process (POMDP) defined as follows:
where
S is the state space,
A is the action space,
O is the observation space,
is the state transition model,
is the reward function, and
is the discount factor. Observations
include
, where
goal represents the 2D coordinates of the target in the robot’s frame,
lidar consists of 128 range measurements from a 1D LiDAR sensor, and
waypoints are derived from a global planner using a 2D map.
The policy outputs actions , where and are the linear velocities, and is the angular velocity.
The reward function
incorporates terms to encourage goal-reaching, minimize time, avoid collisions, and follow global waypoints:
Here,
where
is the goal tolerance distance, and
is a potential-based reward inversely proportional to the distance to the next waypoint.
The SAC algorithm optimizes the expected sum of rewards while maximizing the entropy of the policy:
where
is the entropy temperature and
is the policy entropy:
The policy and value networks are updated using SAC’s off-policy gradient rules. The value function is computed as follows:
and the critic is updated by minimizing the Bellman error:
The policy receives high-level waypoints from a motion planner and uses RL to navigate while avoiding collisions and adapting to dynamic pedestrian behavior. Training involves multi-layout environments (e.g., corridors, doors, intersections), allowing the policy to generalize to unseen, compositional environments.
6.4.5. Velocity Range-Based SAC Reward Shaping for Mapless Navigation
A novel SAC reward shaping method was proposed by Lee and Jeong [
50] in which a “Velocity Range-Based Reward Shaping” technique is devised to improve map-less navigation for autonomous mobile robots. This technique adjusts the reward function to prioritize efficient and safe movements by incorporating velocity-based evaluations into the reinforcement learning framework. The robot’s state
is defined as follows:
where
represents LiDAR-based distance data, and
is the relative target coordinate.
The action
consists of the linear velocity
v and angular velocity
, with velocity bounded by the following:
The reward function
incorporates velocity-based evaluation and consists of the following:
where
with
being the distance to the target,
the orientation error,
the velocity-based evaluation score, and
the importance weights.
The velocity evaluation score
is determined by dividing the velocity range
into
n segments and assigning evaluation scores
to each range:
The policy
is optimized using the SAC objective, modified to include the velocity-based reward:
where
is the entropy temperature, and
is the entropy of the policy:
The value function
and Q-function
are trained using the following updates:
By incorporating velocity-based evaluation into the reward function and leveraging the entropy-regularized objective, the method achieves stable and efficient map-less navigation in dynamic environments. A block diagram of the process flow of the method is presented in
Figure 15. The input includes state information
, derived from LiDAR data and the target position. State normalization prepares inputs for the actor and critic networks. The actor generates the policy
, while the critic evaluates the Q-function. A hybrid reward function and entropy adjustment guide the training process. The replay buffer stores transitions for policy updates. The final policy update combines outputs, producing optimized navigation actions.
7. Emerging Trends in Reinforcement Learning for Robotics
Recent advancements in reinforcement learning have introduced new techniques that improve learning efficiency, generalization, and decision-making in robotics. This section explores emerging trends, including transformer-based RL, attention mechanisms, meta-learning, and transfer learning.
7.1. Transformer-Based DRL
Transformers, originally developed for natural language processing, have shown promise in reinforcement learning by improving long-term sequence modeling and decision-making. Hu et al. [
51] provided a comprehensive survey on transformer-based RL, highlighting its applications in robotic manipulation, autonomous driving, and multi-agent coordination. These architectures enable better credit assignment over long time horizons and mitigate the “deadly triad” problem of RL. Similarly, Zhou et al. [
52] introduced a transformer-based approach for autonomous driving, demonstrating a 41% reduction in collision rates compared to conventional reinforcement learning algorithms.
7.2. Meta-Learning and Transfer Learning in Robotics
Meta-learning aims to enable RL agents to adapt rapidly to new environments by learning how to learn. Ren et al. [
53] developed a meta-RL framework that leverages human preference feedback for few-shot adaptation, allowing robots to learn complex tasks with minimal demonstrations. Furthermore, Liu and Ahmad [
54] proposed a multi-task RL framework that reuses previously learned policies for efficient adaptation in continuous control tasks, reducing sample complexity and improving policy generalization.
Transfer learning has also gained traction in RL-based robotics. Jiang et al. [
55] introduced a transformer-based multi-agent RL approach for cross-domain transfer in urban environments, significantly improving generalization across different cities. These methods enable RL agents to transfer knowledge across different environments, reducing training times for real-world deployment.
7.3. Advancements in Multi-Agent Reinforcement Learning
Multi-agent RL (MARL) has witnessed significant progress in handling complex coordination problems. Gabler and Wollherr [
56] proposed a decentralized MARL framework with dual critics, enabling robots to optimize joint team rewards while considering individual constraints. Additionally, Song et al. [
57] introduced a hierarchical reinforcement learning method for large-scale multi-agent path planning, enhancing sample efficiency through spatiotemporal abstraction.
7.4. Attention Mechanisms in Robot Navigation
Attention mechanisms have been increasingly applied to RL-based navigation tasks. Escudie et al. [
58] developed an attention graph-based RL framework for multi-robot social navigation, improving interaction modeling in dense human environments. Their results demonstrated enhanced cooperation and safety in real-world scenarios.
These emerging trends illustrate the growing capabilities of RL in robotics and highlight promising directions for future research. Integrating these advanced methodologies will enable more efficient, robust, and scalable autonomous systems.
7.5. Scaling DRL for Large-Scale Robotic Systems and Multi-Agent Environments
Deploying deep reinforcement learning (DRL) in large-scale robotic systems, such as urban autonomous fleets or search-and-rescue missions, presents unique challenges related to computational efficiency, coordination, and policy transferability. Recent studies have explored various methods to improve the scalability of DRL-based approaches for multi-agent environments.
One approach is the development of hierarchical DRL techniques that enable more efficient decision-making in large agent populations. Liu et al. [
59] provided an extensive survey on scalability challenges in multi-agent DRL (MADRL), highlighting the role of hierarchical structures in managing state–action space complexity. Their study emphasized that scalable algorithms must incorporate robustness mechanisms to handle dynamic environments effectively.
Graph-based learning methods have also proven effective for large-scale coordination. Zhang et al. [
60] introduced a graph-based DRL technique for multi-robot task allocation, leveraging graph normalization to enhance scalability. Their approach demonstrated improved task efficiency and adaptability to varying team sizes, outperforming conventional heuristic-based task assignment methods.
Distributed and federated learning have also been explored as viable solutions for scalability. Pramuk et al. [
61] proposed a federated DRL framework that enables multi-robot systems to share learned policies without centralized computation, reducing communication overhead while maintaining performance in large-scale navigation tasks. Similarly, Chen et al. [
62] introduced an open-source framework, MultiRoboLearn, designed to bridge the gap between multi-agent DRL simulations and real-world multi-robot applications, ensuring better generalization across different environments.
Another critical consideration in scalability is decentralized policy learning. Liu et al. [
63] presented an Independent Soft Actor–Critic (ISAC)-based method that enhances decentralized training in large-scale multi-agent path planning. Their results showed improved success rates and computation efficiency compared to centralized learning approaches.
Overall, these advancements illustrate the growing feasibility of DRL in large-scale robotic systems, paving the way for scalable, efficient, and adaptive multi-agent coordination in real-world applications.
7.6. Safe Exploration in DRL for Robotics
Ensuring safe exploration is a critical challenge in deep reinforcement learning (DRL) for robotic applications. While traditional RL methods rely on trial-and-error exploration, real-world deployment necessitates strategies that prevent unsafe actions during training and execution. Recent advancements in safe DRL have introduced several approaches, including constrained optimization, risk-aware policy learning, and uncertainty-aware exploration techniques.
One approach to safe exploration is through constrained optimization, where policies are trained under explicit safety constraints. Zhao et al. [
64] introduced Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a method that ensures zero training violations by employing a safety monitor with black-box dynamics. Their approach improves sample efficiency while preventing unsafe transitions. Similarly, Xu et al. [
65] proposed an Extra Safety Budget (ESB-CPO) framework that balances exploration efficiency with safety compliance, dynamically adjusting constraints throughout training.
Hierarchical reinforcement learning has also been explored for safe navigation. Roza et al. [
66] developed a constrained hierarchical RL framework that integrates a safety layer to modify low-level policies, reducing collision rates while maintaining high performance. Additionally, Marchesini et al. [
67] introduced a Safety-Oriented Search technique that biases policy optimization towards safer actions, leveraging an evolutionary cost optimization framework.
Uncertainty-aware policies have been proposed to address model inaccuracies in dynamic environments. Ramírez and Yu [
68] developed a Monte Carlo dropout-based method to estimate uncertainty in state–action pairs, enabling safer decision-making in robotic control. Their approach significantly reduces risk while maintaining policy efficiency. Furthermore, Jayant and Bhatnagar [
69] introduced a model-based constrained RL approach that leverages an ensemble of neural networks to estimate uncertainty and ensure robust constraint satisfaction.
Beyond policy optimization, risk-aware exploration strategies have been studied to enhance safety. Kim et al. [
70] proposed SafeTAC, a Tsallis entropy-regularized safe RL method that improves exploration without violating safety constraints. Their method successfully mitigates overly conservative behaviors while optimizing performance in high-dimensional robotic tasks.
10. Conclusions and Future Directions
This review provides a comprehensive analysis of the application of Deep Reinforcement Learning (DRL) in mobile robot path planning, showcasing its potential to tackle dynamic and uncertain environments where traditional methods often fall short. By leveraging neural networks to handle high-dimensional state–action spaces, DRL enables mobile robots to make real-time decisions with enhanced efficiency and adaptability. The categorized exploration of value-based methods, policy-based approaches, and hybrid frameworks demonstrates the breadth and versatility of DRL techniques in addressing complex navigation challenges. Key advancements, such as Proximal Policy Optimization (PPO), Soft Actor–Critic (SAC), and hybrid frameworks, have delivered a state-of-the-art performance in scenarios ranging from indoor navigation to multi-agent coordination. Despite this progress, limitations persist in scalability, safety, interpretability, and generalization to unseen environments. Addressing these challenges is essential for fully harnessing the transformative potential of DRL in autonomous navigation.
The future of DRL in mobile robot path planning lies in overcoming these limitations and expanding its applicability to real-world scenarios. One critical area is scalability, as current DRL models often struggle with large and complex environments. Distributed and hierarchical learning frameworks could address this by decomposing navigation tasks into subproblems. For instance, a warehouse robot might use a hierarchical approach to first navigate to a specific aisle and then locate a target object within it. Real-time performance improvements will also be essential, especially for logistics robots in high-demand environments like fulfillment centers.
Ensuring safety in navigation is another crucial direction, particularly for robots operating in human-centric environments. By integrating DRL with safety constraints and verification mechanisms, robots can learn to avoid risky behaviors. Autonomous delivery robots in crowded urban areas, for example, could benefit from policies that account for pedestrian movements and dynamic obstacles while maintaining a reliable path to their destination.
The gap between simulation and reality remains a significant challenge. While simulation environments are invaluable for training DRL models, real-world deployment often suffers from discrepancies in environmental conditions. Techniques like domain adaptation and transfer learning could help bridge this gap. For example, autonomous vehicles trained on simulated highways could adapt their policies to real-world conditions involving variable weather, road textures, and traffic dynamics.
Coordination among multiple robots presents another promising avenue for DRL. Multi-agent reinforcement learning frameworks could promote collaboration and efficient resource use in shared environments. Search-and-rescue missions, for instance, could benefit from teams of drones and ground robots working together to navigate and locate survivors in disaster-stricken areas. Similarly, such frameworks could optimize fleets of warehouse robots for collaborative tasks, reducing bottlenecks and improving throughput.
Socially aware navigation is another frontier where DRL could play a transformative role. Incorporating human preferences and social norms into reward functions can help robots operate seamlessly in public spaces. Service robots in airports, shopping malls, or hospitals must navigate crowded areas while respecting personal space and pedestrian flow patterns. These considerations could enhance their acceptance and utility in everyday human environments.
Energy efficiency is an often-overlooked aspect of path planning but is critical for battery-powered mobile robots. Reward functions that factor in energy consumption and optimize for sustainable operation could extend the operational range of robots in applications like agricultural fieldwork or remote inspection tasks. For instance, agricultural robots managing tasks like planting or weed removal would benefit from energy-efficient navigation strategies that maximize battery life across large fields.
The exploration of unstructured terrains presents additional challenges and opportunities for DRL. Robots operating in forests, disaster sites, or underwater environments require policies capable of handling incomplete environmental data and complex dynamics. Exploration-based DRL approaches, which use curiosity-driven rewards, could enable robots to navigate unknown terrains effectively. For example, underwater robots conducting coral reef surveys could leverage these techniques to explore and document fragile ecosystems while avoiding obstacles.
Deep reinforcement learning presents significant computational challenges when deployed on real-world robotic systems, particularly due to the high-dimensional sensory inputs required for robust decision-making. LiDAR and vision-based sensing are widely used in mobile robot path planning, but processing these inputs in real-time demands specialized hardware configurations.
Gautier et al. [
91] highlighted that multi-robot systems relying on DRL for task allocation face severe computational bottlenecks, particularly when cloud resources are unavailable. Their study demonstrated that distributing computation across local robotic clusters improves efficiency but introduces coordination challenges. Similarly, Kwon et al. [
92] explored Hyperdimensional Computing (HDC) as a lightweight alternative to deep learning for sensorimotor control, achieving a 14.2× improvement in computational efficiency while maintaining comparable accuracy.
For autonomous navigation, high-dimensional image data processing is another major concern. Vijetha and Geetha [
93] proposed a state abstraction technique to transform raw vision data into compact representations, significantly reducing the computational overhead of DRL policies running on resource-constrained devices. Additionally, Xiang et al. [
94] introduced RMBench, a benchmarking framework for DRL-based robotic manipulation, revealing that Soft Actor–Critic (SAC) achieves the best trade-off between computational cost and task performance.
Another critical challenge is balancing computational efficiency with training stability. Han et al. [
16] proposed a deep stochastic Koopman operator (DeSKO) to model complex robot dynamics, improving decision-making efficiency while mitigating the impact of computational constraints. Their approach demonstrated robust performance in real-world soft robotic systems.
Given these considerations, future research should focus on optimizing DRL architectures for edge computing and developing hybrid approaches that integrate learning-based methods with classical control techniques to alleviate hardware limitations.
Finally, advancements in integrating vision and multimodal sensory systems could significantly enhance DRL’s robustness in real-world applications. Combining inputs from cameras, LiDAR, and other sensors could enable robots to build richer environmental models. Autonomous drones delivering packages in urban areas, for instance, could utilize multimodal inputs for precise obstacle detection and landing.
As DRL continues to evolve, its potential to generalize across unseen scenarios will become increasingly important. Approaches like meta-reinforcement learning and few-shot learning could enable robots to adapt rapidly to new tasks or environments with minimal additional training. For example, autonomous robots in industrial settings could quickly learn to navigate newly reconfigured layouts or operate in entirely different facilities.
By addressing these challenges and directions, DRL could become a cornerstone of mobile robot navigation, paving the way for its widespread deployment in diverse applications. These advancements will not only improve the adaptability and efficiency of robots but also enhance their safety and reliability, enabling them to meet the demands of real-world scenarios effectively.