1. Introduction
Autonomous intelligent mobile robots [
1] are suitable for completing various tasks in dangerous and complex environments; such tasks include urban rescue, public security patrol, and epidemic prevention. Motion planning, in which information obtained from the external environment by sensors is used to evaluate the optimal collision-free path between a starting point and an ending point, is an essential technique in autonomous navigation. Traditional motion planning [
2] is a core component of the pipeline framework and must be integrated with other subtasks—such as perception, localization, and control—to accomplish autonomous driving tasks; these tasks [
3] are often inflexible and require substantial computational resources and numerous manual heuristic functions because they involve real-time sensory information. The pipeline framework for autonomous driving consists of many interconnected modules, and in the end-to-end method, the entire framework is treated as a single learning task. In end-to-end autonomous motion [
2,
4], raw sensor data are directly used as the input to a neural network, which outputs low-level control commands. This method is attractive because complex modules do not need to be designed; instead, a network with a simple structure is constructed, and the entire motion process is optimized into a single machine-learning task.
On the basis of their principles and era of development, robot planning methods can be divided into traditional and learning-based methods. Traditional methods include graph search algorithms, sample-based algorithms, interpolating curve algorithms, and reaction-based algorithms [
1]. The Dijkstra method [
5] is based on a greedy and optimal graph search; however, it lacks directionality during pathfinding. In the A* method [
6], a heuristic function is employed to measure the distance between the real-time search location and the target location; this process results in a targeted search and a higher search speed than that achieved in the Dijkstra method. A rapidly exploring random tree (RRT) [
7] is a sample-based approach that generates a sequence of dynamically feasible kinematic connections. The RRT method is sensitive to the sampling distribution, and no guarantee exists that the time taken to converge to the optimal solution would be sufficiently short. The interpolating curve algorithm draws a smooth path based on computer-aided geometric design. Typical path smoothing and curve generation rules include line and circle [
8], clothoid curves [
9], polynomial curves [
10], Bezier curves [
11], and spline curves [
12]. The potential field method (PFM), the velocity obstacle method (VOM), and the dynamic window approach (DWA) are three reaction-based algorithms that are widely used in engineering and manufacturing. The PFM [
13] uses the gradient of a potential field to move the robot from an initial to a target point. The VOM [
14] relies on current positions and velocities of robots and obstacles to calculate a reachable avoidance velocity (RAV) space. A proper avoidance velocity is selected from the RAV to avoid static and moving obstacles. The DWA [
15] is about selecting the appropriate translational and rotational velocity to maximize an objective function within a dynamic window, which is an online collision avoidance strategy. The shortcoming of traditional methods is that they require knowledge of the environment prior to planning. These methods are unsuitable for solving path-planning problems in environments with unknown characteristics. In addition, when a traditional method is used, trajectory optimization must be performed at the back end after the path search has been performed at the front end, and trajectory optimization requires considerable calculations.
Reinforcement learning (RL) [
16,
17,
18] is a goal-directed computational approach. In contrast to supervised and unsupervised learning methods, RL involves using the reward of interacting with an unknown dynamic environment as a feedback signal, rather than using many labeled samples [
19]. RL methods can be categorized into value-based, policy-based, and actor–critic (AC)-based frameworks. Value-based RL is the simplest RL method, and it performs well in most discrete action spaces. In policy-based RL, stochastic policies are optimized by directly mapping states to actions through a probability function. The AC-based framework [
20] combines the advantages of value-based and policy-based frameworks and typically includes actor and critic networks. The actor is a policy network that maps input states to output actions, and the critic is a value network that evaluates the quality for each state–action pair.
An RL agent learns by directly interacting with the environment without supervision and without a model of the environment. At the end of an episode, the agent generally receives positive rewards through environmental feedback. For long-distance navigation in environments containing obstacles, robots have difficulty obtaining the final positive reward signal. The sparse reward problem [
21] leads to slow learning and difficult convergence. Reward shaping [
22] is a means of manually tuning and modifying fine-grained reward signal values for robots in different states. The tuning that is performed in reward shaping is intuitive and highly dependent on the expert experience of the person conducting the process. Inappropriate rewards might lead to changes in the optimal policy and produce abnormal behavior. In the curiosity-driven method [
23], the sparse reward problem is solved by exploiting existing trajectories. An intrinsic curiosity module extracts additional intrinsic reward signals from the environment and encourages the agent to explore effectively. Hindsight experience replay (HER) [
24] is another approach used for addressing the sparse reward problem. In HER, additional reward signals are explored and acquired during training by mapping the no-reward state to new targets and by replacing previous targets. Another approach for addressing reward sparsity is hierarchical RL (HRL) [
25,
26], in which the original task is hierarchically decomposed into multiple continuous and easy-to-solve subtasks, which are then further divided and completed to provide dense reward signals to the agent. The reward function is difficult to configure for some specific and complex planning tasks. In inverse RL (IRL) [
27], expert trajectories are utilized for inversely learning the reward function, and policies are then optimized. IRL is an imitation-learning paradigm [
28], as is behavior cloning, which relies on a supervised learning process and suffers from a mismatch between the participant’s strategy and the expert’s strategy.
AC-based deep RL (DRL) approaches—including the deep deterministic policy gradient (DDPG) [
29], twin-delayed DDPG (TD3) [
30], and soft AC (SAC) [
31] algorithms—are often used to optimize action sequences for robots. Vecerik et al. [
32] proposed a general and model-free DDPG-based approach for solving high-dimensional robotics problems. Other studies [
33,
34] have proposed DDPG-based path-planning methods in which HER is used to overcome the performance degradation caused by sparse rewards. TD3 with traditional global path planning was employed in [
35] to improve the generalization of the developed model. In [
30] and [
36], the TD3 and HRL methods were combined, the state–action space was divided in accordance with the information maximization criterion, and a deterministic option policy was then learned for each region of the state–action space. The SAC algorithm [
31,
37] uses a stochastic policy and obtains the optimal policy by optimizing the entropy. HER is used in the SAC algorithm to improve sampling efficiency and results in favorable exploration performance.
In an environment exposed to multiple obstacles, the agent learns how to avoid obstacles and reach its destination without collision. The DWA [
15] is a traditional method of static obstacle avoidance. Due to complex conditions and equations, dynamic obstacles require a large amount of computing time to predict the next movement of the obstacle. CADRL [
38] and MRCA [
39] are representative obstacle avoidance methods based on reinforcement learning. CADRL aims to avoid pedestrians and needs to obtain pedestrian position, speed, and body shape information. The disadvantage is that it cannot be avoided if the pedestrian is not detected. MRCA can avoid various dynamic obstacles through the distance information of LiDAR without detecting dynamic objects. In a multi-agent learning environment, it is relatively difficult for MRCA to comprehensively avoid dynamic obstacles through multiple different avoidance strategies or no avoidance strategies. Liang et al. [
40] used multiple perception sensors including 2-D LiDAR and depth cameras for smooth collision avoidance. Choi et al. [
41] proposed a framework in decentralized collision avoidance in which each agent independently makes its decision without communication with others. A collision avoidance/mitigation system [
42] was proposed to rapidly evaluate potential risks associated with all surrounding vehicles and to maneuver the vehicle into a safer region when faced with critically dangerous situations through a predictive occupancy map. A two-stage RL-based collision avoidance approach [
43] was proposed. The first stage is a supervised training method with a loss function that encourages the agent to follow the well-known reciprocal collision avoidance strategy. The second stage is using a traditional RL training method to maximize reward function and refine the policy. A hybrid algorithm [
44] of RL and force-based motion planning was presented to solve the distributed motion-planning problem in dense and dynamic environments. A new reward function provides conservative behavior by changing collision punishment to decrease the chance of collision in crowded environments. Many studies [
26,
40,
41,
45] adopt the method of imposing a large penalty on the reward when the robot collides with an obstacle or approaches an obstacle. The survival penalty imposes a negative reward at each time step, which encourages the robot to reach the goal as quickly as possible. Obstacle avoidance based solely on reinforcement learning may lead to path-finding problems in some cases. To solve this problem and improve navigation efficiency, the path planner is integrated with reinforcement learning-based multi-obstacle avoidance.
To address the severe problem of reward sparsity in complex environments, survival penalties can be applied as rewards to ensure successful learning. A survival penalty-based approach is used within an AC framework to solve a motion-planning problem in which the target position and orientation are aligned simultaneously. We design a comprehensive reward function and conduct it in three navigation scenarios to ensure that the goal can be approached while avoiding obstacles dynamically.
The remainder of this paper is organized as follows.
Section 2 details the navigation problem and the requirements for a WMR.
Section 3 briefly describes two AC frameworks, namely, DDPG and TD3. In the aforementioned section, we elaborate on the observation state, action space, and reward function and then construct the environment and agent for DRL.
Section 4 presents details on the proposed model’s training and evaluation and on simulation results for four scenarios.
Section 5 discusses the rationality of the survival penalty function and the potential applications of this study. Finally,
Section 6 provides the conclusions of this study.
2. Problem Statement and Preliminaries
A path is a route from one point in space to another. The path-planning problem is closely related to the collision avoidance problem. Although most studies have aimed to solve the path-planning problem in a flexible and efficient manner, they have considered only motion planning involving position transitions, without considering orientation-related tasks. In this paper, we propose a planning approach in which orientation and localization tasks are tackled simultaneously for further investigating the potential of WMRs. We address attitude adjustment in the final goal of the WMR’s task. As illustrated in
Figure 1, our goal is for the WMR to move in a timesaving manner and without colliding with obstacles while possessing the required orientation. The proposed method can be applied to many situations. For example, the operation of an automated parking system is a position and orientation colocalization problem. The system must locate the target parking space, define a path that avoids obstacles, and park in the space with the correct orientation and high accuracy [
46]. The proposed method involves the planning of collision-free paths during precise directional navigation. To meet requirements for position and orientation transformation, kinematic constraints as well as position and attitude constraints are considered.
As depicted in
Figure 1, a differential-drive WMR consists of two rear driving wheels parallel to the back-to-front axis of the body and a caster wheel that supports the robotic platform. The WMR is moved by two direct motors driving the wheels. The global inertia is fixed in the plane of motion, and the moving WMR has a local body frame. The global inertial frame and local body frame are denoted by X–Y and x–y, respectively. The origin of the local frame’s coordinate system is the center of gravity of the WMR. The x-direction is the direction that the WMR is facing, whereas the y-direction points to the left if the WMR is facing forward. The WMR is assumed to be in contact with a nondeformable horizontal plane on which it purely rolls and never slips during motion.
The attitude of the WMR is described by the following generalized coordinate vector with five components:
, where
are the inertial coordinates of the WMR’s center of mass,
is the yaw angle of the WMR relative to the horizontal inertial axis
, and
and
are the rotation angles of the right and left driving wheels, respectively. The width of the WMR is 2
, the distance between the center of mass and geometric center of the WMR platform is
, and the driving wheel radius is
. As shown in
Figure 1, the centroid
of the WMR does not coincide with its geometric center
. The WMR is subject to three nonholonomic constraints when its wheels are rolling and not slipping.
where
is the nonholonomic constraint matrix.
Let
and
represent the angular velocities of the right and left driving wheels, respectively. If
is considered a control input, the kinematic model of the WMR [
47] can be expressed as follows:
where
The line velocity and angular velocity of the WMR at point
are denoted by
and
, respectively. The relationship between
,
,
, and
is as follows:
Equation (3) is substituted into (2) to obtain the ordinary form of a WMR with two actuated wheels, which is expressed as follows:
where
which satisfies the equation
.
5. Discussion
The reward function is an incentive mechanism that tells the agent what is correct and what is wrong through rewards and penalties. RL applications can be transformed into sequential decision-making problems, and states, actions, and rewards can be established. The agent’s goal is to maximize the total reward. Designing and implementing a reward function that is consistent with an application is challenging. For example, the application of RL in electric vehicle battery-charging management [
50] uses positive rewards to allow management to play a major role in coordinating the charging and discharging mechanism and effectively achieve a safe, efficient, and reliable power system. In autonomous navigation applications, as shown in
Figure 5, negative rewards can encourage agents to move to the destination as quickly as possible. On the contrary, positive rewards may lead the agent to increase the navigation time (steps) and obtain the maximum reward. In our study, since the navigation time of each navigation situation varies greatly, it is not directly designed as a penalty parameter.
TD3 and DDPG are model-free RL off-policy RL algorithms. Model-free algorithms learn directly from experience and can operate without complete knowledge of environmental dynamics or transition models. Simulated environments are the most common method of training agents. The kinematic model of WMR is based on a two-dimensional framework in the x- and y-directions. When a vehicle travels on uneven terrain, the vehicle will generate significant vertical vibrations. Movements in the z-direction may affect the precise positioning of the vehicle, which is a key requirement for autonomous navigation. Bikes, tripod cars, and four-wheel drive cars are real vehicles with different motion structures. The simulation environment could use an accurate vehicle model to generate relevant experience, but this would require a lot of computer time to solve the nonlinear vehicle equations. The agent relies on sampling to estimate the value function, resulting in noisy estimates and slower convergence, but model-free algorithms can be used in large, complex environments. Our approach can be extended to complex 3D environment applications such as UAVs or robotic manipulators. The goal of path-planning techniques is not only to discover optimal and collision-free paths, but also to minimize various issues such as path length, travel time, and energy consumption that require further research. Using multiple robots to conduct collaborative multi-observation of multiple moving targets is a future research direction.
The main contributions of this study are summarized as follows:
- (1)
Appropriate reward functions are designed by considering human driving experiences and using a survival penalty method so that a wheeled mobile robot (WMR) can receive negative reward feedback at every step. The reward function varies continually on the basis of observation and action. Dense reward signals are found to improve convergence during training, and this finding verifies the feasibility of solving the sparse reward problem encountered in autonomous driving tasks for WMRs.
- (2)
Two consecutive RL episodes are connected to increase the cumulative penalty. This process is performed when the WMR collides with an obstacle. The agent is prevented from selecting the wrong path, learns a path in which obstacles are avoided, and successfully and safely reaches the target.
- (3)
The effectiveness and robustness of the proposed method is evaluated for three navigation scenarios. The simulation results indicate that the proposed method can effectively solve the problem of planning a path along which a WMR can drive to its destination in a complex environment. The results also indicate that the proposed method is robust and self-adaptive.
- (4)
By using the DDPG and TD3 algorithms with the same AC framework, the WMR can safely and quickly reach its target under the simultaneous alignment of position and orientation. For the same environment and standard, compared with the DDPG algorithm, the TD3 algorithm requires considerably fewer computations and has higher real-time performance.
- (5)
The path-planning method has the ability to handle orientation and positioning simultaneously. In an environment without map information, the intermediate waypoints between the departure point and the destination do not need to be pre-determined segmentally, and an end-to-end obstacle-free autonomous U-turn path can be realized. End-to-end vehicle navigation has the potential to be used in high-end industrial robot applications such as electric vehicle parking and forklift picking and placing.