1. Introduction
Due to their high flexibility, quadrotor unmanned aerial vehicles (QUAVs) have gained significant popularity in various applications, including parcel delivery [
1], precision agriculture [
2,
3,
4], search and rescue [
5,
6], and surveillance [
7]. In these scenarios, the QUAV is typically required to autonomously navigate to a target position. However, in dense environments, unexpected obstacles can obstruct the path and lead to collisions. This task is even more challenging in a multi-agent system [
8,
9]. Therefore, obstacle avoidance becomes a crucial task for ensuring safe path planning. The QUAV must find an unobstructed path to the target while considering its physical limitations. Conventional approaches to obstacle-free path planning include Dijkstra, probabilistic roadmap (PRM), rapidly-exploring random trees (RRT), and artificial potential field (APF) [
10,
11,
12,
13]. Since these approaches do not assume the physical realization of the agent to be of a certain type, they are applicable to QUAV navigation. APF faces a critical challenge in dense environments. When multiple obstacles narrow the passageway to the target, the agent may experience oscillations due to the repulsive force created by the obstacles that conflicts with the attractive force towards the target. As a result, reaching the target smoothly becomes difficult.
Most existing approaches to reducing APF oscillations either rely on second-order optimization theory techniques or employ an escaping mechanism when the agent detects oscillations. However, these approaches are not fully compatible with a hierarchical path planning framework because they directly modify the forces applied to the QUAV without considering its often nonlinear dynamics. Additionally, they commonly attribute oscillations to local minima, which is not always the case. One possible solution to this problem is to utilize global obstacle information to bend an imaginary rubber band path [
14]. However, this approach poses challenges for real-time decision-making, as it requires prior knowledge of obstacle positions.
Based on the symmetrical motion controller design in [
15], our previous work [
16] enabled the utilization of an APF in a hierarchical framework, utilizing only local information provided by sensors mounted on the QUAV. In this framework, an APF path planner generates a state that incorporates modified position errors, while low-level motion controllers are responsible for both position and attitude control. In [
17], we introduced the Hessian matrix of the APF as a damping term, which effectively dissipates system energy, and we proved its stability using an energy-based Lyapunov function. In this current work, we utilize the time-derivative of the APF to further reduce oscillations in QUAV path planning within a hierarchical control framework.
Considering that the QUAV dynamic system is nonlinear and strongly coupled, we use deep reinforcement learning (DRL) to train the low-level motion controllers [
18]. DRL not only alleviates the labor-intensive process of parameter tuning, as compared to the traditional proportional–integral–derivative (PID) method, but also demonstrates remarkable fitting capability with the aid of deep neural networks. The key contributions of this study are outlined as follows.
We incorporate the time-derivatives of the potential function to account for the velocities of intermediate target points, thereby resolving the conflict between the fixed-point chasing logic of the DRL motion controller and the potentially erratic movements of the intermediate target points.
To enhance the control performance of a high-dimensional nonlinear system, we reconfigure the states of the DRL motion controller to eliminate any asymmetry and ensure stability.
We conducted comparative simulation experiments to validate the effectiveness of the proposed method in reducing oscillations and preventing overshooting.
The complex dynamics of the QUAV system are considered, which is different from other approaches that either do not discuss agent dynamics or simplify the dynamics as a simple second-order point mass model.
The structure of this paper is outlined as follows. In
Section 2, we provide a comprehensive overview of the existing research on obstacle-free path planning. In
Section 3, we introduce the necessary background knowledge, including the principles of conventional artificial potential fields, drone dynamics, and deep reinforcement learning.
Section 4 elaborates on the methodology that we have employed in this study. We present the experiments conducted in
Section 5 to validate the effectiveness of the proposed method in reducing oscillation and overshooting.
Section 6 discusses limitations of this work and its applicability to dynamic and irregular environments. Finally, in
Section 7, we conclude our findings.
2. Related Work
We identify three major approaches to achieve obstacle-free path planning. Since they do not set restrictions on the type of agents being used, these approaches are applicable to QUAV obstacle avoidance. The first approach is grid search, exemplified by algorithms such as Dijkstra, A*, and D* [
10,
19,
20]. These algorithms are guaranteed to find a viable least-cost path when the environment is finite. However, they suffer from the curse of dimensionality, as the number of vertices to be explored grows exponentially with the dimensionality of the working space. Additionally, the need for higher control resolutions further increases the computational burden. The second approach is sampling-based search, with PRM [
11] and RRT [
12] being typical examples. For example, Farooq et al. [
21] guided the QUAV to avoid dangerous zones in a dynamic environment by computing the PRM. Compared to grid search, these algorithms are capable of quickly producing paths in high-dimensional spaces. However, since sampling-based methods rely on global information, such as environment boundaries, they are not well-suited for real-time control where sensor data are used to generate actions for the next step. The third approach is APF, which assumes that both the target and obstacles generate potential fields that influence the movement of the agent. By leveraging the gradient of the summed potential field, the agent is attracted to the target while being repelled from nearby obstacles. For example, Ma’arif et al. [
22] used APF to guide a single QUAV to reach the target and avoid collisions with dynamic obstacles. APF is widely used in QUAV navigation due to its ease of implementation and ability to guide the QUAV using only local information.
Despite its numerous advantages, APF has certain inherent limitations [
23]. One recognizable limitation is that the agent can easily become trapped in local minima, where the attraction and repulsion forces cancel each other out. Algorithms designed to address this problem can generally be classified into three categories. The first category is local minimum removal. For example, Kim et al. [
24] employed harmonic functions to construct potential fields, allowing for the selection of singularity locations and the elimination of local minima in free space. The second category is local minimum escape. For instance, Park et al. [
25] combined APF with simulated annealing, which introduces randomness to the agent’s actions, enabling it to escape from local minima. Wang et al. [
26] proposed the Left Turning scheme, which effectively handles U-shaped obstacles and helps the agent escape local minima. Lai et al. [
27] proposed a dynamic step-size adjustment method to help the multi-UAV escape the local minima. The third category is local minimum avoidance. For instance, Doria et al. [
28] utilized the deterministic annealing approach to expand the repulsive area and shrink the attractive area. This allows the agent to avoid local minima at the beginning, when the potential function is convex due to a high initial temperature. Additionally, Ge et al. [
29] addressed a specific case of the local minimum problem known as goals non-reachable with obstacles nearby (GNRON). They considered the relative distance between the agent and the target when designing the repulsive potential function. As the agent approaches the target, this function approaches zero, thereby reducing the repulsive force in the target’s vicinity and overcoming the local minimum problem. Overall, these different approaches highlight the efforts made to overcome the limitations of APF and improve its performance in various scenarios.
APF faces another dilemma, which is the occurrence of oscillations in narrow passages with densely distributed obstacles nearby. Previous research on oscillation reduction is relatively limited and mainly draws inspiration from optimization theory techniques. For instance, Ren et al. [
30] proposed the use of Levenberg’s modification of Newton’s method as a solution to oscillation problems. This approach incorporates second-order information by utilizing the Hessian matrix. Additionally, they adjusted the control law to maintain a constant speed, ensuring smooth movement of the agents. Biswas et al. [
31] further compared first-order gradient descent methods with two second-order approaches and concluded that Levenberg–Marquardt is more effective in generating smoother trajectories and improving convergence speeds.
The second branch of oscillation reduction algorithms introduces the concept of virtual obstacles or targets. Similar to LME used in tackling local minima, methods belonging to this branch often employ escaping techniques once oscillations are detected. For instance, Zhao et al. [
32] enhanced the manipulator’s predictive ability by incorporating dynamic virtual target points and utilized an extreme point jump-out function to escape oscillations. Zhang et al. [
33] employed tangent APF to avoid local oscillations and introduced the back virtual obstacle setting strategy-APF algorithm, which enables the agent to return to previous steps and withdraw from concave obstacles. In a rule-based fashion, Zheng et al. [
34] specified the condition for adding obstacles, compelling the resultant force to deflect when its angle to the obstacle center is too small. The dynamic step-size adjustment method proposed by Lai et al. [
27] is also able to escape the jitter area where a local minimum is encountered, but it does not address oscillations in other cases, such as narrow passageways.
Oscillations can also be mitigated by modifying the repulsive forces in a certain manner. Tran et al. [
35] estimated the projection of the repulsive force vector onto the attractive force vector and subtracted this component to prevent the agent from moving in the opposite direction of the attractive force. Martis et al. [
36] introduced vortex potential fields to achieve seamless cooperative collision avoidance between mobile agents, and these were validated using Lyapunov stability analysis. For larger obstacles with irregular shapes, Szczepanski [
37] combined the benefits of repulsive APF and vortex APF by defining multiple layers around the surface area of the obstacles. This approach surpassed traditional APF and pure vortex APF in terms of path smoothness.
There are also works specifically designed for QUAV path planning and obstacle avoidance. For example, Meradi et al. [
38] proposed a nonlinear model predictive control method based on quaternions for QUAVs’ obstacle avoidance. Valencia et al. [
39] constructed a QUAV obstacle detection and avoidance system using a monocular camera. Gageik et al. [
40] discussed the use of complementary low-cost sensors in QUAV collision avoidance. However, the problem of oscillation reduction in QUAV obstacle-free path planning based on APF is largely under-explored.
With the advancement in DRL technology, motion controllers based on DRL have gained widespread usage in QUAV path planning and obstacle avoidance by virtue of their strong fitting capability. In this context, the APF can be seen as an upper layer that indirectly or directly influences the agent’s current pursuit position, while the low-level motion control task is handled by the DRL agent. For instance, the RL environment may employ convolutional neural networks to receive information about the surrounding potential energy and generate estimated rewards for various actions [
41]. Xing et al. [
42] combined the enhanced APF method with deep deterministic policy gradient to navigate autonomous underwater vehicles. In our previous research, we developed a hierarchical framework where the APF path planner was utilized to modify the position errors perceived by the DRL motion controllers, effectively altering the target position [
16]. In this study, we further address oscillations by considering the velocity of the virtual target point, which has been validated in a collision-free formation control task [
17]. Experimental results demonstrate that this improved algorithm reduces oscillations compared to approaches that do not incorporate second-order information, such as velocities.
4. Methodology
4.1. Motion Controller Training
As is shown in
Figure 4, motion control is divided into two hierarchies: translational control and angular control. Due to the presence of gravity, vertical control differs inherently from horizontal control. Hence, we further subdivide translational control into vertical and horizontal control. As the three angular controllers adhere to the same control logic, and likewise the two horizontal controllers, only three networks are required. The actor networks’ parameters for angular, vertical, and horizontal control are denoted by
and
, respectively. Their states are designed as
in which
and
, variables that characterize the pitch and the roll, are defined as
There are a few points to note here. First,
, and
are given by external sources, while
and
are calculated from the outputs of the translational controllers. Second, horizontal states not only consider the corresponding position and velocity errors but also take into account height and angular information. Third, angular errors
are defined using the rotation matrix instead of the angular differences, and they are calculated by
in which the vee map
is the inverse of the hat map, and
is the desired rotation matrix uniquely determined by
, and
.
To obtain
and
, the translational controllers’ networks first generate desired forces along each axis following
where the networks’ outputs are bounded by
, and
is the maximum translational acceleration. Then, the desired roll and pitch can be calculated as [
15]
The ultimate desired thrust and torques are generated following Equation (
15), where
is the maximum angular acceleration. We can find the desired steady-state rotor speed
by taking the inverse matrix in Equation (
5), and the PWM signal
will eventually be obtained.
Equation (
16) defines the reward function for the angular, vertical, and horizontal reinforcement learning environments. In this equation,
represents the error at the current time, and
indicates the specific controller to which the error is applied. The agent receives a reward when the error approaches zero and is penalized otherwise. One notable advantage of this reward design is the absence of hyperparameters that require manual fine-tuning.
During the training process, the reinforcement learning environments receive the agent’s action and output the next state as well as the reward based on the current state . In this work, the pitch and x-axis controllers are chosen as the prototypes for the angular and horizontal controllers, meaning that during training. The training follows a certain order. First, the roll and vertical controllers are trained independently. When the pitch controller is being trained, we set , and as zero, and , while when the vertical controller is being trained, we set as and the desired torques as zero. The x-axis controller is trained only after and are obtained because contains height and pitch information, which is affected by the vertical and pitch controllers. No obstacles or external disturbances are considered during training.
4.2. State Transformation
The trained motion controllers have the ability to navigate the QUAV in the corresponding dimension. However, when the states in the testing phase are too large, it becomes challenging for the networks to generate reasonable outputs based on the limited training samples. This issue typically arises in translational control, where the position errors are not bounded, except by the simulator. Since it is not feasible to gather information for position errors of arbitrary magnitude within a finite training period, using the policy networks in a large or boundless environment can result in uncontrollable behavior, which is inherently dangerous. To mitigate the impact of large horizontal position errors, we employ a hard-clipping technique on the variables and , with a clipping bound of . Although no hard-clipping is applied to , its range is constrained due to the presence of a floor and a ceiling in the working space of this study.
As observed in [
18], the trained DRL controllers will decelerate to avoid overshooting, indicating that there exists a natural velocity upper bound
, since
when the velocity is smaller than
and
when it exceeds the bound. Considering that
is sometimes too large to guarantee safety in the testing phase, we manually set an upper bound
to the velocity in the testing phase to prevent the QUAV from moving too quickly in a free space. To increase data efficiency, we multiply
and
by
, which can map the horizontal velocity errors from
to
and therefore force the actual velocity inputs to have a similar scale as the training samples.
In addition to the aforementioned issues, it is necessary to address the problem of anisotropic control performance. As depicted in
Figure 5a, we denote the force exerted on the QUAV as
when it is located at position A and as
when it is at position B. Ideally, if positions A and B are equidistant from the target at a distance
r, we expect the two force vectors to have identical magnitudes and angles relative to the corresponding position error vectors. This implies that the control performance should exhibit isotropy. However, the current motion controllers we have obtained do not guarantee isotropy due to two reasons: (1) the
x-axis and
y-axis controllers independently calculate the force components along each axis based on
and
, and (2) the policy networks are non-linear. To preserve this property, we transform the horizontal coordinate system such that the force vector components are calculated radially and tangentially to the horizontal error vector
. As shown in
Figure 5b, the unit radial and tangential vectors are defined as
The horizontal states in the testing phase are so far modified as
in which
, and
are defined as
Similarly, the horizontal forces defined in Equation (
13) should be modified following
It should be noted that, up to now, and , and there is no reason for the QUAV to move along . This is because no obstacles are taken into consideration yet. As we shall see in the next section, they will be further modified by the repulsive potential fields.
4.3. APF for Network Controller
In a hierarchical framework, RL motion controllers are solely responsible for the navigation of the QUAV in open spaces and do not address higher-level tasks such as obstacle avoidance. In this study, we incorporate the use of APF to guide the RL motion controllers by influencing the state inputs, thereby indirectly altering the QUAV’s current pursuit position.
Figure 6 provides an illustration of how the target is updated, with the saturated target being the direct result of the clipping operation along the radial direction. Similar to traditional APF methods, any obstacle that falls within the QUAV’s sensing range generates a repulsive force. However, instead of directly applying this force to the agent, it is utilized to further modify the target’s position. This is accomplished by providing the motion controllers with modified position errors in the following manner:
It is important to note that distance clipping is performed prior to the modification of errors by
, as depicted in
Figure 6a. This implies that the radial and tangential position errors,
and
, are not strictly limited by the distance
d. This is done to ensure that the repulsive forces exerted by nearby obstacles have a significant impact on the movements of the QUAV. Without this consideration, the actual influence of repulsion on the target would be insignificant, as shown in
Figure 6b, particularly when the original target is located far away.
The newly-introduced repulsive forces are dynamic, as the geometric relationship between the QUAV and adjacent obstacles changes. This implies that the position of the target also changes dynamically. Therefore, it is not enough to solely consider the agent’s current velocity in the motion controllers’ states, as the target was assumed to be stationary during the training process. In this study, we calculate the derivative of the repulsive force with respect to time to determine the target’s velocity. Using the chain rule, we can obtain that
in which
is the velocity of the dynamic target,
is the horizontal distance to the
i-th obstacle, and
is the Hessian matrix for the
i-th obstacle’s repulsive field. A few properties are to be satisfied when designing
:
should be monotonically decreasing with respect to x;
should be monotonically decreasing with respect to x;
and ;
should be second-order differentiable.
The first requirement is to ensure that the force is repulsive. The second property states that the influence of the obstacle weakens as the distance increases. The third property implies that the repulsion force approaches infinity at the surface of the obstacle, while it becomes zero when the obstacle is at the boundary of the sensing range. The final requirement guarantees the existence of the Hessian matrix. Since the repulsive APF is only applied through its derivative forms, it is reasonable to directly design
as
where
is a manually set factor that scales the intensity of repulsion.
Considering that the calculation of
necessitates significant computational resources and is time-consuming, we approximate
by utilizing the time difference of
. If the radial component of
points towards the target, we eliminate it, as in such cases, the obstacle poses no threat to the agent. To be more specific, we define a distinct target velocity in the radial direction using the following approach:
This technique prevents
from exceeding the bound
d and therefore avoids overshooting. Ultimately, the horizontal states in the testing phase are modified as
This design enables the seamless integration of RL motion control and APF obstacle avoidance, without requiring any modifications to the underlying structure of the pre-trained motion controllers. The modification of states only occurs during the testing phase. What sets this strategy apart from traditional APF approaches is that, instead of directly applying forces to the QUAV, they are utilized to propel the target. This approach is equivalent to generating a sequence of waypoints that are carefully designed to avoid any potential obstacles.
5. Simulation Experiments
5.1. Training Setup
The QUAV’s dynamics followed Crazyflie 2.0 [
46]. All agents were trained using TD3, whose hyperparameters are summarized in
Table 1. The time step was set as 0.02 s to ensure control accuracy, while the upper bounds of translational and angular acceleration were
m/s
2 and
rad/s
2, respectively.
An MLP with two layers, each consisting of 32 nodes, was utilized to train each controller. The optimizer used in this process was Adam. The targets were randomly sampled from a uniform distribution. Specifically, angular targets were chosen from the range of , while translational targets were selected from the range of . It is worth noting that the horizontal network was trained only after the angular and vertical networks had been obtained. The training stage of each controller comprised a total of 200 training episodes, with episode lengths set at 100 and 500 for angular and translational control respectively. This ensured that the agent had sufficient time to reach the target before an episode concluded. Throughout the training phase, no obstacles or external disturbances were taken into consideration.
5.2. Testing Setup
We designed three different controllers for all experiments. Controller 1 utilized the default setting, while controller 2 excluded the axial target velocity cancellation operation. On the other hand, controller 3 solely employed APF to influence position errors, without affecting velocity errors.
The testing environment consisted of an open space with a floor and a ceiling that was 5 m high. Additionally, there were several cylindrical obstacles, each standing at a height of 5 m, strategically placed around the center of the space. The positions of these obstacles were determined by uniformly selecting the distances of their axes from the center within the range of . Furthermore, the radii of the cylinders were uniformly chosen from the range of . It was ensured that the minimum distance between any two obstacles was at least . For each trial, three QUAVs were utilized, which acted as dynamic obstacles and generated APF repulsive forces similar to the fixed obstacles in the environment.
They were initialized at , and the corresponding targets were at , where . As the agents moved, proximity sensors mounted on them could detect nearby obstacles. The maximum sensing range d was 2 m, the position error clipping bound was 0.5 m, while the repulsive force scaling factor was set as 0.18. The manually set velocity upper bound was set as 2 m/s, and through experiments, it was found that the natural velocity upper bound m/s.
The agents faced the challenge of reaching the targets on the opposite side without any collisions. The success of navigation was determined by whether each agent entered the 0.1 m proximity of its corresponding target. We created four different environments with specific values for and N: , and . In each environment, we conducted 1000 trials for every controller setting. Based on these trials, we calculated the success rates and the average time taken for successful trials.
To evaluate path smoothness, we calculated the curvatures of projection of the actual paths on the horizontal plane according to
in which the time-derivatives of positions are approximated by the differences between two time steps. It should be emphasized that
may be either positive or negative, depending on the turning direction. Path oscillations are reflected by the occurrence of sharp turns, which correspond to huge absolute curvatures. In other words, if a path contains multiple points with huge absolute curvatures, it is oscillatory. In this work, the curvature threshold for a sharp turn is defined as 1000.
5.3. Experiment Results
5.3.1. General Performance
Figure 7 is a vertical view of the agents’ trajectories as they moved around in the four environments, where the circles represent cylindrical obstacles, the curves in orange, green, and blue illustrate the trajectories, and the stars at their ends are the targets. A smaller
indicates a more challenging scenario, since the minimum width of the passageway between two obstacles was smaller, while a bigger
N means there were more obstacles around the origin. It can be seen that the hierarchical control framework we have proposed is capable of navigating the agents to the corresponding targets smoothly without collisions or detours, even in the densest environment.
To validate the effectiveness of the proposed modifications, we conducted 1000 trials for the three controllers in various environments. The success rates and average reaching times are presented in
Table 2. In the environment with low density (0.9, 9), all controllers achieved a high success rate. Among them, controller 2 demonstrated the fastest performance, with an average reaching time of only 22.801 s. This can be attributed to controller 2’s tendency to accelerate more dramatically when the repulsive force contributes to the attractive force. However, as the density increased, controller 2 experienced a more significant decrease in success rate compared to controller 1. This is because the higher acceleration increases the risk of overshooting, which will be discussed in the following section. Furthermore, controller 3, which did not consider any velocity errors, exhibited a severe decline in performance. It not only took more time to reach the targets but also struggled to handle dense environments. This was due to the delay between the assumed target and the actual target, modified by the repulsive APF forces. As a result, oscillatory behaviors were observed. In summary, the introduction of velocity errors effectively mitigated oscillations, while the cancellation of the target velocity radial component enhanced the agents’ ability to address overshooting and the resulting oscillations. However, this came at the expense of slower agent movement.
The method is also applicable to an environment with more QUAVs, as shown in
Figure 8. The environment setting is similar to that discussed in the testing setup. In general, the agents are able to reach their targets without collisions. For example, in
Figure 8b, where three pairs of agents move head to head, they can still reach the targets safely.
5.3.2. Oscillation Reduction and Overshooting Prevention
Table 3 lists the average number of points with an absolute curvature greater than 1000 for the three controllers over all successful trials. It can be seen that controller 3 demonstrates the worst performance in terms of path smoothness, which testifies that target velocity information plays an essential role in reducing path oscillations. On the other hand, controllers 1 and 2 are close, with the path of the former being slightly more oscillatory. It should be noted that overshooting is not equivalent to global path smoothness, since it only occurs at certain areas, such as the triangular trap. Therefore, the oscillations caused by overshooting cannot easily be reflected by path curvature.
To give quantitative analysis to the effectiveness of radial target velocity cancellation operation as defined by Equation (
24), another experiment setting was used. Concretely, three cylindrical obstacles with a radius of 1.5 m were randomly placed on a ring whose inner and outer radii were 2.3 m and 2.5 m, respectively. The three QUAVs were initialized at
, and the corresponding targets were at
, where
. In this way, they were trapped by the encircled obstacles at first, and the task was to escape the triangular trap. Ideally, the maximum distance between any QUAV and the corresponding target position is 8.6 m, otherwise it is caused by overshooting. In this experiment, overshooting cases are defined as cases where at least one QUAV’s distance to its target exceeded 10 m, as shown in
Figure 9a. Furthermore, we define trapped cases where at least one QUAV’s distance to the origin is within 4 m, as shown in
Figure 9b.
Table 4 shows the results. It can be seen that controller 1 encountered the fewest overshooting cases, which shows that the velocity cancellation operation helps preventing overshooting. Moreover, since this operation simplifies the obstacle environment by neglecting obstacles behind it, trapped cases are largely avoided as well.
An intuitive comparison of the three controllers can be found in
Figure 10. In
Figure 10b, the agent represented by the green trajectory experienced oscillation within the triangular area formed by obstacles 1, 2, and 3. The reason behind this oscillation was that the combined force exerted by obstacle 1 and obstacle 2 was directed towards the target and was not counteracted even though they did not pose a threat to the agent. Consequently, the force towards the goal became excessively strong, leading to the agent overshooting in this direction. Furthermore, disregarding velocity information entirely resulted in more frequent oscillations. The reason was that, in this case, the agent assumed the intermediate target to be fixed, while in reality the intermediate target was affected by the relative position between the agent and the obstacles. Oscillations occurred when the intermediate target updated before the agent reached the previous one. This can be observed in
Figure 10c, where all agents’ trajectories exhibit a zigzag pattern, especially when maneuvering between obstacles. This increases the risk of collision with nearby obstacles.
5.3.3. The Effect of a Clipping Bound
In the mentioned experiments, we utilized a position error clipping bound of
m. This choice was made based on previous experiments that indicated that it was the safest option. We varied the value of
for controller 1, ranging from 0.3 m to 1.0 m with an increment of 0.1 m. A total of 1000 trials were conducted with the environment parameters set at (0.7, 12). Additionally, we observed that, for
values of 0.3 m and 0.4 m, the agent’s velocity was relatively low, resulting in a longer time to reach the target. To account for more trials that had the potential to reach the targets, we relaxed the time requirement for success. This adjustment allowed us to consider a broader range of trials. The results of these experiments are presented in
Table 5. From the table, it can be observed that the average reaching time is inversely correlated with
. Furthermore,
yielded the highest success rate. This can be attributed to the fact that, when the agent moved too slowly, it lacked the flexibility to handle imminent collisions. Conversely, when the agent moved too quickly, the outputs of the low-level flight controllers were not sufficiently strong to decelerate the agent in time, thereby increasing the likelihood of collisions.
5.3.4. Comparison to Existing Methods
This section compares the proposed method to two other existing methods in terms of success rate, average reaching time, average absolute curvature, and the number of sharp turns. The first method to be compared uses A* as the path planner and the SE(3) control law proposed in [
45] as the motion controller. This method discretizes the environment into cubes with side length 0.2 m, which are also known as voxels. It should be noted that the environment information of this algorithm is known in advance instead of being obtained with sensors. The second method to be compared [
47] is a hierarchical DRL method that uses a DQN network to generate waypoints based on the sensor information returned by 16 rangefinders. The waypoints then serve as the target points of the motion controller trained with PPO. The testing environment was a single-angle scenario, where the agent was initialized at [0, 4.5, 2.5], and the target position was [0, −4.5, 2.5]. Twelve cylindrical obstacles with a radius of 0.4 m were randomly placed around the origin. The positions of these obstacles were determined by uniformly selecting the distances of their axes from the center within the range of
and
. The results are listed in
Table 6, which also includes the performance of controller 3 in this environment. We can see that the proposed method outperformed the other two algorithms in terms of success rate and average reaching time, with the path slightly more oscillatory. Although path oscillation is an inherent problem with APF, it can be reduced to a large extent with the help of velocity information. After addressing this issue, the controller can navigate the QUAV in a more safe and efficient manner.
Figure 11 illustrates the paths of the four methods in the same environment. It can be seen that the proposed method’s path is smooth on a macro level and is far away from obstacles. The path of A* + SE(3) is mostly straight because the waypoints returned by A* are exactly on the centers of the voxels.
Figure 11c shows a similar path, in that there are only 16 possible directions to choose from when updating the waypoints. The path of controller 3, on the other hand, is the most oscillatory one.