1. Introduction
Unmanned aerial vehicles (UAVs) are a class of aircraft that can fly autonomously without a need for an onboard pilot, which are notable for their convenience, flexibility, low cost, and wide range of applications [
1]. UAVs operate without a pilot on board, allowing them to perform a variety of high-risk or inaccessible tasks in complex and changing environments, including terrain mapping, precision agriculture (PA), surveillance and reconnaissance, power line inspections, search and rescue (SAR), film shooting, aerial delivery, and military fields [
2,
3,
4,
5,
6,
7,
8].
In the field of UAVs, the execution of numerous tasks fundamentally relies on target tracking. For instance, in security management tasks, it is imperative to track, encircle, and apprehend criminals when a terrorist event occurs. Under such circumstances, UAVs can be used to track criminals and even carry light weapons to assist police. Another vivid case is the search and rescue (SAR) tasks after a disaster happened. When natural disasters occur, the terrain in the affected area is complex, and communication is interrupted, posing significant challenges to SAR tasks. In such scenarios, UAVs can quickly enter these areas with advanced sensors like high-definition cameras and infrared thermal imagers to provide valuable information for SAR tasks, conduct aerial patrols, and perform target tracking with their aerial superiority.
With the rapid development of technology, UAVs have achieved significant improvements in intelligence, endurance, payload capacity, and resistance to interference, broadening their applicability across diverse scenarios. Despite these improvements, UAVs operating in SAR tasks often encounter unique challenges, such as infrastructure damage and severe natural conditions, which can disrupt communications. This disruption severely impacts the UAVs’ ability to perceive their environment, diminishing the success rate of rescues. Furthermore, in denial environments, the communication between UAVs and the ground command center will be disrupted, which cannot guarantee fast and safe communication. Therefore, each UAV is required to have the ability to independently and intelligently complete tasks under these severe environments, achieving a highly autonomous ability as much as possible.
The investigation into how unmanned aerial vehicles (UAVs) can make rapid and effective maneuvering decisions based on real-time observation information to accomplish tasks in unknown and complex environments efficiently has emerged as a significant area of research interest. In recent years, advanced methods such as model predictive control, optimization-based methods, and particle swarm optimization (PSO) algorithms have been successfully applied in UAVs. These algorithms empower UAVs to navigate and complete complex tasks within unfamiliar environments by improving route planning and decision-making processes.
Faced with denial environments without external positioning systems, Abraham et al. [
9] presented a method to enable a quadrotor helicopter equipped with a laser rangefinder to autonomously explore and map unstructured indoor environments. Mac et al. [
10] introduced a novel method combining sensor fusion for localization and an improved potential field method for UAVs in obstacle avoidance tasks, alongside an optimally tuned PID controller using multi-objective particle swarm optimization. However, their methods relied heavily on onboard sensors, which may limit performance in more complex or dynamically changing environments. Rothmund et al. [
11] utilized scenario-based model predictive control for inspection drones to avoid obstacles in unknown environments. Although the UAV would change speed or take detours to avoid flying in potentially dangerous areas to mitigate risk, the system was limited by its reliance on a pre-planned path. At the same time, the high computational cost of this method made it difficult to guarantee real-time performance requirements. Similarly, Kulathunga et al. [
12] introduced an optimization-based, trajectory-tracking approach for multi-rotor aerial vehicles operating in unknown environments. The method utilized a dual-planner system consisting of a global planner and a local planner. The global planner adjusted the initial reference trajectory to avoid obstacles, while the local planner focused on generating optimal control policies for following this adjusted trajectory. This approach was based on precise mathematical models, which could be a limitation in unknown environments where model parameters are often difficult to obtain accurately. Saccani et al. [
13] introduced a multi-trajectory model predictive control for UAV navigation in unknown and static environments by using LiDAR for obstacle detection and path planning that balanced safety and goal achievement. To address the problem of coordinating USVs and UAVs for maritime parallel search, Li et al. [
14] proposed a dynamic event-triggered control mechanism to reduce communication load and a sensor-tolerant control to handle sensor faults. This method was validated through simulation experiments, demonstrating its effectiveness in maintaining formation and achieving full coverage of the search area. However, this method depended on threshold tuning and assumed the states of the USV and UAV were known, which is difficult to implement in practice. Additionally, it did not consider obstacles, limiting its applicability in complex environments.
In recent years, with the rapid development of artificial intelligence (AI), reinforcement learning (RL), as a part of machine learning, has achieved remarkable results due to its excellent performance and wide application in the field of decision-making of complex tasks, such as autonomous vehicles, robot control, game strategy optimization, and financial transactions [
15,
16,
17]. Unlike traditional optimal control methods, the agent in reinforcement learning learns a policy through interaction with the environment, aiming to maximize cumulative rewards over time. It does not necessarily require a known model of the system dynamics and focuses on adaptability through trial and error, which is more adaptable to handling uncertainties and unknown dynamics.
Pham et al. [
18] integrated a PID controller with Q-learning, a type of reinforcement learning algorithm, to manage the UAV’s trajectory. This combination enabled the UAV to dynamically adjust its path based on real-time inputs, learning from the environment to enhance both navigation precision and efficiency. Li et al. [
19] enhanced a UAV’s capability to swiftly adapt to unpredictable target movements by integrating deep reinforcement learning (DRL) with meta-learning. This approach significantly improved tracking accuracy and efficiency, making it particularly effective in diverse scenarios such as wildlife protection and emergency aid. When it comes to aerial robots in search and rescue tasks within unknown environments, Ramezani et al. [
20] introduced an energy-aware hierarchical reinforcement learning approach by utilizing a predictive energy consumption algorithm to enhance the sustainability and endurance of missions. To address the real-time challenges in the decision-making process of UAVs, Zhao et al. [
21] proposed adaptive and random exploration methods, ensuring that UAVs reach their target positions via reasonable and safe paths, enhancing their operational effectiveness and safety in dynamic environments. Kun et al. [
22] utilized the deep deterministic policy gradient (DDPG) algorithm to direct UAVs toward fixed-point targets; however, their method did not address the guidance and tracking of moving targets. In response to this limitation, Alejandro et al. [
23] developed a Gazebo-based reinforcement learning framework that effectively applied the DDPG algorithm for continuous UAV landing on a moving ship, which improved the UAV’s adaptability and performance in dynamic environments. However, the initial position of the UAV was fixed in their simulation, which could not effectively reflect the autonomy of guiding the maneuvering decision-making process. To address UAV flight control in dynamic environments with random wind turbulence, Ma et al. [
24] introduced an incremental reinforcement learning (IRL)-based algorithm, which employed policy relief (PR) for exploration and significance weighting (SW) to enhance learning. This method was validated through simulations and real-world tests, demonstrating good tracking performance under wind disturbances though it was operated under the assumption of a fully known environment and did not account for collision avoidance. Xia et al. [
25] proposed an end-to-end cooperative multi-agent reinforcement learning (MARL) scheme for UAV target tracking, addressing limitations such as unknown trajectories and limited flight performance. This paper modeled communication between UAVs in the swarm, improving coordination and efficiency. The method enhanced tracking with an energy-saving strategy and spatial information entropy, outperforming deep reinforcement learning baselines in simulations. However, it requires position information, assumes half-duplex communication, and may cause difficulties in communication-denied environments.
Faced with the complexities of autonomous decision-making for unmanned aerial vehicles (UAVs) in unknown environments, and to enable UAVs to autonomously make decisions based on limited observation information to adapt to unknown settings while maintaining strong generalization capabilities, this paper applies deep reinforcement learning algorithms to UAV search and rescue (SAR) tasks. A two-stage target search and tracking method for UAVs based on deep reinforcement learning is proposed. The novelties of this paper are listed as follows.
- (1)
A deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm is proposed to alleviate the overestimation problem in the critic network of the traditional DDPG algorithm. This method enhances both training speed and effectiveness by adopting three critic networks and introducing the experience replay buffer mechanism.
- (2)
A two-stage target search and tracking method is introduced to enhance the search success rate and tracking performance of UAVs in SAR tasks within unknown environments. This method divides SAR tasks into the search stage and tracking stage, and the controllers for each stage are trained based on the proposed DDPG-3C algorithm.
- (3)
A simplified two-dimensional SAR scenario is designed to demonstrate the practical application of the proposed methods and algorithms.
By introducing the DDPG-3C algorithm, this paper effectively alleviates the overestimation problem found in the traditional DDPG algorithm, leading to faster convergence and improved decision-making ability. The proposed two-stage target search and tracking method significantly improves the efficiency of UAV operations in SAR tasks, which not only provides a more efficient search strategy but also better adaptability to the target’s movements. Additionally, the development of a simplified two-dimensional SAR simulation environment provides a solid foundation for future research into more complex 3D environments.
The structure of this paper is as follows:
Section 2 describes the search and rescue task and a simplified model of the SAR scenario is designed.
Section 3 introduces the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) and its training process.
Section 4 introduces the proposed two-stage search and tracking method in SAR tasks. In
Section 5, the effectiveness of the proposed DDPG-3C model is validated through numerous experiments, as well as the proposed two-stage target search and tracking method. Finally, some conclusions are drawn in
Section 6.
5. Experimental Simulations
In order to validate the effectiveness of the proposed DDPG-3C model that utilizes three critic networks to alleviate the overestimation problem, the performance of the proposed model is compared with two popular reinforcement learning baselines in the two-dimensional search and rescue (SAR) scenario described in
Section 2: DDPG [
27] and TD3 (twin delayed deep deterministic policy gradient) [
29]. At the same time, to validate the advantages of the proposed two-stage target search and tracking method, this paper compares it with the traditional single-stage reinforcement learning model in the SAR scenario.
When training the DDPG decision model, TD3 decision model, and the DDPG-3C decision model, the hyperparameters are set according to
Table 2.
In the following section, the DDPG decision model, the TD3 decision model, and the DDPG-3C decision model will be applied to the search stage and tracking stage of the SAR task, as well as the traditional single-stage method for comparison. All of these algorithms are implemented in Python and run on a desktop computer with a GeForce RTX 4090 GPU and a 13th 64-Core i9-13900KS CPU.
5.1. Search Stage Simulations
In the search stage, there is no target in this scenario when the decision model is training. The designed SAR scenario includes a single UAV with a sensor range of 200 m and two obstacles, each with a radius of 50 m. Both the obstacles and the UAV are positioned at randomly generated locations within a two-dimensional rescue area that spans 0 to 800 m. The obstacles are static, while the UAV operates based on the force generated by the horizontal and vertical thrust, with a maximum action value of 400 N. The hyperparameters for the DDPG, TD3, and DDPG-3C decision models are listed in
Table 2.
The training results of the DDPG-3C decision model, TD3 decision model, and the DDPG decision model in the search stage of the SAR task in denial environments are shown in
Figure 7. The results are calculated using an exponential moving average (EMA) to help visualize trends more clearly by reducing noise from the raw data according to Equation (19):
where
is the smoothing factor where closer to 1 implies heavier smoothing, and closer to 0 implies lighter smoothing. Moreover, the detailed convergence rewards and convergence steps are listed in
Table 3.
It can be seen from
Figure 7a and
Table 3 that the DDPG-3C model, which utilizes three critic networks, shows a reward trajectory that steadily increases and stabilizes at approximately 600, indicating robust learning and performance stability. At the same time, the DDPG-3C model achieves optimal performance more rapidly than its counterparts. In contrast, the traditional DDPG model, which uses a single critic network, reaches a lower peak reward of around 560 and takes approximately 7 million training steps to converge, showing a slower learning rate compared with the DDPG-3C model. Similarly, the TD3 model, which uses twin critic networks, stabilizes at a reward of approximately 560 but requires about 4 million training steps to converge, demonstrating slower convergence compared to DDPG-3C but faster than DDPG.
When it comes to the estimation of target Q-values, the red line in
Figure 7b represents the target Q-value for the DDPG-3C model. This value is calculated by discarding the highest of the three Q-value estimates and averaging the remaining two.
Figure 7b shows that the target Q-value for DDPG-3C is consistently lower than that of the traditional DDPG and TD3, indicating that the proposed decision model can effectively alleviate the problem of overestimation.
To compare the performance of these three models during the search stage, each model will be tested across 10,000 episodes. The average coverage ratio and average collision rate will be calculated to evaluate their effectiveness and safety. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in
Table 4. For the simulation video of the search stage, please click here (
https://easylink.cc/5r3njl, accessed on 25 September 2024) to view (
Supplementary Materials).
According to the data presented in
Table 4, the DDPG model recorded a collision frequency of 543, meaning that collisions occurred in approximately 5.43% of the episodes. The TD3 model demonstrated an improvement with a collision frequency of 461 (4.61%). However, the DDPG-3C model demonstrated a significantly lower collision frequency with a total number of 349 collisions (3.49%). This improvement indicates that the DDPG-3C decision model is more effective at avoiding obstacles, which is safer during the search stage of SAR tasks. In terms of area coverage, the traditional DDPG model achieved a coverage rate of 87.92%, while the TD3 model achieved a slightly better coverage rate of 89.14%. However, the DDPG-3C model outperformed both, achieving a higher coverage rate of 90.17%. This indicates not only a more thorough coverage per episode but also improved efficiency in searching.
5.2. Tracking Stage Simulations
In the tracking stage, the goal is to maintain continuous surveillance of a moving target within a defined two-dimensional area spanning 0 to 800 m. The scenario includes a single UAV that has complete perception of the environment’s states. The environment also contains two static obstacles, each with a radius of 50 m, randomly placed in the area. The UAV operates based on the horizontal and vertical thrust forces, with a maximum action value of 400 N, to keep track of the target without colliding with the obstacles. By providing the UAV with full environmental awareness, the model can learn the optimal strategies for tracking and obstacle avoidance without the constraints of sensor range.
The DDPG, TD3, and DDPG-3C models are trained under the same conditions using hyperparameters listed in
Table 2. Training results are analyzed using an exponential moving average (EMA) according to Equation (19) to smooth the data and highlight trends more effectively. Moreover, the detailed convergence rewards and convergence steps are listed in
Table 5.
Figure 8a,b displays the rewards and target Q-value trajectories for the DDPG, TD3, and DDPG-3C decision models. Similar to the results from the search stage, the DDPG-3C, with its three critic networks, is expected to converge more quickly, achieving this at 3 million training steps, and gain a higher reward of approximately 200. In contrast, the TD3 model takes about 4 million training steps to converge, with a reward of approximately 180, and the traditional DDPG model converges at around 7 million training steps with a lower reward of roughly 160. These results suggest that DDPG-3C shows better adaptability to the target’s movements compared to the other models. The target Q-values for DDPG-3C, calculated by discarding the highest and averaging the remaining two estimates, are relatively lower than those of the traditional DDPG and TD3. This approach offers a more conservative and accurate estimation, thus reducing the risks of overestimation.
To compare the performance of these three models during the tracking stage, each model will be tested across 10,000 episodes. The average collision frequency and the tracking success rate, defined as the percentage of episodes where the UAV successfully tracks and maintains the target within a range of 20 m, will be calculated and compared. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in
Table 6. For the simulation video of the tracking stage, please click here (
https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
According to the data presented in
Table 6, collisions occurred in approximately 3.16% of the episodes when implementing the DDPG decision model in the tracking stage, while the TD3 model showed an improvement with a collision frequency of 274 (2.74%). The DDPG-3C decision model demonstrated an even lower collision frequency of 189, with a collision rate of 1.89%. In terms of tracking success, the DDPG model achieved a tracking success frequency of 9357 out of 10,000 episodes, while the TD3 model performed slightly better with a tracking success frequency of 9397 (93.97%). However, the proposed DDPG-3C decision model outperformed both, achieving a tracking success rate of 95.12%. This higher success rate for DDPG-3C suggests that the model, with its additional critic networks, provides more accurate value estimates and decision-making ability compared to DDPG and TD3.
5.3. Whole Period
In this experiment, the whole period of the SAR tasks was considered, which combines the search stage and tracking stage. Initially, the UAV operates in search mode, aiming to extensively cover a defined two-dimensional area (0 to 800 m) to detect any potential targets. During this stage, the UAV lacks target and obstacle information and relies on its sensor system to detect obstacles and other objects. Once a target enters the UAV’s radar range (within 200 m), the UAV switches to tracking mode and continuously tracks the moving target. The search stage and tracking stage use the DDPG-3C decision model trained in
Section 5.1 and
Section 5.2, respectively. The implementation process of the whole period is shown in
Figure 4.
Figure 9 shows the experimental results of the proposed two-stage target search and tracking method for SAR tasks, which include avoiding obstacles, searching for targets, and tracking the target. In
Figure 9a–c, the UAV’s initial position is (700, 700), with a random initial angle, and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle is
. The positions of the obstacles are (200, 200) and (400, 400). In
Figure 9d–f, the UAV’s initial position is also (700, 700), with a random initial angle and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle of
, while the obstacles are located at (200, 80) and (350, 200).
It can be seen from
Figure 9a,d that the UAV is positioned randomly and continues to search the environment for potential targets. After the UAV detects a target, as illustrated in
Figure 9b,e, it transitions into tracking mode. The UAV adjusts its path to approach the target while avoiding obstacles. In the tracking stage, as shown in
Figure 9c,f, the UAV not only maintains its course towards the target but also adapts to changes in the target’s speed. Additionally, it consistently avoids obstacles, as demonstrated in
Figure 9c. For the simulation video of the whole stage, please click here (
https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
Figure 10 presents the experimental results of the proposed two-stage target search and tracking method for SAR tasks, featuring scenarios with 5, 9, and 13 obstacles, respectively. In each scenario, the UAV’s initial position is (700, 700), with a random initial angle and an initial speed of 0, and the target’s speed is 30 m/s. In
Figure 10a, the target’s initial position is (400, 500), with an angle of
, and the obstacles are located at (100, 100), (100, 700), (700, 100), (700, 700), and (400, 400). In
Figure 10b, four additional obstacles are added, positioned at (400, 100), (100, 400), (700, 400), and (400, 700), and the target’s initial position is (400, 500), with an angle of
. In
Figure 10c, the number of obstacles is increased to 13, which are evenly distributed throughout the SAR scenario, and the target moves at a speed of 30 m/s along a predetermined trajectory.
It can be seen from the figure that in SAR scenarios with varying numbers of obstacles, the UAV successfully avoids obstacles during the search stage while covering the SAR area to locate the target. Once the UAV’s radar detects the target, it switches to tracking mode. During the tracking stage, the UAV not only effectively tracks the target but also continuously avoids obstacles. For the simulation video of the whole stage, please click here (
https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
To better illustrate the effectiveness of the proposed two-stage target search and tracking method for SAR tasks, this paper compares models trained using the two-stage target search and tracking method with the model trained using a traditional single-stage reinforcement learning method. In traditional single-stage SAR tasks, the UAV is required to search for and track the target as soon as possible while avoiding obstacles. The traditional single-stage method utilizes the DDPG-3C model. The hyperparameters of DDPG-3C model for the traditional single-stage method are similar to the DDPG-3C decision models in the search stage and tracking stage. The state is set as Equation (2). The reward function is shown as follows:
where
is the proximity reward, calculated as the Euclidean distance between the UAV and the target, as shown in Equation (17);
represents the reward for finding the target;
represents the collision reward for being too close to barriers or obstacles.
When comparing the performance of the two-stage target search and tracking method with the traditional single-stage method, each model was tested across 20,000 episodes. The average number of steps taken to find the target, defined as the target being detected by the UAV’s sensor, and the average steps for successfully tracking the target, defined as the UAV maintaining the target within a 20-m range, were calculated and compared. Additionally, collision rates were compared. Both models are initialized under the same conditions to maintain fairness in testing. The results are presented in
Table 7. To view the comparison video of these two methods, please click here (
https://easylink.cc/5r3njl, accessed on 30 September 2024).
According to the data presented in
Table 7, when using the two-stage target search and tracking method, which divides SAR tasks into search and tracking stages, requires an average of 171.27 steps to find the target. This is significantly fewer than the traditional single-stage method, which takes an average of 313.78 steps. A more efficient search strategy of the two-stage target search and tracking method, due to specialized training in the search stage, is suggested. Once the target is detected, the two-stage target search and tracking method takes an average of 354.54 steps to track the target, maintaining it within a 20-m range. However, the traditional single-stage method requires more steps, averaging 481.68, to maintain tracking of the target. This means that the UAV is able to operate more effectively in dynamic environments by using the two-stage target search and tracking method. In terms of collision rates, the traditional single-stage method has a slightly higher collision rate of 4.51% compared with 4.07% for the two-stage target search and tracking method, indicating not only faster but also safer operations compared with the traditional method.
It should be noted that although the two-stage target search and tracking method shows overall better efficiency in finding and tracking targets compared with the traditional single-stage method, there is no significant difference in the number of steps taken from detecting to successfully tracking the target between these two methods. To be more specific, when measuring the steps from finding the target to tracking it, the traditional single-stage method requires fewer steps (167.9 steps) than the two-stage method (183.27 steps). This can be attributed to a state transition process from the search stage to the tracking stage. For the two models trained with the two-stage method, during this transition, the state of the UAV in the search stage acts like noise, which the tracking stage must overcome, requiring additional steps to stabilize and effectively track the target.