1. Introduction
In recent years the usage of unmanned aerial vehicles (UAVs) for various applications has increased spontaneously. Multiple UAVs are deployed for cooperative missions such as passenger transportation, logistics delivery, and surveillance [
1]. In order to successfully carry out the mission in limited resources and time, an energy-efficient multiple-UAV navigation control is needed for the cooperative task. Since the energy consumption of a UAV is proportional to operating time, the UAV’s energy efficiency is directly related to a high performance [
2]. To develop an energy-efficient multiple-UAV control model, control complexity is a typical problem that needs to be resolved. When UAVs perform cooperative missions together, the decision of one UAV affects the decision of other UAVs. Moreover, complexity increases exponentially as the number of UAVs increases [
3]. Consequently, there are clear limitations in solving such problems with existing conventional heuristic-based search algorithms.
Multiagent deep reinforcement learning (MADRL) is a novel model that enables each agent to perform cooperative tasks by interacting with other agents through their own decisions. MADRL is a suitable model compared to a conventional model, which can be applied to various environments where multiple agents exist, such as multirobot controls, multiplayer games, and multiple-UAV control, etc. [
4,
5]. Unlike a ground vehicle that moves on a 2D plane, the range of a UAV’s motion is much broader. As a result, the movement strategy for mission performance is more diverse. Furthermore, UAVs must make appropriate decisions by using their own sensor information and the information retrieved by other UAVs. For these reasons, a suitable MADRL model must be selected for efficient navigation control.
There has been considerable research carried out in reinforcement learning (RL) based on UAV navigation and its application. G. Muñoz et al. [
6] developed a DQN-based model applied to a single UAV for navigation with obstacle avoidance. The Airsim-based realistic simulated 3D environment was utilized for training the agent. The author evaluated and demonstrated that the proposed model outperformed other DQN-based algorithms. Similarly, H. Qie et al. [
7] proposed a multiagent deep deterministic policy gradient (MADDPG)-based model for multiple-UAV target assignment and path planning. The results showed that agents could be assigned to their targets at a relatively close distance with a clear behavior for avoiding threat areas. Linfei Feng [
8] introduced the policy gradient (PG) model, which could be applied to optimize the logistics distribution routes of a single UAV. The results showed that the UAV arranged delivery routes to multiple destinations with the shortest path. Ory Walker et al. [
9] developed a framework based on the combination of proximal policy optimization (PPO) and adaptive belief tree (ABT) for multiple-UAV exploration and target finding. The proposed algorithm was verified in both 2D and 3D environments with the physically simulated UAVs using the PX4 software stack. W.J. Yoon et al. [
10] utilized the QMIX model for eVTOL mobility in drone taxi applications. The proposed QMIX-based algorithm showed optimal performance when compared with independent DQN (I-DQN) and a random walk in the drone taxi service scenario. Zhou W. et al. [
11] proposed a reciprocal-reward multiagent actor-critic (MAAC-R) method and applied it for learning cooperative tracking policies for UAV swarms. The training results demonstrated that the proposed model performed better than the MAAC model in terms of cooperative tracking behaviors of UAV swarms. D. Xu et al. [
12] improved the MADDPG-based algorithm and applied it for the autonomous and cooperative control of UAV clusters in combat missions. The proposed algorithm was tested by performing two conventional combat missions. The result showed that the learning efficiency and the operational safety factor were improved when compared with the original MADDPG algorithm. Similarly, Guang Zhan et al. [
13] applied multiagent proximal policy optimization (MAPPO) in a Unity based 3D-simulated air combat environment. The proposed algorithm was trained with a Ray based distributed training framework. In the experiment, MAPPO outperformed COMA and BiCNet in average accumulate reward.
Table 1 shows a detailed comparison of research activity conducted utilizing MADRL and RL.
From
Table 1, most of the research was carried out using actor-critic-based models. Additionally, based on the previous research related to MADRL, we conclude that centralized training with a decentralized execution methodology is more suitable for real-world situations. In real-world execution, it is difficult for one UAV to obtain data from all other UAVs in real time. A decentralized actor network can be used to infer the action in such a partially observable environment. We paid attention to the multi-actor-attention-critic (MAAC) model, which showed optimal performance among algorithms based on a centralized critic and a decentralized policy, which can be used in environments where information exchange between agents is not guaranteed [
16].
This study makes the following significant contributions.
The development of an MAAC-based model with two significant improvements by applying a sensor fusion layer in the actor network and a dissimilarity layer in the critic network.
A new feature to calculate the energy efficiency of UAVs is incorporated with the previously developed UAV LDS simulation environment.
The performance of the existing RL and MADRL models are compared with two energy efficiency indicators.
In this research, we focus on optimizing learning efficiency by efficiently processing the observations of multiple UAVs by adding two features to the MAAC model. First, we introduce a sensor fusion layer in the actor network to extract features from various sensors such as a ray-cast sensor for preventing collision with adjacent obstacles, an inertial navigation system (INS) for the self-awareness of flight status, and a radio detection and ranging (RADAR) system for collecting location data from other UAVs. Second, in the critic network, a dissimilarity layer is added to provide more weight to the information of agents with fewer similarities. By implementing these functions, the efficiency of information processing is increased, and we prove through experiments that it plays a decisive role in achieving the goal of energy-efficient UAV navigation control.
To experiment and validate our proposed MADRL model, the logistic delivery service virtual test bed is adopted from our previous research [
21]. The test bed is customized by adding an energy efficiency module for multiple-UAV cooperation specifically for logistic delivery. To find out whether UAVs can cooperatively perform missions well, the environment includes a scenario in which two UAVs cooperate for transport logistics. A function to measure the total travel distance of UAVs has been added to validate the energy efficiency of the UAVs. Our proposed model shows the highest performance in terms of energy efficiency compared to conventional RL algorithms. We measure energy efficiency with the number of trips carried out during the same time, and the number of cargos carried out during the same distance traveled. Our model shows superiority in both indicators.
Our work is structured as follows.
Section 2 covers the general background of the RL and MADRL algorithms. In
Section 3, we expound on the proposed fusion-MAAC (F-MAAC) method.
Section 4, the test bed for the training and evaluation is described in detail.
Section 5 shows the results and discusses performance evaluation. Finally, the study concludes with future directions in
Section 6.
3. Fusion-Multiactor-Attention-Critic (F-MAAC) Model
In this section, the F-MAAC model is discussed for the application of multiple-UAV cooperative navigation. To increase the learning efficiency of the agent, we used a new sensor fusion layer with MAAC. The sensor fusion layer was used for the UAV’s local observation, and another layer named cosine dissimilarity was added to utilize global information obtained by other UAVs efficiently. The overall architecture of the proposed F-MAAC model is exemplified in
Figure 4.
The overall flow of the F-MAAC model follows the basis of the MAAC model, including a loss function and the gradients of the objective function. Each agent has its own independent actors and critics following a centralized training with a decentralized execution. In the training phase, all agents’ observations are entered as inputs of each agent’s critic network. In the execution phase, the decentralized actor network is used to choose the action as inference by using only its own observation for input data. This general F-MAAC model can be applied to N agents equipped with M types of sensors. The step-by-step training procedure of the F-MAAC model is as follows:
Step 1: Initialize the critic network and actor network with random parameters and synchronize the parameters of target critics with critics , and target actors with actors .
Step2: Get observation from the environment, feed-forward to actors , and select action .
Step 3: Proceed to the next time step with actions and get the next observations and rewards from the environment.
Step 4: Push the obtained set of to the replay buffer.
Step 5: Repeat step 2 to 4 until the number of E data is collected.
Step 6: Sample from the replay buffer,
Step 7: Perform a gradient descent by using B to minimize the loss function in Equation (
11) with respect to the network parameter
where
Step 8: Perform a gradient ascent by using
in B to maximize the gradient of the objective function in Equation (
12) with respect to the network parameter
Step 9: Update the parameters of target critics
with Equation (
13) and target actors
with Equation (
14) using an update rate of
Step 10: Steps 2 to 9 should be repeated until the end of the episode.
3.1. Deep Fusion Layer in Actor Network
As illustrated in
Figure 5, we propose a deep fusion layer in the actor network to increase efficiency. Observations are separated into the M types of sensors to extract features from each sensor. For instance, three different types of sensors are used for UAVs in our virtual UAV LDS environment: a ray-cast sensor for preventing collision with surrounding obstacles, an INS for the self-awareness of flight status, and a RADAR for retrieving coordinates of other UAVs and hubs. Each sensor’s data pass through the sensor encoder. The encoded sensor data are concatenated and pass through two fully connected layers. The output of the deep fusion layer can be expressed by Equation (
15).
,
, and sensor encoders(
) are fully connected layers.
3.2. Dissimilarity Layer in Critic Network
In the critic network, state encodings (
) are shared with other agents, as shown in
Figure 6. The attention head in the multiattention head layer selects relevant information from other agents’ observations. The attention head is constructed with a scaled dot product [
27] which calculates the degree of similarity between encoded observations of agent i (
,
) and the encoded observations of the other agents
(
). The UAVs at adjacent distances will have similar observation data. When more weights are provided to similar observations, the UAVs will have a wider field of view and less chance of colliding with each other.
However, there are also drawbacks derived from the multiattention head layer. For example, when an agent’s observation at a distance that is dissimilar from the current agent’s observation plays an essential role in performing its mission, it can lead to serious performance degradation. More specifically, the observation from an agent at long distances near a target point may provide helpful information. For these reasons, a dissimilarity layer was added to prevent performance degradation due to attention and to improve learning stability. In the previous study, we verified the effect of adding a dissimilarity layer to the MAAC model in a simple 2D cooperative navigation environment [
28].
Cosine similarity refers to the similarity between two vectors obtained by using the cosine angle between the two vectors. The additional use of the observation multiplied by the dissimilarity value may offset the effect of attention. The dissimilarity value is calculated with the encoded observations of agent i (SE) and the encoded observations of other agents (). The value passed through the dissimilarity layer’s dissimilarity value (DV) is concatenated with the value from the multiattention head layer’s attention value (AV) and . Then, the concatenated value is sent to the fully connected layers and to calculate the critic value .
Figure 7 shows the detailed process of the dissimilarity layer. The dissimilarity weight between the agent’s observations is calculated by multiplying the cosine similarity value by a negative number as in Equation (
16).
where,
.
The negative dissimilarity values are replaced with 0 to focus on the agents’ information with different patterns. Then, the observations of each agent are multiplied by the cosine dissimilarity weight and concatenated. The concatenated value is entered as the input value of the fully connected layer. The output value DV from the dissimilarity layer, the output value AV from the multiattention head layer, and the encoded value are concatenated as an input of the fully connected layers.
5. Experimental Simulation and Results
The proposed F-MAAC model was validated using the environment proposed in
Section 4. For more efficient evaluations of the proposed F-MAAC model, we first compared the mean episode rewards of the MAAC, MADDPG, and DDPG models with a training of 20k episodes. Then, the two models with the highest performance, F-MAAC and MAAC, were selected for the training of 150k episodes. To evaluate the trained model to achieve a meaningful scale length, the episode length was replaced with 3000 from 1000 time steps. The timescale of the environment was decreased in the evaluation phase to observe and analyze the strategies of UAVs. The total number of deliveries during one episode and the same distance traveled were evaluated to verify the energy efficiency.
The hyperparameters for training the RL models are shown in
Table 5.
5.1. Comparison of Performance of RL Models
Two MADRL models (MAAC, MADDPG) and one single agent RL model (DDPG) were compared with the proposed F-MAAC model. Each model was trained for 20k episodes with 1000 steps per episode in the proposed UAV LDS simulation environment.
According to
Figure 9, the DDPG’s mean episode reward value showed the worst performance because it did not increase significantly tableuntil 20k episodes. Although it rose slightly higher from 5k episodes to 20k episodes when compared to DDPG, the increase in the mean episode reward value in the MADDPG model was also minor. The F-MAAC and MAAC models, on the other hand, displayed an impressive performance and successfully conveyed some quantities of both large and small cargos. Between 10k and 20k training episodes, the F-MAAC model demonstrated a more significant value than the MAAC model in the mean episode reward. At the end of 20k training sessions, the F-MAAC model demonstrated greater mean episode rewards than the MAAC model by more than 30%.
5.2. Comparison of Performance between F-MAAC and MAAC Models
We retrained the F-MAAC and MAAC models with 150k episodes, which took about six days with two GPU machines. The detailed specifications of the machine are listed below in
Table 6.
Figure 10 shows the mean episode rewards of the MAAC and F-MAAC models for 150k training episodes. The mean episode reward value of the F-MAAC and MAAC models increased noticeably in this experiment compared to the previous
Section 5.1. The difference between them with training episodes until 40k was unnoticeable. After the training of 40k episodes, the F-MAAC model started to outperform the MAAC model. From 80k to 150k, the mean episode reward of the MAAC model decreased while that of the F-MAAC model constantly increased. At the end of the training, the F-MAAC model obtained 50% more rewards than the MAAC model. The randomness and instability of the complex 3D environment produced different learning patterns compared with the previous training, since the maps of the UAV LDS environment were generated randomly for every episode. However, both results showed that the F-MAAC model outperformed the MAAC model. The result of this experiment showed a more reliable comparison since it was trained longer, until 150k episodes.
5.3. Comparison of Energy Efficiency between F-MAAC and MAAC Models
For energy efficiency evaluation, we executed the trained model of F-MAAC and MAAC with 150k episodes. Each model was executed for 100 episodes with 3000 time steps of each episode. The average performance per episode is shown with a box plot in
Figure 11. We show the number of successful deliveries of small cargo and big cargo. Furthermore, the total performance was evaluated with
. The weight of 1.5 was multiplied by the number of big cargos since we gave 50% more rewards to the big cargos in the training phase.
The result showed that the number of deliveries in both small and big cargos with the F-MAAC model was higher than in the MAAC model.
Table 7 shows that the score of the F-MAAC model was 38% higher than that of the MAAC model during one episode, indicating that the F-MAAC model was more energy efficient.
We also provided the energy efficiency with Score_movement, which is the performance per 1000 m distance moved. We recorded the total movements of the UAV during execution. The Score_movement was calculated with . The results showed that the F-MAAC model was 30% more efficient compared to the MAAC model.
In addition, the number of collisions of the F-MAAC model was about 9% less than that of the MAAC model. The improvement of the F-MAAC model’s sensor processing efficiency can be interpreted as having a positive effect on the obstacle avoidance performance of the UAV.
6. Conclusions
This study proposed an MAAC-based multiple-UAV navigation control model that improved energy efficiency through efficient data processing of the UAVs. The following significant findings were obtained.
(a) In the proposed model, the sensor fusion layer was adapted in the actor network, and the dissimilarity layer was utilized for the critic network. When applied to the UAV LDS simulation environment, it outperformed the conventional RL model in terms of energy efficiency.
(b) The sensor fusion layer extracted features from each sensor enabling the UAVs to use various sensor data efficiently. The dissimilarity layer compensated for the loss derived from the attention layer by providing data with high dissimilarity to other agents.
(c) The F-MAAC-applied UAVs transported more cargo than the MAAC in the same amount of time and distance with greater cooperation and fewer collisions.
The feature of measuring the total movement of UAVs was added to the existing UAV LDS environment to calculate energy efficiency. We provided two indicators that calculated the energy efficiency of UAVs. The proposed model showed the best performance in both types of energy efficiency indicators out of various RL models, including the original MAAC model. In future studies, further verification and development are needed for the model in a more sophisticated environment, including realistic sensors and dynamic flight models. Furthermore, the scalability should be verified in a broader environment where more agents exist.