*4.4. Heuristic Reward Function*

In this section, we design a heuristic reward function to evaluate the result when the UAV swarm conducted action **a***t* based on respective observation **o***t*. Since each UAV agent in SAG-MCS is only exposed to local information and acts in a decentralized manner, we expect the reward function can help agents to achieve a better CFE score, while not directly aiming at optimizing the global metrics mentioned in Section 3.4. Therefore, the reward function considers the impact of data collection, battery charging, energy consumption and collision with boundaries.

Firstly, we encourage the UAV swarm to collect data as much as possible. Note that PoIs that within UAV's coverage range *Rcov* are referred as 'covered'. For UAV *u*, we design an individual coverage term *r sel f <sup>u</sup>* and a swarm coverage term *rswarm <sup>u</sup>* :

$$r\_{\boldsymbol{\mu}}^{self} = \begin{cases} \; \eta\_1 \cdot \Sigma\_p d(p), & \text{if } \text{Pol } p \text{ is covered only by UAV } \boldsymbol{\mu} \\\ -1, & \text{if none } \text{Pol } \text{is covered by UAV } \boldsymbol{\mu} \end{cases} \tag{18}$$

$$r\_{\mu}^{\text{sumr}m} = \begin{cases} \frac{\eta\_2}{n\_{\mu}} \cdot \Sigma\_{\eta} d(q), & \text{if } \text{Pol } q \text{ is covered by other UAVs in } G\_{\mathcal{U}}\\ 0, & \text{if UAV } \mu \text{ is not networking with others} \end{cases} \tag{19}$$

where *r sel f <sup>u</sup>* counts the data amounts collected individually by UAV *u*, and *rswarm <sup>u</sup>* counts the data amounts covered by agents that network with UAV *u* in one-hop connection. They are expected to improve the data coverage index through both individual exploration and swarm cooperation. Let *nu* denote the number of UAV *u*'s one-hop neighboring nodes. In addition, we set balance coefficients *η*<sup>1</sup> = 0.4, *η*<sup>2</sup> = 0.04.

Secondly, in order to guide UAVs to charging stations when their batteries are low, we propose a charge term *r charge <sup>u</sup>* as:

$$r\_{\mathfrak{u}}^{\text{clar};\mathfrak{g}\epsilon} = -\min \theta\_{\mathfrak{c}\prime}^{\mathfrak{u}} \quad \forall \mathfrak{c} \in \mathcal{C}\_{\prime} \tag{20}$$

where *θ<sup>u</sup> <sup>c</sup>* ∈ [0, 1] is normalized euclidean distance between UAV *u* and charging station *c*. The charge term *r charge <sup>u</sup>* will increase as UAV moving closer to its nearby charging station. We deem the UAV is in charging state when the relative distance meets *θ<sup>u</sup> <sup>c</sup>* ≤ 2.0, then an extra reward of 2.0 points will be added to *r charge <sup>u</sup>* .

Other factors such as energy consumption and collisions are considered as well. According to Equation (6), we simply define an energy term as *r energy <sup>u</sup>* = 1/*e<sup>u</sup> <sup>t</sup>* . UAVs that consume less energy are expected to gain higher rewards. Then, we define a penalty term *pu* = 1 when UAV *u* collides with the fixed boundary in our scenario, otherwise put *pu* = 0. We integrate local evaluation terms and define the heuristic reward function as:

$$r\_{\rm{\mu}} = \frac{\left(r\_{\rm{u}}^{self} + r\_{\rm{u}}^{scalar}\right) \times \varepsilon + r\_{\rm{u}}^{charge} \times (1 - \varepsilon)}{r\_{\rm{u}}^{energy}} + p\_{\rm{\mu}}, \quad \text{if } \phi(\boldsymbol{u}) > 0,\tag{21}$$

where the weight parameter refers to the remaining battery percentage, denoted as = *φ*(*u*)/100%. Equation (21) only functions when battery is not empty, otherwise the reward function is defined as:

$$r\_u = r\_u^{\varepsilon m r \chi y}, \quad \text{if } \phi(u) \le 0. \tag{22}$$

For training simplicity, UAV can still operate when its battery has drained, but it cannot get reward from data collection and will get an extra punishment.

#### **5. Experiments**

In this section, we introduce the setup of experiments and performance metrics. Then, we compare our approach with three state-of-the-art DRL baselines. Case studies are performed to analyze the effectiveness, expansibility and robustness of ms-SDRGN.

#### *5.1. Experimental Settings*

In this section, we use Pytorch 1.9.0 to perform experiments on Ubuntu 20.04 servers with two NVIDIA 3080 GPUs and an A100 GPU. In the SAG-MCS simulation environment, we set the 2D continuous target area of 200 × 200 pixels, where 120 PoIs, 3 charging stations, and 50 obstacles (20 round obstacles and 30 rectangular obstacles) are randomly initialized. PoIs are scattered around 3 major points from Gaussian distribution, each PoI is randomly assigned associated data within [1, 5]. We deploy 20 UAVs in the training stage with a parameter-shared model for action inference. We define their coverage range *Rcov* = 10, the observation range *Rcov* = 13, and the communication range *Rcomm* = 18 with the probability *p* = 0.5 of communication dropout. The fuzzy global observation with the size of 40 × 40 pixels is updated from satellites to UAVs every 5 timesteps. Each UAV's battery is initially fully charged to 100% and the consumed energy at each timestep is calculated after every movement, according to Equations (5) and (6).

In our implementation, the target entropy factor is set to *p<sup>α</sup>* = 0.3 and the discounted factor *<sup>γ</sup>* is 0.99. We use Adam for optimization with the learning rate of 1 <sup>×</sup> <sup>10</sup>−4, and ReLU as the activation function for all hidden layers. The experience replay buffer is initialized with the size of 2.5 <sup>×</sup> <sup>10</sup><sup>4</sup> for storing interaction histories, and the batch size is set to 256. As for the exploration strategy, we apply − *multinomial* for stochastic policies such as ms-SDRGN, letting start with 0.9 and exponentially decay to 0 in the end. For deterministic policies, we use − *greedy* strategy and set to exponentially decay to 0.05 at 30,000 training episodes.

One simulation episode lasts for 100 timesteps, and each DRL model interacts with the simulation environment for 50,000 episodes in total. Interaction experiences will be pushed to the replay buffer concurrently. After each simulation episode, the learned network is trained for 4 times using the experiences sampled from the replay buffer, while the target network is updated every 5 episodes by directly copying the parameters from the learned network. After training, we test the converged models for 1000 episodes to reduce randomness.

As introduced in Section 3.4, we use the following metrics to evaluate the performance.


#### *5.2. Analysis of Training Convergence and Heuristic Reward Function*

To validate the feasibility and effectiveness of our SAG-MCS environment design and the heuristic reward function, we first present the learning curves of episodic reward and the global metrics over time. During the training phase, we evaluate the model for 20 episodes after every 100 training episodes, and calculate the average global metrics and accumulated reward, as illustrated in Figure 5.

**Figure 5.** (**a**) The episodic reward learning curves of DRL algorithms. (**b**) The global metrics learning curves of ms-SDRGN.

In Figure 5a, we observe the average episodic reward of ms-SDRGN improves very quickly at the beginning, and gradually converges at around 20,000 episodes. Figure 5b presents the changes of four global metrics during the training progress of ms-SDRGN. The final energy index gradually drops and stabilizes to 0.9 at 20,000 episodes, indicating that UAVs have learned to operate at an optimal cruising speed. In addition, the final coverage and fairness index quickly grow and converge at around 10,000 episodes. Correspondingly, the overall CFE score has a similar growth trend and reaches convergence rapidly. Therefore, it can be proved that ms-SDRGN has learned the policy to fulfill the overall objective of maximizing the CFE score. After convergence, the UAV swarm can continuously collect PoIs maximumly using energy-efficient flying speed. The training results have suggested the effectiveness of the heuristic reward function.

Through visualization, we can observe that UAVs have learned to appropriately assign tasks at different remaining batteries. When its battery drops to around 25~40%, the UAV will proceed to the closest charging stations for battery exchange. In each simulation episode with 100 timesteps, the whole swarm rarely runs out of power, as such a charging process will happen two times for each UAV.

#### *5.3. Comparing with DRL Baselines*

We then compare our approach ms-SDRGN with three DRL baselines, including DGN [36], DQN [24] and MAAC [52]. DQN is a simple and efficient single-agent DRL approach, but it is still applicable for multi-agent tasks. Based on DQN, DGN uses GAT for modeling and exploiting the communication between agents. MAAC integrates selfattention mechanism with MADDPG [47], and provides agents with fully observable information to learn decentralized stochastic policy using a centralized critic. Thus, we compare ms-SDRGN with DGN to show the effectiveness of the multi-scale encoder and memory unit. Then, we compare with MAAC to validate the necessity of communication for the multi-agent swarm, especially in a partially observable environment.

We have evaluated the converged methods for 1000 episodes, and taken the mean value and standard deviation of all metrics, as shown in Table 1. Note that for a fair comparison, we also provide fuzzy global observations for the baselines, to ensure the raw observation inputs are the same.



The evaluation results are presented in Table 1. Then, we conduct a independent T-test between our approach and other three DRL baselines on every evaluation metric. It can be concluded that ms-SDRGN has a significant difference comparing to the baselines (*p* < 0.05). We can obtain the following observations from Table 1:

Firstly, the proposed approach ms-SDRGN outperforms all other baselines in terms of reward and coverage index significantly. It demonstrates that with the help of multiscale convolutional encoder and graph-based communication, ms-SDRGN achieves better data collection and energy management efficiency in SAG-MCS scenario. Compared with DQN and DGN, ms-SDRGN can better sense the surrounding environment from previous experiences in the memory unit, and make decisions more efficiently between seeking for more PoIs or returning for charging.

Secondly, from the perspective of fairness and energy, MAAC improves 0.005 fairness and 0.0075 energy index than ms-SDRGN. As a fully observable algorithm, we believe that MAAC can achieve similar cooperative exploration as ms-SDRGN using the observation embeddings from the whole UAV swarm. Regardless of extracting features from neighboring UAV nodes or from the memory unit, MAAC has a simpler objective to reduce its energy consumption for getting a higher reward.

Furthermore, the reward standard deviation of ms-SDRGN is higher than other methods, which may be attributed to randomness generated by the complex MADRL framework.

#### *5.4. Analysis of Communication Dropout*

In practical wireless networking applications, communication losses commonly occur in forms of delay, congestion or packet losses. To better cope with such real-world demanding communication conditions, we assume a *p* = 0.5 probability of communication dropout between interconnected UAVs during the training phase. Theoretically, this setting can improve the robustness of our model when implemented in different conditions. Therefore, we have trained two ms-SDRGN models in environments with and without communication dropout, respectively. Then, we test them in SAG-MCS, where the random communication dropout rate *p* varies in [0, 1], with an interval of 0.1. The evaluation result is shown in Figure 6.

From Figure 6a, it is observed that as the dropout rate grows in evaluation environment, the reward of the model trained w/o dropout continuously decreases. While the model trained w/ dropout achieves more stable evaluated reward and outperforms the other when the dropout rate *p* is larger than 0.4. In terms of the major metric CFE score in Figure 6b, ms-SDRGN trained w/ dropout continuously surpasses ms-SDRGN trained w/o dropout. When the evaluating communication dropout rate changes from 0 to 1.0, the CFE score of ms-SDRGN trained w/ dropout drops around 0.05 point. By contrast, ms-SDRGN trained w/ dropout gets 0.19 point of degradation on CFE score.

Random communication dropout can affect the stability of timing correlation in GRU memory unit. However, after trained in environment with 50% probability of communication losses, our ms-SDRGN is proved to be more robust, and will not result in significant performance loss even under unreliable communication conditions.

**Figure 6.** The evaluation results in environments with different communication dropout rate: (**a**) mean episodic reward and (**b**) CFE score.

#### *5.5. Impact of Simulation Environment Scale Setting*

Next, we proceed to verify the performance of the multi-scale convolutional encoder. In actual MCS tasks, UAVs could encounter various densities of overground PoIs. For regions with dense PoI distributions such as modern cities, we hope to perform finergrained observations for higher feature resolution. While we can perform coarse-grained or lightsized observations for areas with sparse PoIs. In order to handle tasks of different observation scales and enhance robustness, we implement CNNs as the multi-scale encoder, which technically is more applicable than linear encoders for large-scale observations. Therefore, we expect to compare the front-end multi-scale convolutional encoder with original linear encoder using different local observation scales.

In this experiment, we simulate different sizes of observation inputs by proportionally scaling the whole map, which could maintain the distribution of all elements and ensure comparison fairness. Specifically, we set the original environment setting introduced in Section 5.1 as scale 1.0 unit, and adjust the scale factor from 0.5 to 2.0 with the interval of 0.5 unit. The major settings of different scale factors are listed in Table 2.


**Table 2.** Simulation Environment Scale Experiment Settings.

The evaluation results of four environment scales are presented in Figure 7. As size of local observation space varying with observation range *Robs*, we can observe that CNN encoder outperforms linear encoder consistently on episodic reward. As for CFE score, ms-SDRGN with local CNN encoder achieves better CFE score than linear encoder when the scale factor is greater than or equal to 1.0, while linear encoder exceeds CNN encoder by 0.04 points at scale 0.5. The above result demonstrates that linear encoder can efficiently extract features from small-size input. In addition, the local CNN used in our multi-scale convolutional encoder has better representational capacity for large observation space. This finding demonstrates the expansibility of ms-SDRGN towards various scales of raw observations.

**Figure 7.** The evaluation results in environments with different scale factors: (**a**) mean episodic reward and (**b**) CFE score. ('w/o local CNN encoder' denotes using linear encoder to process local observations).

#### *5.6. Ablation Study*

Finally, we conduct an ablation study by separately removing components of ms-SDRGN, including multi-scale encoder, GAT layers, and GRU. We evaluate each case for 1000 episodes and the average results are listed in Table 3.

**Table 3.** Ablation study of ms-SDRGN method.


'-ms' means removing local CNN encoder. '-Soft' means training a deterministic policy instead of a stochastic policy. '-1GAT' and '-2GAT' denotes disabling one GAT layer and two GAT layers separately. '-GRU' means disabling GRU memory unit.

It can be observed in Table 3 that when removing any components, our ms-SDRGN will generally result in performance degradation. Firstly, removing local CNN encoder in case '-ms' will reduce average CFE score and reward, which demonstrates the validity of CNN encoder, as discussed in Section 5.5. Secondly, case '-Soft' demonstrates the stochastic policy outperforms the deterministic policy by improving exploration and coverage efficiency. Thirdly, case '-1GAT' disables one GAT layer and limits the ad-hoc communication to one-hop range, which decreases 0.08 points on CFE score and 530 points on reward. Case '-2GAT' disables both two GAT layers, which completely cuts off the communication of the UAV swarm and causes further performance loss. This finding suggests the necessity of GAT mechanism for modeling the communication between agents. Moreover, case '-GRU' removes the memory unit and significantly reduces the average reward and CFE score. For complex MARL tasks such as SAG-MCS in this paper, the memory unit can help agents recall long-term experiences, especially when the positions of PoIs and obstacles are fixed.

#### **6. Discussion**

In this section, we discuss two limitations of our method and explore future directions for practical implementation.

Firstly, the computational complexity is crucial for practical applications. The proposed MADRL approach functions in a decentralized manner. Each UAV agent infers its action using on-board processor and executes the action subsequently. In addition, the multiscale convolutional encoder introduced in Section 4.1 becomes the major computational burden for embedded processors. Therefore, future works will focus on introducing more computationally efficient spatial feature extractors.

Secondly, hand-crafted reward function limits the scalability. The heuristic reward function designed in Section 4.4 is customized for SAG-MCS simulation environment. When migrated to other application scenarios, the reward function requires modification case to case. Inverse reinforcement learning can be a solution for agents to infer reward functions from expert trajectories [53].

#### **7. Conclusions**

This paper introduced a partially observable MCS scenario named SAG-MCS, with an aerial UAV swarm jointly performing data collection task under energy limits. We proposed a value-based MADRL model named ms-SDRGN to address this multi-agent problem. Conclusively, ms-SDRGN applied a multi-scale convolutional encoder to handle the multi-scale observations, and utilized GAT and GRU for modeling communications and providing long-term memories. Effectively, a maximum-entropy method with configurable action entropy was employed to learn a stochastic policy. Experiments were conducted to demonstrate the superiority of our model compared with other DRL baselines, and validate the necessity of major components in ms-SDRGN. In addition, we analyzed the effectiveness of the communication dropout setting and the front-end CNN encoder. Future works will be focused on implementing fully continuous action space and exploring multistage multi-agent scenarios.

**Author Contributions:** Conceptualization, Y.R. and Z.Y.; methodology, Z.Y.; software, Y.R.; validation, Y.R., Z.Y. and G.S.; formal analysis, Y.R.; investigation, Y.R.; resources, X.J.; data curation, Z.Y.; writing—original draft preparation, Y.R.; writing—review and editing, Y.R., Z.Y. and G.S.; visualization, Y.R.; supervision, G.S.; project administration, G.S.; funding acquisition, X.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National key R & D program of China sub project "Emergent behavior recognition, training and interpretation techniques" grant number No. 2018AAA0102302.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We gratefully acknowledge the reviewers for their comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.
