*3.4. Evaluation Metrics*

As stated in Section 3.1, the UAV swarm is aimed at collecting maximum information over PoIs as long as possible. UAVs should avoid collisions with obstacles and borders during movement, and recharge in time when power is low. Following Ye et al. [37] and Liu et al. [45], we propose three global evaluation metrics to evaluate the effectiveness of the joint cooperation of the UAV swarm in this SAG-MCS task. These metrics are ultimately used to evaluate the DRL policy we have trained.

The first metric is *Data Coverage Index*, which describes the average data amounts collected by the whole UAV swarm per timestep, as:

$$c\_t = \frac{\sum\_{k=1}^{K} w\_t(k)d(k)}{Kt}, \quad t = 1, \ldots, T. \tag{3}$$

where *wt*(*k*) denotes the number of timesteps when PoI *k* was successfully collected from timestep 1 till *t*. *d*(*k*) denotes the data amount carried by PoI *k* and *K* is the number of PoIs.

We noticed that in some cases, isolated PoIs in rural areas may not be covered even when the data coverage index is quite high; however, isolated or sparse PoIs in remote areas can carry valuable information in certain scenarios such as disaster relief. Considering the comprehensiveness of the data collection task, we propose the second global metric *Geographical Fairness Index* to evaluate the exploration ability of the UAV team, as:

$$f\_t = \frac{\left(\sum\_{k=1}^{K} w\_t(k)d(k)\right)^2}{K \sum\_{k=1}^{K} \left(w\_t(k)d(k)\right)^2}, \quad t = 1, \ldots, T. \tag{4}$$

where *wt*(*k*) and *d*(*k*) are defined the same as Equation (3). When all PoIs are evenly covered, Equation (4) gives *ft* = 1.

In addition, the third metric *Energy Consumption Index* is used to indicate the energysaving status of the UAV swarm. In order to further simulate the energy consumed by multi-rotor UAV in reality, we adopt an equation of power on the flight speed [46], as:

$$P\_T = \frac{1}{2} \mathcal{C}\_D A \rho v^3 + \frac{\mathcal{W}^2}{\rho b^2 v'} \tag{5}$$

where *CD* is the aerodynamic drag coefficient, *ρ* is the density of air and *v* is the current flying speed. Parameter *A*, *W*, *b* denote UAV's front facing area, total weight, and width, respectively. For simplicity, we adopt a general UAV model and specific values are omitted in this paper. In timestep *t*, we assume the consumed energy *e<sup>u</sup> <sup>t</sup>* by UAV *u* is linear to its battery power, as:

$$
\varepsilon\_t^u = \varepsilon\_0 + \eta\_c P\_{T\_t} u \tag{6}
$$

where *e*<sup>0</sup> represents hovering energy consumption and *η<sup>e</sup>* is an energy coefficient. *PT u t* refers to the output power of UAV *u* in timestep *t*. Equations (5) and (6) reveal that UAV's battery is more efficient at an optimal cruising flight speed, while hovering or flying at maximum speed will consume more power. Note that energy consumed during flight is mainly from rotors and embedded sensors, and we ignore the communication budgets in the ad-hoc network. Therefore, we define the energy consumption index by taking the average of all *U* UAVs in *T* timesteps:

$$\varepsilon\_t = \frac{1}{t \times Ul} \sum\_{\tau=1}^t \sum\_{u=1}^{IL} \varepsilon\_{\tau \prime}^u \quad t = 1, \dots, T. \tag{7}$$

After a complete simulation episode, we calculate the metrics mentioned above as *final global metrics*, denoted as {*cT*, *fT*,*eT*} = {*ct*, *ft*,*et*}*t*=*T*. We hope to maximize the coverage and fairness index for sensing data adequately, while minimize the energy consumption index for energy-saving. Therefore, following Ye et al. [37], we define the overall objective *coverage-fairness-energy score* (CFE score) by a DRL policy *π*:

$$CFE\_t(\pi) = \frac{c\_t \times f\_t}{c\_t}, \quad t = 1, \ldots, T. \tag{8}$$

Obviously, our objective is to optimize the policy *π* to maximize *CFET*(*π*) of the whole episode. As our SAG-MCS is a practical partially observable scenario, UAV agents cannot be aware of these global metrics of the whole swarm. They can only make actions according to the decentralized policy *πu*, ∀*u* ∈ U and self-owned information. Therefore, we propose a heuristic reward function to train the optimal policy *π*, which will be further introduced in Section 4.4.

#### **4. Proposed ms-SDRGN Solution For SAG-MCS**

Due to the multi-scale observation space and complicated SAG-MCS task, we propose a heuristic DRL method named *Multi-Scale Soft Deep Recurrent Graph Network (ms-SDRGN)*. As illustrated in Figure 3, we first utilize a Multi-scale Convolutional Encoder to integrate local and global observed information for better feature extraction from observation space. Based on the concept of DRGN [37], we use graph attention mechanism (GAT [26]) to aggregate neighboring information through ad-hoc connections, and adopt gated recurrent unit (GRU [27]) as a memory unit for better long-term performance. In addition, we utilize a maximum-entropy method to learn stochastic policies via a configurable action entropy objective, and control each UAV agent in a distributed manner. Furthermore, a customized heuristic reward function is proposed for decentralized training.

**Figure 3.** ms-SDRGN Model Architecture.

#### *4.1. Multi-Scale Convolutional Encoder*

Exploiting observations properly is essential for agents to perceive the current state of RL systems and make corresponding actions. Previous DRL methods (e.g., DQN, DGN, MAAC) apply multi-layer perceptron (MLP) as linear encoders to process raw observations, which is preferred for scenarios with smaller observation dimensions or less information, such as Cooperative Navigation [47]. However, in our SAG-MCS task, observations and environment states are more complicated and their input sizes are relatively larger.

Our intuition lies that compared with MLP, convolutional neural network (CNN) is more capable of processing data that has spatial information and large receptive fields, such as images. CNN can integrate information from different input channels as well. So we treat the local observation O*local* and satellites' fuzzy global observation O*global* as simplified real images, and design two CNN to extract spatial feature representations of

local and global input states separately. Specifically, we construct the local CNN with two convolutional layers and two fully connected layers, which outputs local embedding **e***local <sup>u</sup>* . The global CNN has a larger input scale, and we use five convolutional layers, which yields global embedding **e** *global <sup>u</sup>* . As for the auxiliary information in O*sel f* , we simply use a fully connected layer and take **e** *sel f <sup>u</sup>* as output from UAV self-owned information. Finally, we use concatenation operation to combine them as a multi-scale observation embedding **e***<sup>u</sup>* for UAV *u*:

$$\mathbf{e}\_{\mathsf{u}} = \text{concatate}(\mathbf{e}\_{\mathsf{u}}^{\text{local}} \mid \mathbf{e}\_{\mathsf{u}}^{\text{global}} \mid \mathbf{e}\_{\mathsf{u}}^{\text{self}}), \forall \mathsf{u} \in \mathsf{U}. \tag{9}$$

Such multi-scale features can help UAVs better select actions, by taking full account of: (a) the relative position between current UAV and surrounding PoIs, obstacles or other agents; (b) the correlation of current UAV's remaining battery and the distance to the closest charging station; (c) the distribution of PoIs in the fuzzy global map for better exploration and coverage.

#### *4.2. Aggregate Adjacent Information with Graph Attention Mechanism*

For the purpose of multi-agents exchanging information through ad-hoc connections in SAG-MCS, we model the UAV swarm as a graph network, where each node is represented as a UAV, and the edges are the communication links of neighboring UAV pairs. For each node *i*, we denote **e***<sup>i</sup>* extracted from observation space as its node embedding. Let all UAVs networked with UAV node *i* as a set *Gi*. This is implemented by an adjacency mask **A**, which is a *U* × *U* symmetric matrix and satisfies **A**(*i*, *j*) = 1 if UAV node *i* is interconnected with UAV node *j*. For all UAV node *j* ∈ *Gi*, we utilize GAT to determine the weight of UAV node *i* towards its different neighbors *j* as *αij*. Building on the concept of self-attention [48], an attention coefficient between node *i* and its neighboring node *j* is defined as **e***ij* = *a*(**We***i*,**We***j*), where *a*() is a shared attentional mechanism. Then, we calculate the attention weight *αij* by normalizing **e***ij* across all possible node *j* using softmax function:

$$\alpha\_{ij} = \text{softmax}\_j(\mathbf{e}\_{ij}) = \frac{\exp\left(\left(\mathcal{W}\_K \mathbf{e}\_j\right)^T \cdot \mathcal{W}\_Q \mathbf{e}\_i\right)}{\sum\_{k \in G\_i} \exp\left(\left(\mathcal{W}\_K \mathbf{e}\_k\right)^T \cdot \mathcal{W}\_Q \mathbf{e}\_i\right)},\tag{10}$$

Then GAT aggregates information from all adjacent nodes *j* by weighted summation, which is given by:

$$\mathbf{g}\_i = \sum\_{j \in \mathcal{G}\_i} \boldsymbol{\alpha}\_{ij}^k \cdot \mathcal{W}\_V \mathbf{e}\_j. \tag{11}$$

where we denote **g***<sup>i</sup>* as the aggregated output embedding of UAV *j* after one GAT layer. In addition, *WQ*, *WK*, *WV* ∈ **W** are learnable weight matrics related with query, key, and value vector.

As shown in Figure 3, we utilize two GAT layers to aggregate information from neighboring UAV agents within a two-hop communication range, which could further expand the perception range and enhance cooperation of the UAV swarm. For better convergence, we then use skip connections [49] by concatenating the input observation embedding **e***i*, the outputs of the first GAT layer **g***i*,1 and the second GAT layer **g***i*,2, as **g***<sup>i</sup>* = concatenate(**e***<sup>i</sup>* | **g***i*,1 | **g***i*,2).

Additionally, to make full use of temporal information during the interaction with RL environments and improve long-term performance, we integrate a gated recurrent unit (GRU) to memorize temporal features as:

$$h\_t = \text{GRU}(\mathbf{g}\_i \mid h\_{t-1}). \tag{12}$$

where we take **g***<sup>i</sup>* as input and *ht* is the hidden state of timestep *t* stored in the memory unit. After adjacent information aggregation and GRU, we apply an affine transformation layer to *ht* for calculating Q-value *Q*(O*t*, *at*).

#### *4.3. Learn Stochastic Policies with Adjustable Action Entropy*

Based on the Q-value produced by DRGN, we can learn a deterministic policy, where each Q-value represents a fixed probability of the corresponding action. However, deterministic policies can easily jump into local optimum and lack for exploration in complex, real-world scenarios. Inspired by the maximum entropy RL framework [50,51], we utilize soft Q-loss to learn a stochastic policy in SAG-MCS, with the objective of maximizing expected reward and optimizing the action entropy towards a certain target. A flow chart of the training process is presented in Figure 4.

**Figure 4.** The training process of ms-SDRGN. In the flow chart, solid lines indicate feed forward propagation, and dashed lines denote updating parameters by backpropagation.

Firstly, we sample previous interaction experiences from the replay buffer as training inputs. The ms-SDRGN learned model infers a set of Q-value from the experiences. Then, we apply temperatured softmax operation to Q-value for getting the action probability:

$$\pi(\mathcal{O}\_l, a\_l) = \text{softmax}\_{a\_l} \left( \frac{Q(\mathcal{O}\_l, a\_l)}{a} \right) = \exp\left(\frac{Q(\mathcal{O}\_l, a\_l)}{a} - \log \Sigma\_{a\_l} \exp\left(\frac{Q(\mathcal{O}\_l, a\_l)}{a}\right)\right), \tag{13}$$

where *α* is an adjustable temperature parameter, and Q-value *Q*(O*t*, *at*) is produced by the learned model when receiving O*<sup>t</sup>* and *at* as inputs. Specific action during simulation is sampled from the action probability. Then, we use Equation (13) to estimate the action entropy by calculating the information entropy expectation from sampled experiences:

$$\mathbb{E}[\mathcal{H}\_{\pi}(\mathcal{O}, a)] = \mathbb{E}[-\Sigma\_{a\_{l} \sim \pi} \pi(\mathcal{O}\_{l}, a\_{l}) \cdot \log \pi(\mathcal{O}\_{l}, a\_{l})],\tag{14}$$

The action entropy represents the action uncertainty of policy *π*, which can be adjusted by the temperature parameter *α*. Therefore, we preset a target action entropy as H*target <sup>π</sup>* = *p<sup>α</sup>* · max H*π*, where the maximum action entropy is determined by action space as max H*<sup>π</sup>* = log(dim A), and *p<sup>α</sup>* is a hyper-parameter named target entropy factor. Note that different RL tasks require different levels of exploration, so *pα* shall be modified according to specific scenarios. More concretely, our goal is to let the action entropy <sup>E</sup>[H*π*(O, *<sup>a</sup>*)] approach the pre-defined target action entropy <sup>H</sup>*target <sup>π</sup>* , by updating the temperature parameter *α* through gradient descent:

$$\nabla\_a = f\left(\mathcal{H}\_{\pi}^{\text{target}} - \mathbb{E}[\mathcal{H}\_{\pi}(\mathcal{O}, a)]\right). \tag{15}$$

where *<sup>f</sup>* is a customized activation function and <sup>H</sup>*target <sup>π</sup>* denotes the target action entropy. The configurable action entropy mentioned above guarantees the balance between interaction stability and exploration capability of the policy.

Following Soft Q-learning [50], we also include the temperature parameter *α* to help define a V-value function for the target model. Finally, we use the mean squared error calculated by Q-value function and V-value function as *Qloss*:

$$V(\mathcal{O}\_t) = a \cdot \log \Sigma\_{a\_l} \exp\left(\frac{Q(\mathcal{O}\_{t,\alpha} a\_l)}{a}\right),\tag{16}$$

$$Q\_{loss} = \frac{1}{S} \Sigma (r\_t + V(\mathcal{O}\_{t+1}) - Q(\mathcal{O}\_{t\prime} a\_t))^2. \tag{17}$$

where *rt* is the reward earned in timestep *t*, *V*(O*t*) denotes the V-value function and *S* is the batch size. The Q-value function *Q*(*o*, *a*) of the learned model is updated through minimizing the *Qloss* in Equation (17). In the learning process, ms-SDRGN target model will be updated periodically by duplicating the parameters of the learned model directly.
