**2. Related Works**

In this section, we review the literature related to mobile crowdsensing problem, DRL approaches for multi-agent systems, and the joint studies of these two topics.

Threat to validity [28,29]: For this review, we have used multiple strings to search and identify relevant literature in recent decade, such as 'UAV swarm and mobile crowdsensing', 'multi-task allocation and mobile crowdsensing' and 'multi-agent deep reinforcement learning'. Google Scholar is used for forward searches and most of the related works are retrieved from five databases: IEEE Xplore, SpringerLink, Web of Science, ScienceDirect and Arxiv.

#### *2.1. Multi-Task Allocation for Mobile Crowdsensing*

MCS scenarios usually have multiple constraints and objectives. One of the key issues is how to perform task allocation, or how to choose appropriate action strategies for different tasks. The main tasks of SAG-MCS are data collection by covering PoIs and energy management by keeping batteries charged. UAVs need to automatically select action strategies to meet the data collection requirements under the energy-efficiency constraint. Solving such multi-agent task allocation is an NP-hard problem, and the related research is still in a relatively early stage. Feng et al. [30] utilized dynamic programming for path planning in UAV-aided MCS and used Gale-Shapley-based matching algorithm to allocate different tasks for agents. Wang et al. [31] modeled multi-task allocation as a dynamic matching problem, then proposed a multiple-waitlist based task assignment (MWTA) algorithm. In addition, several surveys of task allocation have demonstrated the effectiveness of heuristic algorithms. Hayat et al. [13] proposed a genetic algorithm approach to get the minimum task completion time for UAV path planning. Similarly, Xu et al. [32] formulated this problem as a specific mathematical model, and tried to minimize incentive cost under the constraint of sensing quality based on greedy algorithms and genetic algorithms.

#### *2.2. Deep Reinforcement Learning (DRL) for Multi-Agent Systems*

In multi-agent systems, Reinforcement Learning (RL) generally targets at problems of agents sequentially interacting with local environment. At timestep *t*, the environment is at state *st* and agent *i* obtains a observation *o<sup>i</sup> <sup>t</sup>*. Then, agent *i* selects and executes an action *a<sup>i</sup> t* based on *o<sup>i</sup> <sup>t</sup>*, and then gets a reward *r<sup>i</sup> <sup>t</sup>* from the environment. In POMDP, agents cannot directly perceive the underlying states and *o<sup>i</sup> <sup>t</sup>* is not equal to *st*. The objective of RL is to learn a policy *πi*(*ai* | *oi*) for agent *i*. The policy is expected to maximize the discounted reward E[*Rt*] = E ∑<sup>∞</sup> *<sup>k</sup>*=<sup>0</sup> *<sup>γ</sup>kri k* , with a discounted factor *γ* ∈ [0, 1].

Currently, DRL methods have achieved state-of-the-art performance in various RL tasks [24,25], and can be categorized into value-based or policy-based ones. In this paper, we adopted the value-based method. Deep Q-learning (DQN [24]) is one of the most vital value-based DRL approaches. Based on Q-learning, DQN uses deep neural networks to learn a Q-value function *Q*(*o*, *a*), which could estimate the expected reward return E[*Rt*] and be recursively updated. DQN regards the action with biggest Q-value as the most optimal policy *π* (*s*) = arg max *Qπ*(*o*, *a*), and selects it to interact with the environment. In addition, DQN integrates fixed target network and experience replay methods to make the training process more efficient and stable [33]. Specifically, the Q-value function *Q*(*o*, *a*) is updated through minimizing the Q-loss function as:

$$Q\_{loss} = \left(r\_t + \max\_{a\_{t+1}} Q'(o\_{t+1}, a\_{t+1}) - Q(o\_t, a\_t)\right)^2. \tag{1}$$

where *Q* is the learned network and *Q* is the target network. Note that the policies learned by DQN are deterministic, therefore DQN should be trained with action policies such as − *greedy* to enhance exploration.

Compared with classical heuristic algorithms, agents can learn a strategy more efficiently and independently through DRL algorithms, so as to achieve multiple objectives in the sensing area simultaneously.

#### *2.3. DRL Methods for UAV Mobile Crowdsensing*

To date, several studies have investigated the application of DRL algorithms in the UAV Mobile Base Station (MBS) scenario, which is a sub-topic of MCS. In the UAV MBS scenario, a swarm of UAV serve as mobile base stations to provide long-term communication services for ground users. Liu et al. [15] proposed a DRL model based on Deep Deterministic Policy Gradient (DDPG [34]) to provide the long-term communications coverage in the MBS scenario. Further, Liu et al. [16] implemented DDPG in a fully distributed manner.

Different from policy gradient methods, Dai et al. [35] applied Graph Convolutional Reinforcement Learning (DGN [36]) in MBS. They modeled the UAV swarm as a graph, and used Graph Attention Network (GAT [26]) as a convolution kernel to extract adjacent information between neighboring UAVs. To further explore the potential of graph networks, Ye et al. [37] designed a FANET based on GAT, named GAT-FANET, allowing two adjacent UAV agents within the communication range to communicate and exchange information at low costs. This work also applied Gated Recurrent Unit (GRU) as a memory unit to record and process long-term temporal information from the graph network.

On the basis of MBS, Liu et al. [38,39] took practical factors such as obstacles and charging stations into consideration in the UAV MCS scenario. Based on the actor-critic network of DDPG, their DRL models used CNN to extract observed spatial information, and deployed a distributed experience replay buffer to store previous training information. Piao et al. [40], Dai et al. [41] and Liu et al. [38] utilized the concept of the Long Shortterm Memory (LSTM [42]) network to store sequential temporal information of previous interaction episodes. As a specific application of MCS, Dai et al. [41] designed an approach for mobile crowdsensing, where mobile agents are required to retrieve data and refresh the sensors distributed in the city, with limited storage capacities of the sensors. Wang et al. [43] proposed a more practical and challenging 3D MCS scene for disaster response simulation, where the UAVs' action space had been expanded to three dimensions.

Compared with the UAV MBS and MCS works mentioned above, this paper proposes a more complicated and promising SAG-MCS scenario, which incorporates global and local observations from space and air, respectively, and encourages UAVs to interact with charging stations as ground nodes. While [38–41] proposed multi-UAV MCS scenarios and used policy-based DRL methods as solutions which utilized LSTM to store temporal information of MCS systems, our approach selects the value-based method based on DQN and uses GRU as the memory unit, which performs similarly to LSTM but is more computationally efficient [44]. Furthermore, when most MADRL studies about MCS solved the problem with deterministic policies, our method learns a stochastic policy following Ye et al. [37] to improve robustness.

#### **3. System Model and Problem Statement**

In this section, we design a partially observable space-air-ground integrated MCS system, with space-based remote sensing satellites and an aerial UAV swarm jointly performing the MCS task. We define the problem and present the 2D simulation system model specifically. Then, we describe the design of evaluation metrics.

#### *3.1. System Model*

As illustrated in Figure 1, the SAG-MCS scenario is simplified to a 2-dimensional continuous square area with the size of *L* × *L* pixels. The simulation area has fixed borders and multiple obstacles that UAVs cannot fly over. We assume that there are a set K - {*k* | *k* = 1, 2, . . . , *K*} of PoIs, and each PoI is assigned a certain data amount *d*(*k*), ∀*k*. Note that PoIs are regarded as persistent information nodes and are not going to disappear after coverage. Additionally, we consider a set C - {*c* | *c* = 1, 2, . . . , *C*} of charging stations and a set B - {*b* | *b* = 1, 2, . . . , *B*} of round and rectangular obstacles. At the beginning of each simulation episode, the locations of all the PoIs, charging stations, and obstacles are randomly distributed in the 2D map. Each PoI's data amount *d*(*k*), ∀*k* is randomly assigned in a certain range as well, but the total data volume Σ*kd*(*k*) of different episodes remains consistent.

Let U - {*u* | *u* = 1, 2, . . . , *U*} be *U* UAV agents deployed in the simulation area, where the UAVs can perform continuous and horizontal flying movements at a fixed altitude. We define *Robs* as the observation range, and *Rcov* as the coverage range or sensing range of each UAV. Arbitrary UAV can observe the local map within the radius *Robs* in real-time and receive *Lsat* × *Lsat* fuzzy global map captured by satellites every some timesteps. Any PoI *k* within a UAV's *Rcov* is recognized as covered and all its data *d*(*k*) is collected once at each timestep *t*. Note that *Rcov* is smaller than *Robs*, as UAVs can only collect data when approaching to PoIs, but they can observe a wider range of area in general. Moreover, we consider the UAV swarm can autonomously form the ad-hoc network, and each pair of agents can be interconnected within communication range *Rcomm* and exchange observed information for joint decision making. Considering the delays and packet losses in realworld ad-hoc networking, we set a communication dropout probability *p* between adjacent UAV nodes in training and evaluation. As for the energy consumption, we set the onboard battery status *φ*(*u*) ∈ [0, 100%], ∀*u*.

For each simulation episode, the data collection task in SAG-MCS scenario will last for *T* timesteps in total. Each UAV's position is randomly assigned and their batteries are fully-charged in the beginning. At each timestep *t*, UAV *u* can obtain local observation from embedded sensors; while every few timesteps, it can obtain fuzzy global observation from the satellite. Using the multi-scale observations {**o***<sup>u</sup> <sup>t</sup>* }*u*∈U , UAV *u* performs an action {**a***u <sup>t</sup>* }*u*∈U . We set the battery *<sup>φ</sup>*(*u*) consumed at timestep *<sup>t</sup>* as {*e<sup>u</sup> <sup>t</sup>* }*u*∈U , which is determined by the current flying speed {*v<sup>u</sup> <sup>t</sup>* }*u*∈U and will be introduced in Section 3.4. When flying close to charging stations, their batteries will be fully charged in next timestep, simulating the real-world battery replacement process on the ground.

#### *3.2. Observation Space*

In SAG-MCS, each UAV agent *<sup>u</sup>* can obtain the multi-scale observation {**o***<sup>u</sup> <sup>t</sup>* }*u*∈U at timestep *t* from different sources, as introduced in Section 3.1. In Figure 2, we formulate the observation space with three elements: O - {**o***<sup>u</sup> <sup>t</sup>* = (O*<sup>u</sup> local* , <sup>O</sup>*<sup>u</sup> global* , <sup>O</sup>*<sup>u</sup> sel f*)}∀*<sup>u</sup>*∈U .

**Figure 2.** The observation space of UAV *u* in SAG-MCS.

(1) Local observation O*local* from embedded sensors: UAV can observe local information within a circle of radius *Robs* in real-time, centering on itself. Let O*local* - {**o***<sup>u</sup> <sup>l</sup>* = (O*u*,1 *local*, <sup>O</sup>*u*,2 *local*, <sup>O</sup>*u*,3 *local*)}∀*<sup>u</sup>*∈U denotes local observation space, which consists of three 2D vector channels. The first channel contains the data amounts and distribution of surrounding PoIs. We set the data value *d*(*k*) as the corresponding pixel value if it refers to PoI *k*, otherwise 0. The second channel contains the locations of obstacles relative to the UAV, where we set pixel value 1 for coordinates of obstacles, otherwise 0. The third channel includes the locations of other UAVs within *Robs*. In addition, we define pixel value 1 for coordinates of UAV agents as well, otherwise 0.

(2) Global observation <sup>O</sup>*<sup>u</sup> global* from satellites: Every *n* timesteps, satellites will capture fuzzy global observation and transmit the information to all UAVs. As shown in Figure 2, O*u global* consists of three 2D channels with reduced size of *Lsat* × *Lsat*(*Lsat* < *L*), which cannot provide precise locations of the environment elements globally. We define O*global* - {**o***<sup>u</sup> <sup>g</sup>* = (O*u*,1 *global*, <sup>O</sup>*u*,2 *global*, <sup>O</sup>*u*,3 *global*)}∀*<sup>u</sup>*∈U in absolute positioning coordinates. The encoding method for global observation is nearly the same as local observation, except in the third channel of UAV locations, we set −1 as the corresponding pixel value if it refers to the absolute location of UAV *u* in global map.

(3) Auxiliary observation <sup>O</sup>*<sup>u</sup> sel f* : Then we utilize information from onboard flight control computer to assist UAV to learn optimal policy. Specifically, we define O*sel f* - **o***u <sup>s</sup>* = concatenate *<sup>x</sup>*(*u*), *<sup>y</sup>*(*u*), *vx*(*u*), *vy*(*u*), *<sup>φ</sup>*(*u*), {Δ*x*(*c*), <sup>Δ</sup>*y*(*c*)}∀*<sup>c</sup>*∈C <sup>∀</sup>*u*∈U . For UAV *<sup>u</sup>*, **o***u <sup>s</sup>* includes its absolute position, velocity and current remaining battery, and the relative locations of all charging stations towards UAV *u*.

#### *3.3. Action Space*

The rotor UAVs are capable of applying different *thrust* at all directions responsively. We choose to discretize the entire 2-dimensional continuous space into eight directions for simplicity, and UAV agents can apply *maximum-thrust* (denoted as 1.0 unit), *half-thrust* (0.5 unit), or *zero-thrust* (0 unit) at any direction. Note that zero-thrust represents hovering in place. Therefore, the action space in SAG-MCS is defined as:

$$\mathcal{A} \triangleq \left\{ \mathbf{a}\_t^u = (\theta\_t^u, f\_t^u) \mid \theta\_t^u \in \{\frac{k\pi}{4} \mid k = 0, 1, \dots, 7\}, f\_t^u \in \{0, 0.5, 1.0\} \right\}.\tag{2}$$

where *θ<sup>u</sup> <sup>t</sup>* denotes the thrust angle and *f <sup>u</sup> <sup>t</sup>* is the thrust magnitude. The action space A consists of 17 actions in total. Since the timestep interval in the simulation is quite short, we assume the physical model is a uniform acceleration process. UAV can adjust the magnitude and direction of velocity using certain actions.
