Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach

Huang, Xuan; Xia, Xu; Wang, Zhibo; Peng, Mugen

doi:10.3390/drones8060218

Open AccessArticle

Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach

¹

6G Research Center, China Telecom Research Institute, Beijing 102209, China

²

Hisilicon Technologies Co., Ltd., Beijing 100085, China

³

Department of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(6), 218; https://doi.org/10.3390/drones8060218

Submission received: 19 April 2024 / Revised: 23 May 2024 / Accepted: 24 May 2024 / Published: 25 May 2024

(This article belongs to the Special Issue Space–Air–Ground Integrated Networks for 6G)

Download

Browse Figures

Versions Notes

Abstract

:

The space–air–ground integrated network can provide services to ground users in remote areas by utilizing high-altitude platform (HAP) drones to support stable user access and using low earth orbit (LEO) satellites to provide large-scale traffic backhaul. However, the rapid movement of LEO satellites requires dynamic maintenance of the matching relationship between LEO satellites and HAP drones. Additionally, different traffic types generated at HAP drones hold varying levels of values. Therefore, a tripartite matching problem among LEO satellites, HAP drones, and traffic types jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency, and traffic storage capacity is formulated as mixed integer nonlinear programming to maximize the average transmitted traffic value. The traffic generation state for HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world scenarios, posing challenges for traditional optimization solvers. Thus, the original problem is decoupled into two independent sub-problems: traffic–drone matching and LEO–drone matching, which are addressed by mathematical simplification and multi-agent deep reinforcement learning with centralized training and decentralized execution, respectively. Simulation results verify the effectiveness and superiority of the proposed tripartite matching approach.

Keywords:

space–air–ground integrated network; high altitude platform drone; LEO satellite; matching problem; reinforcement learning

1. Introduction

Densely deployed ground communication infrastructures can provide access services for mobile and Internet of Things (IoT) users in urban areas, with the advantages of high data rates and small propagation delay. However, deploying infrastructures in remote areas such as the ocean and desert is challenging and expensive. Various applications in remote areas, such as forest monitoring, desert communication, and maritime logistics, are difficult to serve [1,2]. There are still approximately three billion people all over the world living without Internet access, presenting an obstacle for 6G in realizing seamless connectivity and ubiquitous access [3,4]. How to achieve user access and traffic backhaul for mobile and IoT users in remote areas has become crucial [5].

Satellite communication makes up for the shortage of terrestrial networks and provides users with large-scale access services with wide coverage. The utilization of low earth orbit (LEO) satellites for global Internet access and traffic backhaul has garnered attention due to their lower development and launch cost and transmission latency compared with geostationary earth orbit (GEO) and medium earth orbit (MEO) satellites [6]. The use of inter-satellite links (ISLs) enables the traffic generated by ground IoT users to be relayed among LEO satellites and transmitted back to the terrestrial traffic center [7]. However, the severe path loss between LEO satellites and ground IoT users makes it difficult for users to directly access LEO satellites due to limited transmission power.

In order to reduce the demand for user-side transmission power, the space–air–ground integrated network has attracted a lot of attention from academia and industry in 6G [8,9]. Compared to the orbital altitude of hundreds or thousands of kilometers of LEO satellites, the altitude of drones is much lower, thus needing lower transmission power from ground IoT users [10]. In the space–air–ground integrated network, drones in the air are utilized to support user access with lower transmission energy costs, and satellites in space are used to provide traffic backhaul with global coverage [11]. They work together with communication infrastructures on the ground to provide users with various application services. In recent years, a category of drone that can provide users with more stable access, namely high altitude platform (HAP) drone, has become a research hotspot. Different from traditional drones, HAP drones hover at an altitude of about 20 km in the stratosphere, with base stations deployed on them to provide users with ubiquitous and stable access. HAP drones can extend communication capabilities across the space, air, and ground domains. Specifically, aerial networks composed of HAP drones are utilized to support user access and collect traffic generated by users in remote or inaccessible areas lacking communication infrastructures. Then, LEO satellites are used to support traffic backhaul to the terrestrial traffic center, thus supplying stable access and traffic backhaul [12].

Due to the advantages of low deployment cost, flexible on-demand deployment, and reliable line-of-sight communication link, HAP drones have been employed in the satellite-ground network for user access, traffic backhaul, and task execution. However, practical issues in the space–air–ground integrated network have been overlooked in existing research. For instance, due to the high mobility of LEO satellites, HAP drones need to be switched between different LEO satellites. Therefore, the calculation of the available traffic transmission time of HAP drones must jointly consider the remaining visible time and handover latency. Furthermore, different traffic types generated at HAP drones hold varying values, suggesting a preference for establishing matching for high-value traffic types first. Lastly, the assumption of a specific constant traffic generation state at HAP drones in existing research does not align with the stochastic and deterministic nature of traffic generation in practice, rendering conventional static matching algorithms inapplicable [13].

Therefore, in order to address the issues mentioned above, a tripartite matching problem among LEO satellites, HAP drones, and traffic types is investigated for the space–air–ground integrated network in this paper. Specifically, the main contributions of this paper are as follows:

First, the network architecture and working mechanism of the space–air–ground integrated network is introduced, which aims at achieving user access and traffic backhaul in remote areas. Different from the conventional static traffic generation state with deterministic variables, the traffic generation state at HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world scenarios.
Then, different from the conventional schemes that treat all traffic types as equally important, we develop a tripartite matching problem among satellites, HAP drones, and traffic types based on the different values of different traffic types. The problem can be decoupled into two sub-problems: traffic–drone matching and LEO–drone matching. Traffic–drone matching is simplified into multiple separate sub-subproblems through mathematical analysis, which can be addressed independently. LEO–drone matching cannot be solved by conventional optimization solvers since the traffic generation state at drones is a mixture of stochasticity and determinism. Thus, reinforcement learning is adopted. Moreover, due to the significant propagation latency between terrestrial traffic center and LEO satellites, a conventional centralized scheme cannot obtain the latest status of the network. Therefore, it cannot devise LEO–drone matching strategies in a timely manner. In addition, the state space of the LEO–drone matching sub-problem is continuous. Therefore, a multi-agent deep reinforcement learning approach with centralized training and decentralized execution is proposed, in which the value network is centrally trained at the terrestrial traffic center and the LEO–drone matching strategy is timely devised at LEO satellites in a decentralized manner.
Finally, the convergence performance of the proposed matching approach is discussed and analyzed through simulations. In addition, the proposed algorithm is compared with state-of-the-art algorithms under different network parameters to validate its effectiveness.

The rest of the paper is organized as follows. The related works are discussed in Section 2. The system model and working mechanism are illustrated in Section 3. Section 4 formulates and simplifies the tripartite matching problem. In Section 5, the formulated problem is solved by the multi-agent deep reinforcement learning algorithm. Simulation results are presented and discussed in Section 6. Future work is summarized in Section 7. Finally, conclusions are drawn in Section 8.

2. Related Works

Abbasi et al. first presented the potential use cases, open challenges, and possible solutions of HAP drones for next-generation networks [14]. The main communication links between HAP drones and other non-terrestrial network (NTN) platforms, along with their advantages and challenges, are presented in [15]. Due to the rapid movement of LEO satellites, the matching relationship between HAP drones and LEO satellites is not fixed, so efficient matching and association strategies need to be developed. In [16], the matching relationship between user equipment (UE), HAP drones, and terrestrial base stations (BS) is formulated as a mixed discrete–continuous optimization problem under the HAP drone payload connectivity constraints, HAP drones and BSs power constraints, and backhaul constraints to maximize the network throughput. The formulated problem is solved using a combination of integer linear programming and generalized assignment problems. A deep Q-learning (DQL) approach is proposed in [17] to perform the user association between a terrestrial base station and a HAP drone based on the channel state information of the previous time slot. In addition to the above-mentioned UE’s selection between terrestrial networks and non-terrestrial networks, there has been relevant research on the three-party matching problem among users, HAP drones and satellites in remote areas without terrestrial network coverage. In [18], the matching problem among users, HAP drones, and satellites is formulated to maximize the total revenue and it was solved by a satellite-oriented restricted three-sided matching algorithm. In [19], a throughput maximization problem is formulated for ground users in an integrated satellite–aerial–ground network by comprehensively optimizing user association, transmission power, and unmanned aerial vehicle (UAV) trajectory. In [20], a UAV-LEO integrated traffic collection network is proposed to maximize the uploaded traffic volume while ensuring the energy consumption by comprehensively considering bandwidth allocation, UAV trajectory design, power allocation, and LEO satellite selection. The maximum computation delay among terminals is minimized in [21] by a joint considering matching relationship, resource allocation, and deployment location optimization. An alternating optimization algorithm based on block coordinate descent and successive convex approximation is proposed to solve this. A joint association and power allocation approach is proposed for the space–air–ground network in [22] to maximize the transmitted traffic amount while minimizing the transmit power under the constraints of the power budget and quality of service (QoS) requirements of HAP drones and the data storage and visibility time of LEO satellites. The association problem and power allocation problem are alternately addressed by the GUROBI optimizer and the whale optimization algorithm, respectively.

It is worth mentioning that reinforcement learning (RL) algorithms are widely used for HAP drone problems. HAP drones form a distributed network, and with multi-agent RL, the space–air–ground integrated network can effectively become self-organizing. In [23], a multi-agent Q-learning approach is proposed to tackle the service function chain placement problem for LEO satellite networks in a discrete-time stochastic control framework, thus optimizing the long-term system performance. In [24], a multi-agent deep reinforcement learning algorithm with global rewards is proposed to optimize the transmit power, CPU frequency, bit allocation, offloading decision, and bandwidth allocation via a decentralized method, thus achieving the computation offloading and resource allocation for the LEO satellite edge computing network. In [25], the utility of HAP drones is maximized by jointly optimizing association and resource allocation, which is formulated as a Stackelberg game. The formulated problem is transformed into a stochastic game model, and a multi-agent deep RL algorithm is adopted to solve it.

3. System Model and Working Mechanism

In order to provide services for mobile users and IoT users in remote areas, the space–air–ground integrated network is investigated in this paper, and its network architecture is shown in Figure 1. It utilizes an aerial network composed of HAP drones to collect traffic generated by various IoT users, thus providing stable and large-scale access services for areas without ground communication infrastructures. Via a drone–LEO link and multiple LEO–LEO links, the collected traffic is then relayed to the LEO satellite connected to a ground station to achieve traffic backhaul. Finally, the ground station downloads the traffic via the LEO–ground link and transmits it back to the terrestrial traffic center for processing via optical fibers. Ground devices access HAP drones through the C-band. HAP drones are directly connected to LEO satellites through the Ka-band to achieve high-rate traffic backhaul [26].

3.1. Traffic Generation Model for HAP Drones

For the space–air–ground integrated network, the drone–LEO link needs to transmit the traffic generated by the HAP drone itself and the traffic collected from various mobile and IoT users on the ground. This traffic can be divided into traffic types generated with a determined rate, which mainly include HAP drone health status and UE location, and traffic types generated abruptly with random probability, such as malfunction diagnosis and signaling execution. Therefore, the traffic generation state at HAP drones is modeled as a mixture of stochasticity and determinism. Markov chains can be used to describe the traffic generation models of various types uniformly, as shown in Figure 2. Specifically, the generation of each traffic type at HAP drones is modeled as a Markov chain with two states: on and off. In the on state, traffic is generated at a constant rate, whereas traffic generation ceases in the off state. Denote the self-transition probabilities of the q-th traffic type from on to on as

p_{1, q}

and from off to off as

p_{2, q}

, where

q \in \{1, 2, \dots, Q\}

and Q are the total number of traffic types. For traffic types generated at a constant rate, there are

p_{1, q} = 1

and

p_{2, q} = 0

. For traffic types generated abruptly with random probability, there are

0 < p_{1, q} < 1

and

0 < p_{2, q} < 1

, which means that the state switches randomly between on and off.

In addition, different traffic types have varying levels of importance in practical scenarios. For instance, the traffic carrying the remaining power of HAP drones is more valuable than other traffic types. To account for this, we introduce a value factor

μ_{q}

to represent the value of the q-th traffic type. The optimization objective is to maximize the average transmitted traffic value of the network in each time slot. Unlike the conventional approach, which treats all traffic types equally, we prioritize the transmission of high-value traffic when system resources are restricted, which aligns better with actual transmission requirements.

3.2. Traffic Transmission Model between LEO Satellites and HAP Drones

Suppose that there are I HAP drones at an altitude of

h_{1}

and J LEO satellites at an altitude of

h_{2}

in the space–air–ground integrated network. The LEO satellite set is denoted as

J = \{1, 2, \dots, J\}

, and the HAP drone set is denoted as

I = \{1, 2, \dots, I\}

. Each HAP drone is equipped with an omnidirectional antenna, and each LEO satellite is equipped with L steerable beams. The time interval is divided into M time slots with a length of

M_{0}

, and the time slot set is denoted as

M = \{1, 2, \dots, M\}

. When

M_{0}

is sufficiently small, the matching between LEO satellites and HAP drones in each time slot can be treated as quasi-static. In each time slot, one LEO satellite beam can provide services for no more than one HAP drone, and one HAP drone can establish a connection with at most one LEO satellite. We define a LEO–drone matching matrix

X_{I \times J} [m]

to describe the matching relationship between LEO satellites and HAP drones in the m-th time slot. If the i-th HAP drone is served by the j-th LEO satellite in the m-th time slot, there is

x_{i, j} [m] = 1

; otherwise,

x_{i, j} [m] = 0

.

This work focuses on mobile users and IoT users in depopulated regions with almost no obstacles. Therefore, small-scale fading due to multi-path effects can be neglected. The channel gain from the i-th HAP drone to the j-th LEO satellite in the m-th time slot can be expressed as follows [27]:

h_{i, j} [m] = {(\frac{c}{4 π f_{c} d_{i, j} [m]})}^{2},

(1)

where c and

f_{c}

represent the speed of light and the carrier frequency, respectively.

d_{i, j} [m]

represents the distance between the i-th HAP drone and the j-th LEO satellite in the m-th time slot. Based on this, the traffic transmission rate between the i-th HAP drone and the j-th LEO satellite can be expressed as follows:

R_{i, j} [m] = W {log}_{2} (1 + \frac{P_{h} G_{i} G_{j} h_{i, j} [m]}{k_{B} T_{b} W}),

(2)

where W is the bandwidth of LEO beams,

P_{h}

is the transmit power of HAP drones, and

G_{i}

and

G_{j}

represent the antenna gains of the transmitter of HAP drone and the receiver of LEO satellite, respectively [28].

k_{B}

is Boltzmann’s constant, and

T_{b}

is the system noise temperature. When the channel gain between the HAP drone and the j-th LEO satellite exceeds a given threshold

h_{0}

, it is considered that this HAP drone is within the visible range of the j-th LEO satellite. In the m-th time slot, the set of HAP drones within the visible range of the j-th LEO satellite can be expressed as follows:

I_{j} [m] = \{i | h_{i, j} [m] \geq h_{0}\} .

(3)

As a result of the high-speed movement of LEO satellites, handover is required when the HAP drone moves outside the visible range of the LEO satellite. HAP drones are unable to send traffic to LEO satellites during the handover duration

T_{h}

, which can be approximately expressed as follows:

T_{h} = κ \times \frac{d_{i, j} [m]}{c},

(4)

where

κ

is the signaling that requires to be transmitted between the HAP drone and the LEO satellite during handover. Therefore, the available traffic transmission time in the m-th time slot can be represented as follows:

T_{i, j} [m] = \{\begin{matrix} T_{0} - (1 - x_{i, j} [m - 1]) \times κ \times \frac{d_{i, j} [m]}{c}, & T_{i, j}^{r e m a i n} \geq T_{0} \\ T_{i, j}^{r e m a i n}, & T_{i, j}^{r e m a i n} < T_{0} \end{matrix},

(5)

where

T_{i, j}^{r e m a i n}

represents the remaining visible time between the j-th LEO satellite and the i-th HAP drone. In each time slot, a HAP drone can only choose one of the Q traffic types for transmission. We define a traffic–drone matching matrix

Y_{I \times Q} [m]

to describe the transmission status of different traffic types at each HAP drone in the m-th time slot. If the q-th traffic type of the i-th HAP drone is sent in the m-th time slot, there is

y_{i, q} [m] = 1

; otherwise,

y_{i, q} [m] = 0

. Thus, in the m-th time slot, the maximum traffic volume from the q-th traffic type of the i-th HAP drone to the j-th LEO satellite can be expressed as follows:

U_{i, q, j} [m] = x_{i, j} [m] y_{i, q} [m] R_{i, j} [m] T_{i, j} [m] .

(6)

The transmitted traffic value of the q-th traffic type of the i-th HAP drone in the m-th time slot can be represented as follows:

{\tilde{U}}_{i, q} [m] = min (S_{i, q} [m], \sum_{j = 1}^{J} U_{i, q, j} [m]),

(7)

where

S_{i, q} [m]

is the traffic volume of the q-th traffic type stored at the i-th HAP drone in the m-th time slot, which can be obtained as follows:

S_{i, q} [m] = S_{i, q} [m - 1] - {\tilde{U}}_{i, q} [m - 1] + G_{i, q} [m - 1],

(8)

where

G_{i, q} [m - 1]

denotes the traffic volume of the q-th traffic type newly generated at the i-th HAP drone in the

(t - 1)

-th time slot. It is a random variable that follows the traffic generation model defined in Section 3.1.

Therefore, the total transmitted traffic value of the space-air-ground integrated network can be given as follows:

U_{total} = \sum_{m = 1}^{M} \sum_{i = 1}^{I} \sum_{q = 1}^{Q} μ_{q} \cdot {\tilde{U}}_{i, q} [m] .

(9)

4. Problem Formulation and Transformation

The optimization objective is to establish tripartite matching among LEO satellites, HAP drones, and traffic types by choosing the most suitable LEO–drone matching matrix

X_{I \times J} [m]

and traffic–drone matching matrix

Y_{I \times Q} [m]

in each time slot, so as to maximize the average transmitted traffic value of the network. The objective function and constraints can be formulated as follows:

\begin{matrix} max_{X, Y} & lim_{M \to \infty} \frac{U_{total}}{M} \end{matrix}

(10a)

\begin{matrix} s . t . & \sum_{j = 1}^{J} x_{i, j} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(10b)

\begin{matrix} \sum_{i = 1}^{I} x_{i, j} [m] = L, \forall j \in J, \forall m \in M, \end{matrix}

(10c)

\begin{matrix} \sum_{q = 1}^{Q} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(10d)

\begin{matrix} x_{i, j} [m] \in \{0, 1\}, \forall i \in I, \forall j \in J, \forall m \in M, \end{matrix}

(10e)

\begin{matrix} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall q \in Q, \forall m \in M . \end{matrix}

(10f)

Constraint (10b) specifies that each HAP drone can connect to a maximum of one LEO satellite in each time slot. Constraint (10c) specifies that the number of HAP drones served by each LEO satellite is equal to the beam number L. Note that even though each LEO satellite can provide services for less than L HAP drones, this would lead to inefficient use of satellite beams. Thus, in order to achieve maximum average transmitted traffic value of the network in each time slot, all beams of each satellite will be utilized. Constraint (10d) specifies that each HAP drone can transmit a maximum of one traffic type in each time slot. (10e) and (10f) are restrictions on the elements of the LEO–drone matching matrix and the traffic–drone matching matrix, respectively.

The formulated problem (10a) is a mixed integer nonlinear programming problem. In the following content, we will analyze and simplify it. Given a specific

X_{I \times J} [m]

and substituting (9) into (10a), the original problem is as follows:

\begin{matrix} max_{Y} lim_{M \to \infty} & \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{I} (\sum_{q = 1}^{Q} μ_{q} \cdot {\tilde{U}}_{i, q} [m]) \end{matrix}

(11a)

\begin{matrix} s . t . & \sum_{q = 1}^{Q} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(11b)

\begin{matrix} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall q \in Q, \forall m \in M . \end{matrix}

(11c)

Through analysis, it becomes evident that

\sum_{q = 1}^{Q} μ_{q} \cdot {\tilde{U}}_{i, q} [m]

is solely dependent on the matching

\{y_{i, q} [m] | q \in Q\}

between the i-th HAP drone and all traffic types in the m-th time slot, and is independent of the matching

\{y_{i^{'}, q} [m^{'}] | q \in Q, i^{'} \neq i, m^{'} \neq m\}

between other HAP drones and traffic types in other time slots. Consequently, maximizing (11a) can be achieved by maximizing each term within the brackets of (11a). Thus, (11a) can be rephrased as follows:

\begin{matrix} lim_{M \to \infty} & \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{I} (max_{y_{i, q} [m] | q \in Q} \sum_{q = 1}^{Q} μ_{q} \cdot {\tilde{U}}_{i, q} [m]) \end{matrix}

(12a)

\begin{matrix} s . t . & \sum_{q = 1}^{Q} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(12b)

\begin{matrix} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall q \in Q, \forall m \in M . \end{matrix}

(12c)

Formulation (12a) is equivalent to optimizing

I \times M

independent sub-subproblems. For

\forall i \in I, \forall m \in M

, the sub-subproblem can be formulated as follows:

\begin{matrix} max_{y_{i, q} [m] | q \in Q} & \sum_{q = 1}^{Q} μ_{q} \cdot {\tilde{U}}_{i, q} [m] \end{matrix}

(13a)

\begin{matrix} s . t . & \sum_{q = 1}^{Q} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(13b)

\begin{matrix} y_{i, q} [m] \in \{0, 1\}, \forall i \in I, \forall q \in Q, \forall m \in M . \end{matrix}

(13c)

It is feasible that the region can be expressed as follows:

y_{i, q} [m] = 0, \forall q \in Q,

(14)

or

y_{i, q} [m] = \{\begin{matrix} 1, q = q_{0} \\ 0, \forall q \in Q, q \neq q_{0} \end{matrix} .

(15)

Regarding the former, the optimal value of (13a) is 0, whereas for the latter, the optimal value is greater than or equal to 0. Hence, the optimal solution of (13a) must adhere to (15), so as to maximize the objective function. By substituting (15) into (13a), it is equivalent to addressing the following:

max_{q \in Q} (μ_{q} \cdot min (S_{i, q} [m], \sum_{j = 1}^{J} x_{i, j} [m] R_{i, j} [m] T_{i, j} [m])) .

(16)

Its optimal solution can be expressed as follows:

q_{i}^{*} [m] = \underset{q \in Q}{\arg max} (μ_{q} min (S_{i, q} [m], \sum_{j = 1}^{J} x_{i, j} [m] R_{i, j} [m] T_{i, j} [m])) .

(17)

Based on this, the optimal solution of (11a) can be expressed as follows:

y_{i, q} [m] = \{\begin{matrix} 1, q = q_{i}^{*} [m] \\ 0, q \neq q_{i}^{*} [m] \end{matrix}, \forall i \in I, \forall m \in M .

(18)

At this point, we have successfully decomposed the optimization sub-problem (11a) into

I \times M

independent sub-subproblems through mathematical analysis. The optimal traffic–drone matching matrix

Y_{I \times Q} [m]

can be obtained according to (18). Intuitively, once the LEO–drone matching of each time slot is determined, the maximum average transmitted traffic value of the network can be achieved by choosing the traffic type with the highest value for each HAP drone to transmit.

Substituting the optimal solution (18) into the objective function (10a) yields the following:

\begin{matrix} max_{X} lim_{M \to \infty} & \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{I} max_{q \in Q} (μ_{q} min (S_{i, q} [m], \sum_{j = 1}^{J} x_{i, j} [m] R_{i, j} [m] T_{i, j} [m])) \end{matrix}

(19a)

\begin{matrix} s . t . & \sum_{j = 1}^{J} x_{i, j} [m] \in \{0, 1\}, \forall i \in I, \forall m \in M, \end{matrix}

(19b)

\begin{matrix} \sum_{i = 1}^{I} x_{i, j} [m] = L, \forall j \in J, \forall m \in M, \end{matrix}

(19c)

\begin{matrix} x_{i, j} [m] \in \{0, 1\}, \forall i \in I, \forall j \in J, \forall m \in M, \end{matrix}

(19d)

which is solely associated with the LEO–drone matching matrix

X_{I \times J} [m]

.

5. Problem Solving and Algorithm Designing

Typically, conventional optimization solvers are employed to solve problems with deterministic variables [29]. Problems with random variables are difficult to be solved using these solvers. Nevertheless, the tripartite matching problem that this paper focuses on is a mixture of stochasticity and determinism. Therefore, we adopt reinforcement learning to dynamically solve the LEO–drone matching sub-problem (19a). Specifically, the matching between each LEO satellite and HAP drones is modeled as a Markov decision process [30], where each LEO satellite is treated as an agent. The state, action, and reward of the j-th LEO satellite are defined as follows:

State: $s_{j} [m] = \{\{T_{i, j} [m] | i \in I\}, \{R_{i, j} [m] | i \in I\}, \{S_{i, q} [m - 1] | i \in I, q \in Q\}, \{G_{i, q} [m - 1] | i \in I, q \in Q\}\} .$
In the m-th time slot, the j-th LEO satellite obtains the state of each HAP drone within its visible range, which includes the available traffic transmission time $T_{i, j} [m]$ and the traffic transmission rate $R_{i, j} [m]$ of the current m-th time slot, as well as the stored traffic volume $S_{i, q} [m - 1]$ and the traffic generation rate $G_{i, q} [m - 1]$ of each traffic type in the previous $(m - 1)$ -th time slot. For the HAP drone $i_{0}$ , which is not within the visible range of the j-th LEO satellite, i.e., $i_{0} \notin I_{j} [m]$ , there are $T_{i_{0}, j} [m] = 0$ , $R_{i_{0}, j} [m] = 0$ , $S_{i_{0}, q} [m - 1] = 0, \forall q \in Q$ , and $G_{i_{0}, q} [m - 1] = 0, \forall q \in Q$ .
Action: $a_{j} [m] = \{x_{i, j} [m] | \sum_{i = 1}^{I} x_{i, j} [m] = L\} .$
In the m-th time slot, the action of the j-th LEO satellite is to determine which L HAP drones to provide services for. If multiple LEO satellites decide to provide services to the same HAP drone, this HAP drone will actively choose to connect to the LEO satellite, transmitting the highest traffic value.
Reward: $r_{j} (s_{j} [m], a_{j} [m]) = \sum_{i = 1}^{I} \sum_{q = 1}^{Q} μ_{q} U_{i, q, j} [m] .$
In the m-th time slot, the reward obtained by the j-th LEO satellite after taking action $a_{j} [m]$ in state $s_{j} [m]$ is defined as the total transmitted traffic value of the j-th LEO satellite in the current time slot.

Then, reinforcement learning is employed to solve (19a) based on the above definitions. The discounted return of the j-th LEO satellite in the m-th time slot is defined as follows:

G_{j} [m] = \sum_{τ = 0}^{\infty} γ^{τ} r_{j} (s_{j} [m + τ], a_{j} [m + τ]),

(20)

where

γ \in [0, 1)

represents the discount rate, which is used to balance the impact of short-term and long-term rewards. If

γ

is close to 0, the discounted return mainly depends on recent rewards. Conversely, if

γ

approaches 1, the discounted return primarily depends on forward rewards. Q-values can be used to evaluate the expectation of return that the j-th LEO satellite can achieve by taking action

a_{j}

based on policy

π_{j}

in state

s_{j}

, which can be expressed as follows:

q_{π_{j}} (s_{j}, a_{j}) = E [G_{j} [m] | s_{j} [m] = s_{j}, a_{j} [m] = a_{j}] .

(21)

In conventional Q-learning, Q-values of the optimal policy

π_{j}^{*}

can be continuously updated through iterations. Generate an episode of length

T_{max}

. For the t-th iteration, the Q-value of the state-action pair

(s_{j}^{t}, a_{j}^{t})

can be obtained as follows [31]:

q_{j}^{t + 1} (s_{j}^{t}, a_{j}^{t}) = q_{j}^{t} (s_{j}^{t}, a_{j}^{t}) - ϑ_{j}^{t} [q_{j}^{t} (s_{j}^{t}, a_{j}^{t}) - [{\bar{r}}_{j}^{t} (s_{j}^{t}, a_{j}^{t}) + γ max_{a \in A_{j}} q_{j}^{t} (s_{j} (t + 1), a)]],

(22)

where

t \in \{1, 2, \dots, T_{max}\}

.

s_{j}^{t}

represents the state at the t-th step of the episode, and

a_{j}^{t}

denotes the action taken in state

s_{j}^{t}

.

ϑ_{j}^{t}

represents the learning rate, and

A_{j}

denotes the action space of the j-th LEO satellite.

{\bar{r}}_{j}^{t} (s_{j}^{t}, a_{j}^{t})

denotes the average one-step immediate reward acquired after taking action

a_{j}^{t}

in state

s_{j}^{t}

, which can be represented as follows:

{\bar{r}}_{j}^{t} (s_{j}^{t}, a_{j}^{t}) = E [r_{j} (s, a) | s = s_{j}^{t}, a = a_{j}^{t}] .

(23)

Supposing that the proposed approach converges after C iterations, the optimal policy can be expressed as follows [32]:

π_{j}^{*} (a | s_{j}) = \{\begin{matrix} 1, a = \underset{a \in A_{j}}{\arg max} q_{j}^{C} (s_{j}, a) \\ 0, otherwise \end{matrix} .

(24)

The aforementioned conventional Q-learning algorithm stores the calculated Q-values

q_{j}^{t} (s_{j}^{t}, a_{j}^{t})

in the form of tables, known as a Q-table, which has the advantages of being intuitive and easy to analyze. However, due to the continuous state space of (19a), using the conventional tabular Q-learning algorithm requires storing a large volume of data, thereby increasing storage costs. Furthermore, the generalization ability of the conventional Q-learning algorithm is poor. To address these issues, a deep Q-learning algorithm is employed in this paper, which is one of the earliest and most successful algorithms that introduces deep neural networks into reinforcement learning [32]. In deep Q-learning, the high-dimensional Q-table can be approximated by a deep Q network with low-dimensional parameters, thereby significantly reducing the storage cost. In addition, the Q-values of unvisited state-action pairs can be calculated through value function approximation, giving it strong generalization ability.

In addition, the aforementioned algorithm is fully decentralized, in which each satellite calculates its Q-values according to its own local states, local actions, and local rewards. However, LEO satellites are not completely independent, but influence each other. For example, if the i-th HAP drone is connected to the j-th LEO satellite at the current moment, other LEO satellites cannot provide service for this HAP drone. Therefore, the aforementioned fully decentralized reinforcement learning algorithm cannot obtain high performance and may not even converge in some cases. An alternative solution is to use a fully centralized reinforcement learning algorithm. In each time slot, each LEO satellite sends its experience obtained from its interaction with the environment to the terrestrial traffic center. Then, both value network training and strategy making are performed at the center based on global experiences. Nevertheless, the experience of each satellite must pass through multiple ISLs, an LEO-ground link, and an optical fiber link to be transmitted back to the terrestrial traffic center, facing high propagation latency. The terrestrial traffic center is unable to obtain the latest status of the space–air–ground integrated network, so it is unable to make timely LEO–drone matching strategies. To address these issues, we employ multi-agent deep reinforcement learning with centralized training and decentralized execution. The value network of each LEO satellite is trained in a centralized manner at the terrestrial traffic center. Then, the trained value networks are distributed to the corresponding LEO satellites [33]. Each satellite distributively trains its policy network based on the received value network and the latest local observations, thus it can devise LEO–drone matching strategies in a timely manner.

Specifically, when training the value network, each LEO satellite sends its local experience

\{(s_{j}, a_{j}, r_{j}, s_{j}^{'})\}

obtained from its interaction with the environment to the terrestrial traffic center, where

s_{j}^{'}

is the state reached after taking action

a_{j}

in state

s_{j}

. Based on the collected local experiences of various LEO satellites, the terrestrial traffic center forms the global experience, including the global state

s = [s_{1}, s_{2}, \dots, s_{J}]

, the global action

a = [a_{1}, a_{2}, \dots, a_{J}]

, and the global reached state

s^{'} = [s_{1}^{'}, s_{2}^{'}, \dots, s_{J}^{'}]

, and stores

\{(s, a, r_{j}, s^{'})\}

in the replay buffer

D_{j}

. Afterwards, the terrestrial traffic center trains the value network of the j-th LEO satellite based on

\{(s, a, r_{j}, s^{'})\}

to evaluate the quality of the matching approach. As previously mentioned, the deep Q-learning algorithm is adopted, where the true Q-values of the optimal strategy are approximated by the Q-values calculated by the trained value network, which can be obtained through the quasi-static target network scheme [34]. Specifically, two networks need to be defined: the target network

{\hat{q}}_{j} (S, A, ω_{j, t a r g e t})

and the main network

{\hat{q}}_{j} (S, A, ω_{j, m a i n})

described by parameters

ω_{j, t a r g e t}

and

ω_{j, m a i n}

, respectively, where S and A are global states and global actions collected by the terrestrial traffic center in the form of random variables. The objective of parameter iteration is to minimize the mean square error of the Q-values calculated by the target network and the main network. This can be achieved by minimizing the loss function, which can be expressed as follows:

J_{j} (ω_{j, m a i n}) = E [{(\hat{q_{j}} (S, A, ω_{j, m a i n}) - (R_{j} + γ max_{a} \hat{q_{j}} (S^{'}, a, ω_{j, t a r g e t})))}^{2}],

(25)

where

S^{'}

and

R_{j}

represent the reached state and the acquired reward after taking action A in state S, respectively. The gradient-descent algorithm is then adopted to minimize the objective function. The gradient of (25) can be calculated as follows:

\begin{matrix} \nabla_{ω_{j, m a i n}} J_{j} (ω_{j, m a i n}) = E [(R_{j} + γ max_{a} {\hat{q}}_{j} (S^{'}, a, ω_{j, t a r g e t}) - {\hat{q}}_{j} (S, A, ω_{j, m a i n})) \times \\ \nabla_{ω_{j, m a i n}} {\hat{q}}_{j} (S, A, ω_{j, m a i n})], \end{matrix}

(26)

where

\nabla_{ω_{j, m a i n}} {\hat{q}}_{j} (S, A, ω_{j, m a i n})

can be obtained through the gradient back propagation algorithm [35]. In each iteration, an experience batch

D_{j}^{b a t c h}

is randomly sampled from the replay buffer

D_{j}

to train the value network. For each sample

\{(s, a, r_{j}, s^{'})\}

in

D_{j}^{b a t c h}

, the parameter

ω_{j, m a i n}

of the main network is updated as follows:

ω_{j, m a i n} \leftarrow ω_{j, m a i n} - β \nabla_{ω_{j, m a i n}} J_{j} (ω_{j, m a i n}),

(27)

where

β

is the learning rate. After

Δ

iterations, the parameter

ω_{j, t a r g e t}

of the target network is updated as

ω_{j, m a i n}

:

ω_{j, t a r g e t} \leftarrow ω_{j, m a i n} .

(28)

Algorithm 1 presents the matching algorithm based on multi-agent deep reinforcement learning, in which

ϵ

-greedy is used to balance exploitation and exploration. The value network of each LEO satellite is centrally trained at the terrestrial traffic center based on the global states, global actions, and the local reward of each LEO satellite. Then, the trained value network is sent to the corresponding LEO satellite. At the j-th LEO satellite, its policy network can be trained in a decentralized manner based on its received value network with parameter

ω_{j, t a r g e t}

and its local observations. Afterwards, each LEO satellite develops its own optimal strategy based on its trained policy network to maximize the long-term return. Finally, each LEO satellite broadcasts the matching strategy to all HAP drones within its visible range.

Algorithm 1 Matching approach based on multi-agent deep reinforcement learning

Input:: Episode length $T_{\max}$ , learning rate $β$ , greedy factor $ϵ$ , discount factor $γ$ , iteration number $Δ$ , and randomly initialize parameters $ω_{j, m a i n}$ and states $s_{j}^{1}$ , let $ω_{j, t a r g e t} = ω_{j, m a i n}$ , $δ = 0$ , $D_{j} = Φ$ , and $D_{j}^{b a t c h} = Φ$ ;
Output:: Optimal strategy for each LEO satellite
1:: for $t = 1$ to $T_{max}$ do
2:: for $j = 1$ to J do
3:: The j-th LEO satellite takes action $a_{j}^{t}$ according to $ϵ$ -greedy strategy, where the optimal action is $\underset{a \in A_{j}}{\arg max} \hat{q_{j}} (s_{j}^{t}, a, ω_{j, t a r g e t})$ ;
4:: Interact with environment to get the rewards $r_{j}^{t}$ and the reached states $s_{j}^{t + 1}$ ;
5:: end for
6:: Form the global state $s^{t} = [s_{1}^{t}, \dots, s_{J}^{t}]$ , the global action $a^{t} = [a_{1}^{t}, \dots, a_{J}^{t}]$ , and the global reached state $s^{t'} = [s_{1}^{t'}, \dots, s_{J}^{t'}]$ ;
7:: for $j = 1$ to J do
8:: Store $(s^{t}, a^{t}, r_{j}^{t}, s^{t + 1})$ into the replay buffer $D_{j}$ ;
9:: Randomly sample an experience batch $D_{j}^{b a t c h}$ from $D_{j}$ ;
10:: Update $ω_{j, m a i n}$ based on $D_{j}^{b a t c h}$ according to (27);
11:: end for
12:: $δ = δ + 1$ ;
13:: if $δ = = Δ$ then
14:: $ω_{j, t a r g e t} = ω_{j, m a i n}$ for $j = 1, \dots, J$ ;
15:: $δ = 0$ ;
16:: end if
17:: for $j = 1$ to J do
18:: Send the trained value network ${\hat{q}}_{j} (s_{j}, a_{j}, ω_{j, t a r g e t})$ to the j-th LEO satellite;
19:: The j-th LEO satellite trains its own policy network based on $s_{j}$ and ${\hat{q}}_{j} (s_{j}, a_{j}, ω_{j, t a r g e t})$ ;
20:: Develop the optimal strategy of the j-th LEO satellite based on its trained policy network.
21:: end for
22:: end for

6. Simulation Results

In order to verify the effectiveness of the proposed matching algorithm, preliminary simulations are conducted. The main simulation parameters are listed in Table 1. We compare the proposed approach with some state-of-the-art algorithms, including deep deterministic policy gradient (DDPG), deep Q-network (DQN), and two greedy methods.

For the first greedy method (abbreviated as Greedy 1), each LEO satellite will choose the L HAP drones with the highest channel gains within its visible range to establish connections.
For the second greedy method (abbreviated as Greedy 2), each LEO satellite will choose the L HAP drones with the longest remaining visible time within its visible range to establish connections.
For both Greedy 1 and Greedy 2, each HAP drone that has established a connection with an LEO satellite will choose the traffic type with the largest transmitted traffic value for transmission.

Figure 3 illustrates the transmitted traffic values of the proposed matching approach in one time slot under episode lengths of 500, 1000, 1500, 2000, 2500, and 3000, respectively. When the length of the episode does not exceed 2000, the transmitted traffic values in one time slot increase significantly with the increase of the episode length. However, when the length of the episode exceeds 2000, the transmitted traffic values in one time slot are basically the same for various episode lengths. Thus, the length of the episode is set to 2000 in subsequent simulations, thereby saving computational resources while ensuring performance. Furthermore, it can be observed that for any episode length, the transmitted traffic value will first increase and then remain essentially stable, which validates the convergence of the proposed matching algorithm.

Figure 4 illustrates the variation of the relative mean square error of Q-values obtained by the target network and the main network under learning rates of 0.15, 0.1, 0.08, and 0.05, respectively. As the learning rate

β

increases from 0.05 to 0.1, the rate of decrease in relative mean square error accelerates. Nevertheless, as the learning rate continues to increase from 0.1 to 0.15, the rate of decrease in relative mean square error remains almost unchanged, but its fluctuations will increase. Therefore, in order to balance the convergence speed and stability, we set the learning rate

β

to 0.1 in subsequent simulations.

Figure 5 illustrates the total transmitted traffic values of different algorithms under varying HAP drone transmission powers. It can be seen that with the increase of the transmission power, the total transmitted traffic values of all algorithms increase. This is because, according to (2), increasing the transmission power of HAP drones can improve the traffic transmission rates, thereby increasing the total transmitted traffic value of the space–air–ground integrated network. From Figure 5 we can see that the proposed multi-agent deep RL algorithm is the best. Since multi-agent deep RL utilizes centralized training and decentralized execution to reduce the interference of non-stationary environments among agents, the proposed algorithm can increase the transmitted traffic value compared with DDPG and DQN. Furthermore, all the three RL-based algorithms perform better than the greedy methods due to the following two reasons.

Greedy 1 aims to improve the transmission rate between LEO satellites and HAP drones by choosing HAP drones with higher channel gains, thereby increasing the total transmitted traffic value. Similarly, Greedy 2 focuses on reducing the handover latency by choosing HAP drones with long remaining visible time, thereby improving the available traffic transmission time of HAP drones, so as to increase the total transmitted traffic value. In contrast, the RL-based algorithms take a more comprehensive perspective by jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency, and traffic storage capacity. Thus, the RL-based algorithms can improve the total transmitted traffic value of the network from a global perspective, surpassing the performance of Greedy 1 and Greedy 2.
Both Greedy 1 and Greedy 2 rely on static matching algorithms, which fail to account for the randomness of traffic generation at HAP drones. In contrast, the RL-based algorithms can learn the randomness of the traffic generation at HAP drones and make the matching strategy based on this learning.

Figure 6 illustrates the total transmitted traffic values of different algorithms with respect to the LEO satellite beam number L. As the number of LEO satellite beams increases, the total transmitted traffic values of all algorithms will also increase. This is because increasing the number of LEO satellite beams can relax the constraint (10c), thereby allowing more HAP drones to transmit traffic to LEO satellites simultaneously, so as to increase the total transmitted traffic value of the space–air–ground integrated network. From Figure 6, we can see that the proposed multi-agent deep reinforcement learning algorithm is the best since it can learn from the experience of the other LEO satellites. Furthermore, all three RL-based algorithms perform better than greedy methods for the same reasons shown in Figure 5.

7. Future Work

Although the proposed approach can effectively address the tripartite matching problem among LEO satellites, HAP drones, and traffic types, there are some limitations.

7.1. Matching among Various Network Nodes

In this paper, only the matching problem between HAP drones and LEO satellites is considered. However, in the space–air–ground integrated network, in addition to HAP drones and LEO satellites, there are also a variety of network nodes, such as ground users, gateway stations, and geostationary earth orbit satellites. In the future, it is necessary to investigate the matching relationships among different nodes to improve the topology of the space–air–ground integrated network. For example, the matching problem between ground users and HAP drones should be addressed by comprehensively considering multiple factors such as the location, movement speed, and service requirements of ground users and the payloads of HAP drones.

7.2. Computing Task Assignment and Resource Allocation

Our research only considers how to perform user access and traffic backhaul in remote areas where ground base stations are difficult to deploy. However, in addition to serving remote areas, HAP drones can also provide low-latency edge computing services for IoT devices in urban areas with ground base station coverage. In the future, the great pressure that computing-intensive applications place on resource-constrained Internet of Things (IoT) devices with limited computing capability and energy storage can be alleviated by offloading latency-sensitive computing tasks to nearby edge nodes. A matching strategy for ground users, HAP drones, and ground base stations should be developed by jointly optimizing computing task assignment and resource allocation, thus improving the performance of the space–air–ground integrated network, such as minimizing the maximum task execution latency among IoT devices or maximizing the amount of transmitted traffic per unit time.

7.3. HAP Drone Localization

The positions of HAP drones are assumed to be stationary and known in our paper. However, the positions of HAP drones will constantly change due to jitter. Only by knowing the exact location of HAP drones can we accurately calculate the distance between HAP drone and LEO satellite, the remaining visible time, and the channel capacity. Therefore, the exact location of HAP drone is essential for making the user access and traffic backhaul strategy of the space–air–ground integrated network. In the future, the HAP drone localization problem needs to be solved. Other positioning systems can be added to estimate the exact location of HAP drone. For example, reinforcement learning-based algorithms can be used to regularly predict the exact location of HAP drone by inputting atmospheric data such as wind speed.

8. Conclusions

In this paper, the matching problem between HAP drones and LEO satellites in the space–air–ground integrated network has been investigated. First, we introduced the network architecture and working mechanism, including the traffic generation model and the traffic transmission model. Then, a tripartite matching problem that takes comprehensive consideration of multi-dimensional characteristics has been formulated to maximize the average transmitted traffic value of network. Through mathematical simplification, the optimization problem is then simplified into two independent sub-problems: traffic–drone matching and LEO–drone matching. The former can be decoupled into multiple independent and easily solvable sub-subproblems. Considering the mixed stochastic and deterministic traffic generation model, the long propagation latency between LEO satellites and HAP drones, and in the continuous state space, we proposed a multi-agent deep reinforcement learning approach with centralized training and decentralized execution to solve the LEO–drone matching problem. In this approach, the value network is trained in a centralized manner at the terrestrial traffic center and the matching strategy is timely formulated in a decentralized manner at LEO satellites. Finally, the proposed approach has been compared with multiple state-of-the-art algorithms through simulations, and results have proven the effectiveness and efficiency of the proposed algorithm.

Author Contributions

Conceptualization, X.H.; methodology, Z.W. and X.X.; validation, X.H., Z.W. and X.X.; formal analysis, X.H. and M.P.; investigation, X.H. and Z.W.; writing—original draft preparation, X.H.; writing—review and editing, X.H., Z.W., X.X. and M.P.; supervision, X.H. and Z.W.; project administration, X.X.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2020 National Key R&D Program “Broadband Communication and New Network” special “6G Network Architecture and Key Technologies” 2020YFB1806700.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Conflicts of Interest

Author Zhibo Wang is from the company, but all authors declare that there is no conflicts of interest.

References

Jia, Z.; Sheng, M.; Li, J.; Han, Z. Toward data collection and transmission in 6G space-air-ground integrated networks: Cooperative HAP and LEO satellite schemes. IEEE Internet Things J. 2022, 9, 10516–10528. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Liu, M.; Sun, R.; Chen, Y.; Yuan, J.; Li, J. Energy efficient resource allocation for UAV-assisted space-air-ground internet of remote things networks. IEEE Access 2019, 7, 145348–145362. [Google Scholar] [CrossRef]
Liu, J.; Shi, Y.; Fadlullah, Z.M.; Kato, N. Space-air-ground integrated network: A survey. IEEE Commun. Surv. Tuts. 2018, 20, 2714–2741. [Google Scholar] [CrossRef]
Heng, M.; Wang, S.Y.; Li, J.; Liu, R.; Zhou, D.; He, L. Toward a flexible and reconfigurable broadband satellite network: Resource management architecture and strategies. IEEE Wirel. Commun. 2017, 24, 127–133. [Google Scholar]
Qiu, J.; Grace, D.; Ding, G.; Zakaria, M.D.; Wu, Q. Air-ground heterogeneous networks for 5G and beyond via integrating high and low altitude platforms. IEEE Wirel. Commun. 2019, 26, 140–148. [Google Scholar] [CrossRef]
Zhou, D.; Sheng, M.; Luo, J.; Liu, R.; Li, J.; Han, Z. Collaborative data scheduling with joint forward and backward induction in small satellite networks. IEEE Trans. Commun. 2019, 67, 3443–3456. [Google Scholar] [CrossRef]
Karapantazis, S.; Pavlidou, F. Broadband communications via high-altitude platforms: A survey. IEEE Commun. Surv. Tutor. 2005, 7, 2–31. [Google Scholar] [CrossRef]
Nafees, M.; Huang, S.; Thompson, J.; Safari, M. Backhaul-aware user association and throughput maximization in UAV-aided hybrid FSO/RF network. Drones 2023, 7, 74. [Google Scholar] [CrossRef]
Ding, C.; Wang, J.B.; Zhang, H.; Lin, M.; Li, G.Y. Joint optimization of transmission and computation resources for satellite and high altitude platform assisted edge computing. IEEE Trans. Wirel. Commun. 2022, 21, 1362–1377. [Google Scholar] [CrossRef]
Gonzalo, J.; López, D.; Domínguez, D.; García, A.; Escapa, A. On the capabilities and limitations of high altitude pseudo-satellites. Prog. Aerosp. Sci. 2018, 98, 37–56. [Google Scholar] [CrossRef]
Wang, W.; Li, H.; Liu, Y.; Cheng, W.; Liang, R. Files cooperative caching strategy based on physical layer security for air-to-ground integrated IoV. Drones 2023, 7, 163. [Google Scholar] [CrossRef]
Huang, X.; Chen, P.; Xia, X. Heterogeneous optical network and power allocation scheme for inter-cubesat communication. Opt. Lett. 2024, 49, 1213–1216. [Google Scholar] [CrossRef]
Pham, Q.-V.; Mirjalili, S.; Kumar, N.; Alazab, M.; Hwang, W.-J. Whale optimization algorithm with applications to resource allocation in wireless networks. IEEE Trans. Veh. Technol. 2020, 69, 4285–4297. [Google Scholar] [CrossRef]
Abbasi, O.; Yadav, A.; Yanikomeroglu, H.; Dao, N.-D.; Senarath, G.; Zhu, P. HAPS for 6G networks: Potential use cases, open challenges, and possible solutions. IEEE Wirel. Commun. 2024, 1–8. [Google Scholar] [CrossRef]
Lou, Z.; Youcef Belmekki, B.E.; Alouini, M.-S. HAPS in the non-terrestrial network nexus: Prospective architectures and performance insights. IEEE Wirel. Commun. 2024, 30, 52–58. [Google Scholar] [CrossRef]
Liu, S.; Dahrouj, H.; Alouini, M.-S. Joint user association and beamforming in integrated satellite-HAPS-ground networks. IEEE Trans. Veh. Technol. 2024, 73, 5162–5178. [Google Scholar] [CrossRef]
Khoshkbari, H.; Sharifi, S.; Kaddoum, G. User association in a VHetNet with delayed CSI: A deep reinforcement learning approach. IEEE Commun. Lett. 2023, 27, 2257–2261. [Google Scholar] [CrossRef]
Jia, Z.; Sheng, M.; Li, J.; Zhou, D.; Han, Z. Joint HAP access and LEO satellite backhaul in 6G: Matching game-based approaches. IEEE J. Sel. Areas Commun. 2021, 39, 1147–1159. [Google Scholar] [CrossRef]
Pervez, F.; Zhao, L.; Yang, C. Joint user association, power optimization and trajectory control in an integrated satellite-aerial-terrestrial network. IEEE Trans. Wirel. Commun. 2022, 21, 3279–3290. [Google Scholar] [CrossRef]
Ma, T.; Zhou, H.; Qian, B.; Cheng, N.; Shen, X.; Chen, X.; Bai, B. UAV-LEO integrated backbone: A ubiquitous data collection approach for B5G internet of remote things networks. IEEE J. Sel. Areas Commun. 2021, 39, 3491–3505. [Google Scholar] [CrossRef]
Mao, S.; He, S.; Wu, J. Joint UAV position optimization and resource scheduling in space-air-ground integrated networks with mixed cloud-edge computing. IEEE Syst. J. 2021, 15, 3992–4002. [Google Scholar] [CrossRef]
Ei, N.N.; Aung, P.S.; Park, S.-B.; Huh, E.-N.; Hong, C.S. Joint association and power allocation for data collection in HAP-LEO-assisted IoT networks. In Proceedings of the International Conference on Information Networking (ICOIN), Bangkok, Thailand, 11–14 January 2023; pp. 206–211. [Google Scholar]
Doan, K.; Avgeris, M.; Leivadeas, A.; Lambadaris, I.; Shin, W. Service function chaining in LEO satellite networks via multi-agent reinforcement learning. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Kuala Lumpur, Malaysia, 4–8 December 2023; pp. 7145–7150. [Google Scholar]
Li, H.; Yu, J.; Cao, L.; Zhang, Q.; Hou, S.; Song, Z. Multi-agent Reinforcement learning based computation offloading and resource allocation for LEO satellite edge computing networks. Comput. Commun. 2024, 222, 268–276. [Google Scholar] [CrossRef]
Seid, A.M.; Erbad, A. Multi-agent RL for SDN-based resource allocation in HAPS-assisted IoV networks. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 1664–1669. [Google Scholar]
Mei, C.; Gao, C.; Wang, H.; Xing, Y.; Ju, N.; Hu, B. Joint task offloading and resource allocation for space-air-ground collaborative network. Drones 2023, 7, 482. [Google Scholar] [CrossRef]
Dong, F.; Li, H.; Gong, X.; Liu, Q.; Wang, J. Energy-efficient transmissions for remote wireless sensor networks: An integrated HAP/satellite architecture for emergency scenarios. Sensors 2015, 15, 22266–22290. [Google Scholar] [CrossRef] [PubMed]
Leyva-Mayorga, I.; Gala, V.; Chiariotti, F.; Popovski, P. Continent-wide efficient and fair downlink resource allocation in LEO satellite constellations. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 6689–6694. [Google Scholar]
Sutton, R.; Barto, A. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Badini, N.; Jaber, M.; Marchese, M.; Patrone, F. Reinforcement learning-based load balancing satellite handover using NS-3. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 2595–2600. [Google Scholar]
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tuts. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Wang, G.; Yang, F.; Song, J.; Han, Z. Multi-agent deep reinforcement learning for dynamic laser inter-satellite link scheduling. In Proceedings of the IEEE Global Communications Conference GLOBECOM, Kuala Lumpur, Malaysia, 4–8 December 2023; pp. 5751–5756. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Amari, S.I. Backpropagation and stochastic gradient descent method. Neurocomputing 1993, 5, 185–196. [Google Scholar] [CrossRef]

Figure 1. Network architecture of the space–air–ground integrated network.

Figure 2. Traffic generation model for each HAP drone.

Figure 3. Transmitted traffic values in one time slot with different episode lengths.

Figure 4. Mean square error of the Q-values obtained by the target network and the main network.

Figure 5. Total transmitted traffic value under different HAP drone transmission powers.

Figure 6. Total transmitted traffic value under different beam numbers.

Table 1. System parameters.

Description	Notation	Value
Self-transition probability from on to on	$p_{1}$	0.9
Self-transition probability from off to off	$p_{2}$	0.95
Total number of traffic types	Q	4
Value factors	$μ_{q}$ , $q = 1, 2, 3, 4$	1, 2, 3, 4
Number of LEO satellites	J	5
Number of HAP drones	I	20
Number of beams for each LEO satellite	L	4
Beam bandwidth	W	10 MHz
Number of time slots	T	1000
Length of time slot	$T_{0}$	0.1 s
HAP drone altitude	$h_{1}$	20 km
LEO altitude	$h_{2}$	600 km
Carrier frequency	$f_{c}$	10 GHz
Number of signaling	$κ$	8
Discount rate	$γ$	0.98
Learning rate	$β$	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, X.; Xia, X.; Wang, Z.; Peng, M. Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach. Drones 2024, 8, 218. https://doi.org/10.3390/drones8060218

AMA Style

Huang X, Xia X, Wang Z, Peng M. Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach. Drones. 2024; 8(6):218. https://doi.org/10.3390/drones8060218

Chicago/Turabian Style

Huang, Xuan, Xu Xia, Zhibo Wang, and Mugen Peng. 2024. "Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach" Drones 8, no. 6: 218. https://doi.org/10.3390/drones8060218

APA Style

Huang, X., Xia, X., Wang, Z., & Peng, M. (2024). Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach. Drones, 8(6), 218. https://doi.org/10.3390/drones8060218

Article Menu

Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach

Abstract

1. Introduction

2. Related Works

3. System Model and Working Mechanism

3.1. Traffic Generation Model for HAP Drones

3.2. Traffic Transmission Model between LEO Satellites and HAP Drones

4. Problem Formulation and Transformation

5. Problem Solving and Algorithm Designing

6. Simulation Results

7. Future Work

7.1. Matching among Various Network Nodes

7.2. Computing Task Assignment and Resource Allocation

7.3. HAP Drone Localization

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI