Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems

Feng, Linqing; Zhang, Cheng; Zhang, Qiuyang; Zeng, Lingchao; Qin, Pengfei; Wang, Ying

doi:10.3390/electronics13142764

Open AccessArticle

Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems

by

Linqing Feng

¹,

Cheng Zhang

²,

Qiuyang Zhang

¹,

Lingchao Zeng

²

,

Pengfei Qin

^2,* and

Ying Wang

^1,*

¹

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Institute of Telecommunication and Navigation Satellites, China Academy of Space Technology, Beijing 100094, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(14), 2764; https://doi.org/10.3390/electronics13142764

Submission received: 14 June 2024 / Revised: 6 July 2024 / Accepted: 12 July 2024 / Published: 14 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the expansion of the very-high-throughput satellite (VHTS) system, the uneven distribution of traffic demands in time and space has become increasingly significant and cannot be ignored. It is a significant challenge to efficiently and dynamically allocate scarce on-board resources to ensure capacity and demand matching. The advancement of flexible payload technology provides the possibility to overcome this challenge. However, computational complexity is increasing due to the unsynchronized resource adjustment and the time-varying demands of the VHTS system. Therefore, we propose a double-timescale bandwidth and power allocation (DT-BPA) scheme to effectively manage the available resources in the flexible payload architecture. We use a multi-agent deep reinforcement learning (MADRL) algorithm aiming to meet the time-varying traffic demands of each beam and improve resource utilization. The simulation results demonstrate that the proposed DT-BPA algorithm enhanced the matching degree of capacity and demand as well as reduced the system’s power consumption. Additionally, it can be trained offline and implemented online, providing a more cost-effective solution for the VHTS system.

Keywords:

bandwidth allocation; power allocation; double timescale; multi-agent deep reinforcement learning; very-high-throughput satellite

1. Introduction

Establishing a wireless communication network with seamless global coverage is the developmental vision of 6G [1,2,3,4]. Satellite communication, with their wide coverage, plays a vital role in realizing this goal [5,6,7]. As the communication requirements of civil and military networks change, satellite manufacturers will strive to reduce unit costs while increasing satellite capacity. Accordingly, the very-high-throughput satellite (VHTS) has emerged as the focus in the future development of satellite networks [8]. Currently, next-generation VHTSs like Jupiter-3, Konnect VHTS, and Viasat-3 operate in geostationary orbit, providing wide coverage without the need for complex orbital adjustments or maintenance [9]. These satellites can deliver a cumulative capacity of hundreds of gigabytes or even terabytes per second, offering broadband connectivity for aviation, navigation, and remote area users, thereby ensuring a stable and high-quality network experience [10]. Nevertheless, the elevated beam density and extensive bandwidth range of VHTS systems exacerbate the challenges associated with resource management, making resource allocation a imperative issue.

Currently, VHTS mainly serves areas not covered by terrestrial communication networks, mostly scenarios with fixed routes such as aviation and navigation. Due to the wide coverage of the VHTS and the differences in route density within the coverage area, the demand distribution has significant non-uniform spatial characteristics [11]. Moreover, communication demands exhibit tidal effects due to changes in human activities within the coverage area. The time-varying nature of demands is a crucial factor that cannot be overlooked [12]. However, the resource management strategies of a traditional payload structure employs a fixed and uniform allocation method that fails to align with the substantial differences in the demands of different beams and significant temporal variations, leading to reduced resource utilization. Consequently, flexible payload technology has experienced rapid development to facilitate efficient resource utilization in VHTS systems [13]. The resource management of a flexible payload can provide dynamic resource allocation solutions, allocating corresponding bandwidth, power, and other resources based on traffic demands, thus achieving the efficient utilization of system resources as a whole [14].

However, the flexible payload resource management technology of VHTS systems currently encounters the following challenges. Firstly, frequent large-scale resource adjustments can impact the system’s stability and efficiency. This is because synchronization issues among multiple components and competition for limited resources may lead to service disruptions or performance fluctuations. Fortunately, demand with fixed routes, such as aviation and maritime, fluctuates less relative to terrestrial users. Therefore, only a common adjustment of multiple types of resources is required at specific peak hours, and a small portion of resources is adjusted at other times to cope with small fluctuations in traffic demand. Taking into account factors such as satellite hardware conditions, user business types, management efficiency, and costs, it is necessary to adjust different types of resources at different timescales. Taking bandwidth and power resources as examples, bandwidth adjustment has a relatively large impact on system capacity and user experience. So we can adjust both bandwidth and power on a large timescale and power on a small timescale. Secondly, the system necessitates the capability to generate precise and swift decision responses based on real-time demands, and the increasing optimization space imposes a computational burden. Furthermore, convex optimization algorithms generally require precise mathematical modeling and are time-consuming [15,16], rendering them unsuitable for the dynamic resource management (DRM) problems of VHTS systems with np-hard characteristics. Heuristic algorithms excel in wireless resource allocation. Nevertheless, their multiple iterations and slow convergence speed make them unsuitable for real-time resource management problems [17].

In addressing the challenges of flexible payload resource management, it is crucial to consider the adjustment scales of different resource types in a comprehensive manner. The system may develop appropriate adjustment strategies based on specific service scenarios to achieve a balance between resource flexibility and system stability while improving resource utilization. Notably, a learning-based algorithm will improve the accurate response capability of VHTS flexible payloads to the real-time demands of multi-beam scenarios. For the complex highly dynamic environment of VHTS, the deep reinforcement learning (DRL) algorithm can achieve efficient real-time resource allocation without an accurate model by learning the optimal strategy through interaction with the environment. Although the DRL algorithm may require computing resources during the training process, once the training is completed, the complexity is usually lower when the decision is made by the learned strategy at the actual run time.

In this paper, we propose a double-timescale bandwidth and power allocation (DT-BPA) scheme to tackle DRM problems, which can better accommodate the dynamic traffic demands of a multi-beam in VHTS systems in aviation-type scenarios. The DRM problem we formulated is np-hard due to its complex constraints, so we utilize a multi-agent deep reinforcement learning (MADRL) algorithm to solve it. The proposed algorithm can reduce the computing resources and energy consumption of the satellite while maintaining the real-time accuracy of the model, as the offline training is performed at the ground station and then the trained model is deployed to the satellite for online decision making. The main contributions of this paper can be summarized in three aspects as follows.

A double-timescale framework is proposed in this paper to accommodate uneven time-varying traffic demands in aviation-type scenarios. In this framework, we allocate bandwidth and power on a large timescale, and subsequently allocate power on multiple small timescales. Compared to simultaneously adjusting the bandwidth and power resources, the double-timescale resource management approach enables us to fully and flexibly utilize all resources without compromising system stability, allowing for the better adaptation to the specific changes in aviation demands.
The MADRL-based DT-BPA algorithm is employed to construct a two-layer network for addressing the DRM problem, which is low in complexity for offline training and online decision making. We model the DRM problem as a Markov decision process (MDP) and utilize DRL networks for offline training, which reduces the complexity of online execution to meet aviation demands in real time. Additionally, multi-agent method is used to cope with the exponential growth of the action space with the number of beams.
We conduct extensive simulation experiments to evaluate the performance of the proposed scheme. The numerical results demonstrate that, when compared with the benchmark scheme, the proposed strategy can better match the multi-beam traffic demands over a long-time scale. The proposed scheme effectively utilizes the double-timescale resource management framework, enabling the better adaptation to time-varying traffic demands. On the premise of meeting the demand-matching degree, the power consumption of the system is also reduced, so the resource utilization and overall performance of the system are improved to a certain extent.

The rest of this paper is organized as follows. Section 2 summarizes the relevant work on the resource management of multi-beam satellite communication systems. Section 3 describes the VHTS model assumptions and architecture, and formulates issues about the dynamic management of power and bandwidth resource. Section 4 proposes the DT-BPA scheme using the MADRL algorithm and provides a detailed explanation of its implementation process. Section 5 presents simulation results and analysis. Section 6 provides a summary of the research.

2. Related Work

At present, research on resource allocation optimization in multi-beam satellite systems can be categorized into two directions: single-objective optimization and multi-objective optimization. To address the single-objective optimization problem of multi-beam satellite resource allocation, several optimization algorithms have been proposed, including traditional optimization algorithms [18,19], heuristic algorithms [20,21,22], and deep reinforcement learning algorithms [23,24]. For instance, Abdu et al. [19] decomposed the joint resource allocation problem into two sub-problems: carrier allocation and power allocation. Continuous carrier allocation (CCA) and interference-aware carrier allocation (IACA) are proposed, which are solved by the difference of the convex (DC) algorithm and successive convex approximation (SCA) algorithm, respectively. Paris et al. [22] used a genetic algorithm (GA) combined with repair functions to address the joint power and bandwidth resource allocation problem. Frequency re-use was considered to maximize the utilization of spectrum resources, and the unsatisfied system capacity (USC) was used as the objective function. However, the situation of the resource surplus was not considered. Cocco [21] and Paris et al. [22] observed that bandwidth flexibility had a more significant effect on improving the USC. Liao et al. [24] only studied the bandwidth allocation problem, and proposed a collaborative multi-agent deep reinforcement learning (CMDRL) framework to implement resource management strategies.

Due to the inclusion of various allocated resources such as bandwidth, power, carrier, beam size, beam position, etc., and the presence of multiple evaluation criteria and optimization objectives in the allocation algorithm, the DRM problem exhibits high complexity. For example, power consumption will lead to the reduction in the life of satellite components and bring challenges to the heat dissipation of satellites. Therefore, minimizing the total power consumption (TPC) of the system is also a key objective by optimizing the transmit power of each beam [25]. Therefore, the multi-objective optimization of the problem mainly adopts heuristic algorithm [26,27] and various learning algorithms [28,29]. For instance, Pachler et al. [27] studied the multi-objective optimization of the joint bandwidth and power allocation problem, using a combination of the USC and TPC objectives, and using an improved PSO-GA algorithm to solve the problem. The hybrid algorithm can reduce the TPC (up to 60%), and the increase in USC is less (only 2%). Ortiz-Gomez et al. [28] proposed the use of a machine learning (ML) algorithm to train bandwidth and power off-line in response to fluctuations of traffic demands. The optimization objectives encompassed minimizing the capacity demand gap (CDG), TPC, and total bandwidth consumption (TBC). However, the available bandwidth for each frequency reuse color was not constrained. Table 1 compares the resource allocation problems of multi-beam satellite systems in the literature from the aspects of optimization objectives, optimization algorithms, and whether to consider demand matching and resource flexibility.

In order to cope with the challenge that convex optimization algorithms and heuristic algorithms do not match the multi-beam real-time DRM problem, more and more studies have proposed the use of reinforcement learning technology to solve the problem of resource management in satellite communication. Hu et al. [30] demonstrated that the DRL framework is an effective method for wireless resource allocation. Bi et al. [31] discussed the advantages of the DRL framework using a model-free method in solving resource management problems. Liu et al. [32] also verified the strong performance of centralized DRL in real-time resource management. Ferreira et al. [33] used DRL technology in satellite communication to overcome the assumption limitations of other resource allocation methods. Ortiz-Gomez et al. [34] showed that the use of a multi-agent method can reduce the computational complexity and improve the scalability of the system.

By analyzing the resource allocation problem for multi-beam satellite systems, fewer studies have focused on adjusting the different types of resource allocation at different time scales to balance resource utilization and system stability. The VHTS system should be able to formulate an appropriate adjustment strategy according to the special business scenarios with the fixed routes. To this end, we propose the DT-BPA scheme and use the MADRL algorithm to deal with the DRM problem, where each agent only makes decisions on the power of one beam. Unlike previous studies, the VHTS system studied in this paper uses the frequency reuse scheme. Geographically isolated beams will use the same bandwidth, so a single agent decides the bandwidth of different reused bands.

3. System Model and Problem Formulation

3.1. System Model

We propose a VHTS system model that utilizes a GEO satellite with multiple spot beams for communication and multiple gateway stations to accommodate the high bandwidth required by the feeder link. To confine the DRM problem to the beam level, we set the following assumptions for the VHTS system [35]. Firstly, we assume that the power or bandwidth between intra-beam carriers is allocated in a proportional fixed allocation mode. Secondly, we aggregate the traffic demands of all users within a beam to obtain the total demands of the beam, and the resource allocation among users in the beam is based on existing literature. Finally, the beams in the model are spot beams, assuming that beam hopping and timeslot allocation are not used.

The payload considered by the VHTS system should be able to flexibly manage two types of resources, namely power and bandwidth. Therefore, it is assumed that the satellite payload is equipped with the necessary modules, including flexible traveling wave tube amplifiers (TWTA) and channelizers. The TWTA dynamically allocates power to its associated beams based on the input power back-off, enabling flexible power management. The channelizer separates the signal into frequency blocks and rearranges them based on instructions to achieve flexible bandwidth allocation. The flexibility of these resources is realized through the architecture depicted in Figure 1. The payload manager receives input demand data, generates the optimal resource allocation result through the network control center, and transmits it to the satellite, thereby influencing the user links.

The VHTS system utilizes the limited spectrum more effectively through multi-color frequency reuse, enabling high-speed broadband data transmission. Figure 2a depicts a schematic 19-beam layout. Taking the simplest practical reuse pattern, the four-color reuse pattern (involving two frequencies and two polarizations) [36], as an example, the beams assigned with yellow and blue colors use left-handed circular polarization (LHCP), while the beams assigned with red and green colors use right-handed circular polarization (RHCP). When allocating bandwidth resources, beams with different polarizations are independent. Adjacent beams with the same polarization can exchange bandwidth, as shown in Figure 2b. We assume that adjacent beams cannot use overlapping frequency ranges, otherwise strong inter-beam interference will occur. The DRM scheme proposed below is suitable for any frequency reuse mode.

We assume that a very-high-throughput GEO satellite is configured with B beams. The total power of the VHTS system is denoted by

P_{t o t}

, while the total bandwidth of the VHTS system is represented by

B W_{t o t}

. The transmission power of the b-beam is

P_{b}

, and the bandwidth of the b-beam is

B W_{b}

. The system includes

N_{a}

amplifiers, where

N_{a} \leq B

, the maximum power allowed by amplifier a is

P_{max}^{a}

. Each beam is connected to one and only one amplifier, and each amplifier can be connected to multiple beams. The system adopts the TWTA to manage the power allocation adjustment. To meet the power constraints, the power allocated to the b-beam must not exceed the upper limit defined for each beam. The beam power cap constraint is expressed as

P_{t}^{b} \leq P_{max}^{b} .

(1)

Additionally, the sum of the power allocated to all beams should not exceed the total power of the system. The total power constraint is expressed as

\sum_{b = 1}^{B} P_{t}^{b} \leq P_{t o t} .

(2)

Moreover, the sum of the power allocated to beams connected to the same amplifier must not exceed the maximum power allowed by the amplifier. The amplifier power constraint is expressed as

\sum_{b = 1}^{B} P_{t}^{b, a} \leq P_{max}^{a},

(3)

P_{t}^{b, a} = \{\begin{matrix} P_{t}^{b}, & b c o n n e c t e d t o a \\ 0, & b n o t c o n n e c t e d t o a \end{matrix} .

(4)

The channelizer is used to manage the adjustment of bandwidth allocation in the system. The system adopts a dual-polarized multi-color frequency reuse pattern. Adjacent beams with the same polarization can exchange bandwidth when allocating bandwidth. As shown in Figure 2, the No. 1 beam and the No. 3 beam are adjacent in position and have the same polarization direction in the spectrum, so the two beams can exchange bandwidth. The sum of bandwidth allocated to different frequency colors of the same polarization in the bandwidth plan is equal to the total bandwidth of the system. The bandwidth constraint is expressed as

\sum_{j = 1}^{J} B W_{i, j} = B W_{t o t}, i = 0, 1,

(5)

where i denotes the frequency polarization direction,

i = 0

denotes the left polarization, and

i = 1

denotes the right polarization. J denotes the number of sub-bands of the same polarization direction.

{B W}_{i, j}

denotes the bandwidth of the sub-band j of the polarization direction i.

The VHTS communication system in this paper adopts a double-timescale management strategy of bandwidth and power to flexibly meet the real-time traffic demands while maintaining the resource utilization and system stability. The management strategy consists of a large-timescale bandwidth and power allocation stage and a small-timescale power allocation stage, as shown in Figure 3. In the large-timescale resources allocation stage, the system allocates bandwidth to each sub-band and allocates power to each beam based on the resource constraints and the traffic demands over a period of time. In the small-timescale power allocation stage, the system allocates the power to each beam connected to all amplifiers based on the bandwidth allocation result, power resource constraints, and real-time traffic demands. Furthermore, the system needs to set reasonable time intervals and a corresponding proportion of different time scales according to the specific situation. The strategy can improve the performance and user experience of the satellite communication system, and provide customized services for different beams and users.

3.2. Link Budget

It is assumed that the capacity provided by the b-beam

C_{b}

depends on the bandwidth allocated to the beam

{B W}_{b}

and the spectral efficiency of the beam

{S E}_{b}

, which is specifically denoted as

C_{b} = {B W}_{b} \times {S E}_{b},

(6)

where

{S E}_{b}

is the spectral efficiency of the modulation and coding schemes used by commercial modems. Due to the variation in the beam gain and transmission channel conditions within the coverage range, each beam will experience different carrier-to-interference-plus-noise ratios. By referring to the DVB-S2X system, the relationship between the spectral efficiency of each beam and the carrier-to-interference-plus-noise ratio

{C I N R}_{t o t a l}

is modeled by a general function

f_{1} (\cdot)

. The DRM problem is a discrete problem since

f_{1} (\cdot)

is a step function.

{S E}_{b} = f_{1} (C I N R_{t o t a l}) .

(7)

In the VHTS system, the feeder link from the gateway station to the satellite should be considered. The total carrier-to-interference-plus-noise ratio

{C I N R}_{t o t a l}

is related to the carrier-to-interference ratio

{C I R}_{b}

and the carrier-to-noise ratio

{C N R}_{b}

.

The carrier-to-interference ratio

C I R_{b}

represents the difference between the power allocated to the b-beam and the interference power of the b-beam, which is specifically denoted as

C I R_{b} = P_{b} - I_{b} .

(8)

The frequency color needs to be assigned to each beam before calculation. The principle is that the adjacent beams have different frequencies, and that the beams of the same frequency are as far apart as possible. The co-channel interference power

I_{b}

is a function of the frequency reuse scheme and

θ_{b}

, which is specifically denoted as

I_{b} = \sum_{φ = 1}^{Φ} P_{c o} (ϕ, θ_{b}),

(9)

where

θ_{b}

is the 3dB beamwidth of the b-beam,

ϕ

represents the angle between the b-beam and the interference beam,

Φ

represents the total number of interference beams of the b-beam, and

P_{c o}

represents the interference power level of the

φ

-th interference beam of the b-beam.

The calculation of the carrier-to-noise ratio

C N R_{b}

is related to the b-beam power

P_{b}

, beam-transmitting antenna gain

G_{b}

, user-receiving antenna gain

G_{u}

, Boltzmann constant k, system temperature

T_{s}

, total link loss

L_{b}

, amplifier power back-off

O B O

, and the b-beam bandwidth

{B W}_{b}

. The specific expression of

C N R_{b}

is

C N R_{b} = P_{b} + G_{b} + G_{u} / T_{s} - L_{b} - k - log (B W_{b}) - O B O .

(10)

The total link loss

L_{b}

includes free space loss

L_{F S, b}

, atmospheric loss

L_{a t m, b}

, transmission loss

L_{R F, b}

, and reception loss

L_{R F, u}

. The specific expression of

L_{b}

is

L_{b} = L_{F S, b} + L_{a t m, b} + L_{R F, b} + L_{R F, u} .

(11)

The relationship between transmitting antenna gain

G_{b}

and beamwidth

θ_{b}

can be modeled by an appropriate function

f_{2} (\cdot)

G_{b} = f_{2} (θ_{b}) .

(12)

3.3. Problem Formulation

The VHTS system manages both bandwidth and power resources through flexible payloads. The goal is to match the capacity provided by the beam with the traffic demand as much as possible. Therefore, we formulate CDG as the evaluation index of resource allocation results. The smaller the CDG, the higher the matching degree of capacity and demand.

C D G = \sum_{b = 1}^{B} |D_{t}^{b} - C_{t}^{b}| .

(13)

The DRM objective function is defined as minimizing the average CDG per time to achieve the optimal matching of capacity and demand within T time. In addition, to ensure the reliable operation of the equipment and prolong its service life, the TPC is minimized under the premise of ensuring capacity and demand matching. Based on the above analysis, the DRM problem is defined as

min_{P_{t}^{b}, B W_{b}} F_{1} = \frac{1}{T} \sum_{t = 1}^{T} (β_{1} \sum_{b = 1}^{B} |D_{t}^{b} - C_{t}^{b}| + β_{2} \sum_{b = 1}^{B} P_{t}^{b})

(14)

s . t . P_{t}^{b} \leq P_{max}^{b}, b = 1, 2, \dots, B,

(15)

\sum_{b = 1}^{B} P_{t}^{b, a} \leq P_{max}^{a}, a = 1, 2, \dots, N_{a},

(16)

\sum_{b = 1}^{B} P_{t}^{b} \leq P_{t o t},

(17)

B W_{b} \leq B W_{t o t},

(18)

\sum_{j = 1}^{J} B W_{i, j} = B W_{t o t}, i = 0, 1 .

(19)

The DRM objective function

F_{1}

minimizes two parameters of the average time, representing the performance index and the cost index, respectively. The performance index is designed to measure the degree to which the system capacity meets the demand, and it is represented by the CDG. The weight of CDG in the objective function is denoted by

β_{1}

. On the other hand, the cost index is designed to measure the resource consumption that meets all beam demands. Assuming that VHTS uses all bandwidth resources, the selected index is the power of all beams in the satellite, where

β_{2}

is the weight of TPC. Among them, (15)–(17) are power constraints, and (18) and (19) are bandwidth constraints. The formulated optimization problem is a nonlinear and discrete function with multiple constraints. Considering the complexity characteristics of the formulated problem, the double-timescale MADRL method is adopted to solve the formulated DRM problem.

4. Double-Timescale MADRL Algorithm

The VHTS system is characterized by a large number of beams, rapid demand changes, and significant differences in the demand between beams. Consequently, the computational complexity of the double-timescale DRM problem is extremely high. To address this issue, we reconstruct the DRM problem into a DT-MADRL problem. The key idea behind the DT-MADRL approach is to employ a single agent to manage the bandwidth of each sub-band in multi-color reuse using the Deep Q-Network (DQN) for training. Additionally, B agents are utilized to manage the power of all beams connected by different amplifiers in the satellite system, with the traffic demands of beams acting as the environment. The B agents share the same resource allocation space and are trained using the QMIX network. When different traffic demand data are input, the network can output the bandwidth and power allocation results that meet the optimization objectives. We assume that the amplifier can connect any number of beams, and the power management network corresponding to each amplifier is independent. Consequently, we can train the power management network first. For the trained network, it is required that, for all possible traffic demands and bandwidth inputs, the power values that achieve the optimal matching of beam capacity and demand in the amplifier can be obtained. The trained power management network parameters of each amplifier are imported as the environment to train the whole system resource management network. For the trained network, it is required to be able to obtain the bandwidth and power that achieve the optimal matching of beam capacity and demand in the system according to all possible traffic demands inputs. The DT-BPA scheme using the MADRL algorithm architecture is shown in Figure 4. Both models use the evaluation network and the target network, as well as experience a replay buffer, which ensure the stability of the learning.

4.1. Small-Timescale Power Allocation Strategy

For the power management network at a small time scale, the power allocation of the beams connected to each amplifier is required. The objective is to achieve the optimal matching of the beam capacity and demand in the amplifier. On this basis, the TPC of beams in the amplifier is minimized to reduce the device power consumption and extended service life. The offline-trained power management network can adapt to traffic variations in real-time and automatically adjust strategies. The multi-agent method adopted by the network can readily accommodate scenarios with an increasing number of beams by simply adding more agents. These features enable it to better integrate with resource management networks operating on a large time scale.

Based on the above objectives, we assume that the amplifier is connected to M beams and model M agents to manage the power allocation of beams connected to the amplifier. All agents operate in a fully cooperative manner. It is believed that the maximum reward for the system requires the coordination of all agents, that is, the ultimate goal of all agents is to minimize the weighted sum of CDG and TPC of the beams in the amplifier. Since a single agent cannot obtain complete state information, the multi-beam cooperative power allocation decision process is modeled as a decentralized partially observable Markov decision process (Dec-POMDP), which is defined by tuples

< M, S, O, A, P, R, γ >

, where

M = {1, 2, \dots, M}

represents the set of agents. S denotes the set of environmental states, and O denotes the set of joint observations. A denotes the set of joint action space, and each agent

m \in M

selects an action

a^{m} \in A^{m}

to form a joint action

a \in A

. The joint action is applied to the environment, and then the next state is obtained. The corresponding transfer equation is

P (s' | s, a) : S \times A \times S \to [0, 1]

.

R (s, a) : S \times A \to R

denotes the reward function.

γ \in [0, 1]

represents a discount factor.

State space

The environment of each time slot t includes the following.

D_{t} = {D_{t}^{1}, D_{t}^{2}, \dots, D_{t}^{M}}

represents the traffic demands of beams connected to the same amplifier,

C_{t} = {C_{t}^{1}, C_{t}^{2}, \dots, C_{t}^{M}}

represents the capacity provided by the beams,

Δ_{t} = {Δ_{t}^{1}, Δ_{t}^{2}, \dots, Δ_{t}^{M}}

represents the CDG of beams,

P_{t} = {P_{t}^{1}, P_{t}^{2}, \dots, P_{t}^{M}}

represents the current power allocated to beams, and

B W_{t} = {B W_{t}^{1}, B W_{t}^{2}, \dots, B W_{t}^{M}}

represents the current bandwidth allocated to the beams. Therefore,

S_{t} = {O_{t}^{1}, O_{t}^{2}, \dots, O_{t}^{M}}

represents the current state of M agents, where

O_{t}^{m}

indicates the bandwidth, power, traffic demand, and CDG of the current beam m.

O_{t}^{m} = {B W_{t}^{m}, P_{t}^{m}, D_{t}^{m}, Δ_{t}^{m}} .

(20)

2.: Action space

The action space represents the adjustment strategy of the agents for power allocation, and determines the power change of each beam. Specifically,

A_{t} = {a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{M}}

represents the action of M agents, that is, their movement in the space of possible allocated power resources, where

a_{t}^{m}

refers to the possible action space of agent m, and represents the power increase or decrease or unchanged in m-beam.

a_{t}^{m} = \{\begin{matrix} a_{1}, & i n c r e a s e i n p o w e r \\ a_{2}, & d e c r e a s e i n p o w e r \\ a_{3}, & d o n o t h i n g \end{matrix} .

(21)

3.: Reward

M agents share the same instant reward

r_{t} = - α_{1} F_{t} + α_{2} F_{t - 1} - α_{3} P_{p u n},

(22)

F = f_{3} (\sum_{b = 1}^{M} (β_{1} Δ_{b}^{2} + β_{2} P_{b})),

(23)

P_{p u n} = \{\begin{matrix} K, \sum_{b = 1}^{M} P_{b} > P_{max}^{a} \\ 0, \sum_{b = 1}^{M} P_{b} \leq P_{max}^{a} \end{matrix},

(24)

where F denotes the measurement of the weighted sum of the CDG and the TPC of the beams in the amplifier.

P_{p u n}

represents the penalty term of amplifier power limit. In addition,

α_{1}, α_{2}, α_{3}

are positive values, indicating the weight of each parameter in the instant reward.

4.: Value function

The action-observation history of each agent is expressed as

τ^{m} \in T \equiv {(O \times A^{m})}^{*}

, that is, the sequential action-observation history starting from the initial state. The corresponding sequential action-observation history of all agents can be expressed as

τ = (τ^{1}, τ^{2}, \dots, τ^{M})

. Based on this, the random strategy function of each agent is expressed as

π^{m} (a^{m} | τ^{m}) : T \times A \to [0, 1]

. The joint action value function corresponding to the joint strategy is expressed as

Q_{t o t}^{π} (τ, a) = E_{τ \in T^{M}, a \in A} [\sum_{t = 0}^{\infty} γ^{t} r (s, a)] .

(25)

At this time, the value function is related to the action-observation history, not to the state.

The power management network is trained by the QMIX network, which consists of a mixing network and M agent networks. By using the centralized training-distributed execution (CTDE) method, the local action–value functions are learned in a distributed manner, and then the joint action-value function is constructed by concentrating all the local action-value functions. The joint reward of all agents is assigned to each agent, promoting cooperation among multiple agents to achieve maximum global benefits.

When dealing with the problem of multi-agent collaboration, two aspects need to be considered. One is how to learn the joint action–value function, and the other is how to decompose a reasonably distributed strategy from the joint value function. For centralized learning, the mixing network estimates the joint Q-value as a complex nonlinear combination of each agent’s Q-value. The network imposes constraints on the structure, so that the joint Q-value function is monotonic with each agent’s Q-value function, ensuring consistency between the centralized strategy and the decentralized strategy. Specifically, the specific individual-global-max (IGM) constraints are as follows:

\underset{a}{a r g m a x} Q_{t o t} (τ, a) = (\begin{matrix} a r g m a x_{a^{1}} Q_{1} (τ^{1}, a^{1}) \\ \dots \\ a r g m a x_{a^{m}} Q_{m} (τ^{m}, a^{m}) \\ \dots \\ a r g m a x_{a^{M}} Q_{M} (τ^{M}, a^{M}) \end{matrix}),

(26)

where

Q_{t o t}

represents the joint Q-value function, and

Q_{m}

represents the Q-value function of the agent m. In order to guarantee the above conditions, the QMIX network is restricted by monotonic conditions.

\frac{\partial Q_{t o t}}{\partial Q_{m}} \geq 0, \forall m \in {1, 2, \dots, M} .

(27)

For the distribution strategy, after obtaining the estimation of the global Q-value function, QMIX selects the action based on the local Q-value function and the output of the hybrid network, combined with the

ε

-greedy algorithm. By adjusting

ε

, the trade-off between exploration and exploitation can be controlled. A higher value of

ε

encourages the agent to explore and choose random actions to discover new strategies, while a lower value of

ε

encourages the agent to exploit and select the action with the highest Q-value to utilize the learned strategy. The update method of the mixing network is similar to that of DQN, as both use temporal difference-error for network self-renewal. In addition, the evaluation network and target network are both set up. The QMIX network loss function is

L (ω) = \sum_{i = 1}^{b} [{(y_{i}^{t o t} - Q_{t o t} (τ, a, s; ω))}^{2}],

(28)

where b represents the number of samples sampled from the experience relay buffer.

y^{t o t}

is expressed as

y^{t o t} = r + γ {max}_{a^{'}} Q_{t o t} (τ^{'}, a^{'}, s^{'}; ω^{-}),

(29)

where

ω^{-}

is a parameter of the target network, so the temporal difference-error can be expressed as

δ = r + γ Q_{t o t} (t a r g e t) - Q_{t o t} (e v a l u a t e),

(30)

where

Q_{t o t} (t a r g e t)

represents the maximum value

Q_{t o t}

that can be obtained in all behaviors in the case of next state, and

Q_{t o t} (e v a l u a t e)

represents the

Q_{t o t}

obtained according to the behavior selected by the current network strategy in the case of current state.

Since MADRL is trained offline, the training process can be cut off once the network performance has finally converged. Therefore, only the complexity of the proposed algorithm in the online execution phase needs to be analyzed. The computational complexity of the execution refers to the complexity of computing the Q-value, which is related to the network structure. Since each agent chooses the action in parallel, the computational complexity of the proposed algorithm is determined by the most complex neural network of all agents. Assuming that

L_{QMIX}

and

n_{i}^{QMIX}

denote the number of layers of the deep neural networks (DNNs) and the number of neurons in layer i used by the DT-BPA algorithm for the small time scale power allocation network, respectively, the execution computational complexity is

O (\sum_{i = 1}^{L_{Q M I X} - 1} n_{i}^{Q M I X} n_{i + 1}^{Q M I X})

[37]. The specific algorithm process is shown in Algorithm 1.

Algorithm 1 QMIX-based power allocation algorithm.

Input: network parameters and system parameters.
Output: power allocation strategy.

1:: for $e p o c h = 1, 2, \dots, N_{e p o c h}$ do
2:: for $e p i s o d e = 1, 2, \dots, N_{e p i s o d e}$ do
3:: Initialize environment, obtain the state S, the observation of each agent $O^{m}$ , the available action of each agent $A^{m}$ , and reward R.
4:: for $t = 1, 2, \dots, N_{s t e p}$ do
5:: Each agent uses eval_agent_network obtains the Q-value of each action. With probability $ε$ select a random action $a_{t}^{b}$ , otherwise select the action that maximizes the Q-value.
6:: Store $S, S_{n e x t}$ , the observation values of each agent $O^{m}$ , the available action of each agent $A^{m}$ , the next action of each agent $A_{n e x t}^{m}$ , the reward R, the selected action a, whether env has terminated $t e r m i n a t e$ , in the experience pool D.
7:: Randomly sample some data from D, but the data must be the same transition in different episodes.
8:: Update Parameters:
$L (ω) = \sum_{i = 1}^{b} [(y_{i}^{t o t} - Q_{t o t} (τ, a, s; ω))^{2}]$
9:: if $t e r m i n a t e = = T r u e a n d s t e p \leq N_{s t e p}$ then
10:: for $k = s t e p t o N_{s t e p}$ do
11:: Fill in the insufficient data with 0 to ensure data consistency.
12:: end for
13:: end if
14:: $S = S_{n e x t}, A^{m} = A_{n e x t}^{m}$
15:: if $t % p (update frequency) = = 0$ then
16:: Copy the evaluation network parameters to the target network.
17:: end if
18:: end for
19:: end for
20:: Calculate the average reward of $N_{e p i s o d e}$ episodes and add it to the list.
21:: end for

4.2. Large-Timescale Resources Allocation Strategy

The resources management network on a large time-scale requires the allocation of bandwidth for different color sub-bands and power for different beams in the VHTS system. After obtaining the sub-band bandwidth and initial beam power on a large time-scale based on the average traffic demands over a period of time, the trained power management network is employed to allocate power at subsequent time points during that period. The objective is to achieve an optimal matching of beam capacity and demand in the system. Based on the stated objectives, we assume that the VHTS system consists of B beams. We model a single agent to manage the bandwidth allocation of M frequency colors and B agents to handle power allocation for all beams. Additionally, we import the parameters of the converged power management network model as environment variables, eliminating the need to retrain the power allocation process. The shared goal of all agents is to minimize the weighted sum of the CDG and TPC of the beams in the system. Assuming that the agent managing the bandwidth has access to complete state information, the bandwidth allocation decision process is modeled as MDP defined by tuples

< S, A, P, r, γ >

. The strategy gives the corresponding feasible action

A_{t}

according to the given state

S_{t}

and action–value function, and then, the agent obtains the reward

r_{t}

for taking action. The state of the system is converted into

S_{t + 1}

, where the state transition probability is

P (s_{t + 1}, r_{t} | s_{t}, a_{t})

, and the corresponding strategy is

π (a_{t} | s_{t})

,

γ \in [0, 1]

represents the discount factor.

State space

The environment of each time slot t includes the following.

D_{t} = {D_{t}^{1}, D_{t}^{2}, \dots, D_{t}^{B}}

represents the traffic demand of B beams,

C_{t} = {C_{t}^{1}, C_{t}^{2}, \dots, C_{t}^{B}}

represents the capacity provided by B beams,

Δ_{t} = {Δ_{t}^{1}, Δ_{t}^{2}, \dots, Δ_{t}^{B}}

represents the CDG of B beams,

P_{t} = {P_{t}^{1}, P_{t}^{2}, \dots, P_{t}^{B}}

represents the current power of B beams,

B W_{t} = {B W_{t}^{1}, B W_{t}^{2}, \dots, B W_{t}^{M}}

represents the current bandwidth of M frequency colors, and

ω_{t}

represents the parameters of the power management network. Therefore,

S_{t}

represents the current state of the agent managing the bandwidth allocation, indicating the current bandwidth, power, traffic demand, and CDG of all beams in the system.

S_{t} = {B W_{t}, P_{t}, D_{t}, Δ_{t}} .

(31)

2.: Action space

The action space represents the adjustment strategy of the agent for bandwidth allocation, and determines the bandwidth change of each frequency color.

A_{t}

represents the action of the bandwidth agent, that is, its movement in the space of bandwidth resources that may be allocated. There are

3^{M}

possible actions, indicating that the bandwidth of M frequency colors is increased or decreased or unchanged.

3.: Reward

The instant reward of the bandwidth agent is

r_{t} = - α_{1} H_{t} + α_{2} H_{t - 1},

(32)

H = f_{4} (\sum_{b = 1}^{B} (β_{1} Δ_{b}^{2} + β_{2} P_{b})),

(33)

where H is the measure of the weighted sum of the CDG and TPC of all beams in the system. Similarly,

α_{1}, α_{2}

are positive values, indicating the weight of each parameter in the instant reward.

The DQN generation function is used to estimate the action value. Based on the current Q-value estimation, the strategy is used to select an action, which is then executed by the bandwidth agent. The other B power agents perform multi-step adjustments according to the trained network to obtain the optimal power allocation at that time. The agents observe the immediate reward r returned by the environment and the next state

s^{'}

. We add the current state, action, reward, and next state to the experience relay buffer. Subsequently, a batch of experience samples is randomly sampled from the experience replay buffer to train the neural network. The target Q-value of the sample is

Q_{t a r g e t} = r + γ m a x Q (s^{'}, a^{'}; θ^{-}),

(34)

where r is the instant reward,

s^{'}

is the next state,

a^{'}

is the optimal action of the next state, and

γ

is the discount factor,

θ^{-}

is the parameter of the target network.

The mean square error (MSE) loss function is employed to quantify the disparity between the current Q-value estimation and the target Q-value. By utilizing backpropagation, the parameters of the neural network are adjusted to minimize this loss

L (θ) = \frac{1}{2} {[Q_{t a r g e t} - Q (s, a; θ)]}^{2} .

(35)

The parameters of the target network are updated regularly by copying the current evaluation Q network parameters to the target network to enhance stability and convergence.

Assuming that

L_{DQN}

and

n_{i}^{DQN}

denote the number of layers of the DNNs and the number of neurons in layer i used by the DT-BPA algorithm for the small time scale power allocation network, respectively, the execution computational complexity is

O (\sum_{i = 1}^{L_{D Q N} - 1} n_{i}^{D Q N} n_{i + 1}^{D Q N})

. The specific algorithm process is shown in Algorithm 2.

Algorithm 2 DQN-based resources allocation algorithm.

Input: network parameters and system parameters.
Output: bandwidth and power allocation strategy.

1:: for $e p o c h = 1, 2, \dots, N_{e p o c h}$ do
2:: for $e p i s o d e = 1, 2, \dots, N_{e p i s o d e}$ do
3:: Initialize environment, obtain the state S, and reward R.
4:: for $t = 1, 2, \dots, N_{s t e p}$ do
5:: Each agent uses eval_network obtains the Q-value of each action. With probability $ε$ select a random action $a_{t}$ , otherwise, select the action that maximizes the Q-value.
6:: Store $S, S_{n e x t}$ , the selected action of each agent a, whether env has terminated $t e r m i n a t e$ , in the experience pool D.
7:: Randomly sample some data from D, but the data must be the same transition in different episodes.
8:: Update Parameters:
$L (θ) = \sum_{i = 1}^{b} [(y_{i} - Q_{t a r g e t})^{2}]$
9:: if $t e r m i n a t e = = T r u e a n d s t e p \leq N_{s t e p}$ then
10:: for $k = s t e p t o N_{s t e p}$ do
11:: Fill in the insufficient data with 0 to ensure data consistency.
12:: end for
13:: end if
14:: $S = S_{n e x t}$
15:: if $t % p (update frequency) = = 0$ then
16:: Copy the evaluation network parameters to the target Network.
17:: end if
18:: end for
19:: end for
20:: Calculate the average reward of $N_{e p i s o d e}$ episodes and add it to the list.
21:: end for

5. Performance Evaluation

In this section, we employ Python simulations to show the performance of the proposed DT-BPA scheme in the VHTS system. We begin by presenting the simulation parameters. Subsequently, the convergence performance of the proposed algorithm is explained. Furthermore, we conduct simulations to analyze the performance of bandwidth and power resource management under various traffic demands. Finally, we compare the results with other algorithms to demonstrate the enhanced matching performance of the capacity and demand achieved by the proposed algorithm.

5.1. Simulation Setting

In the simulation scenario, we consider a very high-throughput GEO satellite that utilizes the Ka-band for communication. The channel gain primarily takes into account the free space propagation loss. The transmitting antenna gain depends on the antenna radiation pattern with a beamwidth of 0.6 degrees. The multi-beam antenna radiation pattern refers to the Rec. ITU-R S.672 [30]

G (θ) = \{\begin{matrix} G_{m}, f o r θ < θ_{b} \\ G_{m} - 3 {(\frac{θ}{θ_{b}})}^{2}, f o r θ_{b} \leq θ < a θ_{b} \\ G_{m} + L_{s}, f o r a θ_{b} \leq θ < b θ_{b} \\ max {G_{m} + L_{s} + 20 - 25 lg (\frac{θ}{θ_{b}}), 0}, e l s e \end{matrix} .

(36)

The

G / T

value of the receiving antenna at the client is 22 dB/K. It is assumed that the satellite consists of 16 spot beams and 4 TWTAs and adopts a dual-polarized four-color reuse pattern. The beams of different frequencies are connected to the same amplifier. The relationship between the SE of each beam and the carrier-to-interference-plus-noise ratio is determined as follows [38]. The SE value is selected from low to high, and the energy per bit to noise power spectral density ratio of the beam is calculated as

E_{b} / N_{0} = C I N R_{t o t a l} - 10 log (S E) .

(37)

If this value is not less than the threshold corresponding to the SE, the beam can select the SE value until the highest SE value for the beam is reached. The correspondence between SE and the threshold is presented in Table 2. The specific system simulation parameters are outlined in Table 3. We input an episode to the network for training each time, and the algorithm simulation parameters are provided in Table 4.

We compare the proposed DT-BPA algorithm with the four following algorithms.

Single-timescale management of bandwidth and power: For the change in the traffic demands during a period, only the bandwidth and power resources are managed at the same time on a large timescale.
Proportional allocation of bandwidth and power (PABP): For the difference in traffic demand of each beam, the power and bandwidth resources are allocated according to the proportion of demand.
QMIX [39]: Each agent manages the bandwidth of each frequency color and the power of all beams using the bandwidth, and uses the QMIX algorithm for training.
DQN [40]: Single agent manages the bandwidth of all frequency colors and the power of all beams in the system, and uses the DQN algorithm for training.

In this paper, CDG is taken as the performance index, which is defined as the sum of gaps between the beam capacity and the beam demand of the system. The distribution of traffic demand is related to the population density, geographical location and temporal variations during the day.

5.2. Simulation Results

Figure 5 and Figure 6 illustrate the convergence trends of the reward and CDG of the three DRL algorithms during the training process. It can be seen that the DT-BPA algorithm and the QMIX algorithm start to converge at approximately 70 epochs, while the DQN algorithm does not converge in the end. Initially, the action is performed randomly, resulting in a smaller reward. However, as agents continue to learn and improve their actions, rewards gradually increase. The DT-BPA algorithm has a higher final convergence reward value and a smaller CDG than the QMIX algorithm, which indicates that the DT-BPA algorithm is better. The phenomenon lies in the QMIX algorithm where a single agent manages too many resources, resulting in an overly large action space. Consequently, it becomes difficult to effectively schedule resources in tandem with changing demands, usually leading to suboptimal outcomes confined within local optima.

Figure 7 illustrates the CDG obtained by different resource management algorithms as the number of beams in the VHTS system changes. Input multiple sets of traffic demands with different distributions, and calculate the average and standard deviation of the obtained CDG. By comparing the proposed DT-BPA algorithm with the PABP, QMIX and DQN algorithms, it is evident that the DT-BPA algorithm achieves the smallest CDG across different numbers of beams. It indicates that the DT-BPA algorithm optimally matches the system’s capacity with the demand, ensuring an optimal utilization of resources. Since the DQN did not converge, it was equivalent to random allocation, resulting in the worst performance. When the number of system beams is relatively small, the proposed algorithm is not much different from QMIX and PBAP. However, as the number of beams increases, the superiority of the proposed DT-BPA algorithm becomes more prominent, indicating that our algorithm is suitable for the current VHTS system with a large number of beams.

Figure 8 presents the TPC obtained by different resource management algorithms as the number of beams in the VHTS system changes. Input multiple sets of traffic demands with different distributions, and calculate the average and standard deviation of the obtained TPC. The PABP algorithm allocates all resources according to the proportion of the demands of different beams, so the TPC is fixed. By comparing the DT-BPA algorithm with the PABP, QMIX, and DQN algorithms, it can be seen that the DT-BPA algorithm consistently achieves the lowest TPC across the different numbers of beams. This indicates that the DT-BPA algorithm effectively minimizes the power consumption of the system, contributing to increased service life.

Figure 9 shows the CDG obtained by different resource management algorithms with 16 beams as an example in the case of traffic demands with different variances. When the total demand of the system is fixed, the higher the variance of traffic demand data of each beam, the more heterogeneous the distribution of the demands becomes. By comparing the proposed DT-BPA algorithm with PABP, QMIX, and DQN algorithms, it is evident that the DT-BPA algorithm achieves the minimum CDG under varying demand distributions, exhibiting the least susceptibility to the influence of uneven demand distribution, with the QMIX algorithm ranking second. The PABP algorithm allocates the resources according to the proportion of each beam demand, so it is most affected by the uneven distribution of demands. Due to its inability to converge, the DQN algorithm resorts to random allocation regardless of the demand distribution, resulting in a largely unchanged CDG.

Figure 10 compares the system capacity results of double-timescale and single-timescale resource management within a day. In the case of double-timescale resource management, we adjust the bandwidth and power based on the average demand for the next six hours at 0:00, 6:00, 12:00, and 18:00. Additionally, we adjust the power every hour according to the current demand. In the case of single-timescale resource management, the bandwidth and power resources are adjusted simultaneously based on the average demand only four times to satisfy the traffic demand for the whole day as much as possible. It is evident that, despite significant fluctuations in traffic demand over time, the double-timescale resource management algorithm still achieves the satisfactory matching of capacity and demand. It demonstrates the effectiveness of managing both bandwidth and power resources on different timescales.

Figure 11 depicts the demands and capacities of 16 beams in the system based on the DT-BPA algorithm as the total demand keeps increasing. We can note that the proposed algorithm almost matches the capacities provided by all beams with the demands for a total demand of 21 Gbps at the 0:00 and 40 Gbps at the 19:00. The capacities and demands of individual beams show a relatively large gap, which can be attributed to the frequency reuse scheme of the system. As a result of this scheme, some beams share the same bandwidth resources and are weighted according to the traffic demands associated with each beam. However, when the total system demand surges to 60 Gbps, which exceeds the capacity that can be provided by the system resources, the demand of each beam is difficult to meet. In this case, the system requires additional system optimization or capacity expansion measures.

6. Conclusions

In this paper, we proposed a DT-BPA scheme to address the bandwidth and power resource management problem in VHTS systems. Our objective was to achieve efficient resource allocation while ensuring that the capacity provided by each beam closely matches the traffic demand over a given time period, considering the frequency reuse scheme. To tackle this problem, we employed the MADRL method, which combines the QMIX network and the DQN network, to manage power and bandwidth flexibly. We evaluated the performance of our algorithm under time-varying traffic demands, using metrics such as the CDG and TPC. Furthermore, we compared the proposed algorithm with other resource allocation strategies, including single-timescale resource allocation, proportional resource allocation, QMIX algorithm, and DQN algorithm. Through experimental results, we demonstrated that our algorithm achieved a superior performance in terms of capacity matching for each beam under time-varying traffic demands. Additionally, the training of the DT-BPA algorithm was performed offline. As a result, the online decision-making process can quickly allocate resources for all beams, making it cost-effective in the VHTS communication scenario. The efficiency is crucial for real-time resource management and optimization in satellite communication systems.

Author Contributions

Conceptualization, L.F., C.Z. and Q.Z.; methodology, L.F. and Q.Z.; software, L.F. and Q.Z.; validation, L.F., C.Z. and L.Z.; formal analysis, L.F.; investigation, L.F.; resources, C.Z. and P.Q.; data curation, L.F.; writing—original draft preparation, L.F.; writing—review and editing, C.Z., Q.Z., P.Q. and Y.W.; supervision, P.Q. and Y.W.; project administration, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China under Grant 2021YFB2900504; and by the cooperation project with China Academy of Space Technology.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saad, W.; Bennis, M.; Chen, M. A vision of 6G wireless systems: Applications, trends, technologies, and open research problems. IEEE Netw. 2019, 34, 134–142. [Google Scholar] [CrossRef]
Zhang, Z.; Xiao, Y.; Ma, Z.; Xiao, M.; Ding, Z.; Lei, X.; Karagiannidis, G.K.; Fan, P. 6G wireless networks: Vision, requirements, architecture, and key technologies. IEEE Veh. Technol. Mag. 2019, 14, 28–41. [Google Scholar] [CrossRef]
Tataria, H.; Shafi, M.; Molisch, A.F.; Dohler, M.; Sjöland, H.; Tufvesson, F. 6G wireless systems: Vision, requirements, challenges, insights, and opportunities. Proc. IEEE 2021, 109, 1166–1199. [Google Scholar] [CrossRef]
Yang, P.; Xiao, Y.; Xiao, M.; Li, S. 6G wireless communications: Vision and potential techniques. IEEE Netw. 2019, 33, 70–75. [Google Scholar] [CrossRef]
Giordani, M.; Zorzi, M. Non-terrestrial networks in the 6G era: Challenges and opportunities. IEEE Netw. 2020, 35, 244–251. [Google Scholar] [CrossRef]
Giordani, M.; Zorzi, M. Satellite communication at millimeter waves: A key enabler of the 6G era. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC), Big Island, HI, USA, 19–22 February 2020; pp. 383–388. [Google Scholar]
Zhu, X.; Jiang, C. Integrated satellite-terrestrial networks toward 6G: Architectures, applications, and challenges. IEEE Internet Things J. 2021, 9, 437–461. [Google Scholar] [CrossRef]
Vidal, O.; Verelst, G.; Lacan, J.; Alberty, E.; Radzik, J.; Bousquet, M. Next generation high throughput satellite system. In Proceedings of the IEEE First AESS European Conference on Satellite Telecommunications (ESTEL), Rome, Italy, 2–5 October 2012; pp. 1–7. [Google Scholar]
Le Kernec, A.; Canuet, L.; Maho, A.; Sotom, M.; Matter, D.; Francou, L.; Edmunds, J.; Welch, M.; Kehayas, E.; Perlot, N.; et al. The H2020 VERTIGO project towards tbit/s optical feeder links. In Proceedings of the SPIE International Conference on Space Optics, Virtual, 6–11 March 2021; pp. 508–519. [Google Scholar]
Guan, Y.; Geng, F.; Saleh, J.H. Review of high throughput satellites: Market disruptions, affordability-throughput map, and the cost per bit/second decision tree. IEEE Aerosp. Electron. Syst. Mag. 2019, 34, 64–80. [Google Scholar] [CrossRef]
Jia, H.; Wang, Y.; Wu, W. Dynamic Resource Allocation for Remote IoT Data Collection in SAGIN. IEEE Internet Things J. 2024, 11, 20575–20589. [Google Scholar] [CrossRef]
Al-Hraishawi, H.; Lagunas, E.; Chatzinotas, S. Traffic simulator for multibeam satellite communication systems. In Proceedings of the Advanced Satellite Multimedia Systems Conference and Signal Processing for Space Communications Workshop (ASMS/SPSC), Graz, Austria, 20–21 October 2020; pp. 1–8. [Google Scholar]
De Gaudenzi, R.; Angeletti, P.; Petrolati, D.; Re, E. Future technologies for very high throughput satellite systems. Int. J. Satell. Commun. Netw. 2020, 38, 141–161. [Google Scholar] [CrossRef]
Hasan, M.; Bianchi, C. Ka band enabling technologies for high throughput satellite (HTS) communications. Int. J. Satell. Commun. Netw. 2016, 34, 483–501. [Google Scholar] [CrossRef]
Li, Z.; Chen, W.; Wu, Q.; Cao, H.; Wang, K.; Li, J. Robust beamforming design and time allocation for IRS-assisted wireless powered communication networks. IEEE Trans. Commun. 2022, 70, 2838–2852. [Google Scholar] [CrossRef]
Li, Z.; Chen, W.; Cao, H.; Tang, H.; Wang, K.; Li, J. Joint communication and trajectory design for intelligent reflecting surface empowered UAV SWIPT networks. IEEE Trans. Veh. Technol. 2022, 71, 12840–12855. [Google Scholar] [CrossRef]
Hu, X.; Liao, X.; Liu, Z.; Liu, S.; Ding, X.; Helaoui, M.; Wang, W.; Ghannouchi, F.M. Multi-agent deep reinforcement learning-based flexible satellite payload for mobile terminals. IEEE Trans. Veh. Technol. 2020, 69, 9849–9865. [Google Scholar] [CrossRef]
Wang, H.; Liu, A.; Pan, X.; Li, J. Optimization of power allocation for a multibeam satellite communication system with interbeam interference. J. Appl. Math. 2014, 2014, 1–8. [Google Scholar] [CrossRef]
Abdu, T.S.; Kisseleff, S.; Lagunas, E.; Chatzinotas, S. Flexible resource optimization for GEO multibeam satellite communication system. IEEE Trans. Wirel. Commun. 2021, 20, 7888–7902. [Google Scholar] [CrossRef]
Durand, F.R.; Abrão, T. Power allocation in multibeam satellites based on particle swarm optimization. AEU Int. J. Electron. Commun. 2017, 78, 124–133. [Google Scholar] [CrossRef]
Cocco, G.; De Cola, T.; Angelone, M.; Katona, Z.; Erl, S. Radio resource management optimization of flexible satellite payloads for DVB-S2 systems. IEEE Trans. Broadcast. 2017, 64, 266–280. [Google Scholar] [CrossRef]
Paris, A.; Del Portillo, I.; Cameron, B.; Crawley, E. A genetic algorithm for joint power and bandwidth allocation in multibeam satellite systems. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2019; pp. 1–15. [Google Scholar]
Zhang, P.; Wang, X.; Ma, Z.; Liu, S.; Song, J. An online power allocation algorithm based on deep reinforcement learning in multibeam satellite systems. Int. J. Satell. Commun. Netw. 2020, 38, 450–461. [Google Scholar] [CrossRef]
Liao, X.; Hu, X.; Liu, Z.; Ma, S.; Xu, L.; Li, X.; Wang, W.; Ghannouchi, F.M. Distributed intelligence: A verification for multi-agent DRL-based multibeam satellite resource allocation. IEEE Commun. Lett. 2020, 24, 2785–2789. [Google Scholar] [CrossRef]
Fenech, H.; Amos, S.; Hirsch, A.; Soumpholphakdy, V. VHTS systems: Requirements and evolution. In Proceedings of the European Conference on Antennas and Propagation (EUCAP), Paris, France, 19–24 March 2017; pp. 2409–2412. [Google Scholar]
Aravanis, A.I.; MR, B.S.; Arapoglou, P.D.; Danoy, G.; Cottis, P.G.; Ottersten, B. Power allocation in multibeam satellite systems: A two-stage multi-objective optimization. IEEE Trans. Wirel. Commun. 2015, 14, 3171–3182. [Google Scholar] [CrossRef]
Pachler, N.; Luis, J.J.G.; Guerster, M.; Crawley, E.; Cameron, B. Allocating power and bandwidth in multibeam satellite systems using particle swarm optimization. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2020; pp. 1–11. [Google Scholar]
Ortiz-Gomez, F.G.; Tarchi, D.; Rodriguez-Osorio, R.M.; Vanelli-Coralli, A.; Salas-Natera, M.A.; Landeros-Ayala, S. Supervised machine learning for power and bandwidth management in VHTS systems. In Proceedings of the 10th Advanced Satellite Multimedia Systems Conference and the 16th Signal Processing for Space Communications Workshop (ASMS/SPSC), Graz, Austria, 20–21 October 2020; pp. 1–7. [Google Scholar]
Luis, J.J.G.; Guerster, M.; del Portillo, I.; Crawley, E.; Cameron, B. Deep reinforcement learning for continuous power allocation in flexible high throughput satellites. In Proceedings of the IEEE Cognitive Communications for Aerospace Applications Workshop (CCAAW), Cleveland, OH, USA, 25–26 June 2019; pp. 1–4. [Google Scholar]
Hu, X.; Zhang, Y.; Liao, X.; Liu, Z.; Wang, W.; Ghannouchi, F.M. Dynamic beam hopping method based on multi-objective deep reinforcement learning for next generation satellite broadband systems. IEEE Trans. Broadcast. 2020, 66, 630–646. [Google Scholar] [CrossRef]
Bi, S.; Huang, L.; Wang, H.; Zhang, Y.J.A. Stable online computation offloading via Lyapunov-guided deep reinforcement learning. In Proceedings of the IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–7. [Google Scholar]
Liu, S.; Hu, X.; Wang, W. Deep reinforcement learning based dynamic channel allocation algorithm in multibeam satellite systems. IEEE Access 2018, 6, 15733–15742. [Google Scholar] [CrossRef]
Ferreira, P.V.R.; Paffenroth, R.; Wyglinski, A.M.; Hackett, T.M.; Bilén, S.G.; Reinhart, R.C.; Mortensen, D.J. Multiobjective reinforcement learning for cognitive satellite communications using deep neural network ensembles. IEEE J. Sel. Areas. Commun. 2018, 36, 1030–1041. [Google Scholar] [CrossRef]
Ortiz-Gomez, F.G.; Tarchi, D.; Martínez, R.; Vanelli-Coralli, A.; Salas-Natera, M.A.; Landeros-Ayala, S. Cooperative multi-agent deep reinforcement learning for resource management in full flexible VHTS systems. IEEE Trans. Cognit. Commun. Netw. 2021, 8, 335–349. [Google Scholar] [CrossRef]
Sharma, S.K.; Borras, J.Q.; Maturo, N.; Chatzinotas, S.; Ottersten, B. System modeling and design aspects of next generation high throughput satellites. IEEE Commun. Lett. 2020, 25, 2443–2447. [Google Scholar] [CrossRef]
Gonthier, G. Formal proof—The four-color theorem. Notices AMS 2008, 55, 1382–1393. [Google Scholar]
Qiuyang, Z.; Ying, W.; Xue, W. A double-timescale reinforcement learning based cloud-edge collaborative framework for decomposable intelligent services in industrial Internet of Things. China Commun. 2024, 2024, 1–19. [Google Scholar] [CrossRef]
Bachir, A.F.B.A.; Zhour, M.; Ahmed, M. Modeling and Design of a DVB-S2X system. In Proceedings of the International Conference on Optimization and Applications (ICOA), Kenitra, Morocco, 25–26 April 2019; pp. 1–5. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 7234–7284. [Google Scholar]
Hu, X.; Liu, S.; Chen, R.; Wang, W.; Wang, C. A deep reinforcement learning-based framework for dynamic resource allocation in multibeam satellite systems. IEEE Commun. Lett. 2018, 22, 1612–1615. [Google Scholar] [CrossRef]

Figure 1. Double-timescale DRM architecture of VHTS system.

Figure 2. System multi-color frequency reuse diagram: (a) 19 beam layout diagram; and (b) Four-color reuse pattern.

Figure 3. Double-timescale management strategy of bandwidth and power.

Figure 4. Double-timescale resource management algorithm architecture.

Figure 5. Reward convergence curve of DRL algorithms.

Figure 6. CDG convergence curve of DRL algorithms.

Figure 7. CDG comparison of each algorithm under different beam numbers.

Figure 8. TPC comparison of each algorithm under different beam numbers.

Figure 9. CDG comparison of each algorithm under different variances of the demand distribution.

Figure 10. Changes in demand and capacity within a day.

Figure 11. Demands and capacities of 16 beams based on the DT-BPA algorithm under different total demand. (a) 21 Gbps at the 0:00; (b) 40 Gbps at the 19:00. (c) 60 Gbps.

Table 1. Literature comparison of multi-beam satellite system resource allocation problem.

Type	Objective	Algorithm Classification	Concrete Algorithm	Demand Matching	Bandwidth Flexibility	Frequency Reuse	Power Flexibility	Literature
Single- objective	AFI	Convex optimization	Lagrange dual theory	×	×	×	✓	[18]
	USC, TPC, TBC	Convex optimization	DC, SCA	✓	✓	×	✓	[19]
	TPC	Heuristics	PSO	×	×	×	✓	[20]
	CDG, AFI	Heuristics	SA	✓	✓	×	✓	[21]
	USC	Heuristics	GA	✓	✓	✓	✓	[22]
	TPC	Deep reinforcement learning	DRL-DPA	×	×	×	✓	[23]
	CDG, AFI	Deep reinforcement learning	DRL-DBA	✓	✓	×	×	[24]
Multi- objective	USC, TPC	Heuristics	GA-SA, NSGA-II	✓	×	×	✓	[26]
	USC, TPC	Heuristics	PSO-GA, NSGA-II	✓	✓	×	✓	[27]
	CDG, TPC, TBC	Machine learning	SL	✓	✓	×	✓	[28]
	USC, TPC	Deep reinforcement learning	PPO	✓	×	×	✓	[29]

Table 2. Corresponding results of the SE-threshold.

SE	Threshold	SE	Threshold
0.83	1.8	2.22	5.6
1	2.2	2.5	6.3
1.11	2.6	2.67	6.8
1.25	3	2.78	7.2
1.33	3.4	3.13	7.8
1.39	3.8	3.33	8.5
1.5	3.9	3.47	8.9
1.67	4.6	3.7	10.1
1.88	5.4	3.75	10.4

Table 3. Satellite system simulation parameters.

Parameters	Values
Satellite altitude h	35,786 km
Beam number B	16
System total power $P_{t o t}$	1600 W
Amplifier maximum power $P_{max}^{a}$	400 W
System total bandwidth $B W_{t o t}$	2 GHz
Beamwidth $θ$	0.6°
Link loss $F S P L$	210 dB
Amplifier backlash $O B O$	3 dB
UT receiving $G_{u} / T_{s}$	22 dB/K
Boltzmann constant k	−228.6 dB
Amplifier number $N_{a}$	4
Frequency color number M	4
Frequency polar number $N_{p}$	2

Table 4. Algorithm simulation parameters.

Parameters	Values
Step number $N_{s t e p}$ (Qmix)	100
Epoch number $N_{e p o c h}$ (Qmix)	100
Episode number $N_{e p i s o d e}$ (Qmix)	10
Step number $N_{s t e p}$ (DQN)	50
Epoch number $N_{e p o c h}$ (DQN)	50
Episode number $N_{e p i s o d e}$ (DQN)	12
Update frequency	100
Learning rate	0.001
Discount factor $γ$	0.98
Batch size	32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, L.; Zhang, C.; Zhang, Q.; Zeng, L.; Qin, P.; Wang, Y. Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems. Electronics 2024, 13, 2764. https://doi.org/10.3390/electronics13142764

AMA Style

Feng L, Zhang C, Zhang Q, Zeng L, Qin P, Wang Y. Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems. Electronics. 2024; 13(14):2764. https://doi.org/10.3390/electronics13142764

Chicago/Turabian Style

Feng, Linqing, Cheng Zhang, Qiuyang Zhang, Lingchao Zeng, Pengfei Qin, and Ying Wang. 2024. "Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems" Electronics 13, no. 14: 2764. https://doi.org/10.3390/electronics13142764

APA Style

Feng, L., Zhang, C., Zhang, Q., Zeng, L., Qin, P., & Wang, Y. (2024). Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems. Electronics, 13(14), 2764. https://doi.org/10.3390/electronics13142764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Double-Timescale Multi-Agent Deep Reinforcement Learning for Flexible Payload in VHTS Systems

Abstract

1. Introduction

2. Related Work

3. System Model and Problem Formulation

3.1. System Model

3.2. Link Budget

3.3. Problem Formulation

4. Double-Timescale MADRL Algorithm

4.1. Small-Timescale Power Allocation Strategy

4.2. Large-Timescale Resources Allocation Strategy

5. Performance Evaluation

5.1. Simulation Setting

5.2. Simulation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI