Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access

Zhang, Xu; Chen, Pingping; Yu, Genjian; Wang, Shaohao

doi:10.3390/math11040992

Open AccessArticle

Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access

by

Xu Zhang

¹,

Pingping Chen

^1,*

,

Genjian Yu

² and

Shaohao Wang

¹

Department of Electronic Information, School of Advanced Manufacturing, Fuzhou University, Quanzhou 362251, China

²

College of Computer and Control Engineering, Minjiang University, Fuzhou 350108, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(4), 992; https://doi.org/10.3390/math11040992

Submission received: 15 December 2022 / Revised: 21 January 2023 / Accepted: 3 February 2023 / Published: 15 February 2023

(This article belongs to the Special Issue Artificial Intelligence and Internet of Things for Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a medium access control (MAC) protocol based on deep reinforcement learning (DRL), i.e., multi-channel transmit deep-reinforcement learning multi-channel access (MCT-DLMA) in heterogeneous wireless networks (HetNets). The work concerns practical unsaturated channel traffic that arrives following a Poisson distribution instead of saturated traffic that arrives before.By learning the access mode from historical information, MCT-DLMA can well fill the spectrum holes in the communication of existing users. In particular, it enables the cognitive user to multi-channel transmit at a time, e.g., via the multi-carrier technology. Thus, the spectrum resource can be fully utilized, and the sum throughput of the HetNet is maximized. Simulation results show that the proposed algorithm provides a much higher throughput than the conventional schemes, i.e., the whittle index policy and the DLMA algorithms for both the saturated and unsaturated traffic, respectively. In addition, it also achieves a near-optimal result in dynamic environments with changing primary users, which proves the enhanced robustness to time-varying communications.

Keywords:

deep reinforcement learning; MAC protocol; heterogeneous network; spectrum hole

MSC:

94A16; 68T20; 94D99

Graphical Abstract

1. Introduction

With the continuous optimization of communication technology, the global communication network traffic is growing exponentially. The scarce spectrum resources make it necessary to use the spectrum efficiently. Therefore, dynamic spectrum access (DSA) is the key to realize the efficient utilization of spectrum resources in cognitive radio. Various approaches envisioned for DSA are broadly categorized under three models: dynamic exclusive use model, open-sharing model, and hierarchical access model [1]. In the hierarchical access model, the network structure includes a primary user (PU), cognitive user (CU) and access point (AP). The CU can perceive the spectrum holes in the current frequency band without affecting the communication service quality of the PU.

In the field of DSA, the most common medium access control (MAC) protocol is carrier sense multiple access with collision avoidance (CSMA/CA), which uses the carrier sense and binary exponential back-off (BEB) scheme to avoid collision [2]. In BEB, the collision issue increases with the increase in connected devices in the network due to a fixed contention window size [3]. Recently, with advances in artificial intelligence (AI) techniques, deep learning (DL) has been widely used in the field of communication, e.g., channel estimation [4,5,6,7], beamforming [8,9,10,11], channel coding [12,13,14,15,16] and power control [17]. Deep reinforcement learning (DRL) is a major branch of DL that has developed rapidly and achieved great success in the field of games. Ref. [18] uses the Deep Q Network [19] algorithm in the classic Atari2600 game. People also began to use DRL technology in communication fields as a problem-solving solution. Ref. [20] applied multi-agent DRL for transmission power allocation in practical wireless network, where the system model is inaccurate and the channel state information (CSI) delay is not negligible. Ref. [21] proposed a DRL-based routing approach to reduce packet collisions as well as end-to-end delay. Ref. [22] specifically designed a DRL framework to emulate a continuous power allocation for non-orthogonal multiple access (NOMA) systems. By exploiting neural networks, [23] considered a dynamic multi-channel access problem, where multiple correlated channels follow an unknown joint Markov model and users select the channel to transmit data. Ref. [24] considered a multi-channel scenario, where DRL is used to achieve a specific network goal (e.g.,

α

-fairness). Ref. [25] adopted DRL to develop a new MAC protocol, referred to as multi-channel DRL multiple access (DLMA). It allows CU single channel access among multiple channels and can learn an optimal access policy by historical observations. We see that most existing DRL-based solutions assume that the data traffic is saturated. This will be the case, for example, when the nodes are transmitting large files containing many packets. However, in some cases, devices do not generate communication requests all the time. Refs. [26,27] investigated the MAC protocol for the vehicle network and the unmanned aerial vehicle network with unsaturated traffic, respectively. Thus, to better simulate more generalized situation, we consider that the user traffic arrives following a Poisson distribution rather than saturated traffic [28]. In addition, the single-channel transmit, i.e., in DLMA, also leads to a low channel utilization rate when multiple channels are idle at the same time. To address these issues, we consider a heterogeneous wireless network (HetNet) with multiple orthogonal channels. For the considered model, we propose a new DRL-based MCT-DLMA protocol. This protocol enables the CU to multi-channel transmit the data at a time, e.g., via the multi-carrier technology, by exploiting the collected historical channel information. Then, a near-optimal channel access policy can be learned from MCT-DLMA, and we can avoid the waste of channel resources in the case of multiple idle channels. MCT-DLMA can be used for wireless multi-access applications, e.g., the internet of vehicles and the internet of things to enhance the transmission spectrum utilization [29,30]. Figure 1 shows the MCT-DLMA application in IoV. There is a fixed CU and multiple PUs that change with the movement of the vehicle. When PUs are not occupying the channel, CU is expected to send data packets to maximize channel utilization.

The experimental results show that the proposed MCT-DLMA can achieve higher throughput than the existing methods in static HetNet. In particular, MCT-DLMA also shows the enhanced robustness in dynamic environments, where the PUs communicating with the CU change over the time. The contributions of this paper are summarized as follows:

We propose a DRL-based MCT-DLMA protocol for efficient spectrum utilization in multi-channel HetNets. It can learn to find a near-optimal spectrum access policy by exploiting the collected historical channel information. A salient feature of MCT-DLMA is that it enables CU to multi-channel transmit the data at a time, e.g., via the multi-carrier technology. It can avoid the waste of channel resources in the case of multiple idle channels.
The proposed MCT-DLMA is optimized for saturated and unsaturated traffic networks, respectively. The experimental results show that the proposed MCT-DLMA can achieve higher throughput than the existing the whittle index policy [31] and the DLMA in static HetNet. In particular, MCT-DLMA also shows the enhanced robustness in dynamic environments, where the PUs communicating with the CU change over the time.

The remainder of this paper is organized as follows. Section 2 overviews the reinforcement learning techniques. Section 3 describes the system model and the problem to be solved. In Section 4, we present the specific details of MCT-DLMA. Section 5 gives the simulation results, and Section 6 end this paper.

2. Deep Reinforcement Learning Framework

In reinforcement learning (RL), the agent interacts with the environment. At time slot t, the agent observes some manifestations of the environmental state

s_{t}

. Then, it performs the action

a_{t}

in accordance with the policy

π (a | s) = P_{r} (a_{t} = a | s_{t} = s)

. As a result, the agent will receive a reward

r_{t}

from the environment. Meanwhile, the environmental state becomes

s_{t + 1}

. The objective of the agent is to find an optimal policy

π^{*} (a | s)

such that the long-term cumulative discounted reward

R_{t}

can be maximized by

R_{t} = max_{π} E_{π} [\sum_{k = t}^{\infty} γ^{k - 1} r_{k}],

(1)

π^{*} = arg max_{π} E_{π} [\sum_{k = t}^{\infty} γ^{k - 1} r_{k}],

(2)

where

γ

is the reward discount rate,

0 \leq γ < 1

.

2.1. Q-Learning

Q-learning [32] is one of the most classic paradigms of RL. For each state–action pair

(s, a)

, the action-value function

Q_{π} (s, a)

under the policy

π

is defined as

Q_{π} (s, a) = E_{π} [R_{t} | s_{t} = s, a_{t} = a] .

(3)

Then the optimal policy can be decided by

π^{*} (a | s) = arg max_{π} Q_{π} (s, a) .

(4)

Q-learning builds a lookup Q-table to store the action-value function

Q (s, a)

for each state–action pair

(s, a)

. At time slot t, the agent performs the action

a_{t}

with the

ϵ

-greedy policy [32]. The Q-learning updates the corresponding Q-value

Q (s_{t}, a_{t})

as

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})],

(5)

where

α

is the learning rate of the agent.

2.2. Deep Q Network

DQN is an improved algorithm of the Q-learning. It uses an end-to-end reinforcement learning to directly input high-dimensional perceptual vectors into the deep Q network to learn behavioral policies. It uses deep neural networks to approximate the Q function instead of the table in Q-learning. In Q-learning,

Q (s_{t}, a_{t})

is the expected reward, i.e., Q value, of taking action

a_{t}

in the state

s_{t}

. In DQN,

Q (s_{t}, a_{t})

becomes

Q (s_{t}, a_{t}; θ)

, where

θ

is the parameter of the deep neural network. The data pair

(s_{t}, a_{t}, r_{t}, s_{t + 1})

generated time step is stored in a FIFO experience pool D. Then, N pairs of data are randomly extracted from D to form a mini-batch. The loss function

L (θ)

is minimized using neural network back-propagation given by

\begin{matrix} L (θ) = & \frac{1}{N} \sum_{i = 1}^{N} {[r_{t}^{i} + γ max_{a} Q (s_{t + 1}^{i}, a; θ) - Q (s_{t}^{i}, a_{t}^{i}; θ)]}^{2} . \end{matrix}

(6)

Based on DQN, the dueling double deep Q network (D3QN) is further developed as follows:

Double DQN: Solve the overestimation problem in DQN [33]. It changes the loss function and uses the target network ( $θ^{'}$ ) to update the loss function as

$\begin{matrix} L (θ) = & \frac{1}{N} \sum_{i = 1}^{N} [r_{t}^{i} + γ Q (s_{t + 1}^{i}, arg max_{a} Q (s_{t}^{i}, a, θ); θ^{'}) - Q (s_{t}^{i}, a_{t}^{i}; θ)] . \end{matrix}$

(7)
Dueling DQN: Change the unbranched neural network structure in DQN [34]. An advantage layer and a value layer are added, which are used to estimate the advantage value of each action and the current state value.

3. System Model and Problem of Interest

We consider a multi-channel multi-user HetNet based on the slotted ALOHA protocol. Time is allocated into different frames, and each frame contains F time slots. Each user is equipped with a finite buffer, and the packet arrives randomly, i.e., following a Poisson distribution at the beginning of each frame. The user node adopts different MAC protocols and sends packet to AP through the pre-assigned channel. After the AP successfully receives the packet, it will return an acknowledge character (ACK) signal to the sending user. When multiple users access the same channel at the same time, a collision occurs, and these users fail to send in the current time. Each user has not any protocol information of other users. There are the following types of users in this system:

1.: TDMA: send data packet in a fixed time slot of a frame.
2.: Q-ALOHA: send data packet with a fixed probability q in each time slot using Q-ALOHA protocol.
3.: Fixed-window ALOHA (FW-ALOHA): randomly generate a value $w \in [W_{1}, W_{2}]$ after sending a data packet, and wait for w time slots to send the next time [25].
4.: CU: adopt the MCT-DLMA protocol, it monitors the channel state (BUSY/IDLE) for a period of time in the past and selects the channel to send data packets based on the deep Q network.

Only CU can multi-channel transmit, and other users can only single-channel transmit. The purpose is that the CU can access all idle channels possibly when multiple channels are idle.

The communication network may waste many spectrum resources. Figure 2 is the channel state diagram of 20 consecutive time slots, when the three-channel heterogeneous network works. In the figure, white represents that the channel is idle, and black represents that the channel is busy. There is a large number of white areas, which means the spectrum resources are not used. Motivated by this, our goal is to increase throughput and avoid transmit collision by optimizing CU. Since the CU does not have any prior knowledge of PUs, it needs to infer the next behavior of PUs based on past observations, e.g., ACK and channel state.

4. MCT-DLMA Protocol

In this section, we introduce the detail of the proposed MCT-DLMA, which includes the design of actions, states and rewards, and the deep neural network.

4.1. Action

CU chooses an action before the start of each time slot. Suppose that there are K available channels in total. At time slot t, the action of the CU is to select

n \in {0, 1, 2, \dots, K}

channels to send the data packet. We denote

a_{t}^{k} \in {0, 1}

(wait or transmit) as the action of the kth channel of the CU at time slot t, where

k \in {1, 2, \dots, K}

. The CU joint action is denoted by

a_{t} = [a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{K}]

in all K channels.

4.2. State

The state is the basis for the CU decision making. CU monitors all channels to help decision making. At time slot t, the

k^{t h}

channel state is denoted by

b_{t}^{k}

, where

b_{t}^{k} = 0

means that the

k^{t h}

channel is idle, and

b_{t}^{k} = 1

means the busy channel. CU needs to avoid collision with PUs, and it collects the action of PUs. The collected information is denoted as

o_{t} = [o_{t}^{1}, o_{t}^{2}, \dots, o_{t}^{k}, \dots, o_{t}^{K}]

, where

o_{t}^{k}

is the observation of the

k^{t h}

channel by CU at time t. When PU sends data and CU does not send data, i.e.,

b_{t}^{k} = 1

,

a_{t}^{k} = 0

, we have

o_{t}^{k} = 1

and

o_{t}^{k} = 0

otherwise, given by

\begin{matrix} o_{t}^{k} = \{\begin{matrix} 1, & b_{t}^{k} = 1, a_{t}^{k} = 0, \\ 0, & o t h e r w i s e . \end{matrix} \end{matrix}

(8)

Since there is a dependency between the channel states across the times, we integrate the channel observations of the past M time slots as the state

s_{t}

, given by

\begin{matrix} s_{t} & = {[o_{t - M}^{T}, o_{t - M + 1}^{T}, \dots, o_{t - 1}^{T}]}^{T} \\ = [\begin{matrix} o_{t - M}^{1} & o_{t - M}^{2} & \dots & o_{t - M}^{K} \\ o_{t - M + 1}^{1} & o_{t - M + 1}^{2} & \dots & o_{t - M + 1}^{K} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ o_{t - 1}^{1} & o_{t - 1}^{2} & \dots & o_{t - 1}^{K} \end{matrix}] . \end{matrix}

(9)

4.3. Reward

At time slot t, we denote the ACK signal

z_{t} = [z_{t}^{1}, z_{t}^{2}, \dots, z_{t}^{k}, \dots, z_{t}^{K}]

, where

z_{t}^{k} \in {0, 1}

is the ACK signal received on the

k^{t h}

channel. The number of packets in the buffer of CU at time t is defined as

d_{t}

. The goal of CU is to utilize the idle channels and then maximize network throughput. We then denote the reward as

r_{t} = [r_{t}^{1}, r_{t}^{2}, \dots, r_{t}^{k}, \dots, r_{t}^{K}]

, where

r_{t}^{k}

is the reward for the

k^{t h}

channel at time slot t, given by

\begin{matrix} r_{t}^{k} = \{\begin{matrix} 1, & z_{t}^{k} = 1, a_{t}^{k} = 1, d_{t} > 0, \\ - 1.5, & z_{t}^{k} = 0, a_{t}^{k} = 1, d_{t} > 0, \\ - 0.5, & z_{t}^{k} = 0, a_{t}^{k} = 0, d_{t} > 0, \\ 0, & o t h e r w i s e . \end{matrix} \end{matrix}

(10)

Thus, we have different penalties for different channel utilization, i.e., reward of

- 1.5

for collisions,

- 0.5

for waste, and 1 for successful sending.

4.4. Neural Network

Since all the channels are orthogonal, CU can evaluate the expected reward for a single channel without interference from other channels. In addition, the action space size is

2^{K}

, which will rise exponentially as K increases. It results in a huge scale of neural network and ultra-high storage costs. It is not feasible for the neural network to directly output the Q value of all actions.

To address this issue, we divide the state

s_{t}

into independent states

s_{t}^{k}

of each channel, and we define

\begin{matrix} s_{t} & = [\begin{matrix} o_{t - M}^{1} & o_{t - M}^{2} & \dots & o_{t - M}^{K} \\ o_{t - M + 1}^{1} & o_{t - M + 1}^{2} & \dots & o_{t - M + 1}^{K} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ o_{t - 1}^{1} & o_{t - 1}^{2} & \dots & o_{t - 1}^{K} \end{matrix}] \\ = [\begin{matrix} s_{t}^{1} & s_{t}^{2} & \dots & s_{t}^{k} & \dots & s_{t}^{K} \end{matrix}] . \end{matrix}

(11)

where

\begin{matrix} s_{t}^{k} & = {[\begin{matrix} o_{t - M}^{k} & o_{t - M + 1}^{k} & \dots & o_{t - 1}^{K} \end{matrix}]}^{T} . \end{matrix}

(12)

Figure 3 shows a graph of the proposed neural network structure based on Dueling DQN. Note that we use a simple fully connected (FC) neural network inside. There are two branches in the neural network with the advantage layer and the value layer. The advantage layer contains a 128-unit FC layer and a 2-unit FC layer, and the output is advantage value

A (s, a)

. The value layer contains a 128-unit FC layer and a 1-unit FC layer, and the output is state value

V (s)

. Q-value

Q (s, a)

is obtained by adding

A (s, a)

and

V (s)

. At time slot t,

s_{t}^{k}

is input into the neural network, and enters the advantage layer and value layer. The output of the advantage layer is advantage value

A (s, a; θ)

, and the output of the value layer is state value

V (s; θ)

. The two values are added to get action–state value

Q (s, a; θ)

. We denote the action–state value of channel k as

Q_{t}^{k} = [Q_{t, 0}^{k}, Q_{t, 1}^{k}]

, where

Q_{t, 0}^{k}

is the Q value indicating not sending, and

Q_{t, 1}^{k}

is the Q value indicating sending. Finally, CU obtains the larger Q value between

Q_{t, 0}^{k}

and

Q_{t, 1}^{k}

to decide the action

a_{t}^{k}

of channel k. Repeat this process for each channel, and we can obtain the joint action

a_{t} = [a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{K}]

.

We adopt a D3QN algorithm as a problem-solving method. CU interacts with the environment to collect historical information to optimize the access policy. The optimization policy updates the parameters of the neural network by minimizing the loss function

L (θ)

as in (6). The historical data are stored in the experience pool D in the form of

(s_{t}, a_{t}, r_{t}, s_{t + 1})

.

During the training period, data are randomly extracted from D for calculating the loss function and training the neural network. In a dynamic environment, changes in the channels will reduce the reference value of the data in D. Therefore, the past data should be squeezed out of D as quickly as possible after the environment changes. It means that the size of the experience pool D should be limited. In the DRL tasks, the reward discount rate

γ

usually takes the value close to 1. The agent can see further and gain the greatest long-term rewards [24,25]. However, to adapt a dynamic environment, the long-term reward has little significance since the environment changes over times, and the short-term rewards are more important. Thus, the reward discount rate

γ

cannot be too large.

4.5. Algorithm

Algorithm 1 shows the proposed MCT-DLMA protocol with T frame in each epoch. A frame consists of F times slots. First, initialize the parameters. Input the current state

s_{t}

into the neural network at the beginning of each time slot, and select the action

a_{t}

. Then, each user performs an action to send the data packet. The CU can obtain the reward

r_{t}

and the next state

s_{t + 1}

. Data pair

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is stored in D. After one epoch, N data pairs

(s_{t}, a_{t}, r_{t}, s_{t + 1})

are randomly sampled from D. The eval-network

θ

and the target-network

θ^{'}

are updated according to (6). Then, we start the communication process of the next epoch.

Algorithm 1 Double-dueling-deep Q network.

1:: Initialization $s_{0}$ , $D$ , N, $α$ , $γ$ , $θ$ , $θ^{'}$ , F, $λ$
2:: for epoch $= 1$ to max epoch do
3:: for $t = 1$ to T do
4:: if $t % F = = 0$ then
5:: Users receive data follows a Poisson distribution $P (λ)$
6:: end if
7:: Feed $s_{t}$ into eval-Net $θ$ and get the Q values
8:: Q = $\{Q (s_{t}, a; θ) ∣ a \in \{0, 1\}\}$
9:: Choose $a_{t}$ according to Q
10:: Take $a_{t}$ to transmit data, get $r_{t + 1}, s_{t + 1}$
11:: Save $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ into experience pool D
12:: Randomly choosing N data pair from D
13:: $y = \sum_{i = 1}^{N} [r_{t}^{i} + γ Q (s_{t + 1}^{i}, arg max_{a} Q (s_{t}^{i}, a, θ); θ^{'})]$
14:: $L = \frac{1}{N} (y - \sum_{i = 1}^{N} Q (s, a; θ))$
15:: backward and update $θ$ according to L
16:: end for
17:: $θ^{'} \leftarrow θ$
18:: end for

5. Simulation Results

In this section, we apply the MCT-DLMA protocol in different environments and compare it with the existing algorithms, e.g., the whittle index policy and the DLMA. The whittle index policy and the DLMA can only send one packet at a time. MCT-DLMA that supports the channels utilization optimization with unsaturated traffic can transmit multiple data packets at one time by using the multi-channel access mechanism (Table 1).

In the system model, we set

T = 5

and

F = 10

in each frame. The CU observes the history information with length

M = 20

, i.e., it can record the channel state of the past 20 time slots. We empirically set the learning rate

α = 0.0005

,

γ = 0.5

, and use the Adam optimizer for the trained policy of the neural network [23,25]. The experience pool size of D is 250, and the batch size is 32.

5.1. MCT-DLMA + 3-User TDMA

CU coexists with 3 TDMA PUs, denoted by A, B, and C. Each PU sends data on one channel for a total of 3 channels. A sends data in the 1st, 3rd, and 5th time slots of each frame. B sends data in the 1st, 4th, 7th, and 8th time slots of each frame. C sends data in the 4th, 6th, 7th, 9th and 10th time slots of each frame. The data traffic by each user obeys Poisson distribution

P (λ)

. The traffic of PUs A, B and C follows

P (3)

,

P (4)

and

P (5)

, respectively. In Figure 4, it can be seen that MCT-DLMA performs much better than the whittle index for all the cases. In particular, the sum throughput of MCT-DLMA with

P (10)

is about 51.5% higher than that of the whittle index with saturated traffic. This is because all the transition probability matrices of the channels vary in this environment, and the whittle index policy cannot accurately find spectrum holes. On the other hand, MCT-DLMA with

P (10)

has similar performance with DLMA. Under saturated traffic, the sum throughput of MCT-DLMA is about 34% higher than that of DLMA in HetNets. This is because MCT-DLMA can enable CU multi-channel transmit of the data at a time. It can fully exploit the channel resources in the case of multiple idle channels, while DLMA can only utilize a single channel at this case. MCT-DLMA learns the channel access policy of TDMA PUs with historical observations. Thus, it can accurately uses the time slots that are idle in the TDMA protocol due to lack of data. For example, PU C can send data in the 4th, 6th, 7th, 9th, and 10th time slots of each frame. However, in the 7th time slot, the CU detects that the channel is idle. It can be inferred that PU C has no data to send until the next frame. Thus, the CU can learn to send data on the channel of PU C in the 9th and 10th time slots without collision.

5.2. MCT-DLMA + TDMA + Q-ALOHA + FW-ALOHA

TDMA PU, Q-ALOHA PU and FW-ALOHA PU transmit data on one channel each in

K = 3

channels. TDMA PU sends data in the 3rd, 7th, and 8th time slots of each frame. We consider two cases. For the first case, window size parameter of FW-ALOHA

W_{1}

is fixed at 0,

W_{2}

increases from 2 to 10 with a step size of 2, and the transmission probability q of the Q-ALOHA PU is 0.2. For the second case,

W_{1}

of FW-ALOHA is fixed to 0,

W_{2}

is fixed to 4, and q of Q-ALOHA PU increases from 0.1 to 0.9 with a step size of 0.2. We set the mean of the sum of all user traffic to be

K F = 30

. The traffic of TDMA PU, Q-ALOHA PU and FW-ALOHA PU follows

P (3)

,

P (10 q)

,

P (20 / (W_{1} + W_{2} + 1))

, respectively. Moreover, the traffic distribution of unsaturated case for CU is given by

P (30 - 3 - 10 q - \frac{20}{W_{1} + W_{2} + 1}) .

(13)

Figure 5 shows the sum throughput under different configurations. It can be seen that MCT-DLMA achieves higher throughput as compared to DLMA. The reason is that the former scheme allows CU multi-channel access, while DLMA only allows CU single-channel transmit, even there are multiple idle channels in some time slots. Moreover, we can see that when q gradually increases, the throughput of CU first reduces and then increases, while the DLMA is gradually rising. This difference is mainly caused by the multi-channel transmission mechanism of MCT-DLMA. Let us consider two extreme cases, i.e., q is 0 and q is 1. When q is 0, the channel of Q-ALOHA PU is always idle. In this case, the DLMA can send only on the channel of Q-ALOHA. When q is 1, the DLMA policy is not to send on the Q-ALOHA channel, but to search for spectrum holes on the other two channels. Therefore, the sum throughput increases. In MCT-DLMA, the Q-ALOHA channel is occupied by CU for

q = 0

and Q-ALOHA PU for

q = 1

, respectively. The utilization rate of the channel is 100%, which has no effect on the sum throughput. When q is at the middle value, i.e., 0.5, the behavior of Q-ALOHA PU is unpredictable, CU prefers to not use this channel, and the utilization rate of the channel is only 50%. Thus, the sum throughput of MCT-DLMA decreases first and then increases as q changes from 0 to 1.

5.3. MCT-DLMA + 10-User Dynamic TDMA

To simulate more realistic network communications, we randomly change the assigned time slots of 4 TDMA PUs out of a total of 10 PUs every 200 epochs. Let p denote the ratio of the number of time slots used by the TDMA PU to the total number of time slots in a frame. The traffic of TDMA PUs follow

P (10 p)

, while the traffic of CU is assumed to be saturated. We consider an ideal TDMA with a priori knowledge of the network (pk-TDMA), where CU knows the assigned time slots of all the PUs, and it can send data at these fixed idle time slots. In Figure 6, the sum throughput of MCT-DLMA is about 7% higher than pk-TDMA, even if it is difficult for pk-TDMA to obtain prior knowledge in practice. The reason is that in MCT-DLMAM, CU can send data in the assigned idle time slot of PUs just like pk-TDMA. Moreover, in the case that the PU fails to send data in the assigned time slot without new traffic, CU in MCT-DLMA can occupy this channel of the PU to send data, while CU in pk-TDMA cannot unitize this channel. Similar to the static environment, Figure 6 shows that MCT-DLMA still achieves 92% higher sum throughput than DLMA. Moreover, from Figure 4 and Figure 6, we can see that the achieved throughput gain becomes larger as the number of available channels increases in the HetNets.

Additionally, we can see from Figure 6 that the throughput is decreased when the TDMA PUs change every 200 epochs because CU needs time to adapt to the new environment. By using MCT-DLMA, the sum throughput quickly rebounds in a short time after the decrease. It suggests that MCT-DLMA achieves the enhanced robustness in practical dynamic communications as compared to pk-TDMA.

6. Conclusions

Considering realistic communication models with unsaturated traffic, this paper proposes a new DRL-based protocol, MCT-DLMA, to improve spectrum utilization over multi-channel HetNet. The proposed MCT-DLMA allows CU to continuously learn and optimize access policy based on the collected historical channel information. The simulation results demonstrate that MCT-DLMA achieves much higher sum throughput than the conventional whittle index policy and DLMA method in different environments by fully exploiting multiple idle channels. In particular, it can quickly adapt to the dynamic environment with changing PUs and then achieves the near-optimal result since it proposes the deep neural network that can improve generalization ability for multi-channel access optimization. Moreover, in this paper, we assume that the channel condition is perfect. It will be of significance to investigate MCT-DLMA that can deal with HetNets under imperfect channel conditions in the future since the DRL decision of the agent may be affected by the channel interference and transmission interruption.

Author Contributions

Methodology, X.Z. and P.C.; Software, X.Z. and G.Y.; Validation, G.Y. and S.W.; Writing—original draft, X.Z.; Writing—review and editing, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Fund of China under Grants 61871132 and 62171135, the Natural Science Fund of of Fujian Province under Grant 2020J01301 and Industry-University Research Project of Education Department Fujian Province 2020.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhao, Q.; Swami, A. A Survey of Dynamic Spectrum Access: Signal Processing and Networking Perspectives. In Proceedings of the ICASSP IEEE International Conference Acoust Speech Signal Processing, Honolulu, HI, USA, 15–20 April 2007; Volume 4, pp. IV–1349–IV–1352. [Google Scholar] [CrossRef]
Colvin, A. CSMA with collision avoidance. Comput. Commun. 1983, 6, 227–235. [Google Scholar] [CrossRef]
Abramova, M.; Artemenko, D.; Krinichansky, K. Transmission Channels between Financial Deepening and Economic Growth: Econometric Analysis Comprising Monetary Factors. Mathematics 2022, 10, 242. [Google Scholar] [CrossRef]
Liu, S.; Huang, X. Sparsity-aware channel estimation for mmWave massive MIMO: A deep CNN-based approach. China Commun. 2021, 18, 162–171. [Google Scholar] [CrossRef]
Swapna; Tangelapalli; Saradhi, P.P.; Pandya, R.J.; Iyer, S. Deep Learning Oriented Channel Estimation for Interference Reduction for 5G. In Proceedings of the 2021 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 24–25 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
Yao, R.; Qin, Q.; Wang, S.; Qi, N.; Fan, Y.; Zuo, X. Deep Learning Assisted Channel Estimation Refinement in Uplink OFDM Systems Under Time-Varying Channels. In Proceedings of the 2021 International Wireless Communications and Mobile Computing (IWCMC), Harbin, China, 28 June–2 July 2021; pp. 1349–1353. [Google Scholar] [CrossRef]
Rahman, M.H.; Shahjalal, M.; Ali, M.O.; Yoon, S.; Jang, Y.M. Deep Learning Based Pilot Assisted Channel Estimation for Rician Fading Massive MIMO Uplink Communication System. In Proceedings of the 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN), Jeju Island, Republic of Korea, 17–20 August 2021; pp. 470–472. [Google Scholar] [CrossRef]
Ahn, Y.; Shim, B. Deep Learning-Based Beamforming for Intelligent Reflecting Surface-Assisted mmWave Systems. In Proceedings of the 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 20–22 October 2021; pp. 1731–1734. [Google Scholar] [CrossRef]
Dong, R.; Wang, B.; Cao, K. Deep Learning Driven 3D Robust Beamforming for Secure Communication of UAV Systems. IEEE Wirel. Commun. Lett. 2021, 10, 1643–1647. [Google Scholar] [CrossRef]
Al-Saggaf, U.M.; Hassan, A.K.; Moinuddin, M. Ergodic Capacity Analysis of Downlink Communication Systems under Covariance Shaping Equalizers. Mathematics 2022, 10, 4340. [Google Scholar] [CrossRef]
Ma, H.; Fang, Y.; Chen, P.; Li, Y. Reconfigurable Intelligent Surface-aided M-ary FM-DCSK System: A New Design for Noncoherent Chaos-based Communication. IEEE Trans. Veh. Technol. 2022; Early access. [Google Scholar] [CrossRef]
Yang, M.; Bian, C.; Kim, H.S. OFDM-Guided Deep Joint Source Channel Coding for Wireless Multipath Fading Channels. IEEE Trans. Cogn. Commun. 2022, 8, 584–599. [Google Scholar] [CrossRef]
Xu, J.; Ai, B.; Chen, W.; Yang, A.; Sun, P.; Rodrigues, M. Wireless Image Transmission Using Deep Source Channel Coding With Attention Modules. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2315–2328. [Google Scholar] [CrossRef]
Xu, C.; Van Luong, T.; Xiang, L.; Sugiura, S.; Maunder, R.G.; Yang, L.L.; Hanzo, L. Turbo Detection Aided Autoencoder for Multicarrier Wireless Systems: Integrating Deep Learning Into Channel Coded Systems. IEEE Trans. Cogn. Commun. 2022, 8, 600–614. [Google Scholar] [CrossRef]
Chen, P.; Shi, L.; Fang, Y.; Lau, F.C.M.; Cheng, J. Rate-Diverse Multiple Access over Gaussian Channels. IEEE Trans. Wirel. Commun. 2023; Early access. [Google Scholar] [CrossRef]
Fang, Y.; Bu, Y.; Chen, P.; Lau, F.C.M.; Otaibi, S.A. Irregular-Mapped Protograph LDPC-Coded Modulation: A Bandwidth-Efficient Solution for 6G-Enabled Mobile Networks. IEEE Trans. Intell. Transp. Syst. 2021; Early access. [Google Scholar] [CrossRef]
Mazhari Saray, A.; Ebrahimi, A. MAX- MIN Power Control of Cell Free Massive MIMO System employing Deep Learning. In Proceedings of the 2022 4th West Asian Symposium on Optical and Millimeter-Wave Wireless Communications (WASOWC), Tabriz, Iran, 12–13 May 2022; pp. 1–4. [Google Scholar] [CrossRef]
Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Nasir, Y.S.; Guo, D. Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE J. Sel. Areas Commun. 2019, 37, 2239–2250. [Google Scholar] [CrossRef]
Malik, T.S.; Malik, K.R.; Afzal, A.; Ibrar, M.; Wang, L.; Song, H.; Shah, N. RL-IoT: Reinforcement Learning-based Routing Approach for Cognitive Radio-enabled IoT Communications. IEEE Internet Things J. 2022, 10, 1836–1847. [Google Scholar] [CrossRef]
Yang, H.; Xiong, Z.; Zhao, J.; Niyato, D.; Xiao, L.; Wu, Q. Deep Reinforcement Learning-Based Intelligent Reflecting Surface for Secure Wireless Communications. IEEE Trans. Wirel. Commun. 2021, 20, 375–388. [Google Scholar] [CrossRef]
Wang, S.; Liu, H.; Gomes, P.H.; Krishnamachari, B. Deep Reinforcement Learning for Dynamic Multichannel Access in Wireless Networks. IEEE Trans. Cogn. Commun. 2018, 4, 257–265. [Google Scholar] [CrossRef] [Green Version]
Naparstek, O.; Cohen, K. Deep multi-user reinforcement learning for distributed dynamic spectrum access. IEEE Trans. Wirel. Commun. 2018, 18, 310–323. [Google Scholar] [CrossRef] [Green Version]
Yu, Y.; Wang, T.; Liew, S.C. Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks. IEEE J. Sel. Areas Commun. 2019, 37, 1277–1290. [Google Scholar] [CrossRef] [Green Version]
Wang, N.; Hu, J. Performance Analysis of the IEEE 802.11p EDCA for Vehicular Networks in Imperfect Channels. In Proceedings of the 2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), London, UK, 20–22 December 2021; pp. 535–540. [Google Scholar] [CrossRef]
Ruan, Y.; Zhang, Y.; Li, Y.; Zhang, R.; Hang, R. An Adaptive Channel Division MAC Protocol for High Dynamic UAV Networks. IEEE Sens. J. 2020, 20, 9528–9539. [Google Scholar] [CrossRef]
Cao, J.; Cleveland, W.S.; Lin, D.; Sun, D.X. Internet traffic tends toward Poisson and independent as the load increases. In Nonlinear Estimation and Classification; Springer: Berlin/Heidelberg, Germany, 2003; pp. 83–109. [Google Scholar]
Liu, X.; Sun, C.; Yu, W.; Zhou, M. Reinforcement-Learning-Based Dynamic Spectrum Access for Software-Defined Cognitive Industrial Internet of Things. IEEE Trans. Ind. Inform. 2022, 18, 4244–4253. [Google Scholar] [CrossRef]
Liu, X.; Sun, C.; Zhou, M.; Lin, B.; Lim, Y. Reinforcement learning based dynamic spectrum access in cognitive Internet of Vehicles. China Commun. 2021, 18, 58–68. [Google Scholar] [CrossRef]
Liu, K.; Zhao, Q. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 2010, 56, 5547–5567. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), PMLR, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]

Figure 1. MCT-DLMA application in IoV.

Figure 2. A capture of a multiple channel communication.

Figure 3. The neural network structure used in MCT-DLMA.

Figure 4. Sum throughput when CU coexists with three TDMA PUs.

Figure 5. Sum throughput when CU coexists with one TDMA PU, one Q-ALOHA PU, and one FW-ALOHA PU.

Figure 6. Sum throughput when CU coexists with 10 dynamic TDMA PUs.

Table 1. Approach comparison.

Algorithms	DRL-Based	Access Channels	Data Traffic
Whittle Index [31]	No	single	saturated
DLMA [25]	Yes	single	saturated
MCT-DLMA	Yes	multiple	Poisson distribution/ saturated

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Chen, P.; Yu, G.; Wang, S. Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access. Mathematics 2023, 11, 992. https://doi.org/10.3390/math11040992

AMA Style

Zhang X, Chen P, Yu G, Wang S. Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access. Mathematics. 2023; 11(4):992. https://doi.org/10.3390/math11040992

Chicago/Turabian Style

Zhang, Xu, Pingping Chen, Genjian Yu, and Shaohao Wang. 2023. "Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access" Mathematics 11, no. 4: 992. https://doi.org/10.3390/math11040992

APA Style

Zhang, X., Chen, P., Yu, G., & Wang, S. (2023). Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access. Mathematics, 11(4), 992. https://doi.org/10.3390/math11040992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access

Abstract

1. Introduction

2. Deep Reinforcement Learning Framework

2.1. Q-Learning

2.2. Deep Q Network

3. System Model and Problem of Interest

4. MCT-DLMA Protocol

4.1. Action

4.2. State

4.3. Reward

4.4. Neural Network

4.5. Algorithm

5. Simulation Results

5.1. MCT-DLMA + 3-User TDMA

5.2. MCT-DLMA + TDMA + Q-ALOHA + FW-ALOHA

5.3. MCT-DLMA + 10-User Dynamic TDMA

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI