Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks

Giang, Hoang Thi Huong; Hoan, Tran Nhut Khai; Thanh, Pham Duy; Koo, Insoo

doi:10.3390/app10124236

Open AccessArticle

Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks

¹

School of Electrical Engineering, University of Ulsan, Ulsan 44610, Korea

²

College of Engineering Technology, Can Tho University, Can Tho 94000, Vietnam

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2020, 10(12), 4236; https://doi.org/10.3390/app10124236

Submission received: 3 June 2020 / Revised: 16 June 2020 / Accepted: 17 June 2020 / Published: 20 June 2020

(This article belongs to the Special Issue 5G Networks: Optimization, Machine Learning And Blockchain Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Non-orthogonal multiple access (NOMA) is considered a potential technique in fifth-generation (5G). Nevertheless, it is relatively complex when applying NOMA to a massive access scenario. Thus, in this paper, a hybrid NOMA/OMA scheme is considered for uplink wireless transmission systems where multiple cognitive users (CUs) can simultaneously transmit their data to a cognitive base station (CBS). We adopt a user-pairing algorithm in which the CUs are grouped into multiple pairs, and each group is assigned to an orthogonal sub-channel such that each user in a pair applies NOMA to transmit data to the CBS without causing interference with other groups. Subsequently, the signal transmitted by the CUs of each NOMA group can be independently retrieved by using successive interference cancellation (SIC). The CUs are assumed to harvest solar energy to maintain operations. Moreover, joint power and bandwidth allocation is taken into account at the CBS to optimize energy and spectrum efficiency in order to obtain the maximum long-term data rate for the system. To this end, we propose a deep actor-critic reinforcement learning (DACRL) algorithm to respectively model the policy function and value function for the actor and critic of the agent (i.e., the CBS), in which the actor can learn about system dynamics by interacting with the environment. Meanwhile, the critic can evaluate the action taken such that the CBS can optimally assign power and bandwidth to the CUs when the training phase finishes. Numerical results validate the superior performance of the proposed scheme, compared with other conventional schemes.

Keywords:

NOMA; energy harvesting; deep actor-critic reinforcement learning; power allocation; user-pairing

1. Introduction

Recently, fourth-generation (4G) systems reached maturity, and will evolve into fifth-generation (5G) systems where limited amounts of new spectrum can be utilized to meet the stringent demands of users. However, critical challenges will come from explosive growth in devices and data volumes, which require more efficient exploitation of valuable spectrum. Therefore, non-orthogonal multiple access (NOMA) is one of the potential candidates for 5G and upcoming cellular network generations [1,2,3].

According to NOMA principles, multiple users are allowed to share time and spectrum resources in the same spatial layer via power-domain multiplexing, in contrast to conventional orthogonal multiple access (OMA) techniques consisting of frequency-division multiple access (FDMA) and time division multiple access (TDMA) [4]. Interuser interference can be alleviated by performing successive interference cancellation (SIC) on the receiver side. There has been a lot of research aimed at sum rate maximization, and the results showed that higher spectral efficiency (SE) can be obtained by using NOMA, compared to baseline OMA schemes [5,6,7,8]. Zeng et al. [5] investigated a multiple-user scenario in which users are clustered and share the same transmit beamforming vector. Di et al. [6] proposed a joint sub-channel assignment and power allocation scheme to maximize the weighted total sum rate of the system while adhering to a user fairness constraint. Timotheou et al. [7] studied a decoupled problem of user clustering and power allocation in NOMA systems in which the proposed user clustering approach is based on exhaustive search with a high required complexity. Liang et al. [8] studied solutions for user pairing, and investigated the power allocation problem by using NOMA in cognitive radio (CR) networks.

Nowadays, energy consumption for wireless communications is becoming a major social and economic issue, especially with the explosive amounts of data traffic. However, limited efforts have been devoted to the energy-efficient resource allocation problem in NOMA-enabled systems [9,10,11]. The authors in [9] maximized energy efficiency subject to a minimum required data rate for each user, which leads to a nonconvex fractional programming problem. Meanwhile, a power allocation solution aiming to maximize the energy efficiency under users’ quality of service requirements was investigated [10]. Fang et al. [11] proposed a gradient-based binary search power allocation approach for downlink NOMA systems, but it requires high complexity. NOMA was also applied to future machine-to-machine (M2M) communications in [12], and it was shown that the outage probability of the system can be improved when compared with OMA. Additionally, by jointly studying beamforming, user scheduling, and power allocation, the system performance of millimeter wave (mmWave) networks was studied [13].

On the other hand, CR (one of the promising techniques to improve SE), has been extensively investigated for decades. In it, cognitive users (CUs) can utilize the licensed spectrum bands of the primary users (PUs) as long as the interference caused by the CUs is tolerable [14,15,16]. Goldsmith et al. in [17] proposed three operation models (opportunistic spectrum access, spectrum sharing, and sensing-based enhanced spectrum sharing) to exploit the CR technique in practice. It is conceivable that the combination of CR with NOMA technologies is capable of further boosting the SE in wireless communication systems. Therefore, many studies on the performance of spectrum-sharing CR combined with NOMA have been analyzed [18,19].

Along with the rapid proliferation of wireless communication applications, most battery-limited devices become useless if their battery power is depleted. As one of the remedies, energy harvesting (EH) exploits ambient energy resources to replenish batteries, such as solar energy [20], radio frequency (RF) signals [21], and both non-RF and RF energy harvesting [22], etc. Among various kinds of renewable energy resources, solar power has been considered one of the most effective energy sources for wireless devices. However, solar power density highly depends on the environment conditions, and may vary over time. Thus, it is critical to establish proper approaches to efficiently utilize harvested energy for wireless communication systems.

Many early studies regarding NOMA applications have mainly focused on the downlink scenario. However, there are fewer contributions investigating uplink NOMA, where an evolved NodeB (eNB) has to receive different levels of transmitted power from all user devices using NOMA. Zhang et al. in [23] proposed a novel power control scheme, and the outage probability of the system was derived. Besides, the user-pairing approach was studied in many predefined power allocation schemes in NOMA communication systems [24] in which internet of things (IoT) devices first harvest energy from BS transmissions in the harvesting phase, and they then utilize the harvested energy to perform data transmissions using the NOMA technique during the transmission phase. The pricing and bandwidth allocation problem in terms of energy efficiency in heterogeneous networks was investigated in [25]. In addition, the authors in [20] proposed joint resource allocation and transmission mode selection to maximize the secrecy rate in cognitive radio networks. Nevertheless, most of the existing work on resource allocation assumes that the amount of harvested energy is known, or that traffic loads are predictable, which is hard to obtain in practical wireless networks.

Since the information regarding network dynamics (e.g., harvested energy distribution, primary user’s behavior) is sometimes unavailable in the cognitive radio system, researchers usually formulate optimization problems as the framework of a Markov decision process (MDP) [20,22,26,27]. Reinforcement learning is one of the potential approaches to obtaining the optimal solution for an MDP problem by interacting with the environment without having prior information about the network dynamics or without any supervision [28,29,30]. However, it is a big issue for reinforcement learning to have to deal with large-state-space optimization problems. For this reason, deep reinforcement learning (DRL) is being investigated extensively these days in wireless communication systems where deep neural networks (DNNs) work as function approximators and are utilized to learn the optimal policy [31,32,33]. Meng et al. proposed a deep reinforcement learning method for a joint spectrum sensing and power control problem in a cognitive small cell [31]. In addition, deep Q-learning was studied for a wireless gateway that is able to derive the optimal policy to maximize throughput in cognitive radio networks [32]. Zhang et al. [33] proposed an asynchronous advantage, deep actor-critic-based scheme to optimize spectrum sharing efficiency and guarantee the QoS requirements of PUs and CUs.

To the best of our knowledge, there has been little research into resource allocation using deep reinforcement learning under a non-RF energy-harvesting scenario in uplink cognitive radio networks. Thus, we propose a deep actor-critic reinforcement learning framework for efficient joint power and bandwidth allocation by adopting hybrid NOMA/OMA in uplink cognitive radio networks (CRNs). In them, solar energy-powered CUs are assigned the proper transmission power and bandwidth to transmit data to a cognitive base station in every time slot in order to maximize the long-term data transmission rate of the system. Specifically, the main contributions of this paper are as follows.

We study a model of a hybrid NOMA/OMA uplink cognitive radio network adopting energy harvesting at the CUs, where solar energy-powered CUs opportunistically use the licensed channel of the primary network to transmit data to a cognitive base station using NOMA/OMA techniques. Beside that, a user-pairing algorithm is adopted such that we can assign orthogonal frequency bands to each NOMA group after pairing. We take power and bandwidth allocation into account such that the transmission power and bandwidth are optimally utilized by each CU under energy constraints and environmental uncertainty. The system is assumed to work on a time-slotted basis.
We formulate the problem of long-term data transmission rate maximization as the framework of a Markov decision process (MDP), and we obtain the optimal policy by adopting a deep actor-critic reinforcement learning (DACRL) framework under a trial-and-error learning algorithm. More specifically, we use DNNs to approximate the policy function and the value function for the actor and critic components, respectively. As a result, the cognitive base station can allocate the appropriate transmission power and bandwidth to the CUs by directly interacting with the environment, such that the system reward can be maximized in the long run by using the proposed algorithm.
Lastly, extensive numerical results are provided to assess the proposed algorithm performance through diverse network parameters. The simulation results of the proposed scheme are shown to be superior to conventional schemes where decisions on transmission power and bandwidth allocation are taken without long-term considerations.

The rest of this paper is structured as follows. The system model is presented in Section 2. We introduce the problem formulation in Section 3, and we describe the deep actor-critic reinforcement learning scheme for resource allocation in Section 4. The simulation results and discussions are in Section 5. Finally, we conclude the paper in Section 6.

2. System Model

We consider an uplink CRN that consists of a cognitive base station (CBS), a primary base station (PBS), multiple primary users, and

2 M

cognitive users as illustrated in Figure 1. Each CU is outfitted with a single antenna to transmit data to the CBS, and each is equipped with an energy-harvesting component (i.e., solar panels). The PBS and PUs have the license to use the primary channel at will. However, they do not always have data to transmit on the primary channel. Meanwhile, the CBS and the CUs can opportunistically utilize the primary channel by adopting a hybrid NOMA/OMA technique when the channel is sensed as free. To this end, the CBS divides the set of CUs into pairs according to Algorithm 1 where the CU having the highest channel gain will be coupled with the CU having the lowest channel gain, and one of available channels will be assigned to these pairs. More specifically, the CUs are arranged into M NOMA groups, and the primary channel is divided into multiple subchannels to apply hybrid NOMA/OMA for the transmissions between the CUs and the CBS, with

G = \{G_{1}, G_{2}, G_{3}, \dots, G_{M}\}

denoting the set of NOMA groups. Additionally, M NOMA groups are assigned to M orthogonal subchannels,

S C = \{S C_{1}, S C_{2}, S C_{3}, \dots, S C_{M}\}

, of the primary channel such that the CUs in each NOMA group can transmit on the same subchannel and will not interfere with the other groups. In this paper, successive interference cancellation (SIC) [34] is applied at the CBS for decoding received signals, which are transmitted from the CUs. Moreover, we assume that the CUs always have data to transmit, and the CBS has complete channel state information (CSI) of all the CUs.

Algorithm 1 User-pairing Algorithm

1:: Input: channel gain, number of groups, M, number of CUs, $2 M$ .
2:: Sort the channel gain of all CUs in decending order: $g_{1} \geq g_{2} \geq \dots \geq g_{2 M}$
3:: Define set of channel gains $U = \{g_{1}, g_{2}, \dots, g_{2 M}\}$
4:: for $j = 1 : M$
5:: $G_{j} = \{\emptyset\}$
6:: $G_{max} = max \{U\}, G_{min} = min \{U\}$
7:: $G_{j} = G_{j} \cup G_{max} \cup G_{min}$
8:: $U = U \ G_{max} \ G_{min}$
9:: end for
10:: Output: Set of CU pairs.

The network system operation is illustrated in Figure 2. In particular, at the beginning of a time slot, with duration

τ_{s s}

, all CUs concurrently perform spectrum sensing and report their local results to the CBS. Based on these sensing results, the CBS first decides the global sensing result as to whether the primary channel is busy or not following the combination rule [35,36], and then allocates power and bandwidth to all CUs for uplink data transmission. As a consequence, according to the allocated power and bandwidth of the NOMA groups, the CUs in each NOMA group can transmit their data to the CBS through the same subchannel without causing interference with other groups within duration

τ_{T r} = T_{t o t} - τ_{s s}

, where

T_{t o t}

is the total time slot duration. Information regarding the remaining energy in all the CUs is updated to the CBS at the end of each time slot. Each data transmission session of the CUs may take place in more than one time slot until all their data have been transmitted successfully.

During data transmission, the received composite signal at the CBS on subchannel

S C_{m}

is given by

\begin{matrix} y_{m} (t) = \sqrt{P_{_{1 m}} (t)} x_{1 m} (t) h_{1 m} + \sqrt{P_{_{2 m}} (t)} x_{2 m} (t) h_{2 m} + ω_{m}, \end{matrix}

(1)

where

P_{i m} (t) = e_{i m}^{t r} (t) / τ_{T r}

| i ∈ {1,2}, m ∈{1,2,…,M} is the transmission power of CU_i in NOMA group G_m, in which

e_{i m}^{t r}

(t) is the transmission energy assigned for CU_im in time slot t; x_im(t) denotes the transmit signal of CU_im in time slot t, (𝔼{|x_im(t)|²} = 1); ω_m is the additive white Gaussian noise (AWGN) at the CBS on subchannel SC_m with zero mean and variance σ²; and h_im is the channel coefficient between CU_im and the CBS. The overall received signal at the CBS in time slot t is given by

\begin{matrix} y (t) = \sum_{m = 1}^{M} y_{m} (t) . \end{matrix}

(2)

The received signals at the CBS on different sub-channels are independently retrieved from composite signal

y_{m} (t)

using the SIC technique. In particular, the CU’s signal with the highest channel gain is firstly decoded, and then it will be removed from composite signal at the CBS, in a successive manner. The CU’s signal with the lower channel gain in the sub-channel is treated as noise of the CU with the higher channel gain. We assume perfect SIC implementation at the CBS. The achievable transmission rate for the CUs in NOMA group

G_{m}

are

\begin{matrix} \begin{matrix} R_{1 m} (t) = \frac{τ_{T r}}{T_{t o t}} \times b_{m} (t) \times {log}_{2} [1 + \frac{P_{1 m} (t) g_{1 m}}{P_{2 m} (t) g_{2 m} + σ^{2}}] \\ R_{2 m} (t) = \frac{τ_{T r}}{T_{t o t}} \times b_{m} (t) \times {log}_{2} [1 + \frac{P_{2 m} (t) g_{2 m}}{σ^{2}}], \end{matrix} \end{matrix}

(3)

where

b_{m} (t)

is the amount of bandwidth allocated to subchannel

S C_{m}

in time slot t,

g_{i m} = {|h_{i m}|}^{2}

is the channel gain of

C U_{i m}

on subchannel m, and

g_{1 m} \geq g_{2 m}

. Since the channel gain of

C U_{1 m}

,

g_{1 m}

, is higher,

C U_{1 m}

has a higher priority for decoding. Consequently, the signal of

C U_{1 m}

is decoded first by treating the signal of

C U_{2 m}

as interference. Next, user

C U_{1 m}

is removed from signal

y_{m} (t)

, and the signal of user

C U_{2 m}

is decoded as interference-free. The sum achievable transmission rate of NOMA group

G_{m}

can be calculated as:

\begin{matrix} R_{m} (t) = R_{1 m} (t) + R_{2 m} (t) . \end{matrix}

(4)

The sum achievable transmission rate at the CBS can be given as follows:

\begin{matrix} R (t) = \sum_{m = 1}^{M} R_{m} (t) . \end{matrix}

(5)

Energy Arrival and Primary User Models

In this paper, the CUs have a finite capacity battery,

E_{b a t}

, which can be constantly recharged by the solar energy harvesters. Therefore, the CUs can perform their other operations and harvest solar energy simultaneously. For many reasons (such as the weather, the season, different times of the day), the harvested energy from solar resources may vary in practice. Herein, we take into account a practical case, where the harvested energy of

C U_{i}

in NOMA group

G_{m}

(denoted as

e_{i m}^{h}

) follows a Poisson distribution with mean value

ξ_{a v g}

, as studied in [37]. The arrival energy amount that

C U_{i m}

can harvest during time slot t can be given as

e_{i m}^{h} (t) \in \{e_{1}^{h}, e_{2}^{h}, \dots, e_{υ}^{h}\}

where

0 < e_{1}^{h} < e_{2}^{h} < \dots < e_{υ}^{h} < E_{b a t}

. The cumulative distribution function can be given as follows:

\begin{matrix} F (e_{i m}^{h} (t); ξ_{a v g}) = \sum_{k = 0}^{e_{i m}^{h} (t)} e^{- ξ_{a v g}} \frac{{(ξ_{a v g})}^{k}}{k!} . \end{matrix}

(6)

Herein, we use a two-state Markov discrete-time process to model the state of the primary channel, as depicted in Figure 3. We assume that the state of the primary channel does not change during the time slot duration,

T_{t o t}

, and the primary channel can switch states between two adjacent time slots. The state transition probabilities between two time slots are denoted as

P_{i j} |i, j \in \{F, B\}

, in which F stands for the free state, and B stands for the

b u s y

state. In this paper, we consider cooperative spectrum sensing, in which all CUs collaboratively detect spectrum holes based on an energy detection method, and they send these local sensing results to the CBS. Subsequently, the final decision on the primary users’ activities is attained by combining the local sensing data at the CBS [36]. The performance of the cooperative sensing scheme can be evaluated based on probability of detection

P_{d}

and probability of false alarm

P_{f}

.

P_{d}

is denoted as the probability that the PU’s presence is correctly detected (i.e., the primary channel is actually used by the PUs). Meanwhile,

P_{f}

is denoted as the probability that the PU’s is detected to be active, but it is actually inactive (i.e., the sensing result of the primary channel is busy, but the primary channel is actually free).

3. Long-Term Transmission Rate Maximization Problem Formulation

In this section, we aim at maximizing the long-term data transmission rate for uplink NOMA/OMA. The

2 M

users in the CRN can be decoupled into pairs according to their channel gain, as described in Algorithm 1. After user pairing, the joint power allocation and bandwidth allocation problem can be formulated as follows:

\begin{matrix} \begin{matrix} a^{*} (t) = \underset{a (t)}{arg max} \sum_{k = t}^{\infty} \sum_{m = 1}^{M} R_{m} (k) \\ s . t . 0 \leq e_{i m}^{t r} \leq e_{m a x}^{t r} \\ \sum_{m = 1}^{M} b_{m} (t) = W \end{matrix}, \end{matrix}

(7)

where

a (t) = \{b (t), ε (t)\}

represents the action that the CBS assigns to the CUs in time slot t;

b (t)

indicates a vector of the allocated bandwidth portions assigned to the corresponding sub-channel, where

b (t) = \{b_{1} (t), b_{2} (t), \dots ., b_{M} (t)\} |\sum_{m = 1}^{M} b_{m} (t) = W

is the assigned bandwidth amount for

m^{t h}

sub-channel;

ε (t) = [\begin{matrix} e_{11}^{t r} (t), e_{21}^{t r} (t),, e_{12}^{t r} (t), e_{22}^{t r} (t), \dots, e_{1 M}^{t r} (t), e_{2 M}^{t r} (t) \end{matrix}]

refers to a transmission energy vector of the CUs, where

e_{i m}^{t r} (t) \in \{0, e_{1}^{t r}, e_{2}^{t r}, \dots, e_{max}^{t r}\}

is the transmission energy value for

C U_{i m}

, and

e_{m a x}^{t r}

represents the upper-bounded value of transmission energy for each CU in time slot t.

4. Deep Reinforcement Learning-Based Resource Allocation Policy

In this section, we first reformulate the joint power and bandwidth allocation problem, which is aimed at maximizing the long-term data transmission rate of the system as the framework of an MDP. Then, we apply a DRL approach to solve the problem, in which the agent (i.e., the CBS) learns to create the optimal resource policy via trial-and-error interactions with the environment. One of the disadvantages of reinforcement learning is that the high computational costs can be imposed due to the long time learning process of a system with high state space and action space. However, the proposed scheme requires less computation overhead by adopting deep neural networks, as compared to other algorithms such as value iteration-based dynamic programming in partially observable Markov decision process (POMDP) framework [20] in which the transition probability of the energy arrival is required for obtaining the solution. Thus, the complex in formulation and computation can be relieved regardless of the dynamic properties of the environment by using the proposed scheme, as compared to POMDP scheme.

Furthermore, the advantage of a deep reinforcement learning scheme as compared with the POMDP scheme is that the unknown harvested energy distribution can be estimated to create the optimal policy by interacting with the environment over the time horizon. In addition, the proposed scheme can work effectively in a large-state-and-space system by adopting deep neural networks. However, other reinforcement learning schemes such as Q-learning [38], actor-critic learning [39] might not be appropriate to large-state-and-space systems. In the proposed scheme, a deep neural network was trained to obtain the optimal policy where the reward of the system converges to optimal value. Then, the system can choose an optimal action at every state according to that policy learned from the training phase without re-training. Thus, deep actor-critic reinforcement learning can be more applicable to the wireless communication system.

4.1. Markov Decision Process

Generally, the purpose of reinforcement learning is for the agent to learn how to map each system state to an optimal action through a trial-and-error learning process. In this way, the accumulated sum of rewards can be maximized after a number of training time slots. Figure 4 illustrates the traditional reinforcement learning via agent–environment interaction. In particular, the agent observers the system state and then chooses an action at the beginning of a time slot. After that, the system receives the corresponding reward at the end of the time slot, and transfers to the next state based on the performed action. The system will be updated and will then go into the next interaction between agent and environment.

We denote the state space and action space of the system in this paper as

S

and

A

, respectively;

s (t) = \{μ (t), e^{re} (t)\} \in S

represents the state of the network in time slot t, where

μ (t)

is the probability (

b e l i e f

) that the primary channel is free in that time slot, and

e^{re} (t) = [\begin{matrix} e_{11}^{r e} (t), e_{21}^{r e} (t), e_{12}^{r e} (t), e_{22}^{r e} (t), e_{1 M}^{r e} (t), e_{2 M}^{r e} (t) \end{matrix}]

denotes a vector of remaining energy of CUs, where

0 \leq e_{i m}^{r e} \leq E_{b a t}

represents the remaining energy value of

C U_{i m}

. The action at the CBS is denoted as

a (t) = \{b (t), ε (t)\} \in A

. In this paper, we define the reward as the sum data rate of the system, as presented in Equation (5).

The decision-making process can be expressed as follows. At the beginning of time slot t, the agent observes the state,

s (t) \in S

, from information about the environment, and then chooses action

a (t) \in A

following a stochastic policy,

π (a |s) = Pr (a (t) = a |s (t) = s)

, which is mapped from the environment state to the probability of taking an action. In this work, the network agent (i.e, the CBS) determines the transmission power for each CU and decides whether to allocate the bandwidth portion to the NOMA groups in each time slot. Then, the CUs perform their operations (transmit data or stay silent) according to the assigned action from the CBS. Afterward, the instant reward,

R (t)

, which is defined in Equation (5), is fed back to the agent, and the environment transforms to the next state,

s (t + 1)

. At the end of the time slot, the CUs report information about the current remaining energy level in each CU to the CBS for network management. In the following, we describe the way to update information about the belief and the remaining energy based on the assigned actions at the CBS.

4.1.1. Silent Mode

The global sensing decision shows that the primary channel is busy in the current time slot, and thus, the CBS trusts this result and has all CUs stay silent. As a consequence, there is no reward in this time slot, i.e.,

R (t) = 0

. The belief in current time slot t can be calculated according to Bayes’ rule [40], as follows:

\begin{matrix} μ (t) = \frac{μ (t) P_{f}}{μ (t) P_{f} + (1 - μ (t)) P_{d}} . \end{matrix}

(8)

Belief

μ (t + 1)

for the next time slot is updated as follows:

\begin{matrix} μ (t + 1) = μ (t) P_{F F} + (1 - μ (t)) P_{B F} . \end{matrix}

(9)

The remaining energy of

C U_{i m}

for the next time slot is updated as

\begin{matrix} e_{i m}^{r e} (t + 1) = min (e_{i m}^{r e} (t) + e_{i m}^{h} (t) - e_{s s}, E_{b a t}), \end{matrix}

(10)

where

e_{s s}

is the consumed energy from the spectrum sensing process.

4.1.2. Transmission Mode

The global sensing decision indicates that the primary channel is free in the current time slot, and then, the CBS assigns transmission power levels to the CUs for transmitting their data to the CBS. We assume that the data of the CUs will be successfully decoded if the primary channel is actually free; otherwise, no data can be retrieved due to collisions between the signals of the PUs and CUs. In this case, there are two possible observations, as follows.

Observation 1

(Φ_{1})

: All data are successfully received and decoded at the CBS at the end of the time slot. This result means the primary channel was actually free during this time slot, and the global sensing result was correct. The total reward for the network is calculated as

\begin{matrix} R (s (t) |Φ_{1}) = \sum_{m = 1}^{M} R_{m} (t), \end{matrix}

(11)

where the immediate data transmission rate of NOMA group

G_{m}

,

R_{m} (t)

, can be computed with Equation (4). Belief

μ (t + 1)

for the next time slot is updated as

\begin{matrix} μ (t + 1) = P_{F F} . \end{matrix}

(12)

The remaining energy in

C U_{i m}

for the next time slot will be

\begin{matrix} e_{i m}^{r e} (t + 1) = min (e_{i m}^{r e} (t) + e_{i m}^{h} (t) - e_{s s} - e_{i m}^{t r} (t), E_{b a t}), \end{matrix}

(13)

where

e_{i m}^{t r} (t)

denotes the transmission energy assigned to

C U_{i m}

in time slot t.

Observation 2

(Φ_{2})

: The CBS can not successfully decode the data from the CUs at the end of time slot t due to collisions between the signals of the CUs and the PUs. It implies that the primary channel was occupied, and misdetection happened. In this case, no reward is achieved, i.e.,

R (s (t) |Φ_{2}) = 0

. Belief

μ (t + 1)

for the next time slot can be updated as

\begin{matrix} μ (t + 1) = P_{B F} . \end{matrix}

(14)

The remaining energy in

C U_{i m}

for the next time slot is updated by

\begin{matrix} e_{i m}^{r e} (t + 1) = min (e_{i m}^{r e} (t) + e_{i m}^{h} (t) - e_{s s} - e_{i m}^{t r} (t), E_{b a t}) . \end{matrix}

(15)

In reinforcement learning, the agent is capable of improving the policy based on the recursive lookup table of state-value functions. The state-value function,

V^{π} (s)

, is defined as the maximum expected value of the accumulated reward starting from current state s with the given policy, which is written as [28]:

\begin{matrix} V^{π} (s) = E \{\sum_{t = 1}^{\infty} γ^{t} R (t) |s (t) = s, π\}, \end{matrix}

(16)

where

E \{.\}

denotes the expectation, in which

γ \in (0, 1)

is the discount factor, which can affect the agent’s decisions on myopic or foresighted operations;

π

is the stochastic policy, which maps environment state space

S

to action space

A

,

π (a |s) = Pr (a (t) = a |s (t) = s)

. The objective of the resource allocation problem is to find optimal policy

π^{*}

that provides the maximum discounted value function in the long run, which can satisfy the Bellman equation as follows [41]:

\begin{matrix} π^{*} (a |s) = \underset{π}{arg max} V^{π} (s) . \end{matrix}

(17)

The policy can be explored by using an

ϵ - g r e e d y

policy in which a random action is chosen with probability

ϵ

, or an action can be selected based on the current policy with probability

(1 - ϵ)

during the training process [42]. As a result, the problem of joint power and bandwidth allocation in Equation (7) can be rewritten as Equation (17), and the solution to deep actor-critic reinforcement learning will be presented in the following section.

4.2. Deep Actor-Critic Reinforcement Learning Algorithm

The maximization problem in Equation (17) can be solved by using the actor-critic method, which is derived by combining the value-based method [43] and the policy-based method [44]. The actor-critic structure involves two neural networks (actor and critic) and an environment, as shown in Figure 5. The actor can determine the action according to the policy, and the critic evaluates the selected actions based on value functions and instant rewards that are fed back from the environment. The input of the actor is the state of the network, and the output is the policy, which directly affect how the agent chooses the optimal action. The output of the critic is a state-value function

V^{π} (s)

, which is used to calculate the temporal difference (TD) error. Thereafter, the TD error is used to update the actor and the critic.

Herein, both the policy function in the actor and the value function in the critic are approximated with parameter vectors

θ

and

ω

, respectively, by two sequential models of a deep neural network. Both value function parameter

ω

and policy parameter

θ

are stochastically initialized and updated constantly by the critic and the actor, respectively, during the training process.

4.2.1. The Critic with a DNN

Figure 6 depicts the DNN at the critic, which is composed of an input layer, two hidden layers, and an output layer. The critic network is a feed-forward neural network that evaluates the action taken by the actor. Then, the evaluation of the critic is used by the actor to update its control policy. The input layer of the critic is an environment state, which contains

(2 M + 1)

elements. Each hidden layer is a fully connected layer, which involves

H_{C}

neurons and uses a rectified linear unit (ReLU) activation function [45,46] as follows:

\begin{matrix} f_{R e L U} (z) = max (0, z), \end{matrix}

(18)

where

z = \sum_{i = 1}^{2 M + 1} ω_{i} s_{i} (t)

is the estimated output of the layer before applying the activation function, in which

s_{i} (t)

indicates the ith element of the input state,

s (t)

, and

ω_{i}

is the weight for the ith input. The output layer of the DNN at the critic contains one neuron and uses the linear activation function to estimate the state-value function,

V^{π} (s)

. In this paper, the value function parameter is optimized by adopting stochastic gradient descent with a back-propagation algorithm to minimize the loss function, defined as the mean squared error, which is computed by

\begin{matrix} L_{ω} = δ^{2} (t), \end{matrix}

(19)

where

δ (t)

is the TD error between the target value and the estimated value, which is given by

\begin{matrix} δ (t) = E [R (t) + γ V_{ω} (s (t + 1)) - V_{ω} (s (t))], \end{matrix}

(20)

and it is utilized to evaluate selected action

a (t)

of the actor. If the value of

δ (t)

is positive, the tendency to choose action

a (t)

in the future, when the system is in the same state, will be strengthened; otherwise, it will be weakened. The critic parameter can be updated in the direction of the gradient, as follows:

\begin{matrix} Δ ω = α_{c} δ (t) \nabla_{ω} V_{ω}^{π} (s (t)), \end{matrix}

(21)

where

α_{c}

is the learning rate of the critic.

4.2.2. The Actor with a DNN

The DNN in the actor is shown in Figure 7, which includes an input layer, two hidden layers, and an output layer. The input layer of the actor is the current state of the environment. There are two hidden layers in the actor, where each hidden layer is comprised of

H_{A}

neurons. The output layer of the actor provides the probabilities of selecting actions in a given state. Furthermore, the output layer utilizes the soft-max activation function [28] to compute the policy of each action in the action space, which is given as:

\begin{matrix} π_{θ} (a |s) = \frac{e^{z_{a}}}{\sum_{a^{^{'}} \in A} e^{z_{a^{^{'}}}}}, \end{matrix}

(22)

where

z_{a}

is the estimated value for the preference of choosing action

a

. In the actor, the policy can be enhanced by optimizing the state-value function as follows:

\begin{matrix} \begin{matrix} J (π_{θ}) = E [V^{π} (s)] \\ = \sum_{s \in S} d^{π} (s) V^{π} (s), \end{matrix} \end{matrix}

(23)

where

d^{π} (s)

is the state distribution. Policy parameter

θ

can be updated toward the gradient ascending to maximize the objective function [39], as follows:

\begin{matrix} Δ θ = α_{a} \nabla_{θ} J (π_{θ}), \end{matrix}

(24)

where

α_{a}

denotes the learning rate of actor network, and policy gradient

\nabla_{θ} J (π_{θ})

can be computed by using the TD error [47]:

\begin{matrix} \nabla_{θ} J (π_{θ}) = E [\nabla_{θ} log π_{θ} (s, a) δ (t)] . \end{matrix}

(25)

It is worth noting that TD error

δ (t)

is supplied by the critic. The training procedure of the proposed DACRL approach is summarized in Algorithm 2. In the algorithm, the agent interacts with the environment and learns to select optimal action in each state. The convergence of the proposed algorithm depends on number of steps per episode, the number of training episodes and the learning rate, which is discussed in the following section.

Algorithm 2 The training procedure of the deep actor-critic reinforcement learning algorithm

1:: Input: $S$ , $A$ , $γ$ , $α_{a}$ , $α_{c}$ , $e^{re} (t)$ , $μ (t)$ , $E_{c a}$ , $ξ_{a v g}$ , T, W, $ϵ_{m i n}$ , $ϵ_{m a x}$ , $ϵ_{d}$ .
2:: Initialize network parameters of the actor and the critic: $θ, ω$ .
3:: Initialize $ϵ = ϵ_{m a x}$ .
4:: for each episode $e = 1, 2, 3, \dots, L :$
5:: Sample an initial state $s \in S$ .
6:: for each time step $t = 0, 1, 2, 3 \dots, T - 1$ :
7:: Observe current state $s (t)$ , and estimate state value $V_{ω}^{π} (s (t))$ .
8:: Choose an action:
9:
10:: $a (t) = \{\begin{matrix} arg max π_{θ} (a (t) |s (t)) & w . p 1 - ffl \\ random action a (t) \in A & otherwise \end{matrix}$
11:: Execute action $a (t)$ , observe current reward $R (t)$ .
12:: Update epsilon rate $ϵ = max (ϵ . ϵ_{d}, ϵ_{min})$
13:: Update next state $s (t + 1)$
14:: Critic Process:
15:: Estimate next state value $V_{ω}^{π} (s (t + 1))$ .
16:: Critic calculates TD error $δ (t)$
17:: if episode is end at time slot t:
18:: $δ (t) = R (t) - V_{ω} (s (t)) .$
19:: else
20:: $\begin{matrix} δ (t) = R (t) + γ V_{ω} (s (t + 1)) - V_{ω} (s (t)) . \end{matrix}$
21:: end if
22:: Update parameter of critic network $ω \leftarrow ω + Δ ω$
23:: Actor Process:
24:: Update parameter of actor network $θ \leftarrow θ + Δ θ$
25:: end for
26:: end for
27:: Output: Final policy $π_{t}^{*} (a (t) |s (t))$ .

5. Simulation Results

In this section, we investigate the performance of uplink NOMA systems using our proposed scheme. The simulation results are compared with other myopic schemes [48] (Myopic-UP, Myopic-Random, and Myopic-OMA) in terms of average data transmission rate and energy efficiency. In the myopic schemes, the system only maximizes the reward in the current time slot, and the system bandwidth is allocated to the group only if it has at least one active CU in the current time slot. In particular, with the Myopic-UP scheme, the CBS arranges the CUs into different pairs based on Algorthim 1. In the Myopic-Random scheme, the CBS randomly decouples the CUs into pairs. In the Myopic-OMA scheme, the total system bandwidth is divided equally into sub-channels in order to assign them to each active CU without applying user pairing. In the following, we analyze the influence of the network parameters on the schemes through the numerical results.

In this paper, we used Python 3.7 with the TensorFlow deep learning library to implement the DACRL algorithm. Herein, we consider a network based on different channel gain values between the CUs and the CBS, such as

h_{1} = - 20

dB,

h_{2} = - 25

dB,

h_{3} = - 30

dB,

h_{4} = - 35

dB,

h_{5} = - 40

dB,

h_{6} = - 45

dB, where

h_{1}, h_{2}, h_{3}, h_{4}, h_{5}, h_{6}

are the channel gains between

C U_{1}, C U_{2}, C U_{3}, C U_{4}, C U_{5}, C U_{6}

and the CBS, respectively. Two sequential DNNs are utilized to model the value function and the policy function in the proposed algorithm. Each DNN is designed with an input layer, two hidden layers and an output layer as described in Section 4. The number of neurons in each hidden layer of the value function DNN in the critic, and the policy function in the actor, are set at

H_{C} = 24

and

H_{A} = 24

, respectively. For the training process, value function parameter

ω

and the policy parameter

θ

are stochastically initialized by using uniform Xavier initialization [49]. The other simulation parameters for the system are shown in Table 1.

We first examine the average transmission rates of the the DACRL scheme under different training iterations, T, while the number of episodes, L, increases from 1 to 400. We achieved the results by calculating the average transmission rate after separately running the simulation 20 times, as shown in Figure 8. The curves sharply increase in the first 50 training episodes, and then gradually converge to the optimal value. We can see that the agent needs more than 350 episodes to learn the optimal policy at

T = 1000

iterations per episode. However, with the increment in T, the algorithm begins to converge faster. For instance, the proposed scheme learns the optimal policy in less than 300 episodes when

T = 2000

. Nevertheless, it might take a very long time for the training process if each episode uses too many iterations, and the algorithm evenly converges to a locally optimal policy. As a result, the number of training iterations per episode and the number of training episodes should not be too large or too small. In the rest of the simulations, we set training episodes at 300 and training iterations at 2000.

Figure 9 shows the convergence rate of the proposed scheme according to various values of actor learning rate

α_{a}

and critic learning rate

α_{c}

. The figure shows that the reward converges faster with increments in the learning rates. In addition, we can observe that the proposed scheme with actor learning rate

α_{a} = 0.001

and critic learning rate

α_{c} = 0.005

provides the best performance after 300 episodes. When the learning rates of the actor and the critic increase to

α_{a} = 0.01

and

α_{c} = 0.005

, respectively, the algorithm converges very fast, but does not bring a good reward due to underfitting. Therefore, we set the actor and critic learning rates at

α_{a} = 0.001

and

α_{c} = 0.005

, respectively, for the rest of the simulations.

Figure 10 illustrates the average transmission rates under the influence of mean harvested energy. We can see that the average transmission rate of the system increases when the mean value of harvested energy grows. The reason is that with an increase in

ξ_{a v g}

, the CUs can harvest more solar energy, and thus, the CUs have a greater chance to transmit data to the CBS. In addition, the average transmission rate of the proposed scheme dominates the conventional schemes because the conventional schemes focus on maximizing the current reward, and they ignore the impact of the current decision on the future reward. Thus, whenever the primary channel is free, these conventional schemes allow all CUs to transmit their data by consuming most of the energy in the battery in order to maximize the instant reward. This makes the CUs stay silent in the future due to energy shortages. Although the Myopic-Random scheme had lower performance than the Myopic-UP scheme, it still had greater rewards than Myopic-OMA. This outcome demonstrates the efficiency of the hybrid NOMA/OMA approach, compared with the OMA approach, in terms of average transmission rate.

In Figure 11, the energy efficiency of the schemes was compared with respect to the mean value of the harvested energy. In this paper, we define energy efficiency as the transmission data rate obtained at the CBS over the total energy consumption of the CUs during the operations. We can see that the energy efficiency declines as

ξ_{a v g}

rises. The reason is that when the harvested energy goes up, the CUs can gather more energy for their operations; however, the amount of energy overflowing the CUs’ batteries also increases. The curves show that the performance of the proposed scheme outperforms the other conventional schemes because the DACRL agent can learn about the dynamic arrival of harvested energy from the environment. Thus, the proposed scheme can make proper decision in each time slot.

In Figure 12 and Figure 13, we plot the average transmission rate and the energy efficiency, respectively, based on differing noise variance at the CBS. The curves show that system performance notably degrades when noise variance increases. To explain this, noise variance will degrade the data transmission rate, as shown in Equation (3). As a consequence, energy efficiency also decreases with an increment in noise variance. Based on noise variance at the CBS, the figures verify that the proposed scheme dominates the myopic schemes.

6. Conclusions

In this paper, we investigated a deep reinforcement learning framework for joint power and bandwidth allocation by adopting both hybrid NOMA/OMA and user pairing in uplink CRNs. The DACRL algorithm was employed to maximize the long-term transmission rate under the energy constraint in the CUs. A DNN was applied to approximate the policy function and the value function such that the algorithm can work in the system with large state and action spaces. The agent of the DACRL can explore the optimal policy by interacting with the environment. As a consequence, the CBS can effectively allocate bandwidth and power to the CUs based on the current network state in each timeslot. The simulation results verified the advantages of the proposed scheme in improving network performance under various network conditions in the long run, compared to the conventional schemes.

Author Contributions

All authors conceived and proposed the research idea. H.T.H.G. made the formulation and performed the simulations under the supervision of T.N.K.H. and P.D.T.; I.K. analyzed the simulation results; H.T.H.G. wrote the draft paper; T.N.K.H., P.D.T. and I.K. reviewed and edited the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant through the Korean Government (MSIT) under Grant NRF-2018R1A2B6001714.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ding, Z.; Liu, Y.; Choi, J.; Sun, Q.; Elkashlan, M.; Chih-Lin, I.; Poor, H.V. Application of non-orthogonal multiple access in LTE and 5G networks. IEEE Commun. Mag. 2017, 55, 185–191. [Google Scholar] [CrossRef] [Green Version]
Dai, L.; Wang, B.; Yuan, Y.; Han, S.; Chih-Lin, I.; Wang, Z. Non-orthogonal multiple access for 5G: Solutions, challenges, opportunities, and future research trends. IEEE Commun. Mag. 2015, 53, 74–81. [Google Scholar] [CrossRef]
Islam, S.M.R.; Zeng, M.; Dobre, O.A.; Kwak, K.-S. Resource allocation for downlink NOMA systems: Key techniques and open issues. IEEE Wirel. Commun. 2018, 25, 40–47. [Google Scholar] [CrossRef] [Green Version]
Yu, W.; Musavian, L.; Ni, Q. Link-layer capacity of NOMA under statistical delay QoS guarantees. IEEE Trans. Commun. 2018, 66, 4907–4922. [Google Scholar] [CrossRef]
Zeng, M.; Yadav, A.; Dobre, O.A.; Tsiropoulos, G.I.; Poor, H.V. On the sum rate of MIMO-NOMA and MIMO-OMA systems. IEEE Wirel. Commun. Lett. 2017, 6, 534–537. [Google Scholar] [CrossRef]
Di, B.; Song, L.; Li, Y. Sub-channel assignment, power allocation, and user scheduling for non-orthogonal multiple access networks. IEEE Trans. Wirel. Commun. 2016, 15, 7686–7698. [Google Scholar] [CrossRef] [Green Version]
Timotheou, S.; Krikidis, I. Fairness for non-orthogonal multiple access in 5G systems. IEEE Signal Process. Lett. 2015, 22, 1647–1651. [Google Scholar] [CrossRef] [Green Version]
Liang, W.; Ding, Z.; Li, Y.; Song, L. User pairing for downlink nonorthogonal multiple access networks using matching algorithm. IEEE Trans. Commun. 2017, 65, 5319–5332. [Google Scholar] [CrossRef] [Green Version]
Zhang, Y.; Wang, H.-M.; Zheng, T.-X.; Yang, Q. Energy-efficient transmission design in non-orthogonal multiple access. IEEE Trans. Veh. Technol. 2017, 66, 2852–2857. [Google Scholar] [CrossRef] [Green Version]
Hao, W.; Zeng, M.; Chu, Z.; Yang, S. Energy-efficient power allocation in millimeter wave massive MIMO with non-orthogonal multiple access. IEEE Wireless Commun. Lett. 2017, 6, 782–785. [Google Scholar] [CrossRef] [Green Version]
Fang, F.; Zhang, H.; Cheng, J.; Leung, V.C.M. Energy-efficient resource allocation for downlink non-orthogonal multiple access network. IEEE Trans. Commun. 2016, 64, 3722–3732. [Google Scholar] [CrossRef]
Lv, T.; Ma, Y.; Zeng, J.; Mathiopoulos, P.T. Millimeter-wave NOMA transmission in cellular M2M communications for Internet of Things. IEEE Internet Things J. 2018, 5, 1989–2000. [Google Scholar] [CrossRef] [Green Version]
Cui, J.; Liu, Y.; Ding, Z.; Fan, P.; Nallanathan, A. Optimal user scheduling and power allocation for millimeter wave NOMA systems. IEEE Trans. Wirel. Commun. 2018, 17, 1502–1517. [Google Scholar] [CrossRef] [Green Version]
Ahmad, W.S.H.M.W.; Radzi, N.A.M.; Samidi, F.S.; Ismail, A.; Abdullah, F.; Jamaludin, M.Z.; Zakaria, M.N. 5G Technology: Towards Dynamic Spectrum Sharing Using Cognitive Radio Networks. IEEE Access 2020, 8, 14460–14488. [Google Scholar] [CrossRef]
Amjad, M.; Musavian, L.; Rehmani, M.H. Effective Capacity in Wireless Networks: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2019, 21, 3007–3038. [Google Scholar] [CrossRef] [Green Version]
Zhou, F.; Beaulieu, N.C.; Li, Z.; Si, J.; Qi, P. Energy-efficient optimal power allocation for fading cognitive radio channels: Ergodic capacity, outage capacity, and minimum-rate capacity. IEEE Trans. Wirel. Commun. 2016, 15, 2741–2755. [Google Scholar] [CrossRef]
Goldsmith, A.; Jafar, S.A.; Maric, I.; Srinivasa, S. Breaking spectrum gridlock with cognitive radios: An information theoretic perspective. Proc. IEEE 2009, 97, 894–914. [Google Scholar] [CrossRef]
Lv, L.; Chen, J.; Li, Q.; Ding, Z. Design of cooperative nonorthogonal multicast cognitive multiple access for 5G systems: User scheduling and performance analysis. IEEE Trans. Commun. 2017, 65, 2641–2656. [Google Scholar] [CrossRef] [Green Version]
Lv, L.; Ni, Q.; Ding, Z.; Chen, J. Application of non-orthogonal multiple access in cooperative spectrum-sharing networks over nakagamim fading channels. IEEE Trans. Veh. Technol. 2017, 66, 5506–5511. [Google Scholar] [CrossRef] [Green Version]
Thanh, P.D.; Hoan, T.N.K.; Koo, I. Joint Resource Allocation and Transmission Mode Selection Using a POMDP-Based Hybrid Half-Duplex/Full-Duplex Scheme for Secrecy Rate Maximization in Multi-Channel Cognitive Radio Networks. IEEE Sens. J. 2020, 20, 3930–3945. [Google Scholar] [CrossRef]
Lu, X.; Wang, P.; Niyato, D.; Kim, D.I.; Han, Z. Wireless networks with RF energy harvesting: A contemporary survey. IEEE Commun. Surv. Tutor. 2015, 17, 757–789. [Google Scholar] [CrossRef] [Green Version]
Giang, H.T.H.; Hoan, T.N.K.; Thanh, P.D.; Koo, I. A POMDP-based long-term transmission rate maximization for cognitive radio networks with wireless-powered ambient backscatter. Int. J. Commun. Syst. 2019, 32, e3993. [Google Scholar] [CrossRef]
Zhang, N.; Wang, J.; Kang, G.; Liu, Y. Uplink nonorthogonal multiple access in 5g systems. IEEE Commun. Lett. 2016, 20, 458–461. [Google Scholar] [CrossRef]
Ni, Z.; Chen, Z.; Zhang, Q.; Zhou, C. Analysis of RF Energy Harvesting in Uplink-NOMA IoT-Based Network. In Proceedings of the 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall), Honolulu, HI, USA, 22–25 September 2019; pp. 1–5. [Google Scholar]
Gussen, C.M.G.; Belmega, E.V.; Debbah, M. Pricing and bandwidth allocation problems in wireless multi-tier networks. In Proceedings of the 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), Pacific Grove, CA, USA, 6–9 November 2011; pp. 1633–1637. [Google Scholar]
Arunthavanathan, S.; Kandeepan, S.; Evans, R.J. A Markov Decision Process-Based Opportunistic Spectral Access. IEEE Wirel. Commun. Lett. 2016, 5, 544–547. [Google Scholar] [CrossRef]
Xiao, H.; Yang, K.; Wang, X.; Shao, H. A robust MDP approach to secure power control in cognitive radio networks. In Proceedings of the 2012 IEEE International Conference on Communications (ICC), Ottawa, ON, USA, 10–15 June 2012; pp. 4642–4647. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book; MIT Press: London, UK, 2018. [Google Scholar]
Li, R.; Zhao, Z.; Chen, X.; Palicot, J.; Zhang, H. TACT: A transfer actor-critic learning framework for energy saving in cellular radio access networks. IEEE Trans. Wirel. Commun. 2014, 13, 2000–2011. [Google Scholar] [CrossRef] [Green Version]
Puspita, R.H.; Shah, S.D.A.; Lee, G.; Roh, B.; Oh, J.; Kang, S. Reinforcement Learning Based 5G Enabled Cognitive Radio Networks. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, 16–18 October 2019; pp. 555–558. [Google Scholar]
Meng, X.; Inaltekin, H.; Krongold, B. Deep Reinforcement Learning-Based Power Control in Full-Duplex Cognitive Radio Networks. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, UAE, 9–13 December 2018; pp. 1–7. [Google Scholar]
Ong, K.S.H.; Zhang, Y.; Niyato, D. Cognitive Radio Network Throughput Maximization with Deep Reinforcement Learning. In Proceedings of the 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall), Honolulu, HI, USA, 22–25 September 2019; pp. 1–5. [Google Scholar]
Zhang, H.; Yang, N.; Huangfu, W.; Long, K.; Leung, V.C.M. Power Control Based on Deep Reinforcement Learning for Spectrum Sharing. IEEE Trans. Wirel. Commun. 2020. [Google Scholar] [CrossRef]
Ding, Z.; Yang, Z.; Fan, P.; Poor, H.V. On the Performance of Non-Orthogonal Multiple Access in 5G Systems with Randomly Deployed Users. IEEE Signal Process. Lett. 2014, 21, 1501–1505. [Google Scholar] [CrossRef] [Green Version]
Han, W.; Li, J.; Li, Z.; Si, J.; Zhang, Y. Efficient Soft Decision Fusion Rule in Cooperative Spectrum Sensing. IEEE Trans. Signal Process. 2013, 61, 1931–1943. [Google Scholar] [CrossRef]
Ma, J.; Zhao, G.; Li, Y. Soft Combination and Detection for Cooperative Spectrum Sensing in Cognitive Radio Networks. IEEE Trans. Wirel. Commun. 2008, 7, 4502–4507. [Google Scholar]
Lee, P.; Eu, Z.A.; Han, M.; Tan, H. Empirical modeling of a solar-powered energy harvesting wireless sensor node for time-slotted operation. In Proceedings of the 2011 IEEE Wireless Communications and Networking Conference, Cancun, Mexico, 28–31 March 2011; pp. 179–184. [Google Scholar]
Kawamoto, Y.; Takagi, H.; Nishiyama, H.; Kato, N. Efficient Resource Allocation Utilizing Q-Learning in Multiple UA Communications. IEEE Trans. Netw. Sci. Eng. 2019, 6, 293–302. [Google Scholar] [CrossRef]
Wei, Y.; Yu, F.R.; Song, M.; Han, Z. User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An Actor-Critic Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2018, 17, 680–692. [Google Scholar] [CrossRef]
Stone, J.V. Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis; Sebtel Press: Sheffield, UK, 2013; p. 174. [Google Scholar]
Grondman, I.; Busoniu, L.; Lopes, G.A.D.; Babuska, R. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 1291–1307. [Google Scholar] [CrossRef] [Green Version]
Wei, Y.; Yu, F.R.; Song, M.; Han, Z. Joint Optimization of Caching, Computing, and Radio Resources for Fog-Enabled IoT Using Natural Actor–Critic Deep Reinforcement Learning. IEEE Internet Things J. 2019, 6, 2061–2073. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–14. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Zhong, C.; Lu, Z.; Gursoy, M.C.; Velipasalar, S. A Deep Actor-Critic Reinforcement Learning Framework for Dynamic Multichannel Access. IEEE Trans. Cognit. Commun. Netw. 2019, 5, 1125–1139. [Google Scholar] [CrossRef] [Green Version]
Wang, K.; Chen, L.; Liu, Q. On Optimality of Myopic Policy for Opportunistic Access With Nonidentical Channels and Imperfect Sensing. IEEE Trans. Veh. Technol. 2014, 63, 2478–2483. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In The Thirteenth International Conference on Artificial Intelligence and Statistics, 2nd ed.; Athena Scientic: Belmont, MA, USA, 2001; Volume 1–2. [Google Scholar]

Figure 1. System model of the proposed scheme.

Figure 2. Time frame of the cognitive users’ operations.

Figure 3. Markov chain model of the primary channel.

Figure 4. The agent–environment interaction process.

Figure 5. The structure of deep actor-critic reinforcement learning.

Figure 6. The deep neural network in the critic.

Figure 7. The deep neural network in the actor.

Figure 8. The convergence rate of the proposed actor-critic deep reinforcement learning with different training steps in each episode.

Figure 9. The convergence rate of the proposed actor-critic deep reinforcement learning according to different learning rate values.

Figure 10. Average transmission rate according to different values for mean harvested energy.

Figure 11. Energy efficiency according to different values of harvested mean energy.

Figure 12. Average transmission rate according to noise variance.

Figure 13. Energy efficiency according to noise variance.

Table 1. Simulation Parameters.

Parameter	Description	Value
M	Number of groups	3
$T_{t o t}$	Time slot duration	200 ms
$τ_{s s}$	Sensing duration	2 ms
W	Total system bandwidth	1 Hz
$E_{b a t}$	Battery capacity	30 $μ$ J
$e_{s s}$	Sensing cost	1 $μ$ J
$e^{t r}$	Transmission energy	$0, 10, 20$ $μ$ J
$ξ_{a v g}$	Mean value of harvested energy	5 $μ$ J
$μ$	Initial belief that the primary channel is free	0.5
$P_{F F}$	Transition probability of the primary channel from state F to itself	0.8
$P_{B F}$	Transition probability of the primary channel from state B to state F	0.2
$P_{d}$	Probability of detection	0.9
$P_{f}$	Probability of false alarm	0.1
$σ^{2}$	Noise variance	−80 dB
$γ$	Discount factor	0.9
$α_{a}$	Learning rate of the actor	$0.001$
$α_{c}$	Learning rate of the critic	$0.005$
$ϵ$	Epsilon rate	$1 \to 0.01$
$ϵ_{d}$	Epsilon decay	$0.9999$
L	Number of episodes	300
T	Number of iterations per episode	2000

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Giang, H.T.H.; Hoan, T.N.K.; Thanh, P.D.; Koo, I. Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks. Appl. Sci. 2020, 10, 4236. https://doi.org/10.3390/app10124236

AMA Style

Giang HTH, Hoan TNK, Thanh PD, Koo I. Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks. Applied Sciences. 2020; 10(12):4236. https://doi.org/10.3390/app10124236

Chicago/Turabian Style

Giang, Hoang Thi Huong, Tran Nhut Khai Hoan, Pham Duy Thanh, and Insoo Koo. 2020. "Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks" Applied Sciences 10, no. 12: 4236. https://doi.org/10.3390/app10124236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks

Abstract

1. Introduction

2. System Model

Energy Arrival and Primary User Models

3. Long-Term Transmission Rate Maximization Problem Formulation

4. Deep Reinforcement Learning-Based Resource Allocation Policy

4.1. Markov Decision Process

4.1.1. Silent Mode

4.1.2. Transmission Mode

4.2. Deep Actor-Critic Reinforcement Learning Algorithm

4.2.1. The Critic with a DNN

4.2.2. The Actor with a DNN

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI