Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems

Lee, Jaehee; So, Jaewoo

doi:10.3390/s20247094

Open AccessArticle

Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems

by

Jaehee Lee

and

Jaewoo So

^*

Department of Electronic Engineering, Sogang University, Seoul 04107, Korea

^*

Author to whom correspondence should be addressed.

Sensors 2020, 20(24), 7094; https://doi.org/10.3390/s20247094

Submission received: 4 November 2020 / Revised: 26 November 2020 / Accepted: 9 December 2020 / Published: 11 December 2020

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we consider a multiple-input multiple-output (MIMO)—non-orthogonal multiple access (NOMA) system with reinforcement learning (RL). NOMA, which is a technique for increasing the spectrum efficiency, has been extensively studied in fifth-generation (5G) wireless communication systems. The application of MIMO to NOMA can result in an even higher spectral efficiency. Moreover, user pairing and power allocation problem are important techniques in NOMA. However, NOMA has a fundamental limitation of the high computational complexity due to rapidly changing radio channels. This limitation makes it difficult to utilize the characteristics of the channel and allocate radio resources efficiently. To reduce the computational complexity, we propose an RL-based joint user pairing and power allocation scheme. By applying Q-learning, we are able to perform user pairing and power allocation simultaneously, which reduces the computational complexity. The simulation results show that the proposed scheme achieves a sum rate similar to that achieved with the exhaustive search (ES).

Keywords:

non-orthogonal multiple access; multiple-input multiple-output; user pairing; power allocation; reinforcement learning

1. Introduction

5G mobile communication is further increasing the number of users using the wireless Internet. Moreover, autonomous vehicles connected to 5G are also increasing. Hence, the importance of spectrum efficiency has been significantly increasing, and non-orthogonal multiple access (NOMA) is one of most important research areas [1]. NOMA techniques can be categorized into two main classes: power-domain and code-domain NOMA. Code-domain NOMA is a technique for multiplexing users based on “codeword.” The concept of code-domain NOMA was inspired by the classic code division multiple access (CDMA) system [2]. Code-domain NOMA allows multiple users to share the same time-frequency resources but adopts unique user-specific spreading sequences. The spreading sequences are restricted to sparse sequences or non-orthogonal low cross-correlation sequences in code-domain NOMA. Sparse code multiple access (SCMA) is one of the most important techniques in recent code-domain NOMA. In particular, studies to improve spectral efficiency by using low density parity check (LDPC) codes is actively being conducted [3,4]. Power-domain NOMA is another technique that allows multiple user equipment (UEs) to access the same time/frequency resource, where the signals from the UEs are multiplexed through different power allocation coefficients [5]. The transmit power at the base station (BS) is divided up between the UEs. UEs with poor channel conditions receive more transmit power, whereas UEs with better channel conditions receive less transmit power. On the receiver side, successive interference cancellation (SIC) is used to recover each communication. The SIC successively decodes and subtracts the received signal until it reaches its desired signal [5]. SIC and power allocation are important techniques in power-domain NOMA systems. In this paper, we study the user pairing and power allocation for power-domain NOMA systems.

Multiple-input multiple-output (MIMO) is another technique for enhancing the spectrum efficiency. The application of MIMO to NOMA can result in an even higher spectral efficiency. We consider user pairing and power allocation in MIMO-NOMA systems. Many researchers have already investigated user pairing or power allocation in MIMO-NOMA systems [6,7,8,9,10,11,12]. In [6], a joint user pairing and power allocation scheme in virtual MIMO systems was proposed. First, power allocation was performed with known paired user groups, and power allocation was solved with a multi-level water-filling method. In the next step, joint user pairing and power allocation were conducted with an iterative algorithm based on the analysis in the first step. In [7], the authors proposed user pairing and scheduling algorithms for massive MIMO–NOMA systems to maximize the sum rate by mitigating inter-pair interference. In [8], an optimal NOMA power allocation scheme for improving the spectrum efficiency of coexisting multi-user (MU)-MIMO and orthogonal multiple access (OMA) device-to-device (D2D) networks was proposed. In [9], a 2-user downlink MIMO–NOMA power allocation scheme was proposed. The non-convex MIMO–NOMA power allocation problem was formulated with optimal and suboptimal solutions. Furthermore, an optimal power allocation scheme for maximizing fairness was proposed in [10]; all UEs have the same data rate based on the max–min rate criteria power allocation scheme. In [11], user pairing was combined with power allocation in downlink NOMA systems. The UEs were sorted according to the channel gain, and then the optimal power allocation was applied to enhance the spectrum efficiency. In [12], the authors proposed a user pairing and power allocation scheme in a NOMA system, where the number of users is limited to two. In [8,9,10,11,12], power allocation schemes are proposed for NOMA systems. The conventional schemes formulated the power allocation problem based on convex optimization and tried to find the power by mathematically solving the convex problem. However, we apply the RL to determine the power of the UEs in each pair in the MIMO-NOMA system. Moreover, while the conventional schemes required a high computational complexity to determine the user pairing and power allocation in a MIMO-NOMA system, we find the user pairing and power allocation at the same time with low computational complexity.

Many researchers have applied deep learning (DL) in wireless communication [13,14]; the method includes supervised, unsupervised, and reinforcement learning (RL). Supervised learning requires many datasets for training, which may make it difficult to apply to real-time wireless communication environments. In unsupervised learning, data are classified, or statistical distributions are estimated, and user pairing and power allocation are difficult steps. Another learning method is the Q-learning (one of the RL techniques) which is a widely used model-free RL technique. The Q-learning can solve a user pairing and power allocation problem through action. The channel state information (CSI) between the UE and BS changes continuously at every time slot owing to the movement of UEs or shadowing between buildings. Therefore, Q-learning, which determines the optimal reward by applying CSI without a dataset, may be more suitable for wireless communications than other supervised learning techniques that require many datasets.

Some researchers have applied DL to NOMA systems [15,16,17,18,19,20,21,22,23,24,25]. In [15], a DL-aided sparse code multiple access (SCMA) was proposed in which the mapping of data to the resource and the decoding of received signals is conducted with a deep neural network (DNN). In [16], the authors proposed a deep RL-based power allocation with a dual DNN to overcome the noisiness/randomness problem in training data. Moreover, in [17], the NOMA channel was estimated by applying long short-term memory (LSTM), which is used to learn the CSI of the NOMA system through offline and online training. The authors in [18,19] proposed a fast RL method with a

(τ, ϵ)

-greedy based deep Q network (DQN) in jamming environments. Furthermore, user pairing was achieved in [20] by applying multi-agent RL to a multi-carrier NOMA system. In [21], the authors proposed a DQN-based joint power allocation and channel assignment for NOMA systems. They derived a closed-form solution for power allocation, where they proposed an attention based DQN for the channel assignment problem. In [22], the dynamic channel access problem was formulated as a partially observable Markov decision process (POMDP), and DQN was applied to find the access policy via online learning. In [23], the authors proposed a multi-agent DNN approach to predict spectrum occupation of unknown neighbouring networks in slotted wireless networks, where they trained the DNN in an online way, using both RL and supervised learning. The authors in [24] proposed a DQN-based power allocation for a multi-cell network to maximize the total network throughput. In [25], a joint precoding and SIC decoding scheme for MIMO–NOMA system was presented in the imperfect SIC decoding environment.

The key challenges in MIMO-NOMA are beamforming, optimization, power allocation, user pairing, and SIC ordering. These challenges have been studied jointly or partially, under specific performance metrics. MIMO-NOMA is a technology that can enhance spectral efficiency in 5G, but it has a fundamental limitation of high computational complexity. This paper aims to increase the sum rate and reduce the computational complexity by using the RL-based joint power allocation and user pairing in MIMO-NOMA systems. The contributions of this paper are as follows: First, we propose an RL-based joint user pairing and power allocation scheme for MIMO-NOMA systems. The previous studies independently investigated user pairing and power allocation problems; or they researched user pairing and power allocation problems via mathematical approaches such as convex optimization in a simplified system with a few users. To the best of the authors’ knowledge, this study is the first attempt in which RL is applied to perform user pairing and power allocation jointly under a practical system with multiple users. Second, the proposed RL-based scheme reduces the computational complexity. In the conventional schemes, the user pairing is performed after the BS has received information about the location and CSI from UEs, and then the power is allocated to UEs in each pair. In this paper, the user pairing and power allocation are simultaneously performed through RL when a BS receives the location and CSI from UEs. Exhaustive search (ES) is a scheme to find the maximum sum rate, but its computational complexity is extremely high because it finds all pairs that can be user paired, calculates all the coefficients that can be power allocation, and then finds the sum rate. The proposed RL scheme reduces the computational complexity because the sum rate is calculated with one action selection. Third, the proposed RL-based scheme shows that the sum rate is superior to those of OMA and other comparable schemes. The proposed scheme at the beginning of the simulation shows that the sum rate is low because the BS randomly selects the action, but as the time slot increases, the learning proceeds and it approximately converges to the sum rate of the ES. Moreover, it was shown that the proposed scheme is more efficient than the ES or phased RL schemes in terms of the time and computational complexity.

The remainder of this paper is organized as follows: Section 2 describes the system model, and Section 3 presents the proposed RL-based joint user pairing and power allocation in MIMO-NOMA systems. The numerical results are presented in Section 4 and Section 5 concludes this paper.

For the sake of clarity, the main symbols and their descriptions used in this paper are summarized in Table 1.

Notations: Vectors are presented by boldface small letters, while matrices are represented by boldface capital letters;

I_{N}

is the Identity matrix and

\hat{h}

the quantized value of h.

2. System Model

2.1. System Description

In this paper, we consider a downlink MIMO–NOMA in a macro cell with 500 m radius, as shown in (Figure 1). The BS has

P_{B S}

transmit power, and it allocates the same power to the N antennas. Thus, BS transmits a superimposed signal, considering the characteristics of NOMA. To create a MIMO–NOMA applicable scenario, all M UEs are randomly distributed in a cell. The transmitted power at each beam can be expressed as

P_{n} = \frac{P_{B S}}{N}

. We assume that the channel gain is ordered as follows:

| h_{n, i} |^{2} \leq {| h_{n, j} |}^{2}, for i \leq j .

(1)

In NOMA, the UE close to the BS can cancel the interference signal by using SIC, where the interference signal may be the signal sent to the UE with poor channel conditions. Here, the SIC is assumed to be operated with little or no errors. In addition, the BS is responsible for pairing UEs and then it determines the transmit power of each UE. Each UE suffers from Rayleigh fading and additive white Gaussian noise (AWGN) with zero mean and variance

σ_{n, k}

. The superimposed signal transmitted by the BS is as follows:

\begin{matrix} x_{n} = \sum_{k = 1}^{K} \sqrt{α_{n, k} P_{n}} s_{n, k}, \end{matrix}

(2)

where

s_{n, k}

,

α_{n, k}

,

P_{n}

denote the signal transmitted by the BS, the power allocation coefficient, and the transmit power of each beam, respectively. The signal received at the UE

_{n, k}

is as follows:

y_{n, k} = h_{n, k} \sum_{n = 1}^{N} w_{n} x_{n} + n_{n, k},

(3)

where

h_{n, k}

is the Rayleigh fading channel vector from the BS to the UE

_{n, k}

,

w_{n}

is the precoding vector for each beam in the precoding matrix

W = [w_{1}, w_{2}, \dots, w_{n}], w_{n} \in C^{1 \times N}

, and

n_{n, k}

is the AWGN;

h_{n, k}

can be expressed as follows:

\begin{matrix} h_{n, k} & = h_{n, k} \sqrt{d_{n, k}^{- η}} . \end{matrix}

(4)

Moreover, the distance between the BS and UE

_{n, k}

is denoted as

d_{n, k}

, the path loss exponent is

η

, and

h_{n, k}

represents the RL’s state. Equation (3) can be rewritten as follows:

\begin{matrix} y_{n, k} = & h_{n, k} \sqrt{P_{n} α_{n, k}} s_{n, k} + \underset{intra - beam interference}{\underset{⏟}{h_{n, k} w_{n} \sum_{k^{'} = k + 1}^{K} \sqrt{P_{n} α_{n, k^{'}}} s_{n, k^{'}}}} \\ + \underset{inter - beam interference}{\underset{⏟}{h_{n, k} \sum_{n^{'} = 1, n \neq n}^{N} w_{n^{'}} x_{n^{'}}}} + n_{n, k} . \end{matrix}

(5)

After SIC, Equation (5) can be rewritten as follows:

\begin{matrix} y_{n, k} = & \{\begin{matrix} h_{n, k} \sqrt{P_{n} α_{n, k}} s_{n, k} + h_{n, k} \sum_{n^{'} = 1, n \neq n}^{N} w_{n^{'}} x_{n^{'}} + n_{n, k}, if k = K, \\ h_{n, k} \sqrt{P_{n} α_{n, k}} s_{n, k} + h_{n, k} w_{n} \sum_{k^{'} = k + 1}^{K} \sqrt{P_{n} α_{n, k^{'}}} s_{n, k^{'}} \end{matrix} \\ + h_{n, k} \sum_{n^{'} = 1, n \neq n}^{N} w_{n^{'}} x_{n^{'}} + n_{n, k}, if 1 \leq k \leq K, k \neq K . \end{matrix}

(6)

Following the principle of NOMA, the power allocation coefficient,

α_{n, k}

, of each UE is expressed as follows:

0 \leq α_{n, k} \leq 1, \sum_{k = 1}^{K} α_{n, k} = 1, α_{n, k} \in Ω,

(7)

where

Ω

denotes the space of the feasible power allocation coefficient.

2.2. Problem Formulation

Based on Equation (5), the signal-to-interference-plus-noise ratio (SINR) for UE

_{n, k}

is given by

γ_{n, k} = \frac{α_{n, k} P_{n} {| h_{n, k} w_{n} |}^{2}}{I_{n, k}^{U} + I_{n, k}^{N} + σ_{n}^{2}},

(8)

where

I_{n, k}^{U}

and

I_{n, k}^{N}

are respectively the intra-beam and inter-beam interference, as follows:

\begin{matrix} I_{n, k}^{U} = {| h_{n, k} w_{n} |}^{2} \sum_{k^{'} = k + 1}^{K} P_{n} α_{n, k^{'}}, \end{matrix}

(9)

\begin{matrix} I_{n, k}^{N} = \sum_{n^{'} = 1, n^{'} \neq n}^{N} {| h_{n, k} w_{n^{'}} |}^{2} P_{n^{'}} . \end{matrix}

(10)

The objective is to maximize the sum rate from all UEs. Thus, the user pairing of each beam

Φ_{n}

, power allocation coefficient

α_{n, k}

for each UE, and precoding vector

w_{n}

should be determined [8]. The problem can then be formulated as follows:

\begin{matrix} max_{Φ_{n}, w_{n}, α_{n, k}} & R^{a l l} \\ s . t . & (C 1) & \sum_{k = 1}^{K} α_{n, k} = 1, α_{n, k} \in R, n = 1, 2, \dots, N, \\ (C 2) & R_{n, k} \geq R_{0}, \\ (C 3) & | h_{n, k} w_{n} | = 0, \forall n^{'} \neq n, \end{matrix}

(11)

where Equation (11) represents the sum rate of the MIMO-NOMA UEs. The constraint of (C1) is the summation of the power allocation coefficients in a beam. The constraint of (C2) means that the BS satisfies the minimum data rate of each UE,

R_{0}

. The constraint of (C3) represents the beamforming constraint. The optimization problem is the non-convex NP-hard. To solve this problem, the computational complexity should be reduced. The precoding matrices can be expressed as follows [5]:

W = I_{N},

(12)

where

I_{N}

is the

N \times N

identity matrix. Equation (12) represents the inter-beam interference

I_{n, k}^{N}

can be canceled. Therefore, complex MIMO–NOMA systems can be simplified as single-input single-output (SISO) NOMA systems.

From Equations (8) and (12), the data rate of UE

_{n, k}

can be express as follows:

R_{n, k} = {log}_{2} (1 + \frac{α_{n, k} P_{n} {| h_{n, k} w_{n} |}^{2}}{I_{n, k}^{U} + σ_{n}^{2}}) .

(13)

UE

_{n, K}

is the closest user from the BS, and SIC can be used to remove the intra-beam interference

I_{n, k}^{U}

. Consequently, Equation (13) can be rewritten as follows:

R_{n, k} = \{\begin{matrix} {log}_{2} (1 + \frac{α_{n, k} P_{n} {| h_{n, k} w_{n} |}^{2}}{σ_{n}^{2}}), if k = K, \\ {log}_{2} (1 + \frac{α_{n, k} P_{n} {| h_{n, k} w_{n} |}^{2}}{I_{n, k}^{U} + σ_{n}^{2}}), if 1 \leq k \leq K, k \neq K . \end{matrix}

(14)

From Equation (14), the data rate of UEs with

1 \leq k \leq K

in a beam can be calculated; the sum rate of all MIMO–NOMA systems can be calculated by summing the data rates of all beams. The sum rate of MIMO–NOMA systems

R^{a l l}

can be expressed as follows:

R^{a l l} = \sum_{n = 1}^{N} \sum_{k = 1}^{K} {log}_{2} (1 + \frac{α_{n, k} P_{n} {| h_{n, k} w_{n} |}^{2}}{I_{n, k}^{U} + σ_{n}^{2}}) .

(15)

In the conventional user pairing and power allocation procedure, after the BS acquires the CSI from the UE, the BS determines a pair according to the location or channel gain. This information is transmitted to the UEs. When the response from the UEs has been received, the power allocation coefficient of the UEs belonging to each beam is determined again, and the power is transmitted to each UE.

3. Proposed RL-Based Joint User Pairing and Power Allocation

In this section, joint user pairing and power allocation for maximizing the sum rates of a MIMO–NOMA system are proposed. In the wireless channel environment, user pairing and power allocation can be modeled as the repeated interactions between the BS and UEs. The optimal user pairing and power allocation depends on the location of UEs and their radio channel states [18]. The user pairing and power allocation of the BS affect the sum rate of the MIMO–NOMA system. Because the MIMO–NOMA transmission process can be formulated as a Markov decision process, Q-learning can be applied to a MIMO–NOMA system.

Q-learning is based on the state, action, and reward [26]. Figure 2 shows a basic structure of RL. In the proposed Q-learning model, the agent is the BS, and the environments is fading, shadowing, and distance environments between the BS and UEs.

3.1. Design State and Action

The BS performs the user pairing and power allocation based on Q-leaning, and the Q-function determines the user pairing and power allocation value. The state

s^{t}

is the quantized channel vector of the UEs

{\hat{h}}_{n, k}

, the action

θ^{t}

comprises a user pairing set

Φ_{n}

and power allocation coefficient

α_{n, k}

, and the reward is defined as the quantized sum rate

{\hat{R}}^{a l l}

of the MIMO–NOMA system. The quantization is performed in L steps, and the channel vector of the UEs generated with the Rayleigh distribution is quantized into L steps.

The state at time t is as follows:

s^{t} = {[{\hat{h}}_{n, k}^{t - 1}]}_{1 \leq n \leq N, 1 \leq k \leq K} \in ξ,

(16)

where

ξ

is the space of all the possible channel vectors. Moreover, the size of the state space can be expressed as

L^{N K}

.

The action set of the BS is defined as the index of the joint user pairing and power allocation procedure. As assumed in the system model, when there are M UEs in the cell and the BS forms N beamforming vectors, K UEs form a pair in each beam. The user pairing set is defined as

Φ_{n}

:

Φ_{n} = {(n, 1), (n, 2), \dots, (n, K)}, K \geq 2, 1 \leq n \leq N .

(17)

When we use the ES method for user pairing, the computational complexity exponentially increases. Meanwhile, if the channel gain of the UEs grouped in the same nth pair is assumed to be ordered by Equation (1), the user pairing complexity can be reduced.

Moreover, the power allocation coefficients are quantized into the number of K UEs in each beam, and the sum of the power allocation coefficients is set to 1. Thus, Equation (7) can be rewritten as follows:

α_{n, k} \in {k / K}_{1 \leq k \leq K}, \sum_{k = 1}^{K} α_{n, k} = 1, α_{n, k} \in Ω .

(18)

By multiplying the user pairing index and K steps of the power allocation coefficients can be the Q-learning’s joint action. Hence, joint user pairing and power allocation can be performed in one step. From Equations (17) and (18), the equation of action at time t can be expressed as follows:

θ^{t} = Φ_{n} \times Ω

(19)

The size of action spaces is as follows:

n (θ^{t}) = (\begin{matrix} M \\ N \end{matrix}) K = \frac{M! K}{N! (M - N)!} .

(20)

From Equation (20), the action set

θ^{t}

can be converted into an index set, i.e.,

θ^{t} = {0, 1, \dots, (n (θ^{t}) - 1)}

.

The choice of an action in RL is determined by the tradeoff between exploitation and exploration. In this paper, the action was chosen by applying

ϵ

-greedy policy and deciding whether to explore with a random action or exploit the action with the best value with the current information according to

ϵ

. The

ϵ

-greedy equation is as follows:

θ^{t} = \{\begin{matrix} argmax (Q (s^{t}, θ^{t})), & with probability 1 - ϵ \\ random action, & with probability ϵ . \end{matrix}

(21)

An important point when designing the Q-learning model is the size of the

(a c t i o n \times s t a t e)

space. As the

(a c t i o n \times s t a t e)

space increases, the RL complexity exponentially increases. The number of the quantization level L of

{\hat{h}}_{n, k}

increases the state space. The number of user pairing set due to the number of UEs and the number of quantization levels of the power allocation coefficient affect the action space. The

(a c t i o n \times s t a t e)

space exponentially increases with the number of UEs, as shown in Figure 3. As the quantization level increases,

{\hat{h}}_{n, k}

approaches to the actual

h_{n, k}

; however, the increase of the quantization levels may be inefficient because the complexity exponentially increases.

Because of the tradeoff between the complexity and the sum rate, it is important to find the optimal quantization level in the RL structure. Figure 4 shows the sum rate for an increasing quantization level when the time slot is limited to 100,000. The results show that, when the ES scheme is applied, the sum rate increases and converges to about

17.3

bps/Hz. By contrast, when the proposed Q-learning scheme is applied, the sum rate increases and then decreases after a certain level because of the limited time slot (100,000). If the time slot is not limited, the sum rate of Q-learning increases as the quantization level increases. However, as the number of quantization levels increases, the number of states increases, and the RL model requires more time for the sum rate to converge. Our object is to achieve the sum rate similar to that obtained with the ES scheme, while reducing the computational complexity.

In Figure 4, for the case that the reward of RL is calculated with

{\hat{h}}_{n, k}

, the sum rate is highest when the quantization level is 5. Here, we assumed there are four UEs in the cell. For the case that the reward of RL is calculated with

h_{n, k}

, the sum rate is highest when the quantization level is 4. Here,

{\hat{R}}^{a l l}

, which the reward of RL, is calculated with

{\hat{h}}_{n, k}

, and

R^{a l l}

, which is the sum rate, is calculated with

h_{n, k}

. The difference between

{\hat{R}}^{a l l}

and

R^{a l l}

is due to the quantization error in the CSI. Because the object is to increases the sum rate, we chose the quantization level as 4 in the proposed Q-learning.

3.2. Q-Learning-Based Joint User Pairing and Power Allocation Procedure

The reward is the sum rate of the MIMO–NOMA UEs. From Equation (15) reward at time t can be expressed as follows:

{\hat{R}}^{a l l} = \sum_{n = 1}^{N} \sum_{k = 1}^{K} {log}_{2} (1 + \frac{α_{n, k} P_{n} {| {\hat{h}}_{n, k} w_{n} |}^{2}}{I_{n, k}^{U} + σ_{n}^{2}}),

(22)

where

{\hat{R}}^{a l l}

is the sum rate calculated with

{\hat{h}}_{n, k}

. In Q-learning,

{\hat{R}}^{a l l}

is continuously updated by Q-function; whereas

R^{a l l}

is calculated with

h_{n, k}

. The user pairing index and power allocation coefficient is simultaneously determined by using Q-learning.

Moreover,

Q (s, θ)

denotes the Q-function of the BS for system state s and action

θ

:

Q (s^{t}, θ^{t}) \leftarrow (1 - β) Q (s^{t}, θ^{t}) + β [r (s^{t}, θ^{t}) + δ max_{θ^{'}} Q (s^{t + 1}, θ^{t})],

(23)

where the learning rate

β \in (0, 1]

represents the weight of the recent experience in the learning process. The discount factor

δ \in [0, 1]

controls the importance of the immediate and future rewards.

The main structure of the joint user pairing and power allocation based on Q-learning is illustrated in Figure 5 and the algorithm is summarized in Algorithm 1.

Algorithm 1 Joint user pairing and power allocation with Q-learning

1:: Set $Q (s^{t}, θ^{t}) = 0, \forall θ^{t} = 0$ and $\forall s^{t} = 0$
2:: for $t = 1$ to T do
3:: Observe the current state $s^{t}$
4:: Choose action $θ^{t}$ in Equation (19)
5:: Convert action into user pairing set $Φ_{n}$ and power allocation coefficient $α_{n, k}^{t}$
6:: for $n = 1$ to N do
7:: for $k = 1$ to K do
8:: Allocate the transmit power $α_{n, k}^{t} P_{n}$ and pair $Φ_{n}$ for the signal to user k
9:: end for
10:: end for
11:: Send the superimposed signal $x^{t}$ via N antennas
12:: Observe fading, shadowing, and the distance between BS and UEs
13:: Observe the CSI $h_{n, k}^{t}$
14:: Calculate the reward ${\hat{R}}^{a l l}$
15:: $s^{t + 1} = {[{\hat{h}}_{n, k}^{t}]}_{1 \leq n \leq N}$
16:: Update $Q (s^{t}, θ^{t})$ in Equation (23)
17:: Calculate $R^{a l l}$ in Equation (15)
18:: end for

Algorithm 1 works as follows: First, the Q-learning parameters,

Q (s^{t}, θ^{t})

,

θ^{t}

, and

s^{t}

, are initialized. In Step 3, the BS observes the current state

s^{t}

. In Step 4, the BS selects the action

θ^{t}

according to the

ϵ

-greedy policy. In Step 5, the BS converts the selected

θ^{t}

into a user pairing set

Φ_{n}

and the power allocation coefficient

α_{n, k}

. In Step 10, the BS transmits the superimposed signal

x^{t}

via N antennas to the UEs. In Step 12, the BS observes fading, shadowing, and the distance between BS and UEs. In Step 13, the CSI

h_{n, k}^{t}

is observed, and in Step 14, the reward

{\hat{R}}^{a l l}

is calculated. In Step 15, the next state

s^{t + 1}

is quantized. Finally, in Steps 16 and 17, the BS updates

Q (s^{t}, θ^{t})

and

R^{a l l}

based on Equations (23) and (15), respectively.

4. Numerical Results

We consider a MIMO-NOMA system with one BS. The BS is located at the center. The UEs are randomly distributed in a cell within a radius of 50 to 500 m. To take the movement and the channel fluctuation of each UE into consideration, the location and the CSI of each UE is randomly generated in every time slot. In addition, two UEs are assumed to be paired in one beam; Equation (15) can then be expressed as follows:

\begin{matrix} R^{a l l} = \sum_{n = 1}^{N} ({log}_{2} (1 + \frac{α_{n, 1} P_{n} {| h_{n, 1} w_{n} |}^{2}}{I_{n, 1}^{U} + σ_{n}^{2}}) {log}_{2} (1 + \frac{α_{n, 2} P_{n} {| h_{n, 2} w_{n} |}^{2}}{σ_{n}^{2}})) . \end{matrix}

(24)

Because

K = 2

, the power allocation coefficient can be quantized into level 2. The power allocation coefficient set

Ω

is assumed to be

Ω = [0.2, 0.4]

. The learning rate of the Q-function is set to

0.9999

, and the discount factor is set to

0.0001

. The time slot is one TTI, e.g., 1 ms, in a LTE system or a 5G system with 15 kHz subcarrier spacing [27]. At every time slot, the BS observes the CSI of UEs and performs the user pairing and power allocation. The total number of time slots is 100,000; the simulation results are obtained by repeating 1000 times under iteration. The simulation parameters used in this paper are listed in Table 2.

The simulation was performed with the following simulation environments: Intel(R) Core i

9 - 9900

K CPU

@ 3.60

GHz, RAM

16.0

GB, Window10, python

3.7

, GPU GeForce RTX 2080 Ti.

The performance of the proposed RL based scheme is compared with the following schemes: the ES, OMA, random selection, and phased RL schemes for determine the user pairing and the transmit power of UEs. In the ES scheme, the user paring and the transmit power are optimally determined by using the exhaust search method, and therefore the ES scheme shows the highest performance. In the random selection scheme, the BS randomly determines the user pairing and the transmit power of UEs. In the OMA scheme, the BS serves only one UE in a beam and therefore the sum rate is given by [28]

\begin{matrix} R_{O M A} = \sum_{n = 1}^{N} \sum_{k = 1}^{K} (\frac{1}{k} {log}_{2} (1 + \frac{P_{n} {| h_{n, k} w_{n} |}^{2}}{σ_{n}^{2}})) . \end{matrix}

(25)

In the phased RL-based user paring and power allocation scheme, the BS sequentially determines a user pairing and the transmit power of UEs. That is, after pairing the UEs, the BS can then determine the transmit power of UEs. In the phased RL scheme, the Q-function for user pairing is defined as

Q_{U P} (s, θ_{U P})

and the Q-function of the power allocation is defined as

Q_{P A} (s, θ_{P A})

. From Equation (17), action of user pairing RL is defined as

θ_{U P} = Φ_{n}

. From Equation (18), action of power allocation RL is defined as

θ_{P A} = α_{n, k}

. First, user pairing RL proceeds in which the rewards are only used to update the Q-function, where the reward is calculated with the fixed power allocation. The user pairing set

Φ_{n}

is determined by the BS through

Q_{U P} (s, θ_{U P})

. In power allocation RL, the user pairing set

Φ_{n}

is observed as a state along with

{\hat{h}}_{n, k}

. Power allocation coefficient is determined by the BS through

Q_{P A} (s, θ_{P A})

. Finally, the BS updates

Q_{P A} (s, θ_{P A})

, and

R^{a l l}

. The algorithm of the phased RL-based user pairing and power allocation scheme is summarized in Algorithm 2.

Algorithm 2 Phased RL-based user pairing and power allocation

1:: Set $Q_{U P} (s_{U P}^{t}, θ_{U P}^{t}) = 0, \forall θ_{U P}^{t} = 0$ and $\forall s_{U P}^{t} = 0$
2:: Set $Q_{P A} (s_{P A}^{t}, θ_{P A}^{t}) = 0, \forall θ_{P A}^{t} = 0$ and $\forall s_{P A}^{t} = 0$
3:: for $t = 1$ to T do
4:: Choose action $θ_{U P}^{t}$ in Equation (17)
5:: for $n = 1$ to N do
6:: for $k = 1$ to K do
7:: Allocate the fixed transmit power for the signal to user k
8:: end for
9:: end for
10:: Send the superimposed signal $x^{t}$ via N antennas
11:: Observe $s^{t}$ and reward ${\hat{R}}_{U P}^{t}$
12:: Update $Q_{U P} (s_{U P}^{t}, θ_{U P}^{t})$ in Equation (23)
13:: Choose action $θ_{P A}^{t}$ in Equation (18)
14:: for $n = 1$ to N do
15:: for $k = 1$ to K do
16:: Apply user pairing $θ_{U P}^{t}$
17:: Allocate the transmit Power $α_{n, k}^{t} P_{n}$ for the signal to user k
18:: end for
19:: end for
20:: Observe reward ${\hat{R}}_{P A}^{t}$
21:: $s^{t + 1} = {[{\hat{h}}_{n, k}^{t}]}_{1 \leq n \leq N}$
22:: Update $Q_{P A} (s_{P A}^{t}, θ_{P A}^{t})$ in Equation (23)
23:: Calculate $R^{a l l}$ in Equation (15)
24:: end for

Figure 6 shows the sum rate of the RL scheme with respect to the time slot, when the number of UEs is 4 and the quantization levels of CSI is 4. The transmit power of the BS is 43 dBm. In the RL-based scheme, the actions are randomly determined in the first time, which leads to a lower sum rate. As time elapses, the sum rate of the RL-based scheme increases and when the time slot reaches about 40,000, it approximately converges to that of the ES scheme with a performance difference of

0.57 %

. It also means that it takes about 40 seconds (when the time slot is 1 ms) to achieve the sum rate similar to ES. However, the proposed RL-based scheme can keep up with the changing radio channel of the UE because the BS continuously trains the machine for every time slot. Hence, if the wireless channel environment of the UE does not change very rapidly, the proposed RL-based scheme can be applied to real-time scenarios. Because of the quantization error, the RL’s reward is lower than the sum rate calculated with the

h_{n, k}

. The numerical results are compared with those of other schemes by the sum rate calculated with

h_{n, k}

.

When the transmit power of the BS increases, the sum rate increases, as shown in Figure 7. As the transmit power of the BS increases, the sum rates of all schemes increase. The random selection scheme shows the worst sum rate because the SIC is not perfect. As presented in Figure 7, the proposed scheme shows approximately same results as the ES, and also the phased RL scheme exhibits a similar sum rate. When the transmit power is 43 dBm, the proposed RL scheme increases the sum rate by about

21.15 %

and about

41.98 %

in comparison with the OMA scheme and the random selection scheme, respectively.

Figure 8 shows the sum rate as the number of UEs increases. As the number of UEs increases, the sum rates of all schemes increase and finally gradually converge. The performance difference between the ES scheme and the proposed scheme slightly increases as the number of UEs increases. For 10 UEs, the performance difference is about

5.48 %

, which is due to the increased size of states. The proposed scheme increases the sum rate by about

13.17 %

and about

47.67 %

in comparison with the OMA scheme and the random selection scheme, respectively. However, the proposed scheme and the phased RL scheme show the similar performance.

Figure 9 presents the required simulation time as the number of UEs increases. Because the ES scheme investigates all possible actions, its simulation time is extremely high. The results show that the proposed scheme is more efficient than the phased RL scheme in terms of the time complexity. The proposed scheme reduces the time complexity by about

20.97 %

compared with the phased RL scheme.

The proposed scheme reduces the computational complexity. The ES scheme finds all possible actions and therefore, when the action space is denoted by

n = θ^{t}

, the complexity of the ES scheme is represented by

O (n)

. The phased RL scheme sequentially determines the user paring and the transmit power of UEs in each pair. Hence, the complexity of the phased RL can be expressed as

2 \cdot O (1)

, because the RL requires a complexity of

O (1)

after it converges. The proposed RL-based scheme calculates the reward by choosing one action and therefore it has a complexity of

O (1)

.

5. Conclusions

In this paper, an RL-based joint user pairing and power allocation scheme for MIMO–NOMA systems is proposed. To reduce the computational complexity of finding the user pairing and the transmit power of users, the Q-learning was applied. The user pairing and the transmit power allocation were simultaneously performed in Q-learning’s action. The proposed scheme shows the sum rate similar to that of the ES scheme with the low computational complexity. The proposed scheme reduces the time complexity compared with the phased RL scheme although they show the similar performance in terms of the sum rate. However, as the number of UEs increases, the performance difference between the proposed scheme and the ES scheme slightly increases. In the future, we will apply the DQN to the MIMO-NOMA system in order to reduce the performance difference.

Author Contributions

J.L. has contributed to design the algorithm, perform the simulations, and prepare the manuscript. J.S. has led the research project and supervised the activities as the corresponding author. All authors have read and agreed to the published version of the manuscript

Funding

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1F1A1058716 and No. 2020R1F1A1065109). Also, this research was supported by the “HPC Support” project funded by the Korea Ministry of Science and ICT and NIPA.

Conflicts of Interest

The authors declare no conflict of interest.

References

Saito, Y.; Kishiyama, Y.; Benjebbour, A.; Nakamura, T.; Li, A.; Higuchi, K. Non-orthogonal multiple access (NOMA) for cellular future radio access. In Proceedings of the 2013 IEEE 77th Vehicular Technology Conference (VTC Spring), Dresden, Germany, 2–5 June 2013; pp. 1–5. [Google Scholar]
Dai, L.; Wang, B.; Ding, Z.; Wang, Z.; Chen, S.; Hanzo, L. A Survey of Non-Orthogonal Multiple Access for 5G. IEEE Commun. Surv. Tutor. 2018, 20, 2294–2323. [Google Scholar] [CrossRef] [Green Version]
Jerkovits, T.; Liva, G.; Amat, A.G.I. Improving the decoding threshold of tailbiting spatially coupled LDPC codes by energy shaping. IEEE Commun. Lett. 2018, 22, 660–663. [Google Scholar] [CrossRef] [Green Version]
Fang, Y.; Chen, P.; Cai, G.; Lau, F.C.M.; Liew, S.C.; Han, G. Outage-limit-approaching channel coding for future wireless communications:Root-protograph low-density parity-check codes. IEEE Veh. Technol. Mag. 2019, 14, 85–93. [Google Scholar] [CrossRef]
Ding, Z.; Adachi, F.; Poor, H.V. The application of MIMO to non-orthogonal multiple access. IEEE Trans. Wirel. Commun. 2016, 15, 537–552. [Google Scholar] [CrossRef] [Green Version]
Jia, B.; Hu, H.; Zeng, Y.; Xu, T.; Chen, H. Joint user pairing and power allocation in virtual MIMO systems. IEEE Trans. Wirel. Commun. 2018, 17, 3697–3708. [Google Scholar] [CrossRef]
Chen, X.; Gong, F.; Li, G.; Zhang, H.; Song, P. User pairing and pair scheduling in massive MIMO-NOMA systems. IEEE Commun. Lett. 2018, 22, 788–791. [Google Scholar] [CrossRef]
Sun, H.; Xu, Y.; Hu, R.Q. A NOMA and MU-MIMO supported cellular network with underlaid D2D communications. In Proceedings of the 2016 IEEE 83rd Vehicular Technology Conference (VTC Spring), Nanjing, China, 15–18 May 2016; pp. 1–5. [Google Scholar]
Sun, Q.; Han, S.; Chin-Lin, I.; Pan, Z. On the ergodic capacity of MIMO NOMA systems. IEEE Wirel. Commun. Lett. 2015, 4, 405–408. [Google Scholar] [CrossRef]
Timotheou, S.; Krikidis, I. Fairness for non-orthogonal multiple access in 5G systems. IEEE Signal Process. Lett. 2015, 22, 1647–1651. [Google Scholar] [CrossRef] [Green Version]
Guo, J.; Wang, X.; Yang, J.; Zheng, J.; Zhao, B. User pairing and power allocation for downlink non-orthogonal multiple access. In Proceedings of the IEEE Globecom Workshops (GC Wkshps), Washington, DC, USA, 4–8 December 2016; pp. 1–6. [Google Scholar]
Liu, F.; Mähönen, P.; Petrova, M. Proportional fairness-based user pairing and power allocation for non-orthogonal multiple access. In Proceedings of the IEEE International Symposium on Personal, Indoor, and Mobile Radio Communication (PIMRC), Hong Kong, China, 30 August–2 September 2015; pp. 1–5. [Google Scholar]
Zhang, C.; Patras, P.; Haddadi, H. Deep learning in mobile and wireless networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 2224–2287. [Google Scholar] [CrossRef] [Green Version]
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef] [Green Version]
Kim, M.; Kim, N.; Lee, W.; Cho, D. Deep learning-aided SCMA. IEEE Commun. Lett. 2018, 22, 720–723. [Google Scholar] [CrossRef]
Doan, K.N.; Vaezi, M.; Shin, W.; Poor, H.V.; Shin, H.; Quek, T.Q.S. Power allocation in cache-aided NOMA systems: Optimization and deep reinforcement learning approaches. IEEE Trans. Commun. 2020, 68, 630–644. [Google Scholar] [CrossRef] [Green Version]
Gui, G.; Huang, H.; Song, Y.; Sari, H. Deep learning for an effective nonorthogonal multiple access scheme. IEEE Trans. Veh. Technol. 2018, 67, 8440–8450. [Google Scholar] [CrossRef]
Xiao, L.; Li, Y.; Dai, C.; Dai, H.; Poor, H.V. Reinforcement learning-based NOMA power allocation in the presence of smart jamming. IEEE Trans. Veh. Technol. 2018, 67, 3377–3389. [Google Scholar] [CrossRef]
Ye, P.; Wang, Y.; Li, J.; Xiao, L. Fast reinforcement learning for anti-jamming communications. arXiv 2020, arXiv:2002.05364. [Google Scholar]
Wang, S.; Lv, T.; Zhang, X. Multi-agent reinforcement learning-based user pairing in multi-carrier NOMA systems. In Proceedings of the IEEE International Conference on Communications Workshops (ICC Workshops), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
He, C.; Hu, Y.; Chen, Y.; Zeng, B. Joint power allocation and channel assignment for NOMA with deep reinforcement learning. IEEE J. Sel. Areas Commun. 2019, 37, 2200–2210. [Google Scholar] [CrossRef]
Wang, S.; Liu, H.; Gomes, P.H.; Krishnamachari, B. Deep reinforcement learning for dynamic multichannel access in wireless networks. IEEE Trans. Cognit. Commun. Netw. 2018, 4, 257–265. [Google Scholar] [CrossRef] [Green Version]
Mennes, R.; De Figueiredo, F.A.; Latré, S. Multi-Agent Deep Learning for Multi-channel Access in Slotted Wireless Networks. IEEE Access 2020, 8, 95032–95045. [Google Scholar] [CrossRef]
Ahmed, K.I.; Hossain, E. A deep Q-learning methods for downlink power allocation in multi-cell networks. arXiv 2019, arXiv:1904.13032. [Google Scholar]
Kang, J.; Kim, I.; Chun, C. Deep learning-based MIMO-NOMA with imperfect SIC decoding. IEEE Syst. J. 2020, 14, 3414–3417. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Technical note: Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
3GPP. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Study on New Radio Access Technology Physical Layer Aspects (Release 14); Version 14.2.0; Technical Report (TR) 38.802; 3rd Generation Partnership Project (3GPP): Valbonne, France, 2017. [Google Scholar]
Ding, Z.; Lei, X.; Karagiannidis, G.K.; Schober, R.; Yuan, J.; Bhargava, V.K. A survey on non-orthogonal multiple access for 5G networks: Research challenges and future trends. IEEE J. Sel. Areas Commun. 2017, 35, 2181–2195. [Google Scholar] [CrossRef] [Green Version]

Figure 1. System model.

Figure 2. Typical reinforcement learning (RL) architecture.

Figure 3. The

(a c t i o n \times s t a t e)

space versus number of user equipment (UEs).

Figure 3. The

(a c t i o n \times s t a t e)

space versus number of user equipment (UEs).

Figure 4. Sum rate versus the number of channel state information (CSI) quantization levels when the time slot is 100,000.

Figure 5. Illustration of the Q-learning-based joint user pairing and power allocation scheme.

Figure 6. Sum rate of the RL scheme.

Figure 7. Sum rate versus transmit power.

Figure 8. Sum rate versus the number of UEs.

Figure 9. The total simulation time for 1000 iterations versus the number of UEs.

Table 1. Symbols and description.

Symbol	Description
M	Total number of users
n	Number of BS antennas
k	Number of users in a beam
$P_{B}$	Total power of the BS
$P_{n}$	Transmit power at the nth beam
$s_{n, k}$	Signal transmitted to the kth UE at the nth beam
$x_{n}$	Superimposed signal at the nth beam
$h_{n, k}$	Channel vector to the kth UE at the nth beam
${\hat{h}}_{n, k}$	Quantized channel vector to the kth UE at the nth beam
$d_{n, k}$	Distance between BS and the kth UE at the nth beam
$w_{n}$	Precoding vector at the nth beam
$γ_{n, k}$	SINR of the kth UE at the nth beam
$R_{n, k}$	Data rate of the kth UE at the nth beam
$R^{a l l}$	Sum rate of MIMO-NOMA systems
${\hat{R}}^{a l l}$	Sum rate of MIMO-NOMA systems using quantized channel vector
$Φ_{n}$	The user pairing set at the nth beam
$α_{n, k}$	Power allocation coefficient to the kth UE at the nth beam
$η$	Path loss exponent
$n_{n, k}$	Addictive white gaussian noise (AWGN) to the kth UE at the nth beam
L	Number of CSI quantization level
$I_{n, k}^{N}$	Inter-beam interference to the kth UE at the nth beam
$I_{n, k}^{U}$	Intra-beam interference to the kth UE at the nth beam
s	State of Q-learning
$θ$	Action of Q-learning
r	Reward of Q-learning
$β$	Learning rate
$δ$	Discount factor

Table 2. Simulation parameters.

Parameter	Value
Total number of UEs, M	2, 4, 6, 8, 10
Number of transmit antennas, N	1, 2, 3, 4, 5
Number of UEs in a beam, K	2
Power allocation coefficient, $α_{n, k}$	0.2, 0.4
Path loss coefficient, $η$	3
Learning rate, $β$	0.9999
Discount factor, $δ$	0.0001
Time slot (1 ms), T	100,000
Number of iterations, I	1000

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.; So, J. Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems. Sensors 2020, 20, 7094. https://doi.org/10.3390/s20247094

AMA Style

Lee J, So J. Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems. Sensors. 2020; 20(24):7094. https://doi.org/10.3390/s20247094

Chicago/Turabian Style

Lee, Jaehee, and Jaewoo So. 2020. "Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems" Sensors 20, no. 24: 7094. https://doi.org/10.3390/s20247094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems

Abstract

1. Introduction

2. System Model

2.1. System Description

2.2. Problem Formulation

3. Proposed RL-Based Joint User Pairing and Power Allocation

3.1. Design State and Action

3.2. Q-Learning-Based Joint User Pairing and Power Allocation Procedure

4. Numerical Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI