1. Introduction
5G mobile communication is further increasing the number of users using the wireless Internet. Moreover, autonomous vehicles connected to 5G are also increasing. Hence, the importance of spectrum efficiency has been significantly increasing, and non-orthogonal multiple access (NOMA) is one of most important research areas [
1]. NOMA techniques can be categorized into two main classes: power-domain and code-domain NOMA. Code-domain NOMA is a technique for multiplexing users based on “codeword.” The concept of code-domain NOMA was inspired by the classic code division multiple access (CDMA) system [
2]. Code-domain NOMA allows multiple users to share the same time-frequency resources but adopts unique user-specific spreading sequences. The spreading sequences are restricted to sparse sequences or non-orthogonal low cross-correlation sequences in code-domain NOMA. Sparse code multiple access (SCMA) is one of the most important techniques in recent code-domain NOMA. In particular, studies to improve spectral efficiency by using low density parity check (LDPC) codes is actively being conducted [
3,
4]. Power-domain NOMA is another technique that allows multiple user equipment (UEs) to access the same time/frequency resource, where the signals from the UEs are multiplexed through different power allocation coefficients [
5]. The transmit power at the base station (BS) is divided up between the UEs. UEs with poor channel conditions receive more transmit power, whereas UEs with better channel conditions receive less transmit power. On the receiver side, successive interference cancellation (SIC) is used to recover each communication. The SIC successively decodes and subtracts the received signal until it reaches its desired signal [
5]. SIC and power allocation are important techniques in power-domain NOMA systems. In this paper, we study the user pairing and power allocation for power-domain NOMA systems.
Multiple-input multiple-output (MIMO) is another technique for enhancing the spectrum efficiency. The application of MIMO to NOMA can result in an even higher spectral efficiency. We consider user pairing and power allocation in MIMO-NOMA systems. Many researchers have already investigated user pairing or power allocation in MIMO-NOMA systems [
6,
7,
8,
9,
10,
11,
12]. In [
6], a joint user pairing and power allocation scheme in virtual MIMO systems was proposed. First, power allocation was performed with known paired user groups, and power allocation was solved with a multi-level water-filling method. In the next step, joint user pairing and power allocation were conducted with an iterative algorithm based on the analysis in the first step. In [
7], the authors proposed user pairing and scheduling algorithms for massive MIMO–NOMA systems to maximize the sum rate by mitigating inter-pair interference. In [
8], an optimal NOMA power allocation scheme for improving the spectrum efficiency of coexisting multi-user (MU)-MIMO and orthogonal multiple access (OMA) device-to-device (D2D) networks was proposed. In [
9], a 2-user downlink MIMO–NOMA power allocation scheme was proposed. The non-convex MIMO–NOMA power allocation problem was formulated with optimal and suboptimal solutions. Furthermore, an optimal power allocation scheme for maximizing fairness was proposed in [
10]; all UEs have the same data rate based on the max–min rate criteria power allocation scheme. In [
11], user pairing was combined with power allocation in downlink NOMA systems. The UEs were sorted according to the channel gain, and then the optimal power allocation was applied to enhance the spectrum efficiency. In [
12], the authors proposed a user pairing and power allocation scheme in a NOMA system, where the number of users is limited to two. In [
8,
9,
10,
11,
12], power allocation schemes are proposed for NOMA systems. The conventional schemes formulated the power allocation problem based on convex optimization and tried to find the power by mathematically solving the convex problem. However, we apply the RL to determine the power of the UEs in each pair in the MIMO-NOMA system. Moreover, while the conventional schemes required a high computational complexity to determine the user pairing and power allocation in a MIMO-NOMA system, we find the user pairing and power allocation at the same time with low computational complexity.
Many researchers have applied deep learning (DL) in wireless communication [
13,
14]; the method includes supervised, unsupervised, and reinforcement learning (RL). Supervised learning requires many datasets for training, which may make it difficult to apply to real-time wireless communication environments. In unsupervised learning, data are classified, or statistical distributions are estimated, and user pairing and power allocation are difficult steps. Another learning method is the Q-learning (one of the RL techniques) which is a widely used model-free RL technique. The Q-learning can solve a user pairing and power allocation problem through action. The channel state information (CSI) between the UE and BS changes continuously at every time slot owing to the movement of UEs or shadowing between buildings. Therefore, Q-learning, which determines the optimal reward by applying CSI without a dataset, may be more suitable for wireless communications than other supervised learning techniques that require many datasets.
Some researchers have applied DL to NOMA systems [
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]. In [
15], a DL-aided sparse code multiple access (SCMA) was proposed in which the mapping of data to the resource and the decoding of received signals is conducted with a deep neural network (DNN). In [
16], the authors proposed a deep RL-based power allocation with a dual DNN to overcome the noisiness/randomness problem in training data. Moreover, in [
17], the NOMA channel was estimated by applying long short-term memory (LSTM), which is used to learn the CSI of the NOMA system through offline and online training. The authors in [
18,
19] proposed a fast RL method with a
-greedy based deep Q network (DQN) in jamming environments. Furthermore, user pairing was achieved in [
20] by applying multi-agent RL to a multi-carrier NOMA system. In [
21], the authors proposed a DQN-based joint power allocation and channel assignment for NOMA systems. They derived a closed-form solution for power allocation, where they proposed an attention based DQN for the channel assignment problem. In [
22], the dynamic channel access problem was formulated as a partially observable Markov decision process (POMDP), and DQN was applied to find the access policy via online learning. In [
23], the authors proposed a multi-agent DNN approach to predict spectrum occupation of unknown neighbouring networks in slotted wireless networks, where they trained the DNN in an online way, using both RL and supervised learning. The authors in [
24] proposed a DQN-based power allocation for a multi-cell network to maximize the total network throughput. In [
25], a joint precoding and SIC decoding scheme for MIMO–NOMA system was presented in the imperfect SIC decoding environment.
The key challenges in MIMO-NOMA are beamforming, optimization, power allocation, user pairing, and SIC ordering. These challenges have been studied jointly or partially, under specific performance metrics. MIMO-NOMA is a technology that can enhance spectral efficiency in 5G, but it has a fundamental limitation of high computational complexity. This paper aims to increase the sum rate and reduce the computational complexity by using the RL-based joint power allocation and user pairing in MIMO-NOMA systems. The contributions of this paper are as follows: First, we propose an RL-based joint user pairing and power allocation scheme for MIMO-NOMA systems. The previous studies independently investigated user pairing and power allocation problems; or they researched user pairing and power allocation problems via mathematical approaches such as convex optimization in a simplified system with a few users. To the best of the authors’ knowledge, this study is the first attempt in which RL is applied to perform user pairing and power allocation jointly under a practical system with multiple users. Second, the proposed RL-based scheme reduces the computational complexity. In the conventional schemes, the user pairing is performed after the BS has received information about the location and CSI from UEs, and then the power is allocated to UEs in each pair. In this paper, the user pairing and power allocation are simultaneously performed through RL when a BS receives the location and CSI from UEs. Exhaustive search (ES) is a scheme to find the maximum sum rate, but its computational complexity is extremely high because it finds all pairs that can be user paired, calculates all the coefficients that can be power allocation, and then finds the sum rate. The proposed RL scheme reduces the computational complexity because the sum rate is calculated with one action selection. Third, the proposed RL-based scheme shows that the sum rate is superior to those of OMA and other comparable schemes. The proposed scheme at the beginning of the simulation shows that the sum rate is low because the BS randomly selects the action, but as the time slot increases, the learning proceeds and it approximately converges to the sum rate of the ES. Moreover, it was shown that the proposed scheme is more efficient than the ES or phased RL schemes in terms of the time and computational complexity.
The remainder of this paper is organized as follows:
Section 2 describes the system model, and
Section 3 presents the proposed RL-based joint user pairing and power allocation in MIMO-NOMA systems. The numerical results are presented in
Section 4 and
Section 5 concludes this paper.
For the sake of clarity, the main symbols and their descriptions used in this paper are summarized in
Table 1.
Notations: Vectors are presented by boldface small letters, while matrices are represented by boldface capital letters; is the Identity matrix and the quantized value of h.
4. Numerical Results
We consider a MIMO-NOMA system with one BS. The BS is located at the center. The UEs are randomly distributed in a cell within a radius of 50 to 500 m. To take the movement and the channel fluctuation of each UE into consideration, the location and the CSI of each UE is randomly generated in every time slot. In addition, two UEs are assumed to be paired in one beam; Equation (
15) can then be expressed as follows:
Because
, the power allocation coefficient can be quantized into level 2. The power allocation coefficient set
is assumed to be
. The learning rate of the Q-function is set to
, and the discount factor is set to
. The time slot is one TTI, e.g., 1 ms, in a LTE system or a 5G system with 15 kHz subcarrier spacing [
27]. At every time slot, the BS observes the CSI of UEs and performs the user pairing and power allocation. The total number of time slots is 100,000; the simulation results are obtained by repeating 1000 times under iteration. The simulation parameters used in this paper are listed in
Table 2.
The simulation was performed with the following simulation environments: Intel(R) Core iK CPU GHz, RAM GB, Window10, python , GPU GeForce RTX 2080 Ti.
The performance of the proposed RL based scheme is compared with the following schemes: the ES, OMA, random selection, and phased RL schemes for determine the user pairing and the transmit power of UEs. In the ES scheme, the user paring and the transmit power are optimally determined by using the exhaust search method, and therefore the ES scheme shows the highest performance. In the random selection scheme, the BS randomly determines the user pairing and the transmit power of UEs. In the OMA scheme, the BS serves only one UE in a beam and therefore the sum rate is given by [
28]
In the phased RL-based user paring and power allocation scheme, the BS sequentially determines a user pairing and the transmit power of UEs. That is, after pairing the UEs, the BS can then determine the transmit power of UEs. In the phased RL scheme, the Q-function for user pairing is defined as
and the Q-function of the power allocation is defined as
. From Equation (
17), action of user pairing RL is defined as
. From Equation (
18), action of power allocation RL is defined as
. First, user pairing RL proceeds in which the rewards are only used to update the Q-function, where the reward is calculated with the fixed power allocation. The user pairing set
is determined by the BS through
. In power allocation RL, the user pairing set
is observed as a state along with
. Power allocation coefficient is determined by the BS through
. Finally, the BS updates
, and
. The algorithm of the phased RL-based user pairing and power allocation scheme is summarized in Algorithm 2.
Algorithm 2 Phased RL-based user pairing and power allocation |
- 1:
Set and - 2:
Set and - 3:
for to T do - 4:
Choose action in Equation ( 17) - 5:
for to N do - 6:
for to K do - 7:
Allocate the fixed transmit power for the signal to user k - 8:
end for - 9:
end for - 10:
Send the superimposed signal via N antennas - 11:
Observe and reward - 12:
Update in Equation ( 23) - 13:
Choose action in Equation ( 18) - 14:
for to N do - 15:
for to K do - 16:
Apply user pairing - 17:
Allocate the transmit Power for the signal to user k - 18:
end for - 19:
end for - 20:
Observe reward - 21:
- 22:
Update in Equation ( 23) - 23:
Calculate in Equation ( 15) - 24:
end for
|
Figure 6 shows the sum rate of the RL scheme with respect to the time slot, when the number of UEs is 4 and the quantization levels of CSI is 4. The transmit power of the BS is 43 dBm. In the RL-based scheme, the actions are randomly determined in the first time, which leads to a lower sum rate. As time elapses, the sum rate of the RL-based scheme increases and when the time slot reaches about 40,000, it approximately converges to that of the ES scheme with a performance difference of
. It also means that it takes about 40 seconds (when the time slot is 1 ms) to achieve the sum rate similar to ES. However, the proposed RL-based scheme can keep up with the changing radio channel of the UE because the BS continuously trains the machine for every time slot. Hence, if the wireless channel environment of the UE does not change very rapidly, the proposed RL-based scheme can be applied to real-time scenarios. Because of the quantization error, the RL’s reward is lower than the sum rate calculated with the
. The numerical results are compared with those of other schemes by the sum rate calculated with
.
When the transmit power of the BS increases, the sum rate increases, as shown in
Figure 7. As the transmit power of the BS increases, the sum rates of all schemes increase. The random selection scheme shows the worst sum rate because the SIC is not perfect. As presented in
Figure 7, the proposed scheme shows approximately same results as the ES, and also the phased RL scheme exhibits a similar sum rate. When the transmit power is 43 dBm, the proposed RL scheme increases the sum rate by about
and about
in comparison with the OMA scheme and the random selection scheme, respectively.
Figure 8 shows the sum rate as the number of UEs increases. As the number of UEs increases, the sum rates of all schemes increase and finally gradually converge. The performance difference between the ES scheme and the proposed scheme slightly increases as the number of UEs increases. For 10 UEs, the performance difference is about
, which is due to the increased size of states. The proposed scheme increases the sum rate by about
and about
in comparison with the OMA scheme and the random selection scheme, respectively. However, the proposed scheme and the phased RL scheme show the similar performance.
Figure 9 presents the required simulation time as the number of UEs increases. Because the ES scheme investigates all possible actions, its simulation time is extremely high. The results show that the proposed scheme is more efficient than the phased RL scheme in terms of the time complexity. The proposed scheme reduces the time complexity by about
compared with the phased RL scheme.
The proposed scheme reduces the computational complexity. The ES scheme finds all possible actions and therefore, when the action space is denoted by , the complexity of the ES scheme is represented by . The phased RL scheme sequentially determines the user paring and the transmit power of UEs in each pair. Hence, the complexity of the phased RL can be expressed as , because the RL requires a complexity of after it converges. The proposed RL-based scheme calculates the reward by choosing one action and therefore it has a complexity of .