5.1. User Selection Step
Users selection step is responsible for selecting users that are eligible for RBs allocation. Users selection is done per cell.
Depending on the number of users, the number of possible combinations to have for allocating resource blocks can be very high.
It does not make sense to allocate resource to users if there are no packets to be sent to them. Packets should be sent within acceptable delay, otherwise they will be discarded. The selection process then considers buffer status (transmission and retransmission queues) and head of line delay so that only users having packets to receive will be selected. Users are selected in two phases to reduce the state space for the resource allocation step.
In the first phase, users are sorted according to the buffer status. Since a resource block should be allocated to users that may effectively use it in order to not waste it, only users that have data to receive are eligible to be selected. Both transmission queues and retransmission queues are considered. Two lists are obtained. The first list contains users having data in their transmission queue and the second list contains users having data in their retransmission queue. The two lists are then combined to make one list.
In the second step, users are selected based on their priority. The HoL defines the priority. Users that have their HoL closed to the maximum allowed for the considered service are placed on the top of the list. The number of users selected is greater than the number of resource blocks available.
The selection process is summarized in Algorithm 1.
Algorithm 1 not only cares about QoS but also cares about fairness, since, if a user is starving, its head of line delay will become higher. It also cares about users with bad channel conditions since users having traffic in their retransmission buffer are considered. Users having their HoL greater than the maximum delay allowed will have some packets discarded in the frequency domain scheduler and that action will lowered their HoL.
Algorithm 1: Users selection |
|
5.2. Resource Allocation Step
This step is responsible for allocating resource blocks to users selected in the previous step.
In this step, policy space based reinforcement learning is used instead of value function space to allocate RBs to users selected in the previous step. Policy space search methods use explicit policies and modify them through search operators [
41]. We use GA as a policy space search, where a policy is represented by a chromosome. An agent interacts with its environment and learns the best policy to allocate RBs to users. The reinforcement learning agent–environment interaction [
13] for resource allocation is represented in
Figure 2, where
is a representation of the state at time
t and
represents action taken by an agent at time
t, which leads to the state
after receiving reward
at time
.
represents reward received from previous action. There are two possible rewards: −1 and 1.
The environment is composed of the following parameters: user buffer status and user head of line delay (HoL).
A state represents the number of users with their HoL, their buffer status, the overall packet loss ratio, and the QoS bearers, as in [
53].
An agent is a packet scheduler located at each enodeB, thus we have a multi-agent environment. A policy is an action taken by an agent in a given state. An action chooses for each RB a user that the RB will be allocated to. For simplicity, a RB will carry one packet.
A reward is then associated on each action taken by the agent. If action taken by an agent increases the packet loss ratio, a negative reward is given to the agent. If the action taken by an agent decreases the packet loss ratio, a positive reward is given to the agent. The two reward values are −1 and +1.
A single policy represents a chromosome [
41]. A good policy should have a fitness value below a given threshold corresponding to the QoS requirements of a given service. The fitness function for the resource allocation GA algorithm is the sum of HoL.
A RB represents a gene. Its value corresponds to a user that the RB have been allocated to. A chromosome is a set of available RBs that are allocated each time to interval (TTI).
Each gene will be encoded in decimal to ease the computational process. The size of the population will not vary.
Elitism strategy is used to keep the best solutions and the least feasible solutions are been discarded.
A sample chromosome is shown in
Figure 3 that represents a policy used for a sub agent located in a cell with a bandwidth that have 25 RBs and 60 actives users. The numbers represent users. We can see that User 60 has been allocated two non-contiguous RBs, User 9 has been allocated two non-contiguous RBs, User 25 has been allocated two contiguous RBs, and User 16 has been allocated two non-contiguous RBs.
The resource allocation can be modeled by a matrix where each row represents cells and each column represents a resource block to be allocated. A sample resource allocation matrix can be seen in
Table 1 for a network composed of n cells and a bandwidth corresponding to m RBs where
identifies cells and
identifies resource blocks. For example,
is the first cell and
is the first resource block.
represents user
i in cell
j. A vector is used for chromosome representation to ease GA operations crossover and mutation.
An initial population of policies is generated at the beginning of the reinforcement learning process. An agent evaluates each policy through policy evaluation. A fitness function is used to do this.
Each policy is evaluated and improved to obtain the best policy. The best policy will be the policy that delivers packet loss ratio value less than or equal to the value of the maximum packet loss acceptable according to the QoS requirements [
39]. The population of policies is modified through crossover and mutation operations over a defined number of generations with the aim of solutions converging to the best solution. This corresponds to the learning process during which policies are improved.
During crossover operation, some chromosomes are selected according to the crossover probability and divided into two groups. Pairs of chromosomes are composed by choosing one chromosome on each of the two groups to be crossed in order to produce new chromosomes. An example of crossover operation is shown in
Figure 4. The number of RBs of each chromosome is 25. In this example, two chromosomes, namely Child 1 and Child 2, are obtained from two chromosomes, namely Parent 1 and Parent 2.
During mutation operation, some chromosomes are selected according to the mutation probability to be mutated. Six genes are randomly chosen to be mutated on each chromosome selected for mutation to add diversity to population.
A gene value represents a user to whom the corresponding RB is allocated. Mutation of a gene consists of changing its value. In other words, mutating a gene means deallocating a RB from a user and allocating it to another user. The RB is allocated to a user in the waiting list, which contains users selected in the first step of the proposed ICIC scheme but have not been selected to form the initial population of chromosomes.
An example of mutation operation is shown in
Figure 5. An initial chromosome is replaced (mutated) to obtain a mutated chromosome by changing the color of genes, which means deallocating RBs previously allocated to users associated to these genes from the initial chromosome and allocating these RBs to users in the waiting list.
The best policy is obtained after selection, crossover, mutation and evaluation have been repeated a certain number of generations, and then the learning process terminates.
5.3. Dynamic Power Allocation Step
An agent learns to find the best policy, which maximizes SINR values for each user on each cell. The environment of the agent is a multi-cell network composed of enodeBs and users’ geographic locations.
A state represents the location of a user and its SINR value. Only users that have been allocated at least one physical resource block are considered.
The agent is located at a centralized unit that may be located on one of the enodeBs and collects geographic location of users on each enodB. The agent learns the best policy to set transmit power for each enodeB on each physical resource block allocated to users based on their locations.
A reward is associated to each action taken by the agent. If the action taken by an agent is such that the obtained SINR is less than the minimum SINR acceptable, a negative (−1) reward is given to the agent. If the action taken by an agent is such that the obtained SINR is greater than the minimum SINR, a positive reward (+1) is given to the agent.
The reinforcement learning agent–environment interaction [
13] is presented in
Figure 6 where
,
,
,
and
have been described in the resource allocation step.
A single policy represents a chromosome [
41]. The transmit power of an enodeb on a RB represents a gene. A chromosome is a set of transmit power on each enodeB on each RB.
Each policy is evaluated by the minimum SINR obtained with the corresponding enodeB’s transmit power. A gene represents an enodeB transmit power on a RB. A chromosome sample is shown in
Table 2.
The power allocation problem can be seen as a matrix represented in
Table 3 for a network composed by n cells with a bandwidth corresponding to m RBs. In
Table 2 and
Table 3,
represents transmit power of enodeB in cell
i on resource block
. For example,
is the transmit power of enodeBin cell 1 on the first resource block and
is the transmit power of enodeBin cell 2 on the first resource block.
An initial population of policies is generated at the beginning of the learning process. Each gene in a policy is assigned randomly a transmit power value between the minimum transmit power and the maximum transmit power. An agent will evaluate each policy through policy evaluation. The minimum SINR is used as fitness function of the power allocation GA. Precisely, the range of SINR values is used as fitness function. To lower the computation process time, a SINR is calculated for each user only on RBs that have been allocated to the user. Each policy is evaluated during a process where the initial population is modified through crossover and mutation operations. A good policy is the one that has a minimum SINR value greater than the minimum acceptable SINR value. Policies evaluation is repeated many times corresponding to a number of generations in order to find the best policy. Policies are improved during policies evaluation.
Crossover operation is done similar to the one in resource allocation step. An example of crossover operation is shown in
Figure 7 for a power allocation in a network that has three cells and a bandwidth of 5 MHz corresponding to 25 RBs.
Mutation operation is done similar to the one in resource allocation step. An example of mutation operation is shown in
Figure 8 for a three-cell network with 25 RBs to allocate on each cell.
The whole process of dynamic power allocation is summarized in Algorithm 5. The algorithm has some input parameters: users coordinates in the network, maximum and minimum transmit power, population size, generations number, crossover rate, and mutation rate. Details of how initial population is generated, and evaluation and selection of policies, and crossover and mutation operations are done can be viewed in Algorithm 6.
Algorithm 5: Power allocation summary algorithm |
|