1. Introduction
With the continuous development of new-generation mobile communication technology, the Internet of Vehicles (IoV) has attracted extensive attention all over the world in recent years because of its potential value to promote the development of intelligent transportation systems (ITSs) and smart cities. Technically, the IoV connects vehicles to the mobile network to realize the full connection function in vehicle-to-Infrastructure (V2I), vehicle-to-people (V2P), vehicle-to-network (V2N) and vehicle-to-vehicle (V2V). The spectrum resource is a necessary condition to realize all the above vehicle communications.
However, with the continuous expansion of the IoV applications and the improvement of the communication requirements, the existing spectrum resources are obviously insufficient to meet all the communication requirements. Therefore, in order to ensure the communication service of the IoV with low delay and high reliability, a new spectrum resource allocation scheme is needed [
1].
The studies in the IoV area initially used traditional schemes like mathematical modeling. Compared with the traditional cellular network, the uncertain network environment including high-speed moving vehicles and changing channels in the IoV has brought unprecedented challenges to the spectrum resource allocation. Concerned with the dynamic allocation of spectrum resources among high-speed vehicles in the IoV, a dynamic spectrum allocation algorithm based on channel feedback and a graph theory coloring model was proposed in [
2]. The feedback matrix is defined to analyze the occupation and communication of the channel, and the vehicle node is required to judge whether the current channel is available from the master according to the parameter values returned in the channel feedback matrix. To adapt the communication environment of the IoV with the changing number of vehicles, a three-step cognitive spectrum allocation algorithm based on a clustering structure was proposed in [
3]. In addition, a spectrum sharing scheme designed for the slowly changing large-scale fading channel was proposed in [
4,
5], which improves the throughput of the V2I link to the greatest extent and reduces the network signaling overhead. In this scheme, V2I can share the spectrum resources with V2V. Moreover, the resources can also be shared between two V2V links.
In recent years, researchers begin to use different deep learning and reinforcement learning theories to model and solve the IoV spectrum resource allocation problem in an unknown dynamic vehicular environment [
6]. For example, to meet the dynamic and diverse needs of different entities, a Q-learning framework was proposed in [
7] to solve the resource allocation problem in the vehicle cloud environment. In [
8], based on the deep Q-network (DQN), a joint cache and computing spectrum resource scheme was proposed for the unknown number of spectrum resources. To further solve the problems of high mobility and the centralized management of spectrum resources in most vehicle environments, a hybrid spectrum multiplexing and power allocation solution for vehicular communications was proposed in [
9], and a method based on convolutional neural network (CNN) was developed to achieve real-time decisions on power allocation and spectrum reuse. In [
10], a distributed multi-agent spectrum resource allocation scheme was proposed using DQN theory where each V2V link is considered as an agent, and each agent periodically observes and explores the vehicular environment. Then, due to different observation results and behavior rewards, each agent learns how to reasonably select their own transmit power and spectrum independently. In [
11], aiming to solve the problem of load and resource allocation in the IoV, an optimal solution based on Q-learning that can reasonably allocate load, control transmission power, plan sub-channels, and calculate spectrum resources was proposed, which effectively reduces the overhead of the system compared with other algorithms. The high-density vehicle environment brings high-dimensional action space. Therefore, in [
12], a multi-agent method based on deep deterministic policy gradient (DDPG) was used to study the spectrum resource allocation of the V2I link and V2V link under the conditions of non-orthogonal multiple access technology. It maximizes the V2I link delivering rate while meeting the strict delay and reliability constraints of V2V communication. To sum up, researchers mainly use the DQN and DDPG algorithms to solve the problem of dynamic spectrum resources allocation in the IoV. However, these two algorithms have some limitations which can be summarized as follows. DQN only shows good advantages in discrete low-dimensional behavior space; although DDPG uses the actor critic (AC) method to be suitable for continuous high-dimensional behavior space, it has poor stability because it adopts deterministic behavior strategy [
13]. In a word, the existing algorithms either only deal with the discrete low-dimensional environment or have a bad performance in continuous complex environment. However, the proposed method can overcome the aforementioned issues.
To deal with the continuous change in vehicle network environment, a complete corporation soft actor critic (SAC) algorithm is proposed in this paper which has bigger V2I capacity and a higher success probability of V2V payload delivery than DQN and DDPG. In addition, in order to reduce the complexity, a parameter sharing soft actor critic scheme (PSSAC) algorithm is proposed, which performs well in a vehicle communication environment with low complexity.
The rest of this paper is organized as follows. In
Section 2, the system model is described in details.
Section 3 introduces the SAC-based allocation scheme.
Section 4 introduces the two proposed algorithms. The simulation results are provided in
Section 5 and the conclusion is given in
Section 6.
2. System Model
The IoV model is shown in
Figure 1 which consists of several cars, a single base station,
X V2I links and
Y V2V links. To ensure the high quality of the V2I link communication, it is assumed that each V2I link is allocated with different orthogonal frequency spectrum subcarriers in advance to eliminate the interference between different V2I links in the IoV environment using the spectrum sensing technology. We define that each V2I link has a fixed allocation of the subcarrier, and the transmission power of all V2I links is
. Since V2V link and V2I link share the same spectrum resources, each V2V link needs to select the specific spectrum and transmission power to improve the communication quality of the V2V link. Therefore, this paper will focus on the design of spectrum resources and power allocation scheme for the V2V link in the IoV to maximize the total transmission rate of the V2I and V2V links.
Suppose that the channel gain of V2V link is only composed of small-scale and large-scale fading, where
represents the frequency-independent large-scale fading effect, namely the shadow effect and path loss, and
represents the frequency-dependent small-scale fading channel gain. Therefore, the channel power gain when
y-th V2V link occupies the
x-th subcarrier can be expressed as
Thus, the received signal-to-interference noise ratio corresponding to the
x-th V2I link and the
y-th V2V link using the
x-th sub-band can be, respectively, expressed as follows
and
where
denotes the noise power,
is the
x-th V2I transmit power,
is the
y-th transmit power using the
x-th sub-band,
is the interference channel between the BS and
y-th V2V transmitter over the
x-th sub-band,
is the interference channel between BS and the
x-th V2I transmitter over the
x-th sub-band,
denotes the binary system with
= 1 meaning that the
y-th V2V link is using the
x-th sub-band and
= 0 is vice versa,
denotes the interference power shown as follows
where
is the channel between the
y-th V2V receiver and the
x-th V2I link over the
x-th sub-band,
is the channel between the
y-th V2V receiver and
y-th V2V transmitter over the
x-th sub-band. Every V2V link is assumed to only access one sub-band, i.e.,
.
Capacities of the
x-th V2I link and the
y-th V2V link over the
x-th sub-band are expressed as follows
where
W denotes the bandwidth of each spectrum sub-band.
On the one hand, V2I links sum capacity needs to be maximized because the V2I links support high data rate mobile communication services. On the other hand, V2V links are designed to support sending and dependably receiving important messages. In [
10], such a requirement is mathematically modeled as the transmit rate of size B packets within time T
where
denotes the capacity of the
y-th V2V link,
B is the V2V payload size, and
is the channel coherence time. The objective is to maximize the payload delivery rate of V2V links and the sum capacity of all V2I links.
Because of the mobility, a distributed V2V resource allocation scheme is better than a centralized controlling scheme in a vehicular environment. Then, it comes to a big challenge that how to coordinate the different actions of all V2V links instead of acting selfishly in their own interests. To figure out this problem, we propose deep reinforcement learning-based V2V spectrum allocation algorithms in the next section.
3. SAC-Based Resource Allocation
Unlike the traditional distributed optimization algorithm, reinforcement learning can solve the sequential decision-making problem in the dynamic vehicle networking environment, and enable the agent to explore and seek the effective strategy with the largest return using trial and error in the unknown vehicular environment. SAC algorithm is a reinforcement learning method proposed by T. Haarnoja et al. [
14] in 2018 based on the idea of AC. Its main idea is to add entropy information based on the original reward to encourage exploration, and then train a behavior strategy with entropy to maximize the reward. Because it retains the randomness of behavior strategy to the greatest extent, it improves the agent’s perception of the environment and enables the agent to adaptively adjust the strategy in the vehicle networking environment with changing channel conditions, which is more conducive to make reasonable spectrum selection. Due to the above advantages of SAC, this paper uses the idea of SAC to propose a new dynamic spectrum resource allocation algorithm for IoV [
10,
15].
To do this, it is necessary to establish a Markov decision process model (MDP) [
16], which is shown in
Figure 2, and in which each V2V link acts as an agent to interact with the environment in real time, that collects the state information of each time in the environment, makes decisions according to the vehicle conditions and requirements of the current environment, and obtains rewards.
3.1. State and Observation Space
As an agent, each V2V link y explores the unknown vehicular environment in the multi-agent resource allocation scenario [
17,
18] and the resource allocation problem can be mathematically seen as an MDP. In
Figure 2, given the current state
, each V2V agent k obtains an observation function
of the environment at each time step t, and then chooses an action
to form a joint action
. Subsequently, each agent receives the same reward
and the vehicular environment enters to the next state
with probability
. Then, each V2V agent will receive the next observation
.
Including all agents’ behaviors and channel conditions, the real environment state is unknown to each agent. They can only observe the environment through an observation function. The observation space of V2V agent Y includes , , , the interference channel from all the V2I transmitters , , the remaining payload of V2V and the remaining time budgets .
Apart from
, such channel information can be accurately estimated by the
y-th V2V link receiver at the start of time slot t [
19]. Therefore, the observation function can be defined as following:
where
.
3.2. Action Space
In the spectrum resource allocation of a multi-agent vehicle network, the V2V agent mainly selects the sub-band and its own transmit power. In this paper, the transmit power is chosen from −100 dBm to 23 dBm. The action function is defined as
3.3. Reward Design
The advantage of reinforcement learning for solving the optimization problem is the reward design. With the appropriate reward design, the system will be trained at each episode and its performance will be improved. As we mentioned, the objectives in this paper include maximizing the sum V2I capacity and improving the transmission success probability of V2V links within a constraint time T.
To achieve the first goal, we sum the capacities of all V2I links,
, and the capacity of
x-th V2V link defined in (5) is a part of the reward. To achieve the other objective, for each agent
y, the reward
is set as the V2V transmission rate until all the payload is delivered. After that, the reward
is set as a constant number
, which is bigger than the largest V2V transmission rate. At each time step t, the V2V-related reward can be set as
The elements of the reward function are not constants. Only when the agents finish the transmission, it will obtain a fixed reward. Under other different observations, the agents receive the different rewards. The agents do not receive the negative reward. When the agents perform a good action, it will receive a big reward, whereas when the agents take a not that good action, it will receive a small reward. The objective is to find an optimal allocation which can select an optimal action at each state that maximizes the total reward. It should be noted that the constant in the above reward function is set to balance the relationship between the final goal of training and the actual training efficiency. If only the final goal is considered, the agent will obtain zero reward before transmitting all the payload. However, in the actual training process, it is found that such a design will seriously hinder learning. The agents cannot learn anything useful because a sparse reward causes a problem that the agents keep obtaining zero reward in the early stage of training which will lead to a bad system performance. To avoid this case, we add some prior experience into the reward. In practice, is a hyper parameter which needs to be set empirically. In the training process, is bigger than the largest transmission rate of V2V while should be less than twice the largest V2V transmission rate according to the training experience. Therefore, we design the second part of reward in (10) to solve this issue.
As a result, to achieve the two objectives, the reward function can be design as
where
and
are positive weights to balance V2V and V2I goals. It should be noted that all V2V receive the same reward so that cooperative policy among all the agents is encouraged.
4. Learning Algorithm
In this paper, SAC is used to solve the spectrum resource allocation problem. The basic idea is described as follows: firstly, the policy network generated by the approximate action strategy and the soft Q network judged by the policy value are established. Then, at each step, the state, next state, action, and reward generated by each V2V link in the IoV are stored as an experience in the memory. Finally, the network is trained in reverse by optimizing the network loss function to obtain a better resource allocation strategy [
14,
20].
The optimization objective of reinforcement learning is to find an optimal policy to achieve the cumulative return maximization. The optimization objective of the SAC algorithm is not only to maximize the cumulative reward, but also to maximize the entropy, which is shown as:
with
where
is the entropy denoting the degree of randomization of the current policy
,
is the temperature parameter balancing the weight of the reward and the stochasticity of the optimal policy, and
is the discount parameter. The probability density function of random variable
is
P, so that the entropy of
can be determined.
According to Bellman’s general recurrence equation, the function of judging the value of the behavior strategy in reinforcement learning is
The Q value of state
taking action
is determined by the sum of reward
and the discount expectation of the Q value of state
taking action
. In SAC, the entropy of the policy shown in (
13) also needs to be considered. Thereafter, the new Q value function is shown as:
where
denotes the expectation value of the next state
from memory and the action
based on the current policy.
4.1. Policy Network
First of all, a policy neural network needs to be constructed as the actor network to generate action strategy. The input of the policy network is all the information of observation in the environment and the outputs are the probability of every action and the chosen action. The policy function is shown as:
The loss function of the policy network can be defined as following [
14]:
with
Two different critic target networks are used to avoid the overvalued problem of the estimation of the Q value. Therefore, there are two Q values, i.e.,
and
in (
18) where the smaller one will be chosen.
4.2. Soft Q Network
Secondly, in order to judge the policy network, we construct two soft Q neural networks, namely the current value network and target network. The inputs of both networks are the state information of observation and the outputs of both networks are the Q value of the action shown as:
The soft value can be defined as:
where
denotes the probability of every action based on state
. Then, the loss function of the soft Q current network can be expressed as:
In order to further consider the impact of the temperature parameter
, we compare the fixed temperature parameter and the adaptive temperature parameter [
20]. The latter one can automatically adjust depending on the environment and network learning. The loss function of the temperature parameter
can be defined as:
where
is the dimension of the action space defined as the practical experience [
14].
4.3. Memory Buffer
SAC algorithm is similar to the Q-learning method in reinforcement learning, and it adopts the idea of experience replay. In the training process, using all sample data to participate in training will cause a slow training speed and poor convergence, whereas using part of sample data cannot achieve a good training effect. The above problem can be solved by establishing two different fixed-size buffer memories. All the experience is stored in the big memory buffer and the high temporal difference (TD) error experience is stored in the small memory buffer. The experience in the small one will be chosen to train more times for its high TD error which can accelerate the training process. When the memory is full, the memory is released to reserve space for new sample data. By setting the capacity M of the memory bank to control the data level involved in the training, most of the sample data are retained as much as possible to achieve good training results. In the actual training process, to speed up the training speed, the experience is randomly selected from the memory in batches for the reverse training of the network.
4.4. Network Learning
The essence of network learning is the process of constantly updating network parameters to minimize the network loss function. In Algorithm 1, the gradient update and soft update are used to update the network. Specifically, the gradient update method is used to update the parameters and temperature entropy coefficient of the soft Q current network and policy network, and the updating functions are shown as:
where
is the gradient update weight,
and
are the network parameters of the soft Q current network and policy network, respectively.
Algorithm 1 Resource sharing based on SAC in multi-Agent reinforcement learning. |
- 1:
Generate vehicular environment and initialize all parameters - 2:
for all V2V agent y do - 3:
Initialize actor, critic, critic target for all agents randomly - 4:
Initialize memory buffer - 5:
end for - 6:
each episode - 7:
Update all vehicles location, large-scale fading - 8:
Reset = B and = T, for all - 9:
for each step t do - 10:
for each V2V agent y do - 11:
Observe the vehicular environment - 12:
Choose action from action space according to policy - 13:
end for - 14:
All agents take actions according to the policy - 15:
Then receive reward - 16:
Update channel small-scale fading - 17:
for each V2V agent y do - 18:
Observe - 19:
Store in the memory buffer - 20:
end for - 21:
end for - 22:
for each V2V agent y do - 23:
Randomly sample some mini-batches from - 24:
memory buffer - 25:
Update critic network and policy - 26:
Update target network - 27:
end for
|
However, the soft Q target network does not participate in the learning progress, so it cannot update itself independently. Therefore, we choose soft update to copy the latest network parameters of the soft Q current network at regular intervals for small-scale updates, shown as follows:
where
is the soft update weight and
denotes the Q value of the target value network. The training process is shown in Algorithm 1.
However, when the vehicular environment becomes more sophisticated with more cars (the number of cars > 30), it is difficult for each agent with five networks to converge. Thus, we present Algorithm 2 to deal with a more complex case. In Algorithm 2, only one policy network (actor network) and one Q network (critic network) are needed to be trained with the parameter sharing in this learning problem [
21]. At each episode, only one agent interacts with the vehicular environment. Networks will be trained by the agent experience and then the next agent will succeed the networks and continue to train. Algorithm 2 greatly reduces the complexity of the networks and it is easier to converge.
Algorithm 2 Sharing soft actor critic for resource allocation in V2X. |
- 1:
Generate vehicular environment and initialize all parameters - 2:
for all V2V agent y do - 3:
Initialize actor, critic, critic target for all agents randomly - 4:
Initialize memory buffer - 5:
Initialize temperature - 6:
end for - 7:
for each V2V agent y do - 8:
for each episode do - 9:
Update all vehicles location, large-scale fading - 10:
Reset = B and = T, for all - 11:
for each step do - 12:
Observe - 13:
Choose action from action space according to policy - 14:
Receive reward - 15:
Update channel small-scale fading - 16:
Observe - 17:
Store in the replay memory buffer - 18:
Randomly sample some mini-batches from - 19:
memory buffer - 20:
Update critic network - 21:
Update policy network - 22:
Update temperature - 23:
Update target network - 24:
Update the observation - 25:
end for - 26:
end for - 27:
end for
|
5. Simulation Results
The simulation considers the vehicle networking topology scenario in the area of 375 m wide and 649 m long one-way lanes. A base station is set on the left of the scenario. Other simulation parameters of the system can be referred to as 3GPP tr 36.885 [
21,
22], which describes the vehicle drop models, densities, speeds, direction of movement, vehicular channels, V2V data traffic, etc., shown in
Table 1. The rectified linear unit is used as the activation function and the RMSProp optimizer is used to update the network parameter. Each V2V agent consists of five fully connected hidden neural networks in which the number of neurons in the three hidden layers is set to 500, 250, and 125, respectively, and we train each agent’s networks with 4000 episodes. In both algorithms, we set the gradient update rate of the soft Q current network and the policy network to 0.001 and the soft update rate of soft Q target network to 0.01. In algorithm 1, the temperature parameter
is fixed for which we set it to 0.001, while in Algorithm 2, the temperature parameter is adaptive and we set it to 0.95. Thereafter, we compare different SAC algorithms with DQN in the simulation to show the better performance.
Firstly, in order to verify the effectiveness of the SAC algorithm, the convergence of the algorithm must be verified. As we can see in
Figure 3,
Figure 4 and
Figure 5, with the increase in training times, the cumulative reward sum returned by each training set gradually increases and finally tends to converge, which verifies the effectiveness of the two SAC algorithms. Additionally, the soft actor critic-fix (SAC-fix) converges to about 110 return; the PSSAC converges to about 118 return; and soft actor critic learn (SAC-learn) converges to about 120 return. This means that SAC-learn and PSSAC perform better than SAC-fix.
Secondly, the V2V payload transmission success probability and V2I sum capacity of the PSSAC, SAC-fix, SAC-learn, and DQN are compared. When the number of vehicles is 6, the V2I sum capacity and V2V payload transmission success probability of the two SAC algorithms and DQN algorithm are simulated and compared in
Figure 6a,b, respectively. This shows that, compared with the SAC algorithm, the V2I sum capacity of the DQN algorithm is better when the size of the loads is small, but the performance obviously decreases with the increase in the size of loads. In addition, the V2I performance of the SAC algorithm is inferior to that of the DQN algorithm, but when the required transmission load increases to 8 × 1060 bytes, the performance of the SAC algorithm begins to exceed that of DQN algorithm, and the performance changes more stably with the increase in loads. In
Figure 7a,b, when the vehicle number comes to 10, the advantages of the DQN algorithm are no longer prominent and SAC algorithms always perform better than DQN with the increase in payload size. In addition, when the vehicular environment becomes more complex, such as increasing the payload size and vehicle size, the SAC-learn performs better than others.
In
Figure 8a,b, it is obvious that, when the vehicle number comes to 20, PSSAC always performs best. Because of the complexities of the SAC-fix and SAC-learn algorithms, it is difficult to converge to a good return and sometimes barely converge. Because DQN cannot deal with such a complex situation, it is not included in the simulations. If the number of V2V connections increases, the performance of all algorithms will get worse and converge more slowly. In addition, in real-world scenarios, there are more details we should consider, such as weather. Therefore, the reward function and environment need redefining.
In addition, in order to see how different agents transmit and cooperate in the transmit payload process, we record the remaining payload of each agent. When the number of vehicles is 6, the residual load changes of the V2V link in the SAC-fix algorithm and DQN algorithm are also observed, as shown in
Figure 9a,b, respectively. On the one hand, both algorithms complete the load transmission task in a very short time. On the other hand, the DQN algorithm takes about twice the amount of time as the SAC-fix algorithm does and its stability is worse than the SAC-fix algorithm. Because the adoption of the maximum entropy idea of the SAC algorithm makes the agent randomize the action as much as possible—on the basis of completing the current task—to obtain a variety of approximate optimal choices, this improves the agent’s exploration ability in the environment and improves the stability of SAC algorithm in the dynamic environment.
6. Conclusions
In this paper, we solved the existing problems of the spectrum allocation in IoV. Firstly, we proposed an SAC-based scheme to solve the spectrum allocation problem in the complex vehicular environment which has a better performance than the DQN scheme. In addition, we proposed the other SAC-based scheme to reduce the complexity to easy convergence, which reduces the training time by about 50 percent. These novel multi-agent SAC schemes (SAC-fix, SAC-learn, PSSAC) have 10 percent performance improvements in terms of the V2I sum capacity and the V2V payload transmission success probability compared with DQN in the vehicular environment. In the real ITS environment, some elements have not been considered in our scenario and algorithm, such as the weather, passersby, and vehicle density. In the future, we will consider more details of the vehicle network and make the V2V environment more realistic, and further improve our algorithm.