1. Introduction
Highway emergencies are uncertain [
1,
2] and occur accompanied by a series of unpredictable traffic phenomena. For example, a sudden lane drop will cause a large number of vehicles to quickly gather in the upstream section of the emergency section, which may cause a series of negative impacts and chain effects, such as queuing and even secondary accidents, resulting in heavier casualties and more property losses [
3,
4,
5]. Therefore, it is necessary to pay attention to the emergency section. Timely traffic management and control are very necessary.
The variable speed limit (VSL) is an effective control measure to improve road safety in emergency environments [
6] and is widely and successfully applied to highway systems in the US, the EU, and Australia [
7,
8]. VSLs can reduce the probability of traffic blockage by balancing vehicle speeds. The principle is to constrain the inflow by adjusting the values of the VSL in time, optimizing the traffic flow operating status to avoid stop-and-go traffic, waves, and other unstable states [
9]. VSL control can adjust traffic states based on real-time traffic information, including traffic speed, flow, and so on, especially for emergency events or severe conditions and other dynamic states. The dynamic adjustment of appropriate speed based on the current conditions has advantages in efficiency and safety. Some studies [
10,
11] show that the implementation of the VSL method can reduce vehicle collisions and improve road safety. What is more, it also makes important contributions to traffic efficiency and environmental benefits [
12,
13].
The VSL control methods can be roughly divided into four types: rule-based, feedback-based, optimal-control-based, and reinforcement learning methods [
14]. Rule-based VSL control methods are implemented by formulating logical rules [
15,
16]. As mentioned before, the value of the variable speed limit is determined based on preselected thresholds, including traffic flow, density, average speed, and so on, which can coordinate speed differences and stabilize traffic flow. Rule-based methods are simple and easy to implement compared to other methods and are already in most traffic management and control systems. The successful application of VSL control in traffic systems also illustrates its effectiveness in coordinating traffic and improving traffic safety and efficiency. The main idea of the feedback-based VSL controller [
17,
18] is to calculate a VSL value using the current and past traffic conditions, which usually requires less computing time than the optimal-control-based methods. However, the performance of feedback-based VSL control relies heavily on the accurate measurement of traffic conditions, such as traffic flow and density. Therefore, small perturbations in the measured density may lead to suboptimal performance of the closed-loop system. The VSL based on optimal control methods is typically implemented within a model predictive control framework [
19,
20,
21,
22]. At each time step, the VSL control command is calculated by solving an optimization problem whose objective function involves a performance metric, such as total travel time, safety, emissions, and fuel consumption. In addition, its control effect depends on the accuracy of the model. The VSL decision-making control strategy based on reinforcement learning is based on real-time data fed back by the environment [
12,
23,
24], and it automatically senses the environmental status and conducts interactive training with the external environment through a continuous trial and error mechanism, thereby learning the optimal control decision.
The implementation of traditional VSL control relies on transportation infrastructure variable message signs (VMSs) to transmit control values. The VMSs on the highway are fixed and placed at discrete locations, so it is very difficult to react to a dynamic traffic environment. Considering the emergence of emergencies is uncertain and random, the locations of variable information signs are very important to the implementation effect of the strategy in the process of the implementation of the VSL control strategies. Therefore, variable information signs should be placed widely and densely enough to effectively respond to the dynamic emergency environment. However, the construction of a continuous variable information sign system requires expensive and cumbersome accessories to configure, such as gantries and other accessories. With the application and development of the Cooperative Vehicle Infrastructure System (CVIS), the comprehensive use of a variety of sensors provides new technical ways to collect traffic data [
25,
26]. Connected and Autonomous Vehicles (CAVs) with autonomous driving and network communication capabilities not only have a smaller expected headway but also are highly obedient and responsive in a timely manner to the control commands issued by the control center [
27]. It solves the problem that the control strategy based on the basic traffic infrastructure has poor flexibility and slow control actions. It is difficult to measure the actual data, driver compliance, and negative impact of reducing driver uncertainty in traffic control. Therefore, CAVs with mobility and perception capabilities bring great potential to traffic management and control and also provide a new method of detecting traffic information. Han et al. [
28] first used CAVs in the process of implementing VSL control at fixed bottlenecks in 2017, breaking the traffic management and control system relying on infrastructure.
As mentioned above, considering the practical engineering application, the traditional VSL method is easy to implement, but the performance effect of the control strategy depends to a large extent on the accuracy of the established traffic flow model. The parameters of the traffic flow model are related to the many factors considered [
29]. For example, the traffic flow models for adjacent road sections are not the same due to different linear shapes. Most traffic models are too idealized to accurately distinguish and describe the traffic flows in different locations or on different road surfaces. Therefore, the establishment of the model is not detailed and accurate, resulting in the performance of the model-based VSL control strategy being unsatisfactory.
Deep reinforcement learning (DRL) is a self-learning method of continuous interaction with the environment to complete the decision-making process [
30]. Continuous interaction can learn the characteristic values of the environment, and the decision-making is suitable for the environment. In recent years, some studies have introduced reinforcement learning algorithms into VSL control [
12,
23,
24], and they have confirmed that the VSL methods based on DRL are better than the traditional method. However, most of the research on VSL methods based on DRL focuses on the balanced control of traffic flow, and there is a lack of research on management and control after emergencies.
The purpose of this work is to design an intelligent traffic management and control strategy for the VSL with DRL under emergencies. The algorithm can learn an adaptive controller that can adapt to the changing environment in a short time. The self-triggered intelligent decision-making framework solves traffic evacuation in emergencies for traffic sustainability development. The contributions of this paper are as follows:
- 1.
The method in this paper is model-free, which can avoid obtaining an accurate traffic flow model;
- 2.
The self-triggering type of traffic management control method can take timely measures after the occurrence time;
- 3.
The performance of the proposed method has an advantage regarding efficiency and safety compared with other methods.
The structure of this research consists of six sections.
Section 1 provides an overview of the topic and the background for the study.
Section 2 proposes the problem of research.
Section 3 describes the Markov decision process of VSL control.
Section 4 explains the VSL control problem based on improving a deep deterministic policy gradient (DDPG) under emergency.
Section 5 discusses the study’s results. Finally,
Section 6, the Conclusion, summarizes the research findings and provides insights into future work.
3. Markov Decision Process of VSL Control
In this section, the VSL control problem under an emergency environment is described as the Markov decision process (MDP) and is formulated to address using deep reinforcement learning.
3.1. Markov Decision Process
The MDP is a classic decision-making process, and then we apply a reinforcement learning architecture to solve this problem. The agent will perceive the current system state and implement actions on the environment according to the strategy, thereby changing the state of the environment and receiving rewards. The accumulation of rewards over time is called reward. The MDP is described by its five components: (1) state space S; (2) action space A; (3) reward function R; (4) state transition matrix P; and (5) discount rate .
MDP describes the process of interaction between an agent and the environment: the agent takes action
in the current state
and then transfers the state to
, and at the same time, it returns the reward function
, as shown in the
Figure 4. The strategy is a mapping from the state space
S to the action space
A, that is,
. The purpose of DRL is to find the optimal solution that can maximize the long-term cumulative reward strategy
.
The VSL control problem is abstracted as an MDP under emergency conditions. It is essential to appropriately define the five elements inherent to the MDP framework. Therefore, the follow-up of this section focuses on the formulation of the five elements of the MDP in the process of VSL control.
3.2. MDP of VSL Control
Traditional VSL control dynamically posts speed values through VMSs, relying too much on road infrastructure, resulting in poor flexibility in the implementation of control strategies and slow control actions. With the development of roadside and vehicle-mounted equipment, sensor technology and vehicle–road wireless communication networks, the cooperative vehicle infrastructure system enhances traffic information management and service capabilities. In this environment, vehicles and road unit systems and their remote data centers can establish effective communication to share relevant status information and traffic control strategies as well as accurately transmit the speed limit value to each vehicle, which solves the problem of poor flexibility and slow control actions of control strategies based on road infrastructure. At the same time, the proposed method solves the shortcomings of high-accuracy dependence on driver compliance and traffic flow state prediction models and reduces the negative impact of driver uncertainty on traffic control.
In this paper, we focus on using VSL to relieve queuing vehicles in emergencies and avoid vehicle congestion. The purpose is to explore VSL solutions in the CVIS environment and maximize the transportation network throughput. In the cooperative vehicle infrastructure system environment, it is not difficult to implement variable speed limit control actions by sending speed limit commands to vehicles in the corresponding area. For example, existing driver assistance systems and cruise control systems can be used to force vehicles to follow the received speed value.
Figure 5 shows the VSL control process in the CVIS environment. The road section upstream of the emergency point is divided into a status monitoring area and a speed limit control area. The status monitoring area contains
cells, the VSL implementation area contains
(
) cells, and the state of each cell is a set of density and speed. The roadside unit will input the information of the status monitoring area obtained from the road to the algorithm unit once an emergency occurs. The algorithm unit will calculate the current speed limit control value based on the current status information and then send the result to the roadside unit. The roadside unit system sends the VSL values to each vehicle in the policy implementation area. In addition, the reinforcement learning algorithm with the actor–critic architecture is used to solve the emergency condition.
The actor is used to output a VSL control strategy, and the critic is used to evaluate the actor’s strategy. The reward function can quantify the efficiency, safety and emission reduction capabilities of the transportation network.
In this section, the VSL control process is formulated as a Markov decision process. Agent, state, action, transition probability, and reward are defined as follows:
Agent: The VSL controller is regarded as an agent. The agent can output different speed limit values for different cells (sub-sections) upstream of the emergency point. The goal of the agent is to control roads once emergencies occur, divert vehicles on accident sections, avoid vehicle congestion, and improve road capacity.
State space: The state space reflects the road state of traffic flow in the real-time traffic environment. Based on the simulation platform, real-time information on the road can be obtained. This paper studies traffic flow control methods under emergencies; the upstream traffic state of the upstream section emergency point has a greater impact compared with the upstream section. Therefore, this paper pays attention to the traffic state of the upstream emergency point. Determine the state detection area of the upstream section of the emergency point and divide the area into
number of cells. In this paper, the state space
S at time
t of the VSL controller can be expressed as:
Action space: The number of action spaces is related to the number of controlled cells. In the scenario of this paper, the number of action spaces is generally less than or equal to the number of state spaces. Considering the real world for implementation and driver compliance issues, the element values of the action space are set to discrete values, and the following formula represents the state expression of the action space.
Transition probability: The training of the agent is carried out on the open-source traffic simulation platform: Simulation of Urban MObility (SUMO). SUMO provides flexible interfaces for network design, traffic sensors and traffic control solutions. The state transition matrix in this article has been defined in the SUMO platform by default.
Reward value: Reinforcement learning is learning to select actions based on maximizing a given reward signal. The key issue with this approach is to ensure that the agent receives rewards that promote good system-level behavior. The reward function of the variable limit control problem can be defined from the optimization goal. The VSL was first proposed to reduce traffic conflicts and enhance the consistency of traffic flow speeds, which can improve road safety. Therefore, safety factors can be ignored when formulating a VSL strategy. The optimization goal of the VSL method can be total travel time, low collision probability, minimum vehicle emissions [
33], etc. The difference in traffic flow upstream and downstream of the emergency point is used as the reward function. For the traffic road network, the closer the traffic flow downstream of the emergency point is to the traffic flow upstream, the less impact the accident point has on the traffic capacity of the road. The reward function can be expressed as:
4. VSL Control Problem Based on Improving DDPG under Emergency
The self-triggered VSL intelligent decision-making control strategy architecture proposed in this paper based on improved DDPG is shown in
Figure 6. The framework mainly includes three parts: the environment module, the trigger module and the intelligent decision-making control module.
The environment module is mainly used to receive the VSL value of each road section provided by the intelligent decision-making control module and provide real-time traffic status information for the trigger module and intelligent decision-making control module. The environment module mainly comprises the road network model and vehicle model.
The trigger module is based on the accident flag to determine whether to start the intelligent decision-making control module. The state of the emergency flag can be monitored by the video monitoring system to trigger the intelligent decision-making control module, or it can be manually set by road monitoring people. It is asleep when the accident flag is , indicating that everything on the road is normal. When the accident flag is in an activated state, indicating that an emergency has occurred on the road, the intelligent decision-making control module will be triggered, and the triggering module will send the location of the emergency to the intelligent decision-making control module. The status of the trigger module determines whether the decision-making control module works. In this paper, the control strategy will be implemented only when an emergency occurs. Under normal road conditions, the intelligent decision-making control module will not be started, which can avoid excessive control of roads and save computing resources and energy efficiency.
The intelligent decision-making module is as follows: the intelligent decision-making module will receive the emergency location sent by the trigger module once an emergency occurs on the road and then collect the traffic information of the emergency point upstream from the environment module. The intelligent decision-making module will output values of the VSL of the road based on the current state and implement the control section, that is, the calculated VSL value of each sub-section, to guide the vehicles in the corresponding sub-section until the next control domain. Vehicles on the road are monitored at the same time. When the vehicle position is out of range in the detecting area of the emergency event, the control of the vehicle will be automatically released. This module is based on the DDPG algorithm and adds actions in the early stage of the algorithm update. The noise parameter improves the algorithm exploration efficiency and stability, avoids local optimally, and combines the continuous action values of the output to make the road smooth. In terms of algorithm implementation, the DDPG algorithm in this paper will use four neural networks, of which the actor and critic each use the same structure neutral network. In addition, they each have a target network. In the DDPG algorithm, the actor also needs the target network to calculate the target value, and the target network update adopts a soft update method that lets the target network update slowly, gradually approaching the value network, and finally obtaining a trained algorithm model. The actor output can be used to value the VSL.
Neural Network Design and Algorithm Update
The DDPG algorithm used in this paper belongs to the actor–critic architecture and includes a policy network and a value network. The policy network is a deterministic policy network, which means that the policy network makes decisions based on the current state s to obtain a deterministic action a. The value network can evaluate this action a based on state s, which can promote the policy network to make improvements.
The deterministic policy network can be expressed as , where is the policy network parameter. The value network can be expressed as , where is the value network parameter, and its output is a real number Q, which is used to evaluate the policy of the action based on the current state.
The time difference (TD) algorithm is used to update the value network based on a set of data
. Therefore, based on these data, the value network can be used to predict the action at time
t based on
. The value
is calculated based on Formula (
5).
In the same way, the value network can also predict the value
of the action at time
:
where
.
Therefore,
can be obtained from the following formula:
In order for
to be as small as possible, gradient descent is used to update the value network
;
where
is the learning rate of the value network.
The policy network is updated using the deterministic policy gradient update method. The goal of training the value network is that the value of
is as large as possible where
. The deterministic policy network can be obtained by the following formula, which is a real number:
The author hopes that the larger the value Formula (
9), the better, so the gradient ascent method is used to update the policy network:
The core challenge of reinforcement learning is how to balance exploration with actively searching for actions that may yield high returns and bring long-term benefits. Without sufficient exploration, the agent may not be able to discover effective VSL strategies. Therefore, this paper focuses on the DDPG algorithm. The output part introduces the action noise parameter
ℵ, which is shown below:
In the early stage of the algorithm update, possible action values can be explored as randomly as possible. As the rounds of algorithm update increase, ℵ becomes smaller and smaller, and the impact on the output action becomes smaller and smaller. Therefore, the output of the policy network adding noise to the action to disturb and increase exploration can also prevent local optimally in the early stage of the algorithm.
This architecture reasons about actions in a continuous space and then simply uses integer transformations to obtain continuous action values to output discrete actions. The actual speed limit value can be obtained by following the formula:
In the formula, and I are constants, is obtained from the network, and is the output quantity (control quantity).
In addition, as an offline strategy reinforcement learning algorithm, one of the advantages of DDPG is that its exploration can be independent of the learning algorithm, and the trained model can be directly used in the corresponding scenario, which has high engineering practical significance. The specific process of the algorithm is shown in Algorithm 1.
Algorithm 1 Self-triggering VSL based on improved DDPG algorithm. |
![Sustainability 16 00965 i001]() |
5. Simulation Experiment and Analysis
The purpose of this section is to evaluate the implementation and effectiveness of deep reinforcement learning in VSL control strategies. To verify the advantages and disadvantages of the proposed method, this paper builds a joint simulation platform based on SUMO and PYTHON, respectively, using the DDPG-based algorithm (DDPG-VSL), improved DDPG algorithm (I-DDPG-VSL), no control (baseline) and traditional rule-based (rule-based) methods for VSL in different road environments to simulation and analysis.
The simulation platform in this paper uses the open-source software Simulation of Urban Mobility (SUMO 1.19.0). The software is very flexible and supports the use of its traffic control interface (TRACI) to set speed limits for vehicles. Moreover, SUMO can also define various vehicle models and vehicle-following and lane-changing parameters for the vehicle. PYTHON is selected for algorithm implementation, which can receive the traffic state information from the simulation platform and output the corresponding control strategy. The implementation architecture of its intelligent decision-making control strategy is shown in
Figure 7. In terms of the construction of simulation scenarios, the selected area is exported from OpenStreetMap.org. Next, the traffic network for simulation is built based on this map. The netconvert command is used to convert the OSM file to the NET file.
The Hong Kong–Zhuhai–Macao Bridge highway from Zhuhai to Hong Kong is selected as the research scenario, as shown in
Figure 8. The first reason is that the Hong Kong–Zhuhai–Macao Bridge is a typical long section without on-ramps and off-ramps. The VSL control strategy can only be used once the emergency occurs. The second reason is that the Hong Kong–Zhuhai–Macao Bridge, is a world-class cross-sea channel with national strategic significance, affecting the operation of the highway network and comprehensive transportation. Guangdong, Hong Kong, and Macao all have responsibilities [
34], and the study of their roads has important engineering practical significance.
5.1. Simulation Model Parameters
To reduce the complexity of traffic flow as much as possible, four types of vehicles are selected, and the parameters are shown in
Table 1. To simulate the road traffic flow as much as possible, the experiment covered traffic flow in three levels: flow, medium and high. To simulate dynamic traffic flow,
Figure 9,
Figure 10 and
Figure 11 respectively show the traffic flow generated in different environments randomly selected during the simulation process.
5.2. Evaluation Metrics
Average travel time (ATT): ATT refers to the typical amount of time it takes for each vehicle to travel in a proposed scenario.
Potential collision number (PCN): Time-to-collision (TTC) defines the space gap divided by the difference speed of the lead vehicle and the following vehicle. If the value of TTC is greater than 3 s, the number of potential collisions will be increased by 1.
5.3. Simulation Results and Analysis
5.3.1. Parameter Setting and Training
When the traffic flow on the road is running stably, an emergency event is set to activate the trigger module to start the VSL intelligent decision-making module. Then, the emergency vehicle is removed after 3 min. After the traffic flow is detected to return to the normal state, the state of the trigger model is changed from activate to sleep. The traffic management and control strategies will stop output values of VSL, and the one-episode training process will be completed.
Table 2 lists some necessary parameters for the DDPG algorithm.
The reward value obtained by the agent in each training episode can reflect the quality of its training effect. The higher the reward value, the better the training effect. The reward value obtained by the strategy based on the DDPG algorithm during the training process changes with the number of episodes, as shown in
Figure 12. In addition,
Figure 12 also shows the comparison of reward values under basic rule control and no control. As can be seen from
Figure 12, before about 30 rounds, the reward value output based on the DDPG algorithm locally converges. This is because the memory bank parameter is set to 512 in the algorithm. When the memory pool space is filled, only then will the stored content in the memory pool be sampled and trained. However, only some conservative states are stored in this process based on the DDPG algorithm, and the action selection is relatively simple, so in the process of sampling and learning the states in the memory pool, the update of the reward value is relatively gentle and easy to converge locally. In the process of the management and control strategy based on the improved DDPG, the noisy parameter is introduced and tried to explore as many possible actions in the memory storage process. The action also can be explored, although the memory pooling is filled. The reward value based on the improved DDPG control strategy can quickly converge, and at the same time, some local convergence is avoided. At the same time, the change process of the reward value also shows that compared with rule-based and no-control strategies, the control strategy based on improved DDPG can gradually learn not only TRAFFIC data but also the impact data on traffic data including road linearity. Compared with the control strategy based on DDPG, the exploration process of improving DDPG algorithm is flexible, and the maximum reward value is finally obtained.
5.3.2. Results and Analysis
The proposed method was compared with the simple DDPG algorithm, rule-based control and no-control method in different traffic flow levels.
Table 3,
Table 4 and
Table 5 show the implementation effects of traffic control strategies under different traffic flow levels, respectively. The results show that although no control is better than the other three control strategies in terms of efficiency at low traffic levels, it performs worse than other methods in terms of safety. Compared with the other three methods, the control based on improved DDPG performs excellent in safety. In terms of efficiency, it is also better than the rule-based and DDPG-based control strategies. Taking no control as the baseline, in a medium-level traffic flow environment, in terms of safety, the rule-based, DDPG-based and improved DDPG-based control strategies, respectively, reduced 17.24%, 24.14%, 29.31%, respectively, improving efficiency by 2.82%, 3.90%, 7.21%. In a high-level traffic flow environment, the performance of the four control strategies is similar to that of medium-level traffic flow. The control strategy based on improved DDPG performs best in terms of both safety and efficiency. The results demonstrate that VSL control based on an improved DDPG algorithm has the best performance on efficiency and safety under emergencies, and the safety improvement is greater than that in efficiency. It advises that at a relatively low level of traffic flow level, excessive vehicle control is not required; only timely guidance of traffic flow is required.
In addition, the weights of two evaluation indicators are calculated based on the entropy weight method [
35,
36] for three levels of traffic flows. The weights of the two indicators are
and
, respectively. The comprehensive evaluations using different methods under three traffic flows are shown in
Figure 13. It is shown in
Figure 13 that based on no-control methods as the benchmark, the comprehensive evaluation index value of the IDDPG-VSL method is the highest under three traffic flows in the comparison of three methods. Therefore, from a comprehensive point of view, the IDDPG-VSL-based traffic management and control strategy is the best for traffic performance.
6. Conclusions and Prospect
In response to the common traffic problem of reduced road capacity during emergencies, this paper innovatively proposes a self-triggered intelligent traffic management and control strategy based on VSL and reinforcement learning to achieve effective traffic flow evacuation. On the one hand, when an accident occurs, this strategy will be triggered immediately and automatically shut down when the traffic flow is recovered to avoid excessive management and control of the transportation system; on the other hand, this strategy can output suitable values of VSL for dynamic traffic flow. To test the effect of the strategy, this paper established a joint simulation platform of PYTHON and SUMO for verification. The results show that in terms of safety, under different traffic flow levels, the proposed strategy has improved by over 28.30% compared to other methods. In terms of efficiency, except for being inferior to no control at low traffic flow conditions, it has improved over 7.21% compared to others. In addition, from the perspective of the proposed comprehensive indicator, the IDDPG-VSL method has the highest performance under three-level traffic flows compared with other methods. The improvement in safety and efficiency has benefits for sustainable transport systems which make a positive contribution to environmental, social and economic sustainability.
The proposed strategy applies to a wide range of scenarios and applies to traffic congestion caused by traffic accidents, bad weather, etc. However, it does not apply to large-scale accidents that completely close the road. In subsequent research, we will consider expanding traffic scenarios and integrating ramp control into the VSL strategy to deal with larger traffic accidents. What is more, factors affecting the traffic flow on the road are also taken into account, such as weather factors [
37].