Next Article in Journal
Transformation of the Urban Energy–Mobility Nexus: Implications for Sustainability and Equity
Previous Article in Journal
Element of Disaster Risk Reduction in Geography Education in Malaysia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Resilient Intelligent Traffic Signal Control Scheme for Accident Scenario at Intersections via Deep Reinforcement Learning

1
Department of Electrical and Computer Engineering, Tarbiat Modares University, Tehran 14115-111, Iran
2
Department of Mechanical Engineering, Polytechnique Montréal, Montreal, QC H3T 1J4, Canada
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(2), 1329; https://doi.org/10.3390/su15021329
Submission received: 5 November 2022 / Revised: 27 December 2022 / Accepted: 6 January 2023 / Published: 10 January 2023

Abstract

:
Deep reinforcement learning methods have shown promising results in the development of adaptive traffic signal controllers. Accidents, weather conditions, or special events all have the potential to abruptly alter the traffic flow in real life. The traffic light must take immediate and appropriate action based on a reasonable understanding of the environment. In this way, traffic congestion would be prevented. In this paper, we develop a reliable controller for such a highly dynamic environment and investigate the resilience of these controllers to a variety of environmental disruptions, such as accidents. In this method, the agent is provided with a complete understanding of the environment by discretizing the intersection and modifying the state space. The proposed algorithm is independent of the location and time of accidents. If the location of the accident changes, the agent does not need to be retrained. The agent is trained using deep Q-learning and experience replay. The model is evaluated in the traffic microsimulator SUMO. The simulation results demonstrate that the proposed method is effective at shortening queues when there is disruption.

1. Introduction

Traffic congestion has numerous negative effects, including air pollution, extended travel times, and energy loss. Despite the fact that expanding roads may appear to be an easy way to reduce traffic congestion, the investment would be costly and ineffective in the long run. Therefore, optimizing traffic signal control would be the best option. This would make existing infrastructures work better [1].

1.1. Literature Review

Fixed timing and adaptive traffic light control are the two main approaches to controlling traffic lights. Instead of taking into account real-time traffic data, fixed timing control uses historical data to determine the timing of traffic lights [2]. However, weather conditions, accidents, and special events can have an impact on traffic demand, which is not always constant. Consequently, the productivity of traffic light control at fixed times is compromised when the traffic request is dynamic. Based on real-time traffic demand, adaptive traffic signal control (ATSC) can optimize the duration of green and red traffic signals using inductive loops to detect vehicles [3,4,5]. The well-known adaptive signal control systems SCOOT [6], SCATS [7], OPAC [8], and PRODYN [9] regulate signal timings by resolving optimization issues.
A model-based predictive controller was presented in [10], which aims to reduce traffic congestion by reducing the number of cars waiting at red lights. In addition to analytical models, simulation models can also be used to analyze traffic system responses under various conditions. Through similar models of traffic flow, these models can mimic real-world processes and obtain information. There are two types of traffic simulation models: microscopic and macroscopic. A microscopic simulation model takes into account the individual behavior of the driver as well as interactions with other vehicles and pedestrians, while a macroscopic simulation considers the entire traffic flow. There are many microsimulation packages used for studying urban traffic dynamics, including AIMSUN, MATSim, Paramics, SUMO, and VISSIM. The simulation models AIMSUN, VISSIM, and SUMO allow the user to create and control models with an external programming language via APIs. Traffic signal control uses a microsimulation model of the network to evaluate the current solution, and the results of each simulation are fed back into the proposed algorithm until a stopping criterion is reached [11]. There has been some research in the literature that analyzes the traffic signal control problem from a multiobjective perspective [12,13]. Ref. [12] used a fuzzy programming approach in which different weight coefficients were assigned to the different optimization objectives. After assigning the weights, the multiobjective function was transformed into a single objective. Ref. [14] formulated optimization problems for traffic signals as a mixed-integer linear programming problem and a nonlinear programming model, respectively. Using convex (quadratic) programming, [15] optimized traffic signal timings at an isolated intersection. In [16], a new algorithm for traffic control optimization was proposed using the number of vehicles in the queue. A comparison of this algorithm with GA and PSA algorithms has shown a 10 percent and 16 percent improvement in queue length reduction, respectively.
Artificial intelligence techniques have recently been applied to traffic signal control [17]. In this area, the reinforcement learning (RL) approach enables the adaptive control of the traffic signals, where the control problem is formulated as a Markov decision process (MDP) [18]. Reinforcement learning can model complex environments to derive optimal control actions by interacting with dynamic systems and observing changes in their behavior. In this approach, when an agent interacts with the environment, it tries to maximize the specified cumulative reward to determine the optimal state-action policy [19]. In reinforcement learning, function approximations are used to compute the value of each action in applications where the state space becomes larger [20].
Numerous studies have investigated the use of deep reinforcement learning techniques for traffic light control [21,22,23,24,25,26,27,28]. These techniques incorporate reinforcement learning and deep learning [29]. Deep learning can help in making optimal decisions in complex traffic environments where decision-making is challenging. This is because multiple layers in deep networks enable choosing the best approximations for decisions.
An agent formulation in RL that deals with the definition of state, action, and reward undoubtedly has implications for the performance of the algorithm. The reward is a numerical value sent by the environment when the agent performs an action. This value is interpreted as an objective function in optimization problems; thus, the total reward should be maximized throughout the entire optimization process. In traffic signal control, various factors are considered as reward functions, such as total queue length, waiting time, number of stops, and average speed, to name a few. In some studies, the weighted sum of multiple factors is considered as the reward [22,30,31,32]. It is possible to perform different types of actions, such as setting the duration of the current phase [24], maintaining the current phase or moving to the next phase, or selecting the next phase from a predetermined set [20,21]. A state is defined as the agent’s knowledge about its environment. Consequently, positions, average speed, queue length, waiting time, phase, and duration of phases at traffic lights are some of the elements of states considered in the literature [22,24,33,34,35]. Researchers proposed using image representations of intersections as states to better perceive the environment [24,36]. However, there are cases where complex states such as images do not lead to excellent performance and simple states are preferable [30]. Ref. [1] presented discrete traffic state encoding (DTSE), where each lane is divided into cells. Based on [1], ref. [35] proposed a deep reinforcement learning method for controlling traffic lights at a single intersection by discretizing the lanes. Each cell has a different length. In the state vector, the presence of a vehicle within the cell is indicated as 1 and the absence of a vehicle is indicated as 0. This state definition has resulted in the desired performance in low traffic volume scenarios. A reinforcement learning method was proposed in [37] for achieving maximum intersection throughput. Based on the total pressure and the total queue length at an intersection, [37] defines an adaptive reward function that uses an exponential approach. Recently, safety and accident analysis have become increasingly popular in traffic signal control [38,39,40,41]. An adaptive traffic signal control algorithm was proposed in [38] to maximize traffic efficiency and safety simultaneously. In [38], high-resolution real-time traffic data are used as the input to the agent, which selects the appropriate signal phase every second to reduce the vehicle delay and crash risk at the intersection. There is a discussion in [41] about the robustness of deep reinforcement learning-based controllers with respect to a variety of uncertainty scenarios.

1.2. Contribution

According to the literature review, since the trained agent (traffic light) in the traffic light control understands the environment through its state representation, environmental disruptions can have a negative impact on the training process. Some of the things that make the environment unpredictable and cause disruption include bad weather, accidents, the stopping of one or more vehicles, changes in the traffic density during special events, and other factors. As a result, the control algorithm is unquestionably more resilient when it has a better understanding of its surroundings. The following is a summary of the contributions made by our study:
  • In this paper, we explore the resilience of a deep reinforcement learning algorithm on a single four-way intersection under accident conditions. In [35], a deep reinforcement learning approach to adaptive traffic light control is presented in which a vector of a vehicle’s presence or absence is considered as an observation of the environment. However, the proposed algorithm in [35] is not resilient under accident conditions, especially in a high traffic scenario. Hence, we modified the state space, so that three types of information will be used for the intersection instead of relying only on the presence of the vehicle as in [35]. The reason for this is that providing inputs that indicate a change in the environmental conditions associated with an accident gives the agent a more realistic perception of the environment. Therefore, we need input variables that accurately reflect changes in the environment. Since an accident causes cars to stop, the agent can better understand the current state of the intersection by adding the number of queued cars to that state. Consequently, this paper considers the presence or absence of cars and the number of queued cars as well as the current traffic light phase as the state. We discuss its utility for a resilient adaptive traffic management system.
  • The proposed algorithm is independent of the location and time of the accident. To demonstrate this, several accidents are simulated at different locations and times in the environment to assess the effectiveness of the method. In this paper, we simulate an accident by halting a vehicle for a period of time on the lane. This can lead to traffic congestion.
  • To the best of our knowledge, there are no studies on the occurrence of multiple accidents whose duration overlaps. In this paper, two different cars are halted at two different places, so their stopping times overlap. It can be observed that the algorithm converges and the corresponding control signal avoids a long queue of cars on the intersection.
  • The proposed resilient control algorithm will be able to handle accidents regardless of where they occur. In other words, if the location of the accident in the training phase differs from the location of the accident during the testing phase, the agent does not need to be retrained to become resilient.
  • The proposed method increases the convergence rate compared to the existing deep reinforcement learning algorithm in [35] as well as to the prior artwork. This means that the algorithm converges to the optimal control signal with fewer episodes during training.

1.3. Organization

Since it is very difficult to obtain a model of the environment with its associated uncertainties for the traffic control problem, a model-free learning algorithm is required. On the other hand, the problem defined in this study has a discrete action space, so the agent can determine the value of each action and choose the one with the highest value. Thus, a value-based algorithm can be used to learn. A simple learning algorithm with these properties that is capable of addressing the aforementioned problem is the Q-learning algorithm, a model-free reinforcement learning technique that is widely used in this field. Therefore, the control problem is modeled as deep reinforcement learning and the Q-learning algorithm is employed as the learning algorithm. In our model, a fully connected network approximates the Q-values for the actions. In addition, experience replay and target networks are used to avoid the divergence of the algorithm [33].
To evaluate the performance of the algorithm, a simulation of urban mobility (SUMO) is used [42]. This simulator can receive observations of intersections and apply the action to the agent through an application programming interface (API). Thus, a real-time traffic simulation is enabled. Even in high traffic situations, the evaluation results prove that our approach can effectively deal with the disruption of the environment.
The framework of the proposed methodology is summarized in Figure 1. The rest of this article is structured as follows: Section 2 presents the problem formulation. The details of the proposed model including the definition of the state, action, reward, and deep network are described in Section 3. In Section 4, the algorithm is evaluated by simulation and comparative results are presented to prove the effectiveness of the proposed method. Finally, in Section 5, the paper is concluded.

2. Problem Statement

2.1. Problem Assumption

In this paper, a single four-way intersection is considered as the environment simulated in the SUMO. The traffic light is utilized as an agent to control the traffic flow. The environment is observed by agents, who use this information to make decisions. The reward for the previously chosen action is also calculated using a measure of the current traffic situation. For the agent training, a sample of data is stored in the memory that contains information about the last simulation steps. Based on the reward received and the state of the environment, the agent now chooses a new action, and the simulation continues. The environment is shown in Figure 2. There are a total of four lanes on each road and each lane is 750 m long. In the right lane, drivers can turn right or go straight, in the two middle lanes, drivers can only go straight, whereas in the left lane, only a left turn is allowed. As shown in Table 1, the traffic signal in this problem is configured to have four main phases, called green phases, and four auxiliary phases, called yellow phases. E, W, N, and S stand for the east, west, north, and south arms, respectively. The times for the green phase and the yellow phase are fixed at 10 and 4 s, respectively. When there are vehicles coming from multiple directions at an intersection, a single traffic signal may not be sufficient to control all the traffic. Therefore, multiple traffic lights must work together at an intersection. Traffic lights direct vehicles at such an intersection from nonconflicting directions simultaneously by changing the traffic light statuses [34].
In this paper, each arm is divided into four lanes, with the left lane controlled by its own traffic light, while the other three are controlled by a common traffic light. Consequently, signal control refers to the process of selecting an appropriate phase from predefined phases. Moreover, the disruption of the environment is considered as an accident. Despite the occurrence of accidents in the environment, our goal is to train the agent to maximize the traffic flow through the intersection so that the length of the queue appears shorter even in high traffic volume scenarios.

2.2. Optimization Goal

It is essential to keep in mind that the lane in which a car has crashed will be blocked in the event of an accident. As a result, the issue at hand is how to select the most effective phase from the set of predetermined phases for the traffic light in order to ensure that other vehicles can cross the intersection without waiting in line and that the intersection operates as efficiently as possible.
The principal objective of this work is to expand the traffic stream at the intersection and lessen the length of the queue that is shaped when a mishap happens. Let q l be the number of vehicles in the queue on lane l in a time step, and let Q m be the maximum length of the queue formed on all lanes of the intersection during the entire time period t. Q m has the following definition, where L is the set of all the lanes at the intersection:
Q m = max l = 1 L q l
Our goal is to shorten the maximum length of a traffic accident queue in all lanes, especially during rush hour. As a result, this paper’s optimization problem is to reduce the maximum length of queues on all lanes formed during the time period t.

2.3. Methodology: Background on Deep Reinforcement Learning

Figure 3 depicts the standard cycle of reinforcement learning.
At the beginning of time step t, the agent observes the intersection state s t based on its interaction with the environment. After observing this state, an action a t is taken and the traffic signal is actuated by the agent. Due to the movement of the vehicles, the state of the environment changes to s t + 1 and the agent receives a reward signal r t + 1 at the end of the time step. Based on a performance measure, a reward signal informs the agent about the appropriateness of the action performed [33].
A reinforcement learning strategy is about learning an optimal policy that maximizes the cumulative expected rewards based on the current state. If the agent knows the optimal Q-values of successive states, the optimal action selection policy is simply to choose the action that yields the highest cumulative reward. The optimal Q-values Q * ( s ,   a ) are obtained by the recursive equation known as the Bellman optimality equation [20]:
Q * ( s ,   a ) = E { R t + γ max a ´ Q * ( S t + 1 ,   a ´ ) | S t = s ,   A t   = a }   for   all   s     S ,   a     A
Based on (3), the agent’s optimal cumulative future reward consists of the immediate reward it receives after choosing action a in state s plus the optimal future reward thereafter. The expression m a x a ´ means that the most valuable of all possible actions is chosen in state S t + 1 .
It is worth noting that the solution of (3) requires that the states are finite and the transition probabilities are known. However, in complex traffic environments, there are numerous states, so it is extremely difficult to calculate the Q-value for each state-action pair. In this paper, traffic signal control is formulated as a deep reinforcement learning problem, as illustrated in Figure 4.
To find the optimal signal control policy, a mechanism called deep Q-learning is used. In this approach, the Equation (3) is not solved directly, but a parameterized deep neural network (DNN) is used to approximate the optimal Q-values Q * ( s , a ) by the Q-learning function.
Through the following function, the Q-value is updated.
Q ( s t ,   a t ) = Q ( s t ,   a t )   +   α ( ( r t + 1 + γ max a Q ( s t + 1 ,   a t + 1 )     Q ( s t ,   a t ) )
Here, ( s t ,   a t ) and ( s t + 1 ,   a t + 1 ) represent the current and next state-action pairs, respectively.   α is the learning rate and the reward r t + 1 is the reward that results from performing action a t in state s t . γ is the discount factor 0 < γ < 1 , which indicates how much importance the agent assigns to future rewards. When γ = 0, the agent considers only immediate rewards, while γ = 1 means that the agent considers future rewards more. After training the agent, the optimal action a t to take in state s t is the one that maximizes Q ( s t , a t ) .

3. Proposed Model Construction

Using the reinforcement learning approach, states, actions, and rewards should be defined when developing a traffic light control system. The following is a description of our model’s three components.

3.1. State Definition

The state must provide sufficient information about the intersection, particularly the distribution of vehicles on each arm, for the agent to effectively optimize the traffic flow. Figure 5 provides an illustration of the road’s proposed discretization in [35]. In this work, we use additional information to define the state in the same way.
As shown in Figure 5, a cell that is closer to a traffic light is smaller than one that is farther away, since an accurate perception of the distribution of vehicles near traffic lights is more important.
It is worth noting that the trained agents should be equipped with sufficient knowledge of the environment. On the one hand, the presence or absence of a vehicle may not provide sufficient information about the environment. On the other hand, accidents affect the number of vehicles in the queue. To make the algorithm more resilient, this information should be given to the agent as an observation of the environment. It should be noted that, in our method, each intersection arm consists of 20 cells with different sizes (Figure 5), so the intersection contains 80 cells in total. Each cell represents one element in the state vector. This means that there are 80 elements in the vectors E and Q.
Adding the 80 numbers indicating the presence or absence of vehicles in each cell with the number of vehicles in the queue in each cell gives 160 as the vector dimension. On the other hand, the traffic light phase is also a 4-dimensional vector added to the state vector. As a result, the state vector used for training and testing contains 164 elements ((20 × 4 × 2) + 4 = 164). This means that the dimension of the state vector is 160 × 1.
As a result, a vector E = [ 1 , 1 , 0 , 1 , 0 , , 1 ] T is used to represent the presence or absence of vehicles in each cell, such that this vector indicates the presence or absence of vehicles as 1 and 0, respectively. T denotes the transpose operator. The vector Q = [ 5 , 3 , 0 , 2 , 2 , , 1 ] T (the numbers are given as examples) is also used to indicate the vehicles in the queue in each cell. Moreover, the current green phase of the traffic light is encoded as a four-dimensional one-hot vector P = [ 1 , 0 , 0 , 0 ] T since there are four possible green phases for the agent. Once the green light is turned on for a phase, the corresponding entry in the vector is set to 1.
Consequently, in this paper, the vector of the presence or absence of vehicles along with the vector of the number of queued cars in each cell, and the current traffic light phase, are concatenated as the state vector S = [ E T , Q T , P T ]   T .

3.2. Action

Depending on the current state of the intersection, the agent must perform the appropriate action. The action space of this paper is defined by choosing the green phase from a predefined set of green phases such as: {NSA, NSLA, EWA, EWLA}. NSA or north-south advance means that the green phase applies to vehicles in the northern and southern arms that want to go straight or turn right. Similarly, EWA means that the green phase applies to vehicles going straight or wanting to turn right in the westbound or eastbound arms. NSLA means that the green phase is active for vehicles turning left in the north and south arms. The same is true for vehicles on the right and left arms wishing to turn left in EWLA. As mentioned earlier, the green phase time and the yellow phase time correspond to 10 s and 4 s, respectively. If the action in step t is different from the action in step t−1, a yellow phase of 4 s is activated between the two actions. Otherwise, there is no yellow phase, i.e., the current green phase continues [35].

3.3. Reward

The reward should be defined based on a measure of traffic efficiency. Since the main objective of this work is to reduce the length of the queue caused by an accident, the waiting time, which is equal to the length of the queue, is considered as the performance measure. Therefore, the reward is defined as the reduction in the cumulative waiting time between two consecutive actions. The reward function is represented as follows:
r t = wt t   1     wt t
in which:
wt t = i = 1 N w I , t
The waiting time of vehicle i at time step t is denoted by w i , t . Here, the waiting time refers to the time in seconds that a vehicle has a speed of less than 0.1 m/s. N is the total number of vehicles in time step t. The reward function is such that its positive value represents a successful action by the agent. Choosing the right action results in fewer vehicles in the queue at the current time step t than at the previous time step t 1 , and the waiting time for the vehicles is also reduced. In this way, the reward grows over time.

3.4. Deep Neural Network Structure

In this paper, a fully connected network is used to estimate the Q-values Q ( s , a , θ ) for all actions a A under the observed state S . A visual representation of the deep neural network is shown in Figure 6. Intersection state s t is the input to the network and a vector of the estimated value Q ( s , a , θ ) is generated as the output of the network. θ is the parameter of the network. The parameters should be learned based on traffic data so that Q ( s , a , θ ) approaches the optimal value Q * ( s t , a t ) .
Obviously, the state vector observed in the environment is the input of the deep neural network. Therefore, the number of neurons in the input vector must equal the number of elements in the state vector. As explained in Section 3.1, the dimensions of the state vector are 164 × 1, which corresponds to the discretization of the intersection arms. As a result, an input layer with 164 neurons is required for this kind of deep neural network. Accordingly, the input layer consists of 164 neurons. In addition, this network contains 5 hidden layers with 400 neurons with a rectifier nonlinearity activation function (ReLU) and an output layer consisting of 4 neurons with a linear activation function. This activation function is most commonly used in hidden layers. By using this function, the problem of vanishing gradients is prevented. Due to its fixed derivative (slope) for one linear component and zero derivative for the other, the learning process using the ReLU function is much faster. Moreover, as the ReLU does not contain exponential terms, the calculations are much faster [43]. The neurons of the output layer indicate the Q-value of an action with respect to a state.

3.5. Model Training

To train the deep network, the techniques of experience replay and target Q-network configuration are used. In these techniques, observations are grouped into batches that are used to train the agent. A batch consists of experiences or samples, where e = { s t ,   a t ,   r t + 1 , s t + 1 } . After performing an action in state s t , r t + 1 is the reward received when the environment evolves to state s t + 1 . Through experience replay, correlations between samples are eliminated and current Q-network parameters will be copied to the target network using target Q-network configuration. In each training instance, to learn the deep neural network parameters, the agent requires training data including the inputs e = { s t ,   a t ,   r t + 1 , s t + 1 } and outputs of the target Q-values Q * ( s t , a t ) . The input data for the training are randomly retrieved from the replay memory (see Figure 7) and each sample in each batch is used for the training. The target Q-values Q * ( s t , a t ) are estimated as:
Q * ( s t ,   a t ) = r t + 1 + γ max a t + 1 Q ´ ( s t + 1 ,   a t + 1 ,   θ )
where Q ´ ( s t + 1 , a t + 1 , θ ) is the output of a separate target network with a similar structure to the main deep neural network and denotes the Q-value associated with a subsequent action after the action a t is performed in state s t .
To update the estimated Q-values, (7) uses the information available in each sample. For each training instance, the number of samples retrieved from the memory forms the batch size, which in this case is 100. The replay memory is used to store these batches and then apply them to the agent at different time steps. The memory has a finite capacity, indicating the number of samples it can store, which in this work is set to 5000. Old samples are discarded when the memory is full.
The network parameters should be learned in a way that minimizes the following mean square error (MSE) function of the Q-values as loss function. A stochastic gradient descent method, which performs a parameter update for each training example, is used to minimize the mean square error.
loss   function = 1 m t = 1 m { ( r t + 1 + γ max a t + 1 Q ´ ( s t + 1 ,   a t + 1 ,   θ ) )     Q ( s t ,   a t ,   θ ) }
in which the input data size is m.
After training, the agent is expected to be able to estimate the optimal Q-values and learn the optimal action selection policy. Since the agent has only experienced a limited number of states and not the entire state space, the Q-values for the states not experienced may not be as accurate as they could be. In addition, the state space itself may be constantly changing, leading to inefficiencies in the previously estimated Q-values.
Therefore, the agent should decide whether to use the Q-values it has already learned and select the action with the highest Q-value (exploitation) or explore other possible actions to improve the estimation of the Q-values and eventually optimize the action policy. In this paper, the ε greedy method is used as the action policy. In this method, a probability ε is defined such that the agent takes an exploratory action with probability ε and chooses the exploitative action with the highest estimated Q-values with probability 1 −   ε . Note that, at the beginning of the training, ε = 1 , which means that the agent performs only exploratory actions. After a certain period of training, the exploratory operation starts. The formulation of the ε greedy is presented below:
ε = 1     current   episode total   number   of   episodes

4. Evaluation Results

In this section, we describe the simulation results to evaluate the performance of the proposed approach. We perform the evaluations in the SUMO simulator where real-time traffic is simulated. Figure 8 shows the studied environment for the evaluation along with the simulated data on the SUMO platform. Our analysis focuses on a single intersection to understand the resilience of deep RLs. Different network configurations and other input distributions will be explored in a future work. Traffic congestion can be the result of accidents that disrupt the traffic system. Different conditions and events such as accidents should be considered when training the agent so that the controllers learn to be resilient in uncertain environments. To create a realistic situation, in this paper, we simulate a lane closure caused by an unpredictable accident by stopping a southbound vehicle. During each episode, the vehicle stops until the simulation time expires. Thus, the control signal should be adjusted so that a long queue does not form on the road where the accident occurred.
During the simulation, we try to maximize the defined reward, i.e., minimize the cumulative delay of all vehicles, and minimize the length of the vehicle queue in case of an accident.

4.1. Traffic Generation Process

Several probability distributions have been proposed in the literature that can be used to model a real traffic flow [44]. To move closer to reality, we refer to [35], where the travel demand is based on the Weibull distribution. Figure 9 shows the traffic generated in an episode with the number of vehicles at each step of the simulation. This distribution clearly represents the traffic flow during a whole day. Since the number of vehicles increases in the initial phase, it is a peak hour. Over time, the number of arriving cars decreases, describing the gradual weakening of the traffic congestion.
Inspired by the shape of the Weibull distribution (Figure 9), the way to apply this distribution to the traffic flow is that a small number of vehicles initially enter the environment and this number gradually increases, indicating the peak hours of the traffic. Then, the number of vehicles entering the environment gradually decreases, indicating the end hours of the day. It should be noted that, in each episode, a random generator function is used to randomly determine the vehicle’s origin and destination.
It should be noted that the traffic flows at different intersections may vary significantly in reality. Therefore, the results presented here represent what the performance would be for a typical all-day traffic volume. The authors intend to evaluate the proposed algorithm under different distributions for generating traffic with short-term fluctuations in a future work.

4.2. Setup for Microscopic Traffic Simulator

The impact of the accident is studied in two different demand scenarios with different traffic volumes: high traffic volume scenario and low traffic volume scenario. In the low traffic volume scenario, 600 vehicles approach the intersection, while in the high traffic volume scenario, 3000 vehicles approach the intersection. In both scenarios, 75% of the vehicles travel straight through the intersection and 25% turn left or right [35].
The agent’s training lasts 50 episodes. Each episode consists of 5400 steps and the time frequency provided by the SUMO is 1 s per step. The parameters of the algorithm are listed in Table 2.

4.3. Model Validation Using Different Travel Demand Scenarios

The resilience of the RL approach is analyzed in both the high and low traffic scenarios. The performance of the trained agent is studied from two perspectives: the reward curve during the training and the queue formed during an accident, according to the waiting time or the dwell time of the vehicles. Moreover, the proposed RL approach is compared with the performance of [35] in terms of the resilience to environmental uncertainties, such as an accident.
Accident simulation: To study the effects of an accident, we try to simulate a route closure caused by an accident in the simulator SUMO. To accomplish this, one of the cars approaching the traffic light stops before the intersection in one of the lanes. The accident can happen at any place in the area and at any time. In such a scenario, the agent gains access to the accident during the training.
First, we stop one of the cars in one of the southbound lanes of the intersection 50 m before the intersection. Figure 10 shows the reward curve during the training based on the cumulative negative reward in a low traffic scenario, and Figure 11 shows this curve in a high traffic scenario.
The performance of the agent is indicated by these curves. It can be observed that the agent is well trained in both scenarios and the cumulative reward increases as the training progresses. Moreover, the reward curve shows that a stable policy has been learned. This means that there is no oscillation between choosing bad and good actions and the algorithm does not deviate to bad decisions. Moreover, the presented approach achieves a fast convergence to an appropriate policy. These results show the resilience of the proposed algorithm to possible accidents and, thus, reduce traffic congestion.
Other metrics to evaluate an agent’s performance are the average queue length and the cumulative delay during learning, as shown in Figure 12 and Figure 13, respectively, for the high traffic scenario. The vehicle delay is defined as the time the vehicle is steady between two consecutive actions. As the training progresses, the queue length and the delay decrease.
At the beginning of the learning process, the agent only explores the environment and performs its actions randomly. Due to these random actions, many vehicles are queued, resulting in high delays in the early stages of learning. The learning episodes progress and the agent starts to choose exploitative actions rather than exploratory actions due to a better understanding of the action-value function. In this way, the agent makes its decisions in a more optimal way, reducing the cumulative delay and average queue length.

4.4. A Comparison Based on Performance Measure

To show the efficiency of the proposed model in reducing the queue length in the presence of accidents, a comparison is made between the proposed approach [35,41]. In [35], the traffic control problem is handled without disruption in the environment, and the agent is trained without an accident occurring. Furthermore, the agent perceives the environment only by a vector, including whether a vehicle is present or not, and it takes actions depending on this. To evaluate the algorithm presented in [35], stopping a vehicle causes an accident in the environment in a high traffic scenario, and then the controller is trained on the new conditions.
There have been few studies that examine the impact of an accident in a single intersection, but the closest study to our paper is [41], which investigates the robustness of a deep RL-based controller in the presence of disruption, such as accidents. Similarly, the learning algorithm in this article is also DQN, and uses a fully connected network as the approximation fraction. We use the state vector from [41] as the input to our algorithm and train the agent with these new inputs to compare the performances of our algorithm and the algorithm presented in [41].
Figure 14 illustrates the maximum queue length during the test result. The longest queue is formed during the peak traffic hours. As can be seen, the presented approach reduces the maximum queue length by 15% compared to Conventional Deep RL method in [35] and by about 9% compared to Deep RL method in [41]. Based on this performance metric, we can conclude that the proposed algorithm has developed a superior control policy.
Moreover, the cumulative reward of [35] during the training episodes in the presence of an accident is presented in Figure 15. The observed fluctuations show the tendency of instability of the learned policy and, hence, it takes much more time to reach convergence compared to the proposed method. This is because a lower understanding of the environment degrades the performance of the algorithm when there is disruption in the environment.
Table 3 shows the numerical comparison between the two agents in [35] and the proposed method during the training in the presence of an accident. The average values of the cumulative negative rewards, cumulative delays, and queue length were used as the performance measures. These values are summed over 50 episodes and then averaged. The proposed resilient traffic signal control method outperforms the conventional deep RL method in [35] in all performance metrics. In comparison with [35], the average time delay and average queue length have both decreased by 10%, and the average cumulative reward has increased by 20%. The authors will test the algorithm on a real dataset and evaluate the error rate of the actual queue length compared to the queue length determined in the simulation.

4.5. Changing the Accident Location

Since accidents can happen at any place, in this section we change the location of the accident and stop a car at three different locations in the northern, western, and eastern arms of the system to prove the effectiveness of the proposed controller. Figure 16 and Figure 17 show the cumulative reward curve and the average queue length curve, respectively, for a high traffic scenario when the accident occurs in the western arm 50 m from the intersection. The reward curve converges after a few episodes, demonstrating the stability of the proposed policy. During the learning process, the agent also reduced the length of the vehicle queues, indicating that there is no traffic congestion.
In order to change the location of the accident, we stopped a car approaching a traffic light 30 m before the intersection in the eastern arm. According to Figure 18 and Figure 19, the algorithm is able to reduce the queue length and increase the cumulative rewards during the training. It seems that the closer the accident is to the intersection, the more episodes the agent needs to learn for the proposed strategy to converge.
After that, we moved the accident location 30 m from the intersection in the northern arm and verified that the proposed method of traffic control was effective in this case. Figure 20 and Figure 21 show the training results in this case, again showing a decreasing queue length and increasing cumulative reward.
The observed trend confirms that the proposed traffic control method is efficient no matter when or where the accident occurs. The control signal prevents a long queue of vehicles, regardless of the location of the accident. The numerical results of the performance evaluation of the two agents in the proposed method and [35] when training in the presence of an accident at different locations are presented in Table 4. N, E, and W refer to the north, east, and west arms of the intersection, respectively. By comparing the performance metrics in the table, we can conclude that the proposed method is generally more effective. In comparison with the conventional method in [35], it has increased the rewards by an average of 22% while reducing the queue length and delays by about 10%.
In another experiment, we will investigate the performance of the proposed algorithm in a situation where an accident occurred in the intersection or just after it and very close to it. In order to accomplish this, we stop a vehicle moving away from the intersection in the northern arm for a period of time. The location of the stop is just 5 m after the intersection. In fact, it can be said that this accident happened almost inside the intersection. The training results are presented in Figure 22 and Figure 23 and the queue length curve during the test is shown in Figure 24. The convergence of the reward curve with little fluctuation confirms the stability of the learned policy, and the reduction in the queue length curve during the training shows the desired performance of the proposed method in reducing the traffic.
The queue length curve in the test phase (Figure 24) shows that the highest queue length occurred during the peak hours, but then, the queue gradually reduced. Therefore, the designed controller is reasonably capable of handling the accidents within the intersection.

4.6. Changing the Accident Time

In this experiment, we change the time of the occurring accident and stop a vehicle at a later time than in the previous experiments. The location of the stop is 50 m from the intersection in the western arm approaching the intersection. We have implemented this experiment in a high traffic scenario. Moreover, in this case, the vehicle is stopped only for a period of time. This means that the accident is temporary and the accident does not occur during the entire training period.
Figure 25 shows the reward curve and Figure 26 shows the average queue length during the training. As you can see, the cumulative reward increases with the training time and then converges. Therefore, the proposed algorithm can achieve stability regardless of the time of the accident. During the learning process, the length of the vehicle queues also decreases, indicating the evacuation of the traffic.
There is also an experiment where the time and location of the accident are changed, and the accident occurs 350 m from the intersection. The vehicle is stopped only for a certain time. In other words, the accident is temporary. This case evaluates the algorithm’s performance when the accident occurs far from the intersection. The cumulative reward and average queue length are illustrated in Figure 27 and Figure 28, respectively. These curves show that the performance of the algorithm is independent of the time and location of the accident. Over time, the reward curve has converged well, and the queue length has decreased.

4.7. Multiple Simultaneous Accidents

As a test of the proposed method in a more complex environment, we simulated two accidents in this section by stopping two cars at two different locations so that the duration of their stops overlapped. One car stopped 700 m before the intersection in the southern arm and another stopped 730 m before the intersection in the northern arm. It should be noted that the starting time of the stopping cars is not the same, but their stopping periods overlap.
As shown in Figure 29 and Figure 30, the accumulative reward and average queue length curves illustrate the stability and success of the proposed strategy even in the presence of two accidents at different locations.
Lastly, we compare the maximum queue length of the proposed method with that of the conventional Deep RL method in [35] to further evaluate the proposed method. The proposed method reduces the maximum queue length at the intersection by about 24% as shown in Figure 31.

4.8. Model Validation with Different Accidents in Training and Testing

In this section, we perform some experiments where the accident location in the training phase differs from the accident location in the testing phase. In this way, the agent is not retrained under new conditions and, thus, the resilience of the algorithm will be investigated. We consider three scenarios and apply our proposed algorithm to them. These scenarios were tested in high traffic volumes. To verify the effectiveness of our approach, the defined scenarios are also applied to [35,41], and our proposed algorithm− is compared to these works in terms of the maximum queue length formed during the testing phase.
  • First scenario
In the first scenario, we train the agent for the case where an accident occurs in the southbound arm at a distance of 350 m from the intersection for a period of time. A trained agent is then tested in an environment where an accident has occurred at a distance of 550 m from the intersection in the eastbound arm which is almost far from the traffic light. The queue length curve during the test result is shown in Figure 32. In this high-volume traffic flow, it is observed that the maximum queue length is about 122 vehicles during the peak hours and then, it is well emptied. During this evaluation, the proposed algorithm proved resilient, and the agent optimized the control signal without having to be retrained for the new accident conditions.
  • Second scenario
In the second scenario, the agent is trained with an accident occurring in the westbound arm at a distance of 30 m from the intersection, throughout the whole training period. We then tested this agent at a 50-m distance from the southern arm of the intersection where an accident occurred. Table 5 shows the maximum queue length for our method compared with [35,41]. It is demonstrated that the proposed control algorithm is resilient to accidents in the environment and can eliminate their effects without retraining. There is a gradual dissipation of the queue formed behind the light and there is a maximum of 140 cars in the peak hours. Based on the results of the presented approach, the maximum queue length can be reduced by 8% compared to [35] and by about 11% compared to [41].
  • Third scenario
An accident occurred 50 m from the intersection in the third scenario, during the training. The agent is then put to the test, assuming the accident occurred on the eastbound arm and 30 m from the intersection. A numerical comparison of the agents in [35,41] and the proposed method is shown in Table 6. Comparing the presented approach to [35,41], we observe that the maximum queue length is reduced by 23% and 28%, respectively.

5. Discussion

In order to implement such control algorithms, adaptive traffic signal controllers can be used, which can learn the control signals from the data from a variety of sources, such as traffic detection sensors. A traffic signal that is adaptive adjusts according to the actual traffic demand. With the development of embedded systems technology, electronic boards, software packages, and closed-loop control systems, the implementation of such algorithms is now possible. Therefore, these adaptive control algorithms can be implemented with a combination of hardware and software [45].
Our proposed model works both in computer control systems and as an additional application. Considering that the testing phase of the algorithm does not require a high volume of processing, the proposed algorithm can be implemented both as an application in computer control systems and as a processor as part of an isolated signaling controller. The distinctive feature of the proposed system is that it can be fully implemented on an embedded platform by a single-board computer and a core as a computing unit.
Considering the concerns, the proposed algorithm can be extended to multiple intersections. However, currently, this study is limited to one intersection and the proposed model has been trained for one intersection. To achieve an optimal policy for all intersections, we need to use cooperative learning methods. The development of the proposed algorithm for application in environments with multiple intersections is one of the future works of the authors. Another limitation is that, in the action, the duration of the green light is not dynamically adjusted. Rather, the action selects the green phase from a predefined set. It is also planned to dynamically adjust the green time of the traffic light in the future. Moreover, considering a specific distribution for traffic flow generation can be one of the limitations of the proposed algorithm. Of course, it should be mentioned that the choice of the Weibull distribution for generating the traffic is based on the traffic of a typical day, which includes rush hours.
The data gathered for this study cover a wide range of traffic flow patterns, so it is as close as possible to actual conditions. The vehicle’s origin and destination are selected at random in each episode using a random generator function. High and low traffic scenarios are taken into consideration, with 25% of vehicles turning left or right at the intersection and 75% going straight. A traffic detector is a basic piece of any intelligent transportation framework. They can collect information about the traffic conditions, such as volumes, speeds, occupancy, and travel times, for better traffic management. There are two basic kinds of traffic detectors.
One type is called detectors on the roadway, such as induction loops, magnetometers, and built-in temperature sensors. The other type of traffic detectors is over-roadway detectors, such as video cameras, radar, Bluetooth/Wi-Fi, and weather sensors. Due to the fact that we have generated the required data through simulation, any sensor that can process and provide this information can be used. For example, this data can be obtained by sensing devices such as high-resolution cameras and image processing systems. The captured images should be processed in real time with an image processing toolkit. The image processing software can capture the traffic images from a camera, detect and count the moving vehicles, estimate the traffic density, and control the traffic signals according to the processed results [46]. On the other hand, traffic data collection often requires an effective method of calibrating surveillance cameras efficiently and accurately. Camera calibration is a necessary step to determine the real positions of the vehicles that appear in the video. For the calibration of the detection devices, many methods have been proposed in numerous articles to achieve the desired result [47,48].

6. Conclusions

A deep reinforcement learning framework is proposed in this paper to improve the efficiency of the traffic flow in the presence of accidents and fluctuating traffic demand. The accident is simulated by halting a vehicle approaching a traffic light in one lane. To provide the agent with enough information about the environment, each road is discretized into cells of different sizes based on [35]. To make the deep RL controller more resilient to possible uncertainties, the extended state representation was used while training the agent in the presence of an accident. The state contains three types of information: a vector indicating the presence or absence of vehicles, the number of vehicles in the queue in each cell, and the traffic phase. The work uses the SUMO, a realistic traffic simulator, to provide an environment for training and evaluating the RL agents. The DQN is used as a learning algorithm and a fully connected network approximates the Q-values for the actions. The simulation results demonstrate that, regardless of the time, location, or number of accidents, the algorithm learned a stable strategy. In addition, when the accident location changes, the proposed resilient control algorithm does not need to be retrained. Thus, the control calculation can deal with mishaps that happen at any place without retraining. In order to verify the effectiveness of our main contributions, we compare the proposed method to [35,41]. There was a significant reduction in the queue length compared to [35,41] when the accidents occur, so that no vehicles had to wait for long periods of time to cross the intersection. The convergence speed is also better with the proposed model than with [35].
In complex networks, the concepts of optimizing hyperparameters and locating critical nodes based on centrality will be pursued in the future. Our next phase of research could look into how the proposed algorithm for autonomous vehicles evolved given the significance of the accident occurrence and the resilience of adaptive traffic control algorithms in the discussion of automated vehicles.

Author Contributions

Z.Z.: Methodology, software, validation, visualization, writing—original draft, writing—review and editing; M.S.: Definition of project, formal analysis, project administration, supervision, writing—review and editing; S.B.: Formal analysis, investigation, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Genders, W.; Razavi, S. Using a Deep Reinforcement Learning Agent for Traffic Signal Control. arXiv 2016, arXiv:1611.01142. [Google Scholar]
  2. Casas, N. Deep Deterministic Policy Gradient for Urban Traffic Light Control. arXiv 2017, arXiv:1703.09035. [Google Scholar]
  3. Zaidi, A.A.; Kulcsár, B.; Wymeersch, H. Back-Pressure Traffic Signal Control with Fixed and Adaptive Routing for Urban Vehicular Networks. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2134–2143. [Google Scholar] [CrossRef] [Green Version]
  4. Wang, T.; Cao, J.; Hussain, A. Adaptive Traffic Signal Control for Large-Scale Scenario with Cooperative Group-Based Multi-Agent Reinforcement Learning. Transp. Res. Part C Emerg. Technol. 2021, 125, 103046. [Google Scholar] [CrossRef]
  5. Jamil, A.R.M.; Ganguly, K.K.; Nower, N. Adaptive Traffic Signal Control System Using Composite Reward Architecture Based Deep Reinforcement Learning. IET Intell. Transp. Syst. 2021, 14, 2030–2041. [Google Scholar] [CrossRef]
  6. Hunt, P.B.; Robertson, D.I.; Bretherton, R.D.; Royle, M.C. The SCOOT On-Line Traffic Signal Optimisation Technique. Traffic Eng. Control 1982, 23, 190–192. [Google Scholar]
  7. Luk, J.Y.K. Two Traffic-Responsive Area Traffic Control Methods: SCAT and SCOOT. Traffic Eng. Control 1984, 25, 14–22. [Google Scholar]
  8. Gartner, N.H. Demand-Responsive Decentralized Urban Traffic Control. Part I: Single-Intersection Policies. In Transportation Research Record 906, TRB; National Research Council: Washington, DC, USA, 1983; pp. 75–81. [Google Scholar]
  9. Henry, J.-J.; Farges, J.L.; Tuffal, J. The PRODYN Real Time Traffic Algorithm. In Control in Transportation Systems; Elsevier: Amsterdam, The Netherlands, 1984; pp. 305–310. [Google Scholar]
  10. Jafari, S.; Shahbazi, Z.; Byun, Y.-C. Improving the Performance of Single-Intersection Urban Traffic Networks Based on a Model Predictive Controller. Sustainability 2021, 13, 5630. [Google Scholar] [CrossRef]
  11. Qadri, S.S.S.M.; Gökçe, M.A.; Öner, E. State-of-Art Review of Traffic Signal Control Methods: Challenges and Opportunities. Eur. Transp. Res. Rev. 2020, 12, 55. [Google Scholar] [CrossRef]
  12. Yu, D.; Tian, X.; Xing, X.; Gao, S. Signal Timing Optimization Based on Fuzzy Compromise Programming for Isolated Signalized Intersection. Math. Probl. Eng. 2016, 2016, 1682394. [Google Scholar] [CrossRef] [Green Version]
  13. Jia, H.; Lin, Y.; Luo, Q.; Li, Y.; Miao, H. Multi-Objective Optimization of Urban Road Intersection Signal Timing Based on Particle Swarm Optimization Algorithm. Adv. Mech. Eng. 2019, 11, 1687814019842498. [Google Scholar] [CrossRef] [Green Version]
  14. Mohebifard, R.; Hajbabaie, A. Optimal Network-Level Traffic Signal Control: A Benders Decomposition-Based Solution Algorithm. Transp. Res. Part B Methodol. 2019, 121, 252–274. [Google Scholar] [CrossRef]
  15. Yu, C.; Ma, W.; Han, K.; Yang, X. Optimization of Vehicle and Pedestrian Signals at Isolated Intersections. Transp. Res. Part B Methodol. 2017, 98, 135–153. [Google Scholar] [CrossRef] [Green Version]
  16. An, H.K.; Awais Javeed, M.; Bae, G.; Zubair, N.; Metwally, M.A.S.; Bocchetta, P.; Na, F.; Javed, M.S. Optimized Intersection Signal Timing: An Intelligent Approach-Based Study for Sustainable Models. Sustainability 2022, 14, 11422. [Google Scholar] [CrossRef]
  17. Qin, H.; Zhang, H. Intelligent Traffic Light under Fog Computing Platform in Data Control of Real-Time Traffic Flow. J. Supercomput. 2021, 77, 4461–4483. [Google Scholar] [CrossRef]
  18. Yoon, J.; Ahn, K.; Park, J.; Yeo, H. Transferable Traffic Signal Control: Reinforcement Learning with Graph Centric State Representation. Transp. Res. Part C Emerg. Technol. 2021, 130, 103321. [Google Scholar] [CrossRef]
  19. Mahdavimoghadam, M.; Nikanjam, A.; Abdoos, M. Improved Reinforcement Learning in Cooperative Multi-Agent Environments Using Knowledge Transfer. J. Supercomput. 2022, 78, 10455–10479. [Google Scholar] [CrossRef]
  20. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  21. Chu, K.-F.; Lam, A.Y.; Li, V.O. Traffic Signal Control Using End-to-End Off-Policy Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2021, 23, 7184–7195. [Google Scholar] [CrossRef]
  22. Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef] [Green Version]
  23. Li, L.; Lv, Y.; Wang, F.-Y. Traffic Signal Timing via Deep Reinforcement Learning. IEEE/CAA J. Autom. Sin. 2016, 3, 247–254. [Google Scholar]
  24. Liang, X.; Du, X.; Wang, G.; Han, Z. A Deep q Learning Network for Traffic Lights’ Cycle Control in Vehicular Networks. IEEE Trans. Veh. Technol. 2019, 68, 1243–1253. [Google Scholar] [CrossRef] [Green Version]
  25. Louati, A.; Louati, H.; Li, Z. Deep Learning and Case-Based Reasoning for Predictive and Adaptive Traffic Emergency Management. J. Supercomput. 2021, 77, 4389–4418. [Google Scholar] [CrossRef]
  26. Norouzi, M.; Abdoos, M.; Bazzan, A.L.C. Experience Classification for Transfer Learning in Traffic Signal Control. J. Supercomput. 2021, 77, 780–795. [Google Scholar] [CrossRef]
  27. Shamsi, M.; Rasouli Kenari, A.; Aghamohammadi, R. Reinforcement Learning for Traffic Light Control with Emphasis on Emergency Vehicles. J. Supercomput. 2022, 78, 4911–4937. [Google Scholar] [CrossRef]
  28. Wei, H.; Zheng, G.; Gayah, V.; Li, Z. Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation. ACM SIGKDD Explor. Newsl. 2021, 22, 12–18. [Google Scholar] [CrossRef]
  29. Bengio, Y. Learning Deep Architectures for AI; Foundations and Trends in Machin learning: Hanover, MA, USA, 2009; Volume 2, pp. 1–127. [Google Scholar]
  30. Zheng, G.; Zang, X.; Xu, N.; Wei, H.; Yu, Z.; Gayah, V.; Xu, K.; Li, Z. Diagnosing Reinforcement Learning for Traffic Signal Control. arXiv 2019, arXiv:1905.04716. [Google Scholar]
  31. Wei, H.; Zheng, G.; Yao, H.; Li, Z. Intellilight: A Reinforcement Learning Approach for Intelligent Traffic Light Control. In Proceedings of the Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2496–2505. [Google Scholar]
  32. Van der Pol, E.; Oliehoek, F.A. Coordinated Deep Reinforcement Learners for Traffic Light Control. In Proceedings of the Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  33. Gao, J.; Shen, Y.; Liu, J.; Ito, M.; Shiratori, N. Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network. arXiv 2017, arXiv:1705.02755. [Google Scholar]
  34. Liang, X.; Du, X.; Wang, G.; Han, Z. Deep Reinforcement Learning for Traffic Light Control in Vehicular Networks. arXiv 2018, arXiv:1803.11115. [Google Scholar]
  35. Vidali, A.; Crociani, L.; Vizzari, G.; Bandini, S. A Deep Reinforcement Learning Approach to Adaptive Traffic Lights Management. In Proceedings of the WOA 2019, the 20th Workshop “From Objects to Agents”, Parma, Italy, 26–28 June 2019; pp. 42–50. [Google Scholar]
  36. Mousavi, S.S.; Schukat, M.; Howley, E. Traffic Light Control Using Deep Policy-Gradient and Value-Function-Based Reinforcement Learning. IET Intell. Transp. Syst. 2017, 11, 417–423. [Google Scholar] [CrossRef] [Green Version]
  37. Fuad, M.R.T.; Fernandez, E.O.; Mukhlish, F.; Putri, A.; Sutarto, H.Y.; Hidayat, Y.A.; Joelianto, E. Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks. Sustainability 2022, 14, 14590. [Google Scholar] [CrossRef]
  38. Gong, Y.; Abdel-Aty, M.; Yuan, J.; Cai, Q. Multi-Objective Reinforcement Learning Approach for Improving Safety at Intersections with Adaptive Traffic Signal Control. Accid. Anal. Prev. 2020, 144, 105655. [Google Scholar] [CrossRef] [PubMed]
  39. Essa, M.; Sayed, T. Self-Learning Adaptive Traffic Signal Control for Real-Time Safety Optimization. Accid. Anal. Prev. 2020, 146, 105713. [Google Scholar] [CrossRef] [PubMed]
  40. Li, M.; Li, Z.; Xu, C.; Liu, T. Deep Reinforcement Learning-Based Vehicle Driving Strategy to Reduce Crash Risks in Traffic Oscillations. Transp. Res. Rec. 2020, 2674, 42–54. [Google Scholar] [CrossRef]
  41. Rodrigues, F.; Azevedo, C.L. Towards Robust Deep Reinforcement Learning for Traffic Signal Control: Demand Surges, Incidents and Sensor Failures. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3559–3566. [Google Scholar]
  42. Krajzewicz, D.; Erdmann, J.; Behrisch, M.; Bieker, L. Recent Development and Applications of SUMO-Simulation of Urban MObility. Int. J. Adv. Syst. Meas. 2012, 128–138. [Google Scholar]
  43. Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Fundamentals of Artificial Neural Networks and Deep Learning. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer International Publishing: Cham, Switzerland, 2022; pp. 379–425. [Google Scholar]
  44. Maurya, A.K.; Dey, S.; Das, S. Speed and Time Headway Distribution under Mixed Traffic Condition. J. East. Asia Soc. Transp. Stud. 2015, 11, 1774–1792. [Google Scholar]
  45. Jaber Abougarair, A.; Edardar, M.M. Adaptive Traffic Light Dynamic Control Based on Road Traffic Signal from Google Maps. In Proceedings of the The 7th International Conference on Engineering & MIS 2021, Almaty, Kazakhstan, 11–13 October 2021; pp. 1–9. [Google Scholar]
  46. Srivastava, S. Adaptive Traffic Light Timer Control (ATLTC). 29 March 2016. Available online: http://www.iitk.ac.in/nerd/web/articles/adaptive-traffic-light-timer-control-atltc/#.Y7qoJnZBzIU (accessed on 26 December 2022).
  47. Ismail, K.; Sayed, T.; Saunier, N. A Methodology for Precise Camera Calibration for Data Collection Applications in Urban Traffic Scenes. Can. J. Civ. Eng. 2013, 40, 57–67. [Google Scholar] [CrossRef]
  48. Ke, R.; Pan, Z.; Pu, Z.; Wang, Y. Roadway Surveillance Video Camera Calibration Using Standard Shipping Container. In Proceedings of the 2017 International Smart Cities Conference (ISC2), Wuxi, China, 14–17 September 2017; pp. 1–6. [Google Scholar]
Figure 1. Framework of the proposed methodology.
Figure 1. Framework of the proposed methodology.
Sustainability 15 01329 g001
Figure 2. A four-way intersection.
Figure 2. A four-way intersection.
Sustainability 15 01329 g002
Figure 3. Reinforcement learning cycle depicts the standard cycle of reinforcement learning.
Figure 3. Reinforcement learning cycle depicts the standard cycle of reinforcement learning.
Sustainability 15 01329 g003
Figure 4. Deep reinforcement learning model for traffic light control.
Figure 4. Deep reinforcement learning model for traffic light control.
Sustainability 15 01329 g004
Figure 5. Road discretization into cells for state representation.
Figure 5. Road discretization into cells for state representation.
Sustainability 15 01329 g005
Figure 6. Structure of deep neural network.
Figure 6. Structure of deep neural network.
Sustainability 15 01329 g006
Figure 7. Representation of sampling the memory.
Figure 7. Representation of sampling the memory.
Sustainability 15 01329 g007
Figure 8. Simulation in SUMO.
Figure 8. Simulation in SUMO.
Sustainability 15 01329 g008
Figure 9. Traffic flow generation.
Figure 9. Traffic flow generation.
Sustainability 15 01329 g009
Figure 10. The cumulative reward during the training episodes in low traffic scenario.
Figure 10. The cumulative reward during the training episodes in low traffic scenario.
Sustainability 15 01329 g010
Figure 11. The cumulative reward during the training episodes in high traffic scenario.
Figure 11. The cumulative reward during the training episodes in high traffic scenario.
Sustainability 15 01329 g011
Figure 12. The average queue length during the training episodes in high traffic scenario.
Figure 12. The average queue length during the training episodes in high traffic scenario.
Sustainability 15 01329 g012
Figure 13. The cumulative delay during the training episodes in high traffic scenario.
Figure 13. The cumulative delay during the training episodes in high traffic scenario.
Sustainability 15 01329 g013
Figure 14. Comparison between maximum queue length during the test results in high traffic scenario.
Figure 14. Comparison between maximum queue length during the test results in high traffic scenario.
Sustainability 15 01329 g014
Figure 15. The cumulative reward during the training episodes in high traffic scenario with accident in [35].
Figure 15. The cumulative reward during the training episodes in high traffic scenario with accident in [35].
Sustainability 15 01329 g015
Figure 16. The cumulative reward during the training episodes in high traffic scenario with accident in western arm.
Figure 16. The cumulative reward during the training episodes in high traffic scenario with accident in western arm.
Sustainability 15 01329 g016
Figure 17. The average queue length during the training episodes in high traffic scenario with accident in western arm.
Figure 17. The average queue length during the training episodes in high traffic scenario with accident in western arm.
Sustainability 15 01329 g017
Figure 18. The cumulative reward during the training episodes in high traffic scenario with accident in eastern arm.
Figure 18. The cumulative reward during the training episodes in high traffic scenario with accident in eastern arm.
Sustainability 15 01329 g018
Figure 19. The average queue length during the training episodes in high traffic scenario with accident in eastern arm.
Figure 19. The average queue length during the training episodes in high traffic scenario with accident in eastern arm.
Sustainability 15 01329 g019
Figure 20. The cumulative reward during the training episodes in high traffic scenario with accident in northern arm.
Figure 20. The cumulative reward during the training episodes in high traffic scenario with accident in northern arm.
Sustainability 15 01329 g020
Figure 21. The average queue length during the training episodes in high traffic scenario with accident in northern arm.
Figure 21. The average queue length during the training episodes in high traffic scenario with accident in northern arm.
Sustainability 15 01329 g021
Figure 22. The cumulative reward during the training episodes in high traffic scenario with accident in the intersection.
Figure 22. The cumulative reward during the training episodes in high traffic scenario with accident in the intersection.
Sustainability 15 01329 g022
Figure 23. The average queue length in the training episodes in high traffic scenario with accident in the intersection.
Figure 23. The average queue length in the training episodes in high traffic scenario with accident in the intersection.
Sustainability 15 01329 g023
Figure 24. The average queue length in the testing episodes in high traffic scenario with accident in the intersection.
Figure 24. The average queue length in the testing episodes in high traffic scenario with accident in the intersection.
Sustainability 15 01329 g024
Figure 25. The cumulative reward during the training episodes in high traffic scenario by changing time of accident.
Figure 25. The cumulative reward during the training episodes in high traffic scenario by changing time of accident.
Sustainability 15 01329 g025
Figure 26. The average queue length during the training episodes in high traffic scenario by changing time of accident.
Figure 26. The average queue length during the training episodes in high traffic scenario by changing time of accident.
Sustainability 15 01329 g026
Figure 27. The cumulative reward during the training episodes in high traffic scenario by changing location and time of accident.
Figure 27. The cumulative reward during the training episodes in high traffic scenario by changing location and time of accident.
Sustainability 15 01329 g027
Figure 28. The average queue length during the training episodes in high traffic scenario by changing location and time of accident.
Figure 28. The average queue length during the training episodes in high traffic scenario by changing location and time of accident.
Sustainability 15 01329 g028
Figure 29. The cumulative reward during the training episodes in high traffic scenario in presence of two accidents.
Figure 29. The cumulative reward during the training episodes in high traffic scenario in presence of two accidents.
Sustainability 15 01329 g029
Figure 30. The average queue length during the training episodes in high traffic scenario in presence of two accidents.
Figure 30. The average queue length during the training episodes in high traffic scenario in presence of two accidents.
Sustainability 15 01329 g030
Figure 31. Comparison between maximum queue length during the test results with two accidents in high traffic scenario.
Figure 31. Comparison between maximum queue length during the test results with two accidents in high traffic scenario.
Sustainability 15 01329 g031
Figure 32. The queue length during test results without retraining in first scenario.
Figure 32. The queue length during test results without retraining in first scenario.
Sustainability 15 01329 g032
Table 1. Signal phases.
Table 1. Signal phases.
LinkN→S  S→NN→EE→W  W→EE→S
Phase N→W  S→ES→WE→N  W→SW→N
Phase 1Sustainability 15 01329 i001Sustainability 15 01329 i002Sustainability 15 01329 i003Sustainability 15 01329 i004
Phase 2Sustainability 15 01329 i005Sustainability 15 01329 i006Sustainability 15 01329 i007Sustainability 15 01329 i008
Phase 3Sustainability 15 01329 i009Sustainability 15 01329 i010Sustainability 15 01329 i011Sustainability 15 01329 i012
Phase 4Sustainability 15 01329 i013Sustainability 15 01329 i014Sustainability 15 01329 i015Sustainability 15 01329 i016
Phase 5Sustainability 15 01329 i017Sustainability 15 01329 i018Sustainability 15 01329 i019Sustainability 15 01329 i020
Phase 6Sustainability 15 01329 i021Sustainability 15 01329 i022Sustainability 15 01329 i023Sustainability 15 01329 i024
Phase 7Sustainability 15 01329 i025Sustainability 15 01329 i026Sustainability 15 01329 i027Sustainability 15 01329 i028
Phase 8Sustainability 15 01329 i029Sustainability 15 01329 i030Sustainability 15 01329 i031Sustainability 15 01329 i032
Table 2. Parameters for reinforcement learning method.
Table 2. Parameters for reinforcement learning method.
Parameter Value
Discount factor γ 0.75
Learning rate0.001
Replay memory size5000
Batch size100
Starting ε 1
Table 3. Overview of average value of metrics for agents during training for accident in southern arm.
Table 3. Overview of average value of metrics for agents during training for accident in southern arm.
Proposed MethodConventional Deep RL Method in [35]
Average cumulative negative reward−152,050−190,530
Average cumulative delay (seconds)431,000 476,070
Average queue length (vehicles)7988
Table 4. Overview of average value of metrics for agents during training by changing accident location.
Table 4. Overview of average value of metrics for agents during training by changing accident location.
Accident
Location
Proposed MethodConventional Deep RL Method in [35]
Average cumulative negative
reward
W−176,950−213,280
E−200,550−279,030
N−174,660−216,240
Average cumulative delay (seconds) W506,200506,800
E538,780629,350
N492,340530,110
Average queue length (vehicles)W9295
E99117
N9199
Table 5. Maximum queue length during the test result without retraining in second scenario.
Table 5. Maximum queue length during the test result without retraining in second scenario.
Proposed MethodConventional Deep RL Method in [35][41]
Maximum queue length
during test phase (vehicles)
140153158
Table 6. Maximum queue length during the test result without retraining in third scenario.
Table 6. Maximum queue length during the test result without retraining in third scenario.
Proposed MethodConventional Deep RL Method in [35][41]
Maximum queue length
during test phase (vehicles)
186261243
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeinaly, Z.; Sojoodi, M.; Bolouki, S. A Resilient Intelligent Traffic Signal Control Scheme for Accident Scenario at Intersections via Deep Reinforcement Learning. Sustainability 2023, 15, 1329. https://doi.org/10.3390/su15021329

AMA Style

Zeinaly Z, Sojoodi M, Bolouki S. A Resilient Intelligent Traffic Signal Control Scheme for Accident Scenario at Intersections via Deep Reinforcement Learning. Sustainability. 2023; 15(2):1329. https://doi.org/10.3390/su15021329

Chicago/Turabian Style

Zeinaly, Zahra, Mahdi Sojoodi, and Sadegh Bolouki. 2023. "A Resilient Intelligent Traffic Signal Control Scheme for Accident Scenario at Intersections via Deep Reinforcement Learning" Sustainability 15, no. 2: 1329. https://doi.org/10.3390/su15021329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop