A Comparative Study of Traffic Signal Control Based on Reinforcement Learning Algorithms

Chen Ouyang; Zhenfei Zhan; Fengyao Lv

doi:10.3390/wevj15060246

,

and

School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J.2024, 15(6), 246;https://doi.org/10.3390/wevj15060246

This article belongs to the Special Issue Development towards Vehicle Safety in Future Smart Traffic Systems

Version Notes

Order Reprints

Abstract

In recent years, the increasing production and sales of automobiles have led to a notable rise in congestion on urban road traffic systems, particularly at ramps and intersections with traffic signals. Intelligent traffic signal control represents an effective means of addressing traffic congestion. Reinforcement learning methods have demonstrated considerable potential for addressing complex traffic signal control problems with multidimensional states and actions. In this research, the team propose Q-learning and Deep Q-Network (DQN) based signal control frameworks that use variable phase sequences and cycle times to adjust the order and the duration of signal phases to obtain a stable traffic signal control strategy. Experiments are simulated using the traffic simulator Simulation of Urban Mobility (SUMO) to test the average speed and the lane occupancy rate of vehicles entering the ramp to evaluate its safety performance and test the vehicle’s traveling time to assess its stability. The simulation results show that both reinforcement learning algorithms are able to control cars in dynamic traffic environments with higher average speed and lower lane occupancy rate than the no-control method and that the DQN control model improves the average speed by about 10% and reduces the lane occupancy rate by about 30% compared to the Q-learning control model, providing a higher safety performance.

Keywords:

traffic signal control; DQN; Q-learning; reinforcement learning; SUMO

1. Introduction

With the acceleration of urbanization and the increasing per capita car ownership in China, urban traffic congestion has become more and more serious, causing many problems such as energy waste, environmental problems (exhaust and noise pollution), and safety problems (traffic crashes). Consequently, Intelligent Transport Systems (ITS) are urgently needed to meet the needs of today’s evolving transport networks []. Traffic signal control (TSC) is considered the most important and effective means for the implementation of ITSs. Traffic systems are acknowledged as complex non-linear entities with extensive state action spaces, making the quest to improve traffic signal control a challenging task. Inadequate management policies can easily contribute to congestion that worsens over time, making relief efforts more difficult []. Currently, the functionality of traffic signal controllers is restricted to predetermined programs that occasionally neglect or only slightly account for real-time traffic conditions. Pre-set traffic lights can program varying time intervals using historical data, or they can use identical intervals of time for each cycle. Typically, traffic lights follow simple protocols in actual use, alternating between traffic traffic signals at consistent intervals. The duration of these predetermined intervals is not optimized in any other manner, albeit subject to change according to certain circumstances, such as peak hours []. Due to the ineffectiveness of these controllers, several researchers have endeavored to design efficient control algorithms by employing Artificial Intelligence (AI) and Machine Learning (ML) techniques.

Among the many intelligent control methods, reinforcement learning does not require complex mathematical modeling but only requires modification of the strategy by constantly interacting with the environment and obtaining reward and punishment signals from the environment []. Reinforcement learning has many applications in various fields, particularly in autonomous vehicles, comprising environment perception [] as well as decision planning [,] and control execution [,]. With success stories in gaming, most notably AlphaGo, which has become well known in the last few years []. Reinforcement learning has been employed in the control process to attain continuous low-speed inverted hovering with helicopters and drones [], to balance a pendulum without knowing the dynamics involved [], or to balance and drive a car []. Reinforcement learning also has many applications in robotics [], including successes in manipulation [] and locomotion []. Other applications include finance [], medicine [], and language processing []. In this paper, we had applied reinforcement learning to research ramp crossing signal control with the following contributions:

We present the traffic signal control problem as a Markov Decision Process (MDP) and employ the reinforcement learning algorithm (Q-learning and DQN) to acquire a dynamic and effective traffic signal control strategy. The evaluation of this paper covers vehicle traveling time, average speed, and lane occupancy rate to demonstrate the effectiveness of the proposed method.
We propose a signal control framework based on Q-learning and Deep Q-Network (DQN) and define two different action spaces (Q-learning and DQN), which are different from other researchers’ approaches []. In Q-learning, the action space involves selecting the duration of each green light phase in the next loop. In DQN, the action space is related to the current phase versus the selected phase. If the phases coincide, the selected phase is executed, and if they do not, the system moves to the next phase, resulting in more efficient signal control.

The rest of the article is as follows: Section 2 provides a systematic review of relevant research in the field. Section 3 describes the reinforcement learning method. Section 4 demonstrates the experimental method and results, and Section 5 concludes the paper and provides an outlook on future research directions.

2. Related Works

Traffic signal control was initially set at fixed time intervals, but this fixed-time control cannot effectively adapt to changing traffic flows []. Therefore, the Adaptive Traffic Light Control (ATLC) technique has been developed to alleviate traffic congestion through the dynamic modification of signal timing. The Sydney Coordinated Adaptive Traffic System (SCATS) and Split Cycle Offset Optimizing Technique (SCOOT) are effectively implemented in many contemporary cities worldwide. Investment is required for the deployment of sensors or manual modifications to enable the intelligent systems to collect data []. The advent of artificial intelligence (AI) has enabled the collection and processing of a multitude of traffic flow data. Additionally, researchers have devised numerous novel techniques for traffic signal control, with extensive experimentation demonstrating that deep reinforcement learning (DRL) is a highly effective approach.

There are three main types of reinforcement learning based on value functions, namely Monte Carlo (MC), Temporal Difference (TD), and Deep Q-Network (DQN) []. All three types are model-free, based on empirical learning, do not require prior knowledge or transfer probabilities, and each state and action must be visited frequently to ensure convergence. In the MC approach, without knowing the MDP state transfer probabilities, the true value of a state is estimated directly from experiencing a complete sequence of states (episode), and the value of a state is considered equal to the average of all the returns computed in terms of that state over multiple state sequences. Mahler et al. [] developed a system for predicting the optimal speed of a vehicle that maximizes its energy efficiency by using probabilistic data for traffic signal phases and timings. Ultimately, Monte Carlo simulations were used to validate the case study results in broad terms. Subsequently, some scholars have used stochastic look-ahead techniques using Monte Carlo Tree Search (MCTS) algorithms to establish effective cooperation between vehicles and transport infrastructures to improve travel reliability in urban transport networks []. Ma et al. [] presented a novel adaptive network-partitioned multi-agent reinforcement learning approach. In order to enhance traffic flow, a MCTS was utilized to determine the optimal partition under Graphic Neural Network (GNN) computational conditions.

TD includes Sarsa and Q-learning. The Sarsa(λ)-based real-time traffic control optimization model proposed by Zhou et al. [] could maintain traffic signal timing strategies more effectively. For the purpose of coordinating traffic signal control in intersecting networks, Yen et al. [] proposed a deep dueling on-policy learning method (2DSarsa) algorithm based on the Sarsa algorithm, with the aim of maximizing network throughput and minimizing average end-to-end delay. Reza et al. [] used a Gaussian function to effectively regulate the inefficiency of the weight update mechanism of the conventional Sarsa(λ) model. The proposed model was tested at 21 intersections, and the performance was improved compared to the conventional Sarsa algorithm. Arel et al. [] effectively reduced average intersection time and congestion by implementing feed-forward neural network control in the Q-learning algorithm for value function approximation in intersection signal control. Additionally, Abdoos et al. [] proposed a two-level hierarchical control approach for traffic signals based on Q-learning. The traffic signal controller at the intersection comprises two levels: the first level is located at the bottom, and the second level is at the top. The results indicate that the proposed hierarchical control enhances the Q-learning efficiency of the lower-level intelligence. Wei et al. [] integrated data from the loop detector and Connected and Automated Vehicles (CAVs) to provide an adaptive traffic signal control approach rooted in Q-learning. The algorithm lays the groundwork for collaborative control of numerous crossings and presents a fresh idea for smartly managing stand-alone junctions amidst mixed traffic flow scenarios.

Traffic signal control methods based on deep reinforcement learning have received widespread attention. Zeng et al. [] combined a recurrent neural network (RNN) with a DQN (i.e., DRQN) and compared its performance with that of a traditional DQN in partially observed traffic situations. Xie et al. [] proposed the Information Exchange Deep Q-Network (IEDQN) method, which eliminated the communication problem between agents. Tunc et al. [] proposed a novel approach to traffic signal timing at intersections, utilizing a combination of deep Q-learning algorithms and Fuzzy Logic Control (FLC), assisted by fuzzy logic. The approach focuses on objective evaluation metrics such as traffic congestion, air pollution, and waiting time to assess simulation results. Wang et al. [] decoupled the DQN into Q-networks at each intersection and then transmitted the output of the hidden layer of each Q-network to its intersection neighbors. The method is computationally simple and has lower latency.

Numerous researchers have also started addressing practical issues related to traffic signal control, such as traffic flow, unexpected events, environmental pollution, and potential system failures, among others. Babatunde et al. [] proposed an adaptive traffic signal controller method based on fuel-game theory that applies Nash Bargaining (NB) in an n-player cooperative game to determine the optimal phase assignment that takes into account future traffic signal control needs. The results demonstrate that this method is effective in reducing CO emissions and fuel consumption. Ounoughi et al. [] proposed a traffic signal control method that combines future noise prediction with the deep dueling Q-network algorithm, EcoLight Plus, to reduce CO₂ emissions, noise levels, and fuel consumption. Zeinaly [] developed a reliable controller for highly dynamic environmental roads (weather conditions or special events) and investigated the adaptability of these controllers to various environmental disturbances, such as unexpected events.

This paper follows these results and experiences and builds a Q-learning and DQN network to implement traffic signal control, focusing on the construction of the scenario model and the simplification of the model framework, as explained in the next section.

3. Method

In this section, we first introduce the basic principles of the Q-learning algorithm and the DQN algorithm, as well as the signal control architectures of the two algorithms. Then, we will construct the experimental scenario. Finally, a definition of state space, action space, and reward is developed.

3.1. Q-Learning

Q-learning is a reinforcement learning algorithm that enables a single agent to acquire policies (e.g., control policies) and progressively make decisions to optimize system performance by interacting with its operating environment through trial and error. The agent learns the action-value function through Q-learning, which presents three primary advantages: (a) modeling system performance rather than individual factors affecting performance; (b) no prior knowledge of the dynamic operating environment is required; and (c) no transfer probability is needed. Q-learning allows the agent to observe state

s_{t} \in S

and choose the best possible action

a_{t} \in A

at time t, and then receive an immediate reward

r_{t + 1} (s_{t + 1})

at time t + 1. The appropriateness Q-value

Q_{t} (s_{t}, a_{t})

of the action

a_{t}

taken in the state

s_{t}

for the state-action pair

(s_{t}, a_{t})

is then updated by the agent, as shown in Equation (1).

Q_{t + 1} (s_{t}, a_{t}) \leftarrow (1 - α) Q_{t} (s_{t}, a_{t}) + α [r_{t + 1} (s_{t + 1}) + γ \underset{a \in A}{m a x} Q_{t} (s_{t + 1}, a)]

(1)

where

α \in [0,1]

represents the learning rate,

γ \underset{a \in A}{m a x} Q_{t} (s_{t + 1}, a)

represents the discounted reward, and

γ \in [0,1]

represents the preference for the discounted reward compared to the immediate reward. The size of the Q-table increases exponentially with the state |

S

| and action |

A

|, particularly in intricate operating environments. This results in dimensionality restrictions and extended duration for the convergence to the optimal action.

Figure 1 illustrates the architecture of signal control based on Q-learning and its interaction process between the agent and the environment. The environment is comprised of the ramp entry with a signal light. At time t, the environment provides the agent with perceived information regarding speed. Following this, the agent computes a reward (

r_{t}

) from the state once it moves. Subsequently, it updates the Q-table and computes the reward (

r_{t}

) received when traveling between state (

s_{t - 1}

) to state (

s_{t}

). Subsequently, the agent chooses the action (

a_{t}

) with the highest Q value and transmits it to the environment.

Figure 1. Q-learning-based signal control architecture.

3.2. DQN

The Q-learning algorithm learns the optimal policy by maximizing the expectation of future rewards, while DQN learns the Q-function by minimizing the loss function. The main idea of DQN is to improve the accuracy of Q-function by exploiting the representation capabilities of deep neural networks to provide a deeper understanding of states and actions. It learns high-level features of states and actions by using multiple hidden layers to be able to better predict future rewards. The DQN updates the parameters of the evaluated network by the following Equation (2). During training, the DQN updates the Q-function by constantly trying different actions and updating it based on reward feedback. It uses a backpropagation algorithm to optimize the parameters of the Q-function so that it can better predict future rewards []. Algorithm 1 represents the steps of a DQN algorithm.

L (ω) = E [{(r + γ \underset{a^{'}}{m a x} Q (s^{'}, a^{'}, ω^{-}) - Q (s, a, ω))}^{2}]

(2)

where

Q (s^{'}, a^{'})

is the estimated value by the target network,

Q (s, a)

is the estimated value by the evaluated network,

ω^{-}

is the parameters of the target network, and

ω

is the parameters of the evaluated network.

Algorithm 1: DQN algorithm

Initialize replay memory D to capacity N
Initialize observation steps S and total steps T: T = 0
Initialize action-value

function Q with randomly selected weights θ

:

θ^{-} = θ

For all episodes n = 1,2…N do

Observe the initial state s_{0}

of the traffic light

Choose action a_{0} ~ π_{θ} (s_{0})

For t = 1 to K do

Observe s_{t}, r_{t}

Store transition (s_{t - 1}, a_{t - 1}, r_{t - 1}, s_{t}

) in D

Update T: T = T + 1

If T > S do

Sample random minibatch of transitions (s_{t - 1}, a_{t - 1}, r_{t - 1}, s_{t}

) from D

Set Q (s_{t}, a_{t}) = R_{a, t + 1} + γ m a x Q (s_{a, t + 1, a^{'}}, a^{'}, θ^{-})

Perform a gradient to update the weights θ

End if
Every C steps copy weights into target network

End for

Unlike the Q-learning control architecture, the DQN inputs real-time traffic state information (vehicle speed, number of waiting vehicles, delay time and traffic flow capacity) into the Q network and outputs the Q value corresponding to each signal decision action. According to the action selection policy, the signal decision action with the maximum Q value is selected and finally fed back to the environment. This process is repeated until the optimal cumulative reward is achieved. Figure 2 shows the architecture of signal control based on the DQN algorithm.

Figure 2. DQN-based signal control architecture.

3.3. TSC Setting

Reinforcement learning shows great potential for solving complex decision-making and control problems. We model the ramp signal control task as a MDP; MDP is a mathematical model for describing the dynamics of a decision problem defined as (S, A, P, R, γ), where the agent follows the policy

π (s | a)

in a specific environment. Reinforcement learning algorithms can learn the MDP to find the optimal policy, thus achieving the learning and decision-making ability of the agent through continuous interaction with the environment. Based on the state

s_{t}

, the agent chooses the action

s_{t} \in A

according to the policy, transitions to the subsequent state

s_{t + 1}

, and receives the reward

r_{t + 1} \in R

. The objective of the agent is to maximize expected returns (discounted cumulative rewards as shown in Equation (3)).

G (s_{t}) = \sum_{i > t} γ^{i - t - 1} r_{i}

(3)

where

γ \in [0,1]

is the discount factor.

3.3.1. Road Model

Figure 3 illustrates the road model of an urban expressway ramp entrance and exit selected in this paper. The design of this road network model takes into full consideration the needs and characteristics of urban ramps. L1 is the main urban traffic artery, which is set up as three lanes and a traffic signal is set up at J1. The left end of the ramp, L3, is the entrance, where the vehicle can enter the main artery from the periphery of the city. The right ramp, designated L4, is intended for vehicles originating from the main road and destined for the city exit.

Figure 3. Ramp road network and the detailed ID.

3.3.2. State Space

The collection of all states is known as the state space, where the state represents how the agent perceives its environment. The traffic conditions in this study are represented by the positions of all vehicles approaching an on-ramp in real-time. Units of length c are used to divide the length of the lane segment measured from the stop line. To ensure that there is a maximum of one vehicle in each interval, the size of c is chosen based on the length of the vehicles and the safety distance. The number at that point in the array is 0 if there has not been a car for some time, otherwise it is 1. By contrast to using a simple queue length to indicate the traffic condition, this approach offers greater precision by reflecting both the current state and future state of traffic flow. Figure 4 shows a definition of the state.

Figure 4. Definition of state space. (a) Actual position of the car on the road; (b) State matrix of the lane.

3.3.3. Action Space

Both the proposed Q-learning algorithm and the DQN algorithm use 200 cycles to search for the optimal solution. The signal timing needs to be adjusted until the current state is the final state.

In Q-learning, the action space is defined to select the duration of each green light phase in the next loop. The switching order of the green light phases in a cycle is fixed, which avoids circumventing the yellow light time caused by the phase switching strategy to maintain the current phase.

In DQN, the action space has a configuration wherein the controller will continue executing the current phase, which is akin to prolonged execution if the current and selected phases are identical. If the current phase is not consistent with the required phase, the yellow phase is introduced to transition to the decision phase automatically.

3.3.4. Reward

After the agent performs an action, the environment changes, and the agent receives a reward value as feedback to guide its learning direction. The definition of rewards is the key to the convergence and good results of deep reinforcement learning. Defining appropriate rewards helps to communicate signals to take the best action strategy. The aim of ramp crossing signal control is to reduce congestion and delay time. Therefore, the average delay time of a vehicle is used as a reward function as Equation (4)

R_{t} = k \times (D_{0} - D_{t})

(4)

where

R_{t}

and

D_{t}

represent the reward and total delay time, respectively, following the vehicle’s completion of the action in question.

D_{0}

represents the anticipated total delay time of the vehicle and is a constant. Parameter k serves to reduce the magnitude of the reward, thereby ensuring the stability of the learning process.

4. Simulation

4.1. Simulation Platform

This study performed experiments using the Simulation of Urban Mobility (SUMO) to verify the model’s accuracy. SUMO provides an API that allows users to create custom functions for traffic simulation using Python programming language. Furthermore, SUMO implements on-ramp signal regulation, as illustrated previously in Figure 3. The SUMO API was leveraged to set the path selection, detectors, vehicle size, vehicle movement control, and signal phase control.

Figure 5 illustrates the general simulation flow for SUMO. The .net.xml file and the .rou.xml file are employed initially, representing the road network and the fundamental vehicles, respectively. If you want to make the simulation more complex, such as pedestrians, different types of vehicles, car parks, etc., you can define the desired elements in the .add.xml file. If you want to automate the simulation using intelligent control algorithms, you need to define .py files to integrate with the API and then call all the above files to the .sumocfg file to run the simulation. The simulation system’s default models consist of the lane-changing and car-following models regularly deployed in SUMO simulations. To closely emulate realistic traffic flow, we base our traffic flow approximation on the Poisson distribution, whereby random car creations are spread out over the road network and introduced simultaneously at specific intervals. For more details on the specific setup parameters, consult Table 1.

Figure 5. The general flow of SUMO simulation.

Table 1. Model parameters.

4.2. Experimental Results

To verify the performance of the proposed signal control architecture based on reinforcement learning algorithms, this experiment compares DQN, Q-learning, and no-control methods. Each experiment is conducted for 200 rounds to monitor the changes in two traffic state indicators, average speed, and lane occupancy rate, throughout the learning process of the model. By slightly modifying the configuration file, the simulation data were extracted into the output file.

Overall, under the reinforcement learning method control, the average vehicle speed is greatly improved compared to the no-control method, while the lane occupancy rate has decreased by different magnitudes, and the operational efficiency of the ramp has been in a relatively efficient state. Specifically, as shown in Figure 6, after 200 rounds of simulation training, the average speed of vehicles on the main road L1 of the no-control method is concentrated within 40–50 km/h, while that of the Q-learning method is in the range of 55–60 km/h, and the average speed of the DQN control method has more peaks than the other methods. It can be noticed that the DQN method in Figure 6a has many lower values, which can be interpreted as the DQN control leads to less traffic in this time period, resulting in a lower average speed. In terms of the lane occupancy rate index, a lower lane occupancy rate means fewer crashes. It can be clearly seen that the no-control method is basically above 6%, which is higher than the two reinforcement learning control methods. Secondly, the proportion of DQN control methods below 2% is higher than that of Q-learning methods. This shows that the DQN method is better than the Q-learning method.

Figure 6. (a) average vehicle speed on L1 under different method control; (b) lane occupancy rate on L1 under different method control.

The average speed and lane occupancy rate for each lane on L1 are also compared to validate the performance of the proposed model fully, as shown in Figure 7. In terms of the average speed, the first lane (red) has the highest average speed for the DQN-control model, which is mostly above 50 km/h, while the average speed for the no-control model is concentrated around 50 km/h. In the second lane (green), the average speeds for the control models of both reinforcement learning methods are higher than that of the no-control model, and in the DQN control third lane (blue), the average speed peaked several times (70 km/h), while the average speed of the Q-learning control model ranged from 30 to 40 km/h, and that of the no-control model was concentrated around 10 km/h. In summary, the DQN control model outperforms both the Q-learning model and the no-control model in terms of average vehicle speed.

Figure 7. Average vehicle speed and lane occupancy rate of three lanes on L1 under different method control. (a) average speed under the DQN method; (b) lane occupancy rate under the DQN method; (c) average speed under the Q-learning method; (d) lane occupancy rate under the Q-learning method; (e)average speed under the no-control method; (f) lane occupancy rate under the no-control method.

From the perspective of lane occupancy rate, it can be seen that the DQN control model has a lower lane occupancy rate than Q-learning in the first and third lanes, while both reinforcement learning models have a lower lane occupancy rate than the no-control model. Combined with the analysis of the L1 lane occupancy rate in Figure 7, it can be concluded that the highest safety performance is achieved by the DQN control method, followed by the Q-learning control and the no-control method.

Figure 8 shows the travel time of the vehicle for different traveling paths (circles represent singular values and triangles represent plurals); in this case, the vehicle has four travel paths (L0–L2, L0–L4, L3–L2, L3–L4). The figure indicates that the DQN control method has a lower travel time than the other two methods, resulting in better stability.

Figure 8. Vehicle travel times for different paths. (a) DQN; (b) Q-learning; (c) no control.

The analysis of the experimental results indicates that the DQN control method results in an average speed improvement of about 10% and a reduction in lane occupancy of about 30% in comparison to the Q-learning method. The two reinforcement learning algorithm control methods have higher average speeds, lower lane occupancy rates, and higher safety performance than the no-control method.

5. Conclusions and Directions for Future Research

The study chooses an on-ramp in the urban area as its context and creates a joint SUMO-Python simulation platform utilizing Traci control. Researchers introduce a signal control framework based on Q-learning and DQN, designed to achieve optimal safety performance in controlling the on-ramp signals. In this study, the average speed and lane occupancy rate of cars approaching the main road L1 and vehicle travel times for different paths are experimentally analyzed, and the outputs are compared with the results of two reinforcement learning control methods and no-control method. The results show that the DQN control method outperforms the Q-learning control method and the no-control method in terms of safety performance and stability.

In the future, we will consider traffic signal control in multi-signal road networks and take into account more factors such as traffic flow, vehicle speed, and traffic crashes to develop more accurate traffic signal control strategies. In addition, we will use a multi-agent reinforcement learning approach to optimize traffic signals by considering each signal as an agent that cooperates with each other to optimize the traffic signal control strategy for the whole road network.

Author Contributions

C.O. for analysis of the data and writing, Z.Z. for review and editing, and F.L. for the ideas and corrections on the structure of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the Open Fund of the National Key Laboratory of Intelligent Vehicle Safety Technology (Fund IVSTSKL-202305) and Chongqing Jiaotong University-Yangtse Delta Advanced Material Research Institute Provincial-level Joint Graduate Student Cultivation Base (Fund JDLHPYJD2021008).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pérez-Gil, Ó.; Barea, R.; López-Guillén, E.; Bergasa, L.M.; Gomez-Huelamo, C.; Gutiérrez, R.; Diaz-Diaz, A. Deep reinforcement learning based control for Autonomous Vehicles in CARLA. Multimed. Tools Appl. 2022, 81, 3553–3576. [Google Scholar] [CrossRef]
Miao, W.; Li, L.; Wang, Z. A Survey on Deep Reinforcement Learning for Traffic Signal Control. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; pp. 1092–1097. [Google Scholar]
Majstorovic, Ä.; Tisljaric, L.; Ivanjko, E.; Caric, T. Urban Traffic Signal Control under Mixed Traffic Flows: Literature Review. Appl. Sci. 2023, 13, 4484. [Google Scholar] [CrossRef]
Zhu, T.M.; Boada, M.J.L.; Boada, B.L. Intelligent Signal Control Module Design for Intersection Traffic Optimization. In Proceedings of the IEEE 7th International Conference on Intelligent Transportation Engineering (ICITE), Beijing, China, 11–13 November 2022; pp. 522–527. [Google Scholar]
Mu, Y.; Chen, S.F.; Ding, M.Y.; Chen, J.Y.; Chen, R.J.; Luo, P. CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
You, C.X.; Lu, J.B.; Filev, D.; Tsiotras, P. Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning. Robot. Auton. Syst. 2019, 114, 1–18. [Google Scholar] [CrossRef]
Tan, J.R. A Method to Plan the Path of a Robot Utilizing Deep Reinforcement Learning and Multi-Sensory Information Fusion. Appl. Artif. Intell. 2023, 37, 2224996. [Google Scholar] [CrossRef]
Lin, Y.; McPhee, J.; Azad, N.L. Longitudinal Dynamic versus Kinematic Models for Car-Following Control Using Deep Reinforcement Learning. In Proceedings of the IEEE Intelligent Transportation Systems Conference (IEEE-ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1504–1510. [Google Scholar]
Chen, J.; Zhou, Z.; Duan, Y.; Yu, B. Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions. World Electr. Veh. J. 2023, 14, 273. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Xian, B.; Zhang, X.; Zhang, H.N.; Gu, X. Robust Adaptive Control for a Small Unmanned Helicopter Using Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 7589–7597. [Google Scholar] [CrossRef] [PubMed]
Agostinelli, F.; Hocquet, G.; Singh, S.; Baldi, P. From Reinforcement Learning to Deep Reinforcement Learning: An Overview. In Braverman Readings in Machine Learning—Key Ideas from Inception to Current State; Rozonoer, L., Mirkin, B., Muchnik, I., Eds.; Part of the Lecture Notes in Computer Science Book Series; Springer: Cham, Switzerland, 2018; pp. 298–328. [Google Scholar] [CrossRef]
Choi, S.; Le, T.P.; Nguyen, Q.D.; Abu Layek, M.; Lee, S.; Chung, T. Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms. Symmetry 2019, 11, 290. [Google Scholar] [CrossRef]
Væhrens, L.; Alvarez, D.D.; Berger, U.; Bogh, S. Learning Task-independent Joint Control for Robotic Manipulators with Reinforcement Learning and Curriculum Learning. In Proceedings of the 21st IEEE International Conference on Machine Learning and Applications (IEEE ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1250–1257. [Google Scholar]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Charpentier, A.; Élie, R.; Remlinger, C. Reinforcement Learning in Economics and Finance. Comput. Econ. 2023, 62, 425–462. [Google Scholar] [CrossRef]
Hu, M.Z.; Zhang, J.H.; Matkovic, L.; Liu, T.; Yang, X.F. Reinforcement learning in medical image analysis: Concepts, applications, challenges, and future directions. J. Appl. Clin. Med. Phys. 2023, 24, e13898. [Google Scholar] [CrossRef] [PubMed]
Clark, T.; Barn, B.; Kulkarni, V.; Barat, S. Language Support for Multi Agent Reinforcement Learning. In Proceedings of the 13th Innovations in Software Engineering Conference (ISEC), Jabalpur, India, 27–29 February 2020. [Google Scholar]
Gu, J.; Lee, M.; Jun, C.; Han, Y.; Kim, Y.; Kim, J. Traffic Signal Optimization for Multiple Intersections Based on Reinforcement Learning. Appl. Sci. 2021, 11, 10688. [Google Scholar] [CrossRef]
Wang, Z.; Liu, X.; Wu, Z. Design of Unsignalized Roundabouts Driving Policy of Autonomous Vehicles Using Deep Reinforcement Learning. World Electr. Veh. J. 2023, 14, 52. [Google Scholar] [CrossRef]
Zhu, R.J.; Wu, S.N.; Li, L.L.; Lv, P.; Xu, M.L. Context-Aware Multiagent Broad Reinforcement Learning for Mixed Pedestrian-Vehicle Adaptive Traffic Light Control. IEEE Internet Things J. 2022, 9, 19694–19705. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Mahler, G.; Vahidi, A. An Optimal Velocity-Planning Scheme for Vehicle Energy Efficiency Through Probabilistic Prediction of Traffic-Signal Timing. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2516–2523. [Google Scholar] [CrossRef]
Mirheli, A.; Hajibabai, L.; Hajbabaie, A. Development of a signal-head-free intersection control logic in a fully connected and autonomous vehicle environment. Transp. Res. Part C-Emerg. Technol. 2018, 92, 412–425. [Google Scholar] [CrossRef]
Ma, J.M.; Wu, F. Learning to Coordinate Traffic Signals With Adaptive Network Partition. IEEE Trans. Intell. Transp. Syst. 2023. Early Access. [Google Scholar] [CrossRef]
Zhou, X.K.; Zhu, F.; Liu, Q.; Fu, Y.C.; Huang, W. A Sarsa(λ)-Based Control Model for Real-Time Traffic Light Coordination. Sci. World J. 2014, 2014, 759097. [Google Scholar] [CrossRef]
Yen, C.C.; Ghosal, D.; Zhang, M.; Chuah, C.N. A Deep On-Policy Learning Agent for Traffic Signal Control of Multiple Intersections. In Proceedings of the 23rd IEEE International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020. [Google Scholar]
Reza, S.; Ferreira, M.C.; Machado, J.J.M.; Tavares, J. A citywide TD-learning based intelligent traffic signal control for autonomous vehicles: Performance evaluation using SUMO. Expert Syst. 2023. [Google Scholar] [CrossRef]
Arel, I.; Liu, C.; Urbanik, T.; Kohls, A.G. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intell. Transp. Syst. 2010, 4, 128–135. [Google Scholar] [CrossRef]
Abdoos, M.; Mozayani, N.; Bazzan, A.L.C. Hierarchical control of traffic signals using Q-learning with tile coding. Appl. Intell. 2014, 40, 201–213. [Google Scholar] [CrossRef]
Wei, Z.B.; Peng, T.; Wei, S.J. A Robust Adaptive Traffic Signal Control Algorithm Using Q-Learning under Mixed Traffic Flow. Sustainability 2022, 14, 5751. [Google Scholar] [CrossRef]
Zeng, J.H.; Hu, J.M.; Zhang, Y. Adaptive Traffic Signal Control with Deep Recurrent Q-learning. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1215–1220. [Google Scholar]
Xie, D.H.; Wang, Z.; Chen, C.L.; Dong, D.Y. IEDQN: Information Exchange DQN with a Centralized Coordinator for Traffic Signal Control. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
Tunc, I.; Soylemez, M.T. Fuzzy logic and deep Q learning based control for traffic lights. Alex. Eng. J. 2023, 67, 343–359. [Google Scholar] [CrossRef]
Wang, X.Y.; Taitler, A.; Smirnov, I.; Sanner, S.; Abdulhai, B. eMARLIN: Distributed Coordinated Adaptive Traffic Signal Control with Topology-Embedding Propagation. Transp. Res. Rec. J. Transp. Res. Board 2023. [Google Scholar] [CrossRef]
Babatunde, J.; Osman, O.A.; Stevanovic, A.; Dobrota, N. Fuel-Based Nash Bargaining Approach for Adaptive Signal Control in an N-Player Cooperative Game. Transp. Res. Rec. J. Transp. Res. Board 2023, 2677, 451–463. [Google Scholar] [CrossRef]
Ounoughi, C.; Ounoughi, D.; Ben Yahia, S. EcoLight plus: A novel multi-modal data fusion for enhanced eco-friendly traffic signal control driven by urban traffic noise prediction. Knowl. Inf. Syst. 2023, 65, 5309–5329. [Google Scholar] [CrossRef]
Zeinaly, Z.; Sojoodi, M.; Bolouki, S. A Resilient Intelligent Traffic Signal Control Scheme for Accident Scenario at Intersections via Deep Reinforcement Learning. Sustainability 2023, 15, 1329. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]

Figure 1. Q-learning-based signal control architecture.

Figure 2. DQN-based signal control architecture.

Figure 3. Ramp road network and the detailed ID.

Figure 4. Definition of state space. (a) Actual position of the car on the road; (b) State matrix of the lane.

Figure 5. The general flow of SUMO simulation.

Figure 6. (a) average vehicle speed on L1 under different method control; (b) lane occupancy rate on L1 under different method control.

Figure 7. Average vehicle speed and lane occupancy rate of three lanes on L1 under different method control. (a) average speed under the DQN method; (b) lane occupancy rate under the DQN method; (c) average speed under the Q-learning method; (d) lane occupancy rate under the Q-learning method; (e)average speed under the no-control method; (f) lane occupancy rate under the no-control method.

Figure 8. Vehicle travel times for different paths. (a) DQN; (b) Q-learning; (c) no control.

Table 1. Model parameters.

Parameters	Value
Lane length	100 m
Vehicle length	5 m
Minimum safe distance between vehicles	3 m
Maximum vehicle speed	50 km/h
Maximum vehicle acceleration	$2.5 m / s^{2}$
Maximum vehicle deceleration	$4.5 m / s^{2}$
Transition phase duration	3 s
Signal phase duration	42 s
Simulation time step	1 s
Path selection	random
Vehicle input	3600 PCU/h

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Comparative Study of Traffic Signal Control Based on Reinforcement Learning Algorithms

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Q-Learning

3.2. DQN

3.3. TSC Setting

3.3.1. Road Model

3.3.2. State Space

3.3.3. Action Space

3.3.4. Reward

4. Simulation

4.1. Simulation Platform

4.2. Experimental Results

5. Conclusions and Directions for Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics