Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic

Meepokgit, Teerapun; Wisayataksin, Sumek

doi:10.3390/app14177908

Open AccessArticle

Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic

by

Teerapun Meepokgit

and

Sumek Wisayataksin

^*

Department of Electronics Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7908; https://doi.org/10.3390/app14177908

Submission received: 18 July 2024 / Revised: 23 August 2024 / Accepted: 3 September 2024 / Published: 5 September 2024

(This article belongs to the Special Issue Intelligent Transportation System Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic lights are the most commonly used tool to manage urban traffic to reduce congestion and accidents. However, the poor management of traffic lights can result in further problems. Consequently, many studies on traffic light control have been conducted using deep reinforcement learning in the past few years. In this study, we propose a traffic light control method in which a Deep Q-network with fuzzy logic is used to reduce waiting time while enhancing the efficiency of the method. Nevertheless, existing studies using the Deep Q-network may yield suboptimal results because of the reward function, leading to the system favoring straight vehicles, which results in left-turning vehicles waiting too long. Therefore, we modified the reward function to consider the waiting time in each lane. For the experiment, Simulation of Urban Mobility (SUMO) software version 1.18.0 was used for various environments and vehicle types. The results show that, when using the proposed method in a prototype environment, the average total waiting time could be reduced by 18.46% compared with the traffic light control method using a conventional Deep Q-network with fuzzy logic. Additionally, an ambulance prioritization system was implemented that significantly reduced the ambulance waiting time. In summary, the proposed method yielded better results in all environments.

Keywords:

deep reinforcement learning; deep Q-network; fuzzy logic; traffic signal control

1. Introduction

Nowadays, numerous countries worldwide are experiencing traffic congestion. The rapid growth of the urban population has resulted in a significant increase in the number of vehicles on roads, a root cause of this problem. Consequently, traffic congestion ensues, which results in prolonged travel times, particularly during periods of high traffic demand. In 2023, people living in Bangkok faced a significant loss of time due to heavy traffic, totaling 63 h, which represents an increase of 25 percent compared to the previous year [1].

Traffic lights are the most commonly used tool to manage city traffic. However, problems often arise when traffic lights are not properly adjusted for the number of cars on the road, leading to congestion and longer waiting times. Therefore, in this study, we propose a method that uses deep reinforcement learning with fuzzy logic to improve traffic light control. This method uses data on the number of cars and waiting times to adjust traffic lights according to current traffic conditions, aiming to reduce waiting times and minimize traffic congestion while enhancing safety.

When traffic congestion is reduced, vehicles can move more smoothly, reducing the likelihood of accidents or dangerous situations [2] (pp. 208–209). A smooth traffic flow minimizes sudden stops and there is less crowding between vehicles, both common causes of accidents, while improving drivers’ responses to various situations and reducing stress, thereby contributing to a safer overall road environment.

In 1977, Pappis and Mamdani [3] recommended using fuzzy logic as an approach to controlling traffic lights, which was implemented at a single intersection. Subsequently, many researchers have extensively studied the application of fuzzy logic to traffic signal control. Data were collected from various situations at an intersection, such as the number of vehicles waiting at red lights, the number of vehicles at the intersection to be granted green signals [4,5,6], cycle starting time, change probability [5], queue lengths [6,7], waiting time [8], and total number of vehicles [9]. These parameters have also been used to analyze and determine traffic signal phases [5,6], or the duration of green signals [4,7,10].

The inference engine uses fuzzy rule-based systems and forgoes the need to construct exact mathematical models [7] by creating a rule base that parallels human traffic control reasoning [8,10,11]. However, it also has the limitation of being unable to autonomously learn information and thereby resolve highly complex situations.

Artificial intelligence (AI) has played a significant role in successfully solving numerous problems. Supervised learning, unsupervised learning, and reinforcement learning [12,13] have been used to solve multiple problems. These learning methods are classified as machine learning methods, which differ from fuzzy logic control in that fuzzy logic control does not use learning algorithms. Reinforcement learning (RL), also widely known as Q-learning, is a method that involves repetitive learning and the use of calculated Q-values to obtain optimal action selections [14,15]. The underlying principle mirrors human learning through trial and error, integrating knowledge from real-time data to determine the appropriateness of actions based on the current environmental conditions [16]. There is an evaluator for the results of actions, which is the determined value of the reward. It is used to calculate the Q-value and store it in a table called the Q-Table. This table retains information about all situations and serves as a reference for decision-making [17,18].

Owing to the complexity of traffic situations and the enormous state space, the limitations of Q-learning in storing data in an excessively large Q-table have led to the integration of deep learning with reinforcement learning. This integrated method is known as Deep Q-learning or the Deep Q-network (DQN). It effectively handles large, intricate state spaces and is widely adopted in research due to its simplicity and effectiveness, where it improves learning by using experience replay. In this method, a neural network replaces the Q-table, which simplifies configuration and tuning without requiring an overly complex design, predicts Q-values, and adjusts weights in the neural network to approximate and adapt to the most suitable Q-values for a given situation. This contrasts with traditional Q-learning, in which learning occurs from the beginning in every situation, thus making the Deep Q-network more flexible [19]. However, their fundamental principles are similar. In this method, the state refers to the collection of situational data, such as the car’s position [20], speed [21,22], and the number of vehicles in each lane [23].

In the action of a Deep Q-network, the optimal traffic signal phase for a given situation is selected. However, the duration of the green signal is predefined and does not dynamically adapt to the current situation [23,24]. The calculation of reward values varies across studies, with the predominant factor being waiting time [24,25], and some studies alternatively use queue length and car speed [21].

In traffic control using reinforcement learning (RL), in addition to the widely popular DQN (Deep Q-network), there are other methods such as the TD3 (Twin Delayed Deep Deterministic Policy Gradient) and DDPG (Deep Deterministic Policy Gradient). These methods show high potential in handling problems with a continuous action space or more complex control requirements, such as having continuous values for autonomous vehicles, which allows the vehicle to adapt smoothly to its environment and temperature control, where inputs must be adjusted continuously to maintain the desired state.

For DDPG [26], the Actor–Critic approach is employed, where the Actor determines the control policy and the Critic evaluates this policy, making it effective for continuous control scenarios. To address problems in the DDPG, the TD3 [27] was developed by implementing delayed policy updates and using a clipped double Q-learning technique, which enhances the model’s stability.

However, the DQN is more suitable for problems that require decision-making in discrete action spaces [28], such as adjusting traffic signal phases where a new phase is selected only after the previous one has finished. It handles these problems more effectively and straightforwardly and is also simpler in structure and easier to implement.

In their study, Tunc and Soylemez [29] used a combination of deep Q-learning and fuzzy logic control to change the traffic light phase and duration of the green light according to the current situation. They employed Deep Q-learning to select the optimal traffic signal phase, while using fuzzy logic control to compute the duration of the green signal.

The results showed that this approach outperformed other methods, such as traditional fixed-time control, Deep Q-learning, and fuzzy logic control.

However, one problem discovered in that study is related to the right-hand traffic environment. In particular, the reward function that is used may calculate an unsuitable reward. Consequently, vehicles wishing to turn left experience excessive waiting times at traffic signals. Owing to the reward process, it is reasonable to assume that greater efficiency can be achieved by giving vehicles a signal to go straight. Therefore, in our study, we use a modified Deep Q-network and fuzzy logic control method to optimize the management of traffic lights at single intersections and enhance efficacy. The following characteristics have been improved.

State improvement: We have edited this to reflect the perspective of the people who use it to make traffic light control decisions. The state collects information regarding the waiting time, divided into ten intervals, and the vehicle’s position.
Reward improvement: We have modified the calculation of the reward values, where the waiting time for each lane is calculated, and the highest waiting time is selected from the lanes on each arm that use the same traffic signals.

In addition to improving the performance, we built an environment close to the real environment, with an increased variety of vehicle types. We added buses, minibuses, and ambulances to the simulation and considered the system with ambulance prioritization.

The remainder of this paper is organized as follows. Section 2 presents the fundamental principles of this study. Section 3 describes the design of the system. Section 4 describes the simulation design. Section 5 presents the methodology used in this study. The experimental results are presented in Section 6. Finally, this article is concluded in Section 7.

2. Fundamental Principles

The principle of adaptive traffic light control using deep reinforcement learning (DRL) and fuzzy logic involves combining DRL, which utilizes neural networks and reinforcement learning, with fuzzy logic to manage traffic lights based on current traffic conditions. The DRL model learns the most effective control strategies by interacting with the traffic environment and receives rewards in the form of reduced waiting times to determine the traffic light phases. Meanwhile, fuzzy logic determines the duration of the green light based on the number of cars at the intersection. The fundamental principles of the methods used for adaptive traffic light control in this study divide the content into subsections as follows.

2.1. Reinforcement Learning

Reinforcement learning [30,31] is a type of machine learning algorithm. It is a self-teaching system that learns through autonomous trial and error. It works through actions to optimize rewards and receive knowledge via practical experience to obtain the highest return, which comprises fundamental factors, as shown in Figure 1.

Environment: The specific environment in which we desire to operate.
Agent: The worker who acts.
Action: The possible actions involved can be conducted in that specific environment.
State: The decision-making process uses the current state or situation of the environment to make decisions.
Reward: The reward value is a measure of the performance of an action.
Policy: Rules or methods for determining the best choice of action based on the current state.

The Markov Decision Process [32] is used to determine the selection of actions. The selection process involves determining the action believed to provide the highest return or cumulative reward. The cumulative reward refers to the sum of all the rewards obtained from the current moment to the future. However, owing to the natural unpredictability of the future when calculating cumulative rewards, it is necessary to reduce the importance of expected rewards. The discount factor (γ), with a value ranging from zero to 1 [33], is applied under Equation (1).

G_{t} = R_{t + 1} + {γ R}_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}

(1)

where R_t+1 is the reward for transitioning from state S_t to S_t+1.

2.2. Deep Reinforcement Learning

Deep reinforcement learning (DRL) [34] applies deep learning combined with reinforcement learning, in which neural networks are used to handle complex structures and datasets. Typically, DRL is used in tasks requiring learning to make decisions in an environment that is subject to variation and uncertainty. It has various applications in multiple fields, such as gaming [35], robotics [36], stock trading [37], medicine [38], energy management [39], and traffic management [40].

For DRL [41], the Q-function, also called the Deep Q-network (DQN), is a widely used method. A neural network that replays data and learns from experience can approximate the Q-value, which is a performance measure for a given action based on the state and action. It is then applied to determine the best action in the current situation.

The main equations of the DQN are the Q-function and loss function.

Q (s, a) = r (s, a) + γ m a x Q (s', a')

(2)

Equation (2) is known as the Bellman equation [42]. The equation is used to update the Q-value of the current action in state s_t, including the immediate reward and the discounted Q-value [29], where γ represents the discount factor, r(s, a) represents the reward value received after taking an action in state s_t, and maxQ refers to the Q-value of choosing the action with the highest value among all possible actions in the next state s_t+1.

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(Q (s, a) - q (s, a))}^{2}

(3)

For the DQN, the implemented loss function [42] is the mean squared error (MSE) according to Equation (3). It is calculated by comparing the predicted Q-value with the actual Q-value, where n is the total amount of data, q(s, a) is the predicted Q-value from the neural network, and Q(s, a) is the actual Q-value according to Equation (2).

2.3. Epsilon-Greedy Policy

An epsilon-greedy policy [43] helps the model consider unknown states. The Q-value helps to determine the optimal action to select in the Deep Q-network. However, we obtained a Q-value that may not have been optimal. Occasionally, certain actions can provide additional rewards. Therefore, we prefer to include a slight probability of using an unprecedented action.

2.4. Fuzzy Logic

Fuzzy logic [44,45] is an alternative way of thinking to standard logic. Typically, it only considers right or wrong, yes or no. In digital thinking or Boolean logic, binary values of 0 and 1 are used. It is a method for analyzing situations that can be confusing or unclear. For instance, when discussing light or dark pink shades to determine whether they are considered red, fuzzy logic may evaluate the probability of red and white being present in combination. However, there are differing perspectives on this topic. Fuzzy logic is used in multiple fields to help decision-making in confusing or unclear situations and is beneficial in situations where there is considerable uncertainty.

3. System Design

In urban traffic management, one of the main challenges is optimizing traffic signal control at intersections to minimize congestion and reduce waiting time. However, traditional traffic signal control systems often rely on fixed timing plans, which may not be efficient under constantly changing traffic conditions, leading to increased delays and congestion.

To address this limitation, we propose traffic signal control using a Deep Q-network combined with fuzzy logic. The system response varies according to the level of traffic congestion at the intersection. This allows traffic lights to automatically change phases through learning with a Deep Q-network to facilitate the passage of vehicles at intersections. In addition, the signal duration is appropriately adjusted using fuzzy logic by applying a prototype method developed based on the existing studies [29] and serves as the comparison target. The content of the system design is divided into the following subsections.

3.1. Environment

In this study, we used a prototype intersection environment based on a comparison target [29]. The intersection consists of four arms, each with four lanes and a length of 750 m.

However, to ensure the correct and effective direction of traffic, vehicles must use the appropriate lane when turning left or right. The use of the rightmost lane is necessary when making a right turn. When making a left turn, the vehicles must use the leftmost lane. Vehicles intending to proceed straight can use the lane on the far-right side, as shown in Figure 2a.

Furthermore, there are several designs for the environment. In the first, each arm has a balanced structure with four lanes. The number of lanes was gradually reduced to three, one arm at a time, by reducing the number of lanes on each arm from all directions to cover each possible design. The last type of traffic design is left-hand driving with four equal lanes across all arms, as shown in Figure 2b–e.

In Figure 2b–d, the only difference is the lane structure, which reduces straight lanes. In Figure 2e, there is a change in traffic driving on the left side of the road.

Moreover, there is an ambulance prioritization system that gives priority to ambulances to pass through the intersections first. However, their safety must also be considered. Therefore, it is imperative to consider the safety duration, which refers to the time duration before changing to the green light phase to allow the ambulance to pass, as well as the duration after the ambulance has passed through the intersection and the phase selection returns to normal. The safety duration also affects the detection distance. If the safety duration increases, it is necessary to provide sufficient time for the green light to be activated before the ambulance arrives at the intersection. Consequently, it was necessary to consider the detection distance, as shown in Figure 3.

3.2. State

This is one of the main components of the proposed method. The current state of the environment is the knowledge used to improve the DQN decision-making process.

For the system to have knowledge about controlling the traffic flow to improve its effectiveness, sufficient information regarding vehicle data at the intersection must be collected. This state information is necessary data that can be immediately provided to the traffic signal operators by considering the location of the vehicle in the environment.

The state variable stores data on the waiting time, divided into ten intervals. These data can be used to indicate the vehicle’s position and time spent waiting. Variable s_t represents the state of the intersection at a specific time step t.

A division of lanes at traffic intersections was implemented. Each lane was divided into parts known as cells. The cell sizes are designed to be unequal, in which the cell nearest to the intersection is narrow and widens when it moves further from the intersection [20]. This is because of its superior performance compared with the use of a uniform cell size for all distances [29].

If a cell has a vehicle and there is no waiting time, the value is 1. However, if the waiting time is not zero, the value is divided into ten intervals according to the waiting time, as shown in Table 1, and the cell has a zero value in the absence of a vehicle. Using the state design of the proposed method, it is possible to observe both the presence of vehicles and waiting time. This differs from the prototype Deep Q-network with fuzzy logic [29], which has values of 0 and 1, indicating the presence or absence of a vehicle.

There were ten cells in each lane. The three lanes on the right were combined into a group unit, and the same traffic signals were used. Therefore, the value recorded in the cell was selected from the maximum values of the three lanes.

Furthermore, in the proposed method, the length of each vehicle type varies and the size of the cells near the intersection is narrow, which means that a vehicle whose length is greater than the size of the cell can be in the next cell. Consequently, the highest value in the cell is saved to state storage. As shown in Figure 4, this differs from a prototype Deep Q-network with fuzzy logic [29], which has only one type of vehicle, and the length of the vehicle is less than that of the cell.

Figure 4 displays an example of a state for the southern side of the intersection. Meanwhile, Figure 4c presents the waiting time of each car, divided into cells based on Figure 4b. If a cell contains more than one vehicle, the highest value within the cell is used for comparison in Table 1 before being stored as part of the state, as depicted in cell 3 of Figure 4c,d.

3.3. Action

The action involves choosing the traffic light phase of the agent that provides the highest reward based on the current state information. A Deep Q-network directly generates an output or chooses it randomly using the epsilon-greedy strategy, which encourages the exploration of unknown states.

Regarding the directions of the traffic signals, as shown in Figure 5, the traffic light switches to the green phase based on the set duration determined by the fuzzy logic control. The next phase was selected after determining the duration. If the previous phase is chosen, it proceeds without delay. However, if the agent chooses a different phase, the light will switch to a yellow light for 3 s before changing to green for the currently selected phase. In all the environments, the same traffic signal phase was used, as shown in Figure 5.

The system chooses the appropriate phase of the traffic green signal from one of the four already-defined phases as follows:

Green light for north–south.
Green light for north–south left turn.
Green light for east–west.
Green light for east–west left turn.

3.4. Reward

Rewards are the measurements that can be used to evaluate the results of an agent’s actions. The calculation begins when the agent has completed a one-time green light duration, which is when the agent switches from state s_t to state s_t+1. The selection of the parameters used in calculating the reward is paramount because of the necessity of using agents to choose actions and apply them for additional learning.

We used the waiting time to calculate rewards because the main objective of this study is to reduce traffic congestion, improve traffic control efficiency, and minimize waiting time at intersections. However, a problem was discovered in the prototype Deep Q-network with fuzzy logic [29] related to the prototype environment. In particular, the reward function calculates the total waiting time for all cars.

Consequently, left-turning vehicles experience a long waiting time for traffic lights. Due to the reward process, it is reasonable that giving vehicles a signal to go straight can lead to better outcomes. As shown in Figure 6, groups 1, 3, 5, and 7 are left-turn lane groups, and groups 2, 4, 6, and 8 are straight lane groups. Calculating the accumulated waiting time in the straight lane groups for all cars would be higher than that in the left-turn lane groups because there were three lanes. Therefore, the accumulated waiting time in each lane was applied to solve this problem, where the highest value is selected from the lanes on each arm that use the same traffic signal. This differs from the prototype Deep Q-network with fuzzy logic [29], which calculates the accumulated waiting time for all cars. The accumulated waiting time increases when the vehicle travels at a velocity lower than 0.1 m/s.

The waiting time calculation in each lane is shown in Equation (4).

{T w t}_{t} = \sum_{i = 1}^{4} G_{2 i - 1, t} + \sum_{i = 1}^{4} G_{2 i, t}

(4)

In Equation (4), Twt_t is the total waiting time at time step t, G_2i−1,t is the total waiting time in the left-turn lane group at time step t, and maxG_2i,t is the highest total waiting time in an array of lane groups to go straight and turn right at time step t, as shown in Figure 6.

The reward function is shown in Equation (5).

r_{t} = {T w t}_{t - 1} - {T w t}_{t}

(5)

3.5. Agent

The agent uses the epsilon-greedy policy to explore the environment and help the agent get to know the environment during the initial learning episodes. For this policy of probabilities, an action with a probability of ε is chosen at random, and the optimal action with a probability of 1−ε is selected. The epsilon value for each episode is specified according to Equation (6) [20].

ε = 1 - \frac{n}{N}

(6)

where n is the current episode, and N is the total episodes.

3.6. Necessary Information

This section describes the key categories, conditions, and costs involved in implementing a deep reinforcement learning-based traffic signal control system. Table 2 is a summary of this necessary information and its associated costs. Understanding this information is important for comprehending the basic requirements and practical limitations involved in developing and deploying the system.

4. Simulation Design

This section focuses on the structure of the DQN and the design of the traffic simulation environment.

4.1. Network Setting

Regarding the neural network structure, a Deep Q-network was constructed using Keras and TensorFlow. The input layer of the neural network receives the state obtained from the environment. The input comprised 80 nodes with two hidden layers, each consisting of 100 nodes. The activation function used in each hidden layer is a Rectified Linear Unit (ReLU). Each value obtained from each node in the preceding layer is weighted before further processing, which the first hidden layer receives from the input layer. The output of the first hidden layer is passed to the second hidden layer. The outputs of the second hidden layer are sent to the output layer, which calculates the Q-values for the four action phases. The learning parameters used in the experiments are listed in Table 3.

We describe the system’s learning process in Algorithm 1. At the beginning of each episode, the epsilon value is calculated to determine whether to explore or exploit. Subsequently, the system controls the traffic light based on the epsilon value and stores the state, action, reward, and next state in memory. We utilized the deque data structure from Python’s collections module to efficiently manage this memory. The process continues until the maximum step is reached. Afterward, the episode statistics are saved, and the training process known as experience replay is initiated. This important training phase involves using data on the state, action, reward, and next state to enhance learning from experience. Finally, it iterates until the total number of episodes is completed, at which point the model is saved and prepared for testing.

Algorithm 1. Training.

Input: number of hidden layers, number of nodes in hidden layers, batch size, learning rate, training epochs, memory size, discount factor, number of states, number of actions, max steps, total cars generated, total episodes
Output: traffic light control model and episode stats
1: set: episode = 0
2: while current episode < total episodes do
3: calculate the epsilon value
4: set: step = 0
5: while step < max steps do
6: get current state
7: calculate the reward
8: add sample data to memory
9: if random number 0.0 to 1.0 < epsilon value then
10: get random action
11: else
12: get action based on model prediction
13: end if
14: if the phase has changed then
15: set yellow phase
16: set yellow duration
17: while yellow duration > 0 do
18: step = step + 1
19: yellow duration = yellow duration − 1
20: end while
21: end if
22: set green phase based on action
23: get green duration based on fuzzy
24: set green duration
25: while green duration > 0 do
26: step = step + 1
27: green duration = green duration − 1
28: end while
29: end while
30: save episode stats
31: for epoch in range training epochs do
32: get samples from batch = batch size
33: train
34: end for
35: episode = episode + 1
36: end while
37: save model

4.2. Environment Setting

The simulation environment was built using SUMO (Simulation of Urban Mobility) software version 1.18.0 and Python 3.9.17 [46]. SUMO software primarily employs microscopic models for traffic simulation. This model focuses on the behavior and interactions of individual vehicles within the traffic system. The driving actions of the vehicle in front, such as accelerating, decelerating, and stopping, affect the driving behavior of the following vehicles. The movement of each vehicle in the traffic flow is based on the following:

Car-Following Models: The models define the distance and speed of vehicles according to changing traffic conditions.
Lane-Changing Models: The models determine the lane-changing behavior of vehicles based on rules such as gap and safety considerations.

In the simulation, we used a random seed in Python to randomly generate vehicles in each direction, ensuring that each episode would always have the same traffic flow situation when repeated for comparison with other methods or to verify the results, whereas each episode has different traffic flow situations when training 500 episodes and testing 25 episodes. Vehicles built in the given situation passed the intersection only once, and the number of vehicles coming from each intersection was nearly equal.

We provided the probability of a vehicle turning left or right instead of being straight. This was an approximate ratio of 1 to 3, respectively. One thousand vehicles were built continuously for 5400 s, which was one episode. The probability of releasing one car per second was determined by using a Weibull distribution with a shape parameter of 2.

Figure 7 shows the number of vehicles that were built according to the settings over a single episode.

The parameter settings for the environment are listed in Table 4. We assumed a situation for the simulation with the idea that 1000 cars would pass through the intersection in 5400 s or 90 min. Using this number of cars is appropriate and covers as it generates reasonably complex traffic situations while keeping system complexity and resource demands within manageable limits. This balance facilitates effective learning and control improvement.

For the simulation, vehicles were classified into four categories: personal cars, minibuses, buses, and ambulances. Personal cars had a length of 4.5 m, a maximum speed of 80 km/h, and an acceleration of 1 m/s². Regarding minibuses and buses, the minibus was 7 m long, whereas the bus was 12 m long. Both had a maximum speed limit of 60 km/h and an acceleration of 1 m/s². The ambulance had a length of 6.5 m, a maximum velocity of 120 km/h, and an acceleration of 2 m/s². All vehicles had a deceleration of 4.5 m/s², and a departure speed of 36 km/h. The minimum gap between any two vehicles was 2.5 m, and a sigma (driver imperfection) value of 0.5 was used.

We set a speed limit for general vehicles according to local driving regulations in urban areas. However, ambulances may need to drive at high speed in emergencies. Therefore, the ambulance speed limit was set higher.

We built ambulances with an approximately 1% probability of spotting this vehicle type, whereas buses and minibuses also had this possibility. There is an approximately 20% probability of being found, and most of the remaining probabilities apply to personal cars.

5. Methodology

The scikit-fuzzy library was used for the fuzzy logic toolkit. Skfuzzy is an alternative name for the software [44]. The chosen parameters are used to determine the inputs GP (green phase) and RP (red phase) [29], which indicate the number of vehicles at the intersection to be granted a green light and the number of vehicles waiting at a red light, respectively. The GP values in this experiment ranged from 0 to 20, whereas the RP values ranged from 0 to 60. The green duration value is the output and ranges from 0 to 30. From this experiment, a dataset was generated that included all conceivable combinations of GP and RP due to fuzzy logic processing, where selecting a new phase requires a time delay of 0.5–1 s, resulting in system latency. In the process, where the time for the green light is determined, the input values GP and RP are used to obtain values from the pre-existing result set instead of repeating the fuzzy logic processes.

Fuzzy logic processing uses the Mamdani fuzzy inference method [44]. The first step is fuzzification, which takes an input value that is a precise number, applies a membership function, and adapts it according to the membership level, which is fuzzy data, as shown in Figure 8a,b. A description of the membership functions is presented in Table 5 and Table 6. Then, it is sent to the inference engine process using the IF-THEN concept based on the predefined rules presented in Table 7. However, the input can match several rules. Next, the results derived from the inference engine process using all the defined rules are combined using the union operation.

Defuzzification converts fuzzy sets into crisp values using a membership function, as shown in Figure 9a. A description of the membership functions is presented in Table 8. This method determines the centroid or center of gravity (COG) [44] on the x-axis, with all the results rounded to the nearest integer. An example of the results obtained is the combination of all results per defined rule and the defuzzification process. When GP was set to 7 and RP was set to 3, the center of gravity was located on the x-axis at 12.097. Therefore, the duration of the green light during this phase is 13 s, as shown in Figure 9b.

6. Experimental Results

The experimental results are divided into seven parts. The first part is the results of the training process, and the remaining parts are the experimental results. Following the completion of the learning process, the results show the average value obtained from the experiment in a total of 25 episodes, which are different situations. The experimental results are for the three types of traffic light control.

Traditional traffic light control is a method that is currently used worldwide, in which a green light signal is given for 10 s in a sequence and repeated for each arm, as shown in Figure 5. It is called the traditional method in the experimental results.
Traffic light control using a prototype Deep Q-network with fuzzy logic is used for comparison. In the experimental results, it is called the original DQN with the fuzzy logic method [29].
Traffic light control using the proposed method.

In the experimental results table, ALL-4 indicates that there were four lanes in all outgoing lanes, E-3 indicates that the road coming east had three outgoing lanes, N-3 indicates that the road coming north had three outgoing lanes, S-3 indicates that the road coming south had three outgoing lanes, and W-3 indicates that the road coming west had three outgoing lanes.

The experimental results in Section 6.1, Section 6.2, Section 6.3, Section 6.4, Section 6.5, Section 6.6 and Section 6.7 are compared with the traditional and original DQN with fuzzy logic. The following values were used to compare the results.

The average total waiting time (WT_AVG) is the average accumulated waiting time of all vehicles in the outgoing lane per step.

${W T}_{T o t a l} = \sum_{i = 1}^{N} W T [i]$

(7)

where N is the number of vehicles at the intersection in the outgoing lane at time step t, and WT is the accumulated waiting time of the vehicle.

${W T}_{A V G} = \frac{1}{T} \times \sum_{i = 1}^{T} {W T}_{T o t a l} [i]$

(8)

where T is the maximum step, and WT_Total in Equation (7) is the total waiting time at time step t.
The average total waiting time for vehicles to make a left turn (WTL_AVG) is the average accumulated waiting time for all vehicles in the outgoing lane to make a left turn per step.

${W T L}_{T o t a l} = \sum_{i = 1}^{N} W T L [i]$

(9)

where N is the number of vehicles at the intersection of the left-turn lane at time step t, and WTL is the accumulated waiting time of vehicles in the left-turn lane.

${W T L}_{A V G} = \frac{1}{T} \times \sum_{i = 1}^{T} {W T L}_{T o t a l} [i]$

(10)

where T is the maximum step, and WTL_Total in Equation (9) is the total waiting time of the vehicle on the left-turn lane at time step t.
The average total waiting time per car (WTC_AVG) is the average accumulated waiting time per vehicle in the outgoing lane per step.

${W T C}_{A V G} = \frac{1}{T} \times \sum_{i = 1}^{T} \frac{1}{Q} {W T}_{T o t a l} [i]$

(11)

where T is the maximum step, Q is the number of vehicles at the intersection with a velocity below 0.1 m per second, and WT_Total in Equation (7) is the total waiting time at time step t.

6.1. All-4 Training Results

The system completed a learning process consisting of 500 episodes, and the experimental results are shown in Figure 10a. Negative rewards indicate that the waiting time increases after the action was selected. The graph shows a decreasing trend in the total value of the negative rewards per episode. This indicates that the system has already been learned.

In Figure 10b–d, the graph shows that after learning, the proposed method outperformed the original DQN with the fuzzy logic method.

6.2. All-4

The experimental results were obtained in the original environment with four outgoing lanes on each arm.

In Table 9, for the proposed method, the values of WT_AVG, WTL_AVG, and WTC_AVG can be reduced by 69.80%, 23.24%, and 60.78% when compared with the traditional method, and 18.46%, 40.36%, and 21.23% when compared with the original DQN with the fuzzy logic method, respectively.

It is noted that the original DQN with the fuzzy logic method yielded worse results than the traditional method compared to vehicles turning left due to the inappropriate reward, in which the traditional method uses a fixed 10 s green signal in a sequence in every phase.

However, the proposed method, in which we modified the reward calculation, yielded better results than all the other methods when compared with all metrics.

The experimental results are presented in Section 6.3. The following value was used to compare the results instead of Equation (10):

The average total waiting time for vehicles to make a right turn (WTR_AVG) is the average accumulated waiting time for all vehicles in the outgoing lane to make a right turn per step.

${W T R}_{T o t a l} = \sum_{i = 1}^{N} W T R [i]$

(12)

where N is the number of vehicles at the intersection of the right-turn lane at time step t, and WTR is the accumulated waiting time of vehicles in the right-turn lane.

${W T R}_{A V G} = \frac{1}{T} \times \sum_{i = 1}^{T} {W T R}_{T o t a l} [i]$

(13)

where T is the maximum step, and WTR_Total in Equation (12) is the total waiting time of the vehicle in the right-turn lane at time step t.

6.3. Left-Hand Traffic

The experimental results are presented in Table 10 for a left-hand traffic environment with four outgoing lanes on each arm.

Using the proposed method, the values of WT_AVG, WTR_AVG, and WTC_AVG can be reduced by 71.21%, 28.54%, and 61.23% when compared with the traditional method, and 22.87%, 45.86%, and 21.12% when compared with the original DQN with fuzzy logic method, respectively.

6.4. Three Lanes One Arm

The experimental results are presented in Table 11 for an environment where one arm had three outgoing lanes, and three arms had four outgoing lanes, which had four environments. Using the proposed method, there is an average for all four environments, and the values of WT_AVG, WTL_AVG, and WTC_AVG can be reduced by 78.01%, 14.73%, and 64.23% when compared with the traditional method, and 19.12%, 39.92%, and 19.69% compared with the original DQN with the fuzzy logic method, respectively.

The traditional method yielded the worst overall waiting time. However, the traditional method yielded better results than the original DQN with the fuzzy logic method compared to vehicles turning left because the green light signal was given for 10 s in a sequence in every phase. Nevertheless, the proposed method yielded better results than all the other methods when compared with all metrics.

6.5. Three Lanes Two Arms

The experimental results are presented in Table 12 for an environment where two arms had three outgoing lanes and two arms had four outgoing lanes, which had six environments. In the proposed method, there is an average for all six environments; the values of WT_AVG, WTL_AVG, and WTC_AVG can be reduced by 81.66%, 9.22%, and 66.21% when compared with the traditional method, and 20.23%, 45.14%, and 21.35% when compared with the original DQN with fuzzy logic method, respectively.

It is noted that the results are consistent with Section 6.4 when using the traditional method, which yielded the worst overall waiting time and better results than the original DQN with the fuzzy logic method compared to vehicles turning left. However, the proposed method yielded better results than all the other methods when comparing all the metrics.

6.6. Three Lanes Three Arms

The experimental results are shown in Table 13 for an environment where three arms have three outgoing lanes, and one arm has four outgoing lanes, which have four environments. In the proposed method, there is an average for all four environments, and the values of WT_AVG, WTL_AVG, and WTC_AVG can be reduced by 84.34%, 6.26%, and 67.54% when compared with the traditional method, and 19.98%, 48.46%, and 20.66% when compared with the original DQN with the fuzzy logic method, respectively.

The results are consistent with those in Section 6.4 and Section 6.5, in which the proposed method yielded better results than all the methods when comparing all the metrics.

The experimental results are presented in Section 6.7. The following additional values were used to compare the results:

Total ambulance waiting time (WTA_Total) is the accumulated waiting time for all ambulances.

${W T A}_{T o t a l} = \sum_{i = 1}^{A} W T A [i]$

(14)

where A is the number of ambulances, and WTA is the accumulated waiting time of ambulances.

6.7. Ambulance Prioritization System

The experimental results are presented in Table 14. Data for the duration of safety at 3 s and 5 s, as well as the detection distances of 150 m, 200 m, 250 m, and 300 m are shown.

A few variables were considered when selecting the appropriate safety duration and detection distance for the ambulance prioritization system, which consists of two metrics: the average total waiting time and total ambulance waiting time. When analyzing the average total waiting time, it was observed that using a detection distance of 300 m results in the highest value. Consequently, they were deemed to be inappropriate for use. When analyzing the total ambulance waiting time, the pair with a safety duration of 3 s and detection distance of 250 m provided the optimal value, indicating no waiting time for the ambulance. Therefore, a safety duration of 3 s and a detection distance of 250 m were selected for the ambulance prioritization system.

Based on the experimental results shown in Table 15, using a safety duration of 3 s and a detection distance of 250 m, it was found that the average total waiting time is similar for the proposed method. However, the original DQN with the fuzzy logic method had an approximately 5.39% higher average total waiting time. Overall, the proposed method provided better results.

7. Conclusions

In this study, we proposed a traffic light control method using a Deep Q-network with fuzzy logic, with the aim of reducing waiting times at intersections and increasing the efficiency of the method based on the state, where adjustments were made to consider the perspective of the individuals responsible for making traffic light control decisions. However, previous studies using the Deep Q-network have yielded suboptimal results due to the reward function, which causes the system to prioritize straight-moving vehicles, making left-turning vehicles wait excessively. Therefore, to solve this problem, we adjusted the reward function to consider the waiting time in each lane. We conducted experiments in a variety of environments to test the flexibility of the method and account for the different lane conditions in real situations. Moreover, we implemented a system to prioritize ambulances to pass through the intersection first.

By learning in the prototype environment and applying the model obtained from learning to various environments close to the real environment, the experimental results from using the proposed method were compared with a comparison target, which yielded better results for all environments and reduced the average total waiting time, average total waiting time for vehicles making a left turn, and average total waiting time per car by 18.46%, 40.36%, and 21.23%, respectively, in the prototype environment. Additionally, in the experiment using the ambulance prioritization system, an appropriate safety duration and detection distance of 3 s and 250 m, respectively, were determined, which resulted in a significant reduction in the ambulance waiting time.

For this research, the proposed method was only applied to a four-way single intersection. In this study, the intersection environment was a single intersection that was not connected to other intersections. The intersection is considered isolated, and a larger number and variety of vehicles are required to ensure greater consistency with the real environment. Therefore, for future work, we plan to develop a system that can accommodate more situations by increasing the number of vehicles, diversifying vehicle types, and creating more interconnected intersections.

Author Contributions

Conceptualization, T.M. and S.W.; methodology, T.M. and S.W.; writing—original draft preparation, T.M.; writing—review and editing, T.M. and S.W.; visualization, T.M. and S.W.; supervision, S.W.; funding acquisition, T.M. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the School of Engineering, King Mongkut’s Institute of Technology Ladkrabang.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

INRIX 2023 Global Traffic Scorecard. Available online: https://inrix.com/scorecard/?utm_source=hellobar&utm_medium=direct#city-ranking-list (accessed on 20 June 2024).
Wang, C. The Relationship between Traffic Congestion and Road Accidents: An Econometric Approach Using GIS. Ph.D. Thesis, Loughborough University, Leicestershire, UK, February 2010. [Google Scholar]
Pappis, C.P.; Mamdani, E.H. A Fuzzy Logic Controller for a Trafc Junction. IEEE Trans. Syst. Man Cybern. 1977, 7, 707–717. [Google Scholar] [CrossRef]
Taskin, H.; Gumustas, R. Simulation of traffic flow system and control using fuzzy logic. In Proceedings of the 12th IEEE International Symposium on Intelligent Control, Istanbul, Turkey, 16–18 July 1997. [Google Scholar]
Liu, H.-H.; Hsu, P.-L. Design and Simulation of Adaptive Fuzzy Control on the Traffic Network. In Proceedings of the 2006 SICE-ICASE International Joint Conference, Busan, Republic of Korea, 18–21 October 2006. [Google Scholar]
Kulkarni, G.H.; Waingankar, P.G. Fuzzy logic based traffic light controller. In Proceedings of the 2007 International Conference on Industrial and Information Systems, Peradeniya, Sri Lanka, 9–11 August 2007. [Google Scholar]
Cai, Y.; Lv, Z.; Chen, J.; Wu, L. An intelligent control for crossroads traffic light. In Proceedings of the 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Shanghai, China, 26–28 July 2011. [Google Scholar]
Firdous, M.; Din Iqbal, F.U.; Ghafoor, N.; Qureshi, N.K.; Naseer, N. Traffic Light Control System for Four-Way Intersection and T-Crossing Using Fuzzy Logic. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 29–31 March 2019. [Google Scholar]
Tunc, I.; Yesilyurt, A.Y.; Soylemez, M.T. Intelligent Traffic Light Control System Simulation for Different Strategies with Fuzzy Logic Controller. In Proceedings of the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO), Bursa, Turkey, 28–30 November 2019. [Google Scholar]
Prontri, S.; Wuttidittachotti, P.; Thajchayapong, S. Traffic signal control using fuzzy logic. In Proceedings of the 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Hua Hin, Thailand, 24–27 June 2015. [Google Scholar]
Bi, Y.; Li, J.; Lu, X. Single Intersection Signal Control and Simulation Based on Fuzzy Logic. In Proceedings of the 2011 Third International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 26–27 August 2011. [Google Scholar]
Sun, C. Fundamental Q-learning Algorithm in Finding Optimal Policy. In Proceedings of the 2017 International Conference on Smart Grid and Electrical Automation (ICSGEA), Changsha, China, 27–28 May 2017. [Google Scholar]
Pandey, D.; Pandey, P. Approximate Q-Learning: An Introduction. In Proceedings of the 2010 Second International Conference on Machine Learning and Computing, Bangalore, India, 9–11 February 2010. [Google Scholar]
Rosyadi, A.R.; Wirayuda, T.A.B.; Al-Faraby, S. Intelligent traffic light control using collaborative Q-Learning algorithms. In Proceedings of the 2016 4th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, 25–27 May 2016. [Google Scholar]
Liao, Y.; Cheng, X. Study on Traffic Signal Control Based on Q-Learning. In Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2009. [Google Scholar]
Liu, Y.; Liu, L.; Chen, W.-P. Intelligent traffic light control using distributed multi-agent Q learning. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017. [Google Scholar]
Ye, B.-L.; Wu, P.; Wu, W.; Li, L.; Zhu, Y.; Chen, B. Q-learning based traffic signal control method for an isolated intersection. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022. [Google Scholar]
Yusop, M.A.M.; Mansor, H.; Gunawan, T.S.; Nasir, H. Intelligent Traffic Lights using Q-learning. In Proceedings of the 2022 IEEE 8th International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), Melaka, Malaysia, 26–28 September 2022. [Google Scholar]
Sutisna, N.; Ilmy, A.M.R.; Arifuzzaki, Z.; Syafalni, I.; Maulana, D.; Mulyawan, R.; Adiono, T. Deep Q-Network Model for Intelligent Traffic Light. In Proceedings of the 2022 International Symposium on Electronics and Smart Devices (ISESD), Bandung, Indonesia, 8–9 November 2022. [Google Scholar]
Vidali, A.; Crociani, L.; Vizzari, G.; Bandini, S. A Deep Reinforcement Learning Approach to Adaptive Traffic Lights Management. In Proceedings of the 20th Workshop “From Objects to Agents”, Parma, Italy, 26–28 June 2019. [Google Scholar]
Wu, T.; Zhou, P.; Liu, K.; Yuan, Y.; Wang, X.; Huang, H.; Wu, D.O. Multi-Agent Deep Reinforcement Learning for Urban Traffic Light Control in Vehicular Networks. IEEE Trans. Veh. Technol. 2020, 69, 8243–8256. [Google Scholar] [CrossRef]
Abhishek, A.; Nayak, P.; Hegde, K.P.; Prasad, A.L.; Nagegowda, K.S. Smart Traffic Light Controller using Deep Reinforcement Learning. In Proceedings of the 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 27–29 May 2022. [Google Scholar]
Zhancheng, S. Research on Application of Deep Reinforcement Learning in Traffic Signal Control. In Proceedings of the 2021 6th International Conference on Frontiers of Signal Processing (ICFSP), Paris, France, 9–11 September 2021. [Google Scholar]
Tigga, A.; Hota, L.; Patel, S.; Kumar, A. A Deep Q-Learning-Based Adaptive Traffic Light Control System for Urban Safety. In Proceedings of the 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 16–17 December 2022. [Google Scholar]
Kodama, N.; Harada, T.; Miyazaki, K. Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis on Reinforcing Successful Experiences. IEEE Access 2022, 10, 128943–128950. [Google Scholar] [CrossRef]
Yang, J.; Wang, P.; Ju, Y. Variable Speed Limit Intelligent Decision-Making Control Strategy Based on Deep Reinforcement Learning under Emergencies. Sustainability 2024, 16, 965. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, H.; Feng, Z.; Zhang, J.; Qian, Y.; Wang, B. A Multi-Objective Optimal Control Method for Navigating Connected and Automated Vehicles at Signalized Intersections Based on Reinforcement Learning. Appl. Sci. 2024, 14, 3124. [Google Scholar] [CrossRef]
Tagesson, D. A Comparison between Deep Q-learning and Deep Deterministic Policy Gradient for an Autonomous Drone in a Simulated Environment. Bachelor’s Thesis, Mälardalens University, Västerås, Sweden, 24 June 2021. [Google Scholar]
Tunc, I.; Soylemez, M.T. Fuzzy logic and deep Q learning based control for traffic lights. Alex. Eng. J. 2023, 67, 343–359. [Google Scholar] [CrossRef]
Yau, K.-L.A.; Chong, Y.-W.; Fan, X.; Wu, C.; Saleem, Y.; Lim, P.-C. Reinforcement Learning Models and Algorithms for Diabetes Management. IEEE Access 2023, 11, 28391–28415. [Google Scholar] [CrossRef]
LA, P.; Bhatnagar, S. Reinforcement Learning With Function Approximation for Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2011, 12, 412–421. [Google Scholar] [CrossRef]
Singh, R.; Gupta, A.; Shroff, N.B. Learning in Constrained Markov Decision Processes. IEEE Trans. Control Netw. Syst. 2023, 10, 441–453. [Google Scholar] [CrossRef]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
Luo, Y.; Gang, T.; Chen, L. Research on Target Defense Strategy Based on Deep Reinforcement Learning. IEEE Access 2022, 10, 82329–82335. [Google Scholar] [CrossRef]
Oh, I.; Rho, S.; Moon, S.; Son, S.; Lee, H.; Chung, J. Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning. IEEE Trans. Games 2022, 14, 212–220. [Google Scholar] [CrossRef]
Kumaar, A.A.N.; Kochuvila, S. Mobile Service Robot Path Planning Using Deep Reinforcement Learning. IEEE Access 2023, 11, 100083–100096. [Google Scholar] [CrossRef]
Ansari, Y.; Yasmin, S.; Naz, S.; Zaffar, H.; Ali, Z.; Moon, J.; Rho, S. A Deep Reinforcement Learning-Based Decision Support System for Automated Stock Market Trading. IEEE Access 2022, 10, 127469–127501. [Google Scholar] [CrossRef]
Yin, S.; Wang, K.; Han, Y.; Pan, J.; Wang, Y.; Li, S.; Yu, F.R. Left Ventricle Contouring in Cardiac Images in the Internet of Medical Things via Deep Reinforcement Learning. IEEE IOT J. 2023, 10, 17705–17717. [Google Scholar] [CrossRef]
Lu, R.; Jiang, Z.; Wu, H.; Ding, Y.; Wang, D.; Zhang, H.-T. Reward Shaping-Based Actor–Critic Deep Reinforcement Learning for Residential Energy Management. IEEE Trans. Ind. Inform. 2023, 19, 2662–2673. [Google Scholar] [CrossRef]
Liang, X.; Du, X.; Wang, G.; Han, Z. A Deep Reinforcement Learning Network for Traffic Light Cycle Control. IEEE Trans. Veh. Technol. 2019, 68, 1243–1253. [Google Scholar] [CrossRef]
Haydari, A.; Yılmaz, Y. Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11–32. [Google Scholar] [CrossRef]
Sarikhani, R.; Keynia, F. Cooperative Spectrum Sensing Meets Machine Learning: Deep Reinforcement Learning Approach. IEEE Commun. Lett. 2020, 24, 1459–1462. [Google Scholar] [CrossRef]
Mughal, B.; Fadlullah, Z.M.; Fouda, M.M.; Ikki, S. Optimizing Packet Forwarding Performance in Multiband Relay Networks via Customized Reinforcement Learning. IEEE Open J. Commun. Soc. 2022, 3, 973–985. [Google Scholar] [CrossRef]
Ali, M.E.M.; Durdu, A.; Celtek, S.A.; Yilmaz, A. An Adaptive Method for Traffic Signal Control Based on Fuzzy Logic With Webster and Modified Webster Formula Using SUMO Traffic Simulator. IEEE Access 2021, 9, 102985–102997. [Google Scholar] [CrossRef]
Khan, N.; Elizondo, D.A.; Deka, L.; Molina-Cabello, M.A. Fuzzy Logic Applied to System Monitors. IEEE Access 2021, 9, 56523–56538. [Google Scholar] [CrossRef]
Liu, B.; Ding, Z. A distributed deep reinforcement learning method for traffic light control. Neurocomputing 2022, 490, 390–399. [Google Scholar] [CrossRef]

Figure 1. Reinforcement learning cycle [30].

Figure 2. (a) The prototype intersection. (b) The three-lane environment on a single arm is called N-3. (c) The three-lane environment on two arms is called E-3 S-3. (d) The three-lane environment on three arms is called S-3 N-3 W-3. (e) Left-hand traffic environment.

Figure 3. Ambulance detection distance in the ambulance prioritization system.

Figure 4. (a) Overview of traffic at intersections. (b) Overview of the traffic situation on the southern side of the intersection, divided into cells. (c) Each car’s waiting time on the southern side of the intersection is divided into cells. (d) The stored state values located on the southern side of the intersection.

Figure 5. The directions of traffic signals phase.

Figure 6. Group of lanes.

Figure 7. (a) Traffic generation over a single episode based on vehicle categories. (b) Traffic generation over a single episode based on vehicle direction. (c) Traffic generation over a single episode based on intersection direction.

Figure 8. (a) The input fuzzy membership of the GP. (b) The input fuzzy membership of the RP.

Figure 9. (a) The output fuzzy membership of the green duration. (b) Examples of results obtained from the defuzzification process.

Figure 10. (a) Training result graph based on negative reward. (b) Training result graph based on average total waiting time. (c) Training result graph based on average total waiting time for vehicles to make a left turn. (d) Training result graph based on average total waiting time per car.

Table 1. The level of division is determined by waiting time.

Level	Waiting Time (s)
1	0
2	1–6
3	7–12
4	13–18
5	19–24
6	25–30
7	31–36
8	37–42
9	43–48
10	More than 48

Table 2. Necessary information for deep reinforcement learning-based traffic signal control systems.

Category	Necessary Information	Conditions	Cost
Data collection devices	Number and position of vehicles	Installation of sensors, cameras, or detection devices	Cost for installation, maintenance, and time required for data processing
Processing systems and servers	Data storage, processing, model training	Data management, data processing, setting up servers, and security	Cost for hardware, software, and time required for system development and management
Traffic signal control systems	Current signal, phase duration, and control.	Integration with existing traffic control systems or new system installation	Cost for integration or installation and system updates

Table 3. Parameter setting of neural network.

Parameter	Value
Hidden layers	2
Nodes in hidden layers	100
Batch size	64
Learning rate	0.001
Training epochs	500
Memory size	50,000
Discount factor	0.75
Number of states (Input)	80
Number of actions (Output)	4

Table 4. Parameter setting of the environment.

Parameter	Value
Max steps (s)	5400
Total cars generated	1000
Total episodes	500

Table 5. Description of membership function of GP.

GP
Index	Membership Function	Description	Range
1	vl	Very Low	0–5
2	l	Low	0–10
3	m	Medium	5–15
4	h	High	10–20
5	vh	Very High	15 onwards

Table 6. Description of membership function of RP.

RP
Index	Membership Function	Description	Range
1	vl	Very Low	0–10
2	l	Low	0–30
3	m	Medium	15–45
4	h	High	30–60
5	vh	Very High	45 onwards

Table 7. Fuzzy inference rules.

GP	RP
		vl	l	m	h	vh
	vl	vvs	vvs	vvs	vvs	vvs
	l	s	s	vs	vs	vs
	m	m	m	m	s	s
	h	l	l	m	m	s
	vh	vvl	vl	l	m	m

Table 8. Description of membership function of green duration.

Green Duration
Index	Membership Function	Description	Range
1	vvs	Very very short	0–5
2	vs	Very short	0–10
3	s	Short	5–15
4	m	Medium	10–20
5	l	Low	15–25
6	vl	Very low	20–30
7	vvl	Very very low	25 onwards

Table 9. The experimental results in ALL-4 environment.

Metrics	Method
Metrics	Traditional	Original DQN with Fuzzy Logic [29]	Proposed Method
WT_AVG (s)	43.57	16.14	13.16
WTL_AVG (s)	5.12	6.59	3.93
WTC_AVG (s)	9.46	4.71	3.71

Table 10. The experimental results in left-hand traffic environment.

Metrics	Method
Metrics	Traditional	Original DQN with Fuzzy Logic [29]	Proposed Method
WT_AVG (s)	43.45	16.22	12.51
WTR_AVG (s)	4.94	6.52	3.53
WTC_AVG (s)	9.44	4.64	3.66

Table 11. The experimental results in a three-lanes and one-arm environment.

Average Total Waiting Time(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy Logic [29]	Proposed Method
E-3	66.51	17.32	13.73
N-3	59.48	17.79	14.53
S-3	56.59	18.18	15.09
W-3	88.50	18.05	14.36
Average total waiting time for vehicles to make a left turn (s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3	5.05	6.93	4.07
N-3	5.02	6.96	4.09
S-3	5.00	7.23	4.70
W-3	5.01	7.37	4.26
Average total waiting time per car (s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3	11.07	4.84	3.85
N-3	10.61	4.82	3.90
S-3	10.60	4.90	3.96
W-3	11.73	4.89	3.91

Table 12. The experiment results in a three-lane and two-arm environment.

Average Total Waiting Time(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy Logic [29]	Proposed Method
E-3 S-3	75.85	19.53	16.75
E-3 W-3	104.60	19.93	14.97
N-3 S-3	72.83	19.81	15.85
N-3 W-3	100.23	21.57	16.17
N-3 E-3	83.61	19.39	16.03
S-3 W-3	94.58	19.57	15.68
Average total waiting time for vehicles to make a left turn(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3 S-3	5.10	8.29	4.88
E-3 W-3	5.11	8.38	4.42
N-3 S-3	5.14	8.30	4.54
N-3 W-3	5.06	9.21	4.76
N-3 E-3	5.03	8.22	4.43
S-3 W-3	5.02	8.07	4.62
Average total waiting time per car(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3 S-3	11.44	5.17	4.08
E-3 W-3	12.86	5.15	3.96
N-3 S-3	11.37	5.16	4.08
N-3 W-3	12.47	5.31	4.11
N-3 E-3	12.02	5.11	4.13
S-3 W-3	12.16	5.10	4.02

Table 13. The experimental results in three-lane and three-arm environments.

Average Total Waiting Time(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy Logic [29]	Proposed Method
E-3 S-3 W-3	122.02	22.48	17.37
E-3 S-3 N-3	95.74	21.09	17.86
E-3 N-3 W-3	118.49	22.38	16.90
S-3 N-3 W-3	113.16	21.20	17.51
Average total waiting time for vehicles to make a left turn(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3 S-3 W-3	5.06	9.63	4.80
E-3 S-3 N-3	5.05	8.70	4.69
E-3 N-3 W-3	5.02	9.11	4.71
S-3 N-3 W-3	5.00	9.21	4.67
Average total waiting time per car(s)
Lane	Method
Lane	Traditional	Original DQN with Fuzzy logic [29]	Proposed Method
E-3 S-3 W-3	13.39	5.40	4.25
E-3 S-3 N-3	12.41	5.28	4.28
E-3 N-3 W-3	13.46	5.38	4.17
S-3 N-3 W-3	12.94	5.27	4.22

Table 14. The experimental results of the ambulance prioritization system with different safety durations and detection distances.

Average Total Waiting Time(s)
Safety Duration (s)	Detection Distance(m)
Safety Duration (s)	150	200	250	300
3	12.87	12.98	13.14	13.53
5	14.17	13.23	13.48	13.92
Total ambulance waiting time(s)
Safety Duration (s)	Detection Distance(m)
Safety Duration (s)	150	200	250	300
3	0.24	0.24	0.00	0.00
5	2.96	1.20	0.20	0.16

Table 15. The experimental results show when there is an ambulance prioritization system and no ambulance prioritization system.

With the Ambulance Prioritization System
Metrics	Original DQN with Fuzzy Logic [29]	Proposed Method
WT_AVG (s)	17.01	13.14
WTA_Total (s)	0.04	0.00
Without an ambulance prioritization system
Metrics	Original DQN with Fuzzy logic [29]	Proposed Method
WT_AVG (s)	16.14	13.16
WTA_Total (s)	65.24	67.60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meepokgit, T.; Wisayataksin, S. Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic. Appl. Sci. 2024, 14, 7908. https://doi.org/10.3390/app14177908

AMA Style

Meepokgit T, Wisayataksin S. Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic. Applied Sciences. 2024; 14(17):7908. https://doi.org/10.3390/app14177908

Chicago/Turabian Style

Meepokgit, Teerapun, and Sumek Wisayataksin. 2024. "Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic" Applied Sciences 14, no. 17: 7908. https://doi.org/10.3390/app14177908

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Traffic Signal Control with State-Optimizing Deep Reinforcement Learning and Fuzzy Logic

Abstract

1. Introduction

2. Fundamental Principles

2.1. Reinforcement Learning

2.2. Deep Reinforcement Learning

2.3. Epsilon-Greedy Policy

2.4. Fuzzy Logic

3. System Design

3.1. Environment

3.2. State

3.3. Action

3.4. Reward

3.5. Agent

3.6. Necessary Information

4. Simulation Design

4.1. Network Setting

4.2. Environment Setting

5. Methodology

6. Experimental Results

6.1. All-4 Training Results

6.2. All-4

6.3. Left-Hand Traffic

6.4. Three Lanes One Arm

6.5. Three Lanes Two Arms

6.6. Three Lanes Three Arms

6.7. Ambulance Prioritization System

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI