Next Article in Journal
Study on Microclimate and Thermal Comfort in Small Urban Green Spaces in Tokyo, Japan—A Case Study of Chuo Ward
Previous Article in Journal
Electrochemical Recovery of Phosphorus from Simulated and Real Wastewater: Effect of Investigational Conditions on the Process Efficiency
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning

College of Automobile and Traffic Engineering, Nanjing Forestry University, Nanjing 210037, China
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(24), 16564; https://doi.org/10.3390/su152416564
Submission received: 17 October 2023 / Revised: 1 December 2023 / Accepted: 3 December 2023 / Published: 5 December 2023

Abstract

:
To alleviate intersection traffic congestion and reduce carbon emissions at intersections, research on exploiting reinforcement learning for intersection signal control has become a frontier topic in the field of intelligent transportation. This study utilizes a deep reinforcement learning algorithm based on the D3QN (dueling double deep Q network) to achieve adaptive control of signal timings. Under a mixed traffic environment with connected and automated vehicles (CAVs) and human-driven vehicles (HDVs), this study constructs a reward function (Reward—CO2 Reduction) to minimize vehicle waiting time and carbon dioxide emissions at the intersection. Additionally, to account for the spatiotemporal distribution characteristics of traffic flow, an adaptive-phase action space and a fixed-phase action space are designed to optimize action selections. The proposed algorithm is validated in a SUMO simulation with different traffic volumes and CAV penetration rates. The experimental results are compared with other control strategies like Webster’s method (fixed-time control). The analysis shows that the proposed model can effectively reduce carbon dioxide emissions when the traffic volume is low or medium. As the penetration rate of CAVs increases, the average carbon dioxide emissions and waiting time can be further reduced with the proposed model. The significance of this study lies in its dual achievement: by presenting a flexible strategy that not only reduces the environmental impact by lowering carbon dioxide emissions but also enhances traffic efficiency, it provides a tangible example of the advancement of green intelligent transportation systems.

1. Introduction

With the rapid development of the global economy, urban transportation systems have undergone extensive and comprehensive progress, while the urban road network, a critical component of the urban transportation system, has been gradually optimized. Meanwhile, the proliferation of automobiles, coupled with the coexistence of conventional human-driven vehicles (HDVs) and connected and automated vehicles (CAVs), has intensified urban traffic congestion, consequently exacerbating environmental issues, such as air pollution and carbon emissions. Intersections, as pivotal nodes for road integration, play an important role in determining or constraining the traffic capacity of the entire urban road network. In order to mitigate traffic congestion and improve traffic efficiency, an increasing number of urban areas are resorting to signal control systems for intersection traffic management. Traditional traffic signal control systems are typically categorized into three types: fixed signal timing control systems, induction signal control systems [1,2] and adaptive signal control systems. Currently, numerous urban centers have extensively adopted control schemes, including SCATS [3], RHODES [4], and SCOOT [5], which, in essence, remain deterministic in nature and present various limitations, such as inadequate adaptability, predictability, and real-time responsiveness, as well as an incapacity to efficaciously address environmental concerns such as traffic congestion and carbon emissions.
In recent years, carbon emission reduction has become one of the global environmental issues that has received widespread attention. Increasingly, scholars are considering carbon reduction in their studies on intersection signal optimization. Coelho et al. [6] investigated the effects of traffic signal controls on emissions and traffic behavior. Their findings highlight a trade-off: while signal controls reduce speeding, they can increase stationary emissions. However, speed signal controls can mitigate overall pollutant emissions. Yao et al. [7] refined emission coefficients for green- and red-light periods using a vehicle-specific power (VSP) model. By factoring in turning movements and varied traffic flows, they optimized signal control on major roads, targeting reduced vehicle delays and emissions. Yao et al. [8] presented a novel hierarchical optimization model for traffic signal and vehicle trajectory coordination in mixed traffic of automated and human-driven vehicles. The model leverages an innovative combination of model predictive control for trajectory planning and dynamic programming for traffic signal timing, aiming for fuel efficiency. Simulations suggest reductions in fuel consumption and emissions, but further validation in varied traffic situations is needed. Chen and Yuan [9] presented a novel traffic signal optimization approach, integrating macroscopic traffic and emission models. Employing a genetic algorithm with emissions as the focal objective, they tested their method on two urban intersections in Xi’an. While simulations showed promising reductions in travel time and emissions, the model’s scalability to broader urban settings remains to be explored. Lin et al. [10] proposed a traffic signal optimization model using fuzzy control and a differential evolution algorithm, targeting capacity maximization and delay minimization. Innovatively, the study automates the optimization of fuzzy membership functions and rules and introduces a coordination method for adjacent intersection signals based on traffic flow analysis and fuzzy reasoning, confirming effectiveness via simulations. Xiao et al. [11] introduced a lifecycle analysis of public bikes, highlighting that strategic turnover rates significantly optimize carbon emission reductions, a novel approach in green transportation studies. Public transportation, being a crucial component of the overall transportation system, should also be integrated into the optimization of intersection signal control [12]. He et al. [13] proposed an innovative adaptive control algorithm for presignals at intersections, improving bus priority without obstructing regular traffic flow. Unlike static models, this dynamic approach adjusts in real time to changing traffic demands, optimizing the presignal strategy for different urban scenarios. However, existing studies predominantly rely on emission models designed for fuel vehicles to calculate CO2 emissions, thereby overlooking the CO2 emissions of electric vehicles at intersections.
Meanwhile, artificial intelligence technology has been developing rapidly. Deep reinforcement learning (DRL), which combines reinforcement learning (RL) and deep learning (DL), has emerged as a promising technique widely applied in the field of traffic control. DRL has been increasingly recognized for its potential in discerning the most effective strategies by engaging directly with complex traffic environments [14,15,16]. Its interactive learning process enables it to adapt and evolve within these scenarios, providing innovative solutions for traffic management and control. Arel et al. [17] were among the first to undertake relevant work in optimizing signal control at intersections using feedforward neural networks to approximate the value function of the Q-learning algorithm. However, due to the lack of an experience replay mechanism and a target network, two essential components of the DQN, it is not a complete DQN algorithm. Genders and Razavi [18] have previously used the DQN algorithm to study traffic signal control optimization, and proposed discrete traffic state encoding (DTSE) to characterize the vehicle position state, velocity state, and current traffic signal phase at intersections, adopting a convolutional neural network to approximate the Q value of discrete actions. Ma et al. [19] developed a deep actor–critic algorithm, which combines time-series images of intersections as input states with the actor–critic model, avoiding the drawbacks of value-based and policy-based control algorithms. Li et al. [20] unveiled the KS-DDPG algorithm for regional traffic management. Using a unique knowledge sharing protocol, agents exchange traffic data for coordinated multi-intersection control. Its robustness in dynamic scenarios is yet to be fully explored. Lu et al. [21] introduced the 3DRQN algorithm, an enhanced DQN variant employing dueling and double Q networks for faster convergence and improved control. By integrating an LSTM network, the algorithm’s reliance on current intersection traffic data is diminished, bolstering its robustness. Kim and Sohn [22] leveraged graph neural networks to craft the DGQN algorithm for optimal large-scale traffic signal control. While their asynchronous update method accelerates convergence, the algorithm’s adaptability in rapidly changing traffic dynamics remains a challenge. Zhu et al. [23] have innovated traffic signal control with DRL by extracting interpretable decision trees, enabling clarity on how strategies are derived. They successfully minimized complexity in decision-making without compromising performance. However, they acknowledge the inherent complexity risks in decision trees, a factor that may limit real-world applicability in more diverse traffic conditions. Yan et al. [24] proposed the graph cooperation Q-learning network traffic signal control (GCQN-TSC) model to achieve more efficient traffic signal coordination in multiple intersections in large road networks. The model features an embedded self-attention mechanism that enables agents to adjust their attention in real time based on dynamic traffic flow information, facilitating more efficient cooperation between agents. Chen et al. [25] developed the AWA-DDQN algorithm for traffic signal optimization, which dynamically tweaks parameters based on real-world interactions. While its adaptability is novel, its applicability in interconnected multi-intersection settings is yet to be tested and the algorithm’s efficiency needs to be improved. While optimizing the efficiency of intersection traffic flow, some scholars have also taken the reduction in carbon dioxide emissions within the intersection area as an evaluation criterion for optimizing intersection signal control [26,27,28]. Research has shown that intersection signal optimization schemes based on deep reinforcement learning have a certain effect on reducing CO2 emissions.
In light of the above, despite the considerable scholarly efforts devoted to signal control optimization methods aimed at CO2 emission reduction, as well as the utilization of deep reinforcement learning for signal control optimization, there is still a lack of signal intersection optimization control methods based on deep reinforcement learning oriented towards CO2 emissions reduction. These methods could effectively reduce CO2 emissions in urban traffic and promote the sustainable development of urban transportation. Meanwhile, previous studies on deep reinforcement learning-based intersection signal control optimization have focused more on algorithm innovation, while neglecting the design of reward functions. Sutton et al. [29] posited that the reward signal conveys to the agent the objectives that one desires to achieve rather than prescribing the methods of their attainment. This indicates that the appropriateness of the reward function design is pivotal in determining whether the agent can learn the correct strategies. Booth et al. [30] observed that experts in the field of deep reinforcement learning often prioritize the optimization of the reward function before refining other aspects of reinforcement learning design. It is evident that the reward function is considered foundational to reinforcement learning and is accorded substantial emphasis. Concurrently, reward shaping stands as a prominent research direction within the realm of reinforcement learning, where experts and scholars dedicate efforts to adapt and fine-tune reward functions, thus enabling agents to perform diverse tasks efficiently [31,32,33].
In summary, the reward function plays a crucial role in enabling the agent to learn the correct strategies in deep reinforcement learning.
Therefore, this study focuses on reducing CO2 emissions and improving intersection efficiency with human-driven vehicles and connected and automated vehicles. The specific contributions of this study are as follows:
(1)
A reward function is proposed, which is guided by CO2 reduction, and CO2 emissions are calculated based on the instantaneous fuel consumption of fuel vehicles and the instantaneous energy consumption of electric vehicles.
(2)
Two sets of distinct action spaces are designed to adapt to spatiotemporal differences at various intersections, which facilitates the achievement of multiobjective optimization of CO2 reduction and intersection efficiency improvement.
(3)
The dueling double deep Q-learning network (D3QN) algorithm is employed to optimize intersection signal control. Incorporating vehicle acceleration data into the state space modeling of traffic signal control has significantly enhanced the learning performance of the reward function.

2. Problem Statement

The optimization of traffic signal control at intersections based on deep reinforcement learning is illustrated in Figure 1. This section provides a concise description of the problem of traffic signal control at intersections based on deep reinforcement learning, and models the state space, action space, and reward function settings of the intersection.

2.1. Problem Description and Assumptions

2.1.1. Problem Description

The intersection model established in this paper consists of six lanes in the north–south direction and four lanes in the east–west direction. The north–south direction has one right-turn lane, one left-turn lane, and four through lanes, while the east–west direction has one right-turn lane, one left-turn lane, and two through lanes, as shown in Figure 2. The signal plan is designed to consider the temporal and spatial differences that exist in real-world intersections, such as different time periods at the same intersection and different geographical locations of the intersections. To address these differences, two sets of action plans are proposed, namely, fixed-phase and adaptive-phase, respectively. The yellow light time during phase switching is set to 3 s to clear the vehicles at the intersection and ensure traffic safety.
When addressing traffic signal control problems, it can be viewed as a Markov decision process (MDP). A classical MDP consists of a state space (S), an action space (A), state transition probabilities (P), and a reward function (R). In the context of traffic signal control at intersections, these components are defined as follows:
(1)
State space—S: The state space of an intersection comprises the factors that form its environment, typically encompassing vehicle status and traffic signal phase information, such as the location, speed, turning direction, and queue length of vehicles, as well as the duration and phase of traffic signals.
(2)
Action space—A: In the optimization of traffic signal control, the traffic light is considered as an agent that makes action decisions, namely, switching the signal phase, based on the observed state of the intersection.
(3)
State transition probability—P: The transition of intersection state refers to the process where the signal controller, acting as the agent, observes the intersection state s t at time t , takes action a t , and causes the intersection state to transition to s t + 1 at time t + 1 . The probability of the state transition can be expressed as p ( s t + 1 | s t , a t ) , where s t and a t represent the intersection state and action at time t , respectively, and s t + 1 represents the intersection state at time t + 1 .
(4)
Reward function—R: The reward function in intersection signal optimization is designed to assist the signal light, or the agent, in learning the control policy. Every time the agent executes an action, the environment returns a reward value as feedback to evaluate the effectiveness of the action. Typical reward functions in intersection signal optimization include average queue length, difference in cumulative waiting time between adjacent time steps, difference in cumulative queue length between adjacent time steps, and traffic pressure.

2.1.2. Assumptions

To underpin the rigor of our model, we delineate the following foundational assumptions:
(1)
Vehicles are categorized into four types based on their driving mechanism and powertrain: human-driven fuel vehicles (HDFV), human-driven electric vehicles (HDEV), connected and automated fuel vehicles (CAFV), and connected and automated electric vehicles (CAEV).
(2)
Based on the 2022 statistics [34] presented in Table 1, and guided by the literature [35,36], this study predicts a rising market share for electric automated vehicles in China, while other vehicle categories are expected to decline.
(3)
V2X communication is assumed to exhibit inherent reliability, eliminating concerns about congestion, latency, or data attrition.
(4)
(HDV drivers consistently exhibit driving behaviors that are in strict adherence to prevailing traffic regulations.

2.2. State Space Definition

The vehicle information includes the position, speed, acceleration, and deceleration of the vehicles on the entrance lane. DTSE discretization is adopted to express the intersection state space. Research has shown that the discrete encoding of the traffic state achieved by DTSE does not differ significantly in optimization effectiveness from high-resolution state input [37]. In this study, the section within 200 m of the entrance lane’s stop line is considered as the state space. The state space is composed of position, speed, and acceleration matrices, each divided into 40 cells of 5 m. The position matrix consists of 0 and 1 elements, where 1 indicates the presence of a vehicle and 0 indicates no vehicle. The min–max normalization method is used to process the above data, with the specific formula as follows:
(1)
Vehicle speed normalized
V n o r = V V m i n V m a x V m i n
In the equation, V n o r denotes the normalized speed, with values confined within the 0 ,   1 interval. V m i n signifies the minimum vehicle speed, which is set at 0   m / s . V m a x represents the maximum vehicle speed, which is determined by the intersection speed limit and speed factor, as calculated by the formula provided below:
V m a x = V l i m × s p e e d   f a c t o r
Typically, urban intersection speed limits fall within the range of 40 to 60 km per hour ( k m / h ). However, in this study, the speed limit ( V l i m ) is stipulated at 50   k m / h . In view of the fact that some drivers may be speeding while driving, this paper introduces a speed factor with a value of 1.2. This speed factor denotes that drivers can traverse signalized intersections at speeds up to 1.2 times the posted lane speed limit.
(2)
Vehicle acceleration normalized
a c c n o r = a c c a c c m i n a c c m a x a c c m i n
where a c c n o r represents the normalized acceleration, with values confined to the 0 , 1 interval. a c c m i n and a c c m a x represent the minimum and maximum accelerations, with respective values of 0   k m / h and 2.6   k m / h .
(3)
Vehicle deceleration normalized
d e c n o r = d e c d e c m i n d e c m a x d e c m i n
where d e c n o r represents the normalized deceleration, with values confined to the 0 , 1 interval. d e c m i n and d e c m a x represent the minimum and maximum decelerations, with respective values of 0   k m / h and 9.0   k m / h .
The normalized values are then filled into the corresponding speed and acceleration matrices based on the vehicle’s position. As a result, the state space matrix size is 3 × 20 × 40 .
Figure 3 illustrates the intersection entrance state space at time t , where the state space is composed of the entrance lanes in the four directions. Taking the north entrance lane as an example, the position matrix represents the vehicles’ positions at 0   m , 15   m , and 25   m away from the stop line on the lane, corresponding to vehicles A, B, and C. The velocities of these three vehicles are 0   m / s , 3   m / s , and 5 m/s, respectively. Following normalization with respect to the intersection speed limit and speed factor, these velocities transform into 0, 0.18, and 0.30. The accelerations of the vehicles are 0   m / s 2 , 1   m / s 2 , and 0.20   m / s 2 . Upon normalization, accounting for the maximum acceleration and deceleration, these values become 0, 0.11, and 0.08, respectively. The information pertaining to vehicle speed, acceleration, and deceleration is subsequently inserted into the appropriate speed and acceleration matrices based on the vehicle’s location in the position matrix.

2.3. Action Space Modeling

As an agent for action decision making, the traffic signal selects the optimal action to execute from all possible action plans based on the observed current intersection state. Considering the spatiotemporal heterogeneity of actual traffic flow at intersections, two sets of action spaces (fixed-phase sequence and adaptive-phase sequence) are established. The following section, Section 4, will analyze the application scenarios of these distinct action spaces.
(1)
Fixed-phase sequence
The fixed-phase action space A = N S G , N S L G , E W G , E W L G for the intersection control is defined in this paper, where N S G denotes north–south through movement, N S L G denotes north–south left turn, E W G denotes east–west through movement, and E W L G denotes east–west left turn. The right-turning vehicles are not regulated by the traffic signal and can make a turn when there is no conflict with other vehicles. The schematic diagram of the fixed-phase action space is depicted in Figure 4.
(2)
Adaptive-phase sequence
Compared with the fixed-phase sequence, the adaptive-phase sequence involves a more intricate action space. Nonetheless, the agent’s decision-making process becomes more flexible, allowing it to select the optimal action based on the intersection’s state. The action space of adaptive-phase sequence A = N G , S G , N S G , N S L G , E G , W G , E W G , E W L G comprises N G for northbound through and left turns, S G for southbound through and left turns, N S G for north–south through, N S L G for north–south left turns, E G for eastbound through and left turns, W G for westbound through and left turns, E W G for east–west through, and E W L G for east–west left turns. Likewise, right-turning vehicles are permitted to turn only if they do not impede the passage of through or left-turning vehicles. The adaptive-phase action space is illustrated in Figure 5.

2.4. Reward Function Settings

The objective of deep reinforcement learning is to maximize the cumulative reward, where an agent learns to make optimal decisions based on the rewards obtained from executing actions. In this context, a traffic signal receives a reward r t from the environment after taking an action a t based on the intersection state ( s t ) . The reward evaluates the quality of the signal’s action. Considering the reduction in carbon emissions and the improvement in intersection efficiency as our objectives, we define the reward function as follows:
R = k 1 E t b 1 E t 1 + k 2 D t b 2 D t 1
where E t represents the cumulative CO2 emissions of the intersection entrance at time t , E t 1 represents the cumulative CO2 emissions of the intersection entrance at time t 1 , D t represents the cumulative waiting time at the intersection entrance at time t , and D t 1 represents the cumulative waiting time at the intersection entrance at time t 1 . After multiple experiments, the values of k 1 and k 2 were determined to be −2 and −1, respectively, while the values of b 1 and b 2 were both set to 0.9.
To achieve real-time adjustment of intersection signal phases using deep reinforcement learning and make decisions based on instantaneous CO2 emissions more accurately, an instantaneous emission model should be adopted for vehicle emissions modeling. Considering the current level of road infrastructure construction, roadside perception devices can obtain the instantaneous speed and acceleration of vehicles, and thus, the instantaneous fuel consumption of fuel vehicles and the instantaneous energy consumption of electric vehicles are calculated from the perspectives of vehicle-specific power and electric energy conversion, respectively.
(1)
Instantaneous CO2 Emissions from Fuel vehicles
The instantaneous fuel consumption of fuel vehicles is calculated using the vehicle-specific power method, which offers the advantages of relative simplicity and broad applicability. As reported in the literature [38], the simplified formula for the vehicle-specific power calculation is expressed as follows:
V S P = v 1.1 a + 9.81 g r a d e + 0.132 + 0.000302 v 3
where v represents the instantaneous velocity of the vehicle, a represents the instantaneous acceleration of the vehicle, and g r a d e represents the road gradient. Since the research scene is an urban intersection, so grade = 0 in the formula, the simplified calculation formula is as follows:
V S P = v 1.1 a + 0.132 + 0.000302 v 3
The CO2 emissions of the fuel vehicles are determined based on the corresponding CO2 emission rates for different vehicle-specific power intervals, as provided in reference [39]. The vehicle-specific power intervals and the vehicle’s corresponding CO2 emission rate are presented in Appendix A Table A1. This information is utilized to calculate the instantaneous emissions of HDVs.
(2)
Instantaneous CO2 Emissions from Electric vehicles
The instantaneous emissions of electric vehicles are determined by converting their energy consumption into CO2 emissions based on the CO2 emission factor of the national power grid, according to reference [40]. The instantaneous energy consumption model for electric vehicles is specified as follows.
The power of an electric vehicle comprises the power of the vehicle’s propulsion system and the power of the vehicle’s auxiliary components. The power of the vehicle’s propulsion system is related to the vehicle’s instantaneous speed v and the vehicle traction force F t . Therefore, the output power of the vehicle’s battery can be expressed as follows:
P o u t = F t v η p o w + P 0 F t > 0
where η p o w is the energy efficiency of the vehicle powertrain and P 0 is the power for vehicle accessories.
Unlike conventional fuel vehicles, electric vehicles require consideration of their kinetic energy recovery. Specifically, when the vehicle traction force F t > 0 , the charging power of the battery is given by:
P i n = k η p o w F t v + P 0 F t < 0
where k is the energy recovery efficiency of the vehicle, which is determined as follows, according to the literature [41]:
k = 0.5 × v 5             v < 5 m / s 0.5 + 0.3 × v 5 20       v > 5 m / s
In summary, the instantaneous power of an electric vehicle battery is as follows:
P = P o u t        F t > 0 P i n         F t < 0
The instantaneous energy consumption of electric vehicles at the intersection entrance can be determined as follows:
W e = 1 n p i
In the formula, p i is the instantaneous energy consumption of the i -th electric vehicle at time t , n is the total number of electric vehicles at the intersection entrance at time t , and W e is the cumulative instantaneous energy consumption of electric vehicles at the intersection entrance at time t .
Considering the energy loss of electric vehicles during the charging process, the electricity consumed by the national power grid to charge electric vehicles on the entrance lane of the intersection at time t is calculated as follows:
W t = W e η g b
where η g b is the charging loss rate for electric vehicles and W t is the actual power consumption of the national power grid at time t .
The actual electricity consumption by the power grid can be combined with the CO2 emission factor of the national power grid published in the same year to obtain the instantaneous CO2 emissions of electric vehicles at the intersection entrance at time t . According to the document issued by the Environmental Protection Bureau [42], the CO2 emission factor is determined as 0.5703 t   CO 2 / M W h .

3. Methodology

After a thorough comparison of the DQN [43], D3QN, and the discretized ASAC (adaptive soft actor–critic) algorithms [44], this study ultimately decided to employ the D3QN algorithm based on its convergence rate and optimization effectiveness. As demonstrated in Figure 6, this study established a two-hour peak traffic period, utilizing a Weibull distribution for vehicle generation, to assess the intelligent agent’s decision-making capabilities under varying traffic volumes. According to the results shown in Figure 7, throughout 300 iterations of training, the D3QN algorithm exhibited the strongest convergence and resilience against traffic volume fluctuations. This algorithm is developed by incorporating the dueling DQN [45] and double DQN [46] into the traditional DQN algorithm, addressing the problem of the traditional DQN’s complete reliance on action value for state prediction and the overestimation issue in value estimation. Compared with the discretized ASAC algorithm, the D3QN further enhances decision optimization for the agent, thereby improving the optimization effectiveness for signalized intersections.

3.1. D3QN Algorithm

The D3QN algorithm is a form of deep reinforcement learning that builds upon the conventional DQN algorithm, incorporating the advantages of both the dueling DQN and double DQN methods.
The main improvement of the dueling DQN lies in its network structure, which allows for a better modeling of the state-value function and thus enhances the algorithm’s performance. The network structure of the dueling DQN divides the neural network into two branches: one for the value function and the other for the advantage function. The value function branch is used to estimate the state-value function, while the advantage function branch is used to estimate the advantage value of each action relative to the mean action. In this way, the dueling DQN can separately estimate the value of each action, rather than just estimating the value of the entire state. The action value function of the dueling DQN is shown below:
Q s , a ; θ = V s ; θ + ( A s , a ; θ 1 A a A s , a ; θ
The equation defines the value function V s ; θ for state s under network parameters θ , and the advantage function A s , a ; θ based on the state-action pair, which predicts the significance of each available action given the current state s .
Moreover, the D3QN algorithm employs the double DQN technique to alleviate the problem of Q-value overestimation in the DQN algorithm, thereby improving the stability of the algorithm. Specifically, the double DQN uses two neural networks, a main Q network Q ( s , a ; θ ) and a target Q network Q t a r g e t ( s , a ; θ ) , to estimate the action-value function. The main Q network Q ( s , a ; θ ) is used for action selection, while the target Q network Q t a r g e t ( s , a ; θ ) is used to estimate the target Q value. By selecting the action through the main Q network and using it to calculate the target Q value for the target Q network, this can effectively alleviate the problem of overestimating the target Q value caused by maximizing when calculating the optimal action value. Compared with the traditional DQN algorithm, the signal lights trained with D3QN can make reasonable decisions in complex intersection environments and are more suitable for multiobjective optimization tasks, that is, reducing cumulative CO2 emissions while reducing the average waiting time.
The signal control algorithm based on the D3QN comprises a main Q network Q ( s , a ; θ ) , and introduces a target Q network Q t a r g e t ( s , a ; θ )
(1)
Main Q network parameters update
The update strategy for the main Q network in the context of the D3QN algorithm is similar to that of the DQN, with the incorporation of a target Q network designed to mitigate the overestimation issue during training. The main Q network, denoted by Q ( s , a ; θ ) , where θ represents the corresponding neural network parameters, is defined alongside a target Q network, denoted by Q t a r g e t ( s , a ; θ ) , where θ denotes the respective neural network parameters. The main Q network is updated via minibatch gradient descent with TD temporal difference, and the specific update process can be described as follows:
(1.1)
The sample ( s t , a t , r t , s t + 1 , d ) is extracted from the experience pool, where s t denotes the state, a t denotes the action taken based on state s t , r t denotes the reward provided by the environment for taking action a t , s t + 1 denotes the new state resulting from the state transition probabilities after taking action a t , and d denotes whether or not the final state has been reached.
(1.2)
The double DQN network uses the main Q network to calculate the optimal action a . The formula is as follows:
a = a r g m a x a Q ( s t + 1 , a ; θ )
(1.3)
The target network calculates the Q t a r g e t value. The formula is as follows:
Q t a r g e t = r + γ ( 1 d ) Q t a r g e t ( s ,   a ;   θ )
where a is calculated by the main Q network of the agent in state s t + 1 , and γ is the discount factor.
(1.4)
Loss function: squared error function J θ
J θ = 1 2 [ Q ( s t , a t ; θ ) Q t a r g e t ] 2
(1.5)
Stochastic gradient descent
( J θ ) ( θ ) = ( Q ( s t , a t ; θ ) Q t a r g e t ) ( Q ( s t , a t ; θ ) ) ( θ )
(1.6)
Parameter θ update
θ = θ λ ( J θ ) ( θ )
(2)
Main Q network parameter update
In order to alleviate the overestimation problem of the main Q network, the target Q network is employed as an auxiliary network to aid in the parameter training process. The target Q network is updated using a delayed update and soft update strategy. Delayed update refers to updating the network parameters after every fixed time step, known as a generational gap. Soft update involves using a weighted average of the main Q network parameters and the historical target Q network parameters as the target Q network parameters during the parameter update process. The specific formula is as follows:
θ = τ θ + ( 1 τ ) θ
where τ is a smoothing coefficient that reflects the degree of influence of the main Q network parameters on the target network.

3.2. D3QN Network Model and Algorithmic Process

The present study employs a D3QN deep neural network structure, as depicted in Figure 8, and relevant parameters. Given the use of a discretized traffic state encoding scheme for signalized intersection state space design, a two-dimensional convolutional neural network is first applied to extract features from the intersection state information matrix. This convolutional neural network performs padding, convolution, and pooling operations on the intersection information matrix successively. Subsequently, the flattened tensor with a length of 1200 is input into a dueling network consisting of two fully connected layers, each with 128 neurons and the ReLU activation function, one outputting the state value of size 1, and the other outputting the action advantage value of size 4 or 8 in combination with the action space of this study. Finally, the state value and the action advantage value are combined to output the Q value.
The D3QN is an improved algorithm based on the DQN, so the algorithm’s procedure is similar to that of the DQN. The specific steps of the signalized intersection control algorithm based on the D3QN are as follows (Algorithm 1):
Algorithm 1: Dueling Double DQN Algorithm
Input:  D -empty replay buffer; θ -initial main Q network parameters; θ -copy of θ
Input:  γ -discount factor; τ -smoothing coefficient; α -main Q network learning rate;
    ε -exploration rate; M -replay buffer maximum size; B -training batch size;
    μ -target network update frequency; I -number of iterations; N -max training epoch; T -max time step
for  e p i s o d e   e 1,2 , 3 , , N do
 Initialize state s 0
  for  t 1,2 , , T do
   Observe state s t and choose action a t based on ε -greedy policy
   Execute action a t , observe reward r t and next state s t + 1
   Store transition tuple ( s t , a t , r t , s t + 1 , d ) in D
   Replace and delete the oldest tuple if D > M
   Sample a minibatch of transitions ( s , a , r , s , d ) from D
   for each transition ( s , a , r , s , d ) do
    Define a = a r g m a x a Q ( s , a ; θ )
    Set y i = r + γ 1 d Q t a r g e t ( s , a ; θ )
    Do a gradient descent step with loss y i Q ( s , a ; θ ) 2
    Update target parameters θ = τ θ + ( 1 τ ) θ every I steps
   end
  end
end

4. Experiment and Results

4.1. Simulation Environment and Parameter Settings

The present study has established a simulation experimental environment for signalized intersections using the SUMO software (Version 1.18.0), as illustrated in Figure 2. Vehicle position, speed, and acceleration data were acquired as state inputs through SUMO’s built-in TraCI interface and lane area detectors (E2). The agent makes decisions on actions by combining the signal phase and the remaining green light time, among other information. The D3QN algorithm, employed for intersection signal control optimization, was implemented with a deep neural network in the Pytorch framework.
Given that the simulated intersection has six south-to-north inbound lanes and four outbound lanes, as well as four east-to-west inbound lanes and two outbound lanes. The number of vehicles entering from each direction varies during simulation periods. The specific proportion of vehicles from each direction relative to the total number of vehicles is presented in Table 2.
The vehicular types considered in the present study comprise human-driven and connected and automated vehicles. The Krauss and CACC car-following models are employed. The length of the vehicle is 5   m , the maximum deceleration is 4.5   m · s 2 , the maximum acceleration is 2.6   m · s 2 , and the maximum speed is 60   k m · h 1 . The D3QN algorithm is trained using the Adam optimizer in combination with a minibatch stochastic gradient descent. For each iteration, the optimizer randomly samples a batch of 64 training samples from the experience replay buffer. The 1800   s simulation step is taken as one epoch, with a total of 800 epochs. The algorithm’s main hyperparameters are detailed in Table 3.

4.2. Experimental Evaluation and Results Analysis

This paper intends to conduct the following three experiments to evaluate the reward function proposed in this paper:
1.
Considering the increasing penetration rate of connected and automated vehicles, it is essential to investigate the signal control strategies for intersections in a high CAV penetration environment. Therefore, this study sets the CAV penetration rate at 90 % and conducts a comparative analysis of the optimization effects of the proposed reward function on signalized intersections under three traffic volume conditions: high, medium, and low.
2.
In real-world scenarios, the penetration rate of CAVs in mixed traffic flow at urban intersections varies continuously over time. Therefore, we set the traffic volume to 3600   p c u · h 1 and compare the optimization effect of the reward function proposed in this paper for signalized intersections under different levels of connected and automated vehicles penetration rate.
3.
Given the spatiotemporal distribution characteristics of traffic flow at signalized intersections, this study conducts a comparative analysis of the optimization effects of the reward function proposed in this paper on signalized intersections under different action space schemes and provides recommendations for the selection of action space schemes.
4.
To validate the feasibility of the algorithm proposed in this paper in real-world intersections, this study tests the robustness of the model in scenarios where traffic accidents occur at intersections.
In this study, we focus on two key performance indicators for assessing experimental outcomes: firstly, the average waiting time of vehicles, and secondly, the average carbon dioxide emissions from vehicles. These criteria stand out as our primary metrics for evaluation.
Referring to the literature [47], the optimization scheme proposed in this paper, i.e., Scheme 1: Reward—CO2 Reduction, is compared with the other four schemes in the first and second experiments.
1.
Scheme 2: The deep reinforcement learning signal timing optimization scheme, referred to as Reward—Wait Time, employs the difference in the cumulative waiting time between adjacent time steps as the reward function, and is formulated as follows:
R T = 1 ( D t c D t 1 )
where D t is the cumulative waiting time of vehicles at the intersection entrance at time t , D t 1 is the cumulative waiting time of vehicles at the intersection entrance at time t 1 and c is the reduction factor of 0.9.
2.
Scheme 3: The deep reinforcement learning signal timing optimization scheme, referred to as Reward—Queue Length, employs the cumulative queuing length as the reward function, and is formulated as follows:
R L = 1 L t
where L t is the cumulative queue length of vehicles at the intersection entrance at time t .
3.
Scheme 4: Fixed signal timing control (FSTC), using the Webster signal timing method to calculate the green light duration of each phase.
4.
Scheme 5: Actuated signal control (ASC), working by prolonging traffic phases whenever a continuous stream of traffic is detected.

4.2.1. Analysis of Experimental Results under Different Traffic Volume Conditions

To evaluate the optimization performance of the proposed reward function under scenarios of higher penetration rates of connected and automated vehicles in the future, assuming a 90% penetration rate of CAVs, this study conducts training under mixed traffic conditions with high, medium, and low traffic volumes, corresponding to traffic volumes of 4800   p c u · h 1 , 3600   p c u · h 1 , and 2400   p c u · h 1 , respectively. In accordance with the assumptions outlined in the introduction, the market penetration rate of vehicles in this experiment is illustrated in Table 4.
In this study, we choose the parameters of the epoch with the highest average reward during the training phase as the basis for our network model. Based on Figure 9a, the neural network parameters at 585, 556, and 321 epochs were selected as the network model parameters for the Reward—CO2 Reduction scheme under high, medium, and low vehicle volumes, respectively. Similarly, based on Figure 9b, the neural network parameters at 506, 784, and 625 epochs were selected for the Reward—Wait Time scheme under high, medium, and low vehicle volumes, respectively, while the neural network parameters at 373, 701, and 754 epochs were selected for the Reward—Queue Length scheme under the same traffic volumes based on Figure 9c. To compare the optimization effects of various signal control schemes, 20 groups of high, medium, and low traffic volumes were randomly generated as validation flows.
Figure 10 and Figure 11 depict the mean values of vehicular average CO2 emissions and average waiting times, respectively, for each signal control scheme under 20 randomly generated traffic flow scenarios. As observed from Figure 10, when the market penetration rate of CAVs reaches 90%, the three signal control schemes implemented using the D3QN algorithm demonstrate a reduction in vehicular average CO2 emissions compared with the FSTC scheme under high, medium, and low traffic flow conditions. Notably, the application of the D3QN algorithm to the three signal control schemes exhibits a pronounced reduction in CO2 emissions at low and medium traffic volumes, while the effectiveness in reducing CO2 emissions decreases slightly under high traffic conditions. At a traffic flow rate of 2400   p c u · h 1 , implementing the proposed Reward—CO2 Reduction scheme for intersection signal control results in average vehicle CO2 emissions of 128.29   g . This represents a decrease of 2.41 % compared with Scheme 2, 4.31 % compared with Scheme 3, 7.13 % compared with Scheme 4 and 5.74% compared with Scheme 5. Similarly, at a traffic flow rate of 3600   p c u · h 1 , the implementation of the Reward—CO2 Reduction scheme results in an average vehicle CO2 emission of 134.96 g. This indicates a reduction of 2.20 % compared to Scheme 2, 3.93 % compared to Scheme 3, 7.42 % compared to Scheme 4, and 5.93% compared to Scheme 5. As the traffic volume increases, the effectiveness of the proposed Reward—CO2 Reduction scheme in reducing CO2 emissions diminishes. At a traffic volume of 4800   p c u · h 1 , the optimization performance of the Reward—CO2 Reduction scheme is still superior to that of Scheme 2, Scheme 4 and Scheme 5, but slightly inferior to that of Scheme 3. The last-named scheme is less affected by traffic volume in terms of reducing CO2 emissions.
In addition to the average CO2 emissions per vehicle, the average waiting time of vehicles serves as another evaluation criterion that relates to the intersection’s traffic efficiency. As shown in Figure 11, at a traffic volume of 2400   p c u · h 1 , the Reward—CO2 Reduction scheme yielded the highest average waiting time among all schemes of 23.33   s . However, considering Figure 10, it can be observed that under low traffic flow conditions, adopting the Reward—CO2 Reduction scheme aligns more closely with the concept of eco-driving. In the condition of low traffic volume, the proposed scheme prioritizes reducing CO2 emissions at the intersection, rather than improving traffic flow efficiency, which sets it apart from previous reward function learning objectives. Under medium traffic conditions, the Reward—CO2 Reduction scheme provides the best optimization effect for the average waiting time of intersection vehicles. At a traffic volume of 3600   p c u · h 1 , the Reward—CO2 Reduction scheme resulted in an average waiting time of 19.68   s , representing a 13.72 % reduction compared with Scheme 2, a 16.68 % reduction compared with Scheme 3, a 28.88 % reduction compared with Scheme 4, and a 22.56% reduction compared with Scheme 5. Like CO2 emissions, the optimization effect of the Reward—CO2 Reduction scheme for the average waiting time of vehicles decreases at a traffic volume of 4800   p c u · h 1 . Compared with the other strategies, this scheme is only slightly inferior to Scheme 3. The scheme proposed in this paper has poor optimization effectiveness under high traffic conditions, while Scheme 3 has the lowest average waiting time of vehicles under high traffic conditions.
In summary, by considering the instant acceleration of vehicles in the intersection approach lane as part of the state space, this study allows the reward function of the Reward—CO2 Reduction scheme to learn the corresponding control strategies for low and medium traffic volumes effectively, thereby making rational action choices to cope with different traffic conditions. At a traffic volume of 3600   p c u · h 1 , the agent trained with the reward function proposed in this article yielded the most effective optimization of traffic signals for reducing CO2 emissions and enhancing throughput efficiency at the intersection, satisfying the traffic flow requirements for the vast majority of time periods. As traffic volume increases, the agent’s focus shifts from lowering CO2 emissions to improving intersection throughput efficiency, thereby alleviating congestion at the intersection.

4.2.2. Analysis of Experimental Results under Different Penetration Rates of CAVs

As the market penetration rate of connected and automated vehicles continues to increase, it is imperative to compare and analyze the optimization effects of signalized intersection control strategies under different penetration rates of CAVs. With the traffic volume set at 3600   p c u · h 1 , this study conducted training under mixed traffic conditions with high, medium, and low penetration rates of CAVs, corresponding to penetration rates of 90 % , 60 % , and 30 % , respectively. The composition of vehicles at 90% CAV market penetration rate is consistent with Experiment 1, while the vehicle composition for 30% and 60% CAV market penetration rates is detailed in Table 5 and Table 6. In addition, 20 groups of traffic flows with different penetration rates of CAVs were randomly generated as validation traffic to compare and analyze the optimization effects of various signal control strategies.
As depicted in Figure 12, employing the Reward—CO2 Reduction scheme for intersection signal optimization under a 30 % CAV penetration rate yields an average CO2 emission of 230.46   g . This is a reduction of 3.12 % compared with Scheme 3, an 8.49 % reduction compared with Scheme 4, and a 5.62 % reduction compared with Scheme 5. For a 60 % CAV penetration rate, the Reward—CO2 Reduction scheme results in an average CO2 emission of 187.59   g per vehicle, which represents an improvement of 12.32% compared with Scheme 4 and an improvement of 7.95% compared with Scheme 5. The optimization performances of the three D3QN algorithm-based schemes were nearly identical. In low and medium CAV penetration intersections, Scheme 2 slightly outperforms the Reward—CO2 Reduction scheme in terms of optimization effectiveness. As the penetration rate of CAVs rises, electric vehicles powered by batteries constitute a progressively higher proportion of traffic flow, leading to a marked reduction in CO2 emissions. At a 90 % CAV penetration rate, utilizing the Reward—CO2 Reduction scheme for intersection signal optimization results in a considerable decrease in the average CO2 emissions of vehicles, plummeting to 134.96   g . This reflects a 2.14 % decrease compared with Scheme 2, a 3.72 % decrease relative to Scheme 3, a 7.42 % decrease compared with Scheme 4, and a 5.93 % decrease compared with Scheme 5. The Reward—CO2 Reduction scheme advanced in this paper delivers the most noticeable optimization effect on the average CO2 emissions of vehicles when CAV penetration rates are high.
In Figure 13, under a traffic flow of 3600   p c u · h 1 , the optimization effect with vehicle average waiting time as the evaluation criterion is illustrated. At a 30 % penetration rate of CAVs, applying the Reward—CO2 Reduction scheme to signal optimization results in a vehicle average waiting time of 21.49   s , which is 12.71 % lower than Scheme 3, 19.72 % lower than Scheme 4 and 16.96 % lower than Scheme 5. However, the optimization effect of Scheme 2 was found to be superior to that of the Reward—CO2 Reduction scheme in this context. At a 60 % penetration rate of CAVs, implementing the Reward—CO2 Reduction scheme in signal optimization yields a vehicle average waiting time of 20.55   s , which is 9.67 % lower than Scheme 3, 23.44 % lower than Scheme 4 and 19.88 % lower than Scheme 5. In intersections with a 90 % penetration rate of CAVs, applying the Reward—CO2 Reduction scheme to signal optimization leads to a vehicle average waiting time of 19.68   s , which is 13.72 % lower than Scheme 2, 16.68 % lower than Scheme 3, 28.88 % lower than Scheme 4, and 22.28 % lower than Scheme 5. The proposed signal scheme in this study shows a significant improvement in the evaluation criterion of vehicle average waiting time under different market penetration rates of CAVs at a traffic flow of 3600   p c u · h 1 .

4.2.3. Analysis of Experimental Results under Different Action Space Schemes

Given the spatiotemporal distribution characteristics of traffic flow within signalized intersections, it is imperative to investigate the design of action spaces for deep reinforcement learning in signal control under different spatial and temporal conditions. Building upon the previous two experiments, this study aims to analyze the impact of different action spaces on the optimization effectiveness of the Reward—CO2 Reduction scheme and provide action space design recommendations for signalized intersections in different geographical locations or at different time periods. The spatiotemporal variability of traffic flow within signalized intersections is primarily reflected in different traffic volumes. To this end, this experiment employs the same high, medium, and low traffic volume settings as in Experiment 1, while setting the penetration rate of CAVs at 60%.
The trained model was tested under 20 sets of randomly generated traffic flows with high, medium, and low traffic volumes, respectively. The average performance of the model in each scenario is presented in Table 3, Table 4 and Table 5. Specifically, Scheme 1—APS denotes the Reward—CO2 Reduction approach employing an adaptive-phase sequence as the action space, while Scheme 1—FPS represents the Reward—CO2 Reduction approach utilizing a fixed phase sequence as the action space. Scheme 4 aligns with the experimental configurations previously outlined in Experiments 1 and 2. Figure 14a,b show the test process under low traffic flow.
As shown in Table 7, it is observed that, under a traffic flow of 2400   p c u · h 1 , Scheme 1—FPS achieves a average CO2 emissions of 171.24   g and an average waiting time of 15.85   s , exhibiting a reduction of 3.65 % and 24.95 % , respectively, compared with Scheme 1—APS, and a reduction of 8.25 % and 18.55 % , respectively, compared with Scheme 4. As depicted in Table 8 and Table 9, Scheme 1—FPS shows poor generalization ability as the traffic flow increases, with significant fluctuations in its performance for signalized intersections under medium and high traffic flow conditions. In contrast, Scheme 1—APS demonstrates superior performance. At a traffic volume of 3600   p c u · h 1 , Scheme 1—APS achieved average vehicle CO2 emissions and average vehicle waiting times of 187.13   g and 20.55   s , respectively. Compared with Scheme 1—FPS, the average CO2 emissions and average waiting times for vehicles decreased by 6.86 % and 39.58 % , respectively. Compared with Scheme 4, the average CO2 emissions and average waiting times for vehicles decreased by 5.27 % and 24.73 % , respectively. At high levels of traffic flow, Scheme 1—FPS exhibits a notable reduction in average CO2 emissions and average waiting time for vehicles, with values of 199.61   g and 29.19   s , respectively, compared with Scheme 1—APS which shows decreases of 14.49 % and 46.61 % , and to Scheme 4 with decreases of 2.77 % and 19.36 % . These results indicate that the Reward—CO2 Reduction scheme has a considerable optimization effect across all three levels of traffic flow. During peak hours in urban centers, it is recommended to utilize the adaptive-phase sequence for intersection signal optimization, while for suburban areas with lower traffic volumes, the fixed-phase sequence is more suitable.

4.2.4. Analysis of Experimental Results concerning the Robustness of the Algorithm during Traffic Accident Scenarios

To assess the robustness of the algorithm proposed in this paper under unexpected conditions in actual intersection environments, this experiment simulates peak traffic flow using a Weibull distribution for vehicle generation. The simulation lasts for 7200 s, with a traffic volume of 2400 vehicles, and a 90% market penetration rate for connected vehicles. To compare the optimization effectiveness of different algorithms in the event of traffic accidents, 20 sets of random traffic flows are generated as validation flows.
As shown in Figure 15 and Figure 16, the Reward—CO2 Reduction scheme introduced in this study achieves average CO2 emissions of 137.11g per vehicle during traffic incidents at intersections. This emission level is 2.07% less than that of Scheme 2, 6.57% less than that of Scheme 3, 11.37% less than that of Scheme 4, and 10.13% less than that of Scheme 5. In addition to exhibiting superior optimization performance in terms of average CO2 emissions per vehicle, the scheme proposed in this paper also demonstrates a reduction in average vehicle waiting time during traffic accidents compared with other schemes. Specifically, it shows an 11.31% decrease compared with Scheme 2, a 15.67% decrease compared with Scheme 3, a significant 34.12% decrease compared with Scheme 4, and a 24.58% decrease compared with Scheme 5. Additionally, compared with other deep reinforcement learning algorithms used in Scheme 2 and Scheme 3, the proposed scheme demonstrates a relatively tight data distribution, indicating more stable performance under varying conditions and exhibiting robustness.

5. Conclusions

The present study utilizes the D3QN algorithm and proposes a novel reward function, called Reward—CO2 Reduction, to train a deep reinforcement learning model for optimizing traffic signal control at intersections. We introduce the acceleration of vehicles in the inbound lanes of intersections as a state variable, which enhances the agent’s ability to learn effective traffic control policies and promotes the efficient flow of traffic while reducing CO2 emissions. To the best of our knowledge, this is the first attempt to incorporate vehicle acceleration information into the state space modeling of traffic signal control. In addition, this paper considers not only the instantaneous emission model of fuel vehicles but also the instantaneous energy consumption model of electric vehicles when calculating the CO2 emissions at intersections, which is more in line with the actual situation.
(1)
In various traffic scenarios with high penetration rates of connected and automated vehicles (CAVs), the proposed signal control scheme adopts different control strategies based on the traffic volume, focusing on optimizing CO2 emissions and communication efficiency. The scheme achieves optimal results in the traffic scenarios with medium traffic volume, but its optimization performance diminishes with increasing traffic volume.
(2)
Experiment 2 demonstrates that the proposed signal control scheme performs better in scenarios with higher CAV penetration rates. When the penetration rate reaches 90%, the scheme reduces the average waiting time of vehicles at intersections by 13.72%, 16.68%, 28.88%, and 22.88%, respectively, compared with Scheme 2, 3, 4 and 5. In addition, the optimized average CO2 emissions at intersections are also reduced by 2.14%, 3.72%, 7.42%, and 5.93 % , respectively.
(3)
To further reduce CO2 emissions at intersections and improve intersection throughput efficiency, this paper compares fixed-phase sequence and adaptive-phase sequence optimization schemes under different traffic volume conditions and provides recommendations. The fixed-phase sequence action space is suitable for optimizing intersection signals with the Reward—CO2 Reduction scheme in low traffic volume scenarios, while the adaptive-phase sequence action space combined with the Reward—CO2 Reduction scheme performs better in medium to high traffic volume scenarios and maintains good robustness.
(4)
Considering the complexity of real-world intersections, this study simulates scenarios of traffic accidents occurring during peak hours to validate the optimization performance and robustness of the model. The experiments indicate that the proposed signal control scheme exhibits desirable optimization capabilities under unexpected conditions. Furthermore, compared with deep reinforcement learning models trained with other reward functions, our proposed model demonstrates enhanced robustness.
The current study proposes a deep reinforcement learning algorithm utilizing the Reward—CO2 Reduction function for optimizing signal control at a single intersection. However, several potential avenues for future research arise from the identified limitations of the method proposed in this study. First, the effectiveness of this reward function will need to be further examined in the context of both different single intersections and multiple-intersection signal coordinated control, as suggested by the relevant literature [48]. Additionally, due to the adoption of discretized traffic state encoding, the scalability of the model across various intersections is somewhat constrained. We intend to further explore traffic state representation based on the direction of traffic flow to enhance the model’s scalability. Third, to enhance the practical applicability of the model when deployed at actual intersections, it is imperative to refine the algorithmic framework and the neural network architecture. This would facilitate testing the model’s decision-making responsiveness on hardware platforms with lower processing capabilities. Furthermore, the efficacy of the Reward—CO2 Reduction function appears to be less than optimal under conditions of elevated traffic flow, indicating a necessity for refinement in the design of this reward function. Lastly, reducing CO2 emissions in intersection environments relies not solely on optimized signal control, but also on real-time trajectory optimization based on the optimized signals. Therefore, future work will address the challenge of achieving coordinated optimization of intersection signals and vehicle trajectories.

Author Contributions

Conceptualization, data curation, methodology, software, validation, and writing—original draft, Z.W.; Conceptualization, data curation, and writing—review and editing, L.X.; conceptualization and writing—review and editing, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Appendix A

Table A1. Vehicles’ CO2 emissions rates in different VSP intervals.
Table A1. Vehicles’ CO2 emissions rates in different VSP intervals.
Number VSP   Intervals / ( k W · t 1 ) CO 2   Emission   Rate / ( g · s 1 )
1 < 2 1.5437
2 2 ~ 0 1.6044
3 0 ~ 1 1.1308
4 1 ~ 4 2.3863
5 4 ~ 7 3.2102
6 7 ~ 10 3.9577
7 10 ~ 13 4.7520
8 13 ~ 16 5.3742
9 16 ~ 19 5.9400
10 19 ~ 23 6.4275
11 23 ~ 28 7.0660
12 28 ~ 33 7.6177
13 33 ~ 39 8.3224
14 39 8.4750

References

  1. Fellendorf, M. VISSIM: A microscopic simulation tool to evaluate actuated signal control including bus priority. In Proceedings of the 64th Institute of Transportation Engineers Annual Meeting, Dallas, TX, USA, 16–19 October 1994; pp. 1–9. [Google Scholar]
  2. Mirchandani, P.; Head, L. A real-time traffic signal control system: Architecture, algorithms, and analysis. Transp. Res. Part C Emerg. Technol. 2001, 9, 415–432. [Google Scholar] [CrossRef]
  3. Lowrie, P. Scats-a traffic responsive method of controlling urban traffic. In Sales Information Brochure; Roads & Traffic Authority: Sydney, Australia, 1990. [Google Scholar]
  4. Mirchandani, P.; Wang, F.-Y. RHODES to intelligent transportation systems. IEEE Intell. Syst. 2005, 20, 10–15. [Google Scholar] [CrossRef]
  5. Hunt, P.; Robertson, D.; Bretherton, R.; Royle, M.C. The SCOOT on-line traffic signal optimisation technique. Traffic Eng. Control 1982, 23, 190–192. [Google Scholar]
  6. Coelho, M.C.; Farias, T.L.; Rouphail, N.M. Impact of speed control traffic signals on pollutant emissions. Transp. Res. Part D Transp. Environ. 2005, 10, 323–340. [Google Scholar] [CrossRef]
  7. Yao, R.; Sun, L.; Long, M. VSP-based emission factor calibration and signal timing optimisation for arterial streets. IET Intell. Transp. Syst. 2019, 13, 228–241. [Google Scholar] [CrossRef]
  8. Yao, Z.; Zhao, B.; Yuan, T.; Jiang, H.; Jiang, Y. Reducing gasoline consumption in mixed connected automated vehicles environment: A joint optimization framework for traffic signals and vehicle trajectory. J. Clean. Prod. 2020, 265, 121836. [Google Scholar] [CrossRef]
  9. Chen, X.; Yuan, Z. Environmentally friendly traffic control strategy—A case study in Xi’an city. J. Clean. Prod. 2020, 249, 119397. [Google Scholar] [CrossRef]
  10. Lin, H.; Han, Y.; Cai, W.; Jin, B. Traffic signal optimization based on fuzzy control and differential evolution algorithm. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8555–8566. [Google Scholar] [CrossRef]
  11. Xiao, G.; Lu, Q.; Ni, A.; Zhang, C. Research on carbon emissions of public bikes based on the life cycle theory. Transp. Lett. 2023, 15, 278–295. [Google Scholar] [CrossRef]
  12. Haitao, H.; Yang, K.; Liang, H.; Menendez, M.; Guler, S.I. Providing public transport priority in the perimeter of urban networks: A bimodal strategy. Transp. Res. Part C Emerg. Technol. 2019, 107, 171–192. [Google Scholar] [CrossRef]
  13. He, H.; Guler, S.I.; Menendez, M. Adaptive control algorithm to provide bus priority with a pre-signal. Transp. Res. Part C Emerg. Technol. 2016, 64, 28–44. [Google Scholar] [CrossRef]
  14. Wiering, M.A. Multi-agent reinforcement learning for traffic light control. In Proceedings of the Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000), Stanford University, Stanford, CA, USA, 29 June–2 July 2000; pp. 1151–1158. [Google Scholar]
  15. Abdulhai, B.; Kattan, L. Reinforcement learning: Introduction to theory and potential for transport applications. Can. J. Civ. Eng. 2003, 30, 981–991. [Google Scholar] [CrossRef]
  16. El-Tantawy, S.; Abdulhai, B. An agent-based learning towards decentralized and coordinated traffic signal control. In Proceedings of the 13th International IEEE Conference on Intelligent Transportation Systems, Funchal, Portugal, 19–22 September 2010; pp. 665–670. [Google Scholar]
  17. Arel, I.; Liu, C.; Urbanik, T.; Kohls, A.G. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intell. Transp. Syst. 2010, 4, 128–135. [Google Scholar] [CrossRef]
  18. Genders, W.; Razavi, S. Using a deep reinforcement learning agent for traffic signal control. arXiv 2016. [Google Scholar] [CrossRef]
  19. Ma, D.; Zhou, B.; Song, X.; Dai, H. A deep reinforcement learning approach to traffic signal control with temporal traffic pattern mining. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11789–11800. [Google Scholar] [CrossRef]
  20. Li, Z.; Yu, H.; Zhang, G.; Dong, S.; Xu, C.-Z. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp. Res. Part C Emerg. Technol. 2021, 125, 103059. [Google Scholar] [CrossRef]
  21. Lu, L.; Cheng, K.; Chu, D.; Wu, C.; Qiu, Y. Adaptive Traffic Signal Control Based on Dueling Recurrent Double Q Network. China J. Highw. Transp. 2022, 35, 267. [Google Scholar]
  22. Kim, G.; Sohn, K. Area-wide traffic signal control based on a deep graph Q-Network (DGQN) trained in an asynchronous manner. Appl. Soft Comput. 2022, 119, 108497. [Google Scholar] [CrossRef]
  23. Zhu, Y.; Yin, X.; Chen, C. Extracting Decision Tree From Trained Deep Reinforcement Learning in Traffic Signal Control. Ieee Trans. Comput. Soc. Syst. 2023, 10, 1997–2007. [Google Scholar] [CrossRef]
  24. Yan, L.; Zhu, L.; Song, K.; Yuan, Z.; Yan, Y.; Tang, Y.; Peng, C. Graph cooperation deep reinforcement learning for ecological urban traffic signal control. Appl. Intell. 2023, 53, 6248–6265. [Google Scholar] [CrossRef]
  25. Chen, Y.; Zhang, H.; Liu, M.; Ye, M.; Xie, H.; Pan, Y. Traffic signal optimization control method based on adaptive weighted averaged double deep Q network. Appl. Intell. 2023, 53, 18333–18354. [Google Scholar] [CrossRef]
  26. Ren, A.; Zhou, D.; Feng, J. Attention mechanism based deep reinforcement learning for traffic signal control. Appl. Res. Comput. 2023, 40, 430–434. [Google Scholar]
  27. Haddad, T.A.; Hedjazi, D.; Aouag, S. A deep reinforcement learning-based cooperative approach for multi-intersection traffic signal control. Eng. Appl. Artif. Intell. 2022, 114, 105019. [Google Scholar] [CrossRef]
  28. Kumar, N.; Rahman, S.S.; Dhakad, N. Fuzzy inference enabled deep reinforcement learning-based traffic light control for intelligent transportation system. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4919–4928. [Google Scholar] [CrossRef]
  29. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  30. Booth, S.; Knox, W.B.; Shah, J.; Niekum, S.; Stone, P.; Allievi, A. The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 5920–5929. [Google Scholar]
  31. Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Icml, Bled, Slovenia, 27–30 June 1999; pp. 278–287. [Google Scholar]
  32. Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv 2018, arXiv:1810.12894. [Google Scholar]
  33. Badia, A.P.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.J.; et al. Never give up: Learning directed exploration strategies. arXiv 2020, arXiv:2002.06038. [Google Scholar]
  34. Market Analysis Report of China’s Intelligent Connected Passenger Vehicles from January to December 2022; China Industry Innovation Alliance for the Intelligent and Connected Vehicles (CAICV): Beijing, China, 2023.
  35. New Energy Vehicle Industry Development Plan (2021–2035); China. 2020. Available online: https://www.iea.org/policies/15529-new-energy-vehicle-industry-development-plan-2021-2035 (accessed on 20 October 2020).
  36. IEA. Global EV Data Explorer; IEA: Paris, France, 2023. [Google Scholar]
  37. Genders, W.; Razavi, S. Evaluating reinforcement learning state representations for adaptive traffic signal control. Procedia Comput. Sci. 2018, 130, 26–33. [Google Scholar] [CrossRef]
  38. Jimenez-Palacios, J.L. Understanding and Quantifying Motor Vehicle Emissions with Vehicle Specific Power and TILDAS Remote Sensing; Massachusetts Institute of Technology: Cambridge, MA, USA, 1998. [Google Scholar]
  39. Frey, H.; Unal, A.; Chen, J.; Li, S.; Xuan, C. Methodology for Developing Modal Emission Rates for EPA’s Multi-Scale Motor Vehicle & Equipment Emission System; US Environmental Protection Agency: Ann Arbor, MI, USA, 2002; p. 13. [Google Scholar]
  40. Zhao, H. Simulation and Optimization of Vehicle Energy Consumption and Emission at Urban Road Signalized Intersection; Lanzhou Jiaotong University: Lanzhou, China, 2019. [Google Scholar]
  41. Yang, S.; Li, M.; Lin, Y.; Tang, T. Electric vehicle’s electricity consumption on a road with different slope. Phys. A Stat. Mech. Its Appl. 2014, 402, 41–48. [Google Scholar] [CrossRef]
  42. Climate Letter of Approval No. 43. China. 2023. Available online: https://www.mee.gov.cn/xxgk2018/xxgk/xxgk06/202302/t20230207_1015569.html (accessed on 7 February 2023).
  43. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013. [Google Scholar] [CrossRef]
  44. Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019. [Google Scholar] [CrossRef]
  45. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
  46. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
  47. Haydari, A.; Yılmaz, Y. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2020, 23, 11–32. [Google Scholar] [CrossRef]
  48. Haitao, H.; Menendez, M.; Guler, S.I. Analytical evaluation of flexible-sharing strategies on multimodal arterials. Transp. Res. Part A Policy Pract. 2018, 114, 364–379. [Google Scholar] [CrossRef]
Figure 1. Deep reinforcement learning for traffic signal control.
Figure 1. Deep reinforcement learning for traffic signal control.
Sustainability 15 16564 g001
Figure 2. Schematic diagram of intersection model.
Figure 2. Schematic diagram of intersection model.
Sustainability 15 16564 g002
Figure 3. Example of traffic state space on north entrance road.
Figure 3. Example of traffic state space on north entrance road.
Sustainability 15 16564 g003
Figure 4. Fixed-phase sequence action space.
Figure 4. Fixed-phase sequence action space.
Sustainability 15 16564 g004
Figure 5. Adaptive-phase sequence action space.
Figure 5. Adaptive-phase sequence action space.
Sustainability 15 16564 g005
Figure 6. The arrival times of vehicles (Weibull distribution).
Figure 6. The arrival times of vehicles (Weibull distribution).
Sustainability 15 16564 g006
Figure 7. Comparisons between the D3QN, DQN, and discretized ASAC on average reward.
Figure 7. Comparisons between the D3QN, DQN, and discretized ASAC on average reward.
Sustainability 15 16564 g007
Figure 8. Network Structure Diagram of D3QN (Adaptive-Phase Sequence).
Figure 8. Network Structure Diagram of D3QN (Adaptive-Phase Sequence).
Sustainability 15 16564 g008
Figure 9. Average reward during training—90% market penetration rate of CAVs.
Figure 9. Average reward during training—90% market penetration rate of CAVs.
Sustainability 15 16564 g009
Figure 10. Average CO2 emissions for vehicles—90% market penetration rate of CAVs.
Figure 10. Average CO2 emissions for vehicles—90% market penetration rate of CAVs.
Sustainability 15 16564 g010
Figure 11. Average waiting time for vehicles—90% market penetration rate of CAVs.
Figure 11. Average waiting time for vehicles—90% market penetration rate of CAVs.
Sustainability 15 16564 g011
Figure 12. Average CO2 emissions for vehicles—traffic volume of 3600   p c u · h 1 .
Figure 12. Average CO2 emissions for vehicles—traffic volume of 3600   p c u · h 1 .
Sustainability 15 16564 g012
Figure 13. Average waiting time for vehicles—traffic volume of 3600   p c u · h 1 .
Figure 13. Average waiting time for vehicles—traffic volume of 3600   p c u · h 1 .
Sustainability 15 16564 g013
Figure 14. Average CO2 emissions and waiting time for vehicles during a 20-round evaluation process—60% penetration rate of CAVs and 2400   p c u · h 1 traffic volume.
Figure 14. Average CO2 emissions and waiting time for vehicles during a 20-round evaluation process—60% penetration rate of CAVs and 2400   p c u · h 1 traffic volume.
Sustainability 15 16564 g014
Figure 15. Average CO2 emissions for vehicles during traffic accident occurrences.
Figure 15. Average CO2 emissions for vehicles during traffic accident occurrences.
Sustainability 15 16564 g015
Figure 16. Average waiting time for vehicles during traffic accident occurrences.
Figure 16. Average waiting time for vehicles during traffic accident occurrences.
Sustainability 15 16564 g016
Table 1. Current market penetration rate of vehicles.
Table 1. Current market penetration rate of vehicles.
Human-Driven Vehicle (%)Connected and Automated Vehicle with Level 2 and above Automation Level (%)
Fuel vehicle (%)51.419.6
Electric vehicle (%)13.715.3
Table 2. The proportion of vehicles in different driving directions at each entrance of the intersection.
Table 2. The proportion of vehicles in different driving directions at each entrance of the intersection.
EntranceMovement
Going StraightTurning LeftTurning Right
East entrance13.33%5.56%3.33%
West entrance13.33%5.56%3.33%
South entrance16.67%6.94%4.17%
North entrance16.67%6.94%4.17%
Table 3. Main hyperparameters of D3QN model.
Table 3. Main hyperparameters of D3QN model.
HyperparametersValues
Learning rate α 0.0003
Discount factor γ 0.99
Batch size B 64
Exploration rate ε 0.9
Target network update frequency μ 100
Experience replay buffer M 500,000
Smoothing coefficient τ 0.05
Table 4. The 90% market penetration rate of CAVs.
Table 4. The 90% market penetration rate of CAVs.
Human-Driven Vehicles (%)Connected and Automated Vehicles with Level 2 and above Automation Levels (%)
Fuel vehicle (%)2.019.6
Electric vehicle (%)8.070.4
Table 5. The 30% market penetration rate of CAVs.
Table 5. The 30% market penetration rate of CAVs.
Human-Driven Vehicle (%)Connected and Automated Vehicle with Level 2 and above Automation Level (%)
Fuel vehicle (%)56.319.6
Electric vehicle (%)13.710.4
Table 6. The 60% market penetration rate of CAVs.
Table 6. The 60% market penetration rate of CAVs.
Human-Driven Vehicle (%)Connected and Automated Vehicle with Level 2 and above Automation Level (%)
Fuel vehicle (%)26.319.6
Electric vehicle (%)13.740.4
Table 7. Performance of schemes under the condition of low traffic flow at an intersection.
Table 7. Performance of schemes under the condition of low traffic flow at an intersection.
SchemeAverage CO2 Emissions/gAverage Waiting Time/s
Scheme 1—APS 177.73 21.12
Scheme 1—FPS 171.24 15.85
Scheme 4 186.64 19.46
Table 8. Performance of schemes under the condition of medium traffic flow at an intersection.
Table 8. Performance of schemes under the condition of medium traffic flow at an intersection.
SchemeAverage CO2 Emissions/gAverage Waiting Time/s
Scheme 1—APS 187.13 20.55
Scheme 1—FPS 200.92 34.01
Scheme 4 197.55 27.30
Table 9. Performance of schemes under the condition of high traffic flow at an intersection.
Table 9. Performance of schemes under the condition of high traffic flow at an intersection.
SchemeAverage CO2 Emissions/gAverage Waiting Time/s
Scheme 1—APS 199.61 29.19
Scheme 1—FPS 233.44 54.67
Scheme 4 205.29 36.20
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Xu, L.; Ma, J. Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability 2023, 15, 16564. https://doi.org/10.3390/su152416564

AMA Style

Wang Z, Xu L, Ma J. Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability. 2023; 15(24):16564. https://doi.org/10.3390/su152416564

Chicago/Turabian Style

Wang, Zhaowei, Le Xu, and Jianxiao Ma. 2023. "Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning" Sustainability 15, no. 24: 16564. https://doi.org/10.3390/su152416564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop