Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism

Jia, Wenzhe; Ji, Mingyu

doi:10.3390/app15158605

Open AccessArticle

Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism

by

Wenzhe Jia

¹ and

Mingyu Ji

^2,*

¹

Aulin College, Northeast Forestry University, Harbin 150040, China

²

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8605; https://doi.org/10.3390/app15158605

Submission received: 8 May 2025 / Revised: 19 June 2025 / Accepted: 28 July 2025 / Published: 3 August 2025

Download

Browse Figures

Versions Notes

Abstract

Traffic congestion in large-scale road networks significantly impacts urban sustainability. Traditional traffic signal control methods lack adaptability to dynamic traffic conditions. Recently, deep reinforcement learning (DRL) has emerged as a promising solution for optimizing signal control. This study proposes a Multi-Agent Deep Reinforcement Learning (MADRL) framework for large-scale traffic signal control. The framework employs spatio-temporal attention networks to extract relevant traffic patterns and a hierarchical reinforcement learning strategy for coordinated multi-agent optimization. The problem is formulated as a Markov Decision Process (MDP) with a novel reward function that balances vehicle waiting time, throughput, and fairness. We validate our approach on simulated large-scale traffic scenarios using SUMO (Simulation of Urban Mobility). Experimental results demonstrate that our framework reduces vehicle waiting time by 25% compared to baseline methods while maintaining scalability across different road network sizes. The proposed spatio-temporal multi-agent reinforcement learning framework effectively optimizes large-scale traffic signal control, providing a scalable and efficient solution for smart urban transportation.

Keywords:

traffic signal control; multi-agent reinforcement learning; deep learning; spatio-temporal attention; large-scale road networks

1. Introduction

Traffic congestion has emerged as one of the most pressing challenges in modern urban management, significantly impacting economic productivity, environmental sustainability, and public health. The rapid rise in private vehicle ownership, particularly in densely populated countries such as China, has intensified congestion across major metropolitan areas. As of 2022, the number of registered vehicles in China exceeded 320 million, and the number of licensed drivers surpassed 452 million, resulting in severe traffic congestion during peak hours [1]. Economically, congestion contributes to substantial productivity losses—an estimated USD 160 billion annually in the United States alone, primarily due to delays in the movement of goods and services [2]. Environmentally, idling vehicles emit large quantities of greenhouse gases such as carbon dioxide (CO₂) and nitrogen oxides (NO_x), exacerbating urban air pollution and global warming. According to the IPCC, the transportation sector accounts for approximately 23% of global CO₂ emissions from fuel combustion, with road transport responsible for nearly 75% of that share [3]. From a public health perspective, prolonged exposure to vehicle emissions has been linked to increased risks of respiratory and cardiovascular diseases, as well as heightened stress levels due to noise pollution [4]. These facts underscore the urgent need for innovative and scalable solutions to optimize urban traffic flows.

One of the most cost-effective and impactful strategies to mitigate urban congestion is the optimization of traffic signal control at intersections. Over the past decades, various traffic signal control strategies have been developed, ranging from simple fixed-time schedules to complex real-time adaptive systems. Fixed-time control methods, such as the Webster algorithm [5], rely on pre-configured timing plans and perform poorly under variable traffic conditions. Semi-adaptive or actuated control systems, which detect vehicle presence via sensors and adjust signal durations accordingly, offer greater flexibility [6]. Between these two, dynamic control systems like TRANSYT and TASS dynamically adjust signal phases based on predefined logic rules or traffic flow predictions [7,8], but still often operate in a localized fashion.

More sophisticated adaptive systems such as SCATS (Sydney Coordinated Adaptive Traffic System) [9] and SCOOT (Split Cycle Offset Optimization Technique) [10] respond to real-time traffic data to optimize signal plans across a network of intersections. However, these systems typically require extensive sensor infrastructure and centralized data processing, which pose significant scalability and cost challenges. Furthermore, both SCATS and SCOOT represent centralized control architectures, which do not leverage the potential benefits of decentralized or hybrid models. Notably, decentralized systems such as MOTION [11] and TASS [8] provide more distributed control policies, offering better scalability and robustness in some urban deployments.

Despite the variety of available approaches, conventional traffic control methods are limited by their inability to holistically adapt to complex and dynamically changing traffic environments. They often focus on local optimization and fail to capture interdependencies between neighboring intersections, leading to inefficiencies such as increased queue lengths and poor network throughput. Additionally, these methods struggle to scale in large urban networks consisting of thousands of signalized intersections—like the city of Manhattan, which contains more than 2500 traffic lights requiring continuous coordination [12].

To address these limitations, this study proposes a Hybrid Multi-Agent Deep Reinforcement Learning (MADRL) framework that incorporates a spatio-temporal attention mechanism to facilitate intelligent, distributed traffic signal control. Our approach formulates traffic signal optimization as a Markov Decision Process (MDP) and introduces a novel reward function that jointly optimizes average vehicle waiting time, network throughput, and fairness. The framework leverages Graph Attention Networks (GATs) to model spatial dependencies among intersections, while Long Short-Term Memory (LSTM) networks are used to capture temporal traffic dynamics. Furthermore, the proposed method adopts a hierarchical agent structure with both local and global coordination levels, enabling effective multi-agent cooperation in large-scale urban networks.

2. Related Works

Traditional traffic signal control methods have been widely deployed in urban traffic systems. Fixed-time control strategies, such as the Webster method [5], use pre-determined signal plans and are easy to implement but lack adaptability to real-time conditions. Actuated control improves upon this by using detectors to trigger phase changes based on vehicle presence [6], yet its scope remains locally limited. Dynamic control frameworks like TRANSYT and TASS bridge this gap partially by enabling time-of-day-based schedule adaptation and local logic adjustments [7,8]. More complex adaptive control methods, such as SCOOT and SCATS [9,10], allow for continuous real-time signal updates. However, they often require centralized infrastructure and suffer from scalability issues when deployed across large-scale city networks [11].

Reinforcement learning (RL) has emerged as a viable solution to overcome these limitations. Early works demonstrated that Q-learning-based agents could outperform fixed strategies in simple intersections [12]. With advances in Deep RL, models like DQN [13] and PPO [14] have been adapted to urban traffic control, enabling agents to learn state-action value functions and stochastic policies. Yet, these models struggle with scalability due to the exponential growth of the action space in multi-intersection environments.

Recent methods attempt to address this by using spatio-temporal representations and multi-agent settings. For instance, STMARL [15] introduces a spatio-temporal attention mechanism over agent neighborhoods to improve coordination. HeteroLight [16] addresses the heterogeneity of intersection types using a general policy structure that adapts across zones. MARL-DSTAN [17] constructs temporal graphs between agents to support long-range dependency modeling in irregular traffic networks. These works motivate our choice to combine spatio-temporal modeling and hierarchical multi-agent structures to achieve better generalization and scalability in realistic networks. In addition to reinforcement learning, graph neural networks (GNNs) have been explored as a way to capture spatial dependencies in traffic networks. Graph-based approaches model road networks as graphs, where intersections represent nodes and roads represent edges. GAT (Graph Attention Network) [18] has been applied in traffic signal control to improve information aggregation across intersections. Ref. [19] utilized GNNs to enhance the representation of traffic states in reinforcement learning models, demonstrating improved coordination in multi-agent systems. However, the integration of graph neural networks with multi-agent reinforcement learning for large-scale traffic control is still an evolving research area.

Multi-agent reinforcement leafigurerning (MARL) is another promising direction for traffic signal control, where multiple agents collaborate to optimize traffic flow across a network of intersections. Ref. [20] applied MARL to decentralized traffic signal control, showing that independent agents can learn effective policies but struggle with global coordination. To address this limitation. Ref. [21] introduced a centralized training and decentralized execution (CTDE) approach, enabling agents to share information during training while maintaining scalability in real-time control. However, MARL approaches often suffer from non-stationarity. Ref. [22] where agents continuously adapt their policies, leading to instability in learning. Addressing this issue requires improved training frameworks, such as parameter sharing and hierarchical reinforcement learning.

Despite these advancements, several challenges remain in applying reinforcement learning to large-scale traffic signal control [23]. One major challenge is the trade-off between local and global optimization [24]. While centralized methods achieve better global coordination, they are computationally expensive and difficult to scale. Conversely, fully decentralized approaches are more scalable but often lead to suboptimal traffic management due to lack of coordination. Additionally, reward function [25] design plays a critical role in RL-based traffic signal control. Many existing methods use reward functions based on average waiting time or queue length but designing an effective reward function that balances multiple objectives—such as throughput [26], fairness [27], and robustness to traffic fluctuations [28]—remains an open research problem.

To address these limitations, this study proposes a multi-agent deep reinforcement learning (MADRL) framework that integrates spatio-temporal attention networks for enhanced traffic signal coordination. By leveraging Graph Attention Networks (GATs) to model spatial dependencies and recurrent neural networks (RNNs) to capture temporal variations, the proposed approach enables more efficient traffic control in large-scale networks. The hierarchical reinforcement learning structure ensures that local and global objectives are balanced, improving both scalability and coordination. Extensive experiments using Simulation of Urban Mobility(SUMO) demonstrate that the proposed framework outperforms baseline methods in terms of reducing vehicle waiting times and improving traffic throughput. These findings contribute to the growing body of research on intelligent traffic management and highlight the potential of deep reinforcement learning for large-scale urban mobility optimization.

To distinguish our work from existing literature, we highlight three core innovations. First, unlike STMARL or MARL-DSTAN that primarily focus on spatial neighborhood interactions, our approach incorporates both graph-based spatial attention and LSTM-based temporal encoding to better capture cross-time dependencies. Second, our framework introduces a hybrid hierarchical structure where global agents guide sub-region controllers, achieving improved coordination across large networks without incurring full centralization costs. Third, we propose a customized reward function that jointly balances throughput, waiting time, and fairness—an aspect often treated separately or implicitly in prior work. These contributions collectively enable our model to scale effectively while maintaining robust performance under dynamic traffic loads.

3. Hybrid Multi-Agent Reinforcement Learning (Hybrid MARL)

3.1. Spatio-Temporal Information Extraction

In large-scale road networks, traffic conditions [29] exhibit strong spatial and temporal correlations due to dynamic vehicle movements. Effectively capturing these relationships is crucial for optimizing traffic signal control, as it enables reinforcement learning (RL) agents to make more informed decisions [30]. Traditional methods often rely on manually designed traffic features, which may fail to fully represent the complex traffic dynamics in large-scale networks [31]. To address this limitation, this study integrates Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Attention Networks (GATs) to extract spatial and temporal dependencies from traffic data [32].

The proposed spatio-temporal feature extraction framework is illustrated in Figure 1. This framework consists of three key modules: state embedding, spatial module, and temporal module. The state embedding module processes raw traffic data into structured feature representations that capture key characteristics of vehicle movements. The spatial module employs Graph Attention Networks (GATs) to model interdependencies between intersections, allowing the system to learn dynamic traffic interactions. The temporal module utilizes Long Short-Term Memory (LSTM) networks to extract sequential patterns from historical traffic data, enabling the model to anticipate congestion trends. By integrating these three modules, the extracted spatio-temporal features serve as informative inputs for reinforcement learning agents, leading to improved decision-making in traffic signal control.

3.2. Single-Agent Reinforcement Learning Framework

Traffic signal control can be formulated as a Markov Decision Process (MDP), where each intersection functions as an autonomous agent that learns an optimal control policy through interaction with its environment. The MDP is defined as a tuple (S, A, P, R, γ), where

S (State Space) represents the current traffic conditions, including queue lengths at each lane, average vehicle speeds, and the current traffic light phase.
A (Action Space) defines the possible traffic signal phase changes at an intersection. Each agent selects an action from a discrete set of phase transition options.
P (State Transition Probability) specifies the probability of transitioning from one state to another, given a selected action.
R (Reward Function) evaluates the effectiveness of an action based on metrics such as total vehicle waiting time, throughput, and fairness in traffic signal allocation.
γ (Discount Factor) determines the importance of future rewards relative to immediate rewards.

We formulate the traffic signal control problem as a Markov Decision Process (MDP), defined by the tuple

(S, A, P, R, γ)

, where

S: the state space (e.g., queue lengths, current phase, average speeds), A: the action space (e.g., possible phase changes), P: the transition probability function, R: the reward function, and

γ

: the discount factor.

To guide agent behavior, we propose a composite reward function that balances three objectives: minimizing waiting time, maximizing throughput, and ensuring fairness. The overall reward

R_{t}

at time

t

is given by

R_{t} = α \cdot R_{\{w a i t\} (t)} + β \cdot R_{\{t h r o u g h p u t\} (t)} + γ \cdot R_{\{f a i r n e s s\} (t)}

where

R_{\{w a i t\} (t)} = - \ s u m_{{\{i = 1\}}_{i (t)}^{\{n\} w}}

: total waiting time across all approaches,

R_{\{t h r o u g h p u t\} (t)} = \ s u m_{{\{i = 1\}}_{i (t)}^{\{n\} v}}

: number of vehicles that have passed through,

R_{\{f a i r n e s s\} (t)} = - \ t e x t \{V a r\} (g_{i (t)})

: variance of green time among phases.

The coefficients α = 0.5, β = 0.3, and γ = 0.2 are tuned via grid search on the validation set to achieve optimal trade-offs between the three objectives.

However, a major limitation of single-agent reinforcement learning [25] in traffic control is its inability to capture interactions between adjacent intersections. Since traffic flow at one intersection directly affects the conditions at neighboring intersections [26], independent learning approaches often lead to inefficient, locally optimized policies [27]. Figure 2 illustrates the basic structure of the single-agent reinforcement learning model, where each intersection operates independently without considering the global traffic state.

3.3. Multi-Agent Reinforcement Learning Framework

In large-scale road networks, optimizing individual intersections in isolation is insufficient to achieve network-wide efficiency. Therefore, a multi-agent reinforcement learning (MARL) framework is proposed to enable decentralized traffic signal controllers to cooperate. However, MARL presents unique challenges, such as non-stationarity (as agents continuously adapt their strategies) and exponential growth of the action space. To address these issues, we propose a Hybrid MARL framework that balances local decision-making with global coordination.The detailed hyperparameter settings used for training our Hybrid MARL framework are presented in Appendix A.

The proposed Hybrid MARL framework consists of two levels of agents:

(1): Sub-region Agents: These agents control a small cluster of intersections and optimize traffic flow within their local region. Each sub-region agent is responsible for managing a local cluster of intersections. In our experimental setting, each sub-region typically contains 4 to 8 signalized intersections, depending on the density and structure of the urban road layout. This decomposition ensures both manageable learning complexity and sufficient spatial coordination.
(2): Global Agent: This centralized agent aggregates traffic information from all sub-region agents and provides high-level guidance to ensure network-wide coordination.
(3): This hierarchical structure reduces computational complexity while maintaining effective coordination across large road networks. The overall structure of the Hybrid MARL model is presented in Figure 3, where local agents operate within defined regions, and a central agent optimizes traffic flow at a global level.

The learning process follows a Centralized Training and Decentralized Execution (CTDE) paradigm. During training, all sub-region agents share experience and receive coordinated feedback from the global agent, allowing them to collectively learn optimal traffic control policies. During real-time execution, each sub-region agent operates independently, making decisions based on local observations. This hybrid approach achieves scalability while retaining the benefits of collaborative learning.

Each agent in the Hybrid MARL framework is trained using the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, which extends the standard Deep Deterministic Policy Gradient (DDPG) to a multi-agent setting. The policy gradient update is computed as

\nabla_{θ_{i}} J (θ_{i}) = E [\nabla_{θ_{i}} l o g π_{θ_{i}} (a_{i}| s_{i}) Q_{π} (s, a)]

(1)

π (θ_{i})

represents the policy network of agent

i

, and

Q π (s, a)

is the action-value function. The global agent provides an auxiliary reward signal to sub-region agents to encourage network-wide optimization.

To improve stability and convergence, we implement Experience Replay and Target Networks, which mitigate policy fluctuations and enhance learning efficiency. Additionally, Graph Attention Networks (GATs) are incorporated into agent architectures to improve communication between neighboring intersections, allowing agents to learn cooperative traffic control strategies.

3.4. Algorithm Workflow

The overall Hybrid MARL algorithm workflow is presented in Figure 4, outlining the sequential process of data collection, spatio-temporal feature extraction, reinforcement learning updates, and execution.

The algorithm follows these key steps:

(1): Traffic Data Processing: Data is collected and preprocessed into structured state embeddings.
(2): Spatio-Temporal Feature Extraction: CNNs, LSTMs, and GATs are used to extract relevant spatial and temporal patterns.
(3): Policy Learning: Each sub-region agent optimizes its traffic signal policy using MADDPG.
(4): Global Coordination: The central agent aggregates information and provides coordination signals.
(5): Execution Phase: Trained policies are deployed for real-time traffic signal control.

4. Experiments

4.1. Experimental Setup

To validate the effectiveness of the proposed Hybrid Multi-Agent Reinforcement Learning (Hybrid MARL) framework, we conduct extensive experiments on a simulated traffic network. The simulation is implemented using SUMO (Simulation of Urban Mobility), a widely used open-source microscopic traffic simulator. The experimental setup includes multiple road network configurations, varying traffic flow intensities, and comparisons with state-of-the-art baseline methods. Figure 5 illustrates the experimental environment and road network topology used in this study.

The experimental road network is configured to mimic real-world traffic conditions, incorporating intersections of varying complexities. Traffic demand patterns follow a Poisson distribution to simulate realistic vehicle arrivals, ensuring diverse and dynamic traffic conditions. The simulation runs for 10,000-time steps, with each agent making decisions at predefined intervals. The performance of each traffic signal control strategy is evaluated based on multiple key performance indicators (KPIs), including average vehicle waiting time, throughput, congestion level, and fairness of green light distribution.

To ensure consistent comparability, all scenarios are based on a regular 4 × 4 grid layout representing a medium-sized urban road network. Each intersection has four incoming approaches, and each lane receives an identical vehicle inflow rate unless specified. Directional imbalances in flow patterns are not modeled in this study.

The theoretical value of 2000 vehicles/hour/lane is cited as an idealized saturation flow from traffic flow theory. In practice, our simulation maintains much lower demand rates (ranging from 800 to 1500 vehicles/hour/lane), and this theoretical upper limit is only used as a normalization constant for throughput metrics. All performance results in Table 1 and Table 2 are reported under the same simulation conditions for fair comparison.

4.2. Comparison with Baseline Methods

To demonstrate the superiority of the proposed Hybrid MARL framework, we compare its performance with several baseline methods, including traditional rule-based approaches and state-of-the-art deep reinforcement learning (DRL) models. The baseline methods include the following:

(1): Fixed-Time Control (FT): A conventional method that assigns pre-defined green time intervals to each phase, irrespective of real-time traffic conditions.
(2): Actuated Control (AC): A semi-adaptive method that adjusts signal timings based on real-time vehicle detection sensors.
(3): Max-Pressure Control (MP): A widely used optimization-based approach that balances incoming and outgoing vehicle flows at intersections.
(4): Deep Q-Network (DQN): A reinforcement learning approach that uses a single-agent Q-learning framework for traffic signal control.
(5): Proximal Policy Optimization (PPO): A policy-gradient reinforcement learning model designed for continuous control tasks.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG): A standard multi-agent RL approach that enables decentralized traffic control with policy sharing.

Table 1 presents the comparison of average vehicle waiting times for different methods. It is evident that our proposed Hybrid MARL framework significantly outperforms traditional rule-based methods and single-agent RL models. The reduction in waiting time demonstrates the effectiveness of multi-agent coordination and hierarchical optimization in alleviating congestion.

The 25% improvement refers to the reduction in average vehicle waiting time using our proposed method, compared to fixed-time control (FT) and actuated control (AC) baseline methods. This improvement was observed under three traffic density settings: low (800 vehicles/h/lane), medium (1200 vehicles/h/lane), and high (1600 vehicles/h/lane). The reduction was calculated as the difference in waiting times between our proposed Hybrid MARL framework and the baseline methods under the same traffic conditions.

Additionally, we measure throughput, which quantifies the number of vehicles successfully passing through intersections within a given time frame. Table 2 shows that Hybrid MARL achieves the highest throughput compared to all baseline methods, demonstrating its effectiveness in improving network-wide efficiency, indicating that the proposed model facilitates smoother and more efficient traffic flow.

4.3. Performance Analysis Under Different Traffic Conditions

To evaluate the adaptability of Hybrid MARL under varying traffic conditions, we simulate different vehicle arrival rates and measure model performance. While all models experience performance degradation under high congestion, Hybrid MARL maintains a significant advantage over baselines, demonstrating robustness and scalability. Figure 6 Impact of different traffic intensities on waiting time. Hybrid MARL consistently achieves lower waiting times across different congestion levels.

Furthermore, we analyze traffic flow stability, which quantifies fluctuations in vehicle speeds and queue lengths. Figure 7 visualizes the real-time traffic flow stability for Hybrid MARL and baseline models. The results indicate that Hybrid MARL produces smoother traffic flow, reducing the probability of sudden congestion spikes.

As shown in Figure 6, Hybrid MARL exhibits more stable traffic movement, reducing fluctuations in congestion patterns.

4.4. Ablation Study

To further investigate the contribution of different components within Hybrid MARL, we conduct an ablation study by systematically removing key modules from the framework. The following variations are tested:

(1): Hybrid MARL (Full Model): The complete proposed model, integrating spatio-temporal feature extraction, hierarchical reinforcement learning, and global coordination.
(2): Without Spatio-Temporal Attention (No-ST): This variant removes Graph Attention Networks (GATs) and LSTMs, preventing agents from learning dynamic traffic dependencies.
(3): Without Global Coordination (No-GC): This model eliminates the global agent, forcing sub-region agents to operate independently.
(4): Without Hierarchical Learning (No-HL): A single-layer reinforcement learning approach without hierarchical control.

Figure 7 presents the results of the ablation study. Removing spatio-temporal attention results in a significant degradation in performance, confirming its importance in modeling dynamic traffic conditions. Similarly, without global coordination, waiting times increase sharply, indicating that hierarchical structure plays a crucial role in optimizing network-wide traffic flow.

Figure 7 highlights the impact of different model components. The full Hybrid MARL model achieves the best performance.

5. Conclusions

To deploy our proposed Hybrid MARL framework in real-world urban traffic systems, we would need to integrate real-time traffic data from sensors and traffic cameras. This could be achieved using existing infrastructure, such as smart traffic lights and vehicle detection systems. However, challenges such as data sparsity, sensor inaccuracies, and integration with existing traffic management platforms need to be addressed for successful deployment. Moreover, collaboration with local traffic authorities would be necessary to adapt our model to the specific requirements and constraints of the city’s road network.

In this study, we proposed a Hybrid Multi-Agent Reinforcement Learning (Hybrid MARL) framework for large-scale traffic signal control, integrating spatio-temporal attention networks, hierarchical reinforcement learning, and cooperative decision-making. The framework effectively captures dynamic traffic dependencies by leveraging Graph Attention Networks (GATs) for spatial feature extraction and Long Short-Term Memory (LSTM) for temporal modeling. Furthermore, the introduction of a hierarchical learning structure with sub-region agents and a global coordinating agent significantly improves the scalability and stability of multi-agent reinforcement learning in complex traffic networks.

Extensive experiments conducted on SUMO simulations demonstrate the effectiveness of the proposed framework. Hybrid MARL outperforms traditional rule-based methods and deep reinforcement learning baselines in key performance indicators, including average vehicle waiting time, throughput, and traffic flow stability. The results show that Hybrid MARL achieves a 25% reduction in waiting time compared to state-of-the-art reinforcement learning methods, while maintaining robust performance across different traffic densities. Additionally, the ablation study confirms the critical role of spatio-temporal attention and hierarchical coordination, showing that removing these components significantly degrades performance.

The findings of this study suggest that Hybrid MARL has significant potential for real-world deployment in smart urban transportation systems. Future research directions include extending the framework to multi-modal transportation environments, integrating real-world sensor data, and exploring federated learning techniques to enhance decentralized optimization in large-scale intelligent traffic management.

Author Contributions

Writing—original draft, W.J. and M.J.; writing—review and editing, W.J. and M.J. All authors will be updated at each stage of manuscript processing, including submission, revision, and revision reminder, via emails from our system or the assigned Assistant Editor. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithm Parameters

This appendix provides the key algorithm parameters used in our experiments. The values for each parameter were chosen based on empirical tuning and literature suggestions.

Parameter	Value	Description	Justification
Discount Factor (γ\gamma)	0.95	Determines the importance of future rewards.	A common value used in reinforcement learning for traffic control systems.
Learning Rate	0.0003	Step size for updating the weights in the optimization process.	Chosen to ensure stable convergence during training.
Batch Size	64	Number of experiences used for each gradient update.	Empirically selected to balance learning speed and memory usage.
Experience Replay Size	100,000	Number of experiences stored in the replay buffer for training.	Large buffer size improves training stability and efficiency in large-scale systems.
Reward Scaling	Varies	Scaling factor applied to individual reward components for balancing objectives (waiting time, throughput, fairness).	Tuned to ensure a balanced trade-off between competing objectives.

These parameters were tuned on a validation set using a grid search method. The values presented are optimal for the simulation environment described in Section 4.1, ensuring convergence and performance improvements in various traffic conditions.

References

Wang, B.; He, Z.K.; Sheng, J.F.; Liu, Y.X. Multi-agent deep reinforcement learning with actor-attention-critic for traffic light control. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2024, 238, 2880–2888. [Google Scholar] [CrossRef]
Lee, S.W.; Heo, Y.J.; Zhang, B.T. Answerer in Questioner’s Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog. arXiv 2018, arXiv:1802.03881. [Google Scholar]
Ding, L.; Lin, Z.; Shi, X.; Yan, G. Target-Value-Competition-Based Multi-Agent Deep Reinforcement Learning Algorithm for Distributed Nonconvex Economic Dispatch. IEEE Trans. Power Syst. 2023, 38, 204–217. [Google Scholar] [CrossRef]
Shang, P.; Liu, X.; Yu, C.; Yan, G.; Xiang, Q.; Mi, X. A new ensemble deep graph reinforcement learning network for spatio-temporal traffic volume forecasting in a freeway network. Digit. Signal Process. 2022, 123, 103419. [Google Scholar] [CrossRef]
Qu, A.; Tang, Y.; Ma, W. Adversarial attacks on deep reinforcement learning-based traffic signal control systems with colluding vehicles. ACM Trans. Intell. Syst. Technol. 2023, 14, 113. [Google Scholar] [CrossRef]
Feng, Y.; Head, K.L.; Khoshmagham, S.; Zamanipour, M. A real-time adaptive signal control in a connected vehicle environment. Transp. Res. Part C Emerg. Technol. 2015, 55, 460–473. [Google Scholar] [CrossRef]
Luo, Z.; Xu, J.; Chen, F. Multi-agent Reinforcement Traffic Signal Control Based on Interpretable Influence Mechanism and Biased ReLU Approximation. arXiv 2024, arXiv:2403.13639. [Google Scholar]
Li, Y.; Guan, Q.; Gu, J.F.; Jiang, X.; Li, Y. A hierarchical deep reinforcement learning method for solving urban route planning problems under large-scale customers and real-time traffic conditions. Int. J. Geogr. Inf. Sci. 2025, 39, 118–141. [Google Scholar] [CrossRef]
Li, X.; Lu, L.; Ni, W.; Jamalipour, A.; Zhang, D.; Du, H. Federated multi-agent deep reinforcement learning for resource allocation of vehicle-to-vehicle communications. IEEE Trans. Veh. Technol. 2022, 71, 8810–8824. [Google Scholar] [CrossRef]
Liu, J.; Li, F.; Wang, J.; Han, H. Proximal Policy Optimization Based Decentralized Networked Multi-Agent Reinforcement Learning. In Proceedings of the 2024 IEEE 18th International Conference on Control & Automation (ICCA), Reykjavik, Iceland, 18–21 June 2024; pp. 839–844. [Google Scholar]
Kim, G.; Sohn, K. Area-wide traffic signal control based on a deep graph Q-Network (DGQN) trained in an asynchronous manner. Appl. Soft Comput. 2022, 119, 108497. [Google Scholar] [CrossRef]
Paul, A.; Mitra, S. Deep reinforcement learning based cooperative control of traffic signal for multi-intersection network in intelligent transportation system using edge computing. Trans. Emerg. Telecommun. Technol. 2022, 33, e4588. [Google Scholar] [CrossRef]
Zhu, R.; Ding, W.; Wu, S.; Li, L.; Lv, P.; Xu, M. Auto-learning communication reinforcement learning for multi-intersection traffic light control. Knowl.-Based Syst. 2023, 275, 110696. [Google Scholar] [CrossRef]
Sun, Z.; Wu, H.; Shi, Y.; Yu, X.; Gao, Y.; Pei, W.; Yang, Z.; Piao, H.; Hou, Y. Multi-agent air combat with two-stage graph-attention communication. Neural Comput. Appl. 2023, 35, 19765–19781. [Google Scholar] [CrossRef]
Zhang, Y.; Li, P.; Fan, M.; Sartoretti, G. HeteroLight: A General and Efficient Learning Approach for Heterogeneous Traffic Signal Control. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 1010–1017. [Google Scholar]
Yao, L.; Torabi, A.; Cho, K.; Ballas, N.; Pal, C.; Larochelle, H.; Courville, A. Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv 2015, arXiv:1502.08029. [Google Scholar]
Wang, Y.; Xu, T.; Niu, X.; Tan, C.; Chen, E.; Xiong, H. STMARL: A spatio-temporal multi-agent reinforcement learning approach for cooperative traffic light control. IEEE Trans. Mob. Comput. 2020, 21, 2228–2242. [Google Scholar] [CrossRef]
Du, X.; Wang, J.; Chen, S.; Liu, Z. Multi-agent deep reinforcement learning with spatio-temporal feature fusion for traffic signal control. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. Proceedings of the European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021, Proceedings, Part IV; Springer International Publishing: Cham, Switzerland, 2021; pp. 470–485. [Google Scholar]
Wang, K.; Shen, Z.; Lei, Z.; Zhang, T. Towards multi-agent reinforcement learning based traffic signal control through spatio-temporal hypergraphs. arXiv 2024, arXiv:2404.11014. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Li, X.; Sun, C. Regional multi-agent cooperative reinforcement learning for city-level traffic grid signal control. IEEE/CAA J. Autom. Sin. 2024, 11, 1987–1998. [Google Scholar] [CrossRef]
Fang, J.; You, Y.; Xu, M.; Wang, J.; Cai, S. Multi-objective traffic signal control using network-wide agent coordinated reinforcement learning. Expert Syst. Appl. 2023, 229, 120535. [Google Scholar] [CrossRef]
Chergui, O.; Sayad, L. Mitigating congestion in multi-agent traffic signal control: An efficient self-attention proximal policy optimization approach. Int. J. Inf. Technol. 2024, 16, 2273–2282. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, Z.; Zhang, J.; Wang, L.; Luan, T.H.; Guo, B.; Yuen, C. Learning decentralized traffic signal controllers with multi-agent graph reinforcement learning. IEEE Trans. Mob. Comput. 2023, 23, 7180–7195. [Google Scholar] [CrossRef]
Mao, F.; Li, Z.; Lin, Y.; Li, L. Mastering arterial traffic signal control with multi-agent attention-based soft actor-critic model. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3129–3144. [Google Scholar] [CrossRef]
Zhou, B.; Zhou, Q.; Hu, S.; Ma, D.; Jin, S.; Lee, D.H. Cooperative traffic signal control using a distributed agent-based deep reinforcement learning with incentive communication. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10147–10160. [Google Scholar] [CrossRef]
Kang, L.; Huang, H.; Lu, W.; Liu, L. Optimizing gate control coordination signal for urban traffic network boundaries using multi-agent deep reinforcement learning. Expert Syst. Appl. 2024, 255, 124627. [Google Scholar] [CrossRef]
Barnhart, C.; Bertsimas, D.; Caramanis, C.; Fearing, D. Equitable and Efficient Coordination in Traffic Flow Management. Transp. Sci. 2012, 46, 262–280. [Google Scholar] [CrossRef]
Hoogendoorn, S.P.; Knoop, V.L.; van Zuylen, H.J. Robust Control of Traffic Networks under Uncertain Conditions. J. Adv. Transp. 2008, 42, 357–377. [Google Scholar] [CrossRef]
Wang, X.; Ma, Y.; Wang, Y.; Jin, W.; Wang, X.; Tang, J.; Jia, C.; Yu, J. Traffic Flow Prediction Via Spatial Temporal Graph Neural Network. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. [Google Scholar]
Ammar, H.; Yasin, Y. Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey. arXiv 2022, arXiv:2005.00935. [Google Scholar]
Hu, H.; Li, X.; Zhang, Y.; Shang, C.; Zhang, S. Multi-objective Location-Routing Model for Hazardous Material Logistics with Traffic Restriction Constraint in Inter-City Roads. Comput. Ind. Eng. 2019, 128, 861–876. [Google Scholar] [CrossRef]
Ge, G.; Wei, Y. Short-term Traffic Speed Forecasting Based on Graph Attention Temporal Convolutional Networks. Neurocomputing 2020, 410, 387–393. [Google Scholar] [CrossRef]

Figure 1. Spatio-temporal feature extraction framework integrating CNNs, LSTMs, and GATs to model dynamic traffic patterns.

Figure 2. Single-agent reinforcement learning framework for traffic signal control, where each agent operates independently without considering network-wide coordination.

Figure 3. Hybrid MARL framework with sub-region agents handling local intersections and a global agent ensuring network-wide coordination.

Figure 4. Algorithm workflow for Hybrid MARL.

Figure 5. The experimental road network setup used in SUMO, showing intersections and vehicle flow patterns. The figure provides a schematic representation of an urban road network organized into a grid pattern, where gray lines depict roadway segments and lanes, green squares indicate intersections with traffic signals currently permitting vehicle passage, and red squares represent intersections where vehicles must stop and wait for the traffic signal to change, collectively illustrating typical traffic signal control at intersections.

Figure 6. Traffic flow stability comparison.

Figure 7. Ablation study results.

Table 1. Comparison of average vehicle waiting times across different traffic control strategies.

Traffic Intensity (vehicles/h)	Fixed-Time Control (FT)	Actuated Control (AC)	Max-Pressure Control (MP)	Deep Q-Network (DQN)	Proximal Policy Optimization (PPO)	MADDPG	Hybrid MARL
Low (800)	45.6	38.2	32.4	28.9	26.7	24.5	18.3
Medium (1200)	62.5	55.8	48.2	41.3	38.9	35.7	27.4
High (1600)	85.3	78.9	67.1	59.8	54.6	49.3	38.1
Very High (2000)	110.7	103.2	89.4	82.5	76.2	70.4	52.7

Table 2. Comparison of throughput (vehicles/hour) among different models.

Method	Throughput (vehicles/h)
Fixed-Time Control (FT)	1250
Actuated Control (AC)	1360
Max-Pressure Control (MP)	1485
Deep Q-Network (DQN)	1620
Proximal Policy Optimization (PPO)	1680
MADDPG	1745
Hybrid MARL	1890

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, W.; Ji, M. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism. Appl. Sci. 2025, 15, 8605. https://doi.org/10.3390/app15158605

AMA Style

Jia W, Ji M. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism. Applied Sciences. 2025; 15(15):8605. https://doi.org/10.3390/app15158605

Chicago/Turabian Style

Jia, Wenzhe, and Mingyu Ji. 2025. "Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism" Applied Sciences 15, no. 15: 8605. https://doi.org/10.3390/app15158605

APA Style

Jia, W., & Ji, M. (2025). Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism. Applied Sciences, 15(15), 8605. https://doi.org/10.3390/app15158605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control with Spatio-Temporal Attention Mechanism

Abstract

1. Introduction

2. Related Works

3. Hybrid Multi-Agent Reinforcement Learning (Hybrid MARL)

3.1. Spatio-Temporal Information Extraction

3.2. Single-Agent Reinforcement Learning Framework

3.3. Multi-Agent Reinforcement Learning Framework

3.4. Algorithm Workflow

4. Experiments

4.1. Experimental Setup

4.2. Comparison with Baseline Methods

4.3. Performance Analysis Under Different Traffic Conditions

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Algorithm Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI