Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks

Appl. Sci. 2025, 15(20), 10945; https://doi.org/10.3390/app152010945

by Qiong Lu^1,2,*, Haoda Fang¹, Zhangcheng Yin¹ and Guliang Zhu¹

Reviewer 1: Anonymous

Reviewer 2:

Jorge Duque

Appl. Sci. 2025, 15(20), 10945; https://doi.org/10.3390/app152010945

Submission received: 7 September 2025 / Revised: 27 September 2025 / Accepted: 30 September 2025 / Published: 12 October 2025

(This article belongs to the Special Issue Advances in Intelligent Transportation and Its Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript proposes HAPS-PPO, a parameter-sharing MARL framework tailored to dual heterogeneity in real traffic networks, i.e., varying observation and action spaces across intersections. Two core components are introduced: an OPW that zero-pads per-agent observations to a common size for a shared backbone, and DMSGL, which clusters agents by action-space cardinality and assigns each cluster a dedicated policy head while sharing the feature extractor. The paper claims large gains over Fixed-time and improvements over modern MARL baselines on a SUMO setting derived from a 2×3 grid around Fujian University of Technology. Reported reductions vs Fixed-time are 44.74%, 23.47%, and 59.60%, with additional gains over MADQN. Methodologically, OPW zero-pads observations online to the network-wide maximum dimension, enabling shared-weight processing. At the same time, DMSGL performs “shared backbone, separate heads” with dynamic policy mapping by action-set size. The Dec-POMDP formulation uses a cumulative waiting-time difference reward and an unusually high discount factor. Experiments rely on six agents with topological and action-space heterogeneity.
Drawbacks:
1. The paper asserts that existing solutions have “fundamental flaws,” and that the literature “severely lacks” a unified framework for dual heterogeneity. This strong claim would benefit from direct empirical comparisons against the most practical competing strategies, especially parameter-shared policies with action masking and GNN-based encoders with head-specific masking, to show HAPS-PPO’s advantage arises from architecture rather than implementation choices.
2. Baseline configuration/tuning details are thin relative to those for HAPS-PPO. This raises concerns about fairness and reproducibility regarding the baseline comparisons.
3. Zero-padding heterogeneous observations can mix “real zeros” with “padding zeros,” potentially confounding the policy; no padding mask or positional scheme is described to disambiguate. This invites representation bias and brittle generalization across agents with very different dimensionalities. An ablation isolating OPW’s contribution is missing.
4. Grouping only by action-space size may be insufficient. Intersections with the same number of phases can still differ materially. The paper does not explore alternatives such as Mixture-of-Experts gating, meta-policies, or feature-conditioned heads, nor does it present an ablation vs a single shared head with action masking or fully independent per-agent heads.
5. The cumulative waiting-time difference reward is sensible, but with γ=0.999 it may encourage slow-moving credit assignment and potential oscillations without explicit regularizers. The manuscript does not report sensitivity to γ or to reward variants, nor does it discuss signal safety constraints.
6. Reporting focuses on aggregated network metrics. There is no breakdown of per-intersection fairness, throughput/stops, emissions, or stability, which are key for TSC.
Recommendations:
1. Equalize baseline tuning: document hyperparameter searches and training budgets for MADQN/MPLight/FMA2C to ensure fair comparisons. Include multiple seeds with mean±95% CI and statistical tests.
2. For OPW, introduce an explicit padding mask concatenated to inputs or use learned embeddings/positional encodings so the network can distinguish “missing lanes” from true zeros. Evaluate a masked MLP/GNN alternative to reduce padding artifacts.
3. For DMSGL, explore content-aware routing instead of grouping purely by action-space size. Report complexity vs. performance trade-offs.
4. Provide learning curves, wall-clock training time, and inference latency per decision to substantiate claims of scalability/efficiency.
5. Release code, configs, SUMO networks/routes, and seeds, or attach them as supplementary files.

Author Response

For research article

Response to Reviewer 1 Comments
1. Summary
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.
2. Questions for General Evaluation	Reviewer’s Evaluation	Response and Revisions
Does the introduction provide sufficient background and include all relevant references?	Yes
Are all the cited references relevant to the research?	Yes
Is the research design appropriate?	Can be improved
Are the methods adequately described?	Can be improved
Are the results clearly presented?	Must be improved
Are the conclusions supported by the results?	Yes
3. Point-by-point response to Comments and Suggestions for Authors
Comments 1: The paper asserts that existing solutions have “fundamental flaws,” and that the literature “severely lacks” a unified framework for dual heterogeneity. This strong claim would benefit from direct empirical comparisons against the most practical competing strategies, especially parameter-shared policies with action masking and GNN-based encoders with head-specific masking, to show HAPS-PPO’s advantage arises from architecture rather than implementation choices.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the relevant content (lines 97-110) to ensure the accuracy and scientific rigor of the language. Thank you for bringing this to our attention. The specific modifications are as follows: These problems pose several insurmountable obstacles for existing MARL frameworks. First, a policy trained for a specific type of intersection lacks compatibility and transferability due to dimensional mismatch. Second, teaching a separate model for each intersection type would lead to a proliferation of models and inefficient training, and more importantly, it would disregard the universal traffic flow knowledge across different intersection types, failing to achieve knowledge sharing. Finally, for rare topologies within the network (e.g., a five-way intersection), data sparsity would make it difficult for the model to be adequately trained and converge. To overcome this fundamental obstacle, this paper proposes a learning framework named Heterogeneity-Aware Policy Sharing Proximal Policy Optimization (HAPS-PPO). This framework ensures training robustness and policy effectiveness in complex heterogeneous environments by incorporating built-in heterogeneity-aware and adaptive mechanisms, making applying advanced MARL algorithms in highly heterogeneous real-world urban traffic networks feasible.
Comments 2: Baseline configuration/tuning details are thin relative to those for HAPS-PPO. This raises concerns about fairness and reproducibility regarding the baseline comparisons.
Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have supplemented the baseline model content by adding textual descriptions and supplementary images for the phase and timing of the fixed-time method, providing a detailed introduction to the MADQN algorithm, and including textual and tabular explanations of its model training hyperparameters. The specific content is supplemented in lines 606–637. Comments 3: Zero-padding heterogeneous observations can mix “real zeros” with “padding zeros,” potentially confounding the policy; no padding mask or positional scheme is described to disambiguate. This invites representation bias and brittle generalization across agents with very different dimensionalities. An ablation isolating OPW’s contribution is missing. Response 3: We sincerely thank the reviewer for pointing out this critical methodological issue. We agree that naively padding with zeros can lead to ambiguity between real data and padded values, potentially causing representation bias and hindering the model's generalization. This is a crucial point that we had overlooked. We acknowledge the limitations raised and provide the following responses and planned improvements: (1) Regarding the potential confusion between “real zeros” and “padding zeros”: We agree that zero-padding may introduce ambiguity if the policy network cannot distinguish between actual zero values (e.g., zero vehicles) and padded zeros. In our current implementation, we relied on non-negative traffic observations (e.g., queue length, vehicle count) and padding applied only at the end of the vector. However, our algorithm currently works acceptably well on the two road networks in the paper. In the future, we will incorporate a binary mask into the observation representation to explicitly indicate valid vs. padded positions. This mask will be concatenated with the padded observation or used in the network’s attention mechanism to ignore padded values. (2) Regarding the lack of a masking or positional scheme: We apologize for not describing such a mechanism in the original manuscript. To address this, we will integrate a positional encoding scheme (e.g., sinusoidal embeddings) or a mask-aware network architecture (e.g., using masked attention in Transformer-based backbones) to help the model disambiguate positions in the next plan. (3) Regarding the missing ablation study on OPW: We agree with the reviewer that an ablation study is essential to experimentally validate the contribution of our approach for handling observation heterogeneity. We apologize for this omission in the original submission. In fact, we previously conducted ablation experiments on the OPW, but the results were unsatisfactory—the model failed to converge. This precisely demonstrates the significance and effectiveness of the OPW we proposed. Comments 4: Grouping only by action-space size may be insufficient. Intersections with the same number of phases can still differ materially. The paper does not explore alternatives such as Mixture-of-Experts gating, meta-policies, or feature-conditioned heads, nor does it present an ablation vs a single shared head with action masking or fully independent per-agent heads. Response 4: We thank the reviewer for raising this important point regarding the granularity of our grouping strategy and the suggested alternative approaches. (1) Regarding grouping by action-space size: We acknowledge that grouping agents solely by their action spaces' cardinality is a simplified approach. Our primary rationale was to establish a practical and computationally efficient baseline that directly addresses the heterogeneity challenge: the structural incompatibility of policy network outputs. While intersections with the same number of phases may differ in other aspects (e.g., traffic flow patterns, topology), action-space dimensionality represents the most critical barrier to parameter sharing in MARL for TSC. (2) Regarding alternative approaches (MoE, meta-policies, etc.): We agree that more sophisticated conditioning mechanisms, such as Mixture-of-Experts gating or feature-conditioned policy heads, offer promising directions for capturing finer-grained heterogeneity. These approaches represent valuable future research avenues. In this work, we focused on establishing a foundational framework that first solves the fundamental compatibility problem, with the intention that more advanced conditioning schemes can be built upon this foundation in future extensions. Comments 5: The cumulative waiting-time difference reward is sensible, but with γ=0.999 it may encourage slow-moving credit assignment and potential oscillations without explicit regularizers. The manuscript does not report sensitivity to γ or to reward variants, nor does it discuss signal safety constraints. Response 5: We appreciate the reviewer's insightful comments regarding our reward function design and the choice of discount factor. (1) Regarding γ=0.999 and credit assignment: We selected a high discount factor (γ=0.999) based on the characteristic timescales of traffic signal control, where actions often have delayed effects that propagate through the network. This choice emphasizes long-term traffic efficiency, which aligns with reducing cumulative congestion. While we acknowledge that high discount factors can slow credit assignment, we note that: The episodic nature of our traffic simulations (2-hour episodes) provides sufficient time horizons for long-term value propagation. The waiting-time difference reward provides immediate, dense feedback that mitigates slow credit assignment. Throughout the experiment, we continuously adjusted hyperparameters to achieve optimal model performance. Through iterative tuning and experimentation, we discovered that γ=0.999 yielded the best results for the model. We want to clarify that our framework incorporates two key mechanisms that function as explicit regularizers to counteract these potential issues. PPO’s Clipped Surrogate Objective is a core feature of the algorithm, which inherently constrains the policy update step size. This prevents drastic policy changes and thus ensures stable learning, directly acting as a powerful regularizer. As detailed in our hyperparameters (Table 4), we employ an entropy coefficient (α=0.01). This form of entropy regularization explicitly encourages the policy to maintain a degree of randomness, which prevents premature convergence to a suboptimal deterministic policy and further enhances training stability. The empirical evidence of this stability can be observed in our convergence plots (Figures 5-12 in the manuscript). The smooth convergence trends for policy loss, value loss, and entropy, without significant oscillations, demonstrate that our HAPS-PPO framework successfully mitigates the potential instabilities associated with a high discount factor. (2) Regarding reward variants and regularization: The cumulative waiting-time difference reward was chosen for its direct relationship with congestion reduction and its proven effectiveness in prior TSC research. It provides a balanced trade-off between immediate feedback and long-term optimization. We will expand our reward function analysis to include: Queue-length-based rewards、 Travel-time-based rewards、Multi-objective rewards combining efficiency and sustainability metrics, and A discussion on potential regularization techniques to stabilize learning. (3) Regarding signal safety constraints: We apologize for not explicitly discussing safety constraints in the original manuscript. In our implementation, we enforce several practical constraints in lines 452–455: Minimum green time (15s) to ensure pedestrian safety and driver expectations. Maximum green time (60s) to prevent starvation of minor approaches. Yellow time transitions (3s) between phase changes. Comments 6: Reporting focuses on aggregated network metrics. There is no breakdown of per-intersection fairness, throughput/stops, emissions, or stability, which are key for TSC.
Response 6: Thank you for your valuable feedback on our research. Your observation that the current report primarily focuses on overall network metrics while lacking detailed analysis of key traffic signal control (TSC) indicators—such as fairness, volume/stopping ratio, emissions, and stability at each intersection—is very pertinent. This insight provides essential guidance for further refining our paper. In response to your suggestions, we have incorporated the following additions and analysis into the revised draft: (1) We modeled the six intersections in the simulation as six agents for traffic signal control and conducted separate simulation evaluations for each. (2) We have incorporated two additional metrics—average vehicle speed and average carbon dioxide emissions per vehicle—into our detailed comparative analysis. This approach aims to yield more granular simulation evaluation data for assessment purposes. We believe this methodology will provide a more precise reflection of the evaluation outcomes. (3) We have expanded and refined Table 6 (in line 638) and supplemented it with textual descriptions (in lines 632–637、644-645). We have also provided a detailed introduction to the newly added evaluation metrics and included the calculation formulas in lines 503–544 and 603–608. Comments 7: Equalize baseline tuning: document hyperparameter searches and training budgets for MADQN/MPLight/FMA2C to ensure fair comparisons. Include multiple seeds with mean±95% CI and statistical tests. Response 7: We sincerely thank you for your meticulous review and valuable comments on our manuscript. Your suggestions are crucial for improving the rigor of our research and the quality of our paper. We completely agree with your point that ensuring fair comparisons with baseline models and the robustness of experimental results is of utmost importance. Following your suggestions, we have comprehensively enhanced and supplemented our experiments. The specific revisions are as follows: （1）We have added a detailed introduction to the baseline MADQN model in this paper, specifically covering the model's theoretical principles, hyperparameter tuning, model architecture parameters, and simulation parameter settings. This content is included in lines 616–637 of the text. （2）We compared the MPLight and FMA2C models using the parameter settings from reference 48. Specific parameters can be found in their work. We standardized the training parameters and variables to ensure fairness and obtain comparable results. (3) Multiple seeds with mean±95% CI and statistical tests. We previously lacked work in this area, which was an oversight on our part. This will be a key focus of our future efforts. Thank you very much for your reminder and suggestions. Comments 8: For OPW, introduce an explicit padding mask concatenated to inputs or use learned embeddings/positional encodings so the network can distinguish “missing lanes” from true zeros. Evaluate a masked MLP/GNN alternative to reduce padding artifacts. Response 8: Thank you very much for your suggestion. We have provided a detailed response to this matter in the preceding comments 3. Comments 9: For DMSGL, explore content-aware routing instead of grouping purely by action-space size—report complexity vs. performance trade-offs. Response 9: Thank you very much for your suggestion. We have provided a detailed response to this matter in the preceding comments section 4. Comments 10: Provide learning curves, wall-clock training time, and inference latency per decision to substantiate claims of scalability/efficiency. Response 10: Thank you for your insightful comment, which is crucial for enhancing the robustness of our scalability and efficiency claims. First, we would like to note that the learning curves related to training efficiency have been included in the original manuscript (Figures 5–12). Specifically, Figure 5 shows the monotonic convergence of episode rewards (max/min/mean) for HAPS-PPO, confirming stable learning progress; Figures 7–12 illustrate the convergence of policy loss, value loss, and entropy loss across different agent groups, reflecting efficient gradient propagation and exploration-exploitation balance enabled by the shared backbone in DMSGL. We also admit a current limitation: our scalability validation, while covering two heterogeneous networks (2×3 grid and Cologne 8-intersection network), lacks systematic data on training efficiency and latency in larger-scale heterogeneous networks (e.g., 5×5 or 10×10 grids). We will supplement comparative analyses of training time and inference latency across networks of varying sizes to further substantiate HAPS-PPO’s scalability. Comments 11: Release code, configs, SUMO networks/routes, and seeds, or attach them as supplementary files. Response 11: Thank you for your valuable comment. We fully understand the need for supporting materials to enhance the reproducibility of our research. Accordingly, we will upload the config files, SUMO networks/routes, and seeds as supplementary files through the submission system for your reference and verification. Regrettably, we cannot provide the source code at this stage due to considerations related to academic and scientific research intellectual property protection, as it involves unpublished technical details and the research team’s core technical reserves. We sincerely hope for your understanding and are willing to provide any other necessary non-proprietary information to assist with the review process.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript introduces a novel framework (HAPS-PPO) that addresses the dual heterogeneity challenge in MARL for traffic signal control. The study is scientifically sound, the methodology is robust, and the results show clear improvements compared to baseline approaches. The topic is relevant and timely, with significant potential for real-world applications.

However, the English expression could be improved for better clarity and fluency. In addition, some figures (especially training convergence plots) would benefit from clearer legends, larger fonts, and more descriptive captions. A schematic diagram comparing HAPS-PPO with other MARL baselines would also help to illustrate the contribution.

Finally, the discussion of limitations and practical implications could be expanded, particularly regarding scalability to larger networks and deployment challenges in real-world contexts. Addressing these points would enhance the clarity, accessibility, and impact of the paper.

Comments on the Quality of English Language

The English is generally clear and understandable, but some sentences are lengthy and complex. Improving fluency, grammar, and transitions would enhance readability and make the manuscript easier to follow. A careful language revision is recommended to ensure smoother expression and more concise phrasing.

Author Response

For review article

Response to Reviewer 2 Comments
1. Summary
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the revisions/corrections highlighted/in track changes in the re-submitted files.
2. Questions for General Evaluation	Reviewer’s Evaluation	Response and Revisions
Does the introduction provide sufficient background and include all relevant references?	YES
Are all the cited references relevant to the research?	YES
Is the research design appropriate?	YES
Are the methods adequately described?	YES
Are the results clearly presented?	YES
Are all figures and tables clear and well-presented?	Can be improved
3. Point-by-point response to Comments and Suggestions for Authors Comment 1: The English expression could be improved for better clarity and fluency. In addition, some figures (especially training convergence plots) would benefit from clearer legends, larger fonts, and more descriptive captions. A schematic diagram comparing HAPS-PPO with other MARL baselines would also help to illustrate the contribution.
Response 1:
Thank you for your constructive feedback, which is of great significance for improving the readability and expressiveness of our manuscript. We fully agree with your suggestions and will implement the following improvements in the revised version: (1) We acknowledge that the clarity and fluency of the English expression need to be further polished. To address this issue, we have carefully refined and optimized the language throughout the text to ensure greater clarity and fluency. (2) We have comprehensively updated and refined our charts—particularly the training convergence plots—to enhance clarity, increase font size, and improve explanatory power. These adjustments lend a more scientific and intuitive quality to the visuals. Additionally, we have incorporated traffic distribution and phase diagrams, enabling readers to grasp traffic flow patterns more intuitively. Comments 2: The discussion of limitations and practical implications could be expanded, particularly regarding scalability to larger networks and deployment challenges in real-world contexts. Addressing these points would enhance the clarity, accessibility, and impact of the paper. Response 2: Thank you for your insightful feedback, crucial for deepening our work's academic depth and practical value. We fully agree with your suggestion to expand the discussion on limitations and practical implications. We will enrich the relevant content in the revised manuscript based on our research foundation and real-world considerations: (1) We will supplement the "Conclusion and Future Work" section to address scalability to larger networks specifically. Building on the current validation in 2×3 grid and Cologne 8-intersection networks, we will explicitly discuss potential challenges in ultra-large-scale heterogeneous networks (e.g., 5×5 or 10×10 grids): (1) The gradient aggregation of the shared backbone may face communication latency in distributed training when the number of agents exceeds 50; (2) The DMSGL mechanism, while efficient for small-scale action space groups, could encounter memory overhead issues if the number of heterogeneous action types surges (e.g., >10 types). We will also quantify these preliminary observations (e.g., training time growth rate with network scale) based on extended simulation tests. (2) We will connect these discussions to our core innovations: For example, the shared backbone in DMSGL improves sample efficiency and reduces deployment costs (fewer parameters than independent models for each intersection), which is critical for municipal adoption. We will also mention pilot application prospects (e.g., adapting the Fujian University of Technology testbed to a real suburban network) to strengthen the paper’s practical relevance. Comments 3: The English is generally clear and understandable, but some sentences are lengthy and complex. Improving fluency, grammar, and transitions would enhance readability and make the manuscript easier to follow. A careful language revision is recommended to ensure smoother expression and more concise phrasing. Response 3: Thank you for your valuable feedback on the language expression of our manuscript. We fully agree with your observations—some sentences are overly lengthy and complex, which may hinder readability. We have significantly revised the language of the article to ensure it is concise, logically coherent, and more research-oriented.
4. Response to Comments on the Quality of the English Language
Point 1: The English could be improved to more clearly express the research.
Response 1: Thank you for your valuable feedback. We fully agree that the English expression needs refinement to convey clearly our HAPS-PPO research for heterogeneous traffic signal control. We will conduct a meticulous, section-by-section revision of the manuscript: simplifying overly lengthy and complex sentences, standardizing technical terminology (e.g., "heterogeneity-aware" "DMSGL"), enhancing logical transitions between technical points (such as the link between OPW design and observation standardization), and double-checking grammar accuracy—all while ensuring the precision of core technical content is preserved. This revision will improve readability and make our research findings more accessible. We appreciate your guidance.

Author Response File: Author Response.pdf

Article Menu

HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI