Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway

Chen, Jiajia; Zhu, Bingqing; Zhang, Mengyu; Ling, Xiang; Ruan, Xiaobo; Deng, Yifan; Guo, Ning

doi:10.3390/wevj16040225

Open AccessArticle

Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway

by

Jiajia Chen

¹,

Bingqing Zhu

¹,

Mengyu Zhang

²,

Xiang Ling

¹,

Xiaobo Ruan

¹,

Yifan Deng

³

and

Ning Guo

^1,*

¹

School of Automotive and Transportation Engineering, Hefei University of Technology, Hefei 230009, China

²

Hefei Communication Investment Holding Group Co., Ltd., Hefei 230009, China

³

School of Chang’an-Dublin International College of Transportation, Chang’an University, Xi’an 710064, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(4), 225; https://doi.org/10.3390/wevj16040225

Submission received: 28 February 2025 / Revised: 3 April 2025 / Accepted: 7 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Recent Advances in Autonomous Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study presents the first investigation into the problem of autonomous vehicle (AV) merging into existing platoons, proposing a multi-agent deep reinforcement learning (MA-DRL)-based cooperative control framework. The developed MA-DRL architecture enables coordinated learning among multiple autonomous agents to address the multi-objective coordination challenge through synchronized control of platoon longitudinal acceleration, AV steering and acceleration. To enhance training efficiency, we develop a dual-layer multi-agent maximum Q-value proximal policy optimization (MAMQPPO) method, which extends the multi-agent PPO algorithm (a policy gradient method ensuring stable policy updates) by incorporating maximum Q-value action selection for platoon gap control and discrete command generation. This method simplifies the training process by using maximum Q-value action policy optimization to learn platoon gap selection and discrete action commands. Furthermore, a partially decoupled reward function (PD-Reward) is designed to properly guide the behavioral actions of both AVs and platoons while accelerating network convergence. Comprehensive highway simulation experiments show the proposed method reduces merging time by 37.69% (12.4 s vs. 19.9 s) and energy consumption by 58% (3.56 kWh vs. 8.47 kWh) compared to existing methods (the quintic polynomial-based + PID (Proportional–Integral–Differential)).

Keywords:

autonomous vehicle; platooning control; deep reinforcement learning; multi-agent systems

1. Introduction

Vehicle platooning has emerged as a research priority in intelligent transportation systems (ITSs), particularly for automated highway applications. This method of traveling significantly improves traffic safety, reduces energy consumption, and brings significant social and economic benefits. However, existing methods face a critical challenge in maintaining coordinated control during AV merging into dynamic platoon scenarios, particularly due to the triadic constraints arising from the complex interaction between AV merging efficiency, safety, and real-time energy optimization demands, a multi-objective control framework that current single-agent approaches fail to provide.

As shown in Figure 1, the merging process into a platoon consists of three sub-problems: (1) platoon longitudinal control, (2) single-vehicle trajectory planning and (3) single-vehicle control. Finding the location of a platoon gap is neglected in this paper, because the issue is relatively simple; existing methods use the proximity principle [1] and the core technical challenges reside in the subsequent phases. The first sub-problem addresses the maintenance of inter-vehicle spacing within the platoon, while the second sub-problem involves synthesizing trajectory planning and lateral–longitudinal control for safe merging maneuvers. Currently, researchers have explored classical control, optimal control, and learning-based methods for platoon and single-vehicle control tasks.

1.1. Platoon Longitudinal Control

Although classical control methods can achieve speed and spacing control of vehicle convoys through basic algorithms, they have limitations in terms of nonlinear scenarios and model dependence: for example, Shaju et al. [2] used dual-loop feedback PID control to effectively reduce the steady-state error and stabilize the vehicle spacing, but it is difficult to cope with nonlinear scenarios (e.g., rapid acceleration/deceleration); the sliding mode control (SMC) proposed by Y. Ying et al. [3] ensures speed stability in nonlinear environments such as curves by strongly robustly suppressing parameter variations and external disturbances, but it needs to rely on the vehicle dynamics model framework (e.g., state equation). Although these methods have the advantages of simple structure and high robustness, respectively, the dependence on models or scenarios still limits their practical applications.

Optimal control frameworks enhance platoon speed–spacing coordination via distributed control with real-time feedback. Distributed leader–follower frameworks such as constrained platooning control proposed by Gaagai et al. [4] show potential for platoon coordination, but their reliance on linearized dynamics and fixed communication topologies limits adaptability to nonlinear disturbances and scalable fleets. Similarly, Distributed Model Predictive Control (DMPC)-based Cooperative Adaptive Cruise Control (CACC) algorithms (e.g., Tapli et al. [5]) significantly improve platoon stability through multi-vehicle collaborative optimization; however, their high computational complexity and strong dependence on vehicle dynamics models lead to limited real-time control performance in dynamic traffic scenarios.

Deep reinforcement learning (DRL) overcomes model dependency by adaptive multi-objective coordination of platoon speed and spacing, and its advantages are reflected in complex scenario adaptation and multi-agent co-optimization, for example, agent-based simulations for multi-vehicle coordination [6] and shared autonomous vehicle platoon modeling [7]. Xu et al. [8] achieved trajectory generation and oscillation suppression of heterogeneous vehicles in a mixed platoon based on multi-agent reinforcement learning to ensure the stability of speed and spacing of vehicles within the platoon; Lin et al. [9] fused dynamic topology sensing and collaborative decision optimization in platoon following control by enhancing multi-agent state representation modeling, which significantly improved the stability of speed control. These methods effectively solved the platoon longitudinal control problem in mixed traffic flow environments by avoiding the reliance on accurate dynamics models.

1.2. Single-Autonomous-Vehicle Merging Control

Although trajectory tracking for a single autonomous vehicle and longitudinal platoon control share a methodological framework, their objectives diverge: the former focuses on eliminating steady-state tracking errors, whereas the latter prioritizes maintaining inter-vehicle platoon distances. Classical methods optimize merging efficiency in fixed scenarios. S. Dasgupta et al. [10] applied PID for constant-distance merging but failed in dynamic spacing.

Optimization-based control frameworks address vehicle merging via spatiotemporal coordination and dynamic decoupling. Reference [11] applied decentralized LQR with V2V communication for trajectory synchronization, requiring low-latency networks. H. Min et al. [12] formulated merging as constrained DMPC to optimize acceleration–spatial trade-offs, minimizing collisions and discomfort but scaling poorly. G. An [13] hybridized ACC-DMPC to suppress jerk during cut-ins, though linear dynamics assumptions limit nonlinear adaptability. These methods balance safety, comfort, and efficiency but face scalability and model dependency limits.

Compared to optimization-based control methods, agent-based deep reinforcement learning demonstrates strong adaptability to dynamic environments, particularly excelling in complex interaction scenarios of mixed traffic flows. For instance, Chen et al. [14] achieved a dynamic balance between safety and efficiency in highway ramp merging scenarios through a multi-agent deep reinforcement learning framework, ensuring the safety of vehicle merging control; Zhou et al. [15] proposed a multi-agent cooperative lane-changing strategy, which uses a distributed reward mechanism to adaptively optimize lane-changing timing and spacing, effectively reducing collision risks in dynamic traffic environments. These approaches are able to cope with complex and dynamic traffic scenarios because they are model-independent and adaptable.

1.3. Single-Autonomous-Vehicle Trajectory Planning

Moreover, many scholars have conducted extensive research on vehicle trajectory planning, which mainly falls into three categories: traditional, optimization, and learning-based approaches.

Traditional trajectory planning methods are widely used in structured environments due to their high computational efficiency and lack of dependence on complex dynamic models. For example, C. Wang et al. [16] improved the A* algorithm by extending direction-constrained nodes, significantly enhancing path search efficiency in grid maps; Li et al. [17] employed the quintic polynomial curves for dynamic lane-changing trajectory planning, generating smooth paths quickly through predefined geometric constraints. However, such methods typically rely on environmental simplifying assumptions (such as static obstacles and fixed traffic rules), making it difficult to handle dynamic interactions in complex traffic scenarios (like sudden obstacle avoidance). This limits their adaptability in unstructured or highly uncertain environments.

Optimized methods enhance planning by improving algorithms and cost functions, achieving smooth trajectories. For instance, K. Bergman et al. [18] combined grid sampling with optimal control to generate kinematically feasible trajectories. However, this approach heavily relies on models and demands high real-time performance and computational complexity, making it challenging to meet complex traffic scenarios.

DRL-based planners bypass explicit modeling by learning adaptive policies from interactions. Hu et al. [19] designed a DRL framework to dynamically adjust trajectories in mixed traffic, balancing collision avoidance and travel efficiency. K. Yang et al. [20] enhanced DQN’s robustness via prioritized experience replay, enabling rapid adaptation to mixed traffic flow environments. DRL methods resolve the environmental adaptation of classical and optimization-based approaches through model-free generalization.

The above research is summarized in Table 1.

In summary, whether it is classical control, optimization-based methods, or learning-based strategies, each has demonstrated certain advantages in their respective application scenarios but also has significant limitations. Specifically, traditional methods often rely on accurate models and complete environmental priorities, making them difficult to adapt to the complex and ever-changing traffic environment. On the other hand, existing DRL methods have made breakthroughs in real-time adaptation and multi-objective coordination, but they face high training complexity and challenges in training efficiency. More importantly, various methods struggle to construct a unified control model for the dynamic process of AVs merging into a platoon, instead breaking down the problem into multiple sub-problems to solve separately. This hierarchical control architecture may lead to a loss of control precision during information flow. Given these challenges, there is an urgent need for an end-to-end method that balances environmental adaptability and global coordination, achieving efficient cooperative control between the platoon and AV while maintaining safety, efficiency, and energy optimization. Moreover, efficient AV merging into platoons not only reduces vehicle merging time but also enhances road safety and promotes coordinated traffic flow.

After years of development, multi-agent reinforcement learning (MARL) has established a comprehensive theoretical framework and algorithmic system, demonstrating its high-efficiency collaborative decision-making capabilities in autonomous driving cooperation and other domains, thereby providing a new paradigm for addressing complex cooperative control problems. In contrast to traditional methods, Multi-Agent DRL (MADRL) enables direct modeling of dynamic game-theoretic interactions between platoon and AV through joint policy learning among distributed agents, effectively mitigating information fragmentation inherent in conventional hierarchical control architectures. Simultaneously, MADRL addresses the limitations of existing DRL approaches through the following core advantages:

Distributed Collaborative Decision-Making: Based on the Centralized Training with Decentralized Execution (CTDE) framework, MADRL allows AV and Platoon Agents to share environmental states while making independent decisions. This architecture captures spatiotemporal correlations in global traffic flow while ensuring real-time control performance.

Adaptive Environment Modeling: Through end-to-end learning of uncertainties in dynamic environments (e.g., intent prediction of surrounding vehicles, communication delays), MADRL eliminates reliance on precise vehicle dynamics models or fixed communication topologies, significantly enhancing robustness in complex scenarios.

To this end, this paper proposes a novel end-to-end cooperative control model based on multi-agent deep reinforcement learning. The framework unifies critical tasks such as platoon longitudinal control and vehicle merging into a holistic formulation, aiming to overcome the rigidity of conventional models, improve environmental adaptability, and resolve multi-objective coordination challenges, thereby enhancing the efficiency and reducing energy consumption of autonomous vehicles merging into platoons in highway environments. This research contributes the following:

1. This study proposes a dual-agent MA-DRL model to address platoon gap selection, AV lateral–longitudinal control, and platoon coordination during merging. It synchronizes AV steering, acceleration and platoon adjustments, resolving coordinated control challenges in complex merging scenarios.

2. We formulated MAMQPPO, a dual-layer DRL algorithm integrating DQN and PPO, which utilizes maximum Q-value action selection to hierarchically learn platoon gap decisions and synchronized AV–platoon control, resolving high-dimensional action space challenges in multi-stage merging while improving training efficiency.

3. We designed a partially decoupled reward (PD-Reward) decomposed into AV, platoon, and energy efficiency sub-functions, addressing multi-objective coordination and computational efficiency while reducing learning complexity and accelerating network convergence.

The organization of this paper is as follows. Section 2 describes the problem of AVs merging into a platoon. Section 3 introduces the control model based on multi-agent deep reinforcement learning. In Section 4, experiments on the model itself and comparative experiments are conducted to evaluate our design. Finally, Section 5 summarizes this work.

2. Problem Formulation

2.1. AVs Merging into Platoon Environment

Currently, electric vehicles, due to their environmental protection and energy-saving advantages, have been widely used in the field of ITSs (intelligent transportation systems). Thanks to the deep compatibility between EV drive systems and autonomous driving architectures, they can more efficiently integrate key hardware components such as high-precision environmental perception sensors, real-time decision-making units, and by-wire actuators, providing reliable technical support for intelligent control in complex traffic scenarios. Therefore, the research object of this paper takes electric vehicles as an example. At this stage, this paper focuses on verifying basic algorithms, primarily targeting good highway conditions and not considering the impact of adverse weather. As shown in Figure 2, the blue vehicles are social vehicles, the red AV is the experimental vehicle, and the yellow vehicles are the autonomous platoon. The red AV wants to merge into the platoon on the highway.

This paper models the problem of merging an AV into a platoon and reducing energy consumption as a multi-dimensional mixed Markov Decision Process (MDP) and can be defined by a 7-tuple ⟨S, A, P, R, Ω, U, γ⟩, where S is the state space, A is the action space,

P (S_{t + 1} = S^{'} | a_{t} = a)

is the state transfer function,

R (s, a) = {[r_{1}, r_{2}, \dots, r_{k}]}^{T}

is the partially decoupled reward function, Ω is the preference space (denoting the combination of the weights of each optimization objective,

w = {[w_{1}, w_{2}, \dots, w_{k}]}^{T} ϵ Ω

, U(s, a, w) is the utility function, and γ

ϵ

(0, 1) is the discount factor. At each time step t, both the AV and the platoon observe and receive the current state as input, and they output the front wheel angle of the AV, the target acceleration of the AV, and the target acceleration of the platoon. These actions are executed in the environment, resulting in the reward for the AV, the reward for the platoon, the energy reward, and the state for the next step.

In a highway environment, the process of an AV merging into a platoon is a complex dynamic process that requires real-time processing of multi-source heterogeneous information. This information includes the AV’s own status (such as speed, acceleration, position, etc.), the status information of the platoon (such as the position of individual vehicles in the platoon, relative speed, relative spacing, etc.), and the real-time changes in the surrounding traffic flow (such as the speed and density of other vehicles). Not only is this information structurally complex and highly dynamic, but there is also a high degree of coupling and mutual influence between them. For example, the merging decision of the AV will directly affect the driving status of the platoon, and the dynamic adjustments of the platoon (such as increasing the spacing or changing the speed) will in turn affect the path planning and control strategy of the AV. This interdependence significantly increases the difficulty of handling the two processes of “increasing the spacing of the platoon” and “AV planning and control for merging” in a combined manner.

2.2. State and Action

To achieve the efficient merging of AVs into a platoon, the AV needs to not only focus on its own status and the status of the surrounding vehicles but also closely monitor the relative positions and the relative dynamics between each AV in the platoon. In addition, the platoon should include its own location status, the status of the surrounding vehicles, and the position status of the red AV in the state space. In any time step t, the AV is traveling at speed V_AV with acceleration A_AV and front wheel angle δ_AV. The platoon and surrounding vehicles are traveling at different speeds (V_plat, V_veh) and accelerations (A_plat, A_veh). Therefore, in any time step t, the state space S_AV of the AV, the state space S_plat of the platoon, and the state space S_veh of the surrounding vehicles are defined as S_AV = {V_AV, A_AV, δ_AV, P_x, P_y}, S_plat = {V_plat, P_x, P_y, A_plat} and S_veh = {V_veh, P_x, P_y, A_veh}, respectively, where V denotes velocity, A denotes acceleration, and P_x and P_y denote longitudinal position and lateral position. The state spaces S_AV, S_plat and S_veh are stored together in a state space dictionary to access during network learning and training.

The control strategy of the AV is based on a continuous action space, Action_Av, which consists of two continuous control variables: longitudinal acceleration (A_AV) and the angle of the front wheels (δ_AV). The platoon, on the other hand, uses a discrete action space {−1, 1} to control its acceleration, where −1 represents deceleration and 1 represents acceleration.

2.3. Reward

During the interaction between the AV and the platoon, the reward function evaluates actions after state acquisition and execution. Actions that facilitate the AV’s integration into the platoon and reduce energy consumption are assigned higher rewards. However, reward function complexity can negatively impact computational efficiency, causing model instability and affecting learning outcomes. Traditional reward functions, typically using a weighted sum of multi-dimensional objectives, fail to clearly reflect the multi-objective requirements of AV–platoon interaction due to scenario uncertainty and differing objectives. This increases network computational complexity and reduces efficiency. Reference [21] proposed a multi-agent credit allocation strategy that can reduce learning complexity and accelerate convergence by constructing a partially decoupled reward mechanism. To address this, this paper proposes a multi-dimensional PD-Reward function, optimizing interactions from perspectives such as AV behavior, platoon state, and energy consumption, specifically, R = (

r_{A V}, r_{p l a t}, r_{E n e r g y}

). However, the AV and platoon reward function design needs to consider factors such as safety, comfort, and AV merging efficiency, and therefore incorporates a weighted summation solution.

2.3.1. AV Reward

When designing the reward function for an AV merging into a platoon, we take into account three core elements: safety, comfort, and efficiency. The safety reward

r_{s a}

is based on the Time To Collision (TTC) relative to the relative velocity and the vehicle dynamics model. The comfort reward

r_{c}

is based on rate of change in acceleration and the smoothness of the front wheel angle. Efficiency is then measured by the total time (t) to successfully merge into the platoon, with a positive reward for fast integration and a negative reward for the opposite. The reward function for AV can be designed as follows:

r_{A V} = \{\begin{matrix} w_{s a} \times r_{s a} + {w_{E} \times r}_{c}, i f n o t s u c c e s s \\ w_{s a} \times r_{s a} + {w_{E} \times r}_{c} - t, i f s u c c e s s \end{matrix}

(1)

Among them, when the vehicle safely merges into the platoon, it is considered a success. The minimum value of the TTC (Time to Collision) when the vehicle cuts in by overtaking is 7.1 s [22].

w_{s a}

and

w_{E}

are the weighting coefficients of safety and comfort, respectively, and can be adjusted according to the actual scenarios and requirements.

2.3.2. Platoon Reward

In the issue of an AV merging into the platoon gap, the platoon reward function designed in this paper takes into account four key factors: collision detection of the platoon, headway distance, speed range, and acceleration range. This is carried out to facilitate the smooth merging of the AV and optimize the overall energy and efficiency of the platoon. Therefore, the formula for the platoon reward function is as follows:

r_{p l a t} = r_{c o l l i} + \sum_{i = 1}^{n} (r_{i_{_} h e a d w a y} \times W_{h e a d} + r_{i_{_} s p e e d} \times W_{s p e e d} + r_{i_{_} a c c} \times W_{a c c})

(2)

r_{p l a t}

is a function that combines these factors to quantify and maximize the efficiency and economy of platoon operation. The n in Equation (2) denotes the number of autonomous vehicles in the platoon;

r_{i_h e a d w a y}

,

r_{i_s p e e d}

and

r_{i_a c c}

denote the reward function of the ith vehicle based on headway distance, speed, and acceleration, respectively, and

W_{h e a d}

,

W_{s p e e d}

and

W_{a c c}

denote the weights coefficients of headway, speed, and acceleration, respectively.

2.3.3. Energy Reward

The energy consumption reward function

r_{E n e r g y}

is based on the vehicle dynamics model, as detailed in the literature [23], which describes the energy consumption formula, and the parameter values of the formula, with the main unit being watts per second

(w \cdot s)

. The negative value of its energy consumption is utilized as a reward–penalty term to promote the vehicle to reduce the energy consumption of the platoon by learning to perform higher rewarding.

f_{E} = b_{0} + b_{1} \times v + b_{2} \times v^{2} + b_{3} \times v^{3} + a \times (c_{0} {+ c}_{1} \times v + c_{2} \times v^{2})

(3)

r_{E n e r g y} = - f_{E}

(4)

Among them,

b_{0}, b_{1}, b_{2}, b_{3}

are used to describe the nonlinear relationship between the energy consumption rate and the motor torque and rotational speed, while

c_{0} {, c}_{1}, a n d c_{2}

are employed to depict the nonlinear relationship between the driving force and the motor torque.

In summary, the multi-dimensional design of the PD-Reward function allows for comprehensive assessment and guidance of AV and platoon behavior, facilitating safer, more efficient, and energy-saving interactions while reducing network computational complexity.

In this paper, we design the PD-Reward function, whose weight coefficients are designed with reference to [24], with preference-based feature learning of the reward function; i.e., we use a scaling model that focuses on energy consumption and safety in the case of the AV and speed in the case of the platoon. The specific parameters of the reward function are shown Table 2.

3. Multi-Agent-Based Deep Reinforcement Learning Coupled Model and Train Details

3.1. Model Structure

3.1.1. Modeling Framework

To address the problem of AVs merging into a platoon, we propose a multi-agent deep reinforcement learning control model integrating the MAMQPPO algorithm and Actor–Critic network, aiming to provide more accurate and efficient target strategies for AVs and platoons.

As shown in Figure 3, the multi-agent deep reinforcement learning model in this paper uses Actor–Critic networks and control techniques, enabling efficient control of AVs and platoons. The model consists of an AV Agent and a Platoon Agent. The input dimension of the Platoon Agent depends on the platoon size (n).

3.1.2. Interaction and Decision-Making Processes

At each time step, the model first collects the status information of the AV, the platoon, and the surrounding environmental vehicles. The AV Agent processes the status of the AV itself, the status of each AV in the platoon, and the information of environmental vehicles that affect the AV’s merging into the platoon. Based on these inputs, the AV Agent predicts the optimal action for the next time step through a multivariate Gaussian distribution and calculates r_AV. Meanwhile, the Platoon Agent receives the AV’s status and the information of environmental vehicles that affect the speed change of the platoon’s lead vehicle. It uses the maximum value function to determine the platoon gap for the AV’s merging and the actions of each vehicle in the platoon {−1, 1}. Combining the current state of the platoon, the Platoon Agent calculates the platoon’s acceleration for the next time step through a Beta distribution and calculates r_plat and r_energy.

As the time steps progress, once the collected data reaches the length set by the experience replay pool, each piece of data contains r_t, which includes r_AV, r_plat, and r_energy. During training, the AV Agent and Platoon Agent select the corresponding reward functions. We then select the specified mini-batch data from the pool for training and update the parameters of the MQMAPPO algorithm. These parameters are subsequently copied to the Actor–Critic network of the multi-agent control model to continue collecting data for the next round.

3.2. Vehicle Dynamics Modeling

In the data collection phase, the state information set is shared, and the AV’s merging control and the platoon’s longitudinal control are carried out in parallel. We refer to this parallel execution process as “pull–insert synchronization”. In order to ensure that network output conforms to vehicle performance requirements, a vehicle dynamics model is integrated into the model, and the constraints on the platoon’s acceleration and the AV’s acceleration and front wheel angle are considered, taking into account the grade factor θ. Below is a detailed analysis of the vehicle dynamics model.

F_{a e r o} = \frac{1}{2} \times ρ \times C_{d} \times A_{F} \times {(v_{x} + v_{w i n d})}^{2}

(5)

F_{j} = m \times a

(6)

F_{f} = m \times g \times μ \times c o s θ

(7)

The

μ

in Equation (7) is the rolling resistance coefficient, which ranges from [0.01, 0.02] [25], utilizing the small-angle assumption.

F_{i} = m \times g \times θ

(8)

Equations (5)–(8) describe the calculation of air resistance, longitudinal force, rolling resistance, and slope resistance, respectively, and the constraints on the longitudinal acceleration of the AV due to resistance can be deduced; see Equation (9).

a = g \times (θ + \cos θ) + \frac{ρ \times C_{d} \times A_{F} \times {(v_{x} - v_{w i n d})}^{2}}{2 m}

(9)

However, Equation (9) does not account for the fact that each vehicle in the platoon experiences different levels of air resistance. The lead vehicle (serial number 0) faces the greatest wind resistance, while the others have resistances dependent on their individual drag coefficients. To accurately model the platoon’s dynamics, we developed an updated formula for the air resistance coefficient, with the head vehicle’s coefficient calculated using Equation (10).

C_{d i} = 0.7231 \times {C_{d i}}_{_}^{0.09199}

(10)

For the other vehicles in the platoon, the air resistance coefficients are updated by Equation (11).

C_{d i} = 0.2241 \times {C_{d i}}_{_}^{0.1369} + 0.5016

(11)

Finally, the longitudinal acceleration of each vehicle in the platoon is constrained by Equation (9). In addition, i represents the serial number of the vehicle in the platoon, C_di denotes the drag coefficient of the ith vehicle at the current time step, and

C_{d i_}

denotes the drag coefficient of the ith vehicle at the previous time step. This design is based on the differences in air resistance faced by vehicles in different positions within the platoon: the head vehicle directly faces the wind resistance, while the drag coefficients of the subsequent vehicles are affected by the wake of the preceding vehicles [26].

Furthermore, regarding the constraint on the AV’s front wheel angle, a lateral dynamics model is employed for description. This model constrains the AV’s front wheel angle by analyzing the relationship between the vehicle’s lateral acceleration a_y, the yaw rate φ, and the lateral forces of the front and rear tires (F_yf, F_yr) [27], as shown in Equation (16).

φ = v_{x} \times \frac{t a n (δ)}{L}

(12)

F_{y f} = 2 \times C_{f} \times (δ - \frac{v_{y} + l_{f} \times φ}{v_{x}})

(13)

F_{y r} = 2 \times C_{r} \times (- (v_{y} - l_{r} \times φ) / v_{x})

(14)

F_{y j} = m \times a_{y}

(15)

δ = \frac{v_{y} + l_{f} \times φ}{v_{x}} + \frac{\frac{{m \times a}_{y}}{2} - C_{r} \times (\frac{- (v_{y} - l_{r} \times φ)}{v_{x}})}{C_{f}}

(16)

Details of the symbolic parameters as well as the values appearing in the above dynamics model are given in Appendix A.

3.3. MAMQPPO Training Algorithm

3.3.1. Structure of the MAMQPPO Algorithm

To address the gap decision and collaborative control challenges in AVs merging into platoons, we employ a dual-layer network architecture that combines the Deep Q-Network (DQN) maximum value function with the proximal policy optimization (PPO) algorithm. This approach aims to simplify the problem, reduce computational complexity in MARL, and enhance the model’s learning efficiency and effectiveness.

As shown in Figure 4, the proposed MAMQPPO algorithm consists of two Actor and two Critic networks, where the AV Actor and Platoon Actor represent the two Actor networks. The AV Actor is a seven-layer network, including an input layer, three hidden layers (with 64, 128, and 64 neurons, respectively), and an output layer, utilizing ReLU and Tanh activation functions to enhance nonlinear processing. The Platoon Actor incorporates the Max-Value function (MaxQ) from the DQN at the output layer to select the maximum-value platoon gap and actions, such as acceleration (1) or deceleration (−1). The action to increase platoon spacing is chosen by comparing energy consumption, enabling efficient merging while maintaining low energy consumption.

The Critic network evaluates the actions from the AV Actor and Platoon Actor and updates policy network parameters via stochastic gradient ascent, which promotes reward-maximizing strategy learning and expedites training. These updates guide the MAMQPPO algorithm to achieve effective learning and decision-making in multi-agent environments.

During the training process, the algorithm first samples the trajectory data {

{s_{n, t}, a_{n, t}, r}_{n, t}, s_{n, t + 1}

} from the replay buffer and inputs the states

s_{n, t}

into the AV Actor and Platoon Actor networks to generate the action

a_{n, t}

, respectively; evaluates the action by Q-value through the dual Critic network (Q_AV and Q_Platoon); optimizes the parameters of the policy network based on the temporal difference error; and finally ensures the stability of multi-intelligence cooperative control through the policy gradient update mechanism. See the pseudo-code for the exact steps.

3.3.2. MAMQPPO Pseudocode

The pseudo-code of the algorithm is described below Algorithm 1.

Algorithm 1. MAMQPPO

Initialize the AV Actor and Platoon Actor networks,

π_{θ} (a| s)

and critic networks,

v_{w} (s_{t}),

with weights

μ,

and

θ

, w.
Initialize the batch size B, iteration steps α and β, and done = 0.
Sample data of the specified length from the Buffer to obtain {

{s_{n, t}, a_{n, t}, r}_{n, t}, s_{n, t + 1}

}.
Retrieve state information s_t from the database.
for episode = 0, 1, 2, … until convergence done = 1
while not done = 1
Initialize the parameters of AV Actor, Platoon Actor, and critic networks.
AV Actor:
i = 2 (two-dimensional action space)

{a^{i}}_{t, m e a n}, {a^{i}}_{t, v a r} = π (s_{t}| μ)

C O V (x_{1}, x_{2}) = E (x_{1} - {a^{1}}_{t, m e a n}) (x_{2} - {a^{2}}_{t, m e a n})

Σ = (\begin{matrix} {a^{1}}_{t, v a r} & C O V (x_{1}, x_{2}) \\ C O V (x_{1}, x_{2}) & {a^{2}}_{t, v a r} \end{matrix})

Sample A_AV (a, δ) from the multivariate Gaussian distribution

N ({a^{i}}_{t, m e a n}, Σ)

v_{t} = V (s_{t}; θ)

Save the action strategy

(a, δ)

Platoon Actor:
n is the output dimension of the platoon
Select platoon gap
J =

π (s_{t}| μ)

,
Choose Discrete action
action = max_aQ(

s_{t}, a; θ

), and action = {−1, 1},
compute

a

_plat
Sampling

a

_plat from normal distribution Beta,
if action = 1
the range is [0, 3] m/s²
else
the range is [−3, 0] m/s²
A_plat = {J,

a

_plat}
Send

a

_t = {A_AV, A_plat}
Compute the advantage function and return,

r_{t}

= {

r_{A V}, r_{p l a t}, r_{e n e r g y}

}
end for
for

f o r k = 0, \dots, M A M Q P P O e p o c h

perform
Compute

J (θ)

Update the network parameters using a gradient method.
Update the platoon reward

\begin{matrix} R_{p l a t} = r_{p l a t} + r_{E n e r g y} \end{matrix}

Update the AV reward

\begin{matrix} R_{A V} = r_{A V} + r_{E n e r g y} \end{matrix}

Calculate the accumulated reward

\begin{matrix} u_{t} = \sum_{k = t}^{n} γ^{k - t} \cdot R_{p l a t, k} + m a x (s_{k}, a_{t}) \end{matrix}

\begin{matrix} u_{t} = \sum_{k = t}^{n} γ^{k - t} \cdot R_{A V, k} \end{matrix}

Calculate the policy gradient

\begin{matrix} \nabla J (θ) = \sum_{t = 1}^{n} γ^{t - 1} \cdot u_{t} \cdot \nabla_{θ} \ln π (a_{t}| s_{t} : θ_{n o ω}) \end{matrix}

Update the Actor parameters

\begin{matrix} θ_{n e w} = θ_{n o ω} + β \cdot \nabla J (θ) \end{matrix}

Calculate the target value

\begin{matrix} y_{t} = r_{t} + γ \cdot v_{t + 1} \end{matrix}

Update the Critic parameters

\begin{matrix} w_{n e w} = w_{n o w} - α \times \nabla_{w} (r - v \times (s_{t}, a_{t}; w_{n o w})) \end{matrix}

end for

4. Experimental Analysis

4.1. Simulation Experiment

Reinforcement learning requires simulation environments for policy training and learning. In this study, a Python-based simulation mimics a realistic highway with two 2000 m parallel lanes, each 3.5 m wide, occupied by platoons of n trucks (7.8 m each). Experiments were conducted on an Intel i9-10900K CPU (Hefei, China) and NVIDIA RTX 3090 GPU (Hefei, China) setup; each training session lasts approximately 6–8 h. Parameters include n ∈ [4, 8], a deceleration command (−1) within [−3, 0], and acceleration within [0, 3].

At the start of the experiment, the initial states of AV, platoon, and environmental vehicles were detailed in Attachment 2. According to the literature [28], traffic flow states can be classified based on vehicle spacing thresholds: free flow (≥40 m), synchronized flow (20–40 m), and congested flow (≤20 m). To enhance algorithm generalization, we used three sets of random seeds to simulate free (10 vehicles, vehicle spacing = 100 m), synchronized (30 vehicles, vehicle spacing = 33.3 m), and congested (50 vehicles, vehicle spacing = 20 m) traffic flow conditions. Experimental results, as shown in Figure 5, Figure 6, Figure 7 and Figure 8, display reward return, value loss, merging time, energy consumption, success rate, average headway distance, and average speed after merging.

As shown in Figure 5 and Figure 6, the network model converges when training iterations reach 2 × 10⁶, entering a stable state with an average reward of 1.15 × 10⁵ and value loss stabilizing at 15.3. To holistically evaluate the control framework, we quantify performance through the following:

1. Merging time (Figure 7a): The merging time converges to 13.5 s ± 0.3 s during training, although there is a brief fluctuation of 0.2 s in the 100–200 interval of the training steps, but ultimately a stable convergence is achieved through the optimization of the strategy.

2. Headway Stability (Figure 7b): Post-merging headway stabilizes at 15.5 m ± 0.4 m, demonstrating robust spacing control despite initial oscillations (±1.3 m).

3. Energy Efficiency (Figure 7c): Sustained power consumption at 4.2 kW

\cdot

h, 18% lower than baseline CACC systems [29].

4. Speed Coordination (Figure 7d): AV velocity converges to 22.5 m/s (max_target: 25 m/s) with <10% deviation, ensuring seamless platoon synchronization.

5. Algorithmic Convergence (Figure 8): The PD-Reward function achieves plateaued returns at 1.5 × 10³ steps—37% faster than standard rewards (2.2 × 10³ steps)—validating accelerated learning for dynamic merging.

4.2. Comparison Experiment

To comprehensively verify the advantages of the proposed control model in AV merging and platoon longitudinal control, we compare the MAMQPPO algorithm with leading reinforcement learning algorithms—Soft Actor–Critic (SAC), Twin Delayed Deep Deterministic Policy Gradient (TD3), Deep Deterministic Policy Gradient (DDPG), and classic PPO—under a well-controlled traffic flow simulation environment and a unified reward framework. This ensures the fairness and reproducibility of the experiments. The results, including return, energy consumption, and merging time, are shown in Figure 9 and Figure 10.

The five colors, orange, green, blue, purple and gray, denote the DDPG, SAC, MAMQPPO, TD3, and PPO algorithms, respectively. For exploring the application of different RL algorithms in AVs merging into a platoon, Figure 9 and Figure 10 visualize the dynamic changes of MAMQPPO, TD3, DDPG, SAC and the classical PPO algorithms during the training process, which provides a strong basis for in-depth understanding of the performance of each algorithm.

According to Table 3, the MAMQPPO algorithm demonstrates superior performance in reward return, success rate, energy consumption, and merging efficiency. In reward return, MAMQPPO surpasses TD3, DDPG, and SAC by an order of magnitude and outperforms PPO by 43.97%. The success rate of MAMQPPO is significantly higher than TD3, SAC, and DDPG, with a 27.40% improvement over PPO. Additionally, MAMQPPO achieves lower energy consumption than PPO and reduces it by 29.47% (TD3), 29.91% (DDPG), and 29.80% (SAC). Furthermore, MAMQPPO exhibits higher merging efficiency compared to PPO and DDPG, with a 30.68% increase over TD3 and 78.21% over SAC. Overall, MAMQPPO excels in solving the AV merging problem, achieving high efficiency and low energy consumption.

In addition, comparative experiments are conducted to further validate the performance advantages of the model proposed in this paper. Specifically, the method proposed in this paper is applied to the AV merge-in-platoon problem and its performance is compared with the quintic polynomial-based control model and the PID (Proportional–Integral–Differential) control model proposed in the literature [30]. In the polynomial-based and PID control model, the PID parameters were set as

k_{p}

,

k_{i}

and

k_{d}

= 4, 0.1, 4.1, with the platoon’s longitudinal behavior controlled by CACC (Cooperative Adaptive Cruise Control). The experimental conclusions regarding AV merging time and average energy consumption are presented in Figure 11 and Figure 12, respectively.

Based on the information from Figure 11 and Figure 12, Table 4, the following detailed experimental results were obtained: Both models successfully completed the task of the AV merging into a platoon. The MAMQPPO method demonstrated significant advantages over the polynomial planning and PID control model in terms of energy consumption, achieving reductions of 58%. Additionally, in the key indicator of merging efficiency, the MAMQPPO method also showed a 37.69% increase compared to the polynomial planning and PID control model. Therefore, MAMQPPO is a more effective solution for AVs merging into platoons compared to traditional control models.

5. Conclusions

This study proposes an innovative multi-agent deep reinforcement learning cooperative control model, which is designed to enhance the overall efficiency of AVs merging into platoons and platoon reorganization processes, while ensuring merging safety and reducing overall energy consumption. By introducing the PD-Reward mechanism, the network convergence is accelerated. Additionally, the MAMQPPO algorithm integrates the maximum Q-value policy from DQN, which effectively reduces the task complexity and enhances both learning efficiency and policy stability.

In simulated highway environments with varying traffic flows, the model was thoroughly trained and tested. Experimental results demonstrate significant advantages of the proposed control model in key performance indicators, including reward return, energy consumption, training success rate, average vehicle spacing after merging, and merging time. These outcomes highlight the model’s excellent performance in AV platoon control, effectively improving efficiency and reducing energy consumption while ensuring safety.

The limitations of this study are threefold: first, model validation is based solely on simulated highway scenarios and does not cover the complex dynamic environment of urban traffic; second, multi-vehicle coordinated merging uses a phased serial strategy, lacking parallel decision optimization, which limits reorganization efficiency; third, it does not consider the interference of adverse weather conditions such as rain and snow on sensor. Future work will focus on (1) integrating real platoon traffic data to enhance generalization capabilities in complex scenarios; (2) developing a multi-vehicle parallel merging mechanism based on game theory or swarm reinforcement learning to optimize global decision-making efficiency; and (3) incorporating weather disturbance factors and robust control modules to improve model reliability in extreme environments.

Author Contributions

Methodology, J.C. and X.R.; Validation, B.Z.; Investigation, X.R. and Y.D.; Writing—original draft, B.Z.; Writing—review & editing, J.C., M.Z., X.L., Y.D. and N.G.; Supervision, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Nature Science Foundation of China, Grant No.51805133, and the Innovation Project of New Energy Vehicle and Intelligent Connected Vehicle of Anhui Province.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

Author Mengyu Zhang was employed by Hefei Communication Investment Holding Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Parameter Name	Interpretation	Value	Unit
$F_{a e r o}$	Atmospheric drag	--	N
$F_{j}$	Vehicle longitude power	--	N
$F_{f}$	Frictional resistance	--	N
$F_{i}$	Gradient resistance	--	N
$F_{y f}$	Front wheel lateral force	--	N
$F_{y r}$	Rear wheel lateral force	--	N
$F_{y j}$	Vehicle lateral force	--	N
$φ$	Lateral angular velocity	--	$r a d / s$
$δ$	Front wheel angle	--	$r a d$
$a_{y}$	Lateral acceleration	--	$m / s^{2}$
$L$	Wheelbase	7.8	$m$
$C_{f}$	Front wheel lateral deflection stiffness	15,000	$N / r a d$
$C_{r}$	Rear wheel lateral deflection stiffness	14,000	$N / r a d$
$l_{r}$	Vehicle rear axle	2.75	$m$
$l_{f}$	Vehicle front axle	3.55	$m$
$v_{y}$	Vehicle lateral speed	--	$m / s$
$ρ$	Air density	1.28	$k g / m^{3}$
$C_{d}$	Wind resistance	0.564	--
$A_{F}$	Windward side	5.8	$m^{2}$
$v_{x}$	Vehicle longitude speed	--	$m / s$
$v_{w i n d}$	Wind speed	5.0	$m / s$
$m$	Vehicle quality	20,000	$k g$
$a$	Vehicle longitude acceleration	--	$m / s^{2}$
$θ$	Slope angle	--	$r a d$
$g$	Gravitational acceleration	9.8	$k g \cdot m / s^{2}$
$u_{t}$	Return	--	--
$γ$	Discount rate	0.05	--
$β$	Learning rate	$1 \times 10^{- 5}$	--
$α$	Learning rate	$1 \times 10^{- 5}$	--
$v$	Value of Critic networks	--	--
$ω$	Weight of Critic networks	--	--
$R$	Reward	--	--
$y_{t}$	Target value	--	--
$r_{t}$	The immediate reward for t	--	--
$θ_{n e w}$	Updated value network parameters	--	--
$θ_{n o ω}$	Current value network parameters	--	--
$\nabla_{w}$	The gradient of w in value network	--	--
W	Width of AV	2.8	m

Appendix B

ID	$v_{x}$ (m/s)	$v_{y}$ (m/s)	$a_{x}$ (m/s²)	$a_{y}$ (m/s²)	$x$ (m)	$y$ (m)	$δ$ (rad)
AV	23	0	0	0	70	1.75	0
EV	Random (17, 25)	0	Random (−3, 3)	0	Random (0, 1000)	1.75 or 5.25	0
PV0	22	0	0	0	0	5.25	0
PV1	22	0	0	0	20	5.25	0
PV2	22	0	0	0	40	5.25	0
PV3	22	0	0	0	60	5.25	0
PV4	22	0	0	0	80	5.25	0
PV5	22	0	0	0	100	5.25	0
PV6	22	0	0	0	120	5.25	0
PV7	22	0	0	0	140	5.25	0

References

Ding, J.; Li, L.; Peng, H.; Zhang, Y. A rule-based cooperative merging strategy for connected and automated vehicles. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3436–3446. [Google Scholar] [CrossRef]
Shaju, A.; Southward, S.; Ahmadian, M. PID-Based Longitudinal Control of Platooning Trucks. Machines 2023, 11, 1069. [Google Scholar] [CrossRef]
Ying, Y.; Mei, T.; Song, Y.; Liu, Y. A Sliding Mode Control approach to longitudinal control of vehicles in a platoon. In Proceedings of the 2014 IEEE International Conference on Mechatronics and Automation, Tianjin, China, 3–6 August 2014; pp. 1509–1514. [Google Scholar]
Gaagai, R.; Horn, J. Distributed Predecessor-Follower Constrained Platooning Control of Linear Heterogeneous Vehicles. In Proceedings of the 2024 UKACC 14th International Conference on Control (CONTROL), Winchester, UK, 10–12 April 2024; pp. 274–280. [Google Scholar]
Tapli, T.; Akar, M. Cooperative Adaptive Cruise Control Algorithms for Vehicular Platoons Based on Distributed Model Predictive Control. In Proceedings of the 2020 IEEE 16th International Workshop on Advanced Motion Control (AMC), Kristiansand, Norway, 14–16 September 2020; pp. 305–310. [Google Scholar]
Huang, Z.; Chu, D.; Wu, C.; He, Y. Path planning and cooperative control for automated vehicle platoon using hybrid automata. IEEE Trans. Intell. Transp. Syst. 2018, 20, 959–974. [Google Scholar] [CrossRef]
Sala, M.; Soriguera, F. Macroscopic modeling of connected autonomous vehicle platoons under mixed traffic conditions. Transp. Res. Procedia 2020, 47, 163–170. [Google Scholar] [CrossRef]
Xu, Y.; Shi, Y.; Tong, X.; Chen, S.; Ge, Y. A Multi-Agent Reinforcement Learning Based Control Method for CAVs in a Mixed Platoon. IEEE Trans. Veh. Technol. 2024, 73, 16160–16172. [Google Scholar] [CrossRef]
Lin, H.; Lyu, C.; He, Y.; Liu, Y.; Gao, K.; Qu, X. Enhancing State Representation in Multi-Agent Reinforcement Learning for Platoon-Following Models. IEEE Trans. Veh. Technol. 2024, 73, 12110–12114. [Google Scholar] [CrossRef]
Dasgupta, S.; Raghuraman, V.; Choudhury, A.; Teja, T.N.; Dauwels, J. Merging and splitting maneuver of platoons by means of a novel PID controller. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
Liu, H.; Zhuang, W.; Yin, G.; Tang, Z.; Xu, L. Strategy for heterogeneous vehicular platoons merging in automated highway system. In Proceedings of the Chinese Control And Decision Conference, Shenyang, China, 9–11 June 2018. [Google Scholar]
Min, H.; Yang, Y.; Fang, Y.; Sun, P.; Zhao, X. Constrained Optimization and Distributed Model Predictive Control-Based Merging Strategies for Adjacent Connected Autonomous Vehicle Platoons. IEEE Access 2019, 7, 163085–163096. [Google Scholar] [CrossRef]
An, G.; Talebpour, A. Vehicle Platooning for Merge Coordination in a Connected Driving Environment: A Hybrid ACC-DMPC Approach. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5239–5248. [Google Scholar] [CrossRef]
Chen, D.; Hajidavalloo, M.R.; Li, Z.; Chen, K.; Wang, Y.; Jiang, L.; Wang, Y. Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11623–11638. [Google Scholar] [CrossRef]
Zhou, W.; Chen, D.; Yan, J.; Li, Z.; Yin, H.; Ge, W. Multi-agent reinforcement learning for cooperative lane changing of connected and autonomous vehicles in mixed traffic. Auton. Intell. Syst. 2022, 2, 5. [Google Scholar] [CrossRef]
Wang, C.; Wang, L.; Qin, J.; Wu, Z.; Duan, L.; Li, Z.; Cao, M.; Ou, X.; Su, X.; Li, W.; et al. Path planning of automated guided vehicles based on improved A-Star algorithm. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 2071–2076. [Google Scholar]
Li, Y.; Li, L.; Ni, D. Dynamic trajectory planning for automated lane changing using the quintic polynomial curve. J. Adv. Transp. 2023, 2023, 6926304. [Google Scholar] [CrossRef]
Bergman, K.; Ljungqvist, O.; Axehill, D. Improved Path Planning by Tightly Combining Lattice-Based Path Planning and Optimal Control. IEEE Trans. Intell. Veh. 2021, 6, 57–66. [Google Scholar] [CrossRef]
Hu, H.; Wang, Y.; Tong, W.; Zhao, J.; Gu, Y. Path Planning for Autonomous Vehicles in Unknown Dynamic Environment Based on Deep Reinforcement Learning. Appl. Sci. 2023, 13, 10056. [Google Scholar] [CrossRef]
Yang, K.; Liu, L. An Improved Deep Reinforcement Learning Algorithm for Path Planning in Unmanned Driving. IEEE Access 2024, 12, 67935–67944. [Google Scholar] [CrossRef]
Kapoor, A.; Freed, B.; Choset, H.; Schneider, J. Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization. arXiv 2024, arXiv:2408.04295. [Google Scholar]
Zhou, J.; Tkachenko, P.; del Re, L. Gap Acceptance Based Safety Assessment Of Autonomous Overtaking Function. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2113–2118. [Google Scholar]
Li, M.; Cao, Z.; Li, Z. A Reinforcement Learning-Based Vehicle Platoon Control Strategy for Reducing Energy Consumption in Traffic Oscillations. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5309–5322. [Google Scholar] [CrossRef] [PubMed]
Mindermann, S.; Shah, R.; Gleave, A.; Hadfield-Menell, D. Active inverse reward design. arXiv 2018, arXiv:1809.03060. [Google Scholar]
Gillespie, T.D. Tire-Road Interaction. In Fundamentals of Vehicle Dynamics; Society of Automotive Engineers: Warrendale, PA, USA, 1992; pp. 53–110. [Google Scholar]
Jiang, T.; Shi, Y.; Xu, P.; Li, Z.; Wang, L.; Zhang, W. Simulation of wind resistance and calculation of fuel saving rate of heavy truck formation driving. Intern. Combust. Engine Powerpl. 2022, 39, 81–85. [Google Scholar]
Di, H.-Y.; Zhang, Y.-H.; Wang, B.; Zhong, G.; Zhou, W. A review of research on lateral control models and methods for autonomous driving. J. Chongqing Univ. Technol. Nat. Sci. 2021, 35, 71–81. [Google Scholar]
Kerner, B.S.; Rehborn, H. Experimental properties of phase transitions in traffic flow. Phys. Rev. Lett. 1997, 79, 4030. [Google Scholar] [CrossRef]
Li, J.; Chen, C.; Yang, B.; He, J.; Guan, X. Energy-Efficient Cooperative Adaptive Cruise Control for Electric Vehicle Platooning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 4862–4875. [Google Scholar] [CrossRef]
Goli, M.; Eskandarian, A. Evaluation of lateral trajectories with different controllers for multi-vehicle merging in platoon. In Proceedings of the 2014 International Conference on Connected Vehicles and Expo (ICCVE), Vienna, Austria, 3–7 November 2014; pp. 673–678. [Google Scholar]

Figure 1. Merging into platoon course.

Figure 2. Merging environment.

Figure 3. Multi-agent based deep reinforcement learning control model.

Figure 4. MAMQPPO algorithm framework.

Figure 5. Return.

Figure 6. Value loss.

Figure 7. Evaluation indicator charts ((a) merging time, (b) vehicle spacing, (c) energy consumption, (d) speed).

Figure 8. Return convergence effect.

Figure 9. RL compared return.

Figure 10. Evaluation indicator charts ((a) energy consumption, (b) merging time, (c) training success rate).

Figure 11. Compared control merging time.

Figure 12. Compared control consumption.

Table 1. Literature review.

Control
Method	Applicable Environment	Advantages	Disadvantages	Study on Platoon	Study on AV
classical control	For static or simple scenarios where the vehicle dynamics model is known and relatively accurate.	The algorithm structure is simple, and the controller is mature and well-developed.	Less adaptable to nonlinear, dynamically changing traffic environments.	[2,3]	[10]
optimal control	Suitable for scenarios requiring precise control and balancing of multi-objectives (such as platoon coordination and merging control).	Optimize safety, comfort, and efficiency simultaneously by using mathematical modeling and cost function design to achieve optimal performance.	More rigid model assumptions, strong reliance on a priori information, high computational complexity, limited real-time adaptability.	[4,5]	[11,12,13]
learning-based	Suitable for dynamic, variable, nonlinear environments (e.g., complex interactions, heterogeneous traffic).	Model liberalization, adaptive to complex scenes, strong multi-objective coordination, good real-time feedback.	Convergence and stability challenges remain.	[6,7,8,9]	[14,15]
Trajectory planning
classical	Suitable for structured maps	The method is mature and highly stable.	Difficult to handle dynamic obstacles and complex interactions.	-	[16,17]
optimal	Dynamic obstacle avoidance and real-time adjustment.	Smoother trajectory for comfort	High computational complexity and real-time dependence on simplified models.	-	[18]
learning-based	Dynamic and complex traffic scenarios.	Adaptation to high-dimensional state spaces and complex interactions.	Less interpretable, longer training time.	-	[19,20]

Table 2. Reward parameter weights.

Reward Weight
Name	Value
$w_{s a}$	1.2
$w_{E}$	1.5
$w_{h e a d}$	0.9
$w_{s p e e d}$	1.5
$w_{a c c}$	1.1
$b_{0}$	0.1569
$b_{1}$	0.0245
$b_{2}$	$- 7.415 \times 10^{- 4}$
$b_{3}$	$5.975 \times 10^{- 5}$
$c_{0}$	0.07224
$c_{1}$	0.09681
$c_{2}$	0.001075

Table 3. RL comparison data.

Algorithm	Return	Success Rate (%)	Consumption (W·s)
TD3	$2.11 \times 10^{4}$	13.5	391.5
DDPG	$1.29 \times 10^{4}$	45.3	393.2
SAC	$7.01 \times 10^{4}$	2.37	392.8
$P P O$	$1.41 \times 10^{5}$	17.4	694.4
MAMQPPO	$1.91 \times 10^{5}$	62.4	276.1

Table 4. Compared control data.

Algorithm	Merging Time (s)	Consumption (W)
Polynomial + PID + CACC	19.9	425.6
MAMQPPO	12.4	288.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Zhu, B.; Zhang, M.; Ling, X.; Ruan, X.; Deng, Y.; Guo, N. Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway. World Electr. Veh. J. 2025, 16, 225. https://doi.org/10.3390/wevj16040225

AMA Style

Chen J, Zhu B, Zhang M, Ling X, Ruan X, Deng Y, Guo N. Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway. World Electric Vehicle Journal. 2025; 16(4):225. https://doi.org/10.3390/wevj16040225

Chicago/Turabian Style

Chen, Jiajia, Bingqing Zhu, Mengyu Zhang, Xiang Ling, Xiaobo Ruan, Yifan Deng, and Ning Guo. 2025. "Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway" World Electric Vehicle Journal 16, no. 4: 225. https://doi.org/10.3390/wevj16040225

APA Style

Chen, J., Zhu, B., Zhang, M., Ling, X., Ruan, X., Deng, Y., & Guo, N. (2025). Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway. World Electric Vehicle Journal, 16(4), 225. https://doi.org/10.3390/wevj16040225

Article Menu

Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway

Abstract

1. Introduction

1.1. Platoon Longitudinal Control

1.2. Single-Autonomous-Vehicle Merging Control

1.3. Single-Autonomous-Vehicle Trajectory Planning

2. Problem Formulation

2.1. AVs Merging into Platoon Environment

2.2. State and Action

2.3. Reward

2.3.1. AV Reward

2.3.2. Platoon Reward

2.3.3. Energy Reward

3. Multi-Agent-Based Deep Reinforcement Learning Coupled Model and Train Details

3.1. Model Structure

3.1.1. Modeling Framework

3.1.2. Interaction and Decision-Making Processes

3.2. Vehicle Dynamics Modeling

3.3. MAMQPPO Training Algorithm

3.3.1. Structure of the MAMQPPO Algorithm

3.3.2. MAMQPPO Pseudocode

4. Experimental Analysis

4.1. Simulation Experiment

4.2. Comparison Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI