Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment

Cui, Zihao; Deng, Kailian; Zhang, Hongtao; Zha, Zhongyi; Jobaer, Sayed

doi:10.3390/math13050754

Open AccessArticle

Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment

by

Zihao Cui

^1,2,

Kailian Deng

^1,2,*,

Hongtao Zhang

^1,2,†

,

Zhongyi Zha

^1,3 and

Sayed Jobaer

^1,2

¹

College of Information Science and Technology, Donghua University, Shanghai 201620, China

²

Engineering Research Center of Digitized Textile and Apparel Technology, Ministry of Education, Shanghai 201620, China

³

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

^†

This author is second contribution to this work.

Mathematics 2025, 13(5), 754; https://doi.org/10.3390/math13050754

Submission received: 23 January 2025 / Revised: 13 February 2025 / Accepted: 24 February 2025 / Published: 25 February 2025

(This article belongs to the Special Issue Application of Machine Learning and Data Mining, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The development of artificial intelligence (AI) game agents that use deep reinforcement learning (DRL) algorithms to process visual information for decision-making has emerged as a key research focus in both academia and industry. However, previous game agents have struggled to execute multiple commands simultaneously in a single decision, failing to accurately replicate the complex control patterns that characterize human gameplay. In this paper, we utilize the ViZDoom environment as the DRL research platform and transform the agent–environment interactions into a Partially Observable Markov Decision Process (POMDP). We introduce an advanced multi-agent deep reinforcement learning (DRL) framework, specifically a Multi-Agent Proximal Policy Optimization (MA-PPO), designed to optimize target acquisition while operating within defined ammunition and time constraints. In MA-PPO, each agent handles distinct parallel tasks with custom reward functions for performance evaluation. The agents make independent decisions while simultaneously executing multiple commands to mimic human-like gameplay behavior. Our evaluation compares MA-PPO against other DRL algorithms, showing a 30.67% performance improvement over the baseline algorithm.

Keywords:

deep reinforcement learning; convolution neural network; partially observable Markov decision process; multi-agent system

MSC:

68T07

1. Introduction

Recent advancements in fundamental physics have been facilitated by the implementation of artificial neural networks and machine learning methodologies [1]. In parallel, AlphaFold, an innovative system integrating deep learning and reinforcement learning methodologies, was honored with the Nobel Prize in Chemistry for its revolutionary advancements in the field of protein structure prediction [2]. The progression from AlphaStar and AlphaGo’s notable successes in gaming to AlphaFold’s groundbreaking applications in biology exemplifies the versatility of deep reinforcement learning (DRL) technology [3]. DRL, which integrates neural networks with reinforcement learning principles, has not only demonstrated remarkable efficacy in gaming scenarios but also continues to significantly advance various domains of scientific research [4,5].

DRL has achieved substantial attention across both industry and academia [6], demonstrating its capability as a model-free methodology for addressing complex decision-making challenges in industrial environments [7]. AlphaGo, an artificial intelligence (AI) system for the ancient game of Go, achieved a historic milestone by defeating human Go champions [8]. The system utilizes a sophisticated dual-phase approach, initially employing supervised learning on human data before transitioning to reinforcement learning through self-play iterations. Leveraging the powerful function approximation capabilities of deep neural networks, DRL agents demonstrate remarkable proficiency in acquiring knowledge through interactions with unknown, information-limited environments, ultimately leading to robust real-time decision-making processes. Building upon the exceptional adaptability of DRL agents in complex environments, DRL-based algorithms have demonstrated significant value across multiple domains, including intraday scheduling for home energy systems [9], flight trajectory optimization for unmanned aerial vehicles [10], and energy trading between electric vehicles and power grids [11]. The foundation for these achievements stems is built on the extensive testing and validation of DRL frameworks through diverse gaming environments and simulators, establishing it as an advanced and effective paradigm [12,13].

The journey commenced with DeepMind’s groundbreaking implementation of DRL technology in Atari games [14], where artificial intelligence agents demonstrated exceptional capabilities in mastering and subsequently surpassing human performance levels [15]. While AlphaStar showcased unprecedented mastery by outperforming 99.8% of human players in StarCraft through its precise real-time execution, it encountered limitations in certain scenarios due to deficiencies in cognitive reasoning and strategic planning capabilities. Currently, OpenAI Gym provides a diverse range of game interfaces for training and evaluating DRL algorithms, i.e., SC2LE [16], ViZDoom [17], and Mujoco [18], establishing a robust foundation for developing more sophisticated game AI [19]. Therefore, our research is focused on optimizing the implementation of advanced DRL frameworks within ViZDoom, a simulation environment that incorporates complex multi-dimensional action controls.

Moreover, collaborative interaction between multiple agents has been demonstrated to be an effective approach for converging toward near-optimal solutions [20,21]. With the advancement of single-agent DRL implementations, multi-agent DRL has sparked a wave of enthusiasm [22,23]. Multi-agent reinforcement learning enables collaboration by maximizing collective rewards, which expands traditional reinforcement learning from individual problem-solving to cooperative achievement [24]. The methodology has shown remarkable effectiveness in advanced applications including multiplayer games [25], coordinated UAV operations [26], and robot swarm control systems [27]. Since real-world scenarios often involve multiple interacting agents, we incorporate a multi-agent framework into the DRL paradigm to improve agent performance in ViZDoom.

In reinforcement learning, developing sophisticated agents capable of replicating human-like behavioral patterns within gaming environments remains a significant challenge. To address this challenge, we introduce MA-PPO, an advanced reinforcement learning algorithm framework specifically designed for complex gaming scenarios. Using the first-person shooter game ViZDoom as our experimental platform, we model environmental interactions as a Partially Observable Markov Decision Process. Our MA-PPO algorithm incorporates an advanced multi-agent architecture that overcomes a key limitation of traditional reinforcement learning: instead of executing only one action per decision, it enables agents to perform multiple commands simultaneously, much like human players. To ensure genuine intelligence rather than scripted behavior, we implement a distributed task allocation mechanism, where agents operate independently on parallel objectives, avoiding control conflicts. Through comprehensive empirical evaluation against other reinforcement learning algorithms, our framework achieves superior performance in the ViZDoom environment.

The main contributions of our work are summarized as follows:

We transform the interactions between the ViZDoom environment and agents into a Partially Observable Markov Decision Process, and systematically define the corresponding states, actions, and reward functions. The reward functions serve as key indicators for assessing the effectiveness of each agent in executing their designated specialized tasks.
We propose an innovative multi-agent reinforcement learning algorithm framework MA-PPO based on PPO. Our framework utilizes image information as input and outputs joint actions composed of multiple individual actions to interact with the ViZDoom environment, achieving predefined objectives in a manner that better approximates human players.
Simulation experiments show that compared to original PPO, MA-PPO achieved a 30.67% reward gain, while compared to three other benchmark algorithms including DQN, MA-PPO achieves at least 32.00% performance improvement. Visual analysis shows that MA-PPO achieves optimal task completion with minimal steps required, and parameter experiment results indicate that all selected parameters are optimal.

The reminder of this paper is organized as follows. Section 2 introduces the related work. Section 3 first introduces the ViZDoom simulator and the multi-agent POMDP formulation. Section 4 describes the update process of the proposed MA-PPO algorithm and the network architecture. Section 5 evaluates the performance of the proposed algorithm through simulations and conducts a parameter experiment. Finally Section 6 concludes the paper. The main symbols used in this paper are summarized in Table 1.

2. Related Work

In this section, we review the related works, including applications of the actor–critic framework in RL.

Although RL has demonstrated remarkable efficacy in practical applications, there remain several critical areas that need further exploration and research. Traditional tabular-based RL methods, i.e., Q-learning and Sarsa, employ a tabular structure to maintain states and actions, facilitating value function retrieval through systematic table consultation. Li et al. introduced Approximate Dynamic Programming (ADP) as an innovative methodology for optimizing long-term operational costs in microgrid systems [28]. The real-time scheduling problem was modeled as a finite-horizon Markov decision process. Through the implementation of neural networks for power generation forecasting, the ADP approach demonstrated effectiveness when evaluated using authentic power grid data simulations. Anderson et al. conducted research on modeling player interactions in StarCraft through Nash equilibrium theory, with findings indicating that the implementation of minimax-Q algorithms led to enhanced performance for AI agents in StarCraft tournaments [29].

However, value function-based tabular RL algorithms are limited to handling only discrete action and state spaces, and encounter the problem of decision fatigue. As a result, the implementation of the actor–critic framework presents an effective solution to address the operational limitations. Wang et al. implemented an actor–critic network in home energy systems to deal with continuous battery charging and discharging actions [9]. The actor–critic-based RL scheme enhances the economic efficiency of battery scheduling, while maintaining safety constraints throughout daily operations. Zha et al. introduced an actor–critic algorithm named Advanced Actor–Critic (AAC) that trains agents to match human-level skill in the complex strategy game StarCraft II [30]. The AAC-based AI demonstrates rapid convergence despite encountering vast state and action spaces.

Although reinforcement learning has advanced significantly across many fields, research has primarily focused on single-agent scenarios, even though real-world applications often involve multiple decision-makers working together [31]. Thus, multi-agent reinforcement learning has emerged as a powerful approach for tackling complex real-world problems [32]. Lowe et al. first introduced the multi-agent actor–critic framework, i.e., multi-agent deep deterministic policy gradient (MADDPG), which handles environmental non-stationarity through a system of multiple trained sub-policies, with random policy selection during each episode [33]. MADDPG, as a versatile multi-agent reinforcement learning framework, has made significant contributions through its centralized training and distributed execution strategy in multi-robot collaborative systems [34], human behavior imitation [35], and Data Center Digital Twins [36].

3. Modeling in ViZDoom

In this section, we introduce the ViZDoom simulator, a gaming platform widely used for training DRL agents. We then present the concept of Partially Observable Markov Decision Process (POMDP) and its extension to a multi-agent framework. Finally, we detail the construction process and theoretical foundations of multi-agent POMDP.

3.1. Preliminaries

3.1.1. ViZDoom

ViZDoom is a reinforcement learning research platform built on a popular open-source 3D first-person shooting game, allowing agents to play Doom by interacting with screen buffers and game variables [37]. From a technical performance perspective, the ViZDoom environment demonstrates excellent optimization results. At a resolution of 320 × 240, the environment achieves an average rendering rate of 2500 frames per second. At a lower resolution of 160 × 120, even single-threaded CPUs can achieve up to 7000 frames per second. For reinforcement learning research, the ViZDoom simulator runs in synchronous mode, where the game engine pauses to await decisions from the learning agent. As a result, the learning system can progress at its optimal pace without time restrictions. Additionally, the environment also comes with multiple pre-built scenarios specifically designed to train fundamental skills like shooting and navigation. Based on the technical advantages, the ViZDoom environment provides an ideal experimental platform for our research.

3.1.2. Reinforcement Learning with POMDP

In the reinforcement learning framework, the environment and agent continuously interact in order to maximize rewards [38]. The Partially Observable Markov Decision Process (POMDP) provides a formal mathematical model for analyzing and characterizing agent–environment interactions, denoted as a five-tuple

< S, A, R, P, γ >

.

In POMDP, we denote

s_{t} \in S

as the state space set, where

s_{t}

refers to the state variables obtained by the agent from the environment at time slot t. We denote

a_{t} \in A

as the action space set, where

a_{t}

refers to the specific action taken by the agent to interact with the environment at time slot t. Following the execution of an action, the agent receives environmental feedback, which is formally designated as the reward signal. We denote R as the reward function, where

R (s_{t}, a_{t})

refers to the immediate feedback received by the agent when executing action

a_{t}

in state

s_{t}

at time slot t. Thus, we denote P as the state transition probability function, where

P (s_{t + 1} | s_{t}, a_{t})

characterizes the probability distribution of the system transitioning from state

s_{t}

to the next state

s_{t + 1}

through action

a_{t}

. Finally, we denote

γ \in [0, 1]

as the discount factor, regulating the trade-off between immediate and long-term rewards. A larger discount factor

γ

emphasizes long-term gains, while a smaller discount factor

γ

focuses more on immediate returns. During the continuous interaction between the agent and environment, the system produces an ordered sequence of state–action pairs, which forms a temporal transition sequence parameterized by the time slot t. We define the sequence of state–action pairs as trajectory

τ = {s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{t}, a_{t}, s_{t + 1}, a_{t + 1}, \dots s_{T}, a_{T}}

(1)

where T is the termination time of the interaction.

In this paper, we consider a multi-agent POMDP, where I agents share information and collaborate to achieve a common goal. We denote

i \in I

as the set of agents. All agents share the same state space S and discount factor

γ

. Due to different agents having unique action spaces, we redefine the action space A in multi-agent POMDP as a composite action space. We denote

a_{t}^{i} \in A^{i} \in A

as the action taken by the agent i to interact with the environment at time t; thus, the action space for the whole system satisfies

a_{t} = {a_{t}^{i = 1}, \dots, a_{t}^{i = I}}

at time t. Ultimately, the agent i updates its policy

π^{i}

according to different reward functions.

As shown in Figure 1, the environment interaction of the multi-agent system follows the process below: Based on the environment state

s_{t}

, each agent i independently selects action

a_{t}^{i}

, and the individual decisions collectively form the system’s joint action. After the agents execute the joint action

a_{t}

and interact with the environment, the state transitions from state

s_{t}

to state

s_{t + 1}

and receives a reward

r_{t}

. Based on the next state

s_{t + 1}

and reward

r_{t}

, the system continues making decisions, and the cycle ultimately generates a complete interaction trajectory

τ

in Equation (1).

3.2. Multi-Agent POMDP Formulation

In the multi-agent POMDP, the agent i first observes the state

s_{t}

in the ViZDoom environment, then selects corresponding actions

a_{t}^{i = 1}

and

a_{t}^{i = 2}

from the action spaces

A^{i = 1}

and

A^{i = 2}

, respectively. The system then executes the combined joint action

a_{t}

in the ViZDoom environment. Following execution, the system receives both the next state

s_{t + 1}

and reward

r_{t + 1}

from the environment interaction before proceeding to the next decision cycle. Since the ViZDoom simulator updates all agent-observable information in real-time, the next state space

s_{t + 1}

is independent of previous states

s_{t}

and actions

a_{t}

, satisfying the MDP property. The POMDP formulation is as follows.

3.2.1. State

We adopt the ViZDoom Basic environment as the experimental platform. The system is configured with a dual-agent architecture, i.e.,

I = 2

. Both agents share the same state observation information, satisfying

s_{t}^{i = 1} = s_{t}^{i = 2}

(2)

The state space is composed of game images directly obtained by the multi-agent system, which is a discrete state space. The state space consists of a series of post-interaction images obtained directly from the ViZDoom environment, which is a discrete state space containing key decision-making information, such as player weapon orientation, enemy positions, ammunition levels, terrain features, and player health status.

3.2.2. Action

To align with the ViZDoom environment, we define four basic operations for the agents, i.e., stay still, move left, move right, and fire. We construct a multi-agent system composed of two agents with differentiated action spaces to control the player unit in the environment. For the agents i, we establish a clear division of action spaces, denoted as

\begin{matrix} a_{t}^{i = 1} \in A^{i = 1} = {a^{s t a y}, a^{l e f t}, a^{r i g h t}} \\ a_{t}^{i = 2} \in A^{i = 2} = {a^{h o l d}, a^{s h o o t}} \end{matrix}

(3)

where the agent

i = 1

is responsible for controlling the player’s position movement, and its action space contains three actions:

a^{s t a y}

represents the agent maintaining a stationary position without movement,

a^{l e f t}

represents the agent moving to the left, and

a^{r i g h t}

represents the agent moving to the right. The second agent is specifically responsible for shooting decisions, with its action space containing two actions:

a^{h o l d}

represents the agent holding a weapon without firing, and

a^{s h o o t}

represents the agent holding a weapon and firing. Through the collaborative mechanism, the multi-agent system achieves complete control over player behavior and interacts with the environment.

3.2.3. Reward

The reward function settings adopted in our research represent the measurement of the multi-agent system’s actual performance in the game. To achieve differentiated behavior guidance, we establish different evaluation criteria for each agent in the system, as follows:

\begin{matrix} r_{t}^{i = 1} = r^{m o v e} + r^{h i t} \\ r_{t}^{i = 2} = r^{s h o o t} + r^{h i t} \end{matrix}

(4)

The reward function adopted in this paper represents the measurement of the multi-agent system’s actual performance in the game. When the system successfully hits the target, it receives a positive reward

r^{h i t}

. Meanwhile, we introduce

r^{m o v e}

and

r^{s h o o t}

as behavioral penalty factors, corresponding to the costs of movement and shooting actions. These two penalties are designed to optimize the system’s behavioral strategy, encouraging agents to adopt more strategic movement patterns with the aim of achieving more efficient shooting performance under limited ammunition conditions.

4. Proposed Algorithm

The core scheme adopted in this paper is the Proximal Policy Optimization (PPO) algorithm. In this section, we first extend the PPO algorithm to a multi-agent architecture. Then, we present a comprehensive analysis of the optimization mechanisms underlying these algorithm variants, along with a detailed examination of their respective network architectures.

4.1. Proximal Policy Optimization

As an advanced method in DRL, PPO achieves high training efficiency while maintaining learning stability, making it particularly suitable for agent decision training in complex environments. PPO is based on an actor–critic architecture design, where the critic network is responsible for evaluating the state value function and plays an evaluative role in the policy optimization process. Within the ViZDoom environment, the agent i perceives the state

s_{t}^{i}

at the beginning of the time slot t. Subsequently, the agent i engages with the environment by selecting and executing the action from a discrete set

a_{t}^{i} \in A^{i}

. Then, the agent i observes an immediate reward

r_{t}^{i}

from the ViZDoom environment before transitioning to the next state

s_{t + 1}^{i}

. Through agent–environment interactions, we obtain the interaction trajectory

τ

. We input the state transition and corresponding reward from the trajectory

τ

into the critic network to calculate the state target value

V_{target}

:

V_{target} = r_{t} + γ^{V} \cdot V_{Θ} (s_{t + 1})

(5)

where

γ^{V}

is the discount factor for value function and

Θ

represents the weight of the critic network. Based on the critic network’s estimated value for the current state and the target value

V_{target}

, we construct the following loss function:

L_{VF} (Θ) = {\hat{E}}_{t} [{(V_{Θ} (s_{t}) - V_{target})}^{2}]

(6)

By applying the Adam optimizer, we minimize the loss function

L_{VF} (Θ)

to update the weight of the critic network

Θ

. The loss function

L_{VF} (Θ)

serves to minimize the disparity between the estimated value of critic model and the target value, thereby enhancing the model’s capacity to accurately assess the quality of the current state.

In terms of the actor network, its main responsibility is to generate action decisions. To optimize the actor network, we first need to use the Generalized Advantage Estimator (GAE) method to calculate the advantage estimate value

{\hat{A}}_{t}

:

{\hat{A}}_{t} = δ_{t} + (γ^{A} \cdot λ) δ_{t + 1} + \dots + {(γ^{A} \cdot λ)}^{T - t + 1} δ_{T - 1}

(7)

where

δ_{t} = r_{t} + γ^{A} \cdot V_{Θ} (s_{t + 1}) - V_{Θ} (s_{t})

. During the estimation process, the state values

V_{Θ} (s_{t + 1})

and

V_{Θ} (s)

are provided by the critic network,

γ^{A}

is the discount coefficient for the advantage function, and

λ

is used to balance bias and variance.

We denote the weights of the actor network as

θ

, and the network weights before updating as

θ_{old}

. To prevent drastic gradient fluctuations during parameter updates, the PPO algorithm employs a clipping mechanism to constrain the difference between new and old policies

π_{θ}

and

π_{θ_{old}}

. Let

π_{θ} (a_{t} ∣ s_{t})

denote the probability of policy

π_{θ}

selecting action

a_{t}

under state

s_{t}

, and define the probability ratio between new and old policies as

p_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})}

. Therefore, we adopt the following loss function for the PPO scheme:

L^{clip} (θ) = {\hat{E}}_{t} [min (p_{t} (θ) {\hat{A}}_{t}, p^{clip} (θ) {\hat{A}}_{t})]

(8)

where the clip function

p^{clip} (θ)

is denoted as

p^{clip} (θ) = \{\begin{matrix} 1 - ϵ, if p_{t} (θ) < 1 - ϵ \\ p_{t} (θ), if 1 - ϵ < p_{t} (θ) < 1 + ϵ \\ 1 + ϵ, if 1 + ϵ < p_{t} (θ) \end{matrix}

(9)

Figure 2 shows how the loss function

L^{clip} (θ)

varies with the probability ratio

p_{t}

under different advantage functions. The loss function

L^{clip} (θ)

aims to provide the actor model with insights regarding policy advantages while limiting the magnitude of single updates to ensure controlled optimization. Here,

ϵ

is a hyper-parameter used to limit the clipping range. Finally, we update the weight of the actor network

θ

, utilizing the Adam optimizer based on the loss function

L^{clip} (θ)

.

4.2. Multi-Agent-Based RL Framework for ViZDoom

We systematically preprocess the game screen output from the ViZDoom environment to optimize the model input. Specifically, the ViZDoom simulator outputs a three-channel color image with a resolution of

240 \times 320

. After selective cropping and gray scale conversion, the image is resized to a single-channel feature of

84 \times 84

. To enhance temporal information expression, we concatenate four frames of images together, ultimately forming a state input with four channels of

84 \times 84

features. The preprocessing scheme effectively preserves the dynamic characteristics of the environment while providing sufficient contextual information for the agent.

It is worth noting that only one frame of the image is available in the game’s initial state. Therefore, when processing the initial state, we replicate the first frame four times and concatenate them to construct the initial state

s_{0}

. It not only ensures the uniformity of input dimensions but also provides a stable and coherent learning foundation for the model. Through the above standardized processing, we achieve continuous and consistent state representations from the beginning to the end of the game, and provide the agent with a stable learning starting point.

In the actor network architecture responsible for shooting control, we use the preprocessed game information as the model input. Then, it goes through multiple convolutional network processing: in the first convolutional layer, a 5 × 5 kernel operates with a stride of 2, followed by a ReLU activation function; in the second convolutional layer, a 4 × 4 kernel operates with a stride of 2, followed by a ReLU activation function; in the third convolutional layer, a 3 × 3 kernel operates with a stride of 2, followed by a ReLU activation function; and in the fourth convolutional layer, a 3 × 3 kernel operates with a stride of 2, producing output dimensions of (batch size, 32, 9, 9). Subsequently, the output is flattened through a flatten layer, and then passes through three fully connected layers, i.e., outputting 512, 64, and 2 units, respectively. Each is followed by a ReLU activation function. Finally, a softmax layer generates an action probability distribution to determine whether to fire.

The actor network architecture for movement control is essentially identical to the shooting model, with only the output dimension in the final layer changed to 3, used to control left movement, right movement, or staying still. The two agents share the same critic network structure. It maintains the same input processing method and convolutional layer configuration, and through a similar network architecture, ultimately outputs a single state value estimate. Specifically, the critic network’s input will go through four convolutional layers. After flattening, it passes through fully connected layers with dimensions of 512, 64, and 1, each followed by a ReLU activation function, ultimately outputting the state value assessment. The specific process is shown in Figure 3.

According to the MA-PPO training process in Algorithm 1, we extend the original PPO algorithm to implement a dual-agent collaborative control system. Compared to the PPO pseudocode table Algorithm A1, we design the simultaneous training and updating of multiple agents in the MA-PPO training process. Two independently trained agents jointly control a game character, with their policy updates occurring relatively independently. The algorithm accepts several input variables, i.e., the maximum number of epochs

η

, maximum number of steps

ξ

, maximum number of training repetitions

υ

, and environment state

s_{t}

, and then outputs the control strategy. During practical application, the agent accepts the same inputs to generate joint actions

a_{t}

as output. The training procedure follows the process below as follows. After initializing the system parameters and model architecture for the two agents in the system, a training loop is entered into, consisting of

η

epochs, with each epoch containing

ξ

training steps. During the training process, the agent first interacts with the environment until the end of a game round, then the system records the interaction trajectory

τ

and updates the counter, followed by

υ

iterations of optimization on the collected data for both agents. During the optimization phase, the system first calculates the state target value

V_{target}

based on trajectory

τ

, constructs the critic network’s loss function

L_{VF} (Θ)

, and updates the relevant parameters through the Adam optimizer. Next, the system evaluates the advantage function

{\hat{A}}_{t}

based on trajectory

τ

, which is used to construct the actor network’s loss function

L^{clip} (θ)

, and similarly employs the Adam optimizer for parameter updates. Finally, the control strategy

π

is formed. These steps collectively form the core optimization mechanism of the MA-PPO algorithm.

Algorithm 1 Multi-Agent Proximal Policy Optimization.

Input: state from environment

s_{t}

,

η

,

ξ

,

υ

;

Output: control strategy;

1:: for agent $i = 1, 2$ do
2:: initialize critic parameters $Θ$ and actor parameters $θ$
3:: initialize the learning rate $β^{l r}$ and optimizer
4:: EndFor
5:: for epoch $k = 1, 2, \dots, η$ do
6:: for step $t = 1, 2, \dots, ξ$ do
7:: play the game under parameters $Θ$ and $θ$
8:: for game is not over do
9:: the agent obtains environment information
10:: the agent executes actions
11:: EndFor
12:: gain the trajectory $τ$
13:: record the game steps
14:: for agent $i = 1, 2$ do
15:: for repeat training times $j = 1, 2, \dots, υ$ do
16:: compute $V_{target}$ according to Equation (5)
17:: compute $L_{VF} (Θ)$ according to Equation (6)
18:: update parameters of critic network $Θ$
19:: compute ${\hat{A}}_{t}$ according to Equation (7)
20:: compute $L^{clip} (θ)$ according to Equation (8)
21:: update parameters of actor network $θ$
22:: EndFor
23:: EndFor
24:: EndFor
25:: EndFor
26:: form a control strategy $π$ for the whole system

5. Results and Analysis

5.1. Experiment Setup

To assess the efficacy of our multi-agent deep reinforcement learning framework in first-person shooting scenarios, we conduct comprehensive experiments utilizing the open-source platform, VizDoom. The simulation is conducted on a workstation with an Intel Core i7-9700K processor which is manufactured by Intel Corporation in Santa Clara, CA, USA. The operation system is Ubuntu 20.04.6 LTS (Focal Fossa). The code is written in Python 3.6.0 using Pytorch 1.6.1 and ViZDoom 1.2.4. The parameters are listed in Table 2. The learning rate

β^{l r}

gradually decreases from 1 × 10⁻⁵ to 1 × 10⁻⁷. The discount factors

γ^{A}

and

γ^{V}

are both 0.98 according to Section 5.3.2. Bias

λ

is multiplied with discount factor

γ^{A}

when calculating the advantage function. Following the recommendation in the PPO, we set the range of clip operation

ϵ

= 0.2. For the training iteration variables

η

,

ξ

, and

υ

, we select values that enable algorithm convergence during the actual training process.

5.2. Performance Evaluation

5.2.1. Baseline

We employ five baseline algorithms for experimental evaluation and comparative analysis, i.e., PPO, MA-PPO, A2C, MA-A2C, and DQN. The following will elaborate on the characteristics and implementation methods of each algorithm.

Proximal Policy Optimization (PPO) [39]: PPO was proposed by Schulman et al. in 2017, and is an advanced reinforcement learning method. PPO adopts an actor–critic architecture, which can be applied to both discrete and continuous action spaces, demonstrating strong versatility in reinforcement learning problems. The most distinctive feature of PPO lies in the clip operations, which limits the magnitude of single policy updates, greatly enhancing its stability. Therefore, we develop MA-PPO, extending the PPO algorithm for application within a multi-agent system framework. Compared to the original PPO algorithm, we change the task allocation structure and design two agents responsible for movement and shooting, respectively. The two agents are both updated according to the PPO algorithm and jointly control the player’s behavior. The specific algorithm details are elaborated on in Section 4.

Advantage Actor–Critic (A2C) [9]: The A2C algorithm is an efficient policy optimization method, which employs an actor–critic architecture. The algorithm integrates policy gradient methods with value function estimation, and introduces advantage functions to optimize the training process. The actor network is responsible for action decision-making and updates based on the advantage function, while the critic network evaluates action values and optimizes parameters in a manner similar to PPO. Correspondingly, MA-A2C is our improved version of the A2C algorithm after adapting it to a multi-agent framework.

Deep Q-Network (DQN) [40]: DQN is a deep reinforcement learning algorithm developed by the DeepMind team in 2013. DQN approximates the Q-value function through deep neural networks to handle complex high-dimensional input data. DQN introduces an experience replay mechanism, storing the agent’s interactions with the environment in a replay buffer, and conducts training through random sampling, effectively reducing data correlation and significantly improving training efficiency and stability.

5.2.2. Convergence

We evaluate the convergence performance of the proposed multi-agent architecture through a training period of 1000 epochs. We conduct systematic comparative analysis with PPO, MA-PPO, A2C, MA-A2C, and DQN to validate the performance advantages of our architecture.

In the analysis of Figure 4a, the results show that MA-A2C demonstrates optimal performance during the convergence phase, followed by DQN and A2C. DQN demonstrates significant rapid convergence characteristics, reaching a stable state at around 50 epochs. The efficient convergence performance can be attributed to its streamlined algorithm structure and lower parameter complexity. A2C converges and maintains stability around epoch 80. The MA-A2C learning curve exhibits unique dynamic characteristics. The convergence curve exhibits a significant drop at epoch 86, i.e., reward value −313, followed by a strong recovery. MA-A2C surpasses the performance levels of A2C at epoch 130 and DQN at epoch 140, before achieving stable convergence at epoch 150. The performance curve characteristics reflect the collaborative learning challenges of the multi-agent system in the initial phase and the effective cooperation mechanism subsequently established through continuous environmental interaction. The results in Figure 4b show that MA-PPO consistently maintains a leading advantage, while PPO performs better than DQN after convergence. Specifically, PPO achieves initial convergence at epoch 30, followed by a continuous optimization trend, surpassing DQN in overall performance at epoch 250. MA-PPO reaches convergence at epoch 50, demonstrating excellent performance stability.

The comprehensive evaluation results in Figure 4c indicate that MA-PPO demonstrate far superior convergence compared to the other four baselines, i.e., PPO, MA-A2C, DQN, and A2C. After convergence, the rewards for MA-PPO, PPO, MA-A2C, and DQN are 75, 52, 51, and 31, respectively. Quantitative analysis shows that MA-PPO achieves test performance improvements of 30.67%, 32.00%, and 58.67% compared to PPO, MA-A2C, and DQN, respectively. Meanwhile, compared to their original versions, both the MA-PPO algorithm versus the PPO algorithm, and the MA-A2C algorithm versus the A2C algorithm show highly significant improvements in reward data from the experiments. Therefore, the experimental data confirm that the algorithm variants based on the multi-agent architecture, i.e., MA-PPO and MA-A2C, achieve significant performance improvements compared to their original versions, fully validating the effectiveness of the proposed multi-agent architecture in optimizing reinforcement learning performance.

5.3. Performance Comparison

5.3.1. Rendering

Figure 5 shows the results of our visualization-based comparative analysis of the five baselines’ performance in the game environment. As shown in Table 3, the multi-agent architecture MA-PPO demonstrates optimal performance, efficiently completing the task with a reward of 76, 20 action steps, and 48 remaining ammunition. The PPO algorithm is ranked second, achieving a reward of 56, 35 action steps, and 47 remaining ammunition. MA-A2C achieves a reward of 47, takes 44 steps, and maintains 47 remaining ammunition. DQN achieves a reward of 36, took 55 steps, and has 47 ammunition remaining. The A2C algorithm shows relatively weak performance, obtaining a reward of −109, taking 165 steps, and having 40 ammunition remaining.

Then, we conduct a detailed analysis of the visualization-based results from the perspective of each agent’s actions. MA-PPO executes shooting actions at steps 7 and 20, successfully hitting the enemy at step 20, while rapidly approaching the target during the intermediate phases. PPO executes shooting actions at steps 12, 25, and 35, successfully hitting the enemy at step 35, while slowly approaching the target during the intermediate phases. MA-A2C executes shooting actions at steps 10, 21, and 44, successfully hitting the enemy at step 44, with a brief stationary period between steps 27 and 32, while slowly approaching the target during the intermediate phases. DQN executes shooting actions at steps 20, 37, and 55, successfully hitting the enemy at step 55, with a brief stationary period between steps 25 and 35, while slowly approaching the target during the intermediate phases. A2C executed a total of 10 shots, hitting the enemy at step 165, with the intermediate phases involving frequent lateral oscillations and stationary periods. The experimental data clearly demonstrate that algorithm variants based on multi-agent architecture, i.e., MA-PPO and MA-A2C, outperform their corresponding baselines across all performance metrics, with particularly notable improvements in the case of the A2C algorithm. The empirical results strongly confirm that the proposed multi-agent system has significant advantages in improving agent game performance.

5.3.2. Parameter Experiments

We conduct systematic parameter experiments on the discount factors

γ^{A}

,

γ^{V}

for estimating target value

V_{target}

and advantage function

\hat{A_{t}}

to evaluate how their different combinations affect model training. Through the analysis of 1000 training epochs, we obtain the following key findings.

Figure 6a shows the impact of different

γ^{A}

values (0.98, 0.95, 0.90, and 0.85) on model convergence when

γ^{V}

is fixed at 0.98. The experimental results show that when

γ^{A} = 0.98

, it not only demonstrates optimal training stability but also achieves the highest reward value. The data show that as the

γ^{A}

value approaches 1, the model performance becomes more superior.

Figure 6b shows the experimental results with different

γ^{V}

values (0.98, 0.95, 0.90, and 0.85) when

γ^{A}

is fixed at 0.98. The study finds that changes in

γ^{V}

can lead to significant fluctuations in the model training process, but as the

γ^{V}

value approaches 1, the model training becomes more stable and can achieve higher reward levels.

Figure 6c visualizes the effects of 16 combinations of

γ^{A}

and

γ^{V}

through a heatmap, where darker regions represent superior model performance. The experimental results clearly demonstrate that the parameter combination of

γ^{A} = 0.98

and

γ^{V} = 0.98

produces the best performance, indicating that the combination achieves an ideal balance between bias and variance in reward function estimation. Above all, we accordingly determine the final parameter configuration.

6. Conclusions

In this paper, we propose a multi-agent reinforcement learning framework called MA-PPO that achieves predetermined objectives through continuous interaction with the ViZDoom environment under specific constraints. Specifically, the primary objective is to achieve target acquisition with optimal efficiency while adhering to specified ammunition and time limitations. Comprehensive performance evaluation shows that the improved MA-PPO algorithm achieves a 30.67% improvement in convergence performance compared to the baseline algorithm. Through systematic parameter experiments, we validate the scientific basis and effectiveness of our chosen discount factor parameters. Through a series of simulation experiments, it is demonstrated that the proposed multi-agent architecture has good generality and can be integrated with various actor–critic frameworks to enhance overall performance.

Future work will be expanded in the following direction. For practice, we will explore the potential applications of multi-agent reinforcement learning algorithms in broader and more complex game environments to validate their generality and practical application value. We plan to conduct in-depth research on optimizing the decision-making mechanisms and develop algorithm architectures that better adapt to diverse gaming environments. Furthermore, we will focus on resource efficiency issues in multi-agent systems, emphasizing research on the optimal balance point between agent scale expansion and computational resource consumption, aiming to achieve an ideal trade-off between performance improvement and computational efficiency.

Author Contributions

Conceptualization, H.Z. and S.J.; Methodology, Z.C. and H.Z.; Software, Z.C.; Validation, Z.C.; Formal analysis, Z.Z.; Writing—original draft, Z.C. and H.Z.; Writing—review & editing, K.D., H.Z., Z.Z. and S.J.; Supervision, K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the NSFC Programs under Grant 62403121.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author. Project page at https://lintian106.github.io/MAppo2VizDoom (accessed on 23 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. PPO Algorithm

Based on the PPO algorithm’s training process shown in Algorithm A1, the algorithm accepts several input variables, i.e., the maximum number of epochs

η

, maximum number of steps

ξ

, maximum number of training repetitions

υ

, environment state

s_{t}

, and then outputs both the trained critic and actor networks. During practical application, the agent accepts the same inputs to generate agent actions

a_{t}

as output. The training procedure follows the process below as follows. After initializing the system parameters and model architecture, we enter a training loop consisting of

η

epochs, with each epoch containing

ξ

training steps. During the training process, the agent first interacts with the environment until the end of a game round, then the system records the interaction trajectory

τ

and updates the counter, followed by

υ

iterations of optimization on the collected data. During the optimization phase, the system first calculates the state target value

V_{target}

based on trajectory

τ

, constructs the critic network’s loss function

L_{VF} (Θ)

, and updates the relevant parameters through the Adam optimizer. Next, the system evaluates the advantage function

{\hat{A}}_{t}

based on trajectory

τ

, which is used to construct the actor network’s loss function

L^{clip} (θ)

, and similarly employs the Adam optimizer for parameter updates. These steps collectively form the core optimization mechanism of the PPO algorithm.

Algorithm A1 Proximal Policy Optimization.

Input: state from environment

s_{t}

,

η

,

ξ

,

υ

;

Output: actor network, critic network;

1:: initialize critic parameters $Θ$ and actor parameters $θ$
2:: initialize the learning rate $β^{l r}$ and optimizer
3:: for epoch $k = 1, 2, \dots, η$ do
4:: for step $t = 1, 2, \dots, ξ$ do
5:: play the game under parameters $Θ$ and $θ$
6:: for game is not over do
7:: the agent obtains environment information
8:: the agent executes actions
9:: EndFor
10:: gain the trajectory $τ$
11:: record the game steps
12:: for repeat training times $j = 1, 2, \dots, υ$ do
13:: compute $V_{target}$ according to Equation (5)
14:: compute $L_{VF} (Θ)$ according to Equation (6)
15:: update parameters of critic network $Θ$
16:: compute ${\hat{A}}_{t}$ according to Equation (7)
17:: compute $L^{clip} (θ)$ according to Equation (8)
18:: update parameters of actor network $θ$
19:: EndFor
20:: EndFor
21:: EndFor

Appendix B. Performance Comparison with SAC and MADDPG

Despite the availability of various established algorithm in the DRL field, i.e.,SAC, and multi-agent frameworks, i.e., MADDPG, our research is focused on developing an enhanced multi-agent architecture leveraging PPO. In the following section, we will examine the SAC algorithm and MADDPG framework as reference points to illustrate the rationality and innovation of the algorithm adopted and proposed in our work.

Soft Actor–Critic (SAC) [41], introduced by Haarnoja et al., implements an actor–critic architecture with entropy-based exploration mechanisms, demonstrating exceptional performance across various tasks. However, to address bootstrapping issues, SAC utilizes a configuration of four critic models and one actor model, whereas PPO, another state-of-the-art DRL method, operates with a single actor–critic model. The implementation of SAC in a multi-agent framework would necessitate significantly greater training resources and memory allocation compared to MA-PPO, as the resource requirements would scale with the number of agents, which leads us to adopt the more streamlined PPO algorithm for our multi-agent framework. Performance comparisons with SAC are presented in Figure A1.

MADDPG [33], introduced by Lowe et al., represents a multi-agent architecture that directly outputs agent actions rather than probability distributions. MADDPG employs a centralized training with a decentralized execution learning framework, allowing models to access global information during training while operating with limited perspective during execution. In contrast, our proposed MA-PPO algorithm employs a task-allocation learning framework in the ViZDoom environment, enabling concurrent specialized training for distinct agents on different tasks, which is a more streamlined and efficient approach that provides more possibilities for collaboration and coordination between different agents. As demonstrated in the comparative analysis presented in the appendix, the proposed MA-PPO algorithm exhibits superior performance metrics.

Figure A1. Comparison of average reward using MA-PPO, MADDPG, PPO, and SAC during the training process.

We evaluate the convergence performance of each baseline through a training period of 1000 epochs. We conduct a systematic comparative analysis between MA-PPO and MADDPG to validate the performance advantages of our architecture. Meanwhile, we compare PPO with SAC to explain our rationale for choosing PPO as the baseline for multi-agent improvements.

In the analysis shown in Figure A1, the results demonstrate that MA-PPO exhibits the best performance during the convergence phase, followed by MADDPG, PPO, and SAC. MA-PPO reaches convergence at epoch 50, demonstrating excellent performance stability. MADDPG converges at epoch 30, showing similar stability but with lower performance compared to MA-PPO. SAC converges at epoch 60, with slight performance fluctuations throughout the training process. PPO achieves convergence at epoch 30, followed by a continuous optimization trend, surpassing SAC in overall performance at epoch 800. After convergence, the rewards for MA-PPO, MADDPG, PPO, and SAC are 75, 60, 52, and 50, respectively. Quantitative analysis shows that MA-PPO achieves test performance improvements of 20.00%, 30.67%, and 33.33% compared to MADDPG, PPO, and SAC, respectively. Meanwhile, the performance of PPO is slightly higher than that of SAC after convergence. Therefore, the experimental data confirm that the proposed MA-PPO demonstrates performance superiority compared to existing multi-agent algorithms, and PPO is a more reasonable baseline choice.

Appendix C. Performance Evaluation with Error Bars

We develop the testing procedure according to the VizDoom simulator [37]. We save the converged baselines, and then to control variables, and conduct 100 games in a fixed scenario with controlled environmental conditions, i.e., enemy positions and initial positions, recording the reward values obtained by the baseline in each game. In Figure A2, the blue bars indicate the average rewards from the testing procedure, while the black lines indicate the standard deviation of the rewards. In the analysis shown in Figure A2, the results show that MA-PPO achieves the highest average reward, followed by MADDPG, PPO, MA-A2C, SAC, and DQN, with average reward values of 75.12, 59.60, 52.24, 51.64, 49.82, and 31.34 respectively, and standard deviation values of 1.85, 1.87, 2.10, 8.59, 5.15, and 7.78, respectively. Therefore, the algorithm with the highest stability remains the proposed MA-PPO, followed by MADDPG, PPO, SAC, DQN, and MA-A2C. Thus, the experimental data confirm that the proposed MA-PPO demonstrates the best performance and stability.

Figure A2. Comparison of baseline performance after convergence with error bars.

Appendix D. Parameter Experiments of Learning Rate

We conduct systematic parameter experiments on the learning rate

β^{l r}

to evaluate the impact of variable and fixed learning rates on model training. Through the analysis of 1000 training epochs, we find that variable learning rates can significantly improve the model’s performance. Figure A3 shows the convergence curves of MA-PPO under variable and fixed learning rates. When using variable learning rates, the model converges at 100 epochs, demonstrating excellent performance stability and better convergence. When using fixed learning rates, the model converges at epoch 50, with slight fluctuations during the training process. Therefore, the experimental data confirm that variable learning rates significantly improve the model’s performance and stability.

Figure A3. Comparison of model training effects under different learning rates.

References

Abriata, L.A. The Nobel Prize in Chemistry: Past, present, and future of AI in biology. Commun. Biol. 2024, 7, 1409. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Shafiq, M.; Dabel, M.M.A.; Sun, Y.; Tian, Z. DNN Inference Acceleration for Smart Devices in Industry 5.0 by Decentralized Deep Reinforcement Learning. IEEE Trans. Consum. Electron. 2024, 70, 1519–1530. [Google Scholar] [CrossRef]
Wen, J.; Dai, H.; He, J.; Xi, M.; Xiao, S.; Yang, J. Federated Offline Reinforcement Learning with Multimodal Data. IEEE Trans. Consum. Electron. 2024, 70, 4266–4276. [Google Scholar] [CrossRef]
Zhang, L.; U, L.H.; Zhou, M.; Yang, F. Elastic Tracking Operation Method for High-Speed Railway Using Deep Reinforcement Learning. IEEE Trans. Consum. Electron. 2024, 70, 3384–3391. [Google Scholar] [CrossRef]
Yi, Y.; Zhang, G.; Jiang, H. Online Digital Twin-Empowered Content Resale Mechanism in Age of Information-Aware Edge Caching Networks. IEEE Trans. Commun. 2024, 1. [Google Scholar] [CrossRef]
Scheiermann, J.; Konen, W. AlphaZero-Inspired Game Learning: Faster Training by Using MCTS Only at Test Time. IEEE Trans. Games 2023, 15, 637–647. [Google Scholar] [CrossRef]
Wang, B.; Zha, Z.; Zhang, L.; Liu, L.; Fan, H. Deep Reinforcement Learning-Based Security-Constrained Battery Scheduling in Home Energy System. IEEE Trans. Consum. Electron. 2024, 70, 3548–3561. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, J.; Zhang, H.; Huang, Z.; Briso-Rodríguez, C.; Zhang, L. Experimental study on low-altitude UAV-to-ground propagation characteristics in campus environment. Comput. Netw. 2023, 237, 110055. [Google Scholar] [CrossRef]
Bi, X.; Wang, R.; Jia, Q. On the Speed-Varying Range of Electric Vehicles in Time-Windowed Routing Problems With En-Route Partial Re-Charging. IEEE Trans. Consum. Electron. 2024, 70, 3650–3657. [Google Scholar] [CrossRef]
Wang, W.; Xu, X.; Bilal, M.; Khan, M.; Xing, Y. UAV-Assisted Content Caching for Human-Centric Consumer Applications in IoV. IEEE Trans. Consum. Electron. 2024, 70, 927–938. [Google Scholar] [CrossRef]
Zhang, L. Joint Energy Replenishment and Data Collection Based on Deep Reinforcement Learning for Wireless Rechargeable Sensor Networks. IEEE Trans. Consum. Electron. 2024, 70, 1052–1062. [Google Scholar] [CrossRef]
Jin, Y.; Song, X.; Slabaugh, G.; Lucas, S. Partial Advantage Estimator for Proximal Policy Optimization. IEEE Trans. Games 2024, 1–10. [Google Scholar] [CrossRef]
Moniruzzaman, M.; Yassine, A.; Hossain, M.S. Energizing Charging Services for Next-Generation Consumers E-Mobility With Reinforcement Learning and Blockchain. IEEE Trans. Consum. Electron. 2024, 70, 2269–2280. [Google Scholar] [CrossRef]
Liu, R.Z.; Shen, Y.; Yu, Y.; Lu, T. Revisiting of AlphaStar. IEEE Trans. Games 2024, 16, 317–330. [Google Scholar] [CrossRef]
Li, S.; Xu, J.; Dong, H.; Yang, Y.; Yuan, C.; Sun, P.; Han, L. The Fittest Wins: A Multistage Framework Achieving New SOTA in ViZDoom Competition. IEEE Trans. Games 2024, 16, 225–234. [Google Scholar] [CrossRef]
Huang, J.; Zhang, H.; Zhao, M.; Wu, Z. IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation. arXiv 2024, arXiv:2403.19336. [Google Scholar]
Zhu, A.; He, H.; Yang, Y.; Zheng, Z.; Shao, J. Hands-Free: Action Abstraction With Hierarchical Reinforcement Learning in Text-Based Games. IEEE Trans. Consum. Electron. 2024, 1. [Google Scholar] [CrossRef]
Cao, W.; Zhang, D.; Feng, G. Resilient semi-global finite-time cooperative output regulation of heterogeneous linear multi-agent systems subject to denial-of-service attacks. Automatica 2025, 173, 112099. [Google Scholar] [CrossRef]
Zhang, D.; Chen, H.; Lu, Q.; Deng, C.; Feng, G. Finite-time cooperative output regulation of heterogeneous nonlinear multi-agent systems under switching DoS attacks. Automatica 2025, 173, 112062. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Huang, X.; Wang, Y.; Zhu, G.; Min, G.; Li, J. An Explainable Recommendation Method for Artificial Intelligence of Things Based on Reinforcement Learning With Knowledge Graph Inference. IEEE Trans. Consum. Electron. 2024, 1. [Google Scholar] [CrossRef]
Bebortta, S.; Sekhar Tripathy, S.; Bhatia Khan, S.; Al Dabel, M.M.; Almusharraf, A.; Kashif Bashir, A. TinyDeepUAV: A Tiny Deep Reinforcement Learning Framework for UAV Task Offloading in Edge-Based Consumer Electronics. IEEE Trans. Consum. Electron. 2024, 70, 7357–7364. [Google Scholar] [CrossRef]
Wang, S.; Cao, H.; Yang, L.; Garg, S.; Kaddoum, G.; Alrashoud, M. GCN-Based Multi-Agent Deep Reinforcement Learning for Dynamic Service Function Chain Deployment in IoT. IEEE Trans. Consum. Electron. 2024, 70, 6105–6118. [Google Scholar] [CrossRef]
Li, H.; He, H. Multiagent Trust Region Policy Optimization. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12873–12887. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Xia, H.; Chen, Z.; Kang, Z.; Wang, K.; Gao, W. Computation Cost-Driven Offloading Strategy Based on Reinforcement Learning for Consumer Devices. IEEE Trans. Consum. Electron. 2024, 70, 4120–4131. [Google Scholar] [CrossRef]
Jabbari, A.; Khan, H.; Duraibi, S.; Budhiraja, I.; Gupta, S.; Omar, M. Energy Maximization for Wireless Powered Communication Enabled IoT Devices With NOMA Underlaying Solar Powered UAV Using Federated Reinforcement Learning for 6G Networks. IEEE Trans. Consum. Electron. 2024, 70, 3926–3939. [Google Scholar] [CrossRef]
Zeng, P.; Li, H.; He, H.; Li, S. Dynamic Energy Management of a Microgrid Using Approximate Dynamic Programming and Deep Recurrent Neural Network Learning. IEEE Trans. Smart Grid 2019, 10, 4435–4445. [Google Scholar] [CrossRef]
Tavares, A.R.; Vieira, D.K.S.; Negrisoli, T.; Chaimowicz, L. Algorithm Selection in Adversarial Settings: From Experiments to Tournaments in StarCraft. IEEE Trans. Games 2019, 11, 238–247. [Google Scholar] [CrossRef]
Zha, Z.; Wang, B.; Tang, X. Evaluate, explain, and explore the state more exactly: An improved Actor-Critic algorithm for complex environment. Neural Comput. Appl. 2023, 35, 1–12. [Google Scholar] [CrossRef]
Yang, D.; Yang, K.; Wang, Y.; Liu, J.; Xu, Z.; Yin, R.; Zhai, P.; Zhang, L. How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception. Adv. Neural Inf. Process. Syst. 2024, 36, 25151–25164. [Google Scholar]
Guo, X.G.; Liu, P.M.; Wu, Z.G.; Zhang, D.; Ahn, C.K. Hybrid Event-Triggered Group Consensus Control for Heterogeneous Multiagent Systems With TVNUD Faults and Stochastic FDI Attacks. IEEE Trans. Autom. Control 2023, 68, 8013–8020. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
Seraj, E.; Xiong, J.; Schrum, M.; Gombolay, M. Mixed-initiative multiagent apprenticeship learning for human training of robot teams. In Proceedings of the NeurIPS 2023, the Thirty-Seventh Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Hong, J.; Levine, S.; Dragan, A. Learning to influence human behavior with offline reinforcement learning. arXiv 2024, arXiv:2303.02265. [Google Scholar]
Sarkar, S.; Naug, A.; Guillen, A.; Luna, R.; Gundecha, V.; Babu, A.R.; Mousavi, S. Sustainability of Data Center Digital Twins with Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; Volume 38, pp. 23832–23834. [Google Scholar]
Wydmuch, M.; Kempka, M.; Jaśkowski, W. ViZDoom Competitions: Playing Doom From Pixels. IEEE Trans. Games 2019, 11, 248–259. [Google Scholar] [CrossRef]
Ren, X.; Lai, C.S.; Guo, Z.; Taylor, G. Eco-Driving With Partial Wireless Charging Lane at Signalized Intersection: A Reinforcement Learning Approach. IEEE Trans. Consum. Electron. 2024, 70, 6547–6559. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Lu, J.; Xiao, Y.; Zhang, G. Deep Reinforcement Learning Based Trajectory Design for Customized UAV-Aided NOMA Data Collection. IEEE Wirel. Commun. Lett. 2024, 13, 3365–3369. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, G.; Zhao, M.; Liu, Y. Load Forecasting-Based Learning System for Energy Management With Battery Degradation Estimation: A Deep Reinforcement Learning Approach. IEEE Trans. Consum. Electron. 2024, 70, 2342–2352. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]

Figure 1. Agent–environment interaction in DRL for ViZDoom. The figure illustrates the interaction mechanism between multi-agent systems and the environment. Based on the current state, the system generates agent actions, and subsequently influences the environment through executing the joint action. The environment accordingly transitions to a new state and generates corresponding rewards, thereby constructing a complete feedback control loop. The environment section in the figure shows a game scene from the training environment.

Figure 2. Graph of the loss function

L^{clip} (θ)

varying with ratio

p_{t} (θ)

under different advantage functions A. The graphical analysis shows the relationship changes between the loss function

L^{clip} (θ)

and ratio

p_{t} (θ)

under two scenarios, where the advantage function A is positive and negative. When advantage function A is positive, the loss function

L^{clip} (θ)

is also positive, showing a linear growth trend in the interval

p_{t} (θ)

∈[0,

1 + ϵ

], while maintaining a horizontal trend in the interval

p_{t} (θ)

∈[

1 + ϵ

,

+ \infty

]. Conversely, when advantage function A is negative, the loss function

L^{clip} (θ)

is negative, maintaining a horizontal trend in the interval

p_{t} (θ)

∈[0,

1 - ϵ

], while exhibiting a linear decreasing trend in the interval

p_{t} (θ)

∈[

1 - ϵ

,

+ \infty

].

Figure 2. Graph of the loss function

L^{clip} (θ)

varying with ratio

p_{t} (θ)

under different advantage functions A. The graphical analysis shows the relationship changes between the loss function

L^{clip} (θ)

and ratio

p_{t} (θ)

under two scenarios, where the advantage function A is positive and negative. When advantage function A is positive, the loss function

L^{clip} (θ)

is also positive, showing a linear growth trend in the interval

p_{t} (θ)

∈[0,

1 + ϵ

], while maintaining a horizontal trend in the interval

p_{t} (θ)

∈[

1 + ϵ

,

+ \infty

]. Conversely, when advantage function A is negative, the loss function

L^{clip} (θ)

is negative, maintaining a horizontal trend in the interval

p_{t} (θ)

∈[0,

1 - ϵ

], while exhibiting a linear decreasing trend in the interval

p_{t} (θ)

∈[

1 - ϵ

,

+ \infty

].

Figure 3. The MA-PPO algorithm network architecture diagram. The diagram elaborates on the network architecture of the MA-PPO algorithm in the study, which primarily consists of a critic network and an actor network. Both networks employ a combined structure of convolutional layers and fully connected layers. The critic network outputs a value evaluation metric to quantify the utility of the current state. The actor network generates a probability distribution over possible actions in the model’s action space, providing a decision-making basis for the system. These two networks continuously adjust policy parameters and value estimates through a collaborative optimization mechanism, thereby continuously improving algorithm performance.

Figure 4. Evaluation for the baselines in the ViZDoom environment. (a) Comparison of average reward using MA-A2C, A2C, and DQN during the training process. (b) Comparison of average reward using MA-PPO, PPO, and DQN during the training process. (c) Comparison results with the proposed algorithm vs. baselines on average reward. The legend is the same as in (a,b).

Figure 5. The visualized results of MA-PPO with other baselines. The figure shows the actual performance of each baseline model after completing training, visualized through frame-by-frame rendering for comparison. The model’s task completion efficiency is reflected through a combination of minimized step count and maximized reward value. The figure shows, from top to bottom, the actual running results of the MA-PPO, PPO, MA-A2C, DQN, and A2C.

Figure 6. Comparison of model training effects under different (

γ^{A}

,

γ^{V}

) combinations. In the picture, a refers to

γ^{A}

, and v refers to

γ^{V}

. (a) Comparison of convergence curves with varying

γ^{A}

under the condition of

γ^{V}

= 0.98. (b) Comparison of convergence curves with varying

γ^{V}

under the condition of

γ^{A}

= 0.98. (c) Heatmap comparison of final training effects under different (

γ^{A}

and

γ^{V}

) combinations.

Figure 6. Comparison of model training effects under different (

γ^{A}

,

γ^{V}

) combinations. In the picture, a refers to

γ^{A}

, and v refers to

γ^{V}

. (a) Comparison of convergence curves with varying

γ^{A}

under the condition of

γ^{V}

= 0.98. (b) Comparison of convergence curves with varying

γ^{V}

under the condition of

γ^{A}

= 0.98. (c) Heatmap comparison of final training effects under different (

γ^{A}

and

γ^{V}

) combinations.

Table 1. List of notations.

Symbol	Definition
P	State transition probability function
$γ$	Discount factor
I, i	Set of agents, agent number
T, t	Termination time, time slot
R, $r_{t}$	Reward function, reward at time slot t
$r^{m o v e}$ , $r^{s h o o t}$	Penalty of movement and shooting
$r^{h i t}$	Reward of hitting the target
S, $s_{t}$	State space set, state at time slot t
A, $a_{t}$	Action space set, action made by the agent at time slot t
$A^{i}$	Action space set of agent i
$a^{s t a y}$ , $a^{l e f t}$ , $a^{r i g h t}$	Stay still, move to the left or right
$a^{h o l d}$ , $a^{s h o o t}$	Hold a weapon without firing, fire
$r_{t}^{i}$ , $s_{t}^{i}$ , $a_{t}^{i}$	Reward, state or action at time slot t of agent i
$τ$	Trajectory
$V_{target}$	Target value
$Θ$	Parameter of critic network
$V_{Θ} (s_{t})$	Value calculated by critic network with $Θ$
$L_{VF} (Θ)$	Loss function of critic network with $Θ$
${\hat{A}}_{t}$	Estimated value of advantage function
$δ_{t}$	Measurement of the loss estimated by critic network
$θ$	Parameter of actor network
$θ_{old}$	Old parameter of actor network before updating
$π_{θ}$	Policy of actor network
$π_{θ_{old}}$	Old policy of actor network before updating
$π_{θ} (a_{t} ∣ s_{t})$	Probability of policy $π_{θ}$ taking action $a_{t}$ at state $s_{t}$
$p_{t} (θ)$	Ratio of $π_{θ}$ to $π_{θ_{old}}$ in taking action $a_{t}$ at state $s_{t}$
$L^{clip} (θ)$	Loss function for updating actor network
$p^{clip} (θ)$	Value of $p_{t} (θ)$ after clip operation

Table 2. Parameters of the MA-PPO training settings.

Parameter	Description	Value
$β^{l r}$	Learning rate	[1 × 10⁻⁷, 1 × 10⁻⁵]
$γ^{V}$	Discount factor for calculating target value	0.98
$γ^{A}$	Discount factor for calculating advantage function	0.98
$λ$	Bias between 0 and 1	0.95
$ϵ$	Range of clip operation	0.2
$η$	Training epoch	1000
$ξ$	Training step	2000
$υ$	repeat training times	10

Table 3. ViZDoom testing results for MA-PPO and other baselines.

Scheme	Average Reward ↑	Average Step ↓	Ammunition (AMMO) ↑
PPO	56	35	47
A2C	−109	165	40
DQN	36	55	47
MA-A2C	47	44	47
MA-PPO (Ours)	76	20	48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, Z.; Deng, K.; Zhang, H.; Zha, Z.; Jobaer, S. Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics 2025, 13, 754. https://doi.org/10.3390/math13050754

AMA Style

Cui Z, Deng K, Zhang H, Zha Z, Jobaer S. Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics. 2025; 13(5):754. https://doi.org/10.3390/math13050754

Chicago/Turabian Style

Cui, Zihao, Kailian Deng, Hongtao Zhang, Zhongyi Zha, and Sayed Jobaer. 2025. "Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment" Mathematics 13, no. 5: 754. https://doi.org/10.3390/math13050754

APA Style

Cui, Z., Deng, K., Zhang, H., Zha, Z., & Jobaer, S. (2025). Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics, 13(5), 754. https://doi.org/10.3390/math13050754

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment

Abstract

1. Introduction

2. Related Work

3. Modeling in ViZDoom

3.1. Preliminaries

3.1.1. ViZDoom

3.1.2. Reinforcement Learning with POMDP

3.2. Multi-Agent POMDP Formulation

3.2.1. State

3.2.2. Action

3.2.3. Reward

4. Proposed Algorithm

4.1. Proximal Policy Optimization

4.2. Multi-Agent-Based RL Framework for ViZDoom

5. Results and Analysis

5.1. Experiment Setup

5.2. Performance Evaluation

5.2.1. Baseline

5.2.2. Convergence

5.3. Performance Comparison

5.3.1. Rendering

5.3.2. Parameter Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. PPO Algorithm

Appendix B. Performance Comparison with SAC and MADDPG

Appendix C. Performance Evaluation with Error Bars

Appendix D. Parameter Experiments of Learning Rate

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI