Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning

Zhao, Zhitong; Zhang, Ya; Wang, Siying; Zhou, Yang; Zhang, Ruoning; Chen, Wenyu

doi:10.3390/math13091429

Open AccessArticle

Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning

by

Zhitong Zhao

¹,

Ya Zhang

^1,*

,

Siying Wang

²,

Yang Zhou

³,

Ruoning Zhang

³ and

Wenyu Chen

³

¹

College of Management Science, Chengdu University of Technology, Chengdu 610059, China

²

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

³

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1429; https://doi.org/10.3390/math13091429

Submission received: 7 March 2025 / Revised: 4 April 2025 / Accepted: 16 April 2025 / Published: 27 April 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of value decomposition methods, multi-agent reinforcement learning (MARL) has made significant progress in balancing autonomous decision making with collective cooperation. However, the collaborative dynamics among agents are continuously changing. The current value decomposition methods struggle to adeptly handle these dynamic changes, thereby impairing the effectiveness of cooperative policies. In this paper, we introduce the concept of latent interaction, upon which an innovative method for generating weights is developed. The proposed method derives weights from the history information, thereby enhancing the accuracy of value estimations. Building upon this, we further propose a dynamic masking mechanism that recalibrates history information in response to the activity level of agents, improving the precision of latent interaction assessments. Experimental results demonstrate the improved training speed and superior performance of the proposed method in both a multi-agent particle environment and the StarCraft Multi-Agent Challenge.

Keywords:

reinforcement learning; decentralized partially observable Markov decision process (Dec-POMDP); multi-agent reinforcement learning; multi-agent value decomposition

MSC:

68T07; 68T42

1. Introduction

Collaborative multi-agent reinforcement learning has been an important topic of study in recent years [1,2], with significant advancements in several fields, such as autonomous driving [3], drones swarms [4], robotic swarms control [5], and computer games [6]. The problem of partial observability is one of the primary challenges of multi-agent collaboration. A paradigm of centralized training with decentralized execution (CTDE) has been proposed, allowing for independent decision making by agents while being trained in a centralized manner [7,8]. The key idea of CTDE involves centralized training of all agents’ policies with global information and the decentralized execution of independent policies that are exclusively reliant on local information [9]. One of the most popular MARL modeling techniques is the value decomposition method under CTDE paradigms, which decomposes the global Q-value function

Q_{t o t}

into individual q-value functions

Q_{i}

for each agent i.

However, partial observability hinders agents’ ability to estimate their individual contributions, while global state information does not adequately capture the cooperative interactions among agents. Centralized training compounds the issue of overestimation or underestimation for agents, where their contributions are overestimated or underestimated at the beginning. This implies that the significant contributions of some agents might not be adequately represented in the joint Q-value function, as shown in Figure 1. Assuming two agents contribute equally to the team, the q-value of Agent 1 is underestimation and the q-value of Agent 2 is overestimation. The evaluation of the overall contribution is captured by the joint Q-value

Q_{t o t}

, whereas the individual q-value quantifies the contributions by individual agents. In the context of the joint Q-value, Agent 2, with its higher q-value, has a predominant influence on the joint Q-value, which is in contrast to Agent 1, whose influence on the joint Q-value function is comparatively minor. Upon updating the joint Q-value with a single reward, the high q-value of Agent 2 results in it receiving increased rewards, thereby amplifying the overestimation of its contribution. The shortfall in reward allocation results in an exacerbated underestimation of Agent 1’s contribution in the update.

To precisely determine the actual contribution of each agent, we introduced the concept of latent interaction to enhance the estimation of individual q-values. The latent interaction metric was utilized to refine the assessment of the contribution by influencing its q-values. Figure 2 depicts that, while the q-value of Agent 2 was overestimated, the q-value of Agent 1 was underestimated. Latent interaction adjusted the wrong estimation values, enabling both agents to accurately determine their respective q-values.

The partially observable agents require the estimation of value functions assisted by cooperative dynamic information, which is crucial to support the learning of optimal decentralized policies. In this paper, we propose a novel latent interaction value cooperation learning framework for value decomposition, which employs the history information of agents to generate latent interaction values. The overestimation individual q-value functions of agents are corrected by latent interaction values, ensuring these agents are not marginalized. Additionally, the framework incorporates a dynamic masking mechanism that evaluates the shifts of agents via observations, eliminating history information that falls short of a specified threshold. This mechanism guarantees both streamlined input and an accurate assessment of latent interaction values. Our major contributions are summarized as follows:

In pursuit of effectively assessing the latent interaction of agents, we analyzed the correlation between observations and history information, introducing a dynamic masking mechanism. The mechanism is capable of excising history information that falls below the dynamic change threshold, ensuring the history information is succinct and precise.
We propose a novel latent interaction value (LIV) cooperation learning framework, which integrates history information to generate latent interaction values for each agent. The latent interaction values are used to directly correct the individual q-value functions, thereby enhancing the cooperation of agents.
We conducted a series of experiments to demonstrate the effectiveness of LIV in a multi-agent particle environment and the StarCraft Multi-Agent Challenge. The results show that it significantly outperforms state-of-the-art multi-agent reinforcement learning methods in terms of performance and learning speed.

In the rest of this paper, we first present the related works in Section 2. Then, we formulate the preliminaries of MARL in Section 3. Our main method, the LIV cooperation learning framework, is introduced and particularly analyzed in Section 4. Experimental results and ablation studies are described in Section 5. Then, we conclude the proposed method in Section 6.

2. Related Works

The most simple and convenient method in MARL is to treat each agent as a single-agent reinforcement learning problem and to ignore the interactions between agents, which is known as the independent learners method [10,11]. For partially observable independent learners, they can utilize RNN networks to learn superior individual policies from history information [12,13]. However, due to a lack of coordination, agents may fail to learn the optimal joint policy [14]. Multi-agent reinforcement learning attempts to share information in cooperation to achieve coordinated learning. Initially, agents engage in direct communication to share information, a process that somewhat enhances the learning of joint policy [15,16,17]. Communication challenges such as limited capacity [18] and potential interference [19] result in the sharing of constrained or incomplete information [8].

The paradigm of CTDE has attracted significant attention since its introduction, allowing agents to share information without communication issues [7,8]. The fundamental principle of the CTDE paradigm lies in providing all agents with direct access to global information during centralized training, while ensuring independent decision making based on individual information during execution. Value-based methods in the CTDE paradigm are reflected in Value Decomposition Networks (VDN) [20]. VDN evenly decompose the joint Q-value function

Q_{t o t}

into individual q-value functions

q_{i}

for each agent i, addressing the "lazy agent" problem in multi-agent cooperation. Further, QMIX [21] decomposes the joint Q-value function through monotonic hypernetworks [22]. The monotonic hypernetwork ensures the joint Q-value function is monotonic within the individual q-value functions of agents, thus maintaining consistency between the joint policy and the decentralized individual policies. However, the monotonicity limits the expressiveness of the joint Q-value function, thereby constraining the training of the joint policy. QTRAN [23] improves the value decomposition method by transforming the original joint Q-value function into a more easily decomposable function, thereby freeing it from the monotonic structural constraints imposed by QMIX. Weighted QMIX [24] reduces the learning weight of suboptimal actions, maintaining the expression capability of the joint Q-value function while ensuring consistency between joint and individual policies. Qatten [25] delves into the theoretical relationship between

q_{i}

and

Q_{t o t}

, utilizing a multi-head attention mechanism for the decomposition of the joint Q-value function. Qplex [26] introduces a dueling network structure to factorize the joint Q-value function. In summary, these approaches predominantly focus on building networks during training that capture present environmental information to augment observations while not concentrating on the cooperative dynamics depicted by history information.

The cornerstone of policy-based approaches in the CTDE paradigm lies in leveraging global information to enable the critic to more precisely assess the policy of agents. The most notable method is MADDPG [27], which trains localized deterministic policy of agents through a centralized critic. Since the centralized critic considers information from all agents, an overload of redundant information limits its efficiency and accuracy. Multi-agent actor attention critic (MAAC) [28] enhances the efficiency and accuracy of the critic in the MADDPG by integrating information with an attention mechanism. The large size of the attention network model in MAAC creates certain difficulties in its training process. COMA [29] quantifies the contribution of each agent by utilizing counterfactual baselines. Furthermore, certain methodologies incorporate the concept of value decomposition within policy-based learning. LICA [30] employs a hypernetwork to develop a centralized critic for agent evaluation, incorporating entropy regularization to enhance exploration processes. FACMAC [31] adapts the QMIX for use in MADDPG, aiming to augment the precision of the centralized critic. DOP [32] decomposes the multi-agent actor–critic method in a similar manner. These aforementioned policy-based MARL algorithms emphasize the refinement of information to bolster the evaluative capacity of centralized critic. However, without adequate preprocessing or reduction in information, substantial enhancements are challenging to realize.

3. Preliminaries

3.1. Dec-POMDP

A fully cooperative multi-agent task can be modeled as a decentralized, partially observable Markov decision process (Dec-POMDP) [33], which consists of a tuple

G = 〈 I, S, A, P, r, Z, O, n, γ 〉

, where

s \in S

is the true state of the environment. Each agent

i \in I \equiv {1, 2, \dots, n}

takes an action

a^{i} \in A

and constructs a joint action

a \equiv {[a^{i}]}_{i = 1}^{n} \in A \equiv A^{n}

in each time step. It leads to a state transition on the environment based on the state transition function

P (s^{'} ∣ s, a) : S \times A \times S \to [0, 1]

, and all of the agents share a joint reward

r (s, a)

. Consider a partially observable setting where each agent receives an individual partial observation

z \in Z

based on the observation function

O (s, a)

. Each agent has their own action–observation

τ^{i} \in T \equiv {(Z \times A)}^{*}

, and, based on this, the agent has a random policy

π^{i} (a^{i} ∣ τ^{i}) : T \times A \to [0, 1]

. The joint policy

π \equiv 〈 π^{1}, π^{2}, \dots, π^{n} 〉

has a joint Q-value function:

Q^{π} (s_{t}, a_{t}) = E_{s_{t + 1 : \infty}, a_{t + 1 : \infty}} [R_{t} ∣ s_{t}, a_{t}]

, where t is the time step. The optimal team performance is determined by the joint Q-value function (which aims to maximize their discounted return

R_{t} = \sum_{n = 0}^{\infty} γ^{n} r_{t + n}

):

Q^{π} (s_{t}, a_{t})

, where

γ

is the discount factor.

3.2. Centralized Training with Decentralized Execution

In the CTDE paradigm, algorithms are able to access the global states of all agents to centrally train policies while enabling agents to make autonomous decisions based on their own observations. Within value-based methods, each agent i independently make a decision, drawing from their individual observation

o^{i}

and action–observation history information

τ^{i}

, thereby leading to the generation of individual q-value function

Q_{i} (o^{i}, a^{i})

. The objective of centralized training is realized by formulating a joint Q-value function

Q_{t o t}

for synchronized updates. In multi-agent reinforcement learning, the joint Q-value function quantifies the long-term return for the collective actions of all agents

a = (a^{1}, a^{2}, \dots, a^{n})

by integrating the global state information s and individual actions

a^{i}

, thereby guiding joint policy optimization. After training, although each agent i takes action

a^{i}

solely based on its local observations

τ^{i}

, it still benefits from the global perspective provided by the global Q-value function

Q_{t o t} (s, a)

. Consequently, the essence of value decomposition methods lies in decomposing the joint Q-value function

Q_{t o t}

into individual q-value functions

Q_{i}

for each agent i, which can be simplified to the following:

Q_{t o t} (s, a) = M_{θ} (Q^{1} (τ^{1}, a^{1}), \dots, Q^{n} (τ^{n}, a^{n}); s),

(1)

where

θ

is the parameter of the centralized network

M

, and s is global state information. Additionally, to ensure the consistency between individual policies corresponds to individual q-values and the joint policy corresponds to joint Q-values, the Individual-Global-Max (IGM) condition needs to be guaranteed, which is expressed as follows:

\underset{a}{argmax} Q_{tot} (τ, a) = (\begin{matrix} {argmax}_{a^{1}} Q^{1} (τ^{1}, a^{1}) \\ ⋮ \\ {argmax}_{a^{n}} Q^{n} (τ^{n}, a^{n}) \end{matrix}) .

(2)

VDN [20] and QMIX [21], as two established value decomposition methods, introduce two principal forms of value decomposition structures as additivity and monotonicity. VDN decompose the joint Q-value function into the sum of individual q-value functions for each agent. QMIX, on the other hand, combines the individual q-value function of each agent with a state-dependent continuous monotonic function. A simple example of the value decomposition method within the centralized training and decentralized execution paradigm is depicted in Figure 3.

Within a value decomposition structure, the individual q-value function of each agent can be updated through the joint Q-value function. The joint Q-value function is trained with the environment reward r in the form of DQN [34]:

\begin{matrix} Q_{tot} (s, a) & = r + γ max_{a^{'}} Q_{rot} (s^{'}, a^{'}) \\ = r + γ max_{a^{'}} M_{θ} (Q_{1} (τ_{1}, a_{1}), \dots, Q_{n} (τ_{n}, a_{n}); s^{'}), \end{matrix}

(3)

where

γ

is the discount factor. The centralized network

M

, along with individual q-value functions, undergoes simultaneous updating with the joint Q-value function via backpropagation.

4. Methods

In this section, we elaborate, in detail, on the newly introduced latent interaction value (LIV) framework for cooperation learning. This framework aims to improve the estimation accuracy of the individual q-value functions through the integration of latent interaction values. The LIV cooperation learning framework is principally composed of two innovative components: (1) the latent interaction network, and (2) the dynamic masking mechanism. To begin with, we detail the methodology for generating latent interaction values and analyze their interplay with individual q-value functions. Next, we assess the importance of removing superfluous information in the latent interaction values’ generation process and describe the operational principles of the dynamic masking mechanism. Finally, we provide a comprehensive overview of the implementation protocols and updating strategies for the LIV cooperation learning framework.

4.1. Agent Network

Firstly, we introduce the architecture of agent network, as shown in the brown section of Figure 4c. The agent network consists of two fully connected multi-layer perceptrons (MLPs) with a gated recurrent unit (GRU). The first-layer MLP handles the information collected by the agent, which includes its current observation

o_{t}

and the action

a_{t - 1}

performed in the previous step. The features derived from the first-layer MLP provide insights not only into the environmental conditions from the agent’s perspective, but also contain the state information of other agents. After that, these features are passed to the GRU module, which is dedicated to focusing on time correlation and capturing the dynamic transitions and historical dependencies between time steps. The input of GRU consists of the feature from the first-layer MLP and the hidden state

τ_{t - 1}

from the previous time step. The output of GRU is the hidden state

τ_{t}

at the current time step. By continually updating the hidden state over time, the GRU module preserves the agent’s history information, thereby enhancing the decision-making effectiveness in complex and dynamic environments. Finally, the second-layer MLP is responsible for converting the hidden state output from the GRU module into specific policy or value distributions. The process directs the agent to independently select actions based on the current state, generating the corresponding individual q-value function

q_{t}^{i}

. In the training phase, the individual q-value function

q_{t}^{i}

is fed into the mixing network for optimization, while the hidden state

τ_{t}^{i}

is used to produce the latent interaction values.

4.2. Latent Interaction Generation

The generation of latent interaction values is handled by the latent interaction network, as depicted in Figure 4a. Initially, the network integrates the hidden state

τ_{t}^{i}

produced by the GRU module of each agent i, resulting in a joint hidden state

τ_{t}

. Since the hidden state of agent i will have already been incorporated in its own independent decision making, the latent interaction value for agent i is calculated by excluding its own hidden state from the joint hidden state, which is denoted as

τ_{t}^{- i}

. For each agent i, the latent interaction network isolates the hidden state from themselves in a joint hidden state, yielding the joint hidden state information for Agent 1 through n, which is denoted as

τ_{t}^{- 1}

to

τ_{t}^{- n}

. As shown in Figure 4a, each column in the figure represents the joint hidden state information of an agent after excluding its own hidden state. The joint hidden state is subsequently passed through the dynamic masking mechanism, resulting in the processed joint hidden state information

{\hat{τ}}_{t}^{- i}

. Finally, the processed joint hidden state information is fed into a fully connected network to generate the latent interaction value for agent i. The interaction value generation process can be represented mathematically as follows:

L_{t}^{i} = f_{ϕ} ({\hat{τ}}_{t}^{- i}),

(4)

where

ϕ

represents the parameters of the network f, and

{\hat{τ}}_{t}^{- i}

refers to the joint hidden state information processed by the dynamic masking mechanism (which is discussed, in detail, in the next subsection). It is important to note that latent interaction values are strictly non-negative, which is based on the assumption that interactions between any agents have a positive contribution to the overall team. Additionally, this requirement ensures that value decomposition methods adhere to the IGM condition.

The latent interaction value

L_{t}^{i}

of agent i and its individual q-value function

q_{t}^{i}

are both input into the mixing network for training in order to adjust the evaluation of the individual q-value function and train the model accordingly. Specifically, when the latent interaction value is between 0 and 1, it imposes a penalty on the individual q-value function. On the contrary, it serves as an incentive when it exceeds 1. Latent interaction values are dynamically adjusted in response to changes in history information, which allows agents to receive appropriate feedback even if their estimation of the individual q-value function is excessively high or low. This adjustment mechanism is neither a static, unchanging penalty nor an irreversible evaluation, but rather a dynamic process designed to motivate agents to continuously improve and adapt.

4.3. Dynamic Masking Mechanism

The joint hidden state

τ_{t}

includes the agent’s historical observations, actions taken, and the history information of other agents. Although this information plays a critical role in global optimization during multi-agent cooperation, the significant portion is redundant, containing repetitive or irrelevant information in the current context. The excessive accumulation of such redundant information not only increases computational complexity, but also interferes with the generation of latent interaction values, thus diminishing learning efficiency and, in extreme cases, leading to a decline in the cooperation performance. Therefore, the challenge addressed by the mechanism presented in this section is how to streamline information while maintaining learning efficiency.

The dynamic masking mechanism is centered on the introduction of a threshold

ς

, which guides the decision of whether hidden state information should be preserved or masked. Specifically, this threshold acts as a metric to assess whether there has been a significant change in an agent’s observations. If the change surpasses the threshold, it is considered beneficial for the current decision and retained. Otherwise, the hidden state is masked. In implementation, the decision is made by evaluating the difference between the observation vectors of agent i at two consecutive time steps, such as

o_{t - 1}^{i}

and

o_{t}^{i}

, as follows:

Δ o_{i} (t) = o_{t}^{i} - o_{t - 1}^{i} .

(5)

This difference reflects the environmental changes experienced by agent i at two different time points, providing a foundational basis for the subsequent assessment of latent interaction values. The dynamic masking mechanism defines a condition for retaining an agent’s hidden state information. If the norm of the difference in observation vectors

Δ o_{i} (t)

exceeds the preset threshold

ς

, it indicates a significant change in the agent’s observations, and the corresponding hidden state information is retained. If the difference is smaller than or equal to

ς

, it is masked. This process can be formalized by the following indicator function:

δ_{i} (t) = \{\begin{matrix} 1, & If ∥ Δ o_{i} (t) ∥ > ς \\ 0, & If ∥ Δ o_{i} (t) ∥ \leq ς, \end{matrix}

(6)

where

δ_{i} (t) = 1

means the hidden state information is preserved, and

δ_{i} (t) = 0

means it is masked. At any time t, the dynamic masking matrix

δ (t) = [δ_{1} (t), δ_{2} (t), \dots, δ_{n} (t)]

, which is created by the dynamic masking mechanism, is element-wise multiplied with the joint hidden state

τ_{t}

to obtain the processed joint hidden state

{\hat{τ}}_{t}

:

{\hat{τ}}_{t} = δ (t) \cdot τ_{t} .

(7)

This allows for a dynamic adjusting of the hidden state information, thereby keeping the tradeoff between model size and learning efficiency.

4.4. LIV Cooperation Learning Framework

In this subsection, the various components are combined to propose the LIV cooperation learning framework, which are designed to assist in refining the individual q-value functions. The overall structure of the LIV cooperation learning framework is shown in Figure 4b. The framework consists of three primary modules: the agent network module, the latent interaction network module, and the mixing network module. The entire process is divided into four distinct steps, which are elaborated below.

Step 1—Independent decision making: The agent network architecture, which is shown in the brown section of Figure 4c, consists of two fully connected multi-layer perceptrons (MLPs) and a gated recurrent unit (GRU). The first MLP receive the current observation

o_{t}^{i}

and the action

a_{t - 1}^{i}

from the previous time step of each agent i to extract features. The GRU combines the extracted features with the hidden state

τ_{t - 1}^{i}

from the previous time step to generate the current hidden state

τ_{t}^{i}

. Based on the current hidden state

τ_{t}^{i}

, each agent independently makes a decision, selects an action

a_{t}^{i}

, and generates the individual q-value function

q_{t}^{i} (τ_{t}^{i}, a_{t}^{i})

. Notably, the agent network employs a parameter-sharing mechanism, meaning that the inputs of all agents are processed through the same network, and each agent outputs its respective hidden state and individual q-value function. In the training phase, the hidden states of all agents input into the latent interaction network, while the individual q-value functions are fed into the mixing network.

Step 2—Latent interaction value generation: Latent interaction values are generated in the latent interaction network, as shown in Figure 4a, where the decision process of the dynamic masking mechanism is depicted in the dashed box. Firstly, the network integrates the hidden state information from all of the agents to the joint hidden state

τ_{t}

. Secondly, the joint hidden state is used to generate latent interaction values that excludes the agent’s hidden state, which is represented as

τ_{t}^{- i}

. Next, the dynamic masking mechanism filters and removes inactive parts of the joint hidden state, resulting in the processed joint hidden state

{\hat{τ}}_{t}^{- i}

. Finally, the processed information is input into the MLP to generate the corresponding latent interaction value

L_{t}^{i}

for agent i.

Step 3—Correcting the q-value functions: The mixing network architecture, as shown in the upper part of Figure 4c, receives the individual q-value functions from the agent network and the latent interaction values from the latent interaction network, and it then outputs the overall joint Q-value function

Q_{t o t}

. The latent interaction value

L_{t}^{i}

for each agent acts as a multiplicative factor and directly modifies its corresponding individual q-value function:

{\hat{q}}_{t}^{i} (τ_{t}, a_{t}^{i}) = L_{t}^{i} (τ_{t}^{- i}) \cdot q_{t}^{i} (τ_{t}^{i}, a_{t}^{i}) .

(8)

Subsequently, the corrected individual q-value function

{\hat{q}}_{t}^{i}

is processed through monotonic decomposition, where the global state information is incorporated to generate the joint Q-value function. To ensure that the Individual Global Maximization (IGM) condition is satisfied, all of the weights generated by global state information in the network are strictly constrained to be non-negative.

\begin{matrix} Q_{tot} (s, a) & = M_{θ} ({\hat{q}}_{1} (τ_{t}, a_{t}^{1}), \dots, {\hat{q}}_{n} (τ_{t}, a_{t}^{n}); s) \\ = |w (s)| \times [{\hat{q}}_{1} (τ_{t}, a_{t}^{1}), \dots, {\hat{q}}_{n} (τ_{t}, a_{t}^{n})] + b (s), \end{matrix}

(9)

where

M

represents a nonlinear hypernetwork with parameters

θ

, and its structure is illustrated in the upper part of Figure 4c in green.

Step 4—Framework updating: The joint Q-value function is optimized by the Double Deep Q-Network (DDQN) [35]. To improve the stability of the learning process, a dual-network framework is employed [36]. The current network, with its parameters denoted by

θ

, is responsible for current learning. The target network, with the parameters

θ^{-}

, is used specifically for calculating the joint Q-value estimates for the next state. Finally, the update is performed by minimizing the temporal difference (TD) loss:

L (θ) = E_{(D^{'} \sim D)} [{(y^{tot} - Q_{t o t} (s, a; θ))}^{2}],

(10)

where

y^{tot} = r + γ {max}_{a^{'}} Q_{t o t} (s^{'}, a^{'}; θ^{-})

represents the target value, r is the reward, and

γ

is the discount factor. The mixing network, latent interaction network, and agent network are synchronized and updated through backpropagation. To better understand this framework, the overall architecture was illustrated, as shown in Figure 4b. LIV enables multi-agent systems to optimize joint policy during the training phase by leveraging shared information while ensuring independent decision making in the execution phase. The proposed method can be applied in domains such as robotics, transportation, and energy management, demonstrating its effectiveness in coordinating complex tasks. Moreover, Algorithm 1 provides a detailed description of the LIV cooperation learning framework.

Algorithm 1 LIV Cooperation Learning Framework

1:: Initialize the parameters of networks $θ$ , set the parameters of target networks $θ^{-} = θ$
2:: Set the learning rate $α$ and maximum time step $t_{\max}$ , initialize the replay buffer $D =$ , step $= 0$
3:: while step < step_max do
4:: t $= 0$ , $s_{0} =$ initial state
5:: while $s_{t} \neq$ terminal and $t <$ episode limit do
6:: for each agent i do
7:: $τ_{t}^{i} = τ_{t - 1}^{i} \cup \{(o_{t}^{i}, a_{t - 1}^{i})\}$
8:: Select action $a_{t}^{i}$ by $ε -greedy$
9:: Generate the individual q-value function $q_{t}^{i} (τ_{t}^{i}, a_{t}^{i})$
10:: end for
11:: Get reward $r_{t}$ and next state $s_{t + 1}$
12:: $D = D \cup \{s_{t}, a_{t}, r_{t}, s_{t + 1}\}$
13:: $t = t + 1$ , step = step $+ 1$
14:: end while
15:: if $| D | >$ batch size then
16:: $b \leftarrow$ random batch of episodes from D
17:: for each timestep t in each episode in batch b do
18:: Construct each agent i’s joint hidden state $τ_{t}^{- i}$
19:: Construct each agent i’s processed joint hidden state ${\hat{τ}}_{t}^{- i}$ by Equation (7)
20:: Generate the latent interaction values $L_{t}^{i}$ by Equation (4)
21:: Calculate the corrected individual q-value functions ${\hat{q}}_{t}^{i}$ by Equation (8)
22:: Calculate the joint Q-value functions $Q_{t o t} (s, a; θ)$ by Equation (9)
23:: Calculate the target joint Q-value functions $Q_{t o t} (s^{'}, a^{'}; θ^{-})$
24:: end for
25:: $Δ θ = \nabla_{θ} Δ Q_{t o t}$
26:: $θ = θ - α Δ θ$
27:: end if
28:: if steps % update-interval $= 0$ then
29:: $θ \leftarrow θ^{-}$
30:: end if
31:: end while

5. Experiment

To demonstrate the effectiveness of the proposed LIV cooperation learning framework, we conducted experiments on the multi-agent particle environment and the StarCraft Multi-Agent Challenge. Additionally, we also verified the superiority of LIV in terms of the parameter scale compared to more complex centralized network design algorithms. Finally, we included ablation studies to analyze the contribution of each component in the proposed framework.

5.1. Experimental Environment

The experiment section aims to evaluate the cooperation performance across two multi-agent reinforcement learning environments, namely the Hard Multi-Agent Particle Environment (Hard-MPE) [37] and the StarCraft Multi-Agent Challenge (SMAC) [6]. MPE provides a simplified physical particle system, which is ideal for analyzing basic cooperation and competition among agents. Hard-MPE is the improved version of MPE, which increases complexity to further assess agents’ decision-making capabilities. SMAC is built upon the real-time strategy game StarCraft II, which provides an ideal environment for assessing agent performance in high-dimensional state spaces.

Hard Multi-Agent Particle Environment: MPE is a scenario library based on a simplified particle system and fundamental physical interactions. It incorporates tasks like tracking, cooperation, and competition to explore the dynamics of collaboration and competition among agents. Building upon this, Hard-MPE designs more challenging scenarios on MPE. Specifically, we used two typical cooperative scenarios from Hard-MPE: Coverage Control and Formation Control. These scenarios set different cooperation goals to encourage agents to work together to complete specific tasks. In the Coverage Control scenario, N agents and N landmarks are involved, with the primary goal being to ensure that each landmark is occupied by one agent. The key challenge in this task is to establish an effective division of labor and cooperation among agents, where each agent independently decides which landmark to occupy based on partial observations and which ones to avoid conflicts with. In the Formation Control scenario, N agents and one landmark are involved, with the task being to arrange the agents in a regular polygon around the landmark. In this scenario, agents need maintain reasonable distances from each other, and there is a fixed distance from the landmark to ensure the formation of a regular geometric formation. This requires agents to consider their own position, the positions of other agents, and the location of the landmark to achieve precise spatial coordination.

StarCraft Multi-Agent Challenge: SMAC, built upon the real-time strategy game StarCraft II, serves as a platform for evaluating the cooperative performance of multi-agent systems in intricate, dynamic, and uncertain environments. SMAC utilizes the application interface in StarCraft II, modeling agents as units in the game, and it requires them to collaborate in partially observable environments to tackle complex challenges. SMAC provides a wide range of task scenarios, including basic combat scenarios and intricate formations, as well as the detailed control of units. Six representative scenarios from the SMAC are chosen to evaluate the cooperation performance of the different MARL methods. In these scenarios, agents are confronted with asymmetrical formation, where they need defend the attack of enemy agents while quickly developing and implementing effective polices with limited resources. Scenarios are divided into three difficulty levels: easy, hard, and super-hard. The easy scenarios are characterized by simple unit control and enemies that are equal to or smaller than the agents. In the hard scenarios, unit control becomes more complex, or there are more enemies than agents. The super-hard scenarios combine these factors, with complex unit control and a significantly larger number of enemies. This design intensifies the complexity of decision making, particularly when agents collaborate in partially observable scenarios. Rewards are usually assigned based on the level of task completion, such as, for example, 10 for defeating an enemy unit and 200 for defeating all enemies. It is important to note that the rewards are given to the team as a whole, not for individual agents, with the aim of encouraging cooperation and collaboration.

5.2. Baseline Method

To validate the effectiveness of the LIV cooperation learning framework, we compared it with several widely used multi-agent value decomposition baseline methods. The baseline methods chosen for comparison included QMIX [21] and Weighted QMIX [24]. Furthermore, QPLEX [26] and GraphMIX [38], which employ complex network designs to capture additional environmental characteristics, were also included in the comparative analysis.

QMIX: A multi-agent reinforcement learning algorithm based on a monotonic hypernetwork that constructs a joint Q-value function from individual q-value functions. The main principle of QMIX lies in establishing an accurate mapping between individual and joint Q-value functions under monotonicity constraints. This design guarantees adherence to the Individual Global Maximization (IGM) condition, facilitating efficient coordination and control in multi-agent systems. Moreover, by integrating global state information into its mixing network, QMIX significantly improves the cooperation performance, leading to more accurate joint Q-value function estimates. We employed the original parameter configuration to validate the QMIX algorithm, guaranteeing extensive comparability.

Weighted QMIX: A novel method that enhances the QMIX algorithm. Unlike QMIX, which builds the joint Q-value function through a monotonic network, Weighted-QMIX employs a fully expressive network that is not limited by monotonicity constraints. Specifically, this method directly receives joint actions and a global state as the input, and it then outputs the corresponding joint Q-value function. By discarding monotonicity constraints, this design enables the neural network to express and learn more intricate and adaptable joint Q-value functions, improving the flexibility. Weighted-QMIX updates individual q-value functions by applying weights derived from the joint Q-value function. This method introduces two different weighting strategies: Optimistic Weighted and Central Weighted (referred to as OW-QMIX and CW-QMIX, respectively). The Optimistic Weighted strategy favors a positive view of the joint Q-value function, assuming a q-value function of each agent that contributes positively to the global optimum. The Central Weighted strategy focuses on the global states and joint Q-value functions, providing more balanced weight distribution across q-value functions. These strategies are designed to enhance the relationship between individual and joint Q-value functions from multiple perspectives, assisting in enhancing the cooperation performance and adaptability in various scenarios. We utilize the default parameter settings of the Weighted-QMIX algorithm, with the weight parameter w being uniformly set to

0.75

.

QPLEX: A multi-agent reinforcement learning method built on the Dueling DQN [36], which enhances decision-making efficiency by finely decomposing individual q-value functions into value functions and advantage functions. Specifically, the value functions evaluate the performance in a given state, while the advantage functions focus on quantifying the superiority of a specific action compared to the average action policy. This decomposition method helps distinguish between the inherent value of states and the relative importance of action choices, enhancing the expressiveness and generalization ability in complex environments. QPLEX introduces an attention mechanism to effectively incorporate the state value and advantage functions with global state information. The incorporation guarantees that the joint Q-value function calculation fully accounts for the intricate interactions between agents and the environment, yielding more precise joint Q-value estimation. We also employed the default QPLEX parameters, with the attention network closely adhering to the original specifications, to ensure consistency and comparability in the experiment.

Graphmix: A graph neural network-based method designed to optimize cooperation and decision-making processes in multi-agent systems by building graph relationships between agents. Graphmix utilizes the attention mechanism to generate an adjacency matrix, which models the intricate interactions between agents. It constructs a graph network that effectively combines individual q-value functions with global state information, generating a joint Q-value function and additional reward signals to aid in team decision making. We also utilized the default parameter configuration of the Graphmix algorithm.

5.3. Parameter Settings

The LIV cooperation learning framework comprises three main components: the agent network, the latent interaction network, and the mixing network. The agent network design aligns with the baseline methods, consisting of two MLPs with a GRU, where the dimension of the hidden layer is 64. The input dimension of first MLP is determined by the observation dimension of the scenario, and its output dimension is matched to the input dimension of the GRU, both of which are 64. The output dimension of the GRU aligns with the input dimension of the second MLP, which is also 64. The output dimension of the second MLP is based on the number of actions, and the

ε

-greedy strategy is used to select the optimal individual q-value function. The

ε

value is reduced from 1 to 0.05 over the first 50,000 time steps, and it is maintained at 0.05 until the maximum time step is reached. The agent network follows the principles of the centralized training and decentralized execution paradigm, and it employs parameter sharing to improve learning efficiency. The MLP in the latent interaction network also uses a three-layer architecture, with the hidden layer dimension set to 64. The mixing network integrates global state information, and its hypernetwork structure closely resembles that of QMIX with a two-layer configuration: the dimension of the intermediate layer (first MLP) is 64, and the dimension of the encoding layer (second MLP) is 32. To maintain fairness in the experiments, all of the methods use the same hyper-parameters. The dynamic masking threshold is set to 0 in the experiment. A threshold of 0 implies that the dynamic masking mechanism removes information from agents whose observations remain unchanged. Unchanged observations indicate that the agent is no longer participating in cooperation, providing a rationale for removing its associated information.

The SMAC environment was evaluated using three scenarios with increasing difficulty levels: easy, hard, and super-hard. The key distinction between the scenarios was the varying number and types of agents and enemy units, resulting in adjustments to the network input dimensions to accommodate every scenario. In the decentralized execution phase, agents are given a sight range of 9 to account for their partial observability, meaning they only perceive themselves and other agents within this range. Information outside the sight range is set to 0, simulating the incomplete information that agents encounter in real-world settings. The maximum time steps (

s t e p_{m a x}

) are set to 2 million (with the corridor scenario using 5 million steps), and the replay buffer size is fixed at 5000. The update interval is set to 200 time steps. In Hard-MPE, the number of agents N is set to 4 for both the Coverage Control and Formation Control scenarios. The maximum time step of the two scenarios is set to 0.4 million. The replay buffer size and update interval are kept consistent with the SMAC environment, both of which are set to 5000 and 200 time steps, respectively. Any other parameters not specified for the two environments are aligned with the settings of the Python (latest v. 3.13) multi-agent reinforcement learning framework (pymarl) [6].

5.4. Evaluation Metrics

In SMAC, the evaluation metric is the win rate of agents against the built-in AI enemies, and it is determined by the ratio of win times to the total evaluation times. Performance evaluations are conducted every 10,000 time steps, with each evaluation including 32 episodes, and the actions are selected using a greedy strategy to ensure the most optimal actions are taken during the evaluation. All results are based on the median of five independent runs with different random seeds to reduce the influence of randomness. To better illustrate the changes in cooperation performance, the results highlight the shaded area, representing the 25% to 75% quartile.

In the Hard-MPE environment, the performance is measured by cumulative reward, which is defined as the highest accumulated return achieved by the agents. The evaluation is conducted every 50,000 time steps, with each evaluation containing 32 episodes, and the greedy strategy is also used for selecting actions. Similar to the SMAC environment, the results are compared using the median from five independent runs, with different random seeds employed in each run. The final evaluation results are displayed both as a summary performance table and a learning curve.

5.5. Evaluation Results

5.5.1. Hard-MPE Results

Figure 5 shows the comparison of the learning processes of the LIV method, along with the baseline methods QMIX and Weighted-QMIX in the Coverage Control and Formation Control scenarios. Weighted-QMIX includes two variants, Optimistic Weighting QMIX (OW-QMIX) and Central Weighting QMIX (CW-QMIX). For the Coverage Control scenario, LIV exhibited a faster learning rate initially and maintained a higher average test return than CW-QMIX, OW-QMIX, and QMIX. Particularly, after 0.2 million time steps, the LIV method showed a more stable performance. On the contrary, the average test return of CW-QMIX and OW-QMIX fluctuated, indicating less stability compared to LIV. In the Formation Control scenario, LIV also exhibited excellent performance. Especially at 0.3 million time steps, the average test return of LIV was significantly higher than the other baseline methods. In summary, LIV demonstrated significant advantages over the compared baseline methods in terms of learning speed, final test return, and policy stability.

Table 1 provides the final performance evaluations of all methods, which were based on two metrics: the average test return and average test time. The average test return indicates the average final test return over different runs in the same scenario, with higher values reflecting better performance. The average test time measures the efficiency of different methods, where shorter times denote better cooperation efficiency. In the Coverage Control scenario, the average test return of LIV was found to be superior to other baseline methods. Additionally, the average test time of LIV was on par with the best method QMIX, indicating that LIV excels not only in performance, but also in efficient cooperation. In the Formation Control scenario, the average test return of LIV was comparable to CW-QMIX and OW-QMIX. And the average test time of LIV was also superior to other baseline methods. Overall, the proposed LIV can achieve higher cooperation efficiency and superior final performance.

5.5.2. SMAC Results

Figure 6 illustrates the comparison between the proposed LIV and two baseline methods: QMIX and Weighted-QMIX. In the easy scenarios of 1c3s5z, 2m_vs_1z, and 2s_vs_1sc, all of the methods achieved a nearly

100 %

win rate. It is noteworthy that, in the same time steps, LIV showed a higher win rate than the baseline methods. In the easy scenarios, the alignment of contributions with their individual q-values led to minimal performance differences. In hard scenarios of 2c_vs_64zg, 5m_vs_6m, and bane_vs_bane, LIV not only showed superior performance in the same time steps, but also achieved a marked increase in the final win rate compared to the baseline methods. However, in the hard scenario bane_vs_bane, the performance of QMIX exhibited significant fluctuations. In the corridor scenario, QMIX failed to obtain any win rate. The scenarios, involving a larger variety of agents, demonstrate that QMIX tends to fail with an increasing number of agents. The results suggest that the introduction of latent interaction values facilitates the cooperation among agents, allowing each agent to accurately assess its contribution and thus improve overall performance. This advantage is still noticeable in the super-hard scenarios corridor and MMM2. In these challenging scenarios, LIV far surpassed the performance of the baseline methods. This indicates that LIV is more efficient in coordinating agent cooperation in complex tasks, effectively reducing instability in the learning process by fine tuning individual q-value functions.

Figure 7 shows a further comparison of LIV with the graph network-based method Graphmix and the cutting-edge method QPLEX. Similar to LIV, both baseline methods integrate joint history information to improve cooperation. However, the network scale with the attention mechanism of the two baseline methods is significantly larger than LIV. In the easy scenarios of 2m_vs_1z and 1c3s5z, the performance of LIV was similar to the baseline methods. This indicates that, in simple scenarios, the complex network structures of the baseline methods cannot noticeably improve the performance. In the hard scenario 2c_vs_64zg, LIV showed a higher win rate in the same time steps. The two baseline methods exhibited a slower learning speed, which resulted in lower win rates compared to LIV. In the other hard scenario 5m_vs_6m, LIV achieved a win rate comparable to QPLEX. It is noteworthy that the network scale of QPLEX is approximately twice that of LIV. LIV achieves similar results with a lightweight architecture, providing strong evidence for the crucial role of latent interaction values in cooperation policy learning. In the super-hard scenario corridor, Graphmix showed a stable learning process. Although LIV exhibited lower stability, its average performance consistently surpassed that of Graphmix. In the other super-hard scenario MMM2, LIV exhibited a significant advantage over the two baseline methods in terms of learning efficiency and final performance.

5.6. Network Scale Comparison

As shown in this subsection, a comparative analysis of the network scales across all methods was conducted. Owing to the need to construct extra networks for joint history information extraction, most of the methods possessed a larger network scale compared to the baseline method QMIX. Prior research (e.g., DFAC [39]) indicates that expanding network scale contributes to improved performance. As shown in Table 2, QMIX, as the most basic multi-agent cooperation method, exhibited the smallest network scale. The maximum parameter value was obtained from the 27m_vs_30m scenario, while the minimum value was derived from the 2m_vs_1z scenario in SMAC. Weighted-QMIX constructed the largest network structure as it necessitates processing extensive information and training an additional unconstrained neural network. Graphmix and QPLEX integrate joint historical information, each employing more sophisticated network architectures to enhance feature extraction. Although the increased complexity can significantly enhance performance, it also substantially expands network depth and parameter count, thereby elevating computational difficulty. The proposed LIV method preprocesses information prior to network input, substantially decreasing the number of layers and parameters and maintaining a streamlined network structure. Compared with baseline methods, the proposed LIV method achieves comparable performance without an additional network design.

5.7. Ablation Results

In order to evaluate the effectiveness in the joint history information and dynamic masking mechanism, two variants were developed and benchmarked against the proposed LIV method in SMAC. The first variant, termed QMIX_history, combines the joint history information and global state information as the input of the QMIX mixing network, aiming to enhance multi-agent cooperation through the mixing network. The second variant was designed specifically to assess the impact of the dynamic masking mechanism on the generation of latent interaction values. In this variant, two versions were considered: LIV_wo_mask, which omits the dynamic masking mechanism; and LIV_w_mask, which incorporates it. Figure 8 shows ablation results that demonstrate the performance disparities between the two variants and the baseline LIV method. In the easy scenario 2s_vs_1sc, the result of the QMIX_history variant demonstrated that reducing network complexity can assist in preventing detrimental impacts in uncomplicated settings. In the hard scenarios of 2c_vs_64zg and 5m_vs_6m, the latent interaction network exhibited superior performance. Additionally, the cooperation performance remained consistent irrespective of whether the dynamic masking mechanism was employed, indicating that the removal of redundant information does not adversely affect the experimental results. The LIV_w_mask variant showed a faster learning speed due to its dynamic masking mechanism, proving that the removal of redundant information improves cooperative efficiency. Comparable trends were noted in the super-hard scenarios bane_vs_bane and MMM2. In the super-hard scenario corridor, the LIV_wo_mask variant failed to yield positive results, whereas the LIV_w_mask variant demonstrated positive performance, highlighting the potential detrimental impact of redundant information on cooperative decision-making processes. The dynamic masking mechanism demonstrated a facilitative effect on the generation of latent interaction values. Moreover, in the scenarios 2s_vs_1sc, 2c_vs_64zg, and MMM2, the LIV_w_mask variant did not consistently outperform LIV_wo_mask. This inconsistency might result from the threshold setting in the dynamic masking mechanism. When the threshold settings impact agent collaboration, it is advisable to adjust them appropriately to minimize interference. In the future, we can investigate the development of a dynamic threshold masking mechanism to mitigate interference.

6. Conclusions

In this paper, we introduced a novel latent interaction value (LIV) cooperation learning framework that integrates the history information of agents to enhance individual q-value function and facilitate coordinated policy optimization. Specifically, the LIV cooperation learning framework integrates history information from the hidden states of the agent network to generate unique latent interaction values for each agent. Subsequently, these latent interaction values are incorporated into the individual q-value functions to mitigate the estimation errors arising from partial observability. Furthermore, a dynamic masking mechanism was devised to adaptively refine the history information based on the activity level of agents, enhancing the accuracy of latent interaction value generation. The mechanism assesses the observational dynamics of agents and filters out history information that falls below a specified threshold, effectively mitigating the detrimental impact of redundant information. Experimental results demonstrate that the proposed LIV outperforms baseline methods in both the multi-agent particle environment and the StarCraft Multi-Agent Challenge, achieving rapid learning speed and significant improvement in cooperation performance.

Author Contributions

Conceptualization, Z.Z.; Data curation, Z.Z.; Formal analysis, Z.Z.; Investigation, Z.Z., Y.Z. (Ya Zhang), S.W., R.Z. and W.C.; Methodology, Z.Z.; Project administration, Y.Z. (Ya Zhang); Software, Z.Z., Y.Z. (Yang Zhou) and R.Z.; Supervision, W.C.; Visualization, Z.Z. and Y.Z. (Ya Zhang); Writing—Original Draft, Z.Z.; Writing—Review and Editing, Y.Z. (Ya Zhang), S.W. and Y.Z. (Yang Zhou). All authors have read and agreed to the published version of the manuscript.

Funding

There is no funding support for the work of this paper.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bertsekas, D. Results in Control and Optimization. Results Control Optim. 2020, 1, 100003. [Google Scholar] [CrossRef]
Cassano, L.; Yuan, K.; Sayed, A.H. Multiagent Fully Decentralized Value Function Learning With Linear Convergence Rates. IEEE Trans. Autom. Control 2021, 66, 1497–1512. [Google Scholar] [CrossRef]
Cao, Y.; Yu, W.; Ren, W.; Chen, G. An Overview of Recent Progress in the Study of Distributed Multi-Agent Coordination. IEEE Trans. Ind. Informatics 2013, 9, 427–438. [Google Scholar] [CrossRef]
Zanol, R.; Chiariotti, F.; Zanella, A. Drone Mapping through Multi-Agent Reinforcement Learning. In Proceedings of the 2019 IEEE Wireless Communications and Networking Conference (WCNC), Marrakech, Morocco, 15–19 April 2019; pp. 1–7. [Google Scholar] [CrossRef]
Hüttenrauch, M.; Šošić, A.; Neumann, G. Guided Deep Reinforcement Learning for Swarm Systems. arXiv 2017, arXiv:1709.06011. [Google Scholar]
Samvelyan, M.; Rashid, T.; Schroeder de Witt, C.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.M.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 2186–2188. [Google Scholar]
Oliehoek, F.A.; Spaan, M.T.J.; Vlassis, N. Optimal and Approximate Q-Value Functions for Decentralized POMDPs. J. Artif. Intell. Res. 2008, 32, 289–353. [Google Scholar] [CrossRef]
Kraemer, L.; Banerjee, B. Multi-Agent Reinforcement Learning as a Rehearsal for Decentralized Planning. Neurocomputing 2016, 190, 82–94. [Google Scholar] [CrossRef]
Kim, G.; Chung, W. Tripodal Schematic Control Architecture for Integration of Multi-Functional Indoor Service Robots. IEEE Trans. Ind. Electron. 2006, 53, 1723–1736. [Google Scholar] [CrossRef]
Tan, M. Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents. In Proceedings of the 10th International Conference on International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
Tuyls, K.; Weiss, G. Multiagent Learning: Basics, Challenges, and Prospects. AI Mag. 2012, 33, 41. [Google Scholar] [CrossRef]
Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 29–37. [Google Scholar]
Ni, T.; Eysenbach, B.; Salakhutdinov, R. Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 16691–16723. [Google Scholar]
Laurent, G.J.; Matignon, L.; Le Fort-Piat, N. The World of Independent Learners Is Not Markovian. Int. J. Knowl.-Based Intell. Eng. Syst. 2011, 15, 55–64. [Google Scholar] [CrossRef]
Mordatch, I.; Abbeel, P. Emergence of Grounded Compositional Language in Multi-Agent Populations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 1495–1502. [Google Scholar]
Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2145–2153. [Google Scholar]
Sukhbaatar, S.; szlam, a.; Fergus, R. Learning Multiagent Communication with Backpropagation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2252–2260. [Google Scholar]
Lowe, R.; Foerster, J.; Boureau, Y.L.; Pineau, J.; Dauphin, Y. On the Pitfalls of Measuring Emergent Communication. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 693–701. [Google Scholar]
Kim, W.; Cho, M.; Sung, Y. Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6079–6086. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, Stockholm, Sweden, 10–15 July 2018; Volume 3, pp. 2085–2087. [Google Scholar]
Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. Qmix: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4295–4304. [Google Scholar]
Ha, D.; Dai, A.; Le, Q.V. HyperNetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.; Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 10199–10210. [Google Scholar]
Yang, Y.; Hao, J.; Liao, B.; Shao, K.; Chen, G.; Liu, W.; Tang, H. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv 2020, arXiv:2002.03939. [Google Scholar]
Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. [Google Scholar]
Iqbal, S.; Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2974–2982. [Google Scholar]
Zhou, M.; Liu, Z.; Sui, P.; Li, Y.; Chung, Y.Y. Learning Implicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 11853–11864. [Google Scholar]
Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.A.; Torr, P.; Boehmer, W.; Whiteson, S. FACMAC: Factored Multi-Agent Centralised Policy Gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 12208–12221. [Google Scholar]
Wang, Y.; Han, B.; Wang, T.; Dong, H.; Zhang, C. Off-Policy Multi-Agent Decomposed Policy Gradients. arXiv 2020, arXiv:2007.12322. [Google Scholar]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer International Publishing: Cham, Switzerland, 2016; Volume 1. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
Agarwal, A.; Kumar, S.; Sycara, K. Learning Transferable Cooperative Behavior in Multi-Agent Teams. arXiv 2019, arXiv:1906.01202. [Google Scholar] [CrossRef]
Naderializadeh, N.; Hung, F.H.; Soleyman, S.; Khosla, D. Graph Convolutional Value Decomposition in Multi-Agent Reinforcement Learning. arXiv 2021, arXiv:2010.04740. [Google Scholar]
Sun, W.F.; Lee, C.K.; Lee, C.Y. DFAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 9945–9954. [Google Scholar]

Figure 1. The update process in a centralized network makes agents with underestimated or overestimated q-values further exacerbate these biases.

Figure 2. The update process in centralized network incorporating latent interaction makes agents with underestimated or overestimated q-values be corrected accordingly.

Figure 3. A schematic diagram of the value decomposition within the CTDE paradigm.

Figure 4. (a) The structure of the latent interaction network, (b) the main architecture of the LIV cooperation learning framework, and (c) the structure of the agent network (down) and mixing network (up).

Figure 5. The learning curves of the proposed LIV method and classical MARL baselines in Hard-MPE environment. (a) Coverage Control scenario and (b) Formation Control scenario.

Figure 6. The learning curves of the proposed LIV method and classical MARL baselines in the SMAC environment.

Figure 7. The learning curves of the proposed LIV method and SOTA MARL baselines in the SMAC environment.

Figure 8. The learning curves of the different methods that processed history information in typical SMAC scenarios.

Table 1. The final performance of LIV and the baseline methods in the Hard-MPE environment.

Scenarios	Methods	Average Test Return	Average Test Time
Coverage Control	LIV (Ours)	−27.51	36.82
	CW-QMIX	−41.61	36.83
	OW-QMIX	−44.33	38.40
	QMIX	−29.42	34.26
Formation Control	LIV (Ours)	−16.15	13.70
	CW-QMIX	−16.10	13.74
	OW-QMIX	−16.36	14.17
	QMIX	−40.15	17.82

Table 2. A comparison of the network layers and parameters across different methods.

Methods	Layers	Parameter
Methods	Layers	Maximum	Minimum
QMIX ¹	1	11,457	283,105
Weighted-QMIX	2	157,891	1,021,667
Qplex	6	17,165	708,581
GraphMix	5	50,034	471,026
LIV (Ours)	2	15,747	391,395

¹ Basic multi-agent cooperation method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Zhang, Y.; Wang, S.; Zhou, Y.; Zhang, R.; Chen, W. Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning. Mathematics 2025, 13, 1429. https://doi.org/10.3390/math13091429

AMA Style

Zhao Z, Zhang Y, Wang S, Zhou Y, Zhang R, Chen W. Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning. Mathematics. 2025; 13(9):1429. https://doi.org/10.3390/math13091429

Chicago/Turabian Style

Zhao, Zhitong, Ya Zhang, Siying Wang, Yang Zhou, Ruoning Zhang, and Wenyu Chen. 2025. "Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning" Mathematics 13, no. 9: 1429. https://doi.org/10.3390/math13091429

APA Style

Zhao, Z., Zhang, Y., Wang, S., Zhou, Y., Zhang, R., & Chen, W. (2025). Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning. Mathematics, 13(9), 1429. https://doi.org/10.3390/math13091429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Related Works

3. Preliminaries

3.1. Dec-POMDP

3.2. Centralized Training with Decentralized Execution

4. Methods

4.1. Agent Network

4.2. Latent Interaction Generation

4.3. Dynamic Masking Mechanism

4.4. LIV Cooperation Learning Framework

5. Experiment

5.1. Experimental Environment

5.2. Baseline Method

5.3. Parameter Settings

5.4. Evaluation Metrics

5.5. Evaluation Results

5.5.1. Hard-MPE Results

5.5.2. SMAC Results

5.6. Network Scale Comparison

5.7. Ablation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI