Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control

Zhu, Xiaoxia

doi:10.3390/math12193062

Open AccessArticle

Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control

by

Xiaoxia Zhu

^1,2

¹

School of Intelligent Manufacturing, Shanghai Zhongqiao Vocational and Technical University, Shanghai 201514, China

²

School of Automation, Southeast University, Nanjing 210096, China

Mathematics 2024, 12(19), 3062; https://doi.org/10.3390/math12193062

Submission received: 23 August 2024 / Revised: 18 September 2024 / Accepted: 25 September 2024 / Published: 30 September 2024

(This article belongs to the Special Issue Advance in Control Theory and Optimization)

Download

Browse Figures

Versions Notes

Abstract

:

A hierarchical consensus control algorithm based on value function decomposition is proposed for hierarchical multi-agent systems. To implement the consensus control algorithm, the reward function of the multi-agent systems can be decomposed, and two value functions can be obtained by analyzing the communication content and the corresponding control objective of each layer in the hierarchical multi-agent systems. Therefore, for each agent in the systems, a dual-critic network and a single-actor network structure are applied to realize the objective of each layer. In addition, the target network is introduced to prevent overfitting in the critic network and improve the stability of the online learning process. During the updating of network parameters, a soft updating mechanism and experience replay buffer are introduced to slow down the update rate of the network and improve the utilization rate of training data. The convergence and stability of the consensus control algorithm with the soft updating mechanism are analyzed theoretically. Finally, the correctness of the theoretical analysis and the effectiveness of the algorithm were verified by two experiments.

Keywords:

reinforcement learning; value function decomposition; multi-agent; consensus

MSC:

93A16

1. Introduction

In recent years, due to the intense development of technology and the rapid improvement of data processing capabilities, many practical systems in the field of engineering have become increasingly complex. Therefore, the study of large and intricate systems has emerged as a prominent research focus. At present, the main research trend is to model complex network systems and divide them into multiple simple and identical subsystems [1]. The hierarchical multi-agent system is an architecture for organizing and managing multiple agents [2]. The agents of the systems can be divided into different layers, each of them having different responsibilities and control objectives. Typically, higher-layer agents are responsible for global policy making, while lower layer agents perform more specific tasks or operations. Therefore, the model of hierarchical multi-agent systems can be described by multiple subsystems. This holds significant practical relevance; that is, we can combine many of the same simple agents to form a large and complex system, or we can realize the complex tasks that cannot be completed by a single agent. In the control process of robotic swarms [3], hierarchical multi-agent systems can be employed to coordinate multiple robots to accomplish complex tasks. For example, in search-and-rescue missions [4], higher-layer agents can plan the search area, while lower layer agents carry out specific search actions. In intelligent traffic management [5], higher-layer agents can be responsible for overall traffic flow management, whereas lower layer agents control specific traffic signals or vehicles. In complex supply chain systems [6], higher-layer agents can perform global supply chain optimization, while lower layer agents handle specific tasks such as production, transportation, or inventory management. By decomposing complex tasks and allocating them to different layers of agents, hierarchical multi-agent systems can enhance the efficiency and flexibility of the system. Over the past decade, the analysis of hierarchical multi-agent systems and the development of distributed control schemes have become focal issues in the field of control and obtained numerous research outcomes [7].

For large-scale multi-agent systems, the model is usually described by a hierarchical structure because large-scale systems can usually be divided into some groups, and the number of agents in different groups and the communication topology may be different. The research on the consensus control of hierarchical multi-agent systems is the basis of all the problems mentioned above. Consensus control is critical for ensuring the stability and performance of the entire system. Williams et al. [8] outlined the concept of the hierarchy, which is used to describe relationships between subeformations, for example, “formations of formations”. This concept could be valuable in fields like robotics, swarm intelligence, and military applications. The authors provide a strong theoretical foundation by discussing the dynamics of stability within hierarchical formations. This is important for ensuring that the structures they propose are robust and applicable in real-world scenarios. However, it may pose challenges in large-scale systems with unpredictable environmental factors. Smith et al. [9] studied the application of the hierarchical structure to the circular tracking of vehicle networks, and better results can be obtained compared with the traditional circular tracking algorithm; the convergence rate of vehicle groups to the center of mass is significantly increased. Hara et al. [10] proposed a general hierarchical multi-agent model with a fractal structure and studied the stability, global convergence, and low rank of the interconnection structure between different layers. Consensus in hierarchical systems with low-rank interconnections has practical implications in various fields, such as decentralized control systems and distributed optimization. However, like many theoretical papers, it may rely on certain simplifying assumptions that might not hold in real-world scenarios, potentially limiting the generalizability of the findings. Tsubakino and Hara et al. [11] propose a hierarchical characterization method based on eigenvectors for low-rank, interconnected, multi-agent, dynamical systems. This eigenvector-based approach could have practical implications in areas such as robotic coordination, decentralized decision-making, and networked systems, where efficient and stable intergroup communication is crucial. However, while low-rank interconnections aim to reduce complexity, eigenvector computations in large, dynamic systems may still be computationally expensive, especially in real-time applications where swift decisions are needed. Sang et al. [12] studied the group consensus problem based on hierarchical containment control in linear multi-agent systems. The influence of time delay in dynamic environments and systems on consensus is considered in this paper. A robust control strategy is designed to ensure that the system can achieve stable containment control and consensus in the presence of dynamic changes and communication delays. Wang et al. [13] proposed a new hierarchical control framework for distributed control of multi-agent systems. The framework divides the system into multiple levels, each responsible for a different control task. The framework is designed to improve the scalability and robustness of the system and simplify the control design of complex systems.

Hierarchical reinforcement learning is one of the effective methods of solving large-scale and multi-task reinforcement learning problems [14]. Through task decomposition, strategies are learned according to the goals of each subtask, and the learned strategies are combined to obtain a global strategy, which can effectively solve the dimensional disaster problem in large-scale systems. Makar et al. [15] propose a hierarchical reinforcement learning algorithm based on value function decomposition to solve the multi-agent discrete control problem. The combination of hierarchical structures with multi-agent reinforcement learning represents a novel framework. Hierarchical approaches allow for the decomposition of complex tasks into subtasks, making it easier to handle environments with multiple agents and large state-action spaces. However, implementing a hierarchical algorithm is often more challenging. Designing the appropriate hierarchy of tasks and sub-tasks and determining how agents should transition between them can require significant domain knowledge and expertise. To solve large-scale and complex continuous control problems, Yang et al. proposed the hierarchical Deep Deterministic Policy Gradient (DDPG) algorithm [16], which is a dual-critic network and multi-actor network structure. The critic network is divided into two layers. In the first layer, a Multi-DDPG algorithm structure is used to solve simple task problems, and the second layer focuses on solving complex combined tasks. At the same time, hierarchical reinforcement learning can also solve the problem of switching topology in the multi-agent systems. For example, Wang et al. [17] proposed a two-layer reinforcement learning framework to solve the output consensus problem of multi-agent systems under switching topology conditions. The first layer uses the Q-learning algorithm, wherein each agent selects the optimal action according to the current state, and the second layer selects the appropriate action strategy according to the current topology to ensure the output consensus of the entire system. This two-layer structure gives the control strategy strong adaptability and robustness and can be adjusted quickly when the topology changes.

The research concerning hierarchical consensus control mentioned above is largely based on the specific sequence of the tasks, which means that the performance of the algorithm convergence may be insufficient. However, in this paper, a hierarchical consensus control algorithm based on value function decomposition is proposed for hierarchical multi-agent systems. According to the communication content of each layer of the multi-agent systems and the corresponding control objectives, the reward function is decomposed, and two value functions are obtained. Specifically, for each agent in the system, the dual-critic network and single-actor network structure are designed. The updating of the dual-critic network is based on the decomposition of value functions in different tasks, and the two decomposed value functions have no logical order. It is also unnecessary to consider the training order of the double evaluation network during the training process. In addition, this paper introduces the target network to avoid overfitting in the critic network and improve the stability of the online learning process. In the updating of network parameters, a soft updating mechanism and experience replay buffer are introduced to slow down the update rate of the network and improve the utilization rate of training samples. The main contributions of this paper are as follows:

(1): For hierarchical multi-agent systems, a hierarchical consensus control algorithm based on value function decomposition is proposed. Firstly, the structure of the algorithm is that of a distributed actor–critic network, which ensures that the distributed characteristics of multi-agent systems are fully utilized. In addition, a value function decomposition algorithm is introduced to ensure the simultaneous optimization of the control objectives of agents at different levels, which gives the training process a certain robustness.
(2): The convergence and stability analysis of the consensus control algorithm with a soft update mechanism are proposed. It is proved that for each agent, the action value function estimated by the critic network can converge to the optimal value, the policy output from the actor network can converge to the corresponding optimal value, and the multi-agent system can be asymptotically stable.

In this paper, the implementation of the hierarchical consensus control algorithm based on value function decomposition is presented, and the convergence and stability of the algorithm are analyzed. Finally, the correctness of the theoretical analysis and the effectiveness of the algorithm were verified by two experiments. Multi-agent systems with and without leaders are both considered in the Section 5.

2. Preliminaries

2.1. Graph Theory

The communication of the leader–follower multi-agent system can be represented by a weighted directed graph

G (A) = (V, E, A)

, which is composed of a set of N agents

V = {v_{1}, v_{2}, \dots, v_{N}}

, a set of edges

E = {e_{i j} = (v_{i}, v_{j})} \subset V \times V

, and a non-negative adjacency matrix

A = [a_{i j}]

.

a_{i j} > 0

if and only if

(v_{i}, v_{j}) \in E

, which means agent i and j can communicate with each other; otherwise,

a_{i j} = 0

. For

\forall i = 1, 2, \dots, N

,

a_{i i} = 0

. The in-degree matrix

D = {[d_{i j}]}_{N \times N}

is a diagonal matrix with

d_{i i} = \sum_{j \in N_{i}} a_{i j}

, and the Laplacian matrix is defined as

L = D - A

. The communication between the leader and followers is modeled by the diagonal matrix

B = {[b_{i i}]}_{N \times N}

, and

b_{i i} > 0

means there is a directed path between the leader and the ith follower; otherwise,

b_{i i} = 0

. If a digraph contains a directed spanning tree, it means there exists a directed path from the leader to any other follower.

2.2. Problem Formulation

Consider the hierarchical(two-layer) discrete multi-agent system [18]. The structure is shown in Figure 1. The system contains

N = p \times q

agents:

x_{i} (k + 1) = A x_{i} (k) + B u_{i} (k) i = 1, 2, \dots, N,

(1)

where

x_{i} (k) \in R^{n}

is the system state, and

u_{i} (k) \in R^{m}

is the control signal values. The system matrix parameters are

A \in R^{n \times n}

and

B \in R^{n \times m}

. The number of agents is N, which is divided into q groups, and each group has p agents. Based on the graph theory mentioned above, the communication of p agents in the bottom

H 2

groups can be represented by the adjacency matrix

A_{1}

of dimensions

p \times p

. The communication between the q groups of the top layer

H 1

can be represented by the adjacency matrix

A_{2}

of dimensions

q \times q

. The Laplace matrices are

L_{1} \in R^{p \times p}

and

L_{2} \in R^{q \times q}

, corresponding to the adjacent matrices

A_{1}

and

A_{2}

, respectively. The communication mode between groups and whether the agents in each group can receive inter-group information are crucial to the consensus of hierarchical multi-agent systems. In this paper, it is assumed that the agents in each group can receive information outside the group in the way of the topological structure

A_{2}

. And the communication information between groups is fixed as the average value of all agents in each group,

g_{j}

[19]; that is, for each group of agents in the bottom layer,

g_{j} = \frac{1}{p} x_{i} (k), i \in p, j \in q

.

Assumption 1.

There is no time delay in intra-group communication and inter-group communication in hierarchical multi-agent systems. The number and topology of agents in each group are the same.

In the two-layer discrete multi-agent systems, the agents in the same group are relatively concentrated. For example, the communication distance between robots or vehicles composed of communication nodes in the group is close, so the communication delay is small and can be ignored. The communication distance between groups is often very far because the communication line is long, the signal transmission capacity is limited, and the communication time delay of each group of agents can not be ignored. In order to reduce the difficulty of the algorithm, the problem of communication time delay is not considered in this paper.

In hierarchical multi-agent systems, the number and topology of agents within each group can indeed be the same, but this is not always necessary. If each group has the same number of agents, it simplifies the system design and interaction protocols. This uniformity makes coordination easier since each group can follow a similar communication and decision-making process. If each group shares the same topology, it typically means that the connections and interactions between agents are similar across groups. The specific number and topology of agents depend on the problem’s requirements.

Definition 1.

The two-layer discrete multi-agent system (1) reaches hierarchical consensus, meaning that for any initial state, there is

\forall i, j

,

{lim}_{k \to \infty} ∥ x_{i} (k) - x_{j} (k) ∥ = 0

.

If the hierarchical multi-agent systems has a leader agent, it can be defined as follows:

x_{0} (k + 1) = A x_{0} (k),

(2)

where

x_{0} (k) \in R^{n}

is the reference signal value of multi-agent systems.

Definition 2.

The two-layer discrete multi-agent systems (1) and (2) can reach hierarchical consensus, meaning that for any initial state, there is

\forall i

,

{lim}_{k \to \infty} ∥ x_{i} (k) - x_{0} (k) ∥ = 0

.

Considering the communication characteristics of hierarchical multi-agent systems, each agent can obtain intra-group and inter-group information, so the local tracking error for agent i in each group is as follows:

e_{i} (k) = \sum_{j \in q} [a_{1, i j} (x_{i} (k) - x_{j} (k)) + b_{i i} (x_{i} (k) - x_{0} (k))] + \sum_{j \in p} a_{2, i j} (g_{i} (k) - g_{j} (k)) .

(3)

According to the equation above, the global tracking errors are defined below; that is,

e (k) = [e_{1}^{T} (k), e_{2}^{T} (k), \dots, e_{N}^{T} (k)] \in R^{N n}

:

e (k) = (I_{q} \otimes (L_{1} + B) \otimes I_{n}) (X (k) - X_{0} (k)) + \frac{1}{p} (L_{2} \otimes Δ_{1} \otimes I_{n}) X (k),

(4)

where ⊗ is the Kronecker product,

I_{n}

is the n-dimensional unit matrix,

I_{q}

is the q-dimensional unit matrix, the state of the h group agent is

X_{h} (k) = {[x_{(h - 1) p + 1}^{T} (k), x_{(h - 1) p + 2}^{T} (k), \dots, x_{h m}^{T} (k)]}^{T}

, the state vector for all agents is defined as

X (k) = {[x_{1}^{T} (k), x_{2}^{T} (k), \dots, x_{N}^{T} (k)]}^{T} \in R^{N n}

, the state vector of the leader agent is

X_{0} (k) = {[x_{0}^{T} (k), x_{0}^{T} (k), \dots, x_{0}^{T} (k)]}^{T} \in R^{N n}

,

\frac{1}{p} (L_{2} \otimes Δ_{1} \otimes I_{n}) X (k)

is the value of inter-group state error, and

Δ_{1} = {[1, \dots, 1]}^{T} [1, \dots, 1] \in R^{p \times p}

is the information interaction between each group structure. If the multi-agent systems has no leader, then both

b_{i i}

and

B

are zero.

According to (4), the global tracking errors can be defined as follows:

e (k) = e_{1} (k) + e_{2} (k),

(5)

where

e_{1} (k) = (I_{q} \otimes (L_{1} + B) \otimes I_{n}) (X (k) - X_{0} (k))

,

e_{2} (k) = \frac{1}{p} (L_{2} \otimes Δ_{1} \otimes I_{n}) X (k)

.

For hierarchical multi-agent systems with a leader agent, in order to reach consensus, the global consensus error of the whole system is defined as follows:

ϵ (k) = X (k) - X_{0} (k) + \frac{1}{p} (L_{2} \otimes Δ_{1} \otimes I_{n}) X (k),

(6)

where

ϵ_{1} (k) = X (k) - X_{0} (k)

,

ϵ_{2} (k) = \frac{1}{p} (L_{2} \otimes Δ_{1} \otimes I_{n}) X (k)

.

Lemma 1

([20]). If the matrix

(L_{1} + B)

is singular, then

∥ ϵ_{1} (k) ∥ \leq ∥ e_{1} (k) ∥ / λ_{m i n} (L_{1} + B)

, where

λ_{m i n} (L_{1} + B)

is the minimum singular value of matrix

(L_{1} + B)

.

Lemma 1 shows that if the global tracking error is small enough, then the global consensus error can be arbitrarily small. To ensure that the inequality in Lemma 1 holds, the following hypothesis is given:

Assumption 2.

The communication graphs

G (A_{1})

and

G (A_{2})

of the multi-agent systems are connected.

2.3. Optimal Consensus Control Based on Value Function Decomposition

Considering the two-layer multi-agent systems, the performance index function of agent i in any group can be defined as follows:

\begin{matrix} J_{i} (e_{i} (k), {\underset{̲}{u}}_{i} (k)) = & \sum_{t = k}^{\infty} γ^{t - k} r_{i} (e_{i} (k), u_{i} (k)) \\ = & r_{i} (e_{i} (k), u_{i} (k)) + γ J_{i} (e_{i} (k + 1), u_{i} (k + 1)), \end{matrix}

(7)

where

{\underset{̲}{u}}_{i} (k) = (u_{i} (k), u_{i} (k + 1), \dots, u_{i} (\infty))

is the control sequence composed of all the control quantities of agent i from the current moment, and

0 < γ < 1

is the discount factor. The reward value of environment is

r_{i} (e_{i} (k), u_{i} (k))

.

The objective of the optimal consensus control is to minimize the performance index function (7). Therefore, according to Bellman’s principle, the optimal value of the state value function

J_{i}^{*} (e_{i} (k))

satisfies the following equation:

\begin{matrix} J_{i}^{*} (e_{i} (k)) = min_{u_{i} (k)} {r_{i} (e_{i} (k), u_{i} (k)) + γ J_{i}^{*} (e_{i} (k + 1))} . \end{matrix}

(8)

The DTHJB equation for agent i can be expressed as follows:

\begin{matrix} J_{i}^{*} (e_{i} (k)) = r_{i} (e_{i} (k), u_{i}^{*} (k)) + γ J_{i}^{*} (e_{i} (k + 1)) . \end{matrix}

(9)

Therefore, the DTHJB equation for hierarchical multi-agent systems can be expressed as follows:

\begin{matrix} J^{*} (e (k)) = R (e (k), u^{*} (k)) + γ J^{*} (e (k + 1)) . \end{matrix}

(10)

The action value function Q and the optimal action value function

Q^{*}

are introduced. Then, the optimal action value function of agent i is as follows:

\begin{matrix} Q_{i}^{*} (e_{i} (k), u_{i} (k)) = r_{i} (e_{i} (k), u_{i} (k)) + min_{u_{i} (k + 1)} γ Q_{i}^{*} (e_{i} (k + 1), u_{i} (k + 1)) . \end{matrix}

(11)

By minimizing the action value function, the optimal control value

u^{*}

can be obtained directly as follows:

u_{i}^{*} (e_{i} (k)) = arg min_{u_{i} (k)} Q_{i}^{*} (e_{i} (k), u_{i} (k)) .

(12)

For two-layer discrete multi-agent systems, a hierarchical reinforcement learning algorithm can be introduced to decompose the performance index function and action value function according to the control objectives of different layers. The following content gives the conditions that need to be satisfied by the decomposition of the action value function, so as to provide the basis for the implementation of the algorithm in the next section.

Theorem 1

([21]). Suppose the reward function r can be decomposed into M reward functions, namely,

r (e, u) = \sum_{k = 1}^{M} r_{k} (e, u)

. Then, the performance index function and action value function can be decomposed into

J (e) = \sum_{k = 1}^{M} J_{k} (e)

and

Q (e, u) = \sum_{k = 1}^{M} Q_{k} (e, u)

.

Proof of Theorem 1.

The performance index function

J (e)

is expressed as follows:

J (e) = \sum_{l = 0}^{\infty} γ^{l} r_{t + l + 1} (e, u) .

(13)

According to the decomposed form of the reward function in Theorem 1, the performance index function can be decomposed into the following:

\begin{matrix} J (e) & = \sum_{l = 0}^{\infty} γ^{l} \sum_{k = 1}^{M} r_{t + l + 1, k} (e, u) \\ = \sum_{k = 1}^{M} γ^{l} \sum_{l = 0}^{\infty} r_{t + l + 1, k} (e, u) \\ = \sum_{k = 1}^{M} J_{k} (e) . \end{matrix}

(14)

Similarly, the decomposition form of the action value function can be obtained:

Q (e, u) = \sum_{k = 1}^{M} Q_{k} (e, u) .

(15)

□

From Theorem 1, we can know that the action value function can be decomposed into several sub-functions for different objectives, and there is no sequential relationship between the objectives. How to optimize these sub-functions separately is also a problem to be studied below.

3. Proposed Algorithm

3.1. Consensus Algorithm Based on Action Value Function Decomposition

For hierarchical multi-agent systems which contain N agents, the overall consensus algorithm structure based on action value function decomposition in order to obtain the optimal control value (12) for each agent is shown in Figure 2. The reward function of agent i is decomposed into two rewards corresponding to the consensus control objective with the intra-group

r_{i, 1} (e_{i} (k), u_{i} (k))

and the inter-group

r_{i, 2} (e_{i} (k), u_{i} (k))

. That is, the reward function is expressed as follows:

r_{i} (e_{i} (k), u_{i} (k)) = r_{i, 1} (e_{i} (k), u_{i} (k)) + r_{i, 2} (e_{i} (k), u_{i} (k)),

(16)

where

r_{i, 1} (e_{i} (k), u_{i} (k)) = e_{i, 1}^{T} (k) Q_{i} e_{i, 1} (k) + u_{i}^{T} (k) P_{i} u_{i} (k)

,

r_{i, 2} (e_{i} (k), u_{i} (k)) = e_{i, 2}^{T} (k) Q_{i} e_{i, 2} (k) + u_{i}^{T} (k) P_{i} u_{i} (k)

,

Q_{i} (k) > 0

, and

P_{i} (k) > 0

are positive definite matrices. According to Theorem 1, the action value function of agent i can be decomposed into the following:

Q_{i} (e_{i} (k), u_{i} (k)) = Q_{i, 1} (e_{i} (k), u_{i} (k)) + Q_{i, 2} (e_{i} (k), u_{i} (k)) .

(17)

Therefore, for agent i, the initial value of the action value function is defined as

Q_{i, 1}^{0} (e_{i} (k), u_{i} (k))

and

Q_{i, 2}^{0} (e_{i} (k), u_{i} (k))

, and the corresponding control value is as follows:

u_{i}^{0} (e_{i} (k)) = arg min_{u_{i} (k)} Q_{i}^{0} (e_{i} (k), u_{i} (k)) .

(18)

The objective of the hierarchical consensus control is to achieve consensus states of all agents; that is, for each agent i, the intra-group error

e_{i, 1}

and the inter-group error

e_{i, 2}

tend toward zero. Therefore, the action value function is decomposed and updated according to the reward function decomposition. Based on Formula (18), each sub-function of the action value function can be calculated as follows:

\begin{matrix} Q_{i, 1}^{1} (e_{i} (k), u_{i} (k)) = & r_{i, 1} (e_{i} (k), u_{i} (k)) + min_{u_{i} (k + 1)} γ Q_{i, 1}^{0} (e_{i} (k + 1), u_{i} (k + 1)), \\ Q_{i, 2}^{1} (e_{i} (k), u_{i} (k)) = & r_{i, 2} (e_{i} (k), u_{i} (k)) + min_{u_{i} (k + 1)} γ Q_{i, 2}^{0} (e_{i} (k + 1), u_{i} (k + 1)) . \end{matrix}

(19)

Similarly, the control value of the m-th iteration is as follows:

u_{i}^{m} (e_{i} (k)) = arg min_{u_{i} (k)} Q_{i}^{m} (e_{i} (k), u_{i} (k)) .

(20)

The corresponding decomposed action value function is as follows:

\begin{matrix} Q_{i, 1}^{m + 1} (e_{i} (k), u_{i} (k)) = & r_{i, 1} (e_{i} (k), u_{i} (k)) + min_{u_{i} (k + 1)} γ Q_{i, 1}^{m} (e_{i} (k + 1), u_{i} (k + 1)), \\ Q_{i, 2}^{m + 1} (e_{i} (k), u_{i} (k)) = & r_{i, 2} (e_{i} (k), u_{i} (k)) + min_{u_{i} (k + 1)} γ Q_{i, 2}^{m} (e_{i} (k + 1), u_{i} (k + 1)), \end{matrix}

(21)

where

m = 1, 2, \dots

indicates the number of iterations. The algorithm is based on Bellman’s principle of optimality and adopts the iterative method to solve the problem. For each agent i, the optimal Q function and corresponding control value can be obtained by iteratively updating Formulas (20) and (21). The iterative algorithm presented in this paper can deal with the consensus control problem of hierarchical multi-agent systems with and without leaders. However, how to obtain optimal control and minimize the Q function is a problem that needs to be solved below.

3.2. Algorithm Implementation

For each agent i, the hierarchical multi-agent consensus algorithm based on action value function decomposition is implemented by the dual-critic network and single-actor network. The corresponding offline training process is shown in Figure 3. The structure is similar to the hierarchical DDPG, but the difference is that the training of the dual-critic network has no primary and secondary order. The dual-critic network learns the control objectives of each layer and aggregates the learned knowledge into the same actor network to obtain the corresponding control value. For the actor network of each agent, the input is the local tracking error

e_{i} (k)

, and the output is the control policy

u_{i} (k)

, which can be obtained by minimizing the action value function

Q_{i} (e_{i} (k), u_{i} (k)) = Q_{i, 1} (e_{i} (k), u_{i} (k)) + Q_{i, 2} (e_{i} (k), u_{i} (k))

. According to Theorem 1, each agent has a dual-critic network

θ^{C_{1}}

and

θ^{C_{2}}

, output

{\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k))

, and

{\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k))

, which are used to evaluate the control effect of current control values on the intra-group consensus and inter-group consensus of hierarchical multi-agent systems. For the updating of the critic network, according to the Bellman equation, the temporal difference error is defined as follows:

\begin{matrix} E_{t d, 1} = & r_{i, 1} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 1} (e_{i} (k + 1), u_{i} (k)) - {\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k)), \\ E_{t d, 2} = & r_{i, 2} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 2} (e_{i} (k + 1), u_{i} (k)) - {\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k)), \end{matrix}

(22)

where

y_{1} (k) = r_{i, 1} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 1} (e_{i} (k + 1), u_{i} (k))

and

y_{2} (k) = r_{i, 2} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 2} (e_{i} (k + 1), u_{i} (k))

are the target value of the dual-critic network. The parameters of the critic network are updated according to the criterion of gradient descent by minimizing the square value of the temporal difference error.

The traditional learning process of the critic network often tends to be unstable or even divergent. One reason is that the learning process of the critic network is based on the traditional Q-learning algorithm, which approximates the Q value by taking the maximum operation, which may suffer great performance loss. Meanwhile, the observation data used for the critic network training is limited and cannot fully reflect the dynamic characteristics of the system; therefore, the neural network may overfit, which leads to unstable training. Therefore, the target networks

θ^{T_{1}}

and

θ^{T_{2}}

and soft update mechanism are introduced into the training process of the critic network. The main idea is to slow down the parameter updating of the neural network to avoid overfitting of incomplete observation data so that the parameters of the target network gradually approximate the critic network. Thus, the accuracy of the output of the critic network is ensured. The setting of the initial weight of the target network is generally consistent with that of the critic network. The updating of the target networks

θ^{T_{1}}

and

θ^{T_{2}}

is as follows:

\begin{matrix} {\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{T_{1}}) & = (1 - τ) {\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k) | θ_{k}^{T_{1}}) + τ {\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{C_{1}}), \\ {\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{T_{2}}) & = (1 - τ) {\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k) | θ_{k}^{T_{2}}) + τ {\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{C_{2}}), \end{matrix}

(23)

where

τ \in (0, 1]

indicates the soft update rate, and when

τ = 1

, the update process is the same as that of the traditional actor–critic network. The value of

τ

is generally small, which slows down the parameter updating of the target network. Accordingly, the temporal difference errors used to critic network updates are as follows:

\begin{matrix} E_{t d, 1} = & r_{i, 1} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 1} (e_{i} (k + 1), u_{i} | θ_{k}^{T_{1}}) - {\hat{Q}}_{i, 1} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{C_{1}}), \\ E_{t d, 2} = & r_{i, 2} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i, 2} (e_{i} (k + 1), u_{i} | θ_{k}^{T_{1}}) - {\hat{Q}}_{i, 2} (e_{i} (k), u_{i} (k) | θ_{k + 1}^{C_{2}}) . \end{matrix}

(24)

The hierarchical multi-agent consensus algorithm proposed in this paper based on action value function decomposition adopts an experience replay buffer and batch processing technology in the training process, which fully improves the utilization rate of historical data. Figure 3 shows the use of the experience replay buffer in the training process. The implementation of the hierarchical multi-agent consensus algorithm based on action value function decomposition is shown in Algorithm 1. For each agent i at the time step k, the data tuples

(e_{i} (k), u_{i} (k), r_{i, 1} (k), r_{i, 2} (k), e_{i} (k + 1))

are put into the experience replay buffer. Tuples

(e_{i} (k)

,

u_{i} (k))

are used as the input of the critic network by training the weights of the critic network which can output

Q_{i, 1} (e_{i} (k), u_{i} (k))

and

Q_{i, 2} (e_{i} (k), u_{i} (k))

, and the objective of the critic network is to minimize a batch of temporal difference error, based on Formula (24). It is defined as follows:

\begin{matrix} T D_{i, 1} = \frac{1}{M} \sum_{n} {(E_{t d, 1} (n))}^{2}, \\ T D_{i, 2} = \frac{1}{M} \sum_{n} {(E_{t d, 2} (n))}^{2}, \end{matrix}

(25)

where

n = 1, \dots, M

.

M

is the number of tuples taken from the experience replay buffer.

Algorithm 1 Hierarchical consensus algorithm based on action value function decomposition

1:: Initialization:
2:: Randomly initialize the weights $θ^{C_{1}}, θ^{C_{2}}$ of the critic network ${\hat{Q}}_{i, 1}, {\hat{Q}}_{i, 2}$
3:: Randomly initialize the weights $θ^{T_{1}}, θ^{T_{2}}$ of the target network
4:: Randomly initialize the weights $θ^{u_{i}}$ of theaction network $u_{i}$
5:: Initialize the experience replay buffer $R$ with the space $M$
6:: Learning process:
7:: for Current training Episode $= 1, M$ do
8:: Initialize the exploration noise sequence
9:: Randomly initialize the system state
10:: Local tracking error $e_{i} (1)$ is obtained
11:: for Timestep k $= 1, T$ do
12:: Calculate control input $u_{i} (k)$ based on the noise
13:: Execute the control input $u_{i} (k)$ and observe the new state $r_{i, 1} (k), r_{i, 2} (k)$ and $e_{i} (k + 1)$
14:: Store the data tuples $(e_{i} (k), u_{i} (k), r_{i, 1} (k), r_{i, 2} (k), e_{i} (k + 1))$ into the experience replay buffer $R$
15:: Sample the $M$ training data from the experience replay buffer $R$
16:: The weight parameters $θ^{C_{1}}, θ^{C_{2}}$ of the critic network are updated according to Formula (25)
17:: Update the weight parameters $θ^{T_{1}}, θ^{T_{2}}$ of the target network according to Formula (23)
18:: Update the weight parameters $θ^{u_{i}}$ of the actor network according to (20)
19:: end for
20:: end for

4. Theoretical Analysis

4.1. Convergence Analysis of the Proposed Algorithm

For agent i, the update mechanism (24) and (23) are simplified as follows:

\begin{matrix} {\hat{Q}}_{i}^{C} (e_{i} (k), u_{i} (k)) & = r_{i} (e_{i} (k), u_{i} (k)) + γ min {\hat{Q}}_{i}^{T} (e_{i} (k + 1), u_{i}), \\ {\hat{Q}}_{i}^{T} (e_{i} (k), u_{i} (k)) & = (1 - τ) {\hat{Q}}_{i}^{T} (e_{i} (k), u_{i} (k)) + τ {\hat{Q}}_{i}^{C} (e_{i} (k), u_{i} (k)), \end{matrix}

(26)

where the critic network of agent i is

{\hat{Q}}_{i}^{C} (e_{i} (k), u_{i} (k)) = {\hat{Q}}_{i, 1}^{C_{1}} (e_{i} (k), u_{i} (k)) + {\hat{Q}}_{i, 2}^{C_{1}} (e_{i} (k), u_{i} (k))

, and the target network is

{\hat{Q}}_{i}^{T} (e_{i} (k), u_{i} (k)) = {\hat{Q}}_{i, 1}^{T_{1}} (e_{i} (k), u_{i} (k)) + {\hat{Q}}_{i, 2}^{T_{1}} (e_{i} (k), u_{i} (k))

. Considering the whole hierarchical multi-agent systems, and the control value is defined as

u (k) = {[u_{1} (k), u_{2} (k), \dots, u_{N} (k)]}^{T}

, the overall optimization objective function is

Q (e (k), u (k)) = {[Q_{1} (e_{1} (k), u_{1} (k)) \dots Q_{N} (e_{N} (k), u_{N} (k))]}^{T}

. For simplicity of expression, we use u to represent

u (k)

, and Q represents

Q (e (k), u (k))

. When the layered multi-agent systems reach consensus, the optimal performance index is

Q^{*}

and the optimal control signal is

u^{*}

.

Theorem 2.

Let

Q_{1}^{0}

and

Q_{2}^{0}

represent the randomly initialized

2 N

critic networks, and each function is a bounded continuous function. Then, according to the action value function sequence

Q_{m}

generated by Formula (26) and the control sequence

u_{m}

generated by Formula (20), if the discount factor satisfies

γ \in (0, 1)

, when the iteration number

m \to \infty

, there is

Q_{m} \to Q^{*}

and

u_{m} \to u^{*}

.

Proof of Theorem 2.

For the layered multi-agent systems, the global tracking error is

e (k) = {[e_{1}^{T} (k), e_{2}^{T} (k), \dots, e_{N}^{T} (k)]}^{T}

, and according to Formula (26), we introduce two operators

S_{q}

and

T_{q}

:

\begin{matrix} T_{q} Q (e, u) & = R (e, u) + γ min_{u} Q (e^{'}, u^{'}), \\ S_{q} Q (e, u) & = (1 - τ) Q (x, u) + τ T_{q} Q (e, u), \end{matrix}

(27)

where

R (e, u) = {[r_{1} (k), r_{2} (k), \dots, r_{N} (k)]}^{T}

. Based on the two mappings above, Formula (26) can be expressed as follows:

\begin{matrix} {\hat{Q}}_{k + 1}^{C} (e, u) = T_{q} {\hat{Q}}_{k}^{T} (e, u), \\ {\hat{Q}}_{k + 1}^{T} (e, u) = S_{q} {\hat{Q}}_{k}^{T} (e, u) . \end{matrix}

(28)

First, the monotonicity of the operator

T_{q}

is proved; that is, let

f (e, u), g (e, u)

be the estimates of two action-valued functions. For all

(e, u)

, when

f (e, u) \geq g (e, u)

, then

T_{q} f (e, u) \geq T_{q} g (e, u)

. Let

u^{*} = arg min f (e, u)

, then

\begin{matrix} T_{q} f (e, u) & = R (e, u) + γ f (e, u^{*}) \\ \geq R (e, u) + γ g (e, u^{*}) \\ \geq R (e, u) + γ min g (e, u) \\ = T_{q} g (e, u), \end{matrix}

(29)

Therefore, it can be proved that the operator

T_{q}

is monotonic.

If the hierarchical multi-agent systems can reach consensus, the optimal value of

Q^{*}

must be the fixed point of mapping

S_{q}

; that is,

Q^{*} = S_{q} Q^{*}

. In order to prove that the mapping

S_{q}

has a unique fixed point,

‖ . ‖

is simplified to represent

{‖ . ‖}_{s u p}

, and two bounded continuous functions f and g are defined. Let

I_{N} = {[1, 1, \dots, 1]}^{T}

, then we can obtain

f \leq g + ‖ f - g ‖

:

\begin{matrix} S_{q} f & = (1 - τ) f + τ T_{q} f \\ \leq (1 - τ) (g + I_{N} ‖ f - g ‖) + τ T_{q} (g + I_{N} ‖ f - g ‖) \\ = S_{q} g + (1 - τ + τ γ) (I_{N} ‖ f - g ‖) . \end{matrix}

(30)

Similarly, we can obtain the following formula:

\begin{matrix} S_{q} g - S_{q} f & \leq (1 - τ + τ γ) I_{N} ‖ g - f ‖ . \end{matrix}

(31)

According to the Pareto optimality [22], we can obtain:

\begin{matrix} ‖ S_{q} f - S_{q} g ‖ & \leq (1 - τ + τ γ) ‖ g - f ‖ . \end{matrix}

(32)

Therefore, the above formula satisfies Banach’s fixed point theorem [23], and the mapping

S_{q}

is a strictly compressed mapping with a unique fixed point. It is easy to verify that the optimal action value function

Q^{*}

is the fixed point of

S_{q}

, and therefore it must be unique.

Based on the above analysis, the following proves that when

m \to \infty

, there is

Q^{m} \to Q^{*}

:

\begin{matrix} ‖ Q_{m + 1}^{T} - Q^{*} ‖ & = ‖ S_{q} Q_{m}^{T} - S_{q} Q^{*} ‖ \\ \leq (1 - τ + τ γ) ‖ Q_{m}^{T} - Q^{*} ‖ \\ = (1 - τ + τ γ) ‖ Q_{m}^{T} - Q_{m + 1}^{T} + Q_{m + 1}^{T} - Q^{*} ‖ \\ \leq (1 - τ + τ γ) ‖ Q_{m}^{T} - Q_{m + 1}^{T} ‖ \\ + (1 - τ + τ γ) ‖ Q_{m + 1}^{T} - Q^{*} ‖, \end{matrix}

(33)

Then, let

δ = 1 - τ + τ γ

; the above formula can be written as follows:

\begin{matrix} ‖ Q_{m + 1}^{T} - Q^{*} ‖ & \leq \frac{δ}{1 - δ} ‖ Q_{m}^{T} - Q_{m + 1}^{T} ‖ \\ \leq \frac{δ^{m + 1}}{1 - δ} ‖ Q_{1}^{T} - Q_{0}^{T} ‖, \end{matrix}

(34)

According to the analysis above, when

m \to \infty

, then

Q_{m}^{T} \to Q^{*}

. When the discount factor

γ \in (0, 1)

, the performance index function

Q_{m}

and the control policy

u_{m}

can reach the optimal value. That is,

Q_{m}^{C} \to Q^{*}

,

Q_{m}^{T} \to Q^{*}

and

u_{m} \to u^{*}

. □

4.2. Stability Analysis

Theorem 3.

Assuming that the Assumption 2 is true, the hierarchical multi-agent consensus algorithm based on action value function decomposition can make the global tracking error

e (k)

of the multi-agent system (1) and (2) asymptotically stable, and the states of all agents can reach consensus.

Proof of Theorem 3.

The hierarchical multi-agent consensus algorithm is based on action value function decomposition, which considers the DTHJB equation of the multi-agent systems below:

\begin{matrix} J^{*} (e (k)) = R (e (k), u^{*} (k)) + γ J^{*} (e (k + 1)), \end{matrix}

(35)

Then, multiplying both sides by

γ^{k} I_{N}^{T}

gives the following:

\begin{matrix} γ^{k} I_{N}^{T} J^{*} (e (k)) - γ^{k + 1} I_{N}^{T} J^{*} (e (k + 1)) = γ^{k} I_{N}^{T} R (e (k), u^{*} (k)) . \end{matrix}

(36)

In order to prove the stability of the system, the first difference of the Lyapunov function is defined as follows:

\begin{matrix} Δ (γ^{k} I_{N}^{T} J^{*} (e (k))) = γ^{k + 1} I_{N}^{T} J^{*} (e (k + 1)) - γ^{k} I_{N}^{T} J^{*} (e (k)) . \end{matrix}

(37)

Based on Formula (36), we can obtain the following:

\begin{matrix} Δ (γ^{k} I_{N}^{T} J^{*} (e (k))) = - γ^{k} I_{N}^{T} R (e (k), u^{*} (k)) \leq 0 . \end{matrix}

(38)

Since the reward function R is the quadratic or absolute form of the global tracking error, there is

Δ (γ^{k} I_{N}^{T} J^{*} (e (k)) = 0

if and only if

e (k) = 0

. Therefore, the global tracking error

e (k)

of the multi-agent systems is asymptotically stable. The state of the entire multi-agent systems can reach consensus. □

5. Simulation

In this section, two simulation examples are used to verify the effectiveness of the proposed algorithm. For each agent, the dual-critic network, the actor network, and the target network all adopt the backpropagation three-layer neural network. The hidden layer of each agent contains 30 linear rectification units and is fully connected to the input layer. The output layer of each agent is a fully connected linear layer, and the weights of all networks are updated according to Algorithm 1. The Adaptive Moment Estimations (Adam) optimization method was used to update the weights. In order to ensure the implementation of the algorithm and the consensus control effect, hyperparameters of the neural network were employed and are given in Table 1. The capacity of the experience replay buffer was set to 50,000.

5.1. Multi-Agent Systems without Leader

Considering the topology structure of the multi-agent systems as Figure 4, the system contains 12 agents, which are divided into 3 groups, and each group has 4 agents. It can be seen from the figure that the topology structure of the agents in each group is a connected graph, and the dynamic performance of a single agent is defined as follows:

x_{i} (k + 1) = A x_{i} (k) + B u_{i} (k),

(39)

where the system matrix is

A = [\begin{matrix} 0.995 & 0.09983 \\ - 0.09983 & 0.995 \end{matrix}]

and

B = [\begin{matrix} 0.2 \\ 0.1 \end{matrix}]

. The initial value of the system is randomly selected as

[- 1, 1]

, and the safe value of the system state is set to

\pm 2.0

. The reward function is decomposed according to the intra-group error and inter-group error, and Q is taken as the unit matrix,

R = 0.1

. To verify Theorem 2, we set

γ = 0.995

, and the offline training method was used, that is, the offline implementation of Algorithm 1. We set the soft update rate of the target network to

τ = 0.001

and used the experience replay buffer to maximize the use of historical data. The number of iterative steps in each episode during the training process was 500, and the total number of performed episodes was 300.

Figure 5 shows the control effect of the algorithm. In a randomly given initial state, the state of all agents can reach consensus after 200 time steps. Figure 6a,b show the convergence process of the intra-group error, and Figure 6c,d show the convergence process of the inter-group error. From all the simulation results, it can be seen that all the agents can reach the same state in 150 time steps.

5.2. Multi-Agent Systems with Leader

Consider the topology structure of multi-agent systems as Figure 7; the system contains 1 virtual leader agent and 12 follower agents. The 12 follower agents are divided into 3 groups, and each group has 4 agents. It can be seen from the figure that the topology graph of the agents in each group is connected. The multi-agent systems are defined as follows:

x_{0} (k + 1) = A x_{0} (k), x_{i} (k + 1) = A x_{i} (k) + B u_{i} (k),

(40)

where the system matrix is

A = [\begin{matrix} 0.995 & 0.09983 \\ - 0.09983 & 0.995 \end{matrix}]

and

B = [\begin{matrix} 0.2 \\ 0.1 \end{matrix}]

. The initial value of the system is randomly selected as

[- 1, 1]

, and the safe value of the system state is set as

\pm 2.0

. The reward function is decomposed according to intra-group error and inter-group error, and Q is taken as the unit matrix,

R = 0.1

. To verify Theorem 2, we set

γ = 1

and used the offline training method, that is, the offline implementation of Algorithm 1, and set the soft update rate of the target network to

τ = 0.001

. The number of iterative steps in each episode during the training process was 3000, and the total number of performed episodes was 100.

Figure 8 shows the control effect of the algorithm. In a random given initial state, all agents can follow the leader agent in 400 time steps. Figure 9a,b show the convergence process of the intra-group error, and Figure 9c,d show the convergence process of the inter-group error. It can be seen from all the simulation graphs that the consensus control algorithm based on action value function decomposition proposed in this paper can make all the agent states reach consensus.

6. Conclusions

In this paper, a hierarchical consensus control algorithm based on value function decomposition is proposed for hierarchical multi-agent systems. According to the communication content of each layer and the corresponding control objective of the multi-agent systems, the value function is decomposed according to the reward function of each layer. For each agent in the system, a dual-critic network and single-actor network are adopted. The target network is introduced to avoid overfitting in the critic network and improve the stability of the online learning process. The soft update mechanism and experience reply buffer are introduced in the network parameter update process to improve the utilization rate of training samples. In this paper, the convergence and stability of consensus control algorithm with a soft update mechanism are analyzed theoretically. Finally, the correctness of the theoretical analysis and the effectiveness of the algorithm were verified by two experiments. The value function decomposition method in this paper is used for hierarchical consensus control, but the multi-agent systems studied are homogeneous, and for heterogeneous multi-agent systems, achieving consensus control with this undifferentiated value decomposition method is of great research value. At the same time, in this paper, it is assumed that there is no time delay during the information interaction between the groups of agents, and communication delay often exists in actual systems. Therefore, the design of an effective hierarchical algorithm for multi-agent systems with communication delay must also be urgently studied.

Funding

This research was funded by Research on Intelligent Manufacturing Process Control Method based on Reinforcement Learning of Shanghai Zhongqiao Vocational and Technical University grant number ZQZR202419.

Data Availability Statement

All data generated or analyzed during this study are included in this article.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DDPG	Deep Deterministic Policy Gradient
Adam	Adaptive Moment Estimations

References

Hou, J.; Zheng, R. Hierarchical consensus problem via group information exchange. IEEE Trans. Cybern. 2018, 49, 2355–2361. [Google Scholar] [CrossRef]
Cheng, C.; Yang, B.; Xiao, Q. Hierarchical Coordinated Predictive Control of Multiagent Systems for Process Industries. Appl. Sci. 2024, 14, 6025. [Google Scholar] [CrossRef]
Albani, D.; Hönig, W.; Nardi, D.; Ayanian, N.; Trianni, V. Hierarchical task assignment and path finding with limited communication for robot swarms. Appl. Sci. 2021, 11, 3115. [Google Scholar] [CrossRef]
Yang, T.; Jiang, Z.; Dong, J.; Feng, H.; Yang, C. Multi agents to search and rescue based on group intelligent algorithm and edge computing. In Proceedings of the 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Halifax, NS, Canada, 30 July–3 August 2018; pp. 389–394. [Google Scholar]
Li, Z.; Al Hassan, R.; Shahidehpour, M.; Bahramirad, S.; Khodaei, A. A hierarchical framework for intelligent traffic management in smart cities. IEEE Trans. Smart Grid 2017, 10, 691–701. [Google Scholar] [CrossRef]
Long, Q.; Zhang, W. An integrated framework for agent based inventory–production–transportation modeling and distributed simulation of supply chains. Inf. Sci. 2014, 277, 567–581. [Google Scholar] [CrossRef]
Zhou, H.; Li, W.; Shi, J. Hierarchically Distributed Charge Control of Plug-In Hybrid Electric Vehicles in a Future Smart Grid. Energies 2024, 17, 2412. [Google Scholar] [CrossRef]
Williams, A.; Glavaski, S.; Samad, T. Formations of formations: Hierarchy and stability. In Proceedings of the 2004 American Control Conference, Boston, MA, USA, 30 June–2 July 2004; Volume 4, pp. 2992–2997. [Google Scholar]
Smith, S.L.; Broucke, M.E.; Francis, B.A. A hierarchical cyclic pursuit scheme for vehicle networks. Automatica 2005, 41, 1045–1053. [Google Scholar] [CrossRef]
Hara, S.; Shimizu, H.; Kim, T.H. Consensus in hierarchical multi-agent dynamical systems with low-rank interconnections: Analysis of stability and convergence rates. In Proceedings of the 2009 American Control Conference, St. Louis, MO, USA, 10–12 June 2009; pp. 5192–5197. [Google Scholar]
Tsubakino, D.; Hara, S. Eigenvector-based intergroup connection of low rank for hierarchical multi-agent dynamical systems. Syst. Control Lett. 2012, 61, 354–361. [Google Scholar] [CrossRef]
Sang, J.; Ma, D.; Zhou, Y. Group-consensus of hierarchical containment control for linear multi-agent systems. IEEE/CAA J. Autom. Sin. 2023, 10, 1462–1474. [Google Scholar] [CrossRef]
Wang, X.; Xu, Y.; Cao, Y.; Li, S. A hierarchical design framework for distributed control of multi-agent systems. Automatica 2024, 160, 111402. [Google Scholar] [CrossRef]
Pateria, S.; Subagdja, B.; Tan, A.H.; Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Makar, R.; Mahadevan, S.; Ghavamzadeh, M. Hierarchical multi-agent reinforcement learning. In Proceedings of the Fifth International Conference on Autonomous Agents, Montreal, QC, Canada, 28 May–1 June 2001; pp. 246–253. [Google Scholar]
Yang, Z.; Merrick, K.; Jin, L.; Abbass, H.A. Hierarchical deep reinforcement learning for continuous action control. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5174–5184. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Liu, Y.; Zhang, H. Two-Layer Reinforcement Learning for Output Consensus of Multiagent Systems Under Switching Topology. IEEE Trans. Cybern. 2024, 54, 5463–5472. [Google Scholar] [CrossRef] [PubMed]
Duan, Z.; Zhai, G.; Xiang, Z. State consensus for hierarchical multi-agent dynamical systems with inter-layer communication time delay. J. Frankl. Inst. 2015, 352, 1235–1249. [Google Scholar] [CrossRef]
Lin, Z.; Hou, J.; Yan, G.; Yu, C.B. Reach almost sure consensus with only group information. Automatica 2015, 52, 283–289. [Google Scholar] [CrossRef]
Abouheaf, M.I.; Lewis, F.L.; Vamvoudakis, K.G.; Haesaert, S.; Babuska, R. Multi-agent discrete-time graphical games and reinforcement learning solutions. Automatica 2014, 50, 3038–3053. [Google Scholar] [CrossRef]
Van Seijen, H.; Fatemi, M.; Romoff, J.; Laroche, R.; Barnes, T.; Tsang, J. Hybrid reward architecture for reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Lopez Mejia, V.G.; Lewis, F.L. Dynamic Multiobjective Control for Continuous-time Systems using Reinforcement Learning. IEEE Trans. Autom. Control 2019, 64, 2869–2874. [Google Scholar] [CrossRef]
Ciarlet, P.G. Linear and Nonlinear Functional Analysis with Applications; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2013. [Google Scholar]

Figure 1. Hierarchical multi-agent systems.

H 1

is the top layer, and

H 2

is the bottom layer.

Figure 1. Hierarchical multi-agent systems.

H 1

is the top layer, and

H 2

is the bottom layer.

Figure 2. Schematic diagram of the reward function decomposition framework for two-layer multi-agent systems.

Figure 3. Training process and schematic diagram of the single agent. The solid black line indicates the signal flow, and the dashed line indicates the parameter adjustment path.

Figure 4. Hierarchical multi-agent system topology without leader.

Figure 5. State trajectories of the layered multi-agent systems in Experiment 1. For (a,b), the horizontal axis represents the simulation time steps, and the vertical axis represents the state of the system.

Figure 6. Local tracking error of the layered multi-agent systems in Experiment 1. The horizontal axis represents the simulation time steps, and the vertical axis represents the local tracking error value of the system.

Figure 7. Hierarchical multi-agent system topology with leader.

Figure 8. State trajectories of the layered multi-agent systems in Experiment 2. For (a,b), the horizontal axis represents the simulation time steps, and the vertical axis represents the state of the system.

Figure 9. Local tracking error of the layered multi-agent systems in Experiment 2. The horizontal axis represents the simulation time steps, and the vertical axis represents the local tracking error value of the system.

Table 1. Parameter settings in the experiment.

Parameter Name	Mumber
learning rate of critic network ( $l_{c}$ )	0.001
learning rate of action network ( $l_{a}$ )	0.001
discount factor ( $γ$ )	0.995
soft update rate ( $τ$ )	0.001
experience reply buffer ( $R$ )	50,000
batch number ( $M$ )	128
hidden layer nodes of ciric network ( $N_{c h}$ )	30
hidden layer nodes of action network ( $N_{a h}$ )	30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, X. Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control. Mathematics 2024, 12, 3062. https://doi.org/10.3390/math12193062

AMA Style

Zhu X. Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control. Mathematics. 2024; 12(19):3062. https://doi.org/10.3390/math12193062

Chicago/Turabian Style

Zhu, Xiaoxia. 2024. "Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control" Mathematics 12, no. 19: 3062. https://doi.org/10.3390/math12193062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Reinforcement Learning with Value Function Decomposition for Hierarchical Multi-Agent Consensus Control

Abstract

1. Introduction

2. Preliminaries

2.1. Graph Theory

2.2. Problem Formulation

2.3. Optimal Consensus Control Based on Value Function Decomposition

3. Proposed Algorithm

3.1. Consensus Algorithm Based on Action Value Function Decomposition

3.2. Algorithm Implementation

4. Theoretical Analysis

4.1. Convergence Analysis of the Proposed Algorithm

4.2. Stability Analysis

5. Simulation

5.1. Multi-Agent Systems without Leader

5.2. Multi-Agent Systems with Leader

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI