1. Introduction
In recent years, due to the intense development of technology and the rapid improvement of data processing capabilities, many practical systems in the field of engineering have become increasingly complex. Therefore, the study of large and intricate systems has emerged as a prominent research focus. At present, the main research trend is to model complex network systems and divide them into multiple simple and identical subsystems [
1]. The hierarchical multi-agent system is an architecture for organizing and managing multiple agents [
2]. The agents of the systems can be divided into different layers, each of them having different responsibilities and control objectives. Typically, higher-layer agents are responsible for global policy making, while lower layer agents perform more specific tasks or operations. Therefore, the model of hierarchical multi-agent systems can be described by multiple subsystems. This holds significant practical relevance; that is, we can combine many of the same simple agents to form a large and complex system, or we can realize the complex tasks that cannot be completed by a single agent. In the control process of robotic swarms [
3], hierarchical multi-agent systems can be employed to coordinate multiple robots to accomplish complex tasks. For example, in search-and-rescue missions [
4], higher-layer agents can plan the search area, while lower layer agents carry out specific search actions. In intelligent traffic management [
5], higher-layer agents can be responsible for overall traffic flow management, whereas lower layer agents control specific traffic signals or vehicles. In complex supply chain systems [
6], higher-layer agents can perform global supply chain optimization, while lower layer agents handle specific tasks such as production, transportation, or inventory management. By decomposing complex tasks and allocating them to different layers of agents, hierarchical multi-agent systems can enhance the efficiency and flexibility of the system. Over the past decade, the analysis of hierarchical multi-agent systems and the development of distributed control schemes have become focal issues in the field of control and obtained numerous research outcomes [
7].
For large-scale multi-agent systems, the model is usually described by a hierarchical structure because large-scale systems can usually be divided into some groups, and the number of agents in different groups and the communication topology may be different. The research on the consensus control of hierarchical multi-agent systems is the basis of all the problems mentioned above. Consensus control is critical for ensuring the stability and performance of the entire system. Williams et al. [
8] outlined the concept of the hierarchy, which is used to describe relationships between subeformations, for example, “formations of formations”. This concept could be valuable in fields like robotics, swarm intelligence, and military applications. The authors provide a strong theoretical foundation by discussing the dynamics of stability within hierarchical formations. This is important for ensuring that the structures they propose are robust and applicable in real-world scenarios. However, it may pose challenges in large-scale systems with unpredictable environmental factors. Smith et al. [
9] studied the application of the hierarchical structure to the circular tracking of vehicle networks, and better results can be obtained compared with the traditional circular tracking algorithm; the convergence rate of vehicle groups to the center of mass is significantly increased. Hara et al. [
10] proposed a general hierarchical multi-agent model with a fractal structure and studied the stability, global convergence, and low rank of the interconnection structure between different layers. Consensus in hierarchical systems with low-rank interconnections has practical implications in various fields, such as decentralized control systems and distributed optimization. However, like many theoretical papers, it may rely on certain simplifying assumptions that might not hold in real-world scenarios, potentially limiting the generalizability of the findings. Tsubakino and Hara et al. [
11] propose a hierarchical characterization method based on eigenvectors for low-rank, interconnected, multi-agent, dynamical systems. This eigenvector-based approach could have practical implications in areas such as robotic coordination, decentralized decision-making, and networked systems, where efficient and stable intergroup communication is crucial. However, while low-rank interconnections aim to reduce complexity, eigenvector computations in large, dynamic systems may still be computationally expensive, especially in real-time applications where swift decisions are needed. Sang et al. [
12] studied the group consensus problem based on hierarchical containment control in linear multi-agent systems. The influence of time delay in dynamic environments and systems on consensus is considered in this paper. A robust control strategy is designed to ensure that the system can achieve stable containment control and consensus in the presence of dynamic changes and communication delays. Wang et al. [
13] proposed a new hierarchical control framework for distributed control of multi-agent systems. The framework divides the system into multiple levels, each responsible for a different control task. The framework is designed to improve the scalability and robustness of the system and simplify the control design of complex systems.
Hierarchical reinforcement learning is one of the effective methods of solving large-scale and multi-task reinforcement learning problems [
14]. Through task decomposition, strategies are learned according to the goals of each subtask, and the learned strategies are combined to obtain a global strategy, which can effectively solve the dimensional disaster problem in large-scale systems. Makar et al. [
15] propose a hierarchical reinforcement learning algorithm based on value function decomposition to solve the multi-agent discrete control problem. The combination of hierarchical structures with multi-agent reinforcement learning represents a novel framework. Hierarchical approaches allow for the decomposition of complex tasks into subtasks, making it easier to handle environments with multiple agents and large state-action spaces. However, implementing a hierarchical algorithm is often more challenging. Designing the appropriate hierarchy of tasks and sub-tasks and determining how agents should transition between them can require significant domain knowledge and expertise. To solve large-scale and complex continuous control problems, Yang et al. proposed the hierarchical Deep Deterministic Policy Gradient (DDPG) algorithm [
16], which is a dual-critic network and multi-actor network structure. The critic network is divided into two layers. In the first layer, a Multi-DDPG algorithm structure is used to solve simple task problems, and the second layer focuses on solving complex combined tasks. At the same time, hierarchical reinforcement learning can also solve the problem of switching topology in the multi-agent systems. For example, Wang et al. [
17] proposed a two-layer reinforcement learning framework to solve the output consensus problem of multi-agent systems under switching topology conditions. The first layer uses the Q-learning algorithm, wherein each agent selects the optimal action according to the current state, and the second layer selects the appropriate action strategy according to the current topology to ensure the output consensus of the entire system. This two-layer structure gives the control strategy strong adaptability and robustness and can be adjusted quickly when the topology changes.
The research concerning hierarchical consensus control mentioned above is largely based on the specific sequence of the tasks, which means that the performance of the algorithm convergence may be insufficient. However, in this paper, a hierarchical consensus control algorithm based on value function decomposition is proposed for hierarchical multi-agent systems. According to the communication content of each layer of the multi-agent systems and the corresponding control objectives, the reward function is decomposed, and two value functions are obtained. Specifically, for each agent in the system, the dual-critic network and single-actor network structure are designed. The updating of the dual-critic network is based on the decomposition of value functions in different tasks, and the two decomposed value functions have no logical order. It is also unnecessary to consider the training order of the double evaluation network during the training process. In addition, this paper introduces the target network to avoid overfitting in the critic network and improve the stability of the online learning process. In the updating of network parameters, a soft updating mechanism and experience replay buffer are introduced to slow down the update rate of the network and improve the utilization rate of training samples. The main contributions of this paper are as follows:
- (1)
For hierarchical multi-agent systems, a hierarchical consensus control algorithm based on value function decomposition is proposed. Firstly, the structure of the algorithm is that of a distributed actor–critic network, which ensures that the distributed characteristics of multi-agent systems are fully utilized. In addition, a value function decomposition algorithm is introduced to ensure the simultaneous optimization of the control objectives of agents at different levels, which gives the training process a certain robustness.
- (2)
The convergence and stability analysis of the consensus control algorithm with a soft update mechanism are proposed. It is proved that for each agent, the action value function estimated by the critic network can converge to the optimal value, the policy output from the actor network can converge to the corresponding optimal value, and the multi-agent system can be asymptotically stable.
In this paper, the implementation of the hierarchical consensus control algorithm based on value function decomposition is presented, and the convergence and stability of the algorithm are analyzed. Finally, the correctness of the theoretical analysis and the effectiveness of the algorithm were verified by two experiments. Multi-agent systems with and without leaders are both considered in the
Section 5.
2. Preliminaries
2.1. Graph Theory
The communication of the leader–follower multi-agent system can be represented by a weighted directed graph , which is composed of a set of N agents , a set of edges , and a non-negative adjacency matrix . if and only if , which means agent i and j can communicate with each other; otherwise, . For , . The in-degree matrix is a diagonal matrix with , and the Laplacian matrix is defined as . The communication between the leader and followers is modeled by the diagonal matrix , and means there is a directed path between the leader and the ith follower; otherwise, . If a digraph contains a directed spanning tree, it means there exists a directed path from the leader to any other follower.
2.2. Problem Formulation
Consider the hierarchical(two-layer) discrete multi-agent system [
18]. The structure is shown in
Figure 1. The system contains
agents:
where
is the system state, and
is the control signal values. The system matrix parameters are
and
. The number of agents is
N, which is divided into
q groups, and each group has
p agents. Based on the graph theory mentioned above, the communication of
p agents in the bottom
groups can be represented by the adjacency matrix
of dimensions
. The communication between the
q groups of the top layer
can be represented by the adjacency matrix
of dimensions
. The Laplace matrices are
and
, corresponding to the adjacent matrices
and
, respectively. The communication mode between groups and whether the agents in each group can receive inter-group information are crucial to the consensus of hierarchical multi-agent systems. In this paper, it is assumed that the agents in each group can receive information outside the group in the way of the topological structure
. And the communication information between groups is fixed as the average value of all agents in each group,
[
19]; that is, for each group of agents in the bottom layer,
.
Assumption 1. There is no time delay in intra-group communication and inter-group communication in hierarchical multi-agent systems. The number and topology of agents in each group are the same.
In the two-layer discrete multi-agent systems, the agents in the same group are relatively concentrated. For example, the communication distance between robots or vehicles composed of communication nodes in the group is close, so the communication delay is small and can be ignored. The communication distance between groups is often very far because the communication line is long, the signal transmission capacity is limited, and the communication time delay of each group of agents can not be ignored. In order to reduce the difficulty of the algorithm, the problem of communication time delay is not considered in this paper.
In hierarchical multi-agent systems, the number and topology of agents within each group can indeed be the same, but this is not always necessary. If each group has the same number of agents, it simplifies the system design and interaction protocols. This uniformity makes coordination easier since each group can follow a similar communication and decision-making process. If each group shares the same topology, it typically means that the connections and interactions between agents are similar across groups. The specific number and topology of agents depend on the problem’s requirements.
Definition 1. The two-layer discrete multi-agent system (1) reaches hierarchical consensus, meaning that for any initial state, there is ,. If the hierarchical multi-agent systems has a leader agent, it can be defined as follows:
where
is the reference signal value of multi-agent systems.
Definition 2. The two-layer discrete multi-agent systems (1) and (2) can reach hierarchical consensus, meaning that for any initial state, there is , . Considering the communication characteristics of hierarchical multi-agent systems, each agent can obtain intra-group and inter-group information, so the local tracking error for agent
i in each group is as follows:
According to the equation above, the global tracking errors are defined below; that is,
:
where ⊗ is the Kronecker product,
is the
n-dimensional unit matrix,
is the
q-dimensional unit matrix, the state of the
h group agent is
, the state vector for all agents is defined as
, the state vector of the leader agent is
,
is the value of inter-group state error, and
is the information interaction between each group structure. If the multi-agent systems has no leader, then both
and
are zero.
According to (
4), the global tracking errors can be defined as follows:
where
,
.
For hierarchical multi-agent systems with a leader agent, in order to reach consensus, the global consensus error of the whole system is defined as follows:
where
,
.
Lemma 1 ([
20]).
If the matrix is singular, then , where is the minimum singular value of matrix . Lemma 1 shows that if the global tracking error is small enough, then the global consensus error can be arbitrarily small. To ensure that the inequality in Lemma 1 holds, the following hypothesis is given:
Assumption 2. The communication graphs and of the multi-agent systems are connected.
2.3. Optimal Consensus Control Based on Value Function Decomposition
Considering the two-layer multi-agent systems, the performance index function of agent
i in any group can be defined as follows:
where
is the control sequence composed of all the control quantities of agent
i from the current moment, and
is the discount factor. The reward value of environment is
.
The objective of the optimal consensus control is to minimize the performance index function (
7). Therefore, according to Bellman’s principle, the optimal value of the state value function
satisfies the following equation:
The DTHJB equation for agent
i can be expressed as follows:
Therefore, the DTHJB equation for hierarchical multi-agent systems can be expressed as follows:
The action value function Q and the optimal action value function
are introduced. Then, the optimal action value function of agent
i is as follows:
By minimizing the action value function, the optimal control value
can be obtained directly as follows:
For two-layer discrete multi-agent systems, a hierarchical reinforcement learning algorithm can be introduced to decompose the performance index function and action value function according to the control objectives of different layers. The following content gives the conditions that need to be satisfied by the decomposition of the action value function, so as to provide the basis for the implementation of the algorithm in the next section.
Theorem 1 ([
21]).
Suppose the reward function r can be decomposed into M reward functions, namely, . Then, the performance index function and action value function can be decomposed into and . Proof of Theorem 1. The performance index function
is expressed as follows:
According to the decomposed form of the reward function in Theorem 1, the performance index function can be decomposed into the following:
Similarly, the decomposition form of the action value function can be obtained:
□
From Theorem 1, we can know that the action value function can be decomposed into several sub-functions for different objectives, and there is no sequential relationship between the objectives. How to optimize these sub-functions separately is also a problem to be studied below.