Next Article in Journal
Multifunctional Polymers and Composites
Previous Article in Journal
Flexible Route Planning for Multiple Mobile Robots by Combining Q–Learning and Graph Search Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Multi-Robot Formation Control Based on MATD3 Algorithm

1
School of Electronic, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, China
2
Technical Development Base of Industrial Integration Automation of Fujian Province, Fuzhou 350118, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(3), 1874; https://doi.org/10.3390/app13031874
Submission received: 15 December 2022 / Revised: 19 January 2023 / Accepted: 19 January 2023 / Published: 31 January 2023

Abstract

:
This paper investigates the problem of multi-robot formation control strategies in environments with obstacles based on deep reinforcement learning methods. To solve the problem of value function overestimation in the deep deterministic policy gradient (DDPG) algorithm, this paper proposes an improved multi-agent twin delayed deep deterministic policy gradient (MATD3) algorithm under the CTDE framework combined with the twin delayed deep deterministic policy gradient (TD3) algorithm, which adopts a prioritized experience replay strategy to improve the learning efficiency. For the problem of difficult obstacle avoidance for a robot formation, a hybrid reward mechanism is designed to use different formation maintenance strategies in obstacle areas and obstacle-free areas to achieve the control goal of obstacle avoidance by reasonably changing the formation. The simulation experiments verified the effectiveness of the multi-robot formation control strategy designed in this paper, and comparative simulations verified that the algorithm has a faster convergence speed and more stable performance.

1. Introduction

The fields of formation control, cooperative control, and swarm control of multi-agent systems have been rapidly developing in recent decades [1]. The research on formation control focuses on making multi-agent systems generate and maintain a predetermined formation structure to perform the corresponding scenario tasks when moving toward a target area. Multi-robot formation control has been widely used in military and firefighting applications, such as cruise reconnaissance, collaborative search and rescue, and transportation [2,3,4,5,6]. In the early stages of formation control research, researchers applied traditional control methods such as predictive control [7] and sliding mode variable structure control [8] to solve formation problems and developed formation control strategies based on the leader-following method [9], the virtual structure method [10], and the behavior-based method [11]. All of the above traditional control methods require accurate system modeling to calculate the control law, which significantly increases the complexity of the system design.
Obstacle avoidance in formation navigation for multiple robots is a fundamental and important capability. The work [12] proposed a distributed control method for the formation navigation task of robot teams that combines constrained nonlinear optimization with consistency to achieve obstacle avoidance in formations. However, this approach needs to avoid the risk of deadlock. The work [13] combined state transition techniques with model predictive control (MPC) to propose a decentralized controller for the formation reconfiguration of a UAV system in an obstacle environment. The experimental results showed that, based on this method, the UAV system can successfully learn an unknown dynamics model and solve the optimal control problem using MPC.
Over the years, reinforcement-learning-based solutions to formation control problems have received much attention, with the advantages of avoiding cumbersome dynamics modeling and coping well with changes in unknown environments through self-learning capabilities [14]. The work [15] used the GQ( λ ) algorithm to learn leader-following control, which improved the path-planning capability of the followers and was experimented on a physical robot platform. However, only a linear formation structure could be considered. The work [16] improved the actor–critic algorithm by adding an adaptive discriminator to estimate the unknown system dynamics and performed stability analysis. The work [17] improved the DDPG algorithm by applying the advantage function to a multi-USV formation generation and maintenance strategy. The work [18] used the DDPG algorithm to solve the USV formation path-tracking problem by adding a random braking mechanism after action selection, which allowed the USV team to correctly control the velocity when it approached the reference position and prevented it from falling into a local optimum. The work [19] improved the VDN algorithm by proposing a C2VDN algorithm with cognitive consistency based on the fish-like robot formation control problem, where all robots share state-value network parameters and achieved circular formation control based on a decentralized controller. The work [20] used the DDPG algorithm for formation motion planning and control tasks for TWODMR robot teams and performed comparison experiments with the artificial potential field algorithm. We also refer to [21,22,23,24] for more related research on learning-based methods for different applications.
All of the literature above studied and improved the formation control problem in their respective scenarios using reinforcement learning algorithms. However, most of the literature does not consider the detection capability of the agents for obstacles, and there is less research on multi-agent formation navigation strategies in environments with obstacles. On the other hand, most of the reinforcement learning algorithms used in the existing literature are DDPG algorithms [25], which originated from the DQN algorithm [26], which is an off-policy approach. The DDPG algorithm updates the target value function during training using the action that is currently considered to have the largest Q value, which can lead to the problem of Q overestimation. However, the overestimation is usually non-uniform, leading to unreliable decisions made by the algorithm. Based on these problems, this paper investigated multi-robot formation navigation in environments with obstacles using deep reinforcement learning in conjunction with the leader-following method. Based on the twin delayed deep deterministic policy gradient (TD3) algorithm [27] combined with the CTDE framework [28], an improved multi-agent twin delayed deep deterministic policy gradient (MATD3) algorithm is proposed. The main contributions of this paper are as follows: (1) To address the overestimation problem in the DDPG algorithm, we used the TD3 algorithm in combination with the CTDE framework to implement multi-robot formation control. (2) We designed a hybrid reward mechanism that allows the robot formation to change its formation reasonably when facing an obstacle area and thus pass. (3) The prioritized experience replay (PER) mechanism [29] was also added to make full use of high-quality samples for training and improve the learning efficiency. At the same time, we set up both guided rewards and sparse rewards to facilitate robot learning. The simulation results showed that the algorithm can perform the formation navigation task in environments with obstacles well.
The rest of the paper is organized as follows: In Section 2, we provide the problem description and system modeling. Section 3 describes the reinforcement learning algorithm used in this paper and the improvements made. Section 4 describes the reinforcement-learning-based formation control scheme, including the state space design and reward function design. In Section 5, we conduct the simulation experiments and analyze the experimental results. Finally, Section 6 concludes the whole paper.

2. Problem Description and Modeling

2.1. System Model

In a two-dimensional environment, there exist N omnidirectional robots with i = 1 , , N ; when i = l e a , this denotes the leader robot r l e a , and when i = 1 , , N 1 , this denotes the following robot r i . The position and velocity of r l e a are denoted by p l e a and v l e a , respectively, and the position and velocity of r i are denoted by p i and v i , respectively. There exist M circular obstacles of radius R o b s . o j denotes the j t h obstacle, where j = 1 , , M . The position of o j is denoted as p o j , and there exists one target point whose position is denoted as p g . The dynamics model of the robot r i is described as follows:
a i t = F i t / m v i t = v i t 1 + a i t · Δ t p i t = p i t 1 + v i t · Δ t
where i = 1 , , N 1 and i = l e a . a i t , v i t , and p i t all belong to R 2 , denoting the acceleration, velocity, and position of robot r i at moment t; Δ t denotes the system sampling period; m is the mass of the robot. Let u i t = F i t denote the control input, where F i t R 2 represents the force vector of robot r i .

2.2. Formation Problem Description

The multi-robot formation scenario is shown in Figure 1, where there are some obstacles and robots in a finite plane area, and the robots are divided into leader and followers. The formation navigation task requires the robots to collectively form a default formation and reach the target area while avoiding collisions.
As shown in Figure 2, at the beginning, the robots move the shortest path from their respective initial positions to form a default formation.
For the formation avoidance problem, we considered the following scheme: when the leader robot leads the formation into the obstacle area (we assumed that the leader robot knows the area range of the obstacle distribution), the formation changes into a chain structure when it passes the obstacle area, and after passing it, it returns to the default formation and continues to move towards the target point, as shown in Figure 3.
In this paper, we assumed that the leader robot can globally perceive the location of the target point. Define the detection distance of the robot as ρ d > 0 . When the distance between the robot and the obstacle is less than ρ d , the robot can sense the obstacle’s position. During the formation movement, the robot can obtain the status information of itself and other robots in real-time, which includes velocity information v i t and position information p i t . The control target of formation navigation is divided into three parts: maintaining the default formation, avoiding collision, and reaching the target position.
(1) Maintaining the default formation: The formation members start from their respective starting positions and move to form a default formation, and the default spacing should be maintained between all robots. Therefore, the control objective is to minimize the distance error between robots:
min d i j d i j * , i , j 1 , , N , i j
where d i j * is the desired relative distance between robot r i and robot r j and d i j is the actual relative distance between robot r i and robot r j .
(2) Collision avoidance: During the formation movement, the distance between robots and between robots and obstacles should satisfy d i j ρ r and d i s r i , o j ρ o , where d i s ( · ) denotes the Euclidean distance in two dimensions and ρ r and ρ o ρ d > ρ o are the safe distance between robots and the safe distance between robots and obstacles, respectively.
(3) Reaching the target position: During the formation movement, our goal is to minimize the distance between the leader robot and the target point, as follows:
min p l e a p g 2

3. Multi-Robot Formation Control Based on Multi-Agent Reinforcement Learning

3.1. Multi-Agent Reinforcement Learning

In single-agent reinforcement learning, the environment is stable for the agent. However, in the multi-agent system, each agent will interact with the environment, which will cause the environment to become unstable for each agent. To solve this problem, the centralized training and decentralized execution framework (CTDE) [28] is often used in research. At the beginning, all agents participate in the training simultaneously to jointly improve the parameters of the policy network and the value network. After the training, the agents make decisions and take actions according to their respective policy networks.
The Markov decision process (MDP) is defined as the process of the interaction between an agent and its environment [30]. The Markov decision process for a multi-robot formation can be represented explicitly by a multivariate group S , A , R , O , P , where S denotes the state space of the environment and s t S denotes the state at moment t. A = A 1 , , A N denotes the joint action space of all robots, and the joint action of all robots is denoted as a = a 1 , , a N , where a i A i denotes the action of robot r i . R = R 1 , , R N denotes the set of reward functions of all robots, and r i = R i s t , a i denotes the reward value of robot r i when it performs action a i in state s t . The observation space of all robots can be denoted as the set O = O 1 , , O N , and the observations of all robots are denoted by o = o 1 , , o N . The state transfer probability function is denoted as P s t + 1 s t , a i , that is the probability distribution specifying when the environmental state s t changes to the state s t + 1 at the next moment when the robot r i performs the action a i in state s t .

3.2. MATD3

Overestimation bias is a common problem in reinforcement learning, where the action value function is maximized and used to learn the optimal policy, which can cause the learned policy to become inaccurate [31]. The TD3 algorithm updates the target value function by learning two critic networks and choosing a smaller estimated Q value to approach the true Q value to alleviate the overestimation problem. Furthermore, TD3 uses a policy delayed update approach, where the critic network is updated for a certain number of steps before the actor network is updated [27]. The MATD3 algorithm combines the implementation process of TD3 and the “centralized training, distributed execution” architecture. The algorithm structure of MATD3 is designed as shown in Figure 4. MATD3 uses two sets of actor–critic structures, and each robot has its own independent actor network μ o i ; θ i μ with parameter θ i μ , which calculates action a i based on the current observation o i of robot r i . The two critic networks are trained in a centralized manner, respectively, network Q 1 o 1 , , o N , a 1 , , a N ; θ i Q 1 with parameter θ i Q 1 and network Q 2 o 1 , , o N , a 1 , , a N ; θ i Q 2 with parameter θ i Q 2 , which take the observations and actions of all robots as the input, and the network output is the estimated Q value to evaluate the goodness of action a i , thus improving the performance of the actor network. For stable learning, the target actor network and two target critic networks are constructed with parameters θ ^ i μ , θ ^ i Q 1 , and θ ^ i Q 2 , respectively, and the parameter update law of the target network is:
θ ^ i μ = τ θ i μ + ( 1 τ ) θ ^ i μ
θ ^ i Q 1 = τ θ i Q 1 + ( 1 τ ) θ ^ i Q 1
θ ^ i Q 2 = τ θ i Q 2 + ( 1 τ ) θ ^ i Q 2
where τ is the soft update parameter. To alleviate the overestimation problem, a smaller Q value is chosen to calculate the target Q value, so that the estimated Q value can be close to a more realistic and reliable target Q value, which in turn guides the parameter update of the actor network. The target Q value of robot r i can be defined as follows:
y i t = r i t + γ · min Q j o 1 t + 1 , , o N t + 1 , a ˜ 1 t + 1 , , a ˜ N t + 1 ; θ ^ i Q j ; j = 1 , 2
where γ is the discount factor and a ˜ i t + 1 is the target action of the clipping, which is the output of the target actor network with the following expression.
a ˜ i t + 1 = μ o i t + 1 ; θ ^ i μ + clip N ( 0 , σ ) , c i , c i
where N ( 0 , σ ) denotes a random noise satisfying a normal distribution and c i is the noise edge value.
Define the two critic loss functions of the robot r i as J θ i Q 1 and J θ i Q 2 , as follows:
J θ i Q 1 = E o t , a t , r t , o t + 1 Q 1 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 1 y i t 2
J θ i Q 2 = E o t , a t , r t , o t + 1 Q 2 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 2 y i t 2
The two critic network parameters θ i Q 1 and θ i Q 2 are updated by gradient descent, specifically in Equations (11) and (12).
θ i Q 1 = θ i Q 1 α C 1 θ i Q 1 J θ i Q 1
θ i Q 2 = θ i Q 2 α C 2 θ i Q 2 J θ i Q 2
where α C 1 and α C 2 are the learning rates of the two critic networks, respectively. The formula for calculating the actor gradient of the robot r i is defined as follows:
θ i μ J θ i μ = E o t , a t θ i μ μ i o i t ; θ i μ a i Q o 1 t , , o N t , a 1 t , , a N t ; θ i Q a i = μ i o i t ; θ i μ
Then, the update law of the actor network parameter θ i μ is:
θ i μ = θ i μ α A θ i μ J θ i μ
where α A is the learning rate of the actor network.

3.3. Combining Prioritized Experience Replay

In order to reuse historical experience data, the experience replay mechanism was first introduced in the DQN algorithm to store historical data in the experience replay buffer, which greatly improves the experience utilization [26]. However, the random sampling approach ignores the importance of the experience data, which may lead to the uneven quality of the sampled experience data and, thus, reduce the learning efficiency. To solve the above problem, the prioritized experience replay (PER) method is proposed, and the PER method prioritizes each experience datum according to its importance and is able to select the more important experiences more times when sampling, thus increasing the learning efficiency [29]. The PER method performs well in single-agent algorithms such as DQN and DDPG. To address the characteristics of the multi-agent CTDE framework, this paper used a centralized experience buffer to store the joint information of all the agents and, then, introduced the prioritized experience replay mechanism to prioritize the sampling experience according to the experience quality.
On the other hand, TD-error is expressed as the difference between the estimated Q value and the target Q value, which is usually used to update the estimate of the critic network, so it can be used as a correction to the estimate, so the TD-error can reflect how well the algorithm assesses the true value of the experience. Samples with larger TD-error values are given higher priority, and conversely, samples with smaller TD-error values are given lower priority. The TD-error derived by robot r i for sample k through two critic networks is shown in Equations (15) and (16).
δ i , 1 k = y i t Q 1 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 1
δ i , 2 k = y i t Q 2 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 2
In this paper, the larger value between the absolute values of two TD-errors was chosen as the priority of sample k, as shown in Equation (17).
p k = max δ i , 1 k , δ i , 2 k
Then, the sampling probability of sample k can be defined as:
p ˜ k = p k α l p l α
where p ˜ k is the sampling probability of sample k. p k α denotes the priority of sample k in the replay buffer. α is a hyperparameter from 0 to 1 that affects the priority, where α is 0, the sampling criterion is random sampling, and l denotes the number of samples in the replay buffer.
Using the PER method would have the problem of changing the state distribution. Before the PER method was proposed, the replay buffer existed to address the data correlation and make the sample data independently and identically distributed. However, using the PER would more frequently replay experiences with higher TD-error values, which would inevitably introduce bias. In order to avoid bias leading to oscillations or even dispersion during training, the importance sampling method is introduced to appropriately reduce the weights of good samples so that the impact of each sample in the gradient descent is the same during training, thus ensuring the convergence of the results. The sampling weights for sample k are defined as:
ω k = 1 l β · p k β
where β is a 0 to 1 hyperparameter that determines the impact of improving the PER on the convergence result. The larger β is, the greater the sampling weight of the lower-priority samples will be. To improve the stability of training, we normalized the weights and derived them as follows:
ω k = ( l p ˜ k ) β max m ω m = l p ˜ k β max m ( l p ˜ m ) β = ( p ˜ k ) β max m ( p ˜ m ) β
where m is the sample ordinal number with the largest weight. Therefore, the two critic network loss functions in the MATD3 algorithm can be rewritten as:
J θ i Q 1 = E o t , a t , r t , o t + 1 ω k Q 1 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 1 y i t 2
J θ i Q 2 = E o t , a t , r t , o t + 1 ω k Q 2 o 1 t , , o N t , a 1 t , , a N t ; θ i Q 2 y i t 2
where ω k is the sampling weight of sample k. y i t is the target Q value of robot r i , and Q j o 1 t , , o N t , a 1 t , , a N t ; θ i Q j ; j = 1 , 2 is the estimated Q value of the critic network.
The flow of the MATD3 algorithm based on prioritized experience replay is shown in Algorithm 1. Steps 1 and 2 initialize the network parameters of each robot and the corresponding target network parameters. Step 3 initializes the training parameters, including the conditioning coefficients α and sampling weight coefficients β , the experience buffer D, the number of experiences K for each extraction, and the network update frequency d. Steps 4 to 26 introduce the framework process of the MATD3 algorithm for the PER, as shown in Section 3.1. o denotes the set of all robot observations. a i denotes the action of robot r i , where N ( 0 , σ ) is the random noise in the training. MATD3 calculates the priority p k and sampling weight ω k for each experience during sampling according to the formula in Section 3.3 and subsequently updates the critic network and actor network. Finally, a soft update of the parameters of the target network is performed.
Algorithm 1 MATD3 with prioritized experience replay.
 1:
Create actor network μ and two critic networks Q 1 , Q 2 by using θ i μ , θ i Q 1 , and θ i Q 2
 2:
Initialize target μ , target Q 1 , and target Q 2 by using θ ^ i μ , θ ^ i Q 1 , and θ ^ i Q 2
 3:
Initialize priority parameters α , β ; replay buffer D; minibatch K; update policy frequency d
 4:
for episode=1 to total episode do
 5:
    Generate random noise for action exploration
 6:
    Acquire original observations o = ( o 1 , . . . , o N )
 7:
    for step=1 to max episode length do
 8:
           for each robot i, select actions a i = μ i ( o i ) + clip ( N ( 0 , σ ) , c i , c i )
 9:
           Execute actions ( a 1 t , . . . , a N t ) ; observe r i t and o i t + 1
10:
         Store ( o t , a 1 t , . . . , a N t , r t , o t + 1 ) in D
11:
          s t = s t + 1
12:
         for robot i = 1 to N do
13:
            for  j = 1 to K do
14:
                 Sample experience k with probability p k from D
15:
                 Compute weight ω k and TD-error δ k
16:
                 Update the priority of k according to | δ k |
17:
            end for
18:
            Calculate target critic Q value via (7)
19:
            Update critic by minimizing the loss via (21), (22)
20:
            if  t % d = 0  then
21:
                 Update policy μ i with gradient via (13)
22:
            end if
23:
         end for
24:
         Use (4) and (6) soft update target network parameters θ ^ i μ , θ ^ i Q 1 , and θ ^ i Q 2
25:
    end for
26:
end for

4. Multi-Robot Formation Strategy Design Based on Improved MATD3

4.1. Robot Formation MDP Design

We designed a Markov decision process for robot formation based on a leader-following strategy. Using the leader-following strategy allows for a better description of the relationships between the formation members and makes it easier to implement and adjust the formation. In addition, to distinguish the relationship between the leader and followers and to make the formation more scalable, each robot has its own MDP. Robots in the formation are divided into three different categories: leader robot, left-following robots, and right-following robots, where:
(1) The leader needs to find an optimal strategy to plan the shortest path to reach the goal point while avoiding obstacles.
(2) The followers need to maintain the formation by keeping a predetermined distance from other robots.

4.2. State and Action Space Design

The state space is defined according to the specific types of robots in the formation. For the leader robot r l e a , its main goal is to find an optimal path to reach the target point, so the state information includes its own position S l e a p and velocity S l e a v , the target point position S g p , and the detection information of the obstacle position S l e a o , specifically expressed as:
S l e a d e r = S l e a p , S l e a v , S g p , S l e a o = p l e a x , p l e a y , v l e a x , v l e a y , p l e a x p g x , p l e a y p g y , p l e a x p o j x , p l e a y p o j y , j = 1 , , M
For the following robots, the left-following robots and the right-following robots share the same state space. The state information of the following robot r i includes its own position S i p and velocity S i v , the other robots’ positions S l p , and the detection information of the obstacle position S i o , specifically expressed as:
S f o l l o w e r i = S i p , S i v , S l p , S i o = p i x , p i y , v i x , v i y , p l x , p l y , p i x p o j x , p i y p o j y , j = 1 , , M
where i [ 1 , , N 1 ] , l [ 1 , , i 1 , i + 1 ] .
The action space of robot r i is the magnitude of the force, i.e.,
A i = F i x , F i y i = 1 , , N

4.3. Reward Function Design

The reward function of the robot is designed according to the formation control problem of multiple robots. To avoid the problem of low training efficiency due to sparse rewards, we used a combination of guided rewards and sparse rewards. In the process of robot–environment interaction, if reaching the target point, the collision between robots, the collision between robots and obstacles, or reaching the preset spacing between robots occurs, then the sparse reward is used; if none of them occurs, then the guided reward is used.
Different types of robots have different roles, and for the leader robot, the design of the guided reward function:
R G l e a d e r = p l e a p g 2
and sparse reward:
R S l e a d e r = R 1 , if p l e a p g 2 ρ g R 4 , if p l e a p o j 2 ρ o R 5 , if p l e a p i 2 ρ r
where i = 1 , , N 1 , R 1 is the leader robot’s reward for reaching the target point, R 4 is the penalty for the collision of the leader robot with the obstacle, R 5 denotes the penalty for the collision of the leader robot with the other robots, ρ g denotes the target point radius, ρ o denotes the obstacle safety distance, and ρ r denotes the collision safety distance.
When the leader robot enters the obstacle area, the formation will change to a more passable one. Specifically, the formation strategy will be changed when the leader robot satisfies p o , min x < p l e a x < p o , max x and p o , min y < p l e a y < p o , max y , where p o , min x , p o , max x , p o , min y , p o , max y is the range of the obstacle area.
We set up a hybrid reward mechanism for the following robot. When there is no obstacle area, the robot teams form a V-shaped formation to perform the formation task, and each following robot needs to maintain a preset distance from all its neighboring robots, as shown in Figure 2; when entering the obstacle area, each following robot only needs to maintain a preset distance from a specific neighboring robot in order to form a chain formation structure, as shown in Figure 5. Set the follower guide reward function:
R G f o l l o w e r = d i j d i j * , if beyond the obstacle area d n m d n m * , others
and sparse reward:
R S f o l l o w e r = R 2 , if beyond the obstacle area and d i j = d i j * R 3 , if not beyond the obstacle area and d n m = d n m * R 4 , if p i p o j 2 ρ o R 5 , if p i p j 2 ρ r
where d i j and d i j * denote the actual relative distance and the desired relative distance of any two robots in the obstacle-free area, respectively. d n m and d n m * denote the actual relative distance and desired relative distance between the left-side m t h following robot and the right-side n t h following robot when in the obstacle area, respectively. R 2 and R 3 denote the reward for maintaining the formation when the formation is in the non-obstacle and obstacle areas, respectively. R 4 and R 5 denote the penalty for collision between any robot and any obstacle and the penalty for collision between any two robots, respectively.

5. Simulation Experiments

5.1. Experimental Environment Setup

The experimental hardware environment of this paper was an Intel Core i7-10700K CPU and NVIDIA GeForce RTX 3090 GPU, and a multi-robot formation simulation training environment was built based on the OpenAI platform. The simulation environment was a square two-dimensional plane with a side length of 8 m. Five robots existed in the environment, and one of them was the formation leader; the radius of each robot was set to 0.1 m, and the maximum speed limit was 0.85 m/s. Four obstacles and one target point existed in the environment; the radius of the obstacles was randomly distributed between (0.4 m, 0.6 m), and the radius of the target point was 0.05 m. At the beginning of the training, the obstacles and target point were randomly generated within a fixed range.

5.2. Training Parameters’ Setting

This experiment used the MADDPG algorithm, the MATD3 algorithm, and the improved MATD3 algorithm for training. The network structure was the same for all three algorithms, and each network contained two hidden layers with 128 nodes in each layer. The activation function used was ReLU, and the optimizer was Adam. The maximum number of training episodes was 60,000; the maximum number of time steps per episode was 150; the learning rate was set to 0.001; the network parameters were updated every 100 steps; the discount factor γ was 0.95; the parameter λ was set to 0.2; the replay buffer size was 10 6 ; the training batch size was 1024; parameters α and β were set to 0.6 and 0.5, respectively; the robot detection distance ρ d was set to 0.2, and the collision safety distance ρ r , as well as the obstacle safety distance ρ o were set to 0.1; the obstacle area parameters p o , min x , p o , max x , p o , min y , p o , max y were set to −3, 3, −2, 2, respectively; the locations of the target points were randomly generated within the range ( 3 , 3.5 ) .

5.3. Experimental Results and Analysis

The goal of reinforcement learning is to maximize the cumulative discounted reward, so the cumulative episode average reward is commonly used to measure the goodness of the training results, and we used the average reward of the last 1000 episodes as the final episode average reward value. Training was performed using the three algorithms described above, respectively. The episode average rewards are shown in Figure 6. It can be seen that the three algorithms can learn better policies around 20,000 episodes and gradually reach convergence. In comparison, the improved MATD3 algorithm had the highest average reward after convergence and had better stability compared to the other two algorithms.
On the other hand, to measure the ability of the robot formations to avoid collisions, we set two metrics, including the collision rate between robots (CBRS) and the collision rate between robots and obstacles (CBRO). They were obtained based on whether the distance between the robots was less than ρ r and whether the distance between the robots and obstacles was less than ρ o , as shown in Equation (30).
C B R S = Number of collisions between robots Number of time steps per episode C B R O = Number of collisions between the robot and the obstacle Number of time steps per episode
The experimental results are shown in Figure 7. Obviously, the improved MATD3 algorithm had better performance in formation collision avoidance.
The path-planning ability of the leader robot directly determined the success of the formation navigation task, so we used the final distance between the leader robot and the target point (UDBLT) as one of the algorithm metrics. The experimental results are shown in Figure 8. All three algorithms can converge to near 0, indicating that the leader robot can successfully plan the path to the target point.
Formation stability error (RFSE) is an important indicator of robot formation navigation, which we define according to the following equation.
e ( r ) = 1 T t = 0 T i , j N d i j d i j * , i j
where d i j and d i j * denote the actual relative distance and the desired relative distance between robot r i and robot r j , respectively. The experimental results are shown in Figure 9. It can be seen that the improved MATD3 algorithm had a lower formation stability error compared to the other algorithms, indicating that the multiple robots can maintain the preset formation well.
Table 1 shows the comparative data of the above metrics in more detail. According to Table 1, the improved MATD3 algorithm performed better in terms of collision avoidance and formation stability errors in the obstacle environment.
The trained multi-robot formation path is shown in Figure 10, where the robot team formed a preset formation at Step 10 and was moving toward the target area. At Step 35, the robot formation completed the formation change and entered the obstacle area. At Step 95, the robot formation passed the obstacle area, resumed the preset formation, and continued to move towards the target area. At the end, the robot formation successfully navigated to the target area. As can be seen, due to the hybrid reward mechanism, the robot formation can reasonably change its formation when facing the obstacle area and, thus, achieve obstacle avoidance more easily.

6. Conclusions

This paper investigated the multi-robot formation navigation problem in environments with obstacles using reinforcement learning methods, solved the overestimation problem using the TD3 algorithm, and made the algorithm training more efficient by introducing the prioritized experience replay method. The formation control strategy was proposed in combination with the leader-following method. Facing the random distribution of environmental obstacles, a hybrid reward mechanism was designed to change the formation shape to collaboratively avoid the obstacles. The final experimental results showed that the improved algorithm in this paper had a lower collision rate and better formation stability. How to apply the algorithm to dynamic environments with obstacles and reduce the training time comprises our future research direction. It would also be interesting to integrate the physics-based obstacle-detecting method into our learning-based framework [32,33,34,35,36,37,38,39].

Author Contributions

C.Z. proposed and implemented the methods of this paper, performed the experiments, analyzed the results, and wrote the original manuscript. Z.L. participated in some experiments and data collection. J.L. and Y.S. made suggestions for the analysis of the experimental results and participated in some of the manuscript revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Fujian Province (Grant 2020J01876) and the Science Research Foundation for Introduced Talents, Fujian Province of China (Grants GY-Z21215, GY-Z21216).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dorri, A.; Kanhere, S.S.; Jurdak, R. Multi-agent systems: A survey. IEEE Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
  2. Alonso-Mora, J.; Baker, S.; Rus, D. Multi-robot formation control and object transport in dynamic environments via constrained optimization. Int. J. Robot. Res. 2017, 36, 1000–1021. [Google Scholar] [CrossRef]
  3. Macwan, A.; Vilela, J.; Nejat, G.; Benhabib, B. A multirobot path-planning strategy for autonomous wilderness search and rescue. IEEE Trans. Cybern. 2014, 45, 1784–1797. [Google Scholar] [CrossRef] [PubMed]
  4. Miyazaki, K.; Matsunaga, N.; Murata, K. Formation path learning for cooperative transportation of multiple robots using MADDPG. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; pp. 1619–1623. [Google Scholar] [CrossRef]
  5. Wasik, A.; Pereira, J.N.; Ventura, R.; Lima, P.U.; Martinoli, A. Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 1721–1728. [Google Scholar] [CrossRef] [Green Version]
  6. Zhang, T.J. Unmanned aerial vehicle formation inspired by bird flocking and foraging behavior. Int. J. Autom. Comput. 2018, 15, 402–416. [Google Scholar] [CrossRef]
  7. Xiao, H.; Li, Z.; Chen, C.P. Formation control of leader–follower mobile robots’ systems using model predictive control based on neural-dynamic optimization. IEEE Trans. Ind. Electron. 2016, 63, 5752–5762. [Google Scholar] [CrossRef]
  8. Defoort, M.; Floquet, T.; Kokosy, A.; Perruquetti, W. Sliding-mode formation control for cooperative autonomous mobile robots. IEEE Trans. Ind. Electron. 2008, 55, 3944–3953. [Google Scholar] [CrossRef] [Green Version]
  9. He, S.; Wang, M.; Dai, S.L.; Luo, F. Leader–follower formation control of USVs with prescribed performance and collision avoidance. IEEE Trans. Ind. Inform. 2018, 15, 572–581. [Google Scholar] [CrossRef]
  10. Lewis, M.A.; Tan, K.H. High precision formation control of mobile robots using virtual structures. Auton. Robot. 1997, 4, 387–403. [Google Scholar] [CrossRef]
  11. Balch, T.; Arkin, R.C. Behavior-based formation control for multirobot teams. IEEE Trans. Robot. Autom. 1998, 14, 926–939. [Google Scholar] [CrossRef] [Green Version]
  12. Alonso-Mora, J.; Montijano, E.; Schwager, M.; Rus, D. Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 5356–5363. [Google Scholar]
  13. Hafez, A.; Givigi, S. Formation reconfiguration of cooperative UAVs via Learning Based Model Predictive Control in an obstacle-loaded environment. In Proceedings of the 2016 Annual IEEE Systems Conference (SysCon), Orlando, FL, USA, 18–21 April 2016; pp. 1–8. [Google Scholar]
  14. Kiumarsi, B.; Vamvoudakis, K.G.; Modares, H.; Lewis, F.L. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 2042–2062. [Google Scholar] [CrossRef] [PubMed]
  15. Knopp, M.; Aykın, C.; Feldmaier, J.; Shen, H. Formation control using GQ (λ) reinforcement learning. In Proceedings of the 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Lisbon, Portugal, 28–31 August 2017; pp. 1043–1048. [Google Scholar] [CrossRef] [Green Version]
  16. Wen, G.; Chen, C.P.; Feng, J.; Zhou, N. Optimized multi-agent formation control based on an identifier–actor–critic reinforcement learning algorithm. IEEE Trans. Fuzzy Syst. 2017, 26, 2719–2731. [Google Scholar] [CrossRef]
  17. Xie, J.; Zhou, R.; Liu, Y.; Luo, J.; Xie, S.; Peng, Y.; Pu, H. Reinforcement-learning-based asynchronous formation control scheme for multiple unmanned surface vehicles. Appl. Sci. 2021, 11, 546. [Google Scholar] [CrossRef]
  18. Zhao, Y.; Ma, Y.; Hu, S. USV formation and path-following control via deep reinforcement learning with random braking. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5468–5478. [Google Scholar] [CrossRef]
  19. Zhang, T.; Li, Y.; Li, S.; Ye, Q.; Wang, C.; Xie, G. Decentralized Circle Formation Control for Fish-like Robots in the Real-world via Reinforcement Learning. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 8814–8820. [Google Scholar] [CrossRef]
  20. Tian, H.; Lu, Y.; Zhang, O.; Sun, G.; Wu, C.; Yao, W. Deep Reinforcement Learning Based Multiple Omnidirectional Mobile Robots Control. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 7226–7231. [Google Scholar] [CrossRef]
  21. Zhang, P.; Meng, P.; Yin, W.; Liu, H. A neural network method for time-dependent inverse source problem with limited-aperture data. J. Comput. Appl. Math. 2023, 421, 114842. [Google Scholar] [CrossRef]
  22. Liu, H. On local and global structures of transmission eigenfunctions and beyond. J. Inverse Ill-Posed Probl. 2022, 30, 287–305. [Google Scholar] [CrossRef]
  23. Gao, Y.; Liu, H.; Wang, X.; Zhang, K. On an artificial neural network for inverse scattering problems. J. Comput. Phys. 2022, 448, 110771. [Google Scholar] [CrossRef]
  24. Yin, W.; Yang, W.; Liu, H. A neural network scheme for recovering scattering obstacles with limited phaseless far-field data. J. Comput. Phys. 2020, 417, 109594. [Google Scholar] [CrossRef]
  25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  26. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  27. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor–critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  28. Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor–critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6379–6390. [Google Scholar]
  29. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  30. Otterlo, M.v.; Wiering, M. Reinforcement learning and markov decision processes. In Reinforcement Learning; Springer: Berlin, Germany, 2012; pp. 3–42. [Google Scholar] [CrossRef]
  31. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
  32. Liu, H.; Zou, J. Uniqueness in an inverse acoustic obstacle scattering problem for both sound-hard and sound-soft polyhedral scatterers. Inverse Probl. 2006, 22, 515. [Google Scholar] [CrossRef]
  33. Li, J.; Liu, H.; Zou, J. Strengthened linear sampling method with a reference ball. SIAM J. Sci. Comput. 2010, 31, 4013–4040. [Google Scholar] [CrossRef]
  34. Li, J.; Liu, H.; Zou, J. Locating multiple multiscale acoustic scatterers. Multiscale Model. Simul. 2014, 12, 927–952. [Google Scholar] [CrossRef]
  35. Yin, Y.; Yin, W.; Meng, P.; Liu, H. On a hybrid approach for recovering multiple obstacles. Commun. Comput. Phys. 2022, 31, 869–892. [Google Scholar] [CrossRef]
  36. Liu, H.; Tsou, C.H. Stable determination by a single measurement, scattering bound and regularity of transmission eigenfunctions. Calc. Var. Partial Differ. Equ. 2022, 61, 91. [Google Scholar] [CrossRef]
  37. Chow, Y.T.; Deng, Y.; He, Y.; Liu, H.; Wang, X. Surface-localized transmission eigenstates, super-resolution imaging, and pseudo surface plasmon modes. SIAM J. Imaging Sci. 2021, 14, 946–975. [Google Scholar] [CrossRef]
  38. Blåsten, E.L.; Liu, H. Scattering by curvatures, radiationless sources, transmission eigenfunctions, and inverse scattering problems. SIAM J. Math. Anal. 2021, 53, 3801–3837. [Google Scholar] [CrossRef]
  39. Diao, H.; Cao, X.; Liu, H. On the geometric structures of transmission eigenfunctions with a conductive boundary condition and applications. Commun. Partial Differ. Equ. 2021, 46, 630–679. [Google Scholar] [CrossRef]
Figure 1. Multi-robot formation scene.
Figure 1. Multi-robot formation scene.
Applsci 13 01874 g001
Figure 2. Formation generation.
Figure 2. Formation generation.
Applsci 13 01874 g002
Figure 3. Toggle formation navigation.
Figure 3. Toggle formation navigation.
Applsci 13 01874 g003
Figure 4. MATD3 algorithm framework.
Figure 4. MATD3 algorithm framework.
Applsci 13 01874 g004
Figure 5. Changing formation strategy.
Figure 5. Changing formation strategy.
Applsci 13 01874 g005
Figure 6. Episode average rewards.
Figure 6. Episode average rewards.
Applsci 13 01874 g006
Figure 7. Collision rate during robot formation movement: (a) collision rate between robots (CBRS); (b) collision rate between robots and obstacles (CBRO).
Figure 7. Collision rate during robot formation movement: (a) collision rate between robots (CBRS); (b) collision rate between robots and obstacles (CBRO).
Applsci 13 01874 g007
Figure 8. Final distance of the leader robot from the target point (UDBLT).
Figure 8. Final distance of the leader robot from the target point (UDBLT).
Applsci 13 01874 g008
Figure 9. Formation stability error (RFSE).
Figure 9. Formation stability error (RFSE).
Applsci 13 01874 g009
Figure 10. Multi-robot formation paths with different time steps: (a) Step 10; (b) Step 35; (c) Step 95; (d) Step 120.
Figure 10. Multi-robot formation paths with different time steps: (a) Step 10; (b) Step 35; (c) Step 95; (d) Step 120.
Applsci 13 01874 g010
Table 1. Detailed comparison of experimental results.
Table 1. Detailed comparison of experimental results.
AlgorithmCBRO (%)UDBLTRFSE
MADDPG3.720.152.28
MATD34.290.051.09
PER-MATD32.370.080.36
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, C.; Li, J.; Shi, Y.; Lin, Z. Research on Multi-Robot Formation Control Based on MATD3 Algorithm. Appl. Sci. 2023, 13, 1874. https://doi.org/10.3390/app13031874

AMA Style

Zhou C, Li J, Shi Y, Lin Z. Research on Multi-Robot Formation Control Based on MATD3 Algorithm. Applied Sciences. 2023; 13(3):1874. https://doi.org/10.3390/app13031874

Chicago/Turabian Style

Zhou, Conghang, Jianxing Li, Yujing Shi, and Zhirui Lin. 2023. "Research on Multi-Robot Formation Control Based on MATD3 Algorithm" Applied Sciences 13, no. 3: 1874. https://doi.org/10.3390/app13031874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop