Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm

Li, Mingfei; Liu, Haibin; Xie, Feng; Huang, He

doi:10.3390/electronics13173513

Open AccessArticle

Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm

College of Mechanical & Energy Engineering, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3513; https://doi.org/10.3390/electronics13173513

Submission received: 31 July 2024 / Revised: 30 August 2024 / Accepted: 3 September 2024 / Published: 4 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes a novel adaptive distributed recurrent SAC (Soft Actor–Critic) control method to address the leader–follower formation control problem of omnidirectional mobile robots. Our method successfully eliminates the reliance on the complete state of the leader and achieves the task of formation solely using the pose between robots. Moreover, we develop a novel recurrent SAC reinforcement learning framework that ensures that the controller exhibits good transient and steady-state characteristics to achieve outstanding control performance. We also present an episode-based memory replay buffer and sampling approaches, along with a unique normalized reward function, which expedites the recurrent SAC reinforcement learning formation framework to converge rapidly and receive consistent incentives across various leader–follower tasks. This facilitates better learning and adaptation to the formation task requirements in different scenarios. Furthermore, to bolster the generalization capability of our method, we normalized the state space, effectively eliminating differences between formation tasks of different shapes. Different shapes of leader–follower formation experiments in the Gazebo simulator achieve excellent results, validating the efficacy of our method. Comparative experiments with traditional PID and common network controllers demonstrate that our method achieves faster convergence and greater robustness. These simulation results provide strong support for our study and demonstrate the potential and reliability of our method in solving real-world problems.

Keywords:

leader–follower; recurrent SAC; formation control; reinforcement learning

1. Introduction

In the field of control and robotics, distributed control of multi-mobile robots, especially in formation control, has attracted significant attention in recent years [1,2]. Formation control is a fundamental motion coordination problem in mobile robotics [3] with the aim of tracking group trajectories while maintaining the desired spacing between robots defined by relative distances and angles [4]. In addition, compared to single robots, multi-robot cooperative systems offer significant improvements in efficiency, scalability, and robustness. There are widespread applications in various fields for such systems, such as search [5], transportation [6], rescue operations [7], and surveillance [8]. To date, there has been a great deal of research on formation control problems, which can be broadly categorized into the following four main approaches: leader–follower [9,10,11,12], virtual structure [13], behaviorally based [14], and model-predictive control [15,16] strategies. The virtual structure method has high complexity and low flexibility. This approach requires the design of a virtual structure that all agents must strictly follow, which increases the complexity of the control algorithm. The behaviorally based method has poor predictability; it is difficult to accurately predict the behavior of the entire group, and the design and implementation of behavior rules require precise adjustments, otherwise resulting in inconsistent behavior among agents in the formation. Model Predictive Control (MPC) has high computational costs. Due to the need for frequent calculation updates, MPC may struggle to meet real-time requirements in highly dynamic environments. In contrast, the leader–follower method is more commonly used in many practical applications due to its simple structure and ease of implementation. From the perspective of obtaining agent states, formation control can be divided into displacement-based and distance-based methods [17]. For displacement-based control, the states of neighboring agents are obtained from a global coordinate system [18]. However, distance-based control relies on sensors such as inertial measurement units, LiDAR, cameras, or other onboard sensors to acquire agent states in a local coordinate system [19]. Here, we do not review all the formation control methods. Instead, considering simplicity in implementation and scalability in applications, we focus primarily on leader–follower formation methods.

In leader–follower formation methods, the leader robot moves along a predefined smooth trajectory in the plane, while the followers must maintain an expected pose with respect to the leader. In terms of robot movement types, differential drive [20] and omnidirectional mobile robots [21] and their combination [22] are the most common types of mobile robots in leader–follower control tasks. Researchers have studies the leader–follower formation control problem using various methods. Some studies have used kinematic models or single-integrator dynamics to achieve formation control. For example, Roza and Ashton et al. [23] combined a bounded translational consensus controller with an attitude synchronizer to form a distance-based differential-drive robot formation control strategy. However, this method may fail to achieve target formation under certain extreme initial conditions, and it relies on an accurate kinematic model. Uncertainties in practical applications could potentially affect its performance. Furthermore, as the number of robots increases, the consumption of computational resources may also impact the system’s real-time performance. Tang and Ji et al. [24], on the other hand, used a fuzzy control algorithm to design a controller for differential-drive robots to address distance-based leader–follower formation control problems. But the fuzzy control algorithm itself involves the tuning and optimization of multiple parameters, which may require more experiments and adjustments to ensure the stability and reliability of the algorithm under various conditions. Maghenem [25], Dasdemir [26], and Loria [27] proposed a distributed queuing control method for vehicle fleets based on leader-generated planning tree communication topologies. However, tree topologies exhibit poor robustness against communication failures. Additionally, Sun and Mou et al. [28] ensured that agents reached the desired formation within a finite time, while Kwang-Kyo et al. [29] proposed gradient-like control laws for formation control. Shi and Song et al. [30] utilized Deep Reinforcement Learning (DRL) to address nonlinear, highly coupled control problems by combining deep neural networks with approximate complex control strategies, providing a practical solution for the nonlinear dynamics of formation control. However, the generalization ability of this method in diverse environments has not been fully validated, and under suboptimal initial conditions, the convergence process exhibited significant overshoot and high steady-state error.

The limitations highlighted in the existing literature regarding traditional formation control methods, such as inaccuracies in environmental modeling, the extensive need for experimental expertise and parameter adjustments, and suboptimal performance in dynamically complex environments, are coupled with issues like overshoot and steady-state errors in learning algorithms. Therefore, we propose a adaptive distributed control method for leader–follower formation based on the recurrent SAC algorithm. It can learn the optimal policy through interactions between agents and the environment and does not require the manual design of models or tuning of parameters. Moreover, it can better adapt to uncertain environments and tasks to achieve more efficient formation control, which overcomes the drawbacks of traditional methods in model and parameter tuning. The overall architecture of the method we propose is shown in Figure 1.

The originality and main contributions of our work are outlined as follows:

An innovative reinforcement learning formation control method is proposed, which uses only the relative pose between robots and learns the optimal policy through the interaction between the agents and the environment. We achieve formation control without the need for manual model design or parameter tuning.
A novel recurrent SAC reinforcement learning framework is constructed, which effectively ensures excellent transient and steady-state characteristics of the controller.
An episode-based memory replay buffer and sampling methods, along with a novel normalization reward function, are proposed, which enable the recurrent SAC reinforcement learning formation framework to converge rapidly. This ensures consistent rewards across different tasks, facilitating better learning and adaptation to different formation task requirements in diverse scenarios.
By normalizing the state space, the discrepancies between formation tasks with different shapes are eliminated, thereby enhancing the generalization capabilities of our controller.
Extensive formation experiments with various shapes and comparative experiments with other controllers are conducted in the Gazebo simulator, and the results demonstrate the outstanding performance of our method in various tasks.

The remainder of this paper is organized as follows. The problem formulation is presented in Section 2. The proposed methods are described in detail in Section 3. Section 4 focuses on the simulation setup and evaluation process, while Section 5 presents the detailed results and analysis of leader–follower formation control experiments conducted in the Gazebo simulator. Finally, Section 6 summarizes the results of this work.

2. Problem Formulation

2.1. Leader–Follower Kinematics

A group of

N = (R_{f}, R_{l})

leader–follower formation configurations is illustrated in Figure 2, where

R_{f}

is the follower robot and

R_{l}

is the leader robot. Both robots are omnidirectional mobile robots with mecanum wheels, and they move in the same plane. The kinematics of each robot can be represented by Equation (1).

\{\begin{matrix} \dot{x_{i}} = & v_{i}^{x} cos θ_{i} - v_{i}^{y} sin θ_{i} \\ \dot{y_{i}} = & v_{i}^{x} sin θ_{i} + v_{i}^{y} cos θ_{i} \\ \dot{θ_{i}} = & ω_{i} \end{matrix}

(1)

where

i \in l, f

,

r_{i} = {[x_{i}, y_{i}]}^{T}

and

θ_{i}

are the position and orientation of robot

R_{i}

with respect to the global coordinate system (

F_{0}

), respectively.

v_{i}^{x}

,

v_{i}^{y}

, and

ω_{i}

are the linear velocities in the x and y directions and the angular velocity of robot

R_{i}

with respect to its own coordinate system (

F_{i}

), respectively.

The position of the leader with respect to the follower coordinate system (

r_{l f} = [x_{l f}, y_{l f}]

) can be calculated using Equation (2).

r_{l f} = R (θ_{f}) (r_{l} - r_{f})

(2)

where

R (θ_{f}, θ_{l})

is the rotation matrix, which can be calculated by using Equation (3).

R (θ_{f}) = [\begin{matrix} cos (θ_{f}) & sin (θ_{f}) \\ - sin (θ_{f}) & cos (θ_{f}) \end{matrix}]

(3)

The position of the leader relative to the follower (

r_{l f}

) is differentiated using Equation (4).

\dot{r_{l f}} = [\begin{matrix} 0 & ω_{f} \\ - ω_{f} & 0 \end{matrix}] r_{l f} + [\begin{matrix} cos θ_{l f} & - sin θ_{l f} \\ sin θ_{l f} & cos θ_{l f} \end{matrix}] v_{l} - v_{f}

(4)

where

θ_{l f} = θ_{l} - θ_{f}

is the angle of the leader robot (

R_{l}

) relative to the follower robot (

R_{f}

). Therefore, according to Equation (5), we can calculate the angular velocity of

R_{l}

with respect to

R_{f}

.

\dot{θ_{l f}} = ω_{l} - ω_{f}

(5)

Finally, based on Equations (1)–(5), we can derive the kinematics of the leader–follower robots as shown in Equation (6).

\{\begin{matrix} {\dot{x}}_{l f} = & ω_{f} y_{l f} + v_{l}^{x} cos θ_{l f} - v_{l}^{y} sin θ_{l f} - v_{f}^{x} \\ {\dot{y}}_{l f} = & - ω_{f} x_{l f} + v_{l}^{x} sin θ_{l f} + v_{l}^{y} cos θ_{l f} - v_{f}^{y} \\ {\dot{θ}}_{l f} = & ω_{l} - ω_{f} \end{matrix}

(6)

2.2. Problem Statements

This study proposes a leader–follower formation method that solely uses the pose information of the robots. Specifically, given a constant desired position (

r_{l f}^{*}

) and angle (

θ_{l f}^{*}

) of the leader with respect to the follower, our method ensures that the relative position (

r_{l f}

) and angle (

θ_{l f}

) converge to within an arbitrarily small neighborhood of

r_{l f}^{*}

and

θ_{l f}^{*}

. In other words, it satisfies Equation (7).

\{\begin{matrix} r_{l f}^{*} - δ & < r_{l f} < r_{l f}^{*} + δ \\ θ_{l f}^{*} - ε & < θ_{l f} < θ_{l f}^{*} + ε \end{matrix}

(7)

where

δ > 0

and

ε > 0

are the neighborhood radii of

r_{l f}^{*}

and

θ_{l f}^{*}

, respectively.

The recurrent SAC leader–follower formation method receives relative pose information of the robots and outputs velocities for the follower robot to move, as shown in Equation (8).

\begin{matrix} r_{l f} = f (v_{l}, v_{f}, ω_{l}, ω_{f}, \dots) \end{matrix}

(8a)

\begin{matrix} (v_{f}, ω_{f}) = g (r_{l f}, θ_{l f}, β) \end{matrix}

(8b)

The function

f (v_{l}, v_{f}, ω_{l}, ω_{f}, \dots)

is provided by the Gazebo simulator, and we do not delve into it here. The function

g (r_{l f}, θ_{l f}, β)

is provided by our formation framework, with

β

being the network parameters.

3. Simulation Methods

This section provides a detailed description of the proposed recurrent SAC distributed formation control method, including motion augmentation techniques, reinforcement learning settings, an episode-based memory replay buffer, reward functions, network architecture, state and action spaces, etc.

3.1. Motion Augmentation

To ensure that the formation method generalizes well across different scenarios, some simple yet effective motion augmentation techniques were used during the training process. The implementation of these techniques is described as follows.

3.1.1. Multiple Types of Motion

The motion of the leader robot is set to five different motion modes, namely straight, lateral, diagonal, rotational, and a mixed mode.

3.1.2. Various Motion Speeds

The motion speed of the leader is divided into translational speeds, along with the X and Y axes and rotational speeds around the Z axis. To simulate a more realistic and dynamic environment, maximum and minimum speed limits were set for each direction. During actual movement, the speeds of the leader are randomly generated within these predefined maximum and minimum limits to simulate different motion scenarios and conditions.

3.1.3. Leader Motion Diversity

In each simulation episode, a motion mode is randomly selected from Section 3.1.1. Following this, a velocity value is randomly generated by using the method described in Section 3.1.2. This two-stage random selection process ensures that the motion of the leader is diverse and unpredictable to better simulate the dynamic changes in real-world environments.

3.2. Recurrent Soft Actor–Critic Algorithm

Hausknecht et al. [31] effectively addressed the problem of partially observable Markov decision processes (POMDPs), where a complete system state is not available at every decision point, by integrating LSTM into the DQN framework. Inspired by this, we incorporated LSTM into the SAC framework to develop the recurrent soft actor–critic (RSAC) algorithm. This enhancement allows RSAC to retain information over time series, which is crucial for partially observable scenarios in leader–follower formation decision making, where only the leader’s position information is available without velocity data.

Furthermore, we applied RSAC to train agents to obtain an excellent leader–follower formation controller. The key novelties of the recurrent SAC method are the introduction of Long Short-Term Memory (LSTM) neural networks and the adoption of different training methods to better perceive the temporal information of input states.

Unlike the traditional SAC algorithm, the training approach of the recurrent SAC algorithm undergoes significant changes. Specifically, it takes information from the entire episode as input rather than single-step information. This comprehensively considers the correlation of states, which allows the agent to better understand the dynamic changes and long-term reward signals of the agent. Consequently, it facilitates more effective learning and policy optimization. However, during decision making, recurrent SAC still uses single-step states as input and obtains the temporal sequence information between each step of the agent through the memory cells of the LSTM, which enables more accurate and reliable decision making.

In the initial step of each episode, the hidden state (h) and cell state (c) of the LSTM are initialized to ensure that the LSTM network is in a clear state at the beginning of each new episode and unaffected by previous episodes, which establishes a stable start state at the beginning of each new episode to better adapt to new environments and tasks, thereby avoiding information leakage and confusion between episodes and enhancing the stability and reliability of the algorithm. A complete description of the algorithm is shown in Algorithm 1.

Algorithm 1 Recurrent Soft Actor–Critic

Initialize parameter $ψ, θ_{1}, ϕ$
Set Target network weights $\bar{ψ} \leftarrow ψ, θ_{2} \leftarrow θ_{1}$
Initialize an empty replay buffer $D \leftarrow σ$
Set a training episode length $L \leftarrow l$
for each episode do
Initialize hidden and cell states $h \leftarrow h_{0}$ , $c \leftarrow c_{0}$
Initialize target network hidden and cell states $\bar{h} \leftarrow h_{0}$ , $\bar{c} \leftarrow c_{0}$
for each step do
$a_{t} \sim π_{ϕ} (a_{t} | s_{t})$
$s_{t + 1} \sim P (s_{t + 1} | s_{t}, a_{t})$
Update hidden and cell states $h_{i + 1} \leftarrow h_{i}$ , $c_{i + 1} \leftarrow c_{i}$
Update target network hidden and cell states $\bar{h_{i + 1}} \leftarrow \bar{h_{i}}$ , $\bar{c_{i + 1}} \leftarrow \bar{c_{i}}$
end for
$D \leftarrow D \cup ((s_{0}, a_{0}, r (s_{0}, a_{0}), s_{1}), \dots, (s_{t}, s_{t}, a_{t}, r (s_{t}, a_{t}), s_{t + 1}))$
for each gradient do
$ψ \leftarrow ψ - λ_{V} \hat{\nabla_{ψ}} J_{V} (ψ)$
$θ_{i} \leftarrow θ_{i} - λ_{Q} \hat{\nabla_{θ_{i}}} J_{Q} (θ_{i}) f o r i \in {1, 2}$
$ϕ \leftarrow ϕ - λ_{π} \hat{\nabla_{ϕ}} J_{π} (ϕ)$
$\bar{ψ} \leftarrow τ ψ + (1 - τ) \bar{ψ}$
end for
end for

3.3. Episode-Based Memory Replay Buffer

Drawing inspiration from Hausknecht’s work [31] on a memory replay buffer for DRQN, we introduce a novel episode-based memory replay buffer specifically designed for recurrent SAC, as well as creative storage and sampling approaches.

3.3.1. Storage Approach

Diverging from conventional step-wise storage, we use an episode-based storage approach. First, the states (state, action, reward, next state, and terminal) of each step are defined as a tuple. Finally, each experience in our memory replay buffer is defined as all the tuples within a complete episode.

3.3.2. Sampling Approach

We use a random sampling strategy, specifically selecting N experiences randomly from the memory replay buffer. For each experience, we randomly choose a time point and feed the sequence from that time point to a fixed length into the LSTM network in a single batch. At the start of each training iteration, we reset the hidden state of LSTM to zero to ensure the independence of each sequence.

3.4. State Space

The pose of the leader robot with respect to the follower robot is set as the state space. To enable the agent to learn and optimize its policy without relying on a fixed leader–follower pose, the state space is mapped according to Equation (9).

S t a t e = [\frac{(θ_{l f} - θ_{e})}{θ_{m a x} - θ_{m i n}}, \frac{(x_{l f} - x_{e})}{x_{m a x} - x_{m i n}}, \frac{(y_{l f} - y_{e})}{y_{m a x} - y_{m i n}}]

(9)

where

θ_{l f}

,

x_{l f}

, and

y_{l f}

are the angle, x coordinate, and y coordinate of the leader relative to the follower, respectively. Similarly,

θ_{e}

,

x_{e}

, and

y_{e}

are the desired angle, x coordinate, and y coordinate of the leader relative to the follower, respectively.

θ_{m a x}

,

x_{m a x}

, and

y_{m a x}

are the maximum angle, x coordinate, and y coordinate of the leader relative to the follower, respectively, while

θ_{m i n}

,

x_{m i n}

, and

y_{m i n}

are the minimum angle, x coordinate, and y coordinate, respectively.

The mapping of states standardizes the angle and position information, enabling the agent to better understand its pose relative to the leader and reducing reliance on absolute position.

3.5. Action Space

In this study, we define a new action space that consists of the movements of the follower in the following three key directions: linear velocity along the x and y axes and angular velocity along the z axis. Specifically, we meticulously designed a mapping method to convert these velocities into values within the range of

[0, 1]

. This mapping is shown in Equation (10).

\begin{matrix} a c t i o n = (v_{x}^{n}, v_{y}^{n}, ω_{z}^{n}) \end{matrix}

(10a)

\begin{matrix} v_{x}^{n} = \frac{2 \times (v_{x} - v_{x}^{m i n})}{v_{x}^{m a x} - v_{x}^{m i n}} - 1 \end{matrix}

(10b)

\begin{matrix} v_{y}^{n} = \frac{2 \times (v_{y} - v_{y}^{m i n})}{v_{y}^{m a x} - v_{y}^{m i n}} - 1 \end{matrix}

(10c)

\begin{matrix} ω_{z}^{n} = \frac{2 \times (ω_{z} - ω_{z}^{m i n})}{ω_{z}^{m a x} - ω_{z}^{m i n}} - 1 \end{matrix}

(10d)

where

v_{x}^{n}

,

v_{y}^{n}

, and

ω_{z}^{n}

represent the normalized values of the robot’s velocity in the x and y directions and angular velocity in the z direction, respectively.

a c t i o n

is the action space, and

v_{x}

,

v_{y}

, and

ω_{z}

are the linear velocities of the follower robot along the x and y axes and the angular velocity along the z axis, respectively.

v_{x}^{m i n}

,

v_{y}^{m i n}

, and

ω_{z}^{m i n}

are the minimum linear velocities of the follower along the x and y axes and its angular velocity along the z axis, respectively.

v_{x}^{m a x}

,

v_{y}^{m a x}

, and

ω_{z}^{m a x}

are the maximum linear velocities of the follower along the x and y axes and its angular velocity along the z axis, respectively.

This design standardizes the representation of the follower’s motion state to make it easier to control its behavior, thereby enhancing the flexibility and adaptability of the overall system. Additionally, since all action values are mapped within a fixed range, this not only makes optimization of the algorithm easier but also simplifies control, which better controls the behavior of the follower in different scenarios. Overall, through this carefully designed action space, the follower more easily adapts to varying scenarios.

3.6. Reward Function

The reward signal determines the behavior of the agent and plays a crucial role in reinforcement learning. The reward function should provide effective feedback to evaluate the action value of the agent. Additionally, it is necessary to inform the agent about the quality of its behavior and the desired actions for the task in reinforcement learning. Therefore, the reward function designed in this study adheres to the following three principles:

It encourages the follower to maintain a specific distance and angle relative to the leader.
When the pose between the follower and the leader is close to the expected pose, the reward value is greater.
The changes in reward values should effectively reflect the changes in the state after performing an action.

Based on the above principles, the reward function proposed in this study is expressed as shown in Equation (11).

r = A - 1 / N (\frac{|θ_{l f} - θ_{e}|}{Δ θ_{m a x}} + \frac{|x_{l f} - x_{e}|}{Δ x_{m a x}} + \frac{|y_{l f} - y_{e}|}{Δ y_{m a x}})

(11)

The reward function consists of two parts. The first part is the shaping signal (

A > 0

) of the reward. The second part is the original reward (

o r = 1 / N (\frac{|θ_{l f} - θ_{e}|}{Δ θ_{m a x}} + \frac{|x_{l f} - x_{e}|}{Δ x_{m a x}} + \frac{|y_{l f} - y_{e}|}{Δ y_{m a x}}) \in [0, 1]

and

o r \in [0, 1]

, where N is the average factor of the reward). The variables include angle, x coordinate, and y coordinate; therefore, N is set to 3.

Δ θ_{m a x}

,

Δ x_{m a x}

, and

Δ y_{m a x}

are the normalization factors of the angle, x coordinate, and y coordinate, respectively, calculated according to Equation (12).

Δ i_{m a x} = max (i_{m a x} - i_{e}, i_{e} - i_{m i n})

(12)

3.7. Network Architecture

The network structure of the adaptive distributed recurrent SAC formation framework consists of two parts, namely a temporal feature extractor and a decision maker, as illustrated in Figure 3.

3.7.1. Temporal Feature Extractor

The temporal feature extractor consists of two independent LSTM recurrent neural networks, which take encoded pose information as input. These inputs pass through input gates, forget gates, and output gates to extract crucial temporal features from the leader–follower formation task.

3.7.2. Decision Marker

The decision maker consists of the following two parts: an actor network and a critic network (the value network and the soft Q-network, respectively), both constructed by fully connected layers. The actor network takes the output of the first LSTM recurrent neural network from the temporal feature extractor as input, while the critic network takes the concatenation of the output of the second LSTM recurrent neural network from the temporal feature extractor and the encoded actions as input.

4. Simulation Configurations

4.1. Simulation Environment

We use Gazebo as a simulator to simulate highly realistic physical environments. The physical properties of the environment, the kinematics, and the dynamics of the robots involved in the experiments are all provided by Gazebo. For specific details, refer to [32].

To enable interaction between the robot and the reinforcement learning environment, we first design the URDF model of robots and the environment model. Then, we use the Python API to achieve interaction between multi-robot systems and the environment. Finally, PyTorch is used to construct the recurrent SAC leader–follower formation network and reinforcement learning algorithm. The details of the configuration are shown in Figure 4.

4.2. Computational Details

This study involved simulation validations on the ROS Noetic platform. The simulation setup included devices equipped with an Intel Core i9-14900k processor (3.20 GHz), 32 GB of RAM, and a GeForce RTX 4090 graphics card. The simulation involved two omnidirectional mobile robots with mecanum wheels, with the data transmission rate between them reaching 10Hz. Our developed control algorithm processed and transmitted position information through a fully connected network. The results show that the agent’s per-frame processing time (including the processing pf observation data and outputting of actions) was 3.2 milliseconds, with memory usage of about 2.7 GB and CPU utilization of about 41.6%. Under these conditions, our method not only ensures control accuracy but also achieves stable real-time performance on the ROS Noetic platform.

4.3. Simulation Setups

We trained the model using the method described in Section 3 to obtain the recurrent SAC leader–follower formation controller. Subsequently, we conducted two sets of experiments using this controller—one fore V formation and another for random motion evaluation.

4.3.1. Baseline Configuration

Baseline 1 utilizes three sets of PID controllers, each tailored to manage the robot’s linear velocity in the x and y directions and angular velocity in the z direction. The PID parameters for each axis are configured as follows:

X-axis linear velocity: $k_{p} = 0.7$ , $k_{i} = 0.07$ , $k_{d} = 0.01$ ;
Y-axis linear velocity: $k_{p} = 0.7$ , $k_{i} = 0.07$ , $k_{d} = 0.01$ ;
Z-axis angular velocity: $k_{p} = 0.6$ , $k_{i} = 0.04$ , $k_{d} = 0.01$ .

These settings were refined through extensive tuning and optimization to achieve a swift control response, minimize overshoot, and ensure the stability and reliability of the control system.

According to [30], baseline 2 was configured as a common network controller without LSTM. This controller consists of four layers of fully connected networks for feature extraction and two decision-making layers divided into mean output and variance output. Actions are generated by sampling from the normal distribution formed by these outputs. The controller was optimized using the Soft Actor–Critic (SAC) algorithm and trained using the reward function designed by us, as shown in equation x in the original text. The inputs and outputs of this controller are consistent with those proposed in our controller, ensuring fairness and effectiveness in comparison. The hyperparameters during the SAC training process are set as follows: learning rates for the actor network, critic network, and temperature parameter are all set at 3e-4; the discount factor is set at 0.99; the soft update parameter is set at 0.005; and the batch size is set at 64.

4.3.2. V-Shaped Formation

Five robots are used for the V-shaped formation evaluation. In this configuration, R1 acts as the leader for R2 and R3, while R4 and R5 are followers of R2 and R3, respectively, as illustrated in Figure 5. The desired distances and angles are set as follows:

r_{12} = r_{13} = r_{24} = r_{35} = \sqrt{2}

and

θ_{12} = θ_{13} = θ_{24} = θ_{35} = 0

.

4.3.3. Random Motion

Two robots are used to perform random formation evaluation—one acting as the leader and the other as the follower. The leader follows as a pattern of random motion, as described in Section 3.1. The maximum velocities for the leader are set to 0.4 m/s, …, 1.5 m/s. The follower maintains a specific pose relative to the leader and records the relative distance and angle. Additionally, we compare our proposed recurrent SAC leader–follower formation controller with conventional network-based formation controllers and PID controllers.

4.4. Assumptions and Limitations of Our Simulation

4.4.1. Assumptions

The robots used in this study are equipped with mecanum wheels, allowing for smooth movement and rotation in any direction.
The operating environment is a flat factory floor suitable for the movement of omnidirectional robots.
The communication delay between robots does not exceed 0.1 s to ensure the timeliness of information transfer, maintaining continuous control and responsive performance.
The positioning error of the robots does not exceed 0.01 m, providing sufficient precision for accurate formation control.

4.4.2. Limitations

Although our method performs well in simulations, its performance may depend on specific hardware configurations, such as processing speed and sensor precision, which could limit its application on low-cost or low-performance hardware.
While we assume a flat surface, the existing methods may need adjustments or improvements in more complex or irregular terrain.
Our study assumes shorter communication delays, but in practical applications, longer delays or unstable communications may affect system performance.

5. Simulation Results and Analysis

To compare the performance of the proposed recurrent SAC formation control method in leader–follower formation tasks, we conducted the two evaluations described in Section 4.3 in the Gazebo simulation environment.

5.1. Training Results

We trained the follower agents using DDPG, TD3, pure SAC, and our proposed algorithms. The training results, as shown in Figure 6, clearly demonstrate that our method performed the best in the mobile robot formation training process. Our method not only converged to a higher reward value but also exhibited less fluctuation during training, showcasing better stability.

5.2. V-Shaped Formation

We conduct evaluations following the setup described in Section 4.3.2, and the results are depicted in Figure 7. Figure 7a illustrates the trajectories of the robots on the plane. It can be observed that the desired V-shaped formation is initially achieved around the second second and reaches a stable state around the fourth second. It can also be observed from the robot snapshots in Figure 7d–g that the robots achieved a stable formation at t = 4 s. Figure 7b clearly demonstrates the poor initial positions of the leader and followers, but the distance errors of all followers converge to zero rapidly. However, as shown in Figure 7c, we observe significant fluctuations in the relative angles between the leader and followers before t = 2 s, which may be attributed to the robots adjusting their angles rapidly to achieve a stable distance formation. Nevertheless, after t = 2 s, these fluctuations quickly converge, ultimately converging to positions close to zero by t = 6 s. Similarly, in the snapshots presented in Figure 7e,g, we observe that at t = 2 s, the formation is initially formed, but there are still significant errors in angles. However, by t = 6 s, these errors are mostly eliminated.

5.3. Random Motion Formation

We conducted random motion evaluation according to the setup described in Section 4.3.3. The results for the relative distances and angles between the leader and follower robots are illustrated in Figure 8 and Figure 9, respectively.

From Figure 8, it is evident that when employing our proposed method as the controller, the distance error and fluctuation between the leader and follower during random motion remain consistently low across different maximum velocities, and it is significantly lower than PID controllers and common network controllers without LSTM. It is particularly noteworthy that our method maintains small and stable distance errors, regardless of whether the leader is set at a smaller or larger velocity. In contrast, as observed in Figure 8e–l, when using other methods as the controller and when the leader’s maximum velocity exceeds 0.8 m/s, the distance error rapidly expands, and significant fluctuations occur.

Figure 9 shows that in terms of angle control, when using our proposed method as the controller, the angle error and fluctuation are significantly smaller than the other two methods when the leader’s maximum velocity is greater than or equal to 0.8 m/s. When the leader’s velocity is less than 0.8 m/s, although our method is less effective in angle control than the traditional PID controller, it still outperforms the common network controller without LSTM.

5.4. Challenges and Solutions in Real-World Robotic Systems

5.4.1. Hardware Constraints

Computing power: The actual computing power of the hardware may limit the execution efficiency of positioning and formation control algorithms. To address this, we can run or optimize the computational steps on high-performance hardware.
Sensor accuracy: The reliability of sensors is crucial for map building and localization in the SLAM system. We will use high-accuracy sensors and implement error correction techniques to enhance the precision of location data.

5.4.2. Environmental Uncertainty

Communication delay and signal interference: We will implement redundant data transmission mechanisms and optimize network protocols to minimize the impacts of latency. Furthermore, we will utilize advanced signal processing technologies, select frequencies less susceptible to interference, and employ signal enhancement and error correction techniques to ensure high-quality communication.
Dynamic environment: Dynamic obstacles and variable environmental conditions can disrupt the stability of formation control. We will use real-time sensor data to dynamically adjust control strategies, ensuring adaptability to environmental fluctuations.
Slam error: Environmental variations and sensor noise may induce errors in SLAM technology. We will employ data fusion techniques to mitigate these errors and introduce error compensation mechanisms in formation control to bolster the system’s overall robustness.

6. Conclusions

In this study, we present a novel adaptive distributed recurrent SAC formation control method designed to tackle the leader–follower formation control challenge of omnidirectional mobile robots. Additionally, we introduce episode-based memory replay buffers and sampling methods, along with a novel normalized reward function. These enhancements effectively expedite the convergence speed of the recurrent SAC reinforcement learning formation framework and enhance its generalization capability across diverse scenarios.

In terms of experimentation, we extensively tested our method in the Gazebo simulator to validate its efficacy. In the V-shaped formation experiment, we observed that our method rapidly and steadily achieves the desired formation shape, and the distance and angle errors between the leader and followers converge to zero quickly. In the random motion experiment, we further affirmed the robustness of our method. It effectively maintained the desired distance and angle between followers and the leader across various maximum velocities. Comparatively, our method outperforms traditional PID and common network controllers without LSTM, showcasing superior robustness and generalization performance.

We note that the current method does not sufficiently consider issues of excessive communication delays and communication instability. Therefore, in our future research, we plan to further validate and optimize the practicality and effectiveness of this method during actual deployment and in environments with unstable communication.

Author Contributions

Conceptualization, H.L.; methodology, M.L.; software, M.L.; validation, M.L.; investigation, F.X. and H.H.; resources, H.L.; writing, M.L. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant 2021YFB1716200) and the Research Funds for Leading Talents Program (052000514124510).

Data Availability Statement

The datasets used in this study are available from the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bullo, F.; Cortés, J.; Martinez, S. Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms; Princeton University Press: Princeton, NJ, USA, 2009; Volume 27. [Google Scholar]
Mesbahi, M. Graph Theoretic Methods in Multiagent Networks; Princeton University Press: Princeton, NJ, USA, 2010; pp. 1–424. [Google Scholar]
Kagan, E.; Shvalb, N.; Ben-Gal, I. Autonomous Mobile Robots and Multi-Robot Systems: Motion-Planning, Communication, and Swarming; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
Kamel, M.A.; Yu, X.; Zhang, Y. Formation control and coordination of multiple unmanned ground vehicles in normal and faulty situations: A review. Annu. Rev. Control 2020, 49, 128–144. [Google Scholar] [CrossRef]
Zhao, W.; Meng, Q.; Chung, P.W. A heuristic distributed task allocation method for multivehicle multitask problems and its application to search and rescue scenario. IEEE Trans. Cybern. 2015, 46, 902–915. [Google Scholar] [CrossRef] [PubMed]
Farrugia, J.L.; Fabri, S.G. Swarm robotics for object transportation. In Proceedings of the 2018 UKACC 12th International Conference on Control (CONTROL), Sheffield, UK, 5–7 September 2018; pp. 353–358. [Google Scholar]
Queralta, J.P.; Taipalmaa, J.; Pullinen, B.C.; Sarker, V.K.; Gia, T.N.; Tenhunen, H.; Gabbouj, M.; Raitoharju, J.; Westerlund, T. Collaborative multi-robot search and rescue: Planning, coordination, perception, and active vision. IEEE Access 2020, 8, 191617–191643. [Google Scholar] [CrossRef]
Trevai, C.; Ota, J.; Arai, T. Multiple mobile robot surveillance in unknown environments. Adv. Robot. 2007, 21, 729–749. [Google Scholar] [CrossRef]
Yu, X.; Liu, L. Distributed formation control of nonholonomic vehicles subject to velocity constraints. IEEE Trans. Ind. Electron. 2015, 63, 1289–1298. [Google Scholar] [CrossRef]
Miao, Z.; Liu, Y.H.; Wang, Y.; Yi, G.; Fierro, R. Distributed estimation and control for leader-following formations of nonholonomic mobile robots. IEEE Trans. Autom. Sci. Eng. 2018, 15, 1946–1954. [Google Scholar] [CrossRef]
Lin, J.; Miao, Z.; Zhong, H.; Peng, W.; Wang, Y.; Fierro, R. Adaptive image-based leader–follower formation control of mobile robots with visibility constraints. IEEE Trans. Ind. Electron. 2020, 68, 6010–6019. [Google Scholar] [CrossRef]
Ramírez-Neria, M.; González-Sierra, J.; Madonski, R.; Ramírez-Juárez, R.; Hernandez-Martinez, E.G.; Fernández-Anaya, G. Leader–follower formation and disturbance rejection control for omnidirectional mobile robots. Robotics 2023, 12, 122. [Google Scholar] [CrossRef]
Rezaee, H.; Abdollahi, F. A decentralized cooperative control scheme with obstacle avoidance for a team of mobile robots. IEEE Trans. Ind. Electron. 2013, 61, 347–354. [Google Scholar] [CrossRef]
Arrichiello, F.; Chiaverini, S.; Indiveri, G.; Pedone, P. The null-space-based behavioral control for mobile robots with velocity actuator saturations. Int. J. Robot. Res. 2010, 29, 1317–1337. [Google Scholar] [CrossRef]
Xiao, H.; Li, Z.; Chen, C.P. Formation control of leader–follower mobile robots’ systems using model predictive control based on neural-dynamic optimization. IEEE Trans. Ind. Electron. 2016, 63, 5752–5762. [Google Scholar] [CrossRef]
Xiao, H.; Chen, C.P. Incremental updating multirobot formation using nonlinear model predictive control method with general projection neural network. IEEE Trans. Ind. Electron. 2018, 66, 4502–4512. [Google Scholar] [CrossRef]
Oh, K.K.; Park, M.C.; Ahn, H.S. A survey of multi-agent formation control. Automatica 2015, 53, 424–440. [Google Scholar] [CrossRef]
Wang, W.; Huang, J.; Wen, C.; Fan, H. Distributed adaptive control for consensus tracking with application to formation control of nonholonomic mobile robots. Automatica 2014, 50, 1254–1263. [Google Scholar] [CrossRef]
Zou, Y.; Wen, C.; Guan, M. Distributed adaptive control for distance-based formation and flocking control of multi-agent systems. IET Control Theory Appl. 2019, 13, 878–885. [Google Scholar] [CrossRef]
Yan, L.; Ma, B. Practical formation tracking control of multiple unicycle robots. IEEE Access 2019, 7, 113417–113426. [Google Scholar] [CrossRef]
Taheri, H.; Zhao, C.X. Omnidirectional mobile robots, mechanisms and navigation approaches. Mech. Mach. Theory 2020, 153, 103958. [Google Scholar] [CrossRef]
Paniagua-Contro, P.; Hernandez-Martinez, E.G.; González-Medina, O.; González-Sierra, J.; Flores-Godoy, J.J.; Ferreira-Vazquez, E.D.; Fernandez-Anaya, G. Extension of leader-follower behaviours for wheeled mobile robots in multirobot coordination. Math. Probl. Eng. 2019, 2019, 4957259. [Google Scholar] [CrossRef]
Roza, A.; Maggiore, M.; Scardovi, L. A smooth distributed feedback for formation control of unicycles. IEEE Trans. Autom. Control 2019, 64, 4998–5011. [Google Scholar] [CrossRef]
Tang, X.; Ji, Y.; Gao, F.; Zhao, C. Research on multi-robot formation controlling method. In Proceedings of the Third International Conference on Cyberspace Technology (CCT 2015), Beijing, China, 17–18 October 2015; pp. 1–3. [Google Scholar]
Maghenem, M.; Loria, A.; Panteley, E. Lyapunov-based formation-tracking control of nonholonomic systems under persistency of excitation. IFAC-PapersOnLine 2016, 49, 404–409. [Google Scholar] [CrossRef]
Dasdemir, J.; Loría, A. Robust formation tracking control of mobile robots via one-to-one time-varying communication. Int. J. Control. 2014, 87, 1822–1832. [Google Scholar] [CrossRef]
Loria, A.; Dasdemir, J.; Jarquin, N.A. Leader–follower formation and tracking control of mobile robots along straight paths. IEEE Trans. Control. Syst. Technol. 2015, 24, 727–732. [Google Scholar] [CrossRef]
Sun, Z.; Mou, S.; Deghat, M.; Anderson, B.D. Finite time distributed distance-constrained shape stabilization and flocking control for d-dimensional undirected rigid formations. Int. J. Robust Nonlinear Control 2016, 26, 2824–2844. [Google Scholar] [CrossRef]
Oh, K.K.; Ahn, H.S. Distance-based undirected formations of single-integrator and double-integrator modeled agents in n-dimensional space. Int. J. Robust Nonlinear Control 2014, 24, 1809–1820. [Google Scholar] [CrossRef]
Shi, Y.; Song, J.; Hua, Y.; Yu, J.; Dong, X.; Ren, Z. Leader-Follower Formation Control for Fixed-Wing UAVs using Deep Reinforcement Learning. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 3456–3461. [Google Scholar]
Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. In Proceedings of the 2015 AAAI Fall Symposium Series, Arlington, Virginia, 12–14 November 2015. [Google Scholar]
Koenig, N.; Howard, A. Design and use paradigms for gazebo, an open-source multi-robot simulator. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2149–2154. [Google Scholar]

Figure 1. The framework of our proposed method.

Figure 2. The leader-follower formation setup.

Figure 3. The network architecture of the proposed method.

Figure 4. The simulation layout.

Figure 5. The leader–follower topology of the V-shaped formation.

Figure 6. Reward curves during training with different algorithms.

Figure 7. The V-shaped formation results. (a) Trajectories of all robots on a plane; (b,c) relative distances and angles between leader and follower robots, respectively; (d–g) snapshots of robots at t = 0 s, t = 2 s, t = 4 s, and t = 6 s, respectively.

Figure 8. The formation results concerning the distance from random motions. Baseline1 is the common network controller without LSTM; Baseline2 is the PID controller. (a–l) Distance curves when the leader is set to maximum speeds of 0.4 m/s, 0.5 m/s, …, and 1.5 m/s, respectively. (m) The means and standard deviations of distance errors.

Figure 9. The formation results concerning the angle of random motions. Baseline1 is the common network controller without LSTM; Baseline2 is the PID controller. (a–l) Angle curves when the leader is set to maximum speeds of 0.4 m/s, 0.5 m/s, …, and

1.5

m/s, respectively. (m) The means and standard deviations of angle errors.

Figure 9. The formation results concerning the angle of random motions. Baseline1 is the common network controller without LSTM; Baseline2 is the PID controller. (a–l) Angle curves when the leader is set to maximum speeds of 0.4 m/s, 0.5 m/s, …, and

1.5

m/s, respectively. (m) The means and standard deviations of angle errors.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Liu, H.; Xie, F.; Huang, H. Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm. Electronics 2024, 13, 3513. https://doi.org/10.3390/electronics13173513

AMA Style

Li M, Liu H, Xie F, Huang H. Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm. Electronics. 2024; 13(17):3513. https://doi.org/10.3390/electronics13173513

Chicago/Turabian Style

Li, Mingfei, Haibin Liu, Feng Xie, and He Huang. 2024. "Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm" Electronics 13, no. 17: 3513. https://doi.org/10.3390/electronics13173513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Distributed Control for Leader–Follower Formation Based on a Recurrent SAC Algorithm

Abstract

1. Introduction

2. Problem Formulation

2.1. Leader–Follower Kinematics

2.2. Problem Statements

3. Simulation Methods

3.1. Motion Augmentation

3.1.1. Multiple Types of Motion

3.1.2. Various Motion Speeds

3.1.3. Leader Motion Diversity

3.2. Recurrent Soft Actor–Critic Algorithm

3.3. Episode-Based Memory Replay Buffer

3.3.1. Storage Approach

3.3.2. Sampling Approach

3.4. State Space

3.5. Action Space

3.6. Reward Function

3.7. Network Architecture

3.7.1. Temporal Feature Extractor

3.7.2. Decision Marker

4. Simulation Configurations

4.1. Simulation Environment

4.2. Computational Details

4.3. Simulation Setups

4.3.1. Baseline Configuration

4.3.2. V-Shaped Formation

4.3.3. Random Motion

4.4. Assumptions and Limitations of Our Simulation

4.4.1. Assumptions

4.4.2. Limitations

5. Simulation Results and Analysis

5.1. Training Results

5.2. V-Shaped Formation

5.3. Random Motion Formation

5.4. Challenges and Solutions in Real-World Robotic Systems

5.4.1. Hardware Constraints

5.4.2. Environmental Uncertainty

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI