In this section, to realize the scalability of the UGV formation, according to different tasks, general MDP models are designed for the leader and other UGVs with follower attributes, respectively, including the state space, action space, and reward function. In addition, the deep-deterministic-policy-gradient-based distributed training and execution strategy will be explained.
3.3. Reward Functions
The reward function, as the environment feedback to the agent’s behavior, plays a crucial role in reinforcement learning. In this work, a set of subreward functions were designed for the distributed leader and other UGVs to achieve the goal of a cooperative formation and meet the premise of obstacle avoidance. The total reward was composed of all subrewards.
For the leader UGV, its main purpose was to ensure its safety while reaching the target point. The total reward function of the leader consisted of three subreward functions:
Arrival reward function (): the reward function was designed to encourage a UGV to reach the target and only applied a large positive value to as a reward when UGV reached the target position.
Distance reward function (): To avoid the problem of slow learning or even nonconvergence due to the sparse rewards, the distance reward function was designed to provide dense rewards for the leader’s exploration, so as to guide the leader to move towards the target position. The formula for calculating the reward function was:
where
and
are the coordinate positions of UGV at time
t and time
, respectively,
is the coordinate position of the target point, and
is a constant, which was used to adjust the importance of the reward. Therefore, when the distance between a UGV and the target was shortened, the positive calculated value was assigned to
as a reward; otherwise, the behavior far away from the target was punished by a negative calculated value.
Safety protection reward function (): Safety is one of the core keys in path planning, requiring no obstacles on the planned path, and further expected to maintain a certain distance to obtain redundant safety and avoid collision accidents.
First, consider the handling of collision events. In the environment of continuous space and discrete time, it is necessary to give corresponding rewards as feedback for three situations generated by the interaction between the UGV and obstacles:
No collision: if there is no collision with obstacles in the next state, no penalty is given.
Collision: when the next state collides with an obstacle, the UGV is given a large negative reward as a punishment to guide it to reduce the selection of similar actions in this state.
Collision along the route: Compared with the continuous real world, discrete time brings some special situations: the next state of the UGV does not collide with the obstacle, but there is an overlap between the connecting line of the two states and the obstacle. Therefore, checking for such collisions is easily overlooked but necessary, and was also assigned as a penalty for collision along the route.
In addition, to obtain safety redundancy, it was necessary for the UGV to avoid getting too close to the obstacle to reduce the possibility of collision. Assuming that was the minimum safe distance between the UGV and the obstacle, when the distance between the UGV and the obstacle was less than but no collision occurred, the negative reward was assigned to the UGV as a penalty to guide safer choices in the future. The value of was configured according to the importance of safety redundancy.
In summary, the safety protection reward function could be expressed as:
Summarizing the reward function proposed for the leader UGV, including the three subreward functions, in which
and
are positive rewards,
is a negative reward, the total reward function of the leader could be expressed as:
For other UGVs with follower attributes, their main purpose was to maintain the shape of the formation while also ensuring safety. The total reward function of the distributed follower consisted of four subreward functions:
Formation distance reward function (): For the distributed follower, keeping the shape of the formation is one of its main tasks. Corresponding to the layered leader-following structure mentioned above, the triangular shape was naturally selected as the target formation. Consider a basic triangular formation consisting of a leader and two followers, as shown in
Figure 6,
and
represent the distance between the follower 1 and its leader and follower 2, respectively, and the closer the distance is to the formation demand, the smaller the penalty. The specific formation achievement reward was calculated as:
where
is the required formation distance corresponding to its level.
and
are constant coefficients, which were used to adjust the importance of the two distance-keeping rewards in the triangular formation.
Safety protection reward function (): Safety is also a critical issue for followers. We basically adopted the same solution as that of the leader, including collision prevention and safety redundancy.
Formation position function (): Corresponding to the state space, the followers were divided into two categories: the left follower UGV and the right follower UGV. As shown in
Figure 7, the main goal of this reward function was to adjust the relative position relationship between the followers and the leader, so that the left follower UGV and the right follower UGV were more inclined to be on the left and right sides of the ray formed by the leader’s coordinates and heading, respectively, thus avoiding the instability of the formation and the collision risk inside the formation caused by the exchange of positions between the two followers during the advance. If the follower was on the correct side of the ray, zero reward was distributed; otherwise, a negative value was assigned to
. The relative position relationship could be calculated as:
where
,
is the coordinate of any point on the ray. If the
was greater than 0, the follower was on the right side; if the
was less than 0, the follower was on the left side; otherwise, the follower was on the ray.
Action reward function (): Different from the previous awards that were all for completing the formation and safety, an action reward function was introduced to keep the formation shape more stable, so as to avoid repeated oscillation of the formation between the completed and incomplete states. Considering the formation task under the leader-following structure, if the speed and heading of the UGV and its leader were more consistent, the shape of the formation would be more stable, and the reward function could be expressed as:
where
is the velocity of the follower and
v is the velocity of its leader.
is the heading angle of the follower and
is the heading angle of its leader.
and
are constant coefficients, which were used to adjust the importance of keeping the speed and heading consistently. Note that to avoid the follower UGV adopting this strategy when the formation is incomplete,
should be configured to a smaller value.
Summarizing the reward function proposed for UGVs with follower attributes, including the four subreward functions, in which
,
, and
are negative rewards and
is a positive reward. The total reward function of followers could be expressed as:
3.4. Formation Expansion Method
In order to satisfy the requirement of formation expansion and reduce the possibility of collision within the formation, a hierarchical triangular formation structure under the layered leader-following structure was designed. Its core was that each agent formed a triangular formation with its leader and follower, respectively, when the leader or follower existed.
Specifically, consider the basic single-layer triangular formation consisting of three UGVs, as shown in
Figure 8a. The formation is divided into a leader and two followers, and the distance of the triangular formation is set to
l.
On this basis, it is extended to a double-layer formation consisting of nine UGVs, as shown in
Figure 8b. At that time, the formation consists of leader, level 1 leaders, and followers, and the distance of the triangular formation consisting of the leader and level 1 leaders is set to
.
Furthermore, it is extended to a three-layer formation consisting of 27 UGVs, as shown in
Figure 8c, and the distance of the triangular formation consisting of the leader and level 2 leaders is set to
.
To conclude, for an n-layer formation with a maximum capacity of , the level i leader needs to form a triangular formation with a distance from its leader, while followers keep a distance l from their leader.
3.5. Cooperative Formation for Distributed Training and Execution Using Deep Deterministic Policy Gradient
In this work, a distributed and extensible cooperative formation method using DDPG was implemented. Corresponding to the layered formation structure proposed above, instead of adopting a centralized strategy where all UGVs were treated as one agent, a distributed training and execution strategy was designed as shown in
Figure 9. Each UGV had their own actor and critic networks. Instead of using global state information, each agent obtained its own observations
and rewards
by performing actions and interacting with the environment, allowing each network to learn an optimal control strategy to achieve a cooperative formation.
In addition, such a training strategy could largely meet the scalability requirements of UGV formations. Once the network was trained on the base formation shape, the trained network could be easily reused by adjusting the formation distance to meet the needs of other levels of UGV to achieve formation expansion without retraining.
The training and testing framework is shown in
Figure 10. Its purpose was to implement reinforcement learning training on simple UGV formations and complete more complex formation tasks with the learned knowledge in the testing phase. Specifically, in the training process, actor–critic networks of the leader, left follower, and right follower were trained by a basic triangular UGV formation, respectively. In the testing phase, different numbers and shapes of formation cases (testing block in
Figure 10) were designed to verify the advantages of the scalability of the proposed algorithm. Due to the great benefits of the designed state space and reward function, UGVs of different levels belonging to the left or right followers could reuse the strategies learned from the basic formation in the training phase by allocating the required formation distance. The workflow of the proposed distributed and scalable cooperative formation control algorithm is shown in Algorithm 1.
Algorithm 1 DDPG for Distributed and Scalable Cooperative Formation. |
Set the number of UGVs in the formation and determine the category for each UGV Configure the distance of the UGV formation forUGV i = 1, 2, …, n do Initialize replay buffer Randomly initialize critic network and actor with weights and Initialize target network and with weights , end for forepisode = 1: max-episodedo Reset the training scenario settings Initialize a random process for action exploration for step = 1: max-step do for UGV i = 1, 2, …, n do Receive observation state Select action Execute action in the training scenario and observe reward and new state Store transition in Sample a random minibatch of N transitions from Set Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: Update the target networks: end for if leader UGV arrives at goal point then Break end if end for end for
|
In the design of the actor and critic networks of the DDPG algorithm, based on the dimensions of the state space and action space mentioned above, a network structure with three hidden layers (each layer containing 256, 128, and 128 neurons, respectively) was adopted to approximate the policy and state–action value function, as shown in
Figure 11.
Table 1 lists the other relevant parameters used by the DDPG algorithm.