1. Introduction
With the rapid development of technology, the application of agents in the field of rescue has become increasingly widespread [
1,
2,
3]. However, how to efficiently dispatch agents for rescue missions and ensure their accurate and rapid arrival at designated locations remains a significant issue facing current research. Traditional agent-dispatching methods often lack precise considerations for real-time performance and average arrival time [
4,
5], while control strategies may experience performance degradation in complex environments [
6]. Therefore, the study of new methods for agent rescue dispatching and control is of great significance for improving rescue efficiency and reducing rescue costs.
In the context of rescue operations, the promptness of agent arrival holds paramount importance in determining the overall effectiveness of the rescue efforts [
7,
8,
9]. Men et al. [
10] proposed a multi-objective emergency response facility location (ERFsL) model for emergency management, taking into account the spatial characteristics of facilities and the potential risks associated with catastrophic interlocking chemical accidents. Regarding the application of swarm intelligence algorithms in positioning, Shalli Rani et al. [
11] proposed a method that utilizes the krill herd algorithm for the non-ranging distributed processing of mobile target nodes in a maritime rescue network. Addressing the emergency dispatch issue for large-scale forest fires with multiple rescue centers (depots) and limited firefighting resources, Wang et al. [
12] aim to minimize the total completion time of all firefighting tasks by determining the optimal rescue routes for firefighting teams from multiple rescue centers. Nonetheless, previous research has frequently neglected the need for the precise estimation and optimization of agent arrival times, thereby compromising the efficiency of scheduling strategies. By constructing an average distance function, we are able to more precisely evaluate the arrival time of agents under different scheduling strategies, thus optimizing the scheduling plan and ensuring that the agents arrive at the rescue site in the shortest possible time.
Amidst the complexity and volatility of rescue environments, a rational and swift dispatch strategy serves to optimize resource allocation and elevate utilization efficiency [
13,
14,
15]. Traditional scheduling optimization methods often rely on mathematical programming techniques [
16,
17], such as linear programming, integer programming, and mixed-integer linear programming. These methods formulate scheduling problems as optimization problems with the objective of minimizing a specific cost function, such as makespan, tardiness, or total completion time. They utilize established algorithms, including simplex methods, branch-and-bound, and heuristic search algorithms, to find optimal or near-optimal solutions. Yue et al. [
18] established an evaluation model for electric vehicle (EV) indicators, determined the weight of each indicator using the optimal combination weight method based on accelerated genetic algorithms, and proposed an optimized scheduling strategy to achieve optimal EV fleet scheduling. For the large-scale and time-consuming agile Earth observation satellite scheduling problem, Du et al. [
19] propose a data-driven parallel scheduling method that incorporates a probability prediction model, a task allocation strategy, and a parallel scheduling approach. In the IEEE Time-Sensitive Networking (TSN) Task Group, Min et al. [
20] proposed two metaheuristic optimization methods to enhance the scheduling success rate of the Time-Aware Shaper (TAS). However, these traditional methods can suffer from limitations, particularly when dealing with complex real-world scheduling problems characterized by uncertainty, dynamic changes, and non-linear constraints [
21]. They may struggle to scale to large-scale instances, require significant computational resources, and often make assumptions that do not fully capture the complexities of real-world systems. Furthermore, traditional methods may not be flexible enough to accommodate emerging trends and requirements, such as real-time decision making and integration with machine learning techniques. To overcome the previously mentioned limitations, a scheduling strategy optimization method based on the PPO algorithm is designed. Combining the characteristics of the PPO algorithm, an optimization method suitable for agent rescue scheduling is designed, achieving the efficient optimization of the agent-scheduling strategy and improving rescue efficiency.
Predefined-time control can ensure that the intelligent agent arrives at the designated location within the pre-set time limit during the rescue process, thus ensuring the timeliness and effectiveness of the rescue operation [
22,
23,
24]. To enable the direct preset of the upper bound of the settling time, ref. [
25] first proposed a new concept called predefined-time stability. Predefined-time control can make the control system achieve stability before the user-preset time, which has aroused the research interest of scholars. Refs. [
26,
27] separately studied the predefined-time control design issues for first-order and second-order systems. In [
28], a predefined-time leader–follower consensus protocol was established to achieve the predefined-time convergence of tracking errors. The authors of [
29] investigated the predefined-time bipartite consensus control problem for multiple quadrotors under malicious attacks. However, the aforementioned predefined-time control design methods are proposed for control systems that satisfy the matching conditions and cannot be used to address the problem for non-strict feedback nonlinear systems. To overcome the aforementioned constraints, we formulate a precision-oriented fuzzy adaptive tracking control approach with non-singular predefined time, specifically tailored for nonlinear systems. This approach synergistically leverages the strengths of fuzzy control and adaptive control, empowering the agent with the capability to precisely calibrate control parameters during its dynamic motion, thereby effectively navigating through the complexities of diverse environmental scenarios.
The core innovations are highlighted as follows:
(1) In optimizing the dispatch of rescue agents, we have formulated an evaluation function centered around the average flying distance of the agents. This function serves as a rigorous assessment criterion for rescue mission scheduling, accurately quantifying the time taken by agents to reach their intended destinations.
(2) In the research on agent-task allocation, this paper designs a task allocation method based on the proximal policy optimization algorithm. This algorithm can adaptively adjust the task allocation strategy, thereby enhancing the collaborative efficiency of the agent team and improving the quality of task completion.
(3) Addressing the issue of controlling the arrival of agent at target locations, this paper proposes a singularity-free predefined-time fuzzy adaptive tracking control strategy tailored for nonlinear systems. The agents are capable of accurately reaching their destinations from the starting points with enhanced efficiency within a predefined time.
3. Scheduling Policy Optimization Method Based on PPO
In rescue operations involving agents, the scheduling strategy of agents is crucial for enhancing rescue effectiveness. Especially under resource constraints, the challenge lies in how to allocate agents fairly and efficiently. To address this issue, this paper introduces the PPO algorithm and explores its application in optimizing agent-scheduling strategies. The PPO algorithm, with its distinctive advantages such as swift learning speed and robust convergence performance, has offered formidable support for tackling this intricate problem. Its capability to efficiently traverse the vast search space and converge to optimal solutions, even in high-dimensional problems, underscores its applicability and significance in addressing complex decision-making challenges within the context of agent rescue operations.
To address the agent-scheduling strategy optimization problem, we adopt a Markov Decision Process (MDP) framework. In detail, the specifics are as follows:
State: The state is framed as an matrix, encoding the enclosure status between agents and targets. Specifically, if the i-th agent is scheduling the j-th target at time t, takes a value of 1; otherwise, it is 0.
Action: The action space comprises a vector of length p, where each element represents the allocation of agents to targets. Specifically, indicates the identity of the agent assigned to surround the i-th target, thereby defining the scheduling strategy.
Reward: To optimize the multiple target assignment process using PPO, our primary goal is to maximize the efficiency of surrounding. This is achieved by defining a reward function as outlined in (
4).
Return: The total return is calculated as the discounted sum of all future rewards, starting from and extending indefinitely. This is formally expressed as , where serves as the discount factor, set to 0.9 in this context, balancing the importance of immediate and future rewards.
The scheduling strategy optimization method based on PPO aims to iteratively seek the optimal scheduling strategy. The optimization of the agent-scheduling policy based on PPO is shown in Algorithm 1. It entails a systematic reinforcement learning process that initializes policy and value networks, iteratively collects data through agent–environment interactions, updates policy using a clipped surrogate loss function and importance sampling for stability and efficiency, concurrently refines the value function, and repeats this cycle to gradually improve scheduling strategies, harnessing PPO’s strengths in stability, efficiency, and flexibility. The specific process is as follows:
Algorithm 1 The optimization of the agent-scheduling policy based on PPO. |
- 1:
Initialize: - 2:
Number of agents: n - 3:
Policy network with parameters - 4:
Value network with parameters - 5:
Replay buffer D - 6:
Learning rates , - 7:
Discount factor - 8:
Clipping factor - 9:
Number of iterations T - 10:
Batch size B - 11:
for to T do - 12:
Clear micro-batch data for the current iteration - 13:
for each episode do - 14:
Initialize environment state s (e.g., locations and demands of all sites) - 15:
for each agent to n do - 16:
Select action (allocation of rescue packages) based on - 17:
Execute (allocate packages to sites) - 18:
Receive reward based on satisfaction of demands - 19:
Observe next state - 20:
Store in D - 21:
Update - 22:
end for - 23:
end for - 24:
for each training step do - 25:
Sample a batch randomly from D - 26:
for each do - 27:
Compute advantage function - 28:
Compute clipped surrogate objective - 29:
Update policy network parameters using gradient ascent: - 30:
Compute value network target - 31:
Update value network parameters using mean squared error loss: - 32:
end for - 33:
end for - 34:
Evaluate the current policy on a validation set - 35:
if performance does not improve significantly then - 36:
Adjust hyperparameters or improve the network structure - 37:
end if - 38:
end for - 39:
Output: The trained policy network for agent rescue package allocation
|
Step 1. Initialization of Policy Network: In the initial phase, we establish a randomly initialized policy network, typically a neural network with parameters . This network takes the current state s as input and outputs two values: the probability distribution for the action a (i.e., the scheduling decision) and the state value function . The objective of the policy network is to maximize the expected value of the cumulative rewards.
Step 2. Data Collection: In each iteration, we interact with the environment using the current policy network. Given the state , the policy network selects an action based on the probability distribution and executes it. The environment then responds with the next state and an immediate reward . We collect these data points to form trajectories, which are utilized for subsequent policy updates.
Step 3. Calculation of Advantage Function: The advantage function
represents the additional reward that can be obtained by taking action
a in a given state s compared to the average reward obtained by following the current policy
. It can be as follows:
where
is the action-value function, representing the expected cumulative reward obtained by following policy
after taking action
a in state
s and
is the state-value function, representing the expected cumulative reward obtained by following policy
in state
s.
In this paper, we estimate the advantage function using the Generalized Advantage Estimation (GAE) approach. GAE estimates the advantage function by introducing the
-return method, which combines multi-step Temporal Difference (TD) errors. The TD error
is defined as follows:
where
is the immediate reward,
and
are the value function estimates for the next and current states, respectively, and
is the discount factor.
GAE calculates
as a weighted sum of multi-step TD errors, specifically
where
k is the number of steps considered and
is the GAE’s hyperparameter that balances bias and variance. When
is close to 1, GAE favors multi-step returns, thus reducing variance; when
is close to 0, GAE favors one-step returns, thus reducing bias.
Step 4. Update of Policy Network: Based on the PPO algorithm, we utilize the collected data and computed advantage function values to update the policy network. The PPO algorithm ensures stable updates by limiting the magnitude of changes between the old and new policies. Specifically, it employs a clipped surrogate objective function to update the parameters
of the policy network, which can be expressed as follows:
where
is the ratio of the probability of taking action
under the new policy to that under the old policy,
is an estimate of the advantage function
, and
is a hyperparameter that controls the degree of change between the old and new policies.
We iteratively repeat the process from Step 2 to Step 4 until a predefined number of iterations is reached or a convergence criterion is satisfied. During this iterative procedure, the policy network gradually converges to an optimal scheduling strategy, which maximizes the expected cumulative reward.
4. Predefined-Time Control Design and Stability Analysis
We define the error of the system as follows:
where
represents the tracking error,
stands for the desired command signal,
refers to the virtual error variable, and
signifies the intermediate control variable.
The following Lyapunov function is to be chosen:
where
,
is the estimated value of parameter
.
From (
5), (
10), and (
11), one can obtain the following:
Additionally, in order to achieve the desired functionality, we can establish the intermediate control variable
and a strategy for the adaptive update of
as specified below:
where
is determined by dividing the positive even integer
(greater than the positive odd integer
) by
. The time constant
serves as a positive design parameter. The values of
, and
are set to
,
.
By incorporating Equations (
13) and (
14) into Equation (
12) and subsequently applying the corresponding inequality, we obtain
where
is a positive, intentionally chosen parameter and
is equal to 0.2785, the desired outcome can be achieved.
where
.
In a similar manner to (
11), we have selected the following Lyapunov function:
Subsequently, we are able to derive
Similarly, we define the intermediate control variables
and the adaptive update strategies
as follow:
By incorporating Equations (
19) and (
20) into Equation (
18) and utilizing the inequality stated in (
15), one can obtain
where
.
Consistent with our approach, we adopt the Lyapunov function specified below:
then we can obtain
Similarly, we define the predefined-time controller
u and the adaptive update strategies
as follows:
Substituting (
24) and (
25) into (
23) and applying the inequality in (
15), one can obtain
where
.
The following inequalities can be derived from [
31,
32]:
Substituting (
27) and (
28) into (
26), one has
where
According to [
33,
34], for any
, there exist constants
p and
q satisfying
and
, for which the following inequalities are valid: