Next Article in Journal
Fractional Dynamics of Cassava Mosaic Disease Model with Recovery Rate Using New Proposed Numerical Scheme
Previous Article in Journal
Adaptive Iterative Learning Tracking Control for Nonlinear Teleoperators with Input Saturation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Predefined-Time Agent-Scheduling Strategy Based on PPO

1
Air Defense and AntiMissile School, Air Force Engineering University, Xi’an 710043, China
2
Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(15), 2387; https://doi.org/10.3390/math12152387
Submission received: 21 June 2024 / Revised: 23 July 2024 / Accepted: 27 July 2024 / Published: 31 July 2024
(This article belongs to the Section Engineering Mathematics)

Abstract

:
In this paper, we introduce an agent rescue scheduling approach grounded in proximal policy optimization, coupled with a singularity-free predefined-time control strategy. The primary objective of this methodology is to bolster the efficiency and precision of rescue missions. Firstly, we have designed an evaluation function closely related to the average flying distance of agents, which provides a quantitative benchmark for assessing different scheduling schemes and assists in optimizing the allocation of rescue resources. Secondly, we have developed a scheduling strategy optimization method using the Proximal Policy Optimization (PPO) algorithm. This method can automatically learn and adjust scheduling strategies to adapt to complex rescue environments and varying task demands. The evaluation function provides crucial feedback signals for the PPO algorithm, ensuring that the algorithm can precisely adjust the scheduling strategies to achieve optimal results. Thirdly, aiming to attain stability and precision in agent navigation to designated positions, we formulate a singularity-free predefined-time fuzzy adaptive tracking control strategy. This approach dynamically modulates control parameters in reaction to external disturbances and uncertainties, thus ensuring the precise arrival of agents at their destinations within the predefined time. Finally, to substantiate the validity of our proposed approach, we crafted a simulation environment in Python 3.7, engaging in a comparative analysis between the PPO and the other optimization method, Deep Q-network (DQN), utilizing the variation in reward values as the benchmark for evaluation.

1. Introduction

With the rapid development of technology, the application of agents in the field of rescue has become increasingly widespread [1,2,3]. However, how to efficiently dispatch agents for rescue missions and ensure their accurate and rapid arrival at designated locations remains a significant issue facing current research. Traditional agent-dispatching methods often lack precise considerations for real-time performance and average arrival time [4,5], while control strategies may experience performance degradation in complex environments [6]. Therefore, the study of new methods for agent rescue dispatching and control is of great significance for improving rescue efficiency and reducing rescue costs.
In the context of rescue operations, the promptness of agent arrival holds paramount importance in determining the overall effectiveness of the rescue efforts [7,8,9]. Men et al. [10] proposed a multi-objective emergency response facility location (ERFsL) model for emergency management, taking into account the spatial characteristics of facilities and the potential risks associated with catastrophic interlocking chemical accidents. Regarding the application of swarm intelligence algorithms in positioning, Shalli Rani et al. [11] proposed a method that utilizes the krill herd algorithm for the non-ranging distributed processing of mobile target nodes in a maritime rescue network. Addressing the emergency dispatch issue for large-scale forest fires with multiple rescue centers (depots) and limited firefighting resources, Wang et al. [12] aim to minimize the total completion time of all firefighting tasks by determining the optimal rescue routes for firefighting teams from multiple rescue centers. Nonetheless, previous research has frequently neglected the need for the precise estimation and optimization of agent arrival times, thereby compromising the efficiency of scheduling strategies. By constructing an average distance function, we are able to more precisely evaluate the arrival time of agents under different scheduling strategies, thus optimizing the scheduling plan and ensuring that the agents arrive at the rescue site in the shortest possible time.
Amidst the complexity and volatility of rescue environments, a rational and swift dispatch strategy serves to optimize resource allocation and elevate utilization efficiency [13,14,15]. Traditional scheduling optimization methods often rely on mathematical programming techniques [16,17], such as linear programming, integer programming, and mixed-integer linear programming. These methods formulate scheduling problems as optimization problems with the objective of minimizing a specific cost function, such as makespan, tardiness, or total completion time. They utilize established algorithms, including simplex methods, branch-and-bound, and heuristic search algorithms, to find optimal or near-optimal solutions. Yue et al. [18] established an evaluation model for electric vehicle (EV) indicators, determined the weight of each indicator using the optimal combination weight method based on accelerated genetic algorithms, and proposed an optimized scheduling strategy to achieve optimal EV fleet scheduling. For the large-scale and time-consuming agile Earth observation satellite scheduling problem, Du et al. [19] propose a data-driven parallel scheduling method that incorporates a probability prediction model, a task allocation strategy, and a parallel scheduling approach. In the IEEE Time-Sensitive Networking (TSN) Task Group, Min et al. [20] proposed two metaheuristic optimization methods to enhance the scheduling success rate of the Time-Aware Shaper (TAS). However, these traditional methods can suffer from limitations, particularly when dealing with complex real-world scheduling problems characterized by uncertainty, dynamic changes, and non-linear constraints [21]. They may struggle to scale to large-scale instances, require significant computational resources, and often make assumptions that do not fully capture the complexities of real-world systems. Furthermore, traditional methods may not be flexible enough to accommodate emerging trends and requirements, such as real-time decision making and integration with machine learning techniques. To overcome the previously mentioned limitations, a scheduling strategy optimization method based on the PPO algorithm is designed. Combining the characteristics of the PPO algorithm, an optimization method suitable for agent rescue scheduling is designed, achieving the efficient optimization of the agent-scheduling strategy and improving rescue efficiency.
Predefined-time control can ensure that the intelligent agent arrives at the designated location within the pre-set time limit during the rescue process, thus ensuring the timeliness and effectiveness of the rescue operation [22,23,24]. To enable the direct preset of the upper bound of the settling time, ref. [25] first proposed a new concept called predefined-time stability. Predefined-time control can make the control system achieve stability before the user-preset time, which has aroused the research interest of scholars. Refs. [26,27] separately studied the predefined-time control design issues for first-order and second-order systems. In [28], a predefined-time leader–follower consensus protocol was established to achieve the predefined-time convergence of tracking errors. The authors of [29] investigated the predefined-time bipartite consensus control problem for multiple quadrotors under malicious attacks. However, the aforementioned predefined-time control design methods are proposed for control systems that satisfy the matching conditions and cannot be used to address the problem for non-strict feedback nonlinear systems. To overcome the aforementioned constraints, we formulate a precision-oriented fuzzy adaptive tracking control approach with non-singular predefined time, specifically tailored for nonlinear systems. This approach synergistically leverages the strengths of fuzzy control and adaptive control, empowering the agent with the capability to precisely calibrate control parameters during its dynamic motion, thereby effectively navigating through the complexities of diverse environmental scenarios.
The core innovations are highlighted as follows:
(1) In optimizing the dispatch of rescue agents, we have formulated an evaluation function centered around the average flying distance of the agents. This function serves as a rigorous assessment criterion for rescue mission scheduling, accurately quantifying the time taken by agents to reach their intended destinations.
(2) In the research on agent-task allocation, this paper designs a task allocation method based on the proximal policy optimization algorithm. This algorithm can adaptively adjust the task allocation strategy, thereby enhancing the collaborative efficiency of the agent team and improving the quality of task completion.
(3) Addressing the issue of controlling the arrival of agent at target locations, this paper proposes a singularity-free predefined-time fuzzy adaptive tracking control strategy tailored for nonlinear systems. The agents are capable of accurately reaching their destinations from the starting points with enhanced efficiency within a predefined time.

2. Preliminaries and Problem Description

2.1. Preliminaries

The following nonlinear system is to be taken into account:
x ˙ ( t ) = f ( x , t ) , x ( 0 ) = x 0 ,
where x R n signifies the state vector of the system and f ( x , t ) : R n R n is designated as a continuous function.
Definition 1.
When the state trajectory x ( t , x 0 ) of system (1) fulfills the condition x ( x 0 , t ) ε after a predetermined time T d has elapsed, the equilibrium point is referred to as practical predefined-time stable (PPTS), where ε is a positive constant.
Lemma 1
(See [30]). Given a Lyapunov function V ( x ) that pertains to system (1) and satisfies the inequality
V ˙ π θ T d V 1 + θ 2 ρ ¯ π θ T d V 1 θ 2 + Θ ,
where T d is the predefined time and is positive, θ and ρ ¯ are design parameters with 0 < θ < 1 and ρ ¯ = 2 + 1 , and D is a positive constant. Under these circumstances, the equilibrium point of system (1) is deemed PPTS with the predefined time T d . Additionally, the residual set of the state trajectory x ( t , x 0 ) can be defined as
x { V ( x ) : V ( x ) 2 θ T d Θ / π } .

2.2. Problem Description

In this paper, we are devoted to exploring the pivotal question of how to allocate resources and satisfy demands in rescue operations. Precisely, our operation involves the utilization of n rescue agents, all of which possess the capability to transport rescue packages with a standardized load capacity of m units. Furthermore, there are p disaster-affected rescue locations, each exhibiting varying degrees of need for rescue resources, which we quantify as q i , where i = 1 , 2 , , n signifies the individual rescue location. Concerning the issue of agent scheduling, we introduce a binary matrix A R n × p to depict the scheduling results. Specifically, a i j = 1 when agent i is dispatched to relief site j, and a i j = 0 otherwise. Concurrently, we establish a distance matrix D R n × p , wherein d i j represents the flying distance between agent i and relief site j. The fundamental aim of this research is to achieve the minimization of the average flying distance for agents to reach relief sites while maintaining the satisfaction of all relief site requirements.
Based on the description provided, the evaluation function for the task-scheduling approach presented in this paper is formulated to accurately assess its performance, which is expressed as:
m i n i = 1 n j = 1 p a i j d i j n s t . m i = 1 n a i j q j j = 1 , 2 , , p .
The dynamic model chosen for the nonlinear multi-agent system is outlined below:
x ˙ i = f i x + x i + 1 + d i t , x ˙ n = f n x + u + d n t , y = x 1 ,
where u and y represent the input and output of the system, respectively, x i ( i = 1 , 2 , , n ) represents the system state variable, d i ( t ) represents the external disturbance bounded by an unknown positive constant d i , f i ( x ) = Φ i T λ ( x ) + ε i denotes the unknown system nonlinear dynamic, and ε i is the minimum approximation error.

3. Scheduling Policy Optimization Method Based on PPO

In rescue operations involving agents, the scheduling strategy of agents is crucial for enhancing rescue effectiveness. Especially under resource constraints, the challenge lies in how to allocate agents fairly and efficiently. To address this issue, this paper introduces the PPO algorithm and explores its application in optimizing agent-scheduling strategies. The PPO algorithm, with its distinctive advantages such as swift learning speed and robust convergence performance, has offered formidable support for tackling this intricate problem. Its capability to efficiently traverse the vast search space and converge to optimal solutions, even in high-dimensional problems, underscores its applicability and significance in addressing complex decision-making challenges within the context of agent rescue operations.
To address the agent-scheduling strategy optimization problem, we adopt a Markov Decision Process (MDP) framework. In detail, the specifics are as follows:
State: The state is framed as an n × p matrix, encoding the enclosure status between agents and targets. Specifically, if the i-th agent is scheduling the j-th target at time t, ( S t ) i j takes a value of 1; otherwise, it is 0.
Action: The action space comprises a vector A c of length p, where each element represents the allocation of agents to targets. Specifically, A c [ i ] indicates the identity of the agent assigned to surround the i-th target, thereby defining the scheduling strategy.
Reward: To optimize the multiple target assignment process using PPO, our primary goal is to maximize the efficiency of surrounding. This is achieved by defining a reward function as outlined in (4).
Return: The total return R t is calculated as the discounted sum of all future rewards, starting from t + 1 and extending indefinitely. This is formally expressed as R t = k = 0 γ k r t + k + 1 , where γ serves as the discount factor, set to 0.9 in this context, balancing the importance of immediate and future rewards.
The scheduling strategy optimization method based on PPO aims to iteratively seek the optimal scheduling strategy. The optimization of the agent-scheduling policy based on PPO is shown in Algorithm 1. It entails a systematic reinforcement learning process that initializes policy and value networks, iteratively collects data through agent–environment interactions, updates policy using a clipped surrogate loss function and importance sampling for stability and efficiency, concurrently refines the value function, and repeats this cycle to gradually improve scheduling strategies, harnessing PPO’s strengths in stability, efficiency, and flexibility. The specific process is as follows:
Algorithm 1 The optimization of the agent-scheduling policy based on PPO.
  1:
Initialize:
  2:
Number of agents: n
  3:
Policy network π ( a | s ; θ ) with parameters θ
  4:
Value network V ( s ; θ ) with parameters θ
  5:
Replay buffer D
  6:
Learning rates α actor , α critic
  7:
Discount factor γ
  8:
Clipping factor ϵ
  9:
Number of iterations T
 10:
Batch size B
 11:
for  t = 1 to T do
 12:
    Clear micro-batch data B t for the current iteration
 13:
    for each episode do
 14:
        Initialize environment state s (e.g., locations and demands of all sites)
 15:
        for each agent u = 1 to n do
 16:
           Select action a u (allocation of rescue packages) based on π ( a | s ; θ )
 17:
           Execute a u (allocate packages to sites)
 18:
           Receive reward r u based on satisfaction of demands
 19:
           Observe next state s
 20:
           Store ( s , a u , r u , s ) in D
 21:
           Update s s
 22:
        end for
 23:
    end for
 24:
    for each training step do
 25:
        Sample a batch B t randomly from D
 26:
        for each ( s , a u , r u , s ) B t  do
 27:
           Compute advantage function A ( s , a u )
 28:
           Compute clipped surrogate objective L t ( θ )
 29:
           Update policy network parameters θ using gradient ascent: θ θ + α actor θ L t ( θ )
 30:
           Compute value network target V target = r u + γ V ( s ; θ )
 31:
           Update value network parameters θ using mean squared error loss: θ θ α critic θ ( V ( s ; θ ) V target ) 2
 32:
        end for
 33:
    end for
 34:
    Evaluate the current policy π ( a | s ; θ ) on a validation set
 35:
    if performance does not improve significantly then
 36:
        Adjust hyperparameters or improve the network structure
 37:
    end if
 38:
end for
 39:
Output: The trained policy network π ( a | s ; θ ) for agent rescue package allocation
Step 1. Initialization of Policy Network: In the initial phase, we establish a randomly initialized policy network, typically a neural network with parameters θ . This network takes the current state s as input and outputs two values: the probability distribution π ( a | s ; θ ) for the action a (i.e., the scheduling decision) and the state value function V ( s ; θ ) . The objective of the policy network is to maximize the expected value of the cumulative rewards.
Step 2. Data Collection: In each iteration, we interact with the environment using the current policy network. Given the state s t , the policy network selects an action a t based on the probability distribution π ( a | s t ; θ ) and executes it. The environment then responds with the next state s t + 1 and an immediate reward r t . We collect these data points ( s t , a t , r t , s t + 1 ) to form trajectories, which are utilized for subsequent policy updates.
Step 3. Calculation of Advantage Function: The advantage function A π ( s , a ) represents the additional reward that can be obtained by taking action a in a given state s compared to the average reward obtained by following the current policy π . It can be as follows:
A π ( s , a ) = Q π ( s , a ) V π ( s )
where Q π ( s , a ) is the action-value function, representing the expected cumulative reward obtained by following policy π after taking action a in state s and V π ( s ) is the state-value function, representing the expected cumulative reward obtained by following policy π in state s.
In this paper, we estimate the advantage function using the Generalized Advantage Estimation (GAE) approach. GAE estimates the advantage function by introducing the φ -return method, which combines multi-step Temporal Difference (TD) errors. The TD error δ t is defined as follows:
δ t = r t + γ V ( s t + 1 ) V ( s t ) ,
where r t is the immediate reward, V ( s t + 1 ) and V ( s t ) are the value function estimates for the next and current states, respectively, and γ is the discount factor.
GAE calculates A t GAE ( φ ) as a weighted sum of multi-step TD errors, specifically
A t GAE ( φ ) = l = 0 k 1 ( γ φ ) l δ t + l ,
where k is the number of steps considered and γ is the GAE’s hyperparameter that balances bias and variance. When γ is close to 1, GAE favors multi-step returns, thus reducing variance; when γ is close to 0, GAE favors one-step returns, thus reducing bias.
Step 4. Update of Policy Network: Based on the PPO algorithm, we utilize the collected data and computed advantage function values to update the policy network. The PPO algorithm ensures stable updates by limiting the magnitude of changes between the old and new policies. Specifically, it employs a clipped surrogate objective function to update the parameters θ of the policy network, which can be expressed as follows:
L clip θ = E t min r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t ,
where r t ( θ ) = π ( a t | s t ; θ ) π ( a t | s t ; θ o l d ) is the ratio of the probability of taking action a t under the new policy to that under the old policy, A ^ t is an estimate of the advantage function A π ( s t , a t ) , and ϵ is a hyperparameter that controls the degree of change between the old and new policies.
We iteratively repeat the process from Step 2 to Step 4 until a predefined number of iterations is reached or a convergence criterion is satisfied. During this iterative procedure, the policy network gradually converges to an optimal scheduling strategy, which maximizes the expected cumulative reward.

4. Predefined-Time Control Design and Stability Analysis

We define the error of the system as follows:
μ 1 = y y d , μ i = x i β i 1 ,
where μ 1 represents the tracking error, y d stands for the desired command signal, μ i refers to the virtual error variable, and β i 1 signifies the intermediate control variable.
The following Lyapunov function is to be chosen:
V 1 = 1 2 μ 1 2 + 1 2 Φ ˜ 1 2 ,
where Φ ˜ 1 = Φ 1 Φ ^ , Φ ^ is the estimated value of parameter Φ 1 .
From (5), (10), and (11), one can obtain the following:
V ˙ 1 = μ 1 Φ 1 T λ 1 x + ε 1 + μ 2 + β 1 + d 1 t y ˙ d Φ ˜ 1 Φ ^ ˙ 1 .
Additionally, in order to achieve the desired functionality, we can establish the intermediate control variable β 1 and a strategy for the adaptive update of Φ ^ 1 as specified below:
β 1 = φ π θ T d μ 1 1 + θ tanh φ π θ T d μ 1 2 + θ ν 1 + y ˙ d φ ¯ c 0 π θ T d μ 1 1 θ tanh φ ¯ c 0 π θ T d μ 1 2 θ ν 1 μ 1 χ 1 Φ ^ 1 ,
Φ ^ ˙ 1 = γ φ π θ T d Φ ^ 1 1 + θ γ ¯ φ ¯ c 0 π θ T d Φ ^ 1 1 θ + μ 1 2 χ 1 , Φ ^ 1 ( 0 ) > 0 ,
where θ is determined by dividing the positive even integer θ 1 (greater than the positive odd integer θ 2 ) by θ 2 . The time constant T d serves as a positive design parameter. The values of c 0 , φ , φ ¯ , γ , and γ ¯ are set to 2 + 1 , ( 2 n ) θ 2 / 2 1 + θ 2 , 1 / 2 1 θ 2 , γ = 2 + θ , 2 θ .
By incorporating Equations (13) and (14) into Equation (12) and subsequently applying the corresponding inequality, we obtain
| w | w tanh ( w / ν ) τ ν ,
where ν is a positive, intentionally chosen parameter and τ is equal to 0.2785, the desired outcome can be achieved.
V ˙ 1 φ π θ T d μ 1 2 + θ tanh φ π θ T d μ 1 2 + θ ν 1 + γ φ π θ T d Φ ˜ 1 Φ ^ 1 1 + θ + μ 1 μ 2 φ ¯ c 0 π θ T d μ 1 2 θ tanh φ ¯ c 0 π θ T d μ 1 2 θ ν 1 + γ ¯ φ ¯ c 0 π θ T d Φ ˜ 1 Φ ^ 1 1 θ + 3 4 φ π θ T d μ 1 2 + θ + γ φ π θ T d Φ ˜ 1 Φ ^ 1 1 + θ + μ 1 μ 2 φ ¯ c 0 π θ T d μ 1 2 θ + γ ¯ φ ¯ c 0 π θ T d Φ ˜ 1 Φ ^ 1 1 θ + Θ 1 ,
where Θ 1 = 3 4 + 2 τ ν 1 .
In a similar manner to (11), we have selected the following Lyapunov function:
V i = V i 1 + 1 2 μ i 2 + 1 2 Φ ˜ i 2 ,
Subsequently, we are able to derive
V ˙ i = V ˙ i 1 + μ i { Φ i T λ i x + ε i + μ i + 1 + β i + d i t H i 1 j = 1 i 1 β i 1 x j Φ j T λ j x + ε j + d j t } Φ ˜ i Φ ^ ˙ i .
Similarly, we define the intermediate control variables β i and the adaptive update strategies Φ i ^ as follow:
β i = φ π θ T d μ i 1 + θ tanh φ π θ T d μ i 2 + θ ν i + H i 1 μ i χ i Φ ^ i φ ¯ c 0 π θ T d μ i 1 θ tanh φ ¯ c 0 π θ T d μ i 2 θ ν i μ i 1 ,
Φ ^ ˙ i = γ φ π θ T d Φ ^ i 1 + θ γ ¯ φ ¯ c 0 π θ T d Φ ^ i 1 θ + μ i 2 χ i , Φ ^ i 0 > 0 .
By incorporating Equations (19) and (20) into Equation (18) and utilizing the inequality stated in (15), one can obtain
V ˙ i j = 1 i 1 φ π θ T d μ j 2 + θ + j = 1 i γ φ π θ T d Φ ˜ j Φ ^ j 1 + θ j = 1 i 1 φ ¯ c 0 π θ T d μ j 2 θ + j = 1 i γ ¯ φ ¯ c 0 π θ T d Φ ˜ j Φ ^ j 1 θ φ π θ T d μ i 2 + θ tanh φ π θ T d μ i 2 + θ ν i φ ¯ c 0 π θ T d μ i 2 θ tanh φ ¯ c 0 π θ T d μ i 2 θ ν i + μ i μ i + 1 + Θ i 1 + 3 i 4 j = 1 i φ π θ T d z j 2 + θ + j = 1 i γ φ π θ T d Φ ˜ j Φ ^ j 1 + θ + μ i μ i + 1 j = 1 i φ ¯ c 0 π θ T d z j 2 θ + j = 1 i γ ¯ φ ¯ c 0 π θ T d Φ ˜ j Φ ^ j 1 θ + Θ i ,
where Θ i = Θ i 1 + 3 i 4 + 2 τ ν i .
Consistent with our approach, we adopt the Lyapunov function specified below:
V n = V n 1 + 1 2 μ n 2 + 1 2 Φ ˜ n 2 ,
then we can obtain
V ˙ n = V ˙ n 1 + μ n Φ n T λ n x + ε n + u + d n t H n 1 j = 1 n 1 β n 1 x j Φ j T λ j x + ε j x + d j t } Φ ˜ n Φ ^ ˙ n .
Similarly, we define the predefined-time controller u and the adaptive update strategies Φ n ^ as follows:
u = φ π θ T d μ n 1 + θ tanh φ π θ T d μ n 2 + θ ν n + H n 1 μ n χ n Φ ^ n φ ¯ c 0 π θ T d μ n 1 θ tanh φ ¯ c 0 π θ T d μ n 2 θ ν n z n 1 ,
Φ ^ ˙ n = γ φ π θ T d Φ ^ n 1 + θ γ ¯ φ ¯ c 0 π θ T d Φ ^ n 1 θ + μ n 2 χ n , Φ ^ n ( 0 ) > 0 .
Substituting (24) and (25) into (23) and applying the inequality in (15), one can obtain
V ˙ n j = 1 n 1 φ π θ T d μ j 2 + θ + j = 1 n γ φ π θ T d Φ ˜ j Φ ^ j 1 + θ j = 1 n 1 φ ¯ c 0 π θ T d μ j 2 θ + j = 1 n γ ¯ φ ¯ c 0 π θ T d Φ ˜ j Φ ^ j 1 θ φ π θ T d μ n 2 + θ tanh φ π θ T d μ n 2 + θ ν n φ ¯ c 0 π θ T d μ n 2 θ tanh φ ¯ c 0 π θ T d μ n 2 θ ν n + Θ n 1 + 3 n 4 j = 1 n φ π θ T d μ j 2 + θ + j = 1 n γ φ π θ T d Φ ˜ j Φ ^ j 1 + θ j = 1 n φ ¯ c 0 π θ T d μ j 2 θ + j = 1 n γ ¯ φ ¯ c 0 π θ T d Φ ˜ j Φ ^ j 1 θ + Θ n ,
where Θ n = Θ n 1 + 3 n 4 + 2 τ ν n .
The following inequalities can be derived from [31,32]:
γ Φ ˜ j Φ ^ j 1 + θ Φ j ( 2 + θ ) Φ ^ j 2 + θ 2 Φ j ( 2 + θ ) Φ ˜ j 2 + θ ,
γ ¯ Φ ˜ j Φ ^ j 1 θ Φ j ( 2 θ ) Φ ^ j 2 θ 2 Φ j ( 2 θ ) Φ ˜ j 2 θ .
Substituting (27) and (28) into (26), one has
V ˙ n j = 1 n φ π θ T d μ j 2 + θ j = 1 n φ π θ T d Φ ˜ j 2 + θ j = 1 n φ ¯ c 0 π θ T d μ j 2 θ j = 1 n φ ¯ c 0 π θ T d Φ ˜ j 2 θ + Θ ,
where Θ = Θ n + j = 1 n π ( 2 n ) θ 2 2 θ 2 θ T d Φ j ( 2 + θ ) + j = 1 n c 0 π 2 θ 2 θ T d Φ j ( 2 θ ) .
According to [33,34], for any x i R , there exist constants p and q satisfying 0 < p 1 and 1 < q , for which the following inequalities are valid:
i = 1 n | x i | p i = 1 n | x i | p , i = 1 n | x i | q n q 1 i = 1 n | x i | q .
Then, one can obtain
V ˙ n π θ T d j = 1 n 1 2 μ j 2 + 1 2 Φ ˜ j 2 1 + θ 2 c 0 π θ T d j = 1 n 1 2 μ j 2 + 1 2 Φ ˜ j 2 1 θ 2 + Θ = π θ T d V n 1 + θ 2 c 0 π θ T d V n 1 θ 2 + Θ .

5. Simulation Setup And Result Analysis

5.1. Simulation Setup

To validate the effectiveness of the theoretical derivation and the proposed agent-scheduling optimization strategy, this section designs and executes a series of simulation experiments using the Python programming language. To comprehensively consider the potential impact of the number of agents and the scale of rescue areas on algorithm performance, we specifically set up two experimental scenarios with different characteristics to demonstrate the versatility and adaptability of the algorithm under various conditions. In both scenarios, each agent is set to carry one rescue package as the task load. Specifically, scenario one simulates a multi-agent system consisting of ten agents and three rescue areas, with the initial positions of the agents and the locations of the rescue targets detailed in Table 1 and Table 2, respectively. Scenario two further expands the system complexity by introducing fourteen agents and four rescue areas, and, similarly, the initial deployment of agents and the locations of task targets are set according to Table 3 and Table 4.
This paper simulation platform is P y T o r c h , and the specific parameter configuration of PPO is given in Table 5. And the structure of the PPO networks is given in Table 6 To demonstrate the outstanding performance of the method proposed in this study, we conducted an in-depth comparison between the PPO algorithm and the classic DQN, a traditional optimization technique. This move aims to validate the significant advantages of the PPO algorithm in terms of optimization efficiency and effectiveness through direct performance evaluation.
For the singularity-free predefined-time fuzzy adaptive tracking control strategy based on nonlinear systems, in (13) and (14) and (24) and (25), we set T d = 5 , θ = θ 1 θ 2 = 6 75 and ν 1 = ν 2 = 0.001 .
λ c ( x ) = i = 1 b ϖ B i c ( x i ) c = 1 ι i = 1 b ϖ B i c ( x i ) .
The fuzzy membership functions are selected as
ϖ B 1 c = e ( x 1 3 + c ) 2 16 , ϖ B 2 c = e ( x 2 6 + 2 c ) 2 4 ,
where the number of fuzzy rules is ι = 5 and b = 1 , 2 .

5.2. Result Analysis

Figure 1 and Figure 2 comprehensively exhibit the comparative results of agent scheduling under two distinct scenarios, aiming to delve into and compare the performance of two scheduling strategies. Specifically, Figure 1a visually presents the detailed layout of task allocation among agents under the random scheduling strategy in a scenario with ten agents and three rescue areas, where the average distance between agents is 6.61. In contrast, Figure 1b reveals the optimized state of task allocation and a more compact layout among agents in the same scenario after adopting the PPO-optimized scheduling strategy, with the average distance reduced to 5.40. Furthermore, Figure 2a,b showcase the contrast between the effects of random scheduling and PPO-optimized scheduling in a complex scenario expanded to fourteen agents and four rescue areas, with the latter significantly decreasing the average distance between agents from 4.88 to 3.72. This series of comparative analyses not only highlights the superiority of the PPO algorithm in shortening the average distance from agents to rescue areas, but also validates the algorithm’s wide applicability and efficiency in solving problems through the setting of two different scenarios.
This figure, compared to the results under the random scheduling strategy, demonstrates a remarkably optimized effect. This finding not only validates the effectiveness of the PPO strategy in scheduling optimization, but also provides a strong basis for our future research and applications. By comparing and analyzing the performance of these two strategies, we can gain a clearer understanding of the differences in resource allocation and efficiency optimization between different strategies, providing valuable references for future research and applications.
This paper comparatively analyzes the performance of the PPO algorithm against the DQN algorithm, aiming to demonstrate the superiority of the proposed approach. Figure 3 vividly illustrates the distinct differences in average return values and respective trend characteristics between the two. Specifically, the figure encapsulates the learning progression of both algorithms under Scenario 2, as devised in this study, providing quantitative evidence for assessing their efficiency and stability. Observing the DQN algorithm, representative of traditional methods, its average return exhibits a gradual decline from a peak of 10.5 during the initial 15,000 iterations, subsequently stabilizing over the next 5000 iterations to converge at approximately 7.9. This phenomenon underscores DQN’s potential for overfitting in early learning phases and its relatively slower convergence rate, limiting performance enhancements in complex environments which require rapid adaptation and optimization. In contrast, the PPO algorithm demonstrates remarkable convergence efficiency and optimization prowess. As depicted in Figure 3, PPO’s average return swiftly peaks at 9.5 before steadily declining and converging to around 6.2 after approximately 12,500 iterations, advancing 2500 iterations ahead of DQN. This outcome underscores PPO’s efficiency in strategy optimization and underscores its unique advantage in maintaining training stability through constrained policy update steps. PPO’s ability to swiftly identify and stabilize on a relatively superior strategy underscores its capacity to offer more efficient results.
The simulation results of the non-singular predefined-time fuzzy adaptive tracking control strategy based on nonlinear systems are presented in detail in Figure 4, Figure 5, Figure 6 and Figure 7. Figure 4 presents a comparison of the trajectories of the system output signal y and the desired command signal y d , demonstrating a high degree of overlap between the two signals, which visually showcases the dynamic response of the system. Figure 5 shows the trajectory of the tracking error, indicating that before the preset time T d = 5 , the tracking error μ 1 can stably converge to a very small range near zero. This result fully validates the effectiveness and accuracy of the control algorithm. As evident from Figure 4 and Figure 5, the proposed preset-time control algorithm in this study exhibits excellent tracking control performance. Figure 6 presents the trajectory of the control system’s input signal u, while Figure 7 depicts the evolution trajectory of the adaptive parameter Φ . This demonstrates that the proposed preset-time control method in this paper enables the agent to reach its destination within the desired time frame for rescue.

6. Conclusions

In this paper, we introduce innovative scheduling and control strategies for the rescue missions of intelligent agents. We construct a function related to the average distance of the intelligent agents and design an optimization method for the scheduling strategy based on the PPO algorithm, achieving efficient response to rescue missions. Furthermore, we propose a singularity-free predefined-time fuzzy adaptive tracking control strategy based on nonlinear systems, ensuring that the intelligent agents can accurately reach designated locations within the predetermined time. Finally, two independent experimental scenarios are designed to comprehensively evaluate the universality and efficiency of the proposed method. Specifically, to highlight the superiority of the PPO algorithm in optimizing scheduling strategies, the PPO algorithm is directly compared with the classical DQN algorithm. The experimental results show that, when compared with the scheduling strategy optimized by the DQN algorithm, the strategy generated by the PPO algorithm significantly shortens the average travel distance for the agent to complete the task by 1.7 units, while achieving a 27% increase in efficiency, effectively demonstrating the superior performance of the PPO algorithm in the field of scheduling optimization. These innovative approaches not only improve the rescue efficiency, but also enhance the reliability of the intelligent agents in complex environments.

Author Contributions

Quantitative Analysis, Modeling: D.Q.; Method Validation and Simulation Scenario Design: Y.Z.; Method Design, Paper Editing, and Submission: L.L.; Simulation Processing and Analysis: Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (NSFC), Grant/Award Number: 72071209 and the Youth Talent Lifting Project of the China Association for Science and Technology No. 2021-JCJQ-QT-018.

Data Availability Statement

Data will be made available on request.

Acknowledgments

Thanks to the editors and reviewers for their help in improving the quality of the paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

  1. Cheng, J.; Gao, Y.; Tian, Y.; Liu, H. GA-LNS Optimization for Helicopter Rescue Dispatch. IEEE Trans. Intell. Veh. 2023, 8, 3898–3912. [Google Scholar] [CrossRef]
  2. Wu, J.; Song, C.; Ma, J.; Wu, J.; Han, G. Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments for Multiple Autonomous Underwater Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 6807–6820. [Google Scholar] [CrossRef]
  3. Akgun, S.A.; Ghafurian, M.; Crowley, M.; Dautenhahn, K. Using Affect as a Communication Modality to Improve Human-Robot Communication in Robot-Assisted Search and Rescue Scenarios. IEEE Trans. Affect. Comput. 2023, 14, 3013–3030. [Google Scholar] [CrossRef]
  4. Wang, Y.; Chen, X.; Wang, L. Deep Reinforcement Learning-Based Rescue Resource Distribution Scheduling of Storm Surge Inundation Emergency Logistics. IEEE Trans. Ind. Inform. 2023, 19, 10004–10013. [Google Scholar] [CrossRef]
  5. Wu, J.; Sun, Y.; Li, D.; Shi, J.; Li, X.; Gao, L.; Yu, L.; Han, G.; Wu, J. An Adaptive Conversion Speed Q-Learning Algorithm for Search and Rescue UAV Path Planning in Unknown Environments. IEEE Trans. Veh. Technol. 2023, 72, 15391–15404. [Google Scholar] [CrossRef]
  6. Huo, M.; Duan, H.; Zeng, Z. Cluster Space Control Method of Manned-Unmanned Aerial Team for Target Search Task. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 2545–2549. [Google Scholar] [CrossRef]
  7. Ribeiro, R.G.; Cota, L.P.; Euzébio, T.A.M.; Ramírez, J.A.; Guimarães, F.G. Unmanned-Aerial-Vehicle Routing Problem with Mobile Charging Stations for Assisting Search and Rescue Missions in Postdisaster Scenarios. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 6682–6696. [Google Scholar] [CrossRef]
  8. Zhang, J.; Jia, Y.; Zhu, D.; Hu, W.; Tang, Z. Study on the Situational Awareness System of Mine Fire Rescue Using Faster Ross Girshick-Convolutional Neural Network. IEEE Intell. Syst. 2020, 35, 54–61. [Google Scholar] [CrossRef]
  9. Dell’Agnola, F.; Jao, P.K.; Arza, A.; Chavarriaga, R.; Millán, J.d.R.; Floreano, D.; Atienza, D. Machine-Learning Based Monitoring of Cognitive Workload in Rescue Missions with Drones. IEEE J. Biomed. Health Inform. 2022, 26, 4751–4762. [Google Scholar] [CrossRef]
  10. Men, J.; Jiang, P.; Zheng, S.; Kong, Y.; Zhao, Y.; Sheng, G.; Su, N.; Zheng, S. A Multi-Objective Emergency Rescue Facilities Location Model for Catastrophic Interlocking Chemical Accidents in Chemical Parks. IEEE Trans. Intell. Transp. Syst. 2020, 21, 4749–4761. [Google Scholar] [CrossRef]
  11. Rani, S.; Babbar, H.; Kaur, P.; Alshehri, M.D.; Shah, S.H. An Optimized Approach of Dynamic Target Nodes in Wireless Sensor Network Using Bio Inspired Algorithms for Maritime Rescue. IEEE Trans. Intell. Transp. Syst. 2023, 24, 2548–2555. [Google Scholar] [CrossRef]
  12. Wang, L.; Zhao, X.; Wu, P. Resource-Constrained Emergency Scheduling for Forest Fires via Artificial Bee Colony and Variable Neighborhood Search Combined Algorithm. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5791–5806. [Google Scholar] [CrossRef]
  13. Yang, F.; Wu, N.; Qiao, Y.; Zhou, M.; Su, R.; Qu, T. Modeling and Optimal Cyclic Scheduling of Time-Constrained Single-Robot-Arm Cluster Tools via Petri Nets and Linear Programming. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 871–883. [Google Scholar] [CrossRef]
  14. Sun, H.; Wang, S.; Zhou, F.; Yin, L.; Liu, M. Dynamic Deployment and Scheduling Strategy for Dual-Service Pooling-Based Hierarchical Cloud Service System in Intelligent Buildings. IEEE Trans. Cloud Comput. 2023, 11, 139–155. [Google Scholar] [CrossRef]
  15. Wei, F.; Wan, Z.; He, H.; Lin, X.; Li, Y. A Novel Scheduling Strategy for Controllable Loads with Power-Efficiency Characteristics. IEEE Trans. Smart Grid 2020, 11, 2151–2161. [Google Scholar] [CrossRef]
  16. Xie, J.; Huang, S.; Wei, D.; Zhang, Z. Scheduling of Multisensor for UAV Cluster Based on Harris Hawks Optimization with an Adaptive Golden Sine Search Mechanism. IEEE Sens. J. 2022, 22, 9621–9635. [Google Scholar] [CrossRef]
  17. Du, W.; Zhong, W.; Tang, Y.; Du, W.; Jin, Y. High-Dimensional Robust Multi-Objective Optimization for Order Scheduling: A Decision Variable Classification Approach. IEEE Trans. Ind. Inform. 2019, 15, 293–304. [Google Scholar] [CrossRef]
  18. Yue, H.; Zhang, Q.; Zeng, X.; Huang, W.; Zhang, L.; Wang, J. Optimal Scheduling Strategy of Electric Vehicle Cluster Based on Index Evaluation System. IEEE Trans. Ind. Appl. 2023, 59, 1212–1221. [Google Scholar] [CrossRef]
  19. Du, Y.; Wang, T.; Xin, B.; Wang, L.; Chen, Y.; Xing, L. A Data-Driven Parallel Scheduling Approach for Multiple Agile Earth Observation Satellites. IEEE Trans. Evol. Comput. 2020, 24, 679–693. [Google Scholar] [CrossRef]
  20. Min, J.; Kim, W.; Paek, J.; Govindan, R. Effective Routing and Scheduling Strategies for Fault-Tolerant Time-Sensitive Networking. IEEE Internet Things J. 2024, 11, 11008–11020. [Google Scholar] [CrossRef]
  21. Zhang, W.; Shi, J.; Zhang, S.; Chen, M. Scenario-Based Robust Remanufacturing Scheduling Problem Using Improved Biogeography-Based Optimization Algorithm. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 3414–3427. [Google Scholar] [CrossRef]
  22. Kumar, S.; Soni, S.K.; Kumar Pal, A.; Kamal, S.; Xiong, X. Nonlinear Polytopic Systems with Predefined Time Convergence. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 632–636. [Google Scholar] [CrossRef]
  23. Becerra, H.M.; Vázquez, C.R.; Arechavaleta, G.; Delfin, J. Predefined-Time Convergence Control for High-Order Integrator Systems Using Time Base Generators. IEEE Trans. Control Syst. Technol. 2018, 26, 1866–1873. [Google Scholar] [CrossRef]
  24. Wang, J.; Wong, W.C.; Luo, X.; Li, X.; Guan, X. Cooperative Platoon Control for Uncertain Networked Aerial Vehicles with Predefined-Time Convergence. IEEE Internet Things J. 2022, 9, 5982–5991. [Google Scholar] [CrossRef]
  25. Sánchez Torres, J.D.; Gomez-Gutierrez, D.; López, E.; Loukianov, A. A class of predefined-time stable dynamical systems. IMA J. Math. Control Inf. 2017, 35, i1–i29. [Google Scholar] [CrossRef]
  26. Jiménez-Rodríguez, E.; Muñoz-Vázquez, A.J.; Sánchez-Torres, J.D.; Loukianov, A.G. A Note on Predefined-Time Stability. IFAC-PapersOnLine 2018, 51, 520–525. [Google Scholar] [CrossRef]
  27. Sánchez-Torres, J.D.; Muñoz-Vázquez, A.J.; Defoort, M.; Jiménez-Rodríguez, E.; Loukianov, A.G. A class of predefined-time controllers for uncertain second-order systems. Eur. J. Control 2020, 53, 52–58. [Google Scholar] [CrossRef]
  28. Ni, J.; Liu, L.; Tang, Y.; Liu, C. Predefined-Time Consensus Tracking of Second-Order Multiagent Systems. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 2550–2560. [Google Scholar] [CrossRef]
  29. Cui, G.; Xu, H.; Xu, S.; Gu, J. Predefined-Time Adaptive Fuzzy Bipartite Consensus Control for Multiquadrotors Under Malicious Attacks. IEEE Trans. Fuzzy Syst. 2024, 32, 2187–2197. [Google Scholar] [CrossRef]
  30. Xu, H.; Yu, D.; Sui, S.; Chen, C.L.P. An Event-Triggered Predefined Time Decentralized Output Feedback Fuzzy Adaptive Control Method for Interconnected Systems. IEEE Trans. Fuzzy Syst. 2023, 31, 631–644. [Google Scholar] [CrossRef]
  31. Yang, H.; Ye, D. Adaptive fixed-time bipartite tracking consensus control for unknown nonlinear multi-agent systems: An information classification mechanism. Inf. Sci. 2018, 459, 238–254. [Google Scholar] [CrossRef]
  32. Cui, G.; Xu, H.; Yu, J.; Lam, H.K. Event-Triggered Distributed Fixed-Time Adaptive Attitude Control with Prescribed Performance for Multiple QUAVs. IEEE Trans. Autom. Sci. Eng. 2023, 1–11. [Google Scholar] [CrossRef]
  33. Zuo, Z.; Tie, L. A new class of finite-time nonlinear consensus protocols for multi-agent systems. Int. J. Control 2014, 87, 363–370. [Google Scholar] [CrossRef]
  34. Cui, G.; Xu, H.; Chen, X.; Yu, J. Fixed-Time Distributed Adaptive Formation Control for Multiple QUAVs with Full-State Constraints. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4192–4206. [Google Scholar] [CrossRef]
Figure 1. Agent scheduling results for Scenario 1. × represents the rescue area, • represents the agent. (a) The results of the random scheduling of agents. The average distance is 6.61. (b) The scheduling results after optimization using the PPO algorithm. The average distance is 5.40.
Figure 1. Agent scheduling results for Scenario 1. × represents the rescue area, • represents the agent. (a) The results of the random scheduling of agents. The average distance is 6.61. (b) The scheduling results after optimization using the PPO algorithm. The average distance is 5.40.
Mathematics 12 02387 g001
Figure 2. Agent scheduling results for Scenario 2. × represents the rescue area, • represents the agent. (a) The results of the random scheduling of agents. The average distance is 4.88. (b) The scheduling results after optimization using the PPO algorithm. The average distance is 3.72.
Figure 2. Agent scheduling results for Scenario 2. × represents the rescue area, • represents the agent. (a) The results of the random scheduling of agents. The average distance is 4.88. (b) The scheduling results after optimization using the PPO algorithm. The average distance is 3.72.
Mathematics 12 02387 g002
Figure 3. Comparison of the average return values between the PPO algorithm (in this paper) and the DQN algorithm (traditional).
Figure 3. Comparison of the average return values between the PPO algorithm (in this paper) and the DQN algorithm (traditional).
Mathematics 12 02387 g003
Figure 4. The trajectories of y and y d .
Figure 4. The trajectories of y and y d .
Mathematics 12 02387 g004
Figure 5. The trajectory of tracking error μ 1 .
Figure 5. The trajectory of tracking error μ 1 .
Mathematics 12 02387 g005
Figure 6. The trajectory of control input u.
Figure 6. The trajectory of control input u.
Mathematics 12 02387 g006
Figure 7. The trajectories of adaptive parameters Φ ^ i ( i = 1 , 2 ) .
Figure 7. The trajectories of adaptive parameters Φ ^ i ( i = 1 , 2 ) .
Mathematics 12 02387 g007
Table 1. The initial position of the agent in Scenario 1.
Table 1. The initial position of the agent in Scenario 1.
AgentLatLonAgentLatLon
10.04261.055764.51037.9949
22.85840.269571.49844.0147
34.71610.601280.54184.5941
47.71877.437091.75619.4918
55.94428.8786108.47328.7486
Table 2. The information of rescue areas in Scenario 1.
Table 2. The information of rescue areas in Scenario 1.
Rescue AreaLatLonDemand
16.48312.14763
29.49300.12113
31.80881.87724
Table 3. The initial position of the agent in Scenario 2.
Table 3. The initial position of the agent in Scenario 2.
AgentLatLonAgentLatLon
16.14703.810187.11602.0503
26.37114.744693.07789.8086
34.42530.9577100.10264.6604
46.14160.5733114.60388.5465
55.65715.3323124.52476.3166
63.90059.0885134.76002.2003
75.33377.0734147.13596.1904
Table 4. The information of rescue areas in Scenario 2.
Table 4. The information of rescue areas in Scenario 2.
Rescue AreaLatLonDemand
12.16602.57083
20.45801.75513
36.17678.29074
45.24632.70814
Table 5. Hyperparameter configuration.
Table 5. Hyperparameter configuration.
NotationMeaning
Time steps20,000
Number of agents10/14
Number of rescue areas3/4
Actor learning rate0.01
Critic learning rate0.001
Discount0.9
Table 6. Structure of the PPO networks.
Table 6. Structure of the PPO networks.
Policy NetworkValue Network
Input(c,state_dim)Input(c,state_dim)
Linear(state_dim,hidden_dim)Linear(state_dim,hidden_dim)
relusoftmax
Linear(hidden_dim,1)Linear(hidden_dim,1)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qi, D.; Zhao, Y.; Li, L.; Jia, Z. Optimization of Predefined-Time Agent-Scheduling Strategy Based on PPO. Mathematics 2024, 12, 2387. https://doi.org/10.3390/math12152387

AMA Style

Qi D, Zhao Y, Li L, Jia Z. Optimization of Predefined-Time Agent-Scheduling Strategy Based on PPO. Mathematics. 2024; 12(15):2387. https://doi.org/10.3390/math12152387

Chicago/Turabian Style

Qi, Dingding, Yingjun Zhao, Longyue Li, and Zhanxiao Jia. 2024. "Optimization of Predefined-Time Agent-Scheduling Strategy Based on PPO" Mathematics 12, no. 15: 2387. https://doi.org/10.3390/math12152387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop