1. Introduction
Urban air mobility (UAM), a novel air transportation concept designed for urban environments, has garnered significant interest in recent years from the aerospace and transportation sectors [
1,
2]. UAM’s goal is to enhance the efficiency of transporting people and goods through specialized vehicles like electric vertical take-off and landing (eVTOL) aircraft and small unmanned air vehicles (sUAVs) [
3]. These vehicles, characterized by high levels of automation, navigate autonomously from take-off to landing without human operators. However, this innovative transportation approach faces numerous complex challenges, including environmental concerns, urban infrastructure considerations, and specific issues related to the UAM platform itself. These include building infrastructure interactions [
1], dense traffic of aerial vehicles [
2], micro-weather patterns [
3,
4], urban emergencies or disasters [
1], and the quality and reliability of communication, navigation, and surveillance systems [
5,
6]. These elements contribute to the formation of various static, dynamic, and uncertain no-fly zones, obstacles, and geo-fences in urban airspace [
7], presenting three primary safety challenges for automated aerial vehicles. Firstly, there is a need for real-time, adaptive re-planning in response to environmental uncertainties. Secondly, the nonlinear kinematics and dynamics of these systems, coupled with aerodynamic and power constraints, limit the vehicles’ cruising speed and acceleration. Thirdly, rapid and feasible trajectory generation becomes crucial, especially when destination vertiports change due to congestion or when emergency scenarios necessitate on-the-fly adjustments. In this paper, we tackle these issues by integrating methodologies from sampling-based path planning [
8], trajectory optimization, recurrent neural networks (RNNs) [
9], reinforcement learning (RL) [
10], generative adversarial imitation learning (GAIL) [
11], and transformers [
12]. Our approach aims to enable efficient, real-time re-planning under uncertain airspace conditions.
In our study, we have developed two distinct algorithms for addressing the challenges of path planning and trajectory optimization in urban air mobility (UAM) operations: a coupled approach and a decoupled one. These algorithms represent a significant advancement over traditional path planning methods, primarily due to their markedly reduced computational times, which enable real-time implementation in UAM scenarios. The coupled algorithm is intricately designed to ensure that the vehicle adheres to kinematic and dynamic constraints while also achieving the quickest possible journey to its destination. This approach integrates the constraints directly into the planning process, thereby ensuring efficient and safe navigation through urban airspace. On the other hand, the decoupled algorithm focuses on optimizing straight path segments, making it particularly suitable for scenarios where the aerial vehicle must adhere to strict scheduling constraints. This algorithm first plans the path and then optimizes the trajectory, allowing for greater flexibility in dealing with dynamic urban environments. Central to both approaches is the use of tree data generated by the RRT* algorithm or its variants. The RRT* algorithm is known for its asymptotic optimality and probabilistic completeness [
13], characteristics that are critical in ensuring a comprehensive exploration of the configuration space to yield feasible and optimal paths. However, RRT*-based algorithms typically require extensive exploration time, making them unsuitable for real-time path planning applications. Our algorithms address this limitation by utilizing the tree data generated from previous RRT* explorations. These tree data consist of a series of nodes, each with specific attributes such as coordinates, cost to the tree root, and the index of its parent node. In more complex models, these nodes may also include velocity and acceleration data. In our tree structure, all nodes maintain optimal paths from the tree root, which is defined as the current position of the vehicle or the position of the departure vertiport. The primary innovation in our approach lies in the real-time update of these tree data in dynamic environments. We aim to continuously guarantee the optimized attributes and connections (parent indices) of the nodes on the tree, as well as the feasibility and optimality of the flight path to the destination. Traditional graph-rewiring techniques based on RRT*, which involve significant forward and backward propagation, are too time-consuming for real-time updates [
14]. Instead, we utilize RL policies or GAIL generators to guide the update of tree connections and node attributes. This method replaces computationally expensive calculations with more efficient neural network inference processes. Moreover, our approaches offer unparalleled flexibility in UAM operations. They can adapt to changes in the destination of the aerial vehicle at any moment, generating a feasible and optimal path from the vehicle’s current position to the new destination in real-time. This flexibility is a direct result of the intensive exploration that results in tree nodes spread throughout the configuration space and the inherent nature of the tree data, where paths from the root to all nodes are optimized. To the best of our knowledge, our paper introduces some of the first methodologies capable of performing on-the-fly updates of the destination and optimizing the path and trajectory in real-time while maintaining the comprehensive explorations of the environment. This feature is especially valuable in urban settings where sudden changes in destination or route may be necessary due to various unforeseen circumstances such as traffic congestion, weather conditions, or emergency situations. Furthermore, our algorithms are designed to be robust against the dynamic and unpredictable nature of urban airspaces. By continuously updating the tree data based on the latest environmental information, our approaches ensure that the trajectory remains feasible and optimal, even in the face of rapid changes in the urban landscape. This aspect is crucial for maintaining safety standards and operational efficiency in UAM applications. In summary, the coupled and decoupled algorithms that we propose in this paper represent a significant leap forward in the field of UAM path planning and trajectory optimization. By leveraging advanced techniques in machine learning, such as RL and GAIL, combined with the efficient use of RRT*-based tree data, our approaches not only reduce computational time but also enhance the flexibility and robustness of UAM operations. This makes them highly suitable for the dynamic and complex environment of urban air mobility, where real-time updates and adaptability are key to successful operation.
The structure of this paper is outlined as follows:
Section 2 delves into the existing literature relevant to our study, highlighting both the contributions and limitations of previous methods in motion planning and re-planning. In
Section 3, we introduce a novel decoupled planning algorithm. This algorithm leverages reinforcement learning and a piecewise polynomial trajectory generation method to produce minimum snap trajectories with significantly reduced computation times, while also ensuring comprehensive environmental exploration.
Section 4 is dedicated to discussing our coupled planning algorithm, which employs a generative adversarial imitation learning framework. This approach is adept at generating time-optimized trajectories, again with remarkably short computation durations, and it maintains a thorough exploration of the configuration space. Finally,
Section 5 provides the concluding remarks of this paper, summarizing our key findings and contributions to the field of motion planning in urban air mobility.
4. Approach Based on Generative Adversarial Imitation Learning
In this section, we propose a coupled planning algorithm that generates the time-optimal trajectory in an extremely short computation time via a generative adversarial imitation learning (GAIL) algorithm [
11], meanwhile guaranteeing sufficient exploration of the environment. This method also performs motion planning and re-planning on the basis of existing tree data, but the algorithm for producing tree data is kinodynamic RRT* [
16]. There are two main reasons for why kinodynamic RRT* is expensive. First, a complete process requires a large number of iterations. Second, in kinodynamic RRT*, evaluating the connection cost between two states requires solving a two-point boundary value problem (TPBVP), which is usually complicated because the dynamic transition from one state to another is required to be considered. Unlike RRT*, when we want to generate a minimum time trajectory, the step size of kinodynamic RRT* is not the Euclidean distance but the time difference; in addition, the connection between the tree nodes generated by kinodynamic RRT* is not a straight line but a curve trajectory that conforms to kinematic and dynamic constraints [
26]. When the goal of kinodynamic RRT* is the minimum time trajectory, using time as the step size can indeed provide some intuitive benefits for solving the TPBVP, especially compared to using Euclidean distance. This simplifies the TPBVP when the exact time difference between two points is known. Furthermore, since the goal of optimization is to minimize time, making the step size correspond to time is intuitive and can help to ensure that the generated path is consistent with the optimization goal. For the overall algorithm, whenever trying to connect a node in the tree to a new random sample point, or when trying to connect a node in the tree to another node in the ’choose parent’ and ’rewire’ phase, whether there are feasible trajectories satisfying dynamical constraints needs to be considered. This requires solving a TPBVP in almost every iteration, which leads to a very long processing time. Therefore, in order to make the processing time of our algorithm short enough to achieve real-time requirements, we need to bypass the large number of iterations and the challenging two-point boundary value problems.
Each node in kinodynamic RRT* tree data has more attributes than that of RRT*. Each kinodynamic RRT* tree node contains the index of itself, the index of the parent, and the time cost to reach the node from the starting point, as well as the coordinate, velocity, and acceleration of each dimension. If still using the reinforcement-learning-based approach, complex tree data will lead to bloated reward functions, highly raising the difficulty of designing appropriate rewards and violating our intention of pursuing simpler but effective approaches. Hence, we utilize the approach based on GAIL, which has advantages in implementing and tuning hyper-parameters because it does not require the design of reward function compared to RL algorithms.
GAIL requires the agent to interact with the environment but cannot obtain rewards from the environment. In addition, GAIL needs an object to be imitated and, in practice, it needs to collect the decision-making records of the imitated object. In our method, this imitated object is kinodynamic RRT*. Equation (
14) shows the decision-making record
created by kinodynamic RRT*, where
is consistent every five time steps in two-dimensional motion planning and donates the node index and positions, and
represents the parent selection, single-dimensional velocity, and acceleration of the corresponding
.
GAIL consists of a generator and a discriminator. The generator is a policy network that makes decisions, and we will use PPO [
27] to train the generator. The target of GAIL is to learn a policy network such that the discriminator cannot distinguish whether a decision is made by the policy network or the imitated object. The discriminator is a neural network that will be trained by gradient descent. Training makes the discriminator more accurate in determining where decisions are coming from, and this adversarial approach to mutual progress is the main idea of GAIL.
As mentioned previously, PPO will be used as our algorithm for training the policy network. Similar to the method in
Section 3, we cannot use the entire tree data as input to the policy network, since tree data contain many nodes, and each node has several attributes, which makes tree data actually a very-high-dimensional vector that PPO is unable to process. Therefore, we still apply a recurrent neural network as the policy network (known as the generator) of GAIL for handling this problem.
Figure 8 shows the details of the RNN policy network.
Using the RNN policy network is the key to bypassing the large number of iterations and the challenging two-point boundary value problems. Solving the TPBVP is avoided by neural network inference. Furthermore, the tree is not built by iterations but by the sequential RNN tokens.The RNN policy network of GAIL still contains node structures whose number is equivalent to the number of nodes in the tree data. Each node structure can be regarded as a set of single-input multi-output RNN sequences. In two-dimensional motion planning, each node structure has five tokens and each token outputs an action.
The design of the first token in each node structure is very similar to that in
Section 3: the input
is pre-processed
, which contains a node index and position, and its output
is an
-dimensional one-hot vector representing the selection of the parent node. The value of
is a manually tuned hyper-parameter that affects the speed of convergence of the reinforcement learning algorithm and the results of training. Before the start of training, the algorithm will iterate over the tree data and update tree data by adding
nearest nodes for each tree node. Closer nodes also represent shorter arrival times when there are zero initial velocity and acceleration. Equations (
15), (
17) and (
18) display the pre-processing and input formation, probability distribution generation, and action determination of the first token.
where
is the input to the first token at time
t;
is the embedding matrix for the one-hot encoding of the node index;
represents the GRU function taking the current input
, the previous hidden state
, and GRU parameters
;
represents the one-hot encoding of the node index at time
t;
is the embedding matrix for the node position;
stands for the position (coordinates) of the node at time
t;
is the probability distribution output by the first token at time
t;
denotes the weight matrix for generating the probability distribution;
and
represent the hidden state of the RNN from the current and previous time step;
is the bias vector associated with the probability distribution;
is the one-hot encoded action representing the selection of the parent node at time
t;
denotes the function that returns the index of the maximum value in
, indicating the most probable parent node selection.
The output of the second token of each node is a single-dimensional velocity whose value depends on the position of the node itself and the choice of the parent node, and is influenced by other nodes. The hidden layer of the RNN contains information of the index and position of the current node, and historical information of some other nodes. The input of the second token
is the output of the first token (also known as the selection of the parent) after going through the embedding layer. The output of the third token of each node is also a single-dimensional velocity whose value depends on the position of the node itself, the choice of the parent node, and the obtained velocity of another dimension, and is influenced by other nodes. The input of the third token
is therefore the output of the second token after going through the embedding layer. The input of the fourth token is the output of the third token after going through the embedding layer since the output of the fourth token is a single-dimensional acceleration that also depends on velocity. The input of the fifth token is similar to that of previous tokens. In the second to fifth tokens of each node, we utilize gained tanh as the activation function to output the velocity and acceleration of each dimension. tanh is a center-symmetric function and monotonically increasing in a certain range, so the RNN policy network has equivalent performance regardless of whether the output velocity and acceleration of each dimension are positive or negative. Gain is determined in accordance with velocity and acceleration constraints. Equations (
19) and (
20) describe the process of the second to fifth token.
where
is the input for the
i-th token, computed from the output of the previous token (
);
represents the embedding matrix specific to the
i-th token;
denotes the output of the
i-th token, which could be a single-dimensional velocity or acceleration;
is the weight matrix for the velocity or acceleration outputs;
represents the hidden state of the RNN from the current time step for the
i-th token;
is the bias vector associated with the velocity or acceleration outputs; tanh is the hyperbolic tangent activation function, used here for its symmetric properties;
is a factor used to adjust the output within the constraints of velocity and acceleration.
It is worth mentioning that these four embedding layers between every two tokens in the structure of each node are different as they perform different tasks, whereas all node structures share the set of these four embedding layers because they separately execute the same tasks in each node. The output of the embedding layer will be concatenated with a corresponding type embedding vector
for ’select parent’ and different dimensional velocities and accelerations before being input into
. Similar to the approach proposed in
Section 3, we do not use a standard RNN, in fact, but use a GRU (Equation (
16)) to avoid the problem of gradient vanishing and gradient explosion, where each token has two hidden layers. Equation (
21) displays the decision-making record
created by the RNN policy network, where
is also consistent every five time steps in two-dimensional motion planning and donates the node index and positions,
represents the parent selection, single-dimensional velocity, and acceleration of the corresponding
, and
m is the number of tokens in the RNN policy network.
The discriminator of GAIL is a neural network whose structure is shown in
Figure 9. The essence of the discriminator is actually a binary classifier. Its output value
represents the judgment of authenticity, where
is the neural network parameter. The closer the output is to 1, the more true it is—that is, the action is produced by kinodynamic RRT*—and the closer the output is to 0, the more false it is; that is, it is generated by the policy network.
The goal of training GAIL is to make the generator (also known as the RNN policy network) produce a record of decisions that are as good as those of the imitated object. At the end of training, the discriminator cannot distinguish between the generator’s decision records and the imitated object’s decision records. Therefore, while training the generator, we need to train the discriminator simultaneously, and only if the discriminator is good enough will the generator that can fool it obtain satisfactory results. When training the discriminator, we encourage it to make more accurate judgments. We want the discriminator to know that
is true, so
should be encouraged to be as large as possible. We want the discriminator to know that
is false, so
should be encouraged to be as small as possible. Equation (
22) defines the loss function.
We expect the loss function to be as small as possible so that we can use gradient descent to update parameters
, which are shown in function (
23).
where
donates the learning rate and
is the gradient. The larger the output
of the discriminator, the more similar the decisions generated by the RNN policy network are to those generated by kinodynamic RRT*, and the more successful the imitation learning; therefore, we substitute the reward
with Equation (
24).
Then, according to
, we can apply the PPO algorithm to train the RNN policy network
of GAIL with Equations (
25) and (
26).
where
is a hyper-parameter used to control the clipping ratio.
denotes the expectation operator.
is the advantage function under the old policy.
denotes the probability ratio.
Algorithm 3 shows the details of training process. The training performance of the RNN policy network acting as the generator of GAIL is strongly dependent on the hyper-parameters of the PPO algorithm. The discriminator that outputs rewards for the PPO algorithm is a neural network, which is also dependent on the selection of hyper-parameters. Therefore, optimizing hyper-parameters is essential since this approach is highly sensitive to hyper-parameters. Among the commonly used hyper-parameter optimization methods for machine learning, Bayesian hyper-parameter optimization methods have shown advantages in both accuracy and efficiency compared to grid search and random search [
38] as this optimization problem has no explicit objective function expression.
Algorithm 3: Train RNN policy network using GAIL with PPO updates |
Initialization: Initialize RNN policy network and GAIL discriminator Initialize environment and decision-making algorithm (Kinodynamic RRT*); Define loss functions for and Training Loop:
|
The core in the Bayesian optimization method includes a surrogate model and an acquisition function. In our approach, the Gaussian process model (GP) is applied as a surrogate model. The Gaussian process is a joint distribution of a series of random variables that obey the normal distribution. Based on this model, the distribution of the objective function can be estimated from the mean value , and the uncertainty of each position can be obtained from variance , where x is a set of hyper-parameters and is the mean ratio of the time cost of trajectories generated by the RNN policy network and the benchmark algorithm kinodynamic RRT*, respectively, in the Monte Carlo simulation. In detail, we randomly generated 1000 sets of different starting points and target points, as well as obstacle positions, for the Monte Carlo simulation. The Euclidean distances between the starting points and the target points are greater than a threshold. Our proposed method and the baseline classical method will, respectively, generate a trajectory in each environment and then obtain the ratio of the time cost of the two trajectories, finally obtaining the mean ratio of the time cost of 1000 environments.
After constructing the surrogate model, the acquisition function is used to determine the next set of hyper-parameters, trading off exploration (sampling from high-uncertainty areas) and exploitation (sampling from high-value areas). The process will be iterated multiple times until it is close to the global optimum. The next set of hyper-parameters is determined as Equation (
27).
where
is the acquisition function and
X is the
n observation points from
so far.
The expected improvement (EI) is a common choice as the acquisition function and it can be evaluated under the GP model as Equation (
28) [
39]:
where
where
and
are the standard normal density and standard normal distribution function. The first term
in Equation (
28) is used for exploitation, the second term
is used for exploration, and the parameter
determines the proportion of exploration.
Additionally, when the vehicle that needs to perform motion planning changes—that is, the speed and acceleration limits of the vehicle change—we do not need to entirely retrain an RNN policy network but perform fine-tuning on the basis of the RNN policy network trained previously. Fine-tuning will perform the following five steps:
Changing values of the gain of the gained tanh activation function;
Setting independent learning rates for the two hidden layers of the RNN policy network;
Fine-tuning by GAIL with fewer episodes;
Applying a Bayesian hyper-parameter optimization method;
Using the greedy soup receipt of the model soups method [
40].
Model soups is a method used to average the weights of multiple models fine-tuned with different hyper-parameter configurations, resulting in improving accuracy and robustness without incurring any additional inference or memory costs. The greedy soup is constructed by sequentially adding each model as a potential ingredient in the soup, and only keeping the model in the soup if the performance on a held-out validation set evaluated by Monte Carlo simulation results improves. Before running this procedure, we sort the models in decreasing order of validation set accuracy so the greedy soup can be no worse than the best individual model on the held-out validation set [
40].
Figure 10 shows the results produced by the approach based on GAIL. It can be found that our algorithm successfully generates a collision-free tree with curve connections and obtains a safe trajectory.
Figure 10a shows planning before flying and
Figure 10b shows re-planning on the fly. In addition, a GRU is utilized as the RNN policy network, and cuDNN acceleration is applied. Based on tree data of 1200 nodes, an extremely short computation time is required to generate the trajectory using the trained policy network, which is only around three seconds on our device, while it takes more than two minutes to complete 1200 iterations using kinodynamic RRT*.
Table 2 displays the average processing time and average normalized vehicle travel time of the benchmark method and the three methods used for comparison under 50 different situations of environment 1, as well as the average processing time and mean normalized vehicle travel time of our approach’s Monte Carlo simulations. Kinodynamic RRT* is our benchmark algorithm. This algorithm needs a large number of iterations and requires solving a TPBVP in almost every iteration, which leads to a very long processing time. Learning-based kinodynamic RRT* [
41] replaces solving the TPBVP in the ’choose parent’ phase and ’rewire’ phase with neural network inference, thus significantly accelerating the algorithm. However, the processing time of this algorithm is still not short enough as a large number of iterations is still required. The stable sparse RRT (SST) [
42] algorithm achieves optimality guarantees without requiring optimal boundary value problem solutions, and only requires forwarding dynamically propagating random actions from a selected node, but such a technique is prone to ’wandering’ through the state space and taking a long time to identify a solution [
43]. Since our approach bypasses the large number of iterations and the challenging two-point boundary value problems, the processing time of our approach is significantly short, and the average vehicle travel time is also close to the baseline algorithm. Moreover, our approach produces significantly better trajectories than the ’directly imitate trajectory using kinodynamic RRT*’ method, which reflects the necessity and superiority of the RNN policy network.
Figure 11a shows the Monte Carlo simulation results produced by the approach based on GAIL. After applying a Bayesian hyper-parameter optimization method, it can be found that the properties of the resulting trees, including the velocity, acceleration, and time that it takes to reach the node, are close to the results of kinodynamic RRT*.
Figure 11b displays the Monte Carlo simulation results for fine-tuning based on the original generated RNN policy network. In this scene, the velocity limit is increased by half and the acceleration limit is doubled. As shown in
Table 2, with Bayesian hyper-parameter optimization and greedy soup, the performance of the fine-tuned policy network for a new vehicle is close to that of the original policy network.
Figure 12 shows a trajectory generated by our approach based on GAIL in environment 2, which includes five dynamic obstacles and three static obstacles.
Table 3 displays the average processing time and average normalized vehicle travel time of the benchmark method, three methods used for comparison, and our approaches under 50 different situations of environment 2. It can be found that our methods have relatively better performance in environments with more obstacles and fewer effective samples.