Next Article in Journal
The Impact of Data Injection on Predictive Algorithm Developed within Electrical Manufacturing Engineering in the Context of Aerospace Cybersecurity
Next Article in Special Issue
Integrating Flight Scheduling, Fleet Assignment, and Aircraft Routing Problems with Codesharing Agreements under Stochastic Environment
Previous Article in Journal
Characteristics of a Heat Exchanger in a Liquid Rocket Engine Using Conjugate Heat Transfer Coupling with Open-Source Tools
Previous Article in Special Issue
A STAM Model Based on Spatiotemporal Airspace Sector Interdependencies to Minimize Tactical Flow Management Regulations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiple UAS Traffic Planning Based on Deep Q-Network with Hindsight Experience Replay and Economic Considerations

by
Shao Xuan Seah
and
Sutthiphong Srigrarom
*
Mechanical Engineering Department, National University of Singapore, Singapore 117411, Singapore
*
Author to whom correspondence should be addressed.
Aerospace 2023, 10(12), 980; https://doi.org/10.3390/aerospace10120980
Submission received: 6 October 2023 / Revised: 11 November 2023 / Accepted: 13 November 2023 / Published: 22 November 2023
(This article belongs to the Collection Air Transportation—Operations and Management)

Abstract

:
This paper explores the use of deep reinforcement learning in solving the multi-agent aircraft traffic planning (individual paths) and collision avoidance problem for a multiple UAS, such as that for a cargo drone network. Specifically, the Deep Q-Network (DQN) with Hindsight Experience Replay framework is adopted and trained on a three-dimensional state space that represents a congested urban environment with dynamic obstacles. Through formalising a Markov decision process (MDP), various flight and control parameters are varied between training simulations to study their effects on agent performance. Both fully observable MDPs (FOMDPs) and partially observable MDPs (POMDPs) are formulated to understand the role of shaping reward signals on training performance. While conventional traffic planning and optimisation techniques are evaluated based on path length or time, this paper aims to incorporate economic analysis by considering tangible and intangible sources of cost, such as the cost of energy, the value of time (VOT) and the value of reliability (VOR). By comparing outcomes from an integration of multiple cost sources, this paper is better able to gauge the impact of various parameters on efficiency. To further explore the feasibility of multiple UAS traffic planning, such as cargo drone networks, the trained agents are also subjected to multi-agent point-to-point and hub-and-spoke network environments. In these simulations, delivery orders are generated using a discrete event simulator with an arrival rate, which is varied to investigate the effect of travel demand on economic costs. Simulation results point to the importance of signal engineering, as reward signals play a crucial role in shaping reinforcements. The results also reflect an increase in costs for environments where congestion and arrival time uncertainty arise because of the presence of other agents in the network.

1. Introduction

Modern-day consumerism rests on a foundation of a complex and interconnected global supply chain network. Online marketplaces and the promise of cheap and quick deliveries place immense stress on logistics networks, competing for scarce transport resources in an already congested environment. The pandemic has greatly exacerbated these issues by disabling workers and delaying shipments, leaving massive gaps in the system that send supply shocks through the economy.
As road networks in densely populated urban areas are increasingly congested, the development of an aerial cargo drone network is a natural next step towards reducing the land-based last-mile delivery bottleneck, the journey from distribution warehouses to final users. An example is shown in Figure 1 [1]. With the airspace projected to be shared by a growing number of users, a fair and efficient traffic control system must, therefore, be adopted to ensure safe and continual operation.
Tall and often irregular structures in congested urban environments lead to many challenges in generating optimal traffic and individual aircraft trajectories. A conventional collision avoidance technique, such as a potential field (repulsive force for obstacles and attractive force for destination), takes a longer time to compute and may give a suboptimal solution. In addition, a collision avoidance manoeuvre is usually considered a convex optimisation problem, which may not be the case, and hence, the outcomes also depend on the initial guesses. The result can likewise be suboptimal. This also requires a certain computing time.
Reinforcement learning, however, surpasses other forms of control when individual agents following simple rules create systems that may not have easy closed-form analytical solutions. Despite each drone behaving within a well-defined set of dynamic and control parameters, the sheer number of combinations of sequential actions leads to an exponential growth of possibilities, analogous to the many different possible games and outcomes of chess. Only a sufficiently complex model trained on large sets of simulated data may possibly converge to an optimal solution.
Therefore, this paper explores the use of a technique in reinforcement learning known as the Deep Q-Network (DQN) to search for factors that lead to optimal drone traffic and individual trajectories. An analysis of economic costs across models trained on modified parameters also provides a sound understanding of these effects in different environments.
Because it is a relatively simple model, this Double Deep Q-Network with an experience replay model can also potentially be put into the onboard system of the drones for the real-world mission. The limitation is simply the connection and conjugation with the distance sensors and the communications with other drones.

Advantage of the Proposed Approach

While conventional traffic planning and optimisation techniques are evaluated based on path length or time, this paper incorporates economic analysis by considering tangible and intangible sources of cost, such as the cost of energy, the value of time (VOT) and the value of reliability (VOR). This is more practical and has a clear advantage over the conventional approach.

2. Literature Review

2.1. Graph Search for Path Planning

The path length is a useful heuristic for finding efficient aircraft trajectories. Graph networks model locations as nodes and distances as edges to solve for the shortest trajectory between two points. For instance, graph-based methods in the class of shortest path algorithms, such as Djikstra’s algorithm and the A* algorithm, use dynamic programming to iteratively search for the shortest path [2]. Tree-based methods, such as the rapidly-exploring random tree* (RRT*), on the other hand, circumvent the inefficiencies arising from random walks in graph-based searches by rapidly growing search trees far away [3].
However, several limitations of graph-based methods exist, with the most important being the assumption that the shortest path is equivalent to the most dynamically efficient path. Graph networks fail to consider the dynamic history of drones as they assign a fixed cost for traversals between adjacent nodes, whereas the current momentum and energy of these drones may alter these costs significantly. Drone networks are also expected to operate in relatively large environments, and the time complexities of graph-based algorithms do not scale well with the size or resolution of the search space. Therefore, other approaches to path planning need to be adopted instead.

2.2. Optimal Control

The modern approach to optimal control involves the formulation of a Bolza problem comprising an objective function of terminal and running cost components, Φ and L, as shown in (1), which are minimised based on a dynamic constraint d, path constraint τ and boundary constraint ϕ [4]. The optimisation problem may be solved by using advanced numerical methods in non-linear programming, which is beyond the scope of this paper.
min x , u , t 0 , t f Φ ( x ( t 0 ) , t 0 , x ( t f ) , t f ) + t 0 t f L ( x ( t ) , u ( t ) , t ) d t
is subjected to dx t / d t = d x t , u t , τ x t , u t , t 0 ,   ϕ x t 0 , t 0 , x t f , t f = 0 .
Unfortunately, most of these constrained optimisation problems are not analytically solvable and must be numerically solved for each problem through approximations or simulation-based methods, especially those of real-world problems. Furthermore, irregular state boundaries give rise to additional complexities in formulating solutions. Therefore, a new approach to solving optimal control problems is a better option.

2.3. Reinforcement Learning

Reinforcement learning allows agents with no prior information to learn an optimal policy through agent–environment interactions, without the need for retraining when new environments are encountered. The goal of reinforcement learning is to generate an optimal policy π ( s , a ) , representing the probability of taking an action a at each state s.
Deep Q-Networks arise from a combination of Q-learning and deep neural networks. Mnih et al. first used a DQN to train an agent to play several Atari 2600 computer games, with the agent performance surpassing all previously trained agents for six out of seven games and exceeding the level of an expert human player for three of them [5]. By iteratively computing expected future rewards using (2), known as the Bellman optimality equation, agents are trained to perform actions derived from a pre-determined policy (such as ϵ -greedy) according to the perceived cumulative discounted future reward of the state–action values Q ( s , a ) generated by the trained neural network [6].
Q * s , a = E R t + γ max a Q * s , a
DQN is an off-policy algorithm that allows the neural network to be updated by Q-values of the subsequent state–action pair not derived from the actual policy. Such off-policy behaviour allows for better exploration and a faster convergence to an optimal path as compared with an on-policy algorithm, such as SARSA [7]. DQN also uses TD(0), a one-step lookahead temporal difference target, to update the neural network online, allowing for faster convergence than offline updating via whole Monte Carlo simulation episodes.
DQN has also been adopted to overcome the complexity of robot path planning. Raajan, et al. used a DQN in an environment with quantised agent states and both static and dynamic obstacles in the environment and compared it with traditional path planning algorithms, such as RRT* [8]. The results show a great improvement in computational time with little to no reduction in optimality, demonstrating the potential for deep reinforcement learning methods to explore decisions that go against the conventional intuition of heuristic-based approaches. Nevertheless, DQN has a computational overhead, and this can be a limiting factor in practical applications.

2.4. Transport Costs

Current path-finding algorithms are assessed based on the length of the shortest path between the origin and destination. While path length may be an important factor, there are other sources of economic costs that influence the optimal trajectory. These costs may be broadly categorised into tangible and intangible costs.

2.4.1. Tangible Costs

The tangible costs of operating a drone delivery service can be decomposed into fixed capital costs and variable operating costs. Fixed capital costs involve the initial capital required to initialise such a service, including the mechanical costs of constructing delivery drones, the infrastructure required for performing take-offs and landings and the regular maintenance costs required to maintain a fleet of drones. Since fixed costs are invariant across different control algorithms, they are beyond the focus of this paper.
Variable operating costs comprise several input costs that are dependent on the aggregate output of the drone service, quantifiable in terms of the total payload distance travelled. The seminal paper by D’Andrea estimates the power consumption for a cargo drone in (3), with m p as the payload mass in kg, m v as the vehicular mass in kg, g as the gravitational acceleration in m/s 2 , v as the cruising velocity in m/s, η as the dimensionless motor-propeller power transfer efficiency, L / D as the dimensionless lift-to-drag ratio and p a as the power consumption for all other electronics in kW [9]. Using c as the cost of energy in dollars per kWh and e as the dimensionless charging efficiency, the cost per unit distance is also given by D’Andrea in (4).
Power Required ( kW ) = m p + m v g v η ( L / D ) + p a
Cost per unit distance USD / m , p = c e m p + m v g η ( L / D ) + p a v
Further studies have extended this model to include other factors that may affect the electric power required for different profiles of drone flight. A systematic review was conducted among extended models, accounting for variable weight profiles, different flight profiles and specific flight parameters, such as angle of attack, wind vectors and the various forces of flight [10].

2.4.2. Intangible Costs

Intangible costs are typically derivative of two sources: the value of time (VOT) and the value of reliability (VOR). The VOT refers to the willingness to pay to reduce the amount of travel time by one unit, and the VOR is the amount that a commuter is willing to pay to reduce a specified measure of uncertainty in arrival time by one unit.
In the case of physical commutes, Lam and Small estimated the VOT to be USD 22.87/h, and the VOR to be USD 15.12/h for males and USD 31.91/h for females [11]. The unit of uncertainty specified by Lam and Small is the difference between the 50th and 90th percentiles of the distribution of arrival timings, as there is typically little to no cost for arriving earlier than planned.
However, the applicability of commuter VOTs and VORs to last-mile drone deliveries may be weak. While there are numerous stated preference studies that estimate the VOT and VOR for passenger travel, there are few studies that focus on the domain of freight.
While there is no existing literature specific to cargo delivery drones, Shams examined the VOT and VOR for freight in the United States and produced estimates in [12] based on the mean time and the standard deviation of the distribution of arrival timings as unit measures. Fowkes also conducted a similar stated preference study in the United Kingdom but added controls for the values of early starts and late arrivals for freight transport [13]. The unit of uncertainty employed by Fowkes is the time difference between the earliest arrival and the 98th percentile of arrival. The results of both studies are presented in Table 1.
The following sections utilise and draw on these models to evaluate the effect of varying various hyperparameters on economic costs for an aerial cargo drone network.

3. Methodology

In this section, we briefly discuss the concept of the Deep Q-Network and Reward Engineering, specifically Hindsight Experience Replay. This is to understand our reinforcement learning approach. Subsequently, the setup environment and sample single- and multiple-agent traffic and individual path planning are presented.

3.1. Deep Q-Network (DQN)

As mentioned in the previous section (Reinforcement Learning), we employed Deep Q-Network as a tool. There are several excellent studies in the literature, such as Roderick et al. [14] and Fan et al. [15]. Here, we directly use the DQN Q-learning sampling-based method to find the future discounted value of taking an action a at a state s. A Q-table of dimension s X A is used to lookup the best action a * to take at every state s. The convergence through the Bellman optimality equation is shown in Equation (2).
The neural network is parameterised by parameter array θ , representing a series of weight matrices, and receives the state space input s S to produce an output action a A . With the large state space S expected of the continuous environment, it is not feasible to store each action–value pair in a Q-table of dimension S × A . Rather, a function approximator in the form of a deep neural network may be used to parameterise Q ( s , a ) .
Since the environment is three-dimensional (3D), the state space is fed through a network of 3D convolution layers (3D-Conv) for feature extraction before passing through several fully connected hidden layers. By using a convolutional neural network (CNN), continuous and dynamic state spaces evolving over time are permitted since features of continuous terrain data may be extracted. Figure 2 shows how a CNN is incorporated into the DQN architecture. The model architectures used in this paper are specified in Appendix A.
In dealing with the “deadly triad” of function approximation [16], temporal difference targets and off-policy learning, which may lead to a divergence in values [6], a modified DQN known as the Double DQN (DDQN) is used instead [17]. By obtaining Q-values from a secondary target network that is periodically updated from the primary network, action selection is decoupled from action evaluation, leading to better training stability. The DQN algorithm is adapted from van Hasselt et al. [17] and is presented in Algorithm 1. The corresponding schematic diagrams are available in Figure A1, Figure A2, Figure A3 and Figure A4 in Appendix A.
Algorithm 1: Double Deep Q-Network with Experience Replay
  •  Initialise primary network Q θ , replay buffer D with fixed capacity C
  •  Initialise target network Q θ , and  θ θ
  •  Initialise environment with start state s t
  • for each iteration in M until convergence do
  •     for each step until terminal or exceeding threshold do
  •       select a t from Q θ s t , a t
  •       execute a t and observe r t and s t + 1
  •       add s t , a t , r t , s t + 1 to D
  •       sample minibatch from D
  •       for each item in minibatch do
  •           y t = r t + γ Q θ s t + 1 , max a Q θ s t + 1 , a [TD(0) TARGET]
  •           θ θ + α y t Q θ s t , a t θ Q θ s t , a t
  •           s t s t + 1
  •          if number of updates is more than threshold then
  •             θ θ
  •            reset threshold
  •          end if
  •       end for
  •     end for
  • end for

3.2. Reward Engineering

The objective of each agent is to navigate from its start state to its end state in the most efficient path possible, without the event of a collision. The reward function can be modelled as a signal from the environment, including the static or dynamic obstacles. With scarce positive rewards and abundant negative rewards in the MDP due to the presence of multiple obstacles in a congested urban environment, a lack of positive reinforcement may cause the agent to follow a suboptimal path that reduces the risk of terrain collision instead of converging toward the most efficient path. Therefore, a degree of reward engineering is necessary to improve the training paradigm. Current techniques in reward shaping involve heuristic inputs and may not be feasible for complex problems such as this. One novel technique involves the use of Hindsight Experience Replay (HER), an improvement to existing experience replay buffers by modifying the goal and the reward such that the frequency of positive reinforcement is increased [18]. This is done by shifting the goal state of the drone to its actual terminal state to simulate positive reward in certain replays, leading to additional simulated training examples that may be otherwise unattainable. The following Figure 3 and Figure 4 show the data movement and environment evolution for minibatch updating and Hindsight Experience Replay updating.

3.3. Environment

We specify the digitised environment as terrain or static obstacles and dynamic obstacles:

3.3.1. Urban Terrain/Static Obstacles

A random urban terrain generator is used to generate a simulated environment for model training. Each environment is initialised as empty, and B non-overlapping structures of various base shapes and heights are added to the terrain. The terrain is represented by three-dimensional voxels, with 0 used to represent empty space and 1 used to represent the presence of a built structure. Several samples are shown in Figure 5, with the yellow and green cubes representing the drone and goal, respectively.

3.3.2. Dynamic Obstacles

In addition to the static obstacles, D unit-sized dynamic obstacles, shown in pink in Figure 6, are also introduced to simulate other agents not controlled by the network. These agents follow a non-stationary, random walk with drift α and Gaussian noise σ ϵ 2 , as specified in (5).
X t = α + X t 1 + ϵ t , ϵ t N 0 , σ ϵ 2 I

3.3.3. Risks and Heterogenous Unit Step Cost

There are certainly inherent risks to operating aerial drone networks over populous and dense urban areas or areas with no navigational aids. Such risks are exogenous yet should reasonably affect the solution to the trajectory optimisation problem due to the immense negative externalities to life and infrastructure on the ground in the event of an accident. Therefore, the environment shall consider the risk associated with the occupancy of an agent in all valid spaces by assigning a value Risk ( 0 , 1 ) to the environment. These values manifest as state information to the agent to be learnt, resulting in a non-binary representation of the state s. The state information also modifies the reward space R as a proportionate coefficient to the reward signal r that penalises traversally through risky regions. The level of risk associated with a particular location is represented by an equivalent degree of transparency in the respective spaces, as shown in Figure 7. This equivalent degree of transparency can also include the proximity to the infrastructure and the degree of visibility or occlusion within the navigation areas.

3.4. Single-Agent Path Planning

3.4.1. Fully Observable Markov Decision Process (MDP)

The single-agent drone path planning problem is first modelled as a fully observable Markov decision process (MDP) S , A , P , R , γ , which are defined as the following:
  • S : three-dimensional discrete state space;
  • A : six possible actions corresponding to unit increments/decrements for each dimension (north, south, east, west, up, down); see Figure 8;
  • P : state transition matrix representing a probabilistic correspondence of each action; a A to a transition in S × S , or P s s | a s , s S , a A ;
  • R : reward function as a signal from the environment to the agent for taking an action a A at a state s S and transitioning to state s S ;
  • γ : discount factor to account for uncertainty in future rewards.
In this MDP, Table 2 specifies the variables S , A , P , R , γ . Δ s s A represents the deterministic kinematic transition a from state s to s , without the presence of noise.
To understand the role of different environment characteristics and training paradigms on the performance on trained agents, a schedule of simulations in Table 3 is conducted. By varying one parameter between scheduled runs, the effect of that parameter on economic cost may be analysed under ceteris paribus conditions.

3.4.2. Partially Observable MDP

However, fully observable MDPs assume perfect and complete information, which may be an unrealistic assumption since the agent’s knowledge of the state is limited to the range of its onboard sensors. Complete information also increases the number of irrelevant dimensions for the agent to process when learning the optimal trajectory. Therefore, a partially observable MDP S , A , P , R , Ω , O , γ is formulated:
  • Ω : three-dimensional observation of the adjacent space with range ρ , leading to an observed state of dimensions 2 ρ + 1 × 2 ρ + 1 × 2 ρ + 1 ;
  • O: conditional observation that is deterministic conditional on s .
For simplicity, the observation o is made deterministic given each state s, and the partially observable MDP is approximated as a fully observable MDP, with the state s set as the observable state o Ω . o is the extended Moore neighbourhood of range ρ , a cube centred on the agent with length 2 ρ + 1 . Examples with varying ρ are shown in Figure 9.
A separate schedule of simulations in Table 4 is conducted to study agent performance in partially observable MDPs and the effects of different observation distances.

3.5. Multiple-Agent Traffic and Individual-Path Planning

In operating a fleet network of drones, a fully observable MDP considers the exponentially complex action space of multiple agents. Therefore, this multiple-agent traffic planning problem considers independent agents operating within the same environment, each acting upon an observed state s X centred around their respective locations X. The formulated MDP is, therefore, reflective of the partially observable MDP as specified in the single-agent traffic planning problem. Each agent trains as a single agent but can thereafter be deployed as an independent agent in a multiple-agent system.
The performance of trained agents in the multiple-agent traffic and individual-path planning problem is assessed by simulating common airline network topologies. Cook and Goodwin outlined two prominent types of airline networks: point-to-point and hub-and-spoke systems [19] (see Figure 10), which may be used as multi-agent environments for agent evaluation.

3.5.1. Point-to-Point (P2P) Environment

The most intuitive aerial transport network involves the direct routing of a drone from its origin to its destination via the shortest path. To simulate daily continuous operations, independent agents are generated with uniform spatial probability via a non-homogenous Poisson process with rate parameter λ t specified in Equation (6). λ t is a kernel function that can represent the aggregate demand for drone deliveries at time t. The shape of λ t is given in Figure 11. Likewise, the intended destination of the drone is randomly generated.
λ t = λ max 1 t T 2 2 T 2 4
Through varying the environment size, number of buildings, and the peak rate λ max , the simulation schedule in Table 5 captures the effect of varying such parameters on agent performance. Each trained model in Section 3.2 tests on each of the scheduled runs.

3.5.2. Hub-and-Spoke Environment

The hub-and-spoke model is a widely adopted network topology that was popularised in the late 1970s, serving as an economically superior alternative to the existing P2P mode of civil air travel because of the ability to reap economics of density [19]. Likewise, the structure of modern-day logistics networks also resembles the hub-and-spoke model, as distributed payloads are agglomerated into a central arrival distribution hub, shipped to destination distribution hubs, and subsequently re-distributed to their respective destinations.
To assess the performance of trained agents serving as last-mile delivery services, a series of simulations in Table 6 with varying hubs and demands are performed. The equilibrium locations of H hubs of multiple delivery firms are hypothesised by Hotelling’s model of spatial competition [20], a dense cluster that is generated using a multivariable normal distribution centred at the centroid of the environment.

3.6. Performance Evaluation

Traditional approaches to optimal control minimise an objective function that comprises the path length and a heuristic representing control effort. However, an economic approach to cost is adopted, considering both tangible and intangible costs of operation for each drone delivery flight. This provides a more holistic perspective for cost–benefit analysis since the values of operations are quantified on an economic scale. The first type of costs is tangible resource costs. While the multitude of drone models available allows for much space to vary the parameters of flight, for the purposes of comparison, this methodology adopts conventional parameters that best represent the general population of cargo delivery drones. These parameters are fixed in Table 7.
Using (3) and (4), these drone and flight parameters give a total power consumption of 0.34525 kilowatts and a cost per unit distance p of 0.172625 US cents per kilometre. The other form of costs are intangible costs. The VOT estimates for freight transport are unlikely to apply in this context since the main opportunity costs of operating a vehicle for traditional freight shipping methods are inapplicable for an automated drone service. On the contrary, the concept of VOR is applicable to cargo delivery since there are costs to the consumer due to the uncertainty of package arrival timings. With a higher variance, consumers may need to plan around such uncertainty and, therefore, incur higher costs. For purposes of comparison against the various simulations, a value of USD 4.36 (2021 prices) is used as the monetary value associated with reducing the standard deviation of one ton of cargo drone delivery time by one hour [12]. With the parameters in Table 7, this corresponds to 1.308 US cents per hour of standard deviation. Despite the superiority of the 50th percentile to 90th percentile of arrival timings as a measure of VOR, this paper does not consider the use of this metric because of the unavailability of estimates for cargo deliveries.

4. Results

We ran the proposed Double Deep Q-Network with experience replay approach as shown in Algorithm 1 offline using an Intel I7 processor. All simulations were coded in the Python programming language, with the use of the Tensorflow Keras deep learning framework for model training, which was accelerated using the NVIDIA CUDA Deep Neural Network (cuDNN) package. The results from 1000 test simulations per model for single-agent FOMDPs presenting with the shortest path of minimum length equal to environment size N are presented in Table 8.
The graphs in Figure 12 present the test distribution of travel times for each trained agent, and Figure 13 shows the expected economic cost for Runs 1 to 12.
Likewise, the results from the single agent POMDP simulations for a fixed journey of length N are presented in Table 9.
The graphs in Figure 14 present the test distribution of travel times for each trained agent, and Figure 15 shows the expected economic cost for Runs 13 to 24.

4.1. Point-to-Point Environment Test Results

In the P2P multi-agent environment, each simulation ran for 1000 time-steps. The overall economic cost per run is presented in Table 10 and summarised in Figure 16.

4.2. Hub-and-Spoke Environment Test Results

Lastly, the hub-and-spoke simulations were conducted similarly to the P2P simulations, with the results presented in Table 11 and summarised in Figure 17.

5. Discussion

5.1. Training Stability

In tuning the relevant hyperparameters of the DDQN, it was found that convergence to optimal values was not guaranteed, as shown by the results of several runs. Not all models reached convergence, with several simulations presenting rewards and travel times that are completely unusable, pointing to the delicate nature of training DQNs. This was supported by the theoretical instability of off-policy learning with bootstrapping and function approximation, with convergence achieved only with a sufficiently small learning rate α and the discount factor γ . However, when these parameters were set to conservative values, the training regime stabilised, leading to overall better results. Nonetheless, the volatility of the resulting average reward curves results in the need to pick the optimal model based on the calculated training average reward, which is a certain overestimation of the model’s performance. Table 8 and Table 9 unequivocally point to the train-test gap of machine learning, which, therefore, requires a series of test simulations to assess the overall performance of the resultant models.

5.2. Signal Engineering

Prior to obtaining the results, it was hypothesised that a fully observable MDP would result in superior results to the partially observable MDP. Agents with perfect and complete information about the environment should result in more efficient outcomes because of the ability to learn the costs and benefits of taking a particular action. However, with POMDP models outperforming those with FOMDPs, it suggests that a degree of signal engineering may be required to guide the agent into learning an optimal strategy for its trajectory. There are several possible reasons that can be attributed to these observations. Firstly, it is well understood that the curse of dimensionality in statistical models, which refers to the high dimensionality of signal input into the model, may cause issues with learning if the signal-to-noise ratio is sufficiently small. The DDQN may pick up on unimportant features of the environment that may result in drastic changes to the output, especially given the instability of the TD targets, even in DDQN models where a target network is used. Secondly, placing the drone in the centre of its observed environment in the POMDP models, as compared to a delocalised environment in the FOMDP models, allows the agent to better learn the relative importance of spatial information in determining its next step. In spite of the universal approximation theorem, which allows sufficiently large neural networks the capability to learn the structure of any signal presented, it is likely necessary in practice to shape the state input to better assist the agent in identifying relevant features that will accelerate learning.

5.3. Effect of Hindsight Experience Replay

In the FOMDP simulations, the models trained with HER consistently performed better than the equivalent training regimes without HER, as underscored in Figure 8. HER strengthens the importance of handling scarce reward signals, a well-known problem in reinforcement learning that may cause training models to fail to converge because of a lack of positive reinforcement signals.

5.4. Effect of Dynamic Obstacles

Another surprising result is the consistent positive training effect of added dynamic obstacles to the environment. In both FOMDP and POMDP simulations, models with added dynamic obstacles tended to perform better, despite the added risk of collision. It is likely that these obstacles introduced a wider set of training examples into the CNN, giving the model a higher inclination to avoid actions that would lead to negative rewards.

5.5. Effect of Observation Radius

In the POMDP simulations with varying observation radius, there is no clear pattern from the results. However, it should be noted that the training of Run 22 did not converge well, suggesting that large environment sizes may require a certain degree of observation radius to train effectively. Nonetheless, there is insufficient evidence to support this claim definitively, and more tests should be conducted to ascertain the effect of observation radius on agent performance.

5.6. Effect of Travel Demand

In varying the value of λ max , which represents the maximum level of travel demand, the multi-agent simulations Run 25 to Run 30 reflected a higher economic cost because of the presence of congestion. However, the same effect was not observed from Run 31 to Run 33, which might be due to the high density of built structures. There was also no observable effect of demand on the hub-and-spoke environments.

5.7. Effect of Network Topology

A contrast between Figure 11 and Figure 12 is the effect of travel demand on economic costs. While the P2P environment suffered from inefficiencies arising from higher travel demand, there was less of a clear pattern for the hub-and-spoke environments. Intuitively, this could be due to the disruption of flow between two aircraft that were heading in opposing directions in the P2P environments, as compared to the hub-and-spoke networks, which facilitate largely unidirectional flow of traffic and are less prone to obstruction. However, these tests only simulated the outflow of cargo drones and did not consider inbound traffic. Further studies may be performed to ascertain the effects of bidirectional flow on economic efficiency.

5.8. Proportion of Tangible and Intangible Costs

In all single-agent simulation tests, tangible costs accounted for 87% to 96% of total economic costs. However, the multiple agent tests reflected a lower percentage of 66% to 93%, as illustrated in Figure 18. This agrees with intuition as the level of travel time uncertainty rises in a multiple-agent environment, where multiple agents are competing for limited airspace, resulting in a higher tendency for congestion and a degree of unreliability. This cost structure does not include VOT and costs due to externalities, which may further reduce the proportion of tangible costs.

5.9. VOT and VOR Estimates

In calculating the expected economic costs for each aircraft trajectory, the VOT was assumed to be approximately zero, while the VOR was estimated from freight studies that may not be applicable to last-mile drone deliveries. Theoretically, the VOT for commuters comprises three components: the wage forgone, the dollar value of the utility gained from working, and the dollar value of the disutility gained from travel, as underlined in (7), where λ is the marginal utility per dollar of unearned income. However, in last-mile deliveries, there is no wage forgone by the recipient, and the work time is fixed since drone deliveries do not affect the work time of the consumer. Therefore, under certain assumptions, the VOT can potentially vary with delivery time if and only if there is an associated disutility gained from additional travel time.
VOT = wage + M U work time λ M U travel time λ
Goods that are time-sensitive may have a preferred time of arrival, t * , such that any arrival timing later than t * generates additional disutility to the consumer. On the other hand, if there is no urgency, there is no disutility generated from longer delivery times. While there is no definite study at present to estimate such numbers, a stated preference study may be conducted by giving consumers options for “Regular” and “Express” shipping as part of a choice set to measure the utility gained from earlier arrivals of different types of packages. As a result, different goods may have different VOT profiles for travel times after t * . Several candidate VOT profiles are shown in Figure 19.
Alternatively, a new framework for utility may be in order, given that the components of the utility function for cargo delivery may be significantly different from that for commuting. Morisugi developed separate utility specifications, which distinguished the case for non-business freight service, specifying a quality level indicator of freight time s t > 0 that decreases with time [21]. From solving the first-order conditions, the computed VOT is mathematically equivalent to the cost of freight service per freight time, p, multiplied by the time elasticity of s t , ε s , as given in (8). By estimating the time elasticity of the quality of cargo drone deliveries, a more representative value of VOT for these purposes may be computed.
VOT = p ε s t
Intuitively, the VOR for delivery times is an important and significant component in the economic cost. Consumers often rely on expected arrival times to plan workflows that may be disrupted if these goods do not arrive at the expected arrival time. It is also important to acknowledge that the nature of the goods transported may widely affect these estimates since the ultimate value derived from the delivery time reliability may differ across various goods. More specifically, the interpretation of schedule delay early (SDE) and schedule delay late (SDL) costs can be extended into cargo deliveries. SDE costs may arise from additional storage costs due to early arrivals of goods, while SDL costs can accumulate from the potential cascading of delays and inconveniences due to late deliveries. Regardless, the assumed VOR structure is likely to be accurate because of the similarities underlying commuter travel and last-mile cargo deliveries.

5.10. Effect of Uncertainty of Data or Environment

As the proposed approaches depend largely on incoming sensing data and environmental input, the accuracy and reliability of the information may affect the performance of the approaches. Here, the effect of uncertainty due to the environment or sensing info comes in through signal input to the model. The current proposed approaches have two means to overcome this: (1) the use of probability in MDP as one of the inputs, which inherently covers the varying levels of uncertainty, and (2) the use of more training data with a variety of situations (and uncertainties). Hence, the variation of data has been taken care of, and the robustness of this method can be maintained.

6. Conclusions and Future Work

Through formulating an MDP using discrete three-dimensional space, this paper explores high-level control of aircraft trajectories for multi-agent traffic and path planning. The effects of various environment and flight parameters were analysed to calculate the impact on economic costs. However, these studies assumed a non-dynamical representation of the agent in the environment as they did not conform to dynamic and kinematic constraints. DQN also assumes a finite and discrete action space since the output layer of a neural network is discrete. These results may differ if a dynamic model of cargo drones with differentiated costs for the various actions is considered. For instance, the energy costs of ascending may be different from cruising or descending. Therefore, value-based learning methods can be further supported by policy gradient networks or actor-critic networks on more realistic models of cargo drones to derive more realistic estimates.
While conventional traffic planning and optimisation techniques are evaluated based on path length or time, this paper incorporates economic analysis by considering tangible and intangible sources of cost, such as the cost of energy, the value of time (VOT) and the value of reliability (VOR). This is more practical and has a clear advantage over the conventional approach.
More work is also required to study the VOT and VOR specific to cargo drone deliveries in congested urban areas. Present studies estimate freight VOT and VOR through containerised shipping methods that may not reflect the circumstances required for last-mile drone delivery. In measuring intangible costs, one of the major sources of externalities, noise pollution, may also be considered to reflect the total economic cost. It is reasonable to assume that the presence of flying aircraft in the vicinity of residential areas may cause unwanted disturbances that inflict external costs on society.
In addition, it must be acknowledged that these models only provide a high-level overview of multi-agent drone traffic, individual-path planning, and control. Other control models—not limited to those trained using reinforcement learning or neural networks—should also be developed to optimise the low-level actuator functions of cargo drones, in accompaniment to the policies that are established in this paper.
Furthermore, the proposed method can be scaled to large-scale networks by simply adding in the number of drones and the size of the environment in the training data, although there might be a limit due to the large computation involved.
Overall, the results from this paper show that the parameters that govern an aerial cargo drone network must be set correctly to maximise the efficiency of the network. This paper also demonstrates the usefulness of incorporating economic cost structures to an engineering-based problem. Such methods should continue to influence the analysis of engineering problems, especially where they are slated for implementation in society.

Author Contributions

Conceptualisation, S.X.S.; methodology, S.X.S. and S.S.; software, S.X.S.; validation, S.X.S. and S.S.; formal analysis, S.X.S. and S.S.; investigation, S.X.S. and S.S.; writing—original draft preparation, S.S.; writing—review and editing, S.X.S. and S.S.; project administration and funding, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
A*A-Star Pathfinding Algorithm
CNNConvolutional Neural Network
CPIConsumer Price Index
cuDNNNVIDIA CUDA Deep Neural Network
DDQNDouble Deep Q-Networks
DQNDeep Q-Networks
FOMDPFully Observable Markov Decision Process
HERHindsight Experience Replay
MDPMarkov Decision Process
P2PPoint-to-Point Multi-Agent Environment Model
POMDPPartially Observable Markov Decision Process
RRT*Rapidly-Exploring Random Tree Star Pathfinding Algorithm
SDESchedule Delay Early
SDLSchedule Delay Late
TD(n)Temporal Difference Target, n-step lookahead
VORValue of Reliability
VOTValue of Time

Appendix A. Model Architecture and Complexity

Details of the two model architectures used for training in the FOMDP and POMDP simulations are given in the following descriptors and schematics in Figure A1, Figure A2, Figure A3 and Figure A4.
Figure A1. Description of CNN-DQN model for FOMDP simulations.
Figure A1. Description of CNN-DQN model for FOMDP simulations.
Aerospace 10 00980 g0a1
Figure A2. Description of CNN-DQN model for POMDP simulations.
Figure A2. Description of CNN-DQN model for POMDP simulations.
Aerospace 10 00980 g0a2
Figure A3. CNN-DQN schematic for FOMDP with input size (10,10,10),3..
Figure A3. CNN-DQN schematic for FOMDP with input size (10,10,10),3..
Aerospace 10 00980 g0a3
Figure A4. CNN-DQN schematic for POMDP with input size 343,3..
Figure A4. CNN-DQN schematic for POMDP with input size 343,3..
Aerospace 10 00980 g0a4

References

  1. Unmanned Airspace, Unmanned Air System Traffic Management (UTM). 2023. Available online: https://www.unmannedairspace.info/unmanned-air-system-traffic-management-utm-directory/ (accessed on 30 September 2023).
  2. Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
  3. Karaman, S.; Frazzoli, E. Incremental sampling-based algorithms for optimal motion planning. Robot. Sci. Syst. VI 2010, 104, 267–274. [Google Scholar]
  4. Bliss, G.A. The problem of Bolza in the calculus of variations. In Annals of Mathematics; Mathematics Department, Princeton University: Princeton, NJ, USA, 1932; pp. 261–274. [Google Scholar]
  5. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  6. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  7. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]
  8. Raajan, J.; Srihari, P.; Satya, J.P.; Bhikkaji, B.; Pasumarthy, R. Real Time Path Planning of Robot using Deep Reinforcement Learning. IFAC-PapersOnLine 2020, 53, 15602–15607. [Google Scholar] [CrossRef]
  9. D’Andrea, R. Can Drones Deliver? Automation Science and Engineering. IEEE Trans. 2014, 11, 647–648. [Google Scholar]
  10. Zhang, J.; Campbell, J.F.; Sweeney, D.C., II; Hupman, A.C. Energy consumption models for delivery drones: A comparison and assessment. Transp. Res. Part D Transp. Environ. 2021, 90, 102668. [Google Scholar] [CrossRef]
  11. Lam, J.; Bai, X. Transportation Research Part E. Logist. Transp. Rev. 2016, 16, 27. [Google Scholar]
  12. Shams, Kollol, Understanding the Value of Travel Time Reliability for Freight Transportation to Support Freight Planning 2016. Available online: https://digitalcommons.fiu.edu/etd/2828/ (accessed on 30 September 2023).
  13. Fowkes, T. The design and interpretation of freight stated preference experiments seeking to elicit behavioural valuations of journey attributes. Transp. Res. Part B Methodol. 2007, 41, 966–980. [Google Scholar] [CrossRef]
  14. Roderick, M.; MacGlashan, J.; Tellex, S. Implementing the deep q-network. arXiv 2017, arXiv:1711.07478. [Google Scholar]
  15. Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A theoretical analysis of deep Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, PMLR, Berkeley, CA, USA, 10–11 June 2020; pp. 486–489. [Google Scholar]
  16. Van Hasselt, H.; Doron, Y.; Strub, F.; Hessel, M.; Sonnerat, N.; Modayil, J. Deep reinforcement learning and the deadly triad. arXiv 2018, arXiv:1812.02648. [Google Scholar]
  17. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  18. Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Pieter Abbeel, O.; Zaremba, W. Hindsight experience replay. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  19. Cook, G.N.; Goodwin, J. Airline networks: A comparison of hub-and-spoke and point-to-point systems. J. Aviat./Aerosp. Educ. Res. 2008, 17. [Google Scholar] [CrossRef]
  20. Harold, H. Stability in competition. Econ. J. 1929, 39, 41–57. [Google Scholar]
  21. Morisugi, H. Measurement of Value of Time for Freight Trips and its Benefit by Market Information. Transp. Res. Procedia 2017, 25, 5144–5159. [Google Scholar] [CrossRef]
Figure 1. Sample multiple cargo drone traffic [1].
Figure 1. Sample multiple cargo drone traffic [1].
Aerospace 10 00980 g001
Figure 2. CNN-DQN model architecture and training regime.
Figure 2. CNN-DQN model architecture and training regime.
Aerospace 10 00980 g002
Figure 3. Algorithm schematic for training with replay buffers.
Figure 3. Algorithm schematic for training with replay buffers.
Aerospace 10 00980 g003
Figure 4. Algorithm schematic for HER updates.
Figure 4. Algorithm schematic for HER updates.
Aerospace 10 00980 g004
Figure 5. Sample urban terrain with drone locations and end goal locations.
Figure 5. Sample urban terrain with drone locations and end goal locations.
Aerospace 10 00980 g005
Figure 6. Sample urban terrain with added dynamic obstacles, highlighted in pink.
Figure 6. Sample urban terrain with added dynamic obstacles, highlighted in pink.
Aerospace 10 00980 g006
Figure 7. Exogenous risk models. (a) Uniform step cost. (b) Heterogenous risk costs.
Figure 7. Exogenous risk models. (a) Uniform step cost. (b) Heterogenous risk costs.
Aerospace 10 00980 g007
Figure 8. Six possible actions corresponding to unit increments/decrements for each dimension (north, south, east, west, up, down).
Figure 8. Six possible actions corresponding to unit increments/decrements for each dimension (north, south, east, west, up, down).
Aerospace 10 00980 g008
Figure 9. Varying observable radius of agent in partially observable MDP in the extended Moore neighbourhood of range ρ .
Figure 9. Varying observable radius of agent in partially observable MDP in the extended Moore neighbourhood of range ρ .
Aerospace 10 00980 g009
Figure 10. Point-to-point and hub-and-spoke systems [19].
Figure 10. Point-to-point and hub-and-spoke systems [19].
Aerospace 10 00980 g010
Figure 11. Poisson process rate parameter with T = 1000 and λ max = 1.
Figure 11. Poisson process rate parameter with T = 1000 and λ max = 1.
Aerospace 10 00980 g011
Figure 12. Travel time distributions. (a) Runs 1 to 8, with environment size = path length = 20 m. (b) Runs 9 to 12, with environment size = path length = 40 m.
Figure 12. Travel time distributions. (a) Runs 1 to 8, with environment size = path length = 20 m. (b) Runs 9 to 12, with environment size = path length = 40 m.
Aerospace 10 00980 g012
Figure 13. Summary of economic costs, FOMDP, single agent.
Figure 13. Summary of economic costs, FOMDP, single agent.
Aerospace 10 00980 g013
Figure 14. Travel time distributions. (a) Runs 13 to 21, with environment size = path length = 20 m. (b) Runs 22 to 24, with environment size = path length = 40 m.
Figure 14. Travel time distributions. (a) Runs 13 to 21, with environment size = path length = 20 m. (b) Runs 22 to 24, with environment size = path length = 40 m.
Aerospace 10 00980 g014
Figure 15. Summary of economic costs, POMDP, single agent.
Figure 15. Summary of economic costs, POMDP, single agent.
Aerospace 10 00980 g015
Figure 16. Summary of economic costs, P2P environment, multiple agents.
Figure 16. Summary of economic costs, P2P environment, multiple agents.
Aerospace 10 00980 g016
Figure 17. Summary of economic costs, hub-and-spoke environment, multiple agents.
Figure 17. Summary of economic costs, hub-and-spoke environment, multiple agents.
Aerospace 10 00980 g017
Figure 18. Tangible costs as a fraction of economic costs, all simulations.
Figure 18. Tangible costs as a fraction of economic costs, all simulations.
Aerospace 10 00980 g018
Figure 19. Different VOT profiles for different goods.
Figure 19. Different VOT profiles for different goods.
Aerospace 10 00980 g019
Table 1. Estimates for VOT and VOR for freight.
Table 1. Estimates for VOT and VOR for freight.
Type of GoodsVOTVOTVORVOR
(USD/ton-h)(USD/ton-h)(USD/ton-h)USD/ton-h)
[12][13][12][13]
Perishable0.710.49–0.844.950.65–4.82
Non-perishable1.610.49–0.843.550.65–4.82
Table 2. MDP specifications.
Table 2. MDP specifications.
VariablesSpecifications (Set as Hyperparameters)
S N × N × N , variable , S = N 3
A 1 , 0 , 0 , + 1 , 0 , 0 , 0 , 1 , 0 , 0 , + 1 , 0 , 0 , 0 , 1 , 0 , 0 , + 1 , A = 6
P s s | a = 0.95 , for  a = Δ ( s s )
P P s s | a = 0.056 6 , for  a a , a A , a = Δ ( s s ) , P = 6 N 6
P s s | a = 0 , otherwise
r s , a , s = 1 , for successful delivery
R r s , a , s = 0.01 x f ( s ) , for heterogenous unit step cost, R = 6 N 6
r s , a , s = 0.25 , for collision
γ 0.95
Table 3. Fully observable MDP simulation schedule.
Table 3. Fully observable MDP simulation schedule.
RunEnvironment
Size, N
# Buildings, B# Dynamic
Obstacles, D
Hindsight
Experience Replay
110100No
210100Yes
3101010No
4101010Yes
510250No
610250Yes
7102510No
8102510Yes
9205025No
10205025Yes
112010025No
122010025Yes
Table 4. Partially observable MDP simulation schedule.
Table 4. Partially observable MDP simulation schedule.
RunEnvironment
Size, N
# Buildings, B# Dynamic
Obstacles, D
Observable
Radius, ρ
13101001
14101002
15101003
16102501
17102502
18102503
191025101
201025102
211025103
222050251
232050252
242050253
Table 5. Point-to-point multi-agent simulation schedule.
Table 5. Point-to-point multi-agent simulation schedule.
RunEnvironment
Size, N
# Buildings, B# Dynamic
Obstacles, D
λ max
252050100.1
262050100.2
272050100.5
282050250.1
292050250.2
302050250.5
3120100250.1
3220100250.2
3320100250.5
Table 6. Hub-and-spoke multi-agent simulation schedule.
Table 6. Hub-and-spoke multi-agent simulation schedule.
RunEnvironment
Size, N
# Buildings, B# Dynamic
Obstacles, D
# Hubs λ max
3420502520.1
3520502520.2
3620502520.5
3720502550.1
3820502550.2
3920502550.5
40205025100.1
41205025100.2
42205025100.5
Table 7. Drone parameters.
Table 7. Drone parameters.
ParameterValue
Mass of payload, m p (kg)2.0
Mass of vehicle, m v (kg)3.0
Unit Step Size, Δ x (m)2.0
Unit Time, Δ t (s)0.2
Constant Velocity, v = Δ x Δ t (m/s)10
Lift-to-drag Ratio, L / D 4.0
Power Transfer Efficiency, η 0.5
Avionics Power, p a (kW)0.1
Cost of Electric Power, c h (USD/kWh)0.144
Cost of Electric Power, c (USD/kWs)0.00004
Charging Efficiency, e0.8
Table 8. Fully Observable MDP simulation results.
Table 8. Fully Observable MDP simulation results.
RunParameters
[ N , B , D , HER]
Training
Reward
Test
Reward
Mean
Time ( s )
Time
S.D. ( s )
Economic
Cost (US Cents)
1[10, 10, 0, N]−0.640−1.01020.0511.5920.03519
2[10, 10, 0, Y]0.137−0.91817.4136.1680.03230
3[10, 10, 10, N]−0.490−1.42719.94762.0090.03516
4[10, 10, 10, Y]0.7550.4036.1446.7300.01305
5[10, 25, 0, N]−0.13−0.99220.0201.7330.03519
6[10, 25, 0, Y]0.148−0.67414.0967.9850.02723
7[10, 25, 10, N]0.1510.1738.0747.5480.01668
8[10, 25, 10, Y]0.7160.1907.0827.5760.01498
9[20, 50, 25, N]−0.430−1.56020.1880.3540.03498
10[20, 50, 25, Y]−0.240−1.00620.2000.0000.03487
11[20, 100, 25, N]−0.450−1.00120.1420.9210.03510
12[20, 100, 25, Y]0.3400.02610.2507.2120.02032
Table 9. Partially Observable MDP simulation results.
Table 9. Partially Observable MDP simulation results.
RunParameters
[ N , B , D , ρ ]
Training
Reward
Test
Reward
Mean
Time ( s )
Time
S.D. ( s )
Economic
Cost (US Cents)
13[10, 10, 0, 1]0.8680.5145.6312.9570.01079
14[10, 10, 0, 2]0.8690.3357.0983.4170.01349
15[10, 10, 0, 3]0.8960.6894.0742.3560.00789
16[10, 25, 0, 1]0.8200.3666.7153.8680.01299
17[10, 25, 0, 2]0.7940.4985.5023.0700.01061
18[10, 25, 0, 3]0.8430.1457.7873.3190.01464
19[10, 25, 10, 1]0.8590.7184.0762.6720.00800
20[10, 25, 10, 2]0.8550.5855.0873.0190.00987
21[10, 25, 10, 3]0.7570.3146.8283.3920.01302
22[20, 50, 25, 1]−0.310−0.91418.2233.6760.03279
23[20, 50, 25, 2]0.3680.2019.4812.3010.01720
24[20, 50, 25, 3]0.452−0.06011.1432.8030.02025
Table 10. Point -to-point multi-agent simulation results.
Table 10. Point -to-point multi-agent simulation results.
RunParameters [ N , B , D , λ max ]Economic Cost (US Cents)
25[20, 50, 10, 0.1]0.02302
26[20, 50, 10, 0.2]0.02616
27[20, 50, 10, 0.5]0.02728
28[20, 50, 25, 0.1]0.01824
29[20, 50, 25, 0.2]0.02439
30[20, 50, 25, 0.5]0.03029
31[20, 100, 10, 0.1]0.03276
32[20, 100, 10, 0.2]0.02812
33[20, 100, 10, 0.5]0.03267
Table 11. Hub-and-spoke multi-agent simulation results.
Table 11. Hub-and-spoke multi-agent simulation results.
RunParameters [ N , B , D , H , λ max ]Economic Cost (US Cents)
34[20, 50, 25, 2, 0.1]0.01696
35[20, 50, 25, 2, 0.2]0.02593
36[20, 50, 25, 2, 0.5]0.02327
37[20, 50, 25, 5, 0.1]0.02401
38[20, 50, 25, 5, 0.2]0.02349
39[20, 50, 25, 5, 0.5]0.02468
40[20, 50, 25, 10, 0.1]0.02847
41[20, 50, 25, 10, 0.2]0.02692
42[20, 50, 25, 10, 0.5]0.02125
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Seah, S.X.; Srigrarom, S. Multiple UAS Traffic Planning Based on Deep Q-Network with Hindsight Experience Replay and Economic Considerations. Aerospace 2023, 10, 980. https://doi.org/10.3390/aerospace10120980

AMA Style

Seah SX, Srigrarom S. Multiple UAS Traffic Planning Based on Deep Q-Network with Hindsight Experience Replay and Economic Considerations. Aerospace. 2023; 10(12):980. https://doi.org/10.3390/aerospace10120980

Chicago/Turabian Style

Seah, Shao Xuan, and Sutthiphong Srigrarom. 2023. "Multiple UAS Traffic Planning Based on Deep Q-Network with Hindsight Experience Replay and Economic Considerations" Aerospace 10, no. 12: 980. https://doi.org/10.3390/aerospace10120980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop