1. Introduction
Autonomous flight is already a reality. American company Xwing has applied its autonomous flight software, Super-pilot, on the Cessna 208B Caravan and is pursuing standard certification for large unmanned aerial systems (UASs) from the Federal Aviation Administration (FAA) [
1]. This trend is evident globally, as policy documents such as the Urban Air Mobility (UAM) and the Defense Roadmap underscore the significance of autonomous flight and promote its development as a core technology. This suggests that autonomous flight is not limited to experimental trials but is being practically implemented in both commercial and military sectors. Moreover, considering the global trend of population decline, autonomous flight is expected to be increasingly utilized and become an indispensable technology [
2,
3].
Previous research on autonomous flight systems primarily involved modeling the system based on the flight dynamics of aircraft and then developing software code to compute and execute the optimal trajectory [
4,
5,
6]. However, such approaches in the aerospace field have encountered many challenges. Aircraft require complex and sensitive interactions with the air to sustain flight, making them highly dependent on real-time performance and SWaP-C (Size, Weight, Power, and Cost). This, in turn, implies that both the environments in which traditional code-based autonomous flight systems can be deployed and the performance of the systems themselves are subject to numerous spatiotemporal constraints. Furthermore, code-based autonomous flight systems exhibit limitations in dynamic and uncertain settings, such as those involving human cooperation, swarm flight, and traffic control [
7,
8].
To overcome these limitations, reinforcement learning (RL) has emerged as an alternative. RL approximately optimizes sequential decision-making problems by learning from data obtained through interactions (rollouts) with the environment [
9]. In particular, deep reinforcement learning (DRL), which integrates deep neural networks (DNNs), has demonstrated excellent performance in various state and action spaces following the success of DeepMind’s AlphaGo. This approach is a powerful tool for tackling complex problems such as nonlinear multi-objective optimization and shows high potential for applications in the aerospace field [
10,
11].
However, deep reinforcement learning requires collecting large volumes of data through interactions with the environment for training. This process typically demands significant time and expense, and acquiring data for hazardous scenarios in safety-critical systems—such as those in the aerospace field—is especially challenging [
12,
13]. As a result, researchers predominantly train deep reinforcement learning models by using simulations; however, differences between virtual and real environments cause these models to behave abnormally, leading to the Sim-to-Real problem [
14]. Moreover, because aerodynamic properties are extremely sensitive, developing an accurate simulator for aircraft is both complex and expensive, and every time a new aircraft is developed, researchers must update the simulator accordingly [
15,
16].
One approach to overcome these challenges is model-based reinforcement learning (MBRL) [
17,
18]. The core of MBRL is to incorporate a model that simulates the interaction between the reinforcement learning agent and the environment into the training process. Unlike model-free reinforcement learning, which relies solely on actual experience data obtained through direct interactions with the environment, MBRL leverages both real experience data and data generated from internal predictions to meet the training data requirements. Consequently, MBRL enables efficient learning while minimizing interactions with the real environment. However, the accuracy of the simulated environment model and the inherently low sample efficiency in sparse-reward settings still require improvement [
19,
20,
21].
In this study, we propose an MGH (model-based reinforcement learning with GANs and HER) framework. The framework integrates a Generative Adversarial Network (GAN) [
22] and Hindsight Experience Replay (HER) [
23] into model-based reinforcement learning to enhance its performance. Our key insight is that by using the GAN to improve the accuracy of the environmental model and HER to maximize sample efficiency, the framework can learn solely from sparse real-world data without requiring a separately pre-built simulation model. The MGH framework comprises two main stages: In the first stage, we initialize the GAN-based environmental model by using data collected through expert-controlled agent interactions. In the second stage, we conduct model-based reinforcement learning augmented with Hindsight Experience Replay based on the initialized environmental model.
These assumptions underlying the MGH framework were implemented and evaluated by using the Deep Deterministic Policy Gradient (DDPG) algorithm in a 3D real-world environment where a quadcopter was controlled. In the experiment, the quadcopter’s mission was to take off from a starting point and fly to a designated target location. We conducted the experiment indoors by using a small quadcopter equipped only with a monocular camera—without any additional self-localization sensors—and remotely controlled the quadcopter through the MGH framework model on a server. For performance comparison, models implementing conventional deep reinforcement learning algorithms were also trained on the same dataset and evaluated on the same real-world flight task. The results showed that the MGH framework accelerated learning convergence by up to 70.59% compared with existing algorithms and clearly demonstrated the impact of the environmental model on reinforcement learning.
Our contributions through this study are as follows:
Novel integration of GANs and HER in MBRL
We are the first to unify a GAN-based environment model and Hindsight Experience Replay within an MBRL paradigm for real flight control tasks. This joint approach addresses both high-fidelity model learning and sample efficiency in sparse-reward environments.
Real-world validation on a quadcopter platform
Unlike many simulation-based studies, we use actual flight data from an indoor quadcopter environment. This setup reveals how our method performs under real-world conditions where data collection is expensive and risky.
Significant improvement in convergence speed and model accuracy
Experimental results show that our framework reduces the required number of training samples and accelerates convergence by up to 70.59% over existing algorithms (DDPG with GANs). Additionally, the GAN component demonstrates high accuracy in predicting transitions compared with naive or traditional model-based approaches.
By jointly addressing model fidelity and sparse-reward sample efficiency, we expect this framework to make a significant contribution to the field of deep reinforcement learning-based autonomous flight.
The paper is structured as follows: In
Section 2, we compare our study with related work. In
Section 3, we provide the background and fundamental theories relevant to our study. In
Section 4, we present the proposed MGH framework in detail. In
Section 5, we describe the experimental setup and report the experimental results.
Section 6 offers a discussion of our findings.
Section 7 discusses threats to validity. Finally,
Section 8 concludes the paper.
2. Related Work
Model-based reinforcement learning (MBRL) enables efficient learning while minimizing interactions with the real environment, making it an especially important research topic in areas where data collection is challenging. In the aerospace domain, the scarcity of real flight data necessitates reinforcement learning methods with high sample efficiency, and in complex environments such as drone flight control, accurate environmental modeling is even more critical. Against this backdrop, various researchers have proposed methods to enhance both the sample efficiency and the accuracy of environmental models in MBRL.
Zhao et al. [
24] proposed a novel method that integrates Conditional Generative Adversarial Networks (CGANs) into MBRL to improve sample efficiency. They trained the state transition model of the environment by using a CGAN and enhanced training stability with a Wasserstein GAN (WGAN). Similarly, Charlesworth and Montana [
25] introduced PlanGAN, which employs GANs in sparse-reward, multi-goal environments to generate trajectories that help an agent achieve its goals. These studies focus on leveraging GANs to enhance the accuracy of environmental models and maximize sample efficiency.
Meanwhile, researchers have also applied Hindsight Experience Replay (HER) to MBRL to boost sample efficiency. Yang et al. [
26] proposed Model-based Hindsight Experience Replay (MHER), which uses the environmental model to generate virtual goals and thereby introduces a more efficient goal re-labeling method. Huang et al. [
27] combined model-based reinforcement learning with experience replay techniques in MRHER to effectively handle sparse rewards in continuous object manipulation tasks. These studies contribute to improving data efficiency by integrating HER with MBRL.
Research aimed at strengthening the theoretical foundation of MBRL and enhancing sample efficiency is also noteworthy. Luo et al. [
28] proposed an algorithmic framework that guarantees convergence in nonlinear dynamic models and increased sample efficiency through Stochastic Lower Bounds Optimization (SLBO). Li et al. [
29] addressed data scarcity in offline reinforcement learning by incorporating a pessimism principle to minimize sample complexity. Wang et al. [
30] presented a method using Conservative Model-Based Actor–Critic (CMBAC) to compensate for model inaccuracies and enhance sample efficiency. Additionally, Sun et al. [
31] proposed the MOBILE algorithm, which utilizes model-Bellman discrepancies to improve policy learning stability in offline reinforcement learning, while Ji et al. [
32] reduced unnecessary model updates and increased sample efficiency by dynamically determining the timing of model updates with Constrained Model-shift Lower-bound Optimization (CMLO).
Other studies have focused on improving performance through enhanced reward prediction and internal state representations. Lee et al. [
33] proposed DREAMSMOOTH, which improves reward prediction accuracy via reward smoothing and enhances sample efficiency in sparse-reward environments. Scholz et al. [
34] improved the internal state representation of the MuZero algorithm by using self-supervised learning to boost sample efficiency, and Ma et al. [
35] increased sample efficiency by automatically balancing observation modeling and reward modeling through Harmony World Models.
MBRL has also been actively applied in drone flight control. Becker-Ehmck et al. [
36] used a latent state space model based on a Variational Autoencoder (VAE) to improve sample efficiency in drone flight control. Lambert et al. [
37] combined a neural network-based dynamics model with Model Predictive Control (MPC) to perform the low-level control of quadrotors with limited data. Although these studies represent meaningful attempts to address data scarcity in drone flight control, they fall short of simultaneously maximizing both environmental model accuracy and sample efficiency.
In addition, Khalid et al. [
38] applied MBRL to other domains, using differentiable ordinary differential equations (ODEs) in quantum control problems to enhance model accuracy and reduce sample complexity. Although their study is not directly related to the aerospace field, its approach to improving sample efficiency by increasing model accuracy is noteworthy.
Beyond these, researchers have worked on integrating imitation learning with MBRL for agile aircraft control using pilot demonstration data. For example, Sever et al. [
39] proposed a hybrid approach that unifies imitation learning, transfer learning, and reinforcement learning to address limited pilot data in high-agility maneuvers. They demonstrated that leveraging both simulation-generated proxy data and a small amount of real pilot data can robustly adapt to aircraft parameter changes while preserving maneuver stability.
Overall, various researchers have striven to enhance sample efficiency and environmental model accuracy in MBRL. However, most previous research has focused primarily on either improving model accuracy via GANs or boosting sample efficiency through HER. In this study, we aim to overcome these limitations by integrating HER and GANs within MBRL to address data scarcity in the aerospace domain and simultaneously enhance both sample efficiency and the accuracy of environmental models in drone flight control environments.
By leveraging GANs to improve the accuracy of environmental models, our approach enables precise environmental simulation even when real flight data are scarce. This strategy further reinforces the effectiveness of GANs as demonstrated in the studies by Zhao et al. [
24] and Charlesworth and Montana [
25]. At the same time, employing HER to recycle unsuccessful experiences to boost sample efficiency builds upon the work by Yang et al. [
26] and Huang et al. [
27]. By combining these two strategies, we aim to comprehensively address issues that previous researchers treated separately.
Therefore, in this study, we propose a novel approach that integrates HER and a GAN within MBRL to simultaneously enhance sample efficiency and environmental model accuracy. We expect that this method will make a significant contribution to reinforcement learning in the aerospace domain, particularly in applications such as drone flight control where real data are scarce.
3. Background
In this chapter, we provide a detailed introduction to the key theories and technologies that form the basis of this study.
Section 3.1 reviews the fundamental concepts of reinforcement learning.
Section 3.2 examines the deep reinforcement learning algorithm DDPG, while
Section 3.3 analyzes the principles and limitations of model-based reinforcement learning. Additionally,
Section 3.3 and
Section 3.4 discuss the principles of GANs and HER and their roles in reinforcement learning, respectively.
3.1. Reinforcement Learning
Reinforcement Learning (RL) addresses sequential decision-making problems that are typically formulated as a Markov Decision Process (MDP). An agent observes a state
, takes an action
according to a policy
, transitions to
with probability
, and receives a reward
from reward function
. The goal is to find
that maximizes the expected return
:
where
is the trajectory sampled from the policy
,
is the discount factor,
is the reward at time step
, and
is the horizon.
3.2. Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) tackles continuous action spaces by combining actor–critic methods with neural network function approximators. The policy network (actor)
deterministically outputs actions, while the value network (critic)
estimates
. DDPG stabilizes training via an experience replay buffer of
tuples and target networks for both actor and critic. Although successful in many tasks, DDPG can require a large number of real-environment interactions due to its model-free nature. The algorithm of the DDPG is shown in Algorithm 1.
Algorithm 1 DDPG |
1: Initialize actor parameters and critic parameters . 2: Initialize target networks with parameters . 3: Initialize replay buffer 4: for each episode do 5: Reset environment and get initial state 6: for each step to do. 7: Select action where 8: Execute action and observe reward , next state . 9: Store transition in . 10: if size(D) > MinBatchSize then 11: Sample a random mini-batch of from D 12: ’ (s’, ’ (s’)) 13: 14: 15: 16: end if 17: 18: end for 19: end for |
3.3. Model-Based Reinforcement Learning
In Model-Based Reinforcement Learning (MBRL), one learns an approximate environment model
. environmental model is expressed as follows:
It aims to simulate imaginary rollouts, thereby reducing real-environment sampling. The policy can then be improved using both real data and model-generated data. However, bias arises if deviates significantly from the true dynamics . This deviation may accumulate over multiple time steps, potentially degrading learning stability or performance.
3.4. Generative Adversarial Networks
Generative Adversarial Networks (GANs) consist of two neural networks—the generator (
) and the discriminator (
)—which compete adversarially during training. The generator (
) accepts a noise vector
, sampled from latent space
, as input and produces a fake data sample
. The discriminator (
) determines whether an input
x comes from the real data distribution (
) or from the generated data distribution (
). To accomplish this, the GAN is trained by having the generator and discriminator play a mini-max game, which is expressed by the following value function
:
The discriminator is trained to output 1 for real data x and 0 for generated data
. Meanwhile, the generator seeks to trick the discriminator by maximizing
. The optimal discriminator
for a given generator
is derived as follows:
The precise data representation capabilities of GANs can be leveraged in model-based reinforcement learning to enhance the accuracy of the environmental model.
3.5. Hindsight Experience Replay
Hindsight Experience Replay (HER) is a technique that improves learning efficiency by reusing experiences gathered through an agent’s interactions with the environment for various goals. Conventional reinforcement learning only learns when a reward is provided, rendering trajectories without rewards unusable. As a result, in environments with sparse rewards, most experiences go unused, which severely diminishes sample efficiency. HER addresses this issue by reinterpreting the states reached during an episode as if the goal had been achieved, thereby assigning a corresponding success reward. This approach effectively transforms the problem into multi-goal reinforcement learning, where—unlike in traditional reinforcement learning—goal
from goal space
is incorporated as a parameter into the policy, value function, and reward function, redefining them as follows:
where
is the policy,
is the action-value function, and
is the reward function.
4. Proposed Method
In this chapter, we propose the MGH framework, which integrates a GAN and HER into model-based reinforcement learning (MBRL). First, we detail the structure and components of the framework and present the training algorithm. Next, we explain in detail how the GAN and HER are integrated, interact, and fulfill their respective roles. Finally, we demonstrate the convergence and stability of the proposed method by theoretically establishing a performance lower bound and mathematically proving the impact of the GAN and HER on performance.
4.1. Overview of MGH Framework
The MGH framework aims to achieve data-efficient reinforcement learning in real-world environments by integrating a GAN and HER into model-based reinforcement learning. Its main components include an Actor–Critic module, a GAN module, a HER module, and an Experience Replay Buffer. In this framework, the Actor–Critic module utilizes an off-policy algorithm to take advantage of the Experience Replay Buffer.
Figure 1 illustrates the overall structure of the MGH framework.
First, the GAN module comprises a generator and a discriminator. Generator takes state and action as input and learns state transition model to predict the next state, . Subsequently, discriminator distinguishes between the real transition and the generated transition . The generator learns to trick the discriminator into classifying its fake data as genuine environmental data, while the discriminator learns to accurately differentiate between the generated data and the actual environmental data. Through this adversarial training process, the GAN module attains a highly accurate environmental model.
The Actor–Critic module consists of an actor and a critic, where the actor approximates the policy and the critic approximates the value function. Actor produces optimal action for given state and goal , while critic evaluates the value of the state–action–goal pair that includes the action provided by the actor. In this setup, the actor is trained by using a policy-based approach, and the critic is trained by using a value-based approach, thereby integrating the strengths of both methods to facilitate effective learning.
The HER module utilizes reassigning various goals to both the agent’s experienced trajectories and those generated by the environmental model. This allows the agent to learn from experiences that did not originally reach the desired goal, thereby enhancing sample efficiency during training.
Finally, Experience Replay Buffer stores data collected from the real environment, as well as data generated by the environmental model. The stored data are sampled in minibatches to train both the actor and the critic. Since the GAN structure generally requires a large amount of data for training, the Experience Replay Buffer is initially populated with data obtained from expert demonstrations.
4.2. Learning Algorithm
The MGH framework adopts Model-Based Reinforcement Learning (MBRL) as its foundation and integrates Generative Adversarial Network (GAN) for environment modeling and Hindsight Experience Replay (HER) for improved sample efficiency. Building on the theoretical analyses, which demonstrate how GAN and HER jointly reduce model bias and enhance reward utilization, this section provides a procedural overview of how the MGH framework operates in practice.
4.2.1. GAN–HER Integration
In this framework, the generator
and discriminator
are trained adversarially so that virtual transitions produced by
closely resemble real transitions
. The real data are collected through direct interaction with the environment, while the GAN-based virtual data supplement these real transitions in the replay buffer. The objective is to ensure that
remains realistic enough to prevent large model bias when learning the policy. As discussed in
Section 4.3, a more accurate internal model tightens performance bounds by reducing the divergence from the true dynamics
.
Concurrently, HER reassigns goals after each episode by designating future states as new goals and labeling these as successful transitions. This approach effectively increases the density of reward signals, addressing the scarcity issue commonly observed in sparse-reward tasks. By retrospectively treating previously failed attempts as successful under alternate goals, HER enables the agent to reuse a broader range of experience for policy improvement. Thus, the dual benefits of improved model fidelity (via GAN) and richer reward feedback (via HER) collectively boost the learning speed and sample efficiency of the MGH framework.
4.2.2. Algorithmic Flow
The learning process proceeds in five phases. First, both the GAN parameters , and the actor–critic parameters , are randomly initialized, with additional expert demonstration data stored in the replay buffer to stabilize early training. Next, the agent collects real transitions by executing actions sampled from the policy, perturbed by Gaussian exploration noise, and records the resulting tuples in . Simultaneously, the generator produces virtual transitions , which are also stored in to provide supplementary data.
At the end of each episode, a GAN update stage occurs in which minibatches of real and generated transitions are sampled from . The discriminator is optimized to distinguish real from generated transitions, while the generator aims to fool the discriminator with increasingly realistic outputs. This procedure gradually refines the internal model, narrowing the gap between true dynamics and the learned dynamics .
Subsequently, HER re-labeling is applied by selecting future states from each sampled trajectory as new goals, assigning a success reward for reaching those states, and storing these artificially successful experiences in . This re-labeling step effectively transforms trajectories that originally failed to reach the actual goal into positive experiences for a different goal. Finally, actor–critic learning updates both the critic , to minimize temporal difference errors, and the actor , to maximize the expected return. The target networks and receive soft updates to ensure stable training.
By repeating these phases across multiple episodes, the agent gains access to both real and GAN-generated transitions that are further enriched by HER. The practical steps of this unified procedure are shown in Algorithm 2, where lines 6–7 illustrate an initial batch of training using expert demonstrations, lines 14–15 insert real and generated data into the buffer, lines 18–22 detail the adversarial update of the GAN, lines 23–27 reassign goals via HER, and lines 28–33 carry out standard off-policy actor–critic updates.
Algorithm 2 MGH framework. |
1: Initialize generator parameters and discriminator . 2: Initialize actor parameters and critic parameters . 3: Initialize target networks with parameters . 4: Initialize goal . 5: Initialize experience replay buffer with expert-collected data. 6: Train generator and discriminator using data from . 7: Train actor and critic using data from . 8: for each episode do 9: Reset environment and get initial state 10: for each step to do. 11: Select action where 12: Execute action and observe reward , next state . 13: Store transition in . 14: Predict next state and reward . 15: Store transition in . 16: Update state 17: end for 18: for to do 19: Sample minibatch from . 20: Update Discriminator to discriminate real and fake transitions. 21: Update to generate realistic transitions. 22: end for 23: for to do 24: Sample sequence consist of from . 25: Apply HER to create new goal and reward . 26: Store each transition in . 27: end for 28: for to do 29: Sample minibatch from . 30: Update Critic by minimizing . 31: Update Actor via policy gradient. 32: Update target networks , 33: end for 34: end for |
4.3. Theoretical Justification for MGH Framework
To theoretically establish the performance benefits of the MGH framework, we analyze the sample efficiency improvement by incorporating GAN and HER within the Model-Based Reinforcement Learning (MBRL) paradigm.
4.3.1. Model Error Bound
In an MDP with discount
, the difference in expected returns when using an imperfect model
instead of the true model
can be bounded. A standard multi-step analysis [
40] shows that any local difference in transition probabilities can compound over repeated rollouts. Concretely, if
is the policy trained under
, then:
The factor arises because, at each step, the agent’s return is discounted by , but the errors can propagate through future time steps—thus incurring a geometric series whose partial sum is on the order of . This derivation confirms that lowering the total variation distance across state–action pairs will tighten the upper bound on performance loss.
When GAN is used for , the generator is adversarially trained to produce that closely match real transitions, thereby minimizing . Substituting for yields a tighter bound compared to naive function approximators, providing a solid theoretical rationale for our GAN-based model.
4.3.2. Joint Sample Efficiency from GAN and HER
First, let
be the probability of “success” under the original reward structure. In a sparse-reward environment,
might be very small, leading to high sample complexity
. HER re-labels each real trajectory
times, increasing the effective success rate to the following:
Thus reducing the sample complexity from to . However, model-based methods also gain efficiency by simulating extra transitions. Suppose each real transition can generate on average synthetic transitions from the GAN model. Then, for every real step, the agent effectively observes additional transitions in the replay buffer. If we treat these synthetic transitions as having the same success probability (or re-labeled success probability) —an approximation that holds if is sufficiently accurate—then each real step yields -fold more “successful” data.
Hence, the overall number of successes per real interaction roughly scales as
. Since RL sample complexity depends inversely on the fraction of successful experiences, we can write a new effective sample complexity order:
demonstrating that both the GAN model through
and HER through
synergistically reduce the needed real-environment interactions. In practice,
reflects how many reliable virtual rollouts we can generate per real step, and
denotes the number of HER re-labelings per trajectory. As
and
grow, the sample complexity decreases, consistent with our empirical findings. This combination underlies the MGH framework’s advantage in data-scarce, sparse-reward tasks.
5. Experiments
In this section, we describe the experiments conducted to evaluate the performance of the proposed MGH framework by comparing it with the representative deep reinforcement learning algorithm DDPG. The experiments were designed to address two research questions and were carried out in an actual drone environment. We objectively analyze the performance of DDPG, DDPG-HER, DDPG-GAN, and MGH-DDPG by using various evaluation metrics, and we validate the data efficiency and learning performance of the MGH framework based on the results.
5.1. Research Questions
5.1.1. RQ1: How Much Data Does the MGH Framework Require for Training?
It is essential to verify whether the quantity of data required by the proposed MGH framework for convergence is practically sufficient, even if it is lower than that required by conventional reinforcement learning methods. Therefore, we compare the learning convergence speed and mission success rate of the MGH framework when trained with different amounts of data: 2K (2000), 5K (5000), 8K (8000), and 10K (10,000) samples.
5.1.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?
In this study, we primarily aim to enhance data efficiency in real-world environments where collecting flight data is challenging due to high time and cost constraints. To this end, we compare the learning convergence speed and mission success rate of the proposed MGH framework with those of DDPG, DDPG-HER, and DDPG-GAN—where DDPG-HER and DDPG-GAN integrate HER and GANs with DDPG, respectively—after training with 10K samples.
5.1.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?
We incorporated a GAN to improve the accuracy of the environmental model, thereby generating experiences that closely resemble those obtained through real-world interactions. To evaluate this, we compare the geometric transformation from
to
in the real transition
with the transformation from
to
generated by the GAN. Since the state is represented as an image, we extract feature points by using the Scale-Invariant Feature Transform (SIFT) [
41] and quantify the geometric differences between the two images by comparing their homography matrices.
5.2. Experimental Environment
5.2.1. Quadcopter
The quadcopter used in the experiment is the DJI RMTT, weighing approximately 80 g with a maximum flight speed of 8 m/s. It captures video at 720p resolution with a field of view (FoV) of 82.6° by using an integrated monocular camera. In addition, it can hover indoors without GPS by utilizing its bottom-mounted infrared (IR) sensor and optical flow sensor. The onboard Inertial Measurement Unit (IMU) enables the monitoring of the quadcopter’s speed and acceleration. Control of the quadcopter is executed by a controller connected to a server via Wi-Fi, where both the learning and control algorithms run.
5.2.2. Flight Mission
The experiments were conducted in an indoor environment with dimensions of 3 m (width) × 5 m (length) × 2.5 m (height), as shown in
Figure 2. In this setting, the quadcopter’s mission was to fly toward a target located at the far end of the area, ensuring that the center of the target aligned with the center of the camera view and that the target’s vertical dimension matched the height of the camera frame. Any deviation from the designated experimental area was considered a collision or mission failure.
5.2.3. Pre-Collected Data
In the aforementioned environment, a human pilot directly controlled the quadcopter to collect data. The collected data included the current image, speed, and acceleration; after the pilot issued a control command, the next image, speed, and acceleration were recorded and combined into a single data sample. The sampling interval was 0.2 s, and approximately 10K samples were collected over about 30 min of flight. Additionally, upon mission success, each sample was assigned a reward through object detection based on YOLOv8 [
42]. This setup represents a very-sparse-reward environment, and the total quantity of data is considerably lower than what is typically used in model-free reinforcement learning.
To mitigate potential variations arising from battery depletion, temperature drift, or sensor noise, we conducted our data-collection flights within a relatively short and consistent time (5 min). After each flight session, we replaced or recharged the drone’s battery to keep power levels as consistent as possible. Furthermore, we carefully tuned hyperparameters with small grid searches and ran multiple flight trajectories under similar environmental conditions to reduce confounding factors. By doing so, we sought to minimize the influence of uncontrolled variables on both the data collection and the reinforcement learning outcomes.
5.2.4. MDP Modeling
We define the state and action spaces, the sparse reward function, and the episode termination conditions.
Table 1 summarizes these elements. We represent each transition in the form
, where
includes both an RGB image (downsampled to 64 × 64). The action
is a 3-dimensional continuous vector (roll, pitch, throttle) in
. At each step, the agent receives a sparse reward, and the episode ends either upon mission success, collision, or a time limit.
5.2.5. MGH Framework
Our framework integrates DDPG (actor–critic) with a GAN-based environment model and Hindsight Experience Replay (HER). The reason for choosing DDPG is that it is the most basic off-policy algorithm for continuous spaces. This makes it easier to directly observe the effects of HER and GAN with minimal influence from the algorithm itself.
Table 2 summarizes the main hyperparameters used throughout the training, and
Table 3 provides a concise overview of each neural network architecture (actor, critic, generator, and discriminator).
In our implementation, the Actor maps the quadcopter’s visual observations to continuous control commands, and the Critic evaluates their corresponding Q-values for policy optimization. The Generator uses the current state and action to synthesize the next image, while the Discriminator learns to distinguish genuine transitions from generated ones. By training these modules jointly, we augment limited real-flight data with realistic synthetic experiences and apply HER to effectively handle the sparse-reward setting. This combination improves data efficiency and accelerates convergence in the quadcopter navigation task.
5.3. Experimental Results
5.3.1. RQ1: How Much Data Does the MGH Framework Require for Training?
To address this research question, we evaluated the performance of each model after running reinforcement learning on data that a human pilot had previously collected by controlling the quadcopter. The training data sizes for the models were 2K (2000), 5K (5000), 8K (8000), and 10K (10,000) samples. Since a reward of 1 was given only upon mission success, the rewards in the graphs range between 0 and 1.
The experimental results are shown in
Figure 3. The MGH framework, when trained with 10K samples, converged around the 2K mark, and with 8K samples, it converged around the 4K mark. In contrast, the models trained with 5K and 2K samples failed to converge, which indicates that the mission itself was unsuccessful. We interpret this as a consequence of the GAN not being properly trained due to the initial lack of sufficient data, resulting in the continued use of erroneous model values during training, thereby preventing convergence.
5.3.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?
To address this question, we compared the performance of DDPG, DDPG-HER, and DDPG-GAN after training each with 10K samples. The hyperparameters for each model were set to the same values as those used in the MGH framework.
The experimental results are shown in
Figure 4. The MGH framework converged at approximately 2K samples, while DDPG-HER converged at around 6.8K samples. This demonstrates that compared with DDPG-HER, the MGH framework’s convergence speed in this environment is 70.59% faster. Although DDPG-GAN showed a trend toward convergence, it did not reach a convergence point within the 10K sample range. Meanwhile, DDPG failed to show any sign of convergence and oscillated, which we interpret as a consequence of sparse rewards in the mission environment and insufficient flight data.
5.3.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?
To address this question, we compare the geometric transformation from
to
in the real transition
with the transformation from
to the GAN-generated state
. We perform this comparison by extracting the homography matrices for each state by using the Scale-Invariant Feature Transform (SIFT) algorithm. We then quantify geometric similarity by computing the average normalized reprojection error from the matched keypoints and defining a similarity measure
that ranges from 0 (completely different) to 1 (identical):
Here, denotes the similarity between images (e.g., ). represents the total number of matched keypoints used in the homography estimation. is the reprojection error for the ith keypoint pair, computed as the Euclidean distance between the transformed keypoint and its corresponding keypoint in the target image. is a predefined threshold that normalizes the error, ensuring that errors greater than or equal to yield a normalized value of 1.
Figure 5 shows the experimental results. In the figure, the bold line represents the mean point of the data. The dotted line above and below the mean line represents the standard deviation point. Based on the black vertical line, the left value represents the magnitude of the standard deviation, and the right value represents the mean value. At 2K and 5K, the mean values are 0.76, 0.59, respectively, indicating that the geometric transformation of the real data is quite different from that of the GAN-generated data. However, from 8K onward, the error drops dramatically to below 0.2. This suggests that with increased data, the GAN better captures features such as edges, resulting in a more accurate representation. Notably, these findings are consistent with the results from RQ1, confirming that the quality of the environmental model learned from the training data ultimately determines the success of model-based reinforcement learning.
Additionally,
Figure 6 displays images of the environment generated by the GAN at different data levels (from 2K to 8K, from left to right). At 2K, the image is considerably blurry, and when the SIFT algorithm is applied, the extracted feature points differ significantly from those in other images, resulting in completely different values. Although the image quality improves at 5K, some blurriness persists, leading to a high standard deviation. However, from 8K onward, the images become relatively clear, and the geometric transformations closely match the actual results, resulting in a low error.
6. Discussion
We have proposed the MGH framework, which integrates a GAN and HER into model-based reinforcement learning (MBRL) to enhance data efficiency in reinforcement learning for autonomous flight control. In particular, we explored approaches to maximizing the applicability of reinforcement learning by using real flight data in the aerospace domain. This work offers important insights for practical autonomous flight solutions sought by the aviation industry, as detailed in the following.
First, regarding data sparsity, the MGH framework outperformed conventional model-free reinforcement learning (MF-RL). Collecting flight data from aircraft or drones is limited by safety concerns and costs. The GAN-based environmental modeling and HER-based experience reuse methods integrated in this study maximize the utilization of such limited data and enable stable learning with a relatively small amount of flight data. This is significant because it allows aerospace companies to reduce costs while developing effective autonomous flight systems based on actual flight data.
Second, in terms of real-time control system efficiency, the MGH framework achieved a 70.59% improvement in convergence speed compared with conventional DDPG-GAN. This is a critical performance indicator for autonomous flight control systems that must operate in real time. Moreover, our results confirm that model accuracy significantly affects overall reinforcement learning performance; by enhancing model accuracy with GAN, our approach reduces the performance gap with the real environment.
Third, our approach also makes a notable contribution to resolving the Sim-to-Real problem. Traditional simulator-based learning often leads to abnormal behavior during real flights due to discrepancies in aerodynamic conditions. In contrast, the proposed GAN-based environmental modeling accurately replicates real flight data, achieving a homography error of less than 0.06. This indicates that our method can serve as a robust foundation for training more precise autonomous flight systems during actual flight tests of aircraft or drones.
Fourth, this study has important implications for enhancing aviation safety. In aerospace systems, instability in learning due to data scarcity or collisions during flight can be catastrophic. By reusing even unsuccessful experiences through HER, the MGH framework provides stable learning under data-sparse conditions. This capability is particularly crucial to the development of Urban Air Mobility (UAM) systems and defense-related autonomous flight systems.
In conclusion, the MGH framework offers significant potential for implementing autonomous flight control systems that are practical in the aviation industry. The reinforcement learning approach that combines a GAN and HER demonstrated high data efficiency and performance even in real-world environments with sparse data, and it is expected to make substantial contributions to future developments in autonomous flight systems and aircraft control technologies.
7. Threats to Validity
7.1. Internal Validity
A potential threat to internal validity arises from factors not fully controlled during data collection and experimentation. For instance, variations in the drone’s battery status, minor sensor drift, and potential environmental changes (e.g., slight temperature fluctuations in the indoor facility) could influence flight performance and learning outcomes. Although we minimized these effects by conducting flights within a short time window and periodically monitoring battery levels, we acknowledge that perfect control was not feasible. Additionally, the sparse-reward setting might introduce variability in how quickly each model converges, but we mitigated this by carefully tuning hyperparameters and by running multiple flight trajectories under similar conditions.
7.2. External Validity
All experiments were conducted on a single quadcopter model and in a relatively simple indoor environment. Consequently, our results may not directly generalize to different aircraft types, large outdoor areas, or environments with extensive obstacles and dynamic conditions. While the core mechanism of combining a GAN-based model with HER is not inherently limited to any specific drone or setting, additional adaptations—such as more advanced GAN architectures or explicit obstacle modeling—may be required for complex scenarios.
8. Conclusions
In this study, we proposed the MGH framework, which effectively leverages sparse data in real flight environments. By integrating Generative Adversarial Networks (GANs) and Hindsight Experience Replay (HER) into model-based reinforcement learning (MBRL), the MGH framework was evaluated in an actual quadcopter flight environment and compared with conventional model-free reinforcement learning methods. The results demonstrated that the MGH framework offers superior data efficiency and, in particular, achieves a convergence speed up to 70.59% higher than models that incorporate GANs with DDPG.
A key innovation of our approach is that it simultaneously addresses both the accuracy of the environmental model and the sample efficiency in sparse-reward settings. Previous model-based methods have often focused solely on improving the dynamic model (e.g., through GANs) or on handling sparse rewards with techniques such as HER, but seldom have they combined both strategies in a single MBRL framework with real flight data. By unifying GAN-driven environment modeling and HER-based sample reuse, our framework fills this gap and demonstrates enhanced performance in real quadcopter flight tasks. This synergy between improving model accuracy and maximizing sample efficiency represents the primary contribution of our study and differentiates it from existing research that typically addresses these challenges in isolation.
Although the primary aim was to validate data efficiency gains, several practical considerations merit further exploration. First, the current analysis chiefly focuses on learning convergence speed and mission success rate. Future work should investigate additional performance factors, such as computational cost, long-term policy stability, and hyperparameter sensitivity, to provide a more comprehensive assessment of the framework. Second, the experimental setup was deliberately simplified—featuring a relatively small indoor flight area with few obstacles—to highlight the core benefits of combining GANs and HER. While our results underscore the framework’s efficacy under these conditions, more intricate obstacle arrangements, dynamic environmental factors, and larger or outdoor arenas could challenge the GAN-based modeling and are promising directions for subsequent research.
To the best of our knowledge, this study is the first where a GAN and HER are combined within an MBRL framework, significantly enhancing the practical applicability of reinforcement learning in autonomous flight. The combined GAN–HER learning framework is expected to maximize data efficiency and learning performance in diverse real-world environments, extending its potential application to autonomous flight, robotic control, and other related fields.