Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data

Lee, Uicheon; Lee, Seonah; Kim, Kyonghoon

doi:10.3390/drones9040264

Open AccessArticle

Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data

by

Uicheon Lee

¹,

Seonah Lee

^1,2,*

and

Kyonghoon Kim

^3,*

¹

Department of AI Convergence Engineering, Gyeongsang National University, Jinju 52828, Republic of Korea

²

Department of Software Engineering, Gyeongsang National University, Jinju 52828, Republic of Korea

³

School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(4), 264; https://doi.org/10.3390/drones9040264

Submission received: 18 February 2025 / Revised: 20 March 2025 / Accepted: 29 March 2025 / Published: 31 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Recently, autonomous flight has emerged as a key technology in the aerospace and defense sectors; however, traditional code-based autonomous flight systems face limitations in complex environments. Although reinforcement learning offers an alternative, its practical application in real-world settings is hindered by the substantial data requirements. In this study, we develop a framework that integrates a Generative Adversarial Network (GAN) and Hindsight Experience Replay (HER) into model-based reinforcement learning to enhance data efficiency and accuracy. We compared the proposed framework against existing algorithms in actual quadcopter control. In the comparative experiment, we demonstrated an improvement of up to 70.59% in learning speed, clearly highlighting the impact of the environmental model. To the best of our knowledge, this study is the first where a GAN and HER are combined with model-based reinforcement learning, and it is expected to contribute significantly to the practical application of reinforcement learning in autonomous flight.

Keywords:

MBRL (model-based reinforcement learning); GANs (generative adversarial networks); HER (hindsight experience replay); autonomous flight

1. Introduction

Autonomous flight is already a reality. American company Xwing has applied its autonomous flight software, Super-pilot, on the Cessna 208B Caravan and is pursuing standard certification for large unmanned aerial systems (UASs) from the Federal Aviation Administration (FAA) [1]. This trend is evident globally, as policy documents such as the Urban Air Mobility (UAM) and the Defense Roadmap underscore the significance of autonomous flight and promote its development as a core technology. This suggests that autonomous flight is not limited to experimental trials but is being practically implemented in both commercial and military sectors. Moreover, considering the global trend of population decline, autonomous flight is expected to be increasingly utilized and become an indispensable technology [2,3].

Previous research on autonomous flight systems primarily involved modeling the system based on the flight dynamics of aircraft and then developing software code to compute and execute the optimal trajectory [4,5,6]. However, such approaches in the aerospace field have encountered many challenges. Aircraft require complex and sensitive interactions with the air to sustain flight, making them highly dependent on real-time performance and SWaP-C (Size, Weight, Power, and Cost). This, in turn, implies that both the environments in which traditional code-based autonomous flight systems can be deployed and the performance of the systems themselves are subject to numerous spatiotemporal constraints. Furthermore, code-based autonomous flight systems exhibit limitations in dynamic and uncertain settings, such as those involving human cooperation, swarm flight, and traffic control [7,8].

To overcome these limitations, reinforcement learning (RL) has emerged as an alternative. RL approximately optimizes sequential decision-making problems by learning from data obtained through interactions (rollouts) with the environment [9]. In particular, deep reinforcement learning (DRL), which integrates deep neural networks (DNNs), has demonstrated excellent performance in various state and action spaces following the success of DeepMind’s AlphaGo. This approach is a powerful tool for tackling complex problems such as nonlinear multi-objective optimization and shows high potential for applications in the aerospace field [10,11].

However, deep reinforcement learning requires collecting large volumes of data through interactions with the environment for training. This process typically demands significant time and expense, and acquiring data for hazardous scenarios in safety-critical systems—such as those in the aerospace field—is especially challenging [12,13]. As a result, researchers predominantly train deep reinforcement learning models by using simulations; however, differences between virtual and real environments cause these models to behave abnormally, leading to the Sim-to-Real problem [14]. Moreover, because aerodynamic properties are extremely sensitive, developing an accurate simulator for aircraft is both complex and expensive, and every time a new aircraft is developed, researchers must update the simulator accordingly [15,16].

One approach to overcome these challenges is model-based reinforcement learning (MBRL) [17,18]. The core of MBRL is to incorporate a model that simulates the interaction between the reinforcement learning agent and the environment into the training process. Unlike model-free reinforcement learning, which relies solely on actual experience data obtained through direct interactions with the environment, MBRL leverages both real experience data and data generated from internal predictions to meet the training data requirements. Consequently, MBRL enables efficient learning while minimizing interactions with the real environment. However, the accuracy of the simulated environment model and the inherently low sample efficiency in sparse-reward settings still require improvement [19,20,21].

In this study, we propose an MGH (model-based reinforcement learning with GANs and HER) framework. The framework integrates a Generative Adversarial Network (GAN) [22] and Hindsight Experience Replay (HER) [23] into model-based reinforcement learning to enhance its performance. Our key insight is that by using the GAN to improve the accuracy of the environmental model and HER to maximize sample efficiency, the framework can learn solely from sparse real-world data without requiring a separately pre-built simulation model. The MGH framework comprises two main stages: In the first stage, we initialize the GAN-based environmental model by using data collected through expert-controlled agent interactions. In the second stage, we conduct model-based reinforcement learning augmented with Hindsight Experience Replay based on the initialized environmental model.

These assumptions underlying the MGH framework were implemented and evaluated by using the Deep Deterministic Policy Gradient (DDPG) algorithm in a 3D real-world environment where a quadcopter was controlled. In the experiment, the quadcopter’s mission was to take off from a starting point and fly to a designated target location. We conducted the experiment indoors by using a small quadcopter equipped only with a monocular camera—without any additional self-localization sensors—and remotely controlled the quadcopter through the MGH framework model on a server. For performance comparison, models implementing conventional deep reinforcement learning algorithms were also trained on the same dataset and evaluated on the same real-world flight task. The results showed that the MGH framework accelerated learning convergence by up to 70.59% compared with existing algorithms and clearly demonstrated the impact of the environmental model on reinforcement learning.

Our contributions through this study are as follows:

Novel integration of GANs and HER in MBRL
We are the first to unify a GAN-based environment model and Hindsight Experience Replay within an MBRL paradigm for real flight control tasks. This joint approach addresses both high-fidelity model learning and sample efficiency in sparse-reward environments.
Real-world validation on a quadcopter platform
Unlike many simulation-based studies, we use actual flight data from an indoor quadcopter environment. This setup reveals how our method performs under real-world conditions where data collection is expensive and risky.
Significant improvement in convergence speed and model accuracy
Experimental results show that our framework reduces the required number of training samples and accelerates convergence by up to 70.59% over existing algorithms (DDPG with GANs). Additionally, the GAN component demonstrates high accuracy in predicting transitions compared with naive or traditional model-based approaches.

By jointly addressing model fidelity and sparse-reward sample efficiency, we expect this framework to make a significant contribution to the field of deep reinforcement learning-based autonomous flight.

The paper is structured as follows: In Section 2, we compare our study with related work. In Section 3, we provide the background and fundamental theories relevant to our study. In Section 4, we present the proposed MGH framework in detail. In Section 5, we describe the experimental setup and report the experimental results. Section 6 offers a discussion of our findings. Section 7 discusses threats to validity. Finally, Section 8 concludes the paper.

2. Related Work

Model-based reinforcement learning (MBRL) enables efficient learning while minimizing interactions with the real environment, making it an especially important research topic in areas where data collection is challenging. In the aerospace domain, the scarcity of real flight data necessitates reinforcement learning methods with high sample efficiency, and in complex environments such as drone flight control, accurate environmental modeling is even more critical. Against this backdrop, various researchers have proposed methods to enhance both the sample efficiency and the accuracy of environmental models in MBRL.

Zhao et al. [24] proposed a novel method that integrates Conditional Generative Adversarial Networks (CGANs) into MBRL to improve sample efficiency. They trained the state transition model of the environment by using a CGAN and enhanced training stability with a Wasserstein GAN (WGAN). Similarly, Charlesworth and Montana [25] introduced PlanGAN, which employs GANs in sparse-reward, multi-goal environments to generate trajectories that help an agent achieve its goals. These studies focus on leveraging GANs to enhance the accuracy of environmental models and maximize sample efficiency.

Meanwhile, researchers have also applied Hindsight Experience Replay (HER) to MBRL to boost sample efficiency. Yang et al. [26] proposed Model-based Hindsight Experience Replay (MHER), which uses the environmental model to generate virtual goals and thereby introduces a more efficient goal re-labeling method. Huang et al. [27] combined model-based reinforcement learning with experience replay techniques in MRHER to effectively handle sparse rewards in continuous object manipulation tasks. These studies contribute to improving data efficiency by integrating HER with MBRL.

Research aimed at strengthening the theoretical foundation of MBRL and enhancing sample efficiency is also noteworthy. Luo et al. [28] proposed an algorithmic framework that guarantees convergence in nonlinear dynamic models and increased sample efficiency through Stochastic Lower Bounds Optimization (SLBO). Li et al. [29] addressed data scarcity in offline reinforcement learning by incorporating a pessimism principle to minimize sample complexity. Wang et al. [30] presented a method using Conservative Model-Based Actor–Critic (CMBAC) to compensate for model inaccuracies and enhance sample efficiency. Additionally, Sun et al. [31] proposed the MOBILE algorithm, which utilizes model-Bellman discrepancies to improve policy learning stability in offline reinforcement learning, while Ji et al. [32] reduced unnecessary model updates and increased sample efficiency by dynamically determining the timing of model updates with Constrained Model-shift Lower-bound Optimization (CMLO).

Other studies have focused on improving performance through enhanced reward prediction and internal state representations. Lee et al. [33] proposed DREAMSMOOTH, which improves reward prediction accuracy via reward smoothing and enhances sample efficiency in sparse-reward environments. Scholz et al. [34] improved the internal state representation of the MuZero algorithm by using self-supervised learning to boost sample efficiency, and Ma et al. [35] increased sample efficiency by automatically balancing observation modeling and reward modeling through Harmony World Models.

MBRL has also been actively applied in drone flight control. Becker-Ehmck et al. [36] used a latent state space model based on a Variational Autoencoder (VAE) to improve sample efficiency in drone flight control. Lambert et al. [37] combined a neural network-based dynamics model with Model Predictive Control (MPC) to perform the low-level control of quadrotors with limited data. Although these studies represent meaningful attempts to address data scarcity in drone flight control, they fall short of simultaneously maximizing both environmental model accuracy and sample efficiency.

In addition, Khalid et al. [38] applied MBRL to other domains, using differentiable ordinary differential equations (ODEs) in quantum control problems to enhance model accuracy and reduce sample complexity. Although their study is not directly related to the aerospace field, its approach to improving sample efficiency by increasing model accuracy is noteworthy.

Beyond these, researchers have worked on integrating imitation learning with MBRL for agile aircraft control using pilot demonstration data. For example, Sever et al. [39] proposed a hybrid approach that unifies imitation learning, transfer learning, and reinforcement learning to address limited pilot data in high-agility maneuvers. They demonstrated that leveraging both simulation-generated proxy data and a small amount of real pilot data can robustly adapt to aircraft parameter changes while preserving maneuver stability.

Overall, various researchers have striven to enhance sample efficiency and environmental model accuracy in MBRL. However, most previous research has focused primarily on either improving model accuracy via GANs or boosting sample efficiency through HER. In this study, we aim to overcome these limitations by integrating HER and GANs within MBRL to address data scarcity in the aerospace domain and simultaneously enhance both sample efficiency and the accuracy of environmental models in drone flight control environments.

By leveraging GANs to improve the accuracy of environmental models, our approach enables precise environmental simulation even when real flight data are scarce. This strategy further reinforces the effectiveness of GANs as demonstrated in the studies by Zhao et al. [24] and Charlesworth and Montana [25]. At the same time, employing HER to recycle unsuccessful experiences to boost sample efficiency builds upon the work by Yang et al. [26] and Huang et al. [27]. By combining these two strategies, we aim to comprehensively address issues that previous researchers treated separately.

Therefore, in this study, we propose a novel approach that integrates HER and a GAN within MBRL to simultaneously enhance sample efficiency and environmental model accuracy. We expect that this method will make a significant contribution to reinforcement learning in the aerospace domain, particularly in applications such as drone flight control where real data are scarce.

3. Background

In this chapter, we provide a detailed introduction to the key theories and technologies that form the basis of this study. Section 3.1 reviews the fundamental concepts of reinforcement learning. Section 3.2 examines the deep reinforcement learning algorithm DDPG, while Section 3.3 analyzes the principles and limitations of model-based reinforcement learning. Additionally, Section 3.3 and Section 3.4 discuss the principles of GANs and HER and their roles in reinforcement learning, respectively.

3.1. Reinforcement Learning

Reinforcement Learning (RL) addresses sequential decision-making problems that are typically formulated as a Markov Decision Process (MDP). An agent observes a state

s_{t} \in S

, takes an action

a_{t} \in A

according to a policy

π

, transitions to

s_{t + 1} \in S

with probability

P (s_{t + 1} | s_{t}, a_{t})

, and receives a reward

r_{t}

from reward function

R (s_{t}, a_{t})

. The goal is to find

π

that maximizes the expected return

J

:

J (π) = E_{τ \sim p_{π}} [\sum_{t = 0}^{T} γ^{t} r_{t}]

(1)

where

τ

is the trajectory sampled from the policy

π

,

γ \in (0, 1)

is the discount factor,

r_{t}

is the reward at time step

t

, and

T

is the horizon.

3.2. Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG) tackles continuous action spaces by combining actor–critic methods with neural network function approximators. The policy network (actor)

μ_{θ}

deterministically outputs actions, while the value network (critic)

Q_{ϕ}

estimates

Q (s, a)

. DDPG stabilizes training via an experience replay buffer of

(s_{t}, a_{t}, r_{t}, s_{t + 1})

tuples and target networks for both actor and critic. Although successful in many tasks, DDPG can require a large number of real-environment interactions due to its model-free nature. The algorithm of the DDPG is shown in Algorithm 1.

Algorithm 1 DDPG

1: Initialize actor parameters

θ

and critic parameters

ϕ

.
2: Initialize target networks with parameters

θ^{'} ⟵ θ, ϕ^{'} ⟵ ϕ

.
3: Initialize replay buffer

D

4: for each episode do
5: Reset environment and get initial state

s_{0}

6: for each step

t = 1

to

T

do.
7: Select action

a_{t} = μ (s_{t} | θ) + ε_{,}

where

ε ~ N (0, σ^{2}) .

8: Execute action

a_{t}

and observe reward

r_{t}

, next state

s_{t + 1}

.
9: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in

D

.
10: if size(D) > MinBatchSize then
11: Sample a random mini-batch of

(s, a, r, s')

from D
12:

y_{t} = r + {γ Q}_{ϕ}

’ (s’,

μ_{θ}

’ (s’))
13:

L (ϕ) = (1 / M i n i B a t c h S i z e) * \sum {(y}_{t} - Q_{ϕ} {(s, a))}^{2}

14:

ϕ' ⟵ τ ϕ + (1 - τ) ϕ'

15:

θ' ⟵ τ θ + (1 - τ) θ'

16: end if
17:

s_{t} ⟵ s_{t + 1}

18: end for
19: end for

3.3. Model-Based Reinforcement Learning

In Model-Based Reinforcement Learning (MBRL), one learns an approximate environment model

\hat{P} (s' | s, a)

. environmental model is expressed as follows:

\hat{P} (s_{t}, a_{t}) = {a r g m i n}_{θ} \sum_{t = 1}^{N} {(s_{t + 1} - f_{θ} (s_{t}, a_{t}))}^{2}

(2)

It aims to simulate imaginary rollouts, thereby reducing real-environment sampling. The policy can then be improved using both real data and model-generated data. However, bias arises if

\hat{P}

deviates significantly from the true dynamics

P

. This deviation may accumulate over multiple time steps, potentially degrading learning stability or performance.

3.4. Generative Adversarial Networks

Generative Adversarial Networks (GANs) consist of two neural networks—the generator (

G

) and the discriminator (

D

)—which compete adversarially during training. The generator (

G

) accepts a noise vector

z

, sampled from latent space

p_{z} (z)

, as input and produces a fake data sample

G (z)

. The discriminator (

D

) determines whether an input x comes from the real data distribution (

p_{d a t a} (x)

) or from the generated data distribution (

p_{g} (x)

). To accomplish this, the GAN is trained by having the generator and discriminator play a mini-max game, which is expressed by the following value function

V (D, G)

:

V (D, G) = E_{x \sim p_{d a t a} (x)} [l o g l o g D (x)] + E_{z \sim p_{z} (z)} [l o g l o g (1 - D (G (z)))]

(3)

The discriminator is trained to output 1 for real data x and 0 for generated data

G (z)

. Meanwhile, the generator seeks to trick the discriminator by maximizing

D (G (z))

. The optimal discriminator

D^{*}

for a given generator

G

is derived as follows:

D^{*} (x) = \frac{p_{d a t a} (x)}{p_{d a t a} (x) + p_{g} (x)}

(4)

The precise data representation capabilities of GANs can be leveraged in model-based reinforcement learning to enhance the accuracy of the environmental model.

3.5. Hindsight Experience Replay

Hindsight Experience Replay (HER) is a technique that improves learning efficiency by reusing experiences gathered through an agent’s interactions with the environment for various goals. Conventional reinforcement learning only learns when a reward is provided, rendering trajectories without rewards unusable. As a result, in environments with sparse rewards, most experiences go unused, which severely diminishes sample efficiency. HER addresses this issue by reinterpreting the states reached during an episode as if the goal had been achieved, thereby assigning a corresponding success reward. This approach effectively transforms the problem into multi-goal reinforcement learning, where—unlike in traditional reinforcement learning—goal

g \in G

from goal space

G

is incorporated as a parameter into the policy, value function, and reward function, redefining them as follows:

π (s_{t}) \to π (s_{t}, g), Q (s_{t}, a_{t}) \to Q (s_{t}, a_{t}, g), Q R (s_{t}, a_{t}) \to R (s_{t}, a_{t}, g)

(5)

where

π (\cdot)

is the policy,

Q (\cdot)

is the action-value function, and

R (\cdot)

is the reward function.

4. Proposed Method

In this chapter, we propose the MGH framework, which integrates a GAN and HER into model-based reinforcement learning (MBRL). First, we detail the structure and components of the framework and present the training algorithm. Next, we explain in detail how the GAN and HER are integrated, interact, and fulfill their respective roles. Finally, we demonstrate the convergence and stability of the proposed method by theoretically establishing a performance lower bound and mathematically proving the impact of the GAN and HER on performance.

4.1. Overview of MGH Framework

The MGH framework aims to achieve data-efficient reinforcement learning in real-world environments by integrating a GAN and HER into model-based reinforcement learning. Its main components include an Actor–Critic module, a GAN module, a HER module, and an Experience Replay Buffer. In this framework, the Actor–Critic module utilizes an off-policy algorithm to take advantage of the Experience Replay Buffer. Figure 1 illustrates the overall structure of the MGH framework.

First, the GAN module comprises a generator and a discriminator. Generator

G_{ψ}

takes state

s_{t}

and action

a_{t}

as input and learns state transition model

\hat{P} (s_{t}, a_{t})

to predict the next state,

s_{t + 1}

. Subsequently, discriminator

D_{ω}

distinguishes between the real transition

(s_{t}, a_{t}, s_{t + 1})

and the generated transition

(s_{t}, a_{t}, G_{ψ} (s_{t}, a_{t}))

. The generator learns to trick the discriminator into classifying its fake data as genuine environmental data, while the discriminator learns to accurately differentiate between the generated data and the actual environmental data. Through this adversarial training process, the GAN module attains a highly accurate environmental model.

The Actor–Critic module consists of an actor and a critic, where the actor approximates the policy and the critic approximates the value function. Actor

π_{θ} (s_{t}, g)

produces optimal action

a_{t}

for given state

s_{t}

and goal

g

, while critic

Q (s_{t}, a_{t}, g)

evaluates the value of the state–action–goal pair that includes the action

a_{t}

provided by the actor. In this setup, the actor is trained by using a policy-based approach, and the critic is trained by using a value-based approach, thereby integrating the strengths of both methods to facilitate effective learning.

The HER module utilizes reassigning various goals to both the agent’s experienced trajectories and those generated by the environmental model. This allows the agent to learn from experiences that did not originally reach the desired goal, thereby enhancing sample efficiency during training.

Finally, Experience Replay Buffer

D

stores data collected from the real environment, as well as data generated by the environmental model. The stored data are sampled in minibatches to train both the actor and the critic. Since the GAN structure generally requires a large amount of data for training, the Experience Replay Buffer is initially populated with data obtained from expert demonstrations.

4.2. Learning Algorithm

The MGH framework adopts Model-Based Reinforcement Learning (MBRL) as its foundation and integrates Generative Adversarial Network (GAN) for environment modeling and Hindsight Experience Replay (HER) for improved sample efficiency. Building on the theoretical analyses, which demonstrate how GAN and HER jointly reduce model bias and enhance reward utilization, this section provides a procedural overview of how the MGH framework operates in practice.

4.2.1. GAN–HER Integration

In this framework, the generator

G_{ψ}

and discriminator

D_{ω}

are trained adversarially so that virtual transitions produced by

G_{ψ}

closely resemble real transitions

(s, a, s')

. The real data are collected through direct interaction with the environment, while the GAN-based virtual data supplement these real transitions in the replay buffer. The objective is to ensure that

G_{ψ} (s, a)

remains realistic enough to prevent large model bias when learning the policy. As discussed in Section 4.3, a more accurate internal model tightens performance bounds by reducing the divergence from the true dynamics

P (s' | s, a)

.

Concurrently, HER reassigns goals after each episode by designating future states as new goals and labeling these as successful transitions. This approach effectively increases the density of reward signals, addressing the scarcity issue commonly observed in sparse-reward tasks. By retrospectively treating previously failed attempts as successful under alternate goals, HER enables the agent to reuse a broader range of experience for policy improvement. Thus, the dual benefits of improved model fidelity (via GAN) and richer reward feedback (via HER) collectively boost the learning speed and sample efficiency of the MGH framework.

4.2.2. Algorithmic Flow

The learning process proceeds in five phases. First, both the GAN parameters

ψ

,

ω

and the actor–critic parameters

θ

,

ϕ

are randomly initialized, with additional expert demonstration data stored in the replay buffer

D

to stabilize early training. Next, the agent collects real transitions by executing actions sampled from the policy, perturbed by Gaussian exploration noise, and records the resulting tuples

(s_{t}, a_{t}, r_{t}, s_{t + 1}, g)

in

D

. Simultaneously, the generator

G_{ψ}

produces virtual transitions

(s_{t}, a_{t}, G_{ψ} (s_{t}, a_{t}))

, which are also stored in

D

to provide supplementary data.

At the end of each episode, a GAN update stage occurs in which minibatches of real and generated transitions are sampled from

D

. The discriminator is optimized to distinguish real from generated transitions, while the generator aims to fool the discriminator with increasingly realistic outputs. This procedure gradually refines the internal model, narrowing the gap between true dynamics

P

and the learned dynamics

\hat{P}

.

Subsequently, HER re-labeling is applied by selecting future states from each sampled trajectory as new goals, assigning a success reward for reaching those states, and storing these artificially successful experiences in

D

. This re-labeling step effectively transforms trajectories that originally failed to reach the actual goal into positive experiences for a different goal. Finally, actor–critic learning updates both the critic

Q_{ϕ}

, to minimize temporal difference errors, and the actor

μ_{θ}

, to maximize the expected return. The target networks

θ'

and

ϕ'

receive soft updates to ensure stable training.

By repeating these phases across multiple episodes, the agent gains access to both real and GAN-generated transitions that are further enriched by HER. The practical steps of this unified procedure are shown in Algorithm 2, where lines 6–7 illustrate an initial batch of training using expert demonstrations, lines 14–15 insert real and generated data into the buffer, lines 18–22 detail the adversarial update of the GAN, lines 23–27 reassign goals via HER, and lines 28–33 carry out standard off-policy actor–critic updates.

Algorithm 2 MGH framework.

1: Initialize generator parameters

ψ

and discriminator

ω

.
2: Initialize actor parameters

θ

and critic parameters

ϕ

.
3: Initialize target networks with parameters

θ^{'} ⟵ θ, ϕ^{'} ⟵ ϕ

.
4: Initialize goal

g

.
5: Initialize experience replay buffer

D

with expert-collected data.
6: Train generator

G_{ψ}

and discriminator

D_{ω}

using data from

D

.
7: Train actor

μ_{θ}

and critic

Q_{ϕ}

using data from

D

.
8: for each episode do
9: Reset environment and get initial state

s_{0}

10: for each step

t = 1

to

T

do.
11: Select action

a_{t} = μ (s_{t}, g | θ) + ε,

where

ε ~ N (0, σ^{2}) .

12: Execute action

a_{t}

and observe reward

r_{t}

, next state

s_{t + 1}

.
13: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1}, g)

in

D

.
14: Predict next state

{s'}_{t + 1} = G_{ψ} (s_{t}, a_{t})

and reward

{r'}_{t}

.
15: Store transition

(s_{t}, a_{t}, {r'}_{t}, {s'}_{t + 1}, g)

in

D

.
16: Update state

{s_{t} ⟵ s}_{t + 1} .

17: end for
18: for

k = 1

to

K

do
19: Sample minibatch from

D

.
20: Update Discriminator

D_{ω}

to discriminate real and fake transitions.
21: Update

G_{ψ}

to generate realistic transitions.
22: end for
23: for

m = 1

to

M

do
24: Sample sequence consist of

(s_{t}, a_{t}, r_{t}, s_{t + 1}, g)

from

D

.
25: Apply HER to create new goal

g^{'}

and reward

r^{'}

.
26: Store each transition

(s_{t}, a_{t}, {r'}_{t}, {s'}_{t + 1}, g')

in

D

.
27: end for
28: for

l = 1

to

L

do
29: Sample minibatch from

D

.
30: Update Critic

Q_{ϕ}

by minimizing

L (ϕ)

.
31: Update Actor

μ_{θ}

via policy gradient.
32: Update target networks

Q_{ϕ'}

,

μ_{θ'}

33: end for
34: end for

4.3. Theoretical Justification for MGH Framework

To theoretically establish the performance benefits of the MGH framework, we analyze the sample efficiency improvement by incorporating GAN and HER within the Model-Based Reinforcement Learning (MBRL) paradigm.

4.3.1. Model Error Bound

In an MDP with discount

γ

, the difference in expected returns when using an imperfect model

\hat{P} (s' | s, a)

instead of the true model

P

can be bounded. A standard multi-step analysis [40] shows that any local difference in transition probabilities can compound over repeated rollouts. Concretely, if

π_{M B R L}

is the policy trained under

\hat{P}

, then:

| J (π) - J (π_{M B R L}) | \leq \frac{γ}{{(1 - γ)}^{2}} E_{s, a} [D_{T V} (P (s' | s, a) | | \hat{P} (s' | s, a))]

(6)

The factor

{γ / (1 - γ)}^{2}

arises because, at each step, the agent’s return is discounted by

γ

, but the errors can propagate through future time steps—thus incurring a geometric series whose partial sum is on the order of

{γ / (1 - γ)}^{2}

. This derivation confirms that lowering the total variation distance

D_{T V} (P | | \hat{P})

across state–action pairs will tighten the upper bound on performance loss.

When GAN is used for

\hat{P} \equiv G_{ψ}

, the generator is adversarially trained to produce

s^{'}

that closely match real transitions, thereby minimizing

D_{T V} (P | | G_{ψ})

. Substituting

G_{ψ} (s, a)

for

\hat{P} (s' | s, a)

yields a tighter bound compared to naive function approximators, providing a solid theoretical rationale for our GAN-based model.

4.3.2. Joint Sample Efficiency from GAN and HER

First, let

p

be the probability of “success” under the original reward structure. In a sparse-reward environment,

p

might be very small, leading to high sample complexity

O (1 / p)

. HER re-labels each real trajectory

k

times, increasing the effective success rate to the following:

p_{H E R} = 1 - {(1 - p)}^{k}

(7)

Thus reducing the sample complexity from

O (1 / p)

to

O (1 / p_{H E R})

. However, model-based methods also gain efficiency by simulating extra transitions. Suppose each real transition can generate on average

α

synthetic transitions from the GAN model. Then, for every real step, the agent effectively observes

α

additional transitions in the replay buffer. If we treat these synthetic transitions as having the same success probability (or re-labeled success probability)

p_{H E R}

—an approximation that holds if

G_{ψ}

is sufficiently accurate—then each real step yields

α

-fold more “successful” data.

Hence, the overall number of successes per real interaction roughly scales as

α \cdot p_{H E R}

. Since RL sample complexity depends inversely on the fraction of successful experiences, we can write a new effective sample complexity order:

O (\frac{1}{α \cdot p_{H E R}}) = O (\frac{1}{α (1 - {(1 - p)}^{k})}) \geq O (\frac{1}{α \cdot p})

(8)

demonstrating that both the GAN model through

α

and HER through

p_{H E R}

synergistically reduce the needed real-environment interactions. In practice,

α

reflects how many reliable virtual rollouts we can generate per real step, and

k

denotes the number of HER re-labelings per trajectory. As

α

and

k

grow, the sample complexity decreases, consistent with our empirical findings. This combination underlies the MGH framework’s advantage in data-scarce, sparse-reward tasks.

5. Experiments

In this section, we describe the experiments conducted to evaluate the performance of the proposed MGH framework by comparing it with the representative deep reinforcement learning algorithm DDPG. The experiments were designed to address two research questions and were carried out in an actual drone environment. We objectively analyze the performance of DDPG, DDPG-HER, DDPG-GAN, and MGH-DDPG by using various evaluation metrics, and we validate the data efficiency and learning performance of the MGH framework based on the results.

5.1. Research Questions

5.1.1. RQ1: How Much Data Does the MGH Framework Require for Training?

It is essential to verify whether the quantity of data required by the proposed MGH framework for convergence is practically sufficient, even if it is lower than that required by conventional reinforcement learning methods. Therefore, we compare the learning convergence speed and mission success rate of the MGH framework when trained with different amounts of data: 2K (2000), 5K (5000), 8K (8000), and 10K (10,000) samples.

5.1.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?

In this study, we primarily aim to enhance data efficiency in real-world environments where collecting flight data is challenging due to high time and cost constraints. To this end, we compare the learning convergence speed and mission success rate of the proposed MGH framework with those of DDPG, DDPG-HER, and DDPG-GAN—where DDPG-HER and DDPG-GAN integrate HER and GANs with DDPG, respectively—after training with 10K samples.

5.1.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?

We incorporated a GAN to improve the accuracy of the environmental model, thereby generating experiences that closely resemble those obtained through real-world interactions. To evaluate this, we compare the geometric transformation from

s_{t}

to

s_{t + 1}

in the real transition

(s_{t}, a_{t}, s_{t + 1})

with the transformation from

s_{t}

to

{G_{ψ} (s_{t}, a_{t}) = s^{'}}_{t + 1}

generated by the GAN. Since the state is represented as an image, we extract feature points by using the Scale-Invariant Feature Transform (SIFT) [41] and quantify the geometric differences between the two images by comparing their homography matrices.

5.2. Experimental Environment

5.2.1. Quadcopter

The quadcopter used in the experiment is the DJI RMTT, weighing approximately 80 g with a maximum flight speed of 8 m/s. It captures video at 720p resolution with a field of view (FoV) of 82.6° by using an integrated monocular camera. In addition, it can hover indoors without GPS by utilizing its bottom-mounted infrared (IR) sensor and optical flow sensor. The onboard Inertial Measurement Unit (IMU) enables the monitoring of the quadcopter’s speed and acceleration. Control of the quadcopter is executed by a controller connected to a server via Wi-Fi, where both the learning and control algorithms run.

5.2.2. Flight Mission

The experiments were conducted in an indoor environment with dimensions of 3 m (width) × 5 m (length) × 2.5 m (height), as shown in Figure 2. In this setting, the quadcopter’s mission was to fly toward a target located at the far end of the area, ensuring that the center of the target aligned with the center of the camera view and that the target’s vertical dimension matched the height of the camera frame. Any deviation from the designated experimental area was considered a collision or mission failure.

5.2.3. Pre-Collected Data

In the aforementioned environment, a human pilot directly controlled the quadcopter to collect data. The collected data included the current image, speed, and acceleration; after the pilot issued a control command, the next image, speed, and acceleration were recorded and combined into a single data sample. The sampling interval was 0.2 s, and approximately 10K samples were collected over about 30 min of flight. Additionally, upon mission success, each sample was assigned a reward through object detection based on YOLOv8 [42]. This setup represents a very-sparse-reward environment, and the total quantity of data is considerably lower than what is typically used in model-free reinforcement learning.

To mitigate potential variations arising from battery depletion, temperature drift, or sensor noise, we conducted our data-collection flights within a relatively short and consistent time (5 min). After each flight session, we replaced or recharged the drone’s battery to keep power levels as consistent as possible. Furthermore, we carefully tuned hyperparameters with small grid searches and ran multiple flight trajectories under similar environmental conditions to reduce confounding factors. By doing so, we sought to minimize the influence of uncontrolled variables on both the data collection and the reinforcement learning outcomes.

5.2.4. MDP Modeling

We define the state and action spaces, the sparse reward function, and the episode termination conditions. Table 1 summarizes these elements. We represent each transition in the form

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, where

s_{t}

includes both an RGB image (downsampled to 64 × 64). The action

a_{t}

is a 3-dimensional continuous vector (roll, pitch, throttle) in

[- 1, 1]

. At each step, the agent receives a sparse reward, and the episode ends either upon mission success, collision, or a time limit.

5.2.5. MGH Framework

Our framework integrates DDPG (actor–critic) with a GAN-based environment model and Hindsight Experience Replay (HER). The reason for choosing DDPG is that it is the most basic off-policy algorithm for continuous spaces. This makes it easier to directly observe the effects of HER and GAN with minimal influence from the algorithm itself. Table 2 summarizes the main hyperparameters used throughout the training, and Table 3 provides a concise overview of each neural network architecture (actor, critic, generator, and discriminator).

In our implementation, the Actor maps the quadcopter’s visual observations to continuous control commands, and the Critic evaluates their corresponding Q-values for policy optimization. The Generator uses the current state and action to synthesize the next image, while the Discriminator learns to distinguish genuine transitions from generated ones. By training these modules jointly, we augment limited real-flight data with realistic synthetic experiences and apply HER to effectively handle the sparse-reward setting. This combination improves data efficiency and accelerates convergence in the quadcopter navigation task.

5.3. Experimental Results

5.3.1. RQ1: How Much Data Does the MGH Framework Require for Training?

To address this research question, we evaluated the performance of each model after running reinforcement learning on data that a human pilot had previously collected by controlling the quadcopter. The training data sizes for the models were 2K (2000), 5K (5000), 8K (8000), and 10K (10,000) samples. Since a reward of 1 was given only upon mission success, the rewards in the graphs range between 0 and 1.

The experimental results are shown in Figure 3. The MGH framework, when trained with 10K samples, converged around the 2K mark, and with 8K samples, it converged around the 4K mark. In contrast, the models trained with 5K and 2K samples failed to converge, which indicates that the mission itself was unsuccessful. We interpret this as a consequence of the GAN not being properly trained due to the initial lack of sufficient data, resulting in the continued use of erroneous model values during training, thereby preventing convergence.

5.3.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?

To address this question, we compared the performance of DDPG, DDPG-HER, and DDPG-GAN after training each with 10K samples. The hyperparameters for each model were set to the same values as those used in the MGH framework.

The experimental results are shown in Figure 4. The MGH framework converged at approximately 2K samples, while DDPG-HER converged at around 6.8K samples. This demonstrates that compared with DDPG-HER, the MGH framework’s convergence speed in this environment is 70.59% faster. Although DDPG-GAN showed a trend toward convergence, it did not reach a convergence point within the 10K sample range. Meanwhile, DDPG failed to show any sign of convergence and oscillated, which we interpret as a consequence of sparse rewards in the mission environment and insufficient flight data.

5.3.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?

To address this question, we compare the geometric transformation from

s_{t}

to

s_{t + 1}

in the real transition

(s_{t}, a_{t}, s_{t + 1})

with the transformation from

s_{t}

to the GAN-generated state

G_{ψ} (s_{t}, a_{t}) = {s'}_{t + 1}

. We perform this comparison by extracting the homography matrices for each state by using the Scale-Invariant Feature Transform (SIFT) algorithm. We then quantify geometric similarity by computing the average normalized reprojection error from the matched keypoints and defining a similarity measure

S

that ranges from 0 (completely different) to 1 (identical):

S (I_{A}, I_{B}) = 1 - m i n (1, \frac{1}{N} \sum_{i = 1}^{1} \frac{r_{i}}{τ})

(9)

Here,

S (I_{A}, I_{B})

denotes the similarity between images

I_{A}, I_{B}

(e.g.,

s_{t + 1}, {s'}_{t + 1}

).

N

represents the total number of matched keypoints used in the homography estimation.

r_{i}

is the reprojection error for the ith keypoint pair, computed as the Euclidean distance between the transformed keypoint and its corresponding keypoint in the target image.

τ

is a predefined threshold that normalizes the error, ensuring that errors greater than or equal to

τ

yield a normalized value of 1.

Figure 5 shows the experimental results. In the figure, the bold line represents the mean point of the data. The dotted line above and below the mean line represents the standard deviation point. Based on the black vertical line, the left value represents the magnitude of the standard deviation, and the right value represents the mean value. At 2K and 5K, the mean values are 0.76, 0.59, respectively, indicating that the geometric transformation of the real data is quite different from that of the GAN-generated data. However, from 8K onward, the error drops dramatically to below 0.2. This suggests that with increased data, the GAN better captures features such as edges, resulting in a more accurate representation. Notably, these findings are consistent with the results from RQ1, confirming that the quality of the environmental model learned from the training data ultimately determines the success of model-based reinforcement learning.

Additionally, Figure 6 displays images of the environment generated by the GAN at different data levels (from 2K to 8K, from left to right). At 2K, the image is considerably blurry, and when the SIFT algorithm is applied, the extracted feature points differ significantly from those in other images, resulting in completely different values. Although the image quality improves at 5K, some blurriness persists, leading to a high standard deviation. However, from 8K onward, the images become relatively clear, and the geometric transformations closely match the actual results, resulting in a low error.

6. Discussion

We have proposed the MGH framework, which integrates a GAN and HER into model-based reinforcement learning (MBRL) to enhance data efficiency in reinforcement learning for autonomous flight control. In particular, we explored approaches to maximizing the applicability of reinforcement learning by using real flight data in the aerospace domain. This work offers important insights for practical autonomous flight solutions sought by the aviation industry, as detailed in the following.

First, regarding data sparsity, the MGH framework outperformed conventional model-free reinforcement learning (MF-RL). Collecting flight data from aircraft or drones is limited by safety concerns and costs. The GAN-based environmental modeling and HER-based experience reuse methods integrated in this study maximize the utilization of such limited data and enable stable learning with a relatively small amount of flight data. This is significant because it allows aerospace companies to reduce costs while developing effective autonomous flight systems based on actual flight data.

Second, in terms of real-time control system efficiency, the MGH framework achieved a 70.59% improvement in convergence speed compared with conventional DDPG-GAN. This is a critical performance indicator for autonomous flight control systems that must operate in real time. Moreover, our results confirm that model accuracy significantly affects overall reinforcement learning performance; by enhancing model accuracy with GAN, our approach reduces the performance gap with the real environment.

Third, our approach also makes a notable contribution to resolving the Sim-to-Real problem. Traditional simulator-based learning often leads to abnormal behavior during real flights due to discrepancies in aerodynamic conditions. In contrast, the proposed GAN-based environmental modeling accurately replicates real flight data, achieving a homography error of less than 0.06. This indicates that our method can serve as a robust foundation for training more precise autonomous flight systems during actual flight tests of aircraft or drones.

Fourth, this study has important implications for enhancing aviation safety. In aerospace systems, instability in learning due to data scarcity or collisions during flight can be catastrophic. By reusing even unsuccessful experiences through HER, the MGH framework provides stable learning under data-sparse conditions. This capability is particularly crucial to the development of Urban Air Mobility (UAM) systems and defense-related autonomous flight systems.

In conclusion, the MGH framework offers significant potential for implementing autonomous flight control systems that are practical in the aviation industry. The reinforcement learning approach that combines a GAN and HER demonstrated high data efficiency and performance even in real-world environments with sparse data, and it is expected to make substantial contributions to future developments in autonomous flight systems and aircraft control technologies.

7. Threats to Validity

7.1. Internal Validity

A potential threat to internal validity arises from factors not fully controlled during data collection and experimentation. For instance, variations in the drone’s battery status, minor sensor drift, and potential environmental changes (e.g., slight temperature fluctuations in the indoor facility) could influence flight performance and learning outcomes. Although we minimized these effects by conducting flights within a short time window and periodically monitoring battery levels, we acknowledge that perfect control was not feasible. Additionally, the sparse-reward setting might introduce variability in how quickly each model converges, but we mitigated this by carefully tuning hyperparameters and by running multiple flight trajectories under similar conditions.

7.2. External Validity

All experiments were conducted on a single quadcopter model and in a relatively simple indoor environment. Consequently, our results may not directly generalize to different aircraft types, large outdoor areas, or environments with extensive obstacles and dynamic conditions. While the core mechanism of combining a GAN-based model with HER is not inherently limited to any specific drone or setting, additional adaptations—such as more advanced GAN architectures or explicit obstacle modeling—may be required for complex scenarios.

8. Conclusions

In this study, we proposed the MGH framework, which effectively leverages sparse data in real flight environments. By integrating Generative Adversarial Networks (GANs) and Hindsight Experience Replay (HER) into model-based reinforcement learning (MBRL), the MGH framework was evaluated in an actual quadcopter flight environment and compared with conventional model-free reinforcement learning methods. The results demonstrated that the MGH framework offers superior data efficiency and, in particular, achieves a convergence speed up to 70.59% higher than models that incorporate GANs with DDPG.

A key innovation of our approach is that it simultaneously addresses both the accuracy of the environmental model and the sample efficiency in sparse-reward settings. Previous model-based methods have often focused solely on improving the dynamic model (e.g., through GANs) or on handling sparse rewards with techniques such as HER, but seldom have they combined both strategies in a single MBRL framework with real flight data. By unifying GAN-driven environment modeling and HER-based sample reuse, our framework fills this gap and demonstrates enhanced performance in real quadcopter flight tasks. This synergy between improving model accuracy and maximizing sample efficiency represents the primary contribution of our study and differentiates it from existing research that typically addresses these challenges in isolation.

Although the primary aim was to validate data efficiency gains, several practical considerations merit further exploration. First, the current analysis chiefly focuses on learning convergence speed and mission success rate. Future work should investigate additional performance factors, such as computational cost, long-term policy stability, and hyperparameter sensitivity, to provide a more comprehensive assessment of the framework. Second, the experimental setup was deliberately simplified—featuring a relatively small indoor flight area with few obstacles—to highlight the core benefits of combining GANs and HER. While our results underscore the framework’s efficacy under these conditions, more intricate obstacle arrangements, dynamic environmental factors, and larger or outdoor arenas could challenge the GAN-based modeling and are promising directions for subsequent research.

To the best of our knowledge, this study is the first where a GAN and HER are combined within an MBRL framework, significantly enhancing the practical applicability of reinforcement learning in autonomous flight. The combined GAN–HER learning framework is expected to maximize data efficiency and learning performance in diverse real-world environments, extending its potential application to autonomous flight, robotic control, and other related fields.

Author Contributions

Conceptualization and methodology, U.L. and S.L.; formal analysis and investigation, U.L. and K.K.; writing—original draft preparation, U.L.; writing—review and editing, S.L. and K.K.; visualization, U.L.; supervision, S.L. and K.K.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Glocal University 30 Project Fund of Gyeongsang National University in 2025.

Data Availability Statement

Data will be made available upon reasonable request.

Conflicts of Interest

The authors declare no competing interests that are relevant to the content of this article.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
DDPG	Deep Deterministic Policy Gradient
DNNs	Deep Neural Networks
DRL	Deep Reinforcement Learning
FAA	Federal Aviation Administration
GAN	Generative Adversarial Network
HER	Hindsight Experience Replay
MBRL	Model-Based Reinforcement Learning
MGH	Model-based reinforcement learning with GANs and HER
RL	Reinforcement Learning
UASs	Unmanned Aerial Systems
UAM	Urban Air Mobility

References

Coppinger, R. Robot jets at the ready. Aerosp. Manuf. Mag. 2023, 12, 38–42. [Google Scholar]
NASA. NASA Urban Air Mobility (UAM) Roadmap; National Aeronautics and Space Administration: Washington, DC, USA, 2023.
United States Department of Defense. Unmanned Systems Integrated Roadmap 2017–2042; United States Department of Defense: Arlington County, VA, USA, 2017.
Tang, S.; Kumar, V. Autonomous flight. Annu. Rev. Control. Robot. Auton. Syst. 2018, 1, 29–52. [Google Scholar]
Lai, Y.-C.; Le, T.-Q. Adaptive learning-based observer with dynamic inversion for the autonomous flight of an unmanned helicopter. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 1803–1814. [Google Scholar]
Wang, Y.; Ji, J.; Wang, Q.; Xu, C.; Gao, F. Autonomous flights in dynamic environments with onboard vision. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1–6. [Google Scholar]
Gu, W.; Valavanis, K.P.; Rutherford, M.J.; Rizzo, A. UAV model-based flight control with artificial neural networks: A survey. J. Intell. Robot. Syst. 2020, 100, 1469–1491. [Google Scholar]
Yogi, N.; Kumar, N. Control Systems for Unmanned Aerial Vehicles: Advancement and Challenges. In International Conference on Modern Research in Aerospace Engineering; Springer Nature Singapore: Singapore, 2023. [Google Scholar]
Sutton, R.S. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Dissertation, University of Massachusetts Amherst, Amherst, MA, USA, 1984. [Google Scholar]
Wang, X.; Wang, Y.; Su, X.; Wang, L.; Lu, C.; Peng, H.; Liu, J. Deep reinforcement learning-based air combat maneuver decision-making: Literature review, implementation tutorial and future direction. Artif. Intell. Rev. 2024, 57, 1. [Google Scholar]
Sarıkaya, B.S.; Bahtiyar, Ş. A survey on security of UAV and deep reinforcement learning. Ad Hoc Netw. 2024, 103, 103642. [Google Scholar]
Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar]
Tambon, F.; Laberge, G.; An, L.; Nikanjam, A.; Mindom, P.S.N.; Pequignot, Y.; Khomh, F.; Antoniol, G.; Merlo, E.; Laviolette, F. How to certify machine learning based safety-critical systems? A systematic literature review. Autom. Softw. Eng. 2022, 29, 38. [Google Scholar]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 737–744. [Google Scholar]
Stachiw, T.; Crain, A.; Ricciardi, J. A physics-based neural network for flight dynamics modelling and simulation. Adv. Model. Simul. Eng. Sci. 2022, 9, 13. [Google Scholar]
Deiler, C. Aerodynamic model adjustment for an accurate flight performance representation using a large operational flight data base. CEAS Aeronaut. J. 2023, 14, 527–538. [Google Scholar]
Moerland, T.M.; Broekens, J.; Plaat, A.; Jonker, C.M. Model-based reinforcement learning: A survey. Found. Trends Mach. Learn. 2023, 16, 1–118. [Google Scholar]
Luo, F.M.; Xu, T.; Lai, H.; Chen, X.H.; Zhang, W.; Yu, Y. A survey on model-based reinforcement learning. Sci. China Inf. Sci. 2024, 67, 121101. [Google Scholar]
Plaat, A.; Kosters, W.; Preuss, M. High-accuracy model-based reinforcement learning, a survey. Artif. Intell. Rev. 2023, 56, 9541–9573. [Google Scholar]
Ali, A.M.; Gupta, A.; Hashim, H.A. Deep reinforcement learning for sim-to-real policy transfer of VTOL-UAVs offshore docking operations. Appl. Soft Comput. 2024, 162, 111843. [Google Scholar]
Dawood, M.; Dengler, N.; de Heuvel, J.; Bennewitz, M. Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 30 May 2023; pp. 879–885. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O.I.P.; Zaremba, W. Hindsight experience replay. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5048–5058. [Google Scholar]
Zhao, T.; Wang, Y.; Li, G.; Kong, L.; Chen, Y.; Wang, Y.; Xie, N.; Yang, J. A model-based reinforcement learning method based on conditional generative adversarial networks. Pattern Recognit. Lett. 2021, 152, 18–25. [Google Scholar]
Charlesworth, H.; Montana, G. PlanGAN: Model-based planning with sparse rewards and multiple goals. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33. [Google Scholar]
Yang, R.; Fang, M.; Han, L.; Du, Y.; Luo, F.; Li, X. MHER: Model-based Hindsight Experience Replay. arXiv 2021, arXiv:2107.00306. [Google Scholar]
Huang, Y.; Ren, B.; Xu, Z.; Wu, L. MRHER: Model-based relay hindsight experience replay for sequential object manipulation tasks with sparse rewards. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 5055–5065. [Google Scholar]
Luo, Y.; Xu, H.; Li, Y.; Tian, Y.; Darrell, T.; Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv 2021, arXiv:1807.03858. [Google Scholar]
Li, G.; Shi, L.; Chen, Y.; Chi, Y.; Wei, Y. Settling the sample complexity of model-based offline reinforcement learning. arXiv 2024, arXiv:2204.05275. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Zhou, Q.; Li, B.; Li, H. Sample-efficient reinforcement learning via conservative model-based actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 4712–4720. [Google Scholar]
Sun, Y.; Zhang, J.; Jia, C.; Lin, H.; Ye, J.; Yu, Y. Model-Bellman inconsistency for model-based offline reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HA, USA, 23–29 July 2023. [Google Scholar]
Ji, T.; Luo, Y.; Sun, F.; Jing, M.; He, F.; Huang, W. When to update your model: Constrained model-based reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35. [Google Scholar]
Lee, V.; Abbeel, P.; Lee, Y. DreamSmooth: Improving model-based reinforcement learning via reward smoothing. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Scholz, J.; Weber, C.; Hafez, M.B.; Wermter, S. Improving model-based reinforcement learning with internal state representations through self-supervision. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Ma, H.; Wu, J.; Feng, N.; Wang, J.; Long, M. Harmony world models: Boosting sample efficiency for model-based reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Becker-Ehmck, P.; Karl, M.; Peters, J.; van der Smagt, P. Learning to fly via deep model-based reinforcement learning. arXiv 2020, arXiv:2003.08876. [Google Scholar]
Lambert, N.O.; Drew, D.S.; Yaconelli, J.; Levine, S.; Calandra, R.; Pister, K.S.J. Low-level control of a quadrotor with deep model-based reinforcement learning. IEEE Robot. Autom. Lett 2019, 4, 4224–4230. [Google Scholar]
Khalid, I.; Weidner, C.A.; Jonckheere, E.A.; Schirmer, S.G.; Langbein, F.C. Sample-efficient model-based reinforcement learning for quantum control. Phys. Rev. Res. 2023, 5, 043002. [Google Scholar]
Sever, G.G.; Demir, U.; Satir, A.S.; Sahin, M.C.; Ure, N.K. An integrated imitation and reinforcement learning methodology for robust agile aircraft control with limited pilot demonstration data. Aerosp. Sci. Technol. 2025, 158, 109682. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Burger, W.; Burge, M.J. Scale-invariant feature transform (SIFT). In Digital Image Processing: An Algorithmic Introduction; Springer International Publishing: Cham, Switzerland, 2022; pp. 709–763. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]

Figure 1. A concise overview of the MGH framework. The GAN module learns an environment model, while HER addresses sparse rewards. The Actor–Critic network leverages both real and generated data for policy training.

Figure 2. An indoor test area (3 × 5 × 2.5 m) where the quadcopter takes off and flies toward a designated target. The schematic (a) and the actual setup (b) illustrate the controlled environment used for data collection.

Figure 3. Result of RQ1. Comparison of mission success rates over episodes when the MGH framework is trained with 2K, 5K, 8K, and 10K pre-collected samples. Models with at least 8K samples converge faster and more reliably.

Figure 4. Result of RQ2. Mission success rates for DDPG, DDPG-HER, DDPG-GAN, and the MGH framework trained on the same 10K dataset. The MGH framework converges the fastest, reaching high performance around 2K samples.

Figure 5. Result of RQ3. A comparison between real transitions and GAN-generated next-state images given the current state and action. With at least 8K training samples, the geometric error drops significantly, indicating higher fidelity to real-world transitions.

Figure 6. Inference image of GAN. Sample next-state images generated by the GAN, comparing models trained with 2K, 5K, 8K, and 10K samples. As more data become available, both the background and the target appear more realistic.

Table 1. MDP elements in the indoor quadcopter environment.

Element	Description
State Space	An RGB image (3 channels) of size 64 × 64 captured by the quadcopter’s forward-facing camera.
Action Space	A continuous 3D vector (roll, pitch, throttle) each bounded in $[- 1, 1]$ . Internally, these are scaled to physically meaningful velocity.
Reward Function	Sparse reward: $r = 1$ if the target (detected via YOLOv8) is centered and scaled to match a predefined bounding box threshold; otherwise $r = 0$ . If success is achieved at any step, the episode terminates immediately with reward 1.
Episode Termination	An episode terminates under any of the following: (i) Success (the target is sufficiently centered and scaled), (ii) Collision/Out-of-Bounds (the quadcopter leaves a predefined area or collides with obstacles), or (iii) Time Limit (e.g., 200 steps).

Table 2. Hyperparameters of the MGH framework.

Parameter	Value
Learning Rate (Actor)	0.001
Learning Rate (Critic)	0.001
Batch Size	64
Buffer Size	1,000,000
Discount Factor	0.99
Noise Scale	0.1
Polyak Coefficient (Soft Update)	0.995
Reward Scaling	1.0
Learning Rate (GAN)	0.0002
Number of GAN Discriminator Iterations	5
k (HER)	5

Table 3. Neural network architecture in the MGH framework.

Network	Input	Layers	Output
(DDPG) Actor	Single RGB image (3 × 64 × 64) → flattened features.	3 convolution layers (stride 4 → 2 → 1) with ReLU → Flatten → FC(1024 → 512, ReLU) → FC(512 → 3, Tanh).	3D continuous action(roll, pitch, throttle) in $\in [- 1, 1]$
(DDPG) Critic	Same image input + 3D action concatenated.	3 convolution layers + Flatten (same as actor) → Concatenate action → FC(1028 → 512, ReLU) → FC(512 → 1).	Single scalar $Q (s, a)$
(GAN) Generator	Current image (3 × 64 × 64) + 3D action.	Image encoder (4 downsampling conv layers with ELU) + Action projection → Channel-wise concatenation → Decoder (4 transposed conv layers with ELU) → Final conv → Tanh.	Predicted next image (3 × 64 × 64) in
(GAN) Discriminator	Current and next images (6 × 64 × 64) + 3D action.	Convolution layers (stride 2) with ELU + LayerNorm → Flatten → Concatenate action embedding → FC → 1D output.	Real/fake score (WGAN-GP).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, U.; Lee, S.; Kim, K. Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data. Drones 2025, 9, 264. https://doi.org/10.3390/drones9040264

AMA Style

Lee U, Lee S, Kim K. Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data. Drones. 2025; 9(4):264. https://doi.org/10.3390/drones9040264

Chicago/Turabian Style

Lee, Uicheon, Seonah Lee, and Kyonghoon Kim. 2025. "Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data" Drones 9, no. 4: 264. https://doi.org/10.3390/drones9040264

APA Style

Lee, U., Lee, S., & Kim, K. (2025). Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data. Drones, 9(4), 264. https://doi.org/10.3390/drones9040264

Article Menu

Data-Efficient Reinforcement Learning Framework for Autonomous Flight Based on Real-World Flight Data

Abstract

1. Introduction

2. Related Work

3. Background

3.1. Reinforcement Learning

3.2. Deep Deterministic Policy Gradient

3.3. Model-Based Reinforcement Learning

3.4. Generative Adversarial Networks

3.5. Hindsight Experience Replay

4. Proposed Method

4.1. Overview of MGH Framework

4.2. Learning Algorithm

4.2.1. GAN–HER Integration

4.2.2. Algorithmic Flow

4.3. Theoretical Justification for MGH Framework

4.3.1. Model Error Bound

4.3.2. Joint Sample Efficiency from GAN and HER

5. Experiments

5.1. Research Questions

5.1.1. RQ1: How Much Data Does the MGH Framework Require for Training?

5.1.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?

5.1.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?

5.2. Experimental Environment

5.2.1. Quadcopter

5.2.2. Flight Mission

5.2.3. Pre-Collected Data

5.2.4. MDP Modeling

5.2.5. MGH Framework

5.3. Experimental Results

5.3.1. RQ1: How Much Data Does the MGH Framework Require for Training?

5.3.2. RQ2: How Does the MGH Framework Perform Compared with Conventional Reinforcement Learning in Data-Sparse Environments?

5.3.3. RQ3: How Accurate Is the GAN in Mimicking the Environment?

6. Discussion

7. Threats to Validity

7.1. Internal Validity

7.2. External Validity

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI