Next Article in Journal
Online Resource Allocation and Trajectory Optimization of STAR–RIS–Assisted UAV–MEC System
Next Article in Special Issue
Distributed Decision Making for Electromagnetic Radiation Source Localization Using Multi-Agent Deep Reinforcement Learning
Previous Article in Journal
Assessing Seasonal and Diurnal Thermal Dynamics of Water Channel and Highway Bridges Using Unmanned Aerial Vehicle Thermography
Previous Article in Special Issue
UAV-Based Pseudolite Navigation System Architecture Design and the Flight Path Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems: Integrating a Dynamic Adaptive Genetic Optimization Algorithm and Soft Actor–Critic Algorithm with Hierarchical Action Decomposition and Uncertainty-Quantified Critic Ensemble

1
School of Mechanical Engineering, Dalian University of Technology, Dalian 116024, China
2
China North Artificial Intelligence and Innovation Research Institute, Beijing 100072, China
3
NORINCO Unmanned Vehicle Research and Development Center, China North Vehicle Research Institute, Beijing 100072, China
4
Collective Intelligence and Collaboration Laboratory, Beijing 100072, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(3), 206; https://doi.org/10.3390/drones9030206
Submission received: 24 January 2025 / Revised: 6 March 2025 / Accepted: 11 March 2025 / Published: 13 March 2025
(This article belongs to the Special Issue Unmanned Aerial Vehicles for Enhanced Emergency Response)

Abstract

:
With the continuous progress of UAV technology and the rapid development of mobile edge computing (MEC), the UAV-assisted MEC system has shown great application potential in special fields such as disaster rescue and emergency response. However, traditional deep reinforcement learning (DRL) decision-making methods suffer from limitations such as difficulty in balancing multiple objectives and training convergence when making mixed action space decisions for UAV path planning and task offloading. This article innovatively proposes a hybrid decision framework based on the improved Dynamic Adaptive Genetic Optimization Algorithm (DAGOA) and soft actor–critic with hierarchical action decomposition, an uncertainty-quantified critic ensemble, and adaptive entropy temperature, where DAGOA performs an effective search and optimization in discrete action space, while SAC can perform fine control and adjustment in continuous action space. By combining the above algorithms, the joint optimization of drone path planning and task offloading can be achieved, improving the overall performance of the system. The experimental results show that the framework offers significant advantages in improving system performance, reducing energy consumption, and enhancing task completion efficiency. When the system adopts a hybrid decision framework, the reward score increases by a maximum of 153.53% compared to pure deep reinforcement learning algorithms for decision-making. Moreover, it can achieve an average improvement of 61.09% on the basis of various reinforcement learning algorithms such as proposed SAC, proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), and twin delayed deep deterministic policy gradient (TD3).

1. Introduction

1.1. Research Background

With its rapid development, UAV technology has played an irreplaceable role in considerable fields, especially in mobile edge computing (MEC) systems. Traditional ground MEC servers suffer from problems such as high deployment costs, poor adaptability to network dynamics, and fixed service scope. In recent years, the paradigm shift from ground-edge computing to air–ground joint-edge computing, UAV-assisted MEC, has provided a new method for addressing the above limitations. The UAV-assisted MEC system aims to provide low-latency, high-bandwidth computing services by deploying computing resources close to the user or edge platform, thereby meeting the growing demand for data processing. This technology serves as an essential foundation for the Internet of Things, end-to-end large model inference, and collaboration among multiple unmanned platforms.
In recent years, the research focus in the field of UAV-assisted MEC systems has mainly been on deploying, integrating, and optimizing the communication, perception, and computing capabilities of various nodes in the network through drone platforms to form intelligent edge networks, aiming to achieve an agile and ubiquitous intelligence of things. The high maneuverability of drones enables them to be rapidly and flexibly deployed as airborne MEC servers to assist ground MEC servers in providing temporary computing services anytime, anywhere. In addition, the high-probability line of sight (LOS) link of drones could also improve the communication reliability and network capacity of ground MEC networks [1]. As a mobile edge server, the UAV not only improves data transmission efficiency but also reduces data transmission latency, providing efficient and convenient services for terminal platforms and users.
The UAV-assisted MEC systems in Figure 1 have demonstrated great potential in several application scenarios and can play a significant role in post-disaster rescue and emergency response to accidents due to their combination of the high maneuverability of drones and the proximity computation capability of MEC. In the process of post-disaster rescue, the UAV serves as an airborne mobile edge computing server, providing computing resources close to the ground users. The ground users generate tasks that require computation, and these tasks can be offloaded to the UAV for processing. The UAV communicates with the ground users through data links, receiving tasks and transmitting processed results back to users. The UAV’s high mobility allows it to quickly move to areas where computing resources are needed, enhancing the system’s responsiveness and efficiency.

1.2. Research Gap

In practical applications, UAV-assisted MEC systems still face many challenges, such as the contradiction between supply and demand in resource allocation, the heterogeneity of supply and demand in computational offloading, energy saving and real-time trajectory control, and different time-scale dynamics, which put forward higher requirements for the effective allocation of computational resources, UAV trajectory control, and the design of task offloading strategies [2]. First, the endurance and energy consumption problems of UAVs need to be solved. Since UAVs are limited by their battery capacity, how to extend endurance while ensuring the efficiency of mission execution is a critical issue still to be solved. Secondly, path planning and mission offloading strategies for UAVs in complex environments are also a major difficulty. In addition, with the increase in the number of user devices, the MEC system’s demand for the effective management and scheduling of MEC resources to meet different application scenarios severely limits the stability of the system in complex environments and real-time data processing.
In order to cope with the above challenges, research on optimization for UAV-assisted MEC systems, especially in optimizing decision-making algorithms for UAVs, is particularly crucial. By introducing advanced decision-making algorithms, such as reinforcement learning and deep learning, the decision-making efficiency of UAVs in path planning and task offloading can be improved, thus further enhancing the overall performance of the MEC system. However, the traditional pure deep reinforcement learning (DRL) decision-making methods have certain limitations in practical applications [3]. For example, DRL algorithms often require a large amount of training data and computational resources when dealing with complex dynamic environments, resulting in a long training time and difficulties in convergence. Secondly, DRL usually faces difficulty in balancing the weights between different objectives when facing multi-objective optimization problems, resulting in ineffective decision-making. In addition, the DRL algorithm also suffers from certain accuracy degradation when dealing with mixed problems in discrete and continuous action spaces [4].
To overcome such limitations, this paper proposes an innovative hybrid decision-making framework that combines the Dynamic Adaptive Genetic Optimization Algorithm (DAGOA) with deep reinforcement learning for UAV-assisted MEC systems. The innovation of the framework is to make task offloading decisions via the improved genetic algorithm, which is able to locate the near-optimal solution in a complex search space by utilizing its global search capability and good convergence, whereas DRL focuses on path planning decisions, which is able to achieve efficient decision-making in dynamic environments by learning the optimal policy through interaction with the environment. The potential advantage of the hybrid decision-making framework is that it can fully utilize the respective advantages of DAGOA and DRL to compensate for the deficiencies of a single algorithm. Specifically, DAGOA is able to perform effective search and optimization in discrete action space, while DRL is able to perform fine control and adjustment in continuous action space. By combining the two, the joint optimization of UAV path planning and mission offloading can be achieved to improve the overall performance of the system.
In the next part of this paper, a hybrid decision-making framework based on DAGOA and the soft actor–critic (SAC) with multiple novel mechanisms is proposed as a primary solution for UAV control in the UAV-assisted MEC system, and simulation experiments are conducted to verify that DAGOA, when coupled with a variety of DRL algorithms, is able to help UAVs achieve efficient path planning and task offloading decisions in complex dynamic environments in UAV-assisted MEC systems. The proposed decision framework achieves the co-optimization of discrete task offloading decisions (via DAGOA) and continuous path planning (via enhanced SAC), overcoming the limitations of single-algorithm approaches in handling mixed action spaces.
The experimental results show that the framework achieves significant advantages in enhancing system performance, reducing energy consumption, and improving task completion efficiency. The system adopts a hybrid decision-making framework with a maximum increase of 153.53% in reward scores compared to pure deep reinforcement learning algorithms for decision-making. The framework achieves an average of 61.09% improvement in rewards on the basis of multiple reinforcement learning algorithms such as SAC, Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed Deep Deterministic (TD3) [5]. Meanwhile, the research in this paper provides new ideas and methods for the optimization of UAV-assisted MEC systems, which has high academic value and application prospects and is of great theoretical, as well as practical, significance for promoting the development and application of UAV-assisted MEC systems.

2. Related Works

2.1. State of the Art of UAV Path-Planning Algorithms

An algorithm based on the Kalman filter and the improved rapid exploration random tree (KF-RRT) was proposed by Yan and Fu [6] for the dynamic path-planning problem of UAVs. The algorithm improves the efficiency and adaptability of UAV path planning by introducing weighting coefficients for target region trends and predicting trajectories of dynamic obstacles using the Kalman filter. The feasibility of the path is further optimized via the B-spline curve smoothing technique. Compared with existing RRT algorithms, the KF-RRT algorithm shows superiority in simulation experiments when planning UAV paths in dynamic environments. However, the limitation of this study is that the algorithm may encounter high computational complexity and a lack of real-time performance in highly dynamic and complex environments.
Wang et al. [7] proposed an unmanned aerial vehicle (UAV) path-planning method based on modified particle swarm optimization (MPSO). By establishing a path-planning model, considering UAV performance limitations and spatial obstacle threats, the MPSO algo-rithm with dynamically adjusted inertia weights is used to plan the navigation path of the UAV, which effectively improves the efficiency and reliability of path planning. The simulation results show that the algorithm can effectively solve the path-planning problem of UAVs. However, the limitation of this study is that the algorithm may encounter high computational complexity and a lack of real-time performance in highly dynamic and complex environments.
An ant colony optimization (ACO)-based path-planning algorithm for unmanned aerial vehicles (UAVs) in dynamic environments was proposed by Baroomi et al. [8]. The algorithm determines the optimal path of the UAV using colony intelligence and ant behavior and dynamically adapts to environmental changes through pheromone trajectories and heuristic functions. Simulation results show that the algorithm is effective and efficient in dynamic environments. However, the research limitation is that the energy constraints of UAVs and the dynamic changes in real complex environments are not considered. Although Zhang et al. [9] optimized the efficiency and success rate of UAV path planning by improving the goal-biased RRT algorithm, the adaptability in highly dynamic and complex constraint environments still needs further research.
Zhou et al. [10] proposed a multi-UAV trajectory planning algorithm based on the multi-agent deep deterministic policy gradient (MADDPG) algorithm, which is named potential field dense MADDPG (PF-MADDPG). This algorithm employs a potential field-dense reward function to effectively improve the learning efficiency and path-planning performance in complex environments. Nevertheless, the study neglects the communication synergy between UAVs, and the scalability and robustness in terms of high dynamics have not been fully verified.

2.2. UAV-Assisted Mobile Edge Computing

Li et al. [11] proposed a cooperative computation offloading strategy in UAV-assisted mobile edge computing, which optimizes the initial number and 3D location by deploying multiple UAVs on demand. The rational planning of UAV trajectories during computation offloading is investigated to ensure the communication quality of mobile users and reduce energy consumption in order to accomplish the task with a limited battery capacity and computational power. The mixed-integer nonlinear planning problem is solved by the block coordinate descent method, and the results show that the algorithm can effectively reduce the path loss and total energy consumption. However, the study is limited to an idealized UAV energy and computational model, and real-world applications may face more complex environmental and equipment constraints.
Xiang et al. [12] proposed a UAV-assisted mobile edge computing system that aims to minimize the energy consumption and latency-weighted sum of the system by jointly optimizing the flight trajectory of the UAV and the user’s task offloading strategy. The study employs a differential evolutionary algorithm and an optimistic actor–critic algorithm and verifies the advantages of the algorithms in terms of energy consumption and convergence performance through simulations. However, the limitation of the study is that it does not take into account the energy constraints of UAVs or dynamics in real complex environments.
Gao et al. [13] proposed a secure data transmission scheme for UAV relay-assisted maritime mobile edge computing (MEC) systems. Interference signals are sent via ground jammers to reduce the risk of interception, and the amplify-and-forward (AF) protocol is utilized to jointly optimize the user’s transmission power, time-slot allocation factor, and UAV flight trajectory in order to maximize the user’s minimum secure computing power. The study employs block coordinate descent (BCD) and successive convex approximation (SCA) techniques to solve the optimization problem. Simulation results show that the proposed scheme is more effective in improving the safe computing power of the system compared to the four benchmark schemes. However, the limitations of this study are the assumption of static relative positions between UAVs and the neglect of the effects of variable weather conditions in marine environments on UAV communication and flight performance.
A deep reinforcement learning-based trajectory optimization and resource allocation scheme (DRTORA) is proposed by Gao et al. [14] to address the problem of improving the secure computing performance of UAV-assisted MEC networks. The scheme utilizes deep Q-learning (DQN) to optimize the flight trajectory, task offloading decision, and time allocation of UAVs to maximize the system’s secure computing capability. Simulation results show that DRTORA is effective in improving cybersecurity computing performance. However, the limitation of this study is that a simplified communication model is assumed, which does not adequately consider the dynamic changes and complexity in real-world environments.
Wang and Sun [15] proposed a UAV-assisted MEC system to reduce energy consumption by jointly optimizing the computation of offloading decisions and resource allocation, but they did not include consideration of signal coverage and energy dynamics. Zhang et al. [16] employed a PPO reinforcement learning algorithm to optimize the computational offloading strategy for UAV-assisted MEC, which improves the task processing efficiency but ignores the UAV energy consumption and potential communication changes. Zhou et al. [17] optimized a computational of-floading strategy for UAV-assisted MEC via the DDPG algorithm, which effectively reduces the latency but ignores the challenges of UAV energy consumption and the dynamic ocean environment.
Yin and Tian [18] explored how to reduce the energy consumption of user devices by optimizing user task offloading policies and UAV trajectories in UAV-assisted mobile edge computing (MEC) in the context of 5G. An iterative optimization algorithm based on the deep deterministic policy gradient (DDPG) (IOECA) was proposed to minimize energy consumption. The study considered the battery capacity limitations of UAVs and validated the effectiveness of the proposed algorithm through simulation experiments. However, the limitation of the study is that a fixed UAV altitude and an idealized communication environment are assumed, and the dynamic changes and interference factors in practical applications were not fully considered.
The study by Han et al. [19] proposed a joint user association and a UAV deployment optimization method in UAV-assisted MEC networks, which effectively reduces task latency through optimal transmission theory and particle swarm optimization algorithms. However, the study faced several limitations: first, the paper did not consider UAV energy management and fine-grained computing loads, which may affect the UAV’s sustained operation capability and communication quality; second, though the paper focused on the optimization of the task delay, it did not account for the algorithm execution efficiency, which leads to greater difficulty in practical application.

2.3. Application of Genetic Algorithms

The current genetic algorithms applied to UAV-related control and decision-making generally suffer from limitations in dynamic adaptation and performance. Yao et al. [20] presented an improved genetic algorithm and two related coding methods for the multi-UAV cooperative search problem. Despite their innovative coding strategies, the genetic algorithm still faces challenges in dealing with large-scale search spaces and dynamic environments, leading to a large scope for improving both the accuracy and speed of the algorithm. Wang et al. [21] proposed an improved genetic algorithm based on the beetle tentacle search algorithm for solving the multi-UAV cooperative multi-tasking problem. The algorithm improves the global and local search ability of the genetic algorithm by enhancing population diversity and dynamically adjusting the variance probability, but further research is still needed to overcome the precocity and parameter tuning difficulties. Chen and Qi [22] proposed a path-planning method for UAVs that incorporates a genetic algorithm and an improved ant colony algorithm, aiming to improve the efficiency and safety of path planning. The method utilizes a genetic algorithm to initialize the pheromone matrix of the ant colony algorithm to accelerate convergence and avoid local optima. However, the limitation of this fusion algorithm is that the iterative redundancy of the genetic algorithm in the later search may lead to a long planning time, while the improved ACO algorithm enhances the global search capability, but the response speed and path smoothing for unexpected threats in dynamic environments still need to be improved.
On the other hand, the genetic algorithm research field lacks research on joint applications with deep reinforcement learning in the field of UAV-assisted MEC systems, where UAV path planning and task execution cannot be balanced currently. Gao et al. [23] introduced a genetic algorithm combining oppositional and chaotic search for solving the path-planning problem of UAVs in a multi-obstacle environment. The algorithm improves the search capability and convergence speed of the genetic algorithm by improving the initial population generation and crossover strategies. Nevertheless, the genetic algorithm still faces the challenges of convergence speed and local optimization when dealing with complex constraints and high-dimensional problems, and future research should focus on improving the performance for dynamic path point processing and real-time path planning. Li et al. [24] investigated the multi-UAV air combat weapon-target assignment problem based on genetic algorithms and deep learning by constructing an optimization model in order to improve combat efficiency. Although genetic algorithms perform well in global search, their computational complexity and sensitivity to parameters when dealing with large-scale problems limit their application. An exploration for more efficient algorithms to meet the challenges of real-time dynamic battlefield environments is still needed.
The current improvements to genetic algorithms are still insufficient in terms of execution efficiency and dynamic environment adaptation. Su et al. [25] proposed an improved genetic algorithm that effectively improves the efficiency and effectiveness of multi-UAV task assignment by introducing the Metropolis criterion of simulated annealing and the variable step annealing cooling method. However, compared with the DAGOA adaptive genetic algorithm, the method faces limitations in terms of dynamic environment adaptation and early stopping mechanisms, and future research needs to be further optimized to improve the flexibility and efficiency of the algorithm. Li et al. [26] proposed a multi-UAV maritime target search path-planning algorithm based on GA, which achieves the target classification and path planning by combining K-means clustering and multi-chromosome GA path planning. However, compared with the DAGOA adaptive genetic algorithm, the algorithm faces limitations in dynamic adaptation and individual selection, resulting in limitations in the robustness and performance of the algorithm.

3. Framework Design

3.1. Design of DAGOA

Recent studies have demonstrated the complementary advantages of combining evolutionary algorithms with deep reinforcement learning. Liang et al. [27] proposed RL-RVEA that integrates reinforcement learning with reference vector adaptation, showing RL’s capability of dynamically adjusting search strategies while maintaining EA’s global exploration. In the PSO domain, Li et al. [28] developed NRLPSO using Q-learning to guide velocity vector generation, achieving a 61.09% performance improvement over standalone RL. For complex action spaces, Wang et al. [29] employed genetic programming to automatically design trigger conditions in multi-agent RL systems, leveraging EA’s structure search ability. These hybrid approaches inspire our framework design: DAGOA inherits GA’s global search capability through dynamic mutation rate adaptation and an early stopping mechanism, while SAC with hierarchical action decomposition enables fine-grained policy learning. Compared with existing GA-DRL hybrids [30], our innovation lies in the uncertainty-quantified critic ensemble that overcomes the challenge of dynamic and partial observability in the environment.
The proposed Dynamic Adaptive Genetic Optimization Algorithm (DAGOA) is an advanced evolutionary computation technique that enhances traditional genetic algorithms by integrating adaptivity and dynamic parameter tuning. This section provides its detailing operational principles, step-by-step procedural descriptions, mathematical foundations, and the innovative mechanisms that contribute to its superior performance in solving complex optimization problems. The rigorous examination of DAGOA’s methodology is shown in the next section.
Genetic algorithms (GAs) are stochastic search methods inspired by the principles of natural evolution and genetics. While effective in exploring large solution spaces, traditional GAs can suffer from premature convergence and inefficiency in dynamic environments. DAGOA addresses these limitations through adaptive mechanisms and early stopping criteria, optimizing the task offloading decisions in user-centric systems. The DAGOA algorithm dynamically adjusts parameters, such as mutation rate and population size, according to the real-time state of the optimization process, which increases the adaptability of the algorithm to complex tasks and dynamic environments, and this adaptability enables the algorithm to maintain the diversity of the population, thus avoiding local optima and increasing the convergence speed. The integration of early stopping criteria further enhances computational efficiency by terminating the process when negligible improvements are observed. The following formulation describes the main mechanism of DAGOA as implemented in the framework.
1.
Initialization: The DAGOA begins by initializing a population of candidate solutions. Each individual is represented as a vector of decision variables, populated with random values from a uniform distribution.
P ( 0 ) = { x 1 , x 2 , , x N } , x i = ( x i 1 , x i 2 , , x i M ) Uniform ( 0 , 1 )
where N is the number of individuals, and each x i consists of decision variables for a specific number of users (user_num).
2.
Fitness evaluation: The fitness of each individual is assessed using a predefined fitness function, which evaluates the quality of the solution based on the state of the system and its parameters. An adaptation function, f ( x i ) , is defined for each individual to evaluate its offloading decision performance under a specific environment state, s, and location, ( x , y ) .
f ( x i ) = w energy · E ( x i ) + w delay · D ( x i ) w coverage · C
where E ( x i ) , D ( x i ) and C denote the total energy consumption, the total latency, and the coverage, respectively, which are described in the experiment design section.
3.
Selection: DAGOA employs tournament selection to choose individuals for reproduction. This method selects a subset of individuals randomly, and the one with the highest fitness is chosen as a parent, promoting high-quality solutions while maintaining genetic diversity.
Select ( P ) = arg max x Tournament f ( x )
4.
Crossover: genetic diversity is further enhanced through crossover, where pairs of individuals exchange segments of their decision variables, producing offspring that inherit characteristics from both parents.
Given x 1 = ( x 11 , x 12 , , x 1 N ) and x 2 = ( x 21 , x 22 , , x 2 N ) , Choose crossover point c { 1 , 2 , , N 1 } , Generate offspring : x offspring 1 = ( x 11 , , x 1 c , x 2 ( c + 1 ) , , x 2 N ) x offspring 2 = ( x 21 , , x 2 c , x 1 ( c + 1 ) , , x 1 N )
5.
Mutation: The mutation operation introduces random variations in the offspring to explore new regions of the solution space. DAGOA adaptively adjusts the mutation rate, increasing it if the population shows signs of stagnation.
x i j = Uniform ( 0 , 1 ) , if rand < p mutate ( t ) x i j , otherwise
where the probability of variation, p mutate ( t ) , increases in the long run without improvement to increase population diversity.
p mutate ( t ) = 0.05 , if no improvement count patience / 2 0.01 , otherwise
6.
Adaptation and early stopping: The algorithm proceeds through generations, tracking improvements in fitness scores. Early stopping is triggered if the best fitness improvement remains below a defined threshold over several generations, conserving computational resources.
if max ( f ( P ( t ) ) ) best fitness ϵ , then increment no improvement count
7.
Extracting the best solution: upon termination, the algorithm selects the individual with the highest fitness as the ultimate solution.
x * = arg max x P ( t ) f ( x )
DAGOA showcases several notable advantages, such as adaptive mutation, early stopping mechanism, and tournament selection, ensuring its robustness and flexibility. Through the above steps, the DAGOA achieves the optimization of the task offloading problem, dynamically adjusts the parameters of the genetic algorithm to adapt to the environmental changes, and improves the efficiency of the algorithm. The DAGOA represents a significant advancement in the field of evolutionary computation. Its adaptive mechanisms and efficient termination criteria make it a powerful tool for handling complex optimization challenges, providing an optimal balance between exploration and exploitation while ensuring computational efficiency.

3.2. Embedded SAC Algorithm

The soft actor–critic (SAC) algorithm is an advanced, model-free, off-policy reinforcement learning method [31]. It was designed to address the exploration–exploitation trade-off by combining the benefits of entropy maximization with policy learning. The principle behind SAC is to learn a stochastic policy that not only maximizes expected returns but also maximizes entropy, leading to more robust and exploratory behavior. In this section, we propose enhancements to the SAC algorithm within the hybrid decision-making framework for UAV-assisted mobile edge computing (MEC) systems. The goal is to improve the robustness and decision efficiency of the framework, addressing the challenges of dynamic environments and multi-objective optimization.
SAC, as a deep reinforcement learning algorithm based on the principle of maximum entropy, is able to learn efficiently in high-dimensional continuous action space. SAC optimizes the strategy learning process and enhances its adaptability to dynamic changes in the environment by introducing a dual structure of the value function and policy network. In this framework, the SAC is responsible for the path planning of the UAV, which directly interacts with the environment by obtaining an effective representation of the environment state, including user position, task type, and obstacle information, and it updates its strategy in real time to maximize the cumulative reward. In addition to SAC, this paper also investigates the way in which algorithms such as PPO, DDPG, and TD3 [32] are combined with the DAGOA algorithm and tests their performance in the same simulation environment, thus verifying the extensibility of the proposed framework and the superiority of the proposed SAC.
The proposed soft actor–critic (SAC) algorithm forms the core of UAV path planning in our hybrid decision framework. It operates in a continuous action space, A R 2 , where actions correspond to directional control, a θ [ 0 , π / 2 ] , and speed modulation, a v [ 1 , 10 ] . The state space s t integrates four critical components, including user context, s user t , environmental constraints, s obstacle , UAV status, s uav t , and trajectory history, s history t , which are introduced in Section 4.1.
The policy network π ϕ and Q-function ensemble { Q θ i } i = 1 N jointly optimize the objective:
max π E s D a π t = 0 γ t r ( s t , a t ) + α H ( π ( · | s t ) )
where H ( π ) = E π [ log π ( a | s ) ] is the policy entropy. On the basis of maximizing rewards in traditional reinforcement learning, this objective function introduces policy entropy, H ( π ) , as a regularization term. Maximizing entropy encourages strategies to maintain randomness and avoid premature convergence to local optima.

3.2.1. Hierarchical Action Decomposition

Due to the difficulty of exploring the optimized overall control strategy, the action space is decomposed into two coupled sub-spaces to reduce exploration complexity, which can be denoted as a high-level planner and a low-level controller. The high-level planner generates target direction a ˜ θ using attention-weighted state features:
a ˜ θ = MLP Att ( W q s v , W k s u ) s e
where Att ( · ) computes multi-head attention weights over user states. The low-level controller outputs speed modulation, Δ a v , conditioned on obstacle proximity:
Δ a v = σ W l · s v min j s v s e j 2
where σ is a sigmoid function constraining output to [ 0 , 1 ] .
The design of the hierarchical action decomposition of the proposed SAC could reduce gradient variance, which is similar to the framework proposed by M. Daniel et al. [33]. In our work, let ϕ J and ϕ J H A D denote policy gradients under standard and hierarchical action spaces, respectively. Then,
Var ( ϕ J H A D ) Var ( ϕ J )
The complete policy gradient for standard SAC is as follows:
ϕ J = E s D a π ϕ log π ( a | s ) · Q ( s , a )
Under HAD, the gradient decomposes into the following:
ϕ J H A D = E ϕ h log π h ( a ˜ θ | s ) Q h ( s , a ˜ θ ) + E ϕ l log π l ( a v | s , a ˜ θ ) Q l ( s , a v )
Decompose the policy gradient using the law of total variance:
Var ( J ) = Var ( E [ J | a 0 ] ) + E [ Var ( J | a 0 ) ] Var ( J H A D ) = Var ( E [ J h | a 0 ] ) + E [ Var ( J l | a 0 ) ]
Hierarchical decomposition minimizes cross-term correlations in the second moment matrix, thereby reducing overall variance.

3.2.2. Uncertainty-Quantified Critic Ensemble

To overcome the challenge of dynamic and partial observability in the environment, the Uncertainty-Quantified Critic Ensemble (UQEC) mechanism is employed, introducing an ensemble of N = 5 Q-networks with randomized initializations, which provides robust value estimation [34]. The Pessimistic Q-Learning target values that incorporate epistemic uncertainty are as follows:
y t = r t + γ min i K Q θ i ( s , a ) β σ Q ( s )
where K { 1 , , N } is a random subset ( | K | = 3 ) per update, and σ Q ( s ) represents the ensemble standard deviation. β is dynamically adjusted to increase punishment in high-uncertainty environments (such as areas with dense obstacles) according to Formula (17) to avoid high-risk actions.
σ Q ( s ) = 1 N i = 1 N Q θ i ( s , a ) Q ¯ ( s , a ) 2
The dynamic uncertainty penalty mechanism can adaptively adjust the penalty coefficient, β , balancing exploration and risk avoidance.
β t + 1 = β t + η β E D [ σ Q ( s ) ] σ t a r g e t
where σ t a r g e t is set to 0.2 to maintain moderate conservatism. The Uncertainty-Quantified Critic Ensemble mechanism could address model bias issues in dynamic environments and enhance the robustness of strategies in partially observable scenarios. The integrated mean Q ¯ ( s , a ) = 1 N Q θ i ( s , a ) reduces the estimated variance, while the minimum operator constrains the maximum deviation.
E [ min i Q θ i ] Q * E [ max i Q θ i ]
When the strategy is updated, the uncertainty penalty term β σ Q ( s ) establishes a probabilistic robust boundary. Assuming the true Q-value Q * and the estimation error ϵ i N ( 0 , σ 2 ) , there is the following:
P min i Q θ i Q * β σ 1 Φ β N N
When N = 5 , and β = 1.6 , the probability exceeds 95% ( Φ is the standard normal CDF), which can effectively avoid high-risk action choices.

3.2.3. Adaptive Entropy Temperature

The temperature parameter α in SAC is crucial for balancing exploration and exploitation. In dynamic environments, a fixed α may not be optimal. We employ an adaptive temperature tuning mechanism that adjusts α based on the system’s current state and the complexity of the environment. The adaptive entropy temperature mechanism can dynamically balance strategy exploration and utilization, solving the problems of over-exploration at an early stage and under-exploration at a later stage caused by traditional SAC-fixed temperature parameters [35]. In addition to the uncertainty penalty, the dynamic adjustment principle of the strategy entropy weight α is as follows:
α t + 1 = α t + η α E D [ H ( π t ) ] H t a r g e t
where H t a r g e t = dim ( A ) = 2 , corresponding to the two-dimensional action space. By coupling the adjustment strategy, the update of β is associated with the uncertainty of the Q value, thereby enhancing the conservatism of the algorithm in high-uncertainty observation states. The coupling update of β and α will form a dual closed-loop control, allowing for higher uncertainty penalties ( β increase) and suppressing risk exploration during the high-exploration period (where α is larger); during the high-utilization period (where α is small), the algorithm will reduce the penalty ( β is reduced) and fully utilize the known optimal strategy.
The exploratory nature of SAC complements the exploitation focus of genetic algorithms, creating a balanced and holistic decision-making approach. In the context of the hybrid decision framework, the SAC algorithm was chosen due to its unique ability to address the challenges of dynamic and uncertain environments. SAC’s robustness in various settings ensures that it can adapt to different scenarios within the hybrid framework, maintaining high performance despite environmental changes. The integration of the SAC algorithm into the hybrid decision framework represents a strategic innovation, driven by SAC’s strengths in efficient exploration, sample efficiency, and adaptability. These attributes make it an ideal candidate for addressing the complex challenges present in optimizing UAV path planning and task offloading. As a result, SAC forms a critical component of the framework, driving its effectiveness and success in achieving optimal UAV path planning outcomes.
The uncertainty-quantified critic ensemble and adaptive entropy temperature improve the decision performance of the SAC algorithm through dual aspects. At the level of value estimation, the ensemble critic reduces Q-value variance and adapts β -dynamic constraints to estimate bias; at the level of strategic optimization, entropy regulation maintains exploration, while β regulation avoids high-risk areas. Therefore, the improvement in strategies can be decomposed into the following:
E [ Q n e w Q o l d ] = E [ Δ Q o p t i m ] Optimize Reward β E [ Δ σ Q ] Risk Penalty + α E [ Δ H ] Explore Incentives
The dynamic balance of the above three factors ensures that strategy updates always move towards the Pareto-optimal direction of high returns, low risks, and moderate exploration.
By combining DAGOA with SAC, this study realizes an efficient decision-making mechanism for UAVs when performing MEC missions. In the hybrid decision-making framework, SAC is responsible for rapidly responding to environmental changes and optimizing path planning, while AGA improves the overall resource utilization efficiency of the system by evaluating and optimizing mission offloading decisions offline. In the experiments, the hybrid framework is found to outperform the decision-making strategy using SAC alone or the strategy combining a fixed-parameter genetic algorithm and SAC, especially in the face of dynamic user demands and complex environments, improving the performance of the system significantly.

3.3. Integration

Current UAV-assisted mobile edge computing systems usually demand high efficiency in the utilization of computational, communication, and energy resources, so the research on optimizing the task offloading decision and path planning for UAVs is of great significance in this field. An efficient decision-making framework is essential for optimizing resource allocation and improving system performance. In this paper, we propose an innovative hybrid decision-making framework combining deep reinforcement learning algorithms and DAGOA, which is designed to achieve the synergistic optimization of UAV path planning and ground-based mission offloading. By assigning the path planning task to deep reinforcement learning algorithms (e.g., SAC and PPO, etc.), while the ground task offloading decision is handled via DAGOA, the task execution efficiency and resource utilization of UAVs can be improved. The hybrid framework realizes efficient UAV path-planning and task-offloading decisions, demonstrating its innovation and technical superiority in complex environments.
The convergence of DAGOA is guaranteed by its adaptive mechanism, which prevents premature convergence. The convergence of SAC is ensured by its off-policy nature and regularization of entropies. In the aspect of stability, the novelty of the hybrid decision-making framework proposed in this paper is mainly reflected in the following three aspects.
  • Improved end-to-end decision making: While traditional decision-making frameworks often require the collaborative work of multiple subsystems, this hybrid decision-making framework realizes end-to-end decision making without additional layers. By integrating two subtasks, path planning and task offloading, into a unified framework, the proposed framework is able to consider the constraints and objectives of these two subtasks simultaneously, leading to better global optimization. In this case, this end-to-end hybrid decision-making framework is highly innovative in current academic research.
  • Task decomposition and co-optimization: The framework decomposes, for the first time, the UAV path planning and ground task offloading into two independent tasks for decision-making via the DAGOA algorithm and the DRL algorithm, and this decomposition not only reduces the complexity of the tasks but also achieves overall performance enhancement through the co-optimization.
  • Adaptive mechanism: The adaptive mechanism in DAGOA is innovatively proposed to dynamically adjust the crossover and mutation probabilities according to the evolutionary state of the population, which improves the search efficiency. At the same time, the deep reinforcement learning algorithm realizes the steady updating of strategies through methods such as the strategy gradient, and the two complement each other to jointly improve the decision-making performance of the framework.
The core idea of the hybrid framework proposed in this paper is to make full use of the global search capability of genetic algorithms and the local optimization capability of deep reinforcement learning and to further improve the robustness and decision-making efficiency of the hybrid decision-making framework by realizing adaptive adjustment of the parameters of the DAGOA. DAGOA is an improved genetic algorithm that improves the search efficiency and convergence speed by dynamically adjusting the crossover and mutation probabilities. In this framework, DAGOA is used to solve the ground mission offloading tasks. DAGOA generates and optimizes the mission offloading strategy by simulating natural selection and genetic mechanisms. The innovation is the introduction of an adaptive mechanism, which dynamically adjusts the algorithm parameters according to the evolutionary state of the population, thus accelerating convergence and avoiding local optima. DAGOA dynamically adjusts the fitness of the individuals in the population by simulating the process of natural selection to ensure that suitable mission offloading strategies can be found in complex environments. Specifically, DAGOA evaluates the fitness of individuals in each generation and utilizes a tournament selection strategy to select the optimal individuals for crossover and mutation to generate a new population. In order to avoid converging too quickly early on and thus falling into a local optimum, DAGOA introduces the Early Stop mechanism, which ensures that the genetic operation can be terminated when the fitness has not improved significantly.
The integration of DAGOA and SAC not only enhances the autonomous decision-making ability of UAVs in MEC systems but also provides new ideas for research in related fields, demonstrating the potential of combining optimization algorithms with deep learning methods. Future research can further explore the synergistic mechanisms of other optimization algorithms with deep reinforcement learning to expand the applicability and effectiveness of this framework in a wider range of application scenarios.

4. Experimental Design

In this paper, we design a training environment for simulating UAV-assisted mobile edge computing (MEC) systems, especially for the application of the hybrid decision-making framework proposed in this paper for UAV path planning and mission offloading decisions. This environment is able to access deep reinforcement learning algorithms for UAV path planning and DAGOA for the optimization of ground-based task offloading decisions. The design of this environment takes into account a variety of practical factors, introducing users, obstacles, dynamic noise, diverse task types, and multi-dimensional reward mechanisms in the three-dimensional space to simulate the complexity of real application scenarios and to ensure that the trained model has a high practical application value.
The environment is implemented based on the Python 3.10.15 programming language and employs the Box space from the gym library to define the observation and action spaces. The main components of the environment include the UAV, the user, obstacles, and task types. The action state of the UAV is represented by position, velocity, and direction; the user’s position and task requirements change dynamically, and the position of obstacles is randomly generated at each reset.

4.1. Experimental Environment Components

The size of the environment was 5 km × 5 km × 1 km, and the ground user, the airborne UAV, and the airborne obstacle were the three main components of this environment. At each environment reset, the location of the user was randomly generated within a 2D plane of 5 km × 5 km and maintained random motion, with the UAV flying at a fixed altitude, H, of 1 km from the ground. The location of the obstacle was randomly generated within 5 km × 5 km at a height of 1 km, and the distance between the center of the obstacle and the UAV and the user was ensured to be greater than the size of the obstacle. The task load, L j , for each user was randomly generated between 1 × 10 6 and 1 × 10 7 , reallocated at each environmental reset. The task types were categorized into real-time tasks and batch tasks, which were randomly assigned at each environment reset. Detailed parameters of the proposed environment are shown in Table 1:
The roles and main tasks of each entity in this system are shown in Table 2.
The user’s initial position u j = ( u j x , u j y ) is randomly distributed and may move at each time interval. The user wanders randomly on the ground with randomized motion step lengths and directions, subject to boundary conditions. Task size per user is noted as L j (in bits). The task type can be divided into real-time and batches (labeled 0 and 1, respectively). The SAC algorithm realizes the path planning for the UAV by controlling the speed, v, and direction, θ , of the UAV. The state space defined in this environment is a multidimensional vector containing elements such as the UAV state, user information, obstacle location, and historical trajectory of the UAV, and the expression of the complete state space at time t is as follows:
s t = ( s uav t , s user t , s obstacle , s history t )
The UAV state can be represented as follows:
s uav t = ( p x t , p y t , v t , θ t )
where p t = ( p x t , p y t ) is the location of the UAV, v t represents the speed of the UAV, and θ t represents the direction of the UAV.
The N user states can be expressed as follows:
s user t = { ( u j t , L j t , T j t ) j = 1 , 2 , , N }
where u j t = ( u j x t , u j y t ) is the location, L j t is the task size, and T j t is the task type of each user.
The M obstacle states can be expressed as follows:
s obstacle = { ( o i x , o i y ) i = 1 , 2 , , M }
where o i = ( o i x , o i y ) is the location of each obstacle.
The historical trajectory of the UAV within k time intervals in the current episodes can be represented as follows:
s history t = { ( p x t k , p y t k ) k = 1 , 2 , , K }
The action space of the SAC algorithm can be represented as follows:
a = ( a θ , a v )
As a result, the update function of the UAV’s location is determined via the control of the SAC algorithm:
p x = p x + v · cos ( θ ) , p y = p y + v · sin ( θ ) ,
v = clip ( v + a v , v min , v max ) , θ = θ + a θ .
The random wandering behavior of ground users can be represented in the following equation:
u j x = u j x + S j · cos ( ϕ j ) , u j y = u j y + S j · sin ( ϕ j ) ,
where S j is the random speed, and ϕ j is the random direction.
The simulation environment proposed in this paper is set up with observation noise and reward noise, where observation noise refers to the introduction of Gaussian noise in acquiring state information to simulate sensor measurement errors, and reward noise refers to the introduction of random noise in the reward function to simulate the uncertainty of the environment.
The observation noise can be denoted as follows:
p ˜ x = p x + N ( 0 , σ p 2 ) , p ˜ y = p y + N ( 0 , σ p 2 ) .
The reward noise can be denoted as follows:
N r e w a r d N ( 0 , σ r 2 )

4.2. System Model Design

The system model is a core component in this environment for the quantitative description of the task offloading and path-planning processes in UAV-assisted MEC systems. This component evaluates the total energy consumption and total delay of the system by calculating the communication and computation processes between the user device and the UAV, thus supporting the evaluation of the overall performance metrics of the system.
The channel gain, g j , and transmission rate, r j , between the UAV and the ground user are calculated as follows:
g j = β 0 H 2 + u j p 2
r j = B 0 log 2 1 + P user · g j δ 2
where β 0 is the path loss constant defined according to environment characteristics and carrier frequency, H is the altitude of the UAV, P u s e r is the transmission power of users, and δ 2 is the noise power at the UAV. The communication parameters are shown in Table 3.
Tasks for each user device can be partially processed locally and partially offloaded to the UAV. The task offloading decision, o j , determines the number of tasks that are offloaded to the UAV and processed locally. The number of tasks processed via the UAV can be represented as follows:
L uav , j = o j · L j
The amount of tasks processed by users can be represented as follows:
L user , j = ( 1 o j ) · L j
The time consumption, t up , j , and energy consumption, e up , j , of uploading tasks from users to the UAV can be expressed as follows:
t up , j = L uav , j r j , e up , j = P user · t up , j
The time consumption, t local , j , and energy consumption, e local , j , of processing tasks locally at users can be expressed as follows:
t local , j = L user , j · C user f user , e local , j = K user · ( f user 3 ) · t local , j
where C u s e r is the number of CPU computation cycles required by a user to process 1 bit of data, f u s e r is the local computation frequency of each user, and K u s e r is the capacitance factor of the user CPU. The time consumption, t UAV , j , and energy consumption, e UAV , j , of processing tasks remotely at the UAV can be expressed as follows:
t uav , j = L uav , j · C uav f uav , e uav , j = K uav · ( f uav 3 ) · t uav , j
where C u a v is the number of CPU computation cycles required by the UAV to process 1 bit of data, f u a v is the local computation frequency of the UAV, and K u a v is the capacitance factor of the UAV CPU.
Items in the above equations are shown in Table 4.
The total delay and total energy consumption of each user are denoted as follows:
t load , j = max ( t local , j , t up , j + t uav , j ) × adjustment ( T j )
e load , j = e up , j + e local , j + e uav , j
where T j is the task type adjustment factor, which is defined as follows:
adjustment ( T j ) = 0.8 , if T j = Real - Time Task 1.2 , if T j = Batch Task
The total delay and total energy consumption of the system are denoted as follows:
E t o t a l = j = 1 N e load , j , D t o t a l = j = 1 N t load , j
In the environment proposed in this paper, the reward function not only includes a weighted sum of energy consumption, delay, and coverage but also introduces a dynamic weight-adjustment mechanism. The innovation of this mechanism lies in its ability to continuously and adaptively optimize weights in order to ensure that the system maintains optimal performance in changing environments. After a fixed time interval, the system re-evaluates the weights. After the weights are adjusted, the environment will renormalize the weights to ensure that the sum is 1. This normalization ensures the stability and consistency of the evaluation function of the system after the weights are adjusted.
The design of the reward function integrates the system energy consumption, delay, and coverage and is defined as follows:
reward = ω energy E t o t a l + ω delay D t o t a l ω coverage C + N r e w a r d
where the coverage reward C is defined as follows:
C = j = 1 N I ( g j threshold ) N
The adjustment mechanism of the reward factor for each item can be expressed as follows:
ω energy = clip ( ω energy + Δ ω energy , 0 , 1 ) ,
ω delay = clip ( ω delay + Δ ω delay , 0 , 1 ) ,
ω coverage = clip ( ω coverage + Δ ω coverage , 0 , 1 ) .
By regularly and dynamically adjusting the weights, the system is able to autonomously adjust the priority of optimization objectives according to real-time environmental changes and task requirements. The dynamic adjustment mechanism can guide the algorithm to focus on the system delay in the early stage to ensure the system performance and focus on the system energy consumption in the later stage to extend the mission timeframe, and thus, it can improve the practicality of the system, as well as the robustness of the system to environmental changes. This mechanism provides strong support for the optimization of UAV path planning and mission offloading strategies, enabling the system to always pursue the all-around optimization of energy consumption, delay, and coverage under extreme conditions, thus improving the overall performance and efficiency of the system.

5. Results

This paper has conducted a series of experiments based on the described environment, aiming to provide insights into the key metrics and their interrelationships with the hybrid decision-making framework in this environment. In this section, the experimental data are analyzed in detail to reveal the superiority demonstrated by the hybrid decision-making framework proposed in this paper.
The performance baselines selected for the experiment are four representative high-performance deep reinforcement learning algorithms, including PPO, SAC, DDPG, and TD3, and their performance is analyzed by comparing the reward scores of the deep reinforcement learning algorithms with the hybrid decision-making framework in the same environment. In order to reflect the superiority of DAGOA, a hybrid decision framework that combines a genetic algorithm without an adaptive mechanism with a deep reinforcement learning algorithm is also set as a control in the experiment.
The reward scores of each algorithm with the hybrid decision framework based on each algorithm during the same number of steps of training (one million steps) are shown in Figure 2. Each solid line in the figure represents the reward score after smoothing with a sliding window; each dashed line represents the original reward score, and each color represents an algorithm. Among them, the hybrid decision frameworks generally achieved higher scores compared to the deep reinforcement learning algorithms, with the best performance being the hybrid decision framework consisting of DAGOA and SAC (the green solid line in the figure), which had an average score of −1149.23.
The training effect of each baseline algorithm and its hybrid decision framework formed with GA and DAGOA according to the approach proposed in this paper are shown in Figure 3. After DAGOA and PPO formed the hybrid decision framework, the score of DAGOA improved by 153.53% compared to PPO in Figure 3c, which obtained the highest increase in all baseline algorithms, but it still failed to reach the highest score of the algorithm proposed in this paper. The hybrid decision-making framework consisting of DAGOA and SAC achieved the highest score among all algorithms by improving the score by 41.42% compared to SAC in Figure 3a. The DDPG and TD3 algorithms also gained 5.76% in Figure 3d and 43.63% in Figure 3b in improvement, respectively, after the hybrid decision framework was applied. The effectiveness of the hybrid decision framework and the superior performance of the hybrid decision framework of DAGOA and SAC are well proven through this set of experiments.
PPO stabilizes the training process by limiting the magnitude of policy updates. The significant improvement in the score of PPO in combination with DAGOA indicates that DAGOA is able to effectively explore the policy space and overcome the problem of convergence of PPO in the vicinity of the local optimum solution. The diversity provided via DAGOA allows PPO to jump out of the local optimum and explore a better solution, but due to the inherent limitation of the policy updates of PPO, it fails to reach the highest score of the proposed algorithm in this paper. SAC, with its entropy regularization mechanism, maintains a certain degree of randomness in strategy exploration to avoid premature convergence and combines with the global search capability of DAGOA to explore the strategy space more comprehensively. This combination makes full utilization of the exploration ability of the genetic algorithm and the randomness of SAC, resulting in a balance between the diversity of strategies and the degree of optimization, thus achieving the highest score. DDPG and TD3, as policy gradient-based methods, are susceptible to strategies falling into local optima. The addition of DAGOA introduces new policy variants through genetic manipulation, which enhances the diversity of policy search [36]. In particular, the introduction of double-delayed updating in TD3 further mitigates the policy estimation bias, which produces a more significant enhancement in combination with DAGOA.
In the complex dynamic environments of MEC-oriented systems, the action space dimension of reinforcement learning is high, path planning and task offloading face multiple uncertainties, and a single deep reinforcement learning algorithm may be struggling to optimize all the objectives at the same time. DAGOA provides a stronger ability to adapt to the environment through its adaptive genetic mechanism, which makes the decision-making process more flexible and robust. These results fully demonstrate the effectiveness of the hybrid decision-making framework composed of DAGOA and deep reinforcement learning algorithms, especially in combination with SAC, which exhibits significant performance advantages in complex dynamic environments.
In UAV-assisted MEC systems, as the number of ground users increases, the total energy consumption and total delay of the system increase accordingly, but the increase will be different for systems under the control of different algorithms. In order to verify the robustness of the hybrid decision-making framework proposed in this paper to the number of users in the MEC system, we conducted multiple sets of experiments when the number of users ranged from 3 to 30, we selected two series of algorithms based on SAC and PPO for comparison, and the results are shown in Figure 4c. The experimental results show that the hybrid decision framework of DAGOA and SAC achieves the best robustness to changes in the number of users, with higher scores than the PPO-based algorithms at all numbers of users. The SAC algorithm, with the introduction of the hybrid decision framework, improves its scores at all numbers of users tested except 10, with a maximum improvement of 146.26% in the three-user environment in Figure 4a. As a control, the PPO algorithm also achieved an improvement in its scores for all numbers of users after the introduction of the DAGOA hybrid decision framework, with the improvement ranging between 36.45% and 200.22% in Figure 4b.
Deep reinforcement learning algorithms such as SAC and PPO learn optimal strategies by continuously interacting with the environment when dealing with complex, dynamic environments. However, these algorithms may encounter the problem of insufficient generalization ability when facing large-scale changes in the number of users. The introduction of the DAGOA compensates for this deficiency through its global search capability and adaptive tuning mechanism, which improves the algorithm’s robustness in multi-user environments.
Based on the strong exploration capability inherent to the SAC algorithm, the introduction of DAGOA further optimizes its decision-making process, enabling the algorithm to find better path planning and task offloading strategies under more user environments. Especially in the three-user environment, the score is improved by as much as 146.26%, which demonstrates the effectiveness of the hybrid decision-making framework. The PPO algorithm, after the introduction of DAGOA, slightly lags behind SAC in terms of improvement, although the score is improved under each number of users. This may be related to the characteristics of the PPO algorithm itself; e.g., it focuses more on strategy optimization using known information and is relatively weak in global search and adaptive tuning [37]. Nevertheless, the addition of DAGOA still significantly improves the performance of PPO in a multi-user environment, suggesting that its enhancement of algorithm performance is of general interest.
In the simulation environment designed in this paper, the UAV performs autonomous maneuvering along with mission offloading, and the following Figure 5 represents the trajectory of the autonomous maneuvering of the UAV with mission offloading. In Figure 5, the red dot represents the UAV, the blue dot represents the ground user, the green dotted line represents the trajectory of the UAV, the red line represents the UAV mission offloading, and the gray area represents the obstacles. The trajectory in the figure reflects that the UAV not only optimizes the trajectory for the distribution of users in the mission area but also avoids obstacles. The hybrid decision framework of the DAGOA and SAC algorithms achieves the optimization of the overall energy consumption and the overall delay of the MEC system through the real-time control of the UAV and the real-time mission offloading and, at the same time, protects the flight safety and endurance of the UAV, providing solid application foundation for MEC systems.
Figure 5a demonstrates that the hybrid decision-making framework of DAGOA and SAC strives to cover more users while performing path planning for UAVs in a narrow space, which not only safeguards the overall latency and energy consumption of the system but also realizes safe and reliable services. Figure 5b demonstrates that the hybrid decision-making framework proposed in this paper is not only capable of long-distance maneuvering but also enables the UAV to hover according to the dynamic demand, and in scenarios where the users are relatively concentrated and safe, the UAV is able to select the relatively optimal position for hovering to maintain a low overall system latency. Figure 5c,d show the process of long-distance maneuvering via the UAV to find the real-time optimal location, respectively, in which the maneuvering process not only avoids obstacles but also takes into account the task offloading tasks in the vicinity of the waypoint. Therefore, the hybrid decision-making framework proposed in this paper has shown superior performance in dynamic scenarios.
In the SAC algorithm proposed in this article, the Uncertainty-Quantified Critic Ensemble and adaptive entropy temperature form a synergistic enhancement effect at the level of value estimation and strategy optimization through rigorous theoretical design. Experiments show that the two can reduce the collision rate by 15.6% in static user, dynamic user, and dynamic task scenarios and jointly improve user coverage by 21.1%, providing a robust and efficient decision-making basis for UAV edge computing scenarios. Table 5 indicates the verification of the UQCE mechanism’s effectiveness, where the proposed SAC with UQCE performs better than the baseline SAC in multiple scenarios.
Table 6 indicates that the coupling update of α and β can avoid the limitations of single parameter adjustment. The adaptive alpha mechanism exhibits significant advantages in coverage metrics in the experiment.
The hybrid framework’s training time (average 6.2 h for 1M steps on an NVIDIA RTX4090 GPU) is higher than simpler baselines like PPO (5.1 h) due to the dual-algorithm interaction. However, the inference phase computational overhead is comparable to that of standalone DRL algorithms (average 12 ms/decision), making it feasible for real-time deployment.

6. Conclusions

In summary, it is of great significance to validate the hybrid decision-making framework consisting of proposed DAGOA and deep reinforcement learning for the application of UAV-assisted MEC systems, which also demonstrates significant innovativeness with the proposed improved SAC algorithm. Through interdisciplinary algorithm fusion, dynamic task allocation, and optimization, the introduction of adaptive mechanisms, an end-to-end decision-making framework, and the combination of empirical research and theoretical analysis, the framework achieves an efficient, robust, and scalable decision-making process. By adopting the hybrid decision-making framework, the limitations of the pure DRL method in dealing with complex dynamic environments and multi-objective optimization problems can be effectively alleviated, and the contradiction between supply and demand in resource allocation and the heterogeneity of supply and demand in computational offloading can be resolved so as to improve the resource utilization rate and the efficiency. Through the combination of DAGOA and SAC or other DRL algorithms, the framework is able to realize the joint optimization of computational resource allocation, task offloading, and trajectory control on different time scales to improve the overall performance of the system and, at the same time, realize the energy saving and real-time trajectory control of UAVs in the MEC system to prolong the endurance time of the UAVs and reduce energy consumption, which provides a new idea and methodology for the research and application of related fields.
In future work, the proposed framework can be tested under varying channel models (e.g., urban vs. rural path loss and dynamic interference) and multi-UAV coordination scenarios. Energy-constrained UAVs with heterogeneous computational capacities in the dynamic environment can also be involved in extending the implementation of this framework. While the current framework supports up to 30 users, scaling to ultra-dense networks (100+ users) requires distributed training strategies, whose exploration in future work is also planned.

Author Contributions

Software, Y.Y.; supervision, Y.S.; validation, X.Z. and X.C.; writing—original draft, Y.Y.; writing—review and editing, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hazarika, B.; Singh, K.; Li, C.-P.; Schmeink, A.; Tsang, K.F. RADiT: Resource Allocation in Digital Twin-Driven UAV-Aided Internet of Vehicle Networks. IEEE J. Sel. Areas Commun. 2023, 41, 3369–3385. [Google Scholar] [CrossRef]
  2. Dhingan, D.; Ghosh, S.; Naik, B.B.; Kuila, P. Energy and Delay Efficient Partial Offloading for UAV-assisted MEC Systems using Differential Evolution Algorithm. In Proceedings of the 2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 26–28 May 2023; pp. 415–420. [Google Scholar] [CrossRef]
  3. Rio, A.d.; Jimenez, D.; Serrano, J. Comparative Analysis of A3C and PPO Algorithms in Reinforcement Learning: A Survey on General Environments. IEEE Access 2024, 12, 146795–146806. [Google Scholar] [CrossRef]
  4. Hejres, S.; Mahjoub, A.; Hewahi, N. Routing Approaches used for Electrical Vehicles Navigation: A Survey. Int. J. Comput. Digit. Syst. 2024, 15, 801–819. [Google Scholar] [CrossRef]
  5. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  6. Yan, H.; Fu, X. Dynamic Path Planning of UAV Based on KF-RRT Algorithm. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 9–11 October 2023; pp. 1348–1352. [Google Scholar] [CrossRef]
  7. Wang, K.; Li, S.; Liang, C.; Chen, Y.; Zhang, F.; Guo, Y. Path Planning for UAV Based on MPSO. In Proceedings of the 2023 International Conference on Computer Science and Automation Technology (CSAT), Shanghai, China, 6–8 July 2023; pp. 447–452. [Google Scholar] [CrossRef]
  8. Baroomi, B.; Myo, T.; Ahmed, M.R.; Al Shibli, A.; Marhaban, M.H.; Kaiser, M.S. Ant Colony Optimization-Based Path Planning for UAV Navigation in Dynamic Environments. In Proceedings of the 2023 7th International Conference on Automation, Control and Robots (ICACR), Kuala Lumpur, Malaysia, 4–6 August 2023; pp. 168–173. [Google Scholar] [CrossRef]
  9. Zhang, H.; Xie, X.; Wei, M.; Wang, X.; Song, D.; Luo, J. An Improved Goal-bias RRT algorithm for Unmanned Aerial Vehicle Path Planning. In Proceedings of the 2024 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 22–25 July 2024; pp. 1360–1365. [Google Scholar] [CrossRef]
  10. Zhou, Z.; Xing, X.; Li, Y.; Wang, R. Multi-UAV Path Planning Based on Potential Field Dense Reward in Unknown Environments with Static and Dynamic Obstacles. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 12–14 November 2023; pp. 1289–1294. [Google Scholar] [CrossRef]
  11. Li, C.; Gan, Y.; Zhang, Y.; Luo, Y. A Cooperative Computation Offloading Strategy With On-Demand Deployment of Multi-UAVs in UAV-Aided Mobile Edge Computing. IEEE Trans. Netw. Serv. Manag. 2024, 21, 2095–2110. [Google Scholar] [CrossRef]
  12. Xiang, K.; He, Y. UAV-Assisted MEC System Considering UAV Trajectory and Task Offloading Strategy. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 4677–4682. [Google Scholar] [CrossRef]
  13. Gao, Y.; Lu, F.; Wang, P.; Lu, W.; Ding, Y.; Cao, J. Resource Optimization of Secure Data Transmission for UAV-Relay Assisted Maritime MEC System. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 3345–3350. [Google Scholar] [CrossRef]
  14. Gao, Y.; Liu, S.; Zhang, H.; Zhou, L. Deep Reinforcement Learning-Based Trajectory Optimization and Resource Allocation for Secure UAV-Enabled MEC Networks. In Proceedings of the 2024 IEEE INFOCOM Conference on Computer Communications Workshops (INFOCOM WKSHPS), Vancouver, BC, Canada, 12–15 May 2024; pp. 01–05. [Google Scholar] [CrossRef]
  15. Wang, J.; Sun, H. Joint Resource Allocation and Trajectory Optimization for Computation Offloading in UAV-Enabled Mobile Edge Computing. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 15–17 March 2024; pp. 302–307. [Google Scholar] [CrossRef]
  16. Zhang, X.; Wang, J.; Wang, B.; Jiang, F. Offloading strategy for UAV-assisted mobile edge computing based on reinforcement learning. In Proceedings of the 2022 IEEE/CIC International Conference on Communications in China (ICCC), Foshan, China, 11–14 August 2022; pp. 702–707. [Google Scholar] [CrossRef]
  17. Zhou, S.; Fei, S.; Feng, Y. Deep Reinforcement Learning based UAV-Assisted Maritime Network Computation Offloading Strategy. In Proceedings of the 2022 IEEE/CIC International Conference on Communications in China (ICCC), Foshan, China, 11–14 August 2022; pp. 890–895. [Google Scholar] [CrossRef]
  18. Yin, R.; Tian, H. Computing Offloading for Energy Conservation in UAV-Assisted Mobile Edge Computing. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 5–7 March 2024; pp. 1782–1787. [Google Scholar] [CrossRef]
  19. Han, Z.; Zhou, T.; Xu, T.; Hu, H. Joint User Association and Deployment Optimization for Delay-Minimized UAV-Aided MEC Networks. IEEE Wireless Commun. Lett. 2023, 12, 1791–1795. [Google Scholar] [CrossRef]
  20. Yao, Z.; Wang, H.; Hu, R. Improved Genetic Algorithm and Coding Method for Cooperative Search of UAV Group. In Proceedings of the 2021 36th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Nanchang, China, 19–21 May 2021; pp. 141–144. [Google Scholar] [CrossRef]
  21. Wang, Z.; Wang, B.; Wei, Y.; Liu, P.; Zhang, L. Cooperative Multi-task Assignment of Multiple UAVs with Improved Genetic Algorithm Based on Beetle Antennae Search. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 1605–1610. [Google Scholar] [CrossRef]
  22. Chen, X.; Qi, L. UAV Path Planning Based on The Fusion Algorithm of Genetic and Improved Ant Colony. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 28–30 November 2020; pp. 307–312. [Google Scholar] [CrossRef]
  23. Gao, M.; Liu, Y.; Wei, P. Opposite and Chaos Searching Genetic Algorithm Based for UAV Path Planning. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 2364–2369. [Google Scholar] [CrossRef]
  24. Li, G.; Wang, Y.; Lu, C.; Zhang, Z. Multi-UAV Air Combat Weapon-Target Assignment Based On Genetic Algorithm And Deep Learning. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 28–30 November 2020; pp. 3418–3423. [Google Scholar] [CrossRef]
  25. Su, J.; Qi, J.; Wu, C.; Wang, M.; Guo, J. Multi-UAVs Target Attack Based on Improved Genetic Algorithm. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 1466–1471. [Google Scholar]
  26. Li, L.; Gu, Q.; Liu, L. Research on Path Planning Algorithm for Multi-UAV Maritime Targets Search Based on Genetic Algorithm. In Proceedings of the 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 14–16 December 2020; pp. 840–843. [Google Scholar] [CrossRef]
  27. Peng, L.; Chen, Y.; Sun, Y.; Huang, Y.; Li, W. An information entropy-driven evolutionary algorithm based on reinforcement learning for many-objective optimization. Expert Syst. Appl. 2024, 238, 122164. [Google Scholar]
  28. Wei, L.; Peng, L.; Sun, B.; Sun, Y.; Huang, Y. Reinforcement learning-based particle swarm optimization with neighborhood differential mutation strategy. Swarm Evol. Comput. 2023, 78, 101274. [Google Scholar]
  29. Hao, H.; Zhou, Y.; Zhang, Z.; Peng, X. Underwater glider motion parameter generation based on structure-optimized deep belief network and BP neural network. Appl. Soft Comput. 2025, 169, 112646. [Google Scholar]
  30. Wang, T.; Peng, X.; Wang, T.; Liu, T.; Xu, D. Automated design of action advising trigger conditions for multiagent reinforcement learning: A genetic programming-based approach. Swarm Evol. Comput. 2024, 85, 101475. [Google Scholar] [CrossRef]
  31. Chethana, S.; Charan, S.S.; Srihitha, V.; Amudha, J. Humanoid Robot Gait Control Using PPO, SAC, and ES Algorithms. In Proceedings of the 2023 4th IEEE Global Conference for Advancement in Technology (GCAT), Bangalore, India, 17–19 December 2023; pp. 1–7. [Google Scholar] [CrossRef]
  32. Shehab, M.; Zaghloul, A.; El-Badawy, A. Low-Level Control of a Quadrotor using Twin Delayed Deep Deterministic Policy Gradient (TD3). In Proceedings of the 2021 18th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 25–27 November 2021; pp. 1–6. [Google Scholar] [CrossRef]
  33. Daniel, M.; Magassouba, A.; Aranda, M.; Lequièvre, L.; Ramón, J.A.C.; Rodriguez, R.I.; Mezouar, Y. Multi Actor-Critic DDPG for Robot Action Space Decomposition: A Framework to Control Large 3D Deformation of Soft Linear Objects. IEEE Robotics Autom. Lett. 2024, 9, 1318–1325. [Google Scholar] [CrossRef]
  34. Kanazawa, T.; Wang, H.; Gupta, C. Distributional Actor-Critic Ensemble for Uncertainty-Aware Continuous Control. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–10. [Google Scholar] [CrossRef]
  35. Wang, Y.; Ni, T. Meta-sac: Auto-tune the entropy temperature of soft actor-critic via metagradient. arXiv 2020, arXiv:2007.01932. [Google Scholar]
  36. Kurunathan, H.; Li, K.; Ni, W.; Tovar, E.; Dressler, F. Deep Reinforcement Learning for Persistent Cruise Control in UAV-aided Data Collection. In Proceedings of the 2021 IEEE 46th Conference on Local Computer Networks (LCN), Edmonton, AB, Canada, 28–30 September 2021; pp. 347–350. [Google Scholar] [CrossRef]
  37. De Almeida, A.G.; Colombini, E.L.; Simões, A.D. Controlling Tiltrotors Unmanned Aerial Vehicles (UAVs) with Deep Reinforcement Learning. In Proceedings of the 2023 Latin American Robotics Symposium (LARS), 2023 Brazilian Symposium on Robotics (SBR), and 2023 Workshop on Robotics in Education (WRE), Salvador, Brazil, 15–17 November 2023; pp. 107–112. [Google Scholar] [CrossRef]
Figure 1. The UAV-assisted MEC system application.
Figure 1. The UAV-assisted MEC system application.
Drones 09 00206 g001
Figure 2. The training results of baseline algorithms and hybrid decision frameworks.
Figure 2. The training results of baseline algorithms and hybrid decision frameworks.
Drones 09 00206 g002
Figure 3. Training results of SAC, DDPG, PPO, TD3, and hybrid decision frameworks.
Figure 3. Training results of SAC, DDPG, PPO, TD3, and hybrid decision frameworks.
Drones 09 00206 g003
Figure 4. Rewards of SAC-based and PPO-based algorithms with different users.
Figure 4. Rewards of SAC-based and PPO-based algorithms with different users.
Drones 09 00206 g004
Figure 5. The trajectories and offloading decisions of the UAV.
Figure 5. The trajectories and offloading decisions of the UAV.
Drones 09 00206 g005
Table 1. Detailed parameters of the proposed environment.
Table 1. Detailed parameters of the proposed environment.
Environment ParameterValue
Maximum user number30
Maximum obstacle number10
UAV number1
Maximum UAV speed10 m/s
Minimum UAV speed1 m/s
Maximum UAV acceration3 m/ s 2
Size5 km × 5 km × 1 km
Table 2. Entity roles and tasks.
Table 2. Entity roles and tasks.
EntityRoleMajor Tasks
UAVMobile edge server for dynamic resource allocation and path planningPath planning, task offloading, and energy management
Ground UsersGenerate computational tasks and request UAV servicesTask generation, dynamic mobility, and task offloading
ObstaclesPhysical constraints simulating dynamic environmental interferencePath blocking
MEC systemGlobal optimization of energy, latency, and coverageJoint optimization and dynamic reward weight adjustment
Table 3. Communication parameters between the UAV and users.
Table 3. Communication parameters between the UAV and users.
Communication ParameterValue
Path loss constant β 0 1 × 10 5
Bandwidth1 × 10 7 Hz
δ 1 × 10 5 W
P u s e r 0.5 W
Table 4. Users and the UAV attributes.
Table 4. Users and the UAV attributes.
Attribute ParameterValue
C u s e r 800
C u a v 1 × 10 3
f u s e r 1 × 10 9 Hz
f u a v 3 × 10 9 Hz
K u s e r 1 × 10 27
K u a v 1 × 10 28
Table 5. Comparison of collision rates under different environmental disturbances.
Table 5. Comparison of collision rates under different environmental disturbances.
TasksBaseline SACProposed SAC with UQCE
Static users5.2%1.1%
Dynamic users15.7%4.3%
Dynamic tasks23.4%7.8%
Table 6. Adaptive temperature ablation experiment.
Table 6. Adaptive temperature ablation experiment.
Temperature Regulation MechanismUser Coverage RateEnergy Consumption
Static α = 0.273.5%1.43 kj
Adaptive α 89.2%1.29 kj
Static α = 0.2 and β = 1.068.1%1.57 kj
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Shi, Y.; Cui, X.; Li, J.; Zhao, X. A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems: Integrating a Dynamic Adaptive Genetic Optimization Algorithm and Soft Actor–Critic Algorithm with Hierarchical Action Decomposition and Uncertainty-Quantified Critic Ensemble. Drones 2025, 9, 206. https://doi.org/10.3390/drones9030206

AMA Style

Yang Y, Shi Y, Cui X, Li J, Zhao X. A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems: Integrating a Dynamic Adaptive Genetic Optimization Algorithm and Soft Actor–Critic Algorithm with Hierarchical Action Decomposition and Uncertainty-Quantified Critic Ensemble. Drones. 2025; 9(3):206. https://doi.org/10.3390/drones9030206

Chicago/Turabian Style

Yang, Yu, Yanjun Shi, Xing Cui, Jiajian Li, and Xijun Zhao. 2025. "A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems: Integrating a Dynamic Adaptive Genetic Optimization Algorithm and Soft Actor–Critic Algorithm with Hierarchical Action Decomposition and Uncertainty-Quantified Critic Ensemble" Drones 9, no. 3: 206. https://doi.org/10.3390/drones9030206

APA Style

Yang, Y., Shi, Y., Cui, X., Li, J., & Zhao, X. (2025). A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems: Integrating a Dynamic Adaptive Genetic Optimization Algorithm and Soft Actor–Critic Algorithm with Hierarchical Action Decomposition and Uncertainty-Quantified Critic Ensemble. Drones, 9(3), 206. https://doi.org/10.3390/drones9030206

Article Metrics

Back to TopTop