3. Framework Design
3.1. Design of DAGOA
Recent studies have demonstrated the complementary advantages of combining evolutionary algorithms with deep reinforcement learning. Liang et al. [
27] proposed RL-RVEA that integrates reinforcement learning with reference vector adaptation, showing RL’s capability of dynamically adjusting search strategies while maintaining EA’s global exploration. In the PSO domain, Li et al. [
28] developed NRLPSO using Q-learning to guide velocity vector generation, achieving a 61.09% performance improvement over standalone RL. For complex action spaces, Wang et al. [
29] employed genetic programming to automatically design trigger conditions in multi-agent RL systems, leveraging EA’s structure search ability. These hybrid approaches inspire our framework design: DAGOA inherits GA’s global search capability through dynamic mutation rate adaptation and an early stopping mechanism, while SAC with hierarchical action decomposition enables fine-grained policy learning. Compared with existing GA-DRL hybrids [
30], our innovation lies in the uncertainty-quantified critic ensemble that overcomes the challenge of dynamic and partial observability in the environment.
The proposed Dynamic Adaptive Genetic Optimization Algorithm (DAGOA) is an advanced evolutionary computation technique that enhances traditional genetic algorithms by integrating adaptivity and dynamic parameter tuning. This section provides its detailing operational principles, step-by-step procedural descriptions, mathematical foundations, and the innovative mechanisms that contribute to its superior performance in solving complex optimization problems. The rigorous examination of DAGOA’s methodology is shown in the next section.
Genetic algorithms (GAs) are stochastic search methods inspired by the principles of natural evolution and genetics. While effective in exploring large solution spaces, traditional GAs can suffer from premature convergence and inefficiency in dynamic environments. DAGOA addresses these limitations through adaptive mechanisms and early stopping criteria, optimizing the task offloading decisions in user-centric systems. The DAGOA algorithm dynamically adjusts parameters, such as mutation rate and population size, according to the real-time state of the optimization process, which increases the adaptability of the algorithm to complex tasks and dynamic environments, and this adaptability enables the algorithm to maintain the diversity of the population, thus avoiding local optima and increasing the convergence speed. The integration of early stopping criteria further enhances computational efficiency by terminating the process when negligible improvements are observed. The following formulation describes the main mechanism of DAGOA as implemented in the framework.
- 1.
Initialization: The DAGOA begins by initializing a population of candidate solutions. Each individual is represented as a vector of decision variables, populated with random values from a uniform distribution.
where
is the number of individuals, and each
consists of decision variables for a specific number of users (user_num).
- 2.
Fitness evaluation: The fitness of each individual is assessed using a predefined fitness function, which evaluates the quality of the solution based on the state of the system and its parameters. An adaptation function,
, is defined for each individual to evaluate its offloading decision performance under a specific environment state,
s, and location,
.
where
,
and
C denote the total energy consumption, the total latency, and the coverage, respectively, which are described in the experiment design section.
- 3.
Selection: DAGOA employs tournament selection to choose individuals for reproduction. This method selects a subset of individuals randomly, and the one with the highest fitness is chosen as a parent, promoting high-quality solutions while maintaining genetic diversity.
- 4.
Crossover: genetic diversity is further enhanced through crossover, where pairs of individuals exchange segments of their decision variables, producing offspring that inherit characteristics from both parents.
- 5.
Mutation: The mutation operation introduces random variations in the offspring to explore new regions of the solution space. DAGOA adaptively adjusts the mutation rate, increasing it if the population shows signs of stagnation.
where the probability of variation,
, increases in the long run without improvement to increase population diversity.
- 6.
Adaptation and early stopping: The algorithm proceeds through generations, tracking improvements in fitness scores. Early stopping is triggered if the best fitness improvement remains below a defined threshold over several generations, conserving computational resources.
- 7.
Extracting the best solution: upon termination, the algorithm selects the individual with the highest fitness as the ultimate solution.
DAGOA showcases several notable advantages, such as adaptive mutation, early stopping mechanism, and tournament selection, ensuring its robustness and flexibility. Through the above steps, the DAGOA achieves the optimization of the task offloading problem, dynamically adjusts the parameters of the genetic algorithm to adapt to the environmental changes, and improves the efficiency of the algorithm. The DAGOA represents a significant advancement in the field of evolutionary computation. Its adaptive mechanisms and efficient termination criteria make it a powerful tool for handling complex optimization challenges, providing an optimal balance between exploration and exploitation while ensuring computational efficiency.
3.2. Embedded SAC Algorithm
The soft actor–critic (SAC) algorithm is an advanced, model-free, off-policy reinforcement learning method [
31]. It was designed to address the exploration–exploitation trade-off by combining the benefits of entropy maximization with policy learning. The principle behind SAC is to learn a stochastic policy that not only maximizes expected returns but also maximizes entropy, leading to more robust and exploratory behavior. In this section, we propose enhancements to the SAC algorithm within the hybrid decision-making framework for UAV-assisted mobile edge computing (MEC) systems. The goal is to improve the robustness and decision efficiency of the framework, addressing the challenges of dynamic environments and multi-objective optimization.
SAC, as a deep reinforcement learning algorithm based on the principle of maximum entropy, is able to learn efficiently in high-dimensional continuous action space. SAC optimizes the strategy learning process and enhances its adaptability to dynamic changes in the environment by introducing a dual structure of the value function and policy network. In this framework, the SAC is responsible for the path planning of the UAV, which directly interacts with the environment by obtaining an effective representation of the environment state, including user position, task type, and obstacle information, and it updates its strategy in real time to maximize the cumulative reward. In addition to SAC, this paper also investigates the way in which algorithms such as PPO, DDPG, and TD3 [
32] are combined with the DAGOA algorithm and tests their performance in the same simulation environment, thus verifying the extensibility of the proposed framework and the superiority of the proposed SAC.
The proposed soft actor–critic (SAC) algorithm forms the core of UAV path planning in our hybrid decision framework. It operates in a continuous action space,
, where actions correspond to directional control,
, and speed modulation,
. The state space
integrates four critical components, including user context,
, environmental constraints,
, UAV status,
, and trajectory history,
, which are introduced in
Section 4.1.
The policy network
and Q-function ensemble
jointly optimize the objective:
where
is the policy entropy. On the basis of maximizing rewards in traditional reinforcement learning, this objective function introduces policy entropy,
, as a regularization term. Maximizing entropy encourages strategies to maintain randomness and avoid premature convergence to local optima.
3.2.1. Hierarchical Action Decomposition
Due to the difficulty of exploring the optimized overall control strategy, the action space is decomposed into two coupled sub-spaces to reduce exploration complexity, which can be denoted as a high-level planner and a low-level controller. The high-level planner generates target direction
using attention-weighted state features:
where
computes multi-head attention weights over user states. The low-level controller outputs speed modulation,
, conditioned on obstacle proximity:
where
is a sigmoid function constraining output to
.
The design of the hierarchical action decomposition of the proposed SAC could reduce gradient variance, which is similar to the framework proposed by M. Daniel et al. [
33]. In our work, let
and
denote policy gradients under standard and hierarchical action spaces, respectively. Then,
The complete policy gradient for standard SAC is as follows:
Under HAD, the gradient decomposes into the following:
Decompose the policy gradient using the law of total variance:
Hierarchical decomposition minimizes cross-term correlations in the second moment matrix, thereby reducing overall variance.
3.2.2. Uncertainty-Quantified Critic Ensemble
To overcome the challenge of dynamic and partial observability in the environment, the Uncertainty-Quantified Critic Ensemble (UQEC) mechanism is employed, introducing an ensemble of
Q-networks with randomized initializations, which provides robust value estimation [
34]. The Pessimistic Q-Learning target values that incorporate epistemic uncertainty are as follows:
where
is a random subset (
) per update, and
represents the ensemble standard deviation.
is dynamically adjusted to increase punishment in high-uncertainty environments (such as areas with dense obstacles) according to Formula (17) to avoid high-risk actions.
The dynamic uncertainty penalty mechanism can adaptively adjust the penalty coefficient,
, balancing exploration and risk avoidance.
where
is set to 0.2 to maintain moderate conservatism. The Uncertainty-Quantified Critic Ensemble mechanism could address model bias issues in dynamic environments and enhance the robustness of strategies in partially observable scenarios. The integrated mean
reduces the estimated variance, while the minimum operator constrains the maximum deviation.
When the strategy is updated, the uncertainty penalty term
establishes a probabilistic robust boundary. Assuming the true Q-value
and the estimation error
, there is the following:
When , and , the probability exceeds 95% ( is the standard normal CDF), which can effectively avoid high-risk action choices.
3.2.3. Adaptive Entropy Temperature
The temperature parameter
in SAC is crucial for balancing exploration and exploitation. In dynamic environments, a fixed
may not be optimal. We employ an adaptive temperature tuning mechanism that adjusts
based on the system’s current state and the complexity of the environment. The adaptive entropy temperature mechanism can dynamically balance strategy exploration and utilization, solving the problems of over-exploration at an early stage and under-exploration at a later stage caused by traditional SAC-fixed temperature parameters [
35]. In addition to the uncertainty penalty, the dynamic adjustment principle of the strategy entropy weight
is as follows:
where
, corresponding to the two-dimensional action space. By coupling the adjustment strategy, the update of
is associated with the uncertainty of the Q value, thereby enhancing the conservatism of the algorithm in high-uncertainty observation states. The coupling update of
and
will form a dual closed-loop control, allowing for higher uncertainty penalties (
increase) and suppressing risk exploration during the high-exploration period (where
is larger); during the high-utilization period (where
is small), the algorithm will reduce the penalty (
is reduced) and fully utilize the known optimal strategy.
The exploratory nature of SAC complements the exploitation focus of genetic algorithms, creating a balanced and holistic decision-making approach. In the context of the hybrid decision framework, the SAC algorithm was chosen due to its unique ability to address the challenges of dynamic and uncertain environments. SAC’s robustness in various settings ensures that it can adapt to different scenarios within the hybrid framework, maintaining high performance despite environmental changes. The integration of the SAC algorithm into the hybrid decision framework represents a strategic innovation, driven by SAC’s strengths in efficient exploration, sample efficiency, and adaptability. These attributes make it an ideal candidate for addressing the complex challenges present in optimizing UAV path planning and task offloading. As a result, SAC forms a critical component of the framework, driving its effectiveness and success in achieving optimal UAV path planning outcomes.
The uncertainty-quantified critic ensemble and adaptive entropy temperature improve the decision performance of the SAC algorithm through dual aspects. At the level of value estimation, the ensemble critic reduces Q-value variance and adapts
-dynamic constraints to estimate bias; at the level of strategic optimization, entropy regulation maintains exploration, while
regulation avoids high-risk areas. Therefore, the improvement in strategies can be decomposed into the following:
The dynamic balance of the above three factors ensures that strategy updates always move towards the Pareto-optimal direction of high returns, low risks, and moderate exploration.
By combining DAGOA with SAC, this study realizes an efficient decision-making mechanism for UAVs when performing MEC missions. In the hybrid decision-making framework, SAC is responsible for rapidly responding to environmental changes and optimizing path planning, while AGA improves the overall resource utilization efficiency of the system by evaluating and optimizing mission offloading decisions offline. In the experiments, the hybrid framework is found to outperform the decision-making strategy using SAC alone or the strategy combining a fixed-parameter genetic algorithm and SAC, especially in the face of dynamic user demands and complex environments, improving the performance of the system significantly.
3.3. Integration
Current UAV-assisted mobile edge computing systems usually demand high efficiency in the utilization of computational, communication, and energy resources, so the research on optimizing the task offloading decision and path planning for UAVs is of great significance in this field. An efficient decision-making framework is essential for optimizing resource allocation and improving system performance. In this paper, we propose an innovative hybrid decision-making framework combining deep reinforcement learning algorithms and DAGOA, which is designed to achieve the synergistic optimization of UAV path planning and ground-based mission offloading. By assigning the path planning task to deep reinforcement learning algorithms (e.g., SAC and PPO, etc.), while the ground task offloading decision is handled via DAGOA, the task execution efficiency and resource utilization of UAVs can be improved. The hybrid framework realizes efficient UAV path-planning and task-offloading decisions, demonstrating its innovation and technical superiority in complex environments.
The convergence of DAGOA is guaranteed by its adaptive mechanism, which prevents premature convergence. The convergence of SAC is ensured by its off-policy nature and regularization of entropies. In the aspect of stability, the novelty of the hybrid decision-making framework proposed in this paper is mainly reflected in the following three aspects.
Improved end-to-end decision making: While traditional decision-making frameworks often require the collaborative work of multiple subsystems, this hybrid decision-making framework realizes end-to-end decision making without additional layers. By integrating two subtasks, path planning and task offloading, into a unified framework, the proposed framework is able to consider the constraints and objectives of these two subtasks simultaneously, leading to better global optimization. In this case, this end-to-end hybrid decision-making framework is highly innovative in current academic research.
Task decomposition and co-optimization: The framework decomposes, for the first time, the UAV path planning and ground task offloading into two independent tasks for decision-making via the DAGOA algorithm and the DRL algorithm, and this decomposition not only reduces the complexity of the tasks but also achieves overall performance enhancement through the co-optimization.
Adaptive mechanism: The adaptive mechanism in DAGOA is innovatively proposed to dynamically adjust the crossover and mutation probabilities according to the evolutionary state of the population, which improves the search efficiency. At the same time, the deep reinforcement learning algorithm realizes the steady updating of strategies through methods such as the strategy gradient, and the two complement each other to jointly improve the decision-making performance of the framework.
The core idea of the hybrid framework proposed in this paper is to make full use of the global search capability of genetic algorithms and the local optimization capability of deep reinforcement learning and to further improve the robustness and decision-making efficiency of the hybrid decision-making framework by realizing adaptive adjustment of the parameters of the DAGOA. DAGOA is an improved genetic algorithm that improves the search efficiency and convergence speed by dynamically adjusting the crossover and mutation probabilities. In this framework, DAGOA is used to solve the ground mission offloading tasks. DAGOA generates and optimizes the mission offloading strategy by simulating natural selection and genetic mechanisms. The innovation is the introduction of an adaptive mechanism, which dynamically adjusts the algorithm parameters according to the evolutionary state of the population, thus accelerating convergence and avoiding local optima. DAGOA dynamically adjusts the fitness of the individuals in the population by simulating the process of natural selection to ensure that suitable mission offloading strategies can be found in complex environments. Specifically, DAGOA evaluates the fitness of individuals in each generation and utilizes a tournament selection strategy to select the optimal individuals for crossover and mutation to generate a new population. In order to avoid converging too quickly early on and thus falling into a local optimum, DAGOA introduces the Early Stop mechanism, which ensures that the genetic operation can be terminated when the fitness has not improved significantly.
The integration of DAGOA and SAC not only enhances the autonomous decision-making ability of UAVs in MEC systems but also provides new ideas for research in related fields, demonstrating the potential of combining optimization algorithms with deep learning methods. Future research can further explore the synergistic mechanisms of other optimization algorithms with deep reinforcement learning to expand the applicability and effectiveness of this framework in a wider range of application scenarios.
4. Experimental Design
In this paper, we design a training environment for simulating UAV-assisted mobile edge computing (MEC) systems, especially for the application of the hybrid decision-making framework proposed in this paper for UAV path planning and mission offloading decisions. This environment is able to access deep reinforcement learning algorithms for UAV path planning and DAGOA for the optimization of ground-based task offloading decisions. The design of this environment takes into account a variety of practical factors, introducing users, obstacles, dynamic noise, diverse task types, and multi-dimensional reward mechanisms in the three-dimensional space to simulate the complexity of real application scenarios and to ensure that the trained model has a high practical application value.
The environment is implemented based on the Python 3.10.15 programming language and employs the Box space from the gym library to define the observation and action spaces. The main components of the environment include the UAV, the user, obstacles, and task types. The action state of the UAV is represented by position, velocity, and direction; the user’s position and task requirements change dynamically, and the position of obstacles is randomly generated at each reset.
4.1. Experimental Environment Components
The size of the environment was 5 km × 5 km × 1 km, and the ground user, the airborne UAV, and the airborne obstacle were the three main components of this environment. At each environment reset, the location of the user was randomly generated within a 2D plane of 5 km × 5 km and maintained random motion, with the UAV flying at a fixed altitude,
H, of 1 km from the ground. The location of the obstacle was randomly generated within 5 km × 5 km at a height of 1 km, and the distance between the center of the obstacle and the UAV and the user was ensured to be greater than the size of the obstacle. The task load,
, for each user was randomly generated between
and
, reallocated at each environmental reset. The task types were categorized into real-time tasks and batch tasks, which were randomly assigned at each environment reset. Detailed parameters of the proposed environment are shown in
Table 1:
The roles and main tasks of each entity in this system are shown in
Table 2.
The user’s initial position
is randomly distributed and may move at each time interval. The user wanders randomly on the ground with randomized motion step lengths and directions, subject to boundary conditions. Task size per user is noted as
(in bits). The task type can be divided into real-time and batches (labeled 0 and 1, respectively). The SAC algorithm realizes the path planning for the UAV by controlling the speed,
v, and direction,
, of the UAV. The state space defined in this environment is a multidimensional vector containing elements such as the UAV state, user information, obstacle location, and historical trajectory of the UAV, and the expression of the complete state space at time
t is as follows:
The UAV state can be represented as follows:
where
is the location of the UAV,
represents the speed of the UAV, and
represents the direction of the UAV.
The
N user states can be expressed as follows:
where
is the location,
is the task size, and
is the task type of each user.
The
M obstacle states can be expressed as follows:
where
is the location of each obstacle.
The historical trajectory of the UAV within
k time intervals in the current episodes can be represented as follows:
The action space of the SAC algorithm can be represented as follows:
As a result, the update function of the UAV’s location is determined via the control of the SAC algorithm:
The random wandering behavior of ground users can be represented in the following equation:
where
is the random speed, and
is the random direction.
The simulation environment proposed in this paper is set up with observation noise and reward noise, where observation noise refers to the introduction of Gaussian noise in acquiring state information to simulate sensor measurement errors, and reward noise refers to the introduction of random noise in the reward function to simulate the uncertainty of the environment.
The observation noise can be denoted as follows:
The reward noise can be denoted as follows:
4.2. System Model Design
The system model is a core component in this environment for the quantitative description of the task offloading and path-planning processes in UAV-assisted MEC systems. This component evaluates the total energy consumption and total delay of the system by calculating the communication and computation processes between the user device and the UAV, thus supporting the evaluation of the overall performance metrics of the system.
The channel gain,
, and transmission rate,
, between the UAV and the ground user are calculated as follows:
where
is the path loss constant defined according to environment characteristics and carrier frequency,
H is the altitude of the UAV,
is the transmission power of users, and
is the noise power at the UAV. The communication parameters are shown in
Table 3.
Tasks for each user device can be partially processed locally and partially offloaded to the UAV. The task offloading decision,
, determines the number of tasks that are offloaded to the UAV and processed locally. The number of tasks processed via the UAV can be represented as follows:
The amount of tasks processed by users can be represented as follows:
The time consumption,
, and energy consumption,
, of uploading tasks from users to the UAV can be expressed as follows:
The time consumption, , and energy consumption, , of processing tasks locally at users can be expressed as follows:
where
is the number of CPU computation cycles required by a user to process 1 bit of data,
is the local computation frequency of each user, and
is the capacitance factor of the user CPU. The time consumption,
, and energy consumption,
, of processing tasks remotely at the UAV can be expressed as follows:
where
is the number of CPU computation cycles required by the UAV to process 1 bit of data,
is the local computation frequency of the UAV, and
is the capacitance factor of the UAV CPU.
Items in the above equations are shown in
Table 4.
The total delay and total energy consumption of each user are denoted as follows:
where
is the task type adjustment factor, which is defined as follows:
The total delay and total energy consumption of the system are denoted as follows:
In the environment proposed in this paper, the reward function not only includes a weighted sum of energy consumption, delay, and coverage but also introduces a dynamic weight-adjustment mechanism. The innovation of this mechanism lies in its ability to continuously and adaptively optimize weights in order to ensure that the system maintains optimal performance in changing environments. After a fixed time interval, the system re-evaluates the weights. After the weights are adjusted, the environment will renormalize the weights to ensure that the sum is 1. This normalization ensures the stability and consistency of the evaluation function of the system after the weights are adjusted.
The design of the reward function integrates the system energy consumption, delay, and coverage and is defined as follows:
where the coverage reward
C is defined as follows:
The adjustment mechanism of the reward factor for each item can be expressed as follows:
By regularly and dynamically adjusting the weights, the system is able to autonomously adjust the priority of optimization objectives according to real-time environmental changes and task requirements. The dynamic adjustment mechanism can guide the algorithm to focus on the system delay in the early stage to ensure the system performance and focus on the system energy consumption in the later stage to extend the mission timeframe, and thus, it can improve the practicality of the system, as well as the robustness of the system to environmental changes. This mechanism provides strong support for the optimization of UAV path planning and mission offloading strategies, enabling the system to always pursue the all-around optimization of energy consumption, delay, and coverage under extreme conditions, thus improving the overall performance and efficiency of the system.
5. Results
This paper has conducted a series of experiments based on the described environment, aiming to provide insights into the key metrics and their interrelationships with the hybrid decision-making framework in this environment. In this section, the experimental data are analyzed in detail to reveal the superiority demonstrated by the hybrid decision-making framework proposed in this paper.
The performance baselines selected for the experiment are four representative high-performance deep reinforcement learning algorithms, including PPO, SAC, DDPG, and TD3, and their performance is analyzed by comparing the reward scores of the deep reinforcement learning algorithms with the hybrid decision-making framework in the same environment. In order to reflect the superiority of DAGOA, a hybrid decision framework that combines a genetic algorithm without an adaptive mechanism with a deep reinforcement learning algorithm is also set as a control in the experiment.
The reward scores of each algorithm with the hybrid decision framework based on each algorithm during the same number of steps of training (one million steps) are shown in
Figure 2. Each solid line in the figure represents the reward score after smoothing with a sliding window; each dashed line represents the original reward score, and each color represents an algorithm. Among them, the hybrid decision frameworks generally achieved higher scores compared to the deep reinforcement learning algorithms, with the best performance being the hybrid decision framework consisting of DAGOA and SAC (the green solid line in the figure), which had an average score of −1149.23.
The training effect of each baseline algorithm and its hybrid decision framework formed with GA and DAGOA according to the approach proposed in this paper are shown in
Figure 3. After DAGOA and PPO formed the hybrid decision framework, the score of DAGOA improved by 153.53% compared to PPO in
Figure 3c, which obtained the highest increase in all baseline algorithms, but it still failed to reach the highest score of the algorithm proposed in this paper. The hybrid decision-making framework consisting of DAGOA and SAC achieved the highest score among all algorithms by improving the score by 41.42% compared to SAC in
Figure 3a. The DDPG and TD3 algorithms also gained 5.76% in
Figure 3d and 43.63% in
Figure 3b in improvement, respectively, after the hybrid decision framework was applied. The effectiveness of the hybrid decision framework and the superior performance of the hybrid decision framework of DAGOA and SAC are well proven through this set of experiments.
PPO stabilizes the training process by limiting the magnitude of policy updates. The significant improvement in the score of PPO in combination with DAGOA indicates that DAGOA is able to effectively explore the policy space and overcome the problem of convergence of PPO in the vicinity of the local optimum solution. The diversity provided via DAGOA allows PPO to jump out of the local optimum and explore a better solution, but due to the inherent limitation of the policy updates of PPO, it fails to reach the highest score of the proposed algorithm in this paper. SAC, with its entropy regularization mechanism, maintains a certain degree of randomness in strategy exploration to avoid premature convergence and combines with the global search capability of DAGOA to explore the strategy space more comprehensively. This combination makes full utilization of the exploration ability of the genetic algorithm and the randomness of SAC, resulting in a balance between the diversity of strategies and the degree of optimization, thus achieving the highest score. DDPG and TD3, as policy gradient-based methods, are susceptible to strategies falling into local optima. The addition of DAGOA introduces new policy variants through genetic manipulation, which enhances the diversity of policy search [
36]. In particular, the introduction of double-delayed updating in TD3 further mitigates the policy estimation bias, which produces a more significant enhancement in combination with DAGOA.
In the complex dynamic environments of MEC-oriented systems, the action space dimension of reinforcement learning is high, path planning and task offloading face multiple uncertainties, and a single deep reinforcement learning algorithm may be struggling to optimize all the objectives at the same time. DAGOA provides a stronger ability to adapt to the environment through its adaptive genetic mechanism, which makes the decision-making process more flexible and robust. These results fully demonstrate the effectiveness of the hybrid decision-making framework composed of DAGOA and deep reinforcement learning algorithms, especially in combination with SAC, which exhibits significant performance advantages in complex dynamic environments.
In UAV-assisted MEC systems, as the number of ground users increases, the total energy consumption and total delay of the system increase accordingly, but the increase will be different for systems under the control of different algorithms. In order to verify the robustness of the hybrid decision-making framework proposed in this paper to the number of users in the MEC system, we conducted multiple sets of experiments when the number of users ranged from 3 to 30, we selected two series of algorithms based on SAC and PPO for comparison, and the results are shown in
Figure 4c. The experimental results show that the hybrid decision framework of DAGOA and SAC achieves the best robustness to changes in the number of users, with higher scores than the PPO-based algorithms at all numbers of users. The SAC algorithm, with the introduction of the hybrid decision framework, improves its scores at all numbers of users tested except 10, with a maximum improvement of 146.26% in the three-user environment in
Figure 4a. As a control, the PPO algorithm also achieved an improvement in its scores for all numbers of users after the introduction of the DAGOA hybrid decision framework, with the improvement ranging between 36.45% and 200.22% in
Figure 4b.
Deep reinforcement learning algorithms such as SAC and PPO learn optimal strategies by continuously interacting with the environment when dealing with complex, dynamic environments. However, these algorithms may encounter the problem of insufficient generalization ability when facing large-scale changes in the number of users. The introduction of the DAGOA compensates for this deficiency through its global search capability and adaptive tuning mechanism, which improves the algorithm’s robustness in multi-user environments.
Based on the strong exploration capability inherent to the SAC algorithm, the introduction of DAGOA further optimizes its decision-making process, enabling the algorithm to find better path planning and task offloading strategies under more user environments. Especially in the three-user environment, the score is improved by as much as 146.26%, which demonstrates the effectiveness of the hybrid decision-making framework. The PPO algorithm, after the introduction of DAGOA, slightly lags behind SAC in terms of improvement, although the score is improved under each number of users. This may be related to the characteristics of the PPO algorithm itself; e.g., it focuses more on strategy optimization using known information and is relatively weak in global search and adaptive tuning [
37]. Nevertheless, the addition of DAGOA still significantly improves the performance of PPO in a multi-user environment, suggesting that its enhancement of algorithm performance is of general interest.
In the simulation environment designed in this paper, the UAV performs autonomous maneuvering along with mission offloading, and the following
Figure 5 represents the trajectory of the autonomous maneuvering of the UAV with mission offloading. In
Figure 5, the red dot represents the UAV, the blue dot represents the ground user, the green dotted line represents the trajectory of the UAV, the red line represents the UAV mission offloading, and the gray area represents the obstacles. The trajectory in the figure reflects that the UAV not only optimizes the trajectory for the distribution of users in the mission area but also avoids obstacles. The hybrid decision framework of the DAGOA and SAC algorithms achieves the optimization of the overall energy consumption and the overall delay of the MEC system through the real-time control of the UAV and the real-time mission offloading and, at the same time, protects the flight safety and endurance of the UAV, providing solid application foundation for MEC systems.
Figure 5a demonstrates that the hybrid decision-making framework of DAGOA and SAC strives to cover more users while performing path planning for UAVs in a narrow space, which not only safeguards the overall latency and energy consumption of the system but also realizes safe and reliable services.
Figure 5b demonstrates that the hybrid decision-making framework proposed in this paper is not only capable of long-distance maneuvering but also enables the UAV to hover according to the dynamic demand, and in scenarios where the users are relatively concentrated and safe, the UAV is able to select the relatively optimal position for hovering to maintain a low overall system latency.
Figure 5c,d show the process of long-distance maneuvering via the UAV to find the real-time optimal location, respectively, in which the maneuvering process not only avoids obstacles but also takes into account the task offloading tasks in the vicinity of the waypoint. Therefore, the hybrid decision-making framework proposed in this paper has shown superior performance in dynamic scenarios.
In the SAC algorithm proposed in this article, the Uncertainty-Quantified Critic Ensemble and adaptive entropy temperature form a synergistic enhancement effect at the level of value estimation and strategy optimization through rigorous theoretical design. Experiments show that the two can reduce the collision rate by 15.6% in static user, dynamic user, and dynamic task scenarios and jointly improve user coverage by 21.1%, providing a robust and efficient decision-making basis for UAV edge computing scenarios.
Table 5 indicates the verification of the UQCE mechanism’s effectiveness, where the proposed SAC with UQCE performs better than the baseline SAC in multiple scenarios.
Table 6 indicates that the coupling update of
and
can avoid the limitations of single parameter adjustment. The adaptive alpha mechanism exhibits significant advantages in coverage metrics in the experiment.
The hybrid framework’s training time (average 6.2 h for 1M steps on an NVIDIA RTX4090 GPU) is higher than simpler baselines like PPO (5.1 h) due to the dual-algorithm interaction. However, the inference phase computational overhead is comparable to that of standalone DRL algorithms (average 12 ms/decision), making it feasible for real-time deployment.
6. Conclusions
In summary, it is of great significance to validate the hybrid decision-making framework consisting of proposed DAGOA and deep reinforcement learning for the application of UAV-assisted MEC systems, which also demonstrates significant innovativeness with the proposed improved SAC algorithm. Through interdisciplinary algorithm fusion, dynamic task allocation, and optimization, the introduction of adaptive mechanisms, an end-to-end decision-making framework, and the combination of empirical research and theoretical analysis, the framework achieves an efficient, robust, and scalable decision-making process. By adopting the hybrid decision-making framework, the limitations of the pure DRL method in dealing with complex dynamic environments and multi-objective optimization problems can be effectively alleviated, and the contradiction between supply and demand in resource allocation and the heterogeneity of supply and demand in computational offloading can be resolved so as to improve the resource utilization rate and the efficiency. Through the combination of DAGOA and SAC or other DRL algorithms, the framework is able to realize the joint optimization of computational resource allocation, task offloading, and trajectory control on different time scales to improve the overall performance of the system and, at the same time, realize the energy saving and real-time trajectory control of UAVs in the MEC system to prolong the endurance time of the UAVs and reduce energy consumption, which provides a new idea and methodology for the research and application of related fields.
In future work, the proposed framework can be tested under varying channel models (e.g., urban vs. rural path loss and dynamic interference) and multi-UAV coordination scenarios. Energy-constrained UAVs with heterogeneous computational capacities in the dynamic environment can also be involved in extending the implementation of this framework. While the current framework supports up to 30 users, scaling to ultra-dense networks (100+ users) requires distributed training strategies, whose exploration in future work is also planned.