1. Introduction
Unmanned Aerial Vehicle (UAV) technology has made tremendous progress in recent years, with an increasingly wide range of applications, from military reconnaissance [
1] to logistics [
2] and distribution, and from agriculture and plant protection [
3] to disaster relief [
4]. The development of UAV technology has provided efficient and economical solutions [
5] for these industries. UAVs are popular for monitoring complex terrain due to their high flexibility, mobility, and low deployment costs [
6]. Simultaneously, utilizing multiple UAVs to explore target areas and establish a communication system for information sharing can significantly enhance exploration efficiency, reduce overall costs, and minimize the impact of individual UAV failures on terrain exploration. However, limited resources make informative path planning (IPP) a critical issue.
IPP methods include grid-based search, potential field methods, and heuristic algorithms like A* [
7]. Grid-based search ensures thorough coverage by dividing the area into a grid and systematically exploring each cell, but it is computationally expensive and inefficient for large areas. Potential field methods create a virtual field where UAVs are attracted to the target and repelled by obstacles, resulting in smooth paths; however, they can become trapped in local minima. Heuristic algorithms such as A* provide optimal paths by evaluating multiple potential routes and selecting the best one based on cost. Still, they struggle with dynamic and complex scenarios where environmental conditions constantly change.
Reinforcement learning (RL) [
8] has emerged as a promising solution to address these limitations. RL enables UAVs to learn and adapt to dynamic environments by interacting with the environment and receiving feedback on their actions. This learning process allows for the development of optimal policies that improve over time, providing robust solutions to complex scenarios. However, traditional RL methods are primarily designed for single-agent systems and face significant challenges when extended to multi-agent scenarios, such as scalability and coordination among agents.
To overcome these challenges, multi-agent reinforcement learning (MARL) was developed [
9,
10], which extends RL to multiple interacting agents. MARL enables UAVs to handle dynamic and uncertain environments collaboratively by learning and adapting through interaction with the environment and each other. MARL allows for the development of optimal policies that improve over time and provide robust solutions to complex scenarios. This makes it particularly powerful in situations where traditional methods fall short. Additionally, MARL facilitates coordination among multiple UAVs, enhancing their collective performance. However, despite these advantages, MARL approaches can be computationally intensive and may suffer from efficiency and scalability issues when applied to large-scale multi-UAV systems.
In our method, images captured by the UAVs are first processed through a combination of Autoencoders (AEs) and Principal Component Analysis (PCA) as part of the Adaptive Dimensionality Reduction (ADR) framework. This preprocessing step significantly reduces the computational load by simplifying high-dimensional input data while preserving essential features. AEs are primarily used for feature extraction and handling complex data, while PCA provides an efficient means of further reducing the dimensionality, particularly in less complex scenarios. The output from the ADR framework is then fed into the actor network of the MARL framework, which is responsible for decision-making and path planning. Furthermore, we deploy communication modules that allow UAVs to share information and coordinate their actions effectively. This integrated approach not only enhances the overall efficiency of the system but also improves the performance and robustness of UAV navigation in complex and dynamic environments.
This paper focuses on active data collection using a team of UAVs for terrain monitoring scenarios. The objective is to create a map of an initially unknown, non-homogeneous binary target variable on a 2D terrain, such as identifying crop infestations in an agricultural setting or locating victims in a disaster scenario, utilizing image measurements captured by the UAVs.
Figure 1 can more intuitively show the working scene of the drone. We address the challenge of multi-agent IPP, where we design information-rich paths for the UAVs to cooperatively gather sensor data while adhering to energy, time, or distance constraints. The goal is to enable the UAVs to dynamically monitor the terrain, concentrating on areas of interest with high information value.
The main contribution of this paper is the introduction of a novel multi-agent deep reinforcement learning-based IPP approach for adaptive terrain monitoring using UAV teams. Our approach supports decentralized on-board decision-making and achieves cooperative 3D path planning with variable team sizes. Our main contributions include the following:
1. Markov Modeling and Environment Modeling: We develop a comprehensive Markov model to represent the states of UAVS and the environment dynamics, enabling more accurate predictions and decision-making in complex terrains.
2. ADR framework: We introduce the ADR framework, which integrates AEs and PCA to preprocess image data captured by the UAVs. This dual-method approach allows for efficient dimensionality reduction and feature extraction, enhancing computational efficiency and improving the quality of the input for the reinforcement learning framework.
3. Implementation of Communication Modules: We establish robust communication protocols that allow UAVs to share information and coordinate their actions effectively. This enhances overall system performance, ensuring better coverage and data collection.
Additionally, we address the credit assignment problem in cooperative IPP using counterfactual multi-agent policy gradients (COMA). Our approach significantly improves planning performance and computational efficiency, demonstrating its potential for UAV navigation in complex and dynamic environments.
The whole article includes 6 sections.
Section 2 reviews the relevant literature, covering UAV IPP methods, RL, MARL, AE, and PCA.
Section 3 details our proposed method, introducing the Markov model, our ADR framework, policy learning model, and inter-agent communication model. Mathematical formulations and figures are used to define our methodology.
Section 4 outlines our experimental setup, including parameter configurations and justifications. We present heat maps of three environments, illustrating the significance of different regions.
Section 5 analyzes results from various scenarios and examines factors contributing to the outcomes.
Section 6 concludes by summarizing the ADR framework’s contributions to MARL in UAV path planning and suggests future research directions.
2. Related Work
Traditional UAV IPP methods include the grid-based search proposed by Yamauchi [
11], the potential field methods introduced by Khatib [
12], and heuristic algorithms like A* developed by Hart, Nilsson, and Raphael [
13]. Grid-based search ensures thorough coverage but is computationally expensive and inefficient for large areas. Potential field methods create smooth paths using virtual forces but often get trapped in local minima. The A* algorithm selects optimal paths by evaluating multiple potential routes but struggles in dynamic and complex scenarios. To address these limitations, LaValle [
14] proposed the Rapidly-exploring Random Trees (RRT) algorithm, which efficiently explores large spaces but often produces non-smooth paths. The Particle Swarm Optimization (PSO) method, introduced by Kennedy and Eberhart [
15], optimizes flight paths through iterative improvements but requires careful parameter tuning and is computationally intensive. The limitations of these traditional methods highlight the need for more advanced solutions, such as RL, to achieve more effective UAV path planning in dynamic and complex environments.
RL has also been widely applied to UAVs. Wei et al. [
16] designed a constrained exploration and exploitation strategy using Q-networks, but this method is limited to single-agent intelligence. Vashisth et al. [
17] proposed a dynamically constructed graph restricting agents to local planning actions, allowing for better path exploration. However, it cannot transfer policies to real robots without localization and perception uncertainties. Rueckin et al. [
10] combined tree search with an offline-trained neural network, significantly improving information perception capabilities with small data sets. Pirinen et al. [
7] introduced a strategy for using UAVs to find unknown target regions based on limited visual cues. Chen et al. [
18] developed a graph-based deep RL method for exploration, selecting map frontiers that reduce map uncertainty and travel time. However, their approach is limited to 2D workspaces, whereas we consider 3D planning. Jonas Westheimer et al. [
19] introduce new network feature representations to effectively learn path planning in a 3D workspace. However, their research faces significant efficiency issues as the amount of data increases.
MARL is a more promising direction for IPP as it is closer to real-world scenarios and introduces more complex environments and higher requirements. Foerster et al. [
20] introduced counterfactual COMA to address the credit assignment problem. Lowe et al. [
21] proposed the Multi-Agent Deep Deterministic Policy Gradient (MADDPG), which allows for centralized training and decentralized execution, improving performance in complex environments. Yousef et al. [
22] proposed a novel mean-field flight resource allocation optimization method to minimize the Age of Information (AoI) of perceived data. This method successfully optimized the UAV flight trajectories and the data collection scheduling of ground sensors, showing significant improvements compared to DQN. Iqbal and Sha [
23] proposed Independent Q-Learning (IQL), which simplifies the training process of multiple agents while maintaining high performance. However, the computational complexity and scalability issues of MARL in large-scale systems still require further research and solutions. To address this, recent studies such as Zhang et al. [
24] have proposed a hierarchical MARL method to reduce computational burden and improve scalability through hierarchical decision-making. Additionally, Peng et al. [
25] introduced a distributed MARL framework that reduces communication overhead and enhances system resilience, demonstrating potential applications in large-scale multi-agent systems.
In order to improve the training effect of the network, David et al. [
26] proposed the autoencoder (AE) based on the back propagation of neuron-like unit networks. Hinton et al. [
27] first proposed the use of gradient descent to fine-tune the weights of AE networks, reducing the data dimension to improve the training effect. Although the AE can effectively solve the problem of insufficient feature extraction and overfitting, there are also some problems, such as long training time and insufficient accuracy. In order to solve these problems, scholars have conducted research and made some improvements to the AE.
For image processing, the working principle of the AE requires the data to be transformed into a one-dimensional vector for post-processing, which leads to the loss of the two-dimensional structure information of the image. Therefore, Ranjan et al. [
28] introduced convolutional neural networks to preserve two-dimensional spatial information by replacing the fully connected layers with convolution and pooling operations in them. Zhao et al. [
29] proposed image classification based on DSAE dimensionality reduction, but it causes the problem of gradient disappearance as the number of convolutional neural network (CNN) layers increases. To this end, they added a residual network module to the 3D CNN and proposed 3DDRN [
30]. For samples of various sizes, DSAE can effectively extract low-dimensional features from the original image and perform well in classification. Guo et al. [
31] performed feature extraction by adding a CNN to a stacked autoencoder (SAE). This method can achieve high accuracy through simple unsupervised pre-training and single supervised training, simplifying the complex calculation process. For the noise in the image, Revathi et al. [
32] proposed an AE method based on DCNN to deal with the noise in the image, which effectively improved the accuracy.
To further exploit the potential of the AE, Cheng et al. [
33] added a regularization term to the hidden neurons of a discriminative stacked autoencoder (DSAE) to bring similar samples closer together in the mapping space. At the same time, this approach can use the regularization optimization objective function to fine-tune the DSAE. Inspired by them, Wang et al. [
34,
35] conducted a study on the relationship between the number of neurons in the hidden layer and classification accuracy and proposed that for simple images, the gain is no longer obvious after the number of neurons is close to the input data dimension. However, for complex image processing, the classification progress will improve as the number of neurons increases.
PCA is another linear feature projection method used to reduce the dimensionality of data. It simplifies the classifier design and reduces the computational burden of pattern recognition technology by reducing the dimension and keeping most of the relevant features in the distant data. Qifa et al. [
36] pointed out that the traditional singular value decomposition (SVD) is not well applicable when there are outliers and missing data in the measurement. They leveraged the sensitivity of PCA to outliers and applied iterated reweighted least squares (IRLS) to decompose each element, leading to the development of their L1-PCA approach. Chris et al. [
37] used PCA to minimize the sum of squared errors and proposed rotation invariant L1-norm PCA (R1-PCA). This method guarantees that there is a unique global solution and the solution is rotation invariant. After comparison, R1-PCA can deal with outliers more effectively, and when extended to K-means clustering, L1-norm K-means performs worse, while the R1 K-means method is significantly better. However, PCA can only describe a good coordinate system for all feature distributions and does not consider class separation. Therefore, in order to improve the accuracy of recognition, it is necessary to provide high-level separable features. Matsumura et al. [
38] used PCA technology to analyze EMG data, which effectively improved the recognition accuracy and recognition speed.
To further enhance the efficiency of MARL in UAV path planning, we propose an innovative approach incorporating ADR. By preprocessing the image data captured by the UAVs, the ADR performs dimensionality reduction and feature extraction, significantly reducing the computational burden of high-dimensional input data. The output from the autoencoders is then fed into the execution network of the MARL framework, thereby improving the overall system efficiency and performance and enhancing the robustness of UAV navigation in complex and dynamic environments.
3. Method
This section presents our proposed method. It covers these key components: the Markov Decision Process modeling, the ADR framework, the policy learning module, the Actor–Critic Network, and the Communication module.
3.1. Markov Decision Process Modeling
To address the problem of optimizing UAV collaborative navigation in multi-agent reinforcement learning, we model it as a Markov Decision Process (MDP). The MDP, as illustrated in
Figure 2, serves as a mathematical framework to model decision-making in scenarios where outcomes are influenced by both random factors and the decisions made by an agent. It has been demonstrated that the use of converting such problems to MDP is optimal [
39].
State space. This encompasses all possible states where an agent can exist. We take a picture of the measurement range through the camera of the UAV and obtain the likelihood estimate of UAV i projected onto the flat terrain at time t. Each UAV i stores a local posterior map confidence . According to the measurement value of UAV i and the altitude value, its position can be obtained. We define as the state of the global environment at time t, where represents the current location, and b represents the remaining mission budget.
Action space. This comprises all possible actions an agent can take to influence the environment. Each agent takes an action of a fixed step size in a 3D discrete environment, which includes east, south, west, north, up, and down. They can only move within a prescribed discrete location grid P and cannot move outside the environment.
Reward function. Map state-action pairs to real numbers and provide feedback to the agent by assigning a numerical reward for each action taken in a given state.
where
represents the joint action of all UAVs, and
is the map state at time
t. The reduction in entropy from the current map state to the next,
, is rewarded, as it reflects a decrease in environmental uncertainty, indicating that the UAVs have gathered more valuable information and improved task completion. In the reward function, the parameter
controls the weight of the entropy reduction, influencing the focus on reducing uncertainty. The parameter
, as a bias term, adjusts the baseline reward, encouraging behaviors such as exploring new areas or maintaining specific flight paths.
is the discount factor, a value between 0 and 1, that determines the importance of future rewards, balancing short-term and long-term gains.
3.2. The Proposed Framework
In this section, we present the proposed framework designed for optimizing multi-UAV collaboration in MARL tasks. The framework consists of three key modules: the ADR module, the policy learning module, and the communication module. First, the ADR module performs dimensionality reduction on high-dimensional image data, which is then used in the policy learning module to guide UAV decision-making through an actor–critic architecture. The communication module enables efficient data exchange and coordination between UAVs, allowing for real-time collaboration.
3.2.1. The ADR Framework
The ADR framework, which includes AE and PCA, is specifically designed to optimize data processing efficiency and accuracy in MARL. In our work, the ADR framework is applied to a multi-UAV collaborative task where drones equipped with RGB cameras scan specific terrains, and the framework manages the data flow from the cameras to the Action Network.
The AE achieves data encoding and decoding through two main components:
Encoder:, where x is the input data.
Decoder:, where z is the encoded latent representation.
In processing image data, AE uses this structure to extract key features, aiming to improve model robustness to environmental changes. Standard references on autoencoders, such as the work by [
34], discuss in detail the effects of autoencoders in dimension reduction and feature learning.
In our experiments, the AE was deployed as shown in
Figure 3:
Specifically, we tested three data processing methods: none, AE, and PCA, in that order. During the ‘none’ method, 6000 493 × 493 images were randomly sampled using UAV cameras to train the AE and PCA.
For the AE, it is divided into an encoder and a decoder. The 493 × 493 images are compressed into an 11 × 11 matrix through multiple convolutions by the encoder and then magnified back to the original image size by the decoder in a symmetric inverse process, known as reconstruction. By optimizing the mean squared error (MSE) between the reconstructed and original images, the AE network parameters are adjusted until the reconstruction image closely matches the original image.
At this point, we extract the encoder part of the network, directly use it to generalize the processing of camera data in actual MARL tasks, and compress the data before submitting it to the actor network.
PCA reduces dimensions by linearly transforming the original data matrix X into , where P is the matrix of principal components extracted from the covariance matrix of the data.
Ref. [
37] provides a thorough overview of PCA techniques, particularly emphasizing their application in multivariate data analysis. The integration of these theories and methods, particularly when combined with AE technology within our ADR framework, is intended to improve terrain scanning and feature extraction, potentially offering a viable strategy for managing complex data in practical scenarios.
In our experiments, we organized the 6000 493 × 493 images sampled in the ’none’ method into a large data matrix X, where each row represents a flattened, camera-captured original image. We first applied mean centering on X to ensure the mean of each feature is zero. Next, we calculated the covariance matrix C and performed eigen decomposition to extract the principal components. Finally, based on the input shape required by the actor network, we selected the 121 principal components with the highest eigenvalues to form matrix P. In subsequent MARL tasks, the camera data X is multiplied by P to compress the dimensions to 11 × 11 before being submitted to the actor network.
3.2.2. Policy Learning Module
In this section, we introduce the policy learning module, which plays a critical role in enabling cooperative behavior among UAVs for effective path planning and decision-making. The objective of this module is to guide the UAVs in dynamically adapting to various environmental conditions while ensuring optimal coordination. By leveraging reinforcement learning techniques, specifically the actor–critic framework, the policy learning module allows UAVs to learn and execute robust strategies in real time.
As illustrated in
Figure 4, the overall architecture is designed to facilitate cooperative behavior among UAVs by combining dimensionality reduction and reinforcement learning. Each UAV collects sensor data from its environment, generating a sensor map which is then processed into a local map. To ensure computational efficiency, the ADR framework is applied to these local maps, performing dimensionality reduction and extracting relevant features from the high-dimensional image data. After this step, the reduced information is passed into the actor network, where a policy is generated to guide each UAV’s actions. These actions are performed during the deployment phase, allowing the UAVs to make decisions independently based on their local observations.
In the training phase, the critic network evaluates the actions taken by the UAVs using a global map that provides an overall view of the environment. The critic network computes the Q-values for the actions and feeds this information back to the actor network to optimize the policy. Through this combination of dimensionality reduction, policy learning, and centralized evaluation, the architecture enables efficient, cooperative UAV operations across different environments.
The structure of the actor network is shown in
Figure 5. The actor network utilizes a deep convolutional neural network (CNN) architecture to process multiple input data sources. The network consists of three convolutional layers (Conv1, Conv2, Conv3), followed by fully connected layers (FC layers). Each convolutional layer is followed by a ReLU activation function, and the number of filters increases with each layer, allowing the network to capture complex spatial relationships in the input data. After the final convolutional layer, the output is flattened and passed through the fully connected layers (FC1, FC2, FC3), gradually reducing the feature dimensions before generating the UAV’s possible actions. The final output layer contains 6 nodes, representing the specific actions the UAV can take based on the processed information, guiding its behavior during the deployment phase.
The actor network processes various pieces of input information, including the UAV’s identifier and remaining mission budget, a local map centered on the UAV’s position, the weighted entropy of the local map to provide environmental uncertainty information, the weighted entropy of the UAV’s measurements, and a footprint map indicating the areas observed by all UAVs within communication range. These inputs are processed by the actor network to generate effective strategies, enabling the UAVs to collaborate and achieve the mission objectives. The image data among these inputs are processed by the ADR module to ensure that key information is retained while reducing dimensionality.
The structure of the critic network is similar to that of the actor network, with both networks sharing the same core architecture. However, compared to the actor network, the critic network processes additional input information. In addition to the inputs received by the actor network, the critic network further incorporates global environment information. These additional inputs include a global position map encoding the positions of all agents, the global map state, the weighted entropy of the global map, the map cells currently spanned by all agents’ fields of view, and the actions of other agents.
Next, SCOMAP (State-Compressed Multi-Agent Policy Gradients) serves as the core algorithm to optimize the actions of each agent through the actor and critic networks. SCOMAP builds on the COMA framework but integrates state compression to enhance computational efficiency and scalability. The centralized critic network evaluates the joint state–action values of all agents by utilizing compressed state representations, which are output from the ADR module.
The critic network receives additional global environment information, including the global map state, the positions of all agents, and their actions, as mentioned earlier. This allows the critic network to compute the advantage function using counterfactual baselines, which measure the impact of each agent’s action by comparing it to the baseline actions of other agents. This method ensures that each agent is credited for its individual contribution to the team’s performance while accounting for the actions of others.
This function compares the joint state–action value to the baseline, which is computed by marginalizing over all possible actions of the agent while keeping the actions of the other agents constant. By using state compression, SCOMAP achieves more efficient policy updates and optimized cooperation among UAVs. More detailed equations and derivations can be found in Ref. [
20].
3.2.3. Communication Module
To improve the efficiency of multi-agent reinforcement learning (MARL) in UAV systems, we developed a communication module that operates within a limited range. This module enables UAVs to exchange field-of-view information with nearby agents, thereby enhancing collaboration and task execution. The communication module allows each UAV to share critical environmental and status data within a specified communication range, facilitating real-time decision-making. As shown in
Figure 6, each UAV is equipped with a communication system and an RGB camera sensor to capture image data reflecting its current state. The communication module consists of key components, the interaction diagram between the modules is shown in
Figure 7, including the data transmission and data processing units, which enable efficient data exchange and fusion.
The data transmission unit ensures low-latency and high-bandwidth data transmission within the communication range. It allows UAVs to share their current state information, such as position and speed, and environmental data when they are within communication range. This unit also handles potential random communication failures, ensuring that critical data are transmitted efficiently. On the other hand, the data processing unit receives field-of-view data from other UAVs and processes them, extracting meaningful information to update each UAV’s global situational awareness, which is crucial for optimizing path planning and decision-making.
In the simulation environment, each UAV periodically broadcasts its map data and key observations to other UAVs within its communication range. These shared data include image data captured by the camera, which are used to extract the UAV’s current state, such as altitude, speed, and direction, as well as environmental information. Specifically, the information shared between UAVs can be categorized into two types: position and status information, which includes the current position, flight speed, direction, and altitude of each UAV, and environmental observation information, which includes detailed map data collected via virtual sensors. The latter type of information includes terrain details, obstacles, and other significant environmental features, allowing each UAV to sense its operational environment and adjust its flight trajectory to optimize task efficiency and ensure safety.
Upon receiving the field-of-view data from other UAVs within the communication range, the data processing unit in the communications module performs data fusion. This process involves aligning data from different UAVs to ensure temporal and spatial consistency. Given that UAVs might be in different locations and data transmission may experience delays, the data are aligned based on timestamps and position information to mitigate errors caused by time lag. This alignment ensures that all the UAV information can be combined on a unified temporal and spatial framework. After alignment, the field-of-view data from multiple UAVs are integrated using a weighted averaging method, which assigns weights based on communication quality, field coverage scope, and data accuracy. This integration produces a comprehensive global environmental view, combining multiple observations for a more accurate and complete understanding of the environment.
The integrated data provide each UAV with a global view of the environment, significantly enhancing situational awareness and enabling real-time decision-making. This global view includes a detailed map of the terrain, the locations of obstacles, and the positions and statuses of other UAVs. Based on this global information, each UAV can dynamically adjust its flight path to avoid potential risks and optimize task execution. The integrated global awareness data are then fed back into the UAV’s path planning system to improve mission performance, allowing UAVs to collaborate more effectively in a multi-agent environment and address complex flight tasks and environmental challenges.
4. Experiment
In this study, we used Python 3.8 to develop the experimental code, with key libraries including pygame for graphical interface support, flake8 for code style checks, and pytest for unit testing.
To verify that drones can efficiently collaborate in different scenarios, we conducted experiments in three distinct environments. These scenarios are as follows: Scenario 1—a strip area occupying a certain percentage of the space, Scenario 2—a rectangular area located in the corner of the scene, and Scenario 3—a strip area in the middle. The images of Scenarios 1, 2, and 3 are presented in
Figure 8,
Figure 9 and
Figure 10, respectively. We identified real terrains matching these shapes and created heatmaps based on the areas of interest. In these heatmaps, the dark (especially red) parts are recognized as areas of high interest. During the mission, drones need to collect as much information as possible from these high-interest areas.
In the experiments, we focused on the separate training processes of AE and PCA, using large-scale image data with a resolution of 493 × 493 pixels captured by UAVs for dimensionality reduction to improve computational efficiency and reduce data complexity. The AE was trained on 6000 images captured by the UAVs, with a batch size of 128, using the Adam optimizer with an initial learning rate set to 0.001, for a total of 100 epochs. On the other hand, PCA used Incremental Principal Component Analysis (IPCA) to process the same data, reducing them to 121 principal components. Given the large data volume, IPCA employed a batch size of 4096 and utilized incremental fitting to manage the computational demands of large-scale data processing.
For each experiment, we executed 50 terrain monitoring missions in a 50 m by 50 m area with a map resolution of 10 cm. The planning resolution was set to 10 m, and the flight altitude was limited to between 20 m and 30 m. We used a camera with a 60-degree field of view, ensuring adjacent measurements do not overlap when taken from the lowest altitude. Considering increased sensor noise at higher altitudes, we simulated noise probabilities of 0.98, 0.75, and 0.625 for altitudes of 5, 10, and 15 m, respectively. The UAV team consisted of 4 agents with a communication radius of 25 m.
The experimental environment included a seed value of 3 to ensure varied start positions for each UAV. The sensor was an RGB camera with a resolution of 10 cm (57 pixels in both x and y directions). The simulation used a random field with a cluster radius of 5 m. The mapping prior was set to 0.5. The experiment included constraints such as a spacing of 5 m, a minimum altitude of 5 m, a maximum altitude of 15 m, and a budget of 14 actions. UAVs have a maximum speed of 5 m per second, a maximum acceleration of 2 m per second squared, and a sampling time of 2 s. Missions were conducted using the COMA method in training mode, with 1500 episodes and a patience of 100. The remaining relevant hyperparameters are presented in
Table 1.
In all experiments, we primarily measured the performance of the algorithms using rewards, supplemented by Q-values and entropy to further evaluate the experimental outcomes. The reward value directly reflects the effectiveness of a strategy in completing tasks within a specific environment and serves as the core metric for evaluating the quality of the strategy. It clearly indicates whether the strategy meets the expected goals, which is particularly crucial in complex scenarios. The Q-value assesses the long-term value of the strategy, offering insights into future decisions; a high Q-value suggests that the strategy is effective not only in the current step but also in multiple future steps, which is essential for the long-term optimization of path planning. Entropy evaluates the stability and diversity of the strategy; low entropy indicates that the strategy is becoming stable, while high entropy may suggest that the strategy is still exploring different decision paths. By considering the reward, Q-value, and entropy together, a more comprehensive understanding of the overall performance of the algorithm can be achieved, ensuring more stable and efficient path planning in practical applications. Our experimental results were averaged over ten trials, and the plotted curves reflect these averages, with the shaded areas representing the variance across the experiments.
6. Conclusions
In this study, we proposed and explored the application of the ADR framework in multi-agent reinforcement learning, particularly focusing on its impact on UAV path planning tasks. The ADR framework optimizes system performance by flexibly selecting appropriate dimensionality reduction methods, such as the AE or PCA, to enhance computational efficiency and strategy stability.
This framework effectively handles complex, high-dimensional data while reducing data dimensionality in a way that preserves essential information, thereby lowering computational complexity and accelerating the training and exploration processes of the model. It provides a flexible and efficient dimensionality reduction solution for multi-agent systems, suitable for various task requirements and environmental complexities. By intelligently selecting the most suitable dimensionality reduction method, ADR significantly enhances the efficiency and performance of multi-agent reinforcement learning. Currently, the ADR framework primarily selects dimensionality reduction methods based on predefined scenario complexities. Future research could introduce more intelligent adaptive strategies, enabling the system to analyze environmental complexity and task requirements in real time and dynamically adjust the choice of dimensionality reduction methods, thereby further improving the adaptability and robustness of the system. Additionally, while this study focuses on UAV path planning, the ADR framework could extend to other fields, such as autonomous driving, intelligent surveillance, and robotic collaboration. Exploring the performance of ADR in these areas and its potential for broader applications will be an important direction for future research.