Next Article in Journal
Quality and Efficiency of Coupled Iterative Coverage Path Planning for the Inspection of Large Complex 3D Structures
Next Article in Special Issue
A Robust Hybrid Iterative Learning Formation Strategy for Multi-Unmanned Aerial Vehicle Systems with Multi-Operating Modes
Previous Article in Journal
Novel Twist Morphing Aileron and Winglet Design for UAS Control and Performance
Previous Article in Special Issue
Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(8), 393; https://doi.org/10.3390/drones8080393
Submission received: 23 July 2024 / Revised: 7 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024
(This article belongs to the Special Issue Distributed Control, Optimization, and Game of UAV Swarm Systems)

Abstract

:
Unmanned Aerial Vehicles (UAVs) have garnered widespread attention in reconnaissance and search operations due to their low cost and high flexibility. However, when multiple UAVs (multi-UAV) collaborate on these tasks, a limited communication range can restrict their efficiency. This paper investigates the problem of multi-UAV collaborative reconnaissance and search for static targets with a limited communication range (MCRS-LCR). To address communication limitations, we designed a communication and information fusion model based on belief maps and modeled MCRS-LCR as a multi-objective optimization problem. We further reformulated this problem as a decentralized partially observable Markov decision process (Dec-POMDP). We introduced episodic memory into the reinforcement learning framework, proposing the CNN-Semantic Episodic Memory Utilization (CNN-SEMU) algorithm. Specifically, CNN-SEMU uses an encoder–decoder structure with a CNN to learn state embedding patterns influenced by the highest returns. It extracts semantic features from the high-dimensional map state space to construct a smoother memory embedding space, ultimately enhancing reinforcement learning performance by recalling the highest returns of historical states. Extensive simulation experiments demonstrate that in reconnaissance and search tasks of various scales, CNN-SEMU surpasses state-of-the-art multi-agent reinforcement learning methods in episodic rewards, search efficiency, and collision frequency.

1. Introduction

Advancements in flight control systems, endurance capabilities, and sensors have significantly expanded the use of Unmanned Aerial Vehicles (UAVs) in fields such as search and rescue (SAR) [1], military reconnaissance [2], and environmental monitoring [3]. Compared to traditional human-crewed aircraft or satellite reconnaissance methods, coordinating multiple UAVs (multi-UAV) for reconnaissance and search operations offers lower operational costs and greater flexibility. Additionally, UAVs can be equipped with various sensors to perform diverse reconnaissance and search tasks. Therefore, studying multi-UAV collaborative reconnaissance and search (MCRS) is highly important.
However, coordinating multi-UAVs for reconnaissance and search presents numerous challenges. First, effective communication between UAVs is necessary for information fusion. Thus, it is crucial to consider multi-UAV cooperative reconnaissance and search with a limited communication range (MCRS-LCR). Second, precise planning of each UAV’s flight path is required to ensure comprehensive search area coverage while avoiding collisions and path overlaps. Finally, UAVs need to make distributed decisions based on environmental changes and mission requirements, maintaining system robustness in case of a single-point failure. Additionally, target locations are often unknown and may be sparsely distributed over a vast, unexplored area [4]. If reconnaissance sensors have errors and UAVs fail to detect targets on the first attempt, revisiting the area becomes challenging. These factors all contribute to the complexity of the MCRS-LCR problem.
To address the MCRS-LCR problem, current research primarily focuses on UAV cooperative control and search strategies. Cooperative control can be divided into centralized and distributed structures. Distributed control allows UAVs to make autonomous decisions without a command center, enhancing system scalability and resistance to interference. This has become the leading research direction in UAV control [5]. Regardless of the control structure, communication between UAVs is crucial for effective coordination. Oliehoek [6] categorizes communication among multiple agents into explicit, implicit, delayed, costly, and local communication. Based on these classifications, many researchers consider communication factors when exploring UAV target search problems [7,8,9]. However, for the MCRS-LCR problem, questions like “when to communicate” and “what to communicate” still require further investigation.
Regarding reconnaissance search strategies, some studies discretize the reconnaissance space and generate target probability maps to represent the probability distribution of target locations [7,8,10,11,12]. This method accounts for UAV reconnaissance sensor errors and performs well in practical applications. Other studies have developed optimization methods for multi-UAV systems based on heuristic algorithms [13], ant colony algorithms [14], and particle swarm algorithms [15]. However, as the number of UAVs increases and the environmental state space expands, these methods are prone to local optima.
Deep reinforcement learning (DRL) [16] offers powerful learning capabilities, providing new solutions to the MCRS-LCR problem. Multi-agent deep reinforcement learning (MADRL) further extends DRL’s application scope. MADRL commonly employs the Centralized Training with Decentralized Execution (CTDE) framework. During training, agents use global information and share knowledge to enhance collaboration. During execution, each agent chooses actions based on local observations, enabling distributed decision-making. This approach has succeeded significantly in StarCraft II [17], autonomous driving [18], and multi-robot formation [19].
In deep reinforcement learning, agents need to interact with the environment hundreds or thousands of times to reach human-level performance [20], resulting in data inefficiency. Unlike traditional reinforcement learning methods, human learning relies not only on interactions with the environment but also on past experiences stored in the hippocampus [21], a process known as episodic memory. However, recalling detailed episodes is often impossible. Instead, the human brain typically recalls features associated with high rewards. Episodic memory uses these features to guide future behavior (Figure 1). Research suggests that the ventral hippocampus may encode reward-related information and extract abstract features from environmental states [22,23,24], known as memory representation. This suggests that embedding semantic information with rewards into state space encoding can enhance the effectiveness fo episodic memory.
Inspired by hippocampal episodic memory, Blundell et al. [25] proposed introducing episodic memory into reinforcement learning to improve learning efficiency. Zheng et al. [26], Hyungho et al. [27], and Ma [28] have used episodic memory in MADRL. However, the state space dimension in MADRL increases with the number of agents, making extracting semantic features from high-dimensional map state spaces in the MCRS-LCR problem particularly challenging. To efficiently utilize episodic memory by extracting semantic features from high-dimensional map state spaces, we draw an analogy to vision-based state spaces. An encoder–decoder structure with a CNN is used to learn state embedding patterns influenced by the highest returns. We call this method CNN-Semantic Episodic Memory Utilization (CNN-SEMU).
This paper explores the cooperative reconnaissance and search of unknown static targets using multiple UAVs with limited communication range. The main contributions are as follows:
  • A communication and information fusion model for the MCRS-LCR problem based on belief maps is proposed. Each UAV maintains a belief map for all UAVs and uses a max-plus-sum approach for information fusion, enabling effective communication.
  • Episodic memory is introduced into the MCRS-LCR problem. Using the value factorization centralized training with a decentralized execution (CTDE) framework, episodic memory leverages the highest state values stored in memory to generate better Temporal Difference targets, enhancing MADRL performance.
  • A new MADRL method called CNN-Semantic Episodic Memory Utilization (CNN-SEMU) is proposed. CNN-SEMU employs an encoder–decoder structure with a CNN to extract semantic features with the highest returns from high-dimensional map state spaces, enhancing the effectiveness of episodic memory.
The rest of the paper is structured as follows: Section 2 introduces related work. Section 3 provides the system description and problem formulation. Section 4 reformulates the MCRS-LCR problem within the reinforcement learning framework. Section 5 introduces episodic memory into multi-UAV reinforcement learning and proposes the CNN-SEMU algorithm. Section 6 presents the experiments. Section 7 concludes the paper.

2. Related Work

This section briefly reviews current research on UAV communication modes, reconnaissance and search strategies, and episodic memory in reinforcement learning. It also identifies the gaps in these studies regarding the MCRS-LCR problem.

2.1. UAV Communication and Search Strategies

Communication between UAVs is crucial for efficient coordination. Explicit and implicit communication models do not consider communication costs, noise, and timing. In explicit communication [6], communication actions are added to the UAV’s action set, allowing UAVs to communicate by executing these actions. In contrast, implicit communication influences other UAVs’ observations through behavior without direct communication actions.
In [9], a communication topology matrix for UAVs is designed, represented as the Nth power of the UAV adjacency communication matrix. Communication between UAVs is achieved by exchanging digital pheromones. This method effectively represents the communication links between UAVs, but the matrix power calculation is computationally expensive. In [8], UAVs communicate by broadcasting the positive and negative detection times to their neighbors, taking the maximum value within the communication range. Although this method reduces communication volume, it does not accurately reflect the cumulative detection counts by neighboring UAVs. In [7], it is assumed that all UAVs are directly interconnected, simplifying calculations, but it may lead to network congestion with many UAVs communicating simultaneously. If UAV reconnaissance and search are likened to marking a chalkboard, implicit communication can be represented as other UAVs within the communication range instantly observing the mark. In this paper, we continue to study multi-UAV information fusion based on implicit communication.
The target probability map provides UAVs with probabilistic information about target distribution, which they use to optimize reconnaissance and search strategies. The studies [8,10,12] treat the existence of a target in each cell as a 0 1 distributed random variable. The target probability map can be updated using Bayes’ rule given an initial prior probability. However, if the prior probability is unavailable or incorrect, this method may affect the posterior probability calculation. The studies [7,11] use sensor readings as evidence sources and update the target probability map using Dempster–Shafer (DS) evidence theory. This method does not depend on prior probabilities and can effectively handle conflicting evidence when sensor readings are erroneous.

2.2. Episodic Memory in Reinforcement Learning

In single-agent reinforcement learning, Model-Free Episodic Control (MFEC) [25] uses episodic memory to solve sequential decision tasks. It stores the highest returns of explored state–action pairs in a tabular memory and greedily selects actions based on the maximum returns in the table. Since identical states are challenging to reproduce in natural environments, MFEC generalizes and embeds similar states using k-nearest neighbors and random projection techniques. Episodic Memory Deep Q-Networks (EMDQN) [29] use episodic memory to supervise agent training, combining the generalization strength of Deep Q-Networks (DQN) with the fast convergence of Episodic Control (EC), enabling quicker learning of better policies.
In multi-agent reinforcement learning, the state space dimension is usually more extensive, requiring adjustments and extensions to the feature embedding structures and learning frameworks from single-agent reinforcement learning. Zheng et al. [26] proposed Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration (EMC), which introduces episodic memory into cooperative MADRL. EMC uses episodic memory to remember high-return states and employs one-step Temporal Difference (TD) memory targets for regularized learning. However, EMC still uses random projection for state embedding. According to the Johnson–Lindenstrauss lemma [30], random projection can approximately preserve the distance relationships of the original space. However, when the original state changes slightly, the embedded space may fluctuate significantly. Hyungho et al. [27] proposed Efficient Episodic Memory Utilization (EMU), which uses an encoder–decoder structure to learn semantic state embeddings, addressing the sparse selection of semantically similar memories caused by random projection. However, extracting semantic features directly from high-dimensional map state spaces is challenging. Additionally, EMU employs episodic incentives, providing extra rewards for states within desirable trajectories, thereby encouraging desirable state transitions. Desirable trajectories are defined as those that achieve the overall goal or exceed a preset reward threshold. However, the overall goal may be challenging in practical tasks, and the preset reward threshold requires additional domain knowledge. For the MCRS-LCR problem, if targets are sparsely distributed over a large area, it will be challenging to discover all targets, making additional episodic incentives less effective. In this paper, we continue to use regularized memory targets to improve one-step TD learning and employ an encoder–decoder structure with a CNN to learn state embedding patterns influenced by the highest returns.

3. System Description and Problem Formulation

This section describes the system model, including the grid reconnaissance environment, the UAV model, the belief probability map model, and the UAV communication and information fusion model. Building on this, the MCRS-LCR is framed as a multi-objective optimization problem. Table 1 summarizes the main symbols used in this section.

3.1. Reconnaissance Environment Model

As shown in Figure 2, the task area is divided into L 1 × L 2 square cells, with the cell in the ith row and jth column denoted as c x , y , where i 1 , 2 , , L 1 and j 1 , 2 , , L 2 . It is assumed that several static targets are distributed within the task area, each occupying one cell, with at most one target per cell. Each cell has two possible states: empty (E), indicating no target is present, or full (F), indicating a target is present. Since there is no prior information about target locations, UAVs must use reconnaissance sensors to search for targets.
Additionally, specific cells in the map are assumed to be inaccessible to UAVs due to no-fly zones or obstacles. UAVs can avoid obstacles using a circular field of view (FOV). Let z k represent the location of the kth no-fly zone cell, and assume no targets exist within these no-fly zones.

3.2. UAV Model

The distributed UAV system comprises N homogeneous UAVs that collaboratively perform reconnaissance and search tasks. The entire task process is divided into T time steps. At each time step, a UAV searches only the cell at its current position and decides the next step’s search direction, making the action space discrete. The size of each cell depends on the UAV’s reconnaissance performance and the time step setting. For example, if a UAV can complete the reconnaissance of a cell with 100-m sides in 10 min, then the cell size is set to 100 m, and the time step is set to 10 min.
Each UAV has eight possible search directions at each time step. The position of a UAV can be identified by the cell it occupies. Let the position of UAV i at time step t be u i , t ( x , y ) , which is updated by the chosen action a, represented as
u i , t + 1 ( x ) = u i , t ( x ) + Δ x u i , t + 1 ( y ) = u i , t ( y ) + Δ y ,
where
Δ ( x , y ) = ( 1 , 0 ) , if a = 0 ( left ) ( 1 , 0 ) , if a = 1 ( right ) ( 0 , 1 ) , if a = 2 ( down ) ( 0 , 1 ) , if a = 3 ( up ) ( 1 , 1 ) , if a = 4 ( lower left ) ( 1 , 1 ) , if a = 5 ( lower right ) ( 1 , 1 ) , if a = 6 ( upper right ) ( 1 , 1 ) , if a = 7 ( upper left ) .
If a UAV’s action would cause it to exceed the task area’s boundaries, that action is removed from the set of possible actions. This discrete action space modeling for UAVs eliminates the need to consider complex dynamics and control issues, such as motion posture and turning radius in continuous space, thereby simplifying the problem and improving path planning efficiency.
Assume all UAVs are equipped with identical reconnaissance sensors and each UAV completes the reconnaissance of its current cell within one step. If UAV i performs reconnaissance in c x , y at time step t, the sensor reading is f x , y i ( t ) . If f x , y i ( t ) = F , a target is found in c x , y ; if f x , y i ( t ) = E , no target is found. Due to noise and other errors, the sensors may produce false readings, leading to conflicting reconnaissance results. To address this issue, we treat the sensor readings as evidence and use the Dempster–Shafer (DS) evidence theory, as established in [7], to handle these conflicts. By fusing the sensor readings, we can measure the uncertainty of each cell and represent the uncertainty of all cells as a belief probability map.

3.3. Belief Probability Maps

Given that each cell can be in one of two states, E or F, the frame of discernment Λ is defined as the power set of these states based on the DS evidence theory:
Λ = { E , F , U } ,
where U = { E , F } represents the uncertain state. A probability can be assigned to each subset in the frame of discernment using the basic probability assignment (BPA) function m. This results in the belief probability map:
A Λ m x , y ( A ) = m x , y ( E ) + m x , y ( F ) T a r g e t M a p + m x , y ( U ) U n c e r t a i n t y M a p = 1 ,
where the target map m x , y ( F ) and m x , y ( E ) represent the belief in the presence and absence of a target in c x , y , respectively, while the uncertainty map m x , y ( U ) represents the uncertainty of c x , y . Without prior information, the belief probability map is initialized as m x , y t = 0 ( E ) = 0 , m x , y t = 0 ( F ) = 0 , and m x , y t = 0 ( U ) = 1 .
The core concepts of DS evidence theory are “evidence” and “combination”. “Evidence” refers to sensor readings containing uncertain information, while “combination” refers to the combination rules. According to DS combination rules, sensor readings can be fused as evidence into the belief map at the current time step, thereby updating the belief map for the next time step:
m x , y t + 1 ( E / F / U ) = m x , y t m b ( E / F / U ) = 1 k B C = E / F / U m x , y t ( B ) × m b ( C ) , B , C Λ ,
where
k = 1 B C = m x , y t ( B ) × m b ( C ) , B , C Λ ,
where m b represents the sensor evidence’s BPA. When a sensor detects a target, this reading can be used as evidence to increase confidence in F. Since no information is provided about the absence of a target, the probability assigned to E is 0. Due to sensor errors, this evidence is not entirely reliable, so the remaining probability is assigned to U. Thus, the evidence for detecting a target is represented as
m b ( F ) = m f m b ( E ) = 0 m b ( U ) = 1 m f .
Likewise, the evidence for the sensor not detecting a target is represented as
m b ( F ) = 0 m b ( E ) = m e m b ( U ) = 1 m e ,
where m f and m e represent the sensor’s confidence in detecting or not detecting a target, respectively. These values can be obtained from historical statistical data.

3.4. UAV Communication and Information Fusion Model

Each UAV maintains a distributed belief map and cannot access the global map. Therefore, UAVs must enhance their collaboration through information exchange and fusion. While [7] assumes fully connected communication among UAVs, UAVs can only exchange information with their neighbors within the communication range. The neighbors of UAV i are defined as
N e i , t = { u j , t | u j , t u i , t 2 < R c , j = 1 , 2 , , N , j i } ,
where R c is the communication range of the UAV. The premise of information fusion is that the belief probability map is independent of the order of detection results. Therefore, we introduce positive and negative detection times. For each cell, a UAV records the positive detection times N x , y i , t ( + ) and the negative detection times N x , y i , t ( ) . Their update method is represented as
N x , y i , t + 1 ( + ) = N x , y i , t ( + ) + 1 if f x , y i ( t ) = F N x , y i , t + 1 ( ) = N x , y i , t ( ) + 1 if f x , y i ( t ) = E .
Ref. [7] demonstrated that the state of the belief map depends only on the positive and negative detection times, regardless of the order of detection results. Thus, the recursive formula in Equation (5) can be written to include only the positive and negative detection times. Each UAV can broadcast its positive and negative detection times to its neighbors within the communication range for information fusion, resulting in a distributed belief map for each UAV. It is important to note that our communication model assumes noiseless, instantaneous broadcast communication. This means each UAV broadcasts its messages and immediately receives messages from all other UAVs within its communication range without errors.
Assume each UAV maintains a map of the positive and negative detection times for all UAVs. When UAV j enters the communication range of UAV i, UAV j merges its distributed map with UAV i’s distributed map. After information fusion, the positive and negative detection times for UAV i are represented as
N i = q = 1 N max ( N i , q , N j , q ) , j i .
In the above equation, we denote N x , y i , j , t ( + / ) simply as N i , j for conciseness. Equation (11) represents a process of taking the maximum value for each UAV and then summing them. For example, in a multi-UAV system with three UAVs, at time step t, UAV A’s positive detection times for c x , y are A 3 B 1 C 0 . This indicates that UAV A’s distributed map records c x , y being detected by UAV A three times, by UAV B once, and by UAV C zero times. When UAV B enters A’s communication range with positive detection times A 1 B 3 C 1 for c x , y , the information fusion updates UAV A’s positive detection times to A 3 B 3 C 1 , resulting in seven positive detections. In contrast, using the method from [8], which takes the maximum detection times within the communication range, the fused positive detection count would be five, which does not accurately reflect the actual situation.
Denote ( 1 m f ) N x , y i , t ( + ) as f x , y i , t and ( 1 m e ) N x , y i , t ( ) as e x , y i , t . Under the initial conditions, the distributed target map and uncertainty map for UAV i after information fusion can be recursively calculated as
m x , y i , t ( E ) = f x , y i , t e x , y i , t × f x , y i , t f x , y i , t + e x , y i , t e x , y i , t × f x , y i , t m x , y i , t ( F ) = e x , y i , t e x , y i , t × f x , y i , t f x , y i , t + e x , y i , t e x , y i , t × f x , y i , t m x , y i , t ( U ) = e x , y i , t × f x , y i , t f x , y i , t + e x , y i , t e x , y i , t × f x , y i , t .

3.5. Problem Formulation

The MCRS-LCR problem requires multi-UAV to reduce uncertainty in the task area and discover as many targets as possible with limited communication. UAVs must avoid no-fly zones and prevent collisions with other UAVs during flight. Therefore, a multi-objective optimization model is established as follows:
Objective function:
min x = 1 L x y = 1 L y m x , y t = T ( U ) ,
max x = 1 L x y = 1 L y I ( m x , y t = T ( F ) τ ) ,
where I is an indicator function that equals 1 when the condition inside the parentheses is met, and 0 otherwise. τ represents the threshold for the presence of a target in a cell; when m x , y t ( F ) τ , the cell is considered to contain a target. Equation (13) minimizes the uncertainty in the task area, while Equation (14) maximizes the number of targets discovered. These are complementary objective functions.
Constraints:
0 u i , t ( x ) L x 0 u i , t ( y ) L y ,
z k u i , t 2 > d 1 , k { 1 , 2 , , N Z } ,
u j , t u i , t 2 > d 2 , i j ,
where Equation (15) represents the UAV boundary constraint, while Equations (16) and (17) represent collision avoidance constraints. d 1 and d 2 are the safe distances for no-fly zones and UAV collision avoidance, respectively. Note that communication only affects the UAV’s observation space and is therefore not included in the constraints.

4. Reformulation

With a large number of UAVs and a vast state space dimension, conventional optimization methods can easily fall into local optima when solving the multi-objective optimization problem established in Section 3.5. Therefore, this section reformulates the multi-objective optimization problem within a decentralized partially observable Markov decision process (Dec-POMDP). Based on this, the state and action spaces of the UAVs are redefined, and a new reward function is designed according to the objectives and constraints of the original problem.

4.1. Dec-POMDP

With limited communication, UAVs cannot observe the global state, necessitating the reformulation of the MCRS-LCR problem as a decentralized partially observable Markov decision process (Dec-POMDP) [6]. A Dec-POMDP is defined as a tuple M = ( D , S , A , P , Ω , O , R , γ ) , where D = { 1 , , N } is the set of UAVs, S is the finite set of environmental states, A is the joint action set of multi-UAV, P is the state transition function, Ω is the joint observation set, O is the observation function, R is the immediate reward, and γ is the discount factor for rewards.
At each time step, each UAV i makes a local observation o i Ω and selects an action a i based on this observation. These individual actions collectively form the joint action a i A . The environment state s S transitions to the next state s according to the state transition function P ( s | s , a ) . The UAV team receives a shared immediate reward r R ( s , a , s ) . Under partially observable conditions, UAV i can use its action–observation history τ i Γ ( Ω × A ) to estimate the environment state. The joint action–observation history is denoted as τ Γ Γ N .
The policy of UAV i is denoted as π i ( a i | τ i ) . The goal of cooperative MADRL is to learn a joint policy π that maximizes the joint value function V π ( τ ) = E [ t = 0 γ t r t τ 0 = τ , π ] or the joint action-value function Q π ( τ , a ) = r + γ E τ [ V π ( τ ) ] .

4.2. Observation, State, and Action Spaces

(1) Observation space: The observation space refers to the portion of environmental information that a single UAV can directly perceive. For the MCRS-LCR problem, due to communication constraints, the observation space of UAV i is represented as
o i = { u i ( x ) u i ( x ) L x L x , u i ( y ) u i ( y ) L y L y , M a p U s u r r o u n d , M a p F s u r r o u n d , R e l P o s a g e n t s , R e l P o s t r a p s } ,
where u i ( x ) u i ( x ) L x L x and u i ( y ) u i ( y ) L y L y represent the normalized coordinates of UAV i’s current position relative to the map’s maximum width. Normalizing to the range [ 0 , 1 ] helps maintain scale consistency among different features.
M a p U s u r r o u n d and M a p F s u r r o u n d are the distributed uncertainty map and target map F of the eight surrounding movable cells of UAV i, respectively. Each is represented as an eight-dimensional vector:
M a p U s u r r o u n d = { m x + Δ x , y + Δ y i ( U ) | ( Δ x , Δ y ) D s } ,
M a p F s u r r o u n d = { m x + Δ x , y + Δ y i ( F ) | ( Δ x , Δ y ) D s } ,
where D s defines the relative positions of the eight surrounding cells of UAV i:
D s = { ( 1 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) , ( 0 , 1 ) , ( 0 , 1 ) , ( 1 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) } .
The observation space includes only the belief maps of the eight surrounding movable cells of the UAV, not the entire local belief map. This reduces input dimensions and simplifies the UAV’s perception mechanism. If a direction is not movable, the corresponding observation vector is set to zero.
R e l P o s a g e n t s and R e l P o s t r a p s represent the relative positions of UAV i to other UAVs and obstacles, respectively. Using relative positions helps improve the model’s ability to generalize in different environments. UAVs perceive other UAVs and obstacles through a circular field of view (FOV). To ensure fixed input dimensions, each UAV or obstacle is represented by a three-dimensional vector (distance, relative x-position, relative y-position). Vectors outside the FOV are filled with zeros.
(2) State space: The state space provides a comprehensive view of the environment. For algorithms like QMIX [31] and QPLEX [32], the global state s is used during centralized training to calculate the global action-value function. In the MCRS-LCR problem, the state space is represented as
s = { m ( U ) , m ( F ) , s a g e n t s , s t r a p s , s t a r g e t s } ,
where m ( U ) and m ( F ) are the global belief maps, indicating the uncertainty and target presence on the map. s a g e n t s represents the agent state, including the normalized positions of all UAVs. s t r a p s represents the no-fly zone state, including the normalized positions of all no-fly zones. s t a r g e t s represents the target state, including the normalized positions of all targets.
(3) Action space: The UAV’s action space includes eight possible actions. If an action would cause the UAV to exceed the task area boundaries, it is removed from the set of potential actions.

4.3. Reward Functions

In cooperative MADRL, a well-designed reward function can incentivize agents to collaborate and complete tasks. Based on the objective function and constraints of the MCRS-LCR problem, the reward function is designed as follows:
(1) Exploration reward: The exploration reward guides UAVs to explore the task area and reduce environmental uncertainty, thereby minimizing the uncertainty map. The exploration reward at time step t is given by
r 1 , t = ω 1 x = 1 L 1 y = 1 L 2 ( m x , y t 1 ( U ) m x , y t ( U ) ) .
(2) Target discovery reward: For cells with contradictory reconnaissance results, UAVs must conduct multiple reconnaissances to strengthen the belief in the target presence. Therefore, the target discovery reward is given by
r 2 , t = ω 2 x = 1 L 1 y = 1 L 2 [ I ( m x , y t ( F ) τ ) × I ( ( x , y ) D f i n d ) ] ,
where I is the indicator function, and D f i n d is the set of discovered targets. The reward is given only when a target is found for the first time.
(3) Collision prevention: A penalty is given when a UAV enters the no-fly zone:
r 3 , t = ω 3 i = 1 N k = 1 N Z I ( u i , t z k 2 d 1 ) .
To avoid collisions between UAVs, a penalty is given when the distance between them is less than the safe distance:
r 4 , t = ω 4 i = 1 N i = i + 1 N I ( u i , t u j , t 2 d 2 ) .
Therefore, the total reward at time step t can be expressed as
r t = r 1 , t + r 2 , t + r 3 , t + r 4 , t .

5. Methodology

This section introduces the CNN-Semantic Episodic Memory Utilization (CNN-SEMU) algorithm (Figure 3), which comprises six modules. First, we explain how to construct and update the episodic buffer. Next, we highlight the shortcomings of current embedding methods in addressing the MCRS-LCR problem. Then, we incorporate the CNN module into the state embedding network by drawing an analogy to vision-based state spaces. Finally, we detail the algorithm’s training process.

5.1. Value Factorization CTDE Framework

Centralized training with decentralized execution (CTDE) is a commonly used learning paradigm in multi-agent systems [33,34]. This approach also applies to the MCRS-LCR problem reformulated under the Dec-POMDP framework. During the centralized training phase, all UAVs collaborate using global information to learn and determine the optimal joint action-value function Q t o t ( τ , a ; θ ) . The parameter θ is learned by minimizing the one-step TD loss:
L ( θ ) = E τ , a , r , τ D [ ( y ( τ , a ) Q t o t ( τ , a ; θ ) ) 2 ] ,
where y ( τ , a ) = r + γ max a Q t o t ( τ , a ; θ ) is the one-step TD target, D is the replay buffer, and θ is the target network parameter.
During the decentralized execution phase, due to partial observability of the environment, each UAV can only access its local observation history and make decisions based on its action-value function Q i ( τ i , a i ) . To achieve this, value decomposition is typically used. According to the IGM principle [35], the joint action-value function Q t o t is decomposed into the individual contributions Q i ( τ i , a i ) of each agent.

5.2. Episodic Memory

Figure 3 illustrates the framework of the CNN-Semantic Episodic Memory Utilization (CNN-SEMU) algorithm. Figure 3a shows the standard value factorization CTDE module, where the Mixing Network combines each agent’s action-value function into the joint action-value function Q t o t using a function f. The function f can be implemented using methods like QMIX [31], QPLEX [32], and VDN [36]. Figure 3b depicts the replay buffer, which stores the trajectories ( r , τ , a , τ , s ) generated from agent–environment interactions. In addition to the replay buffer, CNN-SEMU includes an episodic buffer (Figure 3c). The episodic buffer stores the highest returns H ( s ) of historical states, used for the regularization part of the one-step TD loss in the value factorization module, expressed as
L m e m o r y ( θ ) = E τ , a , r , τ , s D [ ( H ( s ) Q t o t ( τ , a ; θ ) ) 2 ] .
However, as the number of agents increases, the state space of MADRL grows exponentially, requiring a vast amount of memory to directly use the high-dimensional state space. Therefore, researchers typically do not use the global state s directly. Instead, they employ a state embedding function ϕ ( s ) : S R k to map the global state s into a k -dimensional vector space, which aligns with how memory is represented in the human brain [24]. The embedded state is stored in the episodic buffer as x = ϕ ( s ) . Additionally, x is used as a key to look up the highest return of the corresponding original state: H ( ϕ ( s ) ) : S R k H .
When the replay buffer collects a new trajectory with timestamp t, the H ( x t ) in the episodic buffer is updated:
H ( x t ) = max { H ( x ^ t ) , R t ( s t , a t ) } , if x ^ t x t 2 < δ R t ( s t , a t ) , otherwise ,
where x ^ t is the nearest neighbor to x t found in the episodic buffer using the nearest neighbor algorithm [37], and δ is the threshold for measuring the approximate relationship between x ^ t and x t . R t ( s t , a t ) represents the return given the global state s t and joint action a t .
The construction and update rules for the episodic buffer are given in Algorithm 1. First, a trajectory is collected from the replay buffer (line 1). Then, a reverse chronological traversal is used to calculate the return R t (lines 2–4). Next, the encoder computes x t (line 5), and the nearest neighbor algorithm finds the closest x ^ t in the current episodic buffer (line 6). Last, lines 7–13 handle the update and addition of memories. If the embedded state x t is sufficiently close to an existing embedded state x ^ t (within the threshold δ ) and x t has a higher return R t , the reward in memory is updated (line 9), and the embedded state in memory is replaced with the new embedded state (line 10: memory shift). If the embedded state x t is outside the threshold δ , it indicates that x t is a new state not present in memory and is added to the episodic buffer (line 13).
Algorithm 1 Construction and Update of Episodic Buffer D E
1:
T = { r 0 , , r T , s 0 , , s T } : trajectory
2:
Initialize R T + 1 = 0 .
3:
for  t = T t o 0  do
4:
   set R t = r t + γ × R t + 1 .
5:
   Compute x t = ϕ ( s t ) .
6:
   Use the nearest neighbor algorithm to select the closest neighbor, x ^ t D E .
7:
   if  x ^ t x t 2 < δ  then
8:
     if  H ( x ^ t ) < R t  then
9:
         H ( x ^ t ) R t
10:
        x ^ t x t , s ^ t s t (memory shift)
11:
     end if
12:
   end if
13:
   Add memory D E ( x t , H ( x t ) , s t )
14:
end for
For the state embedding function ϕ ( s ) , EMC [26] uses random projection. While this method effectively reduces dimensionality, it imposes random weights on the state vectors, causing them to be randomly distributed in the embedding space. This reduces the range of the threshold δ , potentially preventing the recall of similar states with higher returns and only recalling identical states. EMU [27] proposes a trainable encoder–decoder structure to extract important features with reward semantics from the global state. Although the embedding space is smoother compared to random projection, the state space dimension of MCRS-LCR is enormous (as indicated by Equation (22)). Extracting features with return semantics in such a high-dimensional state space is very challenging.
In vision-based environments such as autonomous driving [38] and robotic navigation [39], the input to the state space is typically images or video frames. Agents can use computer vision techniques, such as convolutional neural network (CNN), to extract image features and understand the current environmental state. Both images and grid maps discretize the environment into small units (pixels or grids), each with specific values representing certain environmental features. Inspired by this, convolution operations can extract features from grid maps. Additionally, Equation (4) indicates that the belief map comprises three parts: E, F, and U, whose sum always equals 1. If each map is considered a channel, there must be correlations between these channels.
Figure 3d shows the state embedding network structure of CNN-SEMU. As illustrated, we split the state into the map state s m a p and the other parts of the state s o t h e r . The s m a p includes the target map F and the uncertainty map U, which are processed through a CNN to extract grid map features. The s o t h e r is processed directly through a fully connected network (FC), and the extracted features are combined with the grid map features extracted by the CNN to form the final embedded state x. The decoder reconstructs the embedded state x into a mirror image s ¯ of the original state s and outputs a predicted reward H ¯ through another head to ensure that the embedded state x contains information about the highest return. The final loss function is expressed as
L ( ϕ , ψ ) = E s , H , t D E [ H ψ H ( ϕ ( s t ) ) 2 + λ 1 s ψ s ( ϕ ( s t ) ) 2 2 ] ,
where ϕ ( s | t ) is the encoder state embedding function, with the additional input of time step t to improve feature extraction quality [27]. D E is the episodic buffer, ψ H represents the decoder branch predicting the highest return, ψ s represents the decoder branch reconstructing the global state s, and λ 1 is the scaling factor that adjusts the ratio between the prediction loss and reconstruction loss.
When the encoder–decoder structure is updated, the mapping relationship between the original and embedded states ( ϕ ( s ) : S R k ) and the key–value pair mapping ( R k H ) change. Therefore, it is necessary to store the state key–value pairs ( s , H ( x ) , x ) in the episodic buffer and periodically update the embeddings in the episodic buffer (update embedding).
Figure 3f shows the CNN network structure used to extract map features. Using a 20 × 20 × 2 map as an example, the first and second convolutional layers use 3 × 3 kernels with a stride of 2 to extract primary and higher-level features from the input map. The third layer uses a 5 × 5 kernel with a stride of 1 to cover a larger receptive field, integrating the local features extracted by the previous layers into a more global feature representation.
Figure 3e visualizes the utilization of episodic memory, where the rainbow-colored circles from red to purple represent embedded states with low to high return values. During model training, a batch of global states s t with timestamps is sampled from the replay buffer and encoded into x t through state embedding. Then, x t queries past experiences in the episodic buffer. Using an encoder–decoder structure with CNN allows the embedded space to maintain a more approximate positional structure to the original state space. Therefore, x t can retrieve memories from x ^ t within a larger threshold δ , efficiently utilizing episodic memory.
The total loss function for training the Q t o t network is expressed as
L θ = E τ , a , r , τ D [ ( y ( τ , a ) Q t o t ( τ , a ; θ ) ) 2 + λ ( H ( ϕ ( s ) , a ) Q t o t ( τ , a ; θ ) ) 2 ] ,
where H ( ϕ ( s t ) , a t ) = r t + γ H ( ϕ ( s t + 1 ) ) is the one-step TD memory target and λ is the regularization factor.

5.3. Model Training

Algorithm 2 outlines the training process of CNN-SEMU. In line 7, the value factorization CTDE algorithm includes methods such as VDN, QMIX, and QPLEX. In line 8, t e m b represents the update period for the episodic buffer. Line 11, update embedding, indicates that when the encoder–decoder structure is updated, all embedded states x in the episodic buffer need to be updated immediately to maintain the matching relationship between key–value pairs.
Algorithm 2 CNN-SEMU
1:
Initialize parameters θ , θ , ϕ and ψ .
2:
while episode ≤ max−episode do
3:
   UAVs interact with the environment using ε greedy algorithm based on Q i θ , execute an episode, and obtain a trajectory T .
4:
   Run Algorithm 1 to update the episodic buffer D E .
5:
   Store T in replay buffer D E .
6:
   Sample a random batch of trajectories T i = 1 M 1 from D with batch size M 1 .
7:
   Run the value factorization CTDE algorithm with T i = 1 M 1 and update θ using Equation (32).
8:
   if  t e n v mod t e m b = = 0  then
9:
      Sample a random batch of ( s , H , t ) i = 1 M 2 from D E with batch size M 2 .
10:
     update ϕ and ψ using Equation (31).
11:
     update all x D E with new ϕ (Update embedding)
12:
   end if
13:
end while

6. Experiments

This section evaluates the effectiveness of the model and algorithm through extensive experiments. First, the experimental setup is introduced. Next, the impact of communication conditions on reconnaissance search effectiveness is analyzed. Then, CNN-SEMU is compared with other baselines, and the effect of state embedding on algorithm performance is examined. Finally, key parameters influencing the algorithm are analyzed through parameter experiments, and their impact on the embedding space is discussed.

6.1. Experimental Environment Setup

To ensure the generality of the results, we evaluated the algorithm on maps of three different sizes: 10 × 10 , 15 × 15 , and 20 × 20 , with detailed explanations provided for the 20 × 20 map. The specific environmental settings are shown in Table 2. To ensure target discovery, the episodic length was set so that the UAVs could cover the task area at least twice.
In each episode, the positions of the UAVs, targets, and no-fly zones are initialized randomly. During UAV movement, the positions of the targets and no-fly zones are unknown and fixed. UAVs can detect the positions of no-fly zones and other UAVs through their FOV. Other parameters used in the experiments are shown in Table 3. To better compare with baseline algorithms, the episodic latent dimension follows the settings in [26,27], and the reconstruction loss scaling factor λ 1 follows the settings in [27].
The experiments were conducted on a server with multiple NVIDIA GeForce RTX 4090 GPUs, each with 24 GB of memory and 90 GB of RAM. Unless otherwise specified, each experiment data point averages 160 samples, with 32 episodes tested per random seed during training, using five random seeds.

6.2. Communication Model Analysis

To test the effectiveness of the proposed UAV communication and information fusion model, this section analyzes the performance of multi-UAV cooperative reconnaissance under different communication conditions. The communication model was tested using the standard QMIX [31] in a 20 × 20 map.
Figure 4 shows the distributed uncertainty map U of UAV i under different communication conditions, depicted using contour maps. Redder colors indicate higher uncertainty, while bluer colors indicate lower uncertainty. Compared to heat maps, contour maps more clearly describe the numerical changes in uncertainty across the map. Figure 4a,b shows the uncertainty maps with communication, where UAV i achieves compelling environmental exploration through real-time information sharing. Over time, the overall map becomes bluer, indicating reduced uncertainty. In contrast, Figure 4c,d shows the uncertainty maps without communication, where UAV i is unaware of the reconnaissance results of other UAVs, leaving most of the map red. The scattered blue areas on the right side of the map indicate no-fly zones. Comparing Figure 4a and Figure 4c at the same time shows that UAV i in Figure 4c is likely to continue exploring the red areas already covered by other UAVs, while UAV i in Figure 4a uses the communication mechanism to gain a comprehensive understanding of the entire map, effectively avoiding overlapping trajectories and improving reconnaissance efficiency.
Figure 5 tests the impact of communication distance on the number of targets discovered and the level of uncertainty. As the communication distance increases, the number of targets discovered gradually rises, and uncertainty gradually decreases, eventually converging to a stable value. This indicates that increasing the communication distance enhances information exchange opportunities among UAVs, improving their cooperative reconnaissance and search capabilities. In the 20 × 20 map, when the communication distance of the UAVs reaches 5, the number of targets discovered and the level of uncertainty have already converged to optimal values, indicating that a larger communication distance is not necessary to achieve the best communication effect.

6.3. Baseline Experiment Analysis

CNN-SEMU can be integrated with any value decomposition-based MADRL framework. In this paper, we implemented CNN-SEMU based on the QMIX algorithm. The experiments compared EMC and EMU, both using QMIX as the basic framework, and included state-of-the-art MADRL algorithms like QMIX and QPLEX as baselines. Additionally, we conducted an ablation study by removing the CNN structure from CNN-SEMU (denoted as SEMU).
Figure 6 compares the episodic rewards of CNN-SEMU and baseline algorithms on 10 × 10 and 15 × 15 maps. The results show that CNN-SEMU achieves the best performance in smaller task scenarios.
Figure 7 shows the experimental results of 10 UAVs on 20 × 20 map. The experiment compared four metrics: episodic reward, uncertainty, number of collisions, and number of targets found. As shown in Figure 7a, CNN-SEMU achieved the best performance, and all three episodic memory-based methods outperformed QMIX and QPLEX, demonstrating the effectiveness of episodic memory. QPLEX had the fastest convergence in uncertainty and the number of targets found in the early stages of training but the slowest convergence in the number of collisions. Additionally, CNN-SEMU converged the fastest and to better values in uncertainty, the number of collisions, and the number of targets found.
Episodic methods improve sampling efficiency by recalling the highest returns of historical states and using one-step memory TD loss to estimate the action–state value function more quickly and accurately. This results in better performance compared to non-episodic QMIX and QPLEX. As shown in Figure 7a, SEMU, which only uses the encoder–decoder structure, performs worse than EMC with random projection in the later stages of training. This is because it is challenging to directly extract features in the high-dimensional (880) map state space. In contrast, CNN-SEMU uses CNN to extract map features, achieving semantically meaningful embeddings that can recall similar states with higher returns. It is worth noting that CNN-SEMU initially performs worse than QMIX in the early stages of training, as the embedding network has not yet captured the important features of the high-dimensional map state space.

Comparison with EMU

EMU also uses an encoder–decoder structure for state embedding. However, unlike the one-step TD memory target, EMU introduces episodic incentives to provide additional rewards for states on desirable trajectories. Episodic incentives are defined as
r p = γ N ξ ( s ) N c a l l ( s ) ( H ( f ϕ ( s ) ) max a Q θ ( s , a ) ) ,
where N c a l l ( s ) represents the total number of times state s is visited in the episodic buffer, and N ξ ( s ) represents the number of desirable visits. The final loss function is expressed as
L ( θ ) = E τ , a , r , τ D [ ( r p + y ( τ , a ) Q t o t ( τ , a ; θ ) ) 2 ] .
EMU defines desirable trajectories as those with episodic rewards exceeding a preset threshold R thr . In StarCraft II, the maximum reward is obtained when all enemy units are eliminated, and [27] sets this maximum reward as the threshold R thr . For the MCRS-LCR problem, discovering all targets is challenging, making it difficult to determine the maximum overall episodic reward. In the comparative experiments, using a 20 × 20 map as an example, we set R thr to 365 and 370 based on the results in Figure 7a. Additionally, we set a periodically updated R thr that can be adjusted based on the episodic rewards tested during training. This allows states initially deemed desirable to be re-evaluated as undesirable as cognition updates. However, when the desirability of states in the episodic buffer changes, it becomes difficult to determine N ξ ( s ) . Therefore, we omit N ξ ( s ) / N c a l l ( s ) from Equation (33) and denote this method as EMU-change.
In the comparative experiments with EMU, we used the same CNN-based embedding structure. As shown in Figure 8a, CNN-SEMU exhibited the best performance, while EMU-365/370 performed similarly to QMIX. This is because EMU only receives episodic incentives when the recalled state is desirable, resulting in EMU-365/370 rarely receiving the r p signal during training. Figure 8b shows that EMU-365/370 maintains a similar average Q v a l u e to QMIX, indicating that the episodic incentive r p did not take effect. Figure 8b displays the Q t o t average value for all UAVs. It is evident that the EMU-change has a much higher average Q v a l u e than the other algorithms, indicating it received the r p signal. However, as shown in Figure 8a, the EMU-change performs worse than QMIX. This may be because, although the desirability update enhances the episodic incentive effect, it overestimates the action–state value function. In contrast, CNN-SEMU can quickly and accurately estimate the action-state value function.

6.4. State Embedding Analysis

This subsection examines how the proposed algorithm alters the embedding space, affecting performance. Figure 9 shows the t-SNE [40] visualization results of 20 K randomly sampled state samples from a 1 M memory episodic buffer in a 20 × 20 map task. Since the state vectors have no labels, we label each state vector with its corresponding highest return H.
Figure 9a shows the state vectors before embedding. State vectors with similar colors are clustered together, indicating that state vectors with similar returns are closely arranged in space before embedding. Figure 9b shows the embedded state space using EMC random projection. Although random projection approximately preserves the distance relationships between state vectors, the embedded states are scattered in space with almost no clustering effect. Figure 9c shows the embedding representation of CNN-SEMU without the CNN structure, achieving primary clustering but still less effective than in Figure 9d.
CNN-SEMU achieves good clustering of similar return embeddings, resulting in a smoother embedding space. This indicates that the encoder–decoder with a CNN structure can effectively extract features with reward semantics. Additionally, this clustering makes selecting episodic memory around x t safer (Figure 3e), ensuring that the memories queried within a certain threshold are closely related states. This allows a larger δ to be chosen, effectively utilizing more memories.

6.5. Hyperparametric Analysis

Selecting appropriate hyperparameters is essential for model training. This section examines the impact of critical hyperparameters λ and δ on algorithm performance. The hyperparameter experiments are conducted using a 20 × 20 map as an example.
(1) Sensitivity analysis of regularization factor λ
λ controls the proportion of one-step TD memory loss in Equation (32). In this subsection, we examine the stability of CNN-SEMU’s performance under different settings of λ ( [ 0.01 , 0.05 , 0.1 , 0.5 , 1 ] ). As shown in Figure 10, when λ is 0.01, 0.05, or 0.1, CNN-SEMU achieves optimal performance. However, when λ is too large (0.5 or 1), the algorithm’s performance degrades significantly. This may be because excessive recall of historical states causes the algorithm to get stuck in local optima.
(2) Sensitivity analysis of the state-embedding difference threshold δ
δ influences the update and usage of the episodic buffer during training through Equation (30). In this section, we study the impact of different δ values on three episodic-based methods: EMC, SEMU, and CNN-SEMU. EMC constructs the embedding space using random projection, CNN-SEMU uses an encoder–decoder structure with CNN, and SEMU is a variant of CNN-SEMU without the CNN structure.
Ref. [27] proposed a method to determine the appropriate threshold δ , expressed as
δ ( 2 × 3 σ x ) dim ( x ) M ,
where M is the capacity of the episodic buffer, and σ x represents the standard deviation of x in the episodic buffer. When dim ( x ) = 4 , M = 1 × 10 6 , and σ x is approximately 1, δ can be calculated as δ 0.0013 . Therefore, in this subsection, we set the δ range to [ 1.3 × 10 4 , 1.3 × 10 3 , 1.3 × 10 2 , 1.3 × 10 1 ] to study its impact on episodic-based methods.
Figure 11 shows the episodic reward at the end of training for different δ settings, with the horizontal axis on a logarithmic scale. From Figure 11, it is evident that CNN-SEMU achieves the best training performance across various δ values and performs well over a broader range of δ . When δ = 1.3 × 10 1 , the episodic reward for EMC drops sharply. This is because EMC uses random projection to construct the embedding space, resulting in a disordered distribution (as shown in Figure 9b). Within the δ = 1.3 × 10 1 range, it is possible to retrieve states with significantly different reward values, and this excessive recall can disrupt the training process.
To demonstrate both training efficiency and effectiveness, Table 4 lists the episodic reward based on training time, and Figure 12 shows the training curves for different δ values. As shown in Table 4, CNN-SEMU, which uses an encoder–decoder structure with CNN, converges faster and to a higher episodic reward in most cases. This indicates that in the smoother state space constructed by CNN-SEMU, it is safe to retrieve memories from a wider range of δ . However, when δ = 1.3 × 10 1 , CNN-SEMU’s performance declines, suggesting that although CNN-SEMU is robust to a wide range of δ , excessively large δ can still recall completely unrelated states. Therefore, selecting an appropriate δ helps improve algorithm performance. When δ = 1.3 × 10 3 , CNN-SEMU converges to the highest value, indicating that the choice of δ in Equation (35) is reasonable.
As shown in Figure 12, a larger δ results in higher episodic reward variance than a smaller δ . When δ = 1.3 × 10 1 , both EMC and SEMU fail to find the optimal strategy and exhibit larger error bands, validating our previous analysis. Although larger δ can recall more similar states, it may also recall completely unrelated states. In contrast, CNN-SEMU uses semantic embedding with CNN, clustering states with the same reward in the embedding space. This makes the space smoother and reduces the likelihood of recalling unrelated states.

7. Conclusions and Future Work

This paper focuses on the multi-UAV cooperative reconnaissance and search problem with a limited communication range (MCRS-LCR) for unknown static targets. We propose a cooperative communication and information fusion model based on belief maps. For the MCRS-LCR problem, we introduce a new multi-agent reinforcement learning method based on episodic memory called CNN-SEMU. CNN-SEMU uses an encoder–decoder structure with CNN to construct the embedding space, extracting high-return semantic features from the high-dimensional map state space, making the embedding space smoother. Ultimately, CNN-SEMU improves reinforcement learning sampling efficiency through one-step memory TD targets, thereby enhancing learning. Extensive baseline simulations demonstrate that CNN-SEMU outperforms state-of-the-art multi-agent reinforcement learning methods in episodic rewards, search efficiency, and collision frequency. State embedding and parameter sensitivity analyses reveal the reasons for CNN-SEMU’s superior performance. This is due to the encoder–decoder with a CNN structure, which achieves better semantic state embeddings.
Future research will optimize communication methods between UAVs to reduce unnecessary load. We will also consider constraints such as communication delay and bandwidth to address practical challenges. Currently, our research focuses on the reconnaissance and search of static targets. Moving forward, we will extend our research to include dynamic target reconnaissance and address the challenges of transitioning from simulation to reality.

Author Contributions

Conceptualization, B.Z. and T.W.; methodology, B.Z., M.L. and Y.C.; software, B.Z.; data curation, B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, X.L. and Z.Z.; visualization, B.Z. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (62003359) and the Basic Strengthening Plan Project of China (2023-JCJQ-JJ-0795).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Wu, J.; Sun, Y.; Li, D.; Shi, J.; Li, X.; Gao, L.; Yu, L.; Han, G.; Wu, J. An Adaptive Conversion Speed Q-Learning Algorithm for Search and Rescue UAV Path Planning in Unknown Environments. IEEE Trans. Veh. Technol. 2023, 72, 15391–15404. [Google Scholar] [CrossRef]
  2. Li, X.; Lu, X.; Chen, W.; Ge, D.; Zhu, J. Research on UAVs Reconnaissance Task Allocation Method Based on Communication Preservation. IEEE Trans. Consum. Electron. 2024, 70, 684–695. [Google Scholar] [CrossRef]
  3. Liu, K.; Zheng, J. UAV Trajectory Optimization for Time-Constrained Data Collection in UAV-Enabled Environmental Monitoring Systems. IEEE Internet Things J. 2022, 9, 24300–24314. [Google Scholar] [CrossRef]
  4. Senthilnath, J.; Harikumar, K.; Sundaram, S. Metacognitive Decision-Making Framework for Multi-UAV Target Search without Communication. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 3195–3206. [Google Scholar] [CrossRef]
  5. Xia, J.; Zhou, Z. The Modeling and Control of a Distributed-Vector-Propulsion UAV with Aero-Propulsion Coupling Effect. Aerospace 2024, 11, 284. [Google Scholar] [CrossRef]
  6. Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems, Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
  7. Zhang, B.; Lin, X.; Zhu, Y.; Tian, J.; Zhu, Z. Enhancing Multi-UAV Reconnaissance and Search Through Double Critic DDPG with Belief Probability Maps. IEEE Trans. Intell. Veh. 2024, 9, 3827–3842. [Google Scholar] [CrossRef]
  8. Shen, G.; Lei, L.; Zhang, X.; Li, Z.; Cai, S.; Zhang, L. Multi-UAV Cooperative Search Based on Reinforcement Learning with a Digital Twin Driven Training Framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
  9. Yan, K.; Xiang, L.; Yang, K. Cooperative Target Search Algorithm for UAV Swarms with Limited Communication and Energy Capacity. IEEE Commun. Lett. 2024, 28, 1102–1106. [Google Scholar] [CrossRef]
  10. Chung, T.H.; Burdick, J.W. Analysis of Search Decision Making Using Probabilistic Search Strategies. IEEE Trans. Rob. 2012, 28, 132–144. [Google Scholar] [CrossRef]
  11. Yang, Y.; Polycarpou, M.M.; Minai, A.A. Multi-UAV Cooperative Search Using an Opportunistic Learning Method. J. Dyn. Syst. Meas. Contr. 2007, 129, 716–728. [Google Scholar] [CrossRef]
  12. Liu, S.; Yao, W.; Zhu, X.; Zuo, Y.; Zhou, B. Emergent Search of UAV Swarm Guided by the Target Probability Map. Appl. Sci. 2022, 12, 5086. [Google Scholar] [CrossRef]
  13. Zhang, C.; Zhou, W.; Qin, W.; Tang, W. A novel UAV path planning approach: Heuristic crossing search and rescue optimization algorithm. Expert Syst. Appl. 2023, 215, 119243. [Google Scholar] [CrossRef]
  14. Yue, W.; Xi, Y.; Guan, X. A New Searching Approach Using Improved Multi-Ant Colony Scheme for Multi-UAVs in Unknown Environments. IEEE Access 2019, 7, 161094–161102. [Google Scholar] [CrossRef]
  15. Zhang, W.; Zhang, W. An Efficient UAV Localization Technique Based on Particle Swarm Optimization. IEEE Trans. Veh. Technol. 2022, 71, 9544–9557. [Google Scholar] [CrossRef]
  16. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  17. Samvelyan, M.; Rashid, T.; De Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.M.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft multi-agent challenge. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, Montreal, Canada, 13–17 May 2019; Volume 4, pp. 2186–2188. [Google Scholar]
  18. Wang, Y.; Zhang, J.; Chen, Y.; Yuan, H.; Wu, C. An Automated Learning Method of Semantic Segmentation for Train Autonomous Driving Environment Understanding. IEEE Trans. Ind. Inf. 2024, 20, 6913–6922. [Google Scholar] [CrossRef]
  19. Li, J.; Liu, Q.; Chi, G. Distributed deep reinforcement learning based on bi-objective framework for multi-robot formation. Neural Netw. 2024, 171, 61–72. [Google Scholar] [CrossRef] [PubMed]
  20. Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
  21. Squire, L.R. Memory systems of the brain: A brief history and current perspective. Neurobiol. Learn. Mem. 2004, 82, 171–177. [Google Scholar] [CrossRef]
  22. Biane, J.S.; Ladow, M.A.; Stefanini, F.; Boddu, S.P.; Fan, A.; Hassan, S.; Dundar, N.; Apodaca-Montano, D.L.; Zhou, L.Z.; Fayner, V.; et al. Neural dynamics underlying associative learning in the dorsal and ventral hippocampus. Nat. Neurosci. 2023, 26, 798–809. [Google Scholar] [CrossRef]
  23. Turner, V.S.; O’Sullivan, R.O.; Kheirbek, M.A. Linking external stimuli with internal drives: A role for the ventral hippocampus. Curr. Opin. Neurobiol. 2022, 76, 102590. [Google Scholar] [CrossRef]
  24. Eichenbaum, H. Prefrontal–hippocampal interactions in episodic memory. Nat. Rev. Neurosci. 2017, 18, 547–558. [Google Scholar] [CrossRef] [PubMed]
  25. Blundell, C.; Uria, B.; Pritzel, A.; Li, Y.; Ruderman, A.; Leibo, J.Z.; Rae, J.; Wierstra, D.; Hassabis, D. Model-free episodic control. arXiv 2016, arXiv:1606.04460. [Google Scholar]
  26. Zheng, L.; Chen, J.; Wang, J.; He, J.; Hu, Y.; Chen, Y.; Fan, C.; Gao, Y.; Zhang, C. Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration. Adv. Neural Inf. Process. Syst. 2021, 5, 3757–3769. [Google Scholar]
  27. Na, H.; Seo, Y.; Moon, I.C. Efficient episodic memory utilization of cooperative multi-agent reinforcement learning. arXiv 2024, arXiv:2403.01112. [Google Scholar]
  28. Ma, X.; Li, W.J. State-based episodic memory for multi-agent reinforcement learning. Mach. Learn. 2023, 112, 5163–5190. [Google Scholar] [CrossRef]
  29. Lin, Z.; Zhao, T.; Yang, G.; Zhang, L. Episodic memory deep q-networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI, Stockholm, Sweden, 13–19 July 2018; Volume 2018-July, pp. 2433–2439. [Google Scholar] [CrossRef]
  30. Johnson, W.B.; Lindenstrauss, J.; Schechtman, G. Extensions of Lipschitz maps into Banach spaces. Isr. J. Math. 1986, 54, 129–138. [Google Scholar] [CrossRef]
  31. Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; Volume 10, pp. 6846–6859. [Google Scholar]
  32. Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. arXiv 2020, arXiv:2008.01062. [Google Scholar]
  33. Azzam, R.; Boiko, I.; Zweiri, Y. Swarm Cooperative Navigation Using Centralized Training and Decentralized Execution. Drones 2023, 7, 193. [Google Scholar] [CrossRef]
  34. Khan, A.A.; Adve, R.S. Centralized and distributed deep reinforcement learning methods for downlink sum-rate optimization. IEEE Trans. Wireless Commun. 2020, 19, 8410–8426. [Google Scholar] [CrossRef]
  35. Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.; Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Volume 2019-June, pp. 10329–10346. [Google Scholar]
  36. Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, Stockholm, Sweden, 10–15 July 2018; Volume 3, pp. 2085–2087. [Google Scholar]
  37. Wang, M.; Xu, X.; Yue, Q.; Wang, Y. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow. 2021, 14, 1964–1978. [Google Scholar] [CrossRef]
  38. Liu, Q.; Zhou, S. LightFusion: Lightweight CNN Architecture for Enabling Efficient Sensor Fusion in Free Road Segmentation of Autonomous Driving. IEEE Trans. Circuits Syst. II Express Briefs 2024. early access. [Google Scholar] [CrossRef]
  39. Zhang, Y.; Wilker, K. Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU. J. Organ. End User Comput. 2024, 36, 1–21. [Google Scholar] [CrossRef]
  40. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. An example of episodic memory is a child’s encounter with an ice cream truck, a highly rewarding experience. When the child hears the truck’s music, they recall the previous scene of buying ice cream, including the music (a), the park location (b), the queue of customers (c), and the truck itself (d). The brain constructs episodic memories by recognizing these key features and associating them with positive experiences, guiding the child’s behavior, such as running outside when the music is heard.
Figure 1. An example of episodic memory is a child’s encounter with an ice cream truck, a highly rewarding experience. When the child hears the truck’s music, they recall the previous scene of buying ice cream, including the music (a), the park location (b), the queue of customers (c), and the truck itself (d). The brain constructs episodic memories by recognizing these key features and associating them with positive experiences, guiding the child’s behavior, such as running outside when the music is heard.
Drones 08 00393 g001
Figure 2. System model: UAVs have eight possible movement directions. They coordinate reconnaissance and target search through a limited communication range with other UAVs while avoiding no-fly zones.
Figure 2. System model: UAVs have eight possible movement directions. They coordinate reconnaissance and target search through a limited communication range with other UAVs while avoiding no-fly zones.
Drones 08 00393 g002
Figure 3. An overview of CNN-SEMU. (a) Standard value factorization CTDE framework. (b) Replay buffer. (c) Episodic buffer. (d) State embedding module. (e) Scenario memory utilization. (f) CNN structure.
Figure 3. An overview of CNN-SEMU. (a) Standard value factorization CTDE framework. (b) Replay buffer. (c) Episodic buffer. (d) State embedding module. (e) Scenario memory utilization. (f) CNN structure.
Drones 08 00393 g003
Figure 4. Distributed uncertainty map U of UAV i under different communication conditions.
Figure 4. Distributed uncertainty map U of UAV i under different communication conditions.
Drones 08 00393 g004
Figure 5. The effect of the communication range on the number of targets found and the uncertainty.
Figure 5. The effect of the communication range on the number of targets found and the uncertainty.
Drones 08 00393 g005
Figure 6. Comparison of CNN-SEMU and baseline algorithms on episodic reward.
Figure 6. Comparison of CNN-SEMU and baseline algorithms on episodic reward.
Drones 08 00393 g006
Figure 7. Performance comparison of CNN-SEMU and baseline algorithms on 20 × 20 Map.
Figure 7. Performance comparison of CNN-SEMU and baseline algorithms on 20 × 20 Map.
Drones 08 00393 g007
Figure 8. Performance comparison of CNN-SEMU and EMU.
Figure 8. Performance comparison of CNN-SEMU and EMU.
Drones 08 00393 g008
Figure 9. Visualization of embedded and original states sampled from D E using t-SNE. The rainbow-colored circles, from red to purple, represent embedded states with low to high reward values.
Figure 9. Visualization of embedded and original states sampled from D E using t-SNE. The rainbow-colored circles, from red to purple, represent embedded states with low to high reward values.
Drones 08 00393 g009
Figure 10. The effect of hyperparameter λ on CNN-SEMU performance.
Figure 10. The effect of hyperparameter λ on CNN-SEMU performance.
Drones 08 00393 g010
Figure 11. Episodic reward at the end of training for different values of δ .
Figure 11. Episodic reward at the end of training for different values of δ .
Drones 08 00393 g011
Figure 12. The effect of δ on episodic reward.
Figure 12. The effect of δ on episodic reward.
Drones 08 00393 g012
Table 1. Main notations in Section 3.
Table 1. Main notations in Section 3.
NotationDescription
c x , y Cell at row x and column y in the map.
u i , t The position of UAV i at time step t.
z k The position of the no-fly zone k.
NNumber of UAVs.
N Z Number of no-fly zones.
f x , y i ( t ) Sensor data from UAV i at time step t for c x , y .
m x , y t ( U ) , m x , y t ( F ) , m x , y t ( E ) Basic probability assignments for the uncertainty and target maps at time step t for c x , y .
m b ( U ) , m b ( F ) , m b ( E ) Sensors’ basic probability assignments for the uncertainty and target maps.
N x , y i , t ( + ) , N x , y i , t ( ) At time step t, the positive and negative detection times of UAV i for c x , y .
m f , m e The belief that the sensor both identified and failed to identify the target.
R c The communication range of the UAV.
N e i , t Neighbors of UAV i at time step t.
τ The threshold at which the target is deemed to exist.
d 1 No-fly zone safety distances.
d 2 Safe distance for UAVs to avoid collisions.
Table 2. Map scene settings.
Table 2. Map scene settings.
Map SizeNumber of
UAVs
Number of
Targets
Number of
No-Fly Zones
Dimension of
State Space
Dimension of
Observation Space
Dimension of
Action Space
Episodic
Length
10 × 10 46322636850
15 × 15 812650257870
20 × 20 102010880758100
Table 3. Experimental parameter settings.
Table 3. Experimental parameter settings.
DescriptionValue
UAV communication range ( R c )5
FOV radius ( R s i g h t )4
No-fly zone safety distances ( d 1 )1
Collision safety distance between UAVs ( d 2 )1
m f 0.8
m e 0.8
ω 1 1
ω 2 1
ω 3 1
ω 4 1
Episodic latent dimension ( d i m ( x ) )4
Episodic buffer capacity1 M
Target discovery threshold ( τ )0.95
Reconstruction loss scale factor ( λ 1 )0.1
Regularized scale factor ( λ )0.1
State-embedding difference threshold ( δ )0.0013
Table 4. The effect of δ on episodic reward according to training time.
Table 4. The effect of δ on episodic reward according to training time.
Timesteps (M)0.671.332.00
δ EMCSEMUCNN-SEMUEMCSEMUCNN-SEMUEMCSEMUCNN-SEMU
1.3 × 10 4 355.69363.22359.17360.20365.63369.16367.30370.01372.11
1.3 × 10 3 356.03358.62361.91365.62360.33363.02369.75368.35372.34
1.3 × 10 2 359.43358.84359.77361.68361.30368.24366.49369.74371.91
1.3 × 10 1 339.95343.10337.03321.97342.96358.17311.41349.96362.51
The optimal results of each parameter experiment are highlighted in bold.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, B.; Wang, T.; Li, M.; Cui, Y.; Lin, X.; Zhu, Z. Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning. Drones 2024, 8, 393. https://doi.org/10.3390/drones8080393

AMA Style

Zhang B, Wang T, Li M, Cui Y, Lin X, Zhu Z. Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning. Drones. 2024; 8(8):393. https://doi.org/10.3390/drones8080393

Chicago/Turabian Style

Zhang, Boquan, Tao Wang, Mingxuan Li, Yanru Cui, Xiang Lin, and Zhi Zhu. 2024. "Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning" Drones 8, no. 8: 393. https://doi.org/10.3390/drones8080393

APA Style

Zhang, B., Wang, T., Li, M., Cui, Y., Lin, X., & Zhu, Z. (2024). Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning. Drones, 8(8), 393. https://doi.org/10.3390/drones8080393

Article Metrics

Back to TopTop