MDPI - Publisher of Open Access Journals

36 pages, 1495 KB

Open AccessReview

Decision-Making for Path Planning of Mobile Robots Under Uncertainty: A Review of Belief-Space Planning Simplifications

by Vineetha Malathi, Pramod Sreedharan, Rthuraj P R, Vyshnavi Anil Kumar, Anil Lal Sadasivan, Ganesha Udupa, Liam Pastorelli and Andrea Troppina

Robotics 2025, 14(9), 127; https://doi.org/10.3390/robotics14090127 - 15 Sep 2025

Viewed by 1487

Abstract

Uncertainty remains a central challenge in robotic navigation, exploration, and coordination. This paper examines how Partially Observable Markov Decision Processes (POMDPs) and their decentralized variants (Dec-POMDPs) provide a rigorous foundation for decision-making under partial observability across tasks such as Active Simultaneous Localization and [...] Read more.

Uncertainty remains a central challenge in robotic navigation, exploration, and coordination. This paper examines how Partially Observable Markov Decision Processes (POMDPs) and their decentralized variants (Dec-POMDPs) provide a rigorous foundation for decision-making under partial observability across tasks such as Active Simultaneous Localization and Mapping (A-SLAM), adaptive informative path planning, and multi-robot coordination. We review recent advances that integrate deep reinforcement learning (DRL) with POMDP formulations, highlighting improvements in scalability and adaptability as well as unresolved challenges of robustness, interpretability, and sim-to-real transfer. To complement learning-driven methods, we discuss emerging strategies that embed probabilistic reasoning directly into navigation, including belief-space planning, distributionally robust control formulations, and probabilistic graph models such as enhanced probabilistic roadmaps (PRMs) and Canadian Traveler Problem-based roadmaps. These approaches collectively demonstrate that uncertainty can be managed more effectively by coupling structured inference with data-driven adaptation. The survey concludes by outlining future research directions, emphasizing hybrid learning–planning architectures, neuro-symbolic reasoning, and socially aware navigation frameworks as critical steps toward resilient, transparent, and human-centered autonomy. Full article

(This article belongs to the Section Sensors and Control in Robotics)

► Show Figures

Figure 1

22 pages, 2971 KB

Open AccessArticle

Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks

by Ming Cheng, Saifei He, Yijin Pan, Min Lin and Wei-Ping Zhu

Sensors 2025, 25(17), 5234; https://doi.org/10.3390/s25175234 - 22 Aug 2025

Viewed by 867

Abstract

The Internet of Things (IoT) has promoted emerging applications that require massive device collaboration, heavy computation, and stringent latency. Unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) systems can provide flexible services for user devices (UDs) with wide coverage. The optimization of both [...] Read more.

The Internet of Things (IoT) has promoted emerging applications that require massive device collaboration, heavy computation, and stringent latency. Unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) systems can provide flexible services for user devices (UDs) with wide coverage. The optimization of both latency and energy consumption remains a critical yet challenging task due to the inherent trade-off between them. Joint association, offloading, and computing resource allocation are essential to achieving satisfying system performance. However, these processes are difficult due to the highly dynamic environment and the exponentially increasing complexity of large-scale networks. To address these challenges, we introduce a carefully designed cost function to balance the latency and the energy consumption, formulate the joint problem into a partially observable Markov decision process, and propose two multi-agent deep-reinforcement-learning-based schemes to tackle the long-term problem. Specifically, the multi-agent proximal policy optimization (MAPPO)-based scheme uses centralized learning and decentralized execution, while the closed-form enhanced multi-armed bandit (CF-MAB)-based scheme decouples association from offloading and computing resource allocation. In both schemes, UDs act as independent agents that learn from environmental interactions and historic decisions, make decision to maximize its individual reward function, and achieve implicit collaboration through the reward mechanism. The numerical results validate the effectiveness and show the superiority of our proposed schemes. The MAPPO-based scheme enables collaborative agent decisions for high performance in complex dynamic environments, while the CF-MAB-based scheme supports independent rapid response decisions. Full article

(This article belongs to the Special Issue Wireless Communication Technologies for Internet of Things and Wireless Sensor Networks)

► Show Figures

Figure 1

21 pages, 4738 KB

Open AccessArticle

Research on Computation Offloading and Resource Allocation Strategy Based on MADDPG for Integrated Space–Air–Marine Network

by Haixiang Gao

Entropy 2025, 27(8), 803; https://doi.org/10.3390/e27080803 - 28 Jul 2025

Cited by 1 | Viewed by 712

Abstract

This paper investigates the problem of computation offloading and resource allocation in an integrated space–air–sea network based on unmanned aerial vehicle (UAV) and low Earth orbit (LEO) satellites supporting Maritime Internet of Things (M-IoT) devices. Considering the complex, dynamic environment comprising M-IoT devices, [...] Read more.

This paper investigates the problem of computation offloading and resource allocation in an integrated space–air–sea network based on unmanned aerial vehicle (UAV) and low Earth orbit (LEO) satellites supporting Maritime Internet of Things (M-IoT) devices. Considering the complex, dynamic environment comprising M-IoT devices, UAVs and LEO satellites, traditional optimization methods encounter significant limitations due to non-convexity and the combinatorial explosion in possible solutions. A multi-agent deep deterministic policy gradient (MADDPG)-based optimization algorithm is proposed to address these challenges. This algorithm is designed to minimize the total system costs, balancing energy consumption and latency through partial task offloading within a cloud–edge-device collaborative mobile edge computing (MEC) system. A comprehensive system model is proposed, with the problem formulated as a partially observable Markov decision process (POMDP) that integrates association control, power control, computing resource allocation, and task distribution. Each M-IoT device and UAV acts as an intelligent agent, collaboratively learning the optimal offloading strategies through a centralized training and decentralized execution framework inherent in the MADDPG. The numerical simulations validate the effectiveness of the proposed MADDPG-based approach, which demonstrates rapid convergence and significantly outperforms baseline methods, and indicate that the proposed MADDPG-based algorithm reduces the total system cost by 15–60% specifically. Full article

(This article belongs to the Special Issue Space-Air-Ground-Sea Integrated Communication Networks)

► Show Figures

Figure 1

33 pages, 4841 KB

Open AccessArticle

Research on Task Allocation in Four-Way Shuttle Storage and Retrieval Systems Based on Deep Reinforcement Learning

by Zhongwei Zhang, Jingrui Wang, Jie Jin, Zhaoyun Wu, Lihui Wu, Tao Peng and Peng Li

Sustainability 2025, 17(15), 6772; https://doi.org/10.3390/su17156772 - 25 Jul 2025

Viewed by 756

Abstract

The four-way shuttle storage and retrieval system (FWSS/RS) is an advanced automated warehousing solution for achieving green and intelligent logistics, and task allocation is crucial to its logistics efficiency. However, current research on task allocation in three-dimensional storage environments is mostly conducted in [...] Read more.

The four-way shuttle storage and retrieval system (FWSS/RS) is an advanced automated warehousing solution for achieving green and intelligent logistics, and task allocation is crucial to its logistics efficiency. However, current research on task allocation in three-dimensional storage environments is mostly conducted in the single-operation mode that handles inbound or outbound tasks individually, with limited attention paid to the more prevalent composite operation mode where inbound and outbound tasks coexist. To bridge this gap, this study investigates the task allocation problem in an FWSS/RS under the composite operation mode, and deep reinforcement learning (DRL) is introduced to solve it. Initially, the FWSS/RS operational workflows and equipment motion characteristics are analyzed, and a task allocation model with the total task completion time as the optimization objective is established. Furthermore, the task allocation problem is transformed into a partially observable Markov decision process corresponding to reinforcement learning. Each shuttle is regarded as an independent agent that receives localized observations, including shuttle position information and task completion status, as inputs, and a deep neural network is employed to fit value functions to output action selections. Correspondingly, all agents are trained within an independent deep Q-network (IDQN) framework that facilitates collaborative learning through experience sharing while maintaining decentralized decision-making based on individual observations. Moreover, to validate the efficiency and effectiveness of the proposed model and method, experiments were conducted across various problem scales and transport resource configurations. The experimental results demonstrate that the DRL-based approach outperforms conventional task allocation methods, including the auction algorithm and the genetic algorithm. Specifically, the proposed IDQN-based method reduces the task completion time by up to 12.88% compared to the auction algorithm, and up to 8.64% compared to the genetic algorithm across multiple scenarios. Moreover, task-related factors are found to have a more significant impact on the optimization objectives of task allocation than transport resource-related factors. Full article

(This article belongs to the Special Issue Sustainability in Industrial Engineering and Engineering Management: 2nd Edition)

► Show Figures

Figure 1

31 pages, 1576 KB

Open AccessArticle

Joint Caching and Computation in UAV-Assisted Vehicle Networks via Multi-Agent Deep Reinforcement Learning

by Yuhua Wu, Yuchao Huang, Ziyou Wang and Changming Xu

Drones 2025, 9(7), 456; https://doi.org/10.3390/drones9070456 - 24 Jun 2025

Viewed by 1022

Abstract

Intelligent Connected Vehicles (ICVs) impose stringent requirements on real-time computational services. However, limited onboard resources and the high latency of remote cloud servers restrict traditional solutions. Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC), which deploys computing and storage resources at the network [...] Read more.

Intelligent Connected Vehicles (ICVs) impose stringent requirements on real-time computational services. However, limited onboard resources and the high latency of remote cloud servers restrict traditional solutions. Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC), which deploys computing and storage resources at the network edge, offers a promising solution. In UAV-assisted vehicular networks, jointly optimizing content and service caching, computation offloading, and UAV trajectories to maximize system performance is a critical challenge. This requires balancing system energy consumption and resource allocation fairness while maximizing cache hit rate and minimizing task latency. To this end, we introduce system efficiency as a unified metric, aiming to maximize overall system performance through joint optimization. This metric comprehensively considers cache hit rate, task computation latency, system energy consumption, and resource allocation fairness. The problem involves discrete decisions (caching, offloading) and continuous variables (UAV trajectories), exhibiting high dynamism and non-convexity, making it challenging for traditional optimization methods. Concurrently, existing multi-agent deep reinforcement learning (MADRL) methods often encounter training instability and convergence issues in such dynamic and non-stationary environments. To address these challenges, this paper proposes a MADRL-based joint optimization approach. We precisely model the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and adopt the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm, which follows the Centralized Training Decentralized Execution (CTDE) paradigm. Our method aims to maximize system efficiency by achieving a judicious balance among multiple performance metrics, such as cache hit rate, task delay, energy consumption, and fairness. Simulation results demonstrate that, compared to various representative baseline methods, the proposed MAPPO algorithm exhibits significant superiority in achieving higher cumulative rewards and an approximately 82% cache hit rate. Full article

► Show Figures

Figure 1

20 pages, 1778 KB

Open AccessArticle

Energy Management for Distributed Carbon-Neutral Data Centers

by Wenting Chang, Chuyi Liu, Guanyu Ren and Jianxiong Wan

Energies 2025, 18(11), 2861; https://doi.org/10.3390/en18112861 - 30 May 2025

Cited by 1 | Viewed by 721

Abstract

With the continuous expansion of data centers, their carbon emission has become a serious issue. A number of studies are committing to reduce the carbon emission of data centers. Carbon trading, carbon capture, and power-to-gas technologies are promising emission reduction techniques which are, [...] Read more.

With the continuous expansion of data centers, their carbon emission has become a serious issue. A number of studies are committing to reduce the carbon emission of data centers. Carbon trading, carbon capture, and power-to-gas technologies are promising emission reduction techniques which are, however, seldom applied to data centers. To bridge this gap, we propose a carbon-neutral architecture for distributed data centers, where each data center consists of three subsystems, i.e., an energy subsystem for energy supply, thermal subsystem for data center cooling, and carbon subsystem for carbon trading. Then, we formulate the energy management problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and develop a distributed solution framework using Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Finally, simulations using real-world data show that a cost saving of 20.3% is provided. Full article

► Show Figures

Figure 1

21 pages, 9553 KB

Open AccessArticle

Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning

by Zhitong Zhao, Ya Zhang, Siying Wang, Yang Zhou, Ruoning Zhang and Wenyu Chen

Mathematics 2025, 13(9), 1429; https://doi.org/10.3390/math13091429 - 27 Apr 2025

Cited by 1 | Viewed by 747

Abstract

With the development of value decomposition methods, multi-agent reinforcement learning (MARL) has made significant progress in balancing autonomous decision making with collective cooperation. However, the collaborative dynamics among agents are continuously changing. The current value decomposition methods struggle to adeptly handle these dynamic [...] Read more.

With the development of value decomposition methods, multi-agent reinforcement learning (MARL) has made significant progress in balancing autonomous decision making with collective cooperation. However, the collaborative dynamics among agents are continuously changing. The current value decomposition methods struggle to adeptly handle these dynamic changes, thereby impairing the effectiveness of cooperative policies. In this paper, we introduce the concept of latent interaction, upon which an innovative method for generating weights is developed. The proposed method derives weights from the history information, thereby enhancing the accuracy of value estimations. Building upon this, we further propose a dynamic masking mechanism that recalibrates history information in response to the activity level of agents, improving the precision of latent interaction assessments. Experimental results demonstrate the improved training speed and superior performance of the proposed method in both a multi-agent particle environment and the StarCraft Multi-Agent Challenge. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

30 pages, 3310 KB

Open AccessArticle

Enhancing Scalability and Network Efficiency in IOTA Tangle Networks: A POMDP-Based Tip Selection Algorithm

by Mays Alshaikhli, Somaya Al-Maadeed and Moutaz Saleh

Computers 2025, 14(4), 117; https://doi.org/10.3390/computers14040117 - 24 Mar 2025

Cited by 1 | Viewed by 1477

Abstract

The fairness problem in the IOTA (Internet of Things Application) Tangle network has significant implications for transaction efficiency, scalability, and security, particularly concerning orphan transactions and lazy tips. Traditional tip selection algorithms (TSAs) struggle to ensure fair tip selection, leading to inefficient transaction [...] Read more.

The fairness problem in the IOTA (Internet of Things Application) Tangle network has significant implications for transaction efficiency, scalability, and security, particularly concerning orphan transactions and lazy tips. Traditional tip selection algorithms (TSAs) struggle to ensure fair tip selection, leading to inefficient transaction confirmations and network congestion. This research proposes a novel partially observable Markov decision process (POMDP)-based TSA, which dynamically prioritizes tips with lower confirmation likelihood, reducing orphan transactions and enhancing network throughput. By leveraging probabilistic decision making and the Monte Carlo tree search, the proposed TSA efficiently selects tips based on long-term impact rather than immediate transaction weight. The algorithm is rigorously evaluated against seven existing TSAs, including Random Walk, Unweighted TSA, Weighted TSA, Hybrid TSA-1, Hybrid TSA-2, E-IOTA, and G-IOTA, under various network conditions. The experimental results demonstrate that the POMDP-based TSA achieves a confirmation rate of 89–94%, reduces the orphan tip rate to 1–5%, and completely eliminates lazy tips (0%). Additionally, the proposed method ensures stable scalability and high security resilience, making it a robust and efficient solution for decentralized ledger networks. These findings highlight the potential of reinforcement learning-driven TSAs to enhance fairness, efficiency, and robustness in DAG-based blockchain systems. This work paves the way for future research into adaptive and scalable consensus mechanisms for the IOTA Tangle. Full article

(This article belongs to the Special Issue The Internet of Things—Current Trends, Applications, and Future Challenges (2nd Edition))

► Show Figures

Figure 1

27 pages, 3088 KB

Open AccessArticle

Research on Integrated Control Strategy for Highway Merging Bottlenecks Based on Collaborative Multi-Agent Reinforcement Learning

by Juan Du, Anshuang Yu, Hao Zhou, Qianli Jiang and Xueying Bai

Appl. Sci. 2025, 15(2), 836; https://doi.org/10.3390/app15020836 - 16 Jan 2025

Cited by 1 | Viewed by 1404

Abstract

The merging behavior of vehicles at entry ramps and the speed differences between ramps and mainline traffic cause merging traffic bottlenecks. Current research, primarily focusing on single traffic control strategies, fails to achieve the desired outcomes. To address this issue, this paper explores [...] Read more.

The merging behavior of vehicles at entry ramps and the speed differences between ramps and mainline traffic cause merging traffic bottlenecks. Current research, primarily focusing on single traffic control strategies, fails to achieve the desired outcomes. To address this issue, this paper explores an integrated control strategy combining Variable Speed Limits (VSL) and Lane Change Control (LCC) to optimize traffic efficiency in ramp merging areas. For scenarios involving multiple ramp merges, a multi-agent reinforcement learning approach is introduced to optimize control strategies in these areas. An integrated control system based on the Factored Multi-Agent Centralized Policy Gradients (FACMAC) algorithm is developed. By transforming the control framework into a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), state and action spaces for heterogeneous agents are designed. These agents dynamically adjust control strategies and control area lengths based on real-time traffic conditions, adapting to the changing traffic environment. The proposed Factored Multi-Agent Centralized Policy Gradients for Integrated Traffic Control in Dynamic Areas (FM-ITC-Darea) control strategy is simulated and tested on a multi-ramp scenario built on a multi-lane Cell Transmission Model (CTM) simulation platform. Comparisons are made with no control and Factored Multi-Agent Centralized Policy Gradients for Integrated Traffic Control (FM-ITC) strategies, demonstrating the effectiveness of the proposed integrated control strategy in alleviating highway ramp merging bottlenecks. Full article

(This article belongs to the Topic Digital and Intelligent Technologies and Application in Urban Construction, Operation, Maintenance, and Renewal)

► Show Figures

Figure 1

19 pages, 929 KB

Open AccessArticle

Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing

by Feifan Zhu, Fei Huang, Yantao Yu, Guojin Liu and Tiancong Huang

Sensors 2025, 25(1), 175; https://doi.org/10.3390/s25010175 - 31 Dec 2024

Cited by 7 | Viewed by 4106

Abstract

Unmanned aerial vehicles (UAVs) furnished with computational servers enable user equipment (UE) to offload complex computational tasks, thereby addressing the limitations of edge computing in remote or resource-constrained environments. The application of value decomposition algorithms for UAV trajectory planning has drawn considerable research [...] Read more.

Unmanned aerial vehicles (UAVs) furnished with computational servers enable user equipment (UE) to offload complex computational tasks, thereby addressing the limitations of edge computing in remote or resource-constrained environments. The application of value decomposition algorithms for UAV trajectory planning has drawn considerable research attention. However, existing value decomposition algorithms commonly encounter obstacles in effectively associating local observations with the global state of UAV clusters, which hinders their task-solving capabilities and gives rise to reduced task completion rates and prolonged convergence times. To address these challenges, this paper introduces an innovative multi-agent deep learning framework that conceptualizes multi-UAV trajectory optimization as a decentralized partially observable Markov decision process (Dec-POMDP). This framework integrates the QTRAN algorithm with a large language model (LLM) for efficient region decomposition and employs graph convolutional networks (GCNs) combined with self-attention mechanisms to adeptly manage inter-subregion relationships. The simulation results demonstrate that the proposed method significantly outperforms existing deep reinforcement learning methods, with improvements in convergence speed and task completion rate exceeding 10%. Overall, this framework significantly advances UAV trajectory optimization and enhances the performance of multi-agent systems within UAV-assisted edge computing environments. Full article

(This article belongs to the Section Sensors and Robotics)

► Show Figures

Figure 1

18 pages, 4538 KB

Open AccessArticle

Multi-UAV Escape Target Search: A Multi-Agent Reinforcement Learning Method

by Guang Liao, Jian Wang, Dujia Yang and Junan Yang

Sensors 2024, 24(21), 6859; https://doi.org/10.3390/s24216859 - 25 Oct 2024

Cited by 4 | Viewed by 3108

Abstract

The multi-UAV target search problem is crucial in the field of autonomous Unmanned Aerial Vehicle (UAV) decision-making. The algorithm design of Multi-Agent Reinforcement Learning (MARL) methods has become integral to research on multi-UAV target search owing to its adaptability to the rapid online [...] Read more.

The multi-UAV target search problem is crucial in the field of autonomous Unmanned Aerial Vehicle (UAV) decision-making. The algorithm design of Multi-Agent Reinforcement Learning (MARL) methods has become integral to research on multi-UAV target search owing to its adaptability to the rapid online decision-making required by UAVs in complex, uncertain environments. In non-cooperative target search scenarios, targets may have the ability to escape. Target probability maps are used in many studies to characterize the likelihood of a target’s existence, guiding the UAV to efficiently explore the task area and locate the target more quickly. However, the escape behavior of the target causes the target probability map to deviate from the actual target’s position, thereby reducing its effectiveness in measuring the target’s probability of existence and diminishing the efficiency of the UAV search. This paper investigates the multi-UAV target search problem in scenarios involving static obstacles and dynamic escape targets, modeling the problem within the framework of decentralized partially observable Markov decision process. Based on this model, a spatio-temporal efficient exploration network and a global convolutional local ascent mechanism are proposed. Subsequently, we introduce a multi-UAV Escape Target Search algorithm based on MAPPO (ETS–MAPPO) for addressing the escape target search difficulty problem. Simulation results demonstrate that the ETS–MAPPO algorithm outperforms five classic MARL algorithms in terms of the number of target searches, area coverage rate, and other metrics. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

27 pages, 1181 KB

Open AccessArticle

Joint Resource Scheduling of the Time Slot, Power, and Main Lobe Direction in Directional UAV Ad Hoc Networks: A Multi-Agent Deep Reinforcement Learning Approach

by Shijie Liang, Haitao Zhao, Li Zhou, Zhe Wang, Kuo Cao and Junfang Wang

Drones 2024, 8(9), 478; https://doi.org/10.3390/drones8090478 - 12 Sep 2024

Cited by 2 | Viewed by 1422

Abstract

Directional unmanned aerial vehicle (UAV) ad hoc networks (DUANETs) are widely applied due to their high flexibility, strong anti-interference capability, and high transmission rates. However, within directional networks, complex mutual interference persists, necessitating scheduling of the time slot, power, and main lobe direction [...] Read more.

Directional unmanned aerial vehicle (UAV) ad hoc networks (DUANETs) are widely applied due to their high flexibility, strong anti-interference capability, and high transmission rates. However, within directional networks, complex mutual interference persists, necessitating scheduling of the time slot, power, and main lobe direction for all links to improve the transmission performance of DUANETs. To ensure transmission fairness and the total count of transmitted data packets for the DUANET under dynamic data transmission demands, a scheduling algorithm for the time slot, power, and main lobe direction based on multi-agent deep reinforcement learning (MADRL) is proposed. Specifically, modeling is performed with the links as the core, optimizing the time slot, power, and main lobe direction variables for the fairness-weighted count of transmitted data packets. A decentralized partially observable Markov decision process (Dec-POMDP) is constructed for the problem. To process the observation in Dec-POMDP, an attention mechanism-based observation processing method is proposed to extract observation features of UAVs and their neighbors within the main lobe range, enhancing algorithm performance. The proposed Dec-POMDP and MADRL algorithms enable distributed autonomous decision-making for the resource scheduling of time slots, power, and main lobe directions. Finally, the simulation and analysis are primarily focused on the performance of the proposed algorithm and existing algorithms across varying data packet generation rates, different main lobe gains, and varying main lobe widths. The simulation results show that the proposed attention mechanism-based MADRL algorithm enhances the performance of the MADRL algorithm by 22.17%. The algorithm with the main lobe direction scheduling improves performance by 67.06% compared to the algorithm without the main lobe direction scheduling. Full article

(This article belongs to the Special Issue Space–Air–Ground Integrated Networks for 6G)

► Show Figures

Figure 1

25 pages, 11275 KB

Open AccessArticle

Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning

by Boquan Zhang, Tao Wang, Mingxuan Li, Yanru Cui, Xiang Lin and Zhi Zhu

Drones 2024, 8(8), 393; https://doi.org/10.3390/drones8080393 - 14 Aug 2024

Cited by 2 | Viewed by 1845

Abstract

Unmanned Aerial Vehicles (UAVs) have garnered widespread attention in reconnaissance and search operations due to their low cost and high flexibility. However, when multiple UAVs (multi-UAV) collaborate on these tasks, a limited communication range can restrict their efficiency. This paper investigates the problem [...] Read more.

Unmanned Aerial Vehicles (UAVs) have garnered widespread attention in reconnaissance and search operations due to their low cost and high flexibility. However, when multiple UAVs (multi-UAV) collaborate on these tasks, a limited communication range can restrict their efficiency. This paper investigates the problem of multi-UAV collaborative reconnaissance and search for static targets with a limited communication range (MCRS-LCR). To address communication limitations, we designed a communication and information fusion model based on belief maps and modeled MCRS-LCR as a multi-objective optimization problem. We further reformulated this problem as a decentralized partially observable Markov decision process (Dec-POMDP). We introduced episodic memory into the reinforcement learning framework, proposing the CNN-Semantic Episodic Memory Utilization (CNN-SEMU) algorithm. Specifically, CNN-SEMU uses an encoder–decoder structure with a CNN to learn state embedding patterns influenced by the highest returns. It extracts semantic features from the high-dimensional map state space to construct a smoother memory embedding space, ultimately enhancing reinforcement learning performance by recalling the highest returns of historical states. Extensive simulation experiments demonstrate that in reconnaissance and search tasks of various scales, CNN-SEMU surpasses state-of-the-art multi-agent reinforcement learning methods in episodic rewards, search efficiency, and collision frequency. Full article

(This article belongs to the Special Issue Distributed Control, Optimization, and Game of UAV Swarm Systems)

► Show Figures

Figure 1

30 pages, 6135 KB

Open AccessArticle

A Method for Multi-AUV Cooperative Area Search in Unknown Environment Based on Reinforcement Learning

by Yueming Li, Mingquan Ma, Jian Cao, Guobin Luo, Depeng Wang and Weiqiang Chen

J. Mar. Sci. Eng. 2024, 12(7), 1194; https://doi.org/10.3390/jmse12071194 - 16 Jul 2024

Cited by 7 | Viewed by 2130

Abstract

As an emerging direction of multi-agent collaborative control technology, multiple autonomous underwater vehicle (multi-AUV) cooperative area search technology has played an important role in civilian fields such as marine resource exploration and development, marine rescue, and marine scientific expeditions, as well as in [...] Read more.

As an emerging direction of multi-agent collaborative control technology, multiple autonomous underwater vehicle (multi-AUV) cooperative area search technology has played an important role in civilian fields such as marine resource exploration and development, marine rescue, and marine scientific expeditions, as well as in military fields such as mine countermeasures and military underwater reconnaissance. At present, as we continue to explore the ocean, the environment in which AUVs perform search tasks is mostly unknown, with many uncertainties such as obstacles, which places high demands on the autonomous decision-making capabilities of AUVs. Moreover, considering the limited detection capability of a single AUV in underwater environments, while the area searched by the AUV is constantly expanding, a single AUV cannot obtain global state information in real time and can only make behavioral decisions based on local observation information, which adversely affects the coordination between AUVs and the search efficiency of multi-AUV systems. Therefore, in order to face increasingly challenging search tasks, we adopt multi-agent reinforcement learning (MARL) to study the problem of multi-AUV cooperative area search from the perspective of improving autonomous decision-making capabilities and collaboration between AUVs. First, we modeled the search task as a decentralized partial observation Markov decision process (Dec-POMDP) and established a search information map. Each AUV updates the information map based on sonar detection information and information fusion between AUVs, and makes real-time decisions based on this to better address the problem of insufficient observation information caused by the weak perception ability of AUVs in underwater environments. Secondly, we established a multi-AUV cooperative area search system (MACASS), which employs a search strategy based on multi-agent reinforcement learning. The system combines various AUVs into a unified entity using a distributed control approach. During the execution of search tasks, each AUV can make action decisions based on sonar detection information and information exchange among AUVs in the system, utilizing the MARL-based search strategy. As a result, AUVs possess enhanced autonomy in decision-making, enabling them to better handle challenges such as limited detection capabilities and insufficient observational information. Full article

(This article belongs to the Special Issue Unmanned Marine Vehicles: Perception, Planning, Control and Swarm)

► Show Figures

Figure 1

18 pages, 836 KB

Open AccessArticle

Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism

by Yuxin Fan, Zesong Fei, Jingxuan Huang and Xinyi Wang

Electronics 2024, 13(13), 2442; https://doi.org/10.3390/electronics13132442 - 21 Jun 2024

Cited by 2 | Viewed by 1711

Abstract

Autonomous vehicles (AVs) can be equipped with Integrated sensing and communications (ISAC) devices to realize sensing and communication functions simultaneously. Time-division ISAC (TD-ISAC) is advantageous due to its ease of implementation, efficient deployment and integration into any system. TD-ISAC greatly enhances spectrum efficiency [...] Read more.

Autonomous vehicles (AVs) can be equipped with Integrated sensing and communications (ISAC) devices to realize sensing and communication functions simultaneously. Time-division ISAC (TD-ISAC) is advantageous due to its ease of implementation, efficient deployment and integration into any system. TD-ISAC greatly enhances spectrum efficiency and equipment utilization and reduces system energy consumption. In this paper, we propose a communication-assisted sensing mechanism based on TD-ISAC to support multi-vehicle collaborative sensing. However, there are some challenges in applying TD-ISAC to AVs. First, AVs should allocate resources for sensing and communication in a dynamically changing environment. Second, the limited spectrum resources bring the problem of mutual interference of multi-vehicle signals. To address these issues, we construct a multi-vehicle signal interference model, formulate an optimization problem based on the partially observable Markov decision process (POMDP) framework and design a decentralized dynamic allocation scheme for multi-vehicle time–frequency resources based on a deep reinforcement learning (DRL) algorithm. Simulation results show that the proposed scheme performs better in miss detection probability and average system interference power compared to the DRQN algorithm without the communication-assisted sensing mechanism and the random algorithm without reinforcement learning. We can conclude that the proposed scheme can effectively allocate the resources of the TD-ISAC system and reduce interference between multiple vehicles. Full article

► Show Figures

Figure 1

Search Results (23)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (23)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI