Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (48)

Search Parameters:
Keywords = Upper-Confidence Bound algorithm

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 3099 KB  
Article
A Two-Stage Algorithm for Time Series Compression: ARIMA-Based Pre-Compression and Reinforcement Learning Optimized Chunking
by Miao Chi, Su Pan, Jiaji Feng, Zhe Ding and Zhaowei Zhang
Mathematics 2026, 14(5), 841; https://doi.org/10.3390/math14050841 - 1 Mar 2026
Viewed by 377
Abstract
The explosive growth of time series gives rise to a large amount of data, which emphasizes the importance of data compression. The data compression not only reduces storage costs but also enhances data transmission efficiency and processing speed. However, traditional compression algorithms usually [...] Read more.
The explosive growth of time series gives rise to a large amount of data, which emphasizes the importance of data compression. The data compression not only reduces storage costs but also enhances data transmission efficiency and processing speed. However, traditional compression algorithms usually suffer an insufficient compression ratio and an excessive computational cost. To address these problems above, in this paper, we propose a two-stage compression algorithm for the large-scale time series data. In the first stage, we transform the time series data into low-volatility residual data by using Autoregressive Integrated Moving Average (ARIMA) modeling and apply adaptive precision quantization to improve compressibility. In the second stage, we implement a reinforcement learning-based compression strategy, which utilizes the Q-learning to select the number of blocks to divide the quantized data segment and achieves compression by storing the same content between the divided data blocks only once and storing the different content separately; and we incorporate the Upper Confidence Bound (UCB) to balance exploration and exploitation in order to track changes in data patterns and improve compression performance. Experimental results demonstrate that our algorithm achieves a higher compression ratio while maintaining a low computational complexity compared with traditional compression algorithms. Full article
Show Figures

Figure 1

16 pages, 1542 KB  
Article
User Authentication Using Inner-Wrist Skin Prints: Feasibility and Performance Assessment with Off-the-Shelf Fingerprint Sensor
by Szymon Cygan, Patryk Lamprecht, Jakub Żmigrodzki, Jan Łusakowski-Milencki, Nikolaos Simopulos, Adrian Zarycki and Piotr Muranty
Sensors 2026, 26(4), 1103; https://doi.org/10.3390/s26041103 - 8 Feb 2026
Viewed by 480
Abstract
Wrist-worn devices enable new paradigms of implicit and continuous user authentication; however, identifying biometric modalities that combine reliability with practical integrability remains challenging. Inner-wrist skin texture represents a relatively unexplored biometric characteristic that may be acquired unobtrusively using commodity hardware. This study evaluates [...] Read more.
Wrist-worn devices enable new paradigms of implicit and continuous user authentication; however, identifying biometric modalities that combine reliability with practical integrability remains challenging. Inner-wrist skin texture represents a relatively unexplored biometric characteristic that may be acquired unobtrusively using commodity hardware. This study evaluates biometric verification based on inner-wrist skin texture using an off-the-shelf capacitive fingerprint sensor and an unmodified, manufacturer-provided fingerprint verification algorithm. Two experiments were conducted. Experiment 1 assessed baseline verification performance under controlled acquisition conditions in a cohort of 33 participants (21 male, 12 female; mean age 30.0 ± 16.9 years, range 10–71 years), yielding 1768 genuine authentication trials. Experiment 2 examined the effect of wrist posture variation under controlled flexion in a separate cohort of 15 participants (11 male, 4 female; mean age 30.9 years, range 18–49 years), with 3900 authentication trials recorded. Across 86,897 impostor comparisons in Experiment 1, no false acceptances were observed, corresponding to a conservative upper bound on the false acceptance rate of 6.7 × 10−5 at the 99.7% confidence level, while the false rejection rate was approximately 2.93%. In Experiment 2, the overall false rejection rate increased to 3.52%, with no clear monotonic relationship between wrist angle and verification performance within the tested range. The results demonstrate that inner-wrist skin texture can be captured and matched using fingerprint-oriented sensing and matching technology under controlled conditions, providing an experimental baseline for this biometric modality. At the same time, the use of a closed matching algorithm and a sensor designed for fingerprints limits interpretability and generalization. These findings motivate further investigation using dedicated recognition methods, larger sensing areas, and extended evaluation protocols tailored specifically to wrist skin print biometrics. Full article
(This article belongs to the Special Issue Biomedical Electronics and Wearable Systems—2nd Edition)
Show Figures

Graphical abstract

23 pages, 677 KB  
Article
Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications
by Yunxing Xiang, Yi Yan, Yunchao Song, Jing Gao, Xiaohui You, Jun Wang, Huibin Liang and Yixin Jiang
Sensors 2026, 26(1), 60; https://doi.org/10.3390/s26010060 - 21 Dec 2025
Viewed by 485
Abstract
For XL-MIMO multi-user frequency division duplex systems, this paper proposes a near-field beam training scheme using a two-phase combinatorial multi-armed bandit (MAB) framework. This scheme leverages the MAB framework, integrating energy-aware user scheduling and hierarchical beam training to balance communication quality and device [...] Read more.
For XL-MIMO multi-user frequency division duplex systems, this paper proposes a near-field beam training scheme using a two-phase combinatorial multi-armed bandit (MAB) framework. This scheme leverages the MAB framework, integrating energy-aware user scheduling and hierarchical beam training to balance communication quality and device battery level, thereby effectively enhancing system energy efficiency and extending the device’s lifespan. Specifically, in the first phase, we account for user battery levels by designing an energy-aware upper confidence bound (UCB) algorithm for user scheduling. This algorithm effectively balances exploration and exploitation, prioritizing users with higher achievable rates and sufficient battery level. In the second phase, based on the scheduled users, two UCB algorithms are employed for beam training. In the first layer, discrete Fourier transform codebook-based beam scanning is utilized, and a UCB algorithm is applied to initially acquire angle information for scheduled users. In the second layer, based on the obtained angle information, a candidate set of polar-domain codewords is constructed. Another UCB algorithm is then employed to select the optimal polar-domain codewords. The effectiveness of our scheme is confirmed by simulations, demonstrating notable achievable rate gains for multi-user communications. Full article
Show Figures

Figure 1

52 pages, 782 KB  
Article
Single-Stage Causal Incentive Design via Optimal Interventions
by Sebastián Bejos, Eduardo F. Morales, Luis Enrique Sucar and Enrique Munoz de Cote
Entropy 2026, 28(1), 4; https://doi.org/10.3390/e28010004 - 19 Dec 2025
Cited by 1 | Viewed by 625
Abstract
We introduce Causal Incentive Design (CID), a framework that applies causal inference to canonical single-stage principal–agent problems (PAPs) characterized by bilateral private information. Within CID, the operating rules of PAPs are formalized using an additive-noise causal graphical model (CGM). Incentives are modeled as [...] Read more.
We introduce Causal Incentive Design (CID), a framework that applies causal inference to canonical single-stage principal–agent problems (PAPs) characterized by bilateral private information. Within CID, the operating rules of PAPs are formalized using an additive-noise causal graphical model (CGM). Incentives are modeled as interventions on a function space variable, Γ, which correspond to policy interventions in the principal–follower causal relation. The causal inference target estimand V(Γ) is defined as the expected value of the principal’s utility variable under a specified policy intervention in the post-intervention distribution. In the context of additive-Gaussian independent noise, the estimand V(Γ) decomposes into a two-layer expectation: (i) an inner Gaussian smoothing of the principal’s utility regression; and (ii) an outer averaging over the conditional probability of the follower’s action given the incentive policy. A Gauss–Hermite quadrature method is employed to efficiently estimate the first layer, while a policy-local kernel reweighting approach is used for the second. For offline selection of a single incentive policy, a Functional Causal Bayesian Optimization (FCBO) algorithm is introduced. This algorithm models the objective functional γV(γ) using a functional Gaussian process surrogate defined on a Reproducing Kernel Hilbert Space (RKHS) domain and utilizes an Upper Confidence Bound (UCB) acquisition functional. Consequently, the policy value V(γ) becomes an interventional query that can be answered using offline observational data under standard identifiability assumptions. High-probability cumulative-regret bounds are established in terms of differential information gain for the proposed FBO algorithm. Collectively, these elements constitute the central contributions of the CID framework, which integrates causal inference through identification and estimation with policy search in principal–agent problems under private information. This approach establishes a causal decision-making pipeline that enables commitment to a high-performing incentive in a single-shot game, supported by regret guarantees. Provided that the data used for estimation is sufficient, the resulting offline pipeline is appropriate for scenarios where adaptive deployment is impractical or costly. Beyond the methodological contribution, this work introduces a novel application of causal graphical models and causal reasoning to incentive design and principal–agent problems, which are central to economics and multi-agent systems. Full article
(This article belongs to the Special Issue Causal Graphical Models and Their Applications)
Show Figures

Figure 1

25 pages, 1703 KB  
Article
Design and Optimization Method for Scaled Equivalent Model of T-Tail Configuration Structural Dynamics Simulating Fuselage Stiffness
by Zheng Chen, Xinyu Ai, Weizhe Feng, Rui Yang and Wei Qian
Aerospace 2025, 12(12), 1063; https://doi.org/10.3390/aerospace12121063 - 30 Nov 2025
Cited by 1 | Viewed by 587
Abstract
The T-tail configuration, while offering advantages for large transport aircraft, is susceptible to peculiar aerodynamic phenomena such as deep stall and flutter, necessitating high-fidelity dynamic scaling for wind tunnel testing. In order to address the issue of similarity in the dynamic characteristics of [...] Read more.
The T-tail configuration, while offering advantages for large transport aircraft, is susceptible to peculiar aerodynamic phenomena such as deep stall and flutter, necessitating high-fidelity dynamic scaling for wind tunnel testing. In order to address the issue of similarity in the dynamic characteristics of scaled T-tail models, we propose a comprehensive optimization design method for dynamic scaled equivalent models of T-tail structures with rear fuselages. The development of an elastic-scaled model is accomplished through the integration of the least squares method with a genetic sensitivity hybrid algorithm. In this framework, the objective function is defined as minimizing a weighted sum of the frequency errors and the modal shape discrepancies (1 Modal Assurance Criterion) for the first five modes, subject to lower and upper bound constraints on the design variables (e.g., beam cross-sectional dimensions). The findings indicate that the application of finite element modelling in conjunction with multi-objective optimization results in the scaled model that closely aligns with the dynamic characteristics of the actual aircraft structure. Specifically, the frequency error of the optimized model is maintained below 2%, while the modal confidence level exceeds 95%. A ground vibration test (GVT) was conducted on a fabricated scaled model, with all frequency errors below 3%, successfully validating the optimization approach. This GVT-validated high-fidelity model establishes a reliable foundation for subsequent wind tunnel tests, such as flutter and buffet experiments, the results of which are vital for validating the full-scale aircraft’s aeroelastic model and informing critical flight safety assessments. The T-tail elastic model design methodology presented in this study serves as a valuable reference for the analysis of T-tail characteristics and the design of wind tunnel models. Furthermore, it provides insights applicable to multidisciplinary optimisation and the design of wind tunnel models for other similar elastic scaled-down configurations. Full article
Show Figures

Figure 1

27 pages, 4763 KB  
Article
Lightweight Reinforcement Learning for Priority-Aware Spectrum Management in Vehicular IoT Networks
by Adeel Iqbal, Ali Nauman and Tahir Khurshaid
Sensors 2025, 25(21), 6777; https://doi.org/10.3390/s25216777 - 5 Nov 2025
Cited by 1 | Viewed by 871
Abstract
The Vehicular Internet of Things (V-IoT) has emerged as a cornerstone of next-generation intelligent transportation systems (ITSs), enabling applications ranging from safety-critical collision avoidance and cooperative awareness to infotainment and fleet management. These heterogeneous services impose stringent quality-of-service (QoS) demands for latency, reliability, [...] Read more.
The Vehicular Internet of Things (V-IoT) has emerged as a cornerstone of next-generation intelligent transportation systems (ITSs), enabling applications ranging from safety-critical collision avoidance and cooperative awareness to infotainment and fleet management. These heterogeneous services impose stringent quality-of-service (QoS) demands for latency, reliability, and fairness while competing for limited and dynamically varying spectrum resources. Conventional schedulers, such as round-robin or static priority queues, lack adaptability, whereas deep reinforcement learning (DRL) solutions, though powerful, remain computationally intensive and unsuitable for real-time roadside unit (RSU) deployment. This paper proposes a lightweight and interpretable reinforcement learning (RL)-based spectrum management framework for Vehicular Internet of Things (V-IoT) networks. Two enhanced Q-Learning variants are introduced: a Value-Prioritized Action Double Q-Learning with Constraints (VPADQ-C) algorithm that enforces reliability and blocking constraints through a Constrained Markov Decision Process (CMDP) with online primal–dual optimization, and a contextual Q-Learning with Upper Confidence Bound (Q-UCB) method that integrates uncertainty-aware exploration and a Success-Rate Prior (SRP) to accelerate convergence. A Risk-Aware Heuristic baseline is also designed as a transparent, low-complexity benchmark to illustrate the interpretability–performance trade-off between rule-based and learning-driven approaches. A comprehensive simulation framework incorporating heterogeneous traffic classes, physical-layer fading, and energy-consumption dynamics is developed to evaluate throughput, delay, blocking probability, fairness, and energy efficiency. The results demonstrate that the proposed methods consistently outperform conventional Q-Learning and Double Q-Learning methods. VPADQ-C achieves the highest energy efficiency (≈8.425×107 bits/J) and reduces interruption probability by over 60%, while Q-UCB achieves the fastest convergence (within ≈190 episodes), lowest blocking probability (≈0.0135), and lowest mean delay (≈0.351 ms). Both schemes maintain fairness near 0.364, preserve throughput around 28 Mbps, and exhibit sublinear training-time scaling with O(1) per-update complexity and O(N2) overall runtime growth. Scalability analysis confirms that the proposed frameworks sustain URLLC-grade latency (<0.2 ms) and reliability under dense vehicular loads, validating their suitability for real-time, large-scale V-IoT deployments. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

35 pages, 10688 KB  
Article
Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation
by Jianbo Huang, Long Li and Jia Chen
Symmetry 2025, 17(11), 1808; https://doi.org/10.3390/sym17111808 - 27 Oct 2025
Cited by 1 | Viewed by 999
Abstract
Chronic kidney disease (CKD) impacts over 850 million people globally, representing a critical public health issue, yet existing risk assessment methodologies inadequately address the complexity of disease progression trajectories. Traditional machine learning approaches encounter critical limitations including inefficient hyperparameter selection and lack of [...] Read more.
Chronic kidney disease (CKD) impacts over 850 million people globally, representing a critical public health issue, yet existing risk assessment methodologies inadequately address the complexity of disease progression trajectories. Traditional machine learning approaches encounter critical limitations including inefficient hyperparameter selection and lack of clinical transparency, hindering their deployment in healthcare settings. This study introduces an innovative computational framework that integrates adaptive Multi-Armed Bandit (MAB) strategies with BorderlineSMOTE sampling techniques to improve CKD risk assessment. The proposed methodology leverages XGBoost within an ensemble learning paradigm enhanced by Upper Confidence Bound exploration strategy, coupled with a comprehensive interpretability system incorporating SHAP and LIME analytical tools to ensure model transparency. To address the challenge of algorithmic interpretability while maintaining clinical utility, a four-level risk categorization framework was developed, employing cross-validated stratification methods and balanced performance evaluation metrics, thereby ensuring fair predictive accuracy across diverse patient populations and minimizing bias toward dominant risk categories. Through rigorous empirical evaluation on clinical datasets, we performed extensive comparative analysis against sixteen established algorithms using paired statistical testing with Bonferroni correction. The MAB-optimized framework achieved superior predictive performance with accuracy of 91.8%, F1-score of 91.0%, and ROC-AUC of 97.8%, demonstrating superior performance within the evaluated cohort of reference algorithms (p-value < 0.001). Remarkably, our optimized framework delivered nearly ten-fold computational efficiency gains relative to conventional grid search methods while preserving robust classification performance. Feature importance analysis identified albumin-to-creatinine ratio, eGFR measurements, and CKD staging as dominant prognostic factors, demonstrating concordance with established clinical nephrology practice. This research addresses three core limitations in healthcare artificial intelligence: optimization computational cost, model interpretability, and consistent performance across heterogeneous clinical populations, offering a practical solution for improved CKD risk stratification in clinical practice. Full article
Show Figures

Figure 1

36 pages, 6309 KB  
Article
Utilization of Upper Confidence Bound Algorithms for Effective Subproblem Selection in Cooperative Coevolution Frameworks
by Kyung-Soo Kim
Mathematics 2025, 13(18), 3052; https://doi.org/10.3390/math13183052 - 22 Sep 2025
Viewed by 696
Abstract
In cooperative coevolution (CC) frameworks, it is essential to identify the subproblems that can significantly contribute to finding the optimal solutions of the objective function. In traditional CC frameworks, subproblems are selected either sequentially or based on the degree of improvement in the [...] Read more.
In cooperative coevolution (CC) frameworks, it is essential to identify the subproblems that can significantly contribute to finding the optimal solutions of the objective function. In traditional CC frameworks, subproblems are selected either sequentially or based on the degree of improvement in the fitness of the optimal solution. However, these classical methods have limitations in balancing between exploration and exploitation when selecting the subproblems. To overcome these weaknesses, we propose upper confidence bound (UCB)-based new subproblem selection methods for the CC frameworks. Our proposed methods utilize UCB algorithms to strike a balance between exploration and exploitation in subproblem selection, while also incorporating a non-stationary mechanism to account for the convergence of evolutionary algorithms. These strategies possess novel characteristics that distinguish our methods from existing approaches. In comprehensive experiments, the CC frameworks using our proposed subproblem selectors achieved remarkable optimization results when solving most benchmark functions comprised of 1000 interdependent variables. Thus, we found that our UCB-based subproblem selectors can significantly contribute to searching for optimal solutions in CC frameworks by elaborately balancing exploration and exploitation when selecting subproblems. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

55 pages, 29751 KB  
Article
Multi-Objective Combinatorial Optimization for Dynamic Inspection Scheduling and Skill-Based Team Formation in Distributed Solar Energy Infrastructure
by Mazin Alahmadi
Systems 2025, 13(9), 822; https://doi.org/10.3390/systems13090822 - 19 Sep 2025
Cited by 3 | Viewed by 1842
Abstract
Maintaining operational efficiency in distributed solar energy systems requires intelligent coordination of inspection tasks and workforce resources to handle diverse fault conditions. This study presents a bi-level multi-objective optimization framework that addresses two tightly coupled problems: dynamic job scheduling and skill-based team formation. [...] Read more.
Maintaining operational efficiency in distributed solar energy systems requires intelligent coordination of inspection tasks and workforce resources to handle diverse fault conditions. This study presents a bi-level multi-objective optimization framework that addresses two tightly coupled problems: dynamic job scheduling and skill-based team formation. The job scheduling component assigns geographically dispersed inspection tasks to mobile teams while minimizing multiple conflicting objectives, including travel distance, tardiness, and workload imbalance. Concurrently, the team formation component ensures that each team satisfies fault-specific skill requirements by optimizing team cohesion and compactness. To solve the bi-objective team formation problem, we propose HMOO-AOS, a hybrid algorithm integrating six metaheuristic operators under an NSGA-II framework with an Upper Confidence Bound-based Adaptive Operator Selection. Experiments on datasets of up to seven instances demonstrate statistically significant improvements (p<0.05) in solution quality, skill coverage, and computational efficiency compared to NSGA-II, NSGA-III, and MOEA/D variants, with computational complexity OG·N·(M+logN) (time complexity), O(N·L) (space complexity). A cloud-integrated system architecture is also proposed to contextualize the framework within real-world solar inspection operations, supporting real-time data integration, dynamic rescheduling, and mobile workforce coordination. These contributions provide scalable, practical tools for solar operators, maintenance planners, and energy system managers, establishing a robust and adaptive approach to intelligent inspection planning in renewable energy operations. Full article
(This article belongs to the Special Issue Advances in Operations and Production Management Systems)
Show Figures

Figure 1

27 pages, 520 KB  
Article
QiMARL: Quantum-Inspired Multi-Agent Reinforcement Learning Strategy for Efficient Resource Energy Distribution in Nodal Power Stations
by Sapthak Mohajon Turjya, Anjan Bandyopadhyay, M. Shamim Kaiser and Kanad Ray
AI 2025, 6(9), 209; https://doi.org/10.3390/ai6090209 - 1 Sep 2025
Cited by 3 | Viewed by 3317
Abstract
The coupling of quantum computing with multi-agent reinforcement learning (MARL) provides an exciting direction to tackle intricate decision-making tasks in high-dimensional spaces. This work introduces a new quantum-inspired multi-agent reinforcement learning (QiMARL) model, utilizing quantum parallelism to achieve learning efficiency and scalability improvement. [...] Read more.
The coupling of quantum computing with multi-agent reinforcement learning (MARL) provides an exciting direction to tackle intricate decision-making tasks in high-dimensional spaces. This work introduces a new quantum-inspired multi-agent reinforcement learning (QiMARL) model, utilizing quantum parallelism to achieve learning efficiency and scalability improvement. The QiMARL model is tested on an energy distribution task, which optimizes power distribution between generating and demanding nodal power stations. We compare the convergence time, reward performance, and scalability of QiMARL with traditional Multi-Armed Bandit (MAB) and Multi-Agent Reinforcement Learning methods, such as Greedy, Upper Confidence Bound (UCB), Thompson Sampling, MADDPG, QMIX, and PPO methods with a comprehensive ablation study. Our findings show that QiMARL yields better performance in high-dimensional systems, decreasing the number of training epochs needed for convergence while enhancing overall reward maximization. We also compare the algorithm’s computational complexity, indicating that QiMARL is more scalable to high-dimensional quantum environments. This research opens the door to future studies of quantum-enhanced reinforcement learning (RL) with potential applications to energy optimization, traffic management, and other multi-agent coordination problems. Full article
(This article belongs to the Special Issue Advances in Quantum Computing and Quantum Machine Learning)
Show Figures

Figure 1

24 pages, 1346 KB  
Article
Energy-Efficient Resource Allocation Scheme Based on Reinforcement Learning in Distributed LoRa Networks
by Ryota Ariyoshi, Aohan Li, Mikio Hasegawa and Tomoaki Ohtsuki
Sensors 2025, 25(16), 4996; https://doi.org/10.3390/s25164996 - 12 Aug 2025
Cited by 1 | Viewed by 1693
Abstract
The rapid growth of Long Range (LoRa) devices has led to network congestion, reducing spectrum and energy efficiency. To address this problem, we propose an energy-efficient reinforcement learning method for distributed LoRa networks, enabling each device to independently select appropriate transmission parameters, i.e., [...] Read more.
The rapid growth of Long Range (LoRa) devices has led to network congestion, reducing spectrum and energy efficiency. To address this problem, we propose an energy-efficient reinforcement learning method for distributed LoRa networks, enabling each device to independently select appropriate transmission parameters, i.e., channel, transmission power (TP), and bandwidth (BW) based on acknowledgment (ACK) feedback and energy consumption. Our method employs the Upper Confidence Bound (UCB)1-tuned algorithm and incorporates energy metrics into the reward function, achieving lower power consumption and high transmission success rates. Designed to be lightweight for resource-constrained IoT devices, it was implemented on real LoRa hardware and tested in dense network scenarios. Experimental results show that the proposed method outperforms fixed allocation, adaptive data rate low-complexity (ADR-Lite), and ϵ-greedy methods in both transmission success rate and energy efficiency. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

22 pages, 1271 KB  
Article
Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies
by Abdalaziz Sawwan and Jie Wu
Network 2025, 5(1), 3; https://doi.org/10.3390/network5010003 - 29 Jan 2025
Viewed by 1892
Abstract
Sequential decision-making in dynamic and interconnected environments is a cornerstone of numerous applications, ranging from communication networks and finance to distributed blockchain systems and IoT frameworks. The multi-armed bandit (MAB) problem is a fundamental model in this domain that traditionally assumes independent and [...] Read more.
Sequential decision-making in dynamic and interconnected environments is a cornerstone of numerous applications, ranging from communication networks and finance to distributed blockchain systems and IoT frameworks. The multi-armed bandit (MAB) problem is a fundamental model in this domain that traditionally assumes independent and identically distributed (iid) rewards, which limits its effectiveness in capturing the inherent dependencies and state dynamics present in some real-world scenarios. In this paper, we lay a theoretical framework for a modified MAB model in which each arm’s reward is generated by a hidden Markov process. In our model, each arm undergoes Markov state transitions independent of play in a way that results in varying reward distributions and heightened uncertainty in reward observations. The number of states for each arm can be up to three states. A key challenge arises from the fact that the underlying states governing each arm’s rewards remain hidden at the time of selection. To address this, we adapt traditional index-based policies and develop a modified index approach tailored to accommodate Markovian transitions and enhance selection efficiency for our model. Our proposed proposed Markovian Upper Confidence Bound (MC-UCB) policy achieves logarithmic regret. Comparative analysis with the classical UCB algorithm reveals that MC-UCB consistently achieves approximately a 15% reduction in cumulative regret. This work provides significant theoretical insights and lays a robust foundation for future research aimed at optimizing decision-making processes in complex, networked systems with hidden state dependencies. Full article
Show Figures

Figure 1

17 pages, 2767 KB  
Article
Adaptive Noise Exploration for Neural Contextual Multi-Armed Bandits
by Chi Wang, Lin Shi and Junru Luo
Algorithms 2025, 18(2), 56; https://doi.org/10.3390/a18020056 - 21 Jan 2025
Viewed by 2401
Abstract
In contextual multi-armed bandits, the relationship between contextual information and rewards is typically unknown, complicating the trade-off between exploration and exploitation. A common approach to address this challenge is the Upper Confidence Bound (UCB) method, which constructs confidence intervals to guide exploration. However, [...] Read more.
In contextual multi-armed bandits, the relationship between contextual information and rewards is typically unknown, complicating the trade-off between exploration and exploitation. A common approach to address this challenge is the Upper Confidence Bound (UCB) method, which constructs confidence intervals to guide exploration. However, the UCB method becomes computationally expensive in environments with numerous arms and dynamic contexts. This paper presents an adaptive noise exploration framework to reduce computational complexity and introduces two novel algorithms: EAD (Exploring Adaptive Noise in Decision-Making Processes) and EAP (Exploring Adaptive Noise in Parameter Spaces). EAD injects adaptive noise into the reward signals based on arm selection frequency, while EAP adds adaptive noise to the hidden layer of the neural network for more stable exploration. Experimental results on recommendation and classification tasks show that both algorithms significantly surpass traditional linear and neural methods in computational efficiency and overall performance. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

20 pages, 351 KB  
Article
Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees
by Ali Baheri
Mathematics 2025, 13(1), 149; https://doi.org/10.3390/math13010149 - 3 Jan 2025
Cited by 7 | Viewed by 6583
Abstract
The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when [...] Read more.
The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when decisions must follow a hierarchical structure (as in autonomous systems where high-level strategy guides low-level actions); (2) when there are constraints at multiple levels of decision-making (such as both system-wide and component-level resource limits); and (3) when available actions depend on previous choices or context. To address these challenges, we introduce the hierarchical constrained bandits (HCB) framework, which extends contextual bandits to incorporate both hierarchical decisions and multilevel constraints. We propose the HC-UCB (hierarchical constrained upper confidence bound) algorithm to solve the HCB problem. The algorithm uses confidence bounds within a hierarchical setting to balance exploration and exploitation while respecting constraints at all levels. Our theoretical analysis establishes that HC-UCB achieves sublinear regret, guarantees constraint satisfaction at all hierarchical levels, and is near-optimal in terms of achievable performance. Simple experimental results demonstrate the algorithm’s effectiveness in balancing reward maximization with constraint satisfaction. Full article
Show Figures

Figure 1

29 pages, 1715 KB  
Article
Multi-Armed Bandit Approaches for Location Planning with Dynamic Relief Supplies Allocation Under Disaster Uncertainty
by Jun Liang, Zongjia Zhang and Yanpeng Zhi
Smart Cities 2025, 8(1), 5; https://doi.org/10.3390/smartcities8010005 - 25 Dec 2024
Cited by 3 | Viewed by 2363
Abstract
Natural disasters (e.g., floods, earthquakes) significantly impact citizens, economies, and the environment worldwide. Due to their sudden onset, devastating effects, and high uncertainty, it is crucial for emergency departments to take swift action to minimize losses. Among these actions, planning the locations of [...] Read more.
Natural disasters (e.g., floods, earthquakes) significantly impact citizens, economies, and the environment worldwide. Due to their sudden onset, devastating effects, and high uncertainty, it is crucial for emergency departments to take swift action to minimize losses. Among these actions, planning the locations of relief supply distribution centers and dynamically allocating supplies is paramount, as governments must prioritize citizens’ safety and basic living needs following disasters. To address this challenge, this paper develops a three-layer emergency logistics network to manage the flow of emergency materials, from warehouses to transfer stations to disaster sites. A bi-objective, multi-period stochastic integer programming model is proposed to solve the emergency location, distribution, and allocation problem under uncertainty, focusing on three key decisions: transfer station selection, upstream emergency material distribution, and downstream emergency material allocation. We introduce a multi-armed bandit algorithm, named the Geometric Greedy algorithm, to optimize transfer station planning while accounting for subsequent dynamic relief supply distribution and allocation in a stochastic environment. The new algorithm is compared with two widely used multi-armed bandit algorithms: the ϵ-Greedy algorithm and the Upper Confidence Bound (UCB) algorithm. A case study in the Futian District of Shenzhen, China, demonstrates the practicality of our model and algorithms. The results show that the Geometric Greedy algorithm excels in both computational efficiency and convergence stability. This research offers valuable guidelines for emergency departments in optimizing the layout and flow of emergency logistics networks. Full article
(This article belongs to the Section Applied Science and Humanities for Smart Cities)
Show Figures

Figure 1

Back to TopTop