Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (221)

Search Parameters:
Keywords = reward shaping

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 9226 KB  
Article
Regenerative–Frictional Brake Blending in Electric Vehicles Considering Energy Recovery and Dynamic Battery Charging Limit: A Reinforcement Learning-Based Approach
by Farshid Naseri, Bjartur Ragnarsson a Nordi, Konstantinos Spiliotopoulos and Erik Schaltz
Machines 2026, 14(4), 416; https://doi.org/10.3390/machines14040416 - 9 Apr 2026
Abstract
This paper presents the design, development, and evaluation of a Reinforcement Learning (RL)–based torque-split controller for the regenerative braking system (RBS) in battery electric vehicles (BEVs). The controller employs a Deep Deterministic Policy Gradient (DDPG) agent to distribute the braking demand between regenerative [...] Read more.
This paper presents the design, development, and evaluation of a Reinforcement Learning (RL)–based torque-split controller for the regenerative braking system (RBS) in battery electric vehicles (BEVs). The controller employs a Deep Deterministic Policy Gradient (DDPG) agent to distribute the braking demand between regenerative and frictional braking systems with the aim of maximizing energy recovery while adhering to the physical and operational constraints. To capture the charging limitation of the battery, a State-of-Power (SoP) calculation mechanism is incorporated, providing a time-varying bound on the regenerative charge power. The agent is trained in a MATLAB/Simulink environment representing the digital twin of a BEV drivetrain, and considers a mix of different braking scenarios, i.e., light braking, medium braking, hard braking, and emergency braking. The RL’s reward shaping promotes efficient utilization of the SoP-limited regenerative capability while discouraging constraint violations and aggressive control behavior. Across a range of State-of-Charge (SoC) conditions and driving cycles, including the Worldwide Harmonized Light–Vehicle Test Procedure (WLTP) and synthetic random-rich driving cycle, the RL controller consistently delivers promising performance, yielding energy recovery of up to ~98% of the total braking energy available on WLTP type 3 driving cycle while being able to operate closely to the battery SoP limit. The results demonstrate the proposed controller’s capability for adaptive, constraint-aware energy management in BEVs and underline its potential for future intelligent braking strategies. Full article
Show Figures

Figure 1

20 pages, 1083 KB  
Article
FGeo-ISRL: A MCTS-Enhanced Deep Reinforcement Learning System for Plane Geometry Problem-Solving via Inverse Search
by Yang Li, Xiaokai Zhang, Cheng Qin, Zhengyu Hu and Tuo Leng
Symmetry 2026, 18(4), 628; https://doi.org/10.3390/sym18040628 - 9 Apr 2026
Abstract
Geometric problem-solving has always been a great challenge in the field of deductive reasoning and artificial intelligence. Symmetry is a defining characteristic of geometric shapes and properties. Consequently, the application of symmetry principles to geometric reasoning arises as a natural choice. To address [...] Read more.
Geometric problem-solving has always been a great challenge in the field of deductive reasoning and artificial intelligence. Symmetry is a defining characteristic of geometric shapes and properties. Consequently, the application of symmetry principles to geometric reasoning arises as a natural choice. To address the efficiency degradation and limited generalization, we propose FGeo-ISRL, a neural-symbolic inverse search framework whose core is the synergistic integration of a task-fine-tuned large language model and Monte Carlo Tree Search. Under the formal framework of FormalGeo, geometric theorems are iteratively applied starting from the given conditions and the target conclusion, in order to infer the necessary supporting premises. The large language model simultaneously serves as a policy network and a value network, guiding theorem application decisions and evaluating intermediate proof states, whereas the Monte Carlo Tree Search performs structured exploration over the state space, both training for policy refinement and inference for online search. The reinforcement learning agent is trained with a hybrid reward scheme, combining immediate feedback from the value difference and a sparse success reward. Experiments demonstrate the effectiveness and correctness of FGeo-ISRL. It not only achieves a Single-Step Theorem Accuracy of 90.2% and a Geometric Problem-Solving Accuracy of 83.14%, but also ensures that every step of the proof process remains readable, verifiable, and traceable. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

10 pages, 218 KB  
Entry
Serious Video Games: Tools for Learning, Training and Health
by Caroline Hands
Encyclopedia 2026, 6(4), 83; https://doi.org/10.3390/encyclopedia6040083 - 6 Apr 2026
Viewed by 201
Definition
Serious video games are digital games designed for purposes beyond entertainment, typically to support education, training, health interventions, or behaviour change. They combine game mechanics with psychological and pedagogical principles, such as feedback, repetition, goal-setting, and scaffolding, to create interactive environments that facilitate [...] Read more.
Serious video games are digital games designed for purposes beyond entertainment, typically to support education, training, health interventions, or behaviour change. They combine game mechanics with psychological and pedagogical principles, such as feedback, repetition, goal-setting, and scaffolding, to create interactive environments that facilitate learning, skill development, and sustained engagement. In many cases, they are built to simulate realistic tasks or decision contexts, allowing users to practise skills, test strategies, and learn from consequences in a low-risk setting. Within cyberpsychology, serious video games are particularly valuable because they provide structured digital contexts for examining how technology shapes cognition, emotion, motivation, and behaviour. They enable researchers and practitioners to observe how users respond to digital rewards, challenges, social features, and immersive environments, as well as how these features influence outcomes such as self-efficacy, persistence, attention, and emotion regulation. As a result, serious video games operate at the intersection of psychological theory, human–technology interaction, and applied digital intervention design. This entry provides an overview of their development, theoretical foundations, applications, effectiveness, and associated ethical considerations. Full article
(This article belongs to the Collection Encyclopedia of Digital Society, Industry 5.0 and Smart City)
20 pages, 1088 KB  
Article
Users’ Perspectives of Bidirectional Charging in Public Environments
by Érika Martins Silva Ramos, Thomas Lindgren, Jonas Andersson and Jens Hagman
World Electr. Veh. J. 2026, 17(4), 176; https://doi.org/10.3390/wevj17040176 - 26 Mar 2026
Viewed by 365
Abstract
Technological advances such as Vehicle-to-Grid (V2G) have the potential to support renewable energy integration and grid stability, but large-scale deployment depends on users’ willingness to participate, particularly in public charging environments. While prior research has examined V2G from technical feasibility and system-level perspectives, [...] Read more.
Technological advances such as Vehicle-to-Grid (V2G) have the potential to support renewable energy integration and grid stability, but large-scale deployment depends on users’ willingness to participate, particularly in public charging environments. While prior research has examined V2G from technical feasibility and system-level perspectives, everyday public settings remain unexplored. This study investigates electric vehicle (EV) users’ willingness to engage in V2G services in public spaces, with a focus on incentives, expectations, and how participation aligns with existing routines and parking conditions. A mixed-method approach was applied, combining a survey of 544 car users with two waves of user-centered interviews. The survey data were analyzed using factor analysis and linear regression models, while the interview data were thematically analyzed. The results show that users’ evaluations of V2G are shaped by sustainability expectations, perceived efficiency, and uncertainties, and preferences for public V2G participation are strongly influenced by convenience, clarity of the offer, and perceived control. Home charging practices emerged as a key reference point shaping expectations of public V2G services. Across both methods, simple and transparent incentives, such as reduced charging or parking costs, were consistently preferred over more complex reward models, including point-based systems or dynamic energy trading. Concerns related to control over trips, battery degradation, trust in service providers, and added complexity remain important barriers to participation. The findings highlight the need for user-centered and socio-technical design of public V2G services that align with users’ everyday routines, parking conditions, and expectations to support broader adoption beyond the home context. Full article
Show Figures

Figure A1

34 pages, 63807 KB  
Article
Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints
by Yaqing Chen, Yunfei Jin, Xin He and Yumei Zhang
Drones 2026, 10(3), 227; https://doi.org/10.3390/drones10030227 - 23 Mar 2026
Viewed by 349
Abstract
This study proposes TNAP-DDQN, a deep reinforcement learning method for urban low-altitude UAV path planning under residential noise threshold constraints. With time cost and safety risk as the optimization objectives, operational constraints such as collision risk and maximum AGL altitude are incorporated to [...] Read more.
This study proposes TNAP-DDQN, a deep reinforcement learning method for urban low-altitude UAV path planning under residential noise threshold constraints. With time cost and safety risk as the optimization objectives, operational constraints such as collision risk and maximum AGL altitude are incorporated to achieve coordinated optimization of noise compliance, operational safety, and efficiency. To mitigate action space contraction and training instability induced by multiple constraints, a Noise-Degradation-Mask-based Action Bias Network (NDM-ABN) is introduced at the action selection layer. A three-tier degradation scheme prevents empty candidate sets, while bias-based decision making is applied to approximately tied actions to stabilize the policy. Moreover, multi-step prioritized experience replay (PER) improves sample efficiency and long-horizon return modeling, and potential-based reward shaping (PBRS) transforms sparse constraint signals into auxiliary rewards. Simulation results indicate that: (1) NDM-ABN is the key module for stabilizing the noise-exposure process by suppressing high-noise actions; (2) the required AGL is related to the UAV source noise level and local noise limits, implying the need for differentiated AGL altitude classes; and (3) the maximum admissible UAV source noise level increases as the threshold is relaxed. The proposed method provides quantitative guidance for noise-entry and AGL altitude regulation, while future work will incorporate additional metrics (e.g., A-weighted equivalent sound level) to better capture noise fluctuations and short-term peaks. Full article
(This article belongs to the Section Innovative Urban Mobility)
Show Figures

Figure 1

48 pages, 4538 KB  
Review
Beyond Sensory Properties: Molecular Interactions of Antioxidant Flavour-Active Polyphenols Across the Food-Oral-Gut Axis
by Inês M. Ferreira, Sara A. Martins, Leonor Gonçalves, Mónica Jesus, Elsa Brandão and Susana Soares
Antioxidants 2026, 15(3), 397; https://doi.org/10.3390/antiox15030397 - 21 Mar 2026
Viewed by 646
Abstract
Dietary antioxidants are widely valued for their potential health benefits, but incorporating them into functional foods is not straightforward. Polyphenols are among the most abundant and important antioxidants in foods, and this review focuses on them because the same structural features linked to [...] Read more.
Dietary antioxidants are widely valued for their potential health benefits, but incorporating them into functional foods is not straightforward. Polyphenols are among the most abundant and important antioxidants in foods, and this review focuses on them because the same structural features linked to their health-promoting effects can also cause pronounced bitterness and astringency, ultimately limiting consumer acceptance. This review examines how these challenges are interconnected across three levels: food matrix interactions, bioavailability, and consumer psychobiology. We describe how non-covalent interactions between polyphenols, proteins, and polysaccharides can have both positive and negative effects. While these interactions may alter oral lubrication and flavour release, they also protect highly reactive bioactive compounds from gastric degradation. Furthermore, we broaden the concept of bioavailability by exploring the microbiota-mediated “colonic rescue” of polyphenols that are not released during earlier digestion. We also highlight the role of extraoral bitter taste receptors (TAS2Rs) along the gastrointestinal (GI) tract. Activation of these receptors during digestion can trigger relevant metabolic and endocrine responses, indicating that systemic absorption is not the only pathway to bioactivity. Finally, we connect these mechanisms to individual differences in food acceptance, showing that genetic factors (e.g., TAS2R38 and the salivary proteome) and psychological traits (such as neophobia and reward sensitivity) can shape rejection or flavour-nutrient learning. Overall, the successful development of functional foods will require a “sensory-by-design” approach. This strategy utilises matrix interactions strategically to improve both consumer acceptance and physiological efficacy. Full article
(This article belongs to the Section Natural and Synthetic Antioxidants)
Show Figures

Figure 1

26 pages, 21346 KB  
Article
A Load-Balancing-Aware Learning Framework for Collaborative UAV-MEC Computation Offloading
by Huafeng Li, Yuxuan Wang, Hengming Liu, Jiaxuan Li, Xu Wang, Qun Lei, Ke Xiao and Hongliang Zhu
Sensors 2026, 26(6), 1920; https://doi.org/10.3390/s26061920 - 18 Mar 2026
Viewed by 324
Abstract
Unmanned Aerial Vehicle (UAV) computing clusters face severe operational constraints due to limited computing capabilities and battery capacities, which complicate the simultaneous optimization of low offloading latency, long task endurance, and high cluster efficiency. To address these challenges, this paper proposes a Multi-Objective [...] Read more.
Unmanned Aerial Vehicle (UAV) computing clusters face severe operational constraints due to limited computing capabilities and battery capacities, which complicate the simultaneous optimization of low offloading latency, long task endurance, and high cluster efficiency. To address these challenges, this paper proposes a Multi-Objective Reinforcement Learning framework based on Latency and Power Balance (MORL-LAPB). Instead of broad situational awareness descriptions, our framework directly combines a reward-shaping reinforcement learning algorithm with an evolutionary mechanism to construct a closed-loop optimization paradigm. Crucially, in this context, ’balancing’ extends beyond traditional computational workload distribution; it represents a joint optimization that balances task allocation to ensure short service delays while simultaneously equating the energy depletion rates across UAV nodes to maximize overall cluster efficiency and operational duration. By efficiently identifying Pareto optimal trade-offs, MORL-LAPB dynamically regulates UAV energy allocation and computational resource scheduling. Experimental results demonstrate that, compared to RSO, NSO, and DRLSO baselines, the proposed MORL-LAPB significantly reduces offloading latency, extends effective task execution duration, and improves cluster energy efficiency. The framework offers flexible adaptability and long-term sustainability for diverse operational scenarios under strict multi-objective constraints. Full article
(This article belongs to the Special Issue Communications and Networking Based on Artificial Intelligence)
Show Figures

Figure 1

24 pages, 4975 KB  
Article
Disturbance Observer-Based Actor–Critic Reinforcement Learning with Adaptive Reward for Energy-Efficient Control of Robotic Manipulators
by Le Thi Minh Tam, Nguyen Viet Ngu, Duc Hung Pham and V. T. Mai
Actuators 2026, 15(3), 167; https://doi.org/10.3390/act15030167 - 16 Mar 2026
Viewed by 360
Abstract
Reinforcement learning controllers for robot manipulators depend strongly on reward tuning, and fixed weights may yield poor trade-offs under uncertainty and disturbances. This paper proposes a disturbance observer-based actor–critic RL (DOB–ACRL) with adaptive multi-objective reward shaping for a torque-saturated 2-DOF manipulator, where the [...] Read more.
Reinforcement learning controllers for robot manipulators depend strongly on reward tuning, and fixed weights may yield poor trade-offs under uncertainty and disturbances. This paper proposes a disturbance observer-based actor–critic RL (DOB–ACRL) with adaptive multi-objective reward shaping for a torque-saturated 2-DOF manipulator, where the reward weights are updated online using normalized indicators of tracking error, control energy, and effort. A Lyapunov analysis guarantees the uniform ultimate boundedness of closed-loop signals. The simulations show improved learning and performance over a static reward actor–critic baseline, reducing the RMS tracking error by up to 22.8%, the control energy by ~4.6%, the control effort by 1.9%, and the settling time by up to 29.2%. Full article
(This article belongs to the Section Actuators for Robotics)
Show Figures

Figure 1

17 pages, 566 KB  
Article
Analyst-of-Record: A Proof-of-Concept for Influence-Based Analyst Credit Assignment in Human-Feedback Decision Support
by Devon L. Brown and Danda B. Rawat
Electronics 2026, 15(6), 1210; https://doi.org/10.3390/electronics15061210 - 13 Mar 2026
Viewed by 329
Abstract
The purpose of this study is to examine whether analyst-level credit can be assigned quantitatively in a lightweight human-feedback decision-support pipeline. In intelligence and national security workflows, analysts often provide edits, comments, and evaluative feedback during the production of analytic products, yet these [...] Read more.
The purpose of this study is to examine whether analyst-level credit can be assigned quantitatively in a lightweight human-feedback decision-support pipeline. In intelligence and national security workflows, analysts often provide edits, comments, and evaluative feedback during the production of analytic products, yet these intermediate contributions are usually discarded, leaving no auditable record of how individual feedback shaped the final output. To address this problem, this study proposes a proof-of-concept Analyst-of-Record framework that combines synthetic analyst feedback, a linear ridge reward model, first-order influence functions, and additive Shapley aggregation to estimate both feedback-item and analyst-level contribution scores. The research design uses the Fact Extraction and VERification (FEVER) fact-verification dataset under controlled experimental settings. The pipeline retrieves evidence with Best Matching 25 (BM25), generates a grounded template-based response, derives three synthetic analyst feedback channels from FEVER annotations, trains a reward model on simple claim–answer and analyst-identity features, and aggregates per-feedback influence scores into an Analyst Contribution Index (ACI). The main experiments are conducted on a 500-claim subset across five random seeds, with additional ablation and bootstrap analyses used to assess sensitivity and stability. The findings show that the reward model achieves a mean validation R2 of 0.801±0.037, indicating that the synthetic feedback signals are learnable under the selected featureization. The analyst-level contribution scores remain stable across random seeds, with approximately half of the total influence magnitude attributed to the explanation-quality channel and the remainder split across the other two channels. Ablation results further show that removing the explanation-quality channel collapses validation fit, while bootstrap resampling demonstrates tight concentration of absolute ACI magnitudes. Theoretically, this study extends attribution research beyond document-only grounding by showing how analyst feedback itself can be modeled as an object of contribution analysis. It also demonstrates that influence functions and Shapley-style aggregation can be adapted into a tractable framework for estimating interpretable analyst-level credit in a reproducible experimental setting. Practically, the proposed framework offers an initial foundation for more traceable and accountable decision-support workflows in which intermediate analyst contributions can be preserved rather than lost. The results also provide a feasible implementation path for future systems that incorporate stronger generators, richer evidence representations, and real analyst annotations. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

25 pages, 2560 KB  
Article
Statistical Reward Shaping for Reinforcement Learning in Bipedal Locomotion
by Shuhan Yan, Chuan Chen, Xinliang Zhou and Jiaping Xiao
Electronics 2026, 15(6), 1203; https://doi.org/10.3390/electronics15061203 - 13 Mar 2026
Viewed by 481
Abstract
Achieving stable bipedal locomotion for humanoid robots remains a central challenge in reinforcement learning (RL), in which the design of reward functions is pivotal but non-trivial. This paper proposes a three-tier statistical reward shaping framework to optimize bipedal gait learning. First, training outcomes [...] Read more.
Achieving stable bipedal locomotion for humanoid robots remains a central challenge in reinforcement learning (RL), in which the design of reward functions is pivotal but non-trivial. This paper proposes a three-tier statistical reward shaping framework to optimize bipedal gait learning. First, training outcomes are diagnostically monitored using forward distance, fall rate, and posture score. Pearson correlation and regression analyses are then employed to identify trade-offs and isolate the direct effects of reward components. Finally, targeted parameter sweeps enable directionally guided optimization, substantially reducing heuristic parameter tuning while refining a reward function for the H1 robot in Isaac Lab. Experimental results demonstrate clear improvements over the baseline. The optimized policy reduces convergence time by 14% and increases forward distance by 186%. Stability is markedly enhanced, with fall rate decreasing from 75% to 2% and active locomotion efficiency nearly doubling (0.339 to 0.678). These results validate a reproducible, data-driven framework for reward design, highlighting the importance of principled statistical analysis in complex RL-based humanoid locomotion. Full article
(This article belongs to the Special Issue Advances in Intelligent Computing and Systems Design)
Show Figures

Figure 1

33 pages, 1249 KB  
Article
Degradation-Aware Learning-Based Control for Residential PV–Battery Systems
by Ahmed Chiheb Ammari
Energies 2026, 19(6), 1434; https://doi.org/10.3390/en19061434 - 12 Mar 2026
Viewed by 329
Abstract
Residential photovoltaic (PV)–battery systems are increasingly deployed to reduce electricity costs under time-of-use and demand-charge tariffs, yet their economic value depends critically on how storage is operated over time. Effective control must simultaneously address short-term energy costs, peak-demand exposure, and long-term battery degradation, [...] Read more.
Residential photovoltaic (PV)–battery systems are increasingly deployed to reduce electricity costs under time-of-use and demand-charge tariffs, yet their economic value depends critically on how storage is operated over time. Effective control must simultaneously address short-term energy costs, peak-demand exposure, and long-term battery degradation, all under substantial uncertainty in load and PV generation. While optimization-based approaches can achieve strong performance with accurate forecasts, they are sensitive to forecast errors, whereas learning-based methods often neglect degradation effects or deplete the battery prematurely, leading to suboptimal peak-shaving behavior. This paper proposes a forecast-free, degradation-aware reinforcement learning (RL) framework for residential PV–battery energy management that jointly addresses demand-charge mitigation and battery aging. The proposed controller internalizes both calendar aging and rainflow-based cycling degradation within its objective and incorporates demand-aware reward shaping with time-varying penalties on on-peak grid imports. In addition, a complementary state-of-charge reserve mechanism discourages premature battery depletion and improves responsiveness to late on-peak demand surges, despite the absence of explicit load or PV forecasts. Physical feasibility is guaranteed through an execution-time safety layer that enforces all device and operational constraints by construction. The proposed framework is evaluated on high-resolution residential datasets and compared against optimization-based baselines, including a day-ahead scheduler with perfect foresight and a receding-horizon MPC controller using short-horizon forecasts. Overall, the results show that the proposed RL controller substantially reduces demand charges and total electricity costs relative to forecast-based MPC while maintaining degradation-aware operation, demonstrating the potential of forecast-free reinforcement learning as a practical control strategy for residential PV–battery systems under demand-charge tariffs. Full article
(This article belongs to the Section A: Sustainable Energy)
Show Figures

Figure 1

39 pages, 67440 KB  
Article
LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization
by Chenxu Wang, Jiang Yuan, Tianqi Yu, Xinyue Jiang, Liuyu Xiang, Junge Zhang and Zhaofeng He
Mathematics 2026, 14(5), 915; https://doi.org/10.3390/math14050915 - 8 Mar 2026
Viewed by 469
Abstract
Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in multi-agent systems (MASs) remains a fundamental challenge for general-purpose AI, especially in open-ended interaction scenarios. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of [...] Read more.
Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in multi-agent systems (MASs) remains a fundamental challenge for general-purpose AI, especially in open-ended interaction scenarios. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of Nash equilibria, leaving agents brittle when faced with semantically diverse, unseen behaviors. Recent approaches that invoke Large Language Models (LLMs) at run time can improve adaptability but introduce substantial latency and can become less reliable as task horizons grow; in contrast, LLM-assisted reward-shaping methods remain constrained by the inefficiency of the inner reinforcement-learning loop. To address these limitations, we propose LLM-TOC (LLM-Driven Theory-of-Mind Adversarial Curriculum), which casts generalization as a bi-level Stackelberg game: in the inner loop, a MARL agent (the follower) minimizes regret against a fixed population, while in the outer loop, an LLM serves as a semantic oracle that generates executable adversarial or cooperative strategies in a Turing-complete code space to maximize the agent’s regret. To cope with the absence of gradients in discrete code generation, we introduce Gradient Saliency Feedback, which transforms pixel-level value fluctuations into semantically meaningful causal cues to steer the LLM toward targeted strategy synthesis. We further provide motivating theoretical analysis via the PAC-Bayes framework, showing that LLM-TOC converges at rate O(1/K) and yields a tighter generalization error bound than parameter-space exploration under reasonable preconditions. Experiments on the Melting Pot benchmark demonstrate that, with expected cumulative collective return as the core zero-shot generalization metric, LLM-TOC consistently outperforms self-play baselines (IPPO and MAPPO) and the LLM-inference method Hypothetical Minds across all held-out test scenarios, reaching 75% to 85% of the upper-bound performance of Oracle PPO. Meanwhile, with the number of RL environment interaction steps to reach the target relative performance as the core efficiency metric, our framework reduces the total training computational cost by more than 60% compared with mainstream baselines. Full article
(This article belongs to the Special Issue Applications of Intelligent Game and Reinforcement Learning)
Show Figures

Figure 1

24 pages, 525 KB  
Systematic Review
Gender Diversity and Psychosocial Work Risks from a Non-Binary Perspective: A Systematic Review
by Abel Perez-Gonzalez, Ferdinando Tuscani, Raul Pelagaggi and Mohamed Nasser
Merits 2026, 6(1), 6; https://doi.org/10.3390/merits6010006 - 27 Feb 2026
Viewed by 545
Abstract
This systematic review examines how gender shapes exposure to and experiences of psychosocial risks in the workplace. Drawing on 89 empirical studies published between 2010 and 2024, the review synthesizes evidence from occupational health psychology, gender studies, and organizational research. Searches were conducted [...] Read more.
This systematic review examines how gender shapes exposure to and experiences of psychosocial risks in the workplace. Drawing on 89 empirical studies published between 2010 and 2024, the review synthesizes evidence from occupational health psychology, gender studies, and organizational research. Searches were conducted in PubMed, Web of Science, Scopus, CINAHL, and PsycINFO, and included empirical studies published in English and Spanish. Following PRISMA guidelines, a qualitative thematic synthesis was conducted to integrate findings across diverse sectors, populations, and methodological approaches. The evidence reveals persistent gendered patterns in psychosocial risk exposure and outcomes: women are more frequently exposed to emotionally demanding and relational forms of work and report poorer mental health outcomes; men experience performance-driven strain linked to workload, competition, and reward insecurity more often; and transgender and non-binary workers face additional psychosocial burdens associated with stigma, discrimination, and minority stress. Across the literature, structural and cultural determinants—such as occupational segregation, unequal recognition, and gendered organizational norms—emerge as central mechanisms underlying these disparities. Theoretical frameworks including effort–reward imbalance, demand–control, work–family conflict, organizational climate, and minority stress collectively contribute to explaining how gendered psychosocial risks are produced and sustained. Overall, the review underscores the need to move beyond individualistic and binary models of psychosocial risk toward gender-responsive approaches that account for structural, relational, and identity-based dimensions of work, thereby informing research and organizational strategies aimed at promoting equitable and sustainable well-being at work. Full article
Show Figures

Graphical abstract

38 pages, 16228 KB  
Article
Deep Q-Network Agents for Game Playing: Systematic Evaluation Across Eight Benchmark and Custom Environments
by Časlav Livada, Marko Duka, Tomislav Keser and Krešimir Nenadić
Electronics 2026, 15(5), 958; https://doi.org/10.3390/electronics15050958 - 26 Feb 2026
Viewed by 493
Abstract
Deep Q-Networks (DQNs) have achieved strong performance across a range of benchmark tasks; however, their reliability under varying reward structures and planning horizons remains insufficiently characterized. This study presents a systematic cross-environment analysis of DQN agents evaluated across eight environments spanning simple control, [...] Read more.
Deep Q-Networks (DQNs) have achieved strong performance across a range of benchmark tasks; however, their reliability under varying reward structures and planning horizons remains insufficiently characterized. This study presents a systematic cross-environment analysis of DQN agents evaluated across eight environments spanning simple control, arcade, and strategic domains. Rather than pursuing state-of-the-art performance, the objective is to investigate structural conditions under which standard value-based reinforcement learning succeeds, degrades, or fails. Across controlled experiments with consistent training budgets and statistical validation, three recurring failure patterns are identified: (i) sparse-reward exploration failure, (ii) reward exploitation without functional task competence, and (iii) strategic planning limitations in long-horizon or adversarial environments. Within-environment ablation studies further demonstrate that moderate network scaling (2–4× parameter increases) does not significantly alter learning outcomes when reward functions remain unchanged, suggesting that reward alignment and task horizon dominate architectural capacity as determinants of performance. The results provide a structured diagnostic perspective on DQN reliability, clarify the limits of reward shaping in complex environments, and offer practical guidance for identifying when standard value-based methods are likely to become unstable or insufficient. Full article
(This article belongs to the Special Issue Machine/Deep Learning Applications and Intelligent Systems)
Show Figures

Figure 1

19 pages, 3606 KB  
Article
Autonomous Navigation of an Unmanned Underwater Vehicle via Safe Reinforcement Learning and Active Disturbance Rejection Control
by Qinze Chen, Yun Cheng, Yinlong Yuan and Liang Hua
J. Mar. Sci. Eng. 2026, 14(5), 425; https://doi.org/10.3390/jmse14050425 - 25 Feb 2026
Viewed by 385
Abstract
A two-layer control framework for unmanned underwater vehicle (UUV) navigation is proposed, combining a lower-layer active disturbance rejection controller (ADRC) with an upper-layer safe reinforcement learning (RL) policy for obstacle-avoidance navigation. The lower layer, utilizing ADRC, ensures high tracking accuracy and effective disturbance [...] Read more.
A two-layer control framework for unmanned underwater vehicle (UUV) navigation is proposed, combining a lower-layer active disturbance rejection controller (ADRC) with an upper-layer safe reinforcement learning (RL) policy for obstacle-avoidance navigation. The lower layer, utilizing ADRC, ensures high tracking accuracy and effective disturbance rejection, while the upper layer integrates the twin delayed deep deterministic policy gradient (TD3) algorithm, combined with a control barrier function (CBF)-based quadratic programming (QP) safety filter and safety-inspired reward shaping (SR). The method is evaluated in two simulation studies: (i) velocity and attitude control to assess tracking and disturbance rejection, and (ii) obstacle-avoidance navigation to assess learning efficiency, trajectory smoothness, and safety-related metrics. Simulation results show that ADRC achieves faster tracking and stronger disturbance rejection than a conventional proportional–integral–derivative (PID) controller. Moreover, the proposed TD3 + QP + SR scheme exhibits faster learning, smoother trajectories, and improved safety performance compared with RL baselines. These results indicate that the proposed framework enables efficient and safe UUV navigation in simulation scenarios with obstacles and disturbances. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

Back to TopTop