MDPI - Publisher of Open Access Journals

15 pages, 1780 KB

Open AccessArticle

Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition

by Kristiawan Nugroho, Imam Husni Al Amin, Nina Anggraeni Noviasari and De Rosal Ignatius Moses Setiadi

Computers 2025, 14(9), 361; https://doi.org/10.3390/computers14090361 - 31 Aug 2025

Viewed by 215

Speech Emotion Recognition (SER) plays a vital role in supporting applications such as healthcare, human–computer interaction, and security. However, many existing approaches still face challenges in achieving robust generalization and maintaining high recall, particularly for emotions related to stress and anxiety. This study [...] Read more.

Speech Emotion Recognition (SER) plays a vital role in supporting applications such as healthcare, human–computer interaction, and security. However, many existing approaches still face challenges in achieving robust generalization and maintaining high recall, particularly for emotions related to stress and anxiety. This study proposes a dual-stream hybrid model that combines prosodic features with spatio-temporal representations derived from the Multitaper Mel-Frequency Spectrogram (MTMFS) and the Constant-Q Transform Spectrogram (CQTS). Prosodic cues, including pitch, intensity, jitter, shimmer, HNR, pause rate, and speech rate, were processed using dense layers, while MTMFS and CQTS features were encoded with CNN and BiGRU. A Multi-Head Attention mechanism was then applied to adaptively fuse the two feature streams, allowing the model to focus on the most relevant emotional cues. Evaluations conducted on the RAVDESS dataset with subject-independent 5-fold cross-validation demonstrated an accuracy of 97.64% and a macro F1-score of 0.9745. These results confirm that combining prosodic and advanced spectrogram features with attention-based fusion improves precision, recall, and overall robustness, offering a promising framework for more reliable SER systems. Full article

(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (2nd Edition))

► Show Figures

Figure 1

21 pages, 3121 KB

Open AccessArticle

An Interpretable Stacked Ensemble Learning Framework for Wheat Storage Quality Prediction

by Xinze Li, Wenyue Wang, Bing Pan, Siyu Zhu, Junhui Zhang, Yunzhao Ma, Hongpeng Guo, Zhe Liu, Wenfu Wu and Yan Xu

Agriculture 2025, 15(17), 1844; https://doi.org/10.3390/agriculture15171844 - 29 Aug 2025

Viewed by 172

Abstract

Accurate prediction of wheat storage quality is essential for ensuring storage safety and providing early warnings of quality deterioration. However, existing methods focus solely on storage environmental conditions, neglecting the spatial distribution of temperature within grain piles, lacking interpretability, and generally failing to [...] Read more.

Accurate prediction of wheat storage quality is essential for ensuring storage safety and providing early warnings of quality deterioration. However, existing methods focus solely on storage environmental conditions, neglecting the spatial distribution of temperature within grain piles, lacking interpretability, and generally failing to provide reliable forecasts of future quality changes. To overcome these challenges, an interpretable prediction framework for wheat storage quality based on stacked ensemble learning is proposed. Three key features, Effective Accumulated Temperature (EAT), Cumulative High Temperature Deviation (CHTD), and Cumulative Temperature Gradient (CTG), were derived from grain temperature data to capture the spatiotemporal dynamics of the internal temperature field. These features were then input into the stacked ensemble learning model to accurately predict historical quality changes. In addition, future grain temperatures were predicted with high precision using a Graph Convolutional Network-Temporal Fusion Transformer (GCN-TFT) model. The temperature prediction results were then employed to construct features and were fed into the stacked ensemble learning model to enable future quality change prediction. Baseline experiments indicated that the stacked model significantly outperformed individual models, achieving R² = 0.94, MAE = 0.44 mg KOH/100 g, and RMSE = 0.59 mg KOH/100 g. SHAP interpretability analysis revealed that EAT constituted the primary driver of wheat quality deterioration, followed by CHTD and CTG. Moreover, in future quality prediction experiments, the GCN-TFT model demonstrated high accuracy in 60-day grain temperature forecasts, and although the prediction accuracy of fatty acid value changes based on features derived from predicted temperatures slightly declined compared to features based on actual temperature data, it remained within an acceptable precision range, achieving an MAE of 0.28 mg KOH/100 g and an RMSE of 0.33 mg KOH/100 g. The experiments validated that the overall technical route from grain temperature prediction to quality prediction exhibited good accuracy and feasibility, providing an efficient, stable, and interpretable quality monitoring and early warning tool for grain storage management, which assists managers in making scientific decisions and interventions to ensure storage safety. Full article

(This article belongs to the Special Issue Grain Harvesting, Processing Technology and Storage Management—2nd Edition)

► Show Figures

Figure 1

23 pages, 1466 KB

Open AccessArticle

TMU-Net: A Transformer-Based Multimodal Framework with Uncertainty Quantification for Driver Fatigue Detection

by Yaxin Zhang, Xuegang Xu, Yuetao Du and Ningchao Zhang

Sensors 2025, 25(17), 5364; https://doi.org/10.3390/s25175364 - 29 Aug 2025

Viewed by 252

Abstract

Driving fatigued is a prevalent issue frequently contributing to traffic accidents, prompting the development of automated fatigue detection methods based on various data sources, particularly reliable physiological signals. However, challenges in accuracy, robustness, and practicality persist, especially for cross-subject detection. Multimodal data fusion [...] Read more.

Driving fatigued is a prevalent issue frequently contributing to traffic accidents, prompting the development of automated fatigue detection methods based on various data sources, particularly reliable physiological signals. However, challenges in accuracy, robustness, and practicality persist, especially for cross-subject detection. Multimodal data fusion can enhance the effective estimation of driver fatigue. In this work, we leverage the advantages of multimodal signals to propose a novel Multimodal Attention Network (TMU-Net) for driver fatigue detection, achieving precise fatigue assessment by integrating electroencephalogram (EEG) and electrooculogram (EOG) signals. The core innovation of TMU-Net lies in its unimodal feature extraction module, which combines causal convolution, ConvSparseAttention, and Transformer encoders to effectively capture spatiotemporal features, and a multimodal fusion module that employs cross-modal attention and uncertainty-weighted gating to dynamically integrate complementary information. By incorporating uncertainty quantification, TMU-Net significantly enhances robustness to noise and individual variability. Experimental validation on the SEED-VIG dataset demonstrates TMU-Net’s superior performance stability across 23 subjects in cross-subject testing, effectively leveraging the complementary strengths of EEG (2 Hz full-band and five-band features) and EOG signals for high-precision fatigue detection. Furthermore, attention heatmap visualization reveals the dynamic interaction mechanisms between EEG and EOG signals, confirming the physiological rationality of TMU-Net’s feature fusion strategy. Practical challenges and future research directions for fatigue detection methods are also discussed. Full article

(This article belongs to the Special Issue AI and Smart Sensors for Intelligent Transportation Systems)

► Show Figures

Figure 1

12 pages, 2370 KB

Open AccessArticle

Streak Tube-Based LiDAR for 3D Imaging

by Houzhi Cai, Zeng Ye, Fangding Yao, Chao Lv, Xiaohan Cheng and Lijuan Xiang

Sensors 2025, 25(17), 5348; https://doi.org/10.3390/s25175348 - 28 Aug 2025

Viewed by 313

Abstract

Streak cameras, essential for ultrahigh temporal resolution diagnostics in laser-driven inertial confinement fusion, underpin the streak tube imaging LiDAR (STIL) system—a flash LiDAR technology offering high spatiotemporal resolution, precise ranging, enhanced sensitivity, and wide field of view. This study establishes a theoretical model [...] Read more.

Streak cameras, essential for ultrahigh temporal resolution diagnostics in laser-driven inertial confinement fusion, underpin the streak tube imaging LiDAR (STIL) system—a flash LiDAR technology offering high spatiotemporal resolution, precise ranging, enhanced sensitivity, and wide field of view. This study establishes a theoretical model of the STIL system, with numerical simulations predicting limits of temporal and spatial resolutions of ~6 ps and 22.8 lp/mm, respectively. Dynamic simulations of laser backscatter signals from targets at varying depths demonstrate an optimal distance reconstruction accuracy of 98%. An experimental STIL platform was developed, with the key parameters calibrated as follows: scanning speed (16.78 ps/pixel), temporal resolution (14.47 ps), and central cathode spatial resolution (20 lp/mm). The system achieved target imaging through streak camera detection of azimuth-resolved intensity profiles, generating raw streak images. Feature extraction and neural network-based three-dimensional (3D) reconstruction algorithms enabled target reconstruction from the time-of-flight data of short laser pulses, achieving a minimum distance reconstruction error of 3.57%. Experimental results validate the capability of the system to detect fast, low-intensity optical signals while acquiring target range information, ultimately achieving high-frame-rate, high-resolution 3D imaging. These advancements position STIL technology as a promising solution for applications that require micron-scale depth discrimination under dynamic conditions. Full article

(This article belongs to the Special Issue Visual Sensing Methods for 3D Object Detection, Tracking, and Quantification)

► Show Figures

Figure 1

21 pages, 5171 KB

Open AccessArticle

FDBRP: A Data–Model Co-Optimization Framework Towards Higher-Accuracy Bearing RUL Prediction

by Muyu Lin, Qing Ye, Shiyue Na, Dongmei Qin, Xiaoyu Gao and Qiang Liu

Sensors 2025, 25(17), 5347; https://doi.org/10.3390/s25175347 - 28 Aug 2025

Viewed by 326

Abstract

This paper proposes Feature fusion and Dilated causal convolution model for Bearing Remaining useful life Prediction (FDBRP), an integrated framework for accurate Remaining Useful Life (RUL) prediction of rolling bearings that combines three key innovations: (1) a data augmentation module employing sliding-window processing [...] Read more.

This paper proposes Feature fusion and Dilated causal convolution model for Bearing Remaining useful life Prediction (FDBRP), an integrated framework for accurate Remaining Useful Life (RUL) prediction of rolling bearings that combines three key innovations: (1) a data augmentation module employing sliding-window processing and two-dimensional feature concatenation with label normalization to enhance signal representation and improve model generalizability, (2) a feature fusion module incorporating an enhanced graph convolutional network for spatial modeling, an improved multi-scale temporal convolution for dynamic pattern extraction, and an efficient multi-scale attention mechanism to optimize spatiotemporal feature consistency, and (3) an optimized dilated convolution module utilizing interval sampling to expand the receptive field, and combines the residual connection structure to realize the regularization of the neural network and enhance the ability of the model to capture long-range dependencies. Experimental validation showcases the effectiveness of proposed approach, achieving a high average score of 0.756564 and demonstrating a lower average error of 10.903656 in RUL prediction for test bearings compared to state-of-the-art benchmarks. This highlights the superior RUL prediction capability of the proposed methodology. Full article

(This article belongs to the Section Industrial Sensors)

► Show Figures

Figure 1

16 pages, 306 KB

Open AccessArticle

Adaptive Cross-Scale Graph Fusion with Spatio-Temporal Attention for Traffic Prediction

by Zihao Zhao, Xingzheng Zhu and Ziyun Ye

Electronics 2025, 14(17), 3399; https://doi.org/10.3390/electronics14173399 - 26 Aug 2025

Viewed by 298

Abstract

Traffic flow prediction is a critical component of intelligent transportation systems, playing a vital role in alleviating congestion, improving road resource utilization, and supporting traffic management decisions. Although deep learning methods have made remarkable progress in this field in recent years, current studies [...] Read more.

Traffic flow prediction is a critical component of intelligent transportation systems, playing a vital role in alleviating congestion, improving road resource utilization, and supporting traffic management decisions. Although deep learning methods have made remarkable progress in this field in recent years, current studies still face challenges in modeling complex spatio-temporal dependencies, adapting to anomalous events, and generalizing to large-scale real-world scenarios. To address these issues, this paper proposes a novel traffic flow prediction model. The proposed approach simultaneously leverages temporal and frequency domain information and introduces adaptive graph convolutional layers to replace traditional graph convolutions, enabling dynamic capture of traffic network structural features. Furthermore, we design a frequency–temporal multi-head attention mechanism for effective multi-scale spatio-temporal feature extraction and develop a cross-multi-scale graph fusion strategy to enhance predictive performance. Extensive experiments on real-world datasets, PeMS and Beijing, demonstrate that our method significantly outperforms state-of-the-art (SOTA) baselines. For example, on the PeMS20 dataset, our model achieves a 53.6% lower MAE, a 12.3% lower NRMSE, and a 3.2% lower MAPE than the best existing method (STFGNN). Moreover, the proposed model achieves competitive computational efficiency and inference speed, making it well-suited for practical deployment. Full article

(This article belongs to the Special Issue Graph-Based Learning Methods in Intelligent Transportation Systems)

► Show Figures

Figure 1

25 pages, 1900 KB

Open AccessArticle

Collision Risk Assessment of Lane-Changing Vehicles Based on Spatio-Temporal Feature Fusion Trajectory Prediction

by Hongtao Su, Ning Wang and Xiangmin Wang

Electronics 2025, 14(17), 3388; https://doi.org/10.3390/electronics14173388 - 26 Aug 2025

Viewed by 351

Abstract

Accurate forecasting of potential collision risk in dense traffic is addressed by a framework grounded in multi-vehicle trajectory prediction. A spatio-temporal fusion architecture, STGAT-EDGRU, is proposed. A Transformer encoder learns temporal motion patterns from each vehicle’s history; a boundary-aware graph (GAT) attention network [...] Read more.

Accurate forecasting of potential collision risk in dense traffic is addressed by a framework grounded in multi-vehicle trajectory prediction. A spatio-temporal fusion architecture, STGAT-EDGRU, is proposed. A Transformer encoder learns temporal motion patterns from each vehicle’s history; a boundary-aware graph (GAT) attention network models inter-vehicle interactions; and a Gated Multimodal Unit (GMU) adaptively fuses the temporal and spatial streams. Future positions are parameterized as bivariate Gaussians and decoded by a two-layer GRU. Using probabilistic trajectory forecasts for the main vehicle and its surrounding vehicles, collision probability and collision intensity are computed at each prediction instant and integrated via a weighted scheme into a Collision Risk Index (CRI) that characterizes risk over the entire horizon. On HighD, for 3–5 s horizons, average RMSE reductions of 0.02 m, 0.12 m, and 0.26 m over a GAT-Transformer baseline are achieved. In high-risk lane-change scenarios, CRI issues warnings 0.4–0.6 s earlier and maintains a stable response across the high-risk interval. These findings substantiate improved long-horizon accuracy together with earlier and more reliable risk perception, and indicate practical utility for lane-change assistance, where CRI can trigger early deceleration or abort decisions, and for risk-aware motion planning in intelligent driving. Full article

(This article belongs to the Special Issue Feature Papers in Electrical and Autonomous Vehicles, Volume 2)

► Show Figures

Figure 1

30 pages, 10140 KB

Open AccessArticle

High-Accuracy Cotton Field Mapping and Spatiotemporal Evolution Analysis of Continuous Cropping Using Multi-Source Remote Sensing Feature Fusion and Advanced Deep Learning

by Xiao Zhang, Zenglu Liu, Xuan Li, Hao Bao, Nannan Zhang and Tiecheng Bai

Agriculture 2025, 15(17), 1814; https://doi.org/10.3390/agriculture15171814 - 25 Aug 2025

Viewed by 364

Abstract

Cotton is a globally strategic crop that plays a crucial role in sustaining national economies and livelihoods. To address the challenges of accurate cotton field extraction in the complex planting environments of Xinjiang’s Alaer reclamation area, a cotton field identification model was developed [...] Read more.

Cotton is a globally strategic crop that plays a crucial role in sustaining national economies and livelihoods. To address the challenges of accurate cotton field extraction in the complex planting environments of Xinjiang’s Alaer reclamation area, a cotton field identification model was developed that integrates multi-source satellite remote sensing data with machine learning methods. Using imagery from Sentinel-2, GF-1, and Landsat 8, we performed feature fusion using principal component, Gram–Schmidt (GS), and neural network techniques. Analyses of spectral, vegetation, and texture features revealed that the GS-fused blue bands of Sentinel-2 and Landsat 8 exhibited optimal performance, with a mean value of 16,725, a standard deviation of 2290, and an information entropy of 8.55. These metrics improved by 10,529, 168, and 0.28, respectively, compared with the original Landsat 8 data. In comparative classification experiments, the endmember-based random forest classifier (RFC) achieved the best traditional classification performance, with a kappa value of 0.963 and an overall accuracy (OA) of 97.22% based on 250 samples, resulting in a cotton-field extraction error of 38.58 km². By enhancing the deep learning model, we proposed a U-Net architecture that incorporated a Convolutional Block Attention Module and Atrous Spatial Pyramid Pooling. Using the GS-fused blue band data, the model achieved significantly improved accuracy, with a kappa coefficient of 0.988 and an OA of 98.56%. This advancement reduced the area estimation error to 25.42 km², representing a 34.1% decrease compared with that of the RFC. Based on the optimal model, we constructed a digital map of continuous cotton cropping from 2021 to 2023, which revealed a consistent decline in cotton acreage within the reclaimed areas. This finding underscores the effectiveness of crop rotation policies in mitigating the adverse effects of large-scale monoculture practices. This study confirms that the synergistic integration of multi-source satellite feature fusion and deep learning significantly improves crop identification accuracy, providing reliable technical support for agricultural policy formulation and sustainable farmland management. Full article

(This article belongs to the Special Issue Computers and IT Solutions for Agriculture and Their Application)

► Show Figures

Figure 1

31 pages, 3129 KB

Open AccessReview

A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods

by Yankun Gong, Chao Bao, Zhengxi He, Yifan Jian, Xiaoye Wang, Haineng Huang and Xintai Song

Information 2025, 16(9), 731; https://doi.org/10.3390/info16090731 - 25 Aug 2025

Viewed by 512

Abstract

Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses [...] Read more.

Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses detection principles, inherent challenges, mitigation strategies, and the state of the art (SOTA). Small leaks refer to low flow leakage originating from defects with apertures at millimeter or submillimeter scales, posing significant detection difficulties. Acoustic detection leverages the acoustic wave signals generated by gas leaks for non-contact monitoring, offering advantages such as rapid response and broad coverage. However, its susceptibility to environmental noise interference often triggers false alarms. This limitation can be mitigated through time-frequency analysis, multi-sensor fusion, and deep-learning algorithms—effectively enhancing leak signals, suppressing background noise, and thereby improving the system’s detection robustness and accuracy. OGI utilizes infrared imaging technology to visualize leakage gas and is applicable to the detection of various polar gases. Its primary limitations include low image resolution, low contrast, and interference from complex backgrounds. Mitigation techniques involve background subtraction, optical flow estimation, fully convolutional neural networks (FCNNs), and vision transformers (ViTs), which enhance image contrast and extract multi-scale features to boost detection precision. Multimodal fusion technology integrates data from diverse sensors, such as acoustic and optical devices. Key challenges lie in achieving spatiotemporal synchronization across multiple sensors and effectively fusing heterogeneous data streams. Current methodologies primarily utilize decision-level fusion and feature-level fusion techniques. Decision-level fusion offers high flexibility and ease of implementation but lacks inter-feature interaction; it is less effective than feature-level fusion when correlations exist between heterogeneous features. Feature-level fusion amalgamates data from different modalities during the feature extraction phase, generating a unified cross-modal representation that effectively resolves inter-modal heterogeneity. In conclusion, we posit that multimodal fusion holds significant potential for further enhancing detection accuracy beyond the capabilities of existing single-modality technologies and is poised to become a major focus of future research in this domain. Full article

► Show Figures

Figure 1

22 pages, 6754 KB

Open AccessArticle

Railway Intrusion Risk Quantification with Track Semantic Segmentation and Spatiotemporal Features

by Shanping Ning, Feng Ding, Bangbang Chen and Yuanfang Huang

Sensors 2025, 25(17), 5266; https://doi.org/10.3390/s25175266 - 24 Aug 2025

Viewed by 597

Abstract

Foreign object intrusion in railway perimeter areas poses significant risks to train operation safety. To address the limitation of current visual detection technologies that overly focus on target identification while lacking quantitative risk assessment, this paper proposes a railway intrusion risk quantification method [...] Read more.

Foreign object intrusion in railway perimeter areas poses significant risks to train operation safety. To address the limitation of current visual detection technologies that overly focus on target identification while lacking quantitative risk assessment, this paper proposes a railway intrusion risk quantification method integrating track semantic segmentation and spatiotemporal features. An improved BiSeNetV2 network is employed to accurately extract track regions, while physical-constrained risk zones are constructed based on railway structure gauge standards. The lateral spatial distance of intruding objects is precisely calculated using track gauge prior knowledge. A lightweight detection architecture is designed, adopting ShuffleNetV2 as the backbone to reduce computational complexity, with an incorporated Dilated Transformer module to enhance global context awareness and sparse feature extraction, significantly improving detection accuracy for small-scale objects. The comprehensive risk assessment formula integrates object category weights, lateral risk coefficients in intrusion zones, longitudinal distance decay factors, and dynamic velocity compensation. Experimental results demonstrate that the proposed method achieves 84.9% mean average precision (mAP) on our proprietary dataset, outperforming baseline models by 3.3%. By combining lateral distance detection with multidimensional risk indicators, the method enables quantitative intrusion risk assessment and graded early warning, providing data-driven decision support for active train protection systems and substantially enhancing intelligent safety protection capabilities. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

21 pages, 6738 KB

Open AccessArticle

Dynamic Demand Forecasting for Bike-Sharing E-Fences Using a Hybrid Deep Learning Framework with Spatio-Temporal Attention

by Chen Deng and Yunxuan Li

Sustainability 2025, 17(17), 7586; https://doi.org/10.3390/su17177586 - 22 Aug 2025

Viewed by 427

Abstract

The rapid expansion of bike-sharing systems has introduced significant management challenges related to spatial-temporal demand fluctuations and inefficient e-fence capacity allocation. This study proposes a Spatio-Temporal Graph Attention Transformer Network (STGATN), a novel hybrid deep learning framework for dynamic demand forecasting in bike-sharing [...] Read more.

The rapid expansion of bike-sharing systems has introduced significant management challenges related to spatial-temporal demand fluctuations and inefficient e-fence capacity allocation. This study proposes a Spatio-Temporal Graph Attention Transformer Network (STGATN), a novel hybrid deep learning framework for dynamic demand forecasting in bike-sharing e-fence systems. The model integrates Graph Convolutional Networks to capture complex spatial dependencies among urban functional zones, Bi-LSTM networks to model temporal patterns with periodic variations, and attention mechanisms to dynamically incorporate weather impacts. By constructing a city-level graph based on POI-derived e-fences and implementing multi-source feature fusion through Transformer architecture, the STGATN effectively addresses the limitations of static capacity allocation strategies. The experimental results from Shenzhen’s Nanshan District demonstrate the performance, with the STGATN model achieving an overall Mean Absolute Error (MAE) of 0.0992 and a Coefficient of Determination (R²) of 0.8426. This significantly outperforms baseline models such as LSTM (R²: 0.6215) and a GCN (R²: 0.5488). Ablation studies confirm the model’s key components are critical; removing the GCN module decreased R² by 12 percentage points to 0.7411, while removing the weather attention mechanism reduced R² by nearly 5 percentage points to 0.8034. The framework provides a scientific basis for dynamic e-fence capacity management, advancing spatio-temporal prediction methodologies for sustainable transportation. Full article

(This article belongs to the Section Sustainable Transportation)

► Show Figures

Figure 1

19 pages, 738 KB

Open AccessArticle

Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network

by Heng Zhou, Qing Ai and Ruiting Li

Energies 2025, 18(17), 4466; https://doi.org/10.3390/en18174466 - 22 Aug 2025

Viewed by 536

Abstract

To tackle the limitations in simultaneously modeling long-term dependencies in the time dimension and nonlinear interactions in the feature dimension, as well as their inability to fully reflect the impact of real-time load changes on spatial dependencies, a short-term multi-energy load forecasting method [...] Read more.

To tackle the limitations in simultaneously modeling long-term dependencies in the time dimension and nonlinear interactions in the feature dimension, as well as their inability to fully reflect the impact of real-time load changes on spatial dependencies, a short-term multi-energy load forecasting method based on Transformer Spatio-Temporal Graph neural network (TSTG) is proposed. This method employs a multi-head spatio-temporal attention module to model long-term dependencies in the time dimension and nonlinear interactions in the feature dimension in parallel across multiple subspaces. Additionally, a dynamic adaptive graph convolution module is designed to construct adaptive adjacency matrices by combining physical topology and feature similarity, dynamically adjusting node connection weights based on real-time load characteristics to more accurately characterize the spatial dynamics of multi-energy interactions. Furthermore, TSTG adopts an end-to-end spatio-temporal joint optimization framework, achieving synchronous extraction and fusion of spatio-temporal features through an encoder–decoder architecture. Experimental results show that TSTG significantly outperforms existing methods in short-term load forecasting tasks, providing an effective solution for refined forecasting in integrated energy systems. Full article

► Show Figures

Figure 1

17 pages, 3907 KB

Open AccessArticle

Motion Intention Prediction for Lumbar Exoskeletons Based on Attention-Enhanced sEMG Inference

by Mingming Wang, Linsen Xu, Zhihuan Wang, Qi Zhu and Tao Wu

Biomimetics 2025, 10(9), 556; https://doi.org/10.3390/biomimetics10090556 - 22 Aug 2025

Viewed by 383

Abstract

Exoskeleton robots function as augmentation systems that establish mechanical couplings with the human body, substantially enhancing the wearer’s biomechanical capabilities through assistive torques. We introduce a lumbar spine-assisted exoskeleton design based on Variable-Stiffness Pneumatic Artificial Muscles (VSPAM) and develop a dynamic adaptation mechanism [...] Read more.

Exoskeleton robots function as augmentation systems that establish mechanical couplings with the human body, substantially enhancing the wearer’s biomechanical capabilities through assistive torques. We introduce a lumbar spine-assisted exoskeleton design based on Variable-Stiffness Pneumatic Artificial Muscles (VSPAM) and develop a dynamic adaptation mechanism bridging the pneumatic drive module with human kinematic intent to facilitate human–robot cooperative control. For kinematic intent resolution, we propose a multimodal fusion architecture integrating the VGG16 convolutional network with Long Short-Term Memory (LSTM) networks. By incorporating self-attention mechanisms, we construct a fine-grained relational inference module that leverages multi-head attention weight matrices to capture global spatio-temporal feature dependencies, overcoming local feature constraints inherent in traditional algorithms. We further employ cross-attention mechanisms to achieve deep fusion of visual and kinematic features, establishing aligned intermodal correspondence to mitigate unimodal perception limitations. Experimental validation demonstrates 96.1% ± 1.2% motion classification accuracy, offering a novel technical solution for rehabilitation robotics and industrial assistance. Full article

(This article belongs to the Special Issue Advanced Service Robots: Exoskeleton Robots 2025)

► Show Figures

Figure 1

15 pages, 3090 KB

Open AccessArticle

Diagnosing Faults of Pneumatic Soft Actuators Based on Multimodal Spatiotemporal Features and Ensemble Learning

by Tao Duan, Yi Lv, Liyuan Wang, Haifan Li, Teng Yi, Yigang He and Zhongming Lv

Machines 2025, 13(8), 749; https://doi.org/10.3390/machines13080749 - 21 Aug 2025

Viewed by 291

Abstract

Soft robots demonstrate significant advantages in applications within complex environments due to their unique material properties and structural designs. However, they also face challenges in fault diagnosis, such as nonlinearity, time variability, and the difficulty of precise modeling. To address these issues, this [...] Read more.

Soft robots demonstrate significant advantages in applications within complex environments due to their unique material properties and structural designs. However, they also face challenges in fault diagnosis, such as nonlinearity, time variability, and the difficulty of precise modeling. To address these issues, this paper proposes a fault diagnosis method based on multimodal spatiotemporal features and ensemble learning. First, a sliding-window Kalman filter is utilized to eliminate noise interference from multi-source signals, constructing separate temporal and spatial representation spaces. Subsequently, an adaptive weight strategy for feature fusion is applied to train a heterogeneous decision tree model, followed by a dynamic weighted voting mechanism based on confidence levels to obtain diagnostic results. This method optimizes the feature extraction and fusion process in stages, combined with a dynamic ensemble strategy. Experimental results indicate a significant improvement in diagnostic accuracy and model robustness, achieving precise identification of faults in soft robots. Full article

(This article belongs to the Section Machines Testing and Maintenance)

► Show Figures

Figure 1

17 pages, 1594 KB

Open AccessArticle

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

by Majid Joudaki, Mehdi Imani and Hamid R. Arabnia

Electronics 2025, 14(16), 3326; https://doi.org/10.3390/electronics14163326 - 21 Aug 2025

Viewed by 637

Abstract

Human Action Recognition has seen significant advances through transformer-based architectures, yet achieving a nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces [...] Read more.

Human Action Recognition has seen significant advances through transformer-based architectures, yet achieving a nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces TransMODAL, a novel dual-stream transformer that synergistically fuses spatiotemporal appearance features from a pre-trained VideoMAE(Video Masked AutoEncoders) backbone with explicit skeletal kinematics from a state-of-the-art pose estimation pipeline (RT-DETR(Real-Time DEtection Transformer) + ViTPose++). We propose two key architectural innovations to enable effective and efficient fusion: a CoAttentionFusion module that facilitates deep, iterative cross-modal feature exchange between the RGB and pose streams, and an efficient AdaptiveSelector mechanism that dynamically prunes less informative spatiotemporal tokens to reduce computational overhead. Evaluated on three challenging benchmarks, TransMODAL demonstrates robust generalization, achieving accuracies of 98.5% on KTH, 96.9% on UCF101, and 84.2% on HMDB51. These results significantly outperform a strong VideoMAE-only baseline and are competitive with state-of-the-art methods, demonstrating the profound impact of explicit pose guidance. TransMODAL presents a powerful and efficient paradigm for composing pre-trained foundation models to tackle complex video understanding tasks by providing a fully reproducible implementation and strong benchmark results. Full article

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

► Show Figures

Figure 1

Search Results (331)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (331)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI