Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (670)

Search Parameters:
Keywords = audio datasets

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 741 KB  
Review
A Review of Tools and Technologies to Combat Deepfakes
by Dmitry Erokhin and Nadejda Komendantova
Information 2026, 17(4), 347; https://doi.org/10.3390/info17040347 - 3 Apr 2026
Viewed by 259
Abstract
Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed [...] Read more.
Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed detectable signals into content or generative processes. This review presents a rigorous synthesis of tools and technologies to combat deepfakes across modalities (image, video, audio, and selected multimodal settings), drawing primarily from the peer-reviewed literature, standardized benchmarks, and official technical specifications and reports. The review analyzes detection methods, provenance and authentication technologies, with emphasis on cryptographic manifests and threat models, watermarking and content provenance, including diffusion-era watermarking and industrial deployments, adversarial robustness and attacker adaptation, datasets and benchmarks, evaluation metrics across tasks, and deployment and scalability constraints. A dedicated section addresses legal, ethical, and policy issues, focusing on emerging transparency obligations and platform governance. The review finds that no single countermeasure is sufficient in realistic adversarial settings. The strongest practical approach is a layered defense that combines provenance, watermarking, content-based detection, and human oversight. The study concludes with limitations of the current evidence base and prioritized research directions to improve generalization, interoperability, and trustworthy user experiences. Full article
(This article belongs to the Special Issue Surveys in Information Systems and Applications)
Show Figures

Graphical abstract

35 pages, 3098 KB  
Article
ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding
by Reka Sandaruwan Gallena Watthage and Anil Fernando
Appl. Sci. 2026, 16(7), 3424; https://doi.org/10.3390/app16073424 - 1 Apr 2026
Viewed by 121
Abstract
Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We [...] Read more.
Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

21 pages, 10218 KB  
Article
Interaction-Driven Dynamic Fusion for Multimodal Depression Detection: A Controlled Analysis of Gating and Cross-Attention Under Class Imbalance
by Kazuyuki Matsumoto, Keita Kiuchi, Hidehiro Umehara, Masahito Nakataki and Shusuke Numata
Brain Sci. 2026, 16(4), 366; https://doi.org/10.3390/brainsci16040366 - 28 Mar 2026
Viewed by 272
Abstract
Background/Objectives: Multimodal depression detection research has traditionally relied on early or hybrid fusion strategies without systematically analyzing how dynamic fusion mechanisms interact with modality-specific pretraining. Although gated and attention-based architectures are increasingly adopted, their behavior is rarely examined within a structured fusion taxonomy [...] Read more.
Background/Objectives: Multimodal depression detection research has traditionally relied on early or hybrid fusion strategies without systematically analyzing how dynamic fusion mechanisms interact with modality-specific pretraining. Although gated and attention-based architectures are increasingly adopted, their behavior is rarely examined within a structured fusion taxonomy framework. Methods: In this study, we conduct a controlled taxonomy-level evaluation of multimodal fusion strategies in a Japanese PHQ-9-annotated depression dataset. We compare four fusion paradigms (concatenation, summation, gated fusion, and cross-attention) across three integration stages, crossed with modality-specific affective pretraining configurations for visual (CMU-MOSI/MOSEI), acoustic (JTES), and textual (WRIME) encoders, yielding 512 experimental conditions. Results: The results reveal strong position-dependent effects of fusion strategy. Cross-attention fusion at the audio integration stage achieved the highest mean AUC (0.774) and PR-AUC (0.606), with statistically significant superiority over gated and concatenation-based fusion (Kruskal–Wallis H=86.28, p<0.001). In contrast, fusion effects at the text stage were non-significant in AUC but significant in PR-AUC, highlighting metric-sensitive behavior under class imbalance. Pretraining effects were modality-specific: SigLIP initialization produced significant positive transfer (Δ=+0.018, p<0.001), whereas audio pretraining on JTES resulted in negative transfer (Δ=0.014, p=0.004), suggesting domain mismatch effects. Gate analysis further revealed condition-dependent modality dominance, including cases of semantic–geometric reversal under joint auxiliary augmentation. Conclusions: Our findings suggest that multimodal depression detection systems should not be interpreted through static fusion categories alone. Instead, modality contribution appears to be associated with structured interaction effects between fusion strategy, integration position, and affective pretraining. This work provides a controlled empirical bridge between fusion taxonomy and dynamic modality weighting in clinical multimodal modeling. Full article
(This article belongs to the Section Cognitive, Social and Affective Neuroscience)
Show Figures

Figure 1

24 pages, 1020 KB  
Article
Research on the Diagnosis of Abnormal Sound Defects in Automobile Engines Based on Fusion of Multi-Modal Images and Audio
by Yi Xu, Wenbo Chen and Xuedong Jing
Electronics 2026, 15(7), 1406; https://doi.org/10.3390/electronics15071406 - 27 Mar 2026
Viewed by 285
Abstract
Against the global carbon neutrality target, predictive maintenance (PdM) of automotive engines represents a core technical strategy to advance the sustainable development of the automotive industry. Conventional single-modal diagnostic approaches for engine abnormal sound defects suffer from low accuracy and weak anti-interference capability. [...] Read more.
Against the global carbon neutrality target, predictive maintenance (PdM) of automotive engines represents a core technical strategy to advance the sustainable development of the automotive industry. Conventional single-modal diagnostic approaches for engine abnormal sound defects suffer from low accuracy and weak anti-interference capability. Existing multi-modal fusion methods fail to deeply mine the physical coupling between cross-modal features and often entail excessive model complexity, hindering deployment on resource-constrained on-board edge devices. To resolve these limitations, this study proposes a Physical Prior-Embedded Cross-Modal Attention (PPE-CMA) mechanism for lightweight multi-modal fusion diagnosis of engine abnormal sound defects. First, wavelet packet decomposition (WPD) and mel-frequency cepstral coefficients (MFCC) are integrated to extract time-frequency features from engine audio signals, while a channel-pruned ResNet18 is employed to extract spatial features from engine thermal imaging and vibration visualization images. Second, the PPE-CMA module is designed to adaptively assign attention weights to audio and image features by exploiting the physical coupling between engine fault acoustic and visual characteristics, enabling efficient cross-modal feature fusion with redundant information suppression. A rigorous theoretical derivation is provided to link cosine similarity with the physical correlation of engine fault acoustic-visual features, justifying the attention weight constraint (β = 1 − α) from the perspective of fault feature physical coupling. Third, an improved lightweight XGBoost classifier is constructed for fault classification, and a hybrid data augmentation strategy customized for engine multi-modal data is proposed to address the small-sample challenge in industrial applications. Ablation experiments on ResNet18 pruning ratios verify the optimal trade-off between diagnostic performance and computational efficiency, while feature distribution analysis validates the authenticity and effectiveness of the hybrid augmentation strategy. Experimental results on a self-constructed multi-modal dataset show that the proposed method achieves 98.7% diagnostic accuracy and a 98.2% F1-score, retaining 96.5% accuracy under 90 dB high-level environmental noise, with an end-to-end inference speed of 0.8 ms per sample (including preprocessing, feature extraction, and classification). Cross-engine and cross-domain validation on a 2.0T diesel engine small-sample dataset and the open-source SEMFault-2024 dataset yield average accuracies of 94.8% and 95.2%, respectively, demonstrating strong generalization. This method effectively enhances the accuracy and robustness of engine abnormal sound defect diagnosis, offering a lightweight technical solution for on-board real-time fault diagnosis and in-plant online quality inspection. By reducing engine fault-induced energy loss and spare parts waste, it further promotes energy conservation and emission reduction in the automotive industry. Quantified experimental data on fuel efficiency improvement and carbon emission reduction are provided to substantiate the ecological benefits of the proposed framework. Full article
Show Figures

Figure 1

26 pages, 953 KB  
Article
A Modular Approach to Automated News Generation Using Large Language Models
by Omar Juárez Gambino, Consuelo Varinia García Mendoza, Braulio Hernandez Minutti, Carol-Michelle Zapata-Manilla, Marco-Antonio Bernal-Trani and Hiram Calvo
Information 2026, 17(4), 319; https://doi.org/10.3390/info17040319 - 25 Mar 2026
Viewed by 329
Abstract
Advances in Generative Artificial Intelligence have enabled the development of models capable of generating text, images, and audio that are similar to what humans can create. These models often have valuable general knowledge thanks to their training on large datasets. Through fine-tuning or [...] Read more.
Advances in Generative Artificial Intelligence have enabled the development of models capable of generating text, images, and audio that are similar to what humans can create. These models often have valuable general knowledge thanks to their training on large datasets. Through fine-tuning or prompt-based adaptation, this knowledge can be applied to specific tasks. In this work, we propose a modular approach to automated news generation using Large Language Models, composed of an information retrieval module and a text generation module. The proposed system leverages both publicly available (open-weight) and proprietary Large Language Models, enabling a comparative evaluation of their behavior within the proposed news generation pipeline. We describe the experiments carried out with a total of five representative Large Language Models spanning both categories, detailing their configurations and performance. The results demonstrate the feasibility of using Large Language Models to automate this task and identify systematic differences in behavior across model categories, as well as the problems that remain to be solved to enable fully autonomous news generation. Full article
Show Figures

Graphical abstract

16 pages, 53570 KB  
Article
A Multimodal In-Ear Audio and Physiological Dataset for Swallowing and Non-Verbal Event Classification
by Elyes Ben Cheikh, Yassine Mrabet, Catherine Laporte and Rachel E. Bouserhal
Sensors 2026, 26(7), 2019; https://doi.org/10.3390/s26072019 - 24 Mar 2026
Viewed by 365
Abstract
Swallowing is a critical marker of neurological and emotional health. The ability to monitor it continuously and non-invasively, especially through smart ear-worn devices, holds significant promise for clinical applications. Despite this potential, no public audio datasets currently support reliable swallowing sound detection. Existing [...] Read more.
Swallowing is a critical marker of neurological and emotional health. The ability to monitor it continuously and non-invasively, especially through smart ear-worn devices, holds significant promise for clinical applications. Despite this potential, no public audio datasets currently support reliable swallowing sound detection. Existing datasets focus primarily on speech and breathing, offering limited coverage and lacking detailed annotations for swallowing events. To address this gap, we introduce an in-ear audio dataset specifically designed to capture a wide range of verbal and non-verbal sounds. It includes comprehensive labeling focused on swallowing. The dataset was collected from 34 healthy adults (14 females and 20 males) between the ages of 20 and 29. Each participant performed a series of predefined tasks involving both non-verbal and verbal events. Non-verbal tasks included swallowing, clicking, forceful blinking, touching the scalp, and physical movements such as squatting or walking in place. Verbal tasks consisted of speaking (e.g., describing an image). Recordings were conducted in both quiet and noisy environments to better reflect real-world conditions. Data were captured using a combination of in-/outer-ear microphones, a chest belt to record electrocardiogram (ECG), respiration and acceleration signals, and an ultrasound probe to track tongue movement, which served as a reference for swallowing annotation. All signals were precisely synchronized. To ensure high data quality, the recordings were reviewed using both algorithmic analysis and manual inspection. Swallowing events were identified based on ultrasound signals and validated by an expert to guarantee accurate labeling. As a proof of concept that in-ear audio supports swallow classification, we fine-tune a fully connected neural network on YAMNet embeddings plus zero-crossing rate (ZCR) features. Across the completed folds, the model reaches an F1 score of 0.875 ± 0.013. Full article
(This article belongs to the Special Issue Sensors for Physiological Monitoring and Digital Health: 2nd Edition)
Show Figures

Figure 1

37 pages, 5953 KB  
Article
Fire Detection Using Sound Analysis Based on a Hybrid Artificial Intelligence Algorithm
by Robert-Nicolae Boştinaru, Sebastian-Alexandru Drǎguşin, Nicu Bizon, Dumitru Cazacu and Gabriel-Vasile Iana
Algorithms 2026, 19(3), 240; https://doi.org/10.3390/a19030240 - 23 Mar 2026
Viewed by 285
Abstract
Fire detection is a critical task for early warning systems, particularly in environments where visual sensing is unreliable. While most existing approaches rely on image-based or smoke-based detection, acoustic signals provide complementary information capable of capturing early combustion-related events. This study investigates deep [...] Read more.
Fire detection is a critical task for early warning systems, particularly in environments where visual sensing is unreliable. While most existing approaches rely on image-based or smoke-based detection, acoustic signals provide complementary information capable of capturing early combustion-related events. This study investigates deep learning models for sound-based fire detection, focusing on convolutional and Transformer-based architectures. VGG16 and VGG19 convolutional neural networks are adapted to process time-frequency audio representations for binary classification into Fire and No-Fire classes. An Audio Spectrogram Transformer (AST) is further employed to model long-range temporal dependencies in acoustic data. Finally, a hybrid VGG19-AST architecture is proposed, in which convolutional layers extract local spectral–temporal features, and Transformer-based self-attention performs global sequence modeling. The models are evaluated on a curated dataset containing fire sounds and diverse environmental background noises under multiple noise conditions. Experimental results demonstrate competitive performance across convolutional and Transformer-based models, while the proposed hybrid VGG19-AST architecture achieves the most consistent overall results. The findings suggest that integrating convolutional feature extraction with self-attention-based global modeling enhances robustness under complex acoustic variability. The proposed hybrid framework provides a scalable and cost-effective solution for sound-based fire detection, particularly in scenarios where visual monitoring may be obstructed or ineffective. Full article
Show Figures

Figure 1

19 pages, 992 KB  
Article
Hybrid Music Similarity with Hypergraph and Siamese Network
by Sera Kim, Youngjun Kim, Jaewon Lee and Dalwon Jang
Big Data Cogn. Comput. 2026, 10(3), 96; https://doi.org/10.3390/bdcc10030096 - 21 Mar 2026
Viewed by 333
Abstract
This paper proposes a novel method for measuring music similarity. Existing music similarity measurements have often been used for music appreciation, but this paper proposes a method for measuring the similarity between music samples which are used for music production. Conventional music recommendation [...] Read more.
This paper proposes a novel method for measuring music similarity. Existing music similarity measurements have often been used for music appreciation, but this paper proposes a method for measuring the similarity between music samples which are used for music production. Conventional music recommendation approaches often rely on either metadata-based similarity or audio-based feature similarity in isolation, which limits their effectiveness in sample-based recommendation scenarios where both compositional context and acoustic characteristics are important. To address this limitation, the proposed framework combines a hypergraph-based information similarity module with a feature-based similarity module learned using Siamese networks and triplet loss. In the information-based module, metadata attributes such as beats per minute (BPM), genre, chord, key, and instrument are modeled as vertices in a hypergraph, and Random Walk–Word2Vec embeddings are learned to capture structural relationships between music samples and their attributes. In parallel, the feature-based module employs vertex-specific Siamese networks trained on instrument and key classification tasks to learn perceptual similarity directly from audio signals. The two modules are trained independently and jointly utilized at the recommendation stage to provide attribute-specific similarity results for a given query sample. Results show that the proposed system achieves high Precision@k across multiple attributes and forms stable similarity structures in the embedding space, even without relying on user interaction data. These results reflect embedding consistency evaluated over the entire dataset where training and retrieval are performed on the same sample pool, rather than generalization to unseen samples. These results demonstrate that the proposed hybrid framework effectively captures both structural and perceptual similarity among music samples and is well suited for sample-based music recommendation in music production environments. Full article
Show Figures

Figure 1

30 pages, 9811 KB  
Article
Audio-Based Screening of Respiratory Diseases Using Machine Learning: A Methodological Framework Evaluated on a Clinically Validated COVID-19 Cough Dataset
by Arley Magnolia Aquino-García, Humberto Pérez-Espinosa, Javier Andreu-Perez and Ansel Y. Rodríguez González
Mach. Learn. Knowl. Extr. 2026, 8(3), 80; https://doi.org/10.3390/make8030080 - 20 Mar 2026
Viewed by 343
Abstract
The development of AI-driven computational methods has enabled rapid and non-invasive analysis of respiratory sounds using acoustic data, particularly cough recordings. Although the COVID-19 pandemic accelerated research on cough-based acoustic analysis, many early studies were limited by insufficient data quality, lack of standardized [...] Read more.
The development of AI-driven computational methods has enabled rapid and non-invasive analysis of respiratory sounds using acoustic data, particularly cough recordings. Although the COVID-19 pandemic accelerated research on cough-based acoustic analysis, many early studies were limited by insufficient data quality, lack of standardized protocols, and limited reproducibility due to data scarcity. In this study, we propose an audio analysis framework for cough-based respiratory disease screening research using COVID-19 as a clinically validated case dataset. All analyses were conducted on a single clinically acquired multicentric dataset collected under standardized conditions in certified laboratories in Mexico and Spain, comprising cough recordings from 1105 individuals. Model training and testing were performed exclusively within this dataset. The framework incorporates signal preprocessing and a comparative evaluation of segmentation strategies, showing that segmented cough analysis significantly outperforms full-signal analysis. Class imbalance was addressed using the Synthetic Minority Over-sampling Technique (SMOTE) for CNN2D models and the supervised Resample filter implemented in WEKA for classical machine learning models, both applied exclusively to the training subset to generate balanced training sets and prevent data leakage. Feature extraction and classification were carried out using Random Forest, Support Vector Machine (SVM), XGBoost, and a 2D Convolutional Neural Network (CNN2D), with hyperparameter optimization via AutoML. The proposed framework achieved a best balanced screening performance of 85.58% sensitivity and 86.65% specificity (Random Forest with GeMAPSvB01), while the highest-specificity configuration reached 93.90% specificity with 18.14% sensitivity (CNN2D with SMOTE and AutoML). These results demonstrate the methodological feasibility of the proposed framework under the evaluated conditions. Full article
Show Figures

Figure 1

19 pages, 7310 KB  
Article
Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach
by Porawat Visutsak, Duongduen Ongrungruaeng, Surapong Wiriya and Keun Ho Ryu
Electronics 2026, 15(6), 1271; https://doi.org/10.3390/electronics15061271 - 18 Mar 2026
Viewed by 298
Abstract
This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw [...] Read more.
This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw audio from four major dialects—Central, Northern (Khummuang), Northeastern (Korat), and Southern (Pat-tani)—was transformed into 2D Mel-spectrograms using the Short-Time Fourier Transform (STFT). We analyzed a diverse range of architectures, including the VGG, Inception, ResNet, DenseNet, and MobileNet families, to establish the optimal trade-off between mathematical complexity and spectral feature extraction. Our experimental results identify NASNet-Mobile as the most effective model, achieving a macro-average F1-score of 0.9425. The analysis suggests that NASNet’s search-optimized cell structure is uniquely capable of capturing the multiscale texture of phonetic formants. In contrast, we observed a catastrophic mode collapse in VGG16 (32.97% accuracy), likely due to excessive parameter bloat, while Xception and MobileNetV2 maintained robust generalization. Confusion matrix analysis reveals high acoustic distinctiveness for Southern Thai (96.7% recall), whereas Northern Thai exhibits significant spectral overlap with Central Thai. These results support the hypothesis that CNNs interpret spectrograms as textures rather than discrete objects, positioning NASNet-Mobile as a high-performance, low-latency baseline for edge-device deployment in resource-constrained environments. Full article
(This article belongs to the Special Issue Advances in Machine Learning for Image Classification)
Show Figures

Figure 1

17 pages, 4901 KB  
Article
A New Portable Smart Percussion System Embedded on Raspberry Pi for Bolt Looseness Detection
by Weiliang Zheng, Duanhang Zhang, Keyu Du and Furui Wang
Machines 2026, 14(3), 337; https://doi.org/10.3390/machines14030337 - 16 Mar 2026
Viewed by 331
Abstract
Bolted joints are extensively used in a wide range of industrial and commercial structures, making their condition monitoring essential for ensuring structural integrity and operational safety. Under the influence of vibration, cyclic loading, and environmental factors, bolts may gradually lose preload, which can [...] Read more.
Bolted joints are extensively used in a wide range of industrial and commercial structures, making their condition monitoring essential for ensuring structural integrity and operational safety. Under the influence of vibration, cyclic loading, and environmental factors, bolts may gradually lose preload, which can degrade joint stiffness and eventually lead to structural failure. To address this issue, this study presents a smart percussion system developed on a Raspberry Pi platform that integrates acoustic signal acquisition, real-time signal processing, and visualization of diagnostic results. A bolt looseness detection strategy combining audio feature extraction with unsupervised learning is proposed. In contrast to traditional percussion-based approaches that depend on supervised learning and predefined baseline datasets, the proposed method does not require prior reference data, significantly improving its adaptability and ease of deployment across different structures, which shows essential practical significance. Experimental investigations demonstrate the effectiveness and advantages of the proposed system, indicating its strong potential to enhance percussion-based bolt looseness detection and to support real-time structural health monitoring, which are real-world engineering applications. Full article
(This article belongs to the Special Issue AI-Driven Reliability Analysis and Predictive Maintenance)
Show Figures

Figure 1

25 pages, 3328 KB  
Article
End-to-End Acoustic Classification of Respiratory Sounds Using Multi-Architecture Deep Neural Networks
by Btissam Bouzammour, Ghita Zaz, Malika Alami Marktani, Abdellah Touhafi, Anas El Ouali and Mohammed Jorio
Technologies 2026, 14(3), 178; https://doi.org/10.3390/technologies14030178 - 16 Mar 2026
Viewed by 318
Abstract
Respiratory diseases constitute a major global health burden, necessitating accurate and reliable diagnostic support tools. Conventional auscultation, despite its widespread clinical use, remains inherently subjective and susceptible to inter-observer variability. In this study, we propose a unified deep learning framework for the automated [...] Read more.
Respiratory diseases constitute a major global health burden, necessitating accurate and reliable diagnostic support tools. Conventional auscultation, despite its widespread clinical use, remains inherently subjective and susceptible to inter-observer variability. In this study, we propose a unified deep learning framework for the automated classification of respiratory sound recordings into four clinically relevant categories: Normal, Crackles, Wheezes, and Crackles + Wheezes. The experimental evaluation was conducted on a publicly available dataset comprising heterogeneous respiratory recordings collected from both patients with pulmonary pathologies and healthy individuals. All audio signals were subjected to standardized preprocessing procedures to enhance signal consistency and ensure reliable feature extraction across acquisition conditions. To ensure methodological rigor and prevent optimistic bias, a strict subject-independent validation strategy was adopted using 5-fold GroupKFold cross-validation based on patient identifiers. Six deep learning architectures were systematically implemented and comparatively evaluated under a controlled and reproducible training protocol, including convolutional (1D-CNN, Deep-CNN), recurrent hybrid (CNN–LSTM, CNN–BiLSTM), and attention-based (CNN–Attention, CNN–Transformer) models. Performance metrics were reported as mean ± standard deviation across folds. The CNN–Attention architecture achieved the best overall performance, yielding a Balanced Accuracy of 90.1% ± 1.8% and a macro F1-score of 89.7% ± 2.1%, demonstrating stable inter-patient generalization. These findings indicate that attention-enhanced hybrid architectures effectively capture both local spectral structures and long-range temporal dependencies inherent in respiratory signals. The proposed framework provides a robust foundation for subject-independent automated lung sound classification and contributes to the development of clinically reliable decision-support systems. Full article
(This article belongs to the Section Assistive Technologies)
Show Figures

Figure 1

21 pages, 656 KB  
Article
Acoustic Violence Detection Using Cascade Strategy for Computationally Constrained Scenarios
by Fangfang Zhu-Zhou, Diana Tejera-Berengué, Roberto Gil-Pita, Manuel Utrilla-Manso and Manuel Rosa-Zurera
Electronics 2026, 15(6), 1227; https://doi.org/10.3390/electronics15061227 - 16 Mar 2026
Viewed by 228
Abstract
Detecting violent content in audio recordings is crucial for public safety, autonomous surveillance, and content moderation, particularly when visual cues are unreliable or unavailable. A resource-aware two-stage cascade system is proposed for acoustic violence detection that combines a lightweight Least Squares Linear Detector [...] Read more.
Detecting violent content in audio recordings is crucial for public safety, autonomous surveillance, and content moderation, particularly when visual cues are unreliable or unavailable. A resource-aware two-stage cascade system is proposed for acoustic violence detection that combines a lightweight Least Squares Linear Detector (LSLD) as a first-stage screener with a trimmed version of YAMNet as a second-stage classifier. A percentile-based forwarding rule controls the fraction of segments routed to the deep stage, turning the accuracy–cost trade-off into an explicit operating parameter for always-on deployment. The approach is evaluated on a publicly released dataset of real-world violent audio augmented with background noise and artificial reverberation. The results in the low-false-alarm regime show that the proposed cascade preserves performance close to a Stage 2-only baseline while substantially reducing average deep-inference workload. An ablation study validates the role of the LSLD as an inexpensive pre-filter, and robustness is assessed under clean, reverberant, and 12 dB noise conditions. Finally, an analytic energy consumption model is provided, which links computational workload to daily energy demand and photovoltaic sizing on ultra-low-power hardware, supporting sustainable off-grid deployment. Full article
Show Figures

Figure 1

32 pages, 7928 KB  
Article
eXCube2: Explainable Brain-Inspired Spiking Neural Network Framework for Emotion Recognition from Audio, Visual and Multimodal Audio–Visual Data
by N. K. Kasabov, A. Yang, Z. Wang, I. Abouhassan, A. Kassabova and T. Lappas
Biomimetics 2026, 11(3), 208; https://doi.org/10.3390/biomimetics11030208 - 14 Mar 2026
Viewed by 411
Abstract
This paper introduces a biomimetic framework and novel brain-inspired AI (BIAI) models based on spiking neural networks (SNNs) for emotional state recognition from audio (speech), visual (face), and integrated multimodal audio–visual data. The developed framework, named eXCube2, uses a three-dimensional SNN architecture NeuCube [...] Read more.
This paper introduces a biomimetic framework and novel brain-inspired AI (BIAI) models based on spiking neural networks (SNNs) for emotional state recognition from audio (speech), visual (face), and integrated multimodal audio–visual data. The developed framework, named eXCube2, uses a three-dimensional SNN architecture NeuCube that is spatially structured according to a human brain template. The BIAI models developed in eXCube2 are trainable on spatio- and spectro-temporal data using brain-inspired learning rules. Such models are explainable in terms of revealing patterns in data and are adaptable to new data. The eXCube2 models are implemented as software systems and tested on speech and video data of subjects expressing emotional states. The use of a brain template for the SNN structure enables brain-inspired tonotopic and stereo mapping of audio inputs, topographic mapping of visual data, and the combined use of both modalities. This novel approach brings AI-based emotional state recognition closer to human perception, provides a better explainability and adaptability than existing AI systems. It also results in a higher or competitive accuracy, even though this was not the main goal here. This is demonstrated through experiments on benchmark datasets, achieving classification accuracy above 80% on single-modality data and 88.9% when multimodal audio–visual data are used, and a “don’t know” output is introduced. The paper further discusses possible applications of the proposed eXCube2 framework to other audio, visual, and audio–visual data for solving challenging problems, such as recognizing emotional states of people from different origins; brain state diagnosis (e.g., Parkinson’s disease, Alzheimer’s disease, ADHD, dementia); measuring response to treatment over time; evaluating satisfaction responses from online clients; cognitive robotics; human–robot interaction; chatbots; and interactive computer games. The SNN-based implementation of BIAI also enables the use of neuromorphic chips and platforms, leading to reduced power consumption, smaller device size, higher performance accuracy, and improved adaptability and explainability. This research shows a step toward building brain-inspired AI systems. Full article
Show Figures

Figure 1

22 pages, 1747 KB  
Review
Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques
by Hira Nisar, Salman Masood, Zaki Malik and Adnan Abid
J. Imaging 2026, 12(3), 119; https://doi.org/10.3390/jimaging12030119 - 10 Mar 2026
Viewed by 574
Abstract
Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG [...] Read more.
Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG systems is to synthesize coherent and natural audio–visual outputs by modeling the intricate relationship between speech signals, facial dynamics, and emotional cues. These systems find widespread applications in virtual assistants, interactive avatars, video dubbing for multilingual content, educational technologies, and immersive virtual and augmented reality environments. Moreover, the development of THG has significant implications for accessibility technologies, cultural preservation, and remote healthcare interfaces. This survey paper presents a comprehensive and systematic overview of the technological landscape of Talking Head Generation. We begin by outlining the foundational methodologies that underpin the synthesis process, including generative adversarial networks (GANs), motion-aware recurrent architectures, and attention-based models. A taxonomy is introduced to organize the diverse approaches based on the nature of input modalities and generation goals. We further examine the contributions of various domains such as computer vision, speech processing, and human–robot interaction, each of which plays a critical role in advancing the capabilities of THG systems. The paper also provides a detailed review of datasets used for training and evaluating THG models, highlighting their coverage, structure, and relevance. In parallel, we analyze widely adopted evaluation metrics, categorized by their focus on image quality, motion accuracy, synchronization, and semantic fidelity. Operating parameters such as latency, frame rate, resolution, and real-time capability are also discussed to assess deployment feasibility. Special emphasis is placed on the integration of generative artificial intelligence (GenAI), which has significantly enhanced the adaptability and realism of talking head systems through more powerful and generalizable learning frameworks. Full article
Show Figures

Figure 1

Back to TopTop