Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (478)

Search Parameters:
Keywords = audio classification

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 26320 KB  
Article
Agent-Based Models of Sexual Selection in Bird Vocalizations Using Generative Approaches
by Hao Zhao, Takaya Arita and Reiji Suzuki
Appl. Sci. 2025, 15(19), 10481; https://doi.org/10.3390/app151910481 (registering DOI) - 27 Sep 2025
Abstract
The current agent-based evolutionary models for animal communication rely on simplified signal representations that differ significantly from natural vocalizations. We propose a novel agent-based evolutionary model based on text-to-audio (TTA) models to generate realistic animal vocalizations, advancing from VAE-based real-valued genotypes to TTA-based [...] Read more.
The current agent-based evolutionary models for animal communication rely on simplified signal representations that differ significantly from natural vocalizations. We propose a novel agent-based evolutionary model based on text-to-audio (TTA) models to generate realistic animal vocalizations, advancing from VAE-based real-valued genotypes to TTA-based textual genotypes that generate bird songs using a fine-tuned Stable Audio Open 1.0 model. In our sexual selection framework, males vocalize songs encoded by their genotypes while females probabilistically select mates based on the similarity between males’ songs and their preference patterns, with mutations and crossovers applied to textual genotypes using a large language model (Gemma-3). As a proof of concept, we compared TTA-based and VAE-based sexual selection models for the Blue-and-white Flycatcher (Cyanoptila cyanomelana)’s songs and preferences. While the VAE-based model produces population clustering but constrains the evolution to a narrow region near the latent space’s origin where reconstructed songs remain clear, the TTA-based model enhances the genotypic and phenotypic diversity, drives song diversification, and fosters the creation of novel bird songs. Generated songs were validated by a virtual expert using the BirdNET classifier, confirming their acoustic realism through classification into related taxa. These findings highlight the potential of combining large language models and TTA models in agent-based evolutionary models for animal communication. Full article
(This article belongs to the Special Issue Evolutionary Algorithms and Their Real-World Applications)
Show Figures

Figure 1

15 pages, 10412 KB  
Article
Application of Foundation Models for Colorectal Cancer Tissue Classification in Mass Spectrometry Imaging
by Alon Gabriel, Amoon Jamzad, Mohammad Farahmand, Martin Kaufmann, Natasha Iaboni, David Hurlbut, Kevin Yi Mi Ren, Christopher J. B. Nicol, John F. Rudan, Sonal Varma, Gabor Fichtinger and Parvin Mousavi
Technologies 2025, 13(10), 434; https://doi.org/10.3390/technologies13100434 (registering DOI) - 27 Sep 2025
Abstract
Colorectal cancer (CRC) remains a leading global health challenge, with early and accurate diagnosis crucial for effective treatment. Histopathological evaluation, the current diagnostic gold standard, faces limitations including subjectivity, delayed results, and reliance on well-prepared tissue slides. Mass spectrometry imaging (MSI) offers a [...] Read more.
Colorectal cancer (CRC) remains a leading global health challenge, with early and accurate diagnosis crucial for effective treatment. Histopathological evaluation, the current diagnostic gold standard, faces limitations including subjectivity, delayed results, and reliance on well-prepared tissue slides. Mass spectrometry imaging (MSI) offers a complementary approach by providing molecular-level information, but its high dimensionality and the scarcity of labeled data present unique challenges for traditional supervised learning. In this study, we present the first implementation of foundation models for MSI-based cancer classification using desorption electrospray ionization (DESI) data. We evaluate multiple architectures adapted from other domains, including a spectral classification model known as FACT, which leverages audio–language pretraining. Compared to conventional machine learning approaches, these foundation models achieved superior performance, with FACT achieving the highest cross-validated balanced accuracy (93.27%±3.25%) and AUROC (98.4%±0.7%). Ablation studies demonstrate that these models retain strong performance even under reduced data conditions, highlighting their potential for generalizable and scalable MSI-based cancer diagnostics. Future work will explore the integration of spatial and multi-modal data to enhance clinical utility. Full article
(This article belongs to the Special Issue Application of Artificial Intelligence in Medical Image Analysis)
15 pages, 2557 KB  
Article
Heart Murmur Detection in Phonocardiogram Data Leveraging Data Augmentation and Artificial Intelligence
by Melissa Valaee and Shahram Shirani
Diagnostics 2025, 15(19), 2471; https://doi.org/10.3390/diagnostics15192471 (registering DOI) - 27 Sep 2025
Abstract
Background/Objectives: With a 17.9 million annual mortality rate, cardiovascular disease is the leading global cause of death. As such, early detection and disease diagnosis are critical for effective treatment and symptom management. Cardiac auscultation, the process of listening to the heartbeat, often [...] Read more.
Background/Objectives: With a 17.9 million annual mortality rate, cardiovascular disease is the leading global cause of death. As such, early detection and disease diagnosis are critical for effective treatment and symptom management. Cardiac auscultation, the process of listening to the heartbeat, often provides the first indication of underlying cardiac conditions. This practice allows for the identification of heart murmurs caused by turbulent blood flow. In this exploratory research paper, we propose an AI model to streamline this process to improve diagnostic accuracy and efficiency. Methods: We utilized data from the 2022 George Moody PhysioNet Heart Sound Classification Challenge, comprising phonocardiogram recordings of individuals under 21 years of age in Northeast Brazil. Only patients who had recordings from all four heart valves were included in our dataset. Audio files were synchronized across all recordings and converted to Mel spectrograms before being passed into a pre-trained Vision Transformer, and finally a MiniROCKET model. Additionally, data augmentation was conducted on audio files and spectrograms to generate new data, extending our total sample size from 928 spectrograms to 14,848. Results: Compared to the existing methods in the literature, our model yielded significantly enhanced quality assessment metrics, including Weighted Accuracy, Sensitivity, and F-Score, and resulted in a fast evaluation speed of 0.02 s per patient. Conclusions: The implementation of our method for the detection of heart murmurs can supplement physician diagnosis and contribute to earlier detection of underlying cardiovascular conditions, fast diagnosis times, increased scalability, and enhanced adaptability. Full article
Show Figures

Figure 1

36 pages, 35564 KB  
Article
Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset
by Daniel Alexis Nieto Mora, Leonardo Duque-Muñoz and Juan David Martínez Vargas
Mach. Learn. Knowl. Extr. 2025, 7(4), 109; https://doi.org/10.3390/make7040109 - 24 Sep 2025
Viewed by 48
Abstract
Soundscape monitoring has become an increasingly important tool for studying ecological processes and supporting habitat conservation. While many recent advances focus on identifying species through supervised learning, there is growing interest in understanding the soundscape as a whole while considering patterns that extend [...] Read more.
Soundscape monitoring has become an increasingly important tool for studying ecological processes and supporting habitat conservation. While many recent advances focus on identifying species through supervised learning, there is growing interest in understanding the soundscape as a whole while considering patterns that extend beyond individual vocalizations. This broader view requires unsupervised approaches capable of capturing meaningful structures related to temporal dynamics, frequency content, spatial distribution, and ecological variability. In this study, we present a fully unsupervised framework for analyzing large-scale soundscape data using deep learning. We applied a convolutional autoencoder (Soundscape-Net) to extract acoustic representations from over 60,000 recordings collected across a grid-based sampling design in the Rey Zamuro Reserve in Colombia. These features were initially compared with other audio characterization methods, showing superior performance in multiclass classification, with accuracies of 0.85 for habitat cover identification and 0.89 for time-of-day classification across 13 days. For the unsupervised study, optimized dimensionality reduction methods (Uniform Manifold Approximation and Projection and Pairwise Controlled Manifold Approximation and Projection) were applied to project the learned features, achieving trustworthiness scores above 0.96. Subsequently, clustering was performed using KMeans and Density-Based Spatial Clustering of Applications with Noise (DBSCAN), with evaluations based on metrics such as the silhouette, where scores above 0.45 were obtained, thus supporting the robustness of the discovered latent acoustic structures. To interpret and validate the resulting clusters, we combined multiple strategies: spatial mapping through interpolation, analysis of acoustic index variance to understand the cluster structure, and graph-based connectivity analysis to identify ecological relationships between the recording sites. Our results demonstrate that this approach can uncover both local and broad-scale patterns in the soundscape, providing a flexible and interpretable pathway for unsupervised ecological monitoring. Full article
Show Figures

Figure 1

20 pages, 2930 KB  
Article
Pain Level Classification from Speech Using GRU-Mixer Architecture with Log-Mel Spectrogram Features
by Adi Alhudhaif
Diagnostics 2025, 15(18), 2362; https://doi.org/10.3390/diagnostics15182362 - 17 Sep 2025
Viewed by 279
Abstract
Background/Objectives: Automatic pain detection from speech signals holds strong promise for non-invasive and real-time assessment in clinical and caregiving settings, particularly for populations with limited capacity for self-report. Methods: In this study, we introduce a lightweight recurrent deep learning approach, namely the [...] Read more.
Background/Objectives: Automatic pain detection from speech signals holds strong promise for non-invasive and real-time assessment in clinical and caregiving settings, particularly for populations with limited capacity for self-report. Methods: In this study, we introduce a lightweight recurrent deep learning approach, namely the Gated Recurrent Unit (GRU)-Mixer model for pain level classification based on speech signals. The proposed model maps raw audio inputs into Log-Mel spectrogram features, which are passed through a stacked bidirectional GRU for modeling the spectral and temporal dynamics of vocal expressions. To extract compact utterance-level embeddings, an adaptive average pooling-based temporal mixing mechanism is applied over the GRU outputs, followed by a fully connected classification head alongside dropout regularization. This architecture is used for several supervised classification tasks, including binary (pain/non-pain), graded intensity (mild, moderate, severe), and thermal-state (cold/warm) classification. End-to-end training is done using speaker-independent splits and class-balanced loss to promote generalization and discourage bias. The provided audio inputs are normalized to a consistent 3-s window and resampled at 8 kHz for consistency and computational efficiency. Results: Experiments on the TAME Pain dataset showcase strong classification performance, achieving 83.86% accuracy for binary pain detection and as high as 75.36% for multiclass pain intensity classification. Conclusions: As the first deep learning based classification work on the TAME Pain dataset, this work introduces the GRU-Mixer as an effective benchmark architecture for future studies on speech-based pain recognition and affective computing. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

24 pages, 3485 KB  
Article
Impact Evaluation of Sound Dataset Augmentation and Synthetic Generation upon Classification Accuracy
by Eleni Tsalera, Andreas Papadakis, Gerasimos Pagiatakis and Maria Samarakou
J. Sens. Actuator Netw. 2025, 14(5), 91; https://doi.org/10.3390/jsan14050091 - 9 Sep 2025
Viewed by 448
Abstract
We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation [...] Read more.
We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation techniques are applied prior to spectral-based transformation and include time stretching, pitch shifting, noise addition, volume controlling, and time shifting. Image augmentation techniques are applied after the transformation of the sound into a scalogram, involving scaling, shearing, rotation, and translation. Synthetic sound generation is based on the AudioGen generative model, triggered through a series of customized prompts. Augmentation and synthetic generation are applied to three sound categories: (a) human sounds, (b) animal sounds, and (c) sounds of things, with each category containing ten sound classes with 20 samples retrieved from the ESC-50 dataset. Sound- and image-orientated neural network classifiers have been used to classify the augmented datasets and their synthetic additions. VGGish and YAMNet (sound classifiers) employ spectrograms, while ResNet50 and DarkNet53 (image classifiers) employ scalograms. The streamlined AI-based process of augmentation and synthetic generation, enhanced classifier fine-tuning and inference allowed for a consistent, multicriteria-comparison of the impact. Classification accuracy has increased for all augmentation and synthetic generation scenarios; however, the increase has not been uniform among the techniques, the sound types, and the percentage of the training set population increase. The average increase in classification accuracy ranged from 2.05% for ResNet50 to 9.05% for VGGish. Our findings reinforce the benefit of audio augmentation and synthetic generation, providing guidelines to avoid accuracy degradation due to overuse and distortion of key audio features. Full article
(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)
Show Figures

Figure 1

20 pages, 6876 KB  
Article
Spatiotemporal Heterogeneity of Forest Park Soundscapes Based on Deep Learning: A Case Study of Zhangjiajie National Forest Park
by Debing Zhuo, Chenguang Yan, Wenhai Xie, Zheqian He and Zhongyu Hu
Forests 2025, 16(9), 1416; https://doi.org/10.3390/f16091416 - 4 Sep 2025
Viewed by 463
Abstract
As a perceptual representation of ecosystem structure and function, the soundscape has become an important indicator for evaluating ecological health and assessing the impacts of human disturbances. Understanding the spatiotemporal heterogeneity of soundscapes is essential for revealing ecological processes and human impacts in [...] Read more.
As a perceptual representation of ecosystem structure and function, the soundscape has become an important indicator for evaluating ecological health and assessing the impacts of human disturbances. Understanding the spatiotemporal heterogeneity of soundscapes is essential for revealing ecological processes and human impacts in protected areas. This study investigates such heterogeneity in Zhangjiajie National Forest Park using deep learning approaches. To this end, we constructed a dataset comprising eight representative sound source categories by integrating field recordings with online audio (BBC Sound Effects Archive and Freesound), and trained a classification model to accurately identify biophony, geophony, and anthrophony, which enabled the subsequent analysis of spatiotemporal distribution patterns. Our results indicate that temporal variations in the soundscape are closely associated with circadian rhythms and tourist activities, while spatial patterns are strongly shaped by topography, vegetation, and human interference. Biophony is primarily concentrated in areas with minimal ecological disturbance, geophony is regulated by landforms and microclimatic conditions, and anthrophony tends to mask natural sound sources. Overall, the study highlights how deep learning-based soundscape classification can reveal the mechanisms by which natural and anthropogenic factors structure acoustic environments, offering methodological references and practical insights for ecological management and soundscape conservation. Full article
(This article belongs to the Section Forest Ecology and Management)
Show Figures

Figure 1

29 pages, 2766 KB  
Article
Sound-Based Detection of Slip and Trip Incidents Among Construction Workers Using Machine and Deep Learning
by Fangxin Li, Francis Xavier Duorinaah, Min-Koo Kim, Julian Thedja, JoonOh Seo and Dong-Eun Lee
Buildings 2025, 15(17), 3136; https://doi.org/10.3390/buildings15173136 - 1 Sep 2025
Viewed by 503
Abstract
Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, [...] Read more.
Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, this study presents a sound-based slip and trip classification method using wearable sound sensors and machine learning. Audio signals were recorded using a smartwatch during simulated slip and trip events. Various 1D and 2D features were extracted from the processed audio signals and used to train several classifiers. Three key findings are as follows: (1) The hybrid CNN-LSTM network achieved the highest classification accuracy of 0.966 with 2D MFCC features, while GMM-HMM achieved the highest accuracy of 0.918 with 1D sound features. (2) 1D MFCC features achieved an accuracy of 0.867, outperforming time- and frequency-domain 1D features. (3) MFCC images were the best 2D features for slip and trip classification. This study presents an objective method for detecting slip and trip events, thereby providing a complementary approach to manual assessments. Practically, the findings serve as a foundation for developing automated near-miss detection systems, identification of workers constantly vulnerable to unsafe events, and detection of unsafe and hazardous areas on construction sites. Full article
(This article belongs to the Section Construction Management, and Computers & Digitization)
Show Figures

Figure 1

20 pages, 2241 KB  
Article
HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning
by Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros and Emilios Cambouropoulos
Information 2025, 16(9), 759; https://doi.org/10.3390/info16090759 - 1 Sep 2025
Viewed by 533
Abstract
This paper explores different approaches to harmony tokenization in symbolic music for transformer-based models, focusing on two tasks: masked language modeling (MLM) and melodic harmonization generation. Four tokenization strategies are compared, each varying in how chord information is encoded: (1) as full chord [...] Read more.
This paper explores different approaches to harmony tokenization in symbolic music for transformer-based models, focusing on two tasks: masked language modeling (MLM) and melodic harmonization generation. Four tokenization strategies are compared, each varying in how chord information is encoded: (1) as full chord symbols, (2) separated into root and quality, (3) as sets of pitch classes, and (4) as sets of pitch classes where one is designated as a root. A dataset of over 17,000 lead sheet charts is used to train and evaluate RoBERTa for MLM and GPT-2/BART for harmonization. The results show that chord spelling methods—those breaking chords into pitch-class tokens—achieve higher accuracy and lower perplexity, indicating more confident predictions. These methods also produce fewer token-level errors. In harmonization tasks, chunkier tokenizations (with more information per token) generate chords more similar to the original data, while spelling-based methods better preserve structural aspects such as harmonic rhythm and melody–harmony alignment. Audio evaluations reveal that spelling-based models tend toward more generic pop-like harmonizations, while chunkier tokenizations more faithfully reflect the dataset’s style. Overall, while no single tokenization method dominates across all tasks, different strategies may be preferable for specific applications, such as classification or generative style transfer. Full article
(This article belongs to the Special Issue Machine Learning and Artificial Intelligence with Applications)
Show Figures

Figure 1

21 pages, 3700 KB  
Article
Lung Sound Classification Model for On-Device AI
by Jinho Park, Chanhee Jeong, Yeonshik Choi, Hyuck-ki Hong and Youngchang Jo
Appl. Sci. 2025, 15(17), 9361; https://doi.org/10.3390/app15179361 - 26 Aug 2025
Viewed by 756
Abstract
Following the COVID-19 pandemic, public interest in healthcare has significantly in-creased, emphasizing the importance of early disease detection through lung sound analysis. Lung sounds serve as a critical biomarker in the diagnosis of pulmonary diseases, and numerous deep learning-based approaches have been actively [...] Read more.
Following the COVID-19 pandemic, public interest in healthcare has significantly in-creased, emphasizing the importance of early disease detection through lung sound analysis. Lung sounds serve as a critical biomarker in the diagnosis of pulmonary diseases, and numerous deep learning-based approaches have been actively explored for this purpose. Existing lung sound classification models have demonstrated high accuracy, benefiting from recent advances in artificial intelligence (AI) technologies. However, these models often rely on transmitting data to computationally intensive servers for processing, introducing potential security risks due to the transfer of sensitive medical information over networks. To mitigate these concerns, on-device AI has garnered growing attention as a promising solution for protecting healthcare data. On-device AI enables local data processing and inference directly on the device, thereby enhancing data security compared to server-based schemes. Despite these advantages, on-device AI is inherently limited by computational constraints, while conventional models typically require substantial processing power to maintain high performance. In this study, we propose a lightweight lung sound classification model designed specifically for on-device environments. The proposed scheme extracts audio features using Mel spectrograms, chromagrams, and Mel-Frequency Cepstral Coefficients (MFCC), which are converted into image representations and stacked to form the model input. The lightweight model performs convolution operations tailored to both temporal and frequency–domain characteristics of lung sounds. Comparative experimental results demonstrate that the proposed model achieves superior inference performance while maintaining a significantly smaller model size than conventional classification schemes, making it well-suited for deployment on resource-constrained devices. Full article
Show Figures

Figure 1

18 pages, 917 KB  
Article
ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network
by Yubo Su, Haolin Wang, Zhihao Xu, Chengxi Yin, Fucheng Chen and Zhaoguo Wang
Mathematics 2025, 13(17), 2719; https://doi.org/10.3390/math13172719 - 24 Aug 2025
Viewed by 439
Abstract
Unsupervised anomalous sound detection (ASD) models the normal sounds of machinery through classification operations, thereby identifying anomalies by quantifying deviations. Most recent approaches adopt depthwise separable modules from MobileNetV2. Extensive studies demonstrate that squeeze-and-excitation (SE) modules can enhance model fitting by dynamically weighting [...] Read more.
Unsupervised anomalous sound detection (ASD) models the normal sounds of machinery through classification operations, thereby identifying anomalies by quantifying deviations. Most recent approaches adopt depthwise separable modules from MobileNetV2. Extensive studies demonstrate that squeeze-and-excitation (SE) modules can enhance model fitting by dynamically weighting input features to adjust output distributions. However, we observe that conventional SE modules fail to adapt to the complex spectral textures of audio data. To address this, we propose an Audio Texture Attention (ATA) specifically designed for machine noise data, improving model robustness. Additionally, we integrate an LSTM layer and refine the temporal feature extraction architecture to strengthen the model’s sensitivity to sequential noise patterns. Experimental results on the DCASE 2020 Challenge Task 2 dataset show that our method achieves state-of-the-art performance, with AUC, pAUC, and mAUC scores of 96.15%, 90.58%, and 90.63%, respectively. Full article
Show Figures

Figure 1

26 pages, 6425 KB  
Article
Deep Spectrogram Learning for Gunshot Classification: A Comparative Study of CNN Architectures and Time-Frequency Representations
by Pafan Doungpaisan and Peerapol Khunarsa
J. Imaging 2025, 11(8), 281; https://doi.org/10.3390/jimaging11080281 - 21 Aug 2025
Viewed by 705
Abstract
Gunshot sound classification plays a crucial role in public safety, forensic investigations, and intelligent surveillance systems. This study evaluates the performance of deep learning models in classifying firearm sounds by analyzing twelve time–frequency spectrogram representations, including Mel, Bark, MFCC, CQT, Cochleagram, STFT, FFT, [...] Read more.
Gunshot sound classification plays a crucial role in public safety, forensic investigations, and intelligent surveillance systems. This study evaluates the performance of deep learning models in classifying firearm sounds by analyzing twelve time–frequency spectrogram representations, including Mel, Bark, MFCC, CQT, Cochleagram, STFT, FFT, Reassigned, Chroma, Spectral Contrast, and Wavelet. The dataset consists of 2148 gunshot recordings from four firearm types, collected in a semi-controlled outdoor environment under multi-orientation conditions. To leverage advanced computer vision techniques, all spectrograms were converted into RGB images using perceptually informed colormaps. This enabled the application of image processing approaches and fine-tuning of pre-trained Convolutional Neural Networks (CNNs) originally developed for natural image classification. Six CNN architectures—ResNet18, ResNet50, ResNet101, GoogLeNet, Inception-v3, and InceptionResNetV2—were trained on these spectrogram images. Experimental results indicate that CQT, Cochleagram, and Mel spectrograms consistently achieved high classification accuracy, exceeding 94% when paired with deep CNNs such as ResNet101 and InceptionResNetV2. These findings demonstrate that transforming time–frequency features into RGB images not only facilitates the use of image-based processing but also allows deep models to capture rich spectral–temporal patterns, providing a robust framework for accurate firearm sound classification. Full article
(This article belongs to the Section Image and Video Processing)
Show Figures

Figure 1

42 pages, 5531 KB  
Article
Preliminary Analysis and Proof-of-Concept Validation of a Neuronally Controlled Visual Assistive Device Integrating Computer Vision with EEG-Based Binary Control
by Preetam Kumar Khuntia, Prajwal Sanjay Bhide and Pudureddiyur Venkataraman Manivannan
Sensors 2025, 25(16), 5187; https://doi.org/10.3390/s25165187 - 21 Aug 2025
Viewed by 841
Abstract
Contemporary visual assistive devices often lack immersive user experience due to passive control systems. This study introduces a neuronally controlled visual assistive device (NCVAD) that aims to assist visually impaired users in performing reach tasks with active, intuitive control. The developed NCVAD integrates [...] Read more.
Contemporary visual assistive devices often lack immersive user experience due to passive control systems. This study introduces a neuronally controlled visual assistive device (NCVAD) that aims to assist visually impaired users in performing reach tasks with active, intuitive control. The developed NCVAD integrates computer vision, electroencephalogram (EEG) signal processing, and robotic manipulation to facilitate object detection, selection, and assistive guidance. The monocular vision-based subsystem implements the YOLOv8n algorithm to detect objects of daily use. Then, audio prompting conveys the detected objects’ information to the user, who selects their targeted object using a voluntary trigger decoded through real-time EEG classification. The target’s physical coordinates are extracted using ArUco markers, and a gradient descent-based path optimization algorithm (POA) guides a 3-DoF robotic arm to reach the target. The classification algorithm achieves over 85% precision and recall in decoding EEG data, even with coexisting physiological artifacts. Similarly, the POA achieves approximately 650 ms of actuation time with a 0.001 learning rate and 0.1 cm2 error threshold settings. In conclusion, the study also validates the preliminary analysis results on a working physical model and benchmarks the robotic arm’s performance against human users, establishing the proof-of-concept for future assistive technologies integrating EEG and computer vision paradigms. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

26 pages, 663 KB  
Article
Multi-Scale Temporal Fusion Network for Real-Time Multimodal Emotion Recognition in IoT Environments
by Sungwook Yoon and Byungmun Kim
Sensors 2025, 25(16), 5066; https://doi.org/10.3390/s25165066 - 14 Aug 2025
Viewed by 908
Abstract
This paper introduces EmotionTFN (Emotion-Multi-Scale Temporal Fusion Network), a novel hierarchical temporal fusion architecture that addresses key challenges in IoT emotion recognition by processing diverse sensor data while maintaining accuracy across multiple temporal scales. The architecture integrates physiological signals (EEG, PPG, and GSR), [...] Read more.
This paper introduces EmotionTFN (Emotion-Multi-Scale Temporal Fusion Network), a novel hierarchical temporal fusion architecture that addresses key challenges in IoT emotion recognition by processing diverse sensor data while maintaining accuracy across multiple temporal scales. The architecture integrates physiological signals (EEG, PPG, and GSR), visual, and audio data using hierarchical temporal attention across short-term (0.5–2 s), medium-term (2–10 s), and long-term (10–60 s) windows. Edge computing optimizations, including model compression, quantization, and adaptive sampling, enable deployment on resource-constrained devices. Extensive experiments on MELD, DEAP, and G-REx datasets demonstrate 94.2% accuracy on discrete emotion classification and 0.087 mean absolute error on dimensional prediction, outperforming the best baseline (87.4%). The system maintains sub-200 ms latency on IoT hardware while achieving a 40% improvement in energy efficiency. Real-world deployment validation over four weeks achieved 97.2% uptime and user satisfaction scores of 4.1/5.0 while ensuring privacy through local processing. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

15 pages, 1111 KB  
Article
A Novel Methodology for Data Augmentation in Cognitive Impairment Subjects Using Semantic and Pragmatic Features Through Large Language Models
by Luis Roberto García-Noguez, Sebastián Salazar-Colores, Siddhartha Mondragón-Rodríguez and Saúl Tovar-Arriaga
Technologies 2025, 13(8), 344; https://doi.org/10.3390/technologies13080344 - 7 Aug 2025
Viewed by 404
Abstract
In recent years, researchers have become increasingly interested in identifying traits of cognitive impairment using audio from neuropsychological tests. Unfortunately, there is no universally accepted terminology system that can be used to describe language impairment, and considerable variability exists between clinicians, making detection [...] Read more.
In recent years, researchers have become increasingly interested in identifying traits of cognitive impairment using audio from neuropsychological tests. Unfortunately, there is no universally accepted terminology system that can be used to describe language impairment, and considerable variability exists between clinicians, making detection particularly challenging. Furthermore, databases commonly used by the scientific community present sparse or unbalanced data, which hinders the optimal performance of machine learning models. Therefore, this study aims to test a new methodology for augmenting text data from neuropsychological tests in the Pitt Corpus database to increase classification and interpretability results. The proposed method involves augmenting text data with symptoms commonly present in subjects with cognitive impairment. This innovative approach has enabled us to differentiate between two groups in the database better than widely used text augmentation techniques. The proposed method yielded an increase in the metrics, achieving 0.8742 accuracy, 0.8744 F1-score, 0.8736 precision, and 0.8781 recall. It is shown that implementing large language models with commonly observed symptoms in the language of patients with cognitive impairment in text augmentation can improve the results in low-resource scenarios. Full article
Show Figures

Figure 1

Back to TopTop