Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (94)

Search Parameters:
Keywords = audio event detection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
27 pages, 7248 KB  
Article
Fine-Grained and Lightweight OSA Detection: A CRNN-Based Model for Precise Temporal Localization of Respiratory Events in Sleep Audio
by Mengyu Xu, Yanru Li and Demin Han
Diagnostics 2026, 16(4), 577; https://doi.org/10.3390/diagnostics16040577 - 14 Feb 2026
Viewed by 319
Abstract
Background: Obstructive Sleep Apnea (OSA) is highly prevalent yet underdiagnosed due to the scarcity of Polysomnography (PSG) resources. Audio-based screening offers a scalable solution, but often lacks the granularity to precisely localize respiratory events or accurately estimate the Apnea-Hypopnea Index (AHI). This study [...] Read more.
Background: Obstructive Sleep Apnea (OSA) is highly prevalent yet underdiagnosed due to the scarcity of Polysomnography (PSG) resources. Audio-based screening offers a scalable solution, but often lacks the granularity to precisely localize respiratory events or accurately estimate the Apnea-Hypopnea Index (AHI). This study aims to develop a fine-grained and lightweight detection framework for OSA screening, enabling precise respiratory event localization and AHI estimation using non-contact audio signals. Methods: A Dual-Stream Convolutional Recurrent Neural Network (CRNN), integrating Log Mel-spectrograms and energy profiles with BiLSTM, was proposed. The model was trained on the PSG-Audio dataset (Sismanoglio Hospital cohort, 286 subjects) and subjected to a comprehensive three-level evaluation: (1) frame-level classification performance; (2) event-level temporal localization precision, quantified by Intersection over Union (IoU) and onset/offset boundary errors; and (3) patient-level clinical utility, assessing AHI correlation, error margins, and screening performance across different severity thresholds. Generalization was rigorously validated on an independent external cohort from Beijing Tongren Hospital (60 subjects), which was specifically curated to ensure a relatively balanced distribution of disease severity. Results: On the internal test set, the model achieved a frame level macro F1 score of 0.64 and demonstrated accurate event localization, with an IoU of 0.82. In the external validation, the audio derived AHI showed a strong correlation with PSG-AHI (r = 0.96, MAE = 6.03 events/h). For screening, the model achieved sensitivities of 98.0%, 89.5%, and 89.3%, and specificities of 88.9%, 90.9%, and 100.0% at AHI thresholds of 5, 15, and 30 events per hour, respectively. Conclusions: The Fine-Grained and Lightweight Dual-Stream CRNN provides a robust, clinically interpretable solution for non-contact OSA screening. The favorable screening performance observed in the external cohort, characterized by high sensitivity for mild cases and high specificity for severe disease, highlights its potential as a reliable tool for accessible home-based screening. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

27 pages, 6058 KB  
Article
Hierarchical Self-Distillation with Attention for Class-Imbalanced Acoustic Event Classification in Elevators
by Shengying Yang, Lingyan Chou, He Li, Zhenyu Xu, Boyang Feng and Jingsheng Lei
Sensors 2026, 26(2), 589; https://doi.org/10.3390/s26020589 - 15 Jan 2026
Viewed by 345
Abstract
Acoustic-based anomaly detection in elevators is crucial for predictive maintenance and operational safety, yet it faces significant challenges in real-world settings, including pervasive multi-source acoustic interference within confined spaces and severe class imbalance in collected data, which critically degrades the detection performance for [...] Read more.
Acoustic-based anomaly detection in elevators is crucial for predictive maintenance and operational safety, yet it faces significant challenges in real-world settings, including pervasive multi-source acoustic interference within confined spaces and severe class imbalance in collected data, which critically degrades the detection performance for minority yet critical acoustic events. To address these issues, this study proposes a novel hierarchical self-distillation framework. The method embeds auxiliary classifiers into the intermediate layers of a backbone network, creating a deep teacher–shallow student knowledge transfer paradigm optimized jointly via Kullback–Leibler divergence and feature alignment losses. A self-attentive temporal pooling layer is introduced to adaptively weigh discriminative time-frequency features, thereby mitigating temporal overlap interference, while a focal loss function is employed specifically in the teacher model to recalibrate the learning focus towards hard-to-classify minority samples. Extensive evaluations on the public UrbanSound8K dataset and a proprietary industrial elevator audio dataset demonstrate that the proposed model achieves superior performance, exceeding 90% in both accuracy and F1-score. Notably, it yields substantial improvements in recognizing rare events, validating its robustness for elevator acoustic monitoring. Full article
Show Figures

Figure 1

23 pages, 725 KB  
Article
From Sound to Risk: Streaming Audio Flags for Real-World Hazard Inference Based on AI
by Ilyas Potamitis
J. Sens. Actuator Netw. 2026, 15(1), 6; https://doi.org/10.3390/jsan15010006 - 1 Jan 2026
Viewed by 1043
Abstract
Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between [...] Read more.
Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between the occurrence of a crime, conflict, or accident and the corresponding response by authorities. The key idea is to map reality as perceived by audio into a written story and question the text via a large language model. The method integrates streaming, zero-shot algorithms in an online decoding mode that convert sound into short, interpretable tokens, which are processed by a lightweight language model. CLAP text–audio prompting identifies agitation, panic, and distress cues, combined with conversational dynamics derived from speaker diarization. Lexical information is obtained through streaming automatic speech recognition, while general audio events are detected by a streaming version of Audio Spectrogram Transformer tagger. Prosodic features are incorporated using pitch- and energy-based rules derived from robust F0 tracking and periodicity measures. The system uses a large language model configured for online decoding and outputs binary (YES/NO) life-threatening risk decisions every two seconds, along with a brief justification and a final session-level verdict. The system emphasizes interpretability and accountability. We evaluate it on a subset of the X-Violence dataset, comprising only real-world videos. We release code, prompts, decision policies, evaluation splits, and example logs to enable the community to replicate, critique, and extend our blueprint. Full article
(This article belongs to the Topic Trends and Prospects in Security, Encryption and Encoding)
Show Figures

Figure 1

25 pages, 1229 KB  
Article
YOLO-Based Transfer Learning for Sound Event Detection Using Visual Object Detection Techniques
by Sergio Segovia González, Sara Barahona Quiros and Doroteo T. Toledano
Appl. Sci. 2026, 16(1), 205; https://doi.org/10.3390/app16010205 - 24 Dec 2025
Viewed by 608
Abstract
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct [...] Read more.
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct adaptation of modern YOLO detectors to audio. To our knowledge, this is among the first works to employ YOLOv8 and YOLOv11 not merely as feature extractors but as end-to-end models that localize and classify sound events on mel-spectrograms. Methodologically, our approach (i) generates mel-spectrograms on the fly from raw audio to streamline the pipeline and enable transfer learning from vision models; (ii) applies curriculum learning that exposes the detector to progressively more complex mixtures, improving robustness to overlaps; and (iii) augments training with synthetic audio constructed under DCASE 2023 guidelines to enrich rare classes and challenging scenarios. Comprehensive experiments compare our YOLO-based framework against strong CRNN and Conformer baselines. In our experiments on the DCASE-style setting, the method achieves competitive detection accuracy relative to CRNN and Conformer baselines, with gains in some overlapping/noisy conditions and shortcomings for several short-duration classes. These results suggest that adapting modern object detectors to audio can be effective in this setting, while broader generalization and encoder-augmented comparisons remain open. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

28 pages, 3628 KB  
Article
ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification
by Bing Zhu, Lijun Chen, Xiaoling Li, Songnan Zhao, Shaode Yu and Qiurui Sun
Technologies 2026, 14(1), 12; https://doi.org/10.3390/technologies14010012 - 24 Dec 2025
Viewed by 653
Abstract
Deep learning-based respiratory sound classification (RSC) has emerged as a promising non-invasive approach to assist clinical diagnosis. However, existing methods often face challenges, such as sub-optimal feature representation and limited model expressiveness. To address these issues, we propose an Attention-based Dual-stream Feature Fusion [...] Read more.
Deep learning-based respiratory sound classification (RSC) has emerged as a promising non-invasive approach to assist clinical diagnosis. However, existing methods often face challenges, such as sub-optimal feature representation and limited model expressiveness. To address these issues, we propose an Attention-based Dual-stream Feature Fusion Network (ADFF-Net). Built upon the pre-trained Audio Spectrogram Transformer, ADFF-Net takes Mel-filter bank and Mel-spectrogram features as dual-stream inputs, while an attention-based fusion module with a skip connection is introduced to preserve both the raw energy and the relevant tonal variations within the multi-scale time–frequency representation. Extensive experiments on the ICBHI2017 database with the official train–test split show that, despite critical failure in sensitivity of 42.91%, ADFF-Net achieves state-of-the-art performance in terms of aggregated metrics in the four-class RSC task, with an overall accuracy of 64.95%, specificity of 81.39%, and harmonic score of 62.14%. The results confirm the effectiveness of the proposed attention-based dual-stream acoustic feature fusion module for the RSC task, while also highlighting substantial room for improving the detection of abnormal respiratory events. Furthermore, we outline several promising research directions, including addressing class imbalance, enriching signal diversity, advancing network design, and enhancing model interpretability. Full article
Show Figures

Figure 1

32 pages, 5708 KB  
Article
Affordable Audio Hardware and Artificial Intelligence Can Transform the Dementia Care Pipeline
by Ilyas Potamitis
Algorithms 2025, 18(12), 787; https://doi.org/10.3390/a18120787 - 12 Dec 2025
Viewed by 2269
Abstract
Population aging is increasing dementia care demand. We present an audio-driven monitoring pipeline that operates either on mobile phones, microcontroller nodes, or smart television sets. The system combines audio signal processing with AI tools for structured interpretation. Preprocessing includes voice activity detection, speaker [...] Read more.
Population aging is increasing dementia care demand. We present an audio-driven monitoring pipeline that operates either on mobile phones, microcontroller nodes, or smart television sets. The system combines audio signal processing with AI tools for structured interpretation. Preprocessing includes voice activity detection, speaker diarization, automatic speech recognition for dialogs, and speech-emotion recognition. An audio classifier detects home-care–relevant events (cough, cane taps, thuds, knocks, and speech). A large language model integrates transcripts, acoustic features, and a consented household knowledge base to produce a daily caregiver report covering orientation/disorientation (person, place, and time), delusion themes, agitation events, health proxies, and safety flags (e.g., exit seeking and falling). The pipeline targets real-time monitoring in homes and facilities, and it is an adjunct to caregiving, not a diagnostic device. Evaluation focuses on human-in-the-loop review, various audio/speech modalities, and the ability of AI to integrate information and reason. Intended users are low-income households in remote settings where in-person caregiving cannot be secured, enabling remote monitoring support for older adults with dementia. Full article
(This article belongs to the Special Issue AI-Assisted Medical Diagnostics)
Show Figures

Figure 1

15 pages, 3774 KB  
Article
MSFDnet: A Multi-Scale Feature Dual-Layer Fusion Model for Sound Event Localization and Detection
by Yi Chen, Zhenyu Huang, Liang Lei and Yu Yuan
Sensors 2025, 25(20), 6479; https://doi.org/10.3390/s25206479 - 20 Oct 2025
Viewed by 891
Abstract
The task of Sound Event Localization and Detection (SELD) aims to simultaneously address sound event recognition and spatial localization. However, existing SELD methods face limitations in long-duration dynamic audio scenarios, as they do not fully leverage the complementarity between multi-task features and lack [...] Read more.
The task of Sound Event Localization and Detection (SELD) aims to simultaneously address sound event recognition and spatial localization. However, existing SELD methods face limitations in long-duration dynamic audio scenarios, as they do not fully leverage the complementarity between multi-task features and lack depth in feature extraction, leading to restricted system performance. To address these issues, we propose a novel SELD model—MSDFnet. By introducing a Multi-Scale Feature Aggregation (MSFA) module and a Dual-Layer Feature Fusion strategy (DLFF), MSDFnet captures rich spatial features at multiple scales and establishes a stronger complementary relationship between SED and DOA features, thereby enhancing detection and localization accuracy. On the DCASE2020 Task 3 dataset, our model achieved scores of 0.319, 76%, 10.2°, 82.4%, and 0.198 in ER20,F20, LEcd, LRcd, and SELDscore metrics, respectively. Experimental results demonstrate that MSDFnet performs excellently in complex audio scenarios. Additionally, ablation studies further confirm the effectiveness of the MSFA and DLFF modules in enhancing SELD task performance. Full article
(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)
Show Figures

Figure 1

16 pages, 5544 KB  
Article
Visual Feature Domain Audio Coding for Anomaly Sound Detection Application
by Subin Byun and Jeongil Seo
Algorithms 2025, 18(10), 646; https://doi.org/10.3390/a18100646 - 15 Oct 2025
Viewed by 767
Abstract
Conventional audio and video codecs are designed for human perception, often discarding subtle spectral cues that are essential for machine-based analysis. To overcome this limitation, we propose a machine-oriented compression framework that reinterprets spectrograms as visual objects and applies Feature Coding for Machines [...] Read more.
Conventional audio and video codecs are designed for human perception, often discarding subtle spectral cues that are essential for machine-based analysis. To overcome this limitation, we propose a machine-oriented compression framework that reinterprets spectrograms as visual objects and applies Feature Coding for Machines (FCM) to anomalous sound detection (ASD). In our approach, audio signals are transformed log-mel spectrograms, from which intermediate feature maps are extracted, compressed, and reconstructed through the FCM pipeline. For comparison, we implement AAC-LC (Advanced Audio Coding Low Complexity) as a representative perceptual audio codec and VVC (Versatile Video Coding) as spectrogram-based video codec. Experiments were conducted on the DCASE (Detection and Classification of Acoustic Scenes and Events) 2023 Task 2 dataset, covering four machine types (fan, valve, toycar, slider), with anomaly detection performed using the official Autoencoder baseline model released in DCASE 2024. Detection scores were computed from reconstruction error and Mahalanobis distance. The results show that the proposed FCM-based ACoM (Audio Coding for Machines) achieves comparable or superior performance to AAC at less than half the bitrate, reliably preserving critical features even under ultra-low bitrate conditions (1.3–6.3 kbps). While VVC retains competitive performance only at high bitrates, it degrades sharply at low bitrates. These findings demonstrate that feature-based compression offers a promising direction for next-generation ACoM standardization, enabling efficient and robust ASD in bandwidth-constrained industrial environments. Full article
(This article belongs to the Special Issue Visual Attributes in Computer Vision Applications)
Show Figures

Figure 1

18 pages, 2459 KB  
Article
FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection
by Yibo Li, Dongyuan Ge, Jieke Xu and Xifan Yao
Electronics 2025, 14(19), 3874; https://doi.org/10.3390/electronics14193874 - 29 Sep 2025
Viewed by 781
Abstract
Previous studies on Sound Event Localization and Detection (SELD) have primarily focused on CNN- and Transformer-based designs. While CNNs possess local receptive fields, making it difficult to capture global dependencies over long sequences, Transformers excel at modeling long-range dependencies but have limited sensitivity [...] Read more.
Previous studies on Sound Event Localization and Detection (SELD) have primarily focused on CNN- and Transformer-based designs. While CNNs possess local receptive fields, making it difficult to capture global dependencies over long sequences, Transformers excel at modeling long-range dependencies but have limited sensitivity to local time–frequency features. Recently, the VMamba architecture, built upon the Visual State Space (VSS) model, has shown great promise in handling long sequences, yet it remains limited in modeling local spatial details. To address this issue, we propose a novel state space model with an attention-enhanced feature fusion mechanism, termed FFMamba, which balances both local spatial modeling and long-range dependency capture. At a fine-grained level, we design two key modules: the Multi-Scale Fusion Visual State Space (MSFVSS) module and the Wavelet Transform-Enhanced Downsampling (WTED) module. Specifically, the MSFVSS module integrates a Multi-Scale Fusion (MSF) component into the VSS framework, enhancing its ability to capture both long-range temporal dependencies and detailed local spatial information. Meanwhile, the WTED module employs a dual-branch design to fuse spatial and frequency domain features, improving the richness of feature representations. Comparative experiments were conducted on the DCASE2021 Task 3 and DCASE2022 Task 3 datasets. The results demonstrate that the proposed FFMamba model outperforms recent approaches in capturing long-range temporal dependencies and effectively integrating multi-scale audio features. In addition, ablation studies confirmed the effectiveness of the MSFVSS and WTED modules. Full article
Show Figures

Figure 1

68 pages, 8643 KB  
Article
From Sensors to Insights: Interpretable Audio-Based Machine Learning for Real-Time Vehicle Fault and Emergency Sound Classification
by Mahmoud Badawy, Amr Rashed, Amna Bamaqa, Hanaa A. Sayed, Rasha Elagamy, Malik Almaliki, Tamer Ahmed Farrag and Mostafa A. Elhosseini
Machines 2025, 13(10), 888; https://doi.org/10.3390/machines13100888 - 28 Sep 2025
Viewed by 2182
Abstract
Unrecognized mechanical faults and emergency sounds in vehicles can compromise safety, particularly for individuals with hearing impairments and in sound-insulated or autonomous driving environments. As intelligent transportation systems (ITSs) evolve, there is a growing need for inclusive, non-intrusive, and real-time diagnostic solutions that [...] Read more.
Unrecognized mechanical faults and emergency sounds in vehicles can compromise safety, particularly for individuals with hearing impairments and in sound-insulated or autonomous driving environments. As intelligent transportation systems (ITSs) evolve, there is a growing need for inclusive, non-intrusive, and real-time diagnostic solutions that enhance situational awareness and accessibility. This study introduces an interpretable, sound-based machine learning framework to detect vehicle faults and emergency sound events using acoustic signals as a scalable diagnostic source. Three purpose-built datasets were developed: one for vehicular fault detection, another for emergency and environmental sounds, and a third integrating both to reflect real-world ITS acoustic scenarios. Audio data were preprocessed through normalization, resampling, and segmentation and transformed into numerical vectors using Mel-Frequency Cepstral Coefficients (MFCCs), Mel spectrograms, and Chroma features. To ensure performance and interpretability, feature selection was conducted using SHAP (explainability), Boruta (relevance), and ANOVA (statistical significance). A two-phase experimental workflow was implemented: Phase 1 evaluated 15 classical models, identifying ensemble classifiers and multi-layer perceptrons (MLPs) as top performers; Phase 2 applied advanced feature selection to refine model accuracy and transparency. Ensemble models such as Extra Trees, LightGBM, and XGBoost achieved over 91% accuracy and AUC scores exceeding 0.99. SHAP provided model transparency without performance loss, while ANOVA achieved high accuracy with fewer features. The proposed framework enhances accessibility by translating auditory alarms into visual/haptic alerts for hearing-impaired drivers and can be integrated into smart city ITS platforms via roadside monitoring systems. Full article
(This article belongs to the Section Vehicle Engineering)
Show Figures

Figure 1

29 pages, 2766 KB  
Article
Sound-Based Detection of Slip and Trip Incidents Among Construction Workers Using Machine and Deep Learning
by Fangxin Li, Francis Xavier Duorinaah, Min-Koo Kim, Julian Thedja, JoonOh Seo and Dong-Eun Lee
Buildings 2025, 15(17), 3136; https://doi.org/10.3390/buildings15173136 - 1 Sep 2025
Viewed by 1165
Abstract
Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, [...] Read more.
Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, this study presents a sound-based slip and trip classification method using wearable sound sensors and machine learning. Audio signals were recorded using a smartwatch during simulated slip and trip events. Various 1D and 2D features were extracted from the processed audio signals and used to train several classifiers. Three key findings are as follows: (1) The hybrid CNN-LSTM network achieved the highest classification accuracy of 0.966 with 2D MFCC features, while GMM-HMM achieved the highest accuracy of 0.918 with 1D sound features. (2) 1D MFCC features achieved an accuracy of 0.867, outperforming time- and frequency-domain 1D features. (3) MFCC images were the best 2D features for slip and trip classification. This study presents an objective method for detecting slip and trip events, thereby providing a complementary approach to manual assessments. Practically, the findings serve as a foundation for developing automated near-miss detection systems, identification of workers constantly vulnerable to unsafe events, and detection of unsafe and hazardous areas on construction sites. Full article
(This article belongs to the Section Construction Management, and Computers & Digitization)
Show Figures

Figure 1

19 pages, 5808 KB  
Article
From Convolution to Spikes for Mental Health: A CNN-to-SNN Approach Using the DAIC-WOZ Dataset
by Victor Triohin, Monica Leba and Andreea Cristina Ionica
Appl. Sci. 2025, 15(16), 9032; https://doi.org/10.3390/app15169032 - 15 Aug 2025
Viewed by 4121
Abstract
Depression remains a leading cause of global disability, yet scalable and objective diagnostic tools are still lacking. Speech has emerged as a promising non-invasive modality for automated depression detection, due to its strong correlation with emotional state and ease of acquisition. While convolutional [...] Read more.
Depression remains a leading cause of global disability, yet scalable and objective diagnostic tools are still lacking. Speech has emerged as a promising non-invasive modality for automated depression detection, due to its strong correlation with emotional state and ease of acquisition. While convolutional neural networks (CNNs) have achieved state-of-the-art performance in this domain, their high computational demands limit deployment in low-resource or real-time settings. Spiking neural networks (SNNs), by contrast, offer energy-efficient, event-driven computation inspired by biological neurons, but they are difficult to train directly and often exhibit degraded performance on complex tasks. This study investigates whether CNNs trained on audio data from the clinically annotated DAIC-WOZ dataset can be effectively converted into SNNs while preserving diagnostic accuracy. We evaluate multiple conversion thresholds using the SpikingJelly framework and find that the 99.9% mode yields an SNN that matches the original CNN in both accuracy (82.5%) and macro F1 score (0.8254). Lower threshold settings offer increased sensitivity to depressive speech at the cost of overall accuracy, while naïve conversion strategies result in significant performance loss. These findings support the feasibility of CNN-to-SNN conversion for real-world mental health applications and underscore the importance of precise calibration in achieving clinically meaningful results. Full article
(This article belongs to the Special Issue eHealth Innovative Approaches and Applications: 2nd Edition)
Show Figures

Figure 1

22 pages, 6359 KB  
Article
Development and Testing of an AI-Based Specific Sound Detection System Integrated on a Fixed-Wing VTOL UAV
by Gabriel-Petre Badea, Mădălin Dombrovschi, Tiberius-Florian Frigioescu, Maria Căldărar and Daniel-Eugeniu Crunteanu
Acoustics 2025, 7(3), 48; https://doi.org/10.3390/acoustics7030048 - 30 Jul 2025
Viewed by 3126
Abstract
This study presents the development and validation of an AI-based system for detecting chainsaw sounds, integrated into a fixed-wing VTOL UAV. The system employs a convolutional neural network trained on log-mel spectrograms derived from four sound classes: chainsaw, music, electric drill, and human [...] Read more.
This study presents the development and validation of an AI-based system for detecting chainsaw sounds, integrated into a fixed-wing VTOL UAV. The system employs a convolutional neural network trained on log-mel spectrograms derived from four sound classes: chainsaw, music, electric drill, and human voices. Initial validation was performed through ground testing. Acoustic data acquisition is optimized during cruise flight, when wing-mounted motors are shut down and the rear motor operates at 40–60% capacity, significantly reducing noise interference. To address residual motor noise, a preprocessing module was developed using reference recordings obtained in an anechoic chamber. Two configurations were tested to capture the motor’s acoustic profile by changing the UAV’s orientation relative to the fixed microphone. The embedded system processes incoming audio in real time, enabling low-latency classification without data transmission. Field experiments confirmed the model’s high precision and robustness under varying flight and environmental conditions. Results validate the feasibility of real-time, onboard acoustic event detection using spectrogram-based deep learning on UAV platforms, and support its applicability for scalable aerial monitoring tasks. Full article
Show Figures

Figure 1

22 pages, 3768 KB  
Article
MWB_Analyzer: An Automated Embedded System for Real-Time Quantitative Analysis of Morphine Withdrawal Behaviors in Rodents
by Moran Zhang, Qianqian Li, Shunhang Li, Binxian Sun, Zhuli Wu, Jinxuan Liu, Xingchao Geng and Fangyi Chen
Toxics 2025, 13(7), 586; https://doi.org/10.3390/toxics13070586 - 14 Jul 2025
Viewed by 3141
Abstract
Background/Objectives: Substance use disorders, particularly opioid addiction, continue to pose a major global health and toxicological challenge. Morphine dependence represents a significant problem in both clinical practice and preclinical research, particularly in modeling the pharmacodynamics of withdrawal. Rodent models remain indispensable for investigating [...] Read more.
Background/Objectives: Substance use disorders, particularly opioid addiction, continue to pose a major global health and toxicological challenge. Morphine dependence represents a significant problem in both clinical practice and preclinical research, particularly in modeling the pharmacodynamics of withdrawal. Rodent models remain indispensable for investigating the neurotoxicological effects of chronic opioid exposure and withdrawal. However, conventional behavioral assessments rely on manual observation, limiting objectivity, reproducibility, and scalability—critical constraints in modern drug toxicity evaluation. This study introduces MWB_Analyzer, an automated and high-throughput system designed to quantitatively and objectively assess morphine withdrawal behaviors in rats. The goal is to enhance toxicological assessments of CNS-active substances through robust, scalable behavioral phenotyping. Methods: MWB_Analyzer integrates optimized multi-angle video capture, real-time signal processing, and machine learning-driven behavioral classification. An improved YOLO-based architecture was developed for the accurate detection and categorization of withdrawal-associated behaviors in video frames, while a parallel pipeline processed audio signals. The system incorporates behavior-specific duration thresholds to isolate pharmacologically and toxicologically relevant behavioral events. Experimental animals were assigned to high-dose, low-dose, and control groups. Withdrawal was induced and monitored under standardized toxicological protocols. Results: MWB_Analyzer achieved over 95% reduction in redundant frame processing, markedly improving computational efficiency. It demonstrated high classification accuracy: >94% for video-based behaviors (93% on edge devices) and >92% for audio-based events. The use of behavioral thresholds enabled sensitive differentiation between dosage groups, revealing clear dose–response relationships and supporting its application in neuropharmacological and neurotoxicological profiling. Conclusions: MWB_Analyzer offers a robust, reproducible, and objective platform for the automated evaluation of opioid withdrawal syndromes in rodent models. It enhances throughput, precision, and standardization in addiction research. Importantly, this tool supports toxicological investigations of CNS drug effects, preclinical pharmacokinetic and pharmacodynamic evaluations, drug safety profiling, and regulatory assessment of novel opioid and CNS-active therapeutics. Full article
(This article belongs to the Section Drugs Toxicity)
Show Figures

Graphical abstract

27 pages, 1533 KB  
Article
Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions
by Bastian Estay Zamorano, Ali Dehghan Firoozabadi, Alessio Brutti, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva and Cesar A. Azurdia-Meza
Electronics 2025, 14(14), 2778; https://doi.org/10.3390/electronics14142778 - 10 Jul 2025
Cited by 2 | Viewed by 2060
Abstract
Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article [...] Read more.
Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article addresses SELD by reformulating direction of arrival (DOA) estimation as a multi-class classification task, leveraging deep convolutional recurrent neural networks (CRNNs). We propose and evaluate two modified architectures: M-DOAnet, an optimized version of DOAnet for localization and tracking, and M-SELDnet, a modified version of SELDnet, which has been designed for joint SELD. Both modified models were rigorously evaluated on the STARSS23 dataset, which comprises 13-class, real-world indoor scenes totaling over 7 h of audio, using spectrograms and acoustic intensity maps from first-order Ambisonics (FOA) signals. M-DOAnet achieved exceptional localization (6.00° DOA error, 72.8% F1-score) and perfect tracking (100% MOTA with zero identity switches). It also demonstrated high computational efficiency, training in 4.5 h (164 s/epoch). In contrast, M-SELDnet delivered strong overall SELD performance (0.32 rad DOA error, 0.75 F1-score, 0.38 error rate, 0.20 SELD score), but with significantly higher resource demands, training in 45 h (1620 s/epoch). Our findings underscore a clear trade-off between model specialization and multifunctionality, providing practical insights for designing SELD systems in real-time and computationally constrained environments. Full article
Show Figures

Figure 1

Back to TopTop