MDPI - Publisher of Open Access Journals

15 pages, 3774 KB

Open AccessArticle

MSFDnet: A Multi-Scale Feature Dual-Layer Fusion Model for Sound Event Localization and Detection

by Yi Chen, Zhenyu Huang, Liang Lei and Yu Yuan

Sensors 2025, 25(20), 6479; https://doi.org/10.3390/s25206479 - 20 Oct 2025

Viewed by 463

The task of Sound Event Localization and Detection (SELD) aims to simultaneously address sound event recognition and spatial localization. However, existing SELD methods face limitations in long-duration dynamic audio scenarios, as they do not fully leverage the complementarity between multi-task features and lack [...] Read more.

The task of Sound Event Localization and Detection (SELD) aims to simultaneously address sound event recognition and spatial localization. However, existing SELD methods face limitations in long-duration dynamic audio scenarios, as they do not fully leverage the complementarity between multi-task features and lack depth in feature extraction, leading to restricted system performance. To address these issues, we propose a novel SELD model—MSDFnet. By introducing a Multi-Scale Feature Aggregation (MSFA) module and a Dual-Layer Feature Fusion strategy (DLFF), MSDFnet captures rich spatial features at multiple scales and establishes a stronger complementary relationship between SED and DOA features, thereby enhancing detection and localization accuracy. On the DCASE2020 Task 3 dataset, our model achieved scores of 0.319, 76%, 10.2°, 82.4%, and 0.198 in

{ER}_{20}

,

F_{20}

,

{LE}_{c d}

,

{LR}_{c d}

, and

{SELD}_{s c o r e}

metrics, respectively. Experimental results demonstrate that MSDFnet performs excellently in complex audio scenarios. Additionally, ablation studies further confirm the effectiveness of the MSFA and DLFF modules in enhancing SELD task performance. Full article

(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)

► Show Figures

Figure 1

16 pages, 5544 KB

Open AccessArticle

Visual Feature Domain Audio Coding for Anomaly Sound Detection Application

by Subin Byun and Jeongil Seo

Algorithms 2025, 18(10), 646; https://doi.org/10.3390/a18100646 - 15 Oct 2025

Viewed by 350

Abstract

Conventional audio and video codecs are designed for human perception, often discarding subtle spectral cues that are essential for machine-based analysis. To overcome this limitation, we propose a machine-oriented compression framework that reinterprets spectrograms as visual objects and applies Feature Coding for Machines [...] Read more.

Conventional audio and video codecs are designed for human perception, often discarding subtle spectral cues that are essential for machine-based analysis. To overcome this limitation, we propose a machine-oriented compression framework that reinterprets spectrograms as visual objects and applies Feature Coding for Machines (FCM) to anomalous sound detection (ASD). In our approach, audio signals are transformed log-mel spectrograms, from which intermediate feature maps are extracted, compressed, and reconstructed through the FCM pipeline. For comparison, we implement AAC-LC (Advanced Audio Coding Low Complexity) as a representative perceptual audio codec and VVC (Versatile Video Coding) as spectrogram-based video codec. Experiments were conducted on the DCASE (Detection and Classification of Acoustic Scenes and Events) 2023 Task 2 dataset, covering four machine types (fan, valve, toycar, slider), with anomaly detection performed using the official Autoencoder baseline model released in DCASE 2024. Detection scores were computed from reconstruction error and Mahalanobis distance. The results show that the proposed FCM-based ACoM (Audio Coding for Machines) achieves comparable or superior performance to AAC at less than half the bitrate, reliably preserving critical features even under ultra-low bitrate conditions (1.3–6.3 kbps). While VVC retains competitive performance only at high bitrates, it degrades sharply at low bitrates. These findings demonstrate that feature-based compression offers a promising direction for next-generation ACoM standardization, enabling efficient and robust ASD in bandwidth-constrained industrial environments. Full article

(This article belongs to the Special Issue Visual Attributes in Computer Vision Applications)

► Show Figures

Figure 1

18 pages, 2459 KB

Open AccessArticle

FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection

by Yibo Li, Dongyuan Ge, Jieke Xu and Xifan Yao

Electronics 2025, 14(19), 3874; https://doi.org/10.3390/electronics14193874 - 29 Sep 2025

Viewed by 398

Abstract

Previous studies on Sound Event Localization and Detection (SELD) have primarily focused on CNN- and Transformer-based designs. While CNNs possess local receptive fields, making it difficult to capture global dependencies over long sequences, Transformers excel at modeling long-range dependencies but have limited sensitivity [...] Read more.

Previous studies on Sound Event Localization and Detection (SELD) have primarily focused on CNN- and Transformer-based designs. While CNNs possess local receptive fields, making it difficult to capture global dependencies over long sequences, Transformers excel at modeling long-range dependencies but have limited sensitivity to local time–frequency features. Recently, the VMamba architecture, built upon the Visual State Space (VSS) model, has shown great promise in handling long sequences, yet it remains limited in modeling local spatial details. To address this issue, we propose a novel state space model with an attention-enhanced feature fusion mechanism, termed FFMamba, which balances both local spatial modeling and long-range dependency capture. At a fine-grained level, we design two key modules: the Multi-Scale Fusion Visual State Space (MSFVSS) module and the Wavelet Transform-Enhanced Downsampling (WTED) module. Specifically, the MSFVSS module integrates a Multi-Scale Fusion (MSF) component into the VSS framework, enhancing its ability to capture both long-range temporal dependencies and detailed local spatial information. Meanwhile, the WTED module employs a dual-branch design to fuse spatial and frequency domain features, improving the richness of feature representations. Comparative experiments were conducted on the DCASE2021 Task 3 and DCASE2022 Task 3 datasets. The results demonstrate that the proposed FFMamba model outperforms recent approaches in capturing long-range temporal dependencies and effectively integrating multi-scale audio features. In addition, ablation studies confirmed the effectiveness of the MSFVSS and WTED modules. Full article

► Show Figures

Figure 1

68 pages, 8643 KB

Open AccessArticle

From Sensors to Insights: Interpretable Audio-Based Machine Learning for Real-Time Vehicle Fault and Emergency Sound Classification

by Mahmoud Badawy, Amr Rashed, Amna Bamaqa, Hanaa A. Sayed, Rasha Elagamy, Malik Almaliki, Tamer Ahmed Farrag and Mostafa A. Elhosseini

Machines 2025, 13(10), 888; https://doi.org/10.3390/machines13100888 - 28 Sep 2025

Viewed by 863

Abstract

Unrecognized mechanical faults and emergency sounds in vehicles can compromise safety, particularly for individuals with hearing impairments and in sound-insulated or autonomous driving environments. As intelligent transportation systems (ITSs) evolve, there is a growing need for inclusive, non-intrusive, and real-time diagnostic solutions that [...] Read more.

Unrecognized mechanical faults and emergency sounds in vehicles can compromise safety, particularly for individuals with hearing impairments and in sound-insulated or autonomous driving environments. As intelligent transportation systems (ITSs) evolve, there is a growing need for inclusive, non-intrusive, and real-time diagnostic solutions that enhance situational awareness and accessibility. This study introduces an interpretable, sound-based machine learning framework to detect vehicle faults and emergency sound events using acoustic signals as a scalable diagnostic source. Three purpose-built datasets were developed: one for vehicular fault detection, another for emergency and environmental sounds, and a third integrating both to reflect real-world ITS acoustic scenarios. Audio data were preprocessed through normalization, resampling, and segmentation and transformed into numerical vectors using Mel-Frequency Cepstral Coefficients (MFCCs), Mel spectrograms, and Chroma features. To ensure performance and interpretability, feature selection was conducted using SHAP (explainability), Boruta (relevance), and ANOVA (statistical significance). A two-phase experimental workflow was implemented: Phase 1 evaluated 15 classical models, identifying ensemble classifiers and multi-layer perceptrons (MLPs) as top performers; Phase 2 applied advanced feature selection to refine model accuracy and transparency. Ensemble models such as Extra Trees, LightGBM, and XGBoost achieved over 91% accuracy and AUC scores exceeding 0.99. SHAP provided model transparency without performance loss, while ANOVA achieved high accuracy with fewer features. The proposed framework enhances accessibility by translating auditory alarms into visual/haptic alerts for hearing-impaired drivers and can be integrated into smart city ITS platforms via roadside monitoring systems. Full article

(This article belongs to the Section Vehicle Engineering)

► Show Figures

Figure 1

29 pages, 2766 KB

Open AccessArticle

Sound-Based Detection of Slip and Trip Incidents Among Construction Workers Using Machine and Deep Learning

by Fangxin Li, Francis Xavier Duorinaah, Min-Koo Kim, Julian Thedja, JoonOh Seo and Dong-Eun Lee

Buildings 2025, 15(17), 3136; https://doi.org/10.3390/buildings15173136 - 1 Sep 2025

Viewed by 697

Abstract

Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, [...] Read more.

Unsafe events such as slips and trips occur regularly on construction sites. Efficient identification of these events can help protect workers from accidents and improve site safety. However, current detection methods rely on subjective reporting, which has several limitations. To address these limitations, this study presents a sound-based slip and trip classification method using wearable sound sensors and machine learning. Audio signals were recorded using a smartwatch during simulated slip and trip events. Various 1D and 2D features were extracted from the processed audio signals and used to train several classifiers. Three key findings are as follows: (1) The hybrid CNN-LSTM network achieved the highest classification accuracy of 0.966 with 2D MFCC features, while GMM-HMM achieved the highest accuracy of 0.918 with 1D sound features. (2) 1D MFCC features achieved an accuracy of 0.867, outperforming time- and frequency-domain 1D features. (3) MFCC images were the best 2D features for slip and trip classification. This study presents an objective method for detecting slip and trip events, thereby providing a complementary approach to manual assessments. Practically, the findings serve as a foundation for developing automated near-miss detection systems, identification of workers constantly vulnerable to unsafe events, and detection of unsafe and hazardous areas on construction sites. Full article

(This article belongs to the Section Construction Management, and Computers & Digitization)

► Show Figures

Figure 1

19 pages, 5808 KB

Open AccessArticle

From Convolution to Spikes for Mental Health: A CNN-to-SNN Approach Using the DAIC-WOZ Dataset

by Victor Triohin, Monica Leba and Andreea Cristina Ionica

Appl. Sci. 2025, 15(16), 9032; https://doi.org/10.3390/app15169032 - 15 Aug 2025

Viewed by 1583

Abstract

Depression remains a leading cause of global disability, yet scalable and objective diagnostic tools are still lacking. Speech has emerged as a promising non-invasive modality for automated depression detection, due to its strong correlation with emotional state and ease of acquisition. While convolutional [...] Read more.

Depression remains a leading cause of global disability, yet scalable and objective diagnostic tools are still lacking. Speech has emerged as a promising non-invasive modality for automated depression detection, due to its strong correlation with emotional state and ease of acquisition. While convolutional neural networks (CNNs) have achieved state-of-the-art performance in this domain, their high computational demands limit deployment in low-resource or real-time settings. Spiking neural networks (SNNs), by contrast, offer energy-efficient, event-driven computation inspired by biological neurons, but they are difficult to train directly and often exhibit degraded performance on complex tasks. This study investigates whether CNNs trained on audio data from the clinically annotated DAIC-WOZ dataset can be effectively converted into SNNs while preserving diagnostic accuracy. We evaluate multiple conversion thresholds using the SpikingJelly framework and find that the 99.9% mode yields an SNN that matches the original CNN in both accuracy (82.5%) and macro F1 score (0.8254). Lower threshold settings offer increased sensitivity to depressive speech at the cost of overall accuracy, while naïve conversion strategies result in significant performance loss. These findings support the feasibility of CNN-to-SNN conversion for real-world mental health applications and underscore the importance of precise calibration in achieving clinically meaningful results. Full article

(This article belongs to the Special Issue eHealth Innovative Approaches and Applications: 2nd Edition)

► Show Figures

Figure 1

22 pages, 6359 KB

Open AccessArticle

Development and Testing of an AI-Based Specific Sound Detection System Integrated on a Fixed-Wing VTOL UAV

by Gabriel-Petre Badea, Mădălin Dombrovschi, Tiberius-Florian Frigioescu, Maria Căldărar and Daniel-Eugeniu Crunteanu

Acoustics 2025, 7(3), 48; https://doi.org/10.3390/acoustics7030048 - 30 Jul 2025

Viewed by 2217

Abstract

This study presents the development and validation of an AI-based system for detecting chainsaw sounds, integrated into a fixed-wing VTOL UAV. The system employs a convolutional neural network trained on log-mel spectrograms derived from four sound classes: chainsaw, music, electric drill, and human [...] Read more.

This study presents the development and validation of an AI-based system for detecting chainsaw sounds, integrated into a fixed-wing VTOL UAV. The system employs a convolutional neural network trained on log-mel spectrograms derived from four sound classes: chainsaw, music, electric drill, and human voices. Initial validation was performed through ground testing. Acoustic data acquisition is optimized during cruise flight, when wing-mounted motors are shut down and the rear motor operates at 40–60% capacity, significantly reducing noise interference. To address residual motor noise, a preprocessing module was developed using reference recordings obtained in an anechoic chamber. Two configurations were tested to capture the motor’s acoustic profile by changing the UAV’s orientation relative to the fixed microphone. The embedded system processes incoming audio in real time, enabling low-latency classification without data transmission. Field experiments confirmed the model’s high precision and robustness under varying flight and environmental conditions. Results validate the feasibility of real-time, onboard acoustic event detection using spectrogram-based deep learning on UAV platforms, and support its applicability for scalable aerial monitoring tasks. Full article

► Show Figures

Figure 1

22 pages, 3768 KB

Open AccessArticle

MWB_Analyzer: An Automated Embedded System for Real-Time Quantitative Analysis of Morphine Withdrawal Behaviors in Rodents

by Moran Zhang, Qianqian Li, Shunhang Li, Binxian Sun, Zhuli Wu, Jinxuan Liu, Xingchao Geng and Fangyi Chen

Toxics 2025, 13(7), 586; https://doi.org/10.3390/toxics13070586 - 14 Jul 2025

Viewed by 2815

Abstract

Background/Objectives: Substance use disorders, particularly opioid addiction, continue to pose a major global health and toxicological challenge. Morphine dependence represents a significant problem in both clinical practice and preclinical research, particularly in modeling the pharmacodynamics of withdrawal. Rodent models remain indispensable for investigating [...] Read more.

Background/Objectives: Substance use disorders, particularly opioid addiction, continue to pose a major global health and toxicological challenge. Morphine dependence represents a significant problem in both clinical practice and preclinical research, particularly in modeling the pharmacodynamics of withdrawal. Rodent models remain indispensable for investigating the neurotoxicological effects of chronic opioid exposure and withdrawal. However, conventional behavioral assessments rely on manual observation, limiting objectivity, reproducibility, and scalability—critical constraints in modern drug toxicity evaluation. This study introduces MWB_Analyzer, an automated and high-throughput system designed to quantitatively and objectively assess morphine withdrawal behaviors in rats. The goal is to enhance toxicological assessments of CNS-active substances through robust, scalable behavioral phenotyping. Methods: MWB_Analyzer integrates optimized multi-angle video capture, real-time signal processing, and machine learning-driven behavioral classification. An improved YOLO-based architecture was developed for the accurate detection and categorization of withdrawal-associated behaviors in video frames, while a parallel pipeline processed audio signals. The system incorporates behavior-specific duration thresholds to isolate pharmacologically and toxicologically relevant behavioral events. Experimental animals were assigned to high-dose, low-dose, and control groups. Withdrawal was induced and monitored under standardized toxicological protocols. Results: MWB_Analyzer achieved over 95% reduction in redundant frame processing, markedly improving computational efficiency. It demonstrated high classification accuracy: >94% for video-based behaviors (93% on edge devices) and >92% for audio-based events. The use of behavioral thresholds enabled sensitive differentiation between dosage groups, revealing clear dose–response relationships and supporting its application in neuropharmacological and neurotoxicological profiling. Conclusions: MWB_Analyzer offers a robust, reproducible, and objective platform for the automated evaluation of opioid withdrawal syndromes in rodent models. It enhances throughput, precision, and standardization in addiction research. Importantly, this tool supports toxicological investigations of CNS drug effects, preclinical pharmacokinetic and pharmacodynamic evaluations, drug safety profiling, and regulatory assessment of novel opioid and CNS-active therapeutics. Full article

(This article belongs to the Section Drugs Toxicity)

► Show Figures

Graphical abstract

27 pages, 1533 KB

Open AccessEditor’s ChoiceArticle

Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions

by Bastian Estay Zamorano, Ali Dehghan Firoozabadi, Alessio Brutti, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva and Cesar A. Azurdia-Meza

Electronics 2025, 14(14), 2778; https://doi.org/10.3390/electronics14142778 - 10 Jul 2025

Viewed by 1251

Abstract

Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article [...] Read more.

Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article addresses SELD by reformulating direction of arrival (DOA) estimation as a multi-class classification task, leveraging deep convolutional recurrent neural networks (CRNNs). We propose and evaluate two modified architectures: M-DOAnet, an optimized version of DOAnet for localization and tracking, and M-SELDnet, a modified version of SELDnet, which has been designed for joint SELD. Both modified models were rigorously evaluated on the STARSS23 dataset, which comprises 13-class, real-world indoor scenes totaling over 7 h of audio, using spectrograms and acoustic intensity maps from first-order Ambisonics (FOA) signals. M-DOAnet achieved exceptional localization (6.00° DOA error, 72.8% F1-score) and perfect tracking (100% MOTA with zero identity switches). It also demonstrated high computational efficiency, training in 4.5 h (164 s/epoch). In contrast, M-SELDnet delivered strong overall SELD performance (0.32 rad DOA error, 0.75 F1-score, 0.38 error rate, 0.20 SELD score), but with significantly higher resource demands, training in 45 h (1620 s/epoch). Our findings underscore a clear trade-off between model specialization and multifunctionality, providing practical insights for designing SELD systems in real-time and computationally constrained environments. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition)

► Show Figures

Figure 1

73 pages, 2833 KB

Open AccessArticle

A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities

by Jungpil Shin, Najmul Hassan, Abu Saleh Musa Miah and Satoshi Nishimura

Sensors 2025, 25(13), 4028; https://doi.org/10.3390/s25134028 - 27 Jun 2025

Cited by 4 | Viewed by 4026

Abstract

Human Activity Recognition (HAR) systems aim to understand human behavior and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, [...] Read more.

Human Activity Recognition (HAR) systems aim to understand human behavior and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This survey includes only peer-reviewed research papers published in English to ensure linguistic consistency and academic integrity. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2025, focusing on Machine Learning (ML) and Deep Learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human–object interactions, and activity detection. Our survey includes a detailed dataset description for each modality, as well as a summary of the latest HAR systems, accompanied by a mathematical derivation for evaluating the deep learning model for each modality, and it also provides comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR. Full article

(This article belongs to the Special Issue Computer Vision and Sensors-Based Application for Intelligent Systems)

► Show Figures

Figure 1

26 pages, 7054 KB

Open AccessArticle

An Ensemble of Convolutional Neural Networks for Sound Event Detection

by Abdinabi Mukhamadiyev, Ilyos Khujayarov, Dilorom Nabieva and Jinsoo Cho

Mathematics 2025, 13(9), 1502; https://doi.org/10.3390/math13091502 - 1 May 2025

Cited by 1 | Viewed by 2730

Abstract

Sound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings [...] Read more.

Sound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings in smart cities and quickly assess the situation for security purposes. This research presents a comprehensive study of an ensemble convolutional recurrent neural network (CRNN) model designed for sound event detection (SED) in residential and public safety contexts. The work focuses on extracting meaningful features from audio signals using image-based representation, such as Discrete Cosine Transform (DCT) spectrograms, Cocheagrams, and Mel spectrograms, to enhance robustness against noise and improve feature extraction. In collaboration with police officers, a two-hour dataset consisting of 112 clips related to four classes of emotional sounds, such as harassment, quarrels, screams, and breaking sounds, was prepared. In addition to the crowdsourced dataset, publicly available datasets were used to broaden the study’s applicability. Our dataset contains 5055 audio files of different lengths totaling 14.14 h and strongly labeled data. The dataset consists of 13 separate sound categories. The proposed CRNN model integrates spatial and temporal feature extraction by processing these spectrograms through convolution and bi-directional gated recurrent unit (GRU) layers. An ensemble approach combines predictions from three models, achieving F1 scores of 71.5% for segment-based metrics and 46% for event-based metrics. The results demonstrate the model’s effectiveness in detecting sound events under noisy conditions, even with a small, unbalanced dataset. This research highlights the potential of the model for real-time audio surveillance systems using mini-computers, offering cost-effective and accurate solutions for maintaining public order. Full article

(This article belongs to the Special Issue Advanced Machine Vision with Mathematics)

► Show Figures

Figure 1

28 pages, 6222 KB

Open AccessArticle

IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework

by Haotian Chi, Qi Ma, Yuwei Wang, Jing Yang and Haijun Geng

Appl. Sci. 2025, 15(9), 4795; https://doi.org/10.3390/app15094795 - 25 Apr 2025

Viewed by 1151

Abstract

The increasing prevalence of IoT technology in smart homes has significantly enhanced convenience but also introduced new security and safety challenges. Traditional security solutions, reliant on sequences of IoT-generated event data (e.g., notifications of device status changes and sensor readings), are vulnerable to [...] Read more.

The increasing prevalence of IoT technology in smart homes has significantly enhanced convenience but also introduced new security and safety challenges. Traditional security solutions, reliant on sequences of IoT-generated event data (e.g., notifications of device status changes and sensor readings), are vulnerable to cyberattacks, such as message forgery and interception and delaying attacks, and fail to monitor non-smart devices. Moreover, fragmented smart home ecosystems require vendor cooperation or system modifications for comprehensive monitoring, limiting the practicality of the existing approaches. To address these issues, we propose IoTBystander, a non-intrusive dual-channel smart home security monitoring framework that utilizes two ubiquitous platform-agnostic signals, i.e., audio and network, to monitor user and device activities. We introduce a novel dual-channel aggregation mechanism that integrates insights from both channels and cross-verifies the integrity of monitoring results. This approach expands the monitoring scope to include non-smart devices and provides richer context for anomaly detection, failure diagnosis, and configuration debugging. Empirical evaluations on a real-world testbed with nine smart and eleven non-smart devices demonstrate the high accuracy of IoTBystander in event recognition: 92.86% for recognizing events of smart devices, 95.09% for non-smart devices, and 94.27% for all devices. A case study on five anomaly scenarios further shows significant improvements in anomaly detection performance by combining the strengths of both channels. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

24 pages, 2941 KB

Open AccessArticle

Real-Time Acoustic Detection of Critical Incidents in Smart Cities Using Artificial Intelligence and Edge Networks

by Ioannis Saradopoulos, Ilyas Potamitis, Stavros Ntalampiras, Iraklis Rigakis, Charalampos Manifavas and Antonios Konstantaras

Sensors 2025, 25(8), 2597; https://doi.org/10.3390/s25082597 - 20 Apr 2025

Viewed by 2385

Abstract

We present a system that integrates diverse technologies to achieve real-time, distributed audio surveillance. The system employs a network of microphones mounted on ESP32 platforms, which transmit compressed audio chunks via an MQTT protocol to Raspberry Pi5 devices for acoustic classification. These devices [...] Read more.

We present a system that integrates diverse technologies to achieve real-time, distributed audio surveillance. The system employs a network of microphones mounted on ESP32 platforms, which transmit compressed audio chunks via an MQTT protocol to Raspberry Pi5 devices for acoustic classification. These devices host an audio transformer model trained on the AudioSet dataset, enabling the real-time classification and timestamping of audio events with high accuracy. The output of the transformer is kept in a database of events and is subsequently converted into JSON format. The latter is further parsed into a graph structure that encapsulates the annotated soundscape, providing a rich and dynamic representation of audio environments. These graphs are subsequently traversed and analyzed using dedicated Python code and large language models (LLMs), enabling the system to answer complex queries about the nature, relationships, and context of detected audio events. We introduce a novel graph parsing method that achieves low false-alarm rates. In the task of analyzing the audio from a 1 h and 40 min long movie featuring hazardous driving practices, our approach achieved an accuracy of 0.882, precision of 0.8, recall of 1.0, and an F1 score of 0.89. By combining the robustness of distributed sensing and the precision of transformer-based audio classification, our approach that treats audio as text paves the way for advanced applications in acoustic surveillance, environmental monitoring, and beyond. Full article

(This article belongs to the Special Issue Technologies, Challenges, Applications, and Emerging Trends in Sensor-Enabled Embedded and Ubiquitous Computing)

► Show Figures

Figure 1

26 pages, 15804 KB

Open AccessArticle

Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach

by Anaswara Antony, Wolfgang Theimer, Giovanni Grossetti and Christoph M. Friedrich

Sensors 2025, 25(8), 2591; https://doi.org/10.3390/s25082591 - 19 Apr 2025

Viewed by 1684

Abstract

Autonomous driving technologies for environmental perception are mostly based on visual cues obtained from sensors like cameras, RADAR, or LiDAR. They capture the environment as if seen through “human eyes”. If this visual information is complemented with auditory information, thereby also providing “ears”, [...] Read more.

Autonomous driving technologies for environmental perception are mostly based on visual cues obtained from sensors like cameras, RADAR, or LiDAR. They capture the environment as if seen through “human eyes”. If this visual information is complemented with auditory information, thereby also providing “ears”, driverless cars can become more reliable and safer. In this paper, an Acoustic Event Detection model is presented that can detect various acoustic events in an automotive context along with their time of occurrence to create an audio scene description. The proposed detection methodology uses the pre-trained network Bidirectional Encoder representation from Audio Transformers (BEATs) and a single-layer neural network trained on the database of real audio recordings collected from different cars. The performance of the model is evaluated for different parameters and datasets. The segment-based results for a duration of 1 s show that the model performs well for 11 sound classes with a mean accuracy of 0.93 and F1-Score of 0.39 for a confidence threshold of 0.5. The threshold-independent metric mAP has a value of 0.77. The model also performs well for sound mixtures containing two overlapping events with mean accuracy, F1-Score, and mAP equal to 0.89, 0.42, and 0.658, respectively. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

43 pages, 2542 KB

Open AccessArticle

Mathematical Background and Algorithms of a Collection of Android Apps for a Google Play Store Page

by Roland Szabo

Appl. Sci. 2025, 15(8), 4431; https://doi.org/10.3390/app15084431 - 17 Apr 2025

Viewed by 609

Abstract

This paper discusses three algorithmic strategies tailored for distinct applications, each aiming to tackle specific operational challenges. The first application unveils an innovative SMS messaging system that substitutes manual typing with voice interaction. The key algorithm facilitates real-time conversion from speech to text [...] Read more.

This paper discusses three algorithmic strategies tailored for distinct applications, each aiming to tackle specific operational challenges. The first application unveils an innovative SMS messaging system that substitutes manual typing with voice interaction. The key algorithm facilitates real-time conversion from speech to text for message creation and from text to speech for message playback, thus turning SMS communication into an audio-focused exchange while preserving conventional messaging standards. The second application suggests a secure file management system for Android, utilizing encryption and access control algorithms to safeguard user privacy. Its mathematical framework centers on cryptographic methods for file security and authentication processes to prevent unauthorized access. The third application redefines flashlight functionality using an optimized touch interface algorithm. By employing a screen-wide double-tap gesture recognition system, this approach removes the reliance on a physical button, depending instead on advanced event detection and hardware control logic to activate the device’s flash. All applications are fundamentally based on mathematical modeling and algorithmic effectiveness, emphasizing computational approaches over implementation specifics. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Search Results (88)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (88)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI