Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (265)

Search Parameters:
Keywords = MFCC

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 3377 KiB  
Article
Development of a Baby Cry Identification System Using a Raspberry Pi-Based Embedded System and Machine Learning
by Mohcin Mekhfioui, Wiam Fadel, Fatima Ezzahra Hammouch, Oussama Laayati, Marouan Bouchouirbat, Nabil El Bazi, Amal Satif, Tarik Boujiha and Ahmed Chebak
Technologies 2025, 13(4), 130; https://doi.org/10.3390/technologies13040130 - 31 Mar 2025
Viewed by 46
Abstract
Newborns cry intensely, and most parents struggle to understand the reason behind their crying, as the baby cannot verbally express their needs. This makes it challenging for parents to know if their child has a need or a health issue. An embedded solution [...] Read more.
Newborns cry intensely, and most parents struggle to understand the reason behind their crying, as the baby cannot verbally express their needs. This makes it challenging for parents to know if their child has a need or a health issue. An embedded solution based on a Raspberry Pi is presented to address this problem. The module analyzes audio techniques to capture, analyze, classify, and remotely monitor a baby’s cries. These techniques rely on prosodic and cepstral features, such as MFCC coefficients. They can differentiate the reason behind a baby’s cry, such as hunger, stomach pain, or discomfort. A machine learning model was trained to anticipate the reason based on audio features. The embedded system includes a microphone to capture real-time cries and a display screen to show the anticipated reason. In addition, the system sends the collected data to a web server for storage, enabling remote monitoring and more detailed data analysis. A cell phone application has also been developed to notify parents in real time of why their baby is crying. This application enables parents to adapt quickly and efficiently to their infant’s needs, even when they are not around. Full article
Show Figures

Figure 1

14 pages, 4290 KiB  
Article
Acoustic Identification Method of Partial Discharge in GIS Based on Improved MFCC and DBO-RF
by Xueqiong Zhu, Chengbo Hu, Jinggang Yang, Ziquan Liu, Zhen Wang, Zheng Liu and Yiming Zang
Energies 2025, 18(7), 1619; https://doi.org/10.3390/en18071619 - 24 Mar 2025
Viewed by 102
Abstract
Gas Insulated Switchgear (GIS) is a type of critical substation equipment in the power system, and its safe and stable operation is of great significance for ensuring the reliability of power system operation. To accurately identify partial discharge in GIS, this paper proposes [...] Read more.
Gas Insulated Switchgear (GIS) is a type of critical substation equipment in the power system, and its safe and stable operation is of great significance for ensuring the reliability of power system operation. To accurately identify partial discharge in GIS, this paper proposes an acoustic identification method based on improved mel frequency cepstral coefficients (MFCC) and dung beetle algorithm optimized random forest (DBO-RF) based on the ultrasonic detection method. Firstly, three types of typical GIS partial discharge defects, namely free metal particles, suspended potential, and surface discharge, were designed and constructed. Secondly, wavelet denoising was used to weaken the influence of noise on ultrasonic signals, and conventional, first-order, and second-order differential MFCC feature parameters were extracted, followed by principal component analysis for dimensionality reduction optimization. Finally, the feature parameters after dimensionality reduction optimization were input into the DBO-RF model for fault identification. The results show that this method can accurately identify partial discharge of typical GIS defects, with a recognition accuracy reaching 92.2%. The research results can provide a basis for GIS insulation fault detection and diagnosis. Full article
Show Figures

Figure 1

36 pages, 4990 KiB  
Article
Toward Inclusive Smart Cities: Sound-Based Vehicle Diagnostics, Emergency Signal Recognition, and Beyond
by Amr Rashed, Yousry Abdulazeem, Tamer Ahmed Farrag, Amna Bamaqa, Malik Almaliki, Mahmoud Badawy and Mostafa A. Elhosseini
Machines 2025, 13(4), 258; https://doi.org/10.3390/machines13040258 - 21 Mar 2025
Viewed by 281
Abstract
Sound-based early fault detection for vehicles is a critical yet underexplored area, particularly within Intelligent Transportation Systems (ITSs) for smart cities. Despite the clear necessity for sound-based diagnostic systems, the scarcity of specialized publicly available datasets presents a major challenge. This study addresses [...] Read more.
Sound-based early fault detection for vehicles is a critical yet underexplored area, particularly within Intelligent Transportation Systems (ITSs) for smart cities. Despite the clear necessity for sound-based diagnostic systems, the scarcity of specialized publicly available datasets presents a major challenge. This study addresses this gap by contributing in multiple dimensions. Firstly, it emphasizes the significance of sound-based diagnostics for real-time detection of faults through analyzing sounds directly generated by vehicles, such as engine or brake noises, and the classification of external emergency sounds, like sirens, relevant to vehicle safety. Secondly, this paper introduces a novel dataset encompassing vehicle fault sounds, emergency sirens, and environmental noises specifically curated to address the absence of such specialized datasets. A comprehensive framework is proposed, combining audio preprocessing, feature extraction (via Mel Spectrograms, MFCCs, and Chromatograms), and classification using 11 models. Evaluations using both compact (52 features) and expanded (126 features) representations show that several classes (e.g., Engine Misfire, Fuel Pump Cartridge Fault, Radiator Fan Failure) achieve near-perfect accuracy, though acoustically similar classes like Universal Joint Failure, Knocking, and Pre-ignition Problem remain challenging. Logistic Regression yielded the highest accuracy of 86.5% for the vehicle fault dataset (DB1) using compact features, while neural networks performed best for datasets DB2 and DB3, achieving 88.4% and 85.5%, respectively. In the second scenario, a Bayesian-Optimized Weighted Soft Voting with Feature Selection (BOWSVFS) approach is proposed, significantly enhancing accuracy to 91.04% for DB1, 88.85% for DB2, and 86.85% for DB3. These results highlight the effectiveness of the proposed methods in addressing key ITS limitations and enhancing accessibility for individuals with disabilities through auditory-based vehicle diagnostics and emergency recognition systems. Full article
(This article belongs to the Special Issue Recent Developments in Machine Design, Automation and Robotics)
Show Figures

Figure 1

25 pages, 10241 KiB  
Article
Machine Learning-Based Acoustic Analysis of Stingless Bee (Heterotrigona itama) Alarm Signals During Intruder Events
by Ashan Milinda Bandara Ratnayake, Hartini Mohd Yasin, Abdul Ghani Naim, Rahayu Sukmaria Sukri, Norhayati Ahmad, Nurul Hazlina Zaini, Soon Boon Yu, Mohammad Amiruddin Ruslan and Pg Emeroylariffion Abas
Agriculture 2025, 15(6), 591; https://doi.org/10.3390/agriculture15060591 - 11 Mar 2025
Viewed by 295
Abstract
Heterotrigona itama, a widely reared stingless bee species, produces highly valued honey. These bees naturally secure their colonies within logs, accessed via a single entrance tube, but remain vulnerable to intruders and predators. Guard bees play a critical role in colony defense, [...] Read more.
Heterotrigona itama, a widely reared stingless bee species, produces highly valued honey. These bees naturally secure their colonies within logs, accessed via a single entrance tube, but remain vulnerable to intruders and predators. Guard bees play a critical role in colony defense, exhibiting the ability to discriminate between nestmates and non-nestmates and employing strategies such as pheromone release, buzzing, hissing, and vibrations to alert and recruit hive mates during intrusions. This study investigated the acoustic signals produced by H. itama guard bees during intrusions to determine their potential for intrusion detection. Using a Jetson Nano equipped with a microphone and camera, guard bee sounds were recorded and labeled. After preprocessing the sound data, Mel Frequency Cepstral Coefficients (MFCCs) were extracted as features, and various dimensionality reduction techniques were explored. Among them, Linear Discriminant Analysis (LDA) demonstrated the best performance in improving class separability. The reduced feature set was used to train both Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) classifiers. KNN outperformed SVM, achieving a Precision of 0.9527, a Recall of 0.9586, and an F1 Score of 0.9556. Additionally, KNN attained an Overall Cross-Validation Accuracy of 95.54% (±0.67%), demonstrating its superior classification performance. These findings confirm that H. itama produces distinct alarm sounds during intrusions, which can be effectively classified using machine learning; thus, demonstrating the feasibility of sound-based intrusion detection as a cost-effective alternative to image-based approaches. Future research should explore real-world implementation under varying environmental conditions and extend the study to other stingless bee species. Full article
Show Figures

Figure 1

19 pages, 6983 KiB  
Article
Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines
by Cevahir Parlak
Biomimetics 2025, 10(3), 167; https://doi.org/10.3390/biomimetics10030167 - 10 Mar 2025
Viewed by 274
Abstract
Feature extraction is a crucial stage in speech emotion recognition applications, and filter banks with their related statistical functions are widely used for this purpose. Although Mel filters and MFCCs achieve outstanding results, they do not perfectly model the structure of the human [...] Read more.
Feature extraction is a crucial stage in speech emotion recognition applications, and filter banks with their related statistical functions are widely used for this purpose. Although Mel filters and MFCCs achieve outstanding results, they do not perfectly model the structure of the human ear, as they use a simplified mechanism to simulate the functioning of human cochlear structures. The Mel filters system is not a perfect representation of human hearing, but merely an engineering shortcut to suppress the pitch and low-frequency components, which have little use in traditional speech recognition applications. However, speech emotion recognition classification is heavily related to pitch and low-frequency component features. The newly tailored CARFAC 24 model is a sophisticated system for analyzing human speech and is designed to best simulate the functionalities of the human cochlea. In this study, we use the CARFAC 24 system for speech emotion recognition and compare it with state-of-the-art systems using speaker-independent studies conducted with Time-Distributed Convolutional LSTM networks and Support Vector Machines, with the use of the ASED and the NEMO emotional speech dataset. The results demonstrate that CARFAC 24 is a valuable alternative to Mel and MFCC features in speech emotion recognition applications. Full article
Show Figures

Figure 1

27 pages, 4269 KiB  
Article
A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation
by Xuan Zhang, Jun Tang, Huiliang Cao, Chenguang Wang, Chong Shen and Jun Liu
Appl. Sci. 2025, 15(6), 2924; https://doi.org/10.3390/app15062924 - 7 Mar 2025
Viewed by 434
Abstract
Speaker recognition is essential in smart voice applications for personal identification. Current state-of-the-art techniques primarily focus on ideal acoustic conditions. However, the traditional spectrogram struggles to differentiate between noise, reverberation, and speech. To overcome this challenge, MFCC can be replaced with the output [...] Read more.
Speaker recognition is essential in smart voice applications for personal identification. Current state-of-the-art techniques primarily focus on ideal acoustic conditions. However, the traditional spectrogram struggles to differentiate between noise, reverberation, and speech. To overcome this challenge, MFCC can be replaced with the output from a self-supervised learning model. This study introduces a TDNN enhanced with a pre-trained model for robust performance in noisy and reverberant environments, referred to as PNR-TDNN. The PNR-TDNN employs HuBERT as its backbone, while the TDNN is an improved ECAPA-TDNN. The pre-trained model employs the Canopy/Mini Batch k-means++ strategy. In the TDNN architecture, several enhancements are implemented, including a cross-channel fusion mechanism based on Res2Net. Additionally, a non-average attention mechanism is applied to the pooling operation, focusing on the weight information of each channel within the Squeeze-and-Excitation Net. Furthermore, the contribution of individual channels to the pooling of time-domain frames is enhanced by substituting attentive statistics with multi-head attention statistics. Validated by zhvoice in noisy conditions, the minimized PNR-TDNN demonstrates a 5.19% improvement in EER compared to CAM++. In more challenging environments with noise and reverberation, the minimized PNR-TDNN further improves EER by 3.71% and 9.6%, respectively, and MinDCF by 3.14% and 3.77%, respectively. The proposed method has also been validated on the VoxCeleb1 and cn-celeb_v2 datasets, representing a significant breakthrough in the field of speaker recognition under challenging conditions. This advancement is particularly crucial for enhancing safety and protecting personal identification in voice-enabled microphone applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 3271 KiB  
Article
Fine-Tuned Machine Learning Classifiers for Diagnosing Parkinson’s Disease Using Vocal Characteristics: A Comparative Analysis
by Mehmet Meral, Ferdi Ozbilgin and Fatih Durmus
Diagnostics 2025, 15(5), 645; https://doi.org/10.3390/diagnostics15050645 - 6 Mar 2025
Viewed by 349
Abstract
Background/Objectives: This paper is significant in highlighting the importance of early and precise diagnosis of Parkinson’s Disease (PD) that affects both motor and non-motor functions to achieve better disease control and patient outcomes. This study seeks to assess the effectiveness of machine [...] Read more.
Background/Objectives: This paper is significant in highlighting the importance of early and precise diagnosis of Parkinson’s Disease (PD) that affects both motor and non-motor functions to achieve better disease control and patient outcomes. This study seeks to assess the effectiveness of machine learning algorithms optimized to classify PD based on vocal characteristics to serve as a non-invasive and easily accessible diagnostic tool. Methods: This study used a publicly available dataset of vocal samples from 188 people with PD and 64 controls. Acoustic features like baseline characteristics, time-frequency components, Mel Frequency Cepstral Coefficients (MFCCs), and wavelet transform-based metrics were extracted and analyzed. The Chi-Square test was used for feature selection to determine the most important attributes that enhanced the accuracy of the classification. Six different machine learning classifiers, namely SVM, k-NN, DT, NN, Ensemble and Stacking models, were developed and optimized via Bayesian Optimization (BO), Grid Search (GS) and Random Search (RS). Accuracy, precision, recall, F1-score and AUC-ROC were used for evaluation. Results: It has been found that Stacking models, especially those fine-tuned via Grid Search, yielded the best performance with 92.07% accuracy and an F1-score of 0.95. In addition to that, the choice of relevant vocal features, in conjunction with the Chi-Square feature selection method, greatly enhanced the computational efficiency and classification performance. Conclusions: This study highlights the potential of combining advanced feature selection techniques with hyperparameter optimization strategies to enhance machine learning-based PD diagnosis using vocal characteristics. Ensemble models proved particularly effective in handling complex datasets, demonstrating robust diagnostic performance. Future research may focus on deep learning approaches and temporal feature integration to further improve diagnostic accuracy and scalability for clinical applications. Full article
Show Figures

Figure 1

18 pages, 992 KiB  
Article
Baby Cry Classification Using Structure-Tuned Artificial Neural Networks with Data Augmentation and MFCC Features
by Tayyip Ozcan and Hafize Gungor
Appl. Sci. 2025, 15(5), 2648; https://doi.org/10.3390/app15052648 - 1 Mar 2025
Viewed by 478
Abstract
Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated [...] Read more.
Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated system for the classification of baby sounds is proposed. The proposed method includes data augmentation, feature extraction, hyperparameter tuning, and model training steps. In the first step, various data augmentation techniques were applied to increase the training data’s diversity and strengthen the model’s generalization capacity. The MFCC (Mel-Frequency Cepstral Coefficients) method was used in the second step to extract meaningful and distinctive features from the sound data. MFCC represents sound signals based on the frequencies the human ear perceives and provides a strong basis for classification. The obtained features were classified with an artificial neural network (ANN) model with optimized hyperparameters. The hyperparameter optimization of the model was performed using the grid search algorithm, and the most appropriate parameters were determined. The training, validation, and test data sets were separated at 75%, 10%, and 15% ratios, respectively. The model’s performance was tested on mixed sounds. The test results were analyzed, and the proposed method showed the highest performance, with a 90% accuracy rate. In the comparison study with an artificial neural network (ANN) on the Donate a Cry data set, the F1 score was reported as 46.99% and the test accuracy as 85.93%. In this paper, additional techniques such as data augmentation, hyperparameter tuning, and MFCC feature extraction allowed the model accuracy to reach 90%. The proposed method offers an effective solution for classifying baby sounds and brings a new approach to this field. Full article
Show Figures

Figure 1

21 pages, 4009 KiB  
Article
Applying Acoustic Signals to Monitor Hybrid Electrical Discharge-Turning with Artificial Neural Networks
by Mehdi Soleymani and Mohammadjafar Hadad
Micromachines 2025, 16(3), 274; https://doi.org/10.3390/mi16030274 - 27 Feb 2025
Viewed by 254
Abstract
Artificial intelligence (AI) models have demonstrated their capabilities across various fields by performing tasks that are currently handled by humans. However, the training of these models faces several limitations, such as the need for sufficient data. This study proposes the use of acoustic [...] Read more.
Artificial intelligence (AI) models have demonstrated their capabilities across various fields by performing tasks that are currently handled by humans. However, the training of these models faces several limitations, such as the need for sufficient data. This study proposes the use of acoustic signals as training data as this method offers a simpler way to obtain a large dataset compared to traditional approaches. Acoustic signals contain valuable information about the process behavior. We investigated the ability of extracting useful features from acoustic data expecting to predict labels separately by a multilabel classifier rather than as a multiclass classifier. This study focuses on electrical discharge turning (EDT) as a hybrid process of electrical discharge machining (EDM) and turning, an intricate process with multiple influencing parameters. The sounds generated during EDT were recorded and used as training data. The sounds underwent preprocessing to examine the effects of the parameters used for feature extraction prior to feeding the data into the ANN model. The parameters investigated included sample rate, length of the FFT window, hop length, and the number of mel-frequency cepstral coefficients (MFCC). The study aimed to determine the optimal preprocessing parameters considering the highest precision, recall, and F1 scores. The results revealed that instead of using the default set values in the python packages, it is necessary to investigate the preprocessing parameters to find the optimal values for the maximum classification performance. The promising results of the multi-label classification model depicted that it is possible to detect various aspects of a process simultaneously receiving single data, which is very beneficial in monitoring. The results also indicated that the highest prediction scores could be achieved by setting the sample rate, length of the FFT window, hop length, and number of MFCC to 4500 Hz, 1024, 256, and 80, respectively. Full article
(This article belongs to the Special Issue Future Prospects of Additive Manufacturing)
Show Figures

Figure 1

19 pages, 1668 KiB  
Article
Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection
by Bo Peng, Danlei Li, Kevin I-Kai Wang and Waleed H. Abdulla
Processes 2025, 13(2), 544; https://doi.org/10.3390/pr13020544 - 14 Feb 2025
Viewed by 507
Abstract
This study proposes a framework for anomaly detection in industrial machines with a focus on robust multiclass classification using acoustic data. Many state-of-the-art methods only have binary classification capabilities for each machine, and suffer from poor scalability and noise robustness. In this context, [...] Read more.
This study proposes a framework for anomaly detection in industrial machines with a focus on robust multiclass classification using acoustic data. Many state-of-the-art methods only have binary classification capabilities for each machine, and suffer from poor scalability and noise robustness. In this context, we propose the use of Smoothed Pseudo Wigner–Ville Distribution-based Mel-Frequency Cepstral Coefficients (SPWVD-MFCCs) in the framework which are specifically tailored for noisy environments. SPWVD-MFCCs, with better time–frequency resolution and perceptual audio features, improve the accuracy of detecting anomalies in a more generalized way under variable signal-to-noise ratio (SNR) conditions. This framework integrates a CNN-LSTM model that efficiently and accurately analyzes spectral and temporal information separately for anomaly detection. Meanwhile, the dimensionality reduction strategy ensures good computational efficiency without losing critical information. On the MIMII dataset involving multiple machine types and noise levels, it has shown robustness and scalability. Key findings include significant improvements in classification accuracy and F1-scores, particularly in low-SNR scenarios, showcasing its adaptability to real-world industrial environments. This study represents the first application of SPWVD-MFCCs in industrial diagnostics and provides a noise-robust and scalable method for the detection of anomalies and fault classification, which is bound to improve operational safety and efficiency within complex industrial scenarios. Full article
(This article belongs to the Special Issue Research on Intelligent Fault Diagnosis Based on Neural Network)
Show Figures

Figure 1

18 pages, 1573 KiB  
Article
PD-Net: Parkinson’s Disease Detection Through Fusion of Two Spectral Features Using Attention-Based Hybrid Deep Neural Network
by Munira Islam, Khadija Akter, Md. Azad Hossain and M. Ali Akber Dewan
Information 2025, 16(2), 135; https://doi.org/10.3390/info16020135 - 12 Feb 2025
Viewed by 851
Abstract
Parkinson’s disease (PD) is a progressive degenerative brain disease that worsens with age, causing areas of the brain to weaken. Vocal dysfunction often emerges as one of the earliest and most prominent indicators of Parkinson’s disease, with a significant number of patients exhibiting [...] Read more.
Parkinson’s disease (PD) is a progressive degenerative brain disease that worsens with age, causing areas of the brain to weaken. Vocal dysfunction often emerges as one of the earliest and most prominent indicators of Parkinson’s disease, with a significant number of patients exhibiting vocal impairments during the initial stages of the illness. In view of this, to facilitate the diagnosis of Parkinson’s disease through the analysis of these vocal characteristics, this study focuses on exerting a combination of mel spectrogram and MFCC as spectral features. This study adopts Italian raw audio data to establish an efficient detection framework specifically designed to classify the vocal data into two distinct categories: healthy individuals and patients diagnosed with Parkinson’s disease. To this end, the study proposes a hybrid model that integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs) for the detection of Parkinson’s disease. Certainly, CNNs are employed to extract spatial features from the extracted spectro-temporal characteristics of vocal data, while LSTMs capture temporal dependencies, accelerating a comprehensive analysis of the development of vocal patterns over time. Additionally, the merging of a multi-head attention mechanism significantly enhances the model’s ability to concentrate on essential details, hence improving its overall performance. This unified method aims to enhance the detection of subtle vocal changes associated with Parkinson’s, enhancing overall diagnostic accuracy. The findings declare that this model achieves a noteworthy accuracy of 99.00% for the Parkinson’s disease detection process. Full article
(This article belongs to the Special Issue Feature Papers in Information in 2024–2025)
Show Figures

Graphical abstract

24 pages, 5922 KiB  
Article
Age Prediction from Korean Speech Data Using Neural Networks with Diverse Voice Features
by Hayeon Ku, Jiho Lee, Minseo Lee, Seulgi Kim and Janghyeok Yoon
Appl. Sci. 2025, 15(3), 1337; https://doi.org/10.3390/app15031337 - 27 Jan 2025
Viewed by 753
Abstract
A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have [...] Read more.
A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have been conducted have limitations, such as the absence of specific age groups or detailed age categories. Therefore, this study proposes an optimal combination of speech features and deep-learning models to recognize detailed age groups using a large Korean-speech dataset. From the speech dataset, recorded by individuals ranging from their teens to their 50s, four speech features were extracted: the Mel spectrogram, log-Mel spectrogram, Mel-frequency cepstral coefficients (MFCCs), and ΔMFCCs. Using these speech features, four deep-learning models were trained: ResNet-50, 1D-CNN, 2D-CNN, and a vision transformer. A performance comparison of speech feature-extraction methods and models indicated that MFCCs + ΔMFCCs was the best for both sexes when trained on the 1D-CNN model; it achieved an accuracy of 88.16% for males and 81.95% for females. The results of this study are expected to contribute to the future development of Korean speaker-recognition systems. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

21 pages, 2188 KiB  
Article
Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison
by Buket İşler
Appl. Sci. 2025, 15(3), 1201; https://doi.org/10.3390/app15031201 - 24 Jan 2025
Viewed by 617
Abstract
Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. [...] Read more.
Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. This study provides a comparative evaluation of three machine learning models—convolutional neural networks (CNNs), long short-term memory (LSTM), and dense neural networks (Dense)—for classifying urban sounds. The analysis used the UrbanSound8K dataset, a static dataset designed for environmental sound classification, with mel-frequency cepstral coefficients (MFCCs) applied to extract core sound features. The models were tested in a fog computing architecture on AWS to simulate a smart city environment, chosen for its potential to reduce latency and optimize bandwidth for future real-time sound-recognition applications. Although real-time data were not used, the simulated setup effectively assessed model performance under conditions relevant to smart city applications. According to macro and weighted F1-score metrics, the CNN model achieved the highest accuracy at 90%, followed by the Dense model at 84% and the LSTM model at 81%, with the LSTM model showing limitations in distinguishing overlapping sound categories. These simulations demonstrated the framework’s capacity to enable efficient urban sound recognition within a fog-enabled architecture, underscoring its potential for real-time environmental monitoring and public safety applications. Full article
Show Figures

Figure 1

20 pages, 1849 KiB  
Article
Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation
by John Lorenzo Bautista and Hyun Soon Shin
Appl. Sci. 2025, 15(2), 623; https://doi.org/10.3390/app15020623 - 10 Jan 2025
Viewed by 740
Abstract
This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a [...] Read more.
This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a joint loss function that combines categorical and regression losses, the model ensures balanced optimization across tasks, with experiments exploring various weighting schemes using a tunable parameter to adjust task importance. Two adaptive weight balancing schemes, Dynamic Weighting and Joint Weighting, further enhance performance by dynamically adjusting task weights based on optimization progress and ensuring balanced emotion representation during backpropagation. The architecture employs parallel feature extraction through independent encoders, designed to capture unique features from multiple modalities, including Mel-frequency Cepstral Coefficients (MFCC), Short-term Features (STF), Mel-spectrograms, and raw audio signals. Additionally, pre-trained models such as Wav2Vec 2.0 and HuBERT are integrated to leverage their robust latent features. The inclusion of self-attention and co-attention mechanisms allows the model to capture relationships between input modalities and interdependencies among features, further improving its interpretability and integration capabilities. Experiments conducted on the IEMOCAP dataset using a leave-one-subject-out approach demonstrate the model’s effectiveness, with results showing a 1–2% accuracy improvement over classification-only models. The optimal configuration, incorporating the joint architecture, dynamic weighting, and parallel processing of multimodal features, achieves a weighted accuracy of 72.66%, an unweighted accuracy of 73.22%, and a mean Concordance Correlation Coefficient (CCC) of 0.3717. These results validate the effectiveness of the proposed joint model architecture and adaptive balancing weight schemes in improving SER performance. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 855 KiB  
Article
Comparative Analysis of Audio Features for Unsupervised Speaker Change Detection
by Alymzhan Toleu, Gulmira Tolegen, Rustam Mussabayev, Alexander Krassovitskiy and Bagashar Zhumazhanov
Appl. Sci. 2024, 14(24), 12026; https://doi.org/10.3390/app142412026 - 23 Dec 2024
Viewed by 728
Abstract
This study examines how ten different audio features, including MFCC, mel-spectrogram, chroma, and spectral contrast etc., influence speaker change detection (SCD) performance. The analysis is conducted using two unsupervised methods: Bayesian information criterion with Gaussian mixture model (BIC-GMM), a model-based approach, and Kullback-Leibler [...] Read more.
This study examines how ten different audio features, including MFCC, mel-spectrogram, chroma, and spectral contrast etc., influence speaker change detection (SCD) performance. The analysis is conducted using two unsupervised methods: Bayesian information criterion with Gaussian mixture model (BIC-GMM), a model-based approach, and Kullback-Leibler divergence with Gaussian Mixture Model (KL-GMM), a metric-based approach. Evaluation involved statistical analysis of feature changes in relation to speaker changes (vice versa), supported by comprehensive experimental validation. Experimental results show MFCC as the most effective feature, demonstrating consistently good performance across both methods. Features such as zero crossing rate, chroma, and spectral contrast also showed notable effectiveness within the BIC-GMM framework, while mel-spectrogram consistently ranked as the least influential feature in both approaches. Further analysis revealed that BIC-GMM exhibits greater stability in managing variations in feature performance, whereas KL-GMM is more sensitive to threshold optimization. Nevertheless, KL-GMM achieved competitive results when paired with specific features, such as MFCC and zero crossing rate. These findings offer valuable insights into the impact of feature selection on unsupervised SCD, providing guidance for the development of more robust and accurate algorithms for practical applications. Full article
Show Figures

Figure 1

Back to TopTop